What is OmniRetrieval and why is it significant?

OmniRetrieval is a unified retrieval framework from KAIST that can query across four different types of knowledge sources: unstructured text corpora, SQL relational databases, RDF knowledge graphs (SPARQL), and labeled property graphs (Cypher). Unlike existing approaches that force everything into a shared embedding space, OmniRetrieval preserves each source's native query language and structural affordances, achieving 44.34% retrieval accuracy across 309 knowledge bases—11% better than single-backend baselines.

How does OmniRetrieval differ from traditional RAG systems?

Traditional RAG systems typically operate over a single unstructured corpus using vector similarity. OmniRetrieval extends this to structured sources (SQL databases, knowledge graphs, property graphs) by generating native queries in SQL, SPARQL, or Cypher rather than embedding everything. This preserves structural operations like SQL joins, graph traversals, and compositional queries that embedding-based approaches lose.

What are the three key steps in OmniRetrieval's approach?

First, source selection uses a long-context LLM to identify relevant knowledge sources from the catalog. Second, query formulation generates executable native queries (SQL, SPARQL, Cypher, or text) for each selected source. Third, cross-source evidence selection consolidates results from multiple heterogeneous sources (table rows, graph triples, document passages) into a unified answer.

What datasets and benchmarks did OmniRetrieval use?

OmniRetrieval was evaluated on 13 datasets spanning 309 knowledge bases: 7 BEIR document corpora (NFCorpus, SciFact, FiQA, MS MARCO, FEVER, Natural Questions, HotpotQA), 286 SQL databases (Spider, BIRD), 1 RDF knowledge graph with 3 benchmark datasets (SimpleQuestions, QALD-10, LC-QuAD 2.0 on Wikidata), and 15 labeled property graphs (Text2Cypher on Neo4j).

OmniRetrieval: KAIST's Framework That Finally Unifies | explainx.ai Blog

explainx.ainewsletter3.5k

workshops ↗

OmniRetrieval: KAIST's Framework That Finally Unifies | explainx.ai Blog | explainx.ai

Real-world questions don't respect database boundaries.

Ask "Which artists were born on the same date as Rachel Stevens?" and the answer lives in Wikidata's RDF knowledge graph, queryable via SPARQL.

Ask "How many online purchases did Ole Group make in May 2019?" and you need SQL against a normalized relational database.

Ask "Which actors acted in movies directed by the person who directed Speed Racer?" and you're traversing a labeled property graph with Cypher.

Ask "What is the cancer risk from French fries?" and you're searching unstructured biomedical documents.

Current retrieval systems force you to pick one. Use a document retriever (BM25, DPR) for text, text-to-SQL for databases, text-to-SPARQL for knowledge graphs, or text-to-Cypher for property graphs—but not all of them for a single question.

The natural solution seems obvious: collapse everything into a shared embedding space and retrieve by similarity.

Except that doesn't work.

As researchers from KAIST and DeepAuto.ai demonstrate in their new paper "OmniRetrieval: Unified Retrieval across Heterogeneous Knowledge Sources," flattening structured data into embeddings throws away the structural affordances—joins, traversals, compositional operators—that make each source valuable in the first place.

Their alternative: meet each knowledge source on its own terms.

The Problem: Retrieval Is Fragmented Across Incompatible Backends

Modern knowledge exists in structurally diverse forms:

1. Unstructured Text Corpora

Medical articles (PubMed)
Wikipedia passages
Financial documents
Support tickets

Query method: Free-form natural language → BM25 or dense retrieval

2. Relational Databases (SQL)

Enterprise databases
E-commerce transactions
Analytics warehouses

Query method: Natural language → SQL with joins, aggregations, filters

3. RDF Knowledge Graphs (SPARQL)

Wikidata (15+ billion triples)
DBpedia
Domain ontologies

Query method: Natural language → SPARQL triple patterns and property paths

4. Labeled Property Graphs (Cypher)

Neo4j graphs
Social networks
Supply chains
Recommendation systems

Query method: Natural language → Cypher graph traversals

The fragmentation problem: Each backend has:

Its own native query language
Its own execution engine
Its own structural context (schema, ontology, graph model)
Its own result format (passages, rows, triples, paths)

Existing retrieval approaches operate on one source at a time, leaving the broader knowledge landscape unreachable behind incompatible interfaces.

The Failed Solution: Unified Embeddings

The obvious approach is to project everything into a shared vector space:

Embed all documents, table rows, knowledge graph triples, and graph paths
Embed the user's query
Retrieve by cosine similarity

Why this fails:

1. Modality Gap Bias

Embeddings cluster by source type rather than semantic content. The retriever biases toward sources that look like the query structurally, not sources that answer it.

2. Loss of Structural Operators

SQL joins become individual row embeddings. Graph traversals become separate edge embeddings. Multi-hop reasoning is lost.

Consider the query: "Find companies founded by MIT graduates who later joined Google."

SQL version:

sql

SELECT DISTINCT c.name
FROM companies c
JOIN founders f ON c.id = f.company_id
JOIN education e ON f.person_id = e.person_id
JOIN employment emp ON f.person_id = emp.person_id
WHERE e.institution = 'MIT'
  AND emp.company = 'Google'
  AND emp.start_date > f.founded_date

Embedding version: You get individual rows for companies, founders, education records, and employment records. The join logic is gone.

3. Scale Impossibility

Wikidata has 15+ billion triples. Embedding all possible paths in a property graph grows exponentially with hop length—one graph in the benchmark has tens of billions of 3-hop paths.

Materializing a shared embedding space for real-world knowledge sources is computationally infeasible.

OmniRetrieval's Solution: Native Query Dispatch

Instead of homogenization, OmniRetrieval provides a coordination layer that:

Identifies which knowledge sources are relevant
Generates executable queries in each source's native language
Consolidates heterogeneous results into a unified answer

The Three-Stage Pipeline

Stage 1: Source Selection

Input: Natural language query + catalog of source descriptors

Process: A long-context LLM reads the full catalog of structural contexts (schemas, ontologies, corpus descriptions) and returns a ranked list of candidate sources.

Example:

snippet

Query: "Which artists were born on the same date as Rachel Stevens?"

Catalog contains:
- 7 document corpora (medical, scientific, financial, Wikipedia)
- 286 SQL databases (various domains)
- Wikidata RDF graph (15B+ triples, encyclopedic facts)
- 15 property graphs (movies, social networks, companies)

Selected: Wikidata (contains birth dates of public figures)

Key insight: The LLM can reason over heterogeneous descriptors (table schemas, graph ontologies, corpus summaries) directly, without forcing them into a shared representation.

Stage 2: Query Formulation

Input: Query + structural context for each selected source

Process: For each source, generate an executable native query conditioned on its schema/ontology.

Example:

For Wikidata (SPARQL):

sparql

SELECT DISTINCT ?artist WHERE {
  ?rachel rdfs:label "Rachel Stevens"@en .
  ?rachel wdt:P569 ?birthdate .
  ?artist wdt:P569 ?birthdate .
  ?artist wdt:P106 wd:Q483501 .  # occupation: artist
  FILTER (?artist != ?rachel)
}

For a SQL database:

sql

SELECT a.name
FROM artists a
JOIN persons p1 ON a.person_id = p1.id
JOIN persons p2 ON p1.birthdate = p2.birthdate
WHERE p2.name = 'Rachel Stevens'
  AND a.person_id != p2.id

For a property graph (Cypher):

cypher

MATCH (rachel:Person {name: "Rachel Stevens"})-[:BORN_ON]->(date:Date)
MATCH (artist:Artist)-[:BORN_ON]->(date)
WHERE artist <> rachel
RETURN artist.name

Key insight: Each query is grounded in the source's actual schema (table names, predicates, relationship types), not generic templates.

Stage 3: Cross-Source Evidence Selection

Input: Executor outputs from multiple heterogeneous sources

Process: An LLM selects the subset of results relevant to the original query, filtering across different result formats:

Document passages
SQL table rows
RDF triples
Property graph paths

Example:

snippet

Results from 3 sources:

Source 1 (Wikidata SPARQL):
- Artist: Ronan Keating (born 1977-03-03)
- Artist: Laura Prepon (born 1980-03-07)

Source 2 (Wikipedia docs):
- Passage: "Rachel Stevens (born 9 April 1978)..."

Source 3 (SQL celebrity_db):
- Row: {name: "Ronan Keating", birthdate: "1977-03-03"}

Evidence Selection picks:
- Source 1 results (correctly answers the query)
- Filters out Source 2 (context, not answer)
- Deduplicates Source 3 (same info as Source 1)

Key insight: The consolidation step handles format heterogeneity (triples vs rows vs passages) and selects semantically equivalent answers even when surface forms differ.

The Benchmark: 309 Knowledge Bases, 13 Datasets

OmniRetrieval was evaluated on an extensive benchmark spanning:

Document Search (7 BEIR Datasets)

NFCorpus: Medical (PubMed abstracts)
SciFact: Scientific claim verification
FiQA: Financial question answering
MS MARCO: Web passages
FEVER: Wikipedia fact verification
Natural Questions: Short-answer QA
HotpotQA: Multi-hop reasoning

Relational Databases (286 Databases)

Spider: 206 databases across diverse domains
BIRD: 80 databases from real-world applications

RDF Knowledge Graphs (1 Graph, 3 Datasets)

Wikidata queried via:
- SimpleQuestions: Single-triple factoid queries
- QALD-10: Hand-curated factoid and aggregation queries
- LC-QuAD 2.0: Large-scale compositional queries

Labeled Property Graphs (15 Graphs)

Text2Cypher: Neo4j graphs covering movies, company structures, social networks, financial investigations

Total: 309 distinct knowledge bases, 300 questions per dataset = 3,900 total queries

Results: OmniRetrieval Beats Single-Source Baselines

Evaluated on five LLM backbones (GPT-5.4, Gemini-3.1 Pro, Sonnet-4.6, Qwen-3.5 27B, Gemma-4 31B):

Source Selection Accuracy

Single-backend baselines: 14.73% - 24.84% (each pinned to one paradigm)
KB Routing: 61.65% (picks one source per query)
OmniRetrieval: 65.71% (+4.06pp over KB Routing)
Oracle (perfect selection): 100%

Retrieval Accuracy

Single-backend baselines: 13.69% - 17.93%
KB Routing: 39.98%
OmniRetrieval: 44.34% (+4.36pp, +11% relative improvement)
Oracle: 61.85%

LLM-as-a-Judge (Semantic Equivalence)

Single-backend baselines: 25.65% - 39.49%
KB Routing: 57.99%
OmniRetrieval: 65.88% (+7.89pp)
Oracle: 74.55%

Key finding: The gap to oracle narrows from 34.27pp (source selection) → 17.51pp (retrieval) → 8.67pp (judge), showing that cross-source evidence selection often recovers semantically equivalent answers even when source selection misses the gold standard.

Why OmniRetrieval Works: Four Key Insights

1. Long-Context Source Selection Scales

Rather than embedding source descriptors into a shared space, OmniRetrieval reads the full catalog of schemas, ontologies, and corpus descriptions directly.

This works because:

Long-context LLMs (GPT-5.4, Gemini-3.1) can handle 128k+ tokens
Structural contexts are heterogeneous but relatively compact (schemas fit in <2k tokens each)
The LLM can reason about actual contents (table names, predicate types) rather than similarity scores

Result: 65.71% source selection accuracy across 309 knowledge bases

2. Native Queries Preserve Structural Affordances

By generating SQL, SPARQL, or Cypher instead of embedding atomic units:

SQL preserves:

Joins across normalized tables
Aggregations (COUNT, SUM, AVG)
Window functions
Subqueries

SPARQL preserves:

Triple pattern matching
Property paths (multi-hop traversals)
OPTIONAL and UNION operators
FILTER constraints

Cypher preserves:

Graph pattern matching
Variable-length paths
Relationship property filtering
Shortest path algorithms

Embedding-based approaches lose all of this.

3. Multi-Candidate Exploration Defers Commitment

OmniRetrieval returns a short list of k candidates (default k=3) rather than committing to one source upfront.

Effect of candidate size:

k=1: 57.81% retrieval accuracy (same as KB Routing)
k=3: 65.71% retrieval accuracy (+7.9pp)
k=5: 67.12% retrieval accuracy (+1.41pp)
k=10: 68.29% retrieval accuracy (+1.17pp)

Insight: Returns diminish beyond k=3 because evidence selection accuracy drops from 67.5% at k=3 to 62.8% at k=10—more candidates introduce more noise.

4. Cross-Source Evidence Selection Handles Heterogeneity

The final consolidation step verbalizes results from different formats:

SQL results → Natural language:

snippet

Row: {company: "Tesla", founded: 2003, employees: 127855}
→ "Tesla was founded in 2003 and has 127,855 employees."

SPARQL triples → Natural language:

snippet

<Q2283> <P569> "1980-03-07"
→ "Rachel Stevens was born on March 7, 1980."

Cypher paths → Natural language:

snippet

(:Person {name: "Lana Wachowski"})-[:DIRECTED]->(:Movie {title: "Speed Racer"})<-[:ACTED_IN]-(:Person {name: "Emile Hirsch"})
→ "Emile Hirsch acted in Speed Racer, directed by Lana Wachowski."

The LLM then selects results that answer the query, handling:

Format differences (rows vs triples vs paths)
Semantic equivalence (same info from different sources)
Redundancy elimination

Cross-Paradigm Coverage: Where Each Backend Excels

The researchers analyzed which query types each backend can answer:

Document Search has the widest cross-paradigm coverage (28.2% off-diagonal accuracy), especially for SPARQL questions where Wikipedia-derived corpora overlap with Wikidata's factual content.

Structured backends (SQL, SPARQL, Cypher) have narrower coverage (15.2% - 22.1% off-diagonal) because their answers depend on specific schema elements.

Key insight: No single backend is sufficient. Even the best single-paradigm approach (Document Search) only reaches 28.2% cross-paradigm coverage.

OmniRetrieval achieves 65.88% by engaging the right backend per query.

Implementation Details That Matter

Backbone Models

Closed-source: GPT-5.4, Gemini-3.1 Pro, Sonnet-4.6
Open-source: Qwen-3.5 (27B), Gemma-4 (31B) served via vLLM

Document Retrieval

Encoder: all-MiniLM-L6-v2
Query rewriting: Natural language → hypothetical passage → embed (similar to HyDE)

SPARQL Entity Linking

Follows ToG (Think-on-Graph) procedure for Wikidata entity resolution

Sampling

Temperature: 0.0 (deterministic)
Max tokens: 1024
Single run per configuration (no averaging)

Infrastructure

Open-source models run on single NVIDIA H200 GPU
All knowledge bases accessed through public endpoints (Wikidata SPARQL, Neo4j demo servers, SQLite files)

When OmniRetrieval Struggles: Failure Modes

1. Source Selection Remains the Bottleneck

Even at k=3 candidates, source selection only achieves 65.71% accuracy. The gap to oracle (100%) is largest at this stage.

Why: The catalog contains 309 knowledge bases with similar-sounding schemas. SQL databases in particular (286 of 309 sources) create high ambiguity.

2. Evidence Selection Drops at Higher k

As candidate list size grows from k=3 to k=10, evidence selection accuracy drops from 67.5% to 62.8%.

Why: More candidates introduce more noise, making it harder for the LLM to identify which results actually answer the query.

3. Structured Query Generation Has Schema Linking Errors

Text-to-SQL, text-to-SPARQL, and text-to-Cypher inherit the same failure modes as existing single-backend systems:

Incorrect table/predicate selection
Missing JOIN conditions
Wrong aggregation functions
Entity linking errors (especially for SPARQL)

4. Embedding-Based Baselines Can't Scale

The paper attempted to compare against unified-representation approaches (UniK, UDT, DiFaR) but had to constrain the setup severely:

Only gold-touched triples/edges included for graphs
Random distractors added for balance
Full SQL tables included
Documents kept at full scale

Even in this massively favorable setup, unified embeddings only reached 23% retrieval accuracy vs OmniRetrieval's 46.62%—and this is on a tiny fraction of real-world graph scale.

Fundamental limit: You can't embed 15 billion Wikidata triples or tens of billions of property graph paths.

What This Means for RAG Systems

OmniRetrieval demonstrates that RAG doesn't have to be limited to document retrieval.

Current RAG Stack Limitations

Most production RAG systems look like:

Embed documents into vector database
Embed user query
Retrieve top-k by cosine similarity
Pass to LLM for generation

This only works for unstructured text.

If your knowledge includes:

SQL databases (customer records, transactions, analytics)
Knowledge graphs (entity relationships, ontologies)
Property graphs (social networks, supply chains, recommendations)

You're stuck either:

Manually writing SQL/SPARQL/Cypher queries per question
Flattening structured data into text documents (losing structure)
Maintaining separate retrieval pipelines per backend

OmniRetrieval's Alternative

A unified retrieval layer that:

Automatically selects the right knowledge source(s)
Generates native queries (SQL, SPARQL, Cypher, or text retrieval)
Consolidates results across heterogeneous formats
Passes unified context to the generation LLM

Benefits:

Users query in natural language regardless of backend
Structural operations (joins, traversals) preserved
New sources added by registration (no retraining embeddings)
Multiple sources engaged per query when needed

Implications for Enterprise Knowledge Systems

Most enterprises have knowledge fragmented across:

Unstructured:

Confluence/Notion documents
Slack/Teams messages
Email archives
Support tickets

Structured:

Salesforce (CRM)
SAP/Oracle (ERP)
Snowflake/BigQuery (data warehouses)
Neo4j/TigerGraph (graph databases)

Current solution: Build separate search/query interfaces for each.

OmniRetrieval approach: Single natural-language interface that routes to appropriate backends and consolidates results.

Example enterprise query:

"Which customers purchased product X in Q1 2026 and then opened support tickets about installation issues?"

Requires:

SQL query against sales database (purchases)
SQL query against support ticket system (tickets)
JOIN across separate systems
Potential text search in ticket descriptions

OmniRetrieval can formulate and execute this cross-system query from natural language.

Implementation Roadmap: Building Your Own OmniRetrieval

The KAIST team released code at github.com/JinheonBaek/OmniRetrieval.

Core Components Needed

1. Source Registry

Catalog of available knowledge sources
Structural context (schemas, ontologies) per source
Access credentials/endpoints

2. Source Selector

Long-context LLM (GPT-5.4, Gemini-3.1, Claude Sonnet-4.6)
Prompt template for catalog reading
Ranking logic for top-k candidates

3. Query Generators (Per Backend)

Text-to-SQL: Schema linking + SQL synthesis
Text-to-SPARQL: Entity linking + triple pattern generation
Text-to-Cypher: Graph schema grounding + path queries
Text retrieval: Query rewriting (optional)

4. Execution Engines

SQL: Database connectors (SQLite, PostgreSQL, MySQL)
SPARQL: RDF endpoint clients (Wikidata, custom)
Cypher: Neo4j connector
Text: Vector database (Pinecone, Weaviate, Milvus)

5. Evidence Selector

Result verbalizer (format-specific)
LLM-based relevance filtering
Deduplication logic

Practical Deployment Considerations

Latency:

Source selection: 1-2 seconds (long-context LLM call)
Query generation: 0.5-1 second per source (can parallelize)
Execution: Varies by backend (SQL <1s, SPARQL 1-5s, text retrieval <1s)
Evidence selection: 1-2 seconds

Total: 4-10 seconds for k=3 candidates

Cost (per query):

Source selection: ~5k-10k input tokens (catalog size)
Query generation: ~2k input tokens × k candidates
Evidence selection: ~1k-3k input tokens

At GPT-5.4 pricing: ~$0.01-0.03 per query

Scaling:

Add new sources by appending to catalog (no retraining)
Catalog size grows linearly with sources
Long-context LLMs handle catalogs up to 100k tokens (~500-1000 sources)

Future Directions

The KAIST team identifies several areas for improvement:

1. Fine-Tuned Evidence Selection

Current approach uses zero-shot LLM prompting. Supervised fine-tuning on labeled cross-source selections could improve accuracy.

2. Reinforcement Learning from Answer Quality

Use downstream answer correctness as reward signal to improve source selection and evidence ranking.

3. Operator-Specific Specialization

Rather than a single shared LLM, specialize models for:

Source selection
Per-backend query generation
Evidence consolidation

4. Handling Temporal and Versioned Sources

Current approach assumes static knowledge bases. Real-world sources change over time.

Allow users to provide feedback on selected sources and refine queries iteratively.

vs. Text-to-SQL Systems (Spider, BIRD)

OmniRetrieval advantage: Works across multiple backends, not just SQL

Limitation: Individual SQL generation quality may trail specialized text-to-SQL models

vs. Universal RAG (UniversalRAG, UniK)

OmniRetrieval advantage: Preserves structural operators instead of embedding everything

Trade-off: Higher complexity, more moving parts

vs. LLM Tool Use (ReAct, Toolformer)

OmniRetrieval advantage: Specialized for knowledge retrieval with schema-grounded query synthesis

Difference: Tool use is generic function calling; OmniRetrieval handles complex queries (100+ table schemas)

vs. Hybrid Search Systems

OmniRetrieval advantage: Handles graph traversals and multi-hop reasoning, not just keyword + vector

The Bigger Picture: Toward Universal Knowledge Interfaces

OmniRetrieval represents a shift from homogenization to coordination.

Rather than forcing everything into a shared representation that loses structure, build a meta-layer that:

Understands the question
Knows what sources exist and what they contain
Speaks each source's native language
Synthesizes results into coherent answers

This is how humans work:

We don't memorize all knowledge in one format
We know which books, databases, experts, or tools to consult
We query each appropriately
We integrate findings from multiple sources

OmniRetrieval automates this for machines.

Practical Takeaways

For Researchers

Unified embeddings hit fundamental scale limits for structured data
Long-context LLMs enable heterogeneous catalog reasoning without shared representations
Multi-candidate exploration + deferred commitment outperforms single-source routing
Evidence selection recovers from imperfect source selection, narrowing the gap to oracle

For RAG Engineers

Your retrieval layer can cover SQL, graphs, and text—not just documents
Native query generation preserves structural operations embeddings can't express
Cross-source consolidation is trainable—supervised fine-tuning can improve evidence selection
Cost/latency trade-offs are manageable at $0.01-0.03 per query, 4-10s latency

For Enterprise Architects

Fragmented knowledge systems can share a natural-language interface
New data sources integrate by registration, not infrastructure rebuilds
Structured and unstructured knowledge complement each other—don't force a choice

Conclusion: The End of Single-Backend Retrieval

OmniRetrieval doesn't just benchmark higher than existing approaches.

It demonstrates a fundamentally different architecture for knowledge access:

Old paradigm: Pick your backend (text, SQL, or graph), build a specialized retriever, accept that other knowledge is unreachable.

New paradigm: Register all knowledge sources, let the system route queries to appropriate backends in native languages, consolidate heterogeneous results.

As knowledge continues fragmenting across incompatible formats—unstructured documents, relational databases, knowledge graphs, property graphs, vector databases, and future formats we haven't invented yet—the coordination approach scales where homogenization fails.

The 309 knowledge bases in this benchmark are a tiny slice of enterprise knowledge, which is a tiny slice of human knowledge.

But OmniRetrieval proves the path forward:

Meet each source on its own terms. Preserve what makes each valuable. Unify at the interface, not the representation.

That's how you build retrieval systems for the real world.

Paper: OmniRetrieval: Unified Retrieval across Heterogeneous Knowledge Sources

Authors: Jinheon Baek, Soyeong Jeong, Sangwoo Park, Woongyeong Yeo, Minki Kang, Patara Trirat, Heejun Lee, Sung Ju Hwang (KAIST & DeepAuto.ai)

Code: github.com/JinheonBaek/OmniRetrieval

Benchmark: 13 datasets, 309 knowledge bases (BEIR, Spider, BIRD, SimpleQuestions, QALD-10, LC-QuAD 2.0, Text2Cypher)

Related posts

AWS Certified Generative AI Developer – Professional: what AIP-C01 tests and how to prepare

Azure AI Apps and Agents Developer (AI-103): what the exam tests and how to prepare

Grounding vs RAG vs fine-tuning vs prompt engineering: which fix, when (a 2026 decision guide)

The Problem: Retrieval Is Fragmented Across Incompatible Backends

1. Unstructured Text Corpora

2. Relational Databases (SQL)

3. RDF Knowledge Graphs (SPARQL)

4. Labeled Property Graphs (Cypher)

The Failed Solution: Unified Embeddings

1. Modality Gap Bias

2. Loss of Structural Operators

3. Scale Impossibility

OmniRetrieval's Solution: Native Query Dispatch

The Three-Stage Pipeline

Stage 1: Source Selection

Stage 2: Query Formulation

Stage 3: Cross-Source Evidence Selection

The Benchmark: 309 Knowledge Bases, 13 Datasets

Document Search (7 BEIR Datasets)

Relational Databases (286 Databases)

RDF Knowledge Graphs (1 Graph, 3 Datasets)

Labeled Property Graphs (15 Graphs)

Results: OmniRetrieval Beats Single-Source Baselines

Source Selection Accuracy

Retrieval Accuracy

LLM-as-a-Judge (Semantic Equivalence)

Why OmniRetrieval Works: Four Key Insights

1. Long-Context Source Selection Scales

2. Native Queries Preserve Structural Affordances

3. Multi-Candidate Exploration Defers Commitment

4. Cross-Source Evidence Selection Handles Heterogeneity

Cross-Paradigm Coverage: Where Each Backend Excels

Implementation Details That Matter

Backbone Models

Document Retrieval

SPARQL Entity Linking

Sampling

Infrastructure

When OmniRetrieval Struggles: Failure Modes

1. Source Selection Remains the Bottleneck

2. Evidence Selection Drops at Higher k

3. Structured Query Generation Has Schema Linking Errors

4. Embedding-Based Baselines Can't Scale

What This Means for RAG Systems

Current RAG Stack Limitations

OmniRetrieval's Alternative

Implications for Enterprise Knowledge Systems

Implementation Roadmap: Building Your Own OmniRetrieval

Core Components Needed

Practical Deployment Considerations

Future Directions

1. Fine-Tuned Evidence Selection

2. Reinforcement Learning from Answer Quality

3. Operator-Specific Specialization

4. Handling Temporal and Versioned Sources

5. Interactive Refinement

Comparison to Related Work

vs. Text-to-SQL Systems (Spider, BIRD)

vs. Universal RAG (UniversalRAG, UniK)

vs. LLM Tool Use (ReAct, Toolformer)

vs. Hybrid Search Systems

The Bigger Picture: Toward Universal Knowledge Interfaces

Practical Takeaways

For Researchers

For RAG Engineers

For Enterprise Architects

Conclusion: The End of Single-Backend Retrieval