Real-world questions don't respect database boundaries.
Ask "Which artists were born on the same date as Rachel Stevens?" and the answer lives in Wikidata's RDF knowledge graph, queryable via SPARQL.
Ask "How many online purchases did Ole Group make in May 2019?" and you need SQL against a normalized relational database.
Ask "Which actors acted in movies directed by the person who directed Speed Racer?" and you're traversing a labeled property graph with Cypher.
Ask "What is the cancer risk from French fries?" and you're searching unstructured biomedical documents.
Current retrieval systems force you to pick one. Use a document retriever (BM25, DPR) for text, text-to-SQL for databases, text-to-SPARQL for knowledge graphs, or text-to-Cypher for property graphs—but not all of them for a single question.
The natural solution seems obvious: collapse everything into a shared embedding space and retrieve by similarity.
Except that doesn't work.
As researchers from KAIST and DeepAuto.ai demonstrate in their new paper "OmniRetrieval: Unified Retrieval across Heterogeneous Knowledge Sources," flattening structured data into embeddings throws away the structural affordances—joins, traversals, compositional operators—that make each source valuable in the first place.
Their alternative: meet each knowledge source on its own terms.
The Problem: Retrieval Is Fragmented Across Incompatible Backends
Modern knowledge exists in structurally diverse forms:
1. Unstructured Text Corpora
- Medical articles (PubMed)
- Wikipedia passages
- Financial documents
- Support tickets
Query method: Free-form natural language → BM25 or dense retrieval
2. Relational Databases (SQL)
- Enterprise databases
- E-commerce transactions
- Analytics warehouses
Query method: Natural language → SQL with joins, aggregations, filters
3. RDF Knowledge Graphs (SPARQL)
- Wikidata (15+ billion triples)
- DBpedia
- Domain ontologies
Query method: Natural language → SPARQL triple patterns and property paths
4. Labeled Property Graphs (Cypher)
- Neo4j graphs
- Social networks
- Supply chains
- Recommendation systems
Query method: Natural language → Cypher graph traversals
The fragmentation problem: Each backend has:
- Its own native query language
- Its own execution engine
- Its own structural context (schema, ontology, graph model)
- Its own result format (passages, rows, triples, paths)
Existing retrieval approaches operate on one source at a time, leaving the broader knowledge landscape unreachable behind incompatible interfaces.
The Failed Solution: Unified Embeddings
The obvious approach is to project everything into a shared vector space:
- Embed all documents, table rows, knowledge graph triples, and graph paths
- Embed the user's query
- Retrieve by cosine similarity
Why this fails:
1. Modality Gap Bias
Embeddings cluster by source type rather than semantic content. The retriever biases toward sources that look like the query structurally, not sources that answer it.
2. Loss of Structural Operators
SQL joins become individual row embeddings. Graph traversals become separate edge embeddings. Multi-hop reasoning is lost.
Consider the query: "Find companies founded by MIT graduates who later joined Google."
SQL version:
SELECT DISTINCT c.name
FROM companies c
JOIN founders f ON c.id = f.company_id
JOIN education e ON f.person_id = e.person_id
JOIN employment emp ON f.person_id = emp.person_id
WHERE e.institution = 'MIT'
AND emp.company = 'Google'
AND emp.start_date > f.founded_date
Embedding version: You get individual rows for companies, founders, education records, and employment records. The join logic is gone.
3. Scale Impossibility
Wikidata has 15+ billion triples. Embedding all possible paths in a property graph grows exponentially with hop length—one graph in the benchmark has tens of billions of 3-hop paths.
Materializing a shared embedding space for real-world knowledge sources is computationally infeasible.
OmniRetrieval's Solution: Native Query Dispatch
Instead of homogenization, OmniRetrieval provides a coordination layer that:
- Identifies which knowledge sources are relevant
- Generates executable queries in each source's native language
- Consolidates heterogeneous results into a unified answer
The Three-Stage Pipeline
Stage 1: Source Selection
Input: Natural language query + catalog of source descriptors
Process: A long-context LLM reads the full catalog of structural contexts (schemas, ontologies, corpus descriptions) and returns a ranked list of candidate sources.
Example:
Query: "Which artists were born on the same date as Rachel Stevens?"
Catalog contains:
- 7 document corpora (medical, scientific, financial, Wikipedia)
- 286 SQL databases (various domains)
- Wikidata RDF graph (15B+ triples, encyclopedic facts)
- 15 property graphs (movies, social networks, companies)
Selected: Wikidata (contains birth dates of public figures)
Key insight: The LLM can reason over heterogeneous descriptors (table schemas, graph ontologies, corpus summaries) directly, without forcing them into a shared representation.
Stage 2: Query Formulation
Input: Query + structural context for each selected source
Process: For each source, generate an executable native query conditioned on its schema/ontology.
Example:
For Wikidata (SPARQL):
SELECT DISTINCT ?artist WHERE {
?rachel rdfs:label "Rachel Stevens"@en .
?rachel wdt:P569 ?birthdate .
?artist wdt:P569 ?birthdate .
?artist wdt:P106 wd:Q483501 . # occupation: artist
FILTER (?artist != ?rachel)
}
For a SQL database:
SELECT a.name
FROM artists a
JOIN persons p1 ON a.person_id = p1.id
JOIN persons p2 ON p1.birthdate = p2.birthdate
WHERE p2.name = 'Rachel Stevens'
AND a.person_id != p2.id
For a property graph (Cypher):
MATCH (rachel:Person {name: "Rachel Stevens"})-[:BORN_ON]->(date:Date)
MATCH (artist:Artist)-[:BORN_ON]->(date)
WHERE artist <> rachel
RETURN artist.name
Key insight: Each query is grounded in the source's actual schema (table names, predicates, relationship types), not generic templates.
Stage 3: Cross-Source Evidence Selection
Input: Executor outputs from multiple heterogeneous sources
Process: An LLM selects the subset of results relevant to the original query, filtering across different result formats:
- Document passages
- SQL table rows
- RDF triples
- Property graph paths
Example:
Results from 3 sources:
Source 1 (Wikidata SPARQL):
- Artist: Ronan Keating (born 1977-03-03)
- Artist: Laura Prepon (born 1980-03-07)
Source 2 (Wikipedia docs):
- Passage: "Rachel Stevens (born 9 April 1978)..."
Source 3 (SQL celebrity_db):
- Row: {name: "Ronan Keating", birthdate: "1977-03-03"}
Evidence Selection picks:
- Source 1 results (correctly answers the query)
- Filters out Source 2 (context, not answer)
- Deduplicates Source 3 (same info as Source 1)
Key insight: The consolidation step handles format heterogeneity (triples vs rows vs passages) and selects semantically equivalent answers even when surface forms differ.
The Benchmark: 309 Knowledge Bases, 13 Datasets
OmniRetrieval was evaluated on an extensive benchmark spanning:
Document Search (7 BEIR Datasets)
- NFCorpus: Medical (PubMed abstracts)
- SciFact: Scientific claim verification
- FiQA: Financial question answering
- MS MARCO: Web passages
- FEVER: Wikipedia fact verification
- Natural Questions: Short-answer QA
- HotpotQA: Multi-hop reasoning
Relational Databases (286 Databases)
- Spider: 206 databases across diverse domains
- BIRD: 80 databases from real-world applications
RDF Knowledge Graphs (1 Graph, 3 Datasets)
- Wikidata queried via:
- SimpleQuestions: Single-triple factoid queries
- QALD-10: Hand-curated factoid and aggregation queries
- LC-QuAD 2.0: Large-scale compositional queries
Labeled Property Graphs (15 Graphs)
- Text2Cypher: Neo4j graphs covering movies, company structures, social networks, financial investigations
Total: 309 distinct knowledge bases, 300 questions per dataset = 3,900 total queries
Results: OmniRetrieval Beats Single-Source Baselines
Evaluated on five LLM backbones (GPT-5.4, Gemini-3.1 Pro, Sonnet-4.6, Qwen-3.5 27B, Gemma-4 31B):
Source Selection Accuracy
- Single-backend baselines: 14.73% - 24.84% (each pinned to one paradigm)
- KB Routing: 61.65% (picks one source per query)
- OmniRetrieval: 65.71% (+4.06pp over KB Routing)
- Oracle (perfect selection): 100%
Retrieval Accuracy
- Single-backend baselines: 13.69% - 17.93%
- KB Routing: 39.98%
- OmniRetrieval: 44.34% (+4.36pp, +11% relative improvement)
- Oracle: 61.85%
LLM-as-a-Judge (Semantic Equivalence)
- Single-backend baselines: 25.65% - 39.49%
- KB Routing: 57.99%
- OmniRetrieval: 65.88% (+7.89pp)
- Oracle: 74.55%
Key finding: The gap to oracle narrows from 34.27pp (source selection) → 17.51pp (retrieval) → 8.67pp (judge), showing that cross-source evidence selection often recovers semantically equivalent answers even when source selection misses the gold standard.
Why OmniRetrieval Works: Four Key Insights
1. Long-Context Source Selection Scales
Rather than embedding source descriptors into a shared space, OmniRetrieval reads the full catalog of schemas, ontologies, and corpus descriptions directly.
This works because:
- Long-context LLMs (GPT-5.4, Gemini-3.1) can handle 128k+ tokens
- Structural contexts are heterogeneous but relatively compact (schemas fit in <2k tokens each)
- The LLM can reason about actual contents (table names, predicate types) rather than similarity scores
Result: 65.71% source selection accuracy across 309 knowledge bases
2. Native Queries Preserve Structural Affordances
By generating SQL, SPARQL, or Cypher instead of embedding atomic units:
SQL preserves:
- Joins across normalized tables
- Aggregations (COUNT, SUM, AVG)
- Window functions
- Subqueries
SPARQL preserves:
- Triple pattern matching
- Property paths (multi-hop traversals)
- OPTIONAL and UNION operators
- FILTER constraints
Cypher preserves:
- Graph pattern matching
- Variable-length paths
- Relationship property filtering
- Shortest path algorithms
Embedding-based approaches lose all of this.
3. Multi-Candidate Exploration Defers Commitment
OmniRetrieval returns a short list of k candidates (default k=3) rather than committing to one source upfront.
Effect of candidate size:
- k=1: 57.81% retrieval accuracy (same as KB Routing)
- k=3: 65.71% retrieval accuracy (+7.9pp)
- k=5: 67.12% retrieval accuracy (+1.41pp)
- k=10: 68.29% retrieval accuracy (+1.17pp)
Insight: Returns diminish beyond k=3 because evidence selection accuracy drops from 67.5% at k=3 to 62.8% at k=10—more candidates introduce more noise.
4. Cross-Source Evidence Selection Handles Heterogeneity
The final consolidation step verbalizes results from different formats:
SQL results → Natural language:
Row: {company: "Tesla", founded: 2003, employees: 127855}
→ "Tesla was founded in 2003 and has 127,855 employees."
SPARQL triples → Natural language:
<Q2283> <P569> "1980-03-07"
→ "Rachel Stevens was born on March 7, 1980."
Cypher paths → Natural language:
(:Person {name: "Lana Wachowski"})-[:DIRECTED]->(:Movie {title: "Speed Racer"})<-[:ACTED_IN]-(:Person {name: "Emile Hirsch"})
→ "Emile Hirsch acted in Speed Racer, directed by Lana Wachowski."
The LLM then selects results that answer the query, handling:
- Format differences (rows vs triples vs paths)
- Semantic equivalence (same info from different sources)
- Redundancy elimination
Cross-Paradigm Coverage: Where Each Backend Excels
The researchers analyzed which query types each backend can answer:
Document Search has the widest cross-paradigm coverage (28.2% off-diagonal accuracy), especially for SPARQL questions where Wikipedia-derived corpora overlap with Wikidata's factual content.
Structured backends (SQL, SPARQL, Cypher) have narrower coverage (15.2% - 22.1% off-diagonal) because their answers depend on specific schema elements.
Key insight: No single backend is sufficient. Even the best single-paradigm approach (Document Search) only reaches 28.2% cross-paradigm coverage.
OmniRetrieval achieves 65.88% by engaging the right backend per query.
Implementation Details That Matter
Backbone Models
- Closed-source: GPT-5.4, Gemini-3.1 Pro, Sonnet-4.6
- Open-source: Qwen-3.5 (27B), Gemma-4 (31B) served via vLLM
Document Retrieval
- Encoder: all-MiniLM-L6-v2
- Query rewriting: Natural language → hypothetical passage → embed (similar to HyDE)
SPARQL Entity Linking
- Follows ToG (Think-on-Graph) procedure for Wikidata entity resolution
Sampling
- Temperature: 0.0 (deterministic)
- Max tokens: 1024
- Single run per configuration (no averaging)
Infrastructure
- Open-source models run on single NVIDIA H200 GPU
- All knowledge bases accessed through public endpoints (Wikidata SPARQL, Neo4j demo servers, SQLite files)
When OmniRetrieval Struggles: Failure Modes
1. Source Selection Remains the Bottleneck
Even at k=3 candidates, source selection only achieves 65.71% accuracy. The gap to oracle (100%) is largest at this stage.
Why: The catalog contains 309 knowledge bases with similar-sounding schemas. SQL databases in particular (286 of 309 sources) create high ambiguity.
2. Evidence Selection Drops at Higher k
As candidate list size grows from k=3 to k=10, evidence selection accuracy drops from 67.5% to 62.8%.
Why: More candidates introduce more noise, making it harder for the LLM to identify which results actually answer the query.
3. Structured Query Generation Has Schema Linking Errors
Text-to-SQL, text-to-SPARQL, and text-to-Cypher inherit the same failure modes as existing single-backend systems:
- Incorrect table/predicate selection
- Missing JOIN conditions
- Wrong aggregation functions
- Entity linking errors (especially for SPARQL)
4. Embedding-Based Baselines Can't Scale
The paper attempted to compare against unified-representation approaches (UniK, UDT, DiFaR) but had to constrain the setup severely:
- Only gold-touched triples/edges included for graphs
- Random distractors added for balance
- Full SQL tables included
- Documents kept at full scale
Even in this massively favorable setup, unified embeddings only reached 23% retrieval accuracy vs OmniRetrieval's 46.62%—and this is on a tiny fraction of real-world graph scale.
Fundamental limit: You can't embed 15 billion Wikidata triples or tens of billions of property graph paths.
What This Means for RAG Systems
OmniRetrieval demonstrates that RAG doesn't have to be limited to document retrieval.
Current RAG Stack Limitations
Most production RAG systems look like:
- Embed documents into vector database
- Embed user query
- Retrieve top-k by cosine similarity
- Pass to LLM for generation
This only works for unstructured text.
If your knowledge includes:
- SQL databases (customer records, transactions, analytics)
- Knowledge graphs (entity relationships, ontologies)
- Property graphs (social networks, supply chains, recommendations)
You're stuck either:
- Manually writing SQL/SPARQL/Cypher queries per question
- Flattening structured data into text documents (losing structure)
- Maintaining separate retrieval pipelines per backend
OmniRetrieval's Alternative
A unified retrieval layer that:
- Automatically selects the right knowledge source(s)
- Generates native queries (SQL, SPARQL, Cypher, or text retrieval)
- Consolidates results across heterogeneous formats
- Passes unified context to the generation LLM
Benefits:
- Users query in natural language regardless of backend
- Structural operations (joins, traversals) preserved
- New sources added by registration (no retraining embeddings)
- Multiple sources engaged per query when needed
Implications for Enterprise Knowledge Systems
Most enterprises have knowledge fragmented across:
Unstructured:
- Confluence/Notion documents
- Slack/Teams messages
- Email archives
- Support tickets
Structured:
- Salesforce (CRM)
- SAP/Oracle (ERP)
- Snowflake/BigQuery (data warehouses)
- Neo4j/TigerGraph (graph databases)
Current solution: Build separate search/query interfaces for each.
OmniRetrieval approach: Single natural-language interface that routes to appropriate backends and consolidates results.
Example enterprise query:
"Which customers purchased product X in Q1 2026 and then opened support tickets about installation issues?"
Requires:
- SQL query against sales database (purchases)
- SQL query against support ticket system (tickets)
- JOIN across separate systems
- Potential text search in ticket descriptions
OmniRetrieval can formulate and execute this cross-system query from natural language.
Implementation Roadmap: Building Your Own OmniRetrieval
The KAIST team released code at github.com/JinheonBaek/OmniRetrieval.
Core Components Needed
1. Source Registry
- Catalog of available knowledge sources
- Structural context (schemas, ontologies) per source
- Access credentials/endpoints
2. Source Selector
- Long-context LLM (GPT-5.4, Gemini-3.1, Claude Sonnet-4.6)
- Prompt template for catalog reading
- Ranking logic for top-k candidates
3. Query Generators (Per Backend)
- Text-to-SQL: Schema linking + SQL synthesis
- Text-to-SPARQL: Entity linking + triple pattern generation
- Text-to-Cypher: Graph schema grounding + path queries
- Text retrieval: Query rewriting (optional)
4. Execution Engines
- SQL: Database connectors (SQLite, PostgreSQL, MySQL)
- SPARQL: RDF endpoint clients (Wikidata, custom)
- Cypher: Neo4j connector
- Text: Vector database (Pinecone, Weaviate, Milvus)
5. Evidence Selector
- Result verbalizer (format-specific)
- LLM-based relevance filtering
- Deduplication logic
Practical Deployment Considerations
Latency:
- Source selection: 1-2 seconds (long-context LLM call)
- Query generation: 0.5-1 second per source (can parallelize)
- Execution: Varies by backend (SQL <1s, SPARQL 1-5s, text retrieval <1s)
- Evidence selection: 1-2 seconds
Total: 4-10 seconds for k=3 candidates
Cost (per query):
- Source selection: ~5k-10k input tokens (catalog size)
- Query generation: ~2k input tokens × k candidates
- Evidence selection: ~1k-3k input tokens
At GPT-5.4 pricing: ~$0.01-0.03 per query
Scaling:
- Add new sources by appending to catalog (no retraining)
- Catalog size grows linearly with sources
- Long-context LLMs handle catalogs up to 100k tokens (~500-1000 sources)
Future Directions
The KAIST team identifies several areas for improvement:
1. Fine-Tuned Evidence Selection
Current approach uses zero-shot LLM prompting. Supervised fine-tuning on labeled cross-source selections could improve accuracy.
2. Reinforcement Learning from Answer Quality
Use downstream answer correctness as reward signal to improve source selection and evidence ranking.
3. Operator-Specific Specialization
Rather than a single shared LLM, specialize models for:
- Source selection
- Per-backend query generation
- Evidence consolidation
4. Handling Temporal and Versioned Sources
Current approach assumes static knowledge bases. Real-world sources change over time.
5. Interactive Refinement
Allow users to provide feedback on selected sources and refine queries iteratively.
Comparison to Related Work
vs. Text-to-SQL Systems (Spider, BIRD)
OmniRetrieval advantage: Works across multiple backends, not just SQL
Limitation: Individual SQL generation quality may trail specialized text-to-SQL models
vs. Universal RAG (UniversalRAG, UniK)
OmniRetrieval advantage: Preserves structural operators instead of embedding everything
Trade-off: Higher complexity, more moving parts
vs. LLM Tool Use (ReAct, Toolformer)
OmniRetrieval advantage: Specialized for knowledge retrieval with schema-grounded query synthesis
Difference: Tool use is generic function calling; OmniRetrieval handles complex queries (100+ table schemas)
vs. Hybrid Search Systems
OmniRetrieval advantage: Handles graph traversals and multi-hop reasoning, not just keyword + vector
The Bigger Picture: Toward Universal Knowledge Interfaces
OmniRetrieval represents a shift from homogenization to coordination.
Rather than forcing everything into a shared representation that loses structure, build a meta-layer that:
- Understands the question
- Knows what sources exist and what they contain
- Speaks each source's native language
- Synthesizes results into coherent answers
This is how humans work:
- We don't memorize all knowledge in one format
- We know which books, databases, experts, or tools to consult
- We query each appropriately
- We integrate findings from multiple sources
OmniRetrieval automates this for machines.
Practical Takeaways
For Researchers
- Unified embeddings hit fundamental scale limits for structured data
- Long-context LLMs enable heterogeneous catalog reasoning without shared representations
- Multi-candidate exploration + deferred commitment outperforms single-source routing
- Evidence selection recovers from imperfect source selection, narrowing the gap to oracle
For RAG Engineers
- Your retrieval layer can cover SQL, graphs, and text—not just documents
- Native query generation preserves structural operations embeddings can't express
- Cross-source consolidation is trainable—supervised fine-tuning can improve evidence selection
- Cost/latency trade-offs are manageable at $0.01-0.03 per query, 4-10s latency
For Enterprise Architects
- Fragmented knowledge systems can share a natural-language interface
- New data sources integrate by registration, not infrastructure rebuilds
- Structured and unstructured knowledge complement each other—don't force a choice
Conclusion: The End of Single-Backend Retrieval
OmniRetrieval doesn't just benchmark higher than existing approaches.
It demonstrates a fundamentally different architecture for knowledge access:
Old paradigm: Pick your backend (text, SQL, or graph), build a specialized retriever, accept that other knowledge is unreachable.
New paradigm: Register all knowledge sources, let the system route queries to appropriate backends in native languages, consolidate heterogeneous results.
As knowledge continues fragmenting across incompatible formats—unstructured documents, relational databases, knowledge graphs, property graphs, vector databases, and future formats we haven't invented yet—the coordination approach scales where homogenization fails.
The 309 knowledge bases in this benchmark are a tiny slice of enterprise knowledge, which is a tiny slice of human knowledge.
But OmniRetrieval proves the path forward:
Meet each source on its own terms. Preserve what makes each valuable. Unify at the interface, not the representation.
That's how you build retrieval systems for the real world.
Paper: OmniRetrieval: Unified Retrieval across Heterogeneous Knowledge Sources
Authors: Jinheon Baek, Soyeong Jeong, Sangwoo Park, Woongyeong Yeo, Minki Kang, Patara Trirat, Heejun Lee, Sung Ju Hwang (KAIST & DeepAuto.ai)
Code: github.com/JinheonBaek/OmniRetrieval
Benchmark: 13 datasets, 309 knowledge bases (BEIR, Spider, BIRD, SimpleQuestions, QALD-10, LC-QuAD 2.0, Text2Cypher)