Retrieval-augmented generation gets described as a retrieval problem. It isn't — or not only. The retrieval half is finding relevant chunks. The equally important half is injecting those chunks into the context window in a way the model can use.
This guide covers the full pipeline: how to chunk documents, how to embed and retrieve, how to score and filter, and how to inject retrieved content into the context package in a way that actually improves outputs.
Why RAG is a context engineering problem
The failure mode most teams hit with RAG isn't "the retriever found irrelevant chunks." It's "the retriever found relevant chunks, but the model couldn't use them." This happens because:
- The chunks were injected in the middle of a long context where attention is weakest
- Too many chunks were included, diluting the signal
- The retrieved content was structurally formatted in a way that confused the model
- No relevance threshold was applied, so low-quality matches were injected alongside high-quality ones
- The model wasn't told explicitly to use the retrieved content
All of these are context injection problems, not retrieval problems.
The two-part mental model: retrieval determines what you find; context engineering determines whether the model can use it.
Stage 1: Document chunking
Chunking is the first context engineering decision in a RAG pipeline. How you split documents determines what units are available to retrieve — and whether a single retrieved chunk contains enough information to be useful.
Fixed-size chunking
The simplest approach: split every document into chunks of N tokens, with an overlap of M tokens between adjacent chunks.
chunk_size = 512 tokens
overlap = 64 tokens
Advantages: predictable, easy to implement. Disadvantages: splits arbitrarily mid-sentence or mid-concept, requiring the model to infer context from adjacent chunks that may not be retrieved.
Fixed-size chunking works adequately for dense, uniform text (e.g., encyclopedia articles). It breaks down for structured content, dialogue, or documents where concepts span variable lengths.
Semantic chunking
Split on natural topic or section boundaries rather than fixed token counts. Detect boundaries using:
- Heading structure (H1/H2/H3 in Markdown or HTML)
- Paragraph breaks in prose
- Sentence-level embedding similarity (high cosine distance between adjacent sentences signals a topic shift)
Semantic chunks contain more complete ideas. The tradeoff: chunks have variable sizes, which complicates token budget estimation and retrieval scoring.
Document-level and hierarchical chunking
For code, preserve function or class boundaries. For legal or technical documents, chunk by section with parent-context summaries attached. Hierarchical chunking adds a summary chunk for each section alongside the detail chunks — retrieve at the summary level first, then fetch detail chunks from the most relevant section.
This "two-stage" approach (summarize → select → detail) is more expensive but dramatically improves precision for long documents.
The practical test for chunk quality
For any chunking strategy: take 20 representative queries. Retrieve the top-3 chunks for each. Read them as a human without access to the rest of the document. Can you answer the query from the chunk alone? If you need to infer heavily from context that isn't in the chunk, your chunks are too small or split in the wrong places.
Stage 2: Embedding models
Embedding models convert text to vectors. Similar meanings produce vectors that are close in embedding space, enabling semantic (not just keyword) retrieval.
Choosing an embedding model in 2026
The embedding model market in 2026 has several solid options:
| Model | Dimension | Use case |
|---|---|---|
| OpenAI text-embedding-3-large | 3072 | General purpose, strong multilingual |
| Voyage-3-large (Anthropic) | 1024–2048 | Strong on code and technical content |
| Cohere embed-v4 | 1024 | Enterprise multilingual |
| nomic-embed-text-v2 | 768 | Open-source, runs locally |
| BGE-M3 | 1024 | Open-source, strong multilingual |
Key considerations:
Domain fit. Models trained on code and technical documents embed code better than models trained primarily on web text. Test on a representative sample from your actual documents.
Dimension size. Higher dimensions generally improve recall but increase storage and query cost. Most production systems land between 768 and 1536 dimensions.
Query-document asymmetry. Some models (notably Voyage and Cohere) are optimized for asymmetric retrieval: short queries retrieving from long documents. Others assume symmetric text length. Using a symmetric model for short-query + long-document retrieval degrades recall.
Batch embedding. Embed documents at index time, not at query time. Retrieval latency comes from query embedding and vector search, not document embedding.
Stage 3: Retrieval and scoring
Top-k retrieval
The standard approach: embed the query, find the k most similar document vectors by cosine similarity, return them.
The k parameter is a context budget decision. k = 5 returns 5 chunks; if each chunk is 512 tokens, that's 2,560 tokens of retrieval content. Multiply by the number of retrieval calls in an agent session and you see how retrieval decisions drive total context cost.
Don't over-retrieve. The temptation is to set k high "to make sure we include everything relevant." But injecting 20 retrieved chunks means 18 of them are competing with the 2 that are actually useful. Attention dilution is real — more is not always better.
Relevance thresholds
Apply a minimum similarity threshold before injecting chunks. Typical cutoffs range from 0.70 to 0.85 cosine similarity, depending on your embedding model and domain. Chunks below the threshold get dropped regardless of k.
This is one of the highest-leverage context engineering decisions in a RAG pipeline. Without a threshold, every query injects some results even when no relevant content exists. With a threshold, the model gets a clean empty-context signal when the knowledge base doesn't have an answer — which is usually better than injecting low-quality matches.
Hybrid retrieval
Pure vector search misses exact keyword matches (product names, version numbers, proper nouns). Pure BM25 keyword search misses semantic equivalents ("car" vs "automobile"). Hybrid retrieval combines both:
final_score = α * vector_score + (1 - α) * keyword_score
Most production RAG systems in 2026 use hybrid retrieval. The α hyperparameter depends on your document type — technical documentation with precise terminology weights toward keyword; prose and natural language queries weight toward vector.
Reranking
A retrieval pass typically returns coarse top-k results. A reranking model (cross-encoder rather than bi-encoder) takes the query and each retrieved chunk together and produces a more precise relevance score. Reranking dramatically improves precision at the cost of additional latency.
Common reranking models: Cohere Rerank 3.5, Voyage Rerank 2, BGE Reranker v2. For latency-sensitive applications, apply reranking only to the top 10-20 candidates from the initial retrieval pass.
Stage 4: Context injection patterns
Retrieved chunks are not self-explanatory. The model needs to understand what they are, why they're there, and how to use them. Context injection is the practice of wrapping retrieved content in structure that helps the model treat it correctly.
Basic injection template
[RETRIEVED CONTEXT]
Source: {document_title}, Section: {section_name}
Relevance score: {score}
---
{chunk_content}
---
[RETRIEVED CONTEXT]
Source: {document_title}, Section: {section_name}
Relevance score: {score}
---
{chunk_content}
---
[END OF RETRIEVED CONTEXT]
The explicit tags ([RETRIEVED CONTEXT], [END OF RETRIEVED CONTEXT]) help the model distinguish retrieved material from the system prompt and user message. This reduces the chance of the model conflating instructions with content.
Show relevance scores
Including the relevance score in the injection template has two benefits: it signals to the model which chunks are more trustworthy, and it makes your retrieved context transparent for debugging. A chunk with score 0.61 should get less weight than a chunk with 0.93 — including the score lets the model calibrate.
Cite the source
Include document titles and section names. This enables the model to attribute claims to sources and to hedge when it's working from partial information. It also helps with debugging — when the model says something incorrect, you can trace it back to the retrieved chunk that caused it.
Placement in the context window
Attention research consistently shows that content at the beginning and end of context windows receives more weight than content in the middle. For retrieval-augmented systems:
- System prompt — at the top
- Retrieved context — immediately after the system prompt, before conversation history
- Conversation history — after retrieved content
- User message — at the end
This ordering ensures the model sees retrieved content early (high attention) and the current task late (also high attention). Burying retrieved documents in the middle of a long conversation history is one of the most common context injection mistakes.
Explicit usage instruction
Include an explicit instruction in the system prompt to use the retrieved context:
When answering questions, first check the RETRIEVED CONTEXT below.
Prefer information from the retrieved context over your training knowledge
when they conflict. If the retrieved context does not contain enough
information to answer confidently, say so explicitly rather than guessing.
Without this instruction, models sometimes ignore retrieved content in favor of training knowledge — especially when the retrieved content is short or formatted differently from what the model expects.
Stage 5: Agentic RAG
In standard RAG, retrieval is a fixed preprocessing step: before generating a response, retrieve k chunks and inject them. Agentic RAG puts retrieval under the model's control as a tool call.
{
"name": "search_knowledge_base",
"description": "Search the product documentation for relevant information. Use when the user asks about product features, pricing, or policies.",
"parameters": {
"query": "The search query — make it specific and concrete",
"max_results": "Number of results to retrieve (1-5). Default 3."
}
}
The model decides when to call the tool, what query to use, and how many results to retrieve. This produces several advantages:
Query refinement. The agent can issue multiple retrieval calls with refined queries if the first results are insufficient, rather than being locked into a single retrieval pass.
Conditional retrieval. For queries that don't need external information, the agent skips retrieval entirely — saving tokens and latency.
Incremental retrieval. In long agentic sessions, the agent can retrieve information as it becomes relevant rather than front-loading everything.
The tradeoff: agentic RAG requires the model to have good tool-calling behavior and reliable retrieval tool schemas. A poorly designed tool schema produces poor retrieval queries. The system prompt must clearly define when and how to use the retrieval tool.
Common failure modes and fixes
Failure: Retrieved chunks are too small
Symptom: Model answers are vague or miss detail that exists in the documents.
Cause: Chunks are split so finely that no single chunk contains a complete answer. The relevant information is spread across adjacent chunks, but only one was retrieved.
Fix: Increase chunk size or use parent-document retrieval (retrieve the small chunk for ranking, but inject the larger parent document). Also consider adding an overlap buffer between chunks.
Failure: Relevance collapse
Symptom: The model gives the same response regardless of the query, or ignores retrieved context.
Cause: Retrieved chunks are consistently off-topic due to poor chunking, wrong embedding model, or no relevance threshold.
Fix: Audit retrieved chunks for 20+ representative queries. If retrieved content is consistently irrelevant, diagnose whether the problem is in chunking (content not represented) or retrieval (wrong model or missing threshold). Add relevance thresholds if you're not filtering low-score results.
Failure: Context flooding
Symptom: Model performance degrades as more retrieved content is included.
Cause: Too many chunks are injected, and the relevant information is buried in low-quality matches. Attention is spread thin.
Fix: Reduce k and raise the relevance threshold. Apply reranking to select the top 2-3 chunks from a larger candidate set rather than injecting all top-k candidates.
Failure: Lost-in-the-middle degradation
Symptom: Model performs well with 1-2 retrieved chunks but degrades with 5-10.
Cause: Critical information is placed in the middle of a long context where attention is weakest.
Fix: Reorder injected chunks so the most relevant appear first (after the system prompt). Consider repeating the top chunk at the end of the context package for high-stakes retrieval.
Failure: Hallucination despite retrieval
Symptom: Model confidently states facts not present in retrieved documents.
Cause: The model isn't using retrieved content — it's falling back on training knowledge. Or the retrieved content contains the information but the model didn't weight it appropriately.
Fix: Add explicit usage instructions in the system prompt. Use RAG-focused evaluation to measure whether model answers are grounded in retrieved content vs training knowledge. Consider adding a post-generation grounding check.
Summary: the RAG context engineering checklist
Before shipping a RAG system:
- Chunk size tested against representative queries — single chunk answers typical questions without missing context
- Embedding model validated on your domain (not just general benchmarks)
- Relevance threshold set — low-score results are dropped, not injected
- Reranking applied for precision-sensitive applications
- Retrieved content placed before conversation history in the context window
- Source labels and relevance scores included in injection templates
- Explicit usage instruction in system prompt
- Token budget estimated for retrieval component across a typical session
- Agentic retrieval considered for multi-step agent tasks
The retrieval pipeline and the context injection layer are both part of context engineering. Teams that optimize only the retrieval half ship RAG systems that find good content but can't get the model to use it. Optimize both.