What is RAG in the context of context engineering?

Retrieval-augmented generation (RAG) is the practice of dynamically retrieving relevant information from an external knowledge base and injecting it into the model's context window before generating a response. From a context engineering perspective, RAG is the mechanism for populating the "retrieved documents" component of the context package — it determines what information the model has access to beyond what's in its training data.

Why does RAG often underperform in production?

The most common failure modes are: retrieving chunks that are too small to contain complete answers, retrieving chunks that match keywords but not semantic intent, injecting too many retrieved documents and diluting the signal, placing retrieved content in the middle of a long context where attention is weakest, and failing to filter out retrieved chunks with low relevance scores. RAG is often treated as a retrieval problem when it is equally a context injection problem.

What chunking strategy works best for RAG?

There is no single best strategy — it depends on your document type and query patterns. Semantic chunking (splitting on natural topic boundaries rather than fixed character counts) generally outperforms fixed-size chunking for unstructured text. For code, file-level or function-level chunks tend to work better than paragraph-level. For structured documents, preserve section hierarchy. The key test: can a single chunk answer a typical query without requiring the model to infer missing context from adjacent chunks?

Should I use RAG or a long context window?

Long context windows (1M+ tokens) make it tempting to just dump everything in. But retrieval still wins for cost efficiency, latency, and focused attention. Retrieve and inject only what's relevant for the current query rather than filling the full window with potentially irrelevant content. Long context is most valuable for tasks that genuinely need the full document — legal review, code refactoring across a large file — not as a substitute for retrieval design.

Agentic RAG is when the retrieval pipeline itself is controlled by an agent rather than being a fixed preprocessing step. Instead of always retrieving the top-k chunks before the model response, the agent decides when to retrieve, what to retrieve, and how many results to use based on the current task state. This reduces unnecessary retrieval calls and allows the agent to refine its queries based on earlier tool outputs.

RAG and Context Injection: Pipeline Design Guide 2026 | explainx.ai Blog

explainx.ainewsletter3.5k

workshops ↗

RAG and Context Injection: Pipeline Design Guide 2026 | explainx.ai Blog | explainx.ai

Retrieval-augmented generation gets described as a retrieval problem. It isn't — or not only. The retrieval half is finding relevant chunks. The equally important half is injecting those chunks into the context window in a way the model can use.

This guide covers the full pipeline: how to chunk documents, how to embed and retrieve, how to score and filter, and how to inject retrieved content into the context package in a way that actually improves outputs.

Why RAG is a context engineering problem

The failure mode most teams hit with RAG isn't "the retriever found irrelevant chunks." It's "the retriever found relevant chunks, but the model couldn't use them." This happens because:

The chunks were injected in the middle of a long context where attention is weakest
Too many chunks were included, diluting the signal
The retrieved content was structurally formatted in a way that confused the model
No relevance threshold was applied, so low-quality matches were injected alongside high-quality ones
The model wasn't told explicitly to use the retrieved content

All of these are context injection problems, not retrieval problems.

The two-part mental model: retrieval determines what you find; context engineering determines whether the model can use it.

Stage 1: Document chunking

Chunking is the first context engineering decision in a RAG pipeline. How you split documents determines what units are available to retrieve — and whether a single retrieved chunk contains enough information to be useful.

Fixed-size chunking

The simplest approach: split every document into chunks of N tokens, with an overlap of M tokens between adjacent chunks.

snippet

chunk_size = 512 tokens
overlap = 64 tokens

Advantages: predictable, easy to implement. Disadvantages: splits arbitrarily mid-sentence or mid-concept, requiring the model to infer context from adjacent chunks that may not be retrieved.

Fixed-size chunking works adequately for dense, uniform text (e.g., encyclopedia articles). It breaks down for structured content, dialogue, or documents where concepts span variable lengths.

Semantic chunking

Split on natural topic or section boundaries rather than fixed token counts. Detect boundaries using:

Heading structure (H1/H2/H3 in Markdown or HTML)
Paragraph breaks in prose
Sentence-level embedding similarity (high cosine distance between adjacent sentences signals a topic shift)

Semantic chunks contain more complete ideas. The tradeoff: chunks have variable sizes, which complicates token budget estimation and retrieval scoring.

Document-level and hierarchical chunking

Model	Dimension	Use case
OpenAI text-embedding-3-large	3072	General purpose, strong multilingual
Voyage-3-large (Anthropic)	1024–2048	Strong on code and technical content
Cohere embed-v4	1024	Enterprise multilingual
nomic-embed-text-v2	768	Open-source, runs locally
BGE-M3	1024	Open-source, strong multilingual

snippet

[RETRIEVED CONTEXT]
Source: {document_title}, Section: {section_name}
Relevance score: {score}
---
{chunk_content}
---

[RETRIEVED CONTEXT]
Source: {document_title}, Section: {section_name}
Relevance score: {score}
---
{chunk_content}
---

[END OF RETRIEVED CONTEXT]

snippet

When answering questions, first check the RETRIEVED CONTEXT below. 
Prefer information from the retrieved context over your training knowledge
when they conflict. If the retrieved context does not contain enough 
information to answer confidently, say so explicitly rather than guessing.

json

{
  "name": "search_knowledge_base",
  "description": "Search the product documentation for relevant information. Use when the user asks about product features, pricing, or policies.",
  "parameters": {
    "query": "The search query — make it specific and concrete",
    "max_results": "Number of results to retrieve (1-5). Default 3."
  }
}

RAG and context injection: designing retrieval pipelines that actually work in 2026

Why RAG is a context engineering problem

Stage 1: Document chunking

Fixed-size chunking

Semantic chunking

Document-level and hierarchical chunking

Related posts

Context engineering vs prompt engineering: a precise distinction for 2026

Tool definition and schema design: the context engineering layer most teams get wrong in 2026

Agentic context design: how to engineer the context window for multi-turn AI systems in 2026

The practical test for chunk quality

Stage 2: Embedding models

Choosing an embedding model in 2026

Stage 3: Retrieval and scoring

Top-k retrieval

Relevance thresholds

Hybrid retrieval

Reranking

Stage 4: Context injection patterns

Basic injection template

Show relevance scores

Cite the source

Placement in the context window

Explicit usage instruction

Stage 5: Agentic RAG

Common failure modes and fixes

Failure: Retrieved chunks are too small

Failure: Relevance collapse

Failure: Context flooding

Failure: Lost-in-the-middle degradation

Failure: Hallucination despite retrieval

Summary: the RAG context engineering checklist