On June 23, 2026, Mistral AI released Mistral OCR 4 — a document extraction model that returns not just text, but where each block sits, what type it is, and how confident the model is in every region. @MistralAI pitched it as the ingestion layer for enterprise search, RAG, and agentic document workflows.
The timing is notable: Baidu Unlimited-OCR dropped the day before with a different bet — open weights and one-shot multi-page parsing. PixelRAG, released two days earlier, skips text extraction entirely and retrieves over page screenshots. Mistral's answer is structured text OCR — managed, citation-ready, and designed to feed vector indexes rather than vision-only pipelines.
TL;DR
| Spec | Mistral OCR 4 | Document AI (same engine) |
|---|---|---|
| Release | June 23, 2026 | Same endpoint, extra params |
| Model ID | mistral-ocr-latest | mistral-ocr-latest + schema |
| Output | Text, bboxes, block types, confidence | + JSON schema / image annotation |
| Languages | 170 across 10 groups | Same |
| API price | $4 / 1,000 pages | $5 / 1,000 pages |
| Batch API | $2 / 1,000 pages (50% off) | Same tier |
| OlmOCRBench | 85.20 (Mistral-reported leader) | — |
| Self-host | Single container (enterprise) | Same |
| Best for | RAG ingestion, compliance, agents | Structured field extraction without custom parsers |
What changed from OCR 3
Previous Mistral OCR generations focused on clean text and tables. OCR 4 returns a structured representation of each page:
- Bounding boxes — pixel coordinates for every block (Mistral's most-requested feature)
- Block classification — titles, tables, equations, signatures, headers, footers, and more
- Inline confidence scores — per-page and per-word certainty for human-in-the-loop review
Downstream systems get three primitives that plain OCR never supplied: location, role, and reliability. That trio powers source-grounded citations in RAG, redaction pipelines, and agent workflows that fill forms or validate invoices.
From the official announcement:
OCR 4 returns a structured representation of the document. Each block is localized with a bounding box, classified by type, and inline confidence scores are generated per-page and per-word.
Supported formats include PDF, DOC, PPT, and OpenDocument. Language coverage spans 170 languages across 10 groups — with the widest gains on rare and low-resource scripts (Hindi, Georgian, Bengali, Armenian, Hebrew, Greek, Gujarati, Tamil, Malayalam, Kannada, Telugu) where many competing systems degrade.
Block types OCR 4 recognizes
Mistral classifies each extracted region. Typical block types returned in the structured output:
| Block type | Downstream use |
|---|---|
| title / heading | Section boundaries for semantic chunking |
| paragraph | Base retrieval unit for prose |
| table | Keep intact — never split rows across embedding chunks |
| equation | LaTeX blocks for scientific corpora |
| figure / image | Pair with bbox_annotation_format for caption extraction |
| signature | Compliance and contract verification |
| header / footer | Optional via extract_header / extract_footer |
| list | Preserve bullet hierarchy in markdown output |
Plain OCR (Tesseract, early cloud APIs) returns a character stream. OCR 4 returns a typed document tree — the same abstraction PageIndex-style RAG builds with graph traversal, but generated at extraction time.
Anatomy of an OCR 4 API response
Every POST /v1/ocr call returns a JSON object with a pages array. Each page includes:
| Field | Purpose |
|---|---|
index | 0-based page number |
markdown | Full page text with structure preserved |
dimensions | width, height, dpi for bbox coordinate mapping |
images | Detected figures with bbox coordinates |
tables | Extracted tables (when table_format is set) |
hyperlinks | URLs found on the page |
header / footer | Optional when extraction flags are enabled |
| Confidence scores | Per-word or per-page via confidence_scores_granularity |
From the Mistral Document AI docs, OCR 2512+ supports separate table_format values (html, markdown, or inline), header/footer extraction, and confidence at word or page granularity.
Parsing bounding boxes in Python
After a call, walk detected blocks to build citation metadata for your index:
def blocks_for_rag(ocr_response, min_confidence=0.85):
"""Turn OCR 4 pages into citation-ready chunks."""
chunks = []
for page in ocr_response.pages:
page_idx = page.index
dims = page.dimensions
for block in getattr(page, "blocks", []) or []:
conf = getattr(block, "confidence", 1.0)
if conf < min_confidence:
continue # route to human review queue
chunks.append({
"text": block.text,
"type": block.type, # title, table, paragraph, etc.
"bbox": block.bbox, # normalized or pixel coords
"page": page_idx,
"source_dims": dims,
"confidence": conf,
})
return chunks
Low-confidence blocks are the hook for human-in-the-loop pipelines — the same pattern teams use with Microsoft Presidio to flag PII regions before indexing, except here the model tells you which OCR regions it distrusts.
TypeScript (Node SDK)
import { Mistral } from "@mistralai/mistralai";
const client = new Mistral({ apiKey: process.env.MISTRAL_API_KEY });
const ocrResponse = await client.ocr.process({
model: "mistral-ocr-latest",
document: {
type: "document_url",
documentUrl: "https://arxiv.org/pdf/2201.04234",
},
tableFormat: "html",
includeImageBase64: true,
});
console.log(ocrResponse.pages[0].markdown);
Benchmarks and the scoring caveat
Mistral ran OCR 4 against AI-native OCR models, frontier general-purpose models, enterprise document services, and its own OCR 3.
Human preference (600+ documents)
Annotators blindly ranked competitor output against OCR 4 on 600+ real-world documents across 12+ languages, sourced from third-party vendors. OCR 4 was preferred in the majority of documents against every system tested — win rates averaging 72%.
"We benchmarked Mistral OCR 4 against the leading agentic document parsers across a chart and figure dense financial QA dataset and reached equivalent accuracy at roughly 8x lower cost and 17x lower latency." — Aidan Donohue, AI Engineer, Rogo
Public and internal scores
| Benchmark | OCR 4 score | Notes |
|---|---|---|
| OlmOCRBench | 85.20 | Top among models Mistral tested |
| OmniDocBench | 93.07 | Aggregate; see caveats below |
| Crawl Multilingual (internal) | .98 | Leads all 8 language groups |
"Mistral OCR is roughly 4x faster per page than our incumbent provider, an impressive result for the high-volume docketing workflows where speed is critical." — Ivan Mihailov, AI engineer, Anaqua
Mistral is transparent about benchmark limitations. When they audited mismatches, most were scoring artifacts, not model errors:
| Artifact type | What happens |
|---|---|
| Ground-truth errors | Reference annotations wrong; model read the page correctly |
| Equivalent math notation | Different LaTeX that renders identically counts as mismatch |
| Equation segmentation | Single vs split equation blocks fail string alignment |
| Multi-column reading order | Hyphenation across columns flagged as order failure |
| Block-type attribution | Headers/footers stripped for scoring remove valid titles |
These artifacts concentrate in mathematical, scientific, and multi-column documents. Mistral treats aggregate scores as directional — evaluate on your own corpus before production. Our AI benchmarks guide walks through how to read leaderboard scores without overfitting to a single number.
Multilingual breakdown (Crawl Multilingual)
On Mistral's internal eval, OCR 4 leads across all eight language groups:
| Language group | OCR 4 position | Notes |
|---|---|---|
| English | Leader | Baseline for most public benchmarks |
| Western Europe | Leader | French, German, Spanish, Italian |
| Eastern Europe | Leader | Polish, Czech, Romanian |
| Middle Eastern | Leader | Arabic, Persian, Hebrew |
| Chinese | Leader | Simplified and traditional |
| East Asian | Leader | Japanese, Korean |
| Southeast Asian | Leader | Thai, Vietnamese, Indonesian |
| Rare / low-resource | Widest gap | Hindi, Georgian, Bengali, Armenian, Telugu, and others |
The rare-language gap matters for global enterprises: a parser that works on English financial filings but fails on Hindi vendor invoices breaks closed-loop agent workflows at the ingestion step.
Claude for Work
Use Claude as a thought partner for writing, research & decisions — no coding required. 2 live sessions with Yash Thakker.
Claude for Work is a 2-day live workshop on using Claude to supercharge your daily work — writing, research, analysis, and decision-making — without any coding required. Learn how to set up Claude Projects with custom instructions, run deep-research sprints, co-write documents that sound like you, and build repeatable prompt systems for your team. August 1–2, 2026. Hosted by Yash Thakker, founder of AISOLO Technologies, instructor to 350,000+ students.
Includes 1-year access to all session recordings, a personal prompt library, Discord community access, and a certificate of completion. No coding or technical background required. Designed for managers, marketers, founders, and writers.
OCR 4 API vs Document AI
Both modes hit POST /v1/ocr with model mistral-ocr-latest. Every call returns extracted content, bounding boxes, block types, confidence scores, and markdown-structured text. Document AI adds optional layers on the same response.
| Mode | When to use |
|---|---|
| Pure OCR | Raw extraction, custom downstream logic, high-volume batch ingestion, self-host |
| Document AI | Pass a JSON schema for structured fields, annotate images with a schema, or add a custom prompt for interpretation |
Decision rule: need raw blocks → OCR as-is. Need invoice fields or domain-specific JSON → add document_annotation_format and optional document_annotation_prompt.
Basic extraction (Python)
from mistralai import Mistral
client = Mistral(api_key="YOUR_MISTRAL_API_KEY")
ocr_response = client.ocr.process(
model="mistral-ocr-latest",
document={
"type": "document_url",
"document_url": "https://arxiv.org/pdf/2201.04234",
},
table_format="html", # or "markdown"
include_image_base64=True,
)
print(ocr_response.pages[0].markdown)
Document AI with JSON schema
ocr_response = client.ocr.process(
model="mistral-ocr-latest",
document={
"type": "document_url",
"document_url": "https://example.com/invoice.pdf",
},
document_annotation_format={
"type": "json_schema",
"json_schema": {
"name": "invoice",
"strict": True,
"schema": {
"type": "object",
"properties": {
"vendor": {"type": "string"},
"total": {"type": "number"},
"line_items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"description": {"type": "string"},
"amount": {"type": "number"},
},
},
},
},
"required": ["vendor", "total"],
},
},
},
document_annotation_prompt="Extract invoice fields from this document.",
)
cURL
curl https://api.mistral.ai/v1/ocr \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $MISTRAL_API_KEY" \
-d '{
"model": "mistral-ocr-latest",
"document": {
"type": "document_url",
"document_url": "https://arxiv.org/pdf/2201.04234"
},
"table_format": "html"
}'
Optional parameters from the OCR API docs:
pages—"0,2-4"or list of integers (0-indexed)table_format—htmlormarkdownbbox_annotation_format— structured JSON per detected image regionextract_header/extract_footer— include header/footer blocks (OCR 2512+)confidence_scores_granularity—"word"or"page"for inline confidence
Base64 and local PDF upload
For documents that cannot leave your network during upload, encode locally and pass a data URL — same pattern as the Mistral basic OCR cookbook:
import base64
with open("contract.pdf", "rb") as f:
b64 = base64.standard_b64encode(f.read()).decode()
ocr_response = client.ocr.process(
model="mistral-ocr-latest",
document={
"type": "document_url",
"document_url": f"data:application/pdf;base64,{b64}",
},
)
For strict residency, skip the API entirely and run the single-container self-host path instead — the trade-off matrix in our closed vs open source AI guide applies here.
RAG ingestion pipeline with OCR 4
OCR 4 sits at the extract → chunk → embed → retrieve layer. Here is a production-shaped flow:
flowchart LR
A[PDF / DOC upload] --> B[Mistral OCR 4 API]
B --> C[Typed blocks + bboxes]
C --> D{confidence >= threshold?}
D -->|yes| E[Semantic chunks]
D -->|no| F[Human review queue]
E --> G[Embed + vector index]
G --> H[RAG or agent retrieval]
H --> I[Answer with bbox citations]
Why bboxes matter for RAG: when an LLM cites a source, users expect to see the highlighted region in the original PDF. Bounding boxes make that possible without re-running layout analysis at query time — a direct counter to AI hallucination in uncited answers.
Search Toolkit integration: Mistral's open-source Search Toolkit (public preview, announced at AI Now Summit 2026) accepts OCR 4 structured output as ingestion input. If you are assembling a composable search stack, OCR 4 is the document connector; the toolkit handles retrieval and eval loops.
For teams comparing text OCR against visual retrieval, read our PixelRAG guide — PixelRAG skips markdown extraction and retrieves screenshot tiles instead. The approaches are complementary: OCR 4 for searchable text + citations; PixelRAG when layout and charts must survive as images.
Semantic chunking by block type
Fixed-size chunking (512 tokens, overlap 64) is the default in many RAG tutorials — and it destroys tables, splits equations mid-line, and separates titles from body text. OCR 4's block types give you a better default:
| Block type | Chunk strategy |
|---|---|
| title | Attach to the following paragraph as metadata, or keep as section header |
| paragraph | One chunk per block; merge only if under 100 tokens |
| table | Single chunk; store as HTML from table_format="html" |
| equation | Never split; keep LaTeX intact |
| figure | Chunk caption + optional bbox_annotation_format JSON |
| list | One chunk per list block; preserve nesting in markdown |
This aligns with agentic RAG thinking from our RAG vs agentic RAG guide: structure-aware retrieval beats blind similarity search when documents have hierarchy. OCR 4 supplies that hierarchy at extraction time so you do not need a separate layout model.
For very long documents, pair typed chunks with context compression before sending retrieved blocks to the LLM — especially when tables and footnotes inflate token counts.
Confidence scores and human-in-the-loop routing
OCR 4 emits confidence at word or page granularity. A practical routing policy:
| Confidence band | Action |
|---|---|
| ≥ 0.95 | Auto-index; no review |
| 0.85 – 0.95 | Index with needs_review flag |
| < 0.85 | Queue for human verifier |
Invoice and compliance pipelines benefit most: extract structured fields via Document AI, then send only low-confidence line items to reviewers instead of the full document. Combine with Presidio when bboxes overlap known PII entity types (names, SSN regions) for redaction before indexing.
Where it fits in production pipelines
Mistral positions OCR 4 as an ingestion component, not a decision engine. Recommended workloads:
| Use case | Why OCR 4 fits |
|---|---|
| RAG semantic chunking | Classified blocks become better retrieval units than raw page text |
| Agentic workflows | Bboxes + types give agents structural primitives for forms and compliance |
| Enterprise search | Typed output feeds connectors and entity extraction |
| Human-in-the-loop | Confidence scores route low-certainty regions to reviewers |
| Redaction | Bounding boxes localize PII for automated or assisted redaction |
OCR 4 integrates with the Mistral Search Toolkit (public preview) — Mistral's open-source composable search framework announced at AI Now Summit 2026. Structured OCR output feeds citation-ready ingestion for retrieval and evaluation workflows.
Out of scope: medical diagnosis, legal judgment, high-stakes financial decisions, safety-critical systems, real-time latency-sensitive paths, and non-document inputs (audio, video).
Deployment options
| Channel | Status |
|---|---|
| Mistral Studio API | Available |
| Amazon SageMaker | Available |
| Microsoft Foundry | Available |
| Snowflake Parse Document | Coming soon |
| Self-hosted (single container) | Enterprise — contact Mistral sales |
"The availability of Mistral Document AI with OCR 4 in Microsoft Foundry marks an important milestone in our partnership." — Kimmi Grewal, VP, AI Ecosystem Partnerships, Microsoft
For strict data residency, the single-container self-host path keeps documents inside your VPC. For teams without GPU ops capacity, the managed API with Batch pricing at $2/1k pages targets high-volume archive digitization.
Cost at scale
Mistral OCR 4 pricing is page-based, not token-based — which simplifies budgeting for document-heavy workloads compared to vision-LLM page-by-page parsing. See our generative AI cost optimization guide for broader FinOps patterns.
| Volume (pages) | Standard API ($4/1k) | Batch API ($2/1k) | Document AI ($5/1k) |
|---|---|---|---|
| 10,000 | $40 | $20 | $50 |
| 100,000 | $400 | $200 | $500 |
| 1,000,000 | $4,000 | $2,000 | $5,000 |
Batch API applies a 50% discount for non-real-time jobs — archive digitization, nightly ingestion, bulk contract processing. Rogo reported 8× lower cost versus their prior agentic document parser on a financial QA dataset at equivalent accuracy.
Managed OCR landscape (June 2026)
| Solution | Structured output | Bboxes | Self-host | Typical pricing model |
|---|---|---|---|---|
| Mistral OCR 4 | ✅ types + confidence | ✅ | Enterprise container | $2–5 / 1k pages |
| AWS Textract | Forms + tables | Partial | ❌ | Per page + feature tier |
| Google Document AI | ✅ | Partial | ❌ | Per page |
| Azure Document Intelligence | ✅ | Partial | ❌ | Per page |
| Baidu Unlimited-OCR | Text | ❌ | ✅ MIT weights | GPU compute only |
| PixelRAG | Visual tiles | N/A (screenshots) | ✅ Apache 2.0 | Self-host / hosted API |
Mistral's differentiation is the combined package: bboxes + block types + confidence + Document AI schema layer on one endpoint. Cloud incumbents offer structured extraction but rarely ship per-word confidence and typed blocks in a single OCR-native response.
Batch API workflow
For million-page archives, structure jobs around the Batch API rather than synchronous calls:
- Upload documents to object storage (S3, GCS, Azure Blob)
- Submit batch jobs with public or signed URLs per document
- Poll for completion — no rate-limit pressure on synchronous endpoints
- Parse structured JSON; route low-confidence blocks to review queues
- Index chunks into your vector store (embeddings guide)
# Pseudocode: batch ingestion loop
documents = list_pending_pdfs("s3://archive/invoices/2025/")
for doc_url in documents:
job = client.batch.ocr.submit(
model="mistral-ocr-latest",
document={"type": "document_url", "document_url": doc_url},
table_format="html",
)
track_job(job.id, source=doc_url)
Anaqua's 4× faster per-page result versus their incumbent matters most here — docketing and IP workflows process high page counts daily, and latency compounds into missed deadlines.
Document AI landscape: same week, three bets
June 2026 delivered three distinct document-ingestion philosophies within 72 hours:
| Approach | Representative | Core bet |
|---|---|---|
| Structured text OCR | Mistral OCR 4 | Bboxes + types + confidence for RAG citations |
| Open long-horizon parsing | Baidu Unlimited-OCR | One-pass multi-page, MIT weights |
| Visual retrieval | PixelRAG | Skip text; retrieve screenshot tiles |
Most production stacks will mix layers: OCR 4 for searchable text and compliance metadata, PixelRAG or vision models where charts dominate, and agents on top to act on extracted structure.
Mistral OCR 4 vs Baidu Unlimited-OCR (same week)
Both models landed within 24 hours of each other — a signal that document AI is having a moment in June 2026.
| Dimension | Mistral OCR 4 | Baidu Unlimited-OCR |
|---|---|---|
| Release | June 23, 2026 | June 22–23, 2026 |
| License | Managed API / enterprise self-host | MIT, open weights |
| Multi-page | API per document | One forward pass, 32k context |
| Structured output | Bboxes, types, confidence | Text-focused parsing |
| Cost model | $2–5 / 1k pages | Self-host GPU cost only |
| Document AI layer | Built-in JSON schema | Roll your own |
| Human preference | 72% avg win rate (Mistral-reported) | Not yet published |
| Deepseek-OCR lineage | Separate model family | ✅ extends Deepseek-OCR ngram suppression |
| Best for | Managed RAG + citations | Self-hosted bulk PDF parsing |
Practical split: teams under compliance pressure who want bounding boxes and confidence scores without building parsers → Mistral. Teams who need open weights, unlimited-length PDFs in one shot, and zero per-page API fees → Baidu. Teams where tables and charts must stay visual → PixelRAG.
Neither replaces the other today — they optimize for different constraints. For a broader build-vs-buy framing, see closed source vs local open alternatives.
Getting started
- API key — Mistral Studio
- Cookbook — Getting Started with OCR 4 (bounding boxes and block classification walkthrough)
- Webinar — OCR 4 in Production, July 7, 2026, 6:00 PM CET
- Model card — docs.mistral.ai/models/model-cards/ocr-4-0
Related ExplainX guides
Document ingestion cluster (June 2026):
- Baidu Unlimited-OCR: one-shot long-horizon parsing — open-weight alternative from the same week
- PixelRAG: visual RAG from web screenshots — when layout beats text extraction
RAG and retrieval:
- RAG vs agentic RAG — structure-aware chunking vs fixed windows
- What are embeddings and vector search? — indexing OCR output
- Perplexity Search-as-Code — agentic retrieval patterns
- Headroom: context compression for agents — long document token control
Trust, cost, and agents:
- AI models hallucinate — why and how to catch it — bbox citations reduce uncited answers
- Microsoft Presidio PII detection — redaction before indexing
- Optimising generative AI costs — FinOps for document pipelines
- What are AI agents? — agents acting on extracted structure
- Closed source vs open source AI alternatives — build vs buy for document AI
- AI benchmarks complete guide — reading OCR leaderboard scores
- What are LLM tokens? — cost math when OCR feeds downstream LLM calls
Primary sources: Mistral OCR 4 announcement · OCR API reference · Document AI docs · @MistralAI
Summary
Mistral OCR 4 shifts document extraction from flat text to structured blocks — bounding boxes, typed regions, and confidence scores across 170 languages. It tops Mistral's human preference tests (72% average win rate) and public benchmarks like OlmOCRBench (85.20), with honest caveats about automated scoring.
The API starts at $4/1k pages ($2 via Batch). Document AI on the same endpoint adds schema-driven extraction without a separate parser. For open-weight, one-pass multi-page parsing, compare against Baidu Unlimited-OCR released the day before. For visual retrieval without text parsing, see PixelRAG.
For RAG ingestion, compliance workflows, and enterprise search connectors, OCR 4 is the most complete managed option Mistral has shipped to date — especially when source-grounded citations and structure-aware chunking matter.
Pricing, benchmark scores, and deployment channels reflect Mistral's June 23, 2026 release. Re-check mistral.ai/news/ocr-4 and the API docs before production deployment.