← Back to blog

explainx / blog

Mistral OCR 4: Bounding Boxes, Document AI, and the New OCR API

Mistral OCR 4 ships June 23, 2026 with bounding boxes, block classification, and confidence scores across 170 languages. API at $4/1k pages, 85.20 on OlmOCRBench, and self-host on one container.

·16 min read·Yash Thakker
Mistral AIOCRDocument AIRAGEnterprise AIAPI
Mistral OCR 4: Bounding Boxes, Document AI, and the New OCR API

On June 23, 2026, Mistral AI released Mistral OCR 4 — a document extraction model that returns not just text, but where each block sits, what type it is, and how confident the model is in every region. @MistralAI pitched it as the ingestion layer for enterprise search, RAG, and agentic document workflows.

The timing is notable: Baidu Unlimited-OCR dropped the day before with a different bet — open weights and one-shot multi-page parsing. PixelRAG, released two days earlier, skips text extraction entirely and retrieves over page screenshots. Mistral's answer is structured text OCR — managed, citation-ready, and designed to feed vector indexes rather than vision-only pipelines.

TL;DR

SpecMistral OCR 4Document AI (same engine)
ReleaseJune 23, 2026Same endpoint, extra params
Model IDmistral-ocr-latestmistral-ocr-latest + schema
OutputText, bboxes, block types, confidence+ JSON schema / image annotation
Languages170 across 10 groupsSame
API price$4 / 1,000 pages$5 / 1,000 pages
Batch API$2 / 1,000 pages (50% off)Same tier
OlmOCRBench85.20 (Mistral-reported leader)
Self-hostSingle container (enterprise)Same
Best forRAG ingestion, compliance, agentsStructured field extraction without custom parsers

What changed from OCR 3

Previous Mistral OCR generations focused on clean text and tables. OCR 4 returns a structured representation of each page:

  • Bounding boxes — pixel coordinates for every block (Mistral's most-requested feature)
  • Block classification — titles, tables, equations, signatures, headers, footers, and more
  • Inline confidence scores — per-page and per-word certainty for human-in-the-loop review

Downstream systems get three primitives that plain OCR never supplied: location, role, and reliability. That trio powers source-grounded citations in RAG, redaction pipelines, and agent workflows that fill forms or validate invoices.

From the official announcement:

OCR 4 returns a structured representation of the document. Each block is localized with a bounding box, classified by type, and inline confidence scores are generated per-page and per-word.

Supported formats include PDF, DOC, PPT, and OpenDocument. Language coverage spans 170 languages across 10 groups — with the widest gains on rare and low-resource scripts (Hindi, Georgian, Bengali, Armenian, Hebrew, Greek, Gujarati, Tamil, Malayalam, Kannada, Telugu) where many competing systems degrade.

Block types OCR 4 recognizes

Mistral classifies each extracted region. Typical block types returned in the structured output:

Block typeDownstream use
title / headingSection boundaries for semantic chunking
paragraphBase retrieval unit for prose
tableKeep intact — never split rows across embedding chunks
equationLaTeX blocks for scientific corpora
figure / imagePair with bbox_annotation_format for caption extraction
signatureCompliance and contract verification
header / footerOptional via extract_header / extract_footer
listPreserve bullet hierarchy in markdown output

Plain OCR (Tesseract, early cloud APIs) returns a character stream. OCR 4 returns a typed document tree — the same abstraction PageIndex-style RAG builds with graph traversal, but generated at extraction time.


Anatomy of an OCR 4 API response

Every POST /v1/ocr call returns a JSON object with a pages array. Each page includes:

FieldPurpose
index0-based page number
markdownFull page text with structure preserved
dimensionswidth, height, dpi for bbox coordinate mapping
imagesDetected figures with bbox coordinates
tablesExtracted tables (when table_format is set)
hyperlinksURLs found on the page
header / footerOptional when extraction flags are enabled
Confidence scoresPer-word or per-page via confidence_scores_granularity

From the Mistral Document AI docs, OCR 2512+ supports separate table_format values (html, markdown, or inline), header/footer extraction, and confidence at word or page granularity.

Parsing bounding boxes in Python

After a call, walk detected blocks to build citation metadata for your index:

def blocks_for_rag(ocr_response, min_confidence=0.85):
    """Turn OCR 4 pages into citation-ready chunks."""
    chunks = []
    for page in ocr_response.pages:
        page_idx = page.index
        dims = page.dimensions
        for block in getattr(page, "blocks", []) or []:
            conf = getattr(block, "confidence", 1.0)
            if conf < min_confidence:
                continue  # route to human review queue
            chunks.append({
                "text": block.text,
                "type": block.type,           # title, table, paragraph, etc.
                "bbox": block.bbox,           # normalized or pixel coords
                "page": page_idx,
                "source_dims": dims,
                "confidence": conf,
            })
    return chunks

Low-confidence blocks are the hook for human-in-the-loop pipelines — the same pattern teams use with Microsoft Presidio to flag PII regions before indexing, except here the model tells you which OCR regions it distrusts.

TypeScript (Node SDK)

import { Mistral } from "@mistralai/mistralai";

const client = new Mistral({ apiKey: process.env.MISTRAL_API_KEY });

const ocrResponse = await client.ocr.process({
  model: "mistral-ocr-latest",
  document: {
    type: "document_url",
    documentUrl: "https://arxiv.org/pdf/2201.04234",
  },
  tableFormat: "html",
  includeImageBase64: true,
});

console.log(ocrResponse.pages[0].markdown);

Benchmarks and the scoring caveat

Mistral ran OCR 4 against AI-native OCR models, frontier general-purpose models, enterprise document services, and its own OCR 3.

Human preference (600+ documents)

Annotators blindly ranked competitor output against OCR 4 on 600+ real-world documents across 12+ languages, sourced from third-party vendors. OCR 4 was preferred in the majority of documents against every system tested — win rates averaging 72%.

"We benchmarked Mistral OCR 4 against the leading agentic document parsers across a chart and figure dense financial QA dataset and reached equivalent accuracy at roughly 8x lower cost and 17x lower latency." — Aidan Donohue, AI Engineer, Rogo

Public and internal scores

BenchmarkOCR 4 scoreNotes
OlmOCRBench85.20Top among models Mistral tested
OmniDocBench93.07Aggregate; see caveats below
Crawl Multilingual (internal).98Leads all 8 language groups

"Mistral OCR is roughly 4x faster per page than our incumbent provider, an impressive result for the high-volume docketing workflows where speed is critical." — Ivan Mihailov, AI engineer, Anaqua

Mistral is transparent about benchmark limitations. When they audited mismatches, most were scoring artifacts, not model errors:

Artifact typeWhat happens
Ground-truth errorsReference annotations wrong; model read the page correctly
Equivalent math notationDifferent LaTeX that renders identically counts as mismatch
Equation segmentationSingle vs split equation blocks fail string alignment
Multi-column reading orderHyphenation across columns flagged as order failure
Block-type attributionHeaders/footers stripped for scoring remove valid titles

These artifacts concentrate in mathematical, scientific, and multi-column documents. Mistral treats aggregate scores as directional — evaluate on your own corpus before production. Our AI benchmarks guide walks through how to read leaderboard scores without overfitting to a single number.

Multilingual breakdown (Crawl Multilingual)

On Mistral's internal eval, OCR 4 leads across all eight language groups:

Language groupOCR 4 positionNotes
EnglishLeaderBaseline for most public benchmarks
Western EuropeLeaderFrench, German, Spanish, Italian
Eastern EuropeLeaderPolish, Czech, Romanian
Middle EasternLeaderArabic, Persian, Hebrew
ChineseLeaderSimplified and traditional
East AsianLeaderJapanese, Korean
Southeast AsianLeaderThai, Vietnamese, Indonesian
Rare / low-resourceWidest gapHindi, Georgian, Bengali, Armenian, Telugu, and others

The rare-language gap matters for global enterprises: a parser that works on English financial filings but fails on Hindi vendor invoices breaks closed-loop agent workflows at the ingestion step.

Live WorkshopAug 1–2, 2026 · 2 days

Claude for Work

Use Claude as a thought partner for writing, research & decisions — no coding required. 2 live sessions with Yash Thakker.

Register now

Claude for Work is a 2-day live workshop on using Claude to supercharge your daily work — writing, research, analysis, and decision-making — without any coding required. Learn how to set up Claude Projects with custom instructions, run deep-research sprints, co-write documents that sound like you, and build repeatable prompt systems for your team. August 1–2, 2026. Hosted by Yash Thakker, founder of AISOLO Technologies, instructor to 350,000+ students.

Includes 1-year access to all session recordings, a personal prompt library, Discord community access, and a certificate of completion. No coding or technical background required. Designed for managers, marketers, founders, and writers.


OCR 4 API vs Document AI

Both modes hit POST /v1/ocr with model mistral-ocr-latest. Every call returns extracted content, bounding boxes, block types, confidence scores, and markdown-structured text. Document AI adds optional layers on the same response.

ModeWhen to use
Pure OCRRaw extraction, custom downstream logic, high-volume batch ingestion, self-host
Document AIPass a JSON schema for structured fields, annotate images with a schema, or add a custom prompt for interpretation

Decision rule: need raw blocks → OCR as-is. Need invoice fields or domain-specific JSON → add document_annotation_format and optional document_annotation_prompt.

Basic extraction (Python)

from mistralai import Mistral

client = Mistral(api_key="YOUR_MISTRAL_API_KEY")

ocr_response = client.ocr.process(
    model="mistral-ocr-latest",
    document={
        "type": "document_url",
        "document_url": "https://arxiv.org/pdf/2201.04234",
    },
    table_format="html",  # or "markdown"
    include_image_base64=True,
)

print(ocr_response.pages[0].markdown)

Document AI with JSON schema

ocr_response = client.ocr.process(
    model="mistral-ocr-latest",
    document={
        "type": "document_url",
        "document_url": "https://example.com/invoice.pdf",
    },
    document_annotation_format={
        "type": "json_schema",
        "json_schema": {
            "name": "invoice",
            "strict": True,
            "schema": {
                "type": "object",
                "properties": {
                    "vendor": {"type": "string"},
                    "total": {"type": "number"},
                    "line_items": {
                        "type": "array",
                        "items": {
                            "type": "object",
                            "properties": {
                                "description": {"type": "string"},
                                "amount": {"type": "number"},
                            },
                        },
                    },
                },
                "required": ["vendor", "total"],
            },
        },
    },
    document_annotation_prompt="Extract invoice fields from this document.",
)

cURL

curl https://api.mistral.ai/v1/ocr \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $MISTRAL_API_KEY" \
  -d '{
    "model": "mistral-ocr-latest",
    "document": {
      "type": "document_url",
      "document_url": "https://arxiv.org/pdf/2201.04234"
    },
    "table_format": "html"
  }'

Optional parameters from the OCR API docs:

  • pages"0,2-4" or list of integers (0-indexed)
  • table_formathtml or markdown
  • bbox_annotation_format — structured JSON per detected image region
  • extract_header / extract_footer — include header/footer blocks (OCR 2512+)
  • confidence_scores_granularity"word" or "page" for inline confidence

Base64 and local PDF upload

For documents that cannot leave your network during upload, encode locally and pass a data URL — same pattern as the Mistral basic OCR cookbook:

import base64

with open("contract.pdf", "rb") as f:
    b64 = base64.standard_b64encode(f.read()).decode()

ocr_response = client.ocr.process(
    model="mistral-ocr-latest",
    document={
        "type": "document_url",
        "document_url": f"data:application/pdf;base64,{b64}",
    },
)

For strict residency, skip the API entirely and run the single-container self-host path instead — the trade-off matrix in our closed vs open source AI guide applies here.


RAG ingestion pipeline with OCR 4

OCR 4 sits at the extract → chunk → embed → retrieve layer. Here is a production-shaped flow:

flowchart LR
  A[PDF / DOC upload] --> B[Mistral OCR 4 API]
  B --> C[Typed blocks + bboxes]
  C --> D{confidence >= threshold?}
  D -->|yes| E[Semantic chunks]
  D -->|no| F[Human review queue]
  E --> G[Embed + vector index]
  G --> H[RAG or agent retrieval]
  H --> I[Answer with bbox citations]

Why bboxes matter for RAG: when an LLM cites a source, users expect to see the highlighted region in the original PDF. Bounding boxes make that possible without re-running layout analysis at query time — a direct counter to AI hallucination in uncited answers.

Search Toolkit integration: Mistral's open-source Search Toolkit (public preview, announced at AI Now Summit 2026) accepts OCR 4 structured output as ingestion input. If you are assembling a composable search stack, OCR 4 is the document connector; the toolkit handles retrieval and eval loops.

For teams comparing text OCR against visual retrieval, read our PixelRAG guide — PixelRAG skips markdown extraction and retrieves screenshot tiles instead. The approaches are complementary: OCR 4 for searchable text + citations; PixelRAG when layout and charts must survive as images.


Semantic chunking by block type

Fixed-size chunking (512 tokens, overlap 64) is the default in many RAG tutorials — and it destroys tables, splits equations mid-line, and separates titles from body text. OCR 4's block types give you a better default:

Block typeChunk strategy
titleAttach to the following paragraph as metadata, or keep as section header
paragraphOne chunk per block; merge only if under 100 tokens
tableSingle chunk; store as HTML from table_format="html"
equationNever split; keep LaTeX intact
figureChunk caption + optional bbox_annotation_format JSON
listOne chunk per list block; preserve nesting in markdown

This aligns with agentic RAG thinking from our RAG vs agentic RAG guide: structure-aware retrieval beats blind similarity search when documents have hierarchy. OCR 4 supplies that hierarchy at extraction time so you do not need a separate layout model.

For very long documents, pair typed chunks with context compression before sending retrieved blocks to the LLM — especially when tables and footnotes inflate token counts.


Confidence scores and human-in-the-loop routing

OCR 4 emits confidence at word or page granularity. A practical routing policy:

Confidence bandAction
≥ 0.95Auto-index; no review
0.85 – 0.95Index with needs_review flag
< 0.85Queue for human verifier

Invoice and compliance pipelines benefit most: extract structured fields via Document AI, then send only low-confidence line items to reviewers instead of the full document. Combine with Presidio when bboxes overlap known PII entity types (names, SSN regions) for redaction before indexing.


Where it fits in production pipelines

Mistral positions OCR 4 as an ingestion component, not a decision engine. Recommended workloads:

Use caseWhy OCR 4 fits
RAG semantic chunkingClassified blocks become better retrieval units than raw page text
Agentic workflowsBboxes + types give agents structural primitives for forms and compliance
Enterprise searchTyped output feeds connectors and entity extraction
Human-in-the-loopConfidence scores route low-certainty regions to reviewers
RedactionBounding boxes localize PII for automated or assisted redaction

OCR 4 integrates with the Mistral Search Toolkit (public preview) — Mistral's open-source composable search framework announced at AI Now Summit 2026. Structured OCR output feeds citation-ready ingestion for retrieval and evaluation workflows.

Out of scope: medical diagnosis, legal judgment, high-stakes financial decisions, safety-critical systems, real-time latency-sensitive paths, and non-document inputs (audio, video).


Deployment options

ChannelStatus
Mistral Studio APIAvailable
Amazon SageMakerAvailable
Microsoft FoundryAvailable
Snowflake Parse DocumentComing soon
Self-hosted (single container)Enterprise — contact Mistral sales

"The availability of Mistral Document AI with OCR 4 in Microsoft Foundry marks an important milestone in our partnership." — Kimmi Grewal, VP, AI Ecosystem Partnerships, Microsoft

For strict data residency, the single-container self-host path keeps documents inside your VPC. For teams without GPU ops capacity, the managed API with Batch pricing at $2/1k pages targets high-volume archive digitization.


Cost at scale

Mistral OCR 4 pricing is page-based, not token-based — which simplifies budgeting for document-heavy workloads compared to vision-LLM page-by-page parsing. See our generative AI cost optimization guide for broader FinOps patterns.

Volume (pages)Standard API ($4/1k)Batch API ($2/1k)Document AI ($5/1k)
10,000$40$20$50
100,000$400$200$500
1,000,000$4,000$2,000$5,000

Batch API applies a 50% discount for non-real-time jobs — archive digitization, nightly ingestion, bulk contract processing. Rogo reported 8× lower cost versus their prior agentic document parser on a financial QA dataset at equivalent accuracy.

Managed OCR landscape (June 2026)

SolutionStructured outputBboxesSelf-hostTypical pricing model
Mistral OCR 4✅ types + confidenceEnterprise container$2–5 / 1k pages
AWS TextractForms + tablesPartialPer page + feature tier
Google Document AIPartialPer page
Azure Document IntelligencePartialPer page
Baidu Unlimited-OCRText✅ MIT weightsGPU compute only
PixelRAGVisual tilesN/A (screenshots)✅ Apache 2.0Self-host / hosted API

Mistral's differentiation is the combined package: bboxes + block types + confidence + Document AI schema layer on one endpoint. Cloud incumbents offer structured extraction but rarely ship per-word confidence and typed blocks in a single OCR-native response.


Batch API workflow

For million-page archives, structure jobs around the Batch API rather than synchronous calls:

  1. Upload documents to object storage (S3, GCS, Azure Blob)
  2. Submit batch jobs with public or signed URLs per document
  3. Poll for completion — no rate-limit pressure on synchronous endpoints
  4. Parse structured JSON; route low-confidence blocks to review queues
  5. Index chunks into your vector store (embeddings guide)
# Pseudocode: batch ingestion loop
documents = list_pending_pdfs("s3://archive/invoices/2025/")
for doc_url in documents:
    job = client.batch.ocr.submit(
        model="mistral-ocr-latest",
        document={"type": "document_url", "document_url": doc_url},
        table_format="html",
    )
    track_job(job.id, source=doc_url)

Anaqua's 4× faster per-page result versus their incumbent matters most here — docketing and IP workflows process high page counts daily, and latency compounds into missed deadlines.


Document AI landscape: same week, three bets

June 2026 delivered three distinct document-ingestion philosophies within 72 hours:

ApproachRepresentativeCore bet
Structured text OCRMistral OCR 4Bboxes + types + confidence for RAG citations
Open long-horizon parsingBaidu Unlimited-OCROne-pass multi-page, MIT weights
Visual retrievalPixelRAGSkip text; retrieve screenshot tiles

Most production stacks will mix layers: OCR 4 for searchable text and compliance metadata, PixelRAG or vision models where charts dominate, and agents on top to act on extracted structure.


Mistral OCR 4 vs Baidu Unlimited-OCR (same week)

Both models landed within 24 hours of each other — a signal that document AI is having a moment in June 2026.

DimensionMistral OCR 4Baidu Unlimited-OCR
ReleaseJune 23, 2026June 22–23, 2026
LicenseManaged API / enterprise self-hostMIT, open weights
Multi-pageAPI per documentOne forward pass, 32k context
Structured outputBboxes, types, confidenceText-focused parsing
Cost model$2–5 / 1k pagesSelf-host GPU cost only
Document AI layerBuilt-in JSON schemaRoll your own
Human preference72% avg win rate (Mistral-reported)Not yet published
Deepseek-OCR lineageSeparate model family✅ extends Deepseek-OCR ngram suppression
Best forManaged RAG + citationsSelf-hosted bulk PDF parsing

Practical split: teams under compliance pressure who want bounding boxes and confidence scores without building parsers → Mistral. Teams who need open weights, unlimited-length PDFs in one shot, and zero per-page API fees → Baidu. Teams where tables and charts must stay visual → PixelRAG.

Neither replaces the other today — they optimize for different constraints. For a broader build-vs-buy framing, see closed source vs local open alternatives.


Getting started

  1. API keyMistral Studio
  2. CookbookGetting Started with OCR 4 (bounding boxes and block classification walkthrough)
  3. Webinar — OCR 4 in Production, July 7, 2026, 6:00 PM CET
  4. Model carddocs.mistral.ai/models/model-cards/ocr-4-0

Related ExplainX guides

Document ingestion cluster (June 2026):

RAG and retrieval:

Trust, cost, and agents:

Primary sources: Mistral OCR 4 announcement · OCR API reference · Document AI docs · @MistralAI


Summary

Mistral OCR 4 shifts document extraction from flat text to structured blocks — bounding boxes, typed regions, and confidence scores across 170 languages. It tops Mistral's human preference tests (72% average win rate) and public benchmarks like OlmOCRBench (85.20), with honest caveats about automated scoring.

The API starts at $4/1k pages ($2 via Batch). Document AI on the same endpoint adds schema-driven extraction without a separate parser. For open-weight, one-pass multi-page parsing, compare against Baidu Unlimited-OCR released the day before. For visual retrieval without text parsing, see PixelRAG.

For RAG ingestion, compliance workflows, and enterprise search connectors, OCR 4 is the most complete managed option Mistral has shipped to date — especially when source-grounded citations and structure-aware chunking matter.


Pricing, benchmark scores, and deployment channels reflect Mistral's June 23, 2026 release. Re-check mistral.ai/news/ocr-4 and the API docs before production deployment.

Related posts