What is Mistral OCR 4?

Mistral OCR 4 is a document understanding model released by Mistral AI on June 23, 2026. It extracts text from PDFs, DOC, PPT, and OpenDocument files while returning bounding boxes, block classification (titles, tables, equations, signatures), and inline confidence scores per page and per word. The API model ID is mistral-ocr-latest.

How much does Mistral OCR 4 cost?

Mistral OCR 4 via API costs $4 per 1,000 pages. The Batch API cuts that to $2 per 1,000 pages (50% discount). Document AI — which adds JSON schema extraction and image annotation on top of the same OCR engine — costs $5 per 1,000 pages.

What benchmarks does Mistral OCR 4 score on?

Mistral reports 85.20 on OlmOCRBench (top among models they tested), 93.07 on OmniDocBench, and .98 on an internal Crawl Multilingual evaluation. In blind human preference tests on 600+ documents across 12+ languages, independent annotators preferred OCR 4 over every competitor tested, with win rates averaging 72%.

What is the difference between Mistral OCR 4 and Document AI?

Both use the same POST /v1/ocr endpoint and the same underlying model. Pure OCR mode returns extracted text, bounding boxes, block types, and confidence scores. Document AI adds optional parameters — a JSON schema, image annotation schema, or custom prompt — so mistral-small-2603 reshapes the extracted content into structured fields without separate downstream parsing.

Can Mistral OCR 4 run self-hosted?

Yes. Mistral OCR 4 is compact enough to deploy in a single container, keeping document data inside your environment for residency and compliance. Self-managed deployment is available to enterprise customers through Mistral sales. The model also ships via Amazon SageMaker and Microsoft Foundry.

How does Mistral OCR 4 compare to Baidu Unlimited-OCR?

Baidu Unlimited-OCR (released June 22–23, 2026) is open-weight and processes entire multi-page PDFs in one forward pass under a 32k context — ideal for self-hosted, zero per-page cost pipelines. Mistral OCR 4 is a managed API focused on structured output with bounding boxes, confidence scores, and Document AI schema extraction. Choose Baidu for open weights and long-horizon batch parsing; choose Mistral for production APIs, citation-ready RAG ingestion, and enterprise integrations.

How should I chunk Mistral OCR 4 output for RAG?

Use block classification as your chunk boundary: one retrieval unit per title, paragraph, table, or equation block rather than fixed token windows. Attach bounding box metadata and page index to each chunk so answers can cite exact regions. For hybrid pipelines, embed classified text blocks with a vector index (see our embeddings guide) while keeping table blocks as HTML or markdown without splitting rows across chunks.

Mistral OCR 4: Structured Document Extraction API Guide (2026) | explainx.ai Blog

explainx.ainewsletter3.5k

workshops ↗

Mistral OCR 4: Structured Document Extraction API Guide (2026) | explainx.ai Blog | explainx.ai

On June 23, 2026, Mistral AI released Mistral OCR 4 — a document extraction model that returns not just text, but where each block sits, what type it is, and how confident the model is in every region. @MistralAI pitched it as the ingestion layer for enterprise search, RAG, and agentic document workflows.

The timing is notable: Baidu Unlimited-OCR dropped the day before with a different bet — open weights and one-shot multi-page parsing. PixelRAG, released two days earlier, skips text extraction entirely and retrieves over page screenshots. Mistral's answer is structured text OCR — managed, citation-ready, and designed to feed vector indexes rather than vision-only pipelines.

TL;DR

Spec	Mistral OCR 4	Document AI (same engine)
Release	June 23, 2026	Same endpoint, extra params
Model ID	`mistral-ocr-latest`	`mistral-ocr-latest` + schema
Output	Text, bboxes, block types, confidence	+ JSON schema / image annotation
Languages	170 across 10 groups	Same
API price	$4 / 1,000 pages	$5 / 1,000 pages
Batch API	$2 / 1,000 pages (50% off)	Same tier

Block type	Downstream use
title / heading	Section boundaries for semantic chunking
paragraph	Base retrieval unit for prose
table	Keep intact — never split rows across embedding chunks
equation	LaTeX blocks for scientific corpora
figure / image	Pair with `bbox_annotation_format` for caption extraction
signature	Compliance and contract verification
header / footer	Optional via `extract_header` / `extract_footer`
list	Preserve bullet hierarchy in markdown output

Field	Purpose
`index`	0-based page number
`markdown`	Full page text with structure preserved
`dimensions`	`width`, `height`, `dpi` for bbox coordinate mapping
`images`	Detected figures with bbox coordinates
`tables`	Extracted tables (when `table_format` is set)
`hyperlinks`	URLs found on the page
`header` / `footer`	Optional when extraction flags are enabled
Confidence scores	Per-word or per-page via `confidence_scores_granularity`

python

def blocks_for_rag(ocr_response, min_confidence=0.85):
    """Turn OCR 4 pages into citation-ready chunks."""
    chunks = []
    for page in ocr_response.pages:
        page_idx = page.index
        dims = page.dimensions
        for block in getattr(page, "blocks", []) or []:
            conf = getattr(block, "confidence", 1.0)
            if conf < min_confidence:
                continue  # route to human review queue
            chunks.append({
                "text": block.text,
                "type": block.type,           # title, table, paragraph, etc.
                "bbox": block.bbox,           # normalized or pixel coords
                "page": page_idx,
                "source_dims": dims,
                "confidence": conf,
            })
    return chunks

typescript

import { Mistral } from "@mistralai/mistralai";

const client = new Mistral({ apiKey: process.env.MISTRAL_API_KEY });

const ocrResponse = await client.ocr.process({
  model: "mistral-ocr-latest",
  document: {
    type: "document_url",
    documentUrl: "https://arxiv.org/pdf/2201.04234",
  },
  tableFormat: "html",
  includeImageBase64: true,
});

console.log(ocrResponse.pages[0].markdown);

Benchmark	OCR 4 score	Notes
OlmOCRBench	85.20	Top among models Mistral tested
OmniDocBench	93.07	Aggregate; see caveats below
Crawl Multilingual (internal)	.98	Leads all 8 language groups

Artifact type	What happens
Ground-truth errors	Reference annotations wrong; model read the page correctly
Equivalent math notation	Different LaTeX that renders identically counts as mismatch
Equation segmentation	Single vs split equation blocks fail string alignment
Multi-column reading order	Hyphenation across columns flagged as order failure
Block-type attribution	Headers/footers stripped for scoring remove valid titles

Language group	OCR 4 position	Notes
English	Leader	Baseline for most public benchmarks
Western Europe	Leader	French, German, Spanish, Italian
Eastern Europe	Leader	Polish, Czech, Romanian
Middle Eastern	Leader	Arabic, Persian, Hebrew
Chinese	Leader	Simplified and traditional
East Asian	Leader	Japanese, Korean
Southeast Asian	Leader	Thai, Vietnamese, Indonesian
Rare / low-resource	Widest gap	Hindi, Georgian, Bengali, Armenian, Telugu, and others

Mode	When to use
Pure OCR	Raw extraction, custom downstream logic, high-volume batch ingestion, self-host
Document AI	Pass a JSON schema for structured fields, annotate images with a schema, or add a custom prompt for interpretation

python

from mistralai import Mistral

client = Mistral(api_key="YOUR_MISTRAL_API_KEY")

ocr_response = client.ocr.process(
    model="mistral-ocr-latest",
    document={
        "type": "document_url",
        "document_url": "https://arxiv.org/pdf/2201.04234",
    },
    table_format="html",  # or "markdown"
    include_image_base64=True,
)

print(ocr_response.pages[0].markdown)

python

ocr_response = client.ocr.process(
    model="mistral-ocr-latest",
    document={
        "type": "document_url",
        "document_url": "https://example.com/invoice.pdf",
    },
    document_annotation_format={
        "type": "json_schema",
        "json_schema": {
            "name": "invoice",
            "strict": True,
            "schema": {
                "type": "object",
                "properties": {
                    "vendor": {"type": "string"},
                    "total": {"type": "number"},
                    "line_items": {
                        "type": "array",
                        "items": {
                            "type": "object",
                            "properties": {
                                "description": {"type": "string"},
                                "amount": {"type": "number"},
                            },
                        },
                    },
                },
                "required": ["vendor", "total"],
            },
        },
    },
    document_annotation_prompt="Extract invoice fields from this document.",
)

bash

curl https://api.mistral.ai/v1/ocr \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $MISTRAL_API_KEY" \
  -d '{
    "model": "mistral-ocr-latest",
    "document": {
      "type": "document_url",
      "document_url": "https://arxiv.org/pdf/2201.04234"
    },
    "table_format": "html"
  }'

python

import base64

with open("contract.pdf", "rb") as f:
    b64 = base64.standard_b64encode(f.read()).decode()

ocr_response = client.ocr.process(
    model="mistral-ocr-latest",
    document={
        "type": "document_url",
        "document_url": f"data:application/pdf;base64,{b64}",
    },
)

mermaid

flowchart LR
  A[PDF / DOC upload] --> B[Mistral OCR 4 API]
  B --> C[Typed blocks + bboxes]
  C --> D{confidence >= threshold?}
  D -->|yes| E[Semantic chunks]
  D -->|no| F[Human review queue]
  E --> G[Embed + vector index]
  G --> H[RAG or agent retrieval]
  H --> I[Answer with bbox citations]

Block type	Chunk strategy
title	Attach to the following paragraph as metadata, or keep as section header
paragraph	One chunk per block; merge only if under 100 tokens
table	Single chunk; store as HTML from `table_format="html"`
equation	Never split; keep LaTeX intact
figure	Chunk caption + optional `bbox_annotation_format` JSON
list	One chunk per list block; preserve nesting in markdown

Confidence band	Action
≥ 0.95	Auto-index; no review
0.85 – 0.95	Index with `needs_review` flag
< 0.85	Queue for human verifier

Use case	Why OCR 4 fits
RAG semantic chunking	Classified blocks become better retrieval units than raw page text
Agentic workflows	Bboxes + types give agents structural primitives for forms and compliance
Enterprise search	Typed output feeds connectors and entity extraction
Human-in-the-loop	Confidence scores route low-certainty regions to reviewers
Redaction	Bounding boxes localize PII for automated or assisted redaction

Channel	Status
Mistral Studio API	Available
Amazon SageMaker	Available
Microsoft Foundry	Available
Snowflake Parse Document	Coming soon
Self-hosted (single container)	Enterprise — contact Mistral sales

Volume (pages)	Standard API ($4/1k)	Batch API ($2/1k)	Document AI ($5/1k)
10,000	$40	$20	$50
100,000	$400	$200	$500
1,000,000	$4,000	$2,000	$5,000

Solution	Structured output	Bboxes	Self-host	Typical pricing model
Mistral OCR 4	✅ types + confidence	✅	Enterprise container	$2–5 / 1k pages
AWS Textract	Forms + tables	Partial	❌	Per page + feature tier
Google Document AI	✅	Partial	❌	Per page
Azure Document Intelligence	✅	Partial	❌	Per page
Baidu Unlimited-OCR	Text	❌	✅ MIT weights	GPU compute only
PixelRAG	Visual tiles	N/A (screenshots)	✅ Apache 2.0	Self-host / hosted API

python

# Pseudocode: batch ingestion loop
documents = list_pending_pdfs("s3://archive/invoices/2025/")
for doc_url in documents:
    job = client.batch.ocr.submit(
        model="mistral-ocr-latest",
        document={"type": "document_url", "document_url": doc_url},
        table_format="html",
    )
    track_job(job.id, source=doc_url)

Approach	Representative	Core bet
Structured text OCR	Mistral OCR 4	Bboxes + types + confidence for RAG citations
Open long-horizon parsing	Baidu Unlimited-OCR	One-pass multi-page, MIT weights
Visual retrieval	PixelRAG	Skip text; retrieve screenshot tiles

Dimension	Mistral OCR 4	Baidu Unlimited-OCR
Release	June 23, 2026	June 22–23, 2026
License	Managed API / enterprise self-host	MIT, open weights
Multi-page	API per document	One forward pass, 32k context
Structured output	Bboxes, types, confidence	Text-focused parsing
Cost model	$2–5 / 1k pages	Self-host GPU cost only
Document AI layer	Built-in JSON schema	Roll your own
Human preference	72% avg win rate (Mistral-reported)	Not yet published
Deepseek-OCR lineage	Separate model family	✅ extends Deepseek-OCR ngram suppression
Best for	Managed RAG + citations	Self-hosted bulk PDF parsing

Mistral OCR 4: Bounding Boxes, Document AI, and the New OCR API

Related posts

MinerU 3.4: PDF and Office Parsing for LLM, RAG, and Agent Workflows

Baidu's Unlimited-OCR: One-Shot Long-Horizon Document Parsing Is Here

geohot: I Love LLMs, I Hate Hype — Why Frontier Labs May Not Capture the Value

What changed from OCR 3

Block types OCR 4 recognizes

Anatomy of an OCR 4 API response

Parsing bounding boxes in Python

TypeScript (Node SDK)

Benchmarks and the scoring caveat

Human preference (600+ documents)

Public and internal scores

Multilingual breakdown (Crawl Multilingual)

OCR 4 API vs Document AI

Basic extraction (Python)

Document AI with JSON schema

cURL

Base64 and local PDF upload

RAG ingestion pipeline with OCR 4

Semantic chunking by block type

Confidence scores and human-in-the-loop routing

Where it fits in production pipelines

Deployment options

Cost at scale

Managed OCR landscape (June 2026)

Batch API workflow

Document AI landscape: same week, three bets

Mistral OCR 4 vs Baidu Unlimited-OCR (same week)

Getting started

Summary

Related posts

MinerU 3.4: PDF and Office Parsing for LLM, RAG, and Agent Workflows

Baidu's Unlimited-OCR: One-Shot Long-Horizon Document Parsing Is Here

geohot: I Love LLMs, I Hate Hype — Why Frontier Labs May Not Capture the Value

What changed from OCR 3

Block types OCR 4 recognizes

Anatomy of an OCR 4 API response

Parsing bounding boxes in Python

TypeScript (Node SDK)

Benchmarks and the scoring caveat

Human preference (600+ documents)

Public and internal scores

Multilingual breakdown (Crawl Multilingual)

OCR 4 API vs Document AI

Basic extraction (Python)

Document AI with JSON schema

cURL

Base64 and local PDF upload

RAG ingestion pipeline with OCR 4

Semantic chunking by block type

Confidence scores and human-in-the-loop routing

Where it fits in production pipelines

Deployment options

Cost at scale

Managed OCR landscape (June 2026)

Batch API workflow

Document AI landscape: same week, three bets

Mistral OCR 4 vs Baidu Unlimited-OCR (same week)

Getting started

Related explainx.ai guides

Summary