← Back to blog

explainx / blog

Baidu's Unlimited-OCR: One-Shot Long-Horizon Document Parsing Is Here

Baidu has open-sourced Unlimited-OCR, a model that pushes past Deepseek-OCR to handle unlimited-length documents, PDFs, and multi-page images in a single inference pass. Here is what it does, how to run it, and why it matters.

·5 min read·Yash Thakker
OCRBaiduOpen SourceDocument AIVision ModelsAI Tools
Baidu's Unlimited-OCR: One-Shot Long-Horizon Document Parsing Is Here

Baidu shipped Unlimited-OCR on June 22, 2026 — and it collected 1.8k GitHub stars in under 24 hours. The model tackles one of the most persistent pain points in document AI: parsing entire PDFs and multi-page scans in a single forward pass, without chunking the input or stitching the output back together afterward.

Mistral OCR 4 landed the next day (full guide) with the opposite trade-off: managed API, bounding boxes, block classification, and confidence scores — but not open weights. The two releases frame the June 2026 document-AI split: self-hosted long-horizon parsing vs structured managed extraction.

The arXiv paper dropped the same day. The model is live on Hugging Face and ModelScope, and the full inference code — including a bundled SGLang wheel — is at github.com/baidu/Unlimited-OCR.

What problem does it solve

Most OCR and document parsing pipelines have a hard limit: they process one page or one fixed-size image at a time, then glue the outputs together. That stitching step is where errors compound. A table that spans two pages gets split. A footnote reference loses its anchor. Layout context that spans multiple sections disappears.

Unlimited-OCR's central claim is long-horizon parsing — treating an entire document as a single sequence and maintaining structural context across pages. The project frames itself as pushing Deepseek-OCR further, building on the ngram-based repetition suppression that made Deepseek-OCR reliable on dense text.

Two inference modes: gundam and base

The model ships with two named configurations:

Configimage_sizecrop_modeBest for
gundam640TrueSingle images, fast throughput
base1024FalseMulti-page docs, PDFs, full fidelity

Gundam trades resolution for speed by cropping aggressively. Base preserves full image size for documents where layout and density matter — scientific papers, financial reports, legal filings.

Live WorkshopAug 1–2, 2026 · 2 days

Claude for Work

Use Claude as a thought partner for writing, research & decisions — no coding required. 2 live sessions with Yash Thakker.

Register now

Claude for Work is a 2-day live workshop on using Claude to supercharge your daily work — writing, research, analysis, and decision-making — without any coding required. Learn how to set up Claude Projects with custom instructions, run deep-research sprints, co-write documents that sound like you, and build repeatable prompt systems for your team. August 1–2, 2026. Hosted by Yash Thakker, founder of AISOLO Technologies, instructor to 350,000+ students.

Includes 1-year access to all session recordings, a personal prompt library, Discord community access, and a certificate of completion. No coding or technical background required. Designed for managers, marketers, founders, and writers.

Running it with Transformers

The simplest path uses Hugging Face Transformers with bfloat16 on a CUDA GPU:

import torch
from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('baidu/Unlimited-OCR', trust_remote_code=True)
model = AutoModel.from_pretrained(
    'baidu/Unlimited-OCR',
    trust_remote_code=True,
    use_safetensors=True,
    torch_dtype=torch.bfloat16,
).eval().cuda()

# Single image — gundam config
model.infer(
    tokenizer,
    prompt='<image>document parsing.',
    image_file='your_image.jpg',
    output_path='./output',
    base_size=1024, crop_mode=True,
    max_length=32768,
    no_repeat_ngram_size=35, ngram_window=128,
    save_results=True,
)

# Multi-page PDF
model.infer_multi(
    tokenizer,
    prompt='<image>Multi page parsing.',
    image_files=['page1.png', 'page2.png', 'page3.png'],
    output_path='./output',
    image_size=1024,
    max_length=32768,
    no_repeat_ngram_size=35, ngram_window=1024,
    save_results=True,
)

no_repeat_ngram_size=35 and ngram_window are the repetition-suppression parameters inherited from the Deepseek-OCR lineage — they are what stops the model from looping on dense repeated patterns in tables and forms.

PDF-native workflow

The repo ships a PyMuPDF helper that converts PDF pages to PNG at 300 DPI before feeding them to infer_multi:

import tempfile, fitz

def pdf_to_images(pdf_path, dpi=300):
    doc = fitz.open(pdf_path)
    tmp_dir = tempfile.mkdtemp(prefix='pdf_ocr_')
    mat = fitz.Matrix(dpi / 72, dpi / 72)
    paths = []
    for i, page in enumerate(doc):
        out = os.path.join(tmp_dir, f'page_{i+1:04d}.png')
        page.get_pixmap(matrix=mat).save(out)
        paths.append(out)
    doc.close()
    return paths

model.infer_multi(
    tokenizer,
    prompt='<image>Multi page parsing.',
    image_files=pdf_to_images('your_doc.pdf', dpi=300),
    output_path='./output',
    image_size=1024,
    max_length=32768,
    no_repeat_ngram_size=35, ngram_window=1024,
    save_results=True,
)

High-throughput with SGLang

For production workloads, the repository bundles an SGLang wheel that runs an OpenAI-compatible API server with streaming support and concurrent request handling:

# Start the server
python -m sglang.launch_server \
    --model baidu/Unlimited-OCR \
    --served-model-name Unlimited-OCR \
    --attention-backend fa3 \
    --context-length 32768 \
    --enable-custom-logit-processor \
    --host 0.0.0.0 \
    --port 10000

Clients send streaming requests to http://localhost:10000/v1/chat/completions using standard multimodal message format. The server accepts images_config.image_mode (gundam or base) and custom_params for ngram_size and window_size.

For batch jobs, infer.py starts the SGLang server automatically and dispatches concurrent requests:

# Image directory
python infer.py \
    --image_dir ./examples/images \
    --output_dir ./outputs \
    --concurrency 8 \
    --image_mode gundam

# PDF
python infer.py \
    --pdf ./examples/document.pdf \
    --output_dir ./outputs \
    --concurrency 8 \
    --image_mode gundam

The --concurrency flag controls how many pages are processed in parallel — useful for large PDF batches.

What makes the ngram suppression significant

One of the recurring failures of long-context OCR models is repetition: the model starts looping on a header, a table row, or a footer as it loses track of what it has already generated. Deepseek-OCR introduced no_repeat_ngram_size as a hard constraint at the logit level. Unlimited-OCR inherits this and extends ngram_window — so the constraint is applied across a sliding window rather than the full context, which becomes important when documents are hundreds of pages long and exact repetition from chapter to chapter is legitimate.

Who is it for

Legal and compliance teams parsing dense contracts, regulatory filings, and multi-page agreements where a missed clause is a liability.

Finance and accounting extracting structured data from annual reports, balance sheets, and multi-table PDFs.

Research and academia digitising scanned papers, dissertations, and archival documents where standard OCR breaks on equations, footnotes, and mixed-column layouts.

Developers building document pipelines who need a reliable open-weight model they can self-host without per-page API costs.

How it compares to the alternatives

ModelMulti-page supportOpen weightsPDF nativeContext lengthBboxes / confidence
Unlimited-OCR✅ infer_multi✅ MIT✅ via PyMuPDF32,768
Mistral OCR 4API per docEnterprise self-host✅ nativeAPI
Deepseek-OCRSingle imageShorter
Deepseek-OCR-2LimitedLonger
GPT-4o VisionPage-by-page APIVia preprocessingAPI limitPartial

The MIT licence and self-hostable weights make it a credible alternative for teams that cannot send documents to third-party APIs for compliance or cost reasons.

Getting started

# Install dependencies
pip install torch==2.10.0 torchvision==0.25.0 transformers==4.57.1
pip install Pillow matplotlib einops addict easydict pymupdf psutil

# Pull model and run
python -c "
from transformers import AutoModel, AutoTokenizer
import torch
tokenizer = AutoTokenizer.from_pretrained('baidu/Unlimited-OCR', trust_remote_code=True)
model = AutoModel.from_pretrained('baidu/Unlimited-OCR', trust_remote_code=True, torch_dtype=torch.bfloat16).eval().cuda()
model.infer(tokenizer, prompt='<image>document parsing.', image_file='test.jpg', output_path='./out', base_size=1024, crop_mode=True, max_length=32768, no_repeat_ngram_size=35, ngram_window=128, save_results=True)
"

The paper is on arXiv at 2606.23050. The model is at baidu/Unlimited-OCR on Hugging Face. The code — including the SGLang wheel and infer.py batch runner — is at github.com/baidu/Unlimited-OCR.

For teams already using Deepseek-OCR or page-by-page vision APIs, Unlimited-OCR is worth a close look this week. If you need bounding boxes, typed blocks, and a managed Document AI layer instead, read Mistral OCR 4: bounding boxes and API guide. For visual retrieval without text extraction, see PixelRAG.


Related ExplainX guides

Related posts