What is Unlimited-OCR?

Unlimited-OCR is an open-source document parsing model released by Baidu on June 22–23, 2026. It extends the Deepseek-OCR lineage to support unlimited-length documents — entire PDFs, multi-page scans, and dense images — in a single inference pass without chunking or post-processing stitching.

How is Unlimited-OCR different from Deepseek-OCR?

Deepseek-OCR handles single images or short documents. Unlimited-OCR adds multi-page and PDF-native inference (infer_multi), two operating configurations (gundam for high-speed single-image processing and base for long documents), and an SGLang server backend for high-throughput concurrent batch jobs.

What are the two inference configs gundam and base?

Gundam uses base_size=1024, image_size=640, and crop_mode=True — optimised for fast single-image parsing with aggressive cropping. Base uses base_size=1024, image_size=1024, and crop_mode=False — optimised for full fidelity on long documents and multi-page PDFs.

Can Unlimited-OCR process PDFs directly?

Yes. The model ships with a PyMuPDF helper that converts PDF pages to PNG images at 300 DPI and feeds them into infer_multi. The SGLang backend supports the same workflow with concurrent streaming requests.

What hardware does Unlimited-OCR require?

The Transformers path requires an NVIDIA GPU with CUDA 12.9 and runs on bfloat16. The SGLang server has been tested with FA3 attention backend. The repository was tested on Python 3.12.3 with torch 2.10.0.

Where can I find the model weights?

Weights are available on Hugging Face at baidu/Unlimited-OCR and on ModelScope. The GitHub repository at github.com/baidu/Unlimited-OCR contains inference scripts, the SGLang wheel, and documentation.

Baidu Unlimited-OCR: One-Shot Long-Horizon Document Parsing Explained | explainx.ai Blog

explainx.ainewsletter3.5k

workshops ↗

Baidu Unlimited-OCR: One-Shot Long-Horizon Document Parsing Explained | explainx.ai Blog | explainx.ai

Baidu shipped Unlimited-OCR on June 22, 2026 — and it collected 1.8k GitHub stars in under 24 hours. The model tackles one of the most persistent pain points in document AI: parsing entire PDFs and multi-page scans in a single forward pass, without chunking the input or stitching the output back together afterward.

Mistral OCR 4 landed the next day (full guide) with the opposite trade-off: managed API, bounding boxes, block classification, and confidence scores — but not open weights. MinerU 3.4 (~70k stars) offers the full multi-backend production stack — pipeline, hybrid, VLM, and mineru-router for multi-GPU. The June 2026 document-AI landscape: long-horizon vision parsing, managed API extraction, and self-hosted ingestion engines.

The arXiv paper dropped the same day. The model is live on Hugging Face and ModelScope, and the full inference code — including a bundled SGLang wheel — is at github.com/baidu/Unlimited-OCR.

What problem does it solve

Most OCR and document parsing pipelines have a hard limit: they process one page or one fixed-size image at a time, then glue the outputs together. That stitching step is where errors compound. A table that spans two pages gets split. A footnote reference loses its anchor. Layout context that spans multiple sections disappears.

Unlimited-OCR's central claim is long-horizon parsing — treating an entire document as a single sequence and maintaining structural context across pages. The project frames itself as pushing Deepseek-OCR further, building on the ngram-based repetition suppression that made Deepseek-OCR reliable on dense text.

Two inference modes: gundam and base

The model ships with two named configurations:

Config	image_size	crop_mode	Best for
gundam	640	True	Single images, fast throughput
base	1024	False	Multi-page docs, PDFs, full fidelity

Gundam trades resolution for speed by cropping aggressively. Base preserves full image size for documents where layout and density matter — scientific papers, financial reports, legal filings.

Running it with Transformers

The simplest path uses Hugging Face Transformers with bfloat16 on a CUDA GPU:

python

import torch
from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('baidu/Unlimited-OCR', trust_remote_code=True)
model = AutoModel.from_pretrained(
    'baidu/Unlimited-OCR',
    trust_remote_code=True,
    use_safetensors=True,
    torch_dtype=torch.bfloat16,
).eval().cuda()

# Single image — gundam config
model.infer(
    tokenizer,
    prompt='<image>document parsing.',
    image_file='your_image.jpg',
    output_path='./output',
    base_size=1024, crop_mode=True,
    max_length=32768,
    no_repeat_ngram_size=35, ngram_window=128,
    save_results=True,
)

# Multi-page PDF
model.infer_multi(
    tokenizer,
    prompt='<image>Multi page parsing.',
    image_files=['page1.png', 'page2.png', 'page3.png'],
    output_path='./output',
    image_size=1024,
    max_length=32768,
    no_repeat_ngram_size=35, ngram_window=1024,
    save_results=True,
)

no_repeat_ngram_size=35 and ngram_window are the repetition-suppression parameters inherited from the Deepseek-OCR lineage — they are what stops the model from looping on dense repeated patterns in tables and forms.

PDF-native workflow

The repo ships a PyMuPDF helper that converts PDF pages to PNG at 300 DPI before feeding them to infer_multi:

python

import tempfile, fitz

def pdf_to_images(pdf_path, dpi=300):
    doc = fitz.open(pdf_path)
    tmp_dir = tempfile.mkdtemp(prefix='pdf_ocr_')
    mat = fitz.Matrix(dpi / 72, dpi / 72)
    paths = []
    for i, page in enumerate(doc):
        out = os.path.join(tmp_dir, f'page_{i+1:04d}.png')
        page.get_pixmap(matrix=mat).save(out)
        paths.append(out)
    doc.close()
    return paths

model.infer_multi(
    tokenizer,
    prompt='<image>Multi page parsing.',
    image_files=pdf_to_images('your_doc.pdf', dpi=300),
    output_path='./output',
    image_size=1024,
    max_length=32768,
    no_repeat_ngram_size=35, ngram_window=1024,
    save_results=True,
)

High-throughput with SGLang

For production workloads, the repository bundles an SGLang wheel that runs an OpenAI-compatible API server with streaming support and concurrent request handling:

bash

# Start the server
python -m sglang.launch_server \
    --model baidu/Unlimited-OCR \
    --served-model-name Unlimited-OCR \
    --attention-backend fa3 \
    --context-length 32768 \
    --enable-custom-logit-processor \
    --host 0.0.0.0 \
    --port 10000

Clients send streaming requests to http://localhost:10000/v1/chat/completions using standard multimodal message format. The server accepts images_config.image_mode (gundam or base) and custom_params for ngram_size and window_size.

For batch jobs, infer.py starts the SGLang server automatically and dispatches concurrent requests:

bash

# Image directory
python infer.py \
    --image_dir ./examples/images \
    --output_dir ./outputs \
    --concurrency 8 \
    --image_mode gundam

# PDF
python infer.py \
    --pdf ./examples/document.pdf \
    --output_dir ./outputs \
    --concurrency 8 \
    --image_mode gundam

The --concurrency flag controls how many pages are processed in parallel — useful for large PDF batches.

What makes the ngram suppression significant

One of the recurring failures of long-context OCR models is repetition: the model starts looping on a header, a table row, or a footer as it loses track of what it has already generated. Deepseek-OCR introduced no_repeat_ngram_size as a hard constraint at the logit level. Unlimited-OCR inherits this and extends ngram_window — so the constraint is applied across a sliding window rather than the full context, which becomes important when documents are hundreds of pages long and exact repetition from chapter to chapter is legitimate.

Who is it for

Legal and compliance teams parsing dense contracts, regulatory filings, and multi-page agreements where a missed clause is a liability.

Finance and accounting extracting structured data from annual reports, balance sheets, and multi-table PDFs.

Research and academia digitising scanned papers, dissertations, and archival documents where standard OCR breaks on equations, footnotes, and mixed-column layouts.

Developers building document pipelines who need a reliable open-weight model they can self-host without per-page API costs.

How it compares to the alternatives

Model	Multi-page support	Open weights	PDF native	Context length	Bboxes / confidence
Unlimited-OCR	✅ infer_multi	✅ MIT	✅ via PyMuPDF	32,768	❌
Mistral OCR 4	API per doc	Enterprise self-host	✅ native	API	✅
Deepseek-OCR	Single image	✅	❌	Shorter	❌
Deepseek-OCR-2	Limited	✅	❌	Longer	❌
GPT-4o Vision	Page-by-page API	❌	Via preprocessing	API limit	Partial

The MIT licence and self-hostable weights make it a credible alternative for teams that cannot send documents to third-party APIs for compliance or cost reasons.

Getting started

bash

# Install dependencies
pip install torch==2.10.0 torchvision==0.25.0 transformers==4.57.1
pip install Pillow matplotlib einops addict easydict pymupdf psutil

# Pull model and run
python -c "
from transformers import AutoModel, AutoTokenizer
import torch
tokenizer = AutoTokenizer.from_pretrained('baidu/Unlimited-OCR', trust_remote_code=True)
model = AutoModel.from_pretrained('baidu/Unlimited-OCR', trust_remote_code=True, torch_dtype=torch.bfloat16).eval().cuda()
model.infer(tokenizer, prompt='<image>document parsing.', image_file='test.jpg', output_path='./out', base_size=1024, crop_mode=True, max_length=32768, no_repeat_ngram_size=35, ngram_window=128, save_results=True)
"

The paper is on arXiv at 2606.23050. The model is at baidu/Unlimited-OCR on Hugging Face. The code — including the SGLang wheel and infer.py batch runner — is at github.com/baidu/Unlimited-OCR.

For teams already using Deepseek-OCR or page-by-page vision APIs, Unlimited-OCR is worth a close look this week. If you need bounding boxes, typed blocks, and a managed Document AI layer instead, read Mistral OCR 4: bounding boxes and API guide. For visual retrieval without text extraction, see PixelRAG.

Mistral OCR 4: bounding boxes, Document AI, and API — managed structured extraction (released June 23)
PixelRAG: visual RAG from screenshots — skip text parsing entirely
RAG vs agentic RAG — chunking strategies for parsed documents
What are embeddings and vector search? — indexing extracted text
Closed source vs open source AI alternatives — when to self-host vs use APIs

Baidu's Unlimited-OCR: One-Shot Long-Horizon Document Parsing Is Here

Related posts

MinerU 3.4: PDF and Office Parsing for LLM, RAG, and Agent Workflows

video-use: Edit Videos With Claude Code — No Premiere Pro Needed

World Monitor: The Open-Source Real-Time Global Intelligence Dashboard [2026]

What problem does it solve

Two inference modes: gundam and base

Running it with Transformers

PDF-native workflow

High-throughput with SGLang

What makes the ngram suppression significant

Who is it for

How it compares to the alternatives

Getting started

Related posts

MinerU 3.4: PDF and Office Parsing for LLM, RAG, and Agent Workflows

video-use: Edit Videos With Claude Code — No Premiere Pro Needed

World Monitor: The Open-Source Real-Time Global Intelligence Dashboard [2026]

What problem does it solve

Two inference modes: gundam and base

Running it with Transformers

PDF-native workflow

High-throughput with SGLang

What makes the ngram suppression significant

Who is it for

How it compares to the alternatives

Getting started

Related explainx.ai guides