DeepSeek-OCR
Skill by ara.so β Daily 2026 Skills collection.
DeepSeek-OCR is a vision-language model for Optical Character Recognition with "Contexts Optical Compression." It supports native and dynamic resolutions, multiple prompt modes (document-to-markdown, free OCR, figure parsing, grounding), and can be run via vLLM (high-throughput) or HuggingFace Transformers. It processes images and PDFs, outputting structured text or markdown.
Installation
Prerequisites
- CUDA 11.8+, PyTorch 2.6.0
- Python 3.12.9 (via conda recommended)
Setup
git clone https://github.com/deepseek-ai/DeepSeek-OCR.git
cd DeepSeek-OCR
conda create -n deepseek-ocr python=3.12.9 -y
conda activate deepseek-ocr
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 \
--index-url https://download.pytorch.org/whl/cu118
pip install vllm-0.8.5+cu118-cp38-abi3-manylinux1_x86_64.whl
pip install -r requirements.txt
pip install flash-attn==2.7.3 --no-build-isolation
Alternative: upstream vLLM (nightly)
uv venv
source .venv/bin/activate
uv pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
Model Download
Model is available on HuggingFace: deepseek-ai/DeepSeek-OCR
from huggingface_hub import snapshot_download
snapshot_download(repo_id="deepseek-ai/DeepSeek-OCR")
Inference: vLLM (Recommended for Production)
Single Image β Streaming
from vllm import LLM, SamplingParams
from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor
from PIL import Image
llm = LLM(
model="deepseek-ai/DeepSeek-OCR",
enable_prefix_caching=False,
mm_processor_cache_gb=0,
logits_processors=[NGramPerReqLogitsProcessor]
)
image = Image.open("document.png").convert("RGB")
prompt = "<image>\nFree OCR."
sampling_params = SamplingParams(
temperature=0.0,
max_tokens=8192,
extra_args=dict(
ngram_size=30,
window_size=90,
whitelist_token_ids={128821, 128822},
),
skip_special_tokens=False,
)
outputs = llm.generate(
[{"prompt": prompt, "multi_modal_data": {"image": image}}],
sampling_params
)
print(outputs[0].outputs[0].text)
Batch Images
from vllm import LLM, SamplingParams
from vllm.model_executor.models.deepseek_ocr import NGramPerReqLogitsProcessor
from PIL import Image
llm = LLM(
model="deepseek-ai/DeepSeek-OCR",
enable_prefix_caching=False,
mm_processor_cache_gb=0,
logits_processors=[NGramPerReqLogitsProcessor]
)
image_paths = ["page1.png", "page2.png", "page3.png"]
prompt = "<image>\n<|grounding|>Convert the document to markdown. "
model_input = [
{
"prompt": prompt,
"multi_modal_data": {"image": Image.open(p).convert("RGB")}
}
for p in image_paths
]
sampling_params = SamplingParams(
temperature=0.0,
max_tokens=8192,
extra_args=dict(
ngram_size=30,
window_size=90,
whitelist_token_ids={128821, 128822},
),
skip_special_tokens=False,
)
outputs = llm.generate(model_input, sampling_params)
for path, output in zip(image_paths, outputs):
print(f"=== {path} ===")
print(output.outputs[0].text)
PDF Processing (via vLLM scripts)
cd DeepSeek-OCR-master/DeepSeek-OCR-vllm
python run_dpsk_ocr_pdf.py
Benchmark Evaluation
cd DeepSeek-OCR-master/DeepSeek-OCR-vllm
python run_dpsk_ocr_eval_batch.py
Inference: HuggingFace Transformers
import os
import torch
from transformers import AutoModel, AutoTokenizer
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
model_name = "deepseek-ai/DeepSeek-OCR"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
model_name,
_attn_implementation="flash_attention_2",
trust_remote_code=True,
use_safetensors=True,
)
model = model.eval().cuda().to(torch.bfloat16)
res = model.infer(
tokenizer,
prompt="<image>\n<|grounding|>Convert the document to markdown. ",
image_file="document.jpg",
output_path="./output/",
base_size=1024,
image_size=640,
crop_mode=True,
save_results=True,
test_compress=True,
)
print(res)
Transformers Script
cd DeepSeek-OCR-master/DeepSeek-OCR-hf
python run_dpsk_ocr.py
Prompt Reference
| Use Case |
Prompt |
| Document β Markdown |
`\n< |
| General OCR |
`\n< |
| Free OCR (no layout) |
<image>\nFree OCR. |
| Parse figure/chart |
<image>\nParse the figure. |
| General description |
<image>\nDescribe this image in detail. |
| Grounded REC |
<image>\nLocate <|ref|>TARGET_TEXT<|/ref|> in the image. |
PROMPTS = {
"document_markdown": "<image>\n<|grounding|>Convert the document to markdown. ",
"ocr_image": "<image>\n<|grounding|>OCR this image. ",
"free_ocr": "<image>\nFree OCR. ",
"parse_figure": "<image>\nParse the figure. ",
"describe": "<