Docling Document Parser
Docling is a document parsing library that converts PDFs, Word documents, PowerPoint, images, and other formats into structured data with advanced layout understanding.
Quick Start
Basic document conversion:
from docling.document_converter import DocumentConverter
source = "https://arxiv.org/pdf/2408.09869"
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())
Core Concepts
DocumentConverter
The main entry point for document conversion. Supports various input formats and conversion options.
from docling.document_converter import DocumentConverter
from docling.datamodel.base_models import InputFormat
from docling.document_converter import PdfFormatOption
from docling.datamodel.pipeline_options import PdfPipelineOptions
converter = DocumentConverter()
converter = DocumentConverter(
allowed_formats=[InputFormat.PDF, InputFormat.DOCX]
)
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
}
)
ConversionResult
All conversion operations return a ConversionResult containing:
document: The parsed DoclingDocument
status: ConversionStatus.SUCCESS, PARTIAL_SUCCESS, or FAILURE
errors: List of errors encountered during conversion
input: Information about the source document
result = converter.convert("document.pdf")
if result.status == ConversionStatus.SUCCESS:
markdown = result.document.export_to_markdown()
html = result.document.export_to_html()
data = result.document.export_to_dict()
Supported Formats
Input Formats
- Documents: PDF, DOCX, PPTX, XLSX
- Markup: HTML, Markdown, AsciiDoc
- Data: CSV, JSON (Docling format)
- Images: PNG, JPEG, TIFF, BMP, WEBP
- Audio: WAV, MP3
- Video Text: WebVTT
- Schema-specific: USPTO XML, JATS XML, METS-GBS
Output Formats
- Markdown:
export_to_markdown() or save_as_markdown()
- HTML:
export_to_html() or save_as_html()
- JSON:
export_to_dict() or save_as_json() (note: no export_to_json() method)
- Text:
export_to_text() or export_to_markdown(strict_text=True) or save_as_markdown(strict_text=True)
- DocTags:
export_to_doctags() or save_as_doctags()
Common Patterns
Single File Conversion
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert("document.pdf")
markdown = result.document.export_to_markdown()
html = result.document.export_to_html()
json_data = result.document.export_to_dict()
result.document.save_as_markdown("output.md")
result.document.save_as_html("output.html")
result.document.save_as_json("output.json")
Batch Processing
See references/batch.md for details on convert_all().
URL Conversion
converter = DocumentConverter()
result = converter.convert("https://example.com/document.pdf")
Binary Stream Conversion
from io import BytesIO
from docling.datamodel.base_models import DocumentStream
with open("document.pdf", "rb") as f:
buf = BytesIO(f.read())
source = DocumentStream(name="document.pdf", stream=buf)
result = converter.convert(source)
Format-Specific Options
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.ocr_options.lang = ["en", "es"]
pipeline_options.do_table_structure = True
pipeline_options.generate_page_images = True
converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
}
)
Resource Limits
converter = DocumentConverter()
result = converter.convert(
"large_document.pdf",
max_file_size=20_971_520,
max_num_pages=100
)
Document Chunking
See references/chunking.md for RAG integration.
DoclingDocument Structure
The DoclingDocument is a Pydantic model representing parsed content:
doc = result.document
doc.texts
doc.tables
doc.pictures
doc.key_value_items
doc.body
doc.furniture
doc.groups
for item, level in doc.iterate_items():
print(f"{' ' * level}{item.label}: {item.text[:50]}")
Advanced Features
OCR Configuration
from docling.datamodel.pipeline_options import (
PdfPipelineOptions,
EasyOcrOptions,
TesseractOcrOptions,
TesseractCliOcrOptions,
OcrMacOptions,
RapidOcrOptions
)
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.ocr_options = EasyOcrOptions(lang=["en", "de"])
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.ocr_options = TesseractOcrOptions(lang=["eng", "deu"])
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr