doc-parser▌
claude-office-skills/skills · updated Apr 15, 2026
This skill enables advanced document parsing using docling - IBM's state-of-the-art document understanding library. Parse complex PDFs, Word documents, and images while preserving structure, extracting tables, figures, and handling multi-column layouts.
Document Parser Skill
Overview
This skill enables advanced document parsing using docling - IBM's state-of-the-art document understanding library. Parse complex PDFs, Word documents, and images while preserving structure, extracting tables, figures, and handling multi-column layouts.
How to Use
- Provide the document to parse
- Specify what you want to extract (text, tables, figures, etc.)
- I'll parse it and return structured data
Example prompts:
- "Parse this PDF and extract all tables"
- "Convert this academic paper to structured markdown"
- "Extract figures and captions from this document"
- "Parse this report preserving the document structure"
Domain Knowledge
docling Fundamentals
from docling.document_converter import DocumentConverter
# Initialize converter
converter = DocumentConverter()
# Convert document
result = converter.convert("document.pdf")
# Access parsed content
doc = result.document
print(doc.export_to_markdown())
Supported Formats
| Format | Extension | Notes |
|---|---|---|
| Native and scanned | ||
| Word | .docx | Full structure preserved |
| PowerPoint | .pptx | Slides as sections |
| Images | .png, .jpg | OCR + layout analysis |
| HTML | .html | Structure preserved |
Basic Usage
from docling.document_converter import DocumentConverter
# Create converter
converter = DocumentConverter()
# Convert single document
result = converter.convert("report.pdf")
# Access document
doc = result.document
# Export options
markdown = doc.export_to_markdown()
text = doc.export_to_text()
json_doc = doc.export_to_dict()
Advanced Configuration
from docling.document_converter import DocumentConverter
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
# Configure pipeline
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True
# Create converter with options
converter = DocumentConverter(
allowed_formats=[InputFormat.PDF, InputFormat.DOCX],
pdf_backend_options=pipeline_options
)
result = converter.convert("document.pdf")
Document Structure
# Document hierarchy
doc = result.document
# Access metadata
print(doc.name)
print(doc.origin)
# Iterate through content
for element in doc.iterate_items():
print(f"Type: {element.type}")
print(f"Text: {element.text}")
if element.type == "table":
print(f"Rows: {len(element.data.table_cells)}")
Extracting Tables
from docling.document_converter import DocumentConverter
import pandas as pd
def extract_tables(doc_path):
"""Extract all tables from document."""
converter = DocumentConverter()
result = converter.convert(doc_path)
doc = result.document
tables = []
for element in doc.iterate_items():
if element.type == "table":
# Get table data
table_data = element.export_to_dataframe()
tables.append({
'page': element.prov[0].page_no if element.prov else None,
'dataframe': table_data
})
return tables
# Usage
tables = extract_tables("report.pdf")
for i, table in enumerate(tables):
print(f"Table {i+1} on page {table['page']}:")
print(table['dataframe'])
Extracting Figures
def extract_figures(doc_path, output_dir):
"""Extract figures with captions."""
import os
converter = DocumentConverter()
result = converter.convert(doc_path)
doc = result.document
figures = []
os.makedirs(output_dir, exist_ok=True)
for element in doc.iterate_items():
if element.type == "picture":
figure_info = {
'caption': element.caption if hasattr(element, 'caption') else None,
'page': element.prov[0].page_no if element.prov else None,
}
# Save image if available
if hasattr(element, 'image'):
img_path = os.path.join(output_dir, f"figure_{len(figures)+1}.png")
element.image.save(img_path)
figure_info['path'] = img_path
figures.append(figure_info)
return figures
Handling Multi-column Layouts
from docling.document_converter import DocumentConverter
def parse_multicolumn(doc_path):
"""Parse document with multi-column layout."""
converter = DocumentConverter()
result = converter.convert(doc_path)
doc = result.document
# docling automatically handles column detection
# Text is returned in reading order
structured_content = []
for element in doc.iterate_items():
content_iteDiscussion
Product Hunt–style comments (not star reviews)- No comments yet — start the thread.
Ratings
4.5★★★★★70 reviews- ★★★★★Shikha Mishra· Dec 28, 2024
Useful defaults in doc-parser — fewer surprises than typical one-off scripts, and it plays nicely with `npx skills` flows.
- ★★★★★Aditi Smith· Dec 12, 2024
We added doc-parser from the explainx registry; install was straightforward and the SKILL.md answered most questions upfront.
- ★★★★★Kabir Wang· Dec 12, 2024
doc-parser is among the better-maintained entries we tried; worth keeping pinned for repeat workflows.
- ★★★★★Michael Flores· Dec 4, 2024
doc-parser reduced setup friction for our internal harness; good balance of opinion and flexibility.
- ★★★★★Michael Haddad· Nov 23, 2024
We added doc-parser from the explainx registry; install was straightforward and the SKILL.md answered most questions upfront.
- ★★★★★Aditi Martinez· Nov 3, 2024
doc-parser reduced setup friction for our internal harness; good balance of opinion and flexibility.
- ★★★★★Ama Martin· Nov 3, 2024
Solid pick for teams standardizing on skills: doc-parser is focused, and the summary matches what you get after install.
- ★★★★★Sophia Gupta· Oct 22, 2024
Registry listing for doc-parser matched our evaluation — installs cleanly and behaves as described in the markdown.
- ★★★★★Kabir Tandon· Oct 22, 2024
doc-parser has been reliable in day-to-day use. Documentation quality is above average for community skills.
- ★★★★★Ishan Rao· Oct 14, 2024
doc-parser fits our agent workflows well — practical, well scoped, and easy to wire into existing repos.
showing 1-10 of 70