doc-parser

claude-office-skills/skills · updated Apr 15, 2026

$npx skills add https://github.com/claude-office-skills/skills --skill doc-parser
0 commentsdiscussion
summary

This skill enables advanced document parsing using docling - IBM's state-of-the-art document understanding library. Parse complex PDFs, Word documents, and images while preserving structure, extracting tables, figures, and handling multi-column layouts.

skill.md

Document Parser Skill

Overview

This skill enables advanced document parsing using docling - IBM's state-of-the-art document understanding library. Parse complex PDFs, Word documents, and images while preserving structure, extracting tables, figures, and handling multi-column layouts.

How to Use

  1. Provide the document to parse
  2. Specify what you want to extract (text, tables, figures, etc.)
  3. I'll parse it and return structured data

Example prompts:

  • "Parse this PDF and extract all tables"
  • "Convert this academic paper to structured markdown"
  • "Extract figures and captions from this document"
  • "Parse this report preserving the document structure"

Domain Knowledge

docling Fundamentals

from docling.document_converter import DocumentConverter

# Initialize converter
converter = DocumentConverter()

# Convert document
result = converter.convert("document.pdf")

# Access parsed content
doc = result.document
print(doc.export_to_markdown())

Supported Formats

Format Extension Notes
PDF .pdf Native and scanned
Word .docx Full structure preserved
PowerPoint .pptx Slides as sections
Images .png, .jpg OCR + layout analysis
HTML .html Structure preserved

Basic Usage

from docling.document_converter import DocumentConverter

# Create converter
converter = DocumentConverter()

# Convert single document
result = converter.convert("report.pdf")

# Access document
doc = result.document

# Export options
markdown = doc.export_to_markdown()
text = doc.export_to_text()
json_doc = doc.export_to_dict()

Advanced Configuration

from docling.document_converter import DocumentConverter
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions

# Configure pipeline
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True
pipeline_options.table_structure_options.do_cell_matching = True

# Create converter with options
converter = DocumentConverter(
    allowed_formats=[InputFormat.PDF, InputFormat.DOCX],
    pdf_backend_options=pipeline_options
)

result = converter.convert("document.pdf")

Document Structure

# Document hierarchy
doc = result.document

# Access metadata
print(doc.name)
print(doc.origin)

# Iterate through content
for element in doc.iterate_items():
    print(f"Type: {element.type}")
    print(f"Text: {element.text}")
    
    if element.type == "table":
        print(f"Rows: {len(element.data.table_cells)}")

Extracting Tables

from docling.document_converter import DocumentConverter
import pandas as pd

def extract_tables(doc_path):
    """Extract all tables from document."""
    converter = DocumentConverter()
    result = converter.convert(doc_path)
    doc = result.document
    
    tables = []
    
    for element in doc.iterate_items():
        if element.type == "table":
            # Get table data
            table_data = element.export_to_dataframe()
            tables.append({
                'page': element.prov[0].page_no if element.prov else None,
                'dataframe': table_data
            })
    
    return tables

# Usage
tables = extract_tables("report.pdf")
for i, table in enumerate(tables):
    print(f"Table {i+1} on page {table['page']}:")
    print(table['dataframe'])

Extracting Figures

def extract_figures(doc_path, output_dir):
    """Extract figures with captions."""
    import os
    
    converter = DocumentConverter()
    result = converter.convert(doc_path)
    doc = result.document
    
    figures = []
    os.makedirs(output_dir, exist_ok=True)
    
    for element in doc.iterate_items():
        if element.type == "picture":
            figure_info = {
                'caption': element.caption if hasattr(element, 'caption') else None,
                'page': element.prov[0].page_no if element.prov else None,
            }
            
            # Save image if available
            if hasattr(element, 'image'):
                img_path = os.path.join(output_dir, f"figure_{len(figures)+1}.png")
                element.image.save(img_path)
                figure_info['path'] = img_path
            
            figures.append(figure_info)
    
    return figures

Handling Multi-column Layouts

from docling.document_converter import DocumentConverter

def parse_multicolumn(doc_path):
    """Parse document with multi-column layout."""
    
    converter = DocumentConverter()
    result = converter.convert(doc_path)
    doc = result.document
    
    # docling automatically handles column detection
    # Text is returned in reading order
    
    structured_content = []
    
    for element in doc.iterate_items():
        content_ite

Discussion

Product Hunt–style comments (not star reviews)
  • No comments yet — start the thread.
general reviews

Ratings

4.570 reviews
  • Shikha Mishra· Dec 28, 2024

    Useful defaults in doc-parser — fewer surprises than typical one-off scripts, and it plays nicely with `npx skills` flows.

  • Aditi Smith· Dec 12, 2024

    We added doc-parser from the explainx registry; install was straightforward and the SKILL.md answered most questions upfront.

  • Kabir Wang· Dec 12, 2024

    doc-parser is among the better-maintained entries we tried; worth keeping pinned for repeat workflows.

  • Michael Flores· Dec 4, 2024

    doc-parser reduced setup friction for our internal harness; good balance of opinion and flexibility.

  • Michael Haddad· Nov 23, 2024

    We added doc-parser from the explainx registry; install was straightforward and the SKILL.md answered most questions upfront.

  • Aditi Martinez· Nov 3, 2024

    doc-parser reduced setup friction for our internal harness; good balance of opinion and flexibility.

  • Ama Martin· Nov 3, 2024

    Solid pick for teams standardizing on skills: doc-parser is focused, and the summary matches what you get after install.

  • Sophia Gupta· Oct 22, 2024

    Registry listing for doc-parser matched our evaluation — installs cleanly and behaves as described in the markdown.

  • Kabir Tandon· Oct 22, 2024

    doc-parser has been reliable in day-to-day use. Documentation quality is above average for community skills.

  • Ishan Rao· Oct 14, 2024

    doc-parser fits our agent workflows well — practical, well scoped, and easy to wire into existing repos.

showing 1-10 of 70

1 / 7