pdf-extraction

claude-office-skills/skills · updated Apr 20, 2026

$npx skills add https://github.com/claude-office-skills/skills --skill pdf-extraction
0 commentsdiscussion
summary

Extract text, tables, and metadata from PDF documents with character-level precision.

  • Supports text extraction with layout preservation, word-level positioning, and character-level access including font and size metadata
  • Includes advanced table detection with customizable strategies (lines, text, explicit) and tolerance tuning for complex layouts
  • Provides visual debugging via image rendering with overlays for characters, words, lines, and detected table boundaries
  • Handles cropping
skill.md

PDF Extraction Skill

Overview

This skill enables precise extraction of text, tables, and metadata from PDF documents using pdfplumber - the go-to library for PDF data extraction. Unlike basic PDF readers, pdfplumber provides detailed character-level positioning, accurate table detection, and visual debugging.

How to Use

  1. Provide the PDF file you want to extract from
  2. Specify what you need: text, tables, images, or metadata
  3. I'll generate pdfplumber code and execute it

Example prompts:

  • "Extract all tables from this financial report"
  • "Get text from pages 5-10 of this document"
  • "Find and extract the invoice total from this PDF"
  • "Convert this PDF table to CSV/Excel"

Domain Knowledge

pdfplumber Fundamentals

import pdfplumber

# Open PDF
with pdfplumber.open('document.pdf') as pdf:
    # Access pages
    first_page = pdf.pages[0]
    
    # Document metadata
    print(pdf.metadata)
    
    # Number of pages
    print(len(pdf.pages))

PDF Structure

PDF Document
├── metadata (title, author, creation date)
├── pages[]
│   ├── chars (individual characters with position)
│   ├── words (grouped characters)
│   ├── lines (horizontal/vertical lines)
│   ├── rects (rectangles)
│   ├── curves (bezier curves)
│   └── images (embedded images)
└── outline (bookmarks/TOC)

Text Extraction

Basic Text

with pdfplumber.open('document.pdf') as pdf:
    # Single page
    text = pdf.pages[0].extract_text()
    
    # All pages
    full_text = ''
    for page in pdf.pages:
        full_text += page.extract_text() or ''

Advanced Text Options

# With layout preservation
text = page.extract_text(
    x_tolerance=3,      # Horizontal tolerance for grouping
    y_tolerance=3,      # Vertical tolerance
    layout=True,        # Preserve layout
    x_density=7.25,     # Chars per unit width
    y_density=13        # Chars per unit height
)

# Extract words with positions
words = page.extract_words(
    x_tolerance=3,
    y_tolerance=3,
    keep_blank_chars=False,
    use_text_flow=False
)

# Each word includes: text, x0, top, x1, bottom, etc.
for word in words:
    print(f"{word['text']} at ({word['x0']}, {word['top']})")

Character-Level Access

# Get all characters
chars = page.chars

for char in chars:
    print(f"'{char['text']}' at ({char['x0']}, {char['top']})")
    print(f"  Font: {char['fontname']}, Size: {char['size']}")

Table Extraction

Basic Table Extraction

with pdfplumber.open('report.pdf') as pdf:
    page = pdf.pages[0]
    
    # Extract all tables
    tables = page.extract_tables()
    
    for i, table in enumerate(tables):
        print(f"Table {i+1}:")
        for row in table:
            print(row)

Advanced Table Settings

# Custom table detection
table_settings = {
    "vertical_strategy": "lines",      # or "text", "explicit"
    "horizontal_strategy": "lines",
    "explicit_vertical_lines": [],     # Custom line positions
    "explicit_horizontal_lines": [],
    "snap_tolerance": 3,
    "snap_x_tolerance": 3,
    "snap_y_tolerance": 3,
    "join_tolerance": 3,
    "edge_min_length": 3,
    "min_words_vertical": 3,
    "min_words_horizontal": 1,
    "intersection_tolerance": 3,
    "text_tolerance": 3,
    "text_x_tolerance": 3,
    "text_y_tolerance": 3,
}

tables = page.extract_tables(table_settings)

Table Finding

# Find tables (without extracting)
table_finder = page.find_tables()

for table in table_finder:
    print(f"Table at: {table.bbox}")  # (x0, top, x1, bottom)
    
    # Extract specific table
    data = table.extract()

Visual Debugging

# Create visual debug image
im = page.to_image(resolution=150)

# Draw detected objects
im.draw_rects(page.chars)        # Character bounding boxes
im.draw_rects(page.words)        # Word bounding boxes
im.draw_lines(page.lines)        # Lines
im.draw_rects(page.rects)        # Rectangles

# Save debug image
im.save('debug.png')

# Debug tables
im.reset()
im.debug_tablefinder()
im.save('table_debug.png')

Cropping and Filtering

Crop to Region

# Define bounding box (x0, top, x1, bottom)
bbox = (0, 0, 300, 200)

# Crop page
cropped = page.crop(bbox)

# Extract from cropped area
text = cropped.extract_text()
tables = cropped.extract_tables()

Filter by Position

# Filter characters by region
def within_bbox(obj, bbox):
    x0, top

Discussion

Product Hunt–style comments (not star reviews)
  • No comments yet — start the thread.
general reviews

Ratings

4.569 reviews
  • Harper Gupta· Dec 24, 2024

    pdf-extraction reduced setup friction for our internal harness; good balance of opinion and flexibility.

  • Meera Liu· Dec 24, 2024

    Registry listing for pdf-extraction matched our evaluation — installs cleanly and behaves as described in the markdown.

  • Jin Park· Dec 20, 2024

    Keeps context tight: pdf-extraction is the kind of skill you can hand to a new teammate without a long onboarding doc.

  • Naina Martinez· Dec 20, 2024

    I recommend pdf-extraction for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.

  • Nia Okafor· Dec 12, 2024

    Useful defaults in pdf-extraction — fewer surprises than typical one-off scripts, and it plays nicely with `npx skills` flows.

  • Meera Shah· Dec 8, 2024

    We added pdf-extraction from the explainx registry; install was straightforward and the SKILL.md answered most questions upfront.

  • Nia Abebe· Dec 8, 2024

    Keeps context tight: pdf-extraction is the kind of skill you can hand to a new teammate without a long onboarding doc.

  • Li Srinivasan· Dec 4, 2024

    pdf-extraction reduced setup friction for our internal harness; good balance of opinion and flexibility.

  • Luis Johnson· Nov 27, 2024

    Registry listing for pdf-extraction matched our evaluation — installs cleanly and behaves as described in the markdown.

  • Rahul Santra· Nov 19, 2024

    pdf-extraction reduced setup friction for our internal harness; good balance of opinion and flexibility.

showing 1-10 of 69

1 / 7