pdf-extraction▌
claude-office-skills/skills · updated Apr 20, 2026
Extract text, tables, and metadata from PDF documents with character-level precision.
- ›Supports text extraction with layout preservation, word-level positioning, and character-level access including font and size metadata
- ›Includes advanced table detection with customizable strategies (lines, text, explicit) and tolerance tuning for complex layouts
- ›Provides visual debugging via image rendering with overlays for characters, words, lines, and detected table boundaries
- ›Handles cropping
PDF Extraction Skill
Overview
This skill enables precise extraction of text, tables, and metadata from PDF documents using pdfplumber - the go-to library for PDF data extraction. Unlike basic PDF readers, pdfplumber provides detailed character-level positioning, accurate table detection, and visual debugging.
How to Use
- Provide the PDF file you want to extract from
- Specify what you need: text, tables, images, or metadata
- I'll generate pdfplumber code and execute it
Example prompts:
- "Extract all tables from this financial report"
- "Get text from pages 5-10 of this document"
- "Find and extract the invoice total from this PDF"
- "Convert this PDF table to CSV/Excel"
Domain Knowledge
pdfplumber Fundamentals
import pdfplumber
# Open PDF
with pdfplumber.open('document.pdf') as pdf:
# Access pages
first_page = pdf.pages[0]
# Document metadata
print(pdf.metadata)
# Number of pages
print(len(pdf.pages))
PDF Structure
PDF Document
├── metadata (title, author, creation date)
├── pages[]
│ ├── chars (individual characters with position)
│ ├── words (grouped characters)
│ ├── lines (horizontal/vertical lines)
│ ├── rects (rectangles)
│ ├── curves (bezier curves)
│ └── images (embedded images)
└── outline (bookmarks/TOC)
Text Extraction
Basic Text
with pdfplumber.open('document.pdf') as pdf:
# Single page
text = pdf.pages[0].extract_text()
# All pages
full_text = ''
for page in pdf.pages:
full_text += page.extract_text() or ''
Advanced Text Options
# With layout preservation
text = page.extract_text(
x_tolerance=3, # Horizontal tolerance for grouping
y_tolerance=3, # Vertical tolerance
layout=True, # Preserve layout
x_density=7.25, # Chars per unit width
y_density=13 # Chars per unit height
)
# Extract words with positions
words = page.extract_words(
x_tolerance=3,
y_tolerance=3,
keep_blank_chars=False,
use_text_flow=False
)
# Each word includes: text, x0, top, x1, bottom, etc.
for word in words:
print(f"{word['text']} at ({word['x0']}, {word['top']})")
Character-Level Access
# Get all characters
chars = page.chars
for char in chars:
print(f"'{char['text']}' at ({char['x0']}, {char['top']})")
print(f" Font: {char['fontname']}, Size: {char['size']}")
Table Extraction
Basic Table Extraction
with pdfplumber.open('report.pdf') as pdf:
page = pdf.pages[0]
# Extract all tables
tables = page.extract_tables()
for i, table in enumerate(tables):
print(f"Table {i+1}:")
for row in table:
print(row)
Advanced Table Settings
# Custom table detection
table_settings = {
"vertical_strategy": "lines", # or "text", "explicit"
"horizontal_strategy": "lines",
"explicit_vertical_lines": [], # Custom line positions
"explicit_horizontal_lines": [],
"snap_tolerance": 3,
"snap_x_tolerance": 3,
"snap_y_tolerance": 3,
"join_tolerance": 3,
"edge_min_length": 3,
"min_words_vertical": 3,
"min_words_horizontal": 1,
"intersection_tolerance": 3,
"text_tolerance": 3,
"text_x_tolerance": 3,
"text_y_tolerance": 3,
}
tables = page.extract_tables(table_settings)
Table Finding
# Find tables (without extracting)
table_finder = page.find_tables()
for table in table_finder:
print(f"Table at: {table.bbox}") # (x0, top, x1, bottom)
# Extract specific table
data = table.extract()
Visual Debugging
# Create visual debug image
im = page.to_image(resolution=150)
# Draw detected objects
im.draw_rects(page.chars) # Character bounding boxes
im.draw_rects(page.words) # Word bounding boxes
im.draw_lines(page.lines) # Lines
im.draw_rects(page.rects) # Rectangles
# Save debug image
im.save('debug.png')
# Debug tables
im.reset()
im.debug_tablefinder()
im.save('table_debug.png')
Cropping and Filtering
Crop to Region
# Define bounding box (x0, top, x1, bottom)
bbox = (0, 0, 300, 200)
# Crop page
cropped = page.crop(bbox)
# Extract from cropped area
text = cropped.extract_text()
tables = cropped.extract_tables()
Filter by Position
# Filter characters by region
def within_bbox(obj, bbox):
x0, topDiscussion
Product Hunt–style comments (not star reviews)- No comments yet — start the thread.
general reviewsRatings
4.5★★★★★69 reviews- ★★★★★Harper Gupta· Dec 24, 2024
pdf-extraction reduced setup friction for our internal harness; good balance of opinion and flexibility.
- ★★★★★Meera Liu· Dec 24, 2024
Registry listing for pdf-extraction matched our evaluation — installs cleanly and behaves as described in the markdown.
- ★★★★★Jin Park· Dec 20, 2024
Keeps context tight: pdf-extraction is the kind of skill you can hand to a new teammate without a long onboarding doc.
- ★★★★★Naina Martinez· Dec 20, 2024
I recommend pdf-extraction for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.
- ★★★★★Nia Okafor· Dec 12, 2024
Useful defaults in pdf-extraction — fewer surprises than typical one-off scripts, and it plays nicely with `npx skills` flows.
- ★★★★★Meera Shah· Dec 8, 2024
We added pdf-extraction from the explainx registry; install was straightforward and the SKILL.md answered most questions upfront.
- ★★★★★Nia Abebe· Dec 8, 2024
Keeps context tight: pdf-extraction is the kind of skill you can hand to a new teammate without a long onboarding doc.
- ★★★★★Li Srinivasan· Dec 4, 2024
pdf-extraction reduced setup friction for our internal harness; good balance of opinion and flexibility.
- ★★★★★Luis Johnson· Nov 27, 2024
Registry listing for pdf-extraction matched our evaluation — installs cleanly and behaves as described in the markdown.
- ★★★★★Rahul Santra· Nov 19, 2024
pdf-extraction reduced setup friction for our internal harness; good balance of opinion and flexibility.
showing 1-10 of 69
1 / 7