extracting-pdf-text▌
letta-ai/skills · updated Apr 8, 2026
This skill provides tools and guidance for extracting text from PDFs in formats suitable for language model consumption.
Extracting PDF Text for LLMs
This skill provides tools and guidance for extracting text from PDFs in formats suitable for language model consumption.
Quick Decision Guide
| PDF Type | Best Approach | Script |
|---|---|---|
| Simple text PDF | PyMuPDF | scripts/extract_pymupdf.py |
| PDF with tables | pdfplumber | scripts/extract_pdfplumber.py |
| Scanned/image PDF (local) | pytesseract | scripts/extract_with_ocr.py |
| Complex layout, highest accuracy | Mistral OCR API | scripts/extract_mistral_ocr.py |
| End-to-end RAG pipeline | marker-pdf | pip install marker-pdf |
Recommended Workflow
- Try PyMuPDF first - fastest, handles most text-based PDFs well
- If tables are mangled - switch to pdfplumber
- If scanned/image-based - use Mistral OCR API (best accuracy) or local OCR (free but slower)
Local Extraction (No API Required)
PyMuPDF - Fast General Extraction
Best for: Text-heavy PDFs, speed-critical workflows, basic structure preservation.
uv run scripts/extract_pymupdf.py input.pdf output.md
The script outputs markdown with preserved headings and paragraphs. For LLM-optimized output, it uses pymupdf4llm which formats text for RAG systems.
pdfplumber - Table Extraction
Best for: PDFs with tables, financial documents, structured data.
uv run scripts/extract_pdfplumber.py input.pdf output.md
Tables are converted to markdown format. Note: pdfplumber works best on machine-generated PDFs, not scanned documents.
Local OCR - Scanned Documents
Best for: Scanned PDFs when API access is unavailable.
uv run scripts/extract_with_ocr.py input.pdf output.txt
Requires: pytesseract, pdf2image, and Tesseract installed (brew install tesseract on macOS).
API-Based Extraction
Mistral OCR API
Best for: Complex layouts, scanned documents, highest accuracy, multilingual content, math formulas.
Pricing: ~1000 pages per dollar (very cost-effective)
export MISTRAL_API_KEY="your-key"
uv run scripts/extract_mistral_ocr.py input.pdf output.md
Features:
- Outputs clean markdown
- Preserves document structure (headings, lists, tables)
- Handles images, math equations, multilingual text
- 95%+ accuracy on complex documents
For detailed API options and other services, see references/api-services.md.
Output Format Recommendations
For LLM consumption, markdown is preferred:
- Preserves semantic structure (headings become context boundaries)
- Tables remain readable
- Compatible with most RAG chunking strategies
For detailed comparisons of local tools, see references/local-tools.md.
Discussion
Product Hunt–style comments (not star reviews)- No comments yet — start the thread.
Ratings
4.7★★★★★65 reviews- ★★★★★Hiroshi Mensah· Dec 24, 2024
Useful defaults in extracting-pdf-text — fewer surprises than typical one-off scripts, and it plays nicely with `npx skills` flows.
- ★★★★★Dhruvi Jain· Dec 20, 2024
Useful defaults in extracting-pdf-text — fewer surprises than typical one-off scripts, and it plays nicely with `npx skills` flows.
- ★★★★★Ava Zhang· Dec 20, 2024
extracting-pdf-text has been reliable in day-to-day use. Documentation quality is above average for community skills.
- ★★★★★Anaya Dixit· Dec 16, 2024
Solid pick for teams standardizing on skills: extracting-pdf-text is focused, and the summary matches what you get after install.
- ★★★★★Liam Gill· Dec 4, 2024
We added extracting-pdf-text from the explainx registry; install was straightforward and the SKILL.md answered most questions upfront.
- ★★★★★Li Rahman· Nov 23, 2024
Solid pick for teams standardizing on skills: extracting-pdf-text is focused, and the summary matches what you get after install.
- ★★★★★Rahul Santra· Nov 19, 2024
Registry listing for extracting-pdf-text matched our evaluation — installs cleanly and behaves as described in the markdown.
- ★★★★★Advait Flores· Nov 15, 2024
extracting-pdf-text is among the better-maintained entries we tried; worth keeping pinned for repeat workflows.
- ★★★★★Oshnikdeep· Nov 11, 2024
extracting-pdf-text is among the better-maintained entries we tried; worth keeping pinned for repeat workflows.
- ★★★★★Ama Abbas· Nov 11, 2024
extracting-pdf-text fits our agent workflows well — practical, well scoped, and easy to wire into existing repos.
showing 1-10 of 65