document-processing

eyadsibai/ltk · updated Apr 8, 2026

$npx skills add https://github.com/eyadsibai/ltk --skill document-processing
0 commentsdiscussion
summary

Process, extract, and manipulate PDF, Excel, Word, and PowerPoint documents programmatically.

  • Supports four major office formats (PDF, XLSX, DOCX, PPTX) with format-specific tools: pypdf and pdfplumber for PDFs, openpyxl and pandas for Excel, python-docx for Word, python-pptx for PowerPoint
  • Core operations include text and table extraction, document merging and splitting, format conversion, and OCR for scanned PDFs
  • Excel-specific guidance emphasizes writing formulas rather than stati
skill.md

Document Processing Guide

Work with office documents: PDF, Excel, Word, and PowerPoint.


Format Overview

Format Extension Structure Best For
PDF .pdf Binary/text Reports, forms, archives
Excel .xlsx XML in ZIP Data, calculations, models
Word .docx XML in ZIP Text documents, contracts
PowerPoint .pptx XML in ZIP Presentations, slides

Key concept: XLSX, DOCX, and PPTX are all ZIP archives containing XML files. You can unzip them to access raw content.


PDF Processing

PDF Tools

Task Best Tool
Basic read/write pypdf
Text extraction pdfplumber
Table extraction pdfplumber
Create PDFs reportlab
OCR scanned PDFs pytesseract + pdf2image
Command line qpdf, pdftotext

Common Operations

Operation Approach
Merge Loop through files, add pages to writer
Split Create new writer per page
Extract tables Use pdfplumber, convert to DataFrame
Rotate Call .rotate(degrees) on page
Encrypt Use writer's .encrypt() method
OCR Convert to images, run pytesseract

Excel Processing

Excel Tools

Task Best Tool
Data analysis pandas
Formulas & formatting openpyxl
Simple CSV pandas
Financial models openpyxl

Critical Rule: Use Formulas

Approach Result
Wrong: Calculate in Python, write value Static number, breaks when data changes
Right: Write Excel formula Dynamic, recalculates automatically

Financial Model Standards

Convention Meaning
Blue text Hardcoded inputs
Black text Formulas
Green text Links to other sheets
Yellow fill Needs attention

Common Formula Errors

Error Cause
#REF! Invalid cell reference
#DIV/0! Division by zero
#VALUE! Wrong data type
#NAME? Unknown function name

Word Processing

Word Tools

Task Best Tool
Text extraction pandoc
Create new python-docx or docx-js
Simple edits python-docx
Tracked changes Direct XML editing

Document Structure

File Contains
word/document.xml Main content
word/comments.xml Comments
word/media/ Images

Tracked Changes (Redlining)

Element XML Tag
Deletion <w:del><w:delText>...</w:delText></w:del>
Insertion <w:ins><w:t>...</w:t></w:ins>

Key concept: For professional/legal documents, use tracked changes XML rather than replacing text directly.


PowerPoint Processing

PowerPoint Tools

Task Best Tool
Text extraction markitdown
Create new pptxgenjs (JS) or python-pptx
Edit existing Direct XML or python-pptx

Slide Structure

Path Contains
ppt/slides/slide{N}.xml Slide content
ppt/notesSlides/ Speaker notes
ppt/slideMasters/ Master templates
ppt/media/ Images

Design Principles

Principle Guideline
Fonts Use web-safe: Arial, Helvetica, Georgia
Layout Two-column preferred, avoid vertical stacking
Hierarchy Size, weight, color for emphasis
Consistency Repeat patterns across slides

Converting Between Formats

Conversion Tool
Any → PDF LibreOffice headless
PDF → Images pdftoppm
DOCX → Markdown pandoc
Any → Text Appropriate extractor

Best Practices

Practice Why
Use formulas in Excel Dynamic calculations
Preserve formatting on edit Don't lose styles
Test output opens correctly Catch corruption early
Use tracked changes for contracts Audit trail
Extract to markdown for analysis Easier to process

Common Packages

Language Packages
Python pypdf, pdfplumber, openpyxl, python-docx, python-pptx
JavaScript docx, pptxgenjs
CLI pandoc, qpdf, pdftotext, libreoffice

Discussion

Product Hunt–style comments (not star reviews)
  • No comments yet — start the thread.
general reviews

Ratings

4.668 reviews
  • Hana Mensah· Dec 28, 2024

    I recommend document-processing for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.

  • Anika Haddad· Dec 20, 2024

    Solid pick for teams standardizing on skills: document-processing is focused, and the summary matches what you get after install.

  • Zara Chawla· Dec 16, 2024

    document-processing fits our agent workflows well — practical, well scoped, and easy to wire into existing repos.

  • Emma Jain· Dec 12, 2024

    Registry listing for document-processing matched our evaluation — installs cleanly and behaves as described in the markdown.

  • Dhruvi Jain· Dec 8, 2024

    Keeps context tight: document-processing is the kind of skill you can hand to a new teammate without a long onboarding doc.

  • Ren Ramirez· Dec 8, 2024

    I recommend document-processing for anyone iterating fast on agent tooling; clear intent and a small, reviewable surface area.

  • Omar Sethi· Dec 8, 2024

    Keeps context tight: document-processing is the kind of skill you can hand to a new teammate without a long onboarding doc.

  • Oshnikdeep· Nov 27, 2024

    document-processing has been reliable in day-to-day use. Documentation quality is above average for community skills.

  • Tariq Thomas· Nov 27, 2024

    document-processing has been reliable in day-to-day use. Documentation quality is above average for community skills.

  • Chinedu Sanchez· Nov 11, 2024

    document-processing is among the better-maintained entries we tried; worth keeping pinned for repeat workflows.

showing 1-10 of 68

1 / 7