Document Processing Guide
Work with office documents: PDF, Excel, Word, and PowerPoint.
Format Overview
| Format |
Extension |
Structure |
Best For |
| PDF |
.pdf |
Binary/text |
Reports, forms, archives |
| Excel |
.xlsx |
XML in ZIP |
Data, calculations, models |
| Word |
.docx |
XML in ZIP |
Text documents, contracts |
| PowerPoint |
.pptx |
XML in ZIP |
Presentations, slides |
Key concept: XLSX, DOCX, and PPTX are all ZIP archives containing XML files. You can unzip them to access raw content.
PDF Processing
PDF Tools
| Task |
Best Tool |
| Basic read/write |
pypdf |
| Text extraction |
pdfplumber |
| Table extraction |
pdfplumber |
| Create PDFs |
reportlab |
| OCR scanned PDFs |
pytesseract + pdf2image |
| Command line |
qpdf, pdftotext |
Common Operations
| Operation |
Approach |
| Merge |
Loop through files, add pages to writer |
| Split |
Create new writer per page |
| Extract tables |
Use pdfplumber, convert to DataFrame |
| Rotate |
Call .rotate(degrees) on page |
| Encrypt |
Use writer's .encrypt() method |
| OCR |
Convert to images, run pytesseract |
Excel Processing
Excel Tools
| Task |
Best Tool |
| Data analysis |
pandas |
| Formulas & formatting |
openpyxl |
| Simple CSV |
pandas |
| Financial models |
openpyxl |
Critical Rule: Use Formulas
| Approach |
Result |
| Wrong: Calculate in Python, write value |
Static number, breaks when data changes |
| Right: Write Excel formula |
Dynamic, recalculates automatically |
Financial Model Standards
| Convention |
Meaning |
| Blue text |
Hardcoded inputs |
| Black text |
Formulas |
| Green text |
Links to other sheets |
| Yellow fill |
Needs attention |
Common Formula Errors
| Error |
Cause |
| #REF! |
Invalid cell reference |
| #DIV/0! |
Division by zero |
| #VALUE! |
Wrong data type |
| #NAME? |
Unknown function name |
Word Processing
Word Tools
| Task |
Best Tool |
| Text extraction |
pandoc |
| Create new |
python-docx or docx-js |
| Simple edits |
python-docx |
| Tracked changes |
Direct XML editing |
Document Structure
| File |
Contains |
word/document.xml |
Main content |
word/comments.xml |
Comments |
word/media/ |
Images |
Tracked Changes (Redlining)
| Element |
XML Tag |
| Deletion |
<w:del><w:delText>...</w:delText></w:del> |
| Insertion |
<w:ins><w:t>...</w:t></w:ins> |
Key concept: For professional/legal documents, use tracked changes XML rather than replacing text directly.
PowerPoint Processing
PowerPoint Tools
| Task |
Best Tool |
| Text extraction |
markitdown |
| Create new |
pptxgenjs (JS) or python-pptx |
| Edit existing |
Direct XML or python-pptx |
Slide Structure
| Path |
Contains |
ppt/slides/slide{N}.xml |
Slide content |
ppt/notesSlides/ |
Speaker notes |
ppt/slideMasters/ |
Master templates |
ppt/media/ |
Images |
Design Principles
| Principle |
Guideline |
| Fonts |
Use web-safe: Arial, Helvetica, Georgia |
| Layout |
Two-column preferred, avoid vertical stacking |
| Hierarchy |
Size, weight, color for emphasis |
| Consistency |
Repeat patterns across slides |
Converting Between Formats
| Conversion |
Tool |
| Any → PDF |
LibreOffice headless |
| PDF → Images |
pdftoppm |
| DOCX → Markdown |
pandoc |
| Any → Text |
Appropriate extractor |
Best Practices
| Practice |
Why |
| Use formulas in Excel |
Dynamic calculations |
| Preserve formatting on edit |
Don't lose styles |
| Test output opens correctly |
Catch corruption early |
| Use tracked changes for contracts |
Audit trail |
| Extract to markdown for analysis |
Easier to process |
Common Packages
| Language |
Packages |
| Python |
pypdf, pdfplumber, openpyxl, python-docx, python-pptx |
| JavaScript |
docx, pptxgenjs |
| CLI |
pandoc, qpdf, pdftotext, libreoffice |