Microsoft Presidio is the open-source standard for PII and PHI de-identification—detect sensitive entities in text, images, and tables, then mask, hash, replace, or encrypt them before data hits logs, analytics, or LLM context windows.
Eight years in development, 8,800+ GitHub stars, 183 contributors, MIT license, and active 2026 releases (latest 2.2.362, March 2026). If you are building AI agents, RAG pipelines, or healthcare/fintech workflows, Presidio is the guardrail layer Microsoft ships instead of asking you to roll your own regex for credit cards.
TL;DR
| Topic | Detail |
|---|---|
| Repo | github.com/microsoft/presidio |
| Docs | microsoft.github.io/presidio |
| License | MIT |
| Stars | 8.8K+ |
| Components | Analyzer, Anonymizer, Image Redactor, Structured, CLI |
| Detection | Regex, NER (spaCy/Stanza/transformers), checksums, context |
| Deploy | Python, PySpark, Docker, Kubernetes |
| Images | PNG/JPEG + DICOM medical imaging |
| Caveat | Automated—not 100% recall; layer other controls |
What Is Presidio?
From the README:
Context aware, pluggable and customizable PII de-identification service for text and images.
The name comes from Latin praesidium (“protection, garrison”). Presidio’s goals:
- Democratize de-identification — transparent, auditable decisions
- Extensibility — custom recognizers for your domain
- Multi-platform — Python notebooks to K8s clusters
Microsoft’s explicit warning (do not skip):
Presidio can help identify sensitive/PII data in unstructured text. However, because it is using automated detection mechanisms, there is no guarantee that Presidio will find all sensitive information. Consequently, additional systems and protections should be employed.
That honesty matters for compliance narratives—Presidio is a strong filter, not a magic compliance checkbox.
The Five Packages
| Package | Role | PyPI |
|---|---|---|
| presidio-analyzer | Detect PII spans in text | presidio-analyzer |
| presidio-anonymizer | Replace/mask/hash detected entities | presidio-anonymizer |
| presidio-image-redactor | Redact PII in images + DICOM | presidio-image-redactor |
| presidio-structured | Column-level PII in tables | presidio-structured |
| presidio-cli | Command-line scanning | presidio-cli |
Each package publishes separately with download and coverage badges on GitHub.
Architecture: Analyze → Anonymize
Input text / image / dataframe
│
▼
┌───────────────────┐
│ AnalyzerEngine │ ← RecognizerRegistry (100+ predefined + custom)
│ + NlpEngine │ ← spaCy / Stanza / Transformers
│ + ContextEnhancer│ ← Boost scores using surrounding words
└─────────┬─────────┘
│ List[RecognizerResult] (entity_type, start, end, score)
▼
┌───────────────────┐
│ AnonymizerEngine │ ← replace | redact | hash | encrypt | custom
└─────────┬─────────┘
▼
De-identified output
Key classes (Analyzer docs)
| Class | Purpose |
|---|---|
| AnalyzerEngine | Main entry—runs all recognizers on text |
| RecognizerRegistry | Holds predefined + custom EntityRecognizer instances |
| PatternRecognizer | Regex + context words + validation logic |
| NlpEngine | spaCy, Stanza, or Hugging Face backends |
| ContextAwareEnhancer | Uses lemmas/context to reduce false positives |
| RecognizerResult | Detected entity type, span, confidence score |
Quick Start (Python)
From official getting started:
pip install presidio-analyzer presidio-anonymizer
python -m spacy download en_core_web_lg
Detect:
from presidio_analyzer import AnalyzerEngine
analyzer = AnalyzerEngine()
results = analyzer.analyze(
text="My phone number is 212-555-5555 and email is [email protected]",
entities=["PHONE_NUMBER", "EMAIL_ADDRESS"],
language="en",
)
for r in results:
print(r.entity_type, r.start, r.end, r.score)
Anonymize:
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig
anonymizer = AnonymizerEngine()
anonymized = anonymizer.anonymize(
text="My phone number is 212-555-5555",
analyzer_results=results,
operators={"PHONE_NUMBER": OperatorConfig("replace", {"new_value": "<PHONE>"})},
)
print(anonymized.text)
HTTP server (Docker):
cd presidio-analyzer
docker run -p 5002:3000 presidio-analyzer
curl -d '{"text":"John Smith drivers license is AC432223", "language":"en"}' \
-H "Content-Type: application/json" \
-X POST http://localhost:3000/analyze
Repo ships docker-compose variants for text, image, and transformer backends.
Predefined Recognizers and 2026 Updates
Presidio ships region-specific recognizers maintained by Microsoft and contributors:
| Region / entity | Notes |
|---|---|
| US | SSN, phone, credit card, driver license patterns |
| DE_* | German PII recognizers (2026) |
| PH_TIN | Philippines Tax Identification Number (#2016, June 2026) |
| Global | Email, IP, IBAN, crypto wallet, person, location, date |
Country filter: load_predefined_recognizers() now supports optional country filtering—load only recognizers relevant to your jurisdiction (reduces false positives and startup time).
LangExtract integration: Docker Compose supports language model pipelines for richer extraction (LangExtract PR #1775).
Custom recognizers: Add PatternRecognizer with regex, context words, and checksum validators—tutorial for internal employee IDs, contract numbers, etc.
Image and Structured Data
presidio-image-redactor
- Standard images (PNG, JPEG) via OCR + text redaction
- DICOM medical imaging—critical for HIPAA workflows
- Recent fix: return rendered image when no text detected (avoid empty outputs)
Pairs with our face blur / video privacy content for visual media—Presidio handles text-in-image PII; dedicated video tools handle motion blur.
presidio-structured
Scan DataFrames and tables column-by-column—useful before loading customer CSVs into warehouses or fine-tuning datasets.
Presidio for AI Agents and LLM Pipelines
Presidio fits the guardrails layer—not the model layer:
| Stage | Presidio role |
|---|---|
| User prompt ingress | Strip emails, phones, SSN before LLM |
| RAG retrieval | Scan chunks; block or redact high-PII docs |
| Tool / MCP traces | Anonymize payloads in logs (AI interpretability) |
| Agent memory | De-identify before writing to LLM wiki / OKF stores |
| Export / training | Structured redaction for fine-tune corpora |
OpenAI Deployment Simulation research (our guide) emphasizes pre-release safety testing—Presidio is the kind of privacy-preserving de-identification tool teams wire into those pipelines.
Not a replacement for:
- Access control and encryption at rest
- Human review for high-stakes decisions
- DLP at the network edge
- Model-level refusals (Anthropic/OpenAI policies)
Combine Presidio with Agent Skills that encode “never log PII” procedures discoverable at /skills.
Deployment Options
| Mode | When to use |
|---|---|
| Embedded Python | Notebooks, FastAPI middleware, agent pre-processors |
| Docker Compose | Local dev; text / image / transformer stacks |
| Kubernetes | Production microservice (samples gallery) |
| PySpark | Batch de-ID on data lakes |
| presidio-cli | CI scans on exports, support tickets, chat logs |
OpenSSF Best Practices badge on the repo—supply-chain hardening includes consolidated Dependabot and pinned dependencies (2026 chore commits).
Presidio vs Alternatives
| Tool | Focus | Open source |
|---|---|---|
| Presidio | General PII/PHI, multi-modal, customizable | MIT |
| AWS Comprehend PII | Managed AWS API | Proprietary |
| Google DLP | Cloud DLP API | Proprietary |
| Microsoft Purview | Enterprise data governance | Commercial |
| Guardrails AI / NeMo | LLM output validators | Mixed |
| Regex-only | Fast, brittle | N/A |
Presidio wins when you need self-hosted, auditable, multi-language PII detection with custom recognizers and no cloud lock-in.
Running Presidio in Kubernetes (outline)
Production teams typically deploy:
- Analyzer Deployment — horizontal pod autoscaler on CPU; readiness probe on
/health. - Anonymizer sidecar or service — same release train as analyzer to avoid schema skew.
- Secrets — encryption keys for
encryptoperator via KMS, not env plaintext. - Network policy — only ingress gateway and batch jobs may call analyzer gRPC/HTTP.
- Observability — metrics on entities detected per type, latency p95, false-positive samples to human queue.
Official samples gallery includes Docker and K8s references—pin image digests in manifests.
Batch PySpark jobs suit lake exports: analyze columns at scale before sharing parquet to partners.
When to choose managed DLP instead
Presidio shines self-hosted. Choose Google DLP, AWS Comprehend PII, or Purview when legal requires a vendor BAA/DPA, you want zero ML ops, and cloud spend is acceptable. Many enterprises run Presidio in CI plus cloud DLP at the edge—defense in depth, not either/or.
OpenSSF and supply chain
The Presidio repo carries OpenSSF Best Practices badge work—Dependabot consolidation, pinned dependencies, and security policy in SECURITY.md. Treat Presidio like any other production dependency: pin versions in requirements.txt, scan containers, and review recognizer PRs that add new entity types (they change detection behavior).
June 2026 highlights: Philippines PH_TIN recognizer, German DE_* pack, optional country filter on predefined recognizers, LangExtract docker path, and custom operator validate() fix in anonymizer (#2025).
Wire Presidio into agent pipelines alongside skills that encode security procedures and MCP servers that fetch data—redact before tool results enter the model, not only before logs hit Splunk.
Agent pipeline sequence (recommended)
User input → Presidio analyze → anonymize → RAG retrieve → Presidio on chunks
→ assemble prompt → LLM → Presidio on output → store redacted trace
Skipping chunk redaction is a common gap: retrieved docs often contain more PII than the user’s latest message. Run analyzer on each retrieved segment or on the assembled prompt immediately before the model call.
Presidio’s June 2026 release train also improved CLI dependencies (#2058) and image redactor empty-text behavior—check CHANGELOG before upgrading production clusters.
For regulated teams, document which recognizers run in each environment (US-only vs DE vs PH_TIN) and retention for analyzer logs that might contain sensitive spans before redaction.
Stats and recognizer lists from microsoft/presidio as of June 2026.
Start with pip install presidio-analyzer presidio-anonymizer, run the quick-start snippet on sample text containing an email and phone, then add one custom recognizer for an internal ID your logs actually leak—that single spike teaches more than reading ten pages of documentation.
Browse related privacy tooling at /skills and /mcp-servers when wiring agent guardrails.
Operational Tips
- Tune thresholds — Lower scores catch more PII but increase false positives; use decision tracing (docs)
- Pick NLP backend — spaCy for speed; transformers for accuracy on messy text
- Allow lists —
AnalyzerEnginesupports allow lists for known-safe tokens (SECURITY.md) - Multi-language — Load language-specific spaCy models; recognizers vary by locale
- Test recall — Run golden files through analyzer before production; Microsoft does not guarantee full recall
- Anonymizer operators — Choose replace for readability, hash for joinability, encrypt for reversibility under key management
FastAPI middleware pattern (agent ingress)
Wire Presidio before your LLM handler so PII never enters the model context:
from fastapi import FastAPI, Request
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine
app = FastAPI()
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()
@app.middleware("http")
async def redact_pii(request: Request, call_next):
body = await request.body()
text = body.decode("utf-8", errors="ignore")
results = analyzer.analyze(text=text, language="en")
if results:
anon = anonymizer.anonymize(text=text, analyzer_results=results)
# Replace request body for downstream — implementation depends on framework
request._body = anon.text.encode("utf-8")
return await call_next(request)
Production middleware also needs allow lists for known-safe tokens (internal job IDs), async handling for large bodies, and metrics on detection rates—Microsoft documents allow-list behavior in SECURITY.md.
HIPAA-oriented workflow (text + DICOM)
| Step | Presidio component | Notes |
|---|---|---|
| Ingest clinical note | Analyzer + Anonymizer | Tune thresholds; log false negative reviews |
| De-ID imaging export | Image Redactor + DICOM support | OCR text in burned-in annotations |
| Tabular exports | Structured | Column-level entity types |
| Audit | External SIEM | Presidio is not an audit log product |
Combine with access control, BAA-covered vendors, and human review for high-stakes releases—Presidio reduces accidental leakage; it does not certify compliance by itself.
Custom recognizer example (internal employee ID)
from presidio_analyzer import PatternRecognizer, Pattern
emp_id = PatternRecognizer(
supported_entity="EMPLOYEE_ID",
patterns=[Pattern("Employee ID pattern", r"EMP-\d{6}", 0.8)],
context=["employee", "staff", "hr"],
)
analyzer.registry.add_recognizer(emp_id)
Use context words to cut false positives on generic numeric strings. Tune score thresholds per entity type in regulated environments.
Contributing and Support
- Docs: microsoft.github.io/presidio
- Discussions: GitHub Discussions board
- Issues: GitHub Issues (check docs first)
- Email: [email protected]
- CLA: Microsoft Contributor License Agreement for PRs
Recent contributor activity: Philippines TIN recognizer, German recognizers, custom operator validation fix (#2025), explicit Click dependencies for CLI packages.
Summary
Microsoft Presidio is the open-source PII/PHI workhorse for engineering teams: Analyzer to find entities, Anonymizer to scrub them, Image Redactor for DICOM and OCR text, Structured for tables, CLI for CI.
For AI agents, wire it before context assembly and after tool calls—not instead of encryption, policy, or human review. The SDK is mature, actively maintained, and honest about limits.
Start here: pip install presidio-analyzer presidio-anonymizer → getting started docs → custom recognizers for your domain IDs.
Related Reading
- Karpathy LLM Wiki Pattern
- OpenAI Deployment Simulation
- AI Interpretability and Monitoring
- Face Blur and Video Privacy
- Agent Skills Security
- Browse Agent Skills
Features and stats from microsoft/presidio README and Presidio documentation as of June 2026.