← Back to blog

explainx / blog

Microsoft Presidio: Open-Source PII Detection and De-Identification Guide

Microsoft Presidio detects and anonymizes PII in text, images, and structured data—8.8K GitHub stars, MIT license. Analyzer, Anonymizer, Image Redactor, CLI, spaCy/transformers, DICOM, and AI agent guardrails.

·9 min read·Yash Thakker
Microsoft PresidioPIIData PrivacyOpen SourceAI GuardrailsHIPAA
Microsoft Presidio: Open-Source PII Detection and De-Identification Guide

Microsoft Presidio is the open-source standard for PII and PHI de-identification—detect sensitive entities in text, images, and tables, then mask, hash, replace, or encrypt them before data hits logs, analytics, or LLM context windows.

Eight years in development, 8,800+ GitHub stars, 183 contributors, MIT license, and active 2026 releases (latest 2.2.362, March 2026). If you are building AI agents, RAG pipelines, or healthcare/fintech workflows, Presidio is the guardrail layer Microsoft ships instead of asking you to roll your own regex for credit cards.


TL;DR

TopicDetail
Repogithub.com/microsoft/presidio
Docsmicrosoft.github.io/presidio
LicenseMIT
Stars8.8K+
ComponentsAnalyzer, Anonymizer, Image Redactor, Structured, CLI
DetectionRegex, NER (spaCy/Stanza/transformers), checksums, context
DeployPython, PySpark, Docker, Kubernetes
ImagesPNG/JPEG + DICOM medical imaging
CaveatAutomated—not 100% recall; layer other controls

What Is Presidio?

From the README:

Context aware, pluggable and customizable PII de-identification service for text and images.

The name comes from Latin praesidium (“protection, garrison”). Presidio’s goals:

  1. Democratize de-identification — transparent, auditable decisions
  2. Extensibility — custom recognizers for your domain
  3. Multi-platform — Python notebooks to K8s clusters

Microsoft’s explicit warning (do not skip):

Presidio can help identify sensitive/PII data in unstructured text. However, because it is using automated detection mechanisms, there is no guarantee that Presidio will find all sensitive information. Consequently, additional systems and protections should be employed.

That honesty matters for compliance narratives—Presidio is a strong filter, not a magic compliance checkbox.


The Five Packages

PackageRolePyPI
presidio-analyzerDetect PII spans in textpresidio-analyzer
presidio-anonymizerReplace/mask/hash detected entitiespresidio-anonymizer
presidio-image-redactorRedact PII in images + DICOMpresidio-image-redactor
presidio-structuredColumn-level PII in tablespresidio-structured
presidio-cliCommand-line scanningpresidio-cli

Each package publishes separately with download and coverage badges on GitHub.


Architecture: Analyze → Anonymize

Input text / image / dataframe
        │
        ▼
┌───────────────────┐
│  AnalyzerEngine   │  ← RecognizerRegistry (100+ predefined + custom)
│  + NlpEngine      │  ← spaCy / Stanza / Transformers
│  + ContextEnhancer│  ← Boost scores using surrounding words
└─────────┬─────────┘
          │ List[RecognizerResult]  (entity_type, start, end, score)
          ▼
┌───────────────────┐
│ AnonymizerEngine  │  ← replace | redact | hash | encrypt | custom
└─────────┬─────────┘
          ▼
   De-identified output

Key classes (Analyzer docs)

ClassPurpose
AnalyzerEngineMain entry—runs all recognizers on text
RecognizerRegistryHolds predefined + custom EntityRecognizer instances
PatternRecognizerRegex + context words + validation logic
NlpEnginespaCy, Stanza, or Hugging Face backends
ContextAwareEnhancerUses lemmas/context to reduce false positives
RecognizerResultDetected entity type, span, confidence score

Quick Start (Python)

From official getting started:

pip install presidio-analyzer presidio-anonymizer
python -m spacy download en_core_web_lg

Detect:

from presidio_analyzer import AnalyzerEngine

analyzer = AnalyzerEngine()

results = analyzer.analyze(
    text="My phone number is 212-555-5555 and email is [email protected]",
    entities=["PHONE_NUMBER", "EMAIL_ADDRESS"],
    language="en",
)

for r in results:
    print(r.entity_type, r.start, r.end, r.score)

Anonymize:

from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig

anonymizer = AnonymizerEngine()

anonymized = anonymizer.anonymize(
    text="My phone number is 212-555-5555",
    analyzer_results=results,
    operators={"PHONE_NUMBER": OperatorConfig("replace", {"new_value": "<PHONE>"})},
)

print(anonymized.text)

HTTP server (Docker):

cd presidio-analyzer
docker run -p 5002:3000 presidio-analyzer

curl -d '{"text":"John Smith drivers license is AC432223", "language":"en"}' \
  -H "Content-Type: application/json" \
  -X POST http://localhost:3000/analyze

Repo ships docker-compose variants for text, image, and transformer backends.


Predefined Recognizers and 2026 Updates

Presidio ships region-specific recognizers maintained by Microsoft and contributors:

Region / entityNotes
USSSN, phone, credit card, driver license patterns
DE_*German PII recognizers (2026)
PH_TINPhilippines Tax Identification Number (#2016, June 2026)
GlobalEmail, IP, IBAN, crypto wallet, person, location, date

Country filter: load_predefined_recognizers() now supports optional country filtering—load only recognizers relevant to your jurisdiction (reduces false positives and startup time).

LangExtract integration: Docker Compose supports language model pipelines for richer extraction (LangExtract PR #1775).

Custom recognizers: Add PatternRecognizer with regex, context words, and checksum validators—tutorial for internal employee IDs, contract numbers, etc.


Image and Structured Data

presidio-image-redactor

  • Standard images (PNG, JPEG) via OCR + text redaction
  • DICOM medical imaging—critical for HIPAA workflows
  • Recent fix: return rendered image when no text detected (avoid empty outputs)

Pairs with our face blur / video privacy content for visual media—Presidio handles text-in-image PII; dedicated video tools handle motion blur.

presidio-structured

Scan DataFrames and tables column-by-column—useful before loading customer CSVs into warehouses or fine-tuning datasets.


Presidio for AI Agents and LLM Pipelines

Presidio fits the guardrails layer—not the model layer:

StagePresidio role
User prompt ingressStrip emails, phones, SSN before LLM
RAG retrievalScan chunks; block or redact high-PII docs
Tool / MCP tracesAnonymize payloads in logs (AI interpretability)
Agent memoryDe-identify before writing to LLM wiki / OKF stores
Export / trainingStructured redaction for fine-tune corpora

OpenAI Deployment Simulation research (our guide) emphasizes pre-release safety testing—Presidio is the kind of privacy-preserving de-identification tool teams wire into those pipelines.

Not a replacement for:

  • Access control and encryption at rest
  • Human review for high-stakes decisions
  • DLP at the network edge
  • Model-level refusals (Anthropic/OpenAI policies)

Combine Presidio with Agent Skills that encode “never log PII” procedures discoverable at /skills.


Deployment Options

ModeWhen to use
Embedded PythonNotebooks, FastAPI middleware, agent pre-processors
Docker ComposeLocal dev; text / image / transformer stacks
KubernetesProduction microservice (samples gallery)
PySparkBatch de-ID on data lakes
presidio-cliCI scans on exports, support tickets, chat logs

OpenSSF Best Practices badge on the repo—supply-chain hardening includes consolidated Dependabot and pinned dependencies (2026 chore commits).


Presidio vs Alternatives

ToolFocusOpen source
PresidioGeneral PII/PHI, multi-modal, customizableMIT
AWS Comprehend PIIManaged AWS APIProprietary
Google DLPCloud DLP APIProprietary
Microsoft PurviewEnterprise data governanceCommercial
Guardrails AI / NeMoLLM output validatorsMixed
Regex-onlyFast, brittleN/A

Presidio wins when you need self-hosted, auditable, multi-language PII detection with custom recognizers and no cloud lock-in.


Running Presidio in Kubernetes (outline)

Production teams typically deploy:

  1. Analyzer Deployment — horizontal pod autoscaler on CPU; readiness probe on /health.
  2. Anonymizer sidecar or service — same release train as analyzer to avoid schema skew.
  3. Secrets — encryption keys for encrypt operator via KMS, not env plaintext.
  4. Network policy — only ingress gateway and batch jobs may call analyzer gRPC/HTTP.
  5. Observability — metrics on entities detected per type, latency p95, false-positive samples to human queue.

Official samples gallery includes Docker and K8s references—pin image digests in manifests.

Batch PySpark jobs suit lake exports: analyze columns at scale before sharing parquet to partners.

When to choose managed DLP instead

Presidio shines self-hosted. Choose Google DLP, AWS Comprehend PII, or Purview when legal requires a vendor BAA/DPA, you want zero ML ops, and cloud spend is acceptable. Many enterprises run Presidio in CI plus cloud DLP at the edge—defense in depth, not either/or.

OpenSSF and supply chain

The Presidio repo carries OpenSSF Best Practices badge work—Dependabot consolidation, pinned dependencies, and security policy in SECURITY.md. Treat Presidio like any other production dependency: pin versions in requirements.txt, scan containers, and review recognizer PRs that add new entity types (they change detection behavior).

June 2026 highlights: Philippines PH_TIN recognizer, German DE_* pack, optional country filter on predefined recognizers, LangExtract docker path, and custom operator validate() fix in anonymizer (#2025).

Wire Presidio into agent pipelines alongside skills that encode security procedures and MCP servers that fetch data—redact before tool results enter the model, not only before logs hit Splunk.

Agent pipeline sequence (recommended)

User input → Presidio analyze → anonymize → RAG retrieve → Presidio on chunks
    → assemble prompt → LLM → Presidio on output → store redacted trace

Skipping chunk redaction is a common gap: retrieved docs often contain more PII than the user’s latest message. Run analyzer on each retrieved segment or on the assembled prompt immediately before the model call.

Presidio’s June 2026 release train also improved CLI dependencies (#2058) and image redactor empty-text behavior—check CHANGELOG before upgrading production clusters.

For regulated teams, document which recognizers run in each environment (US-only vs DE vs PH_TIN) and retention for analyzer logs that might contain sensitive spans before redaction.

Stats and recognizer lists from microsoft/presidio as of June 2026.

Start with pip install presidio-analyzer presidio-anonymizer, run the quick-start snippet on sample text containing an email and phone, then add one custom recognizer for an internal ID your logs actually leak—that single spike teaches more than reading ten pages of documentation.

Browse related privacy tooling at /skills and /mcp-servers when wiring agent guardrails.


Operational Tips

  1. Tune thresholds — Lower scores catch more PII but increase false positives; use decision tracing (docs)
  2. Pick NLP backend — spaCy for speed; transformers for accuracy on messy text
  3. Allow listsAnalyzerEngine supports allow lists for known-safe tokens (SECURITY.md)
  4. Multi-language — Load language-specific spaCy models; recognizers vary by locale
  5. Test recall — Run golden files through analyzer before production; Microsoft does not guarantee full recall
  6. Anonymizer operators — Choose replace for readability, hash for joinability, encrypt for reversibility under key management

FastAPI middleware pattern (agent ingress)

Wire Presidio before your LLM handler so PII never enters the model context:

from fastapi import FastAPI, Request
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

app = FastAPI()
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

@app.middleware("http")
async def redact_pii(request: Request, call_next):
    body = await request.body()
    text = body.decode("utf-8", errors="ignore")
    results = analyzer.analyze(text=text, language="en")
    if results:
        anon = anonymizer.anonymize(text=text, analyzer_results=results)
        # Replace request body for downstream — implementation depends on framework
        request._body = anon.text.encode("utf-8")
    return await call_next(request)

Production middleware also needs allow lists for known-safe tokens (internal job IDs), async handling for large bodies, and metrics on detection rates—Microsoft documents allow-list behavior in SECURITY.md.


HIPAA-oriented workflow (text + DICOM)

StepPresidio componentNotes
Ingest clinical noteAnalyzer + AnonymizerTune thresholds; log false negative reviews
De-ID imaging exportImage Redactor + DICOM supportOCR text in burned-in annotations
Tabular exportsStructuredColumn-level entity types
AuditExternal SIEMPresidio is not an audit log product

Combine with access control, BAA-covered vendors, and human review for high-stakes releases—Presidio reduces accidental leakage; it does not certify compliance by itself.


Custom recognizer example (internal employee ID)

from presidio_analyzer import PatternRecognizer, Pattern

emp_id = PatternRecognizer(
    supported_entity="EMPLOYEE_ID",
    patterns=[Pattern("Employee ID pattern", r"EMP-\d{6}", 0.8)],
    context=["employee", "staff", "hr"],
)

analyzer.registry.add_recognizer(emp_id)

Use context words to cut false positives on generic numeric strings. Tune score thresholds per entity type in regulated environments.


Contributing and Support

Recent contributor activity: Philippines TIN recognizer, German recognizers, custom operator validation fix (#2025), explicit Click dependencies for CLI packages.


Summary

Microsoft Presidio is the open-source PII/PHI workhorse for engineering teams: Analyzer to find entities, Anonymizer to scrub them, Image Redactor for DICOM and OCR text, Structured for tables, CLI for CI.

For AI agents, wire it before context assembly and after tool calls—not instead of encryption, policy, or human review. The SDK is mature, actively maintained, and honest about limits.

Start here: pip install presidio-analyzer presidio-anonymizergetting started docs → custom recognizers for your domain IDs.


Related Reading

Features and stats from microsoft/presidio README and Presidio documentation as of June 2026.

Related posts