What is Microsoft Presidio?

Presidio is an open-source data protection SDK from Microsoft for detecting, redacting, masking, and anonymizing personally identifiable information (PII) and protected health information (PHI) in text, images, and structured data. It uses NLP (spaCy, Stanza, Hugging Face transformers), regex, checksums, and custom recognizers. MIT licensed, 8,800+ GitHub stars.

Is Presidio free to use?

Yes. Presidio is MIT-licensed open source. Install via pip (presidio-analyzer, presidio-anonymizer, presidio-image-redactor, presidio-structured), Docker, or from source. No license fee—you pay only for compute and any external NLP models you choose.

How does Presidio work?

Presidio Analyzer scans text with pluggable EntityRecognizers (pattern, NER, rule-based) and returns RecognizerResult spans. Presidio Anonymizer then replaces, masks, hashes, or encrypts those spans using configurable operators. Image Redactor handles OCR text and DICOM medical images; Structured handles tabular columns.

Can Presidio run before LLM or RAG pipelines?

Yes—common pattern: analyze user input and retrieved chunks before sending to an LLM, anonymize logs and tool traces after responses, and redact exports. Presidio is not a substitute for access control or encryption; Microsoft warns automated detection may miss sensitive data—layer additional controls.

What PII types does Presidio detect out of the box?

Predefined recognizers cover credit cards, US SSN, phone numbers, emails, IP addresses, IBAN, crypto wallets, person names, locations, dates, and region-specific IDs (e.g., German DE_* recognizers, Philippines PH_TIN added June 2026). Custom PatternRecognizers extend coverage for internal IDs.

How do I install Presidio?

pip install presidio-analyzer presidio-anonymizer. Requires a spaCy model (e.g., python -m spacy download en_core_web_lg). Docker Compose files ship for text, image, and transformer backends. See microsoft.github.io/presidio for Kubernetes samples and presidio-cli for command-line usage.

Microsoft Presidio: PII Detection Guide 2026 | explainx.ai Blog

explainx.ainewsletter3.5k

workshops ↗

Microsoft Presidio: PII Detection Guide 2026 | explainx.ai Blog | explainx.ai

Microsoft Presidio is the open-source standard for PII and PHI de-identification—detect sensitive entities in text, images, and tables, then mask, hash, replace, or encrypt them before data hits logs, analytics, or LLM context windows.

Eight years in development, 8,800+ GitHub stars, 183 contributors, MIT license, and active 2026 releases (latest 2.2.362, March 2026). If you are building AI agents, RAG pipelines, or healthcare/fintech workflows, Presidio is the guardrail layer Microsoft ships instead of asking you to roll your own regex for credit cards.

TL;DR

Topic	Detail
Repo	github.com/microsoft/presidio
Docs	microsoft.github.io/presidio
License	MIT
Stars	8.8K+
Components	Analyzer, Anonymizer, Image Redactor, Structured, CLI

Package	Role	PyPI
presidio-analyzer	Detect PII spans in text	`presidio-analyzer`
presidio-anonymizer	Replace/mask/hash detected entities	`presidio-anonymizer`
presidio-image-redactor	Redact PII in images + DICOM	`presidio-image-redactor`
presidio-structured	Column-level PII in tables	`presidio-structured`
presidio-cli	Command-line scanning	`presidio-cli`

snippet

Input text / image / dataframe
        │
        ▼
┌───────────────────┐
│  AnalyzerEngine   │  ← RecognizerRegistry (100+ predefined + custom)
│  + NlpEngine      │  ← spaCy / Stanza / Transformers
│  + ContextEnhancer│  ← Boost scores using surrounding words
└─────────┬─────────┘
          │ List[RecognizerResult]  (entity_type, start, end, score)
          ▼
┌───────────────────┐
│ AnonymizerEngine  │  ← replace | redact | hash | encrypt | custom
└─────────┬─────────┘
          ▼
   De-identified output

Class	Purpose
AnalyzerEngine	Main entry—runs all recognizers on text
RecognizerRegistry	Holds predefined + custom EntityRecognizer instances
PatternRecognizer	Regex + context words + validation logic
NlpEngine	spaCy, Stanza, or Hugging Face backends
ContextAwareEnhancer	Uses lemmas/context to reduce false positives
RecognizerResult	Detected entity type, span, confidence score

python

from presidio_analyzer import AnalyzerEngine

analyzer = AnalyzerEngine()

results = analyzer.analyze(
    text="My phone number is 212-555-5555 and email is [email protected]",
    entities=["PHONE_NUMBER", "EMAIL_ADDRESS"],
    language="en",
)

for r in results:
    print(r.entity_type, r.start, r.end, r.score)

python

from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig

anonymizer = AnonymizerEngine()

anonymized = anonymizer.anonymize(
    text="My phone number is 212-555-5555",
    analyzer_results=results,
    operators={"PHONE_NUMBER": OperatorConfig("replace", {"new_value": "<PHONE>"})},
)

print(anonymized.text)

bash

cd presidio-analyzer
docker run -p 5002:3000 presidio-analyzer

curl -d '{"text":"John Smith drivers license is AC432223", "language":"en"}' \
  -H "Content-Type: application/json" \
  -X POST http://localhost:3000/analyze

Region / entity	Notes
US	SSN, phone, credit card, driver license patterns
DE_*	German PII recognizers (2026)
PH_TIN	Philippines Tax Identification Number (#2016, June 2026)
Global	Email, IP, IBAN, crypto wallet, person, location, date

Stage	Presidio role
User prompt ingress	Strip emails, phones, SSN before LLM
RAG retrieval	Scan chunks; block or redact high-PII docs
Tool / MCP traces	Anonymize payloads in logs (AI interpretability)
Agent memory	De-identify before writing to LLM wiki / OKF stores
Export / training	Structured redaction for fine-tune corpora

Mode	When to use
Embedded Python	Notebooks, FastAPI middleware, agent pre-processors
Docker Compose	Local dev; text / image / transformer stacks
Kubernetes	Production microservice (samples gallery)
PySpark	Batch de-ID on data lakes
presidio-cli	CI scans on exports, support tickets, chat logs

Tool	Focus	Open source
Presidio	General PII/PHI, multi-modal, customizable	MIT
AWS Comprehend PII	Managed AWS API	Proprietary
Google DLP	Cloud DLP API	Proprietary
Microsoft Purview	Enterprise data governance	Commercial
Guardrails AI / NeMo	LLM output validators	Mixed
Regex-only	Fast, brittle	N/A

python

from fastapi import FastAPI, Request
from presidio_analyzer import AnalyzerEngine
from presidio_anonymizer import AnonymizerEngine

app = FastAPI()
analyzer = AnalyzerEngine()
anonymizer = AnonymizerEngine()

@app.middleware("http")
async def redact_pii(request: Request, call_next):
    body = await request.body()
    text = body.decode("utf-8", errors="ignore")
    results = analyzer.analyze(text=text, language="en")
    if results:
        anon = anonymizer.anonymize(text=text, analyzer_results=results)
        # Replace request body for downstream — implementation depends on framework
        request._body = anon.text.encode("utf-8")
    return await call_next(request)

Step	Presidio component	Notes
Ingest clinical note	Analyzer + Anonymizer	Tune thresholds; log false negative reviews
De-ID imaging export	Image Redactor + DICOM support	OCR text in burned-in annotations
Tabular exports	Structured	Column-level entity types
Audit	External SIEM	Presidio is not an audit log product

python

from presidio_analyzer import PatternRecognizer, Pattern

emp_id = PatternRecognizer(
    supported_entity="EMPLOYEE_ID",
    patterns=[Pattern("Employee ID pattern", r"EMP-\d{6}", 0.8)],
    context=["employee", "staff", "hr"],
)

analyzer.registry.add_recognizer(emp_id)

Microsoft Presidio: Open-Source PII Detection and De-Identification Guide

TL;DR

Related posts

Thoughtworks Zero-Cost Fallacy — Open Source in the Agentic Era

X Open Source Codebase: Musk Promises Full Platform Transparency

Destructive Command Guard: Stop AI Agents Before They Wreck Your Repo

What Is Presidio?

The Five Packages

Architecture: Analyze → Anonymize

Key classes (Analyzer docs)

Quick Start (Python)

Predefined Recognizers and 2026 Updates

Image and Structured Data

presidio-image-redactor

presidio-structured

Presidio for AI Agents and LLM Pipelines

Deployment Options

Presidio vs Alternatives

Running Presidio in Kubernetes (outline)

When to choose managed DLP instead

OpenSSF and supply chain

Agent pipeline sequence (recommended)

Operational Tips

FastAPI middleware pattern (agent ingress)

HIPAA-oriented workflow (text + DICOM)

Custom recognizer example (internal employee ID)

Contributing and Support

Summary

TL;DR

Related posts

Thoughtworks Zero-Cost Fallacy — Open Source in the Agentic Era

X Open Source Codebase: Musk Promises Full Platform Transparency

Destructive Command Guard: Stop AI Agents Before They Wreck Your Repo

What Is Presidio?

The Five Packages

Architecture: Analyze → Anonymize

Key classes (Analyzer docs)

Quick Start (Python)

Predefined Recognizers and 2026 Updates

Image and Structured Data

presidio-image-redactor

presidio-structured

Presidio for AI Agents and LLM Pipelines

Deployment Options

Presidio vs Alternatives

Running Presidio in Kubernetes (outline)

When to choose managed DLP instead

OpenSSF and supply chain

Agent pipeline sequence (recommended)

Operational Tips

FastAPI middleware pattern (agent ingress)

HIPAA-oriented workflow (text + DICOM)

Custom recognizer example (internal employee ID)

Contributing and Support

Summary

Related Reading