50+ Types of Sensitive Data: AI Detection and Protection Guide

Use this as a working field guide. Configure your redaction policies around a concrete entity catalog, and you’ll reduce leaks without drowning useful prompts in black boxes. Each section below lists examples, detection tips, common false positives, and recommended policy actions.

Personal identifiers (PII)

Names

Examples: first/last, full names; employees, customers, providers.

Detection: NER with context (titles, honorifics) + organization directory for higher precision.

False positives: product names, place names.

Policy: Mask with <PERSON#>; allow public figures if policy permits.

Contact details

Emails, phones, addresses. Use patterns (RFC 5322, E.164) plus context words ("call", "ship to").

Policy: Mask; optionally allow company-owned, public aliases.

Government IDs

SSN/National IDs, passport, driver’s license. Use checksum rules where possible; beware random 9-digit numbers.

Policy: Mask or drop; never store raw in logs.

Financial

Card data (PAN, CVV, expiry)

Use Luhn validation. Distinguish masked vs. full numbers. Never restore CVV.

Policy: Mask PAN with <PAN#>; drop CVV; restoration happens only inside PCI-scoped systems.

Accounts and routing/IBAN

Pattern + country context. Watch for test strings.

Policy: Mask; allow last-4 display where necessary post-inference.

Healthcare (PHI)

MRN, claim IDs, visit numbers

Provider-specific formats; verify via domain metadata (EHR system field labels).

Policy: Mask; restore only into payer/provider communications with approvals.

Diagnosis, treatment, labs

Use medical ontologies for classification; avoid over-masking clinical meaning.

Policy: Mask selective identifiers; allow clinical terms unless policy forbids.

Employment & education

Employee IDs, payroll, student IDs, transcripts.

Policy: Mask by default; restrict restoration to HR/registrar workflows.

Secrets & credentials

API keys, OAuth tokens, passwords, private keys.

Policy: Block and alert; never restore; rotate the credential.

Technical identifiers

Device IDs, IPs, session tokens, cookie values.

Policy: Mask or hash; allow aggregation metrics; never log raw values.

Other sensitive business data

Pricing not public, unreleased product names, M&A code names, legal matter numbers.

Policy: Mask with domain-specific placeholders; maintain allowlists for public items.

Detection tactics that work in production

Hybrid detection: Patterns + ML NER + domain lists.
Context windows: Examine surrounding terms for disambiguation (e.g., "invoice", "member", "DOB").
Entity linking: Unify duplicates across a prompt to keep references consistent.
Thresholds per entity: Different risk/recall trade-offs for PAN vs. first names.

Policy actions: mask, drop, allow, hash

Define actions per entity, environment, and destination. Example: allow staff first names in internal chat but mask in outbound email drafts; always drop secrets; hash IPs when logging.

Evaluation and tuning

Build a labeled test set that reflects your domain: support tickets, emails, notes, transcripts. Track precision/recall per entity and by team. Review false positives weekly at first, then monthly. Add allowlists for frequent non-sensitive terms and denylists for dangerous strings (e.g., known secret prefixes).

Rolling out safely

Start in observe-only mode; measure detections without masking.
Flip to mask for high-risk entities (PAN, SSN, secrets) first.
Add restoration and approvals for workflows that need originals.
Expand coverage and tighten thresholds over time.

Why this matters

A precise entity catalog is the difference between noisy blocking and quiet, scalable protection. Get the catalog right and everything downstream—restoration, audits, and user trust—gets easier.