Use this as a working field guide. Configure your redaction policies around a concrete entity catalog, and you’ll reduce leaks without drowning useful prompts in black boxes. Each section below lists examples, detection tips, common false positives, and recommended policy actions.
Personal identifiers (PII)
Names
Examples: first/last, full names; employees, customers, providers.
Detection: NER with context (titles, honorifics) + organization directory for higher precision.
False positives: product names, place names.
Policy: Mask with <PERSON#>
; allow public figures if policy permits.
Contact details
Emails, phones, addresses. Use patterns (RFC 5322, E.164) plus context words ("call", "ship to").
Policy: Mask; optionally allow company-owned, public aliases.
Government IDs
SSN/National IDs, passport, driver’s license. Use checksum rules where possible; beware random 9-digit numbers.
Policy: Mask or drop; never store raw in logs.
Financial
Card data (PAN, CVV, expiry)
Use Luhn validation. Distinguish masked vs. full numbers. Never restore CVV.
Policy: Mask PAN with <PAN#>
; drop CVV; restoration happens only inside PCI-scoped systems.
Accounts and routing/IBAN
Pattern + country context. Watch for test strings.
Policy: Mask; allow last-4 display where necessary post-inference.
Healthcare (PHI)
MRN, claim IDs, visit numbers
Provider-specific formats; verify via domain metadata (EHR system field labels).
Policy: Mask; restore only into payer/provider communications with approvals.
Diagnosis, treatment, labs
Use medical ontologies for classification; avoid over-masking clinical meaning.
Policy: Mask selective identifiers; allow clinical terms unless policy forbids.
Employment & education
Employee IDs, payroll, student IDs, transcripts.
Policy: Mask by default; restrict restoration to HR/registrar workflows.
Secrets & credentials
API keys, OAuth tokens, passwords, private keys.
Policy: Block and alert; never restore; rotate the credential.
Technical identifiers
Device IDs, IPs, session tokens, cookie values.
Policy: Mask or hash; allow aggregation metrics; never log raw values.
Other sensitive business data
Pricing not public, unreleased product names, M&A code names, legal matter numbers.
Policy: Mask with domain-specific placeholders; maintain allowlists for public items.
Detection tactics that work in production
- Hybrid detection: Patterns + ML NER + domain lists.
- Context windows: Examine surrounding terms for disambiguation (e.g., "invoice", "member", "DOB").
- Entity linking: Unify duplicates across a prompt to keep references consistent.
- Thresholds per entity: Different risk/recall trade-offs for PAN vs. first names.
Policy actions: mask, drop, allow, hash
Define actions per entity, environment, and destination. Example: allow staff first names in internal chat but mask in outbound email drafts; always drop secrets; hash IPs when logging.
Evaluation and tuning
Build a labeled test set that reflects your domain: support tickets, emails, notes, transcripts. Track precision/recall per entity and by team. Review false positives weekly at first, then monthly. Add allowlists for frequent non-sensitive terms and denylists for dangerous strings (e.g., known secret prefixes).
Rolling out safely
- Start in observe-only mode; measure detections without masking.
- Flip to mask for high-risk entities (PAN, SSN, secrets) first.
- Add restoration and approvals for workflows that need originals.
- Expand coverage and tighten thresholds over time.
Why this matters
A precise entity catalog is the difference between noisy blocking and quiet, scalable protection. Get the catalog right and everything downstream—restoration, audits, and user trust—gets easier.
Questions about AI security?
Our experts are here to help you implement secure AI solutions for your organization.
Contact Our Experts