Short answer: Run redaction wherever the text first enters your control plane (often at the edge or inside your VPC), keep restoration where your highest controls live (often on-prem or in a tightly restricted VPC), and use cloud LLMs for compute-heavy inference if and when policy allows. That hybrid split gives you performance and privacy without rebuilding your stack every time policy changes.
What you’re optimizing for
- Data sensitivity: If inputs commonly contain regulated or commercially sensitive data, push redaction closest to source.
- Latency & throughput: Inference may need GPUs you don’t own; redaction is CPU-friendly and fast.
- Residency/sovereignty: Some regions require local processing and key residency; separate restoration makes this easy.
- Operational maturity: On-prem demands SRE/infra rigor; cloud gives elasticity and managed services.
- Total cost: GPUs and high-availability on-prem are expensive; redaction is comparatively cheap to run anywhere.
Four deployment archetypes
- All-cloud (fastest to start): Redaction gateway in your VPC, models via cloud APIs, restoration in a locked-down subnet with KMS/HSM. Best for low to moderate sensitivity with strong vendor controls.
- Hybrid (most common): Redaction at the edge/VPC; models primarily in cloud; restoration on-prem or in a dedicated compliance VPC per region. Balances control and speed.
- VPC-hosted model (private cloud): You host an open-weight model in your VPC; redaction still at ingress; restoration in a compliance subnet. Good for tighter residency and performance predictability.
- All on-prem (strictest): Redaction, models, and restoration inside your data center. Choose when regulations or contracts make external calls unacceptable. Expect higher cost and slower iteration.
Decision framework (scorecard)
Create a simple weighted score (1–5) for each factor: sensitivity, residency, latency, cost, staffing, vendor posture. If sensitivity and residency dominate, lean hybrid or on-prem restoration. If latency and cost dominate with low sensitivity, lean cloud with strong redaction and telemetry hygiene.
Data flow in a pragmatic hybrid
- Client → Gateway (your VPC): Ingress through a proxy/SDK that performs redaction. Only placeholders cross to the model plane. Secrets are blocked.
- Gateway → Model (cloud): The model operates on placeholders and non-sensitive context. Outputs stream back to the gateway.
- Gateway → Destinations: Most outputs remain redacted (tickets, knowledge bases). If a destination requires originals (e.g., a signed letter), the app calls the restoration service.
- App → Restoration (compliance VPC/on-prem): With reason code and approvals, placeholders become allowed originals. Events are logged immutably.
Security controls that matter more than venue
- Identity: SSO + MFA for users; workload identity for services; no shared API keys.
- Network: Egress policies that force all model traffic through your gateway; vendor endpoints allow-listed.
- Policy-as-code: Versioned rules for masking/dropping/allowing by entity and destination; approvals for changes.
- Telemetry hygiene: Logs/analytics cannot accept raw text; schemas enforce safe fields.
- Key management: Separate KMS/HSM for restoration; envelope encryption; short-lived tokens.
Performance playbook
Redaction latency should be single-digit milliseconds per kilobyte with CPU. Stream outputs back through the gateway so you can apply post-process guards (e.g., detect a stray email and mask inline). Use routing tables to shift traffic between models for cost/performance (e.g., use a small model for boilerplate, a larger one for complex drafting).
Cost modeling
Estimate monthly cost as: gateway CPU + egress bandwidth + model tokens + restoration operations + SRE/ops. Redaction is cheap; tokens dominate for high-volume apps. Hybrid allows you to reserve premium models for high-ROI flows and default others to economical providers.
Operational maturity checklist
- Automated deploys for gateway/policy; CI tests for detection and logging bans.
- Runbooks for vendor outages (fallback to alt model/template).
- DR plans: secondary region/vendor; restoration key backup with split knowledge.
- Monthly failover tests and incident drills.
Migration plan in six steps
- Discover: Inventory AI calls; label sensitivity by route.
- Pilot: Add gateway in observe-only; measure detections and latency.
- Enforce: Mask/drop high-risk entities; block secrets; fix logs.
- Isolate restoration: Stand up the service with approvals; wire one workflow.
- Optimize models: Route simple tasks to economical models; pin regions.
- Harden: Add DSAR support (subject-key indexing), residency constraints, and evidence dashboards.
When all on-prem really is necessary
Some healthcare, defense, or sovereign contexts require no external calls. In those cases, pick compact models fine-tuned for your tasks, invest in GPU scheduling, and temper expectations about generative complexity. Even on-prem, keep redaction and restoration separate with strict keys and logs.
KPIs for deployment success
- >90% of AI calls via gateway in 60 days.
- Median redaction latency < 25 ms per request; p95 < 75 ms.
- Leak rate < 1 incident per 10k requests; MTTD < 1 hour; MTTC < 24 hours.
- Restoration events linked to approvals 100% of the time.
FAQs
Q: Can we keep using cloud models if our data is very sensitive? A: Yes—if you minimize aggressively. With placeholders, the model sees structure without secrets. Keep restoration on-prem and block analytics from collecting raw text.
Q: Does self-hosting a model eliminate risk? A: It shifts risk, but doesn’t eliminate it. You still need redaction, telemetry hygiene, and strong identity/keys. Self-hosting adds ops complexity; choose it for sovereignty or cost/predictability, not as a silver bullet.
Q: How do we handle multi-region teams? A: Run redaction in-region, route to region-pinned models, and keep restoration keys local. Replicate policies, not data.
Related: Redaction Placeholders • Data Residency & Sovereignty • Secure AI API Integration
Questions about AI security?
Our experts are here to help you implement secure AI solutions for your organization.
Contact Our Experts