source of truth for security policy, making it reusable across web, mobile, and batch processing services.
2. Bidirectional Filtering: Prompt injection targets inputs, but models can leak PII, generate harmful content, or drift off-topic in outputs. The proxy must evaluate both directions to prevent data exfiltration and compliance violations.
3. Policy-Driven Evaluation: Hardcoding detector logic ties security to deployment cycles. A declarative policy format allows security teams to adjust thresholds, actions, and detector weights without touching application code.
4. Fail-Fast vs. Aggregate Scoring: Detectors run sequentially with configurable failure modes. Critical violations (e.g., jailbreak attempts) trigger immediate blocking, while lower-severity flags (e.g., off-topic drift) can be quarantined or logged for review.
Implementation Example (TypeScript)
The following implementation demonstrates a production-ready guardrail proxy. It wraps a standard LLM client, evaluates traffic through a detector chain, and maintains an audit trail.
import { OpenAI } from 'openai';
import { PolicyEngine, DetectorPipeline, AuditRecorder } from './shield-core';
// 1. Define detector implementations
class InstructionOverrideDetector {
async evaluate(payload: string): Promise<{ score: number; verdict: 'pass' | 'block' | 'flag' }> {
// Semantic analysis + pattern matching for instruction override attempts
const semanticRisk = await this.analyzeIntent(payload);
const patternRisk = this.scanKnownPatterns(payload);
const combined = Math.max(semanticRisk, patternRisk);
return {
score: combined,
verdict: combined > 0.85 ? 'block' : combined > 0.6 ? 'flag' : 'pass'
};
}
private async analyzeIntent(text: string): Promise<number> { /* embedding similarity */ return 0; }
private scanKnownPatterns(text: string): number { /* regex/trie scan */ return 0; }
}
class OutputSanitizer {
async evaluate(payload: string): Promise<{ score: number; verdict: 'pass' | 'block' | 'flag' }> {
// PII detection + harmful content classification
const piiScore = await this.detectSensitiveData(payload);
const harmScore = await this.classifyContent(payload);
const maxRisk = Math.max(piiScore, harmScore);
return {
score: maxRisk,
verdict: maxRisk > 0.9 ? 'block' : maxRisk > 0.7 ? 'flag' : 'pass'
};
}
private async detectSensitiveData(text: string): Promise<number> { /* NER/regex */ return 0; }
private async classifyContent(text: string): Promise<number> { /* classifier */ return 0; }
}
// 2. Build the shielded client
class ShieldedLLMClient {
private policy: PolicyEngine;
private pipeline: DetectorPipeline;
private auditor: AuditRecorder;
private baseClient: OpenAI;
constructor(config: {
baseClient: OpenAI;
policyPath: string;
auditStore: string
}) {
this.baseClient = config.baseClient;
this.policy = new PolicyEngine(config.policyPath);
this.pipeline = new DetectorPipeline([
new InstructionOverrideDetector(),
new OutputSanitizer()
]);
this.auditor = new AuditRecorder(config.auditStore);
}
async chatCompletion(messages: any[], options?: any) {
const userContent = messages.map(m => m.content).join('\n');
// Pre-flight evaluation
const preResult = await this.pipeline.evaluateInput(userContent);
if (preResult.verdict === 'block') {
await this.auditor.record({ type: 'input_blocked', payload: userContent, reason: preResult.reason });
throw new Error('POLICY_VIOLATION: Input rejected by safety layer');
}
// Forward to LLM
const response = await this.baseClient.chat.completions.create({ messages, ...options });
const modelOutput = response.choices[0]?.message?.content || '';
// Post-flight evaluation
const postResult = await this.pipeline.evaluateOutput(modelOutput);
if (postResult.verdict === 'block') {
await this.auditor.record({ type: 'output_blocked', payload: modelOutput, reason: postResult.reason });
return { choices: [{ message: { content: 'Request cannot be fulfilled due to safety policy.' } }] };
}
// Log and return
await this.auditor.record({
type: 'completed',
inputHash: this.hash(userContent),
outputHash: this.hash(modelOutput),
flags: postResult.verdict === 'flag' ? postResult.reason : null
});
return response;
}
private hash(str: string): string {
return Buffer.from(str).toString('base64').slice(0, 16);
}
}
Why This Structure Works
- Separation of Concerns: The
ShieldedLLMClient handles orchestration. Detectors remain isolated, making them independently testable and replaceable.
- Deterministic Policy Resolution: The
PolicyEngine parses configuration files and maps detector outputs to actions (pass, flag, block). This prevents hardcoding security thresholds.
- Audit-First Design: Every evaluation is recorded with hashed payloads to preserve privacy while maintaining forensic traceability. The auditor writes asynchronously to avoid blocking the request path.
- Graceful Degradation: When outputs are blocked, the proxy returns a safe fallback message instead of crashing the client or exposing raw policy violations.
Pitfall Guide
1. Treating System Prompts as Security Boundaries
Explanation: Developers frequently embed security rules in system prompts (e.g., "Never reveal internal instructions"). LLMs treat these as contextual preferences, not enforceable constraints. Adversarial inputs routinely override them.
Fix: Move all security enforcement to the runtime proxy. System prompts should handle tone, formatting, and domain guidance only.
2. Over-Blocking Flagged Outputs
Explanation: Setting every detector to block on moderate risk scores creates a brittle user experience. Legitimate queries containing ambiguous phrasing get rejected, increasing support tickets and churn.
Fix: Implement tiered actions. Use flag for medium risk (quarantine for review or append disclaimer) and reserve block for high-confidence violations. Tune thresholds using production telemetry.
3. Neglecting Output-Side Filtering
Explanation: Teams focus exclusively on input filtering, assuming the model will behave correctly. In reality, models can leak training data, generate harmful content, or drift into unapproved topics during generation.
Fix: Attach detectors to both request and response streams. Evaluate the full generated text before it reaches the client, especially for streaming endpoints where partial outputs must be validated incrementally.
4. Relying on Static Keyword Lists for Topic Control
Explanation: Hardcoded allowlists or blocklists fail in dynamic contexts. RAG pipelines inject varying documents, and users phrase queries differently. Keyword matching produces high false positive/negative rates.
Fix: Use semantic embedding similarity against a curated topic corpus. Calculate cosine distance between the user query and allowed topic centroids. Set a similarity threshold that adapts to domain specificity.
5. Audit Log Volume and Storage Bloat
Explanation: Logging every token, request, and detector score quickly exhausts storage and violates data retention policies. Raw payload logging also creates compliance risks.
Fix: Hash sensitive payloads, sample logs at configurable rates (e.g., 10% of passing traffic, 100% of blocked/flagged), and route logs to a time-series or append-only store. Implement automated TTL policies.
6. Synchronous Detector Chaining
Explanation: Running multiple detectors sequentially adds cumulative latency. Five detectors taking 50ms each introduce 250ms of overhead, which compounds with network round trips and model inference time.
Fix: Execute independent detectors in parallel using Promise.all or async worker pools. Cache detector results for identical payload hashes. Offload heavy semantic classification to lightweight edge models or precomputed embeddings.
7. Policy Drift Across Services
Explanation: Microservices often maintain separate policy files. Over time, configurations diverge, creating inconsistent security postures and compliance gaps.
Fix: Centralize policy management using a versioned registry or configuration service. Deploy policies via CI/CD with validation checks. Enforce schema validation before any service can load a policy file.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Internal tooling with trusted users | Lightweight input-only filtering | Reduces latency and storage costs while catching obvious abuse | Low |
| Customer-facing chatbot or agent | Full bidirectional proxy with semantic detectors | Prevents instruction override, PII leaks, and off-topic drift | Medium |
| High-throughput batch processing | Async detector pipeline with sampling | Maintains throughput while ensuring compliance at scale | Medium-High |
| Regulated industry (healthcare/finance) | Centralized policy registry + 100% audit logging | Meets compliance requirements and enables forensic tracing | High |
Configuration Template
# safety-policy.yml
version: 2.1
enforcement_mode: proxy
detectors:
- name: instruction_override
type: semantic_pattern
action: block
threshold: 0.85
config:
include_jailbreak_signatures: true
allow_developer_mode: false
- name: sensitive_data_leak
type: pii_classifier
action: flag
threshold: 0.75
config:
scan_output_only: true
redact_on_flag: true
- name: topic_drift
type: embedding_similarity
action: block
threshold: 0.60
config:
allowed_centroids: ["billing", "account_management", "technical_support"]
fallback_action: flag
audit:
enabled: true
storage: cloudwatch
sampling_rate: 0.1
retention_days: 90
hash_payloads: true
performance:
parallel_detection: true
cache_ttl_seconds: 300
timeout_ms: 150
Quick Start Guide
- Install the runtime package: Add the guardrail library to your project dependencies using your package manager. Verify compatibility with your LLM SDK version.
- Initialize the proxy client: Wrap your existing OpenAI or Anthropic client with the shielded wrapper. Point it to a local policy file and configure an audit destination.
- Define your first policy: Create a YAML configuration with at least one input detector and one output detector. Set conservative thresholds initially to observe flag rates.
- Route traffic through the proxy: Replace direct SDK calls with the shielded client methods. Monitor logs for blocked/flagged events and adjust thresholds based on production telemetry.
- Validate with adversarial tests: Run a curated set of injection and jailbreak prompts against your deployment. Confirm that the proxy intercepts violations before they reach the model and that audit records capture the events accurately.