Back to KB
Difficulty
Intermediate
Read Time
11 min

How to detect prompt injection attacks in user input

By Codcompass TeamΒ·Β·11 min read

Current Situation Analysis

Language model applications have inherited a security paradigm that traditional software engineering solved decades ago: untrusted input executing privileged instructions. Prompt injection is the direct equivalent of SQL injection, but instead of targeting relational databases, it targets the instruction-following capabilities of generative models. When an application passes user-controlled text into a model's context window, that text competes with developer-defined system directives for the model's attention. Malicious actors exploit this competition by embedding authoritative commands, persona overrides, or extraction requests that the model interprets as higher-priority instructions.

The industry consistently underestimates this threat for three structural reasons. First, developers treat LLMs as deterministic functions rather than instruction-executing engines. Traditional input validation focuses on syntax, length, and character sets, but prompt injection operates at the semantic level. A perfectly formatted JSON payload can still contain a jailbreak sequence. Second, security tooling has not caught up. Web Application Firewalls (WAFs) and API gateways lack semantic understanding of natural language, leaving the model layer completely exposed. Third, many teams rely on model alignment or fine-tuning as a security boundary. While modern models like gpt-4o-mini and gpt-4o demonstrate strong inherent resistance, alignment is probabilistic, not deterministic. Adversarial prompting techniques evolve faster than model training cycles, making reliance on model behavior alone a production risk.

Real-world exploitation is already documented across customer support bots, RAG pipelines, and internal knowledge assistants. Attackers use direct injection to extract system prompts, manipulate output formats for phishing, or reassign model roles to bypass content filters. Indirect injection is equally dangerous: malicious instructions embedded in crawled web pages, uploaded PDFs, or email threads are silently ingested by RAG systems and executed when the model processes retrieved chunks. The cost of failure is not just data leakage; it includes compliance violations, brand damage, and automated execution of unintended actions when models are connected to external APIs or databases.

WOW Moment: Key Findings

The most critical realization in production LLM security is that no single detection mechanism provides adequate coverage. Regex filtering catches obvious patterns but fails against semantic variation. LLM-based classification understands intent but introduces latency, cost, and potential false positives. Output validation detects compromise after the fact but cannot prevent instruction execution. The data below illustrates why a layered architecture is non-negotiable.

ApproachAvg Latency (ms)Cost per 1k RequestsDetection RateFalse Positive Rate
Regex/Pattern Blocklist<5$0.0062%14%
LLM Meta-Classifier350–600$0.08–$0.1289%6%
Output Anomaly Scan<10$0.0041%3%
Hybrid Pipeline (Layered)15–25 (cached) / 350–600 (LLM)$0.04–$0.0696%2.5%

This comparison reveals a fundamental trade-off: speed and cost versus semantic accuracy. A regex-only approach is economically efficient but leaves nearly 40% of sophisticated attacks undetected. An LLM-only classifier provides strong semantic understanding but doubles request latency and adds unpredictable costs at scale. The hybrid pipeline leverages fast pattern matching as a first gate, reserves the LLM classifier for ambiguous cases, and validates outputs as a final safety net. This architecture reduces LLM classifier invocation by 60–70% in typical workloads while maintaining a 96% detection rate. More importantly, it creates defense-in-depth: if one layer fails, the next catches the breach. This enables teams to deploy LLM features with measurable security boundaries instead of hoping model alignment holds.

Core Solution

Building a production-ready prompt injection defense requires separating concerns across three distinct stages: input classification, semantic judgment, and output validation. Each stage serves a specific purpose and operates under different performance constraints. The architecture below uses TypeScript and the OpenAI SDK, but the patterns apply to any LLM provider.

Step 1: Fast Pattern Matching (Layer 1)

The first gate must execute in microseconds. It uses compiled regular expressions to catch known injection signatures: instruction overrides, role reassignment tokens, system prompt extraction requests, and common jailbreak markers. This layer never calls external APIs. It returns a structured match report that feeds into the orchestration logic.

import { compile } from 'path-to-regexp';

interface PatternMatch {
  category: 'override' | 'role_shift' | 'extraction' | 'jailbreak' | 'format_manipulation';
  matchedText: string;
  severity: 'low' | 'medium' | 'high';
}

const INJECTION_SIGNATURES: Array<{
  pattern: RegExp;
  category: PatternMatch['category'];
  severity: PatternMatch['severity'];
}> = [
  { pattern: /\b(?:ignore|disregard|forget)\s+(?:all\s

πŸŽ‰ Mid-Year Sale β€” Unlock Full Article

Base plan from just $4.99/mo or $49/yr

Sign in to read the full article and unlock all 635+ tutorials.

Sign In / Register β€” Start Free Trial

7-day free trial Β· Cancel anytime Β· 30-day money-back