tances. Below is a modular implementation of three core analyzers: placeholder detection, XML tag validation, and sentence length enforcement.
import re
from typing import List
class PlaceholderAnalyzer:
PATTERN = re.compile(r"\{[A-Z_]{2,}\}|\[INSERT.*?\]|TODO[:\s]|FIXME[:\s]|<placeholder>", re.IGNORECASE)
def analyze(self, text: str) -> List[DiagnosticReport]:
diagnostics = []
for idx, line in enumerate(text.splitlines(), start=1):
match = self.PATTERN.search(line)
if match:
diagnostics.append(DiagnosticReport(
rule_id="no_unresolved_placeholders",
severity="error",
line_number=idx,
description="Unresolved template token detected",
offending_snippet=match.group(0)
))
return diagnostics
class XmlStructureAnalyzer:
def analyze(self, text: str) -> List[DiagnosticReport]:
diagnostics = []
stack = []
tag_pattern = re.compile(r"<(/?)([\w_]+)>")
for idx, line in enumerate(text.splitlines(), start=1):
for match in tag_pattern.finditer(line):
is_closing, tag_name = match.groups()
if not is_closing:
stack.append((tag_name, idx))
elif stack and stack[-1][0] == tag_name:
stack.pop()
else:
diagnostics.append(DiagnosticReport(
rule_id="no_mismatched_xml",
severity="error",
line_number=idx,
description=f"Unexpected closing tag </{tag_name}>",
offending_snippet=match.group(0)
))
for tag, line_num in stack:
diagnostics.append(DiagnosticReport(
rule_id="no_unclosed_xml",
severity="error",
line_number=line_num,
description=f"Unclosed XML tag <{tag}>",
offending_snippet=f"<{tag}>"
))
return diagnostics
class SentenceLengthAnalyzer:
def __init__(self, max_chars: int = 200):
self.max_chars = max_chars
self.sentence_splitter = re.compile(r'[.!?]\s+')
def analyze(self, text: str) -> List[DiagnosticReport]:
diagnostics = []
for idx, line in enumerate(text.splitlines(), start=1):
sentences = self.sentence_splitter.split(line)
for sentence in sentences:
if len(sentence.strip()) > self.max_chars:
diagnostics.append(DiagnosticReport(
rule_id="max_sentence_length",
severity="warning",
line_number=idx,
description=f"Sentence exceeds {self.max_chars} character limit",
offending_snippet=sentence.strip()[:50] + "..."
))
return diagnostics
Step 3: Assemble the Validation Engine
The engine registers analyzers, executes them sequentially, and aggregates results. It should support severity filtering and exit code mapping for CI integration.
from typing import List, Type
class PromptValidationEngine:
def __init__(self, analyzers: List[Type], fail_on_warnings: bool = False):
self.analyzers = [a() for a in analyzers]
self.fail_on_warnings = fail_on_warnings
def validate(self, prompt_text: str) -> List[DiagnosticReport]:
all_diagnostics = []
for analyzer in self.analyzers:
all_diagnostics.extend(analyzer.analyze(prompt_text))
return all_diagnostics
def should_fail_pipeline(self, diagnostics: List[DiagnosticReport]) -> bool:
has_errors = any(d.severity == "error" for d in diagnostics)
has_warnings = any(d.severity == "warning" for d in diagnostics)
return has_errors or (self.fail_on_warnings and has_warnings)
Architecture Decisions & Rationale
- Rule Isolation: Each analyzer operates independently. This prevents cross-rule side effects and allows teams to swap or disable specific checks without breaking the pipeline.
- Stack-Based XML Parsing: Regex alone cannot reliably track nested or mismatched tags. A push/pop stack ensures accurate detection of unclosed or improperly ordered XML, which is critical for tool-use and structured output formats.
- Severity Separation: Errors block deployment; warnings flag stylistic concerns. This distinction prevents CI pipelines from failing on non-critical issues while maintaining visibility into prompt hygiene.
- Line-Aware Reporting: LLM prompts are often long. Mapping violations to specific line numbers and excerpts drastically reduces debugging time during code reviews.
Pitfall Guide
1. Confusing Structural Linting with Semantic Evaluation
Static analysis validates syntax, structure, and style. It cannot determine whether a prompt actually produces the desired model behavior. A perfectly linted prompt can still yield poor responses if the instructions are logically flawed.
Fix: Treat linting as a pre-flight check. Use prompt-eval-rubric or similar runtime scoring tools to measure actual instruction adherence after deployment.
2. Over-Enforcing Sentence Length Limits
Setting max_sentence_length too aggressively (e.g., < 100 characters) can fragment complex instructions and degrade model comprehension. LLMs handle nuanced, multi-clause directives better when they are cohesively structured.
Fix: Configure thresholds between 180β250 characters. Allow exceptions for technical specifications or legal disclaimers where precision requires longer constructions.
3. Ignoring Template Context in Placeholder Detection
Blanket placeholder detection will flag legitimate template variables like {user_id} or {context_window}. This creates false positives that teams eventually disable entirely.
Fix: Maintain a whitelist of approved template tokens. Configure the analyzer to only flag unknown uppercase patterns, TODO markers, or generic placeholders like <FILL_IN>.
4. Treating Warnings as Errors in CI
Failing pipelines on warnings slows development velocity and encourages developers to suppress lint output rather than address it.
Fix: Reserve pipeline failures for error severity. Route warnings to PR comments or dashboard alerts. Enable fail_on_warnings only in staging or release branches.
5. Assuming Lint Catches Injection Attacks
Static analysis operates on prompt text in isolation. It cannot detect runtime injection payloads that depend on user input, tool responses, or dynamic context assembly.
Fix: Deploy prompt-shield or equivalent runtime injection detectors in the execution layer. Lint handles authoring safety; runtime tools handle adversarial input.
6. Skipping Version Fingerprinting
Prompts evolve rapidly. Without versioning, teams cannot track which lint rules applied to a specific prompt iteration, making rollback and audit trails impossible.
Fix: Integrate prompt-template-version to generate content hashes and semantic version tags. Store lint results alongside version metadata in your artifact registry.
7. Hardcoding Rules Instead of Using Configuration
Embedding rule thresholds and enabled/disabled states directly in code creates maintenance debt. Different projects require different strictness levels.
Fix: Externalize rule configuration into YAML or JSON. Load thresholds, severity mappings, and rule toggles at runtime. This enables team-specific overrides without code changes.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Pre-merge prompt changes | Static linting (prompt-lint) | Catches structural defects before merge; zero API cost | $0 |
| Post-deployment behavior tuning | Runtime scoring (prompt-eval-rubric) | Measures actual model adherence to instructions | Moderate (API calls) |
| User-facing input handling | Injection detection (prompt-shield) | Detects adversarial payloads in dynamic context | Low-Moderate (runtime overhead) |
| Tool output validation | Shape validation (llm-output-validator) | Ensures structured responses match schema | Low (parsing overhead) |
| Multi-section prompt assembly | Context builder (agent-context-builder) | Standardizes prompt composition from reusable blocks | Low (development time) |
Configuration Template
# prompt-lint-config.yaml
rules:
no_unresolved_placeholders:
enabled: true
severity: error
whitelist:
- "{user_id}"
- "{session_context}"
- "{tool_output}"
no_mismatched_xml:
enabled: true
severity: error
no_unclosed_xml:
enabled: true
severity: error
max_sentence_length:
enabled: true
severity: warning
threshold_chars: 210
no_duplicate_instructions:
enabled: true
severity: warning
min_specificity:
enabled: false
severity: warning
pipeline:
fail_on_warnings: false
report_format: "ci_summary"
output_path: "./reports/prompt_validation.json"
# runner.py
import yaml
from pathlib import Path
from prompt_validation_engine import PromptValidationEngine
from analyzers import PlaceholderAnalyzer, XmlStructureAnalyzer, SentenceLengthAnalyzer
def load_config(path: str) -> dict:
with open(path, "r") as f:
return yaml.safe_load(f)
def run_validation(prompt_path: str, config_path: str) -> int:
config = load_config(config_path)
enabled_rules = [
PlaceholderAnalyzer,
XmlStructureAnalyzer,
SentenceLengthAnalyzer
]
engine = PromptValidationEngine(
analyzers=enabled_rules,
fail_on_warnings=config["pipeline"]["fail_on_warnings"]
)
prompt_text = Path(prompt_path).read_text()
diagnostics = engine.validate(prompt_text)
for d in diagnostics:
level = d.severity.upper()
print(f"[{level}] Rule: {d.rule_id} | Line: {d.line_number}")
print(f" Message: {d.description}")
print(f" Snippet: {d.offending_snippet}\n")
return 1 if engine.should_fail_pipeline(diagnostics) else 0
if __name__ == "__main__":
import sys
sys.exit(run_validation(sys.argv[1], sys.argv[2]))
Quick Start Guide
- Install dependencies:
pip install pyyaml (engine uses standard library for regex/dataclasses)
- Create configuration: Save the YAML template above as
prompt-lint-config.yaml in your project root
- Write your first prompt: Create
system_prompt.txt with your draft instructions
- Run validation:
python runner.py system_prompt.txt prompt-lint-config.yaml
- Integrate into CI: Add the runner command to your pipeline before the deployment stage. Configure the step to fail on non-zero exit codes.
Static analysis does not replace prompt engineering. It enforces the structural discipline required for reliable model interaction. By treating prompts as versioned, lintable artifacts, teams eliminate preventable defects, reduce evaluation noise, and establish a repeatable path from draft to production.