A week with ctxbudgeter: how I cut Claude code-review costs 60%
Compiling Context: A Policy-Driven Architecture for LLM Prompt Optimization
Current Situation Analysis
AI agent development teams consistently treat prompt assembly as a string concatenation problem rather than a resource allocation problem. The pattern is predictable: a system prompt starts lean, then documentation, style guides, historical logs, and source files are incrementally appended to "improve accuracy." Over months, this accretion creates a black box where token distribution, cache efficiency, and security exposure become invisible.
The industry overlooks this because context compilation happens at runtime, hidden inside framework abstractions. Developers measure success by model output quality, not by input token efficiency or cache hit rates. When costs spike or latency degrades, the default response is to switch models or add retries, rather than audit the context pipeline.
Data from production code-review pipelines reveals the structural gap. A typical Claude-backed reviewer ingests ~22,000 input tokens per pull request, regardless of diff size. At scale, this translates to ~$0.066 per call and 8-second latency. Utilization rarely exceeds 55% because static documentation and low-priority files consume budget that should be reserved for task-critical diffs. Worse, without compile-time validation, sensitive snippets (API keys, credentials, internal tokens) routinely ship to third-party endpoints. The problem isn't model capability; it's the absence of a deterministic compilation layer that enforces budgets, cache strategies, and security policies before the request leaves the host environment.
WOW Moment: Key Findings
Shifting from runtime concatenation to compile-time context budgeting produces compounding efficiency gains. The following comparison isolates the delta between naive prompt assembly and a policy-compiled architecture:
| Approach | Input Tokens | Cost/Call | Latency (P95) | Cache Efficiency | Security Posture |
|---|---|---|---|---|---|
| Naive Concatenation | ~22,000 | $0.066 | 8.2s | 0% (no cache control) | Runtime-only, leak-prone |
| Policy-Compiled | 3,400 + 7,400 cached | $0.013 | 3.1s | 68% warm-cache hit rate | Compile-time enforcement |
The 60% cost reduction and 62% latency drop stem from three structural changes:
- Token budgeting with priority scoring eliminates low-signal documentation from every request.
- Deterministic cache breakpoint placement leverages Anthropic's prompt caching by marking the longest stable prefix.
- Compile-time policy enforcement intercepts sensitive content before serialization, removing the need for post-hoc scanning.
This finding matters because it transforms prompt engineering from an experimental art into a deterministic build step. Context becomes versioned, auditable, and gateable in CI. Teams can predict billing, enforce security contracts, and scale agent throughput without model upgrades.
Core Solution
The architecture replaces ad-hoc prompt construction with a three-phase compiler: Allocation β Policy Enforcement β Serialization. Each phase operates on a typed context schema, ensuring predictable token distribution and cache behavior.
Phase 1: Budget Allocation & Priority Scoring
Define a context schema that separates static documentation, task-critical diffs, and lazy-loaded references. Assign priority weights and token estimates. The allocator trims or defers items that exceed the budget.
from dataclasses import dataclass, field
from typing import List, Optional
@dataclass
class ContextItem:
name: str
content: str
kind: str # system, task, doc, code
priority: int
estimated_tokens: int
cache_policy: str = "none" # stable, dynamic, none
required: bool = False
class ContextAllocator:
def __init__(self, model: str, total_budget: int, output_reserve: int):
self.model = model
self.effective_budget = total_budget - output_reserve
self.items: List[ContextItem] = []
def add(self, item: ContextItem):
self.items.append(item)
def compile(self) -> List[ContextItem]:
# Sort by priority descending, then by required flag
sorted_items = sorted(
self.items,
key=lambda x: (x.required, x.priority),
reverse=True
)
allocated = []
used_tokens = 0
for item in sorted_items:
if item.required or (used_tokens + item.estimated_tokens) <= self.effective_budget:
allocated.append(item)
used_tokens += item.estimated_tokens
# Low-priority items silently dropped when budget exhausted
return allocated
Rationale: Separating total_budget from output_reserve prevents the compiler from starving the model's generation window. Priority sorting ensures task-critical diffs and system instructions survive budget pressure, while documentation gracefully degrades.
Phase 2: Policy Enforcement & Secret Interception
Security policies must operate at compile time. Runtime scanning introduces latency and fails to block serialization. The policy engine evaluates sensitivity tags and applies refuse, redact, or warn directives before the payload leaves the host.
import re
from enum import Enum
class SecretPolicy(Enum):
REFUSE = "refuse"
REDACT = "redact"
WARN = "warn"
ALLOW = "allow"
class PolicyEnforcer:
SECRET_PATTERN = re.compile(r"sk-[A-Za-z0-9]{20,}|ghp_[A-Za-z0-9]{36}|AKIA[0-9A-Z]{16}")
def __init__(self, policy: SecretPolicy = SecretPolicy.REFUSE):
self.policy = policy
def validate(self, items: List[ContextItem]) -> List[ContextItem]:
validated = []
for item in items:
if self.SECRET_PATTERN.search(item.content):
if self.policy == SecretPolicy.REFUSE:
raise ValueError(f"Secret detected in '{item.name}'. Build halted.")
elif self.policy == SecretPolicy.REDACT:
item.content = "[REDACTED β sensitivity=secret]"
item.estimated_tokens = 5
elif self.policy == SecretPolicy.WARN:
print(f"WARNING: Secret in '{item.name}'. Proceeding with policy={self.policy.value}")
validated.append(item)
return validated
Rationale: Regex detection is a placeholder; production systems should integrate gitleaks or trufflehog via subprocess or SDK. The enforcement layer is the critical component: policies live in code, are version-controlled, and fail CI deterministically.
Phase 3: Cache-Aware Serialization
Anthropic's prompt caching requires a stable prefix marked with cache_control. The serializer identifies the last item with cache_policy="stable" and applies the breakpoint. Subsequent requests with identical prefixes trigger cache reads instead of creation.
from typing import Dict, Any
class CacheSerializer:
@staticmethod
def to_anthropic_payload(allocated_items: List[ContextItem], user_query: str) -> Dict[str, Any]:
system_blocks = []
user_blocks = []
last_stable_idx = -1
for idx, item in enumerate(allocated_items):
if item.kind == "system":
system_blocks.append({"type": "text", "text": item.content})
else:
user_blocks.append({"type": "text", "text": item.content})
if item.cache_policy == "stable":
last_stable_idx = idx
# Apply cache breakpoint to the last stable block
if last_stable_idx >= 0 and user_blocks:
user_blocks[last_stable_idx]["cache_control"] = {"type": "ephemeral"}
return {
"model": "claude-sonnet-4.6",
"system": system_blocks,
"messages": [{"role": "user", "content": user_blocks + [{"type": "text", "text": user_query}]}],
"max_tokens": 4000
}
Rationale: Anthropic allows up to four cache breakpoints, but a single well-placed breakpoint on the longest stable prefix maximizes cache hit probability. Deterministic ordering is mandatory; shuffling blocks invalidates the cache.
Phase 4: Just-In-Time Reference Resolution
Eagerly loading all touched files causes token bloat on large refactors. Replace static file inclusion with lazy references that resolve only when budget permits.
from typing import Callable, Any
class LazyReference:
def __init__(self, name: str, query: str, resolver: Callable, est_tokens: int):
self.name = name
self.query = query
self.resolver = resolver
self.est_tokens = est_tokens
self.resolved_content: Optional[str] = None
def resolve(self) -> str:
if self.resolved_content is None:
self.resolved_content = self.resolver(self.query)
return self.resolved_content
# Usage in allocator
ref = LazyReference(
name="def:authenticate_user",
query="authenticate_user",
resolver=lambda q: vector_store.similarity_search(q, k=1)[0].page_content,
est_tokens=400
)
Rationale: The allocator evaluates est_tokens before calling the resolver. If budget is exhausted, the reference is skipped entirely, avoiding unnecessary vector database queries and network latency.
Pitfall Guide
| Pitfall | Explanation | Fix |
|---|---|---|
| Eager Loading All Touched Files | Including every modified file regardless of relevance causes token explosion on large refactors. | Use lazy references with estimated token costs. Resolve only when budget permits. |
| Ignoring Cache Breakpoint Placement | Random block ordering or missing cache_control tags forces full cache creation on every call. |
Mark the longest stable prefix with cache_policy="stable". Place cache_control on the last stable block. Maintain deterministic ordering. |
| Runtime Secret Scanning | Scanning after serialization adds latency and fails to block credential leakage. | Enforce sensitivity policies at compile time. Use refuse mode in CI, redact in staging, warn in dev. |
| Hardcoding Priority Scores | Static priorities don't adapt to diff size or file type, causing critical context to be dropped. | Compute dynamic scores: base_priority + (diff_proximity_weight * file_relevance). Re-evaluate per request. |
| Forgetting Output Token Reservation | Allocating 100% of the budget to input leaves insufficient space for model generation, causing truncation. | Always subtract reserved_output_tokens from total_budget before allocation. Monitor actual output usage and adjust. |
| Treating Cache as Free | Cache creation costs more than reads. Inconsistent prefixes or frequent policy changes invalidate warm caches. | Track cache_creation vs cache_read metrics. Stabilize system prompts and documentation ordering. Use deterministic hashes for prefix keys. |
| No Fallback for Budget Overflow | When budget is exceeded, the compiler silently drops context, degrading model accuracy without visibility. | Implement graceful degradation: truncate lowest-priority items, split into multi-turn requests, or trigger a budget_exceeded event for alerting. |
Production Bundle
Action Checklist
- Audit existing prompt assembly: extract token counts, cache hit rates, and latency baselines.
- Define a context schema: separate system, task, documentation, and reference items with priority weights.
- Implement compile-time budget allocation: reserve output tokens, sort by priority, drop low-signal items.
- Wire cache-aware serialization: mark stable prefixes, place
cache_controlbreakpoints, enforce deterministic ordering. - Enforce secret policies at compile time: integrate
refuse/redactmodes, replace regex with production scanners in CI. - Add CI gates: fail builds on budget overflow, cache misconfiguration, or policy violations.
- Monitor cache metrics: track
cache_creationvscache_read, adjust prefix stability, and alert on hit rate degradation.
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| Small PRs (<5 files) | Eager loading + stable cache prefix | Low token footprint; cache hits maximize ROI | -40% vs naive |
| Large refactors (20+ files) | JIT references + dynamic priority scoring | Prevents token bloat; resolves only relevant symbols | -65% vs naive |
| Security-sensitive repos | Compile-time refuse policy + gitleaks integration |
Blocks credential leakage before serialization | Zero leak incidents |
| Multi-model routing | Policy-compiled context + adapter layer | Decouples budget logic from model-specific serialization | Consistent billing across providers |
| High-throughput queues | Deterministic prefix ordering + warm cache strategy | Maximizes cache read ratio; reduces P95 latency | -60% input cost |
Configuration Template
# context_budget.yaml
model: claude-sonnet-4.6
total_budget: 24000
output_reserve: 4000
policies:
secret_detection:
mode: refuse # refuse | redact | warn | allow
scanner: gitleaks # regex | gitleaks | trufflehog
cache_strategy:
stable_prefix: true
breakpoint_placement: last_stable
fallback:
on_overflow: truncate_lowest_priority
alert_channel: slack # slack | pagerduty | none
items:
- name: system_instructions
kind: system
priority: 100
cache_policy: stable
required: true
source: prompts/system.md
- name: style_guide
kind: doc
priority: 85
cache_policy: stable
source: docs/STYLE_GUIDE.md
- name: pr_diff
kind: task
priority: 95
cache_policy: none
required: true
source: webhook.payload.diff
- name: touched_symbols
kind: code
priority: 70
cache_policy: none
loader: vector_search
estimated_tokens: 400
max_items: 5
Quick Start Guide
- Install dependencies:
pip install anthropic tiktoken pyyaml - Define your schema: Create a
context_budget.yamlmatching your repository structure and token limits. - Wire the compiler: Instantiate
ContextAllocator, load YAML config, and attach your resolver functions (file reader, vector store, webhook parser). - Serialize & send: Call
CacheSerializer.to_anthropic_payload(), pass toanthropic.Anthropic().messages.create(), and logcache_creation/cache_readmetrics. - Gate in CI: Add a pre-commit hook or GitHub Action that runs the allocator in
dry_runmode. Fail on budget overflow or policy violations before merging prompt changes.
This architecture transforms context management from an experimental variable into a deterministic build artifact. By enforcing budgets, stabilizing cache prefixes, and intercepting sensitive content at compile time, teams achieve predictable billing, sub-3-second latency, and structurally secure agent pipelines.
Mid-Year Sale β Unlock Full Article
Base plan from just $4.99/mo or $49/yr
Sign in to read the full article and unlock all tutorials.
Sign In / Register β Start Free Trial7-day free trial Β· Cancel anytime Β· 30-day money-back
