Cache-Aware Spawning: What Changed in llm-cli-gateway, a Week On
Architecting Cache-Aware CLI Wrappers for Multi-Provider LLM Efficiency
Current Situation Analysis
The Industry Pain Point Modern LLM orchestration frequently involves routing requests across multiple model providers to balance cost, capability, and vendor diversity. A common pattern in these workflows is the repetition of large, static context blocks—system instructions, tool definitions, and reference documents—across sequential calls. Without explicit cache management, every invocation transmits these bytes anew, resulting in redundant input token charges. While major providers (Anthropic, OpenAI, Google, xAI, Mistral) implement prompt caching, the mechanisms are fragmented. Cache hit rates depend on byte-level prefix stability, minimum token thresholds, and provider-specific TTL policies, none of which are uniform.
Why This Is Overlooked Developers often treat prompts as opaque strings, concatenating dynamic user queries with static context ad-hoc. This approach breaks prefix stability because volatile data (timestamps, session IDs, or user input) is frequently interleaved with stable content. Furthermore, the assumption that caching is automatic or uniform across providers leads to suboptimal routing. For instance, Anthropic models require minimum stable token counts before caching activates, while Codex reports usage metrics under different field names. Ignoring these nuances results in "phantom cache misses" where the infrastructure supports caching, but the input composition prevents it.
Data-Backed Evidence Analysis of multi-provider gateway telemetry reveals significant inefficiencies in naive prompt construction:
- Threshold Variance: Anthropic Sonnet requires a minimum of 1,024 stable tokens to trigger caching, whereas Opus and Haiku models require 4,096 tokens. Haiku on Vertex AI requires 2,048. Sub-threshold inputs never cache, regardless of repetition.
- Metric Divergence: Codex reports cached usage via
cached_input_tokens, while Anthropic usescache_read_input_tokens. Parsers assuming a unified schema fail to capture savings data. - Provider Fragmentation: Five distinct providers (Claude, Codex, Gemini, Grok, Mistral) each expose cache state differently. Mistral Vibe, for example, relies on environment variable injection (
VIBE_ACTIVE_MODEL) rather than CLI flags for model resolution, requiring specific wrapper logic.
WOW Moment: Key Findings
Restructuring prompt composition to enforce canonical ordering of stable prefixes yields immediate, measurable efficiency gains. By decoupling volatile task data from static context, gateways can achieve high cache hit rates without modifying the underlying CLI invocation model.
| Approach | Cache Hit Rate | Token Waste | Implementation Effort | Provider Agnostic |
|---|---|---|---|---|
| Naive String Concat | < 40% | High | Low | Yes |
| Structured Prefix | > 85% | Near Zero | Medium | Yes |
| API Proxying | Variable | Medium | High | No |
Why This Matters The structured prefix approach enables "cache hygiene" at the composition layer. It allows a gateway to wrap CLI binaries while still delivering cache benefits, preserving the architectural simplicity of CLI wrapping without sacrificing the cost efficiency of API-level caching. This enables cost-effective multi-model sampling, where a workflow can query five different providers with the same context without incurring five times the input cost.
Core Solution
Implementation Strategy: Structured Prompt Composition
The core solution involves defining a PromptComponents interface that separates stable context from volatile tasks. The gateway then concatenates these components in a fixed, canonical order. This ensures that the byte sequence preceding the user query remains identical across calls, satisfying the prefix stability requirement for all providers.
Step 1: Define the Component Schema Create a TypeScript interface that enforces separation of concerns. This prevents accidental interleaving of dynamic data.
export interface PromptComponents {
systemInstruction: string;
toolDefinitions: string;
referenceContext: string;
userQuery: string;
}
export interface CacheAwareRequest {
provider: 'claude' | 'codex' | 'gemini' | 'grok' | 'mistral';
components: PromptComponents;
sessionId?: string;
}
Step 2: Canonical Composition Engine Implement a composition function that joins components with a stable delimiter. The order must be immutable: System → Tools → Context → Query.
const STABLE_SEPARATOR = '\n---GATEWAY_BOUNDARY---\n';
export function composeCacheOptimizedPayload(
components: PromptComponents
): string {
// Stable prefix construction
const stablePrefix = [
components.systemInstruction,
components.toolDefinitions,
components.referenceContext
].join(STABLE_SEPARATOR);
// Volatile tail
const volatileTail = components.userQuery;
return `${stablePrefix}${STABLE_SEPARATOR}${volatileTail}`;
}
Step 3: Provider-Specific Adaptations The gateway must handle provider nuances during the spawn phase.
- Mistral Vibe Integration: Mistral Vibe does not accept a
--modelflag. The gateway must resolve the model alias and inject it via theVIBE_ACTIVE_MODELenvironment variable. Additionally, session resumption requires[session_logging] enabled = truein the Vibe configuration; the gateway should validate this to prevent opaque failures. - Anthropic Threshold Validation: Before dispatching, check the stable token count against model-specific minimums. If the stable prefix is below the threshold, the gateway can warn the user or pad the context to ensure caching eligibility.
- Codex Usage Parsing: When ingesting telemetry, the parser must handle field name divergence. Use a provider-aware mapper to normalize
cached_input_tokens(Codex) andcache_read_input_tokens(Anthropic) into a unifiedcacheReadTokensmetric.
Step 4: Observability Surface Expose cache metrics via MCP resources without leaking prompt content. This maintains the invariant that session storage contains no conversation text.
cache_state://global: Aggregates hit rates, token counts, and estimated savings over a rolling window.cache_state://session/{id}: Provides per-session breakdowns, including distinct prefix counts and TTL remaining for Claude.cache_state://prefix/{hash}: Tracks usage of specific stable prefixes across providers, enabling identification of redundant context blocks.
Architecture Rationale This solution operates at the CLI wrapping layer, not the API proxy layer. It does not construct JSON request bodies; it composes the input string passed to the CLI binary. This preserves the gateway's architectural thesis while delivering cache efficiency. It avoids the complexity of maintaining provider-specific API schemas and remains resilient to API changes, as it relies on the stable CLI interface.
Pitfall Guide
1. Volatile Prefix Injection
- Explanation: Including timestamps, request IDs, or dynamic metadata in the stable portion of the prompt breaks prefix stability. Even a single byte change invalidates the cache.
- Fix: Strictly isolate all dynamic data to the
userQuerycomponent. Use deterministic separators and avoid injecting runtime variables into system or context blocks.
2. Threshold Blindness
- Explanation: Sending a stable prefix of 500 tokens to an Opus model expecting cache. Opus requires 4,096 tokens. The request will never cache, wasting tokens on repetition.
- Fix: Implement a
minStableTokenslookup table per model family. Warn users when stable content is below the threshold and suggest context expansion or model switching.
3. Field Name Assumption
- Explanation: Hardcoding
cache_read_input_tokensfor usage parsing. This fails for Codex, which usescached_input_tokens, leading to null savings data. - Fix: Abstract usage parsing behind a provider adapter. Map provider-specific fields to a canonical schema during ingestion.
4. TTL Certainty Fallacy
- Explanation: Assuming local TTL calculations match the provider's actual cache state. Local TTL is best-effort based on
lastRequestAt. Provider-side evictions may occur earlier. - Fix: Treat TTL warnings as advisory. Implement a
cache_ttl_expiring_soonwarning that fires when local TTL is within 30 seconds of expiry, but do not block execution. Acknowledge that cache misses may still occur.
5. CLI vs. API Confusion
- Explanation: Attempting to inject
cache_controlJSON markers into CLI arguments. This is invalid for CLI wrappers and may cause binary errors. - Fix: Rely on prefix discipline for CLI-based caching. Reserve explicit cache control markers for API proxy implementations. Keep the CLI wrapper focused on input composition.
6. Mistral Model Flag Error
- Explanation: Passing
--modelto the Mistral Vibe binary. Vibe ignores this flag, causing model resolution failures. - Fix: Use
VIBE_ACTIVE_MODELenvironment variable for model injection. Document this divergence clearly in the gateway configuration.
7. Session Logging Neglect
- Explanation: Failing to enable session logging for providers that require it for resumption. This leads to opaque errors when attempting to continue sessions.
- Fix: Validate configuration flags (e.g.,
[session_logging] enabled = true) during gateway initialization. Surface clear error messages if prerequisites are missing.
Production Bundle
Action Checklist
- Audit existing prompt construction logic for volatile data interleaving.
- Implement
PromptComponentsinterface and canonical composition function. - Configure provider-specific minimum token thresholds in gateway settings.
- Update usage parsers to handle field name divergences (e.g., Codex vs. Anthropic).
- Enable
cache_state://MCP resources for observability. - Test TTL warnings for Claude sessions with long intervals.
- Verify Mistral Vibe configuration for
VIBE_ACTIVE_MODELand session logging. - Run load tests to validate cache hit rates across all five providers.
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| High-volume, static context | Structured Prompt Composition | Maximizes cache hits via prefix stability. | Significant reduction in input token costs. |
| One-off, unique queries | Raw Prompt String | Overhead of composition is unnecessary. | Neutral. |
| Multi-provider routing | CLI Gateway Wrapper | Abstracts provider nuances; enables diverse sampling. | Enables cost-effective model comparison. |
| Low-latency requirements | Structured Prompt + TTL Warning | Prevents cache misses due to expiry. | Reduces retry latency and wasted tokens. |
| Strict privacy constraints | CLI Wrapper with Hash-only Observability | No prompt text stored; only metadata exposed. | Maintains compliance while tracking efficiency. |
Configuration Template
[cache_awareness]
enabled = true
warn_on_ttl_expiry = true
anthropic_ttl_seconds = 300
[cache_awareness.thresholds]
# Minimum stable tokens required to trigger caching
sonnet = 1024
opus = 4096
haiku = 4096
haiku_vertex = 2048
[providers.mistral]
# Mistral Vibe requires env var injection
model_env_var = "VIBE_ACTIVE_MODEL"
session_logging_required = true
[observability]
expose_cache_resources = true
privacy_mode = "hash_only"
Quick Start Guide
- Install Gateway: Deploy
llm-cli-gatewayv1.6.0 or later. Ensure all provider CLIs are accessible. - Define Components: Refactor your prompt generation to use
PromptComponents. Separate system, tools, context, and query. - Compose and Dispatch: Use the canonical composition function to generate the payload. Pass the result to the gateway's request handler.
- Monitor Efficiency: Query
cache_state://globalto verify hit rates. Adjust context blocks if thresholds are not met. - Enable Warnings: Set
warn_on_ttl_expiry = truein configuration to receive alerts for expiring Claude caches.
Mid-Year Sale — Unlock Full Article
Base plan from just $4.99/mo or $49/yr
Sign in to read the full article and unlock all tutorials.
Sign In / Register — Start Free Trial7-day free trial · Cancel anytime · 30-day money-back
