Cache-Aware Spawning: What Changed in llm-cli-gateway, a Week On

Architecting Cache-Aware CLI Wrappers for Multi-Provider LLM Efficiency

Current Situation Analysis

The Industry Pain Point Modern LLM orchestration frequently involves routing requests across multiple model providers to balance cost, capability, and vendor diversity. A common pattern in these workflows is the repetition of large, static context blocks—system instructions, tool definitions, and reference documents—across sequential calls. Without explicit cache management, every invocation transmits these bytes anew, resulting in redundant input token charges. While major providers (Anthropic, OpenAI, Google, xAI, Mistral) implement prompt caching, the mechanisms are fragmented. Cache hit rates depend on byte-level prefix stability, minimum token thresholds, and provider-specific TTL policies, none of which are uniform.

Why This Is Overlooked Developers often treat prompts as opaque strings, concatenating dynamic user queries with static context ad-hoc. This approach breaks prefix stability because volatile data (timestamps, session IDs, or user input) is frequently interleaved with stable content. Furthermore, the assumption that caching is automatic or uniform across providers leads to suboptimal routing. For instance, Anthropic models require minimum stable token counts before caching activates, while Codex reports usage metrics under different field names. Ignoring these nuances results in "phantom cache misses" where the infrastructure supports caching, but the input composition prevents it.

Data-Backed Evidence Analysis of multi-provider gateway telemetry reveals significant inefficiencies in naive prompt construction:

Threshold Variance: Anthropic Sonnet requires a minimum of 1,024 stable tokens to trigger caching, whereas Opus and Haiku models require 4,096 tokens. Haiku on Vertex AI requires 2,048. Sub-threshold inputs never cache, regardless of repetition.
Metric Divergence: Codex reports cached usage via cached_input_tokens, while Anthropic uses cache_read_input_tokens. Parsers assuming a unified schema fail to capture savings data.
Provider Fragmentation: Five distinct providers (Claude, Codex, Gemini, Grok, Mistral) each expose cache state differently. Mistral Vibe, for example, relies on environment variable injection (VIBE_ACTIVE_MODEL) rather than CLI flags for model resolution, requiring specific wrapper logic.

WOW Moment: Key Findings

Restructuring prompt composition to enforce canonical ordering of stable prefixes yields immediate, measurable efficiency gains. By decoupling volatile task data from static context, gateways can achieve high cache hit rates without modifying the underlying CLI invocation model.

Approach	Cache Hit Rate	Token Waste	Implementation Effort	Provider Agnostic
Naive String Concat	< 40%	High	Low	Yes
Structured Prefix	> 85%	Near Zero	Medium	Yes
API Proxying	Variable	Medium	High	No

Why This Matters The structured prefix approach enables "cache hygiene" at the composition layer. It allows a gateway to wrap CLI binaries while still delivering cache benefits, preserving the architectural simplicity of CLI wrapping without sacrificing the cost efficiency of API-level caching. This enables cost-effective multi-model sampling, where a workflow can query five different providers with the same context without incurring five times the input cost.

Core Solution

Implementation Strategy: Structured Prompt Composition The core solution involves defining a PromptComponents interface that separates stable context from volatile tasks. The gateway then concatenates these components in a fixed, canonical order. This ensures that the byte sequence preceding the user query remains identical across calls, satisfying the prefix stability requirement for all providers.

Step 1: Define the Component Schema Create a TypeScript interface that enforces separation of concerns. This prevents accidental interleaving of dynamic data.

export interface PromptComponents {
  systemInstruction: string;
  toolDefinitions: string;
  referenceContext: string;
  userQuery: string;
}

export interface CacheAwareRequest {
  provider: 'claude' | 'codex' | 'gemini' | 'grok' | 'mistral';
  components: PromptComponents;
  sessionId?: string;
}

Step 2: Canonical Composition Engine Implement a composition function that joins components with a stable delimiter. The order must be immutable: System → Tools → Context → Query.

const STABLE_SEPARATOR = '\n---GATEWAY_BOUNDARY---\n';

export function composeCacheOptimizedPayload(
  components: PromptComponents
): string {
  // Stable prefix construction
  const stablePrefix = [
    components.systemInstruction,
    components.toolDefinitions,
    components.referenceContext
  ].join(STABLE_SEPARATOR);

  // Volatile tail
  const volatileTail = components.userQuery;

  return `${stablePrefix}${STABLE_SEPARATOR}${volatileTail}`;
}

Step 3: Provider-Specific Adaptations The gateway must handle provider nuances during the spawn phase.

Mistral Vibe Integration: Mistral Vibe does not accept a --model flag. The gateway must resolve the model alias and inject it via the VIBE_ACTIVE_MODEL environment variable. Additionally, session resumption requires [session_logging] enabled = true in the Vibe configuration; the gateway should validate this to prevent opaque failures.
Anthropic Threshold Validation: Before dispatching, check the stable token count against model-specific minimums. If the stable prefix is below the threshold, the gateway can warn the user or pad the context to ensure caching eligibility.
Codex Usage Parsing: When ingesting telemetry, the parser must handle field name divergence. Use a provider-aware mapper to normalize cached_input_tokens (Codex) and cache_read_input_tokens (Anthropic) into a unified cacheReadTokens metric.

Step 4: Observability Surface Expose cache metrics via MCP resources without leaking prompt content. This maintains the invariant that session storage contains no conversation text.

cache_state://global: Aggregates hit rates, token counts, and estimated savings over a rolling window.
cache_state://session/{id}: Provides per-session breakdowns, including distinct prefix counts and TTL remaining for Claude.
cache_state://prefix/{hash}: Tracks usage of specific stable prefixes across providers, enabling identification of redundant context blocks.

Architecture Rationale This solution operates at the CLI wrapping layer, not the API proxy layer. It does not construct JSON request bodies; it composes the input string passed to the CLI binary. This preserves the gateway's architectural thesis while delivering cache efficiency. It avoids the complexity of maintaining provider-specific API schemas and remains resilient to API changes, as it relies on the stable CLI interface.

Pitfall Guide

1. Volatile Prefix Injection

Explanation: Including timestamps, request IDs, or dynamic metadata in the stable portion of the prompt breaks prefix stability. Even a single byte change invalidates the cache.
Fix: Strictly isolate all dynamic data to the userQuery component. Use deterministic separators and avoid injecting runtime variables into system or context blocks.

2. Threshold Blindness

Explanation: Sending a stable prefix of 500 tokens to an Opus model expecting cache. Opus requires 4,096 tokens. The request will never cache, wasting tokens on repetition.
Fix: Implement a minStableTokens lookup table per model family. Warn users when stable content is below the threshold and suggest context expansion or model switching.

3. Field Name Assumption

Explanation: Hardcoding cache_read_input_tokens for usage parsing. This fails for Codex, which uses cached_input_tokens, leading to null savings data.
Fix: Abstract usage parsing behind a provider adapter. Map provider-specific fields to a canonical schema during ingestion.

4. TTL Certainty Fallacy

Explanation: Assuming local TTL calculations match the provider's actual cache state. Local TTL is best-effort based on lastRequestAt. Provider-side evictions may occur earlier.
Fix: Treat TTL warnings as advisory. Implement a cache_ttl_expiring_soon warning that fires when local TTL is within 30 seconds of expiry, but do not block execution. Acknowledge that cache misses may still occur.

5. CLI vs. API Confusion

Explanation: Attempting to inject cache_control JSON markers into CLI arguments. This is invalid for CLI wrappers and may cause binary errors.
Fix: Rely on prefix discipline for CLI-based caching. Reserve explicit cache control markers for API proxy implementations. Keep the CLI wrapper focused on input composition.

6. Mistral Model Flag Error

Explanation: Passing --model to the Mistral Vibe binary. Vibe ignores this flag, causing model resolution failures.
Fix: Use VIBE_ACTIVE_MODEL environment variable for model injection. Document this divergence clearly in the gateway configuration.

7. Session Logging Neglect

Explanation: Failing to enable session logging for providers that require it for resumption. This leads to opaque errors when attempting to continue sessions.
Fix: Validate configuration flags (e.g., [session_logging] enabled = true) during gateway initialization. Surface clear error messages if prerequisites are missing.

Production Bundle

Action Checklist

Audit existing prompt construction logic for volatile data interleaving.
Implement PromptComponents interface and canonical composition function.
Configure provider-specific minimum token thresholds in gateway settings.
Update usage parsers to handle field name divergences (e.g., Codex vs. Anthropic).
Enable cache_state:// MCP resources for observability.
Test TTL warnings for Claude sessions with long intervals.
Verify Mistral Vibe configuration for VIBE_ACTIVE_MODEL and session logging.
Run load tests to validate cache hit rates across all five providers.

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
High-volume, static context	Structured Prompt Composition	Maximizes cache hits via prefix stability.	Significant reduction in input token costs.
One-off, unique queries	Raw Prompt String	Overhead of composition is unnecessary.	Neutral.
Multi-provider routing	CLI Gateway Wrapper	Abstracts provider nuances; enables diverse sampling.	Enables cost-effective model comparison.
Low-latency requirements	Structured Prompt + TTL Warning	Prevents cache misses due to expiry.	Reduces retry latency and wasted tokens.
Strict privacy constraints	CLI Wrapper with Hash-only Observability	No prompt text stored; only metadata exposed.	Maintains compliance while tracking efficiency.

Configuration Template

[cache_awareness]
enabled = true
warn_on_ttl_expiry = true
anthropic_ttl_seconds = 300

[cache_awareness.thresholds]
# Minimum stable tokens required to trigger caching
sonnet = 1024
opus = 4096
haiku = 4096
haiku_vertex = 2048

[providers.mistral]
# Mistral Vibe requires env var injection
model_env_var = "VIBE_ACTIVE_MODEL"
session_logging_required = true

[observability]
expose_cache_resources = true
privacy_mode = "hash_only"

Quick Start Guide

Install Gateway: Deploy llm-cli-gateway v1.6.0 or later. Ensure all provider CLIs are accessible.
Define Components: Refactor your prompt generation to use PromptComponents. Separate system, tools, context, and query.
Compose and Dispatch: Use the canonical composition function to generate the payload. Pass the result to the gateway's request handler.
Monitor Efficiency: Query cache_state://global to verify hit rates. Adjust context blocks if thresholds are not met.
Enable Warnings: Set warn_on_ttl_expiry = true in configuration to receive alerts for expiring Claude caches.

Mid-Year Sale — Unlock Full Article