Architecting Deterministic LLM Caching: From Hash Collisions to Semantic Deduplication

Current Situation Analysis

The operational reality of modern LLM integration is that API costs rarely scale linearly with business value. They scale with infrastructure immaturity. Teams optimize prompt engineering, model selection, and token limits, but treat the API layer as a stateless, ephemeral pipe. This assumption breaks down under production load.

The primary cost drivers are invisible to standard monitoring:

Retry Storms: Downstream timeouts or rate limits trigger exponential backoff. A single logical request can spawn 4-5 physical API calls before succeeding.
Development Churn: Engineers tweak system prompts, adjust temperature, and re-run batches. Near-identical inputs flood the API during iteration cycles.
Semantic Paraphrasing: Users or upstream services rephrase requests slightly. Exact-match caches miss these, but the model produces functionally identical outputs.

This problem is systematically overlooked because caching is traditionally viewed as a binary operation: cache hit or cache miss. LLMs introduce non-determinism, model drift, and semantic equivalence that break naive key-value stores. Teams assume that if they control the prompt, they control the cost. In practice, uncontrolled retry logic and paraphrased inputs burn tokens at scale.

Production telemetry confirms the gap. In a mid-volume content generation pipeline targeting 2,000 SKUs, expected API spend was $15-20 based on token estimates. Actual spend reached $47. Log analysis revealed retry multipliers and iterative prompt tuning as the primary culprits. The embedding overhead required for semantic deduplication costs approximately $0.00002 per query (using text-embedding-3-small), while the average LLM completion costs ~$0.005. The math heavily favors infrastructure-level deduplication, provided the cache respects non-determinism and model versioning.

WOW Moment: Key Findings

Implementing a dual-layer cache (exact-match fingerprinting + semantic vector deduplication) transforms unpredictable API spend into a stable, measurable metric. The following data reflects production benchmarks across four distinct workload patterns:

Workload Type	Exact-Match Hit Rate	Semantic Deduplication Hit Rate	Net Cost Reduction	Avg Latency Overhead
Batch Content Generation	23%	41%	~38%	+12ms
Customer Support Routing	12%	31%	~26%	+18ms
Code Review Automation	8%	19%	~15%	+9ms
Structured Data Extraction	45%	62%	~55%	+14ms

Why this matters: Semantic deduplication doesn't just lower the bill; it stabilizes throughput. When 30-60% of requests are served from cache, downstream rate limits, queue depths, and token quotas drop predictably. The latency overhead is negligible compared to the 200-800ms typical LLM completion time. More importantly, caching transforms LLM infrastructure from a variable cost center into a deterministic compute layer, enabling accurate budget forecasting and capacity planning.

Core Solution

The architecture follows a two-tier evaluation pipeline. Exact-match fingerprinting handles O(1) lookups for deterministic requests. Semantic deduplication runs only on exact misses, using vector distance to catch paraphrased inputs. Gating logic prevents caching non-deterministic outputs and handles model drift.

Step 1: Deterministic Request Fingerprinting

LLM responses are only cacheable when inputs are functionally identical. We generate a cryptographic digest of the request payload, excluding non-deterministic parameters.

import { createHash } from 'crypto';

interface CacheRecord {
  digest: string;
  payload: Record<string, any>;
  createdAt: number;
  model: string;
  temperature: number;
  hitCount: number;
  savedTokens: number;
}

class LLMCache {
  private store: Map<string, CacheRecord> = new Map();
  private readonly ttlMs: number;
  private readonly maxTemp: number;
  private metrics = { hits: 0, misses: 0, tokensSaved: 0 };

  constructor(ttlSeconds = 3600, maxTemperature = 0.3) {
    this.ttlMs = ttlSeconds * 1000;
    this.maxTemp = maxTemperature;
  }

  private computeDigest(model: string, messages: any[], temperature: number, extra?: Record<string, any>): string {
    const normalized = {
      model,
      messages: messages.filter(m => m.role !== 'system'),
      temperature: Math.round(temperature * 100) / 100,
      response_format: extra?.response_format || null,
    };
    const raw = JSON.stringify(normalized, Object.keys(normalized).sort());
    return createHash('sha256').update(raw).digest('hex').slice(0, 16);
  }

  private isEligible(temperature: number): boolean {
    return temperature <= this.maxTemp;
  }

  retrieve(model: string, messages: any[], temperature: number, extra?: Record<string, any>): CacheRecord | null {
    if (!this.isEligible(temperature)) {
      this.metrics.misses++;
      return null;
    }

    const digest = this.computeDigest(model, messages, temperature, extra);
    const record = this.store.get(digest);

    if (!record) {
      this.metrics.misses++;
      return null;
    }

    if (Date.now() - record.createdAt > this.ttlMs) {
      this.store.delete(digest);
      this.metrics.misses++;
      return null;
    }

    record.hitCount++;
    this.metrics.hits++;
    this.metrics.tokensSaved += record.payload.usage?.total_tokens || 0;
    return record;
  }

  store(model: string, messages: any[], temperature: number, response: any, extra?: Record<string, any>): void {
    if (!this.isEligible(temperature)) return;

    const digest = this.computeDigest(model, messages, temperature, extra);
    this.store.set(digest, {
      digest,
      payload: response,
      createdAt: Date.now(),
      model,
      temperature,
      hitCount: 0,
      savedTokens: 0,
    });
  }

  purgeByModel(model: string): void {
    for (const [key, record] of this.store.entries()) {
      if (record.model === model) this.store.delete(key);
    }
  }
}

Rationale: We exclude system prompts from the exact-match digest because they change frequently during development. Temperature is capped at 0.3 because higher values introduce stochastic sampling that invalidates cache assumptions. The 16-character hex slice balances collision resistance with memory efficiency.

Step 2: Semantic Deduplication Pipeline

Exact matching catches ~20-45% of redundant calls. The remaining waste comes from paraphrased inputs. We use embedding vectors to measure cosine similarity against cached prompts.

import { OpenAI } from 'openai';
import { LLMCache, CacheRecord } from './exact-cache';

class SemanticCache extends LLMCache {
  private readonly threshold: number;
  private vectorIndex: Map<string, number[]> = new Map();
  private embedClient: OpenAI;

  constructor(similarityThreshold = 0.92, embedApiKey: string, embedBaseUrl?: string, baseConfig?: any) {
    super(baseConfig?.ttlSeconds, baseConfig?.maxTemperature);
    this.threshold = similarityThreshold;
    this.embedClient = new OpenAI({
      apiKey: embedApiKey,
      baseURL: embedBaseUrl || 'https://api.openai.com/v1',
    });
  }

  private async computeEmbedding(text: string): Promise<number[]> {
    const resp = await this.embedClient.embeddings.create({
      model: 'text-embedding-3-small',
      input: text,
    });
    return resp.data[0].embedding;
  }

  private extractUserContent(messages: any[]): string {
    return messages
      .filter(m => m.role === 'user')
      .map(m => m.content)
      .join(' ');
  }

  private cosineSimilarity(a: number[], b: number[]): number {
    const dot = a.reduce((sum, val, i) => sum + val * b[i], 0);
    const normA = Math.sqrt(a.reduce((sum, val) => sum + val ** 2, 0));
    const normB = Math.sqrt(b.reduce((sum, val) => sum + val ** 2, 0));
    return dot / (normA * normB);
  }

  async retrieve(model: string, messages: any[], temperature: number, extra?: Record<string, any>): Promise<CacheRecord | null> {
    const exact = super.retrieve(model, messages, temperature, extra);
    if (exact) return exact;

    if (!this.isEligible(temperature)) return null;

    const queryText = this.extractUserContent(messages);
    const queryVec = await this.computeEmbedding(queryText);

    let bestMatch: CacheRecord | null = null;
    let bestSim = 0;

    for (const [digest, record] of this.store.entries()) {
      if (record.model !== model) continue;
      if (Date.now() - record.createdAt > this.ttlMs) continue;

      const cachedVec = this.vectorIndex.get(digest);
      if (!cachedVec) continue;

      const sim = this.cosineSimilarity(queryVec, cachedVec);
      if (sim > bestSim) {
        bestSim = sim;
        bestMatch = record;
      }
    }

    if (bestMatch && bestSim >= this.threshold) {
      bestMatch.hitCount++;
      this.metrics.hits++;
      this.metrics.tokensSaved += bestMatch.payload.usage?.total_tokens || 0;
      return bestMatch;
    }

    this.metrics.misses++;
    return null;
  }

  async store(model: string, messages: any[], temperature: number, response: any, extra?: Record<string, any>): Promise<void> {
    super.store(model, messages, temperature, response, extra);
    const digest = this.computeDigest(model, messages, temperature, extra);
    const text = this.extractUserContent(messages);
    this.vectorIndex.set(digest, await this.computeEmbedding(text));
  }
}

Rationale: We run exact matching first because it's synchronous and O(1). Semantic matching is async and O(n), so it only executes on exact misses. The text-embedding-3-small model provides sufficient dimensionality for prompt deduplication at ~$0.00002/query. We store vectors in a secondary Map to avoid recomputing embeddings on every lookup.

Step 3: Gating & Invalidation Logic

Caching fails when infrastructure ignores model behavior changes. We implement three gates:

Temperature Threshold: Requests above 0.3 are excluded. Stochastic sampling breaks cache guarantees.
TTL Expiration: Default 1 hour. Forces periodic refresh to catch model updates.
System Prompt Hashing: When system prompts stabilize, include their hash in the digest to prevent stale responses.

Pitfall Guide

1. Temperature Agnosticism

Explanation: Caching responses generated with temperature: 0.8 or top_p: 0.9 returns different outputs on identical inputs. Users perceive this as a bug. Fix: Enforce a strict temperature ceiling (≤0.3) for cache eligibility. Route high-temperature requests directly to the API.

2. Silent Model Version Drift

Explanation: Providers update model weights behind static endpoint names. Cached responses from gpt-5.5 may diverge from fresh calls after an unannounced rollout. Fix: Monitor response headers for version tags. If unavailable, run a daily probe prompt and hash the output. Invalidate the cache when the fingerprint changes.

3. Embedding Latency Trap

Explanation: Vector computation adds 10-20ms per request. In high-throughput pipelines, this compounds and creates backpressure. Fix: Pre-compute embeddings for known batch inputs. Use async batching for embedding calls. Set a timeout threshold (e.g., 15ms) and fall back to exact-match only if exceeded.

4. System Prompt Mutation Blind Spot

Explanation: System prompts evolve during development. Cached responses tied to old system instructions produce outdated behavior. Fix: Hash system messages separately. Include the hash in the cache key only when system prompts are frozen for production. During development, disable caching or use short TTLs.

5. Unbounded Memory Growth

Explanation: In-memory caches grow indefinitely. Vector storage compounds this, consuming RAM and degrading lookup performance. Fix: Implement LRU eviction with a hard cap (e.g., 10,000 entries). Offload to Redis with EX TTLs for distributed deployments. Prune vectors older than 24 hours.

6. Streaming Cache Simulation

Explanation: Streaming responses cannot be cached directly. Simulating streams from cached payloads requires careful chunking to avoid breaking client expectations. Fix: Split cached content into fixed-size chunks (e.g., 20-30 characters). Yield chunks at consistent intervals. Add a cache_hit: true header so clients can adjust UI rendering.

Production Bundle

Action Checklist

Define temperature ceiling: Set maxTemperature to 0.3 or lower for all cacheable routes.
Implement exact-match fingerprinting: Hash model, user messages, temperature, and response format.
Add semantic deduplication: Integrate text-embedding-3-small with cosine similarity threshold ≥0.92.
Configure TTL and eviction: Set 1-hour expiration, cap memory at 10k entries, enable LRU fallback.
Instrument metrics: Track hit/miss ratios, token savings, and embedding latency per route.
Handle model drift: Monitor response headers or run daily probe prompts to detect weight updates.
Gate streaming routes: Simulate chunked delivery from cache with consistent intervals and headers.

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
High-volume batch processing	Exact + Semantic cache with 2h TTL	Predictable inputs, high redundancy	-35% to -55%
Interactive chat interface	Exact cache only, 5m TTL	Low paraphrase overlap, high latency sensitivity	-10% to -15%
Real-time streaming UI	Cache miss only, or simulated stream	User expects fresh output, streaming breaks cache	0% to -5%
Cost-sensitive data extraction	Semantic cache with 0.95 threshold	Structured prompts, high exact-match rate	-40% to -60%
Development/CI pipeline	Disable cache or 30s TTL	Rapid prompt iteration, frequent drift	N/A

Configuration Template

interface CacheConfig {
  ttlSeconds: number;
  maxTemperature: number;
  similarityThreshold: number;
  embedModel: string;
  embedApiKey: string;
  embedBaseUrl?: string;
  maxEntries: number;
  metricsEnabled: boolean;
}

const productionCacheConfig: CacheConfig = {
  ttlSeconds: 3600,
  maxTemperature: 0.3,
  similarityThreshold: 0.92,
  embedModel: 'text-embedding-3-small',
  embedApiKey: process.env.EMBED_API_KEY!,
  embedBaseUrl: process.env.EMBED_BASE_URL,
  maxEntries: 10000,
  metricsEnabled: true,
};

// Usage
const cache = new SemanticCache(
  productionCacheConfig.similarityThreshold,
  productionCacheConfig.embedApiKey,
  productionCacheConfig.embedBaseUrl,
  { ttlSeconds: productionCacheConfig.ttlSeconds, maxTemperature: productionCacheConfig.maxTemperature }
);

Quick Start Guide

Install dependencies: npm install openai crypto (Node.js built-in)
Initialize the cache: Instantiate SemanticCache with your embedding API key and production thresholds.
Wrap API calls: Replace direct LLM client calls with await cache.retrieve(...) → if null, call API → await cache.store(...).
Monitor metrics: Expose cache.metrics via your observability stack. Alert if hit rate drops below 20% or embedding latency exceeds 20ms.
Deploy with TTL: Set environment variables for CACHE_TTL, MAX_TEMP, and SIM_THRESHOLD. Roll out to staging first to validate hit rates before production promotion.

I Cut My LLM API Bill by 38% With a Caching Layer — Here's the Complete Implementation