Back to KB
Difficulty
Intermediate
Read Time
9 min

Building a cost-efficient LLM caching layer in Python

By Codcompass TeamΒ·Β·9 min read

Current Situation Analysis

Language model API expenditure has become one of the most volatile line items in modern AI infrastructure. Teams routinely optimize prompt templates, fine-tune model selections, and implement streaming responses, yet they frequently neglect a fundamental inefficiency: request duplication. In production environments serving customer support, internal knowledge assistants, or automated research pipelines, 30–50% of incoming traffic consists of exact repeats or semantically equivalent queries. This redundancy is rarely visible in standard API dashboards, which aggregate token consumption without distinguishing between novel intent and cached repetition.

The problem is systematically overlooked because infrastructure teams treat LLM calls as stateless, one-off computations. When a user rephrases a question, retries a failed request, or when automated bots poll the same endpoint, the system pays full price for identical cognitive work. Without a routing layer that intercepts these patterns, organizations absorb unnecessary latency and unpredictable billing spikes.

Consider a baseline workload processing 100,000 queries daily. Assuming an average request-response cycle consumes 500 tokens and the target model charges $0.01 per 1,000 tokens, the uncached daily expenditure sits at $500. Introducing a caching mechanism that captures just 40% of traffic reduces API calls to 60,000, dropping daily costs to $300 and yielding $6,000 in monthly savings. In mature support or documentation systems, cache hit rates frequently exceed 60% once the index warms, pushing monthly savings past $9,000. The mathematical advantage is clear, but realizing it requires an architecture that balances deterministic matching with semantic understanding without introducing prohibitive lookup latency.

WOW Moment: Key Findings

The most impactful insight emerges when comparing routing strategies across cost, latency, and operational complexity. A single-tier approach either misses paraphrased intent (exact-only) or incurs heavy vector search overhead (semantic-only). A two-tier design captures the best of both worlds.

ApproachDaily API CostAvg Lookup LatencyHit Rate PotentialInfra Complexity
No Caching$500~1,200 ms0%Minimal
Exact-Only (SHA-256)$350~2 ms25–30%Low
Semantic-Only (Vector Scan)$200~45 ms55–65%High
Two-Tier (Exact + Semantic)$200~5 ms (hit) / ~1,200 ms (miss)55–65%Moderate

This finding matters because it decouples cost reduction from latency penalties. The exact tier acts as a high-speed filter for bots, retries, and UI duplicates, while the semantic tier catches natural language variations. Together, they deliver near-maximum cost reduction with lookup times that remain imperceptible to end users. The architecture also scales predictably: exact lookups remain O(1), and semantic searches can be offloaded to specialized vector stores once the dataset exceeds linear scan thresholds.

Core Solution

The implementation relies on a request router that evaluates incoming prompts through two sequential filters before falling back to the language model. Each tier serves a distinct purpose, and the routing logic is designed to fail fast on misses while guaranteeing dual-write consistency on hits.

Architecture Rationale

  1. Tier 1: Deterministic Hashing
    A cryptographic hash of the prompt combined with the target model identifier creates a unique key. This catches exact duplicates instantly. It is computationally free, requires no external vector computation, and handles retries, automated scripts, and UI state refreshes.

  2. Tier 2: Semantic Similarity
    When the exact tier misses, the prompt is embedded using a lightweight model (text-embedding-3-small). The resulting vector is compared against stored embeddings using cosine similarity. A configurable threshold (default 0.92) determines whether the intent matches a previously cached response. This tier handles paraphrasing, synonym substitution, and minor structural variations.

  3. Fallback & Dual Write
    If both tiers miss, the

πŸŽ‰ Mid-Year Sale β€” Unlock Full Article

Base plan from just $4.99/mo or $49/yr

Sign in to read the full article and unlock all 635+ tutorials.

Sign In / Register β€” Start Free Trial

7-day free trial Β· Cancel anytime Β· 30-day money-back