Back to KB
Difficulty
Intermediate
Read Time
9 min

99. Build a Chatbot With Memory

By Codcompass TeamΒ·Β·9 min read

Engineering Persistent Context: Strategies for Stateful LLM Applications

Current Situation Analysis

Large Language Models (LLMs) are, at their core, stateless functions. A model invocation accepts a prompt and returns a completion; it retains no internal state between calls. This architectural reality creates a fundamental friction point for developers building conversational interfaces: users expect continuity, but the model expects isolation.

The common misconception is that "chat" implies memory. In reality, memory is an application-layer responsibility. Every turn in a conversation must be reconstructed and injected into the prompt context. As conversations lengthen, this approach collides with hard constraints: context window limits and token economics.

The Data Reality:

  • Context Limits: Models like GPT-3.5-turbo support 16k tokens, while GPT-4 variants support up to 128k tokens. However, effective usable context is often lower due to the "lost in the middle" phenomenon, where models degrade in attention for information buried deep within the context window.
  • Token Growth: A verbose multi-turn exchange can consume tokens rapidly. Assuming an average of 4 characters per token and ~4 tokens of overhead per message role/formatting, a 20-turn conversation with detailed responses can easily exceed 4,000 tokens. When combined with system instructions and Retrieval-Augmented Generation (RAG) context, the budget depletes quickly.
  • Cost Implications: Input tokens are billed per request. A naive implementation that resends the entire history for every turn scales cost linearly with conversation length, making long-running sessions prohibitively expensive.

WOW Moment: Key Findings

The choice of memory strategy dictates the trade-off between context fidelity, token efficiency, and implementation complexity. The following comparison highlights how different approaches handle long-running sessions.

StrategyContext FidelityToken EfficiencyLatency ImpactBest Use Case
Sliding WindowHigh (Recent), Zero (Old)LowMinimalTask-oriented bots, short troubleshooting
Summary CompressionMedium (Abstracted), High (Recent)HighModerate (Summarization call)Creative writing, long-form support
Entity ExtractionHigh (Facts), Low (Narrative)Very HighHigh (Extraction call)Personalized assistants, CRM integration

Why This Matters: Entity extraction offers the highest token efficiency for applications requiring long-term user profiling, as it compresses narrative history into structured key-value pairs. Summary compression balances narrative flow with cost, making it ideal for scenarios where the "gist" of earlier interactions matters more than verbatim details. Sliding windows remain the most cost-effective for sessions where only immediate context is relevant.

Core Solution

Building a robust stateful chatbot requires abstracting memory management from the chat loop. The recommended architecture employs the Strategy Pattern, allowing dynamic selection of memory behaviors without altering the core inference logic. This approach supports asynchronous operations, strict token budgeting, and extensible memory types.

Architecture Decisions

  1. Strategy Pattern: Decouples memory logic. The ChatSession interacts with a MemoryStrategy interface, enabling runtime swapping between sliding windows, summaries, or entity stores.
  2. Token Budgeting: Every strategy must respect a global token budget. The memory manager calculates current usage and truncates or compresses history to fit within the limit before sending to the LLM.
  3. Entity Store: For long-term personalization, an entity store extracts and persists facts (e.g., user preferences, names) separately from the conversation history, reducing context bloat.

Implementation (TypeScript)

The following implementation provides a production-ready foundation. It includes a sliding window strategy, a summary strategy, and an entity extraction strategy, all governed by a token-aware manager.

``

πŸŽ‰ Mid-Year Sale β€” Unlock Full Article

Base plan from just $4.99/mo or $49/yr

Sign in to read the full article and unlock all 635+ tutorials.

Sign In / Register β€” Start Free Trial

7-day free trial Β· Cancel anytime Β· 30-day money-back