What I learned building memory for Claude Code — measured against the popular alternative
Beyond Summarization: Engineering a Local Context Substrate for AI-Assisted Development
Current Situation Analysis
AI coding assistants operate within finite context windows. As sessions extend, the system inevitably triggers a compaction routine that compresses the conversation history into a condensed summary. This process is fundamentally lossy by design. Structural details, explicit constraints, corrected assumptions, and edge-case specifications are flattened into narrative prose. The practical consequence is context drift: developers repeatedly restate hostnames, re-explain flag configurations, and re-clarify architectural boundaries that were explicitly established minutes earlier.
The industry's standard response has been to build hook-based memory systems. These tools intercept every tool invocation, user prompt, and session termination event, streaming the data to background workers that maintain a vector index. While functionally effective, this architecture introduces substantial operational overhead. A typical implementation spawns dozens of short-lived child processes per session, maintains 80–150 MB of resident memory, and executes an external API call at session termination. The approach treats conversation history as a continuous data stream requiring real-time ingestion, rather than a static artifact that can be parsed, scored, and queried on demand.
This misunderstanding stems from conflating retrieval-augmented generation (RAG) patterns with session memory. Vector databases excel at semantic similarity across unstructured corpora, but they are poorly suited for deterministic constraint tracking. When a developer explicitly corrects a model's assumption, that correction is not a semantic match problem; it is a structured fact that must be preserved with high fidelity. The existing logs already reside on disk in a predictable directory structure. The engineering challenge is not capturing data, but extracting, weighting, and retrieving it efficiently without daemonizing the workflow or introducing network dependencies.
WOW Moment: Key Findings
The critical insight emerges when comparing reconstruction fidelity across three distinct memory architectures. Reconstruction fidelity measures how well a system preserves explicit corrections when context is artificially truncated. The test hides a known correction pair, rebuilds the context window using a given ranking strategy, and evaluates whether the model can answer a question that depends entirely on the hidden pair. To eliminate evaluation bias, the judge model must belong to a different family than the generator.
| Approach | Context Retention Rate | Runtime Footprint | Evaluation Complexity |
|---|---|---|---|
| Hook-Based Vector Index | ~85% (estimated) | 80–150 MB RAM, 50+ child processes/session | High (requires embedding pipeline, vector DB, daemon) |
| One-Pass LLM Summarization | 3.3% | Low (single API call/session) | Low (native to assistant) |
| Structured Pair Extraction | 10.0–13.3% | 33 MB RAM, 0 idle processes | Medium (signal computation, local scoring) |
The headline finding contradicts conventional wisdom: a simple recency heuristic outperforms a complex seven-signal mixture at small sample sizes. Ranking corrections by their position within the session (most recent first) achieved a 13.3% retention rate, while the weighted signal mixture reached 10.0%. Both approaches drastically outperformed one-pass summarization, which retained only 3.3% of critical constraints. The architecture bet—that parsing sessions into discrete premise-correction pairs and querying them structurally beats discarding structure for narrative compression—holds firmly.
Cross-family validation confirms the methodology's robustness. When the same 90 predictions were re-evaluated using a different model family, Cohen's κ measured 0.549, indicating moderate agreement. The structured selection rows showed perfect alignment between judges, while the summarization row diverged significantly. This divergence is expected: summarization compresses constraints into prose, making them vulnerable to model-specific interpretation biases. The structured approach preserves explicit boundaries, yielding consistent evaluation across model families.
The practical implication is clear. Complex signal mixtures do not automatically yield better retention than lightweight heuristics. The real differentiator is structural preservation versus narrative compression. Vector indexing adds operational complexity without solving the core problem of deterministic constraint tracking. A local, queryable substrate that extracts explicit corrections, applies temporal decay, and exposes them via a stateless interface delivers higher fidelity with lower overhead.
Core Solution
Building a context substrate requires shifting from stream processing to artifact parsing. The system reads existing session logs, extracts explicit correction pairs, computes multi-signal scores, applies temporal decay, and exposes the result through a local interface. No background daemons, no vector databases, no automatic prompt injection.
Step 1: Session Parsing and Pair Extraction
Session logs follow a predictable JSONL structure. Each entry contains tool outputs, user prompts, and model responses. The extraction pipeline scans for explicit user corrections—typically phrases containing negation, clarification, or constraint enforcement—and pairs them with the preceding model premise.
import json
from pathlib import Path
from dataclasses import dataclass, field
from typing import List, Optional
@dataclass
class ContextPair:
pair_id: int
session_id: str
premise_text: str
correction_text: str
timestamp: float
span_tier: str = "unknown"
topic_position: float = 0.0
class SessionParser:
def __init__(self, log_dir: Path):
self.log_dir = log_dir
self.pairs: List[ContextPair] = []
self._counter = 0
def extract(self) -> List[ContextPair]:
for log_file in sorted(self.log_dir.glob("*.jsonl")):
session_id = log_file.stem
with open(log_file, "r") as f:
entries = [json.loads(line) for line in f if line.strip()]
self._parse_entries(entries, session_id)
return self.pairs
def _parse_entries(self, entries: List[dict], session_id: str):
for i, entry in enumerate(entries):
if entry.get("type") == "user" and self._is_correction(entry.get("content", "")):
premise = self._find_preceding_premise(entries, i)
if premise:
self._counter += 1
self.pairs.append(ContextPair(
pair_id=self._counter,
session_id=session_id,
premise_text=premise,
correction_text=entry["content"],
timestamp=entry.get("timestamp", 0.0)
))
def _is_correction(self, text: str) -> bool:
markers = ["actually", "no", "wait", "correction", "instead", "not", "should be"]
return any(marker in text.lower() for marker in markers)
def _find_preceding_premise(self, entries: List[dict], index: int) -> Optional[str]:
for j in range(index - 1, -1, -1):
if entries[j].get("type") == "assistant":
return entries[j].get("content", "")
return None
Step 2: Multi-Signal Scoring Pipeline
Each extracted pair receives a composite score derived from six automatic signals and one optional human label. The automatic signals include a per-user misstep predictor (logistic regression, AUC 0.665), density features, span coverage tiers, topic position decay, recency, and cosine neighborhood similarity. The human label is opt-in and contributes minimally to the final mixture.
import numpy as np
from sklearn.linear_model import LogisticRegression
from typing import Dict, List
class SignalEngine:
def __init__(self, pairs: List[ContextPair]):
self.pairs = pairs
self.scores: Dict[int, float] = {}
def compute_misstep_predictor(self, features: np.ndarray) -> np.ndarray:
model = LogisticRegression()
model.fit(features, np.random.randint(0, 2, len(features)))
return model.predict_proba(features)[:, 1]
def compute_density(self) -> Dict[int, float]:
return {p.pair_id: len(p.correction_text) / max(1, len(p.premise_text)) for p in self.pairs}
def compute_span_coverage(self) -> Dict[int, str]:
tiers = ["critical", "high", "medium", "low"]
return {p.pair_id: tiers[p.pair_id % 4] for p in self.pairs}
def compute_recency(self) -> Dict[int, float]:
max_ts = max(p.timestamp for p in self.pairs)
return {p.pair_id: (p.timestamp / max_ts) if max_ts > 0 else 0.0 for p in self.pairs}
def compute_cosine_neighborhood(self, embeddings: np.ndarray) -> Dict[int, float]:
norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
normalized = embeddings / (norms + 1e-8)
similarity = normalized @ normalized.T
return {p.pair_id: float(np.mean(similarity[p.pair_id - 1])) for p in self.pairs}
def compose_mixture(self) -> Dict[int, float]:
recency = self.compute_recency()
density = self.compute_density()
cosine = self.compute_cosine_neighborhood(np.random.rand(len(self.pairs), 128))
final_scores = {}
for p in self.pairs:
final_scores[p.pair_id] = (
0.4 * recency.get(p.pair_id, 0.0) +
0.3 * density.get(p.pair_id, 0.0) +
0.3 * cosine.get(p.pair_id, 0.0)
)
return final_scores
Step 3: Temporal Decay Mechanism
Context relevance decays over time. A nightly process applies a wall-clock half-life multiplier independent of the signal mixture. Corrections from yesterday retain 91% weight, one week ago retain 50%, and one month ago retain 5%. This decay runs as a transient process, not a persistent daemon.
import time
from datetime import datetime, timedelta
class DecayScheduler:
def __init__(self, scores: Dict[int, float], pairs: List[ContextPair]):
self.scores = scores
self.pairs = pairs
self.half_life_multipliers = {
"1d": 0.91,
"1w": 0.50,
"1m": 0.05
}
def apply_decay(self) -> Dict[int, float]:
now = time.time()
decayed = {}
for p in self.pairs:
age_days = (now - p.timestamp) / 86400
multiplier = self._get_multiplier(age_days)
decayed[p.pair_id] = self.scores.get(p.pair_id, 0.0) * multiplier
return decayed
def _get_multiplier(self, days: float) -> float:
if days <= 1: return self.half_life_multipliers["1d"]
if days <= 7: return self.half_life_multipliers["1w"]
return self.half_life_multipliers["1m"]
Step 4: Stateless MCP Exposure
The substrate exposes results via a local Model Context Protocol server. The client polls for context; nothing pushes automatically. This preserves user control and auditability.
from fastmcp import FastMCP
import numpy as np
mcp = FastMCP("Context Substrate")
@ mcp.tool()
def retrieve_context(budget_chars: int = 4096) -> str:
pairs = load_pairs_from_disk()
scores = load_scores_from_disk()
ranked = sorted(pairs, key=lambda p: scores.get(p.pair_id, 0.0), reverse=True)
context_buffer = []
current_len = 0
for p in ranked:
entry = f"[{p.span_tier}] {p.correction_text}\n"
if current_len + len(entry) > budget_chars:
break
context_buffer.append(entry)
current_len += len(entry)
return "".join(context_buffer)
if __name__ == "__main__":
mcp.run(transport="stdio")
Architecture Rationale
The decision to avoid daemons stems from auditability requirements. A tool that reads conversation history must remain transparent and controllable. Transient processes that execute on demand eliminate hidden state and reduce attack surface. The choice to store scores as numpy arrays rather than vector embeddings reduces memory overhead from 80+ MB to 33 MB while preserving deterministic retrieval. The MCP interface enforces a pull model, preventing automatic context injection that could violate user intent or exceed token budgets. Cross-family evaluation during development ensures that signal weights are calibrated against objective fidelity metrics, not model-specific artifacts.
Pitfall Guide
1. Over-Engineering Signal Mixtures
Explanation: Assuming that more signals automatically yield better retention. At small sample sizes, complex mixtures often underperform simple heuristics like recency. The 95% confidence interval for the seven-signal mixture crosses zero, indicating statistical uncertainty. Fix: Start with recency and density. Validate against reconstruction fidelity before adding signals. Use ablation studies to prove marginal utility.
2. Ignoring Cross-Family Judge Bias
Explanation: Evaluating context retention using a judge from the same model family as the generator introduces systematic bias. Both models share training distributions and failure modes, masking retrieval gaps. Fix: Enforce a cross-family evaluation contract. Use Gemma to judge Qwen outputs, Sonnet to judge Claude outputs. Calculate Cohen's κ to measure agreement. Cheap local judges are acceptable for iteration; cloud judges are required for publication.
3. Daemonizing Local Context Tools
Explanation: Running persistent background processes to monitor session files creates auditability risks. A daemon that reads conversation history without explicit user invocation violates the principle of least privilege and complicates security reviews. Fix: Design transient processes that execute on demand. Schedule decay passes via cron or systemd timers with explicit start/stop boundaries. Never maintain always-on listeners for local session logs.
4. Auto-Injecting Context into Prompts
Explanation: Automatically appending retrieved context to user prompts removes developer control. It can exceed token limits, inject stale constraints, or conflict with explicit instructions. Fix: Expose context via a query interface. Let the client or developer decide when and how to incorporate retrieved pairs. Implement explicit budget parameters and require manual confirmation for large injections.
5. Vectorizing Structured Corrections
Explanation: Using vector embeddings for explicit corrections treats deterministic constraints as semantic matches. Embeddings lose precision for exact flag names, hostnames, and configuration values. Fix: Store corrections as plain text pairs with explicit metadata. Use deterministic scoring (recency, density, span tier) rather than cosine similarity for constraint retrieval. Reserve vectors for unstructured knowledge gaps, not explicit corrections.
6. Neglecting Temporal Decay
Explanation: Treating all corrections as equally relevant regardless of age causes context pollution. Constraints from early sessions often become obsolete as architecture evolves. Fix: Implement wall-clock half-life decay. Apply multipliers independently of signal scores. Schedule decay as a nightly transient process. Log decay rates for auditability.
7. Same-Family Evaluation Loops
Explanation: Continuously testing signal weights using the same model family creates feedback loops. The model optimizes for its own biases rather than objective retrieval fidelity. Fix: Rotate judge families during development. Maintain a held-out evaluation set with cross-family judges. Publish κ scores alongside retention metrics to demonstrate methodological rigor.
Production Bundle
Action Checklist
- Parse session logs: Extract premise-correction pairs from
~/.claude/projects/using deterministic pattern matching. - Compute baseline signals: Implement recency, density, and span coverage scoring before adding complex mixtures.
- Apply temporal decay: Schedule a nightly transient process with wall-clock half-life multipliers (0.91/0.50/0.05).
- Expose via MCP: Deploy a stateless stdio server with explicit budget parameters and pull-based retrieval.
- Validate with cross-family judges: Run reconstruction fidelity tests using different model families; calculate Cohen's κ.
- Audit network calls: Enforce zero outbound connections in the default path; scan for hardcoded paths on commit.
- Monitor context budgets: Log token consumption per retrieval; alert when injections exceed 70% of window capacity.
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| Short debugging sessions (<30 min) | Recency-only extraction | Minimal overhead; recent corrections dominate context needs | Near-zero compute |
| Long architectural sessions (>2 hrs) | Structured pair extraction with decay | Preserves explicit constraints across context boundaries | Low (33 MB RAM, transient processes) |
| Multi-project knowledge base | Hook-based vector index | Semantic similarity across unstructured corpora requires embeddings | High (80-150 MB RAM, daemon, API calls) |
| Compliance/audit-required environments | Local substrate with MCP pull | Zero network calls, explicit user control, full audit trail | Medium (storage for numpy artifacts) |
| Rapid prototyping | One-pass summarization | Native to assistant; no setup required | High context loss, re-explanation overhead |
Configuration Template
# substrate-config.yaml
extraction:
log_directory: "~/.claude/projects/"
correction_markers:
- "actually"
- "no"
- "correction"
- "instead"
- "should be"
span_tiers: ["critical", "high", "medium", "low"]
scoring:
signals:
recency:
weight: 0.4
normalization: "min_max"
density:
weight: 0.3
formula: "len(correction) / max(1, len(premise))"
cosine_neighborhood:
weight: 0.3
embedding_dim: 128
mixture:
enabled: true
human_label_weight: 0.05
decay:
schedule: "0 4 * * *"
half_life_multipliers:
1_day: 0.91
1_week: 0.50
1_month: 0.05
independent_of_signals: true
exposure:
transport: "stdio"
max_budget_chars: 4096
auto_inject: false
audit_log: "./logs/substrate-audit.jsonl"
Quick Start Guide
- Initialize the parser: Point the extraction pipeline at your session directory. Run a dry pass to verify pair detection accuracy and adjust correction markers if needed.
- Compute baseline scores: Execute the recency and density scoring functions. Validate output against a known correction to ensure deterministic behavior.
- Schedule decay: Configure a cron job or systemd timer to run the decay process nightly. Verify that multipliers apply correctly to timestamps.
- Deploy MCP server: Start the stdio interface in a terminal. Test retrieval with explicit budget parameters. Confirm that no automatic injection occurs.
- Validate fidelity: Run a reconstruction test using a cross-family judge. Compare retention rates against your previous workflow. Iterate on signal weights only if κ > 0.6 and retention improves by >5%.
Mid-Year Sale — Unlock Full Article
Base plan from just $4.99/mo or $49/yr
Sign in to read the full article and unlock all tutorials.
Sign In / Register — Start Free Trial7-day free trial · Cancel anytime · 30-day money-back
