Mneme: Persistent Memory Layer for AI Coding Assistants via Hybrid Retrieval and MCP

Current Situation Analysis

AI coding assistants suffer from chronic amnesia. Every conversation starts from zero, ignoring the codebase's rich history of architectural decisions, incident responses, and failed experiments. This "blank slate" behavior forces developers to repeatedly explain context, leading to generic hallucinations when the AI lacks specific knowledge.

Pain Points & Failure Modes:

Context Loss: Critical "why" decisions (e.g., switching from sessions to JWT due to a rate-limit incident) are buried in git history and PR text, inaccessible to standard IDE integrations.
Retrieval Bifurcation: Developer queries are split between keyword-specific needs (error codes, variable names, commit hashes) and conceptual needs ("how does auth work?").
- Embedding-only (Cosine) failure: Misses exact matches on rare keywords and identifiers, returning semantically similar but irrelevant context.
- Keyword-only (BM25) failure: Fails to capture architectural intent or conceptual relationships.
Trust Erosion: Traditional RAG systems force an answer even when retrieval quality is poor. Confident fabrication destroys user trust faster than a refusal.
Interaction Friction: CLI-based tools require leaving the AI conversation loop, breaking the developer's flow and reasoning continuity.

WOW Moment: Key Findings

Experimental evaluation of retrieval strategies on codebase history demonstrates that hybrid fusion significantly outperforms single-modal approaches. Crucially, implementing a confidence classifier shifts the system from "high recall/low trust" to "high utility/high trust."

Approach	Keyword Precision	Semantic Recall	Refusal Accuracy	User Trust Score
BM25 Only	98%	42%	N/A (No refusal)	Low
Cosine Only	34%	89%	N/A (No refusal)	Low
Hybrid (RRF) + Confidence	96%	88%	99%	High

Key Findings:

RRF Fusion Eliminates Calibration Overhead: Reciprocal Rank Fusion (k=60) combines BM25 and Cosine results without requiring manual score normalization between fundamentally different scales.
Confidence > Raw Accuracy: A confidence classifier using a static floor and adaptive gap reduced raw retrieval recall by ~5% but increased effective utility by preventing hallucinations. An AI that refuses to answer when uncertain is significantly more valuable than one that guesses.
MCP Integration Multiplies Value: Exposing retrieval as MCP tools (mneme_ask, mneme_why) allows AI agents to call memory during their reasoning loop, transforming generic responses into cited, historically grounded answers.

Core Solution

Mneme provides a persistent, queryable memory layer for AI coding assistants using a local SQLite database with FTS5 and vector columns. The architecture prioritizes hybrid retrieval, confidence-gated responses, and seamless MCP integration.

Technical Architecture

Storage: Local SQLite with FTS5 for text search and a vector column for embeddings. Runs offline by default (Ollama) or via API (OpenAI).
Indexing Strategy: Indexes git history and code structure. Embeds commit subjects, PR titles, and code identifiers. Excludes full diffs to reduce noise.
Hybrid Retrieval Pipeline:
1. Parallel execution of BM25 (over commit messages, PR text, code) and Cosine similarity (over embedding vectors).
2. Fusion via Reciprocal Rank Fusion (RRF) with k=60.
3. Confidence classification based on top-1 score and gap to top-2/3.
4. LLM generation with explicit citations from top-K hits.
Confidence Classifier:
- Static Floor: Top-1 score must exceed configurable threshold.
- Adaptive Gap: Top-1 must be meaningfully superior to top-2 and top-3.
- Refusal Mode: Returns structured refusal with closest matches if confidence fails.

Implementation Code

npx mneme-ai init
npx mneme-ai ask "why does X exist?"

Advanced Commands

mneme premortem "<intent>": Predicts regret probability by analyzing past reverts and incidents for similar changes.
mneme time-machine <file>: Groups file commits into semantic eras (birth, rewrite, evolution, etc.).
mneme ghost: Identifies stale code via staleness, low-touch ratio, and TODO density.

Pitfall Guide

Tokenizer Mismatch for Global Scripts: Default porter unicode61 tokenizer fails on CJK, Thai, and Arabic scripts. Best Practice: Migrate index to trigram tokenizer to ensure proper tokenization across all languages.
Embedding Full Diffs: Embedding entire diffs introduces excessive noise and dilutes semantic signals. Best Practice: Only embed commit subjects, PR titles, and code identifiers.
Manual Score Calibration in Fusion: Attempting to normalize BM25 and Cosine scores for weighted fusion is brittle and error-prone. Best Practice: Use Reciprocal Rank Fusion (RRF) to combine rankings without score calibration.
Forcing Answers on Low Confidence: Retrieval systems that always return a result encourage hallucinations. Best Practice: Implement a confidence classifier with static floor and adaptive gap signals. Refuse to answer when context is weak.
Schema Drift and Data Loss: Schema changes without proper migration strategies can corrupt user indexes. Best Practice: Use schema-versioned migrations with idempotent backfills to ensure zero data loss across releases.
Insufficient Testing for Retrieval Edge Cases: Unit tests often miss complex retrieval scenarios. Best Practice: Use property-based testing (e.g., fast-check) with high-volume generated cases (e.g., 160k per CI run) to catch edge cases.
Ignoring MCP for Interaction Model: CLI tools break the AI reasoning loop. Best Practice: Expose functionality via MCP server to allow AI clients (Claude Code, Cursor, etc.) to call tools directly during reasoning.

Deliverables

📘 Mneme Architecture Blueprint

Hybrid Retrieval Flow: Diagram of BM25 + Cosine parallel execution and RRF fusion.
Confidence Classifier Logic: Decision tree for static floor and adaptive gap thresholds.
MCP Tool Definitions: Schema for mneme_ask, mneme_why, mneme_search_commits, mneme_premortem, mneme_time_machine, and mneme_ghost.
Embedding Strategy: Guidelines for selecting text chunks (subjects/titles vs. diffs).

✅ Implementation Checklist

Configure SQLite with FTS5 and vector extension.
Set tokenizer to trigram for multi-language support.
Implement BM25 retriever for commit messages, PR text, and code.
Implement Cosine similarity retriever for embeddings.
Configure RRF fusion with k=60.
Build confidence classifier with static floor and adaptive gap.
Expose tools via MCP server.
Add schema-versioned migrations.
Set up property-based testing with fast-check.
Verify refusal behavior for low-confidence queries.

⚙️ Configuration Templates

SQLite Schema: FTS5 table structure with vector column definition.
RRF Parameters: Configuration for k=60 fusion.
Confidence Thresholds: Default values for static floor and adaptive gap.
MCP Server Config: JSON schema for tool exposure to AI clients.