I gave Claude a persistent memory for $0/month using Cloudflare
Building a Semantic Memory Layer for LLM Agents on the Edge
Current Situation Analysis
Large language models operate on a fundamentally stateless architecture. Each new session initializes with a blank context window, discarding prior interactions, project decisions, and user preferences. This design choice optimizes for security and predictable inference costs, but it creates a severe friction point for developers building agentic workflows or long-running development assistants.
The industry has responded with platform-native memory features. While convenient, these implementations are intentionally opaque. They function as heuristic black boxes: you cannot tag entries, filter by timestamp, run semantic queries, or control what gets surfaced. The model decides what to retain based on internal weighting algorithms that are neither documented nor adjustable. For engineering teams, this lack of programmatic control is unacceptable. You cannot build reliable automation around a system you cannot query.
The alternative—external vector databases—introduces infrastructure complexity. Standalone solutions like Pinecone, Weaviate, or Milvus require separate provisioning, authentication layers, and monthly costs that scale with usage. For personal projects or small teams, the operational overhead outweighs the benefit. The gap between stateless LLMs and persistent, queryable knowledge remains one of the most overlooked architectural challenges in modern AI development.
WOW Moment: Key Findings
The breakthrough isn't storing memory; it's how you retrieve it. Traditional keyword search fails when human intent diverges from stored terminology. Semantic vector search decouples storage syntax from retrieval intent, enabling meaning-based matching without exact lexical overlap.
| Approach | Retrieval Precision | Infrastructure Cost (Monthly) | Query Flexibility |
|---|---|---|---|
| Platform Native Memory | Low (heuristic, opaque) | $0 | None (black box) |
| Keyword/SQL Storage | Medium (exact match dependent) | $5–$20+ | High (structured filters) |
| Edge-Hosted Semantic Vector MCP | High (meaning-based) | $0 (personal scale) | Full programmatic control |
This finding matters because it shifts memory from a passive feature to an active engineering primitive. When retrieval operates on semantic proximity rather than string matching, you can store raw observations, technical decisions, or user feedback, and query them using natural language intent. The system surfaces relevant context regardless of whether the original note used the same terminology as the current prompt. This enables true continuity across sessions without bloating context windows or relying on platform-specific workarounds.
Core Solution
The architecture leverages Cloudflare's edge ecosystem to create a self-contained memory layer that communicates with LLMs via the Model Context Protocol (MCP). The stack combines four components:
- Cloudflare Workers: Stateless compute for handling MCP tool calls and routing requests.
- D1 (SQLite): Relational storage for structured metadata (timestamps, categories, sources, deletion flags).
- Vectorize: Managed vector index for storing and querying high-dimensional embeddings.
- Workers AI: On-platform embedding generation using
bge-small-en-v1.5, producing 384-dimensional vectors.
Architecture Decisions & Rationale
Why MCP? MCP standardizes how LLMs interact with external tools. Instead of building custom API wrappers or prompt engineering hacks, you expose typed tools (store_note, fetch_context, list_entries, remove_entry). The LLM client automatically generates tool calls when it detects memory-related intent. This removes guesswork and ensures consistent behavior across different model providers.
Why Semantic Vectors? Keyword search requires you to anticipate query phrasing. Semantic search converts text into mathematical representations where similar concepts cluster together. Storing "users abandon checkout at the payment gateway" and querying "onboarding friction points" returns a match because the embedding space captures conceptual proximity, not lexical overlap.
Why bge-small-en-v1.5? The model outputs 384-dimensional vectors, which balances retrieval accuracy with storage efficiency. Higher dimensions increase index size and query latency without proportional gains in semantic precision for general-purpose text. Workers AI hosts this model natively, eliminating external API calls and keeping the entire pipeline within Cloudflare's free tier.
Implementation
The worker exposes an MCP-compatible endpoint. Below is a production-ready TypeScript implementation demonstrating tool registration, embedding generation, and vector retrieval.
// worker/src/index.ts
import { McpServer } from '@modelcontextprotocol/sdk/server/mcp.js';
import { z } from 'zod';
import { D1Database } from '@cloudflare/workers-types';
export interface Env {
DB: D1Database;
VECTORIZE: any;
AI: any;
MCP_AUTH_TOKEN: string;
}
const server = new McpServer({
name: 'edge-memory-layer',
version: '1.0.0'
});
// Tool 1: Store a new memory entry
server.tool(
'store_note',
'Persist a new observation or decision into the memory layer',
{
content: z.string().min(10).max(2000),
category: z.enum(['project', 'user_pref', 'technical', 'feedback']).optional(),
source: z.string().optional()
},
async ({ content, category, source }) => {
const embedding = await generateEmbedding(content);
// Upsert to vector index
await env.VECTORIZE.upsert([{
id: crypto.randomUUID(),
values: embedding,
metadata: { category: category || 'general', source: source || 'manual', created_at: new Date().toISOString() }
}]);
// Store metadata in D1 for structured filtering
await env.DB.prepare(`
INSERT INTO memory_entries (id, content, category, source, created_at)
VALUES (?, ?, ?, ?, ?)
`).bind(
crypto.randomUUID(),
content,
category || 'general',
source || 'manual',
new Date().toISOString()
).run();
return { content: [{ type: 'text', text: 'Note stored successfully.' }] };
}
);
// Tool 2: Retrieve context via semantic search
server.tool(
'fetch_context',
'Search memory using semantic similarity',
{
query: z.string().min(3),
top_k: z.number().min(1).max(10).default(5),
category_filter: z.string().optional()
},
async ({ query, top_k, category_filter }) => {
const query_embedding = await generateEmbedding(query);
const results = await env.VECTORIZE.query(query_embedding, {
topK: top_k,
returnMetadata: 'all'
});
// Optional: filter by category using D1 if metadata isn't sufficient
let filtered_ids = results.matches.map(m => m.id);
if (category_filter) {
const db_results = await env.DB.prepare(`
SELECT id FROM memory_entries
WHERE category = ? AND id IN (${filtered_ids.map(() => '?').join(',')})
`).bind(category_filter, ...filtered_ids).all();
filtered_ids = db_results.results.map(r => r.id);
}
const context_blocks = results.matches
.filter(m => filtered_ids.includes(m.id))
.map(m => `• [${m.metadata.category}] ${m.metadata.content}`);
return {
content: [{
type: 'text',
text: context_blocks.length > 0
? `Retrieved ${context_blocks.length} relevant entries:\n${context_blocks.join('\n')}`
: 'No matching context found.'
}]
};
}
);
// Tool 3: List recent entries
server.tool(
'list_entries',
'Retrieve recent memory entries with optional pagination',
{
limit: z.number().min(1).max(50).default(10),
offset: z.number().min(0).default(0)
},
async ({ limit, offset }) => {
const rows = await env.DB.prepare(`
SELECT id, content, category, created_at
FROM memory_entries
ORDER BY created_at DESC
LIMIT ? OFFSET ?
`).bind(limit, offset).all();
return {
content: [{
type: 'text',
text: JSON.stringify(rows.results, null, 2)
}]
};
}
);
// Tool 4: Remove an entry
server.tool(
'remove_entry',
'Permanently delete a memory entry by ID',
{
entry_id: z.string().uuid()
},
async ({ entry_id }) => {
await env.DB.prepare('DELETE FROM memory_entries WHERE id = ?').bind(entry_id).run();
await env.VECTORIZE.deleteByIds([entry_id]);
return { content: [{ type: 'text', text: 'Entry removed.' }] };
}
);
// Embedding pipeline
async function generateEmbedding(text: string): Promise<number[]> {
const response = await env.AI.run('@cf/baai/bge-small-en-v1.5', {
text: [text]
});
return response.data[0];
}
export default {
async fetch(request: Request, env: Env): Promise<Response> {
// MCP routing logic would go here
return server.handleRequest(request);
}
};
Why This Structure Works
- Separation of Concerns: D1 handles structured queries and deletion. Vectorize handles similarity search. Workers AI handles embedding generation. This prevents vendor lock-in and allows independent scaling.
- 384-Dimension Constraint:
bge-small-en-v1.5outputs exactly 384 floats. Vectorize indexes must be created withdimensions: 384. Mismatching this value causes silent query failures. - MCP Tool Typing: Using Zod schemas ensures the LLM client generates valid payloads. Invalid requests are rejected before hitting the vector index, saving compute and preventing index corruption.
- Metadata-Driven Filtering: Storing category and source in both Vectorize metadata and D1 enables hybrid retrieval. You can run fast semantic search first, then apply relational filters for precision.
Pitfall Guide
1. Embedding Dimension Mismatch
Explanation: Vectorize requires a fixed dimension count at index creation. If you generate 768-dim embeddings but index expects 384, queries return empty results or throw type errors.
Fix: Verify @cf/baai/bge-small-en-v1.5 outputs 384 dimensions. Create the index with dimensions: 384 and validate payload length before upserting.
2. Prompt Injection via Memory Context
Explanation: Retrieved memory blocks are injected directly into the system prompt. If stored content contains malicious instructions or malformed syntax, it can override model behavior.
Fix: Sanitize retrieved text before injection. Wrap memory context in explicit XML tags (<memory_context>...</memory_context>) and instruct the model to treat it as reference data, not executable instructions.
3. Context Window Budget Overflow
Explanation: Recalling too many entries or storing verbose notes quickly consumes the LLM's context window, degrading performance or triggering truncation.
Fix: Enforce top_k limits (3–5 entries). Implement a relevance threshold (e.g., cosine similarity > 0.75). Truncate stored content to essential facts before embedding.
4. Local Development Blind Spots
Explanation: Vectorize and Workers AI do not run in local wrangler dev environments. Developers often mock embeddings incorrectly, leading to production mismatches.
Fix: Use environment flags to toggle between mock embeddings (random 384-dim arrays) and remote API calls. Test vector queries against a staging index before deploying.
5. Unstructured Note Ingestion
Explanation: Dumping raw conversation logs or unformatted text creates noisy embeddings. The model struggles to extract signal from unstructured dumps. Fix: Enforce a ingestion schema. Require category, timestamp, and concise content. Strip UI noise, timestamps, and conversational filler before embedding.
6. Ignoring Free Tier Rate Limits
Explanation: Workers AI and Vectorize have request caps on the free tier. High-frequency recall calls or bulk imports will trigger 429 errors. Fix: Implement request batching for imports. Cache frequent queries with a short TTL. Monitor usage via Cloudflare dashboard and set up alerts at 80% threshold.
7. Missing Auto-Recall Trigger
Explanation: The LLM won't fetch memory unless explicitly prompted. Without an initialization trigger, sessions start blank despite stored data.
Fix: Configure the MCP client to call fetch_context on session start. Pass a default query like "project status and recent decisions" to prime the context window automatically.
Production Bundle
Action Checklist
- Provision Cloudflare D1 database and Vectorize index with
dimensions: 384 - Deploy worker with
bge-small-en-v1.5binding and auth token configuration - Run D1 schema migration to create
memory_entriestable with indexed columns - Configure MCP client to auto-call
fetch_contexton session initialization - Implement content sanitization pipeline before embedding ingestion
- Set up usage monitoring and alerting for Workers AI and Vectorize rate limits
- Test hybrid retrieval (semantic + category filter) with staging data
- Document ingestion schema and enforce it across all capture channels
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| Personal AI assistant | Edge-hosted Semantic MCP | Zero infrastructure cost, full control, low latency | $0/month |
| Team knowledge base | Centralized Vector DB + MCP Gateway | Multi-user access, audit trails, role-based filtering | $15–$50/month |
| High-volume API ingestion | Batched embedding pipeline + Redis cache | Reduces AI API calls, handles throughput spikes | $5–$20/month (compute) |
| Compliance-heavy environment | On-prem vector store + local embedding model | Data residency, auditability, no external API calls | Hardware + maintenance |
Configuration Template
MCP Client Configuration (Claude Desktop / Cursor / Custom Client)
{
"mcpServers": {
"edge-memory": {
"command": "npx",
"args": ["mcp-remote", "https://<your-worker-subdomain>.workers.dev/mcp"],
"env": {
"MCP_AUTH_TOKEN": "<your-generated-token>"
}
}
}
}
D1 Schema Migration
CREATE TABLE IF NOT EXISTS memory_entries (
id TEXT PRIMARY KEY,
content TEXT NOT NULL,
category TEXT DEFAULT 'general',
source TEXT DEFAULT 'manual',
created_at TEXT NOT NULL,
is_deleted INTEGER DEFAULT 0
);
CREATE INDEX IF NOT EXISTS idx_category ON memory_entries(category);
CREATE INDEX IF NOT EXISTS idx_created ON memory_entries(created_at DESC);
Wrangler Environment Variables
wrangler secret put MCP_AUTH_TOKEN
wrangler d1 create memory-db
wrangler vectorize create memory-index --dimensions=384 --metric=cosine
Quick Start Guide
- Initialize Resources: Run
wrangler d1 createandwrangler vectorize createwithdimensions=384. Note the database and index IDs. - Deploy Worker: Push the TypeScript implementation to Cloudflare. Bind D1, Vectorize, and AI services in
wrangler.toml. SetMCP_AUTH_TOKENviawrangler secret put. - Run Schema: Execute the D1 migration SQL in the Cloudflare dashboard or via
wrangler d1 execute. - Connect Client: Add the MCP configuration JSON to your LLM client. Restart the application to register tools.
- Verify Pipeline: Call
store_notewith test data, thenfetch_contextwith a semantically related query. Confirm retrieval returns the correct entry within 200ms.
This architecture transforms stateless LLM interactions into continuous, context-aware workflows. By decoupling memory storage from retrieval syntax and hosting the entire pipeline on edge infrastructure, you gain programmatic control without infrastructure overhead. The result is a reliable, queryable memory layer that scales with your usage and remains fully auditable.
