Building a Semantic Memory Layer for LLM Agents on the Edge

Current Situation Analysis

Large language models operate on a fundamentally stateless architecture. Each new session initializes with a blank context window, discarding prior interactions, project decisions, and user preferences. This design choice optimizes for security and predictable inference costs, but it creates a severe friction point for developers building agentic workflows or long-running development assistants.

The industry has responded with platform-native memory features. While convenient, these implementations are intentionally opaque. They function as heuristic black boxes: you cannot tag entries, filter by timestamp, run semantic queries, or control what gets surfaced. The model decides what to retain based on internal weighting algorithms that are neither documented nor adjustable. For engineering teams, this lack of programmatic control is unacceptable. You cannot build reliable automation around a system you cannot query.

The alternative—external vector databases—introduces infrastructure complexity. Standalone solutions like Pinecone, Weaviate, or Milvus require separate provisioning, authentication layers, and monthly costs that scale with usage. For personal projects or small teams, the operational overhead outweighs the benefit. The gap between stateless LLMs and persistent, queryable knowledge remains one of the most overlooked architectural challenges in modern AI development.

WOW Moment: Key Findings

The breakthrough isn't storing memory; it's how you retrieve it. Traditional keyword search fails when human intent diverges from stored terminology. Semantic vector search decouples storage syntax from retrieval intent, enabling meaning-based matching without exact lexical overlap.

Approach	Retrieval Precision	Infrastructure Cost (Monthly)	Query Flexibility
Platform Native Memory	Low (heuristic, opaque)	$0	None (black box)
Keyword/SQL Storage	Medium (exact match dependent)	$5–$20+	High (structured filters)
Edge-Hosted Semantic Vector MCP	High (meaning-based)	$0 (personal scale)	Full programmatic control

This finding matters because it shifts memory from a passive feature to an active engineering primitive. When retrieval operates on semantic proximity rather than string matching, you can store raw observations, technical decisions, or user feedback, and query them using natural language intent. The system surfaces relevant context regardless of whether the original note used the same terminology as the current prompt. This enables true continuity across sessions without bloating context windows or relying on platform-specific workarounds.

Core Solution

The architecture leverages Cloudflare's edge ecosystem to create a self-contained memory layer that communicates with LLMs via the Model Context Protocol (MCP). The stack combines four components:

Cloudflare Workers: Stateless compute for handling MCP tool calls and routing requests.
D1 (SQLite): Relational storage for structured metadata (timestamps, categories, sources, deletion flags).
Vectorize: Managed vector index for storing and querying high-dimensional embeddings.
Workers AI: On-platform embedding generation using bge-small-en-v1.5, producing 384-dimensional vectors.

Architecture Decisions & Rationale

Why MCP? MCP standardizes how LLMs interact with external tools. Instead of building custom API wrappers or prompt engineering hacks, you expose typed tools (store_note, fetch_context, list_entries, remove_entry). The LLM client automatically generates tool calls when it detects memory-related intent. This removes guesswork and ensures consistent behavior across different model providers.

Why Semantic Vectors? Keyword search requires you to anticipate query phrasing. Semantic search converts text into mathematical representations where similar concepts cluster together. Storing "users abandon checkout at the payment gateway" and querying "onboarding friction points" returns a match because the embedding space captures conceptual proximity, not lexical overlap.

Why bge-small-en-v1.5? The model outputs 384-dimensional vectors, which balances retrieval accuracy with storage efficiency. Higher dimensions increase index size and query latency without proportional gains in semantic precision for general-purpose text. Workers AI hosts this model natively, eliminating external API calls and keeping the entire pipeline within Cloudflare's free tier.

Implementation

The worker exposes an MCP-compatible endpoint. Below is a production-ready TypeScript implementation demonstrating tool registration, embedding generation, and vector retrieval.

// worker/src/index.ts
import { McpServer } from '@modelcontextprotocol/sdk/server/mcp.js';
import { z } from 'zod';
import { D1Database } from '@cloudflare/workers-types';

export interface Env {
  DB: D1Database;
  VECTORIZE: any;
  AI: any;
  MCP_AUTH_TOKEN: string;
}

const server = new McpServer({
  name: 'edge-memory-layer',
  version: '1.0.0'
});

// Tool 1: Store a new memory entry
server.tool(
  'store_note',
  'Persist a new observation or decision into the memory layer',
  {
    content: z.string().min(10).max(2000),
    category: z.enum(['project', 'user_pref', 'technical', 'feedback']).optional(),
    source: z.string().optional()
  },
  async ({ content, category, source }) => {
    const embedding = await generateEmbedding(content);
    
    // Upsert to vector index
    await env.VECTORIZE.upsert([{
      id: crypto.randomUUID(),
      values: embedding,
      metadata: { category: category || 'general', source: source || 'manual', created_at: new Date().toISOString() }
    }]);

    // Store metadata in D1 for structured filtering
    await env.DB.prepare(`
      INSERT INTO memory_entries (id, content, category, source, created_at)
      VALUES (?, ?, ?, ?, ?)
    `).bind(
      crypto.randomUUID(),
      content,
      category || 'general',
      source || 'manual',
      new Date().toISOString()
    ).run();

    return { content: [{ type: 'text', text: 'Note stored successfully.' }] };
  }
);

// Tool 2: Retrieve context via semantic search
server.tool(
  'fetch_context',
  'Search memory using semantic similarity',
  {
    query: z.string().min(3),
    top_k: z.number().min(1).max(10).default(5),
    category_filter: z.string().optional()
  },
  async ({ query, top_k, category_filter }) => {
    const query_embedding = await generateEmbedding(query);
    
    const results = await env.VECTORIZE.query(query_embedding, {
      topK: top_k,
      returnMetadata: 'all'
    });

    // Optional: filter by category using D1 if metadata isn't sufficient
    let filtered_ids = results.matches.map(m => m.id);
    if (category_filter) {
      const db_results = await env.DB.prepare(`
        SELECT id FROM memory_entries 
        WHERE category = ? AND id IN (${filtered_ids.map(() => '?').join(',')})
      `).bind(category_filter, ...filtered_ids).all();
      filtered_ids = db_results.results.map(r => r.id);
    }

    const context_blocks = results.matches
      .filter(m => filtered_ids.includes(m.id))
      .map(m => `• [${m.metadata.category}] ${m.metadata.content}`);

    return {
      content: [{
        type: 'text',
        text: context_blocks.length > 0 
          ? `Retrieved ${context_blocks.length} relevant entries:\n${context_blocks.join('\n')}`
          : 'No matching context found.'
      }]
    };
  }
);

// Tool 3: List recent entries
server.tool(
  'list_entries',
  'Retrieve recent memory entries with optional pagination',
  {
    limit: z.number().min(1).max(50).default(10),
    offset: z.number().min(0).default(0)
  },
  async ({ limit, offset }) => {
    const rows = await env.DB.prepare(`
      SELECT id, content, category, created_at 
      FROM memory_entries 
      ORDER BY created_at DESC 
      LIMIT ? OFFSET ?
    `).bind(limit, offset).all();

    return {
      content: [{
        type: 'text',
        text: JSON.stringify(rows.results, null, 2)
      }]
    };
  }
);

// Tool 4: Remove an entry
server.tool(
  'remove_entry',
  'Permanently delete a memory entry by ID',
  {
    entry_id: z.string().uuid()
  },
  async ({ entry_id }) => {
    await env.DB.prepare('DELETE FROM memory_entries WHERE id = ?').bind(entry_id).run();
    await env.VECTORIZE.deleteByIds([entry_id]);
    return { content: [{ type: 'text', text: 'Entry removed.' }] };
  }
);

// Embedding pipeline
async function generateEmbedding(text: string): Promise<number[]> {
  const response = await env.AI.run('@cf/baai/bge-small-en-v1.5', {
    text: [text]
  });
  return response.data[0];
}

export default {
  async fetch(request: Request, env: Env): Promise<Response> {
    // MCP routing logic would go here
    return server.handleRequest(request);
  }
};

Why This Structure Works

Separation of Concerns: D1 handles structured queries and deletion. Vectorize handles similarity search. Workers AI handles embedding generation. This prevents vendor lock-in and allows independent scaling.
384-Dimension Constraint: bge-small-en-v1.5 outputs exactly 384 floats. Vectorize indexes must be created with dimensions: 384. Mismatching this value causes silent query failures.
MCP Tool Typing: Using Zod schemas ensures the LLM client generates valid payloads. Invalid requests are rejected before hitting the vector index, saving compute and preventing index corruption.
Metadata-Driven Filtering: Storing category and source in both Vectorize metadata and D1 enables hybrid retrieval. You can run fast semantic search first, then apply relational filters for precision.

Pitfall Guide

1. Embedding Dimension Mismatch

Explanation: Vectorize requires a fixed dimension count at index creation. If you generate 768-dim embeddings but index expects 384, queries return empty results or throw type errors. Fix: Verify @cf/baai/bge-small-en-v1.5 outputs 384 dimensions. Create the index with dimensions: 384 and validate payload length before upserting.

2. Prompt Injection via Memory Context

Explanation: Retrieved memory blocks are injected directly into the system prompt. If stored content contains malicious instructions or malformed syntax, it can override model behavior. Fix: Sanitize retrieved text before injection. Wrap memory context in explicit XML tags (<memory_context>...</memory_context>) and instruct the model to treat it as reference data, not executable instructions.

3. Context Window Budget Overflow

Explanation: Recalling too many entries or storing verbose notes quickly consumes the LLM's context window, degrading performance or triggering truncation. Fix: Enforce top_k limits (3–5 entries). Implement a relevance threshold (e.g., cosine similarity > 0.75). Truncate stored content to essential facts before embedding.

4. Local Development Blind Spots

Explanation: Vectorize and Workers AI do not run in local wrangler dev environments. Developers often mock embeddings incorrectly, leading to production mismatches. Fix: Use environment flags to toggle between mock embeddings (random 384-dim arrays) and remote API calls. Test vector queries against a staging index before deploying.

5. Unstructured Note Ingestion

Explanation: Dumping raw conversation logs or unformatted text creates noisy embeddings. The model struggles to extract signal from unstructured dumps. Fix: Enforce a ingestion schema. Require category, timestamp, and concise content. Strip UI noise, timestamps, and conversational filler before embedding.

6. Ignoring Free Tier Rate Limits

Explanation: Workers AI and Vectorize have request caps on the free tier. High-frequency recall calls or bulk imports will trigger 429 errors. Fix: Implement request batching for imports. Cache frequent queries with a short TTL. Monitor usage via Cloudflare dashboard and set up alerts at 80% threshold.

7. Missing Auto-Recall Trigger

Explanation: The LLM won't fetch memory unless explicitly prompted. Without an initialization trigger, sessions start blank despite stored data. Fix: Configure the MCP client to call fetch_context on session start. Pass a default query like "project status and recent decisions" to prime the context window automatically.

Production Bundle

Action Checklist

Provision Cloudflare D1 database and Vectorize index with dimensions: 384
Deploy worker with bge-small-en-v1.5 binding and auth token configuration
Run D1 schema migration to create memory_entries table with indexed columns
Configure MCP client to auto-call fetch_context on session initialization
Implement content sanitization pipeline before embedding ingestion
Set up usage monitoring and alerting for Workers AI and Vectorize rate limits
Test hybrid retrieval (semantic + category filter) with staging data
Document ingestion schema and enforce it across all capture channels

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Personal AI assistant	Edge-hosted Semantic MCP	Zero infrastructure cost, full control, low latency	$0/month
Team knowledge base	Centralized Vector DB + MCP Gateway	Multi-user access, audit trails, role-based filtering	$15–$50/month
High-volume API ingestion	Batched embedding pipeline + Redis cache	Reduces AI API calls, handles throughput spikes	$5–$20/month (compute)
Compliance-heavy environment	On-prem vector store + local embedding model	Data residency, auditability, no external API calls	Hardware + maintenance

Configuration Template

MCP Client Configuration (Claude Desktop / Cursor / Custom Client)

{
  "mcpServers": {
    "edge-memory": {
      "command": "npx",
      "args": ["mcp-remote", "https://<your-worker-subdomain>.workers.dev/mcp"],
      "env": {
        "MCP_AUTH_TOKEN": "<your-generated-token>"
      }
    }
  }
}

D1 Schema Migration

CREATE TABLE IF NOT EXISTS memory_entries (
  id TEXT PRIMARY KEY,
  content TEXT NOT NULL,
  category TEXT DEFAULT 'general',
  source TEXT DEFAULT 'manual',
  created_at TEXT NOT NULL,
  is_deleted INTEGER DEFAULT 0
);

CREATE INDEX IF NOT EXISTS idx_category ON memory_entries(category);
CREATE INDEX IF NOT EXISTS idx_created ON memory_entries(created_at DESC);

Wrangler Environment Variables

wrangler secret put MCP_AUTH_TOKEN
wrangler d1 create memory-db
wrangler vectorize create memory-index --dimensions=384 --metric=cosine

Quick Start Guide

Initialize Resources: Run wrangler d1 create and wrangler vectorize create with dimensions=384. Note the database and index IDs.
Deploy Worker: Push the TypeScript implementation to Cloudflare. Bind D1, Vectorize, and AI services in wrangler.toml. Set MCP_AUTH_TOKEN via wrangler secret put.
Run Schema: Execute the D1 migration SQL in the Cloudflare dashboard or via wrangler d1 execute.
Connect Client: Add the MCP configuration JSON to your LLM client. Restart the application to register tools.
Verify Pipeline: Call store_note with test data, then fetch_context with a semantically related query. Confirm retrieval returns the correct entry within 200ms.

This architecture transforms stateless LLM interactions into continuous, context-aware workflows. By decoupling memory storage from retrieval syntax and hosting the entire pipeline on edge infrastructure, you gain programmatic control without infrastructure overhead. The result is a reliable, queryable memory layer that scales with your usage and remains fully auditable.