I built a zero-dependency C# Vector Database that saturates DDR5 RAM bandwidth

Current Situation Analysis

Modern AI pipelines, particularly Retrieval-Augmented Generation (RAG) and autonomous agent memory systems, rely heavily on vector similarity search. The industry standard response to this requirement has converged on three patterns: managed cloud vector services, containerized heavyweights written in Rust or Go, and high-level language wrappers that abstract away the underlying math. While these options reduce initial development friction, they introduce significant runtime overhead that becomes problematic at scale.

The core misunderstanding lies in treating vector search as a complex distributed systems problem rather than a linear algebra operation. A vector database is fundamentally a dense, two-dimensional matrix of floating-point numbers paired with a tight comparison loop. When developers reach for containerized solutions or managed wrappers, they inherit network serialization costs, inter-process communication latency, and dependency trees that can exceed 50MB. More critically, they often ignore how the host runtime manages memory.

Consider a standard RAG workload: 100,000 document chunks, each embedded using OpenAI's text-embedding-3-small model (1,536 dimensions). This yields 153.6 million floating-point values, occupying roughly 600 MB of RAM. If stored as a jagged array (float[][]), the .NET runtime allocates 100,000 separate object headers. This fractures cache locality, forces the garbage collector to track thousands of short-lived references, and introduces unpredictable pause times during high-throughput search operations. The overhead isn't in the algorithm; it's in the memory layout and runtime abstraction layers.

WOW Moment: Key Findings

When the memory layout is flattened, bounds checking is eliminated in the hot path, and hardware intrinsics are leveraged correctly, a managed runtime can bypass traditional bottlenecks and approach physical hardware limits. The following comparison illustrates the performance delta between a conventional containerized approach and a zero-dependency, SIMD-optimized .NET implementation.

Approach	Search Latency (100k/1536d)	Memory Bandwidth Utilization	GC Pressure	Dependency Footprint
Containerized (Docker/Python)	45–120 ms	15–25 GB/s	High (interpreter + runtime)	50–200 MB
Optimized .NET (Flat/SIMD)	7.15 ms	~85 GB/s	Near-zero (pre-allocated)	0 MB

The optimized implementation scans the entire 153.6 million float dataset in 7.15 milliseconds. At that throughput, the compute kernel demands approximately 85 GB/s of memory bandwidth. This aligns precisely with the physical ceiling of dual-channel DDR5 RAM on modern motherboards. The bottleneck shifts from software overhead to silicon limitations, proving that with careful memory management and instruction-level parallelism, vector search can run entirely in-process without external services.

Core Solution

Building a high-performance vector search engine in .NET requires three architectural decisions: contiguous memory allocation, hardware-accelerated comparison kernels, and a lightweight integration layer. Each decision targets a specific bottleneck in the traditional stack.

Step 1: Contiguous Memory Layout

Jagged arrays destroy spatial locality. Instead, allocate a single contiguous buffer for all embeddings. For workloads exceeding available RAM, back the buffer with a memory-mapped file. This approach eliminates object headers, guarantees sequential memory access, and allows the OS to handle paging transparently.

public sealed class TensorBuffer : IDisposable
{
    private readonly float[] _flatData;
    private readonly int _dimensions;
    private readonly int _count;

    public TensorBuffer(int count, int dimensions)
    {
        _count = count;
        _dimensions = dimensions;
        _flatData = new float[count * dimensions];
    }

    public Span<float> GetRow(int index)
    {
        int offset = index * _dimensions;
        return new Span<float>(_flatData, offset, _dimensions);
    }

    public void Dispose() => Array.Clear(_flatData);
}

Step 2: SIMD-Optimized Comparison Kernel

Cosine similarity between normalized vectors reduces to a dot product. A scalar loop forces the CPU to fetch, decode, and execute instructions sequentially. By leveraging System.Runtime.Intrinsics, we can process multiple floats per cycle. Four-way loop unrolling feeds the out-of-order execution pipeline with independent Fused-Multiply-Add (FMA) operations, maximizing throughput.

using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;

public static class SimdSearchEngine
{
    public static unsafe float ComputeDotProduct(ReadOnlySpan<float> query, ReadOnlySpan<float> candidate)
    {
        if (!Avx2.IsSupported)
            return ComputeScalar(query, candidate);

        fixed (float* pQ = query)
        fixed (float* pC = candidate)
        {
            var acc0 = Vector256<float>.Zero;
            var acc1 = Vector256<float>.Zero;
            var acc2 = Vector256<float>.Zero;
            var acc3 = Vector256<float>.Zero;

            int i = 0;
            int limit = query.Length - 32;

            while (i <= limit)
            {
                acc0 = Fma.MultiplyAdd(Vector256.Load(pQ + i),     Vector256.Load(pC + i),     acc0);
                acc1 = Fma.MultiplyAdd(Vector256.Load(pQ + i + 8),  Vector256.Load(pC + i + 8),  acc1);
                acc2 = Fma.MultiplyAdd(Vector256.Load(pQ + i + 16), Vector256.Load(pC + i + 16), acc2);
                acc3 = Fma.MultiplyAdd(Vector256.Load(pQ + i + 24), Vector256.Load(pC + i + 24), acc3);
                i += 32;
            }

            var combined = Vector256.Add(Vector256.Add(acc0, acc1), Vector256.Add(acc2, acc3));
            float result = combined.ToScalar() + combined.GetElement(1) + combined.GetElement(2) + combined.GetElement(3);

            // Handle remaining elements
            for (; i < query.Length; i++)
                result += pQ[i] * pC[i];

            return result;
        }
    }

    private static float ComputeScalar(ReadOnlySpan<float> a, ReadOnlySpan<float> b)
    {
        float sum = 0f;
        for (int i = 0; i < a.Length; i++) sum += a[i] * b[i];
        return sum;
    }
}

Step 3: Standard I/O Integration Layer

External APIs introduce serialization overhead and network latency. For AI agents, a Model Context Protocol (MCP) server running over standard input/output provides a zero-latency, zero-dependency communication channel. The host process reads JSON-RPC requests from stdin and writes responses to stdout, enabling direct integration with Claude Desktop, Cursor, or custom orchestration frameworks.

public class McpProtocolBridge
{
    private readonly TensorBuffer _store;
    private readonly Stream _input;
    private readonly Stream _output;

    public McpProtocolBridge(TensorBuffer store, Stream input, Stream output)
    {
        _store = store;
        _input = input;
        _output = output;
    }

    public async Task RunAsync(CancellationToken ct)
    {
        using var reader = new StreamReader(_input);
        using var writer = new StreamWriter(_output) { AutoFlush = true };

        while (!ct.IsCancellationRequested)
        {
            var line = await reader.ReadLineAsync(ct);
            if (string.IsNullOrWhiteSpace(line)) continue;

            var response = ProcessRequest(line);
            await writer.WriteLineAsync(response);
        }
    }

    private string ProcessRequest(string json)
    {
        // Parse JSON-RPC, route to search/add methods, return serialized result
        return "{\"jsonrpc\":\"2.0\",\"id\":1,\"result\":{\"status\":\"ok\"}}";
    }
}

Architecture Rationale

Flat allocation guarantees cache-line prefetching works efficiently. The CPU loads 64-byte blocks sequentially, minimizing L1/L2 cache misses.
fixed pointers with Span<T> eliminate bounds checking in the hot path while maintaining memory safety guarantees at the API boundary.
4-way FMA unrolling matches the execution port layout of modern x86 cores. Independent accumulators prevent pipeline stalls caused by data dependencies.
stdio MCP removes TCP handshake overhead, TLS negotiation, and HTTP framing. The agent and search engine share the same process lifecycle or communicate via pipe, reducing latency to microseconds.

Pitfall Guide

1. Jagged Array Storage

Explanation: Using float[][] creates thousands of small heap objects. The GC must track each reference, and memory fragmentation destroys spatial locality. Fix: Allocate a single contiguous float[] or use MemoryMarshal.CreateSpan over a pre-allocated buffer. Access rows via offset arithmetic.

2. Scalar Dot Product Loops

Explanation: Traditional for loops process one float per iteration. Modern CPUs can execute multiple FMA instructions per cycle, but scalar code leaves execution ports idle. Fix: Use System.Runtime.Intrinsics with loop unrolling. Always include a scalar fallback for architectures without AVX2/FMA support.

3. Ignoring Vector Normalization

Explanation: Cosine similarity requires unit vectors. If embeddings are not normalized, the dot product returns magnitude-weighted scores, breaking ranking logic. Fix: Normalize vectors during ingestion or apply L2 normalization in the search kernel. Pre-normalization is faster for read-heavy workloads.

4. Unsafe Pointer Leaks

Explanation: Using fixed statements without proper scoping can pin memory indefinitely, preventing GC compaction and causing heap fragmentation. Fix: Limit fixed scope to the tightest possible block. Prefer Span<T> and ref struct patterns that enforce stack-only lifetimes.

5. Blocking stdio MCP Streams

Explanation: Reading from stdin synchronously in a single thread blocks the entire process. AI agents expect asynchronous, non-blocking responses. Fix: Use StreamReader.ReadLineAsync() with cancellation tokens. Implement backpressure by buffering responses if the agent consumes output slower than it's produced.

6. Over-Allocation in the Hot Path

Explanation: Creating new arrays, lists, or strings during search triggers GC allocations. Even small allocations compound under high concurrency. Fix: Reuse buffers via ArrayPool<T> or pre-allocate result containers. Return ValueTask or ref struct results to avoid heap pressure.

7. Hardware Capability Blindness

Explanation: Assuming AVX2 or FMA is always available causes PlatformNotSupportedException on older CPUs or certain cloud VMs. Fix: Check Avx2.IsSupported and Fma.IsSupported at startup. Route to optimized or fallback kernels dynamically. Log capability detection for observability.

Production Bundle

Action Checklist

Validate embedding dimensions match your model (e.g., 1536 for text-embedding-3-small)
Pre-normalize all vectors during ingestion to avoid runtime L2 calculations
Implement hardware capability detection before initializing the SIMD kernel
Use ArrayPool<float> for temporary query buffers to eliminate GC pressure
Configure stdio MCP with async stream readers and cancellation token propagation
Benchmark memory bandwidth utilization against your target hardware's theoretical limit
Add structured logging for cache hit rates, SIMD fallback triggers, and search latency percentiles

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Local development / prototyping	In-memory flat buffer	Fastest iteration, zero infrastructure, instant restarts	$0 (CPU/RAM only)
High-throughput RAG pipeline	Memory-mapped file + SIMD kernel	Persists across restarts, handles >RAM datasets, maintains low latency	Moderate (disk I/O)
Multi-tenant cloud deployment	Containerized vector service	Isolation, scaling, managed backups, RBAC	High (cloud compute + licensing)

Configuration Template

{
  "vectorStore": {
    "dimensions": 1536,
    "capacity": 100000,
    "storageMode": "InMemory",
    "normalization": "PreIngest",
    "simdFallback": true
  },
  "mcpBridge": {
    "transport": "Stdio",
    "maxConcurrentRequests": 4,
    "timeoutMs": 5000,
    "tools": ["add_vector", "search_vectors", "get_metadata"]
  },
  "performance": {
    "enableHardwareIntrinsics": true,
    "bufferPoolSize": 1024,
    "logLatencyPercentiles": [50, 95, 99]
  }
}

Quick Start Guide

Initialize the buffer: Create a TensorBuffer with your target capacity and embedding dimensions. Ensure vectors are L2-normalized before insertion.
Wire the search kernel: Call SimdSearchEngine.ComputeDotProduct with your query span and each candidate span. Track top-K results using a min-heap or sorted list.
Attach the MCP bridge: Instantiate McpProtocolBridge with Console.OpenStandardInput() and Console.OpenStandardOutput(). Run RunAsync() in a background task.
Test with an agent: Configure your AI client to point to the compiled executable. Verify add_vector and search_vectors JSON-RPC calls return correctly formatted responses within sub-10ms windows.

Mid-Year Sale — Unlock Full Article