I built a zero-dependency C# Vector Database that saturates DDR5 RAM bandwidth
Current Situation Analysis
Modern AI pipelines, particularly Retrieval-Augmented Generation (RAG) and autonomous agent memory systems, rely heavily on vector similarity search. The industry standard response to this requirement has converged on three patterns: managed cloud vector services, containerized heavyweights written in Rust or Go, and high-level language wrappers that abstract away the underlying math. While these options reduce initial development friction, they introduce significant runtime overhead that becomes problematic at scale.
The core misunderstanding lies in treating vector search as a complex distributed systems problem rather than a linear algebra operation. A vector database is fundamentally a dense, two-dimensional matrix of floating-point numbers paired with a tight comparison loop. When developers reach for containerized solutions or managed wrappers, they inherit network serialization costs, inter-process communication latency, and dependency trees that can exceed 50MB. More critically, they often ignore how the host runtime manages memory.
Consider a standard RAG workload: 100,000 document chunks, each embedded using OpenAI's text-embedding-3-small model (1,536 dimensions). This yields 153.6 million floating-point values, occupying roughly 600 MB of RAM. If stored as a jagged array (float[][]), the .NET runtime allocates 100,000 separate object headers. This fractures cache locality, forces the garbage collector to track thousands of short-lived references, and introduces unpredictable pause times during high-throughput search operations. The overhead isn't in the algorithm; it's in the memory layout and runtime abstraction layers.
WOW Moment: Key Findings
When the memory layout is flattened, bounds checking is eliminated in the hot path, and hardware intrinsics are leveraged correctly, a managed runtime can bypass traditional bottlenecks and approach physical hardware limits. The following comparison illustrates the performance delta between a conventional containerized approach and a zero-dependency, SIMD-optimized .NET implementation.
| Approach | Search Latency (100k/1536d) | Memory Bandwidth Utilization | GC Pressure | Dependency Footprint |
|---|---|---|---|---|
| Containerized (Docker/Python) | 45β120 ms | 15β25 GB/s | High (interpreter + runtime) | 50β200 MB |
| Optimized .NET (Flat/SIMD) | 7.15 ms | ~85 GB/s | Near-zero (pre-allocated) | 0 MB |
The optimized implementation scans the entire 153.6 million float dataset in 7.15 milliseconds. At that throughput, the compute kernel demands approximately 85 GB/s of memory bandwidth. This aligns precisely with the physical ceiling of dual-channel DDR5 RAM on modern motherboards. The bottleneck shifts from software overhead to silicon limitations, proving that with careful memory management and instruction-level parallelism, vector search can run entirely in-process without external services.
Core Solution
Building a high-performance vector search engine in .NET requires three architectural decisions: contiguous memory allocation, hardware-accelerated comparison kernels, and a lightweight integration layer. Each decision targets a specific bottleneck in the traditional stack.
Step 1: Contiguous Memory Layout
Jagged arrays destroy spatial locality. Instead, allocate a single contiguous buffer for all embeddings. For workloads exceeding available RAM, back the buffer with a memory-mapped file. This approach eliminates object headers, guarantees sequential memory access, and allows the OS to handle paging transparently.
public sealed class TensorBuffer : IDisposable
{
private readonly float[] _flatData;
private readonly int _dimensions;
private readonly int _count;
public TensorBuffer(int count, int dimensions)
{
_count = count;
_dimensions = dimensions;
_flatData = new float[count * dimensions];
}
public Span<float> GetRow(int index)
{
int offset = index * _dimensions;
return new Span<float>(_flatData, offset, _dimensions);
}
public void Dispose() => Array.Clear(_flatData);
}
Step 2: SIMD-Optimized Comparison Kernel
Cosine similarity between normalized vectors reduces to a dot product. A scalar loop forces the CPU to fetch, decode, and execute instructions sequentially. By leveraging System.Runtime.Intrinsics, we can process multiple floats per cycle. Four-way loop unrolling feeds the out-of-order execution pipeline with independent Fused-Multiply-Add (FMA) operations, maximizing throughput.
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;
public static class SimdSearchEngine
{
public static unsafe float ComputeDotProduct(ReadOnlySpan<float> query, ReadOnlySpan<float> candidate)
{
if (!Avx2.IsSupported)
return ComputeScalar(query, candidate);
fixed (float* pQ = query)
fixed (float* pC = candidate)
{
var acc0 = Vector256<float>.Zero;
var acc1 = Vector256<float>.Zero;
var acc2 = Vector256<float>.Zero;
var acc3 = Vector256<float>.Zero;
int i = 0;
int limit = query.Length - 32;
while (i <= limit)
{
acc0 = Fma.MultiplyAdd(Vector256.Load(pQ + i), Vector256.Load(pC + i), acc0);
acc1 = Fma.MultiplyAdd(Vector256.Load(pQ + i + 8), Vector256.Load(pC + i + 8), acc1);
acc2 = Fma.MultiplyAdd(Vector256.Load(pQ + i + 16), Vector256.Load(pC + i + 16), acc2);
acc3 = Fma.MultiplyAdd(Vector256.Load(pQ + i + 24), Vector256.Load(pC + i + 24), acc3);
i += 32;
}
var combined = Vector256.Add(Vector256.Add(acc0, acc1), Vector256.Add(acc2, acc3));
float result = combined.ToScalar() + combined.GetElement(1) + combined.GetElement(2) + combined.GetElement(3);
// Handle remaining elements
for (; i < query.Length; i++)
result += pQ[i] * pC[i];
return result;
}
}
private static float ComputeScalar(ReadOnlySpan<float> a, ReadOnlySpan<float> b)
{
float sum = 0f;
for (int i = 0; i < a.Length; i++) sum += a[i] * b[i];
return sum;
}
}
Step 3: Standard I/O Integration Layer
External APIs introduce serialization overhead and network latency. For AI agents, a Model Context Protocol (MCP) server running over standard input/output provides a zero-latency, zero-dependency communication channel. The host process reads JSON-RPC requests from stdin and writes responses to stdout, enabling direct integration with Claude Desktop, Cursor, or custom orchestration frameworks.
public class McpProtocolBridge
{
private readonly TensorBuffer _store;
private readonly Stream _input;
private readonly Stream _output;
public McpProtocolBridge(TensorBuffer store, Stream input, Stream output)
{
_store = store;
_input = input;
_output = output;
}
public async Task RunAsync(CancellationToken ct)
{
using var reader = new StreamReader(_input);
using var writer = new StreamWriter(_output) { AutoFlush = true };
while (!ct.IsCancellationRequested)
{
var line = await reader.ReadLineAsync(ct);
if (string.IsNullOrWhiteSpace(line)) continue;
var response = ProcessRequest(line);
await writer.WriteLineAsync(response);
}
}
private string ProcessRequest(string json)
{
// Parse JSON-RPC, route to search/add methods, return serialized result
return "{\"jsonrpc\":\"2.0\",\"id\":1,\"result\":{\"status\":\"ok\"}}";
}
}
Architecture Rationale
- Flat allocation guarantees cache-line prefetching works efficiently. The CPU loads 64-byte blocks sequentially, minimizing L1/L2 cache misses.
fixedpointers withSpan<T>eliminate bounds checking in the hot path while maintaining memory safety guarantees at the API boundary.- 4-way FMA unrolling matches the execution port layout of modern x86 cores. Independent accumulators prevent pipeline stalls caused by data dependencies.
- stdio MCP removes TCP handshake overhead, TLS negotiation, and HTTP framing. The agent and search engine share the same process lifecycle or communicate via pipe, reducing latency to microseconds.
Pitfall Guide
1. Jagged Array Storage
Explanation: Using float[][] creates thousands of small heap objects. The GC must track each reference, and memory fragmentation destroys spatial locality.
Fix: Allocate a single contiguous float[] or use MemoryMarshal.CreateSpan over a pre-allocated buffer. Access rows via offset arithmetic.
2. Scalar Dot Product Loops
Explanation: Traditional for loops process one float per iteration. Modern CPUs can execute multiple FMA instructions per cycle, but scalar code leaves execution ports idle.
Fix: Use System.Runtime.Intrinsics with loop unrolling. Always include a scalar fallback for architectures without AVX2/FMA support.
3. Ignoring Vector Normalization
Explanation: Cosine similarity requires unit vectors. If embeddings are not normalized, the dot product returns magnitude-weighted scores, breaking ranking logic. Fix: Normalize vectors during ingestion or apply L2 normalization in the search kernel. Pre-normalization is faster for read-heavy workloads.
4. Unsafe Pointer Leaks
Explanation: Using fixed statements without proper scoping can pin memory indefinitely, preventing GC compaction and causing heap fragmentation.
Fix: Limit fixed scope to the tightest possible block. Prefer Span<T> and ref struct patterns that enforce stack-only lifetimes.
5. Blocking stdio MCP Streams
Explanation: Reading from stdin synchronously in a single thread blocks the entire process. AI agents expect asynchronous, non-blocking responses.
Fix: Use StreamReader.ReadLineAsync() with cancellation tokens. Implement backpressure by buffering responses if the agent consumes output slower than it's produced.
6. Over-Allocation in the Hot Path
Explanation: Creating new arrays, lists, or strings during search triggers GC allocations. Even small allocations compound under high concurrency.
Fix: Reuse buffers via ArrayPool<T> or pre-allocate result containers. Return ValueTask or ref struct results to avoid heap pressure.
7. Hardware Capability Blindness
Explanation: Assuming AVX2 or FMA is always available causes PlatformNotSupportedException on older CPUs or certain cloud VMs.
Fix: Check Avx2.IsSupported and Fma.IsSupported at startup. Route to optimized or fallback kernels dynamically. Log capability detection for observability.
Production Bundle
Action Checklist
- Validate embedding dimensions match your model (e.g., 1536 for
text-embedding-3-small) - Pre-normalize all vectors during ingestion to avoid runtime L2 calculations
- Implement hardware capability detection before initializing the SIMD kernel
- Use
ArrayPool<float>for temporary query buffers to eliminate GC pressure - Configure stdio MCP with async stream readers and cancellation token propagation
- Benchmark memory bandwidth utilization against your target hardware's theoretical limit
- Add structured logging for cache hit rates, SIMD fallback triggers, and search latency percentiles
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| Local development / prototyping | In-memory flat buffer | Fastest iteration, zero infrastructure, instant restarts | $0 (CPU/RAM only) |
| High-throughput RAG pipeline | Memory-mapped file + SIMD kernel | Persists across restarts, handles >RAM datasets, maintains low latency | Moderate (disk I/O) |
| Multi-tenant cloud deployment | Containerized vector service | Isolation, scaling, managed backups, RBAC | High (cloud compute + licensing) |
Configuration Template
{
"vectorStore": {
"dimensions": 1536,
"capacity": 100000,
"storageMode": "InMemory",
"normalization": "PreIngest",
"simdFallback": true
},
"mcpBridge": {
"transport": "Stdio",
"maxConcurrentRequests": 4,
"timeoutMs": 5000,
"tools": ["add_vector", "search_vectors", "get_metadata"]
},
"performance": {
"enableHardwareIntrinsics": true,
"bufferPoolSize": 1024,
"logLatencyPercentiles": [50, 95, 99]
}
}
Quick Start Guide
- Initialize the buffer: Create a
TensorBufferwith your target capacity and embedding dimensions. Ensure vectors are L2-normalized before insertion. - Wire the search kernel: Call
SimdSearchEngine.ComputeDotProductwith your query span and each candidate span. Track top-K results using a min-heap or sorted list. - Attach the MCP bridge: Instantiate
McpProtocolBridgewithConsole.OpenStandardInput()andConsole.OpenStandardOutput(). RunRunAsync()in a background task. - Test with an agent: Configure your AI client to point to the compiled executable. Verify
add_vectorandsearch_vectorsJSON-RPC calls return correctly formatted responses within sub-10ms windows.
Mid-Year Sale β Unlock Full Article
Base plan from just $4.99/mo or $49/yr
Sign in to read the full article and unlock all tutorials.
Sign In / Register β Start Free Trial7-day free trial Β· Cancel anytime Β· 30-day money-back
