I Built a Neural Network Engine in C# That Runs in Your Browser - No ONNX Runtime, No JavaScript Bridge, No Native Binaries
Cross-Platform GPU Inference in C#: Transpiling CIL to Web Shaders for Client-Side AI
Current Situation Analysis
Running machine learning models directly in the browser has historically required choosing between fragmented toolchains. Developers typically rely on JavaScript-based runtimes like ONNX Runtime Web, precompiled WebAssembly binaries, or heavy native bridges. Each approach introduces trade-offs: JavaScript runtimes often lack direct GPU compute access, precompiled WASM modules bloat bundle sizes and complicate memory management, and native bridges break the sandboxed web environment.
The .NET ecosystem possesses mature GPU compute libraries capable of transpiling intermediate language (CIL) into hardware-specific shaders. However, browser support was long considered impractical. The primary obstacles were shader generation complexity, WebAssembly memory constraints, and the absence of a unified tensor API that could abstract across WebGL, WebGPU, and WASM SIMD without sacrificing performance.
This gap matters because client-side inference is no longer a niche requirement. Privacy regulations, latency sensitivity, and offline-first architectures demand that models run on the user's device. When inference happens server-side, data must traverse the network, introducing exfiltration risks and infrastructure costs. When it happens client-side, the challenge shifts to efficient resource utilization, cross-backend compatibility, and deterministic memory management.
Industry data reflects the friction. ONNX Runtime Web has documented limitations around WebGPU device sharing that complicate multi-model sessions. Browser GPU compute remains split across WGSL (WebGPU), GLSL (WebGL2), and WASM SIMD/threads, forcing developers to maintain separate kernel implementations or accept performance penalties. The result is a landscape where client-side AI is either locked into a single runtime, requires JavaScript interop layers, or demands native binary distribution.
WOW Moment: Key Findings
The breakthrough lies in treating C# as a universal compute language rather than a platform-specific one. By transpiling CIL into target-specific shaders at runtime, a single codebase can target six distinct execution environments without rewriting kernels or managing native binaries. The following comparison illustrates the architectural shift:
| Approach | Backend Coverage | Memory Overhead | Data Privacy | Deployment Complexity |
|---|---|---|---|---|
| JavaScript/ONNX Runtime Web | WebGPU, WebGL, WASM | High (JS heap + runtime) | Server-dependent | Moderate (runtime bundling) |
| Precompiled WASM Binaries | WASM SIMD, Threads | Medium (static modules) | Client-side | High (build toolchain, memory limits) |
| C# CIL Transpilation | WebGPU, WebGL, WASM, CUDA, OpenCL, CPU | Low (direct GPU buffers) | Fully client-side | Low (single NuGet package) |
This finding matters because it decouples model logic from execution hardware. Developers no longer need to choose between browser compatibility and desktop performance. The same inference pipeline that runs on a mobile browser via WebGL can execute on a datacenter GPU via CUDA or OpenCL, with identical kernel code and memory semantics. It enables progressive enhancement: fallback to WASM SIMD when WebGPU is unavailable, scale to CUDA when deployed server-side, and maintain zero data exfiltration across all paths.
Core Solution
The architecture rests on three pillars: CIL-to-shader transpilation, a unified tensor memory model, and runtime backend resolution. Each component addresses a specific bottleneck in client-side ML deployment.
Step 1: CIL-to-Shader Transpilation Pipeline
Instead of shipping precompiled binaries or relying on JavaScript bridges, the engine intercepts .NET CIL instructions and translates them into target-specific shader code. The transpiler maps C# control flow, arithmetic operations, and memory access patterns to WGSL compute shaders, GLSL vertex/fragment shaders with transform feedback, or WASM SIMD functions.
A kernel written in C# uses a coordinate index and a tensor view to read/write data. The transpiler extracts dimension metadata, generates thread mapping logic, and emits the appropriate shader dialect. This eliminates the need to maintain separate kernel implementations for each backend.
public static void ProcessFeatureMap(ComputeIndex index, FeatureBuffer<float> source, FeatureBuffer<float> destination)
{
int channel = index.Channel;
int height = index.Height;
int width = index.Width;
int batch = index.Batch;
float value = source.Read(batch, channel, height, width);
destination.Write(batch, channel, height, width, value * ActivationFactor);
}
The transpiler recognizes ComputeIndex as a thread coordinate and FeatureBuffer<T> as a contiguous GPU allocation. It generates WGSL @compute entry points, GLSL layout(location) bindings, or WASM function signatures depending on the active backend. Shape metadata travels with the buffer struct, removing the need to pass dimensions as separate scalar arguments.
Step 2: Unified Tensor Memory Model
GPU memory management in browsers is constrained by WebAssembly heap limits and WebGL texture size restrictions. The solution introduces a three-tier tensor abstraction that mirrors host-side and device-side semantics:
Tensor<T>: Host-side descriptor providing zero-copy slicing, reshaping, and sub-tensor views. Does not own memory.DeviceTensor<T>: Disposable wrapper around a GPU allocation. Manages lifetime and provides async host-to-device/device-to-host transfers.KernelView<T>: Blittable struct passed directly into transpiled kernels. Contains inline shape descriptors and memory pointers.
This separation prevents accidental CPU-GPU syncs and ensures deterministic disposal. When a pipeline completes, DeviceTensor<T> releases the underlying GPU buffer immediately, avoiding memory leaks that commonly occur in long-running browser sessions.
using var runtime = ComputeRuntime.Create(BackendPreference.Auto);
using var pipeline = InferencePipeline.Load(runtime, "models/segmentation.onnx");
using var inputBuffer = DeviceTensor<float>.FromHost(runtime, imageData, shape: [1, 3, 256, 256]);
using var results = await pipeline.ExecuteAsync(inputBuffer);
var mask = results["output_mask"];
await mask.CopyToHostAsync(pixelArray);
The BackendPreference.Auto resolver checks for WebGPU support, falls back to WebGL2, then WASM SIMD, and finally CPU. This hierarchy ensures maximum compatibility without manual configuration.
Step 3: Architecture Decisions and Rationale
Why transpile CIL instead of shipping WASM? Precompiled WASM requires a separate build pipeline, complicates debugging, and locks developers into a specific compiler version. CIL transpilation happens at runtime, allowing dynamic optimization, backend-specific code generation, and seamless updates via standard package managers.
Why embed shape metadata in the buffer struct? Passing dimensions as separate parameters increases kernel signature complexity and introduces synchronization overhead. Embedding D0..D3 fields directly in KernelView<T> allows the transpiler to generate bounds-checked access patterns without runtime parameter marshaling.
Why deterministic disposal? Browser GPU contexts have strict memory limits. Leaked buffers cause silent failures or context loss. Implementing IDisposable on DeviceTensor<T> ensures GPU resources are released immediately after inference, preventing heap fragmentation and enabling stable long-running sessions.
Pitfall Guide
1. Ignoring WebAssembly Memory Boundaries
Explanation: WASM modules start with a fixed memory size. Large models or unbounded tensor allocations trigger out-of-memory exceptions that crash the worker thread.
Fix: Stream weights in chunks, use overlapping tile processing for large inputs, and explicitly call Memory.Grow() only when necessary. Monitor heap usage with WebAssembly.Memory.buffer.byteLength.
2. Synchronous GPU-CPU Data Transfers
Explanation: Blocking the main thread to read GPU results causes UI jank and violates browser performance budgets.
Fix: Use async transfer methods (CopyToHostAsync) and pipeline computation with rendering. Keep GPU-CPU syncs at the end of inference cycles, never inside kernel loops.
3. Hardcoding Tensor Dimensions in Kernels
Explanation: Embedding fixed sizes in kernel logic breaks dynamic batching and prevents model reuse across different input resolutions.
Fix: Pass shape descriptors via the tensor view struct. Use index arithmetic (idx % width, idx / (width * height)) to compute coordinates dynamically.
4. Assuming Uniform Backend Capabilities
Explanation: WebGL2 lacks compute shaders, WebGPU requires explicit adapter selection, and WASM SIMD depends on browser flags. Assuming feature parity causes runtime failures. Fix: Implement capability detection before initialization. Provide fallback paths and validate shader compilation success before dispatching workloads.
5. Leaking Accelerator Resources
Explanation: Forgetting to dispose DeviceTensor<T> instances leaves GPU buffers allocated. Browsers eventually reject new allocations or lose the GPU context.
Fix: Wrap all GPU allocations in using statements. Implement a resource pool for frequently reused tensors, but ensure pool cleanup on session termination.
6. Overlooking Shader Compilation Caching
Explanation: Transpiling CIL to shaders on every page load adds 200-500ms latency. Browsers do not cache dynamically generated shader strings. Fix: Serialize compiled shader binaries to IndexedDB or OPFS. Validate cache keys using a hash of the CIL method signature and target backend.
7. Mishandling Floating-Point Precision
Explanation: WebGL and WebGPU default to 32-bit floats, but some models expect 16-bit precision for memory efficiency. Mixing precision causes silent numerical drift.
Fix: Explicitly declare tensor precision during allocation. Use half types in WGSL/GLSL when supported, and validate numerical consistency across backends with tolerance thresholds.
Production Bundle
Action Checklist
- Verify backend capability detection before runtime initialization
- Implement async GPU-CPU transfer patterns to avoid main thread blocking
- Cache transpiled shaders in OPFS to eliminate cold-start latency
- Use
usingscopes for allDeviceTensor<T>allocations to prevent GPU memory leaks - Stream large model weights in chunks to respect WASM heap limits
- Validate numerical precision consistency across WebGPU, WebGL, and WASM backends
- Implement tile-based processing for inputs exceeding backend texture size limits
- Add fallback hierarchy: WebGPU β WebGL2 β WASM SIMD β CPU
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| Single-model PWA with offline support | WASM SIMD + OPFS caching | No GPU dependency, predictable memory, works on all modern browsers | Low (no GPU driver requirements) |
| Multi-model browser session with real-time rendering | WebGPU + async transfer pipeline | Direct compute access, texture-to-canvas zero-copy, concurrent dispatch | Medium (requires WebGPU-enabled browsers) |
| Server-side batch inference | CUDA/OpenCL + DeviceTensor<T> pooling |
Maximum throughput, deterministic memory management, scales to multi-GPU | High (requires NVIDIA/AMD hardware) |
| Legacy browser support (2017+) | WebGL2 + transform feedback | Broad compatibility, no experimental flags, stable shader pipeline | Low (performance penalty vs WebGPU) |
Configuration Template
public static class InferenceConfig
{
public static ComputeRuntime CreateOptimizedRuntime()
{
var builder = new RuntimeBuilder()
.SetBackendPriority(Backend.WebGPU, Backend.WebGL2, Backend.WasmSimd, Backend.Cpu)
.EnableShaderCache(CacheLocation.Opfs, "shader-cache-v1")
.SetMemoryLimit(Megabytes: 512)
.EnableAsyncTransfers(true);
return builder.Build();
}
public static TensorShape CalculateTileDimensions(int sourceWidth, int sourceHeight, int modelInput)
{
int tileX = Math.Min(modelInput, sourceWidth);
int tileY = Math.Min(modelInput, sourceHeight);
int overlap = modelInput / 4;
return new TensorShape(tileX, tileY, overlap);
}
}
Quick Start Guide
- Initialize the runtime: Call
InferenceConfig.CreateOptimizedRuntime()to auto-detect the best available backend and configure shader caching. - Load the model: Use
InferencePipeline.Load(runtime, "path/to/model.onnx")to parse the graph and allocate internal buffers. - Prepare input data: Wrap pixel arrays in
DeviceTensor<float>.FromHost(runtime, data, shape)to transfer data to GPU memory asynchronously. - Execute inference: Call
await pipeline.ExecuteAsync(inputBuffer)to dispatch the workload. Results return as disposable tensor maps. - Render or process output: Use
await result.CopyToHostAsync(destinationArray)only when CPU access is required, or bind GPU buffers directly to canvas/WebGL contexts for zero-copy rendering.
Mid-Year Sale β Unlock Full Article
Base plan from just $4.99/mo or $49/yr
Sign in to read the full article and unlock all tutorials.
Sign In / Register β Start Free Trial7-day free trial Β· Cancel anytime Β· 30-day money-back
