Cross-Platform GPU Inference in C#: Transpiling CIL to Web Shaders for Client-Side AI

Current Situation Analysis

Running machine learning models directly in the browser has historically required choosing between fragmented toolchains. Developers typically rely on JavaScript-based runtimes like ONNX Runtime Web, precompiled WebAssembly binaries, or heavy native bridges. Each approach introduces trade-offs: JavaScript runtimes often lack direct GPU compute access, precompiled WASM modules bloat bundle sizes and complicate memory management, and native bridges break the sandboxed web environment.

The .NET ecosystem possesses mature GPU compute libraries capable of transpiling intermediate language (CIL) into hardware-specific shaders. However, browser support was long considered impractical. The primary obstacles were shader generation complexity, WebAssembly memory constraints, and the absence of a unified tensor API that could abstract across WebGL, WebGPU, and WASM SIMD without sacrificing performance.

This gap matters because client-side inference is no longer a niche requirement. Privacy regulations, latency sensitivity, and offline-first architectures demand that models run on the user's device. When inference happens server-side, data must traverse the network, introducing exfiltration risks and infrastructure costs. When it happens client-side, the challenge shifts to efficient resource utilization, cross-backend compatibility, and deterministic memory management.

Industry data reflects the friction. ONNX Runtime Web has documented limitations around WebGPU device sharing that complicate multi-model sessions. Browser GPU compute remains split across WGSL (WebGPU), GLSL (WebGL2), and WASM SIMD/threads, forcing developers to maintain separate kernel implementations or accept performance penalties. The result is a landscape where client-side AI is either locked into a single runtime, requires JavaScript interop layers, or demands native binary distribution.

WOW Moment: Key Findings

The breakthrough lies in treating C# as a universal compute language rather than a platform-specific one. By transpiling CIL into target-specific shaders at runtime, a single codebase can target six distinct execution environments without rewriting kernels or managing native binaries. The following comparison illustrates the architectural shift:

Approach	Backend Coverage	Memory Overhead	Data Privacy	Deployment Complexity
JavaScript/ONNX Runtime Web	WebGPU, WebGL, WASM	High (JS heap + runtime)	Server-dependent	Moderate (runtime bundling)
Precompiled WASM Binaries	WASM SIMD, Threads	Medium (static modules)	Client-side	High (build toolchain, memory limits)
C# CIL Transpilation	WebGPU, WebGL, WASM, CUDA, OpenCL, CPU	Low (direct GPU buffers)	Fully client-side	Low (single NuGet package)

This finding matters because it decouples model logic from execution hardware. Developers no longer need to choose between browser compatibility and desktop performance. The same inference pipeline that runs on a mobile browser via WebGL can execute on a datacenter GPU via CUDA or OpenCL, with identical kernel code and memory semantics. It enables progressive enhancement: fallback to WASM SIMD when WebGPU is unavailable, scale to CUDA when deployed server-side, and maintain zero data exfiltration across all paths.

Core Solution

The architecture rests on three pillars: CIL-to-shader transpilation, a unified tensor memory model, and runtime backend resolution. Each component addresses a specific bottleneck in client-side ML deployment.

Step 1: CIL-to-Shader Transpilation Pipeline

Instead of shipping precompiled binaries or relying on JavaScript bridges, the engine intercepts .NET CIL instructions and translates them into target-specific shader code. The transpiler maps C# control flow, arithmetic operations, and memory access patterns to WGSL compute shaders, GLSL vertex/fragment shaders with transform feedback, or WASM SIMD functions.

A kernel written in C# uses a coordinate index and a tensor view to read/write data. The transpiler extracts dimension metadata, generates thread mapping logic, and emits the appropriate shader dialect. This eliminates the need to maintain separate kernel implementations for each backend.

public static void ProcessFeatureMap(ComputeIndex index, FeatureBuffer<float> source, FeatureBuffer<float> destination)
{
    int channel = index.Channel;
    int height = index.Height;
    int width = index.Width;
    int batch = index.Batch;

    float value = source.Read(batch, channel, height, width);
    destination.Write(batch, channel, height, width, value * ActivationFactor);
}

The transpiler recognizes ComputeIndex as a thread coordinate and FeatureBuffer<T> as a contiguous GPU allocation. It generates WGSL @compute entry points, GLSL layout(location) bindings, or WASM function signatures depending on the active backend. Shape metadata travels with the buffer struct, removing the need to pass dimensions as separate scalar arguments.

Step 2: Unified Tensor Memory Model

GPU memory management in browsers is constrained by WebAssembly heap limits and WebGL texture size restrictions. The solution introduces a three-tier tensor abstraction that mirrors host-side and device-side semantics:

Tensor<T>: Host-side descriptor providing zero-copy slicing, reshaping, and sub-tensor views. Does not own memory.
DeviceTensor<T>: Disposable wrapper around a GPU allocation. Manages lifetime and provides async host-to-device/device-to-host transfers.
KernelView<T>: Blittable struct passed directly into transpiled kernels. Contains inline shape descriptors and memory pointers.

This separation prevents accidental CPU-GPU syncs and ensures deterministic disposal. When a pipeline completes, DeviceTensor<T> releases the underlying GPU buffer immediately, avoiding memory leaks that commonly occur in long-running browser sessions.

using var runtime = ComputeRuntime.Create(BackendPreference.Auto);
using var pipeline = InferencePipeline.Load(runtime, "models/segmentation.onnx");

using var inputBuffer = DeviceTensor<float>.FromHost(runtime, imageData, shape: [1, 3, 256, 256]);
using var results = await pipeline.ExecuteAsync(inputBuffer);

var mask = results["output_mask"];
await mask.CopyToHostAsync(pixelArray);

The BackendPreference.Auto resolver checks for WebGPU support, falls back to WebGL2, then WASM SIMD, and finally CPU. This hierarchy ensures maximum compatibility without manual configuration.

Step 3: Architecture Decisions and Rationale

Why transpile CIL instead of shipping WASM? Precompiled WASM requires a separate build pipeline, complicates debugging, and locks developers into a specific compiler version. CIL transpilation happens at runtime, allowing dynamic optimization, backend-specific code generation, and seamless updates via standard package managers.

Why embed shape metadata in the buffer struct? Passing dimensions as separate parameters increases kernel signature complexity and introduces synchronization overhead. Embedding D0..D3 fields directly in KernelView<T> allows the transpiler to generate bounds-checked access patterns without runtime parameter marshaling.

Why deterministic disposal? Browser GPU contexts have strict memory limits. Leaked buffers cause silent failures or context loss. Implementing IDisposable on DeviceTensor<T> ensures GPU resources are released immediately after inference, preventing heap fragmentation and enabling stable long-running sessions.

Pitfall Guide

1. Ignoring WebAssembly Memory Boundaries

Explanation: WASM modules start with a fixed memory size. Large models or unbounded tensor allocations trigger out-of-memory exceptions that crash the worker thread. Fix: Stream weights in chunks, use overlapping tile processing for large inputs, and explicitly call Memory.Grow() only when necessary. Monitor heap usage with WebAssembly.Memory.buffer.byteLength.

2. Synchronous GPU-CPU Data Transfers

Explanation: Blocking the main thread to read GPU results causes UI jank and violates browser performance budgets. Fix: Use async transfer methods (CopyToHostAsync) and pipeline computation with rendering. Keep GPU-CPU syncs at the end of inference cycles, never inside kernel loops.

3. Hardcoding Tensor Dimensions in Kernels

Explanation: Embedding fixed sizes in kernel logic breaks dynamic batching and prevents model reuse across different input resolutions. Fix: Pass shape descriptors via the tensor view struct. Use index arithmetic (idx % width, idx / (width * height)) to compute coordinates dynamically.

4. Assuming Uniform Backend Capabilities

Explanation: WebGL2 lacks compute shaders, WebGPU requires explicit adapter selection, and WASM SIMD depends on browser flags. Assuming feature parity causes runtime failures. Fix: Implement capability detection before initialization. Provide fallback paths and validate shader compilation success before dispatching workloads.

5. Leaking Accelerator Resources

Explanation: Forgetting to dispose DeviceTensor<T> instances leaves GPU buffers allocated. Browsers eventually reject new allocations or lose the GPU context. Fix: Wrap all GPU allocations in using statements. Implement a resource pool for frequently reused tensors, but ensure pool cleanup on session termination.

6. Overlooking Shader Compilation Caching

Explanation: Transpiling CIL to shaders on every page load adds 200-500ms latency. Browsers do not cache dynamically generated shader strings. Fix: Serialize compiled shader binaries to IndexedDB or OPFS. Validate cache keys using a hash of the CIL method signature and target backend.

7. Mishandling Floating-Point Precision

Explanation: WebGL and WebGPU default to 32-bit floats, but some models expect 16-bit precision for memory efficiency. Mixing precision causes silent numerical drift. Fix: Explicitly declare tensor precision during allocation. Use half types in WGSL/GLSL when supported, and validate numerical consistency across backends with tolerance thresholds.

Production Bundle

Action Checklist

Verify backend capability detection before runtime initialization
Implement async GPU-CPU transfer patterns to avoid main thread blocking
Cache transpiled shaders in OPFS to eliminate cold-start latency
Use using scopes for all DeviceTensor<T> allocations to prevent GPU memory leaks
Stream large model weights in chunks to respect WASM heap limits
Validate numerical precision consistency across WebGPU, WebGL, and WASM backends
Implement tile-based processing for inputs exceeding backend texture size limits
Add fallback hierarchy: WebGPU → WebGL2 → WASM SIMD → CPU

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Single-model PWA with offline support	WASM SIMD + OPFS caching	No GPU dependency, predictable memory, works on all modern browsers	Low (no GPU driver requirements)
Multi-model browser session with real-time rendering	WebGPU + async transfer pipeline	Direct compute access, texture-to-canvas zero-copy, concurrent dispatch	Medium (requires WebGPU-enabled browsers)
Server-side batch inference	CUDA/OpenCL + `DeviceTensor<T>` pooling	Maximum throughput, deterministic memory management, scales to multi-GPU	High (requires NVIDIA/AMD hardware)
Legacy browser support (2017+)	WebGL2 + transform feedback	Broad compatibility, no experimental flags, stable shader pipeline	Low (performance penalty vs WebGPU)

Configuration Template

public static class InferenceConfig
{
    public static ComputeRuntime CreateOptimizedRuntime()
    {
        var builder = new RuntimeBuilder()
            .SetBackendPriority(Backend.WebGPU, Backend.WebGL2, Backend.WasmSimd, Backend.Cpu)
            .EnableShaderCache(CacheLocation.Opfs, "shader-cache-v1")
            .SetMemoryLimit(Megabytes: 512)
            .EnableAsyncTransfers(true);

        return builder.Build();
    }

    public static TensorShape CalculateTileDimensions(int sourceWidth, int sourceHeight, int modelInput)
    {
        int tileX = Math.Min(modelInput, sourceWidth);
        int tileY = Math.Min(modelInput, sourceHeight);
        int overlap = modelInput / 4;
        return new TensorShape(tileX, tileY, overlap);
    }
}

Quick Start Guide

Initialize the runtime: Call InferenceConfig.CreateOptimizedRuntime() to auto-detect the best available backend and configure shader caching.
Load the model: Use InferencePipeline.Load(runtime, "path/to/model.onnx") to parse the graph and allocate internal buffers.
Prepare input data: Wrap pixel arrays in DeviceTensor<float>.FromHost(runtime, data, shape) to transfer data to GPU memory asynchronously.
Execute inference: Call await pipeline.ExecuteAsync(inputBuffer) to dispatch the workload. Results return as disposable tensor maps.
Render or process output: Use await result.CopyToHostAsync(destinationArray) only when CPU access is required, or bind GPU buffers directly to canvas/WebGL contexts for zero-copy rendering.

I Built a Neural Network Engine in C# That Runs in Your Browser - No ONNX Runtime, No JavaScript Bridge, No Native Binaries