Speech, search, and Stable Diffusion — calling HuggingFace from C#

Native AOT and Free-Threaded Python: Embedding ML Inference Directly in .NET

Current Situation Analysis

The integration of machine learning inference into .NET applications has historically forced developers into a compromise between performance, deployment complexity, and ecosystem access. Python dominates the ML landscape with libraries like Hugging Face transformers, diffusers, and whisper. However, bridging these capabilities into a .NET runtime introduces significant friction.

The industry typically resorts to one of three patterns, each carrying distinct technical debt:

Model Conversion to ONNX: This approach requires converting PyTorch or TensorFlow models to the Open Neural Network Exchange format. While effective for standard architectures, conversion often fails for newer diffusion pipelines, complex attention mechanisms, or models with custom C++ extensions. The conversion process itself becomes a maintenance burden, requiring version synchronization between the Python training stack and the ONNX runtime.
Microservice Architecture: Running Python inference in a separate process or container isolates the runtimes but introduces network latency, serialization overhead, and operational complexity. In high-throughput scenarios, the network hop and JSON marshaling can dominate the total request latency, negating the performance gains of optimized models.
Existing Interop Libraries: Tools like pythonnet provide a bridge but lack support for .NET Native AOT compilation, preventing the creation of single-file, self-contained binaries. Furthermore, as CPython evolves toward free-threading (PEP 703), many interop layers struggle with the removal of the Global Interpreter Lock (GIL), leading to race conditions and memory safety issues in concurrent workloads.

These constraints leave .NET developers without a viable path to embed state-of-the-art ML models directly into native binaries while maintaining concurrency safety and deployment simplicity.

WOW Moment: Key Findings

Recent advancements in C API interop and dependency management enable a fourth pattern: In-Process Embedding with Native AOT and Free-Thread Support. This approach eliminates the network hop, supports single-binary deployment, and leverages PEP 703 for true parallelism in Python inference.

The following comparison highlights the operational advantages of embedding ML inference directly within the .NET process using a modern interop layer:

Approach	Latency Overhead	Deployment Unit	Native AOT	Free-Thread (PEP 703)	Model Coverage
ONNX Runtime	Low	Single Binary	✅ Yes	✅ Yes	⚠️ Limited (Conversion required)
Microservice	High (Network + Serial)	Container/Process	❌ No	✅ Yes	✅ Full
pythonnet	Low	Shared Library	❌ No	⚠️ Experimental	✅ Full
Embedded Interop	Low	Single Binary	✅ Yes	✅ Yes	✅ Full

Why this matters: The embedded approach delivers the model coverage of a microservice with the latency and deployment profile of ONNX, while adding support for Native AOT and free-threaded Python. This enables .NET applications to run complex inference workloads—such as Stable Diffusion or Whisper—inside a single, trimmed binary that scales efficiently across multiple cores without the GIL bottleneck.

Core Solution

The solution relies on a lightweight interop library that calls CPython via the C API, manages dependencies declaratively, and enforces isolation for concurrent execution. The architecture prioritizes three principles:

Declarative Dependency Management: Python environments are provisioned automatically using uv, eliminating manual virtual environment setup and ensuring reproducible builds.
Boundary Minimization: Large data structures (tensors, images, audio buffers) remain in the Python heap. Only structured metadata crosses the boundary via JSON, reducing marshaling overhead.
Isolation by Default: Each inference call operates within an isolated namespace, preventing variable collisions and ensuring thread safety in free-threaded environments.

Implementation Strategy

The implementation follows a builder pattern to configure the Python project, initialize the runtime, and execute inference scripts. The library handles the lifecycle of the Python interpreter and provides a safe handle for execution.

Example 1: Text Summarization

This example demonstrates embedding a transformer model for text summarization. The .NET side passes raw text, and the Python side returns a structured summary object.

using DotNetPy;
using DotNetPy.Uv;

// Configure the Python environment with required dependencies
using var project = PythonProject.CreateBuilder()
    .WithProjectName("ml-summarizer")
    .WithPythonVersion("==3.13.*")
    .AddDependencies(
        "transformers==4.45.0",
        "torch>=2.4,<2.6",
        "sentencepiece")
    .Build();

await project.InitializeAsync();
var executor = project.GetExecutor();

// Load the model once during initialization
executor.Execute(@"
from transformers import pipeline
summarizer = pipeline('summarization', model='facebook/bart-large-cnn')
");

var inputText = @"
The .NET ecosystem has evolved significantly with the introduction of Native AOT.
Developers can now compile C# applications to native machine code, resulting in
smaller binaries and faster startup times. This advancement is particularly beneficial
for cloud-native applications and microservices where resource efficiency is critical.
";

// Execute inference and capture the result
using var result = executor.ExecuteAndCapture(@"
output = summarizer(input_text, max_length=130, min_length=30, do_sample=False)
summary_text = output[0]['summary_text']
result = {'summary': summary_text, 'length': len(summary_text)}
", new Dictionary<string, object?> { { "input_text", inputText } });

// Parse the JSON result in .NET
var summary = result!.GetString("summary");
var length = result.GetInt32("length");

Console.WriteLine($"Summary ({length} chars): {summary}");

Example 2: Audio Classification

This example shows how to handle audio data without crossing the boundary. The audio file is processed entirely within Python, and only the classification label and confidence score are returned.

var executor = project.GetExecutor();

// Initialize the audio classification pipeline
executor.Execute(@"
from transformers import AutoProcessor, Wav2Vec2ForSequenceClassification
import torch

processor = AutoProcessor.from_pretrained('facebook/wav2vec2-base-960h')
model = Wav2Vec2ForSequenceClassification.from_pretrained('facebook/wav2vec2-base-960h')
");

var audioFilePath = "/path/to/audio.wav";

// Run classification; audio bytes stay in Python
using var classification = executor.ExecuteAndCapture(@"
import librosa
speech, sr = librosa.load(audio_file, sr=16000)
inputs = processor(speech, sampling_rate=sr, return_tensors='pt')
logits = model(**inputs).logits
predicted_class_id = logits.argmax(-1).item()
score = logits.softmax(-1).max().item()
result = {'label': model.config.id2label[predicted_class_id], 'score': score}
", new Dictionary<string, object?> { { "audio_file", audioFilePath } });

var label = classification!.GetString("label");
var score = classification.GetDouble("score");

Console.WriteLine($"Class: {label}, Confidence: {score:P2}");

Example 3: Image Captioning

For vision models, the pattern remains consistent. The image is loaded and processed in Python, and the generated caption is returned as metadata.

executor.Execute(@"
from transformers import BlipForConditionalGeneration, BlipProcessor
import torch

processor = BlipProcessor.from_pretrained('Salesforce/blip-image-captioning-base')
model = BlipForConditionalGeneration.from_pretrained('Salesforce/blip-image-captioning-base')
");

var imagePath = "/path/to/image.jpg";

using var captionResult = executor.ExecuteAndCapture(@"
from PIL import Image
import requests

image = Image.open(image_path).convert('RGB')
inputs = processor(image, return_tensors='pt')
out = model.generate(**inputs)
caption = processor.decode(out[0], skip_special_tokens=True)
result = {'caption': caption}
", new Dictionary<string, object?> { { "image_path", imagePath } });

var caption = captionResult!.GetString("caption");
Console.WriteLine($"Caption: {caption}");

Architecture Decisions

uv Integration: Using uv for dependency resolution ensures fast, deterministic provisioning of Python environments. This avoids the overhead of pip and guarantees that the exact versions specified in the builder are used.
JSON Boundary: Returning results as JSON documents allows the .NET side to parse structured data efficiently using System.Text.Json. This avoids the complexity of marshaling complex Python objects and keeps the interop layer thin.
Isolation Factory: The Python.CreateIsolated() method creates a new execution context with a unique namespace. This is critical for PEP 703 compatibility, as it prevents race conditions on shared globals when multiple threads execute Python code concurrently.

Pitfall Guide

Integrating Python into .NET requires careful attention to runtime behavior and memory management. The following pitfalls are common in production environments:

Global Namespace Collisions
- Explanation: In non-isolated executors, variables defined in one call (e.g., result, data) persist in the __main__ namespace. Concurrent calls can overwrite these variables, leading to incorrect results or crashes.
- Fix: Always use Python.CreateIsolated() for concurrent workloads. This ensures each executor has a private namespace, eliminating collisions.
Serializing Large Payloads
- Explanation: Attempting to pass large tensors, images, or audio buffers across the .NET-Python boundary via JSON or byte arrays causes significant memory pressure and latency.
- Fix: Keep large data in the Python heap. Pass only file paths or references, and return structured metadata (e.g., labels, scores, file paths) across the boundary.
AOT Trimming Issues
- Explanation: Native AOT compilation may trim unused code, including reflection metadata or P/Invoke signatures required by the interop library. This can cause runtime failures in the published binary.
- Fix: Use [DynamicDependency] attributes or preserve configuration files to ensure the AOT compiler retains necessary symbols. Test the AOT build early in the development cycle.
Refcount Leaks in Free-Threaded Mode
- Explanation: PEP 703 changes the reference counting mechanism to a split structure. Interop libraries that do not handle this correctly may leak memory or crash due to race conditions on reference counts.
- Fix: Use a library version that explicitly supports PEP 703 and handles the split refcount layout. Ensure SafeHandle implementations correctly manage reference counts in ReleaseHandle.
ThreadPool Starvation
- Explanation: Long-running Python inference calls can block .NET ThreadPool threads, reducing the application's ability to handle concurrent requests.
- Fix: Offload inference calls to dedicated threads or use Task.Run to prevent blocking the ThreadPool. Consider using a custom scheduler for CPU-bound inference workloads.
Dependency Version Drift
- Explanation: Mismatched versions of Python packages can lead to import errors or runtime exceptions, especially when upgrading the interop library or Python version.
- Fix: Pin dependency versions in the builder configuration. Use uv's lock file feature to ensure reproducible builds across environments.
Ignoring GIL Removal Implications
- Explanation: Developers accustomed to the GIL may assume Python code is thread-safe by default. With free-threading, shared state must be explicitly protected.
- Fix: Design Python scripts to be stateless or use thread-safe data structures. Rely on the isolation factory to prevent shared state issues.

Production Bundle

Action Checklist

Isolate Executors: Use Python.CreateIsolated() for all concurrent inference calls to prevent namespace collisions.
Verify AOT Compatibility: Test Native AOT builds early and use [DynamicDependency] to preserve interop symbols.
Minimize Boundary Traffic: Ensure large data (tensors, images) stays in Python; return only JSON metadata.
Pin Dependencies: Use exact version constraints in the builder to avoid drift and ensure reproducibility.
Monitor Memory: Track Python heap usage and ensure SafeHandle implementations correctly release references.
Test Free-Threaded Builds: Validate the application with python3.13t or python3.14t to ensure PEP 703 compatibility.
Offload CPU Work: Use Task.Run or dedicated threads for inference to avoid ThreadPool starvation.

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Low Latency, Single Binary	Embedded Interop	Eliminates network hop; supports AOT; minimal deployment footprint.	Low (No infra overhead)
Complex Model, No Conversion	Embedded Interop	Full model coverage without ONNX conversion; supports latest architectures.	Medium (Python runtime size)
High Concurrency, Free-Thread	Embedded Interop	PEP 703 support enables true parallelism; isolation prevents races.	Low (Efficient resource use)
Legacy System, No Python	Microservice	Isolates Python dependency; easier to manage in heterogeneous environments.	High (Infra + Latency)
Strict Security, No External	Embedded Interop	Runs entirely in-process; no data leaves the application boundary.	Low (Secure by design)

Configuration Template

Use this template to configure a Python project with dependency management and isolation:

using DotNetPy;
using DotNetPy.Uv;

public async Task<InferenceExecutor> CreateExecutorAsync()
{
    var project = PythonProject.CreateBuilder()
        .WithProjectName("production-inference")
        .WithPythonVersion("==3.13.*")
        .AddDependencies(
            "transformers==4.45.0",
            "torch>=2.4,<2.6")
        .Build();

    await project.InitializeAsync();
    
    // Return an isolated executor for thread-safe usage
    return Python.CreateIsolated();
}

Quick Start Guide

Install the Package:

dotnet add package DotNetPy --version 0.6.0

Configure the Project: Use the builder pattern to specify Python version and dependencies.
Initialize and Execute: Call InitializeAsync() to provision the environment, then use Execute() and ExecuteAndCapture() to run inference.
Enable Isolation: For concurrent workloads, use Python.CreateIsolated() to create thread-safe executors.
Publish as Native AOT: Build with dotnet publish -c Release -r win-x64 /p:PublishAot=true to generate a single binary.