ation, streaming enumeration, and batch boundary enforcement.
Phase 1: Boundary Capture and Service Delegation
ASP.NET Core automatically binds a CancellationToken parameter on minimal API endpoints or MVC controllers to HttpContext.RequestAborted. This token represents the true request lifecycle. It should never be consumed directly in the endpoint handler. Instead, delegate to an application service that owns the AI workflow.
public static class InferenceEndpoints
{
public static void MapInferenceRoutes(this IEndpointRouteBuilder app)
{
app.MapPost("/api/v1/inference/chat", async (
ChatRequest payload,
IInferenceOrchestrator orchestrator,
CancellationToken requestToken) =>
{
var result = await orchestrator.ExecuteAsync(payload, requestToken);
return Results.Ok(result);
});
}
}
Architecture Rationale: Separating the HTTP boundary from the inference logic prevents endpoint handlers from becoming orchestration monoliths. The endpoint remains a thin transport layer, while the service layer manages retrieval, prompt construction, and model invocation. The token flows unchanged, preserving the cooperative contract.
Phase 2: Service Propagation and Pipeline Composition
The orchestrator must forward the token to every downstream dependency: retrieval services, embedding generators, and the inference client. Never substitute CancellationToken.None when delegating. Doing so breaks the chain and guarantees orphaned work.
public sealed class InferenceOrchestrator(
IQueryTransformer queryTransformer,
IVectorStoreConnector vectorStore,
IInferenceEngine inferenceEngine)
{
public async Task<InferenceResult> ExecuteAsync(
ChatRequest payload,
CancellationToken lifecycleToken)
{
var transformedQuery = await queryTransformer
.TransformAsync(payload.Query, lifecycleToken);
var contextChunks = await vectorStore
.RetrieveAsync(transformedQuery, payload.TopK, lifecycleToken);
var assembledPrompt = PromptBuilder.Compose(
payload.Query, contextChunks, payload.SystemInstructions);
var completion = await inferenceEngine
.GenerateAsync(assembledPrompt, lifecycleToken);
return new InferenceResult(completion.Text, completion.Usage);
}
}
Architecture Rationale: Each async call receives the same token instance. This ensures that if the client disconnects during vector search, the inference engine never initiates. If the deployment shuts down during prompt assembly, the generation call is skipped entirely. The pipeline becomes a single transactional unit from a lifecycle perspective.
Phase 3: Streaming Enumeration and Write Loop Synchronization
Streaming responses introduce a unique risk: the backend continues producing tokens while the frontend has already abandoned the connection. IAsyncEnumerable requires explicit cancellation checks during enumeration. Additionally, response writes should respect the same token to fail fast when the underlying transport closes.
public static class StreamingEndpoints
{
public static void MapStreamingRoutes(this IEndpointRouteBuilder app)
{
app.MapGet("/api/v1/inference/stream", async (
string query,
IInferenceEngine inferenceEngine,
HttpResponse transport,
CancellationToken lifecycleToken) =>
{
transport.ContentType = "text/event-stream";
var tokenStream = inferenceEngine
.GenerateStreamAsync(query, lifecycleToken);
await foreach (var chunk in tokenStream
.WithCancellation(lifecycleToken))
{
if (string.IsNullOrWhiteSpace(chunk.Content))
continue;
var formatted = $"data: {chunk.Content}\n\n";
await transport.WriteAsync(formatted, lifecycleToken);
await transport.Body.FlushAsync(lifecycleToken);
}
});
}
}
Architecture Rationale: .WithCancellation() bridges APIs that don't natively accept a token parameter. Passing the token to WriteAsync and FlushAsync ensures the HTTP transport fails immediately upon client disconnect, preventing the application from buffering data into a dead socket. The enumeration loop becomes the primary cancellation checkpoint.
Phase 4: Batch Boundary Enforcement for Ingestion
Embedding generation and vector indexing are typically batch-oriented. Processing thousands of chunks in a single loop without cancellation checks creates a shutdown bottleneck. Batch boundaries provide natural safe points to evaluate the token.
public sealed class IngestionPipeline(
IEmbeddingProvider embeddingProvider,
IVectorIndexManager indexManager)
{
public async Task ProcessBatchAsync(
IEnumerable<RawDocument> documents,
CancellationToken lifecycleToken)
{
const int batchSize = 64;
var batches = documents.Chunk(batchSize);
foreach (var batch in batches)
{
lifecycleToken.ThrowIfCancellationRequested();
var texts = batch.Select(d => d.Content).ToArray();
var vectors = await embeddingProvider
.GenerateAsync(texts, lifecycleToken);
await indexManager.UpsertAsync(batch, vectors, lifecycleToken);
}
}
}
Architecture Rationale: ThrowIfCancellationRequested() at the loop boundary prevents partial batch processing. If the token is signaled, the pipeline halts before initiating the next expensive API call. This pattern aligns with graceful shutdown semantics and prevents orphaned half-indexed documents.
Pitfall Guide
1. The Token Drop at Service Boundaries
Explanation: Developers capture the token at the endpoint but forget to pass it when calling into domain services or infrastructure layers. The AI pipeline continues executing after the client disconnects.
Fix: Treat CancellationToken as a required constructor or method parameter across all async boundaries. Use static analysis tools or code reviews to enforce propagation.
2. Treating Cancellation as a Timeout Mechanism
Explanation: CancellationToken does not measure elapsed time. It signals external intent. Confusing it with CancellationTokenSource.CancelAfter() leads to premature termination of valid long-running inferences.
Fix: Use TimeoutPolicy or Polly for duration-based limits. Reserve CancellationToken for lifecycle and user-intent signaling.
3. Streaming Without Enumeration Checks
Explanation: Consuming IAsyncEnumerable without .WithCancellation() or without passing the token to the underlying generator allows the loop to run until completion, regardless of client state.
Fix: Always chain .WithCancellation(lifecycleToken) on the enumeration site, and ensure the generator method accepts and respects the token.
4. Ignoring Batch Boundaries in Ingestion
Explanation: Processing embeddings or vector upserts in a tight loop without checkpointing cancellation forces the application to wait for the entire batch to finish during shutdown.
Fix: Insert ThrowIfCancellationRequested() at batch boundaries. Align batch sizes with API rate limits and memory constraints.
5. Overriding with CancellationToken.None
Explanation: Some developers pass CancellationToken.None to SDK methods to avoid OperationCanceledException noise, effectively disabling cancellation for that call.
Fix: Never override with None. Handle OperationCanceledException at the boundary layer and translate it to appropriate HTTP status codes (e.g., 499 or 204).
6. Assuming Remote Providers Honor Local Tokens Instantly
Explanation: Sending a cancellation signal to a third-party LLM API does not guarantee immediate billing cessation or compute termination. Network latency and provider architecture dictate actual halt times.
Fix: Design for eventual consistency. Log cancellation events, track partial usage, and implement idempotent retry logic for critical workflows.
7. Blocking the Cancellation Thread
Explanation: Calling .Result or .Wait() on async AI SDK methods blocks the thread pool, preventing cancellation propagation and causing deadlocks under load.
Fix: Use await exclusively. If synchronous context is unavoidable, use ConfigureAwait(false) and accept the lifecycle trade-offs, but prefer async all the way.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Real-time chat completion | Propagate token through single inference call | Prevents billing for disconnected sessions | High (direct API credit savings) |
| Streaming token delivery | Enumerate with .WithCancellation() + write loop sync | Stops transport buffering and generation | Medium-High (reduces compute & bandwidth) |
| RAG retrieval pipeline | Chain token through query rewrite, vector search, prompt assembly | Halts expensive retrieval before inference | Medium (reduces vector DB & embedding costs) |
| Batch embedding ingestion | Check token at batch boundaries (64-128 chunks) | Enables graceful shutdown without partial state | Low-Medium (optimizes deployment windows) |
| Background tool orchestration | Use CancellationTokenSource with timeout + lifecycle token | Prevents runaway agent loops | High (avoids infinite tool-call recursion) |
Configuration Template
// Program.cs / Startup configuration
builder.Services.AddScoped<IInferenceOrchestrator, InferenceOrchestrator>();
builder.Services.AddScoped<IQueryTransformer, SemanticQueryTransformer>();
builder.Services.AddScoped<IVectorStoreConnector, MilvusVectorConnector>();
builder.Services.AddScoped<IInferenceEngine, OpenAIInferenceAdapter>();
builder.Services.AddScoped<IEmbeddingProvider, CohereEmbeddingProvider>();
builder.Services.AddScoped<IVectorIndexManager, WeaviateIndexManager>();
// Graceful shutdown integration
builder.Services.AddHostedService<IngestionBackgroundWorker>();
// Distributed tracing correlation
builder.Services.AddOpenTelemetry()
.WithTracing(b => b
.AddSource("AI.Pipeline.*")
.AddAspNetCoreInstrumentation()
.AddHttpClientInstrumentation());
// Endpoint registration
var app = builder.Build();
app.MapInferenceRoutes();
app.MapStreamingRoutes();
app.Run();
Quick Start Guide
- Identify Boundaries: Locate all HTTP endpoints, background workers, and message handlers that initiate AI workflows. Add
CancellationToken parameters to their signatures.
- Wire Propagation: Update service interfaces to accept
CancellationToken. Replace all CancellationToken.None usages with the passed token. Verify propagation through retrieval, embedding, and inference layers.
- Secure Streaming: Convert streaming endpoints to use
IAsyncEnumerable. Append .WithCancellation(lifecycleToken) to the enumeration loop. Pass the token to WriteAsync and FlushAsync calls.
- Batch Safeguards: Refactor ingestion loops to process documents in fixed-size chunks. Insert
ThrowIfCancellationRequested() at the start of each iteration. Align chunk sizes with provider rate limits.
- Validate Shutdown: Run integration tests that trigger
IHostApplicationLifetime.StopApplication() during active inference. Confirm that operations halt within 5 seconds, traces reflect cancellation, and no orphaned API calls complete.