yered architecture: baseline metric collection, targeted trace capture, code-level instrumentation, and telemetry export. The following implementation uses native CLI tools, System.Diagnostics.DiagnosticSource, and OpenTelemetry for production-safe profiling.
Step 1: Establish Baseline Metrics
Before profiling, capture runtime health indicators. dotnet-counters provides real-time sampling without attaching a profiler.
dotnet-counters monitor --process-id <PID> --counters System.Runtime
Key counters to track: % Time in GC, Gen 0/1/2 GC Count, Allocated Bytes/Sec, ThreadPool Completed Work Items Count, Exception Count.
Step 2: Capture Execution Traces
Use dotnet-trace for sampling-based CPU profiling. Sampling is production-safe and avoids the overhead of method-level tracing.
dotnet-trace collect --process-id <PID> --providers Microsoft-Windows-DotNETRuntime:0x14C80000:4 --output profile.nettrace
The provider mask 0x14C80000 enables CPU sampling and GC events. Duration should be limited to 30β60 seconds during peak load to minimize disk I/O and memory buffering.
Step 3: Instrument Critical Paths
For business-logic hot paths, integrate DiagnosticSource and Activity to create distributed traces that correlate with profiling data.
using System.Diagnostics;
public static class ProfilingInstrumentation
{
private static readonly ActivitySource Source = new("Compass.PerfProfile");
public static async Task ProcessPayloadAsync(byte[] payload, CancellationToken ct)
{
using var activity = Source.StartActivity("ProcessPayload");
activity?.SetTag("payload.size", payload.Length);
var sw = Stopwatch.StartNew();
try
{
// Hot path execution
var result = await ExecuteCoreAsync(payload, ct);
activity?.SetTag("operation.status", "success");
return result;
}
catch (Exception ex)
{
activity?.SetTag("operation.status", "failure");
activity?.SetTag("error.message", ex.Message);
throw;
}
finally
{
activity?.SetTag("duration.ms", sw.ElapsedMilliseconds);
}
}
private static async Task<byte[]> ExecuteCoreAsync(byte[] payload, CancellationToken ct)
{
// Simulated work
await Task.Delay(50, ct);
return payload;
}
}
Step 4: Export to Telemetry Backend
Pipe diagnostic events to OpenTelemetry for centralized analysis. This decouples profiling from local CLI usage and enables historical trend tracking.
using OpenTelemetry;
using OpenTelemetry.Resources;
using OpenTelemetry.Trace;
var tracerProvider = Sdk.CreateTracerProviderBuilder()
.SetResourceBuilder(ResourceBuilder.CreateDefault().AddService("compass-api"))
.AddSource("Compass.PerfProfile")
.AddOtlpExporter(options =>
{
options.Endpoint = new Uri("http://otel-collector:4317");
options.Protocol = OpenTelemetry.Exporter.OtlpExportProtocol.Grpc;
})
.Build();
Architecture Decisions and Rationale
- Sampling over Tracing: Method-level tracing forces JIT to deoptimize and disables inlining. Sampling relies on OS timers and preserves native execution characteristics.
- Agentless Collection: CLI-based collection (
dotnet-trace, dotnet-counters) requires no runtime modifications, making it safe for containerized and isolated environments.
- Async-Aware Instrumentation:
Activity naturally flows across await boundaries, preserving call stacks without blocking the thread pool.
- Cloud-Native Export: OTLP export enables correlation with infrastructure metrics, allowing engineers to distinguish between application-level bottlenecks and platform constraints.
Pitfall Guide
1. Profiling Debug or No-Inlining Builds
Debug builds disable JIT optimizations, disable tiered compilation, and alter memory layout. Results are non-representative of production. Always profile Release builds with PublishTrimmed and PublishSingleFile disabled during analysis to preserve symbol resolution.
2. Ignoring JIT Warmup and Tiered Compilation
.NET uses tiered compilation: methods initially compile with minimal optimization, then recompile after execution thresholds are met. Profiling during the first 10β30 seconds of runtime captures unoptimized code paths. Baseline collection should occur after warmup or use DOTNET_TieredPGO=1 to accelerate optimization.
3. Misinterpreting Sampling Data as Exact Execution Time
Sampling records stack traces at fixed intervals (default 10ms). A method appearing in 30% of samples does not mean it consumes exactly 30% of CPU time. It indicates proportional weight. Use relative comparison across runs, not absolute time calculations.
4. Profiling Without Isolating GC Pressure
High CPU usage in .NET often masks GC activity. GC Heap Size, Gen 2 Promotions, and LOH Allocations must be captured alongside CPU samples. Allocation-heavy code may show low CPU but trigger frequent blocking GC pauses, degrading p99 latency.
5. Instrumenting Hot Paths Synchronously
Creating Activity instances or logging inside tight loops introduces allocation and lock contention. Use Activity.Current?.IsAllDataRequested checks, batch telemetry, or leverage Counter/Histogram types from System.Diagnostics.Metrics for high-frequency metrics.
6. Forgetting to Disable Profiling After Diagnosis
Continuous collection consumes disk space, increases memory pressure, and skews load test results. Implement feature flags or environment-based toggles to disable tracing in production after root cause resolution.
7. Treating Allocation Counts as the Sole Metric
Raw allocation numbers are misleading. A 10MB allocation in Gen 0 is cheap; a 10MB allocation surviving to Gen 2 triggers expensive compaction. Focus on Allocation Rate, Gen 2 GC Frequency, and Pinned Object Count rather than total bytes allocated.
Best Practices:
- Run multiple collection windows to account for variance.
- Correlate profiling data with infrastructure metrics (CPU throttling, network I/O, disk latency).
- Use
dotnet-gcinfo or PerfView for deep GC analysis when memory pressure is suspected.
- Validate findings with A/B load tests before refactoring.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Local debugging | Visual Studio Profiler / dotnet-trace (CPU Sampling) | Fast iteration, IDE integration, symbol resolution | Low (dev machine only) |
| Production latency spike | dotnet-trace sampling + dotnet-counters | Minimal overhead, captures real traffic patterns, safe for live nodes | Low (temporary collection) |
| Memory leak investigation | Allocation profiling + GC heap dump (dotnet-gcdump) | Object retention paths, Gen 2 promotion tracking, root object identification | Medium (requires dump analysis tooling) |
| Continuous optimization | OpenTelemetry + Continuous Profiling Agent | Long-tail latency detection, trend analysis, automated alerting | Medium-High (collector infrastructure) |
Configuration Template
appsettings.Profiling.json
{
"OpenTelemetry": {
"ServiceName": "compass-api",
"Exporter": {
"Endpoint": "http://otel-collector:4317",
"Protocol": "Grpc",
"BatchExportProcessor": {
"MaxQueueSize": 2048,
"ScheduledDelayMilliseconds": 5000,
"ExporterTimeoutMilliseconds": 30000
}
},
"Sampling": {
"Probability": 0.1,
"ParentBased": true
}
},
"DiagnosticSource": {
"Sources": [ "Compass.PerfProfile" ],
"IsEnabled": true,
"MaxActivityDepth": 12
}
}
dotnet-trace-profile.json (Custom sampling profile)
{
"providers": [
{
"name": "Microsoft-Windows-DotNETRuntime",
"requestKey": "EventKey",
"keywords": "0x14C80000",
"logLevel": 4,
"filterData": ""
},
{
"name": "System.Runtime",
"requestKey": "EventKey",
"keywords": "0x0",
"logLevel": 5,
"filterData": ""
}
]
}
Usage: dotnet-trace collect --process-id <PID> --profile custom --output trace.nettrace
Quick Start Guide
- Install CLI tools:
dotnet tool install -g dotnet-trace dotnet-counters dotnet-gcdump
- Identify target process:
dotnet-trace ps or dotnet-counters ps to list running .NET processes.
- Collect baseline:
dotnet-counters monitor --process-id <PID> --counters System.Runtime --refresh-interval 5
- Capture trace:
dotnet-trace collect --process-id <PID> --providers Microsoft-Windows-DotNETRuntime:0x14C80000:4 --duration 00:00:30 --output profile.nettrace
- Analyze: Open
.nettrace in PerfView or Visual Studio Diagnostics Tool. Filter by System.Runtime and Microsoft-Windows-DotNETRuntime providers to isolate CPU and GC behavior.