hosted-inference)The harder part: self-hosted inference
Running your own models means paying for GPU time whether you're using it or not. The number that matters: utilization rate.
A single H100 at $2-3/hour needs to be generating tokens >60% of the time to beat API pricing at scale. Below 30% utilization, you'd have been better off on an API. Most self-hosted deployments I see run at 15-25% because of traffic spikes and idle standby capacity.
The break-even math:
- API: pay per token, zero fixed cost
- Self-hosted: ~$2k/month per GPU all-in, first ~2M tokens are effectively "paying off the fixed cost"
- Breakeven: ~4-5M tokens/month per GPU, assuming 60% utilization
The hidden constraint: hardware availability
This is the part most infrastructure analyses miss. In 2026, GPU lead times are still 12-18 months for new deployments. H200s are shipping but allocated. The secondary market for A100s is active but prices haven't dropped as much as expected β because demand from inference workloads has replaced training demand.
What this means for your deployment plan: if you need GPUs in a quarter, you're renting. If you're renting, you can't amortize hardware cost. If you can't amortize, you're at the mercy of spot pricing β which has swung 40% in a single month twice this year already.
The one number to track
For any AI feature, track cost per completed interaction β not cost per token. Token counts hide the real metric: how many tokens does your average user interaction consume?
A chatbot using Claude Sonnet 4.6 ($3/M input, $15/M output) averaging 2,000 tokens per conversation with a typical 70/30 input/output split costs roughly $0.013 per conversation. At 100k conversations/month, that's $1,300 β significant enough that a 10% improvement in token efficiency pays for an engineer's time.
Most teams don't know their average cost per interaction. That's the first number worth instrumenting β without it, you can't tell whether optimisation matters or not.
Summary
- Measure your cache-hit ratio (30-50% is typical; anything below 20% means expensive redundant computation)
- Track cost per completed interaction, not per token
- Know your self-host breakeven point (~4M tokens/month per GPU)
- Assume 12-month lead times for hardware β plan accordingly
- Spot GPU pricing can swing 40% in a month; don't build on spot for production
The infrastructure layer is the part of AI most developers treat as someone else's problem. It isn't. The teams that understand their cost-per-interaction will build features that survive the margin compression that's coming.