st by ~87% while maintaining functional agent behavior.
- Capping context window growth prevents exponential cache write accumulation.
- The sweet spot lies in using
max_tokens as the strict output upper bound combined with a conservative prompt token estimate. This eliminates false negatives while guaranteeing the budget ceiling is never breached.
Core Solution
The architecture relies on four decoupled components: a running token tally, an externalized pricing configuration, a predictive budget gate, and a wired agent loop. Each component maintains strict separation of concerns to ensure maintainability and accurate cost accounting.
1. Token Accounting
Track the four distinct token types returned by Anthropic's API. Never merge them into a single counter.
from dataclasses import dataclass, field
@dataclass
class Usage:
input_tokens: int = 0
output_tokens: int = 0
cache_read_input_tokens: int = 0
cache_creation_input_tokens: int = 0
def add(self, other: "Usage") -> None:
self.input_tokens += other.input_tokens
self.output_tokens += other.output_tokens
self.cache_read_input_tokens += (
other.cache_read_input_tokens
)
self.cache_creation_input_tokens += (
other.cache_creation_input_tokens
)
2. Pricing Configuration
Treat pricing multipliers as external configuration. Load them from environment variables or config files to survive provider rate updates.
@dataclass
class ModelRates:
input_per_m: float
output_per_m: float
cache_read_per_m: float
cache_write_per_m: float
def usd_for(usage: Usage, rates: ModelRates) -> float:
return (
usage.input_tokens * rates.input_per_m
+ usage.output_tokens * rates.output_per_m
+ usage.cache_read_input_tokens
* rates.cache_read_per_m
+ usage.cache_creation_input_tokens
* rates.cache_write_per_m
) / 1_000_000
3. Predictive Budget Gate
The gate performs a worst-case estimation before every API call. It compares the projected cost against remaining budget and raises an exception if the threshold would be breached.
class BudgetExceeded(Exception):
pass
class TokenBudget:
def __init__(
self,
ceiling_usd: float,
rates: ModelRates,
):
self.ceiling_usd = ceiling_usd
self.rates = rates
self.usage = Usage()
def remaining_usd(self) -> float:
return self.ceiling_usd - usd_for(
self.usage, self.rates
)
def check(
self,
prompt_tokens_estimate: int,
max_tokens: int,
) -> None:
worst_case = Usage(
input_tokens=prompt_tokens_estimate,
output_tokens=max_tokens,
)
worst_usd = usd_for(worst_case, self.rates)
if worst_usd > self.remaining_usd():
raise BudgetExceeded(
f"next call could cost "
f"${worst_usd:.4f}, only "
f"${self.remaining_usd():.4f} left"
)
def record(self, usage: Usage) -> None:
self.usage.add(usage)
4. Agent Loop Integration
Wire the gate into the execution loop. Estimate prompt tokens conservatively (len(serialised_messages) // 3 for JSON-encoded payloads) or use client.messages.count_tokens for precision. Always budget for max_tokens as the output upper bound.
import json
import anthropic
MODEL = "claude-sonnet-4-5"
MAX_STEPS = 10
PER_CONVO_USD = 0.50
PER_CALL_MAX_TOKENS = 1024
client = anthropic.Anthropic()
def run_agent(prompt, tools, dispatch, rates):
budget = TokenBudget(PER_CONVO_USD, rates)
messages = [{"role": "user", "content": prompt}]
for step in range(1, MAX_STEPS + 1):
prompt_estimate = _rough_tokens(messages)
budget.check(
prompt_estimate,
PER_CALL_MAX_TOKENS,
)
resp = client.messages.create(
model=MODEL,
max_tokens=PER_CALL_MAX_TOKENS,
tools=tools,
messages=messages,
)
budget.record(_usage_from(resp))
if resp.stop_reason == "end_turn":
return _final_text(resp), budget.usage
if resp.stop_reason !
Pitfall Guide
- Aggregating All Token Types into One Counter: Cache reads, cache writes, and fresh input carry distinct pricing multipliers. Summing them into a single
tokens field breaks USD math and causes systematic underpricing.
- Post-Call Budget Verification: Checking costs after the API response returns means the overspend has already been charged. The budget gate must execute before the network request is dispatched.
- Hardcoding Pricing Multipliers: Provider rates and cache pricing tiers change frequently. Embedding rates in agent logic creates brittle code that fails silently when pricing updates. Externalize to config.
- Underestimating Output Length: Assuming the model will return a short reply when
max_tokens is set high guarantees budget miscalculation. Always calculate worst-case cost using the authorized max_tokens limit.
- Sharing Budget State Across Conversations: Token budgets are scoped to a single conversation. Leaking
Usage or TokenBudget instances across sessions causes premature aborts or false cost accumulation.
- Ignoring Cache Write Costs:
cache_creation_input_tokens are priced higher than fresh input. Failing to track and price cache writes separately leads to compounding financial drift in long-running agents.
- Optimistic Prompt Estimation: Using exact token counts for prompt estimation adds latency and risks under-budgeting. Prefer a conservative over-estimate (e.g.,
len(serialised_messages) // 3) or accept the extra round-trip cost of count_tokens for precision.
Deliverables
- π Blueprint: Cost-Capped Agent Architecture (Python/Anthropic SDK) β Decoupled token accounting, externalized pricing config, predictive gate class, and loop integration pattern.
- β
Checklist: Pre-Deployment Budget Gate Validation β Verify token field separation, confirm
max_tokens bounds, validate config loading, test BudgetExceeded exception handling, and simulate worst-case cache write scenarios.
- βοΈ Configuration Template:
ModelRates JSON/YAML structure β Environment-ready pricing multipliers for input, output, cache read, and cache write tokens per million, with version pinning and fallback defaults.