d handlers. This prevents mid-sentence bugs and enables tool-use orchestration.
3. Token Metering: A TokenMeter utility calculates costs using provider-specific pricing tiers. Logging usage on every response ensures budget visibility from day one.
4. Schema Optimization Awareness: The client accepts a compactSchema flag that strips unnecessary whitespace and type hints from tool definitions, reducing input token bloat.
Implementation
// interfaces.ts
export interface MessagePayload {
role: 'system' | 'user' | 'assistant';
content: string;
}
export interface LLMRequestConfig {
model: string;
messages: MessagePayload[];
maxTokens: number;
temperature?: number;
stopSequences?: string[];
}
export interface LLMResponse {
id: string;
model: string;
content: string;
stopReason: 'end_turn' | 'max_tokens' | 'tool_use' | 'stop_sequence';
usage: { inputTokens: number; outputTokens: number; };
}
export interface PricingTier {
inputPerMillion: number;
outputPerMillion: number;
}
// client.ts
import { LLMRequestConfig, LLMResponse, PricingTier, MessagePayload } from './interfaces';
export class NeuralEndpointClient {
private readonly endpoint: string;
private readonly authHeader: string;
private readonly pricing: PricingTier;
constructor(endpoint: string, apiKey: string, pricing: PricingTier) {
this.endpoint = endpoint;
this.authHeader = `Bearer ${apiKey}`;
this.pricing = pricing;
}
public async dispatch(config: LLMRequestConfig): Promise<LLMResponse> {
const payload = {
model: config.model,
messages: config.messages,
max_tokens: config.maxTokens,
temperature: config.temperature ?? 0.7,
stop: config.stopSequences ?? null,
};
const response = await fetch(this.endpoint, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'Authorization': this.authHeader,
},
body: JSON.stringify(payload),
});
if (!response.ok) {
throw new Error(`LLM API failure: ${response.status} ${response.statusText}`);
}
const raw = await response.json();
return this.parseWireFormat(raw);
}
private parseWireFormat(raw: any): LLMResponse {
const choice = raw.choices?.[0];
if (!choice) throw new Error('Invalid response structure: missing choices array');
return {
id: raw.id,
model: raw.model,
content: choice.message?.content ?? '',
stopReason: choice.finish_reason ?? 'end_turn',
usage: {
inputTokens: raw.usage?.prompt_tokens ?? 0,
outputTokens: raw.usage?.completion_tokens ?? 0,
},
};
}
public calculateCost(input: number, output: number): number {
const inputCost = (input / 1_000_000) * this.pricing.inputPerMillion;
const outputCost = (output / 1_000_000) * this.pricing.outputPerMillion;
return inputCost + outputCost;
}
}
// orchestrator.ts
export class ConversationOrchestrator {
private history: MessagePayload[] = [];
private client: NeuralEndpointClient;
private readonly contextLimit: number;
constructor(client: NeuralEndpointClient, contextLimit: number = 8192) {
this.client = client;
this.contextLimit = contextLimit;
}
public async processTurn(userInput: string, systemInstruction: string): Promise<string> {
this.history.push({ role: 'user', content: userInput });
const payload: LLMRequestConfig = {
model: 'gpt-4o-mini',
messages: [
{ role: 'system', content: systemInstruction },
...this.history,
],
maxTokens: 512,
};
const result = await this.client.dispatch(payload);
// Cost tracking & logging
const turnCost = this.client.calculateCost(result.usage.inputTokens, result.usage.outputTokens);
console.log(`[COST] Turn completed: $${turnCost.toFixed(4)} | Tokens: ${result.usage.inputTokens} in / ${result.usage.outputTokens} out`);
// Stop reason routing
switch (result.stopReason) {
case 'max_tokens':
console.warn('[TRUNCATION] Response hit hard limit. Consider increasing maxTokens or streaming.');
break;
case 'tool_use':
console.info('[TOOL] Model requested external execution. Delegate to function router.');
break;
case 'stop_sequence':
console.debug('[STOP] Matched custom termination string.');
break;
case 'end_turn':
default:
break;
}
this.history.push({ role: 'assistant', content: result.content });
return result.content;
}
}
Why This Architecture Works
- Explicit
max_tokens handling: The client treats the limit as a hard boundary. Production systems should pair this with streaming or iterative continuation if full responses are required.
- State reconstruction visibility: The
history array grows predictably. Teams can implement sliding windows or semantic summarization to cap input costs.
- Cost calculation isolation: By decoupling pricing logic, you can swap tiers, implement internal chargebacks, or trigger budget alerts without touching the HTTP layer.
- Stop reason as control flow: Instead of parsing text for completion signals, the API provides deterministic routing. This eliminates regex-based heuristics and reduces latency.
Pitfall Guide
1. Ignoring stop_reason Branching
Explanation: Developers often read only the content field and assume the response is complete. When max_tokens or stop_sequence triggers, the output is truncated. Shipping this causes silent data loss and broken UI states.
Fix: Always switch on stop_reason. Implement continuation logic for max_tokens (e.g., append a "continue" prompt) and delegate tool_use to a function router.
2. Assuming Word Count Equals Token Count
Explanation: Tokenizers split on subword patterns, not spaces. "Unbelievable" consumes 4 tokens. Code, JSON, and punctuation each consume individual tokens. Budgeting based on word count guarantees underestimation.
Fix: Use the provider's tokenizer library for precise counting, or apply the 4-character/0.75-word rule of thumb for English. Run non-English payloads through the tokenizer before deployment.
3. Treating the API as Stateful
Explanation: The endpoint holds zero conversation memory. If you only send the latest user message, the model loses all prior context. This breaks multi-turn workflows.
Fix: Maintain a client-side message array. Prepend system instructions and append each turn. Implement context window management (sliding window, summarization, or priority injection) to prevent input bloat.
4. Setting max_tokens as a Soft Target
Explanation: The parameter is a hard ceiling. The model stops generating exactly at the limit, regardless of sentence boundaries. This is not a suggestion; it's a circuit breaker.
Fix: Set conservative limits for UI rendering, or use streaming with client-side buffering. If full responses are required, implement a continuation loop that checks stop_reason === 'max_tokens' and resends with an appended prompt.
Explanation: Tool definitions are included in the input payload on every request. Overly verbose schemas with excessive descriptions, nested types, or redundant examples inflate input tokens and costs.
Fix: Compact schemas by removing whitespace, using concise type hints, and referencing external documentation instead of embedding full examples. Validate schema size against your input budget.
6. Non-English Token Inflation Blindness
Explanation: Japanese, Hindi, Arabic, and other non-Latin scripts often require 2β4Γ more tokens than equivalent English text. Global applications that price based on English baselines will experience severe cost overruns.
Fix: Implement language-aware pricing multipliers. Detect input language early and adjust max_tokens or budget thresholds accordingly. Consider routing low-resource languages to specialized models.
7. Missing Usage Logging from Day One
Explanation: Teams delay implementing token tracking, assuming costs will remain low. By the time surprise invoices arrive, historical data is lost, making root-cause analysis impossible.
Fix: Log usage.inputTokens and usage.outputTokens on every response. Aggregate by endpoint, model, and feature. Set up automated alerts at 70% and 90% of monthly budget thresholds.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| High-volume conversational UI | Client-side history + sliding window + streaming | Prevents input bloat, reduces latency, maintains context relevance | Input costs drop 30β50% via context trimming |
| One-off document analysis | Single-turn request + high max_tokens | No state management needed; full context fits in one payload | Predictable cost; output dominates spend |
| Tool-calling agent | Explicit stop_reason: tool_use routing + schema compaction | Deterministic function execution; reduces input token waste | Tool schema optimization saves 15β25% input cost |
| Multi-language support | Language detection + token inflation multiplier + routing | Prevents budget overruns from non-Latin tokenization | Cost variance stabilizes; avoids 2β4Γ spikes |
| Real-time chat with budget constraints | Streaming + client-side buffering + hard max_tokens | Balances UX responsiveness with cost control | Output costs capped; prevents runaway generation |
Configuration Template
// config/llm.config.ts
import { PricingTier } from '../interfaces';
export const PROVIDER_CONFIG = {
endpoint: 'https://api.provider.com/v1/chat/completions',
apiKey: process.env.LLM_API_KEY ?? '',
pricing: {
inputPerMillion: 0.15, // $0.15 per 1M input tokens
outputPerMillion: 0.60, // $0.60 per 1M output tokens (4x multiplier)
} as PricingTier,
defaults: {
model: 'provider-chat-v2',
maxTokens: 512,
temperature: 0.7,
contextWindow: 8192,
streaming: false,
},
safety: {
maxHistoryLength: 20, // Sliding window limit
costAlertThreshold: 0.8, // 80% of monthly budget
truncationFallback: 'continue_prompt', // Strategy for max_tokens hits
},
logging: {
enabled: true,
format: 'json',
fields: ['id', 'model', 'inputTokens', 'outputTokens', 'cost', 'stopReason'],
},
};
Quick Start Guide
- Initialize the client: Import
NeuralEndpointClient and pass your endpoint, API key, and pricing tier from the configuration template.
- Create an orchestrator: Instantiate
ConversationOrchestrator with the client and a context window limit. This manages history and turn processing.
- Run a test turn: Call
processTurn('Hello, analyze this request.', 'You are a concise technical assistant.'). Verify console logs show token counts, cost calculation, and stop_reason routing.
- Validate truncation handling: Temporarily set
maxTokens: 10 and observe the warning log. Confirm your UI or downstream logic handles the cutoff without crashing.
- Deploy with monitoring: Enable structured logging for
usage fields. Configure your observability platform to track input/output token ratios and trigger alerts at 70% budget utilization.
Understanding the raw LLM API contract transforms AI development from experimental guesswork into deterministic engineering. By treating statelessness as a feature, respecting token economics, and routing on explicit stop signals, you build systems that are cost-predictable, failure-resilient, and production-ready from day one.