t a token bucket algorithm at the request pipeline level. This approach allows controlled burst capacity while enforcing a sustained throughput ceiling that matches your GPU's parallel processing limits.
import { InferencePipeline, PipelineContext, PipelineResult } from "@juspay/neurolink";
class ConcurrencyGate {
private availableTokens: number;
private lastRefillTimestamp: number;
private readonly maxBurst: number;
private readonly sustainedRate: number;
constructor(maxBurst: number, sustainedRate: number) {
this.maxBurst = maxBurst;
this.sustainedRate = sustainedRate;
this.availableTokens = maxBurst;
this.lastRefillTimestamp = Date.now();
}
private refill(): void {
const now = Date.now();
const elapsedSeconds = (now - this.lastRefillTimestamp) / 1000;
const newTokens = elapsedSeconds * this.sustainedRate;
this.availableTokens = Math.min(this.maxBurst, this.availableTokens + newTokens);
this.lastRefillTimestamp = now;
}
tryAcquire(): boolean {
this.refill();
if (this.availableTokens >= 1) {
this.availableTokens -= 1;
return true;
}
return false;
}
}
const localGate = new ConcurrencyGate(12, 3);
const concurrencyMiddleware = {
identifier: "local-concurrency-guard",
executionOrder: 100,
interceptRequest: async (ctx: PipelineContext): Promise<PipelineContext> => {
if (!localGate.tryAcquire()) {
throw new Error("GATEWAY_CONCURRENCY_EXCEEDED");
}
return ctx;
}
};
Architecture Rationale: The middleware executes before provider dispatch. By throwing a deterministic error code, we create a catchable signal that downstream routing logic can interpret. The token bucket algorithm is preferred over fixed-window counters because it smooths traffic spikes without abrupt rejection cliffs, matching the bursty nature of user interactions.
Step 2: Latency Budgets and Capability-Matched Fallback
Queue saturation is only one failure mode. Thermal throttling, VRAM fragmentation, and model context window limits can cause requests to hang indefinitely. A latency budget enforces a hard deadline, triggering fallback when local inference exceeds acceptable response times.
import { InferencePipeline, FallbackStrategy, ProviderConfig } from "@juspay/neurolink";
const primaryConfig: ProviderConfig = {
provider: "ollama",
model: "llama3.1",
priority: 1,
middleware: [concurrencyMiddleware]
};
const fallbackConfig: ProviderConfig[] = [
{
provider: "anthropic",
model: "claude-3-5-haiku-20241022",
priority: 2,
apiKey: process.env.ANTHROPIC_API_KEY
},
{
provider: "openai",
model: "gpt-4o-mini",
priority: 3,
apiKey: process.env.OPENAI_API_KEY
}
];
const gateway = new InferencePipeline({
primary: primaryConfig,
fallbackChain: fallbackConfig,
fallbackStrategy: FallbackStrategy.ON_ERROR_OR_TIMEOUT,
globalTimeout: 8000
});
async function routeInference(userPrompt: string): Promise<PipelineResult> {
try {
return await gateway.execute({ input: { text: userPrompt } });
} catch (routingError: any) {
if (routingError.message === "GATEWAY_CONCURRENCY_EXCEEDED") {
console.warn("Local queue saturated. Escalating to cloud fallback.");
return gateway.execute({
input: { text: userPrompt },
bypassPrimary: true
});
}
throw routingError;
}
}
Architecture Rationale: Fallback models must match the capability tier of the local model, not the price tier. Routing a 70B parameter workload to claude-3-5-sonnet or gpt-4o introduces unnecessary cost and capability mismatch. claude-3-5-haiku-20241022 and gpt-4o-mini provide comparable reasoning depth for overflow scenarios while maintaining predictable pricing. The bypassPrimary flag prevents retry loops when the concurrency gate explicitly rejects the request.
Step 3: Economic Guardrails and Telemetry
"Free" local inference becomes expensive when fallback triggers cascade. The onFinish lifecycle hook provides a deterministic point to calculate spend, track provider distribution, and enforce budget thresholds.
const PRICING_MATRIX: Record<string, { inputPerK: number; outputPerK: number }> = {
"claude-3-5-haiku-20241022": { inputPerK: 0.0008, outputPerK: 0.004 },
"gpt-4o-mini": { inputPerK: 0.00015, outputPerK: 0.0006 }
};
let sessionSpend = 0;
const SPEND_THRESHOLD = 5.0;
const economicMiddleware = {
identifier: "spend-tracker",
executionOrder: 200,
onComplete: (result: PipelineResult, meta: any) => {
const pricing = PRICING_MATRIX[meta.model] ?? { inputPerK: 0, outputPerK: 0 };
const promptCost = ((meta.usage?.promptTokens ?? 0) / 1000) * pricing.inputPerK;
const completionCost = ((meta.usage?.completionTokens ?? 0) / 1000) * pricing.outputPerK;
const requestCost = promptCost + completionCost;
sessionSpend += requestCost;
console.log(
`[TELEMETRY] provider=${meta.provider} | model=${meta.model} | ` +
`tokens=${meta.usage?.totalTokens ?? 0} | cost=$${requestCost.toFixed(6)} | ` +
`session=$${sessionSpend.toFixed(4)}`
);
if (meta.provider !== "ollama") {
metrics.track("gateway.fallback_triggered", { provider: meta.provider });
}
if (sessionSpend > SPEND_THRESHOLD) {
opsAlert(`Cloud spend threshold breached: $${sessionSpend.toFixed(2)}`);
}
}
};
Architecture Rationale: Cost tracking must occur after generation completes to capture actual token consumption, not estimates. Logging provider distribution reveals whether your local instance is undersized. A fallback rate exceeding 25% indicates hardware constraints or misconfigured concurrency limits. The middleware pipeline ensures telemetry executes regardless of which provider satisfies the request.
Pitfall Guide
1. Capability Mismatch in Fallback Routing
Explanation: Routing local overflow to premium cloud models (e.g., claude-3-5-sonnet or gpt-4o) inflates costs without improving output quality for standard tasks.
Fix: Maintain a capability matrix. Map local models to cloud equivalents with similar parameter counts and benchmark scores. Use haiku or mini variants for overflow.
2. Aggressive Timeout Thresholds
Explanation: Setting global timeouts below 3 seconds for complex reasoning tasks causes unnecessary fallbacks, increasing cloud spend and fragmenting context windows.
Fix: Align timeouts with task complexity. Interactive chat: 4-6s. Document analysis: 8-12s. Batch processing: 15-30s. Use per-provider deadlines via Promise.race when strict isolation is required.
3. Stateful Cost Tracking in Stateless Deployments
Explanation: Storing sessionSpend in module-level variables fails in serverless or horizontally scaled environments. Each replica maintains independent counters, breaking budget enforcement.
Fix: Offload cost accumulation to Redis or a distributed counter service. Use atomic increment operations and centralized alerting thresholds.
4. Ignoring Streaming Fallback Complexity
Explanation: Timeout and fallback logic breaks when applied to streaming responses. Partial tokens already delivered cannot be recalled, causing duplicate content or broken UI states.
Fix: Implement stream-aware fallback. Buffer initial tokens locally, then switch to cloud streaming only if latency exceeds budget before the first token arrives. Use AbortController to cancel local streams cleanly.
5. Thermal Throttling Blind Spots
Explanation: Consumer GPUs degrade performance under sustained load. Latency increases non-linearly as VRAM temperature rises, but standard metrics show "healthy" GPU utilization.
Fix: Monitor GPU temperature and clock speeds alongside latency. Implement dynamic concurrency reduction when thermal thresholds are breached. Schedule heavy workloads during off-peak hours or rotate across multiple inference nodes.
6. Queue Starvation vs. Backpressure
Explanation: Rejecting requests outright under load creates poor user experience. Accepting all requests causes queue saturation and OOM crashes.
Fix: Implement HTTP 429 responses with Retry-After headers. Use exponential backoff on the client side. Maintain a small request buffer (3-5 requests) to absorb micro-spikes without blocking.
7. Missing Circuit Breaker Integration
Explanation: When cloud fallback providers experience outages, repeated fallback attempts waste time and tokens, compounding latency.
Fix: Wrap fallback chains in a circuit breaker pattern. Track consecutive failures per provider. Open the circuit after 3-5 failures, halving retry attempts until health checks restore connectivity.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Low traffic, single GPU | Local-only with strict concurrency gate | Avoids cloud dependency; predictable latency | $0.00 |
| Medium traffic, variable load | Local + Haiku/Mini fallback with 8s timeout | Balances cost and reliability; handles spikes | +$0.0015 per 1k tokens |
| High traffic, strict SLA | Distributed Ollama cluster + premium fallback | Ensures availability; meets p99 targets | +$0.004 per 1k tokens |
| Batch processing, async | Local-only with extended timeout (15-30s) | Maximizes free compute; no user-facing latency | $0.00 |
| Multi-tenant SaaS | Gateway with per-tenant concurrency quotas | Prevents noisy neighbor issues; isolates spend | Variable per tenant |
Configuration Template
import { InferencePipeline, FallbackStrategy } from "@juspay/neurolink";
export const productionGateway = new InferencePipeline({
primary: {
provider: "ollama",
model: "llama3.1",
priority: 1,
middleware: [concurrencyMiddleware]
},
fallbackChain: [
{
provider: "anthropic",
model: "claude-3-5-haiku-20241022",
priority: 2,
apiKey: process.env.ANTHROPIC_API_KEY
},
{
provider: "openai",
model: "gpt-4o-mini",
priority: 3,
apiKey: process.env.OPENAI_API_KEY
}
],
fallbackStrategy: FallbackStrategy.ON_ERROR_OR_TIMEOUT,
globalTimeout: 8000,
middleware: [economicMiddleware]
});
export async function handleInference(prompt: string) {
try {
return await productionGateway.execute({ input: { text: prompt } });
} catch (err: any) {
if (err.message === "GATEWAY_CONCURRENCY_EXCEEDED") {
return productionGateway.execute({
input: { text: prompt },
bypassPrimary: true
});
}
throw err;
}
}
Quick Start Guide
- Install SDK & Configure Providers: Initialize your project with the inference SDK. Set environment variables for cloud API keys. Verify local Ollama instance is running and model is pulled.
- Deploy Concurrency Gate: Instantiate the token bucket limiter with values matching your GPU's parallel batch capacity. Attach it as the first middleware in your pipeline.
- Attach Telemetry Hook: Register the cost tracking middleware. Configure your metrics backend to ingest
gateway.fallback_triggered events and spend thresholds.
- Validate Fallback Routing: Simulate traffic spikes using a load testing tool. Verify that requests exceeding concurrency limits route to cloud providers without blocking. Confirm latency budgets trigger fallback correctly.
- Monitor & Tune: Observe fallback rates and p99 latency over 24 hours. Adjust concurrency limits if fallback exceeds 20%. Tighten or relax timeouts based on task complexity and user feedback.