ever hardcode provider SDKs in business logic.
2. Semantic Caching: Implement caching based on embedding similarity, not exact string matches, to serve repeated intents instantly.
3. Structured Outputs: Enforce JSON schema validation on all model outputs to prevent parsing errors and enable type-safe downstream processing.
4. Evaluation Pipeline: Integrate automated evaluation (accuracy, toxicity, latency) into the CI/CD pipeline, not just manual testing.
Step-by-Step Implementation
1. Define the Routing Interface
Create a type-safe interface for model interactions that includes cost controls and fallback logic.
// types.ts
export interface ModelConfig {
provider: 'openai' | 'anthropic' | 'local';
modelId: string;
maxTokens: number;
temperature: number;
costPer1kTokens: number;
}
export interface RoutingOptions {
primary: ModelConfig;
fallbacks: ModelConfig[];
maxCostPerRequest: number;
timeoutMs: number;
requireStructuredOutput: boolean;
}
export interface AIResponse<T = unknown> {
data: T;
modelUsed: string;
cost: number;
latencyMs: number;
cached: boolean;
}
2. Implement the AI Router
The router manages the lifecycle: cache check, cost estimation, provider call, and validation.
// ai-router.ts
import { RedisCache } from './cache';
import { Providers } from './providers';
import { validateJson } from './validators';
export class AIRouter {
private cache: RedisCache;
constructor() {
this.cache = new RedisCache();
}
async route<T>(
prompt: string,
options: RoutingOptions,
schema?: object
): Promise<AIResponse<T>> {
const startTime = Date.now();
// 1. Semantic Cache Lookup
const cacheKey = await this.cache.generateKey(prompt);
const cachedResult = await this.cache.get<T>(cacheKey);
if (cachedResult) {
return {
data: cachedResult,
modelUsed: 'cache',
cost: 0,
latencyMs: Date.now() - startTime,
cached: true,
};
}
// 2. Cost & Timeout Guardrails
const estimatedCost = this.estimateCost(prompt, options.primary);
if (estimatedCost > options.maxCostPerRequest) {
throw new Error('Cost guardrail exceeded');
}
// 3. Provider Execution with Fallbacks
let response: AIResponse<T>;
const allModels = [options.primary, ...options.fallbacks];
for (const model of allModels) {
try {
response = await this.executeWithTimeout(
model,
prompt,
schema,
options.timeoutMs
);
// 4. Validation
if (options.requireStructuredOutput && schema) {
const isValid = validateJson(response.data, schema);
if (!isValid) {
console.warn(`Validation failed for model ${model.modelId}`);
continue; // Trigger fallback
}
}
// 5. Cache Storage
await this.cache.set(cacheKey, response.data, { ttl: 3600 });
return response;
} catch (error) {
console.error(`Model ${model.modelId} failed:`, error);
// Continue to next fallback
}
}
throw new Error('All models failed or validation rejected output');
}
private async executeWithTimeout<T>(
model: ModelConfig,
prompt: string,
schema: object | undefined,
timeout: number
): Promise<AIResponse<T>> {
return Promise.race([
Providers.call(model, prompt, schema),
new Promise<never>((_, reject) =>
setTimeout(() => reject(new Error('Timeout')), timeout)
),
]);
}
private estimateCost(prompt: string, model: ModelConfig): number {
const tokenEstimate = prompt.length / 4; // Rough estimate
return (tokenEstimate * model.costPer1kTokens) / 1000;
}
}
3. RAG Integration for Context
For SaaS products requiring domain accuracy, integrate a Retrieval-Augmented Generation pipeline.
// rag-service.ts
export class RAGService {
async augmentPrompt(userQuery: string, tenantId: string): Promise<string> {
// 1. Embed query
const queryEmbedding = await EmbeddingModel.encode(userQuery);
// 2. Vector Search with tenant isolation
const context = await VectorDB.search({
embedding: queryEmbedding,
filter: { tenantId },
topK: 5,
minScore: 0.75,
});
// 3. Construct prompt
return `
Context: ${context.map(c => c.text).join('\n---\n')}
Question: ${userQuery}
Instructions: Answer based strictly on the context. If unknown, state that.
`;
}
}
Architecture Rationale
- TypeScript: Enforces contracts between the AI layer and business logic, reducing runtime errors caused by malformed model outputs.
- Fallback Chain: Ensures high availability. If the primary model is rate-limited or degrades, the system seamlessly shifts to a secondary model.
- Semantic Caching: Reduces API calls by 40-60% in typical SaaS workloads where users repeat intents with slight phrasing variations.
- Tenant Isolation: Critical for multi-tenant SaaS. RAG pipelines must enforce strict data boundaries to prevent cross-tenant data leakage.
Pitfall Guide
Common Mistakes
-
Treating LLMs as Deterministic Functions
- Mistake: Writing unit tests that assert exact string outputs.
- Reality: Models are stochastic. Tests must assert structural validity, semantic similarity, or constraint satisfaction, not exact matches.
- Fix: Use evaluation frameworks that score outputs against rubrics rather than exact equality.
-
Ignoring Cost Volatility
- Mistake: Passing raw user input directly to models without length checks or summarization.
- Reality: Malicious or verbose users can trigger massive token consumption, splicing costs.
- Fix: Implement input sanitization, token counting pre-flight, and cost caps per request and per tenant.
-
Prompt Injection Vulnerabilities
- Mistake: Concatenating user input directly into system prompts.
- Reality: Attackers can inject instructions that override system behavior, exfiltrate data, or perform unauthorized actions.
- Fix: Use structured input separation, output validation, and dedicated guardrail models to detect injection patterns.
-
Context Window Mismanagement
- Mistake: Dumping entire documents into the context window.
- Reality: This increases cost, latency, and dilutes attention (lost-in-the-middle effect).
- Fix: Implement chunking strategies, retrieval-based context injection, and summary compression for long conversations.
-
Skipping Evaluation Metrics
- Mistake: Relying on developer intuition for quality.
- Reality: Model updates can silently degrade performance. Without metrics, regressions go unnoticed.
- Fix: Establish a golden dataset and run automated evaluations on every model version change. Track accuracy, hallucination rate, and latency.
-
Vendor Lock-in via SDK Dependency
- Mistake: Importing provider-specific SDKs throughout the codebase.
- Reality: Switching providers or adding fallbacks requires massive refactoring.
- Fix: Abstract all provider interactions behind a unified interface. Use the Router pattern described in Core Solution.
-
Poor UX for Latency
- Mistake: Blocking UI until generation completes.
- Reality: AI generation can take seconds. Users perceive this as slowness.
- Fix: Implement streaming responses, skeleton loaders, and progressive disclosure of results.
Best Practices from Production
- Structured Outputs: Always request JSON and validate against a schema. This enables reliable parsing and integration with existing data models.
- Feedback Loops: Implement thumbs-up/down mechanisms to capture user feedback. Use this data to retrain prompts or fine-tune models.
- Observability: Log every AI interaction with metadata: model used, tokens, cost, latency, and success/failure status. Correlate this with business metrics.
- Human-in-the-Loop: For high-stakes actions, design workflows where AI suggests and human confirms. This builds trust and reduces risk.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| High Throughput, Low Complexity | Small Model + Semantic Cache | Low latency, sufficient accuracy for simple tasks, high cache hit rate. | Low |
| Critical Accuracy, Complex Reasoning | Large Model + RAG + Structured Output | Minimizes hallucinations, leverages domain context, ensures reliable parsing. | High |
| Budget Constrained / Edge Deployment | Quantized Open Source Model | Eliminates API costs, data privacy, runs on commodity hardware. | Medium (Compute) |
| Multi-Region Global SaaS | Regional Model Routing + Edge Cache | Reduces latency by routing to nearest provider, complies with data residency. | Medium |
| Regulated Industry (Finance/Health) | Guardrails + Human-in-the-Loop + Audit Logs | Ensures compliance, safety, and traceability of AI decisions. | High |
Configuration Template
Copy this configuration to bootstrap your AI routing and evaluation setup.
# ai-config.yaml
router:
default:
primary:
provider: openai
model: gpt-4o-mini
max_tokens: 1024
temperature: 0.1
fallbacks:
- provider: anthropic
model: claude-3-haiku
max_tokens: 1024
cost_cap_per_request: 0.05
timeout_ms: 3000
require_structured_output: true
cache:
enabled: true
provider: redis
semantic_similarity_threshold: 0.85
ttl_seconds: 3600
evaluation:
golden_dataset_path: ./eval/golden-set.json
metrics:
- accuracy
- hallucination_rate
- latency_p95
auto_run_on_deploy: true
security:
prompt_injection_detection: true
max_input_length: 4096
tenant_isolation: true
Quick Start Guide
-
Initialize Project:
npm install @codcompass/ai-sdk redis zod
-
Configure Router:
Create ai-config.yaml using the template above. Set environment variables for API keys.
-
Implement Service:
import { AIRouter } from '@codcompass/ai-sdk';
const router = new AIRouter();
// Example usage
const response = await router.route(
"Summarize the key risks in this document.",
{ /* options */ },
{ type: "object", properties: { summary: { type: "string" } } }
);
console.log(response.data.summary);
-
Run Evaluation:
npx codcompass-ai eval --config ai-config.yaml
Verify metrics meet thresholds before deploying to production.
-
Monitor:
Integrate the router's telemetry with your observability stack. Set alerts for cost spikes, latency degradation, and validation failures.
This article provides the architectural foundation for building robust, cost-effective, and reliable AI-powered SaaS products. Adherence to these patterns ensures scalability and maintainability in production environments.