eady RAG (Retrieval-Augmented Generation) pipeline for an AI triage system, written in TypeScript. The architecture prioritizes observability, cost control, and fallback resilience.
Step 1: Data Ingestion & Chunking Strategy
Raw documents must be normalized, chunked, and embedded before storage. Chunking should respect semantic boundaries, not arbitrary character counts.
import { createHash } from 'crypto';
interface Chunk {
id: string;
content: string;
metadata: { source: string; timestamp: string };
embedding?: number[];
}
export function chunkBySemanticBoundary(text: string, maxTokens = 512): Chunk[] {
const sentences = text.split(/(?<=[.!?])\s+/);
const chunks: Chunk[] = [];
let current: string[] = [];
let tokenCount = 0;
for (const sentence of sentences) {
const tokens = sentence.split(/\s+/).length;
if (tokenCount + tokens > maxTokens && current.length > 0) {
chunks.push({
id: createHash('sha256').update(current.join(' ')).digest('hex'),
content: current.join(' '),
metadata: { source: 'ingestion', timestamp: new Date().toISOString() }
});
current = [];
tokenCount = 0;
}
current.push(sentence);
tokenCount += tokens;
}
if (current.length > 0) {
chunks.push({
id: createHash('sha256').update(current.join(' ')).digest('hex'),
content: current.join(' '),
metadata: { source: 'ingestion', timestamp: new Date().toISOString() }
});
}
return chunks;
}
Architecture Decision: Semantic chunking reduces retrieval noise. Fixed-length chunking fragments context, forcing the model to hallucinate connections. Hash-based IDs enable idempotent upserts and vector store deduplication.
Step 2: Vector Storage & Retrieval Pipeline
Production systems require async embedding generation, batching, and similarity search with fallbacks.
import { pgVector } from './db'; // pgvector or managed vector DB client
export async function upsertChunks(chunks: Chunk[], embeddingModel: string): Promise<void> {
const batch = chunks.map(async (chunk) => {
const embedding = await generateEmbedding(chunk.content, embeddingModel);
chunk.embedding = embedding;
return pgVector.upsert({
id: chunk.id,
content: chunk.content,
embedding,
metadata: chunk.metadata
});
});
await Promise.allSettled(batch);
}
export async function retrieveContext(query: string, topK = 5): Promise<string[]> {
const queryEmbedding = await generateEmbedding(query, 'text-embedding-3-small');
const results = await pgVector.similaritySearch(queryEmbedding, topK, { minScore: 0.72 });
if (results.length === 0) {
// Fallback to keyword search or cached deterministic response
return await keywordFallback(query);
}
return results.map(r => r.content);
}
Architecture Decision: Embedding models are decoupled from retrieval to allow model rotation without downtime. A minimum similarity threshold prevents low-confidence retrievals from polluting context. Keyword fallback ensures zero-latency degradation when vector search fails.
Step 3: Orchestration, Routing & Guardrails
Production AI requires model routing, output validation, and circuit breakers.
import { z } from 'zod';
const TriageResponseSchema = z.object({
category: z.enum(['billing', 'technical', 'account', 'general']),
confidence: z.number().min(0).max(1),
summary: z.string().max(200),
requiresHuman: z.boolean()
});
export type TriageResponse = z.infer<typeof TriageResponseSchema>;
export async function orchestrateTriage(
userQuery: string,
context: string[]
): Promise<TriageResponse> {
const prompt = buildPrompt(userQuery, context);
// Tiered routing: small model for classification, large for generation
const classification = await callModel('gpt-4o-mini', prompt, { temperature: 0.1 });
const parsed = TriageResponseSchema.safeParse(classification);
if (!parsed.success || parsed.data.confidence < 0.75) {
// Fallback to rule-based triage or human queue
return deterministicFallback(userQuery);
}
return parsed.data;
}
Architecture Decision: Structured output parsing via Zod prevents schema drift. Low-confidence triggers bypass generation and route to deterministic logic. Tiered routing reduces cost by 60β70% for classification-heavy workloads.
Step 4: Evaluation & Observability
AI products require continuous evaluation, not one-off benchmarking.
export async function runEvaluationBatch(
dataset: Array<{ query: string; expected: TriageResponse }>,
pipeline: (q: string) => Promise<TriageResponse>
) {
const results = await Promise.all(
dataset.map(async ({ query, expected }) => {
const actual = await pipeline(query);
return {
query,
expected,
actual,
categoryMatch: actual.category === expected.category,
confidenceDelta: Math.abs(actual.confidence - expected.confidence)
};
})
);
const accuracy = results.filter(r => r.categoryMatch).length / results.length;
console.log(`Evaluation accuracy: ${(accuracy * 100).toFixed(2)}%`);
return results;
}
Architecture Decision: Evaluation runs against versioned datasets, not live traffic. Metrics track category accuracy, confidence drift, and fallback frequency. Results feed into CI/CD gates before prompt or model updates.
Pitfall Guide
-
Skipping Evaluation Baselines
Teams deploy without measuring accuracy, latency, or cost before and after changes. This creates blind optimization. Production AI requires a frozen evaluation set that runs on every prompt, model, or pipeline change. Without it, improvements are anecdotal and regressions go undetected.
-
Ignoring Token Economics
Long-context windows reduce retrieval precision and inflate costs. Teams pass entire documents instead of semantic chunks. Production systems enforce token budgets, compress context, and use summarization pipelines for historical data. Unbounded context destroys cost predictability.
-
Hardcoding Prompts Without Versioning
Prompts are code. Editing them directly in production creates unreproducible states. Version every prompt with metadata: model, temperature, date, author. Store in Git or a prompt registry. Rollback capability is non-negotiable.
-
No Fallback or Human-in-the-Loop Path
Probabilistic systems will fail. Without deterministic fallbacks, users experience broken flows. Implement confidence thresholds, schema validation, and explicit routing to human agents or rule-based logic when AI uncertainty exceeds operational limits.
-
Treating Vector Search as Semantic Truth
Embeddings capture statistical similarity, not intent. High cosine scores do not guarantee relevance. Production pipelines combine vector search with metadata filtering, recency weighting, and cross-encoder reranking. Relying solely on top-k similarity causes context pollution.
-
Ignoring Data Drift
User queries, domain terminology, and knowledge bases evolve. Static embeddings degrade over time. Implement scheduled re-embedding, query distribution monitoring, and concept drift detection. Trigger pipeline retraining when similarity distributions shift beyond thresholds.
-
Over-Engineering the AI Layer Before Validating Value
Teams build complex RAG pipelines, fine-tuning loops, and custom eval frameworks before proving user value. Start with a minimal viable AI feature: deterministic fallback, single model, basic retrieval. Validate retention, resolution rate, and cost per interaction before scaling complexity.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| High throughput, low latency SLA (<500ms) | Small model routing + semantic cache + async embedding | Reduces inference time and token spend; cache hits bypass model calls | 60β75% reduction vs uniform large model |
| Compliance-heavy domain (finance, healthcare) | Structured output + guardrails + human fallback on confidence < 0.8 | Ensures auditability, prevents unvalidated generation, meets regulatory standards | 20β30% increase due to validation overhead and human routing |
| Budget-constrained MVP | Single model + deterministic fallback + keyword search | Minimizes infrastructure complexity; validates product-market fit before scaling | Lowest initial cost; scales predictably with traffic |
Configuration Template
// ai-pipeline.config.ts
export const AI_CONFIG = {
models: {
classifier: { id: 'gpt-4o-mini', maxTokens: 150, temperature: 0.1 },
generator: { id: 'gpt-4o', maxTokens: 1024, temperature: 0.3 },
embedding: { id: 'text-embedding-3-small', dimensions: 1536 }
},
routing: {
confidenceThreshold: 0.75,
fallbackToDeterministic: true,
maxRetries: 2,
retryBackoffMs: 1000
},
retrieval: {
topK: 5,
minSimilarity: 0.72,
chunkMaxTokens: 512,
semanticCacheTTL: 3600 // seconds
},
guardrails: {
enableSchemaValidation: true,
toxicityCheck: true,
outputMaxLength: 2000,
requireHumanOnLowConfidence: true
},
observability: {
traceEnabled: true,
metricsEndpoint: '/metrics/ai',
logLevel: 'info',
evaluationDatasetPath: './data/eval/triage-v1.json'
}
};
Quick Start Guide
- Initialize Environment: Copy
ai-pipeline.config.ts into your project root. Set OPENAI_API_KEY, VECTOR_DB_URL, and EVAL_DATASET_PATH in your .env file.
- Seed Vector Store: Run the chunking and embedding pipeline against your knowledge base. Execute
upsertChunks() to populate the vector index with deduplicated, hashed entries.
- Start Orchestration Service: Deploy the routing and guardrail layer using the provided TypeScript template. Mount health checks at
/health and metrics at /metrics/ai.
- Validate & Ship: Run the evaluation batch against your frozen dataset. Confirm accuracy exceeds your threshold, P95 latency stays within SLA, and fallback triggers remain below 15%. Deploy to production with circuit breakers and cache TTLs active.