eets a minimum relevance threshold. Otherwise, the system rewrites the query or falls back to a safe response.
4. Deterministic Retry Limits: Unbounded loops are a production risk. We enforce a maximum attempt counter and a clear exit path to a fallback generator.
Implementation
// types.ts
export interface RAGState {
originalQuery: string;
currentQuery: string;
selectedKB: 'product' | 'ops' | 'faq' | 'none';
retrievedDocs: string[];
qualityScore: number;
attempts: number;
finalAnswer: string;
path: string[];
}
export interface LLMClient {
invoke(prompt: string, input: string): Promise<string>;
}
export interface VectorStore {
search(query: string, topK: number): Promise<string[]>;
}
// orchestrator.ts
import { RAGState, LLMClient, VectorStore } from './types';
const QUALITY_THRESHOLD = 0.6;
const MAX_RETRIES = 2;
export class KnowledgeOrchestrator {
constructor(
private llm: LLMClient,
private stores: Record<string, VectorStore>,
) {}
async execute(query: string): Promise<RAGState> {
const state: RAGState = {
originalQuery: query,
currentQuery: query,
selectedKB: 'none',
retrievedDocs: [],
qualityScore: 0,
attempts: 0,
finalAnswer: '',
path: ['start'],
};
// Phase 1: Retrieval Decision
const needsRetrieval = await this.evaluateIntent(state);
if (!needsRetrieval) {
state.path.push('skip_retrieval');
state.finalAnswer = await this.generateDirect(state);
return state;
}
// Phase 2: Multi-KB Routing
state.selectedKB = await this.routeQuery(state);
state.path.push(`route_to_${state.selectedKB}`);
// Phase 3: Retrieval & Quality Gating Loop
while (state.attempts <= MAX_RETRIES) {
state.retrievedDocs = await this.fetchContext(state);
state.qualityScore = await this.evaluateQuality(state);
state.path.push(`attempt_${state.attempts}_score_${state.qualityScore.toFixed(2)}`);
if (state.qualityScore >= QUALITY_THRESHOLD) {
break; // Quality sufficient
}
if (state.attempts < MAX_RETRIES) {
state.currentQuery = await this.rewriteQuery(state);
state.attempts++;
state.path.push('rewrite_and_retry');
} else {
state.path.push('max_retries_reached');
break;
}
}
// Phase 4: Generation
state.finalAnswer = await this.generateWithContext(state);
return state;
}
private async evaluateIntent(state: RAGState): Promise<boolean> {
const prompt = `Determine if the following question requires external documentation.
Return ONLY "true" or "false".
Criteria for true: proprietary pricing, internal deployment steps, account policies, specific service limits.
Criteria for false: general programming syntax, arithmetic, common knowledge, standard algorithms.
Question: ${state.originalQuery}`;
const result = await this.llm.invoke(prompt, '');
return result.toLowerCase().includes('true');
}
private async routeQuery(state: RAGState): Promise<'product' | 'ops' | 'faq'> {
const prompt = `Select the most appropriate knowledge base for this query.
Options: product, ops, faq.
product: features, pricing, supported models, security certifications.
ops: deployment, troubleshooting, monitoring, backups, infrastructure.
faq: accounts, passwords, refunds, invoices, API keys, billing.
Question: ${state.originalQuery}`;
const raw = await this.llm.invoke(prompt, '');
const normalized = raw.toLowerCase().trim();
if (normalized.includes('ops')) return 'ops';
if (normalized.includes('faq')) return 'faq';
return 'product';
}
private async fetchContext(state: RAGState): Promise<string[]> {
const store = this.stores[state.selectedKB];
return store.search(state.currentQuery, 3);
}
private async evaluateQuality(state: RAGState): Promise<number> {
const context = state.retrievedDocs.join('\n');
const prompt = `Rate the relevance of the following context to the user's question on a scale of 0.0 to 1.0.
Return ONLY a decimal number.
Question: ${state.originalQuery}
Context: ${context}`;
const scoreStr = await this.llm.invoke(prompt, '');
const parsed = parseFloat(scoreStr);
return isNaN(parsed) ? 0.0 : Math.min(1.0, Math.max(0.0, parsed));
}
private async rewriteQuery(state: RAGState): Promise<string> {
const prompt = `The previous retrieval failed to find relevant documents.
Rewrite the following question into a more specific search query.
Preserve the original intent but add domain-specific keywords.
Output ONLY the rewritten query.
Question: ${state.originalQuery}`;
return await this.llm.invoke(prompt, '');
}
private async generateDirect(state: RAGState): Promise<string> {
return this.llm.invoke('You are a technical assistant. Answer directly without external references.', state.originalQuery);
}
private async generateWithContext(state: RAGState): Promise<string> {
const context = state.retrievedDocs.join('\n');
const prompt = `Answer the question using ONLY the provided reference material.
If the context does not contain sufficient information, state that explicitly.
Reference: ${context}
Question: ${state.originalQuery}`;
return this.llm.invoke(prompt, '');
}
}
Architectural Rationale
- Typed State Object: Prevents implicit data flow bugs. Every phase mutates a single source of truth, making it trivial to log execution paths and debug routing failures.
- Separation of Concerns:
evaluateIntent, routeQuery, and evaluateQuality are isolated. This allows you to swap LLM providers, adjust thresholds, or inject cached routing decisions without touching the retrieval logic.
- Quality-Driven Loop: The retry mechanism only triggers when relevance drops below
0.6. This prevents unnecessary rewrites for marginally relevant results while ensuring vague queries get a second chance before generation.
- Explicit Fallback: When
MAX_RETRIES is exhausted, the system proceeds to generation with whatever context exists. This avoids deadlocks and guarantees a response, which is critical for user-facing interfaces.
Pitfall Guide
1. The "All-Context" Fallacy
Explanation: Developers often inject every retrieved document into the prompt, assuming more context improves accuracy. In reality, irrelevant documents increase cognitive load on the model and raise hallucination rates.
Fix: Implement strict top-k limits and relevance scoring. Only inject documents that pass the quality gate. Use chunk-level filtering rather than document-level injection.
2. Routing Prompt Drift
Explanation: Single-turn routing prompts without boundary examples cause the LLM to misclassify queries based on superficial keyword matches (e.g., routing "company invoice" to infrastructure instead of billing).
Fix: Include explicit boundary definitions and few-shot examples in the routing prompt. Calibrate the prompt with 5β10 misclassified queries from your logs.
3. Hardcoded Quality Thresholds
Explanation: Using a fixed 0.6 threshold across all domains ignores semantic density differences. Technical runbooks may naturally score lower on vector similarity but still contain the correct answer.
Fix: Maintain domain-specific thresholds. Run a calibration dataset against your vector store to establish baseline scores per KB. Adjust thresholds based on empirical recall/precision curves.
4. Silent Retry Loops
Explanation: Without explicit attempt tracking and exit conditions, quality gating can trigger infinite rewrite cycles, exhausting rate limits and inflating costs.
Fix: Enforce MAX_RETRIES at the state level. Log every rewrite iteration. Implement circuit-breaker logic that bypasses retrieval entirely after consecutive failures.
5. Context Window Bloat from Rewrites
Explanation: Accumulating rewritten queries and failed retrieval attempts in the prompt history degrades performance and increases token spend.
Fix: Keep the generation prompt isolated from the orchestration state. Pass only the final selected context and original query to the generator. Never include routing metadata in the system prompt.
6. Ignoring Retrieval Latency in SLAs
Explanation: Agentic RAG introduces multiple LLM calls (intent, routing, quality, rewrite). Teams often underestimate the cumulative latency impact.
Fix: Implement parallel execution where possible (e.g., route and intent evaluation can sometimes be merged). Cache routing decisions for recurring query patterns. Set explicit timeout boundaries for each phase.
7. Missing Fallback Semantics
Explanation: When quality gating fails and retries are exhausted, the system may generate a confident but incorrect answer, damaging user trust.
Fix: Program explicit uncertainty responses. If qualityScore < 0.4 after max retries, return a structured fallback: "I couldn't locate specific documentation for this. Please verify the query or contact support."
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| High-volume general queries | Agentic intent skip | Avoids unnecessary retrieval hops and token spend | β 40-55% token cost |
| Strict compliance/audit requirements | Static pipeline with full context logging | Ensures deterministic, traceable retrieval for every query | β 20-30% storage/compute |
| Multi-tenant KBs with overlapping domains | Agentic routing + quality gating | Prevents cross-domain contamination and improves answer precision | β Neutral (offset by latency gains) |
| Low-latency UI (<500ms target) | Cached routing + parallel retrieval | Reduces sequential LLM calls; uses precomputed intent mappings | β 30% latency, β cache infra cost |
| Highly dynamic documentation | Agentic rewrite + adaptive thresholds | Handles vague queries and evolving terminology without manual prompt updates | β 10-15% LLM calls, β accuracy |
Configuration Template
// config.ts
export const RAGConfig = {
thresholds: {
product: 0.65,
ops: 0.55,
faq: 0.60,
globalFallback: 0.40,
},
limits: {
maxRetries: 2,
topK: 3,
maxContextTokens: 4000,
timeoutMs: 3000,
},
routing: {
product: ['pricing', 'features', 'models', 'security', 'certifications'],
ops: ['deployment', 'docker', 'monitoring', 'backup', 'troubleshooting'],
faq: ['account', 'password', 'refund', 'invoice', 'billing', 'api_key'],
},
fallback: {
enabled: true,
message: 'Unable to locate relevant documentation. Please refine your query or contact support.',
logLevel: 'warn',
},
};
Quick Start Guide
- Initialize State & Dependencies: Instantiate the
KnowledgeOrchestrator with your LLM client and vector store mappings. Ensure each KB has a dedicated index.
- Calibrate Routing Prompts: Run 20β30 representative queries through the routing node. Log misclassifications and inject boundary examples into the routing prompt until accuracy exceeds 85%.
- Set Quality Thresholds: Execute a baseline retrieval sweep across all KBs. Record average similarity scores and set domain-specific thresholds in the configuration file.
- Deploy with Telemetry: Wrap the orchestrator in a tracing middleware. Log
path, attempts, qualityScore, and latency for every execution. Set up alerts for retry rates exceeding 15%.
- Validate Fallback Behavior: Intentionally submit vague or out-of-domain queries. Verify that the system respects
MAX_RETRIES, triggers the fallback message, and does not generate hallucinated content.