t.
3. Deterministic Fallback Triggers: When evaluation scores fall below a threshold, the system does not guess. It executes predefined recovery paths: query reformulation, external search fallback, or multi-hop decomposition.
4. State Persistence: Each pipeline stage writes its output to a shared context object. This enables debugging, audit trails, and conditional branching without losing intermediate reasoning steps.
Implementation (TypeScript)
The following implementation demonstrates a state-driven orchestrator. It uses a typed state machine pattern to manage routing, evaluation, and fallback logic. The structure is framework-agnostic but mirrors the node-based execution model popularized by LangGraph.
interface PipelineState {
query: string;
route: 'direct' | 'single' | 'iterative';
retrievedDocs: string[];
relevanceScore: number;
fallbackTriggered: boolean;
finalResponse: string;
metadata: Record<string, unknown>;
}
class RetrievalOrchestrator {
private readonly evaluator: RelevanceScorer;
private readonly fallbackEngine: ExternalSearchFallback;
private readonly generator: ResponseSynthesizer;
constructor(config: OrchestratorConfig) {
this.evaluator = new RelevanceScorer(config.evaluationModel);
this.fallbackEngine = new ExternalSearchFallback(config.fallbackProvider);
this.generator = new ResponseSynthesizer(config.generationModel);
}
async execute(initialQuery: string): Promise<PipelineState> {
const state: PipelineState = {
query: initialQuery,
route: 'direct',
retrievedDocs: [],
relevanceScore: 0,
fallbackTriggered: false,
finalResponse: '',
metadata: { startedAt: Date.now() }
};
// Stage 1: Query Classification
state.route = await this.classifyQuery(initialQuery);
// Stage 2: Conditional Retrieval
if (state.route !== 'direct') {
state.retrievedDocs = await this.fetchDocuments(initialQuery, state.route);
// Stage 3: Relevance Evaluation
state.relevanceScore = await this.evaluator.score(state.query, state.retrievedDocs);
// Stage 4: Fallback or Refinement
if (state.relevanceScore < 0.65) {
state.fallbackTriggered = true;
state.retrievedDocs = await this.fallbackEngine.execute(initialQuery);
state.relevanceScore = await this.evaluator.score(state.query, state.retrievedDocs);
}
}
// Stage 5: Generation
state.finalResponse = await this.generator.synthesize(
state.query,
state.retrievedDocs,
{ route: state.route, score: state.relevanceScore }
);
state.metadata.completedAt = Date.now();
return state;
}
private async classifyQuery(query: string): Promise<PipelineState['route']> {
const prompt = `Analyze the following query and return ONLY one of: direct, single, iterative.
Criteria:
- direct: Parametric knowledge, simple facts, no external context needed.
- single: Requires one retrieval pass, straightforward factual lookup.
- iterative: Multi-hop reasoning, complex synthesis, or ambiguous phrasing.
Query: "${query}"`;
const response = await this.callLLM(prompt);
return response.trim().toLowerCase() as PipelineState['route'];
}
private async fetchDocuments(query: string, route: string): Promise<string[]> {
const limit = route === 'iterative' ? 8 : 4;
// Simulated vector retrieval with metadata filtering
const results = await this.vectorStore.search(query, { limit, filters: { recency: 'last_90_days' } });
return results.map(doc => doc.content);
}
private async callLLM(prompt: string): Promise<string> {
// Placeholder for LLM API call
return 'direct';
}
}
Why This Structure Works
- Decoupled Stages: Each phase (routing, fetching, scoring, fallback, generation) operates independently. This enables unit testing, metric tracking, and hot-swapping of components without rewriting the entire pipeline.
- Explicit Thresholds: The
0.65 relevance score acts as a circuit breaker. Instead of silently passing low-quality chunks to the generator, the system triggers a deterministic recovery path.
- Route-Aware Fetching: Iterative queries receive larger context windows and broader search parameters, while direct queries bypass retrieval entirely. This aligns compute allocation with actual information needs.
- State Transparency: The
PipelineState object carries metadata through every stage. Production monitoring can track route distribution, fallback frequency, and score distributions without instrumenting each service separately.
Pitfall Guide
1. Similarity Threshold Myopia
Explanation: Relying solely on cosine similarity or distance metrics to filter documents ignores semantic utility. A chunk can score 0.92 similarity while containing zero actionable information for the specific query.
Fix: Implement a secondary evaluation layer that scores documents against the query using an LLM or cross-encoder. Treat vector similarity as a pre-filter, not a final verdict.
2. Evaluation Prompt Drift
Explanation: Relevance scoring prompts often lack explicit output schemas or grounding instructions. The evaluator begins returning subjective judgments or hallucinated scores, corrupting the fallback logic.
Fix: Enforce structured outputs (JSON schema) for all evaluation steps. Include explicit rubrics: score: number (0-1), reason: string, grounded_citations: string[]. Validate responses against the schema before proceeding.
3. Recursive Loop Traps
Explanation: Agentic or iterative pipelines can enter infinite refinement cycles when the evaluator consistently returns ambiguous scores. The system rewrites queries, fetches again, and repeats until token limits are exhausted.
Fix: Implement a hard iteration cap (typically 3-5 cycles). Track query entropy or document overlap between cycles. If overlap exceeds 80% or the iteration limit is reached, force a fallback to external search or return a confidence-weighted partial response.
4. Token Budget Ignorance
Explanation: Advanced RAG pipelines multiply LLM calls (routing, evaluation, fallback, generation). Without explicit token accounting, costs scale unpredictably and context windows overflow.
Fix: Define a strict token budget per pipeline stage. Use chunk compression techniques (e.g., selective extraction, summary condensation) before generation. Monitor cumulative token usage and implement graceful degradation when thresholds are approached.
5. Stale Vector Index Blindness
Explanation: Vector databases treat all embeddings equally regardless of creation date. Outdated API references, deprecated configuration guides, or superseded policies rank alongside current documentation.
Fix: Inject temporal metadata into vector documents. Apply recency weighting during retrieval or implement a time-decay function in the scoring layer. Regularly run archival jobs to downrank or remove documents older than a defined freshness threshold.
6. Over-Agentic Routing
Explanation: Granting the LLM full autonomy to choose tools, rewrite queries, and chain retrievals introduces non-determinism. Debugging becomes impossible, and latency spikes unpredictably.
Fix: Constrain agentic behavior within a predefined action space. Use structured routing nodes that limit tool selection to approved endpoints. Log every agent decision with its reasoning trace to enable post-hoc analysis and safety auditing.
7. Missing Citation Grounding
Explanation: Self-correcting pipelines improve factual accuracy but often omit explicit source attribution. Users cannot verify whether the response derives from retrieved context or parametric knowledge.
Fix: Require the generation stage to output inline citations mapping claims to specific document IDs or chunks. Implement a post-generation verification step that flags unsupported assertions and triggers a regeneration with stricter grounding constraints.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Internal FAQ / Simple Facts | Direct Routing + Parametric Generation | Bypasses retrieval entirely, minimizing latency and token usage | Low (-60% vs naive) |
| Product Documentation Lookup | Single-Shot RAG + Cross-Encoder Scoring | One retrieval pass with strict relevance filtering balances speed and accuracy | Medium (+15% vs naive) |
| Multi-Hop Research / Complex Analysis | Iterative/Agentic RAG with Fallback | Requires multiple refinement cycles and external search integration for high-fidelity synthesis | High (+40% vs naive) |
| Regulated / Compliance Queries | Self-RAG with Strict Citation Grounding | Forces model to verify support tokens and output verifiable source mappings | Medium (+20% vs naive) |
| Multi-Lingual / Low-Resource Languages | HyDE-Style Hypothetical Embedding + Contriever | Bridges query-document vocabulary gaps without requiring language-specific fine-tuning | Medium (+25% vs naive) |
Configuration Template
pipeline:
routing:
enabled: true
model: "gpt-4o-mini"
schema: "direct | single | iterative"
max_tokens: 128
retrieval:
provider: "vector_store"
default_limit: 4
iterative_limit: 8
recency_filter: "last_90_days"
similarity_threshold: 0.72
evaluation:
model: "gpt-4o-mini"
output_format: "json"
schema:
score: "number (0-1)"
reason: "string"
grounded_chunks: "string[]"
fallback_threshold: 0.65
fallback:
enabled: true
provider: "web_search"
max_iterations: 3
overlap_tolerance: 0.80
generation:
model: "gpt-4o"
temperature: 0.2
require_citations: true
citation_format: "inline"
max_output_tokens: 1024
monitoring:
track_routes: true
track_fallbacks: true
token_budget_per_query: 4096
alert_on_drift: true
Quick Start Guide
- Initialize the State Machine: Deploy the
RetrievalOrchestrator class with your preferred vector provider and LLM endpoints. Configure the routing schema and evaluation thresholds to match your domain complexity.
- Seed the Evaluation Layer: Create a small labeled dataset of query-document pairs. Fine-tune or prompt-engineer the relevance scorer to output structured JSON scores. Validate against held-out examples before production rollout.
- Configure Fallback Boundaries: Set iteration caps, token budgets, and overlap tolerances. Test the fallback trigger with deliberately ambiguous queries to ensure the system degrades gracefully instead of looping.
- Instrument Observability: Attach logging to each pipeline stage. Track route distribution, average relevance scores, fallback frequency, and generation latency. Set alerts for score drift or fallback rate spikes.
- Deploy with Canary Routing: Route 10% of production traffic through the new pipeline. Compare hallucination rates, latency, and cost against your baseline. Gradually increase traffic once metrics stabilize within acceptable thresholds.