ouples retrieval quality from user phrasing habits.
Core Solution
Building a production-ready query transformation pipeline requires separating concerns: intent analysis, vector generation, retrieval execution, and result consolidation. The following TypeScript implementation demonstrates a modular architecture that supports all three strategies while maintaining strict control over latency, deduplication, and fallback behavior.
Architecture Rationale
- Strategy Isolation: Each transformation method operates as an independent module. This enables runtime routing based on query complexity, latency budgets, or domain constraints.
- Parallel Retrieval: Multi-query and decomposition strategies benefit from concurrent vector searches. Sequential execution introduces unnecessary latency.
- Deterministic Deduplication: Merging results from multiple vectors inevitably produces duplicates. A content-hash based deduplication step ensures the context window isn't polluted.
- Embedding Alignment: HyDE requires the same embedding model used for documents. Mixing models breaks cosine similarity assumptions.
Implementation
import { EmbeddingModel, VectorStore, Document } from './types';
interface QueryTransformationConfig {
maxVariants: number;
topK: number;
enableParallelRetrieval: boolean;
deduplicationThreshold: number;
}
interface TransformationResult {
documents: Document[];
strategy: string;
latencyMs: number;
}
class QueryTransformationEngine {
constructor(
private readonly llm: any,
private readonly embedder: EmbeddingModel,
private readonly vectorStore: VectorStore,
private readonly config: QueryTransformationConfig
) {}
async executeMultiQuery(originalQuery: string): Promise<TransformationResult> {
const start = performance.now();
const prompt = `Generate ${this.config.maxVariants} distinct phrasings of the following query.
Each phrasing must target the same core intent but use different terminology or structure.
Output only the queries, separated by newlines. No numbering or commentary.
Query: ${originalQuery}`;
const variants = (await this.llm.generate(prompt))
.split('\n')
.map(q => q.trim())
.filter(q => q.length > 0);
const retrievalPromises = variants.map(v =>
this.vectorStore.similaritySearch(v, this.config.topK)
);
const rawResults = this.config.enableParallelRetrieval
? await Promise.all(retrievalPromises)
: (await Promise.all(retrievalPromises)).flat();
const merged = this.deduplicate(rawResults.flat());
return {
documents: merged.slice(0, this.config.topK),
strategy: 'multi_query',
latencyMs: performance.now() - start
};
}
async executeHyDE(originalQuery: string): Promise<TransformationResult> {
const start = performance.now();
const prompt = `Construct a concise, technically accurate hypothetical answer to the following question.
This response will be embedded for vector retrieval and must mirror the style and terminology of domain documentation.
Length: 80-120 words. Do not include disclaimers or meta-commentary.
Question: ${originalQuery}`;
const hypotheticalAnswer = await this.llm.generate(prompt);
const queryVector = await this.embedder.embed(hypotheticalAnswer);
const documents = await this.vectorStore.searchByVector(queryVector, this.config.topK);
return {
documents,
strategy: 'hyde',
latencyMs: performance.now() - start
};
}
async executeDecomposition(originalQuery: string): Promise<TransformationResult> {
const start = performance.now();
const prompt = `Decompose the following complex query into independent sub-questions.
Each sub-question must be self-contained and retrievable without external context.
Output one sub-question per line. No numbering, no explanations.
Query: ${originalQuery}`;
const subQueries = (await this.llm.generate(prompt))
.split('\n')
.map(q => q.trim())
.filter(q => q.length > 0);
const retrievalPromises = subQueries.map(sq =>
this.vectorStore.similaritySearch(sq, this.config.topK)
);
const rawResults = this.config.enableParallelRetrieval
? await Promise.all(retrievalPromises)
: (await Promise.all(retrievalPromises)).flat();
const merged = this.deduplicate(rawResults.flat());
return {
documents: merged.slice(0, this.config.topK),
strategy: 'decomposition',
latencyMs: performance.now() - start
};
}
private deduplicate(documents: Document[]): Document[] {
const seen = new Set<string>();
return documents.filter(doc => {
const hash = this.computeContentHash(doc.content);
if (seen.has(hash)) return false;
seen.add(hash);
return true;
});
}
private computeContentHash(content: string): string {
return Buffer.from(content.toLowerCase().replace(/\s+/g, ' ').trim())
.toString('base64')
.slice(0, 16);
}
}
Why This Architecture Works
- Parallel Execution:
Promise.all ensures multi-query and decomposition strategies don't suffer from sequential API latency. Vector stores handle concurrent requests efficiently.
- Deterministic Deduplication: Base64 content hashing prevents duplicate chunks from consuming context window tokens. This is critical when multiple vectors retrieve overlapping document segments.
- Strategy Agnosticism: The engine returns a uniform
TransformationResult interface. Downstream generators don't need to know which transformation was applied. This enables seamless A/B testing and routing logic.
- Embedding Consistency: HyDE explicitly reuses the document embedder. Mixing models introduces distribution shift that breaks cosine similarity assumptions.
Pitfall Guide
1. Blind Strategy Stacking
Explanation: Combining HyDE, multi-query, and decomposition in a single pipeline multiplies LLM calls and retrieval operations. On a 100k document corpus, this can push latency past 3 seconds and inflate API costs by 400%.
Fix: Implement a query router that selects a single strategy based on intent complexity. Use lightweight classification (e.g., regex + small LLM) to route before transformation.
2. HyDE Domain Drift
Explanation: The hypothetical answer generator may produce technically plausible but factually misaligned content if the prompt lacks domain constraints. The resulting vector drifts into irrelevant clusters.
Fix: Inject domain-specific terminology and style guidelines into the HyDE prompt. Add a validation step that checks embedding similarity against a known positive anchor before retrieval.
3. Decomposition Over-Fragmentation
Explanation: Breaking a query into too many sub-questions dilutes retrieval focus. The context window fills with marginally relevant snippets, reducing signal-to-noise ratio.
Fix: Cap decomposition at 3 sub-queries. Enforce a minimum semantic weight threshold. Merge closely related sub-queries before retrieval.
4. Deduplication Blindness
Explanation: Merging results from multiple vectors without deduplication causes duplicate chunks to occupy context tokens. The generator receives redundant information, increasing hallucination risk.
Fix: Always apply content-hash deduplication before context assembly. Preserve original metadata to track which strategy retrieved each chunk.
5. Latency Neglect in Sequential Pipelines
Explanation: Running transformations sequentially (LLM β embed β retrieve β LLM β embed β retrieve) creates compounding latency. User experience degrades rapidly above 1.5s.
Fix: Parallelize independent retrieval calls. Cache embedding results for identical queries. Use streaming responses where applicable.
6. Embedding Model Mismatch
Explanation: Using different embedding models for queries and documents breaks the shared vector space assumption. Cosine similarity becomes mathematically invalid.
Fix: Enforce a single embedding model across the entire pipeline. Validate model consistency during deployment checks.
7. Fixed Top-K Rigidity
Explanation: Hardcoding k=5 ignores result density. Sparse queries return low-confidence matches, while dense queries return redundant top results.
Fix: Implement dynamic k based on similarity score thresholds. Drop results below a confidence floor. Scale k proportionally to corpus size.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Single-concept, conversational query | Multi-Query Variant | Stabilizes retrieval across lexical variance without distribution shift | +1 LLM call, +3 retrievals |
| Technical documentation, formal domain | HyDE | Bridges question-answer semantic gap, maximizes precision & faithfulness | +1 LLM call, +1 embedding |
| Multi-hop, synthesis-heavy query | Query Decomposition | Isolates independent intents, maximizes recall for complex topics | +1 LLM call, +2-3 retrievals |
| Low-latency requirement (<800ms) | Naive + Metadata Filter | Avoids transformation overhead; relies on index quality | Baseline |
| High-accuracy compliance domain | HyDE + Strict Validation | Ensures domain alignment before retrieval; reduces hallucination risk | +1 LLM call, +1 embedding, +validation step |
Configuration Template
// query-routing.config.ts
export const QueryRoutingConfig = {
strategies: {
multiQuery: {
enabled: true,
maxVariants: 3,
topK: 5,
parallel: true,
latencyBudgetMs: 1200,
fallback: 'naive'
},
hyde: {
enabled: true,
topK: 5,
domainPrompt: 'Use precise technical terminology. Mirror documentation style.',
latencyBudgetMs: 1000,
fallback: 'multiQuery'
},
decomposition: {
enabled: true,
maxSubQueries: 3,
topK: 4,
parallel: true,
latencyBudgetMs: 1500,
fallback: 'hyde'
}
},
routing: {
complexityThreshold: 0.7, // LLM confidence score
latencyCircuitBreaker: 1800,
deduplication: {
algorithm: 'base64-content-hash',
enabled: true
}
}
};
Quick Start Guide
- Install Dependencies: Add your vector store client, LLM provider SDK, and embedding model wrapper to your project. Ensure all components share the same authentication context.
- Initialize the Engine: Instantiate
QueryTransformationEngine with your LLM, embedder, vector store, and the configuration template above. Verify embedding model consistency.
- Deploy the Router: Implement a lightweight intent classifier (regex + small LLM or embedding similarity) to route incoming queries to the appropriate strategy. Apply latency budgets and fallback logic.
- Validate & Monitor: Run a benchmark set of 50-100 representative queries. Track context recall, precision, and latency. Adjust
topK, deduplication thresholds, and routing weights based on observed performance. Integrate RAGAS metrics into your CI/CD pipeline for continuous validation.