ursive splitting strategy to maintain semantic coherence.
import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';
import { pipeline } from '@xenova/transformers';
interface DocumentSegment {
id: string;
content: string;
sourceFile: string;
chunkIndex: number;
}
export class IngestionEngine {
private splitter: RecursiveCharacterTextSplitter;
private embedder: any;
constructor() {
this.splitter = new RecursiveCharacterTextSplitter({
chunkSize: 800,
chunkOverlap: 120,
separators: ['\n\n', '\n', '. ', ' ', '']
});
}
async initializeEmbedder() {
this.embedder = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');
}
async processFile(filePath: string, rawText: string): Promise<DocumentSegment[]> {
const segments = await this.splitter.createDocuments([rawText]);
return segments.map((seg, idx) => ({
id: `${filePath}_seg_${idx}`,
content: seg.pageContent,
sourceFile: filePath,
chunkIndex: idx
}));
}
async vectorize(segments: DocumentSegment[]): Promise<number[][]> {
const outputs = await Promise.all(
segments.map(seg => this.embedder(seg.content, { pooling: 'mean', normalize: true }))
);
return outputs.map(out => Array.from(out.data));
}
}
Architecture Rationale:
RecursiveCharacterTextSplitter prioritizes natural language boundaries over arbitrary character counts, reducing context fragmentation.
@xenova/transformers runs inference locally via WebAssembly, eliminating external API dependencies and embedding costs.
- Mean pooling with L2 normalization ensures cosine similarity calculations remain stable across queries.
Phase 2: Vector Storage & Retrieval
Vectors are persisted in a local vector database. ChromaDB is selected for its zero-configuration Docker deployment, native TypeScript client, and efficient HNSW indexing.
import { ChromaClient, Collection } from 'chromadb';
export class VectorStoreAdapter {
private client: ChromaClient;
private collection: Collection | null = null;
constructor() {
this.client = new ChromaClient({ path: 'http://localhost:8000' });
}
async initializeCollection(name: string) {
this.collection = await this.client.getOrCreateCollection({ name });
}
async upsertSegments(segments: DocumentSegment[], embeddings: number[][]) {
if (!this.collection) throw new Error('Collection not initialized');
await this.collection.add({
ids: segments.map(s => s.id),
documents: segments.map(s => s.content),
metadatas: segments.map(s => ({ source: s.sourceFile, chunk: s.chunkIndex })),
embeddings: embeddings
});
}
async queryContext(queryVector: number[], topK: number = 4): Promise<any[]> {
if (!this.collection) throw new Error('Collection not initialized');
const results = await this.collection.query({
queryEmbeddings: [queryVector],
nResults: topK
});
return results.documents[0].map((doc: string, idx: number) => ({
text: doc,
source: results.metadatas[0][idx].source,
chunk: results.metadatas[0][idx].chunk
}));
}
}
Architecture Rationale:
- HNSW indexing provides logarithmic search complexity, keeping retrieval latency under 50ms for datasets up to 100K segments.
- Metadata attachment enables source tracing, which is critical for auditability and user trust.
- Wipe-and-replace ingestion is intentionally chosen over incremental updates to prevent orphaned vectors and ensure consistency. For documentation sets under 50MB, full reindexing completes in seconds.
Phase 3: Grounded Generation & Streaming
The final phase assembles the prompt, routes the request through a model gateway, and streams the response token-by-token to maintain perceived latency.
import OpenAI from 'openai';
export class GenerationOrchestrator {
private llmClient: OpenAI;
constructor(apiKey: string) {
this.llmClient = new OpenAI({
baseURL: 'https://openrouter.ai/api/v1',
apiKey: apiKey
});
}
async buildPrompt(query: string, context: any[]): string {
const contextBlock = context
.map((c, i) => `[${i + 1}] (${c.source} #${c.chunk})\n${c.text}`)
.join('\n\n');
return `You are a technical documentation assistant.
Answer the user's question strictly using the provided context.
If the context does not contain sufficient information, respond with:
"I cannot find a definitive answer in the available documentation."
Context:
${contextBlock}
Question: ${query}`;
}
async streamResponse(prompt: string, model: string = 'meta-llama/llama-3.1-8b-instruct') {
const stream = await this.llmClient.chat.completions.create({
model,
messages: [{ role: 'user', content: prompt }],
stream: true,
temperature: 0.1,
max_tokens: 1024
});
return stream;
}
}
Architecture Rationale:
- OpenRouter abstracts model routing, allowing seamless switching between
gpt-4o-mini, claude-3.5-sonnet, or open-weight models via environment configuration.
- Temperature is locked to
0.1 to minimize creative deviation and enforce factual grounding.
- Streaming decouples network latency from user experience, delivering tokens as they are generated rather than waiting for full completion.
Pitfall Guide
1. Naive Character Splitting Breaks Structured Content
Explanation: Splitting purely by character count frequently severs code blocks, markdown tables, or JSON payloads mid-line, rendering retrieved segments useless.
Fix: Configure the splitter to respect structural delimiters. Use MarkdownTextSplitter for documentation-heavy repos, or implement a pre-processing step that isolates fenced code blocks before applying recursive splitting.
2. Embedding Model Language Mismatch
Explanation: Xenova/all-MiniLM-L6-v2 is optimized for English semantics. Feeding Indonesian, Japanese, or mixed-language documentation degrades retrieval recall by 30-40%.
Fix: Swap to Xenova/multilingual-e5-small or sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2. Always re-embed the entire collection after switching models, as dimensionality and vector space alignment change.
3. Token Budget Blowout from Over-Retrieval
Explanation: Requesting too many segments (topK > 6) or using oversized chunks inflates the prompt, pushing costs upward and increasing the chance the model ignores earlier context.
Fix: Cap topK at 3β5. Implement dynamic context window management that calculates remaining tokens before generation. Trim or summarize lower-relevance segments if the budget is exceeded.
Explanation: The ingestion pipeline skips unsupported file types without warnings. Users assume all files are indexed, leading to false negatives during queries.
Fix: Add explicit validation logging. Generate an ingestion manifest that lists processed files, skipped files, and rejection reasons. Expose this manifest in the UI or admin dashboard.
5. Stale Knowledge from Wipe-Only Ingestion
Explanation: Full reindexing is simple but inefficient for large, frequently updated repositories. Teams may delay updates to avoid downtime, resulting in outdated answers.
Fix: Implement versioned collections (e.g., docs_v1, docs_v2). Route queries to the latest stable version while building the next in parallel. Swap pointers atomically once validation passes.
6. LLM Ignores Grounding Instructions
Explanation: Even with strict prompts, models may default to pre-training knowledge when context is ambiguous or poorly formatted.
Fix: Enforce low temperature, use explicit negative constraints ("Do not use external knowledge"), and implement a post-generation verification step that cross-references cited sources with retrieved segments.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Local development / prototyping | Local embeddings + ChromaDB + OpenRouter | Zero infrastructure cost, instant iteration, full control | $0.00 infrastructure, ~$0.03/query |
| Team internal knowledge base | Managed vector DB (e.g., Weaviate Cloud) + OpenRouter | Higher availability, built-in backups, concurrent access | ~$15β$30/mo DB, ~$0.04/query |
| Production scale (>10K docs) | Distributed vector store + model routing gateway + caching layer | Handles concurrent queries, reduces embedding redundancy, ensures SLA | ~$80β$150/mo infra, ~$0.02/query (cached) |
Configuration Template
# .env.local
OPENROUTER_API_KEY=sk-or-v1-xxxxxxxxxxxxxxxx
LLM_MODEL=meta-llama/llama-3.1-8b-instruct
EMBEDDING_MODEL=Xenova/all-MiniLM-L6-v2
CHROMA_HOST=http://localhost:8000
COLLECTION_NAME=internal_docs_v1
MAX_CHUNK_SIZE=800
CHUNK_OVERLAP=120
TOP_K_RESULTS=4
# docker-compose.yml
version: '3.8'
services:
chroma:
image: chromadb/chroma:latest
ports:
- "8000:8000"
volumes:
- chroma_data:/chroma/chroma
environment:
- ANONYMIZED_TELEMETRY=false
- ALLOW_RESET=true
volumes:
chroma_data:
Quick Start Guide
- Initialize the environment: Run
npm install to pull dependencies, then copy .env.example to .env and populate the OpenRouter key.
- Launch the vector store: Execute
docker compose up -d to start ChromaDB. Verify connectivity by hitting http://localhost:8000/api/v1/heartbeat.
- Index your documentation: Place
.md or .txt files in the designated input directory and run the ingestion script. Monitor console output for segment counts and embedding dimensions.
- Start the application: Run
npm run dev and navigate to http://localhost:3000. Submit a query to validate retrieval, grounding, and streaming behavior.