ionale behind each design decision.
Path 1: Retrieval-Augmented Generation Pipeline
RAG architectures separate knowledge storage from generation. The pipeline ingests documents, chunks them, embeds them, stores them in a vector index, and retrieves relevant segments during inference.
import { QdrantClient } from "@qdrant/js-client-rest";
import { createOpenAI } from "@ai-sdk/openai";
import { generateText } from "ai";
const vectorStore = new QdrantClient({ url: process.env.QDRANT_URL });
const openai = createOpenAI({ apiKey: process.env.OPENAI_API_KEY });
interface ChunkMetadata {
docId: string;
section: string;
timestamp: string;
}
async function buildRetrievalIndex(documents: Array<{ id: string; content: string }>) {
const collection = "enterprise_knowledge_v1";
await vectorStore.createCollection(collection, {
vectors: { size: 1536, distance: "Cosine" }
});
const points = documents.flatMap((doc) => {
const chunks = doc.content.match(/.{1,500}(?:\s|$)/g) || [];
return chunks.map((chunk, idx) => ({
id: `${doc.id}_chunk_${idx}`,
vector: await generateEmbedding(chunk),
payload: { docId: doc.id, chunkIndex: idx, rawText: chunk } as ChunkMetadata
}));
});
await vectorStore.upsert(collection, { points });
}
async function generateEmbedding(text: string): Promise<number[]> {
const response = await fetch("https://api.openai.com/v1/embeddings", {
method: "POST",
headers: { Authorization: `Bearer ${process.env.OPENAI_API_KEY}`, "Content-Type": "application/json" },
body: JSON.stringify({ model: "text-embedding-3-small", input: text })
});
const data = await response.json();
return data.data[0].embedding;
}
async function queryKnowledgeBase(userQuestion: string): Promise<string> {
const queryVector = await generateEmbedding(userQuestion);
const results = await vectorStore.search("enterprise_knowledge_v1", {
vector: queryVector,
limit: 5,
with_payload: true
});
const contextBlocks = results.map((hit) => hit.payload?.rawText as string).join("\n---\n");
const { text } = await generateText({
model: openai("gpt-4o"),
prompt: `You are a technical assistant. Answer using ONLY the provided context. If the context lacks sufficient information, state that explicitly.\n\nContext:\n${contextBlocks}\n\nQuestion: ${userQuestion}`
});
return text;
}
Architectural Rationale:
- Chunking Strategy: Fixed-size token boundaries with overlap preservation prevent semantic fragmentation. The regex-based splitter prioritizes whitespace boundaries to avoid cutting mid-sentence.
- Vector Store Selection: Qdrant provides payload filtering and fast cosine similarity search, enabling metadata-aware retrieval (e.g., filtering by document version or department).
- Prompt Grounding: Explicit constraints (
ONLY the provided context) reduce hallucination drift. The separator (---) helps the model distinguish between independent retrieved segments.
- Embedding Model:
text-embedding-3-small balances dimensionality (1536) with cost efficiency. Larger dimensions increase storage overhead without proportional retrieval gains for most enterprise text.
Path 2: Parameter-Efficient Fine-Tuning Pipeline
Fine-tuning modifies model weights to internalize behavioral patterns. Production deployments rarely use full parameter updates due to compute costs and catastrophic forgetting risks. Instead, Low-Rank Adaptation (LoRA) or Quantized LoRA (QLoRA) targets specific attention layers.
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
import datasets
MODEL_NAME = "mistralai/Mistral-7B-v0.1"
MAX_SEQ_LENGTH = 2048
LORA_R = 16
LORA_ALPHA = 32
LORA_DROPOUT = 0.05
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=MODEL_NAME,
max_seq_length=MAX_SEQ_LENGTH,
dtype=None,
load_in_4bit=True
)
model = FastLanguageModel.get_peft_model(
model,
r=LORA_R,
lora_alpha=LORA_ALPHA,
lora_dropout=LORA_DROPOUT,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
bias="none",
use_gradient_checkpointing="unsloth"
)
def format_instruction(sample):
return f"<|system|>You are a structured output generator.\n<|user|>{sample['input']}\n<|assistant|>{sample['output']}"
dataset = datasets.load_dataset("json", data_files="training_pairs.jsonl")["train"]
dataset = dataset.map(lambda x: {"text": format_instruction(x)})
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=MAX_SEQ_LENGTH,
data_collator=DataCollatorForSeq2Seq(tokenizer=tokenizer),
args=TrainingArguments(
learning_rate=2e-4,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
num_train_epochs=3,
fp16=True,
logging_steps=10,
output_dir="./lora_output",
optim="paged_adamw_8bit"
)
)
trainer.train()
model.save_pretrained_merged("./final_model", tokenizer, save_method="merged_16bit")
Architectural Rationale:
- 4-Bit Quantization + LoRA: Reduces VRAM requirements by ~75% while preserving adaptation quality. Targeting attention and feed-forward layers captures both contextual reasoning and output generation patterns.
- Gradient Checkpointing: Trades compute for memory, enabling larger batch sizes on consumer or mid-tier cloud GPUs.
- Optimizer Choice:
paged_adamw_8bit minimizes memory fragmentation during weight updates, critical for stable convergence with small datasets.
- Merged Export: Saves the adapted weights back into the base model architecture, eliminating runtime LoRA loading overhead during inference.
Pitfall Guide
Production deployments frequently fail due to architectural mismatches rather than implementation bugs. The following pitfalls represent recurring failure modes observed in enterprise LLM systems.
| Pitfall | Explanation | Fix |
|---|
| Context Window Saturation | Retrieving too many chunks or using oversized documents pushes prompts past model limits, truncating critical instructions or causing silent failures. | Implement dynamic chunk ranking with a hard token budget. Use a secondary LLM call to summarize retrieved segments before injection. |
| Fine-Tuning on Noisy Labels | Training on inconsistent formatting, contradictory examples, or unverified facts teaches the model to replicate errors. Hallucinations become structural rather than contextual. | Enforce strict data validation pipelines. Use LLM-as-judge scoring to filter low-confidence pairs. Maintain a golden dataset for regression testing. |
| Treating RAG as a Knowledge Base | Assuming retrieval replaces model understanding leads to poor query formulation. The model cannot reason across disjointed chunks without explicit bridging instructions. | Add a reasoning step in the prompt template. Use hybrid search (BM25 + dense embeddings) to capture both keyword and semantic intent. |
| Ignoring Retrieval Latency Budgets | Vector search, embedding generation, and prompt assembly add 100–400ms per request. Real-time chat interfaces degrade noticeably without caching or async prefetching. | Cache embeddings for frequent queries. Precompute vector indices during off-peak hours. Use streaming responses to mask retrieval delay. |
| Over-Fine-Tuning for Factual Recall | Model weights are poor storage mechanisms for volatile data. Updating fine-tuned models weekly incurs GPU costs and risks catastrophic forgetting of base capabilities. | Reserve fine-tuning for style, schema, and domain syntax. Route factual queries through RAG. Maintain a clear separation of concerns. |
| Prompt Injection in RAG | Retrieved documents may contain adversarial instructions that override system prompts, especially when ingesting user-generated content or external APIs. | Sanitize retrieved text before injection. Use delimiter-based prompt framing. Implement output validation layers to detect instruction leakage. |
| Cost Misalignment | RAG token expansion and fine-tuning GPU hours scale non-linearly. Teams often underestimate prompt token costs or training iteration cycles. | Profile token usage per query. Implement tiered routing: lightweight models for formatting, larger models for complex reasoning. Track cost per successful response. |
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Product documentation updated weekly | RAG | Instant sync avoids retraining cycles; citations remain traceable | Low (vector storage + prompt tokens) |
| Strict JSON output for API integration | Fine-Tuning | Weight adaptation enforces schema consistency without prompt engineering fragility | Medium (GPU hours or API fine-tune fee) |
| Customer support with volatile policies | RAG + Fine-Tuning | Fine-tune for tone/template; RAG injects current policy documents | High (combined infrastructure) |
| Internal code review assistant | Fine-Tuning | Learns repository-specific patterns and review conventions from historical PRs | Medium-High (requires curated PR→review pairs) |
| Compliance-heavy legal Q&A | RAG | Grounded retrieval enables source citation and audit trails; reduces liability | Low-Medium (depends on corpus size) |
Configuration Template
Use this template to scaffold a hybrid routing system that directs queries to the appropriate processing path based on metadata flags.
# llm_router_config.yaml
routing:
default_model: "gpt-4o"
fallback_model: "claude-sonnet-4-6"
paths:
retrieval:
enabled: true
vector_db: "qdrant"
collection: "enterprise_docs_v2"
top_k: 5
max_context_tokens: 3000
embedding_model: "text-embedding-3-small"
cache_ttl_seconds: 3600
adaptation:
enabled: true
model_path: "./lora_output/merged_model"
max_tokens: 1024
temperature: 0.1
repetition_penalty: 1.1
stop_sequences: ["</response>", "```"]
evaluation:
metrics: ["latency_p95", "token_cost", "hallucination_score", "format_compliance"]
alert_thresholds:
latency_p95_ms: 800
hallucination_score: 0.85
Quick Start Guide
- Initialize the retrieval index: Run the chunking and embedding script against your document repository. Verify vector store connectivity and payload structure.
- Prepare fine-tuning data: Format input-output pairs into JSONL. Validate consistency using a sampling script. Remove outliers before training.
- Deploy the routing layer: Configure the YAML template with your API keys and model paths. Start the inference server with async request handling.
- Run baseline evaluation: Submit 50 test queries across factual, structural, and hybrid categories. Measure latency, token usage, and output compliance.
- Iterate on thresholds: Adjust
top_k, context token limits, and temperature values based on evaluation results. Enable caching for high-frequency query patterns.
Engineering LLM applications requires treating knowledge injection and behavioral adaptation as distinct system layers. RAG handles volatility and grounding. Fine-tuning enforces structure and style. Routing queries to the correct path based on data characteristics and output requirements prevents architectural overreach, controls inference costs, and maintains predictable quality at scale.