Fine-tuning vs RAG: a decision framework with examples

By Codcompass Team·2026-05-24·8 min read

LLM Architecture Patterns: Optimizing for Knowledge, Behavior, and Cost

Current Situation Analysis

Engineering teams frequently treat Retrieval-Augmented Generation (RAG) and Fine-Tuning as competing strategies, leading to architectural missteps that inflate costs or degrade user experience. This binary framing ignores the fundamental technical distinction: RAG modifies the context window at inference time, while Fine-Tuning modifies the model weights during training.

This misunderstanding causes two common failures:

The Knowledge Fine-Tune Trap: Teams attempt to inject dynamic facts (e.g., current pricing, recent security advisories) via fine-tuning. Since weights are static post-training, the model cannot access information beyond its cutoff, resulting in hallucinations or stale responses.
The Style Retrieval Fallacy: Teams rely on RAG to enforce output structure or tone. Retrieval surfaces content, but it cannot reliably teach a model to consistently output specific JSON schemas or adopt a distinct voice without behavioral training.

The industry overlooks that these techniques are orthogonal. The decision is not "RAG or Fine-Tuning," but rather a mapping of constraints—data volatility, latency budgets, format strictness, and corpus availability—to the appropriate mechanism. Production systems often require a hybrid approach, yet teams delay this realization until they hit scaling walls.

Data from production deployments indicates that RAG introduces a structural latency overhead of 50–200ms due to vector search and context processing. Conversely, fine-tuning reduces per-query token consumption by enabling shorter prompts and smaller models, but incurs significant setup costs. For example, training a gpt-4o-mini model on 10,000 examples costs approximately $40, while a RAG query on a 500-token input with 300 tokens of context consumes 800 tokens, costing roughly $0.008 at standard rates ($0.01/1k tokens). A fine-tuned model handling the same query might use only 400 tokens, reducing per-query cost to $0.004.

WOW Moment: Key Findings

The following comparison reveals the trade-off surface. The critical insight is that Hybrid architectures dominate in regulated environments where both factual grounding and behavioral consistency are mandatory, despite higher complexity.

Strategy	Inference Latency	Per-Query Cost	Knowledge Freshness	Format Strictness	Best Use Case
RAG	+50–200ms	Higher (Context overhead)	Instant (Index update)	Low/Medium	Dynamic Q&A, Wikis
Fine-Tuning	Base	Lower (Optimized prompt)	Stale (Retrain required)	High	Classification, Style
Hybrid	+50–200ms	Medium	Instant	High	Enterprise Assistants, Compliance

Why this matters: The Hybrid approach allows you to decouple knowledge management from behavior enforcement. You can update your vector store instantly to reflect new regulations while maintaining a fine-tuned model that guarantees the output adheres to a strict JSON schema required by downstream systems. This pattern is essential for security operations centers (SOCs) and legal tech, where response structure is as critical as response accuracy.

Core Solution

Implementing the correct pattern requires a systematic evaluation of your constraints followed by targeted implementation. Below are production-grade TypeScript patterns for each approach.

1. RAG Implementation: Context Injection

RAG pipelines must prioritize retrieval accuracy and context management. The following pattern uses a class-based KnowledgeRetriever to encapsulate vector operations, ensuring clean separation between indexing and inference.

Architecture Decisions: *

Model Selection: text-embedding-3-small is used for embeddings due to its cost-efficiency and performance on general domains. gpt-4o-mini serves as the inference engine for cost-effective generation.

Cosine Similarity: Implemented directly for transparency, though production systems may offload this to a vector database.
Context Limiting: The top_k parameter prevents context window overflow and controls token costs.

import { OpenAI } from 'openai';

interface Document {
  id: string;
  content: string;
  metadata?: Record<string, string>;
}

interface SearchResult {
  score: number;
  document: Document;
}

export class KnowledgeRetriever {
  private client: OpenAI;

  constructor(apiKey: string) {
    this.client = new OpenAI({ apiKey });
  }

  async generateEmbedding(text: string): Promise<number[]> {
    const response = await this.client.embeddings.create({
      model: 'text-embedding-3-small',
      input: text,
    });
    return response.data[0].embedding;
  }

  async buildVectorStore(documents: Document[]): Promise<Document[]> {
    const enrichedDocs = await Promise.all(
      documents.map(async (doc) => ({
        ...doc,
        embedding: await this.generateEmbedding(doc.content),
      }))
    );
    return enrichedDocs;
  }

  async search(query: string, store: Document[], limit: number = 3): Promise<SearchResult[]> {
    const queryVector = await this.generateEmbedding(query);
    
    const scored = store.map((doc) => {
      const score = this.cosineSimilarity(queryVector, doc.embedding!);
      return { score, document: doc };
    });

    return scored
      .sort((a, b) => b.score - a.score)
      .slice(0, limit);
  }

  private cosineSimilarity(vecA: number[], vecB: number[]): number {
    const dotProduct = vecA.reduce((sum, val, i) => sum + val * vecB[i], 0);
    const normA = Math.sqrt(vecA.reduce((sum, val) => sum + val * val, 0));
    const normB = Math.sqrt(vecB.reduce((sum, val) => sum + val * val, 0));
    return dotProduct / (normA * normB);
  }
}

// Usage
const retriever = new KnowledgeRetriever(process.env.OPENAI_API_KEY!);
const docs: Document[] = [
  { id: 'reg-1', content: 'NIS2 applies to medium and large entities across 18 sectors.' },
  { id: 'reg-2', content: 'Fines for essential entities can reach €10M or 2% of turnover.' },
];

const store = await retriever.buildVectorStore(docs);
const results = await retriever.search('What are the penalties under NIS2?', store);

// Construct prompt with retrieved context
const contextBlock = results.map((r, i) => `[Source ${i + 1}] ${r.document.content}`).join('\n');
const prompt = `Sources:\n${contextBlock}\n\nQuestion: What are the penalties under NIS2?`;

2. Fine-Tuning Implementation: Behavioral Alignment

Fine-tuning requires rigorous dataset preparation. The following pattern demonstrates how to serialize training data and initiate a job. Note that the inference API remains identical to base models; only the model identifier changes.

Architecture Decisions:

JSONL Format: Required by the API for batch processing.
Response Format: Enforcing json_object ensures structural consistency, complementing the behavioral training.
Temperature: Set to 0 for deterministic classification tasks.

import fs from 'fs';
import { OpenAI } from 'openai';

interface TrainingExample {
  messages: Array<{ role: 'system' | 'user' | 'assistant'; content: string }>;
}

export class ModelTrainer {
  private client: OpenAI;

  constructor(apiKey: string) {
    this.client = new OpenAI({ apiKey });
  }

  async prepareDataset(examples: TrainingExample[], outputPath: string): Promise<string> {
    const jsonlContent = examples.map((ex) => JSON.stringify(ex)).join('\n');
    fs.writeFileSync(outputPath, jsonlContent);
    return outputPath;
  }

  async initiateTrainingJob(datasetPath: string, baseModel: string = 'gpt-4o-mini'): Promise<string> {
    const fileStream = fs.createReadStream(datasetPath);
    const uploadedFile = await this.client.files.create({
      file: fileStream,
      purpose: 'fine-tune',
    });

    const job = await this.client.fineTuning.jobs.create({
      training_file: uploadedFile.id,
      model: baseModel,
    });

    return job.id;
  }
}

// Inference with Fine-Tuned Model
async function classifyEvent(eventText: string, fineTunedModelId: string) {
  const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
  
  const response = await client.chat.completions.create({
    model: fineTunedModelId, // e.g., ft:gpt-4o-mini:org:classifier:xyz
    messages: [
      { role: 'system', content: 'Classify the event as: phishing, malware, brute_force, exfiltration.' },
      { role: 'user', content: eventText },
    ],
    temperature: 0,
    response_format: { type: 'json_object' },
  });

  return JSON.parse(response.choices[0].message.content!);
}

3. Hybrid Architecture: The Enterprise Standard

For systems requiring both dynamic knowledge and strict behavior, compose the patterns above. The retrieval step feeds context into the fine-tuned model.

async function hybridPipeline(
  query: string, 
  retriever: KnowledgeRetriever, 
  store: Document[], 
  modelId: string
) {
  // 1. Retrieve context
  const results = await retriever.search(query, store, 3);
  const context = results.map(r => r.document.content).join('\n');

  // 2. Generate with fine-tuned model
  const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
  const response = await client.chat.completions.create({
    model: modelId,
    messages: [
      { role: 'system', content: 'Answer using the provided context. Output JSON only.' },
      { role: 'user', content: `Context: ${context}\n\nQuery: ${query}` },
    ],
    response_format: { type: 'json_object' },
  });

  return response.choices[0].message.content;
}

Pitfall Guide

Production deployments often fail due to subtle misalignments between technique and requirement. Avoid these common errors:

The "Knowledge Fine-Tune" Trap
- Explanation: Attempting to update model facts via fine-tuning. Weights are static; the model cannot learn new information post-training.
- Fix: Use RAG for any data that changes. Reserve fine-tuning for static behavioral patterns.
Latency Blindness
- Explanation: Ignoring the 50–200ms overhead introduced by vector search and context processing in RAG. This can violate p95 latency SLAs in real-time applications.
- Fix: Profile retrieval latency. If latency is critical, consider caching embeddings, using approximate nearest neighbor (ANN) indexes, or falling back to fine-tuning if knowledge is static.
Cost Myopia
- Explanation: Focusing only on per-query costs and ignoring setup expenses. Fine-tuning has high upfront costs ($40+ for training) but lower per-query costs. RAG has low setup but higher per-query costs due to context tokens.
- Fix: Calculate Total Cost of Ownership (TCO) based on projected query volume. High-volume applications may benefit from fine-tuning amortization.
Data Quality Neglect
- Explanation: Feeding noisy documents into RAG or low-quality examples into fine-tuning. Garbage in, garbage out applies to both.
- Fix: Implement data curation pipelines. For RAG, chunk documents intelligently and remove duplicates. For fine-tuning, ensure examples cover edge cases and follow the desired output format strictly.
Hybrid Over-Engineering
- Explanation: Implementing a hybrid architecture when RAG alone suffices. This adds complexity and cost without measurable benefit.
- Fix: Start with RAG. Only add fine-tuning when you observe consistent failures in format adherence or style, despite prompt engineering.
Context Window Overflow
- Explanation: Retrieving too many chunks or overly long documents, exceeding the model's context limit or degrading performance.
- Fix: Limit top_k results. Use chunking strategies that preserve semantic boundaries. Monitor token usage and implement truncation logic.
Ignoring Model Capabilities
- Explanation: Assuming all models support fine-tuning or handle long contexts equally.
- Fix: Verify model support. Use gpt-4o-mini for cost-effective fine-tuning and RAG. Ensure embedding models match the domain of your text.

Production Bundle

Action Checklist

Audit Data Volatility: Determine how frequently your knowledge base updates. If weekly or more, RAG is mandatory.
Define Latency SLA: Measure acceptable p95 latency. If <100ms is required and knowledge is static, prioritize fine-tuning.
Calculate TCO: Estimate query volume and compare RAG token costs vs. fine-tuning setup costs.
Prototype RAG: Build a minimal RAG pipeline first. Measure retrieval accuracy and format adherence.
Evaluate Format Consistency: If RAG fails to produce consistent JSON or tone, prepare a fine-tuning dataset.
Implement Hybrid if Needed: Combine RAG for context and fine-tuning for behavior when both are required.
Monitor Drift: Track retrieval relevance and model performance over time. Retrain or re-index as needed.

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Dynamic Pricing Bot	RAG	Prices change daily; RAG allows instant updates.	Low setup, higher per-query.
JSON API Classifier	Fine-Tuning	Strict format required; knowledge is static.	Medium setup, lower per-query.
Legal Compliance Assistant	Hybrid	Needs current regulations (RAG) and strict citation format (FT).	High setup, medium per-query.
Internal Wiki Search	RAG	Large corpus of docs; format is flexible.	Low setup, moderate per-query.
Security Event Triage	Hybrid	Threat intel updates (RAG); MITRE ATT&CK mapping requires consistency (FT).	High setup, medium per-query.

Configuration Template

Use this TypeScript configuration to parameterize your pipeline logic, enabling easy switching between strategies based on environment or feature flags.

export interface PipelineConfig {
  strategy: 'rag' | 'fine-tune' | 'hybrid';
  model: {
    inference: string;
    embedding: string;
    fineTunedId?: string;
  };
  retrieval: {
    topK: number;
    minScore: number;
  };
  constraints: {
    maxLatencyMs: number;
    enforceJson: boolean;
  };
}

export const defaultConfig: PipelineConfig = {
  strategy: 'rag',
  model: {
    inference: 'gpt-4o-mini',
    embedding: 'text-embedding-3-small',
  },
  retrieval: {
    topK: 3,
    minScore: 0.75,
  },
  constraints: {
    maxLatencyMs: 500,
    enforceJson: false,
  },
};

Quick Start Guide

Define Constraints: List your requirements for knowledge freshness, latency, and output format.
Build RAG Prototype: Implement the KnowledgeRetriever pattern with a sample dataset. Test retrieval accuracy.
Evaluate Behavior: Run queries and check if the output meets format and style requirements. If not, proceed to step 4.
Prepare Fine-Tuning Data: Create a JSONL dataset with 50+ high-quality examples demonstrating the desired behavior.
Train and Test: Initiate a fine-tuning job using ModelTrainer. Once complete, test the fine-tuned model with and without RAG to determine the optimal strategy.

🎉 Mid-Year Sale — Unlock Full Article

Base plan from just $4.99/mo or $49/yr

7-day free trial · Cancel anytime · 30-day money-back