Model Selection: text-embedding-3-small is used for embeddings due to its cost-efficiency and performance on general domains. gpt-4o-mini serves as the inference engine for cost-effective generation.
- Cosine Similarity: Implemented directly for transparency, though production systems may offload this to a vector database.
- Context Limiting: The
top_k parameter prevents context window overflow and controls token costs.
import { OpenAI } from 'openai';
interface Document {
id: string;
content: string;
metadata?: Record<string, string>;
}
interface SearchResult {
score: number;
document: Document;
}
export class KnowledgeRetriever {
private client: OpenAI;
constructor(apiKey: string) {
this.client = new OpenAI({ apiKey });
}
async generateEmbedding(text: string): Promise<number[]> {
const response = await this.client.embeddings.create({
model: 'text-embedding-3-small',
input: text,
});
return response.data[0].embedding;
}
async buildVectorStore(documents: Document[]): Promise<Document[]> {
const enrichedDocs = await Promise.all(
documents.map(async (doc) => ({
...doc,
embedding: await this.generateEmbedding(doc.content),
}))
);
return enrichedDocs;
}
async search(query: string, store: Document[], limit: number = 3): Promise<SearchResult[]> {
const queryVector = await this.generateEmbedding(query);
const scored = store.map((doc) => {
const score = this.cosineSimilarity(queryVector, doc.embedding!);
return { score, document: doc };
});
return scored
.sort((a, b) => b.score - a.score)
.slice(0, limit);
}
private cosineSimilarity(vecA: number[], vecB: number[]): number {
const dotProduct = vecA.reduce((sum, val, i) => sum + val * vecB[i], 0);
const normA = Math.sqrt(vecA.reduce((sum, val) => sum + val * val, 0));
const normB = Math.sqrt(vecB.reduce((sum, val) => sum + val * val, 0));
return dotProduct / (normA * normB);
}
}
// Usage
const retriever = new KnowledgeRetriever(process.env.OPENAI_API_KEY!);
const docs: Document[] = [
{ id: 'reg-1', content: 'NIS2 applies to medium and large entities across 18 sectors.' },
{ id: 'reg-2', content: 'Fines for essential entities can reach €10M or 2% of turnover.' },
];
const store = await retriever.buildVectorStore(docs);
const results = await retriever.search('What are the penalties under NIS2?', store);
// Construct prompt with retrieved context
const contextBlock = results.map((r, i) => `[Source ${i + 1}] ${r.document.content}`).join('\n');
const prompt = `Sources:\n${contextBlock}\n\nQuestion: What are the penalties under NIS2?`;
2. Fine-Tuning Implementation: Behavioral Alignment
Fine-tuning requires rigorous dataset preparation. The following pattern demonstrates how to serialize training data and initiate a job. Note that the inference API remains identical to base models; only the model identifier changes.
Architecture Decisions:
- JSONL Format: Required by the API for batch processing.
- Response Format: Enforcing
json_object ensures structural consistency, complementing the behavioral training.
- Temperature: Set to 0 for deterministic classification tasks.
import fs from 'fs';
import { OpenAI } from 'openai';
interface TrainingExample {
messages: Array<{ role: 'system' | 'user' | 'assistant'; content: string }>;
}
export class ModelTrainer {
private client: OpenAI;
constructor(apiKey: string) {
this.client = new OpenAI({ apiKey });
}
async prepareDataset(examples: TrainingExample[], outputPath: string): Promise<string> {
const jsonlContent = examples.map((ex) => JSON.stringify(ex)).join('\n');
fs.writeFileSync(outputPath, jsonlContent);
return outputPath;
}
async initiateTrainingJob(datasetPath: string, baseModel: string = 'gpt-4o-mini'): Promise<string> {
const fileStream = fs.createReadStream(datasetPath);
const uploadedFile = await this.client.files.create({
file: fileStream,
purpose: 'fine-tune',
});
const job = await this.client.fineTuning.jobs.create({
training_file: uploadedFile.id,
model: baseModel,
});
return job.id;
}
}
// Inference with Fine-Tuned Model
async function classifyEvent(eventText: string, fineTunedModelId: string) {
const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const response = await client.chat.completions.create({
model: fineTunedModelId, // e.g., ft:gpt-4o-mini:org:classifier:xyz
messages: [
{ role: 'system', content: 'Classify the event as: phishing, malware, brute_force, exfiltration.' },
{ role: 'user', content: eventText },
],
temperature: 0,
response_format: { type: 'json_object' },
});
return JSON.parse(response.choices[0].message.content!);
}
3. Hybrid Architecture: The Enterprise Standard
For systems requiring both dynamic knowledge and strict behavior, compose the patterns above. The retrieval step feeds context into the fine-tuned model.
async function hybridPipeline(
query: string,
retriever: KnowledgeRetriever,
store: Document[],
modelId: string
) {
// 1. Retrieve context
const results = await retriever.search(query, store, 3);
const context = results.map(r => r.document.content).join('\n');
// 2. Generate with fine-tuned model
const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const response = await client.chat.completions.create({
model: modelId,
messages: [
{ role: 'system', content: 'Answer using the provided context. Output JSON only.' },
{ role: 'user', content: `Context: ${context}\n\nQuery: ${query}` },
],
response_format: { type: 'json_object' },
});
return response.choices[0].message.content;
}
Pitfall Guide
Production deployments often fail due to subtle misalignments between technique and requirement. Avoid these common errors:
-
The "Knowledge Fine-Tune" Trap
- Explanation: Attempting to update model facts via fine-tuning. Weights are static; the model cannot learn new information post-training.
- Fix: Use RAG for any data that changes. Reserve fine-tuning for static behavioral patterns.
-
Latency Blindness
- Explanation: Ignoring the 50–200ms overhead introduced by vector search and context processing in RAG. This can violate p95 latency SLAs in real-time applications.
- Fix: Profile retrieval latency. If latency is critical, consider caching embeddings, using approximate nearest neighbor (ANN) indexes, or falling back to fine-tuning if knowledge is static.
-
Cost Myopia
- Explanation: Focusing only on per-query costs and ignoring setup expenses. Fine-tuning has high upfront costs ($40+ for training) but lower per-query costs. RAG has low setup but higher per-query costs due to context tokens.
- Fix: Calculate Total Cost of Ownership (TCO) based on projected query volume. High-volume applications may benefit from fine-tuning amortization.
-
Data Quality Neglect
- Explanation: Feeding noisy documents into RAG or low-quality examples into fine-tuning. Garbage in, garbage out applies to both.
- Fix: Implement data curation pipelines. For RAG, chunk documents intelligently and remove duplicates. For fine-tuning, ensure examples cover edge cases and follow the desired output format strictly.
-
Hybrid Over-Engineering
- Explanation: Implementing a hybrid architecture when RAG alone suffices. This adds complexity and cost without measurable benefit.
- Fix: Start with RAG. Only add fine-tuning when you observe consistent failures in format adherence or style, despite prompt engineering.
-
Context Window Overflow
- Explanation: Retrieving too many chunks or overly long documents, exceeding the model's context limit or degrading performance.
- Fix: Limit
top_k results. Use chunking strategies that preserve semantic boundaries. Monitor token usage and implement truncation logic.
-
Ignoring Model Capabilities
- Explanation: Assuming all models support fine-tuning or handle long contexts equally.
- Fix: Verify model support. Use
gpt-4o-mini for cost-effective fine-tuning and RAG. Ensure embedding models match the domain of your text.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Dynamic Pricing Bot | RAG | Prices change daily; RAG allows instant updates. | Low setup, higher per-query. |
| JSON API Classifier | Fine-Tuning | Strict format required; knowledge is static. | Medium setup, lower per-query. |
| Legal Compliance Assistant | Hybrid | Needs current regulations (RAG) and strict citation format (FT). | High setup, medium per-query. |
| Internal Wiki Search | RAG | Large corpus of docs; format is flexible. | Low setup, moderate per-query. |
| Security Event Triage | Hybrid | Threat intel updates (RAG); MITRE ATT&CK mapping requires consistency (FT). | High setup, medium per-query. |
Configuration Template
Use this TypeScript configuration to parameterize your pipeline logic, enabling easy switching between strategies based on environment or feature flags.
export interface PipelineConfig {
strategy: 'rag' | 'fine-tune' | 'hybrid';
model: {
inference: string;
embedding: string;
fineTunedId?: string;
};
retrieval: {
topK: number;
minScore: number;
};
constraints: {
maxLatencyMs: number;
enforceJson: boolean;
};
}
export const defaultConfig: PipelineConfig = {
strategy: 'rag',
model: {
inference: 'gpt-4o-mini',
embedding: 'text-embedding-3-small',
},
retrieval: {
topK: 3,
minScore: 0.75,
},
constraints: {
maxLatencyMs: 500,
enforceJson: false,
},
};
Quick Start Guide
- Define Constraints: List your requirements for knowledge freshness, latency, and output format.
- Build RAG Prototype: Implement the
KnowledgeRetriever pattern with a sample dataset. Test retrieval accuracy.
- Evaluate Behavior: Run queries and check if the output meets format and style requirements. If not, proceed to step 4.
- Prepare Fine-Tuning Data: Create a JSONL dataset with 50+ high-quality examples demonstrating the desired behavior.
- Train and Test: Initiate a fine-tuning job using
ModelTrainer. Once complete, test the fine-tuned model with and without RAG to determine the optimal strategy.