or vector operations and inline AI execution. The TimescaleDB distribution provides production-ready extensions without additional orchestration.
-- Enable required extensions
CREATE EXTENSION IF NOT EXISTS vector;
CREATE EXTENSION IF NOT EXISTS ai CASCADE;
The vector extension handles high-dimensional storage and similarity search. The ai extension (pgai) exposes Ollama as a native SQL interface, allowing embedding and generation to execute within the database engine. This removes network serialization, connection pooling complexity, and cross-service fault tolerance concerns.
Step 2: Schema Design & Vectorization Strategy
Design a single source-of-truth table. Avoid separating raw text from embeddings unless you have strict compliance requirements requiring data partitioning.
CREATE TABLE knowledge_store (
doc_id TEXT PRIMARY KEY,
raw_text TEXT NOT NULL,
doc_vector VECTOR(768) NOT NULL,
indexed_at TIMESTAMPTZ DEFAULT NOW()
);
For ingestion, you have two patterns. Inline embedding works for batch loads or low-frequency updates. The vectorizer pattern is mandatory for production-scale continuous ingestion.
-- Inline embedding at insert time
INSERT INTO knowledge_store (doc_id, raw_text, doc_vector)
VALUES (
'sys-arch-001',
'Stateless microservices require external session stores for horizontal scaling.',
ai.ollama_embed('nomic-embed-text', 'Stateless microservices require external session stores for horizontal scaling.')::vector
);
-- Production vectorizer configuration
SELECT ai.create_vectorizer(
'knowledge_store'::regclass,
embedding => ai.embedding_ollama('nomic-embed-text', 768),
chunking => ai.chunking_recursive_character_text_splitter('raw_text', 512, 64),
scheduling => ai.scheduling_continuous()
);
The vectorizer spawns a background worker that watches the base table, chunks text, generates embeddings, and maintains the vector column asynchronously. This prevents insert latency spikes and ensures the embedding pipeline scales independently of write throughput.
Step 3: Unified Retrieval & Generation
Combine semantic search and LLM inference into a single database round trip. The <=> operator computes cosine distance. Subtracting it from 1 yields a similarity score suitable for ranking.
-- Context retrieval
SELECT doc_id, raw_text,
1 - (doc_vector <=> ai.ollama_embed('nomic-embed-text', $1)::vector) AS similarity
FROM knowledge_store
ORDER BY doc_vector <=> ai.ollama_embed('nomic-embed-text', $1)::vector
LIMIT 5;
Pass the retrieved context directly to the generation function. pgai returns structured JSON; extract the payload using PostgreSQL operators.
-- Inline generation
SELECT ai.ollama_generate('gemma2:9b',
'Context:' || E'\n' || $1 || E'\n\nQuestion: ' || $2 || E'\nAnswer:'
)->>'response' AS generated_output;
Wrapping these in a PL/pgSQL function or a Python adapter standardizes the interface and enables consistent prompt templating.
Step 4: Drift Detection Pipeline
Attach a monitoring layer that captures every inference cycle. The ragvitals library provides a dimension-aware detector that compares live traces against a rolling baseline.
from ragvitals import (
Detector, EmbeddingDrift, InMemorySink, JudgeDrift,
QueryDistribution, ResponseQuality, RetrievalRelevance, Trace
)
from datetime import datetime
class SemanticMonitor:
def __init__(self, k: int = 5):
self.detector = Detector(
dimensions=[
QueryDistribution(),
RetrievalRelevance(metric="hit_rate", k=k),
EmbeddingDrift(),
ResponseQuality(score_keys=["faithfulness", "relevance"]),
JudgeDrift(score_key="faithfulness")
],
sinks=[InMemorySink(window_size=1000)]
)
self.k = k
def calibrate(self, reference_embeddings: list, baseline_scores: dict):
self.detector.dimensions[0].set_reference(reference_embeddings)
self.detector.dimensions[2].set_reference(reference_embeddings)
self.detector.dimensions[4].set_reference(baseline_scores)
def record(self, query: str, query_vec: list, retrieved: list,
answer: str, judge_scores: dict, metadata: dict):
trace = Trace(
timestamp=datetime.utcnow(),
query=query,
query_embedding=query_vec,
retrieval_scores=[item[2] for item in retrieved],
relevance_labels=[1] + [0] * (len(retrieved) - 1),
response=answer,
judge_scores=judge_scores,
metadata=metadata
)
self.detector.ingest(trace)
def generate_report(self) -> dict:
return self.detector.report()
Initialize the monitor once per deployment cycle. Feed it traces from every production query. The detector maintains statistical baselines and flags deviations across each dimension independently. When you swap gemma2:9b for llama3.1:8b, only ResponseQuality and JudgeDrift will trigger. When you update the corpus, QueryDistribution and RetrievalRelevance will shift. The architecture turns component changes into measurable signals.
Pitfall Guide
1. Embedding Dimension Mismatch
Explanation: Hardcoding vector dimensions in the schema while using an embedder that outputs a different size causes silent truncation or insertion failures.
Fix: Query the embedder's output dimension programmatically before schema creation. Use ai.embedding_ollama('model-name', dimension) to enforce consistency. Validate dimensions during vectorizer initialization.
2. Synchronous Generation Blocking
Explanation: Running LLM generation inside the same transaction as retrieval blocks database connections while the model generates tokens. This collapses connection pools under concurrent load.
Fix: Decouple retrieval and generation. Fetch context synchronously, then dispatch generation to an async worker pool or message queue. Use pgai for retrieval only, and route generation through a dedicated inference service when QPS exceeds local GPU capacity.
3. Uninitialized Baselines
Explanation: Feeding live traces into the detector without setting reference distributions causes immediate false positives. The system flags normal variance as drift because it lacks a baseline.
Fix: Run a calibration phase with 500-1000 representative queries before enabling alerts. Store the reference embeddings and score distributions in a separate configuration table. Update baselines only during scheduled maintenance windows.
4. Cosine Similarity for Structured Data
Explanation: Pure vector search struggles with exact keyword matching, code syntax, or contractual clauses where semantic proximity is less important than lexical precision.
Fix: Implement hybrid search. Combine pgvector cosine similarity with pg_search or a dedicated Elasticsearch shard for BM25 scoring. Weight the results using a reciprocal rank fusion algorithm before passing to the generator.
5. Missing Multi-Tenant Isolation
Explanation: Storing all documents in a single table exposes cross-tenant data during retrieval. Vector indexes do not enforce access control.
Fix: Enable PostgreSQL Row-Level Security (RLS). Add a tenant_id column and create policies that filter queries by session context. Alternatively, use schema-per-tenant for strict compliance requirements. Never rely on application-layer filtering alone.
6. Prompt Template Hardcoding
Explanation: Embedding prompt strings directly in SQL functions or Python code makes versioning impossible. A/B testing prompt variations requires code deployments.
Fix: Store prompts in a prompt_registry table with versioning, effective dates, and rollback capabilities. Fetch the active template at runtime. Log which prompt version was used in the drift metadata for correlation analysis.
7. Treating Drift as Binary Failure
Explanation: Alerting on any deviation causes alert fatigue. Minor statistical noise triggers pages that require no action.
Fix: Implement threshold bands. Use warn for deviations within 1.5 standard deviations and degraded for deviations beyond 3 standard deviations. Correlate drift spikes with deployment timestamps before triggering incident response.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| < 100k documents, single tenant | Inline pgai embedding + retrieval | Lowest operational overhead, single round trip | Minimal infrastructure cost |
| > 1M documents, high write volume | Vectorizer background worker + pgvectorscale | Prevents insert latency, scales to billions of vectors | Moderate storage cost, negligible compute |
| Strict compliance / multi-tenant | Schema-per-tenant + RLS policies | Guarantees data isolation, simplifies audit trails | Higher operational complexity |
| Code/contract retrieval | Hybrid BM25 + vector search | Preserves lexical precision while capturing semantics | Requires additional search index maintenance |
| High QPS generation | Async worker pool + local Ollama | Prevents DB connection exhaustion, enables GPU batching | Requires dedicated inference hardware |
Configuration Template
-- Production schema with metadata tracking
CREATE TABLE document_index (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
source_text TEXT NOT NULL,
embedding VECTOR(768) NOT NULL,
tenant_id TEXT NOT NULL,
chunk_index INT DEFAULT 0,
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- Enable row-level security
ALTER TABLE document_index ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON document_index
USING (tenant_id = current_setting('app.current_tenant', true));
-- Vectorizer setup for continuous sync
SELECT ai.create_vectorizer(
'document_index'::regclass,
embedding => ai.embedding_ollama('nomic-embed-text', 768),
chunking => ai.chunking_recursive_character_text_splitter('source_text', 1024, 128),
scheduling => ai.scheduling_continuous(interval '30 seconds')
);
# Production drift monitor initialization
from ragvitals import Detector, EmbeddingDrift, JsonlSink, JudgeDrift, QueryDistribution, ResponseQuality, RetrievalRelevance
def initialize_monitor(tenant: str, output_dir: str = "./drift_logs"):
sink = JsonlSink(filepath=f"{output_dir}/{tenant}_drift.jsonl")
monitor = Detector(
dimensions=[
QueryDistribution(),
RetrievalRelevance(metric="hit_rate", k=5),
EmbeddingDrift(),
ResponseQuality(score_keys=["faithfulness", "relevance"]),
JudgeDrift(score_key="faithfulness")
],
sinks=[sink]
)
return monitor
Quick Start Guide
- Launch PostgreSQL with extensions: Run
timescale/timescaledb-ha:pg17-all via Docker, expose port 5432, and execute CREATE EXTENSION vector; CREATE EXTENSION ai CASCADE;
- Pull local models: Execute
ollama pull nomic-embed-text, ollama pull gemma2:9b, and ollama pull llama3.1:8b on the inference host
- Create schema & vectorizer: Run the production schema SQL, configure the vectorizer to watch your base table, and set the chunking strategy to match your document structure
- Calibrate & ingest: Load 500 representative documents, generate reference embeddings, initialize the drift detector with baseline scores, and begin routing production queries through the unified retrieval-generation pipeline
- Monitor & iterate: Review the daily drift report, correlate dimension shifts with deployment events, and adjust retrieval thresholds or prompt templates based on isolated metric changes