Back to KB
Difficulty
Intermediate
Read Time
9 min

Fully open-source RAG with pgvector + pgai + Ollama, and ragvitals watching for drift

By Codcompass TeamΒ·Β·9 min read

Building Observable RAG Pipelines Inside PostgreSQL: A Local-First Architecture

Current Situation Analysis

Shipping a Retrieval-Augmented Generation (RAG) system to production is rarely the challenge. Keeping it stable after the first model swap, corpus expansion, or embedder upgrade is. Teams consistently encounter silent degradation: retrieval recall drops, generation faithfulness erodes, and query intent distributions shift. Because these changes are semantic rather than infrastructural, traditional observability stacks miss them entirely. Latency, error rates, and throughput dashboards remain green while the actual intelligence pipeline deteriorates.

The root cause is architectural fragmentation. Most RAG implementations stitch together separate services: a vector database for storage, an external embedding API, a cloud LLM endpoint, and a third-party evaluation framework. Each hop introduces latency, serialization overhead, and blind spots. When a metric degrades, engineers spend days tracing whether the fault lies in the embedding space, the retrieval algorithm, the prompt template, or the generative model. Without a unified baseline, component isolation is guesswork.

This problem is systematically overlooked because teams treat RAG as a static function rather than a dynamic system. Evaluation is often performed offline during development, then abandoned in production. The industry lacks lightweight, open-source tooling that bridges the gap between database operations and semantic monitoring. The result is a reliance on user complaints or manual spot-checks to detect drift.

Data from production deployments shows that changing a single component produces predictable, isolated shifts across specific evaluation dimensions. Swapping an embedder alters query and embedding distributions but leaves generation quality stable. Replacing the LLM impacts faithfulness and relevance scores while retrieval metrics remain flat. A multi-dimensional monitoring layer that tracks these shifts independently transforms RAG from a black box into a diagnosable system.

WOW Moment: Key Findings

The critical insight is that semantic drift is not monolithic. By tracking five distinct dimensions simultaneously, you can isolate exactly which layer of the pipeline changed and quantify the impact. The following table maps component modifications to their observable effects across the monitoring dimensions:

Component ChangedQuery DistributionEmbedding DriftRetrieval RelevanceResponse QualityJudge Drift
Swap EmbedderπŸ”΄ ShiftπŸ”΄ Shift🟑 Minor🟒 Stable🟒 Stable
Swap Generator🟒 Stable🟒 Stable🟒 StableπŸ”΄ ShiftπŸ”΄ Shift
Update CorpusπŸ”΄ Shift🟑 MinorπŸ”΄ Shift🟑 Minor🟒 Stable
Modify Prompt🟒 Stable🟒 Stable🟒 StableπŸ”΄ ShiftπŸ”΄ Shift
Baseline (Normal)🟒 Stable🟒 Stable🟒 Stable🟒 Stable🟒 Stable

This isolation capability matters because it eliminates diagnostic ambiguity. Instead of rolling back an entire stack when quality drops, you can pinpoint whether the retriever needs tuning, the embedder requires re-calibration, or the generator is hallucinating. The system becomes self-auditing. Engineers gain confidence to iterate on models and data because every change is immediately measurable against a rolling baseline. This transforms RAG operations from reactive firefighting to proactive optimization.

Core Solution

The architecture centers on three principles: consolidation, locality, and continuous evaluation. By embedding AI operations directly into PostgreSQL, eliminating external API dependencies, and attaching a lightweight drift detector to every inference call, you create a pipeline that is both operationally simple and semantically transparent.

Step 1: Infrastructure Initialization

Start with a PostgreSQL instance extended for vector operations and inline AI execution. The TimescaleDB distribution provides production-ready extensions without additional orchestration.

-- Enable required extensions
CREATE EXTENSION IF NOT EXISTS vector;
CREATE EXTENSION IF NOT EXISTS ai CASCADE;

The vector extension handles high-dimensional storage and similarity search. The ai extension (pgai) exposes Ollama as a native SQL interface, allowing embedding and generation to execute within the database engine. This removes network serialization, connection pooling complexity, and cross-service fault tolerance concerns.

Step 2: Schema Design & Vectorization Strategy

Design a single source-of-truth table. Avoid separating raw text from embeddings unless you have strict compliance requirements requiring data partitioning.

CREATE TABLE knowledge_store (
    doc_id       TEXT PRIMARY KEY,
    raw_text     TEXT NOT NULL,
    doc_vector   VECTOR(768) NOT NULL,
    indexed_at   TIMESTAMPTZ DEFAULT NOW()
);

For ingestion, you have two patterns. Inline embedding works for batch loads or low-frequency updates. The vectorizer pattern is mandatory for production-scale continuous ingestion.

-- Inline embedding at insert time
INSERT INTO knowledge_store (doc_id, raw_text, doc_vector)
VALUES (
    'sys-arch-001',
    'Stateless microservices require external session stores for horizontal scaling.',
    ai.ollama_embed('nomic-embed-text', 'Stateless microservices require external session stores for horizontal scaling.')::vector
);

-- Production vectorizer configuration
SELECT ai.create_vectorizer(
    'knowledge_store'::regclass,
    embedding => ai.embedding_ollama('nomic-embed-text', 768),
    chunking => ai.chunking_recursive_character_text_splitter('raw_text', 512, 64),
    scheduling => ai.scheduling_continuous()
);

The vectorizer spawns a background worker that watches the base table, chunks text, generates embeddings, and maintains the vector column asynchronously. This prevents insert latency spikes and ensures the embedding pipeline scales independently of write throughput.

Step 3: Unified Retrieval & Generation

Combine semantic search and LLM inference into a single database round trip. The <=> operator computes cosine distance. Subtracting it from 1 yields a similarity score suitable for ranking.

-- Context retrieval
SELECT doc_id, raw_text, 
       1 - (doc_vector <=> ai.ollama_embed('nomic-embed-text', $1)::vector) AS similarity
FROM knowledge_store
ORDER BY doc_

vector <=> ai.ollama_embed('nomic-embed-text', $1)::vector LIMIT 5;


Pass the retrieved context directly to the generation function. pgai returns structured JSON; extract the payload using PostgreSQL operators.

```sql
-- Inline generation
SELECT ai.ollama_generate('gemma2:9b', 
    'Context:' || E'\n' || $1 || E'\n\nQuestion: ' || $2 || E'\nAnswer:'
)->>'response' AS generated_output;

Wrapping these in a PL/pgSQL function or a Python adapter standardizes the interface and enables consistent prompt templating.

Step 4: Drift Detection Pipeline

Attach a monitoring layer that captures every inference cycle. The ragvitals library provides a dimension-aware detector that compares live traces against a rolling baseline.

from ragvitals import (
    Detector, EmbeddingDrift, InMemorySink, JudgeDrift,
    QueryDistribution, ResponseQuality, RetrievalRelevance, Trace
)
from datetime import datetime

class SemanticMonitor:
    def __init__(self, k: int = 5):
        self.detector = Detector(
            dimensions=[
                QueryDistribution(),
                RetrievalRelevance(metric="hit_rate", k=k),
                EmbeddingDrift(),
                ResponseQuality(score_keys=["faithfulness", "relevance"]),
                JudgeDrift(score_key="faithfulness")
            ],
            sinks=[InMemorySink(window_size=1000)]
        )
        self.k = k

    def calibrate(self, reference_embeddings: list, baseline_scores: dict):
        self.detector.dimensions[0].set_reference(reference_embeddings)
        self.detector.dimensions[2].set_reference(reference_embeddings)
        self.detector.dimensions[4].set_reference(baseline_scores)

    def record(self, query: str, query_vec: list, retrieved: list, 
               answer: str, judge_scores: dict, metadata: dict):
        trace = Trace(
            timestamp=datetime.utcnow(),
            query=query,
            query_embedding=query_vec,
            retrieval_scores=[item[2] for item in retrieved],
            relevance_labels=[1] + [0] * (len(retrieved) - 1),
            response=answer,
            judge_scores=judge_scores,
            metadata=metadata
        )
        self.detector.ingest(trace)

    def generate_report(self) -> dict:
        return self.detector.report()

Initialize the monitor once per deployment cycle. Feed it traces from every production query. The detector maintains statistical baselines and flags deviations across each dimension independently. When you swap gemma2:9b for llama3.1:8b, only ResponseQuality and JudgeDrift will trigger. When you update the corpus, QueryDistribution and RetrievalRelevance will shift. The architecture turns component changes into measurable signals.

Pitfall Guide

1. Embedding Dimension Mismatch

Explanation: Hardcoding vector dimensions in the schema while using an embedder that outputs a different size causes silent truncation or insertion failures. Fix: Query the embedder's output dimension programmatically before schema creation. Use ai.embedding_ollama('model-name', dimension) to enforce consistency. Validate dimensions during vectorizer initialization.

2. Synchronous Generation Blocking

Explanation: Running LLM generation inside the same transaction as retrieval blocks database connections while the model generates tokens. This collapses connection pools under concurrent load. Fix: Decouple retrieval and generation. Fetch context synchronously, then dispatch generation to an async worker pool or message queue. Use pgai for retrieval only, and route generation through a dedicated inference service when QPS exceeds local GPU capacity.

3. Uninitialized Baselines

Explanation: Feeding live traces into the detector without setting reference distributions causes immediate false positives. The system flags normal variance as drift because it lacks a baseline. Fix: Run a calibration phase with 500-1000 representative queries before enabling alerts. Store the reference embeddings and score distributions in a separate configuration table. Update baselines only during scheduled maintenance windows.

4. Cosine Similarity for Structured Data

Explanation: Pure vector search struggles with exact keyword matching, code syntax, or contractual clauses where semantic proximity is less important than lexical precision. Fix: Implement hybrid search. Combine pgvector cosine similarity with pg_search or a dedicated Elasticsearch shard for BM25 scoring. Weight the results using a reciprocal rank fusion algorithm before passing to the generator.

5. Missing Multi-Tenant Isolation

Explanation: Storing all documents in a single table exposes cross-tenant data during retrieval. Vector indexes do not enforce access control. Fix: Enable PostgreSQL Row-Level Security (RLS). Add a tenant_id column and create policies that filter queries by session context. Alternatively, use schema-per-tenant for strict compliance requirements. Never rely on application-layer filtering alone.

6. Prompt Template Hardcoding

Explanation: Embedding prompt strings directly in SQL functions or Python code makes versioning impossible. A/B testing prompt variations requires code deployments. Fix: Store prompts in a prompt_registry table with versioning, effective dates, and rollback capabilities. Fetch the active template at runtime. Log which prompt version was used in the drift metadata for correlation analysis.

7. Treating Drift as Binary Failure

Explanation: Alerting on any deviation causes alert fatigue. Minor statistical noise triggers pages that require no action. Fix: Implement threshold bands. Use warn for deviations within 1.5 standard deviations and degraded for deviations beyond 3 standard deviations. Correlate drift spikes with deployment timestamps before triggering incident response.

Production Bundle

Action Checklist

  • Initialize PostgreSQL with vector and ai extensions on a dedicated instance
  • Define vector dimensions matching the chosen embedder before table creation
  • Configure the vectorizer for continuous ingestion instead of inline embedding
  • Implement hybrid search (BM25 + cosine) for structured or keyword-heavy corpora
  • Enable Row-Level Security if supporting multiple tenants or departments
  • Calibrate the drift detector with a representative query baseline before go-live
  • Decouple generation from retrieval using async workers for high-throughput workloads
  • Version prompt templates in a database registry and log usage in trace metadata

Decision Matrix

ScenarioRecommended ApproachWhyCost Impact
< 100k documents, single tenantInline pgai embedding + retrievalLowest operational overhead, single round tripMinimal infrastructure cost
> 1M documents, high write volumeVectorizer background worker + pgvectorscalePrevents insert latency, scales to billions of vectorsModerate storage cost, negligible compute
Strict compliance / multi-tenantSchema-per-tenant + RLS policiesGuarantees data isolation, simplifies audit trailsHigher operational complexity
Code/contract retrievalHybrid BM25 + vector searchPreserves lexical precision while capturing semanticsRequires additional search index maintenance
High QPS generationAsync worker pool + local OllamaPrevents DB connection exhaustion, enables GPU batchingRequires dedicated inference hardware

Configuration Template

-- Production schema with metadata tracking
CREATE TABLE document_index (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    source_text     TEXT NOT NULL,
    embedding       VECTOR(768) NOT NULL,
    tenant_id       TEXT NOT NULL,
    chunk_index     INT DEFAULT 0,
    created_at      TIMESTAMPTZ DEFAULT NOW()
);

-- Enable row-level security
ALTER TABLE document_index ENABLE ROW LEVEL SECURITY;
CREATE POLICY tenant_isolation ON document_index
    USING (tenant_id = current_setting('app.current_tenant', true));

-- Vectorizer setup for continuous sync
SELECT ai.create_vectorizer(
    'document_index'::regclass,
    embedding => ai.embedding_ollama('nomic-embed-text', 768),
    chunking => ai.chunking_recursive_character_text_splitter('source_text', 1024, 128),
    scheduling => ai.scheduling_continuous(interval '30 seconds')
);
# Production drift monitor initialization
from ragvitals import Detector, EmbeddingDrift, JsonlSink, JudgeDrift, QueryDistribution, ResponseQuality, RetrievalRelevance

def initialize_monitor(tenant: str, output_dir: str = "./drift_logs"):
    sink = JsonlSink(filepath=f"{output_dir}/{tenant}_drift.jsonl")
    
    monitor = Detector(
        dimensions=[
            QueryDistribution(),
            RetrievalRelevance(metric="hit_rate", k=5),
            EmbeddingDrift(),
            ResponseQuality(score_keys=["faithfulness", "relevance"]),
            JudgeDrift(score_key="faithfulness")
        ],
        sinks=[sink]
    )
    return monitor

Quick Start Guide

  1. Launch PostgreSQL with extensions: Run timescale/timescaledb-ha:pg17-all via Docker, expose port 5432, and execute CREATE EXTENSION vector; CREATE EXTENSION ai CASCADE;
  2. Pull local models: Execute ollama pull nomic-embed-text, ollama pull gemma2:9b, and ollama pull llama3.1:8b on the inference host
  3. Create schema & vectorizer: Run the production schema SQL, configure the vectorizer to watch your base table, and set the chunking strategy to match your document structure
  4. Calibrate & ingest: Load 500 representative documents, generate reference embeddings, initialize the drift detector with baseline scores, and begin routing production queries through the unified retrieval-generation pipeline
  5. Monitor & iterate: Review the daily drift report, correlate dimension shifts with deployment events, and adjust retrieval thresholds or prompt templates based on isolated metric changes