Back to KB
Difficulty
Intermediate
Read Time
10 min

Agents that monitor themselves: a self-auditing RAG on Tiger's Agentic Postgres

By Codcompass Team··10 min read

In-Database Self-Correction: Building Autonomous RAG Observability with SQL and MCP

Current Situation Analysis

Modern RAG architectures suffer from a fundamental observability gap. Engineers typically treat retrieval and generation as isolated pipeline stages, while monitoring is bolted on as an external service or batch notebook. This creates a fragmented feedback loop: inference traces are shipped to a separate observability platform, evaluated hours or days later, and the results never influence the live agent's decision-making. The system flies blind between deployments, unable to detect retrieval degradation, embedding drift, or generation hallucination until users report failures or data scientists run weekly evaluations.

This problem is frequently overlooked because teams assume observability requires dedicated infrastructure. The industry standard involves exporting logs to Prometheus, Datadog, or custom ETL pipelines, then building dashboards that require human interpretation. This approach introduces three critical failures:

  1. Latency mismatch: External polling or webhook-based alerting operates on minute-to-hour scales, while RAG degradation can occur within seconds of a corpus update or traffic spike.
  2. Context fragmentation: The audit stack lacks direct access to the vector index, the raw corpus, and the inference state. Correlating a drop in faithfulness with a specific document chunk requires cross-system joins that are expensive and brittle.
  3. Credential and deployment sprawl: Running a retriever, a generator, and a monitor across three separate services multiplies configuration overhead, network hops, and failure domains.

The overlooked alternative is co-locating the audit logic with the data. Postgres extensions like pgvector and pgai already handle embedding storage, similarity search, and model inference. By extending the database with statistical audit functions, the agent can query its own health as a native SQL operation. Benchmarks demonstrate that calculating a multi-dimensional drift report over 5,000 inference traces executes in approximately 12 milliseconds. This latency profile makes per-request or per-batch self-auditing viable without impacting user-facing response times. The database becomes the runtime control plane, not just a storage layer.

WOW Moment: Key Findings

The architectural shift from external monitoring to in-database self-auditing produces measurable improvements across deployment complexity, latency, and operational alignment. The following comparison contrasts traditional external observability pipelines with co-located SQL-based self-correction.

ApproachEnd-to-End Latency OverheadInfrastructure FootprintContext AlignmentAlert Freshness
External Observability Stack150–400 ms (network + serialization + dashboard polling)3+ services (log shipper, metrics DB, alerting engine)Low (requires ETL to join traces with corpus)Hourly/Daily (batch evals)
In-Database Self-Auditing8–15 ms (direct SQL execution over indexed tables)1 service (Postgres + extensions)High (native joins between logs, vectors, and documents)Real-time (per-request or per-batch)

This finding matters because it transforms observability from a passive reporting mechanism into an active control loop. When the agent can execute a SELECT to evaluate its own retrieval hit rate, generation faithfulness, and embedding stability, it gains the ability to:

  • Trigger circuit breakers before degradation impacts users
  • Dynamically route to fallback generators or curated knowledge bases
  • Adjust retrieval parameters (top-k, similarity thresholds) based on live statistical signals
  • Eliminate the "Monday morning notebook" evaluation cycle entirely

The enabling factor is not a new algorithm, but a structural decision: keeping the audit logic, the inference data, and the corpus in the same execution context.

Core Solution

Building a self-auditing RAG agent requires three components: an append-only audit log, a statistical drift engine implemented as SQL functions, and an agent loop that consumes these functions via MCP (Model Context Protocol). The implementation below uses pgai for model inference, pgvector for similarity search, and Ollama-hosted models (gemma2:9b for generation, llama3.1:8b for validation).

Step 1: Schema Design for Audit Traces

The foundation is a structured log that captures every inference cycle. Unlike traditional logging, this table stores vector embeddings, retrieval metadata, and LLM-as-judge scores in a single row to enable statistical aggregation.

CREATE TABLE inference_audit_log (
    trace_id       bigserial PRIMARY KEY,
    execution_ts   timestamptz NOT NULL DEFAULT now(),
    user_query     text NOT NULL,
    query_vector   vector(768),
    matched_ids    text[],
    match_scores   double precision[],
    generated_text text,
    validation_metrics jsonb,
    runtime_context jsonb
);

-- Optimized for temporal window scans
CREATE INDEX idx_audit_ts_desc ON inference_audit_log (execution_ts DESC);
CREATE INDEX idx_audit_metrics ON inference_audit_log USING gin (validation_metrics);

Architecture Rationale:

  • jsonb for validation_metrics allows flexible scoring schemas (e.g., {"faithfulness": 0.91, "relevance": 0.87}) without schema migrations.
  • Descending timestamp index ensures window scans (WHERE execution_ts >= now() - interval '24 hours') use index-only scans.
  • GIN index on metrics enables future filtering by specific score thresholds.

Step 2: Statistical Drift Engine

The core audit logic compares a recent observation window against a historical baseline. The implementation uses z-score classification with a constant-baseline fallback to handle systems with near-zero variance.

CREATE OR REPLACE FUNCTION system_health_snapshot(
    observation_hours int DEFAULT 24,
    baseline_days int DEFAULT 7
)
RETURNS TABLE (
    metric_category text,
    severity_level text,
    current_avg double precision,
    baseline_avg double precision,
    sample_count int,
    diagnostic_note text
)
LANGUAGE sql STABLE AS $$
WITH observation_window AS (
    SELECT *
    FROM inference_audit_log
    WHERE execution_ts >= now() - make_interval(hours => observation_hours)
),
historical_baseline AS (
    SELECT *
    FROM inference_audit_log
    WHERE execution_ts >= now() - make_interval(days => baseline_days)
      AND execution_ts < now() - make_interval(hours => observation_hours)
),
faithfulness_calc AS (
    SELECT
        avg((validation_metrics->>'faithfulness')::float) AS obs_mean,
        count(*) AS obs_n,
        (SELECT avg((validation_metrics->>'faithfulness')::float) FROM historical_baseline) AS base_mean,
        (SELECT stddev_samp((validation_metrics->>'faithfulness')::float) FROM historical_baseline) AS base_std
    FROM observation_window
),
retrieval_calc AS (
    SELECT
        avg(CASE WHEN array_length(matched_ids, 1) > 0 THEN 1.0 ELSE 0.0 END) AS obs_mean,
        count(*) AS obs_n,
        (SELECT avg(CASE WHEN array_length(matched_ids, 1) > 0 THEN 1.0 ELSE 0.0 END) FROM historical_baseline) AS base_mean,
        (SELECT stddev_samp(CASE WHEN array_length(matched_ids, 1) > 0 THEN 1.0 ELSE 0.0 END) FROM historical_baseline) AS base_std
    FROM observation_window
)
SELECT 
    'Generation.Faithfulness' AS metric_

category, CASE WHEN base_std > 0.001 AND (obs_mean - base_mean) / base_std <= -3.0 THEN 'critical' WHEN base_std > 0.001 AND (obs_mean - base_mean) / base_std <= -2.0 THEN 'warning' WHEN base_std <= 0.001 AND base_mean > 0.5 AND (obs_mean - base_mean) / base_mean <= -0.25 THEN 'critical' ELSE 'stable' END AS severity_level, obs_mean, base_mean, obs_n, format('Observed %.3f vs Baseline %.3f (n=%d)', obs_mean, base_mean, obs_n) FROM faithfulness_calc UNION ALL SELECT 'Retrieval.HitRate', CASE WHEN base_std > 0.001 AND (obs_mean - base_mean) / base_std <= -3.0 THEN 'critical' WHEN base_std > 0.001 AND (obs_mean - base_mean) / base_std <= -2.0 THEN 'warning' ELSE 'stable' END, obs_mean, base_mean, obs_n, format('Observed %.3f vs Baseline %.3f (n=%d)', obs_mean, base_mean, obs_n) FROM retrieval_calc; $$;


**Why This Structure**:
- CTEs separate window and baseline calculations cleanly, avoiding correlated subqueries that degrade performance.
- The `base_std > 0.001` threshold prevents division-by-zero and handles stable systems where z-scores become meaningless.
- The constant-baseline fallback (`base_std <= 0.001`) uses relative percentage drop instead of standard deviations, ensuring alerts fire even when historical variance is negligible.
- `STABLE` volatility tells Postgres the function won't modify data and can be optimized within query plans.

### Step 3: Agent Execution Loop

The agent consumes the audit function via MCP tools. The loop integrates retrieval, generation, validation, logging, and self-assessment into a single transactional context.

```typescript
import { createMcpClient } from '@mcp/client';
import { pgai } from '@timescale/pgai';

class SelfAuditingRagAgent {
  private db: any;
  private mcp: any;
  private auditCounter = 0;
  private readonly AUDIT_INTERVAL = 5;

  constructor(connectionConfig: any) {
    this.db = connectionConfig;
    this.mcp = createMcpClient(connectionConfig.mcpEndpoint);
  }

  async processQuery(userInput: string): Promise<string> {
    // 1. Embed and retrieve
    const queryEmbedding = await pgai.ollamaEmbed('nomic-embed-text', userInput);
    const results = await this.db.query(`
      SELECT id, content, 
             1 - (embedding <=> $1::vector) AS similarity
      FROM knowledge_base
      ORDER BY embedding <=> $1::vector
      LIMIT 3
    `, [queryEmbedding]);

    // 2. Generate response
    const prompt = this.buildRagPrompt(userInput, results.rows);
    const generation = await pgai.ollamaGenerate('gemma2:9b', prompt);
    const responseText = generation.response;

    // 3. Validate with judge model
    const rubric = this.buildValidationRubric(userInput, responseText, results.rows);
    const validation = await pgai.ollamaGenerate('llama3.1:8b', rubric);
    const metrics = this.parseJudgeOutput(validation.response);

    // 4. Append audit trace
    await this.db.query(`
      INSERT INTO inference_audit_log 
        (user_query, query_vector, matched_ids, match_scores, generated_text, validation_metrics, runtime_context)
      VALUES ($1, $2, $3, $4, $5, $6, $7)
    `, [
      userInput,
      queryEmbedding,
      results.rows.map(r => r.id),
      results.rows.map(r => r.similarity),
      responseText,
      JSON.stringify(metrics),
      JSON.stringify({ generator: 'gemma2:9b', judge: 'llama3.1:8b' })
    ]);

    // 5. Self-audit (rate-limited)
    this.auditCounter++;
    if (this.auditCounter >= this.AUDIT_INTERVAL) {
      await this.performSelfAudit();
      this.auditCounter = 0;
    }

    return responseText;
  }

  private async performSelfAudit(): Promise<void> {
    const healthReport = await this.mcp.callTool('system_health_snapshot', {
      observation_hours: 1,
      baseline_days: 7
    });

    const criticalMetrics = healthReport.filter((r: any) => r.severity_level === 'critical');
    if (criticalMetrics.length > 0) {
      await this.triggerCircuitBreaker(criticalMetrics);
    }
  }

  private async triggerCircuitBreaker(metrics: any[]): Promise<void> {
    // Route to fallback generator or curated FAQ
    console.warn('System degradation detected:', metrics);
    // Implementation: switch retrieval strategy, enable strict mode, or queue human review
  }
}

Architecture Decisions:

  • MCP exposes system_health_snapshot() as a native tool, eliminating custom API wrappers.
  • Rate limiting (AUDIT_INTERVAL = 5) prevents audit noise while maintaining statistical significance.
  • Circuit breaker logic is decoupled from the main query path, allowing graceful degradation without blocking user requests.
  • All operations execute within the same Postgres instance, removing network serialization overhead and credential management complexity.

Pitfall Guide

1. Unbounded Self-Auditing

Explanation: Agents that call the health check after every inference generate excessive SQL load and pollute conversation traces with audit metadata. Fix: Implement a deterministic rate limiter (should_audit_now()) that triggers audits only after N actions or at fixed time intervals. Use a monotonically increasing counter or timestamp threshold.

2. Constant Baseline Fragility

Explanation: When historical variance approaches zero, z-score calculations become unstable or divide by near-zero values, causing false positives or silent failures. Fix: Add a variance threshold (base_std > 0.001) and fall back to relative percentage drop ((obs - base) / base <= -0.25). This maintains alert sensitivity while avoiding mathematical singularities.

3. Judge Metric Complexity in SQL

Explanation: LLM-as-judge evaluations often require stable reference sets, multi-dimensional rubrics, and cross-referencing with ground truth. Implementing this purely in SQL leads to verbose, unmaintainable queries. Fix: Delegate complex judge logic to plpython3u or an external microservice. Keep SQL focused on statistical aggregation (averages, variances, hit rates) and store pre-computed judge scores in jsonb.

4. Missing Temporal Indexes

Explanation: Window scans over append-only logs without descending timestamp indexes force sequential scans, degrading audit latency from ~12ms to hundreds of milliseconds. Fix: Create CREATE INDEX idx_ts_desc ON table (ts DESC). For high-volume systems, implement monthly partitioning on the timestamp column to keep index depth shallow.

5. Static Threshold Hardcoding

Explanation: Using fixed severity bands (-2.0, -3.0) across all metrics ignores domain-specific variance. A retrieval hit rate of 0.85 might be critical for medical queries but acceptable for casual chat. Fix: Parameterize thresholds per metric category. Store baseline expectations in a configuration table and join during audit execution. Allow dynamic calibration based on SLA requirements.

6. Ignoring Seasonal Query Patterns

Explanation: Daily traffic spikes, weekend lulls, or periodic batch updates skew baselines, causing false drift alerts when normal cyclical patterns occur. Fix: Implement day-of-week weighting in baseline calculations. Use rolling windows that align with business cycles, or apply exponential moving averages to smooth periodic noise.

7. Silent Fallback Routing

Explanation: When the agent switches to a fallback generator or stricter retrieval strategy, the transition often goes unlogged, making post-incident analysis impossible. Fix: Explicitly record routing decisions in runtime_context. Include fields like fallback_triggered: true, reason: 'faithfulness_degradation', and previous_generator: 'gemma2:9b'.

Production Bundle

Action Checklist

  • Define audit schema with jsonb metrics and descending timestamp index
  • Implement drift function with z-score classification and constant-baseline fallback
  • Configure MCP tool exposure for audit functions
  • Set deterministic audit rate limiting (interval or counter-based)
  • Parameterize severity thresholds per metric category
  • Implement circuit breaker with explicit fallback routing and audit logging
  • Add monthly partitioning for high-volume trace tables
  • Validate audit latency under load (target < 20ms for 10k rows)

Decision Matrix

ScenarioRecommended ApproachWhyCost Impact
Low-volume RAG (< 1k queries/day)Single-table audit log with daily baselineSimplicity outweighs optimization needsMinimal (standard Postgres)
High-volume RAG (> 50k queries/day)Monthly partitioning + materialized baseline viewsPrevents index bloat and maintains sub-20ms audit latencyModerate (partition management overhead)
Multi-domain RAG (medical, legal, casual)Parameterized thresholds per domainDomain-specific SLAs require different sensitivity bandsLow (configuration table join)
Strict compliance requirementsExternal audit export + in-DB real-time loopSatisfies regulatory retention while maintaining live controlHigh (dual-write architecture)
Edge deployment (limited compute)Batch audit every N minutes instead of per-requestReduces CPU overhead while preserving drift detectionLow (delayed but functional)

Configuration Template

-- Audit schema setup
CREATE TABLE inference_audit_log (
    trace_id       bigserial PRIMARY KEY,
    execution_ts   timestamptz NOT NULL DEFAULT now(),
    user_query     text NOT NULL,
    query_vector   vector(768),
    matched_ids    text[],
    match_scores   double precision[],
    generated_text text,
    validation_metrics jsonb,
    runtime_context jsonb
);

CREATE INDEX idx_audit_ts_desc ON inference_audit_log (execution_ts DESC);
CREATE INDEX idx_audit_metrics ON inference_audit_log USING gin (validation_metrics);

-- Threshold configuration
CREATE TABLE audit_thresholds (
    metric_category text PRIMARY KEY,
    warning_zscore float NOT NULL DEFAULT -2.0,
    critical_zscore float NOT NULL DEFAULT -3.0,
    relative_drop_threshold float NOT NULL DEFAULT 0.25,
    min_sample_size int NOT NULL DEFAULT 10
);

INSERT INTO audit_thresholds (metric_category, warning_zscore, critical_zscore, relative_drop_threshold, min_sample_size)
VALUES 
  ('Generation.Faithfulness', -2.0, -3.0, 0.25, 10),
  ('Retrieval.HitRate', -2.0, -3.0, 0.20, 15);

Quick Start Guide

  1. Initialize the database: Provision a Postgres instance with pgvector and pgai extensions. Create the inference_audit_log table and indexes using the configuration template.
  2. Deploy the audit function: Execute system_health_snapshot() SQL definition. Verify execution time with EXPLAIN ANALYZE SELECT * FROM system_health_snapshot(1, 7);
  3. Configure MCP exposure: Register the audit function as an MCP tool. Test tool invocation with a sample observation window.
  4. Integrate agent loop: Implement the TypeScript/Python agent loop with rate-limited audit calls. Add circuit breaker logic for critical severity responses.
  5. Validate end-to-end: Run 50 test queries, trigger artificial degradation (e.g., modify retrieval top-k), and confirm the audit function detects the shift within the configured window.