scores in a single row to enable statistical aggregation.
CREATE TABLE inference_audit_log (
trace_id bigserial PRIMARY KEY,
execution_ts timestamptz NOT NULL DEFAULT now(),
user_query text NOT NULL,
query_vector vector(768),
matched_ids text[],
match_scores double precision[],
generated_text text,
validation_metrics jsonb,
runtime_context jsonb
);
-- Optimized for temporal window scans
CREATE INDEX idx_audit_ts_desc ON inference_audit_log (execution_ts DESC);
CREATE INDEX idx_audit_metrics ON inference_audit_log USING gin (validation_metrics);
Architecture Rationale:
jsonb for validation_metrics allows flexible scoring schemas (e.g., {"faithfulness": 0.91, "relevance": 0.87}) without schema migrations.
- Descending timestamp index ensures window scans (
WHERE execution_ts >= now() - interval '24 hours') use index-only scans.
- GIN index on metrics enables future filtering by specific score thresholds.
Step 2: Statistical Drift Engine
The core audit logic compares a recent observation window against a historical baseline. The implementation uses z-score classification with a constant-baseline fallback to handle systems with near-zero variance.
CREATE OR REPLACE FUNCTION system_health_snapshot(
observation_hours int DEFAULT 24,
baseline_days int DEFAULT 7
)
RETURNS TABLE (
metric_category text,
severity_level text,
current_avg double precision,
baseline_avg double precision,
sample_count int,
diagnostic_note text
)
LANGUAGE sql STABLE AS $$
WITH observation_window AS (
SELECT *
FROM inference_audit_log
WHERE execution_ts >= now() - make_interval(hours => observation_hours)
),
historical_baseline AS (
SELECT *
FROM inference_audit_log
WHERE execution_ts >= now() - make_interval(days => baseline_days)
AND execution_ts < now() - make_interval(hours => observation_hours)
),
faithfulness_calc AS (
SELECT
avg((validation_metrics->>'faithfulness')::float) AS obs_mean,
count(*) AS obs_n,
(SELECT avg((validation_metrics->>'faithfulness')::float) FROM historical_baseline) AS base_mean,
(SELECT stddev_samp((validation_metrics->>'faithfulness')::float) FROM historical_baseline) AS base_std
FROM observation_window
),
retrieval_calc AS (
SELECT
avg(CASE WHEN array_length(matched_ids, 1) > 0 THEN 1.0 ELSE 0.0 END) AS obs_mean,
count(*) AS obs_n,
(SELECT avg(CASE WHEN array_length(matched_ids, 1) > 0 THEN 1.0 ELSE 0.0 END) FROM historical_baseline) AS base_mean,
(SELECT stddev_samp(CASE WHEN array_length(matched_ids, 1) > 0 THEN 1.0 ELSE 0.0 END) FROM historical_baseline) AS base_std
FROM observation_window
)
SELECT
'Generation.Faithfulness' AS metric_category,
CASE
WHEN base_std > 0.001 AND (obs_mean - base_mean) / base_std <= -3.0 THEN 'critical'
WHEN base_std > 0.001 AND (obs_mean - base_mean) / base_std <= -2.0 THEN 'warning'
WHEN base_std <= 0.001 AND base_mean > 0.5 AND (obs_mean - base_mean) / base_mean <= -0.25 THEN 'critical'
ELSE 'stable'
END AS severity_level,
obs_mean, base_mean, obs_n,
format('Observed %.3f vs Baseline %.3f (n=%d)', obs_mean, base_mean, obs_n)
FROM faithfulness_calc
UNION ALL
SELECT
'Retrieval.HitRate',
CASE
WHEN base_std > 0.001 AND (obs_mean - base_mean) / base_std <= -3.0 THEN 'critical'
WHEN base_std > 0.001 AND (obs_mean - base_mean) / base_std <= -2.0 THEN 'warning'
ELSE 'stable'
END,
obs_mean, base_mean, obs_n,
format('Observed %.3f vs Baseline %.3f (n=%d)', obs_mean, base_mean, obs_n)
FROM retrieval_calc;
$$;
Why This Structure:
- CTEs separate window and baseline calculations cleanly, avoiding correlated subqueries that degrade performance.
- The
base_std > 0.001 threshold prevents division-by-zero and handles stable systems where z-scores become meaningless.
- The constant-baseline fallback (
base_std <= 0.001) uses relative percentage drop instead of standard deviations, ensuring alerts fire even when historical variance is negligible.
STABLE volatility tells Postgres the function won't modify data and can be optimized within query plans.
Step 3: Agent Execution Loop
The agent consumes the audit function via MCP tools. The loop integrates retrieval, generation, validation, logging, and self-assessment into a single transactional context.
import { createMcpClient } from '@mcp/client';
import { pgai } from '@timescale/pgai';
class SelfAuditingRagAgent {
private db: any;
private mcp: any;
private auditCounter = 0;
private readonly AUDIT_INTERVAL = 5;
constructor(connectionConfig: any) {
this.db = connectionConfig;
this.mcp = createMcpClient(connectionConfig.mcpEndpoint);
}
async processQuery(userInput: string): Promise<string> {
// 1. Embed and retrieve
const queryEmbedding = await pgai.ollamaEmbed('nomic-embed-text', userInput);
const results = await this.db.query(`
SELECT id, content,
1 - (embedding <=> $1::vector) AS similarity
FROM knowledge_base
ORDER BY embedding <=> $1::vector
LIMIT 3
`, [queryEmbedding]);
// 2. Generate response
const prompt = this.buildRagPrompt(userInput, results.rows);
const generation = await pgai.ollamaGenerate('gemma2:9b', prompt);
const responseText = generation.response;
// 3. Validate with judge model
const rubric = this.buildValidationRubric(userInput, responseText, results.rows);
const validation = await pgai.ollamaGenerate('llama3.1:8b', rubric);
const metrics = this.parseJudgeOutput(validation.response);
// 4. Append audit trace
await this.db.query(`
INSERT INTO inference_audit_log
(user_query, query_vector, matched_ids, match_scores, generated_text, validation_metrics, runtime_context)
VALUES ($1, $2, $3, $4, $5, $6, $7)
`, [
userInput,
queryEmbedding,
results.rows.map(r => r.id),
results.rows.map(r => r.similarity),
responseText,
JSON.stringify(metrics),
JSON.stringify({ generator: 'gemma2:9b', judge: 'llama3.1:8b' })
]);
// 5. Self-audit (rate-limited)
this.auditCounter++;
if (this.auditCounter >= this.AUDIT_INTERVAL) {
await this.performSelfAudit();
this.auditCounter = 0;
}
return responseText;
}
private async performSelfAudit(): Promise<void> {
const healthReport = await this.mcp.callTool('system_health_snapshot', {
observation_hours: 1,
baseline_days: 7
});
const criticalMetrics = healthReport.filter((r: any) => r.severity_level === 'critical');
if (criticalMetrics.length > 0) {
await this.triggerCircuitBreaker(criticalMetrics);
}
}
private async triggerCircuitBreaker(metrics: any[]): Promise<void> {
// Route to fallback generator or curated FAQ
console.warn('System degradation detected:', metrics);
// Implementation: switch retrieval strategy, enable strict mode, or queue human review
}
}
Architecture Decisions:
- MCP exposes
system_health_snapshot() as a native tool, eliminating custom API wrappers.
- Rate limiting (
AUDIT_INTERVAL = 5) prevents audit noise while maintaining statistical significance.
- Circuit breaker logic is decoupled from the main query path, allowing graceful degradation without blocking user requests.
- All operations execute within the same Postgres instance, removing network serialization overhead and credential management complexity.
Pitfall Guide
1. Unbounded Self-Auditing
Explanation: Agents that call the health check after every inference generate excessive SQL load and pollute conversation traces with audit metadata.
Fix: Implement a deterministic rate limiter (should_audit_now()) that triggers audits only after N actions or at fixed time intervals. Use a monotonically increasing counter or timestamp threshold.
2. Constant Baseline Fragility
Explanation: When historical variance approaches zero, z-score calculations become unstable or divide by near-zero values, causing false positives or silent failures.
Fix: Add a variance threshold (base_std > 0.001) and fall back to relative percentage drop ((obs - base) / base <= -0.25). This maintains alert sensitivity while avoiding mathematical singularities.
3. Judge Metric Complexity in SQL
Explanation: LLM-as-judge evaluations often require stable reference sets, multi-dimensional rubrics, and cross-referencing with ground truth. Implementing this purely in SQL leads to verbose, unmaintainable queries.
Fix: Delegate complex judge logic to plpython3u or an external microservice. Keep SQL focused on statistical aggregation (averages, variances, hit rates) and store pre-computed judge scores in jsonb.
4. Missing Temporal Indexes
Explanation: Window scans over append-only logs without descending timestamp indexes force sequential scans, degrading audit latency from ~12ms to hundreds of milliseconds.
Fix: Create CREATE INDEX idx_ts_desc ON table (ts DESC). For high-volume systems, implement monthly partitioning on the timestamp column to keep index depth shallow.
5. Static Threshold Hardcoding
Explanation: Using fixed severity bands (-2.0, -3.0) across all metrics ignores domain-specific variance. A retrieval hit rate of 0.85 might be critical for medical queries but acceptable for casual chat.
Fix: Parameterize thresholds per metric category. Store baseline expectations in a configuration table and join during audit execution. Allow dynamic calibration based on SLA requirements.
6. Ignoring Seasonal Query Patterns
Explanation: Daily traffic spikes, weekend lulls, or periodic batch updates skew baselines, causing false drift alerts when normal cyclical patterns occur.
Fix: Implement day-of-week weighting in baseline calculations. Use rolling windows that align with business cycles, or apply exponential moving averages to smooth periodic noise.
7. Silent Fallback Routing
Explanation: When the agent switches to a fallback generator or stricter retrieval strategy, the transition often goes unlogged, making post-incident analysis impossible.
Fix: Explicitly record routing decisions in runtime_context. Include fields like fallback_triggered: true, reason: 'faithfulness_degradation', and previous_generator: 'gemma2:9b'.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| Low-volume RAG (< 1k queries/day) | Single-table audit log with daily baseline | Simplicity outweighs optimization needs | Minimal (standard Postgres) |
| High-volume RAG (> 50k queries/day) | Monthly partitioning + materialized baseline views | Prevents index bloat and maintains sub-20ms audit latency | Moderate (partition management overhead) |
| Multi-domain RAG (medical, legal, casual) | Parameterized thresholds per domain | Domain-specific SLAs require different sensitivity bands | Low (configuration table join) |
| Strict compliance requirements | External audit export + in-DB real-time loop | Satisfies regulatory retention while maintaining live control | High (dual-write architecture) |
| Edge deployment (limited compute) | Batch audit every N minutes instead of per-request | Reduces CPU overhead while preserving drift detection | Low (delayed but functional) |
Configuration Template
-- Audit schema setup
CREATE TABLE inference_audit_log (
trace_id bigserial PRIMARY KEY,
execution_ts timestamptz NOT NULL DEFAULT now(),
user_query text NOT NULL,
query_vector vector(768),
matched_ids text[],
match_scores double precision[],
generated_text text,
validation_metrics jsonb,
runtime_context jsonb
);
CREATE INDEX idx_audit_ts_desc ON inference_audit_log (execution_ts DESC);
CREATE INDEX idx_audit_metrics ON inference_audit_log USING gin (validation_metrics);
-- Threshold configuration
CREATE TABLE audit_thresholds (
metric_category text PRIMARY KEY,
warning_zscore float NOT NULL DEFAULT -2.0,
critical_zscore float NOT NULL DEFAULT -3.0,
relative_drop_threshold float NOT NULL DEFAULT 0.25,
min_sample_size int NOT NULL DEFAULT 10
);
INSERT INTO audit_thresholds (metric_category, warning_zscore, critical_zscore, relative_drop_threshold, min_sample_size)
VALUES
('Generation.Faithfulness', -2.0, -3.0, 0.25, 10),
('Retrieval.HitRate', -2.0, -3.0, 0.20, 15);
Quick Start Guide
- Initialize the database: Provision a Postgres instance with
pgvector and pgai extensions. Create the inference_audit_log table and indexes using the configuration template.
- Deploy the audit function: Execute
system_health_snapshot() SQL definition. Verify execution time with EXPLAIN ANALYZE SELECT * FROM system_health_snapshot(1, 7);
- Configure MCP exposure: Register the audit function as an MCP tool. Test tool invocation with a sample observation window.
- Integrate agent loop: Implement the TypeScript/Python agent loop with rate-limited audit calls. Add circuit breaker logic for
critical severity responses.
- Validate end-to-end: Run 50 test queries, trigger artificial degradation (e.g., modify retrieval top-k), and confirm the audit function detects the shift within the configured window.