How to diagnose where your RAG agent fabricates: an open-source A/B eval workflow with cross-lab blind judges

Current Situation Analysis

The fundamental failure mode in production RAG systems is not retrieval latency or embedding drift, but retrieval-gap fabrication. When a query exceeds the coverage of the knowledge base, standard RAG agents do not admit uncertainty. Instead, they pattern-match plausible answers from partial context, leveraging their RLHF alignment to prioritize "helpfulness" over factual grounding. This manifests as confident hallucination in safety-critical domains (e.g., dietary restrictions, compliance, medical triage).

Traditional debugging approaches fail because:

Manual spot-checking is statistically blind: Engineers only test queries they anticipate, missing edge-case fabrication patterns.
Single-model self-evaluation is biased: Using the same model family for generation and evaluation creates correlated failure modes and confirmation bias.
Prompt-level constraints decay: Instructions like "do not hallucinate" or "only answer if supported" are routinely overridden by the model's completion bias when context is sparse.
No isolation of variables: Without a controlled A/B harness, it is impossible to determine whether a stack improvement actually moves the needle or merely shifts the failure mode.

The pattern is universal across customer support, sales enablement, and internal Q&A. If you are shipping a RAG agent, fabrication is occurring on a subset of queries; the missing piece is a deterministic, cross-lab evaluation workflow that isolates the exact failure vector.

WOW Moment: Key Findings

The eval workflow isolates a single variable (runtime harness injection) against a baseline producer. Cross-lab blind judging eliminates family bias, while deterministic aggregation prevents meta-fabrication. The following table summarizes the reference run across five high-friction query types:

Approach	Citation Accuracy	Groundedness	Honesty/Uncertainty
Baseline (Standard RAG)	2.1	1.8	1.4
Augmented (Runtime Harness)	3.9	4.2	4.6

Key Findings:

Safety Compliance: 3 of 4 cross-lab judges agreed the harness correctly refused to certify unverified claims, while the baseline produced confident but ungrounded lists.
Signature/Entity Traps: The baseline consistently hallucinated authoritative labels (e.g., "chef's signature") when metadata was absent. The harness explicitly named the absence.
Rubric Calibration Edge Cases: In highly constrained domains (e.g., egg-allergen on desserts), the harness may structurally lose on specificity due to conservative refusal. This is a rubric calibration signal, not a harness failure.
Sweet Spot: The architecture achieves maximum diagnostic value when judges span independent model families and the synthesizer is strictly isolated from raw judge rows.

Core Solution

The architecture enforces deterministic evaluation through parallel isolation, cross-lab judging, and statistical aggregation.

Architecture Flow:

Parallel Producers: Two identical agents (same model, same retrieval config) process each test question. Only the augmented producer receives a runtime tool harness.
Anonymization & Evidence Injection: A formatter Code node merges outputs, anonymizes them as A and B, and inlines full retrieved chunks as verifiable evidence. Judges never know which side contains the harness.
Cross-Lab Blind Judging: Four independent judges from different labs score each pair using a strict 5-dimension rubric. Cross-family design prevents correlated hallucination patterns.
Deterministic Aggregation: A Code node computes per-judge totals, cross-judge agreement rates, per-dimension deltas, and hero artifacts. No LLM touches raw scores.
Synthesizer Isolation: The final findings document is generated by a synthesizer agent that receives only aggregated statistics, eliminating the path for meta-output fabrication.

Judge Rubric Output (Strict JSON):

Harness Implementation Example: The reference harness (Ejentum Logic API) injects runtime directives rather than static prompt instructions:

Amplify: absence of evidence is not evidence of absence acknowledgment.
Suppress: confident denial without exhaustive check; definitive negation from absence of knowledge; shallow agreement without examining underlying pattern.

This approach prevents discipline decay across long context windows and can be swapped with any HTTP/MCP/framework-native tool.

Pitfall Guide

Same-Family Judge Bias: Using judges from the same model family as producers introduces correlated failure modes and inflated agreement. Always enforce cross-lab/cross-family judge selection and disclose known limitations.
Rubric Calibration Drift: Judges may interpret dimensions differently across runs or query types. Maintain strict JSON schemas, pin temperature to 0, and periodically recalibrate dimension definitions against ground-truth edge cases.
Harness Prompt Decay: Wiring evaluation logic directly into system prompts causes discipline to degrade as context windows expand. Use runtime tool injection or middleware hooks to re-inject constraints per call.
Small Sample Size Noise: Reference runs with n=5 questions are statistically noisy. Always expand to 50+ production-representative queries before drawing architectural conclusions or shipping changes.
Aggregator Dimension Mismatch: Changing rubric dimensions without updating the deterministic aggregator code breaks per-dimension delta calculations. Version-control aggregator logic alongside rubric definitions.
Synthesizer Stat Fabrication: Allowing the synthesizer LLM to process raw judge rows enables it to hallucinate metrics or rewrite verdicts. Always feed only pre-computed aggregated stats to the synthesizer.
Vector Store Schema Assumptions: Assuming chunk schemas are universal across platforms. Ensure chunk_id, category, name, and description mappings align with your target vector store's metadata filters and embedding pipeline.

Deliverables

Blueprint: Platform-agnostic n8n workflow JSON (credentials stripped) with parallel producer routing, anonymizer merge, cross-lab judge fan-out, and deterministic aggregation pipeline.
Checklist: Adaptation matrix for KB replacement, test query selection criteria, judge family mapping, rubric validation steps, and aggregator configuration verification.
Configuration Templates:
- Standalone JavaScript Code nodes (anonymizer, formatter, aggregator, synthesizer)
- Framework-agnostic system prompts (judge rubric, synthesizer instructions)
- Reference KB schema (menu_kb.json with 49 chunks) and embedding loader script
- Runtime harness directive templates for safety/compliance domains