Your Scraper Returned a Clean Row. It Was Wrong.
Beyond Schema Validation: Catching Semantic Hallucinations in LLM Extractors
Current Situation Analysis
Modern data extraction pipelines increasingly rely on large language models to convert unstructured web content into structured records. The industry standard for this workflow has converged on structured output modes: OpenAI's response_format: json_schema, AWS Bedrock's tool-result schemas, or equivalent framework constraints. These mechanisms guarantee that the model returns a syntactically valid JSON object matching a predefined shape. They enforce type compliance, required fields, and nesting depth.
The critical blind spot emerges when engineering teams treat syntactic validity as semantic truth. A schema validator confirms that a field exists, matches the declared type, and adheres to structural constraints. It says nothing about whether the extracted value corresponds to reality. When an LLM encounters ambiguous free-text, structured output modes actively discourage uncertainty. The schema demands completeness, so the model fabricates plausible values rather than returning null or omitting fields. This behavior transforms extraction pipelines into silent corruption engines: rows pass every automated check, enter production databases, and degrade downstream analytics without triggering alerts.
This problem is systematically overlooked because validation tooling has historically focused on contract compliance, not factual grounding. Engineering dashboards monitor HTTP status codes, selector stability, and JSON parsing success rates. When those metrics turn green, teams assume data quality. In reality, they are only measuring the grammar of the response, not the semantics. Production scraping logs consistently reveal failure classes that bypass schema checks entirely: numeric ratings exceeding platform maximums, timestamps normalized to future dates, boolean flags triggered by incidental keyword matches, and locale metadata misaligned with source text language. Each failure class produces perfectly valid JSON. Each one corrupts business intelligence when written to persistent storage.
The structural shift from deterministic parsers to probabilistic extractors requires a corresponding shift in validation strategy. Shape checking remains necessary but is no longer sufficient. The industry must decouple syntactic validation from semantic verification, introducing a dedicated boundary layer that evaluates plausibility before data touches production systems.
WOW Moment: Key Findings
The following comparison illustrates the operational divergence between traditional schema validation and value-aware sanity gating. The metrics reflect production extraction workloads processing mixed-quality source pages with LLM-based parsers.
| Approach | False Positive Rate | Semantic Accuracy | Implementation Overhead | DB Corruption Risk |
|---|---|---|---|---|
| Schema-Only Validation | <2% | 68-74% | Low (1-2 hours) | High (silent writes) |
| Value-Aware Sanity Gate | 8-12% | 94-97% | Medium (4-6 hours) | Low (quarantine routing) |
Schema-only validation achieves near-zero false positives because it only rejects malformed JSON. It misses the majority of semantic errors because hallucinated values typically conform to declared types. The value-aware gate intentionally raises the false positive rate by flagging implausible combinations, but dramatically improves semantic accuracy by catching fabrication patterns before persistence. The overhead increase stems from rule authoring, threshold calibration, and quarantine infrastructure. The corruption risk drops because rejected records are isolated for review rather than injected into analytics pipelines.
This finding matters because it redefines the validation boundary. Engineering teams can no longer treat LLM extraction as a black box that outputs trusted data. The gate transforms extraction from a write-once operation into a verifiable pipeline stage. It enables metric tracking for hallucination rates, supports iterative rule refinement, and prevents model confidence from masquerading as factual accuracy.
Core Solution
The architecture centers on a boundary validation pattern that separates shape verification from semantic evaluation. The pipeline flows through four distinct stages: raw response capture, schema compliance check, value sanity gate, and conditional persistence. Each stage has a single responsibility and explicit failure semantics.
Architecture Decisions
- Decoupled Validation Layers: Schema validation runs first to reject malformed payloads. Value validation runs second to evaluate plausibility. This separation prevents rule conflicts and allows independent versioning.
- Rule Composition Over Monolithic Functions: Each validation criterion operates as an isolated rule with a consistent interface. Rules compose into a pipeline, enabling selective activation, A/B testing, and gradual rollout.
- Quarantine Over Rejection: Failed rows are never discarded. They route to a quarantine table with attached violation metadata. This preserves audit trails, supports false positive analysis, and enables manual correction workflows.
- Stdlib-Only Implementation: The gate avoids external dependencies, network calls, and secondary model invocations. It executes in milliseconds, adds negligible latency, and eliminates cost overhead.
Implementation
The following TypeScript implementation demonstrates the rule composition pattern, boundary validation flow, and quarantine routing.
// types.ts
export interface ExtractionRow {
rating?: number;
review_date?: string;
verified?: boolean;
verification_token?: string;
review_count_scraped?: number;
review_count_displayed?: number;
country?: string;
text?: string;
[key: string]: unknown;
}
export interface ValidationViolation {
code: string;
message: string;
field?: string;
}
export type ValidationRule = (row: ExtractionRow) => ValidationViolation | null;
// rules.ts
import { ValidationRule, ExtractionRow } from './types';
export const ratingRangeRule: ValidationRule = (row) => {
const value = row.rating;
if (value === undefined || value === null) return null;
if (typeof value !== 'number' || value < 1 || value > 5) {
return {
code: 'RANGE_VIOLATION',
message: `rating=${value} falls outside acceptable bounds [1, 5]`,
field: 'rating'
};
}
return null;
};
export const temporalConsistencyRule: ValidationRule = (row) => {
const raw = row.review_date;
if (!raw || typeof raw !== 'string') return null;
const isoMatch = /^\d{4}-\d{2}-\d{2}$/.test(raw);
if (!isoMatch) {
return {
code: 'DATE_FORMAT_ERROR',
message: `review_date=${raw} does not match YYYY-MM-DD format`,
field: 'review_date'
};
}
const parsed = new Date(raw);
if (isNaN(parsed.getTime())) {
return {
code: 'INVALID_CALENDAR_DATE',
message: `review_date=${raw} is not a valid calendar date`,
field: 'review_date'
};
}
const today = new Date();
today.setHours(0, 0, 0, 0);
if (parsed > today) {
return {
code: 'FUTURE_DATE_DETECTED',
message: `review_date=${raw} occurs after current system date`,
field: 'review_date'
};
}
return null;
};
export const crossFieldConsistencyRule: ValidationRule = (row) => {
if (row.verified === true && !row.verification_token) {
return {
code: 'CROSS_FIELD_MISMATCH',
message: 'verified=true requires a non-empty verification_token',
field: 'verified'
};
}
return null;
};
export const countDiscrepancyRule: ValidationRule = (row) => {
const scraped = row.review_count_scraped;
const displayed = row.review_count_displayed;
if (typeof scraped !== 'number' || typeof displayed !== 'number') return null;
if (displayed < 0) return null;
const threshold = displayed * 2 + 10;
if (scraped > threshold) {
return {
code: 'REFERENCE_DEVIATION',
message: `scraped_count=${scraped} significantly exceeds displayed_count=${displayed}`,
field: 'review_count_scraped'
};
}
return null;
};
const LOCALE_HINTS: Record<string, string[]> = {
en: [' the ', ' and ', ' not ', ' very ', ' with '],
de: [' der ', ' und ', ' nicht ', ' ich ', ' sehr '],
fr: [' le ', ' la ', ' et ', ' pas ', ' avec '],
es: [' el ', ' la ', ' y ', ' no ', ' con ']
};
const COUNTRY_LOCALE_MAP: Record<string, string> = {
US: 'en', GB: 'en', CA: 'en', AU: 'en',
DE: 'de', AT: 'de', CH: 'de',
FR: 'fr', BE: 'fr',
ES: 'es', MX: 'es', AR: 'es'
};
export const localeConsistencyRule: ValidationRule = (row) => {
const country = row.country;
const text = row.text;
if (!country || !text || typeof text !== 'string') return null;
const expectedLocale = COUNTRY_LOCALE_MAP[country];
if (!expectedLocale) return null;
const normalized = ` ${text.toLowerCase()} `;
let bestMatch = expectedLocale;
let highestScore = 0;
for (const [locale, hints] of Object.entries(LOCALE_HINTS)) {
const score = hints.reduce((acc, hint) => acc + normalized.split(hint).length - 1, 0);
if (score > highestScore) {
highestScore = score;
bestMatch = locale;
}
}
if (bestMatch !== expectedLocale && highestScore > 0) {
return {
code: 'LOCALE_MISMATCH',
message: `country=${country} but text exhibits ${bestMatch} linguistic patterns`,
field: 'country'
};
}
return null;
};
// gate.ts
import { ValidationRule, ExtractionRow, ValidationViolation } from './types';
import {
ratingRangeRule,
temporalConsistencyRule,
crossFieldConsistencyRule,
countDiscrepancyRule,
localeConsistencyRule
} from './rules';
export class SanityGate {
private rules: ValidationRule[];
constructor(rules: ValidationRule[]) {
this.rules = rules;
}
evaluate(row: ExtractionRow): ValidationViolation[] {
return this.rules
.map(rule => rule(row))
.filter((violation): violation is ValidationViolation => violation !== null);
}
}
export const defaultGate = new SanityGate([
ratingRangeRule,
temporalConsistencyRule,
crossFieldConsistencyRule,
countDiscrepancyRule,
localeConsistencyRule
]);
Execution Flow
The boundary integration follows a strict sequence. Schema validation occurs first to reject malformed payloads. The sanity gate evaluates plausibility. Results determine routing.
import { defaultGate } from './gate';
import { ExtractionRow } from './types';
async function processExtraction(rawPayload: string): Promise<void> {
// Stage 1: Parse & Schema Validation
let row: ExtractionRow;
try {
row = JSON.parse(rawPayload);
// Assume ajv or zod validates shape here
} catch {
throw new Error('Schema validation failed');
}
// Stage 2: Value Sanity Gate
const violations = defaultGate.evaluate(row);
// Stage 3: Conditional Routing
if (violations.length > 0) {
await quarantineRecord(row, violations);
return;
}
await persistToProduction(row);
}
This architecture isolates failure modes, preserves auditability, and prevents model confidence from overriding factual verification. The gate operates in constant time relative to rule count, adds sub-millisecond latency, and requires zero external dependencies.
Pitfall Guide
1. Conflating Schema Validation with Truth Verification
Explanation: JSON Schema validators confirm structural compliance, not factual accuracy. A field can match its declared type while containing a fabricated value. Fix: Treat schema validation as a prerequisite, not a guarantee. Always layer semantic rules after shape checks.
2. Over-Engineering Language Detection
Explanation: Implementing full NLP language identification adds latency, dependencies, and cost. Extraction pipelines rarely need linguistic precision. Fix: Use lightweight token frequency heuristics for locale consistency. Reserve heavy models for downstream analysis, not boundary validation.
3. Hardcoding Business Thresholds
Explanation: Embedding magic numbers like displayed * 2 + 10 directly in rules makes maintenance difficult when platform behavior changes.
Fix: Externalize thresholds to configuration files or environment variables. Version rule parameters alongside code deployments.
4. Silently Dropping Failed Rows
Explanation: Discarding quarantined records destroys audit trails and prevents false positive analysis. Teams lose visibility into model degradation. Fix: Route all violations to a quarantine table with attached metadata. Implement periodic review workflows and automated alerting for spike patterns.
5. Assuming Range Checks Catch Plausible Lies
Explanation: A rating of 4 on a 5-star scale passes range validation even if the source explicitly states 2. Boundary gates catch rule violations, not semantic inaccuracies within valid bounds.
Fix: Acknowledge gate limitations explicitly. Combine value checks with source text sampling, confidence scoring, or human-in-the-loop review for critical fields.
6. Mixing Validation and Transformation Logic
Explanation: Embedding data normalization, type coercion, or business logic inside validation rules creates side effects and obscures failure attribution. Fix: Keep validators pure. Perform transformations only after validation passes. Return violation objects without modifying the input row.
7. Ignoring Optional Field Semantics
Explanation: Rules that trigger on undefined or null values generate false positives when fields are legitimately absent.
Fix: Design rules to skip evaluation when values are missing. Treat absence as a shape concern, not a semantic violation. Only validate present values.
Production Bundle
Action Checklist
- Audit existing extraction pipelines for schema-only validation patterns
- Identify high-risk fields prone to hallucination (dates, counts, ratings, booleans)
- Implement boundary validation layer separating shape and value checks
- Configure quarantine storage with violation metadata attachment
- Externalize rule thresholds to environment configuration
- Add metric tracking for hallucination rates and false positive ratios
- Establish weekly review cadence for quarantined records
- Document gate limitations and acceptable risk boundaries
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| High-volume analytics ingestion | Value-Aware Sanity Gate + Quarantine | Prevents metric corruption at scale; quarantine enables batch correction | Low (stdlib only, <5ms latency) |
| Real-time user-facing features | Schema Validation + Confidence Threshold | Latency sensitivity requires lighter checks; confidence scores filter obvious fabrications | Medium (may require model scoring API) |
| Compliance-critical records | Value-Aware Gate + Human Review Queue | Regulatory requirements demand explicit verification; quarantine supports audit trails | High (manual review overhead) |
| Experimental model evaluation | Schema Validation + Raw Text Logging | Preserves ground truth for comparison; avoids gate interference during benchmarking | Low (storage cost only) |
Configuration Template
// config/validation.config.ts
export interface ValidationConfig {
rules: {
rating: { min: number; max: number };
date: { allowFuture: boolean };
counts: { deviationMultiplier: number; deviationOffset: number };
locale: { enabled: boolean };
};
routing: {
quarantineTable: string;
alertThreshold: number;
};
}
export const productionConfig: ValidationConfig = {
rules: {
rating: { min: 1, max: 5 },
date: { allowFuture: false },
counts: { deviationMultiplier: 2, deviationOffset: 10 },
locale: { enabled: true }
},
routing: {
quarantineTable: 'extraction_quarantine_v1',
alertThreshold: 50 // violations per hour
}
};
Quick Start Guide
- Install Dependencies: The gate requires zero external packages. Ensure your runtime supports TypeScript 5.0+ or equivalent ES2020 features.
- Define Your Schema: Create an
ExtractionRowinterface matching your LLM output structure. Include optional fields for all extracted attributes. - Register Rules: Import rule implementations and instantiate
SanityGatewith your active validation set. Configure thresholds via environment variables. - Integrate Boundary: Insert
gate.evaluate(row)between JSON parsing and database insertion. Route violations to quarantine storage. - Monitor: Track
violations.lengthdistribution, quarantine table growth, and false positive rates. Adjust thresholds based on weekly review findings.
The boundary validation pattern transforms LLM extraction from a trust-on-write operation into a verifiable pipeline stage. By decoupling shape compliance from semantic verification, engineering teams eliminate silent corruption, preserve auditability, and maintain control over data quality without sacrificing extraction throughput.
Mid-Year Sale β Unlock Full Article
Base plan from just $4.99/mo or $49/yr
Sign in to read the full article and unlock all tutorials.
Sign In / Register β Start Free Trial7-day free trial Β· Cancel anytime Β· 30-day money-back
