I built the first open benchmark for federal contracting AI. Here's what it shows about frontier LLMs.
Mitigating Hallucination in Regulated Text Extraction: A Multi-Task Benchmark for Federal Procurement
Current Situation Analysis
In regulated domains like federal contracting, entity extraction errors are not mere inconveniences; they are operational failures. When an AI system extracts Federal Acquisition Regulation (FAR) or Defense Federal Acquisition Regulation Supplement (DFARS) clause numbers from a solicitation, hallucinating a non-existent clause (e.g., FAR 52.999-99) can trigger compliance rejections, disqualifying a bid before evaluation. Despite the high stakes, the industry lacks standardized benchmarks to measure this specific failure mode. Commercial procurement tools process solicitations using natural language processing, yet none publish reproducible metrics on hallucination rates. Academic efforts remain fragmented, often focusing on narrow sub-tasks like Section 508 compliance rather than end-to-end extraction reliability.
The absence of measurement has allowed significant reliability gaps to persist in frontier models. Independent evaluation reveals that even top-tier models exhibit non-trivial hallucination rates when extracting canonical clause numbers from real regulatory text. GPT-4o hallucinates clause numbers in approximately 11% of predictions on authentic FAR text, while Claude Haiku 4.5 hallucinates in over 32% of cases. These rates are unacceptable for automated compliance workflows. The industry has prioritized aggregate accuracy metrics like F1 score while neglecting the specific error mode that causes the most damage: the invention of authoritative references.
WOW Moment: Key Findings
A rigorous evaluation of multi-task extraction models exposes a critical divergence between aggregate performance and hallucination resistance. When benchmarking against a curated dataset of real federal solicitations and structured regulatory text, a specialized 150-million-parameter model demonstrates a superior trade-off profile compared to frontier systems. While the largest models achieve marginally higher F1 scores, they introduce hallucination risks that specialized compact models can mitigate.
The following comparison isolates performance on the "Real FAR" slice of the benchmark—text sourced exclusively from the Electronic Code of Federal Regulations, eliminating any bias from synthetic data generation.
| Model | Parameter Count | Entity F1 (Real FAR) | Hallucination Rate | Inference Latency Profile |
|---|---|---|---|---|
| Claude Sonnet 4.6 | Frontier | 0.984 | 0.0% | High / Cloud-dependent |
| GPT-4o | Frontier | 0.937 | 11.0% | High / Cloud-dependent |
| Claude Haiku 4.5 | Frontier | 0.804 | 32.1% | Medium / Cloud-dependent |
| Specialized 150M | 149M | 0.800 | 13.8% | Low / On-prem capable |
Why this matters: The specialized 150M model matches Claude Haiku's F1 score on real text but reduces the hallucination rate by more than half. For organizations requiring on-premises deployment, predictable latency, and auditable extraction, the compact model represents a Pareto-optimal solution. It achieves competitive extraction accuracy while significantly lowering the probability of fabricating regulatory citations. Furthermore, this model can be trained in under five minutes on a consumer-grade GPU, democratizing access to domain-specific reliability without reliance on external API providers.
Core Solution
Building a reliable extraction system for heterogeneous regulatory data requires a multi-task architecture with explicit supervision routing. Federal procurement data arrives from disparate sources: SAM.gov provides solicitation metadata (titles, notice types, set-asides), while the eCFR provides raw clause text. A single model must learn from these varied signals without allowing one task to dominate the gradient updates or dilute the signal for others.
The solution employs a shared encoder with task-specific heads and a dynamic task masking mechanism. This approach allows the model to train jointly on all available data while ensuring that each example only updates the heads relevant to its source.
Architecture Decisions
- Shared Encoder: A ModernBERT-base encoder processes all inputs, capturing structural patterns common to both solicitations and regulatory text.
- Task Heads: Four distinct heads handle the extraction tasks:
- Notice Classification: Softmax head for eight SAM.gov notice types.
- NAICS Prediction: Softmax head for twenty top-level sectors.
- Set-Aside Identification: Sigmoid heads for multi-label classification (SBA, SDVOSB, WOSB, etc.).
- Clause Extraction: Token-level BIO tagging head for entity recognition.
- Task Masking: A boolean mask per record indicates which tasks are supervised. SAM metadata records mask out the clause extraction head; raw clause text records mask out classification heads. This prevents gradient noise and ensures efficient learning from sparse supervision.
Implementation Example
The following TypeScript/Python-style pseudocode illustrates the core training loop with task masking. This implementation uses distinct naming conventions and structure from the source material while preserving the technical logic.
import { ModernBERTModel, TaskHead, MultiTaskLoss } from 'gov-nlp-core';
interface TrainingRecord {
input_ids: number[];
task_mask: boolean[]; // [notice_type, naics, set_aside, clause_extraction]
labels: Record<string, any>;
}
class GovContractMultiTaskModel {
private encoder: ModernBERTModel;
private heads: Map<string, TaskHead>;
constructor() {
this.encoder = new ModernBERTModel('base');
this.heads = new Map([
['notice_type', new TaskHead('classification', 8)],
['naics', new TaskHead('classification', 20)],
['set_aside', new TaskHead('multi_label', 7)],
['clause_extraction', new TaskHead('token_tagging', 3)], // BIO
]);
}
forward(record: TrainingRecord) {
const hidden_states = this.encoder.encode(record.input_ids);
const losses: Record<string, number> = {};
let active_tasks = 0;
for (const [task_name, head] of this.heads.entries()) {
const is_active = record.task_mask[task_name];
if (!is_active) continue;
const logits = head.predict(hidden_states);
const loss = head.compute_loss(logits, record.labels[task_name]);
losses[task_name] = loss;
active_tasks++;
}
// Aggregate loss only over active tasks to prevent signal dilution
const total_loss = Object.values(losses).reduce((a, b) => a + b, 0) / active_tasks;
return { loss: total_loss, losses };
}
}
// Usage in training loop
const model = new GovContractMultiTaskModel();
const batch: TrainingRecord[] = load_batch_from_sam_and_ecfr();
for (const record of batch) {
const result = model.forward(record);
result.loss.backward();
optimizer.step();
}
Rationale: The task masking strategy is critical. Without it, records lacking clause text would produce undefined gradients for the extraction head, or require padding with dummy labels that could confuse the model. By explicitly gating gradient flow, the model learns robust representations from limited data. The use of ModernBERT-base provides strong priors for structural pattern recognition, which is essential for identifying clause number formats like 52.219-9.
Pitfall Guide
The "Self-Grading" Bias
- Explanation: Evaluating a model on synthetic data generated by the same model family introduces severe bias. If Claude models are tested on text authored by Claude, performance metrics will be artificially inflated.
- Fix: Always separate evaluation data by provenance. Report metrics on "Real" slices distinct from "Synthetic" slices. Audit the source of every test record.
F1 Score Blindness
- Explanation: High F1 scores can mask dangerous hallucination rates. A model may correctly extract most clauses but invent a few critical ones, leading to compliance failures.
- Fix: Track hallucination rate as a primary metric. Define hallucination as any predicted entity that does not exist in the canonical reference corpus. Optimize for low hallucination even if it slightly reduces F1.
Structural Pattern Overfitting
- Explanation: Clause numbers follow a rigid pattern (
52.<digits>-<digits>). Models may learn to regex-match this pattern without understanding context, failing on edge cases or novel clause structures. - Fix: Include adversarial examples in the training set that mimic clause patterns but are not valid clauses. Validate generalization on out-of-distribution text.
- Explanation: Clause numbers follow a rigid pattern (
API Quota Data Scarcity
- Explanation: Public APIs like SAM.gov often have rate limits. Interrupted data collection can result in incomplete records (e.g., titles without descriptions), degrading model performance on classification tasks.
- Fix: Implement robust data logging and quota management. Use synthetic augmentation sparingly for rare classes, but prioritize real data. Plan training runs around quota cycles.
Rare Class Imbalance
- Explanation: Set-aside types like HUBZone or EDWOSB appear infrequently in solicitations. Models trained on real data alone may fail to recognize these classes.
- Fix: Apply targeted synthetic augmentation for rare classes only. Ensure synthetic examples are diverse and validated against regulatory definitions.
Ignoring Heterogeneous Supervision
- Explanation: Training a single model on mixed data sources without masking can cause tasks with abundant data to dominate, starving tasks with sparse supervision.
- Fix: Implement task masking or dynamic loss weighting. Ensure each task receives proportional gradient updates based on available supervision.
Latency vs. Accuracy Trade-off Neglect
- Explanation: Frontier models offer high accuracy but introduce latency and cost constraints that may be prohibitive for high-volume processing or on-premises deployment.
- Fix: Benchmark compact models alongside frontier models. Evaluate the Pareto frontier of accuracy, hallucination rate, latency, and cost. Select the model that meets operational constraints.
Production Bundle
Action Checklist
- Define Hallucination Metric: Establish a canonical reference corpus and define hallucination as any predicted entity not found in the corpus.
- Audit Data Provenance: Tag every record with its source. Separate evaluation sets by source to detect bias.
- Implement Task Masking: Use a multi-task architecture with explicit masking to handle heterogeneous data sources efficiently.
- Measure Real-World Performance: Report metrics on "Real" data slices, excluding synthetic test records.
- Benchmark Compact Models: Evaluate 100M-200M parameter models for cost-effective, low-latency deployment.
- Handle Rare Classes: Use targeted synthetic augmentation for underrepresented categories.
- Validate Structural Generalization: Test models on adversarial examples to prevent pattern overfitting.
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|---|---|---|
| High-Volume Compliance | Specialized 150M Model | Low latency, on-prem capable, reduced hallucination vs. mid-tier frontier models. | Low (GPU inference) |
| Zero-Hallucination Tolerance | Claude Sonnet 4.6 | Highest F1 and zero hallucination on real text. | High (API costs) |
| Budget-Constrained R&D | Specialized 150M Model | Trainable on consumer GPU in minutes; no API dependency. | Negligible |
| Rare Set-Aside Detection | Augmented Multi-Task Model | Synthetic augmentation improves recall for rare classes. | Medium (Augmentation cost) |
| Audit & Traceability | On-Prem Specialized Model | Full control over data and model weights; no external data leakage. | Medium (Infrastructure) |
Configuration Template
Use this configuration to initialize a multi-task training run with task masking and hallucination tracking.
model:
base: "modernbert-base"
heads:
- name: "notice_type"
type: "classification"
num_labels: 8
- name: "naics"
type: "classification"
num_labels: 20
- name: "set_aside"
type: "multi_label"
num_labels: 7
- name: "clause_extraction"
type: "token_tagging"
num_labels: 3
training:
task_masking: true
epochs: 6
batch_size: 32
learning_rate: 2e-5
metrics:
- "entity_f1"
- "hallucination_rate"
- "macro_f1"
data:
sources:
- "sam.gov"
- "ecfr_title_48"
split:
train: 0.7
val: 0.15
test: 0.15
Quick Start Guide
- Prepare Data: Download solicitations from SAM.gov and parse eCFR Title 48 XML. Align records with task masks based on source.
- Initialize Model: Load a ModernBERT-base encoder and attach task-specific heads. Configure task masking logic.
- Train: Run training on a single GPU. Monitor hallucination rate on the validation set.
- Evaluate: Test on the "Real FAR" slice. Verify hallucination rate and F1 score.
- Deploy: Export the model for on-premises inference. Integrate with your procurement workflow.
