What 500 curated failure pairs actually fix: a breakdown across 3 seeds

By Codcompass Team·2026-05-07·4 min read

What 500 Curated Failure Pairs Actually Fix: A Breakdown Across 3 Seeds

Current Situation Analysis

Small-scale DPO/RLHF experiments frequently suffer from signal dilution and evaluation blindness. Traditional approaches rely on either synthetic bug generation or massive preference datasets (50k+ samples), both of which introduce critical failure modes:

Synthetic failures lack distributional fidelity: Artificially injected bugs do not reflect the model's actual confidence gaps or reasoning boundaries, causing the preference signal to train on artifacts rather than genuine capability gaps.
Aggregate metrics mask failure mode shifts: Pass@1 or pass@k scores collapse diverse failure types into a single scalar. A +3% gain could stem from algorithmic reasoning improvements, formatting cleanup, or even distribution collapse (e.g., refusal spikes or new syntax errors).
Mismatched quality bars break contrastive learning: Pairing model outputs against external "ideal" answers introduces domain shift. The model learns to mimic a different distribution rather than correcting its own failure modes.
Single-seed evaluation creates false confidence: Coding benchmarks like HumanEval (164 problems) are small enough that integer-count ties and seed variance produce misleading deltas. Without multi-seed validation, improvements are statistically indistinguishable from noise.

WOW Moment: Key Findings

The experiment isolates the impact of 500 curated preference pairs where both chosen and rejected sides originate from the same internal validation pipeline. This ensures the contrastive signal trains on "honest failure vs. honest success" rather than external idealization.

Approach	Pass@1 (Mean)	NAME_ERROR Count	ASSERTION_FAIL Count
Base (Qwen2.5-Coder-3B)	80.48%	6.00	23.00
DPO (500 curated pairs)	83.94% ± 0.35	3.67 ± 0.58	18.67 ± 0.58

Key Findings & Sweet Spot:

NAME_ERROR drops 39%: The cleanest signal. The model learns to inline missing stdlib imports (math.ceil, re.match) and resolve undefined helpers. This directly maps to IDFU's rejected pool, which heavily features missing-import patterns from partial-knowledge hallucination.
ASSERTION_FAIL drops 19%: Half of this gain is behavioral, not algorithmic. The base model frequently appended self-tests (if this fails, raise) that crashed the test harness at import time. DPO learned to output clean, library-ready functions, allowing the harness to execute properly.
Real algorithmic gain: ~+2.4 pp out of the +3.46 pp aggregate. The remainder is output formatting stabilization.
Null results are informative: TYPE_ERROR remains fixed at 1 failure across all seeds. DPO did not degrade, improve, or shift this specific failure mode, confirming stable distribution alignment.
Sweet spot: 500 pairs at conservative hyperparameters hits the same performance ballpark as large-sca

Results-Driven

The key to reducing hallucination by 35% lies in the Re-ranking weight matrix and dynamic tuning code below. Stop letting garbage data pollute your context window and company budget. Upgrade to Pro for the complete production-grade implementation + Blueprint (docker-compose + benchmark scripts).

Upgrade Pro, Get Full Implementation

Cancel anytime · 30-day money-back guarantee

le runs (~+3% pass@k) while maintaining computational feasibility for single-GPU workflows. Diminishing returns likely exist beyond this curation threshold, but the small-data regime proves highly efficient.

Core Solution

Architecture & Data Pipeline

Base Model: Qwen2.5-Coder-3B-Instruct (92-language training corpus)
Training Method: DPO via TRL with LoRA
Data Composition: 500 preference pairs from IDFU dataset. Both chosen and rejected sides are generated and verified by the same internal pipeline against identical validation gates. This parity ensures the contrastive signal targets the model's own failure distribution rather than external idealization.
Evaluation Protocol: HumanEval pass@1 across three random seeds (42, 123, 7). Full exception-type breakdown with manual delta inspection to separate algorithmic gains from behavioral artifacts.

Hyperparameter Configuration

LoRA: r=16, alpha=32, dropout=0.05
Target modules: q/k/v/o/gate/up/down_proj (Qwen standard)
DPO: beta=0.1
Optimizer: learning_rate=5e-5
Batch: size=1, grad_accum=4 (effective batch 4)
Epochs: 3 (= 375 optimizer steps for 500 pairs)
Quantization: 4-bit NF4 + bf16 compute (bitsandbytes)
gradient_checkpointing=True
max_length=2048, max_prompt_length=512

Implementation Decisions

Conservative hyperparameters with no sweep or exotic tricks to isolate the effect of data curation.
4-bit NF4 quantization with gradient checkpointing enables full DPO training on a single RTX 4060 (~3-4 hours per seed).
Multi-seed evaluation prevents overfitting to HumanEval's small problem space. Integer-count ties are expected; consistent directional deltas across seeds confirm signal validity.
Manual tracing of newly-passing problems reveals whether improvements stem from logic correction or harness-compatible output formatting.

Pitfall Guide

Over-reliance on Aggregate Scores: Pass@1 collapses failure mode distributions. Always break down results by exception type (NAME_ERROR, ASSERTION_FAIL, etc.) to distinguish algorithmic reasoning gains from formatting or behavioral shifts.
Synthetic Bug Injection: Artificially generated failures do not reflect the model's actual confidence boundaries. Use "honest failures" captured during the same generation pipeline to ensure the preference signal targets real capability gaps.
Ignoring Behavioral Artifacts: DPO may fix harness crashes (e.g., embedded self-tests) rather than core logic. Manually inspect deltas to quantify the split between algorithmic improvement and output stabilization.
Single-Seed Evaluation on Small Benchmarks: HumanEval's 164 problems cause high variance and integer-count ties. Run at least three seeds and report confidence intervals to separate signal from statistical noise.
Mismatched Quality Bars in Preference Pairs: Pairing model outputs against external "ideal" answers introduces distribution shift. Both chosen and rejected sides must originate from the same validation pipeline to maintain contrastive signal integrity.
Distribution Collapse Blindness: Curated DPO can cause refusal spikes or new failure categories. Monitor unchanged categories (e.g., TYPE_ERROR) and track total failure counts to ensure the model isn't degrading in unmeasured dimensions.
Assuming Linear Scaling from Curation: Small-data efficiency does not guarantee large-data performance. Curation yields diminishing returns; transition to volume-based training once the curation threshold is reached.

Deliverables

Blueprint: IDFU DPO Curation & Evaluation Blueprint — Covers the chosen/rejected pipeline parity philosophy, multi-seed evaluation protocol, exception-type breakdown methodology, and manual delta inspection workflow for distinguishing algorithmic vs. behavioral gains.
Checklist: Pre-Training Validation & Post-Training Audit Checklist — Verifies pipeline parity for preference pairs, confirms absence of synthetic injections, validates multi-seed eval harness setup, ensures gradient checkpointing/quantization compatibility, and mandates manual tracing of newly-passing problems.
Configuration Templates: Production-ready TRL DPO trainer config, LoRA targeting specification for Qwen architectures, bitsandbytes NF4 quantization setup, and Python exception categorization script for automated HumanEval failure breakdown. Full dataset sample (100 rows) and complete 500-pair set available at huggingface.co/datasets/namakoo/idfu-verified-code.

What 500 Curated Failure Pairs Actually Fix: A Breakdown Across 3 Seeds

Current Situation Analysis

WOW Moment: Key Findings

Results-Driven

Production Bundle