Back to KB
Difficulty
Intermediate
Read Time
4 min

What 500 curated failure pairs actually fix: a breakdown across 3 seeds

By Codcompass Team··4 min read

What 500 Curated Failure Pairs Actually Fix: A Breakdown Across 3 Seeds

Current Situation Analysis

Small-scale DPO/RLHF experiments frequently suffer from signal dilution and evaluation blindness. Traditional approaches rely on either synthetic bug generation or massive preference datasets (50k+ samples), both of which introduce critical failure modes:

  • Synthetic failures lack distributional fidelity: Artificially injected bugs do not reflect the model's actual confidence gaps or reasoning boundaries, causing the preference signal to train on artifacts rather than genuine capability gaps.
  • Aggregate metrics mask failure mode shifts: Pass@1 or pass@k scores collapse diverse failure types into a single scalar. A +3% gain could stem from algorithmic reasoning improvements, formatting cleanup, or even distribution collapse (e.g., refusal spikes or new syntax errors).
  • Mismatched quality bars break contrastive learning: Pairing model outputs against external "ideal" answers introduces domain shift. The model learns to mimic a different distribution rather than correcting its own failure modes.
  • Single-seed evaluation creates false confidence: Coding benchmarks like HumanEval (164 problems) are small enough that integer-count ties and seed variance produce misleading deltas. Without multi-seed validation, improvements are statistically indistinguishable from noise.

WOW Moment: Key Findings

The experiment isolates the impact of 500 curated preference pairs where both chosen and rejected sides originate from the same internal validation pipeline. This ensures the contrastive signal trains on "honest failure vs. honest success" rather than external idealization.

ApproachPass@1 (Mean)NAME_ERROR CountASSERTION_FAIL Count
Base (Qwen2.5-Coder-3B)80.48%6.0023.00
DPO (500 curated pairs)83.94% ± 0.353.67 ± 0.5818.67 ± 0.58

Key Findings & Sweet Spot:

  • NAME_ERROR drops 39%: The cleanest signal. The model learns to inline missing stdlib imports (math.ceil, re.match) and resolve undefined helpers. This directly maps to IDFU's rejected pool, which heavily features missing-import patterns from partial-knowledge hallucination.
  • ASSERTION_FAIL drops 19%: Half of this gain is behavioral, not algorithmic. The base model frequently appended self-tests (if this fails, raise) that crashed the test harness at import time. DPO learned to output clean, library-ready functions, allowing the harness to execute properly.
  • Real algorithmic gain: ~+2.4 pp out of the +3.46 pp aggregate. The remainder is output formatting stabilization.
  • Null results are informative: TYPE_ERROR remains fixed at 1 failure across all seeds. DPO did not degrade, improve, or shift this specific failure mode, confirming stable distribution alignment.
  • Sweet spot: 500 pairs at conservative hyperparameters hits the same performance ballpark as large-sca

Results-Driven

The key to reducing hallucination by 35% lies in the Re-ranking weight matrix and dynamic tuning code below. Stop letting garbage data pollute your context window and company budget. Upgrade to Pro for the complete production-grade implementation + Blueprint (docker-compose + benchmark scripts).

Upgrade Pro, Get Full Implementation

Cancel anytime · 30-day money-back guarantee