Why your LLM gives everything an 8/10 (and the rubric fix that worked)

Calibrating LLM Judges: Dual-Criterion Anchors for Reliable Automated Scoring

Current Situation Analysis

Automated content curation, moderation, and quality gating increasingly rely on LLM-as-judge architectures. The premise is straightforward: feed the model a piece of content, ask it to score it against a rubric, and trigger downstream actions based on the output. In practice, these systems consistently fail at scale due to a structural phenomenon known as score bunching. Instead of a normal distribution across the scoring spectrum, outputs cluster tightly around a single mid-to-high integer, rendering threshold-based automation useless.

Engineering teams routinely misdiagnose this behavior. The default assumption is that the model lacks capability, prompting upgrades to larger parameter counts or iterative prompt engineering focused on tone adjustments like "be critical," "apply strict standards," or "avoid leniency." These interventions consistently fail because they target the wrong variable. Instruction-tuned models are optimized via RLHF to be helpful, agreeable, and constructive. Asking them to act as harsh evaluators directly conflicts with their alignment training. The model will generate coherent, well-reasoned justifications for mid-tier scores while refusing to cross into high-tier territory, regardless of prompt phrasing or model size.

The root cause is rubric design, not model capability. When threshold-crossing anchors rely on single-dimensional descriptors (e.g., "high quality," "original insight," "substantive contribution"), they create a catch-all bucket for anything that isn't clearly poor. Competent but unremarkable content, derivative summaries, and genuinely novel work all satisfy the same vague criteria, forcing the model to compress them into a single score. Without explicit structural requirements that force differentiation, the LLM defaults to its alignment bias: generous scoring for anything that passes basic coherence checks.

Live deployment data confirms this pattern. In a production voting agent evaluating technical community posts, a single-criterion rubric scored 17 out of 22 substantive submissions identically at 8/10. The upvote threshold was set at 9/10. The result was a 0% action rate. The model's reasoning traces were technically accurate, but the scoring distribution provided zero signal for automation. The problem wasn't that the model couldn't distinguish quality; it was that the rubric provided no mechanical way to force that distinction into the output space.

WOW Moment: Key Findings

The breakthrough occurs when rubric anchors are redesigned to require multiple independent criteria for threshold-crossing scores. By splitting a single descriptive anchor into two mandatory conditions, the model is forced to evaluate orthogonal dimensions of the content. This structural change immediately breaks score bunching and produces a usable distribution without changing the underlying model or adding prompt verbosity.

Approach	Upvote Rate	Score Distribution Entropy	False Positive Rate	Calibration Effort
Single-Criterion Anchors	0.0%	Low (heavy clustering at 8/10)	N/A (threshold never met)	High (constant prompt tweaking)
Dual-Criterion Anchors	22.7%	High (clean spread across 6–9)	<5%	Low (threshold adjustment only)
Multi-Agent Voting	18.2%	Medium	~12%	Very High (coordination overhead)

This finding matters because it shifts the engineering paradigm from prompt optimization to rubric architecture. When scores bunch, the solution isn't to ask the model to try harder; it's to change what the model is required to find. Dual-criterion anchors transform the LLM from a subjective reviewer into a deterministic feature extractor. The model no longer guesses at "quality"; it checks for the presence of specific, verifiable attributes. This enables reliable automation, reduces false positives, and makes threshold tuning a configuration change rather than a prompt rewrite.

Core Solution

Building a calibrated LLM judge requires three architectural decisions: decoupling judgment from policy execution, implementing dual-criterion anchors for threshold scores, and enforcing explicit failure modes. The following implementation demonstrates how to structure this in TypeScript.

Step 1: Decouple Judgment from Policy Execution

The most critical safety net is separating what the model outputs from what the system does. LLMs frequently mislabel their own recommendations due to prompt ambiguity or alignment bias. Never trust the model's action label. Instead, require a structured integer score and apply business logic in code.

interface ScoringResponse {
  score: number;
  reasoning: string;
  matched_criteria: string[];
}

class PolicyEnforcer {
  private readonly upvoteThreshold: number;
  private readonly downvoteThreshold: number;

  constructor(upvote: number, downvote: number) {
    this.upvoteThreshold = upvote;
    this.downvoteThreshold = downvote;
  }

  public executeAction(judgment: ScoringResponse): VoteAction {
    const numericScore = Math.min(10, Math.max(1, judgment.score));
    
    if (numericScore >= this.upvoteThreshold) {
      return { type: 'UPVOTE', weight: 1 };
    }
    if (numericScore <= this.downvoteThreshold) {
      return { type: 'DOWNVOTE', weight: -1 };
    }
    return { type: 'NEUTRAL', weight: 0 };
  }
}

This separation ensures that rubric adjustments, threshold changes, and policy updates never require prompt modifications. The model's only responsibility is to return a calibrated integer based on explicit criteria. The code handles all downstream consequences.

Step 2: Implement Dual-Criterion Anchors

Single-criterion anchors fail because they lack mechanical differentiation. Dual-criterion anchors require two independent conditions to be met simultaneously for a score to cross the threshold. This forces the model to evaluate orthogonal dimensions, breaking the compression effect.

interface RubricAnchor {
  score: number;
  requiredCriteria: string[];
  description: string;
}

const THRESHOLD_ANCHORS: RubricAnchor[] = [
  {
    score: 9,
    requiredCriteria: ['named_concept', 'reproducible_artifact'],
    description: 'Introduces a clearly named mechanism or pattern AND provides a verifiable artifact (code, schema, benchmark, or working demo) that allows independent validation.'
  },
  {
    score: 8,
    requiredCriteria: ['named_concept', 'concrete_reference'],
    description: 'Introduces a clearly named mechanism or pattern AND includes at least one concrete reference (file path, commit hash, link, specific metric, or primary datum).'
  },
  {
    score: 7,
    requiredCriteria: ['substantive_content'],
    description: 'On-topic and well-reasoned, but missing either a named concept or a concrete reference. Includes competent summaries, derivative analysis, or internal status updates.'
  }
];

The prompt template must explicitly instruct the model to evaluate each criterion independently and cap the score if requirements are unmet.

const JUDGMENT_PROMPT = `
Evaluate the following submission against the scoring rubric.
Return a JSON object with: score (1-10), reasoning, and matched_criteria.

RULES:
- Score 9 requires BOTH a named concept AND a reproducible artifact.
- Score 8 requires BOTH a named concept AND a concrete reference.
- If only one condition is met, cap the score at 7.
- Do not infer missing criteria. If a requirement is absent, explicitly state it.
- Ignore tone, writing style, and subjective quality. Focus only on structural requirements.

SUBMISSION:
{{content}}
`;

Step 3: Define Explicit Failure Modes

Competent but shallow content will always default to mid-tier scores unless the rubric explicitly defines where it should land. Add disqualifiers that cap scores regardless of writing quality or topic relevance.

const SCORE_CAPS = [
  { condition: 'internal_status_report', maxScore: 7, reason: 'Lacks novel conceptual contribution' },
  { condition: 'philosophical_framing', maxScore: 7, reason: 'No falsifiable claim or concrete handle' },
  { condition: 'third_party_aggregation', maxScore: 6, reason: 'News roundup without original analysis' },
  { condition: 'generic_structure', maxScore: 7, reason: 'Bullet-point summary with closing question' },
  { condition: 'unsubstantiated_claim', maxScore: 5, reason: 'No link, hash, file, or reproducible step provided' }
];

These caps are injected into the system prompt as hard constraints. They prevent the model from rewarding well-written but structurally shallow content with threshold-crossing scores.

Architecture Rationale

The decision to use dual-criterion anchors over single-criterion or multi-agent approaches stems from three factors:

Determinism over Subjectivity: Single-criterion rubrics force the model to guess at qualitative boundaries. Dual-criterion rubrics convert qualitative judgment into binary feature detection. The model checks for presence/absence, not degree of quality.
Calibration Efficiency: When scores bunch, adjusting the threshold in code is instantaneous. Rewriting prompts to force differentiation requires iterative dry runs, version tracking, and unpredictable model behavior.
Cost and Latency: Multi-agent voting or chain-of-thought refinement adds compute overhead and increases variance. A single model with a structurally sound rubric produces consistent outputs at baseline latency.

The architecture treats the LLM as a structured data extractor rather than an autonomous decision-maker. This aligns with production best practices for AI systems: isolate model variance, enforce policy in code, and design rubrics that map directly to measurable attributes.

Pitfall Guide

1. Trusting the Model's Action Label

Explanation: LLMs frequently output contradictory action labels (e.g., scoring 8/10 but recommending "upvote"). This stems from prompt ambiguity and alignment training that encourages agreeable language. Fix: Never parse vote_recommendation or action fields. Extract only the integer score and apply thresholds in code. Treat the model as a feature extractor, not a policy engine.

2. Single-Dimensional Threshold Anchors

Explanation: Anchors like "high quality" or "original insight" lack mechanical differentiation. They create a catch-all bucket that compresses competent content into a single score. Fix: Require two independent criteria for threshold-crossing scores. Force the model to evaluate orthogonal dimensions (e.g., named concept + concrete reference). If one is missing, cap the score.

3. Prompting for "Strictness" Instead of Criteria

Explanation: Instructions like "be harsh," "apply high standards," or "avoid leniency" conflict with RLHF alignment. The model will generate coherent reasoning around the same bunched score regardless of tone instructions. Fix: Replace tone modifiers with structural requirements. Define exactly what must be present to cross each threshold. Let the rubric enforce strictness, not the prompt's attitude.

4. Ignoring Dry-Run Distribution Analysis

Explanation: Deploying a rubric without analyzing score distribution leads to silent failures. Bunching at a single score indicates undifferentiated anchors, not model incompetence. Fix: Run dry evaluations on a representative dataset before production. Calculate distribution entropy. If >60% of scores cluster at one integer, split that anchor into dual criteria and add explicit failure modes.

5. Missing Explicit Disqualifiers

Explanation: Without defined failure modes, the model defaults to generous scoring for anything that passes basic coherence. Competent but shallow content wins by default. Fix: Add explicit score caps for known shallow patterns (status reports, news roundups, philosophical framing, unsubstantiated claims). Inject these as hard constraints in the system prompt.

6. Hardcoding Thresholds in Prompts

Explanation: Embedding threshold logic in the prompt (e.g., "upvote if score >= 9") couples business policy with model behavior. Changing thresholds requires prompt rewrites and re-validation. Fix: Keep thresholds in configuration. The model returns a score; the policy layer applies the threshold. This enables A/B testing, per-community tuning, and instant recalibration.

7. Neglecting Score Distribution Monitoring

Explanation: Rubric performance degrades over time as content patterns shift. Without monitoring, score bunching returns silently, breaking automation. Fix: Track score distribution entropy, threshold hit rates, and false positive rates in production. Set alerts when clustering exceeds 50% at a single score. Trigger rubric versioning when drift is detected.

Production Bundle

Action Checklist

Decouple judgment from policy: Extract only integer scores; apply thresholds in code.
Design dual-criterion anchors: Require two independent conditions for threshold-crossing scores.
Define explicit failure modes: Add score caps for status reports, aggregations, and unsubstantiated claims.
Run dry-run calibration: Evaluate 20-50 samples; calculate distribution entropy before deployment.
Version your rubrics: Store rubric definitions in configuration, not prompts. Enable rollback.
Monitor distribution drift: Track score clustering and threshold hit rates; alert on >50% bunching.
Isolate model variance: Use structured JSON output; validate schema before policy execution.

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
High-volume content filtering	Dual-criterion single model	Deterministic feature extraction; low latency; consistent thresholds	Low (baseline compute)
Subjective quality assessment	Multi-agent voting with rubric alignment	Reduces individual model bias; captures nuanced disagreement	High (3x compute overhead)
Rapid prototyping / MVP	Single-criterion with code-enforced threshold	Faster iteration; acceptable for low-stakes automation	Low (but high calibration effort)
Compliance / regulatory gating	Dual-criterion + explicit disqualifiers	Hard constraints prevent false positives; auditable criteria	Medium (requires rubric versioning)
Community-specific curation	Per-community threshold configuration	Same rubric, different policy layers; no prompt rewrites	Low (configuration-only change)

Configuration Template

// rubric.config.ts
export const QualityRubric = {
  model: 'qwen3.6:27b',
  temperature: 0.1,
  maxTokens: 512,
  outputSchema: {
    score: 'number',
    reasoning: 'string',
    matched_criteria: 'string[]'
  },
  thresholds: {
    upvote: 8,
    downvote: 3,
    neutral: { min: 4, max: 7 }
  },
  anchors: [
    {
      score: 9,
      required: ['named_concept', 'reproducible_artifact'],
      description: 'Novel mechanism/pattern + verifiable artifact (code, schema, benchmark)'
    },
    {
      score: 8,
      required: ['named_concept', 'concrete_reference'],
      description: 'Novel mechanism/pattern + concrete reference (hash, link, metric, datum)'
    },
    {
      score: 7,
      required: ['substantive_content'],
      description: 'On-topic, well-reasoned, but missing named concept or concrete reference'
    }
  ],
  scoreCaps: [
    { condition: 'internal_status', max: 7 },
    { condition: 'philosophical_framing', max: 7 },
    { condition: 'third_party_aggregation', max: 6 },
    { condition: 'unsubstantiated_claim', max: 5 }
  ]
};

Quick Start Guide

Initialize the policy layer: Create a PolicyEnforcer class that accepts upvote/downvote thresholds and maps integer scores to actions. Never parse model action labels.
Define dual-criterion anchors: Write rubric definitions requiring two independent conditions for scores 8 and 9. Add explicit score caps for known shallow patterns.
Run dry calibration: Execute the rubric against 20-50 representative samples. Calculate score distribution entropy. If >60% cluster at one integer, split that anchor and add missing disqualifiers.
Deploy with monitoring: Enable score distribution tracking, threshold hit rate logging, and drift alerts. Store rubric versions in configuration for instant rollback or A/B testing.
Iterate on distribution, not prompts: When automation fails, analyze the score distribution. Adjust thresholds in code or split anchors structurally. Avoid tone-based prompt modifications.

Mid-Year Sale — Unlock Full Article