ncy.
- Prompt optimization loops: A/B testing becomes statistically reliable because the evaluation metric itself does not fluctuate.
Core Solution
Building a production-ready rubric engine requires separating rule definition, weight management, context injection, and score aggregation. The architecture must be deterministic, type-safe, and extensible without modifying core logic.
Architecture Decisions
- Rule as Pure Functions: Each criterion is a callable that accepts the response string and an optional context dictionary. This keeps rules stateless and testable in isolation.
- Weight Normalization at Initialization: Weights must sum to 1.0. Validation occurs once during construction, not during scoring, to avoid runtime overhead.
- Context-Aware Evaluation: Many rules depend on the original prompt, user intent, or system configuration. Passing context explicitly prevents hardcoding and enables dynamic rule behavior.
- Deterministic Aggregation: The final score is a weighted sum of passed rules. No randomness, no sampling, no external API calls.
- Extensible Scoring Modes: While boolean pass/fail is standard, the architecture should support partial scoring (e.g., 0.5 for 2/4 required terms) without breaking the core contract.
Implementation
from __future__ import annotations
import re
from dataclasses import dataclass, field
from typing import Callable, Dict, Any, List, Optional
from collections import namedtuple
EvaluationOutcome = namedtuple("EvaluationOutcome", ["score", "rule_results", "metadata"])
@dataclass(frozen=True)
class ScoringRule:
name: str
weight: float
evaluator: Callable[[str, Dict[str, Any]], bool]
description: str = ""
def __post_init__(self):
if not 0.0 <= self.weight <= 1.0:
raise ValueError(f"Weight for rule '{self.name}' must be between 0.0 and 1.0")
@dataclass
class OutputRubric:
rules: List[ScoringRule]
_weight_sum: float = field(init=False, repr=False)
def __post_init__(self):
total = sum(r.weight for r in self.rules)
if abs(total - 1.0) > 1e-3:
raise ValueError(f"Rule weights must sum to 1.0. Current sum: {total:.4f}")
object.__setattr__(self, '_weight_sum', total)
def evaluate(self, response: str, context: Optional[Dict[str, Any]] = None) -> EvaluationOutcome:
ctx = context or {}
rule_results: Dict[str, bool] = {}
weighted_score = 0.0
for rule in self.rules:
try:
passed = rule.evaluator(response, ctx)
except Exception as e:
passed = False
rule_results[f"{rule.name}_error"] = str(e)
rule_results[rule.name] = passed
if passed:
weighted_score += rule.weight
return EvaluationOutcome(
score=round(weighted_score, 4),
rule_results=rule_results,
metadata={"context_keys": list(ctx.keys()), "rule_count": len(self.rules)}
)
Usage Example: E-Commerce Product Recommendation Validation
def contains_price_format(text: str, ctx: Dict) -> bool:
return bool(re.search(r'\$\d{1,3}(?:,\d{3})*(?:\.\d{2})?', text))
def matches_category(text: str, ctx: Dict) -> bool:
target = ctx.get("target_category", "").lower()
return target in text.lower()
def avoids_hedging_language(text: str, ctx: Dict) -> bool:
hedging_terms = ["maybe", "probably", "might be", "i think", "as an ai"]
return not any(term in text.lower() for term in hedging_terms)
def respects_length_constraint(text: str, ctx: Dict) -> bool:
max_len = ctx.get("max_length", 300)
return len(text.strip()) <= max_len
recommendation_rubric = OutputRubric([
ScoringRule("price_format", 0.30, contains_price_format, "Must include formatted price"),
ScoringRule("category_match", 0.35, matches_category, "Must reference target category"),
ScoringRule("tone_direct", 0.20, avoids_hedging_language, "No hedging or AI disclaimers"),
ScoringRule("length_compliance", 0.15, respects_length_constraint, "Stays within configured limit"),
])
result = recommendation_rubric.evaluate(
response="The Sony WH-1000XM5 headphones are priced at $348.00 and deliver industry-leading noise cancellation for travel.",
context={"target_category": "headphones", "max_length": 250}
)
print(f"Quality Score: {result.score:.2f}")
# Output: Quality Score: 1.00
Why This Architecture Works in Production
- Zero external dependencies: The scoring engine runs entirely in-process. No network calls, no rate limits, no provider outages.
- Explicit failure modes: Rule evaluation errors are caught and logged without crashing the pipeline. This prevents a single malformed regex from blocking deployments.
- Context-driven flexibility: Rules adapt to dynamic inputs (user tier, product category, SLA requirements) without duplicating rubric definitions.
- Type safety and immutability: Frozen dataclasses prevent accidental mutation of rules at runtime. Named tuples enforce structured return values for downstream consumers.
- Observability ready: The
metadata field and rule_results dictionary map directly to OpenTelemetry attributes, enabling per-rule pass rates, score distributions, and drift detection in monitoring dashboards.
Pitfall Guide
1. Over-Weighting Trivial Checks
Explanation: Assigning high weights to length or punctuation checks creates a false sense of quality. A response can score 0.95 by being the right length and containing a dollar sign while completely missing the user's intent.
Fix: Anchor weights to business impact. Use historical data to identify which structural properties correlate with user satisfaction or conversion. Start with equal weights, then adjust based on A/B test outcomes, not intuition.
2. Ignoring Context Dependency
Explanation: Hardcoding thresholds (e.g., len(text) > 100) fails when requirements change across user segments, locales, or product lines.
Fix: Always pass context to evaluators. Make thresholds configurable via environment variables or feature flags. Validate that context keys exist before evaluation to prevent silent failures.
3. Confusing Structural Validation with Factual Accuracy
Explanation: Rubrics verify format, not truth. A response can perfectly match all rules while containing hallucinated data.
Fix: Treat rubric scores as a quality gate, not a correctness guarantee. Pair deterministic scoring with semantic validation only when necessary. Document this boundary explicitly in team runbooks to prevent misaligned expectations.
4. Neglecting Weight Normalization
Explanation: Weights that don't sum to 1.0 produce uninterpretable scores. A sum of 0.8 caps the maximum score at 0.8, breaking threshold logic downstream.
Fix: Enforce normalization at initialization. Add CI checks that validate rubric configurations before deployment. Log warnings when weights drift due to manual edits.
5. Using Rubrics for Open-Ended Generation
Explanation: Creative writing, brainstorming, and exploratory tasks lack fixed structural requirements. Applying rigid rules suppresses model creativity and yields artificially low scores.
Fix: Reserve rubrics for task-completion, classification, summarization, and structured extraction. Use human review or LLM-as-judge only for open-ended domains where semantic nuance matters more than format.
6. Failing to Version Rubrics Alongside Prompts
Explanation: Changing a prompt without updating the rubric creates evaluation drift. New prompt behaviors may pass old rules or fail new ones unpredictably.
Fix: Treat rubrics as infrastructure code. Store them in version control alongside prompt templates. Run rubric compatibility tests during prompt review cycles. Tag rubric versions in deployment manifests.
7. Hardcoding Thresholds Without Baseline Calibration
Explanation: Setting a gate at score >= 0.80 without historical data leads to false blocks or silent degradation.
Fix: Run rubrics in shadow mode for 2β4 weeks before enforcing gates. Collect score distributions, identify natural baselines, and set thresholds at the 10thβ15th percentile of historical performance. Adjust dynamically as model versions change.
Production Bundle
Action Checklist
Decision Matrix
| Scenario | Recommended Approach | Why | Cost Impact |
|---|
| CI/CD Regression Testing | Deterministic Rubric | Zero variance, sub-10ms latency, blocks bad merges | $0 marginal cost |
| Runtime Response Gating | Rubric + Threshold Routing | Fast enough for user-facing latency budgets | Negligible compute |
| A/B Prompt Comparison | Deterministic Rubric | Eliminates evaluation noise, enables statistical significance | $0 marginal cost |
| Semantic Fact-Checking | LLM-as-Judge or Human Review | Rubrics cannot verify truthfulness | High token cost |
| Creative/Exploratory Tasks | Human Review or LLM Judge | Structural rules suppress nuance and creativity | Variable |
| Multi-Model Routing | Rubric Score + Fallback Chain | Enables deterministic model selection based on quality | Low compute |
Configuration Template
# rubric_config.yaml
version: "1.0"
name: "customer_support_routing"
threshold: 0.75
rules:
- name: "contains_ticket_id"
weight: 0.30
pattern: "TICKET-\\d{6}"
description: "Must reference valid ticket format"
- name: "matches_department"
weight: 0.35
allowed_values: ["billing", "technical", "account", "shipping"]
description: "Must classify into valid department"
- name: "avoids_ai_disclaimers"
weight: 0.20
forbidden_phrases: ["as an ai", "i'm an ai", "language model"]
description: "No model identity references"
- name: "length_within_bounds"
weight: 0.15
min_chars: 20
max_chars: 250
description: "Concise but complete classification"
Quick Start Guide
- Install dependencies:
pip install pyyaml (for config loading) and set up your evaluation module.
- Define rules: Create pure functions for each quality dimension. Pass context explicitly. Avoid side effects.
- Initialize rubric: Load rules and weights. Validate normalization. Store configuration in version control.
- Run in shadow mode: Execute rubrics on production traffic without blocking. Collect score distributions for 2 weeks.
- Enforce gates: Set threshold at historical 10th percentile. Route low scores to fallback or human review. Monitor drift monthly.