How to Brier-grade your own ML option-pricing forecasts in 40 lines of Python

By Codcompass Team·2026-05-27·8 min read

The Calibration Loop: Implementing Brier-Grade Scoring for Probabilistic Option Models

Current Situation Analysis

In quantitative finance, a persistent gap exists between model development and model maintenance. Teams invest significant effort training machine learning models to output probability distributions or point estimates for option pricing, yet the feedback loop required to validate these outputs is frequently absent. This omission stems from a misunderstanding of what constitutes a falsifiable signal.

Most practitioners evaluate models by comparing predicted prices against current market marks. This approach is fundamentally flawed for two reasons:

Market Noise: The market price reflects liquidity premiums, bid-ask spreads, and transient sentiment, making it a noisy ground truth.
Unfalsifiability: A price prediction cannot be verified until the contract expires. Until then, a divergence between model and market is indistinguishable from model error versus market mispricing.

The industry standard for probabilistic validation comes from meteorology and sabermetrics, where forecasts are graded using proper scoring rules. The Brier score is the dominant metric for binary outcomes. For option models, the probability of finishing in-the-money (prob_itm) offers a clean, binary resolution event. The underlying asset either closes above the strike or it does not. This binary nature eliminates ambiguity, allowing for precise calibration measurement.

Without a logging infrastructure, prob_itm forecasts remain ephemeral. Implementing a robust logging and grading pipeline transforms these forecasts from static outputs into actionable performance data, enabling teams to detect calibration drift, identify regime-specific failures, and quantify model honesty.

WOW Moment: Key Findings

The following comparison illustrates why shifting from price-delta tracking to Brier-grade scoring on prob_itm provides superior model governance.

Evaluation Method	Falsifiability	Resolution Clarity	Calibration Insight	Actionability
Price Delta Tracking	Low	Continuous/Ambiguous	None	Low
Brier Score on Prob-ITM	High	Binary/Definitive	Granular (Bin-level)	High

Why this matters: Price delta tracking tells you if your model is "close" to the market, but it cannot tell you if your probabilities are reliable. A model can have a low mean absolute error on prices while being severely overconfident in its tail risk estimates. Brier scoring on prob_itm isolates the probability calibration. It reveals whether a forecast of "30% chance ITM" actually results in ITM outcomes 30% of the time. This granularity allows you to recalibrate specific strike regimes or expiration buckets rather than applying blunt adjustments to the entire model.

Core Solution

The implementation requires three distinct components: a client to retrieve forecasts, a storage layer to persist predictions, and an evaluator to compute scores upon resolution. The architecture prioritizes type safety, separation of concerns, and auditability.

Architecture Decisions

Data Classes for Schema Enforcement: Using dataclasses ensures that forecast records maintain a consistent structure, preventing schema drift as the system evolves.
Decoupled Storage: The logger writes to a structured format (CSV for prototyping, Parquet/DB for production) with immutable append semantics. This preserves the state of the forecast at the moment of generation.
**Resolution Logic Isolat

ion:** The logic determining realized outcomes is separated from the logging logic. This allows the grading engine to run asynchronously after expiration without blocking the inference pipeline. 4. Focus on prob_itm: While the model may output fair value, prob_itm is the primary metric for calibration. Fair value accuracy is secondary to probability honesty in risk management contexts.

Implementation

The following TypeScript-style logic is implemented in Python for the ecosystem compatibility required by the Helium MCP endpoint. The code uses a class-based structure to encapsulate state and behavior.

1. Forecast Client and Logger

import requests
import csv
import logging
from dataclasses import dataclass, asdict
from datetime import datetime, timezone
from pathlib import Path
from typing import Optional

logger = logging.getLogger(__name__)

@dataclass
class OptionContract:
    symbol: str
    strike: float
    expiration: str
    option_type: str  # 'call' or 'put'

@dataclass
class ForecastSnapshot:
    contract: OptionContract
    model_price: float
    prob_itm: float
    data_date: str
    logged_at: str
    realized_spot: Optional[float] = None
    realized_itm: Optional[int] = None
    brier_loss: Optional[float] = None

class HeliumForecastClient:
    ENDPOINT = "https://heliumtrades.com/mcp_option_price/"
    
    def fetch(self, contract: OptionContract) -> ForecastSnapshot:
        params = {
            "symbol": contract.symbol,
            "strike": contract.strike,
            "expiration": contract.expiration,
            "option_type": contract.option_type
        }
        try:
            response = requests.get(self.ENDPOINT, params=params, timeout=10)
            response.raise_for_status()
            payload = response.json()
            
            return ForecastSnapshot(
                contract=contract,
                model_price=payload["predicted_price"],
                prob_itm=payload["prob_itm"],
                data_date=payload["options_data_date"],
                logged_at=datetime.now(timezone.utc).isoformat()
            )
        except requests.RequestException as e:
            logger.error(f"Failed to fetch forecast for {contract.symbol}: {e}")
            raise

class CalibrationLogger:
    def __init__(self, log_path: Path):
        self.log_path = log_path
        self._initialize_file()

    def _initialize_file(self):
        if not self.log_path.exists():
            with self.log_path.open("w", newline="") as f:
                writer = csv.writer(f)
                writer.writerow([
                    "logged_at", "symbol", "strike", "expiration", "option_type",
                    "model_price", "prob_itm", "data_date",
                    "realized_spot", "realized_itm", "brier_loss"
                ])

    def record(self, snapshot: ForecastSnapshot):
        with self.log_path.open("a", newline="") as f:
            writer = csv.writer(f)
            writer.writerow([
                snapshot.logged_at,
                snapshot.contract.symbol,
                snapshot.contract.strike,
                snapshot.contract.expiration,
                snapshot.contract.option_type,
                snapshot.model_price,
                snapshot.prob_itm,
                snapshot.data_date,
                "", "", ""
            ])
        logger.info(f"Logged forecast: {snapshot.contract.symbol} {snapshot.contract.strike} {snapshot.contract.option_type}")

2. Brier Scorer and Calibration Analyzer

import pandas as pd
import numpy as np

class CalibrationAnalyzer:
    def __init__(self, log_path: Path):
        self.df = pd.read_csv(log_path)

    def resolve_contracts(self):
        """Compute realized ITM and Brier loss for resolved contracts."""
        resolved = self.df[self.df["realized_spot"].notna()].copy()
        
        # Determine ITM status based on option type
        is_call = resolved["option_type"] == "call"
        is_itm_call = resolved["realized_spot"] >= resolved["strike"]
        is_itm_put = resolved["realized_spot"] <= resolved["strike"]
        
        resolved["realized_itm"] = np.where(
            is_call, 
            is_itm_call.astype(int), 
            is_itm_put.astype(int)
        )
        
        # Brier Loss: (Forecast Probability - Realized Outcome)^2
        resolved["brier_loss"] = (
            resolved["prob_itm"] - resolved["realized_itm"]
        ) ** 2
        
        return resolved

    def compute_calibration_histogram(self, resolved_df: pd.DataFrame, bins=5):
        """
        Groups forecasts by probability bins and compares mean forecast 
        against mean realized frequency.
        """
        resolved_df = resolved_df.copy()
        resolved_df["prob_bin"] = pd.cut(
            resolved_df["prob_itm"], 
            bins=np.linspace(0, 1, bins + 1), 
            include_lowest=True
        )
        
        calibration = resolved_df.groupby("prob_bin").agg(
            mean_forecast=("prob_itm", "mean"),
            realized_frequency=("realized_itm", "mean"),
            count=("brier_loss", "count")
        ).reset_index()
        
        return calibration

    def report_metrics(self, resolved_df: pd.DataFrame):
        mean_brier = resolved_df["brier_loss"].mean()
        return {
            "contracts_graded": len(resolved_df),
            "mean_brier_loss": mean_brier,
            "brier_std": resolved_df["brier_loss"].std()
        }

Usage Workflow:

Inference: Call HeliumForecastClient.fetch() and pass the result to CalibrationLogger.record().
Resolution: After expiration, update the CSV with realized_spot (e.g., via a nightly job querying market data).
Grading: Instantiate CalibrationAnalyzer, run resolve_contracts(), and inspect the calibration histogram.

Pitfall Guide

Pitfall	Explanation	Fix
Survivorship Bias in Logging	Logging only contracts where the model predicts high probability or high profit. This skews calibration metrics toward "easy" cases.	Log every forecast request unconditionally. The grading set must represent the full distribution of model outputs.
Rate Limit Exhaustion	The Helium endpoint enforces a limit of 50 calls per IP per day. Aggressive polling or backtesting without caching will trigger 429 errors.	Implement a local cache for repeated requests. Prioritize contracts based on liquidity or risk exposure. Batch requests where possible.
Confusing Calibration with Sharpness	A model can be well-calibrated (probabilities match frequencies) but uninformative (all probabilities are 50%). Conversely, a sharp model may be miscalibrated.	Track both metrics. Use the Brier score decomposition to separate calibration error from resolution (sharpness). Aim for high sharpness without sacrificing calibration.
Ignoring Bin-Level Miscalibration	Relying solely on mean Brier loss masks local failures. A model might be accurate at 50% probability but overconfident at 90%.	Always generate the calibration histogram. Investigate bins where `mean_forecast` diverges significantly from `realized_frequency`.
Schema Drift in Storage	Adding columns to the log file without versioning breaks downstream parsers and grading scripts.	Use a schema registry or versioned file naming (e.g., `calibration_v2.csv`). Validate schema on read.
Timezone Inconsistency	Mixing UTC and local timestamps in logs causes misalignment when joining with market data.	Store all timestamps in UTC ISO-8601 format. Convert to local time only for display purposes.
Price vs. Probability Conflation	Assuming a low Brier score implies the model's price predictions are accurate.	Brier score measures probability honesty, not price accuracy. Maintain a separate tracking mechanism for price delta metrics.

Production Bundle

Action Checklist

Instrument Logging: Integrate CalibrationLogger into the inference pipeline to capture all prob_itm forecasts.
Define Resolution Source: Establish a reliable data feed for realized underlying prices at expiration.
Automate Grading: Schedule a daily job to run CalibrationAnalyzer on resolved contracts.
Monitor Rate Limits: Implement alerting if API usage approaches the 50 calls/day threshold.
Review Calibration Plots: Schedule weekly reviews of the calibration histogram to detect drift.
Set Thresholds: Define acceptable Brier loss bounds per strike regime to trigger model retraining.
Backup Logs: Ensure log files are version-controlled or backed up to prevent data loss.

Decision Matrix

Scenario	Recommended Approach	Why	Cost Impact
Prototyping / Low Volume	CSV + Pandas	Zero infrastructure overhead; rapid iteration.	Low
High Volume / Multi-Model	PostgreSQL + Airflow	Concurrency, schema enforcement, and orchestration.	Medium
Rate-Limited API Usage	Local Cache + Batching	Prevents 429 errors; maximizes utility of free tier.	Low
Real-Time Monitoring	Stream to Dashboard	Immediate visibility into calibration drift.	High

Configuration Template

Use this YAML configuration to parameterize the logging and grading pipeline.

forecasting:
  endpoint: "https://heliumtrades.com/mcp_option_price/"
  timeout_seconds: 10
  rate_limit:
    max_calls_per_day: 50
    cooldown_seconds: 60

storage:
  log_path: "/data/calibration_logs/option_forecasts.csv"
  schema_version: "1.0"
  backup_enabled: true

grading:
  resolution_delay_hours: 24
  calibration_bins: 5
  alert_threshold_brier: 0.15
  alert_threshold_bin_divergence: 0.10

logging:
  level: "INFO"
  format: "%(asctime)s - %(name)s - %(levelname)s - %(message)s"

Quick Start Guide

Install Dependencies:
```
pip install requests pandas numpy
```

Initialize Logger:

from pathlib import Path
logger = CalibrationLogger(Path("forecasts.csv"))

Log a Forecast:

client = HeliumForecastClient()
contract = OptionContract("AAPL", 180.0, "2026-06-26", "call")
snapshot = client.fetch(contract)
logger.record(snapshot)

Resolve and Grade: After expiration, update the CSV with the realized spot price, then run:

analyzer = CalibrationAnalyzer(Path("forecasts.csv"))
resolved = analyzer.resolve_contracts()
print(analyzer.report_metrics(resolved))
print(analyzer.compute_calibration_histogram(resolved))

Iterate: Use the calibration histogram to identify miscalibrated bins and refine model training or post-processing adjustments.

🎉 Mid-Year Sale — Unlock Full Article

Base plan from just $4.99/mo or $49/yr

7-day free trial · Cancel anytime · 30-day money-back