Your model predicts 70% win probability. The team wins 58% of the time when you say 70%. That gap is a calibration problem, and it will silently destroy your profitability.
Accuracy measures how often you are right. Calibration measures whether your probabilities mean what they say. For sports betting, calibration determines whether you make money. See scikit-learn's calibration guide for the standard techniques.
Why Calibration Beats Accuracy
Two models evaluate a game where the market asks 65 cents:
- Model A says 78%. Well-calibrated — when it says 78%, the team wins ~78% of the time. Real edge: 13 cents.
- Model B says 78%. Poorly calibrated — when it says 78%, the team wins ~66%. Actual edge: 1 cent, below fees.
Both trigger a buy signal. Model A profits. Model B loses money on every trade despite identical raw output. The danger is that Model B may have a better AUC — it just cannot price contracts correctly.
Measuring Calibration: Brier Score
import numpy as np
def brier_score(y_true, y_prob):
"""Lower is better. 0 = perfect, 0.25 = coin flip."""
return np.mean((y_prob - y_true) ** 2)
Brier score captures both discrimination and calibration. For sports betting, below 0.20 is solid. Below 0.18 is excellent.
Measuring Calibration: ECE
Expected Calibration Error directly measures the gap between predicted probability and observed frequency:
def expected_calibration_error(y_true, y_prob, n_bins=10):
bin_edges = np.linspace(0, 1, n_bins + 1)
ece = 0.0
for i in range(n_bins):
mask = (y_prob >= bin_edges[i]) & (y_prob < bin_edges[i + 1])
if mask.sum() == 0:
continue
bin_conf = y_prob[mask].mean()
bin_acc = y_true[mask].mean()
ece += mask.sum() * abs(bin_acc - bin_conf)
return ece / len(y_true)
ECE below 0.03 means your probabilities are within 3 percentage points of reality. Above 0.05, your edge calculations are systematically wrong.
Visualizing It
A reliability diagram plots predicted probability vs observed frequency. Perfectly calibrated models fall on the diagonal. If your curve bows above, you are under-confident (your 60% predictions win 70%). If below, you are over-confident. Both are fixable.
Fixing Calibration: Isotonic Regression
Isotonic regression learns a monotonic mapping from raw model output to calibrated probabilities:
from sklearn.isotonic import IsotonicRegression
# CRITICAL: fit on validation set, never training data
calibrator = IsotonicRegression(out_of_bounds="clip")
calibrator.fit(raw_probs_val, y_val)
# Apply to new predictions
calibrated_probs = calibrator.predict(raw_probs_test)
If you calibrate on training data, you overfit the calibration curve. Always use a held-out validation set.
Platt Scaling: The Alternative
For smaller datasets, Platt scaling fits a logistic curve:
from sklearn.linear_model import LogisticRegression
platt = LogisticRegression()
platt.fit(raw_probs_val.reshape(-1, 1), y_val)
calibrated = platt.predict_proba(raw_probs_test.reshape(-1, 1))[:, 1]
Platt works well for SVMs and neural nets but can be too rigid for tree-based models. For sports betting, isotonic regression is usually the better choice.
Our Calibration Pipeline
At ZenHodl, calibration is a first-class step:
- Train the WP model on training data (seasons 2020-2024)
- Generate raw probabilities on a validation set (early 2025)
- Fit an isotonic calibrator on the validation set
- Apply the calibrator to live predictions
- Monitor ECE weekly — retrain if it drifts above 0.04
The calibrator adds ~0.5 cents per trade of edge. Across thousands of trades, that compounds into meaningful profit.
Common Mistakes
Calibrating on training data. The model already fits the training data, so the calibration curve learns nothing useful.
Ignoring domain shift. A calibrator trained on 2023 NBA data may not transfer to 2026 March Madness. Different sports and seasons require recalibration.
Confusing calibration with discrimination. A model predicting 50% for every game is perfectly calibrated — and useless. You need both.
The Bottom Line
In sports betting, your probabilities are your prices. If you say 78% and the market says 65%, you are buying at 65 cents something you value at 78 cents. If your 78% is actually worth 66%, you overpaid. Calibration makes your probabilities trustworthy.
Part of the ZenHodl blog. We write about sports analytics, prediction markets, and building trading bots with Python.