Calibration Beats Accuracy: The NBA Model Bug That...

In sports betting, accuracy is the wrong metric. Our NBA bot had 65% accuracy on backtests and was losing money in live trading. The fix wasn't a better model — it was a one-line bug fix that improved calibration by 44%.

This is the story.

The Two Numbers That Matter

There are two ways to measure a prediction model:

Accuracy asks: did you predict the winner? If your model says "home team wins" with 51% probability and they win, that's accurate. If it says "home team wins" with 99% probability and they win, that's also accurate. Same answer, very different signals.

Calibration asks: when you predicted 70%, did the team actually win 70% of the time? When you predicted 90%, did they win 90% of the time? A perfectly calibrated model's predictions match reality across the entire probability range.

For a betting strategy, calibration matters more than accuracy. You're not betting on every game — you're only betting when your edge over the market exceeds your costs. If your "70% prediction" is actually a 55% reality, every trade you make at that confidence level is a loser, even if your overall accuracy looks fine.

How We Measure Calibration

The standard metric is Expected Calibration Error (ECE), which complements the Brier score. Here's how it works:

Bin predictions into 10 buckets (0-10%, 10-20%, ..., 90-100%)
For each bucket, compare the average predicted probability to the actual win rate
Take the weighted average of the absolute differences

A perfectly calibrated model has ECE = 0. Anything under 0.05 is excellent. Above 0.10 is concerning. Our NBA bot was at 0.103.

What 0.103 meant in practice: when the model said "70% probability," the actual win rate was closer to 60%. When it said "85%," reality was 75%. The model was systematically overconfident, which meant every "edge" we detected was 10 cents smaller than we thought.

Finding the Bug

We were puzzled. The NBA model used the same architecture as NCAAMB (which had ECE 0.022 — well calibrated). Same features, same training pipeline, same hyperparameters. Why was NBA so different?

We added logging to count the Elo teams loaded at each stage of training. The output:

Stage 1 (raw data): 42 NBA teams
Stage 2 (after split): 0 NBA teams
Stage 3 (training): 0 NBA teams

The Elo dictionary was being lost during a DataFrame split operation. The model had been training with elo_diff = 0.0 for every game. Every prediction was made without knowing which team was better.

The reason 65% accuracy was still possible: score and time remaining alone are pretty predictive. A team up 15 with 5 minutes left wins most of the time regardless of which team it is. But without Elo, the model couldn't distinguish situations where a stronger team has a higher comeback probability than a weaker team in the same score state. The error showed up exactly where calibration matters most: in the 60-80% prediction range.

The Fix

Six lines of code:

# Module-level cache to preserve Elo dict through DataFrame transformations
_ELO_CACHE = {}

def compute_elo_ratings(df, sport):
    # ... existing computation ...
    _ELO_CACHE[sport] = elo_dict  # preserve before returning DataFrame
    return enriched_df

def get_cached_elo(sport):
    return _ELO_CACHE.get(sport, {})

After retraining with the fix:

Metric	Before	After	Change
Elo teams	0	42	+42
Brier Score	0.197	0.181	-8.1%
ECE	0.103	0.057	-44.6%
Accuracy	65.2%	66.8%	+1.6%
Backtest WR	60.1%	69.3%	+9.2 points
Backtest cents/trade	-2.4c	+6.2c	+8.6c

Notice that accuracy barely changed (65.2% → 66.8%, essentially noise). But the backtested trading performance jumped from losing money to making +6.2 cents per trade.

That's the calibration effect. The model was already accurate enough — it just needed to know how confident to be.

Why Calibration Is Different

For a Kaggle competition, you optimize accuracy or AUC. For sports betting, you optimize calibration.

The reason is simple: betting strategies only act on predictions above a confidence threshold. If your threshold is "bet when fair_prob > market_price + 8 cents," you're only making decisions on a small subset of games. Your accuracy on the full dataset is irrelevant. What matters is your calibration on the games where you actually bet.

A model that's 70% accurate overall but perfectly calibrated will outperform a model that's 75% accurate but systematically overconfident — because the overconfident model will bet on games where the real edge is negative.

How to Build Calibrated Models

Three techniques work:

Isotonic regression. Take your raw model predictions, plot them against actual outcomes on a held-out calibration set, and fit a monotonic step function to map raw predictions to calibrated probabilities. We use this for XGBoost models. See scikit-learn's calibration guide.

Spline calibration. Same idea but uses smooth splines instead of step functions. Better when you have enough calibration data. We use this for logistic regression.

Platt scaling. Sigmoid fit on raw scores. Originally proposed in Platt 1999. Older technique, generally worse than isotonic for our use case but very fast to compute.

Whichever method you use, the calibration step must happen on a separate calibration set, not the training set. Calibrating on training data gives you a model that's perfectly calibrated on data it's already memorized — not on new data.

The Tools to Detect This

Plot a calibration curve. For your held-out test set, bucket predictions into 10 bins, compute the average prediction and actual win rate per bucket, and plot them on the same chart. A well-calibrated model produces a 45-degree diagonal line. Systematic deviations show you exactly where your model is wrong.

Track ECE alongside Brier score and AUC. If ECE drifts up while Brier stays flat, your model is becoming miscalibrated even though its overall accuracy is steady. This is a leading indicator that something is broken upstream — exactly the signal we missed before discovering the Elo bug.

Run a calibration audit on production models periodically. Pull recent predictions and outcomes, compute ECE on the last 30 days, compare to training-time calibration. If they diverge, something has changed in the live data distribution that your training set didn't capture.

The Takeaway

We didn't need a smarter model. We needed a working pipeline. The NBA bot's accuracy was fine all along — it was producing the wrong probabilities because it was missing 42 team ratings.

If your trading bot is losing money on backtests that look profitable, check your calibration before you change your model. A miscalibrated 75% model loses to a calibrated 65% model every time.

The NBA model and its calibration pipeline are taught in Module 3 of our course. Module 4 covers backtesting with proper time-split validation to catch calibration drift before it costs you money.

Calibration Beats Accuracy: The NBA Model Bug That Lost Money at 65% Win Rate