Win Probability Models for Sports Betting: The Math, The Code, and The Mistakes

Win probability (WP) models are the foundation of sports prediction. ESPN has one. FiveThirtyEight had one. Every sportsbook uses one internally. And if you're building a prediction market bot, you need one too.

This is the complete guide — the math, the code, and the mistakes that make models look good in backtests but lose money in production.

What a WP Model Does

A WP model answers one question: given the current game state (score, time remaining, period, team strength), what's the probability the home team wins?

The input is a feature vector: - Score differential: home_score - away_score - Time remaining: seconds left in regulation - Period/quarter/inning: where in the game are we - Team strength: Elo ratings for both teams - Sport-specific features: possession, down/distance (football), shots on goal (hockey), pitcher stats (baseball)

The output is a probability between 0 and 1. A well-calibrated model that outputs 0.72 means the home team wins 72% of the time in similar game states.

Step 1: Elo Ratings

Every WP model needs a measure of team strength. Elo ratings, invented by Arpad Elo for chess, are the simplest approach that works.

The expected win probability for team A against team B:

E(A) = 1 / (1 + 10^((R_B - R_A) / 400))

After a game, ratings update:

R_A_new = R_A + K × (actual - expected)

Where K is the learning rate (how fast ratings respond to new results). We tune K per sport:

Sport	K	Home Advantage	Why
NBA	20	100 points	High-scoring, frequent games
NCAAMB	20	100	Same as NBA, wider skill gap
NFL	15	55	Few games, each matters more
NHL	8	40	Low-scoring, more randomness
MLB	8	24	162 games/season, high variance

Higher K means faster adaptation but more noise. Lower K means smoother ratings but slower to detect roster changes.

The margin-of-victory multiplier: For basketball and football, we scale K by the margin of victory. A 20-point blowout moves ratings more than a 1-point squeaker. This is disabled for hockey and baseball (low-scoring sports where margins are noise).

Season decay: Between seasons, we regress all ratings 50% toward the mean (1500). This prevents a team that was great last year from carrying an inflated rating into a rebuilding year.

Full implementation: Build an Elo Rating System from Scratch in Python

Step 2: Feature Engineering

Raw Elo difference plus score and time gets you 80% of the way. The remaining 20% comes from sport-specific features:

Basketball (NBA, NCAAMB): - score_diff: home - away score - time_fraction: fraction of game remaining (0 = game over, 1 = tip-off) - elo_diff: home Elo - away Elo - total_score: combined score (proxy for game pace) - score_diff_x_elo: interaction term (a 10-point lead by a strong team is more secure) - score_diff_x_tf: interaction (a 10-point lead with 2 minutes left vs 20 minutes) - pace_diff, ortg_diff, drtg_diff: team efficiency metrics

Football (NFL, CFB): - All of the above plus: - down, distance, yard_line: current play state - possession_home: 1 if home team has the ball

Hockey (NHL): - shots_on_goal_diff: SOG differential - power_play_state: 5v5, 5v4, 4v5, etc. - empty_net: whether the trailing team has pulled their goalie

Baseball (MLB): - is_home_batting: which team is batting (alternates by half-inning) - sp_era_diff, sp_whip_diff, sp_k9_diff: starting pitcher stats

What doesn't work: We tested and rejected rest days, travel distance, referee assignments, weather, and player-level stats. None improved backtested trading performance despite improving raw accuracy. The reason: markets already price in these factors. Your model only makes money on information the market doesn't have.

Step 3: Model Architecture

We use different model architectures per sport:

Logistic Regression + Spline Calibration: Best for high-scoring sports (basketball). Outputs well-calibrated probabilities by default. Spline post-processing smooths the calibration curve.

XGBoost + Isotonic Calibration: Best for low-scoring sports (hockey, baseball). Gradient boosting captures non-linear interactions (a 1-0 lead in the 3rd period of hockey is very different from a 1-0 lead in the 1st). Isotonic regression fixes the raw probability outputs.

Split-Phase Model: Separate models for early game (high uncertainty) and late game (low uncertainty), blended by time fraction. Works well for basketball where the dynamics change dramatically between the 1st quarter and final 2 minutes.

Ensemble: Weighted combination of all three, with weights determined by which model trades best on held-out data. The ensemble provides the most robust predictions across all game states.

The model choice matters less than calibration. A well-calibrated logistic regression outperforms a miscalibrated XGBoost every time. See Calibration Beats Accuracy for the full explanation.

Step 4: Training Protocol

The most common mistake in sports ML is data leakage — using future information to predict past outcomes. We prevent this with strict time-split validation:

Training set:   2020-21 through 2023-24 seasons
Calibration set: First half of 2024-25
Test set:       Second half of 2024-25 through 2025-26

Never random-split sports data. Random splits mix games from the same season, allowing the model to "learn" team strength from future games. Time-split ensures the model only ever sees data from before the prediction date.

Calibration happens on a separate set. Calibrating on training data gives you a model that's perfectly calibrated on data it has memorized. The calibration set must be data the model has never seen during training.

Test on the most recent data. Your test set should be the most recent season because that's the closest approximation to the distribution you'll trade on. A model that works on 2021 data but fails on 2025 data is useless.

Step 5: Calibration

The step most people skip — and the one that determines whether you make money.

Expected Calibration Error (ECE): Bin predictions into 10 buckets (0-10%, 10-20%, ..., 90-100%). For each bucket, compare average predicted probability to actual win rate. The weighted average of the absolute differences is your ECE.

ECE < 0.02: Excellent calibration
ECE 0.02-0.05: Good
ECE 0.05-0.10: Needs improvement
ECE > 0.10: Your model is confidently wrong — it will lose money

Brier Score: The mean squared error of probability predictions. Lower is better. Measures both accuracy and calibration simultaneously.

Our production models achieve: - NCAAMB: Brier 0.145, ECE 0.022 - NBA: Brier 0.127, ECE 0.057 - NHL: Brier 0.157, ECE 0.034 - MLB: Brier 0.152, ECE 0.011

The 7 Mistakes That Lose Money

Mistake 1: Training on market prices. If you train on Polymarket/sportsbook prices, your model learns to agree with the market. It can never find mispricings because its training signal IS the market. Train on ESPN game data only.

Mistake 2: Random train/test splits. See above. Always time-split.

Mistake 3: Ignoring calibration. A 65% accurate model can lose money if it's not calibrated. When it says 75%, the team needs to actually win 75% of the time — not 62%.

Mistake 4: Too many features. 50 features overfit to training data. 6-14 features generalize to new seasons. Start simple, only add features that improve trading profit (not just accuracy).

Mistake 5: Not accounting for execution costs. A 5-cent edge looks profitable. After 2-cent fees and 3-cent average slippage, it's negative. Backtest with realistic costs from day one.

Mistake 6: Backtesting with market prices you can't actually trade at. The ask price in your historical data might have had 8 contracts of depth. Your 100-contract order would have moved the price 10 cents. Model slippage.

Mistake 7: Not dampening extreme predictions. A raw model output of 95% probably isn't 95% in reality. Dampening (shrinking toward 50%) prevents overconfident bets. We use output = 50 + (raw - 50) × 0.80 for most sports.

From Model to Money

A WP model is necessary but not sufficient. You also need:

Edge detection: Compare model fair value to market price. Only trade when edge > 8 cents.
Signal filtering: Reject 60-65% of signals. Low edges, extreme entries, stale quotes, high slippage — all need gates.
Hold-to-settlement discipline: Never sell. The game resolves at 0 or 100 cents.
Multi-venue confirmation: When your model AND sportsbooks agree, but Polymarket disagrees, the edge is real.

The model is the foundation. Execution is what makes it profitable.

This methodology is taught step-by-step across 6 modules in our course. Module 1 covers data collection, Module 2 covers Elo ratings, Module 3 covers WP model training and calibration, Module 4 covers backtesting, Module 5 covers live execution, and Module 6 covers deployment. Live results from these exact models at zenhodl.net/results.