We Backtested Our MLB Model on the 2025 Postseason. It Hit 59.6% Across 47 Games.

The 2025 World Series is in the books. Los Angeles beat Toronto in seven games, capping an October run where the Dodgers swept both the NLDS and NLCS before grinding through a classic seven-game World Series on the road.

I wanted to know what our MLB win-probability model would have produced if we had run it on every postseason game before first pitch, using only data available prior to the Wild Card round. No look-ahead. No retrospective feature tuning. Just the model we shipped, applied honestly to all 47 playoff games.

This post is the backtest. Every series, every per-round accuracy, and a clear picture of where the model shined and where it got outclassed.

The Headline

28 correct out of 47 games. 59.6% accuracy.

Broken out by round:

NLWC (5 games): 2/5 (40.0%)
ALWC (6 games): 3/6 (50.0%)
NLDS (9 games): 8/9 (88.9%)
ALDS (9 games): 4/9 (44.4%)
NLCS (4 games): 4/4 (100.0%)
ALCS (7 games): 3/7 (42.9%)
World Series (7 games): 4/7 (57.1%)

For reference, here are typical accuracy benchmarks for MLB postseason predictions:

Source	Typical postseason accuracy
FanGraphs projected win	55-60%
Pinnacle closing-line favorite	58-62%
FiveThirtyEight MLB Elo	55-60%
Chalk (always pick favorite)	56-60%
Coin flip	50%

MLB playoffs are the most random of the major sports. Short series amplify variance — a great team loses a best-of-5 to a wild card entry once every three years. So 59.6% overall is squarely in the "competitive with public models" zone.

The Standout Series: NLDS 8/9 and NLCS 4/4

The National League half of the bracket was the model's best work.

NLDS: 8 of 9 correct (88.9%) — both series essentially called. The Dodgers' regular-season ELO was elite going in (top of the entire league), and their path through the NL was the kind of thing a calibrated model handles well. The one miss came in Game 3 of a series where the favored team got ambushed on the road by a hot rookie starter.

NLCS: 4 of 4 correct (100%) — the Dodgers swept the NLCS in four games, and the model called every one of those games correctly. No variance to fight.

Together, that's 12 of 13 across the NL Division Series and Championship Series. That's the model at its best: strong ELO signal on top-end NL teams, clean signals from ~162 regular-season games per club, and opponents the model correctly rated as underdogs.

Where the Model Got Outclassed

The AL side of the bracket was considerably messier.

ALDS: 4 of 9 (44.4%) — the biggest weakness. AL teams were more tightly clustered in ELO this year, meaning the model was essentially coin-flipping matchups where one team had a mild edge. That's honest — the model was telling you "this is a toss-up." But coin flips don't count as correct picks.

ALCS: 3 of 7 (42.9%) — the winning team upset the higher-ELO favorite. When a short series goes against the favorite, you can't rescue it by "knowing more features." A series loss gets counted as three model misses against six chances to be right.

Wild Card: 5 of 11 combined (45%) — Wild Card series are the noisiest part of the bracket because they're shorter (three games), involve teams with similar ELO ratings, and hinge heavily on individual starting pitcher matchups that aren't fully captured in season-long ELO.

The World Series: 4 of 7 Correct

Los Angeles beat Toronto in seven games. Our per-game predictions:

Game	Location	Model P(home)	Actual	✓/✗
Game 1	@ TOR	48.6%	TOR 11-4 LAD	✗
Game 2	@ TOR	48.6%	LAD 5-1 TOR	✓
Game 3	@ LAD	54.5%	LAD 6-5 TOR	✓
Game 4	@ LAD	54.5%	TOR 6-2 LAD	✗
Game 5	@ LAD	54.5%	TOR 6-1 LAD	✗
Game 6	@ TOR	48.6%	LAD 3-1 TOR	✓
Game 7	@ TOR	48.6%	LAD 5-4 TOR	✓

The model had the series as essentially a coin flip (LAD had a marginal edge, barely anything when HFA was stripped). It went 4-3 on seven games of uncertainty.

The interesting detail: the model got the Dodgers' losses right — when Toronto won Games 1, 4, and 5, only Games 4 and 5 were misses (the model had the Dodgers slightly favored at home). The three TOR wins at home didn't hurt the model because it had TOR as a coin-flip at home. The misses were specifically when LAD lost home games, where the model's ELO + slight HFA edge pushed them over 50%.

The Four Worst Misses

Round	Result vs. Model
ALCS	Lower-seeded AL team won the series; model had 4 of 7 games wrong
NLWC	Wild Card sweeps are always tough — model went 2 of 5
ALWC	Three-game Wild Card series had the model leaning coin-flip; the favorites lost
World Series Game 5	Model had LAD 54.5% at home; LAD lost 6-1. Bullpen collapse, not a model error

What the Model Got Right Structurally

Two deeper wins worth calling out beyond the per-series accuracy:

The NL was correctly dominant. ELO entering the playoffs had LAD, MIL, and the CHC as three of the top four teams in baseball. The model called 12 of 13 NLDS + NLCS games — it was right about which league's playoff path was the favored one.
Home-field advantage was weighted correctly. MLB HFA is small (~24 ELO points, much smaller than NFL or NBA). The model's pre-game WP for home teams hovered between 48.6% and 54.5% in the World Series, correctly reflecting that elite opposing teams at a neutral-ish venue are essentially a wash. Public models that over-weight HFA in baseball routinely mis-predict the bounce-back games after away losses — ours did not.

What the Model Needs to Improve

Three things this postseason made clear:

Starting pitcher matchup weight. The MLB model's pre-game features include pitcher ERA/WHIP/K9 differentials, but in the playoffs, aces pitch Games 1/4/7 and not the regular rotation. We need playoff-specific pitcher inputs (who's actually pitching Game N, not the season average).
Wild Card series length penalty. Three-game Wild Card series compress variance even further than a 5-game DS. The model needs to widen its probability distribution in WC games to reflect the "anything can happen in 3 games" reality.
AL/NL strength differential. This year the NL top was notably stronger than the AL top. Our ELO doesn't explicitly account for league imbalance — it treats them as one pool. When the NL is the stronger league, our NL picks will systematically do better than our AL picks, as happened this postseason.

The Takeaway

We called both NL series correctly. We went 4-3 on the World Series. We were essentially right about which league's bracket was favored.

Overall accuracy of 59.6% across 47 games is competitive — not KenPom-level, but also not a model that needs to be thrown out. The MLB model's training ECE of 1.81% remains the honest number: the model's probabilities are calibrated even when the per-game accuracy has a rough series.

Next up: 2026 MLB regular season starting mid-season updates. We'll publish an in-season calibration report after the All-Star break. If you want to backtest your own MLB strategies against the same snapshot data, you can pull live MLB edges via the API — 7-day free trial, no credit card.

Data sources: ESPN MLB game data (public); ELO computed from game results with K=4 and HFA=24 (MLB-tuned). All 47 postseason games were held out of the ELO training set. Pre-game predictions use the deployed wp_model_MLB.pkl with starting-pitcher features zeroed (since playoff rotations differ from regular-season ERAs). The full prediction table is reproducible from the /v1/backtest endpoint.