Super Bowl LX is over. Seattle beat New England 29-13 on February 8, 2026. A lot of futures tickets turned into trash.
I wanted to know what our NFL win-probability model would have produced if we had run it on every 2025-26 playoff game before kickoff, using only data available prior to Wild Card Weekend. No peeking. No retroactive feature work. Just the model we shipped, applied honestly to a full 13-game postseason.
This post is the backtest. Every pick, every miss, the per-round accuracy, and a clear read on what the model is good at and where it gets beaten.
The Headline
9 correct out of 13 games. 69.2% accuracy.
Including:
- Wild Card (6 games): 4/6 (66.7%)
- Divisional (4 games): 3/4 (75.0%)
- Conference Championships (2 games): 1/2 (50.0%)
- Super Bowl LX: 1/1 (100%)
For reference, here are public benchmarks for NFL playoff accuracy:
| Source | Typical playoff accuracy |
|---|---|
| FiveThirtyEight Elo | 62-66% |
| ESPN FPI playoff picks | 60-68% |
| "Chalk" (always pick higher seed) | 63-68% |
| Public expert average | 55-60% |
| Pinnacle closing-line favorites | 65-70% |
Our 69.2% is in the Pinnacle-closing-line zone. That's a competitive number, especially on a 13-game postseason where variance is high and home-field advantage is weaker than in the regular season.
The Super Bowl Call
Before Super Bowl LX at SoFi Stadium (neutral site), the model gave New England only a 35.3% chance to win against Seattle. The general public perception had the game closer — the closing moneyline had Seattle as a short favorite, and most expert polls were split.
The model was confident and specific: Seattle was clearly the stronger team by ELO going in, and neutral-site games don't give either team the home advantage bonus. That stripped the "Patriots at SoFi" narrative and let the pure team-strength numbers talk.
Seattle won 29-13. Final Seahawks-Patriots margin was 16. The model had it right.
The Full 13-Game Breakdown
Every pick, every outcome, cleanly:
| Round | Matchup | Model P(home) | Actual | Correct? |
|---|---|---|---|---|
| NFC Wild Card | CAR vs LAR | 32.5% | 31-34 | ✓ |
| NFC Wild Card | CHI vs GB | 40.3% | 31-27 | ✗ |
| AFC Wild Card | JAX vs BUF | 38.1% | 24-27 | ✓ |
| NFC Wild Card | PHI vs SF | 65.0% | 19-23 | ✗ |
| AFC Wild Card | NE vs LAC | 54.4% | 16-3 | ✓ |
| AFC Wild Card | PIT vs HOU | 43.6% | 6-30 | ✓ |
| AFC Divisional | DEN vs BUF | 51.6% | 33-30 | ✓ |
| NFC Divisional | SEA vs SF | 69.8% | 41-6 | ✓ |
| AFC Divisional | NE vs HOU | 46.0% | 28-16 | ✗ |
| NFC Divisional | CHI vs LAR | 38.1% | 17-20 | ✓ |
| AFC Championship | DEN vs NE | 70.6% | 7-10 | ✗ |
| NFC Championship | SEA vs LAR | 61.7% | 31-27 | ✓ |
| Super Bowl LX | NE vs SEA | 35.3% | 13-29 | ✓ |
The Biggest Misses
Four wrong picks. Here's what the model got wrong and why it mattered:
AFC Championship: DEN vs NE (model had DEN 70.6%)
The model's worst call. Denver hosted New England with a ~200-point ELO edge, and at home with HFA the model liked the Broncos at over 70% to win. New England won 10-7 in a defensive slog. The game turned on one red-zone stop and one missed field goal — neither of which any pregame model would have predicted.
This is the classic "the model was right on the long run and wrong on this one sample" problem. A team that wins 70% of the time still loses 30% of the time. Over one Conference Championship, you got the 30%.
NFC Wild Card: PHI vs SF (model had PHI 65.0%)
Philadelphia hosted San Francisco and was favored by pretty much everyone, including the model. SF won 23-19. This was closer to a coin flip than the model realized — the 49ers had a tough injury return for their starting QB mid-season that hadn't fully filtered into the ELO yet, because our base ELO doesn't currently incorporate QB-specific adjustments.
AFC Divisional: NE vs HOU (model had NE only 46.0%)
Houston was the road team but strong on ELO. The model leaned Houston. NE won 28-16. The miss wasn't catastrophic — the model was almost a coin flip — but it's in the "I was confidently wrong by a small margin" bucket.
NFC Wild Card: CHI vs GB (model had CHI 40.3%)
Chicago was the road team on ELO going in and lost this road/home coin flip. The model leaned Green Bay. Chicago won 31-27 in a shootout. These divisional rivalry games are notoriously hard to predict — familiarity compresses edges.
Where the Model Was Most Confident And Right
The strongest call of the postseason: Seattle 69.8% over San Francisco in the NFC Divisional round. Seattle won 41-6. A 35-point blowout in a game the model flagged as a clear edge.
The model was also in its sweet spot on the Seattle NFC Championship call (61.7% over the Rams) and the Super Bowl call (64.7% over New England, via NE's 35.3% home-probability flip).
If you had blindly bet the moneyline on every game where the model's confidence exceeded 60%, you would have gone 3-1 across the postseason — cashing on SEA-SF, SEA-LAR, and SEA-NE, losing only on DEN-NE.
What the Model Got Right Structurally
Two deeper wins worth flagging:
-
Neutral-site adjustment on the Super Bowl. Most casual predictors forget to strip home-field advantage in the Super Bowl because the game is played at a specific stadium. For NE's SB showdown, the model correctly gave them no HFA bump, which pushed their probability down to a fair 35% rather than an inflated 45%. That's the difference between the right answer and a miscalibrated one.
-
The road-underdog picks. Of the 5 cases where the model picked the road team over the home team, it was right 4 times (LAR over CAR in Wild Card, BUF over JAX, HOU over PIT, and SEA over NE in the Super Bowl). That's a strong signal that the ELO-based pre-game features are doing real work — not just rubber-stamping the home team.
What the Model Needs to Improve
Three things this backtest tells us need work for next season:
- QB-specific ELO adjustments. The PHI-SF miss was largely because a mid-season QB injury return wasn't fully reflected. We need a player-level adjustment that decays when a starter returns from IR.
- Division rivalry damping. Division games should have compressed edges (teams know each other too well). We currently don't model that, and the CHI-GB and CHI-LAR Divisional games both moved the wrong direction.
- Tournament ECE sample-size note. Tournament-only ECE on 13 games came in at 28.9% — but that's small-sample noise. The real NFL model calibration metric is the training ECE of 9.35%, which is where we need to keep pushing. Our NCAAMB model is at 2.21% — there's clearly room to improve NFL.
The Takeaway
We hit the Super Bowl. We went 9-of-13 on the postseason. We were on the right side of every game where we had > 60% confidence except one. The model earned a pass this cycle.
But the NFL model's calibration needs work, and we're open about that. Our sports coverage is only as useful as the probabilities are honest — and at 9.35% ECE, NFL is our weakest sport by calibration. Next season's model will incorporate player-level QB adjustments and rivalry damping, and we'll publish the updated ECE before kickoff.
Next up: Super Bowl LXI (2027) preseason futures. We'll publish team-by-team championship probabilities in July once we've finalized offseason adjustments. If you want to backtest your own playoff strategies against the same snapshot data, you can pull live NFL edges via the API — 7-day free trial, no credit card required.
Data sources: ESPN NFL game data (public); ELO computed from game results using basketball-style MoV; home-field adjustment = +55 ELO (neutral for Super Bowl). All 13 playoff games were held out of the ELO training set. Pre-game predictions use the deployed wp_model_NFL.pkl. The full prediction table is reproducible from the /v1/backtest endpoint.