We Backtested Our Model on the 2026 March Madness...

The 2026 NCAA Men's Tournament wrapped up on April 6. Michigan beat UConn 69-63 to win the national championship. Arizona fell in the Final Four to Michigan. A lot of brackets burned.

I wanted to know what our NCAAMB win-probability model would have done if we had run it on every tournament game before tipoff, using only data that existed prior to Selection Sunday. No peeking. No retroactive feature engineering. Just the model we deployed in production, applied honestly to an out-of-sample bracket.

This post is the full backtest. Every pick, every miss, the per-round accuracy, and how it compares to public benchmarks like KenPom, 538, and the "chalk" strategy of always picking the higher seed.

The Headline Number

48 correct out of 67 games. 71.6% accuracy.

That includes:

First Four: 3/4 (75%)
First Round: 22/32 (68.8%)
Second Round: 12/16 (75.0%)
Sweet 16: 5/8 (62.5%)
Elite 8: 3/4 (75.0%)
Final Four + Championship: 3/3 (100%)

For context, here's what public benchmarks have historically hit on the first round of March Madness:

Source	Typical first-round accuracy
"Chalk" (always pick higher seed)	71-73%
KenPom model	70-72%
FiveThirtyEight predictions	68-71%
Average ESPN bracket entry	60-65%
Action Network experts	60-65%

Our 68.8% first-round accuracy is right in the KenPom zone. Overall 71.6% across the full tournament is solid, especially given that the later rounds have all high-caliber teams where edges compress.

But accuracy alone is not the whole story. Any model that just mirrors the seed line will hit ~72% in round one. The question is whether the model disagreed with the seed line in the right places. That's where this gets more interesting.

Where the Model Disagreed With the Bracket

Out of 67 games, the model predicted the nominally "lower seed" (the team listed as away in the bracket) would win 19 times. Of those 19 contrarian picks, 10 were correct (52.6%).

That sounds unimpressive until you compare it to the baseline: blindly picking the higher seed hits ~72% overall because favorites win most games. A contrarian pick is a bet against the consensus. Hitting 52.6% on contrarian picks means the model identified upsets with roughly fair odds — which is exactly what you'd want from a calibrated probability system.

Here are the correctly-called underdog picks the model flagged:

Round	Game	Model's "underdog"	Result
First Four	SMU vs M-OH	Miami (OH)	M-OH won 89-79
First Four	LEH vs PV	Prairie View	PV won
1st Round	WIS vs HPU	High Point	HPU won
1st Round	UNC vs VCU	VCU	VCU won
1st Round	OSU vs TCU	TCU	TCU won
1st Round	UGA vs SLU	Saint Louis	SLU won
1st Round	VILL vs USU	Utah State	USU won
2nd Round	KU vs SJU	Saint Joseph's	SJU won
Sweet 16	NEB vs IOWA	Iowa	Iowa won
Final Four	ARIZ vs MICH	Michigan	MICH won 91-73

That last one is the one I'm happiest about. Going into the Final Four, Michigan was the underdog against Arizona by most metrics. The model gave Michigan a 70.1% win probability — a full standard deviation above the consensus line. Michigan then won by 18.

The model also got the National Championship right, giving Michigan a 55.1% edge over UConn. Michigan won 69-63.

The Misses

Every model has misses. Honesty about them is the only thing that makes the 71.6% meaningful. Here are the worst ones:

Miss	Model said	Actually
NCSU vs TEX (First Four)	NC State 79.9% to win	Texas won 68-66
GONZ vs TEX (2nd Round)	Gonzaga 74.9%	Texas won
UVA vs TENN (2nd Round)	Virginia 72.5%	Tennessee won
HOU vs ILL (Sweet 16)	Houston 61.1%	Illinois won

The NCSU-TEX loss hurts the most. The model gave NC State almost 80% — a confident pick that died in a two-point game. That's the kind of call where you get punished for confidence without context (injury news, late-breaking lineup changes, tournament-specific coaching adjustments).

The model also missed in the opposite direction four times, picking upsets that didn't happen:

ARK > HAW (Arkansas won; model liked Hawaii)
VAN > MCN (Vandy won; model liked McNeese)
TTU > AKR (Tech won; model liked Akron)
UK > SCU (Kentucky won; model liked Santa Clara)

All four were "leans" rather than confident picks (model WP between 38-49%), which is the right behavior. But "I was only mildly wrong" is still wrong.

Where the Model Was Most Confident — And Why That Matters

The 10 games the model had the highest confidence in went 9-1. Here they are:

Home	Away	Model WP (home)	Actual	Round
DUKE	SIE	85.2%	Duke won ✓	1st Round
PUR	QUC	85.2%	Purdue won ✓	1st Round
NCSU	TEX	80.0%	Texas won ✗	First Four
NEB	VAN	80.0%	Nebraska won ✓	2nd Round
ARIZ	USU	80.0%	Arizona won ✓	2nd Round
HOU	IDHO	77.0%	Houston won ✓	1st Round
NEB	TROY	77.0%	Nebraska won ✓	1st Round
MIA	MIZ	77.0%	Miami won ✓	1st Round
PUR	MIA	77.0%	Purdue won ✓	2nd Round
ARIZ	ARK	75.3%	Arizona won ✓	Sweet 16

9 out of 10 correct when the model said "this team wins 75%+ of the time." That's exactly the calibration behavior you want: when the model is confident, it should be right at roughly its stated confidence level.

Per-Round Accuracy: Why Sweet 16 Is Always the Trap

The 62.5% accuracy in the Sweet 16 looks worst on the table, but it's not a model failure — it's a fundamental feature of the tournament.

By Sweet 16, every remaining team has won at least two games against decent opposition. The talent gap compresses. ELO differentials are smaller. Matchup-specific factors (style fit, rest, injury status) start to dominate. Historically, almost every serious prediction model — including KenPom and 538 — underperforms their overall accuracy in the Sweet 16, then recovers in the Elite 8 and Final Four as the true contenders separate.

This is why we publish ECE (Expected Calibration Error) alongside accuracy. ECE measures whether the model's confidence is honest: if it says 70%, does it hit 70%? Our tournament-only ECE came in at 13.4%, which is higher than the training-set ECE of 2.2% — but that's on a sample of 67 games, which is tiny. The confidence intervals swamp any real signal.

What We're Taking From This

Three takeaways that are directly informing the next model version:

The basic calibration held. 71.6% accuracy and 100% on the Final Four + Championship round are what you'd expect from a top-tier model. The features that matter most in regular-season play (ELO diff, pace differential, offensive/defensive rating) continue to matter in neutral-site tournament play.
Confident picks were reliable. 9-out-of-10 on the top 10 highest-confidence calls. If you're subscribing to a win-probability API, you want to know that the model's 80% calls actually win 80% of the time. Ours did.
The Sweet 16 variance is structural, not fixable. We're not going to "fix" 62.5% accuracy in the Sweet 16 by adding more features. The compressed talent gap means every model lives in the 50-65% zone in that round. Our betting model handles this correctly by reducing position size when model edge is small — which for a binary prediction market is the only mathematically correct response to high uncertainty.

The Full Data

Every one of the 67 picks, with model WP, final score, and result, is available in our public backtest archive. If you want to see how the model performs on your own historical data or run your own pre-game predictions with the same ELO + priors pipeline, you can do that via the API.

Next up: the 2027 preseason model. We're re-training on the full 2025-26 season (including the tournament games as a fresh calibration batch), and we'll publish an ECE report against the updated holdout. That work goes live in August.

If you want to backtest your own basketball strategies against the same snapshot data we use internally, the API gives you read access to ~12,000 NCAAMB games across three seasons plus every tournament snapshot from 2023-26. 7-day free trial, with trial details shown before payment.

NCAAMB 2025-26 Season Report — The 5,345-game regular-season companion to this tournament backtest (ECE 4.39%).
Best College Basketball Prediction Sites 2026 — How we compare to KenPom, Bart Torvik, and others.
How to Build a March Madness Prediction Model — Full Python tutorial for the pipeline behind this backtest.
Calibration Beats Accuracy — Why ECE matters more than raw win rate for Kelly sizing.
Super Bowl LX Retrospective — The NFL equivalent of this analysis (9 of 13 playoff games correct).

Data sources: ESPN game data (public); ELO computed from game results; team-level pace/ORTG/DRTG priors from season-to-date box scores. Pre-game predictions use the exact same model weights (wp_model_NCAAMB.pkl) deployed in production on zenhodl.net. All 67 tournament games were held out of the ELO training set and the model's calibration set — nothing about the tournament was used to generate the predictions. The full backtest script and predictions file are reproducible from the /v1/backtest endpoint.

We Backtested Our Model on the 2026 March Madness Bracket. It Hit 71.6%.

The Headline Number

Where the Model Disagreed With the Bracket

The Misses

Where the Model Was Most Confident — And Why That Matters

Per-Round Accuracy: Why Sweet 16 Is Always the Trap

What We're Taking From This

The Full Data

Related reading

Get ZenHodl Weekly

Want to build this yourself?

The Headline Number

Where the Model Disagreed With the Bracket

The Misses

Where the Model Was Most Confident — And Why That Matters

Per-Round Accuracy: Why Sweet 16 Is Always the Trap

What We're Taking From This

The Full Data

Related Reading

Related reading

Get ZenHodl Weekly

Want to build this yourself?