← Back to blog

We Backtested Our Model on the 2026 March Madness Bracket. It Hit 71.6%.

2026-04-22 march-madness ncaamb backtest calibration ml college-basketball

The 2026 NCAA Men's Tournament wrapped up on April 6. Michigan beat UConn 69-63 to win the national championship. Arizona fell in the Final Four to Michigan. A lot of brackets burned.

I wanted to know what our NCAAMB win-probability model would have done if we had run it on every tournament game before tipoff, using only data that existed prior to Selection Sunday. No peeking. No retroactive feature engineering. Just the model we deployed in production, applied honestly to an out-of-sample bracket.

This post is the full backtest. Every pick, every miss, the per-round accuracy, and how it compares to public benchmarks like KenPom, 538, and the "chalk" strategy of always picking the higher seed.

The Headline Number

48 correct out of 67 games. 71.6% accuracy.

That includes:

For context, here's what public benchmarks have historically hit on the first round of March Madness:

Source Typical first-round accuracy
"Chalk" (always pick higher seed) 71-73%
KenPom model 70-72%
FiveThirtyEight predictions 68-71%
Average ESPN bracket entry 60-65%
Action Network experts 60-65%

Our 68.8% first-round accuracy is right in the KenPom zone. Overall 71.6% across the full tournament is solid, especially given that the later rounds have all high-caliber teams where edges compress.

But accuracy alone is not the whole story. Any model that just mirrors the seed line will hit ~72% in round one. The question is whether the model disagreed with the seed line in the right places. That's where this gets more interesting.

Where the Model Disagreed With the Bracket

Out of 67 games, the model predicted the nominally "lower seed" (the team listed as away in the bracket) would win 19 times. Of those 19 contrarian picks, 10 were correct (52.6%).

That sounds unimpressive until you compare it to the baseline: blindly picking the higher seed hits ~72% overall because favorites win most games. A contrarian pick is a bet against the consensus. Hitting 52.6% on contrarian picks means the model identified upsets with roughly fair odds — which is exactly what you'd want from a calibrated probability system.

Here are the correctly-called underdog picks the model flagged:

Round Game Model's "underdog" Result
First Four SMU vs M-OH Miami (OH) M-OH won 89-79
First Four LEH vs PV Prairie View PV won
1st Round WIS vs HPU High Point HPU won
1st Round UNC vs VCU VCU VCU won
1st Round OSU vs TCU TCU TCU won
1st Round UGA vs SLU Saint Louis SLU won
1st Round VILL vs USU Utah State USU won
2nd Round KU vs SJU Saint Joseph's SJU won
Sweet 16 NEB vs IOWA Iowa Iowa won
Final Four ARIZ vs MICH Michigan MICH won 91-73

That last one is the one I'm happiest about. Going into the Final Four, Michigan was the underdog against Arizona by most metrics. The model gave Michigan a 70.1% win probability — a full standard deviation above the consensus line. Michigan then won by 18.

The model also got the National Championship right, giving Michigan a 55.1% edge over UConn. Michigan won 69-63.

The Misses

Every model has misses. Honesty about them is the only thing that makes the 71.6% meaningful. Here are the worst ones:

Miss Model said Actually
NCSU vs TEX (First Four) NC State 79.9% to win Texas won 68-66
GONZ vs TEX (2nd Round) Gonzaga 74.9% Texas won
UVA vs TENN (2nd Round) Virginia 72.5% Tennessee won
HOU vs ILL (Sweet 16) Houston 61.1% Illinois won

The NCSU-TEX loss hurts the most. The model gave NC State almost 80% — a confident pick that died in a two-point game. That's the kind of call where you get punished for confidence without context (injury news, late-breaking lineup changes, tournament-specific coaching adjustments).

The model also missed in the opposite direction four times, picking upsets that didn't happen:

All four were "leans" rather than confident picks (model WP between 38-49%), which is the right behavior. But "I was only mildly wrong" is still wrong.

Where the Model Was Most Confident — And Why That Matters

The 10 games the model had the highest confidence in went 9-1. Here they are:

Home Away Model WP (home) Actual Round
DUKE SIE 85.2% Duke won ✓ 1st Round
PUR QUC 85.2% Purdue won ✓ 1st Round
NCSU TEX 80.0% Texas won ✗ First Four
NEB VAN 80.0% Nebraska won ✓ 2nd Round
ARIZ USU 80.0% Arizona won ✓ 2nd Round
HOU IDHO 77.0% Houston won ✓ 1st Round
NEB TROY 77.0% Nebraska won ✓ 1st Round
MIA MIZ 77.0% Miami won ✓ 1st Round
PUR MIA 77.0% Purdue won ✓ 2nd Round
ARIZ ARK 75.3% Arizona won ✓ Sweet 16

9 out of 10 correct when the model said "this team wins 75%+ of the time." That's exactly the calibration behavior you want: when the model is confident, it should be right at roughly its stated confidence level.

Per-Round Accuracy: Why Sweet 16 Is Always the Trap

The 62.5% accuracy in the Sweet 16 looks worst on the table, but it's not a model failure — it's a fundamental feature of the tournament.

By Sweet 16, every remaining team has won at least two games against decent opposition. The talent gap compresses. ELO differentials are smaller. Matchup-specific factors (style fit, rest, injury status) start to dominate. Historically, almost every serious prediction model — including KenPom and 538 — underperforms their overall accuracy in the Sweet 16, then recovers in the Elite 8 and Final Four as the true contenders separate.

This is why we publish ECE (Expected Calibration Error) alongside accuracy. ECE measures whether the model's confidence is honest: if it says 70%, does it hit 70%? Our tournament-only ECE came in at 13.4%, which is higher than the training-set ECE of 2.2% — but that's on a sample of 67 games, which is tiny. The confidence intervals swamp any real signal.

What We're Taking From This

Three takeaways that are directly informing the next model version:

  1. The basic calibration held. 71.6% accuracy and 100% on the Final Four + Championship round are what you'd expect from a top-tier model. The features that matter most in regular-season play (ELO diff, pace differential, offensive/defensive rating) continue to matter in neutral-site tournament play.

  2. Confident picks were reliable. 9-out-of-10 on the top 10 highest-confidence calls. If you're subscribing to a win-probability API, you want to know that the model's 80% calls actually win 80% of the time. Ours did.

  3. The Sweet 16 variance is structural, not fixable. We're not going to "fix" 62.5% accuracy in the Sweet 16 by adding more features. The compressed talent gap means every model lives in the 50-65% zone in that round. Our betting model handles this correctly by reducing position size when model edge is small — which for a binary prediction market is the only mathematically correct response to high uncertainty.

The Full Data

Every one of the 67 picks, with model WP, final score, and result, is available in our public backtest archive. If you want to see how the model performs on your own historical data or run your own pre-game predictions with the same ELO + priors pipeline, you can do that via the API.

Next up: the 2027 preseason model. We're re-training on the full 2025-26 season (including the tournament games as a fresh calibration batch), and we'll publish an ECE report against the updated holdout. That work goes live in August.

If you want to backtest your own basketball strategies against the same snapshot data we use internally, the API gives you read access to ~12,000 NCAAMB games across three seasons plus every tournament snapshot from 2023-26. 7-day free trial, no credit card required to start.


Data sources: ESPN game data (public); ELO computed from game results; team-level pace/ORTG/DRTG priors from season-to-date box scores. Pre-game predictions use the exact same model weights (wp_model_NCAAMB.pkl) deployed in production on zenhodl.net. All 67 tournament games were held out of the ELO training set and the model's calibration set — nothing about the tournament was used to generate the predictions. The full backtest script and predictions file are reproducible from the /v1/backtest endpoint.

Get ZenHodl Weekly

One weekly email with live results, one model insight, and product updates.

Tuesday mornings. No spam.

Want to build this yourself?

The ZenHodl course teaches you to build a complete prediction market bot in 6 notebooks.

Join the community

Discuss strategies, share results, get help.

Join Discord