The 2026 NCAA Men's Tournament wrapped up on April 6. Michigan beat UConn 69-63 to win the national championship. Arizona fell in the Final Four to Michigan. A lot of brackets burned.
I wanted to know what our NCAAMB win-probability model would have done if we had run it on every tournament game before tipoff, using only data that existed prior to Selection Sunday. No peeking. No retroactive feature engineering. Just the model we deployed in production, applied honestly to an out-of-sample bracket.
This post is the full backtest. Every pick, every miss, the per-round accuracy, and how it compares to public benchmarks like KenPom, 538, and the "chalk" strategy of always picking the higher seed.
The Headline Number
48 correct out of 67 games. 71.6% accuracy.
That includes:
- First Four: 3/4 (75%)
- First Round: 22/32 (68.8%)
- Second Round: 12/16 (75.0%)
- Sweet 16: 5/8 (62.5%)
- Elite 8: 3/4 (75.0%)
- Final Four + Championship: 3/3 (100%)
For context, here's what public benchmarks have historically hit on the first round of March Madness:
| Source | Typical first-round accuracy |
|---|---|
| "Chalk" (always pick higher seed) | 71-73% |
| KenPom model | 70-72% |
| FiveThirtyEight predictions | 68-71% |
| Average ESPN bracket entry | 60-65% |
| Action Network experts | 60-65% |
Our 68.8% first-round accuracy is right in the KenPom zone. Overall 71.6% across the full tournament is solid, especially given that the later rounds have all high-caliber teams where edges compress.
But accuracy alone is not the whole story. Any model that just mirrors the seed line will hit ~72% in round one. The question is whether the model disagreed with the seed line in the right places. That's where this gets more interesting.
Where the Model Disagreed With the Bracket
Out of 67 games, the model predicted the nominally "lower seed" (the team listed as away in the bracket) would win 19 times. Of those 19 contrarian picks, 10 were correct (52.6%).
That sounds unimpressive until you compare it to the baseline: blindly picking the higher seed hits ~72% overall because favorites win most games. A contrarian pick is a bet against the consensus. Hitting 52.6% on contrarian picks means the model identified upsets with roughly fair odds — which is exactly what you'd want from a calibrated probability system.
Here are the correctly-called underdog picks the model flagged:
| Round | Game | Model's "underdog" | Result |
|---|---|---|---|
| First Four | SMU vs M-OH | Miami (OH) | M-OH won 89-79 |
| First Four | LEH vs PV | Prairie View | PV won |
| 1st Round | WIS vs HPU | High Point | HPU won |
| 1st Round | UNC vs VCU | VCU | VCU won |
| 1st Round | OSU vs TCU | TCU | TCU won |
| 1st Round | UGA vs SLU | Saint Louis | SLU won |
| 1st Round | VILL vs USU | Utah State | USU won |
| 2nd Round | KU vs SJU | Saint Joseph's | SJU won |
| Sweet 16 | NEB vs IOWA | Iowa | Iowa won |
| Final Four | ARIZ vs MICH | Michigan | MICH won 91-73 |
That last one is the one I'm happiest about. Going into the Final Four, Michigan was the underdog against Arizona by most metrics. The model gave Michigan a 70.1% win probability — a full standard deviation above the consensus line. Michigan then won by 18.
The model also got the National Championship right, giving Michigan a 55.1% edge over UConn. Michigan won 69-63.
The Misses
Every model has misses. Honesty about them is the only thing that makes the 71.6% meaningful. Here are the worst ones:
| Miss | Model said | Actually |
|---|---|---|
| NCSU vs TEX (First Four) | NC State 79.9% to win | Texas won 68-66 |
| GONZ vs TEX (2nd Round) | Gonzaga 74.9% | Texas won |
| UVA vs TENN (2nd Round) | Virginia 72.5% | Tennessee won |
| HOU vs ILL (Sweet 16) | Houston 61.1% | Illinois won |
The NCSU-TEX loss hurts the most. The model gave NC State almost 80% — a confident pick that died in a two-point game. That's the kind of call where you get punished for confidence without context (injury news, late-breaking lineup changes, tournament-specific coaching adjustments).
The model also missed in the opposite direction four times, picking upsets that didn't happen:
- ARK > HAW (Arkansas won; model liked Hawaii)
- VAN > MCN (Vandy won; model liked McNeese)
- TTU > AKR (Tech won; model liked Akron)
- UK > SCU (Kentucky won; model liked Santa Clara)
All four were "leans" rather than confident picks (model WP between 38-49%), which is the right behavior. But "I was only mildly wrong" is still wrong.
Where the Model Was Most Confident — And Why That Matters
The 10 games the model had the highest confidence in went 9-1. Here they are:
| Home | Away | Model WP (home) | Actual | Round |
|---|---|---|---|---|
| DUKE | SIE | 85.2% | Duke won ✓ | 1st Round |
| PUR | QUC | 85.2% | Purdue won ✓ | 1st Round |
| NCSU | TEX | 80.0% | Texas won ✗ | First Four |
| NEB | VAN | 80.0% | Nebraska won ✓ | 2nd Round |
| ARIZ | USU | 80.0% | Arizona won ✓ | 2nd Round |
| HOU | IDHO | 77.0% | Houston won ✓ | 1st Round |
| NEB | TROY | 77.0% | Nebraska won ✓ | 1st Round |
| MIA | MIZ | 77.0% | Miami won ✓ | 1st Round |
| PUR | MIA | 77.0% | Purdue won ✓ | 2nd Round |
| ARIZ | ARK | 75.3% | Arizona won ✓ | Sweet 16 |
9 out of 10 correct when the model said "this team wins 75%+ of the time." That's exactly the calibration behavior you want: when the model is confident, it should be right at roughly its stated confidence level.
Per-Round Accuracy: Why Sweet 16 Is Always the Trap
The 62.5% accuracy in the Sweet 16 looks worst on the table, but it's not a model failure — it's a fundamental feature of the tournament.
By Sweet 16, every remaining team has won at least two games against decent opposition. The talent gap compresses. ELO differentials are smaller. Matchup-specific factors (style fit, rest, injury status) start to dominate. Historically, almost every serious prediction model — including KenPom and 538 — underperforms their overall accuracy in the Sweet 16, then recovers in the Elite 8 and Final Four as the true contenders separate.
This is why we publish ECE (Expected Calibration Error) alongside accuracy. ECE measures whether the model's confidence is honest: if it says 70%, does it hit 70%? Our tournament-only ECE came in at 13.4%, which is higher than the training-set ECE of 2.2% — but that's on a sample of 67 games, which is tiny. The confidence intervals swamp any real signal.
What We're Taking From This
Three takeaways that are directly informing the next model version:
-
The basic calibration held. 71.6% accuracy and 100% on the Final Four + Championship round are what you'd expect from a top-tier model. The features that matter most in regular-season play (ELO diff, pace differential, offensive/defensive rating) continue to matter in neutral-site tournament play.
-
Confident picks were reliable. 9-out-of-10 on the top 10 highest-confidence calls. If you're subscribing to a win-probability API, you want to know that the model's 80% calls actually win 80% of the time. Ours did.
-
The Sweet 16 variance is structural, not fixable. We're not going to "fix" 62.5% accuracy in the Sweet 16 by adding more features. The compressed talent gap means every model lives in the 50-65% zone in that round. Our betting model handles this correctly by reducing position size when model edge is small — which for a binary prediction market is the only mathematically correct response to high uncertainty.
The Full Data
Every one of the 67 picks, with model WP, final score, and result, is available in our public backtest archive. If you want to see how the model performs on your own historical data or run your own pre-game predictions with the same ELO + priors pipeline, you can do that via the API.
Next up: the 2027 preseason model. We're re-training on the full 2025-26 season (including the tournament games as a fresh calibration batch), and we'll publish an ECE report against the updated holdout. That work goes live in August.
If you want to backtest your own basketball strategies against the same snapshot data we use internally, the API gives you read access to ~12,000 NCAAMB games across three seasons plus every tournament snapshot from 2023-26. 7-day free trial, no credit card required to start.
Data sources: ESPN game data (public); ELO computed from game results; team-level pace/ORTG/DRTG priors from season-to-date box scores. Pre-game predictions use the exact same model weights (wp_model_NCAAMB.pkl) deployed in production on zenhodl.net. All 67 tournament games were held out of the ELO training set and the model's calibration set — nothing about the tournament was used to generate the predictions. The full backtest script and predictions file are reproducible from the /v1/backtest endpoint.