Comparing Polymarket Odds to ZenHodl Win Probabilities Across 5,000 Games

When we say our model has edge, the only honest test is whether it disagrees with the market in the right direction often enough to make money. Calibration is necessary. Beating the closing line is sufficient.

This post walks through the head-to-head: 5,000+ resolved games where both ZenHodl had a published fair probability and Polymarket had a closing market price. We measure agreement, disagreement, and which side is right when they diverge.

The Setup

For every game in the comparison, we have:

The ZenHodl pre-game fair probability for the home side (calibrated, as published on the API at the time)
The Polymarket pre-game implied probability for the home side (closing market price, vig-removed)
The actual game outcome

The dataset spans roughly 14 months across NBA, NHL, MLB, NCAAMB, NCAAWB, CFB, NFL, soccer, and tennis. Esports (CS2, LoL) are excluded from this cut because their Polymarket sample sizes are too small to be statistically interesting.

Average game count per sport: 200-1,500. NCAAMB is the largest sample (1,800+ games) because of season volume. NFL is the smallest meaningful sample (220 games).

Where the Model and Market Agree

The first finding is that ZenHodl and Polymarket agree most of the time. Across the 5,000+ games, the two probabilities differ by less than 5 percentage points on roughly 65% of contracts. The model is not picking different winners than the market. It is mostly producing the same probabilities, plus or minus noise.

This is what you would expect from any well-calibrated model on a relatively efficient market. The market price is itself an aggregator of many forecasts; if your model is well-built, it lands close to the market consensus. The disagreement is the interesting part.

Where They Disagree, and Who Wins

In the 35% of contracts where the two probabilities differ by 5 points or more, we get to test which estimate is closer to reality. The standard test is Brier score — squared error of probability versus observed outcome — averaged across the disagreement set.

Across the disagreement set, ZenHodl's Brier score is lower than Polymarket's by a small but consistent margin. The size of the gap depends on sport:

NCAAMB: ZenHodl Brier 0.214 vs Polymarket Brier 0.222 (gap 0.008)
NHL: ZenHodl Brier 0.231 vs Polymarket Brier 0.235 (gap 0.004)
NBA: ZenHodl Brier 0.219 vs Polymarket Brier 0.225 (gap 0.006)
MLB: ZenHodl Brier 0.241 vs Polymarket Brier 0.247 (gap 0.006)
Soccer (3-way): ZenHodl Brier 0.218 vs Polymarket Brier 0.221 (gap 0.003)
Tennis: ZenHodl Brier 0.241 vs Polymarket Brier 0.244 (gap 0.003)

Every sport has a positive gap. The gap is small — in the 0.003 to 0.008 range. But it is consistent in direction. On the contracts where the model disagrees with the market, the model is right slightly more often than the market is.

That gap is the source of the trading edge. It is not a wide gap. It is not a "we are smarter than everyone" gap. It is the kind of small, persistent edge that compounds when you size positions correctly with quarter-Kelly.

Where the Model Loses to the Market

Not every disagreement is a model win. There are buckets where the market beats us cleanly.

The most consistent loser bucket is high-edge tennis. On contracts where ZenHodl's probability is more than 20 cents away from Polymarket's, the market is right more often than we are. The model is producing extreme probabilities in regions where it does not have enough training data to be trusted, and the market knows better. We cap the maximum edge our tennis bot will trade at 20 cents specifically because of this finding.

The second loser bucket is late-game NHL with empty net situations. Our NHL model treats the empty net as a feature, but Polymarket's pricing on those contracts incorporates information from the orderbook (who is loading up on which side) that our model does not have. We have considered exiting NHL positions when the empty net comes out and the market shifts more than 5 cents — backtest is suggestive but the live sample is still too small to commit.

The third is a MLB late-inning bullpen pattern. Our MLB model handles starting pitcher quality but does not have great per-reliever data. Polymarket prices late-game contracts incorporating the specific reliever in for the save, and the model is sometimes wrong about it. Tracking this is on the roadmap.

What 5,000 Games Tell Us About Market Efficiency

The headline conclusion: Polymarket is mostly efficient, and the inefficiency that does exist is in specific buckets that require domain knowledge to identify.

The market is right on aggregate. The market is right on extreme edges (where the model is overconfident). The market is also right on situations the model has not been trained on enough.

Where the model wins is the middle of the edge distribution — disagreements of 8 to 18 cents in regions of the feature space the model has good coverage on. That is also exactly where our edge-band filter constrains the bots to trade.

Implications for Building Your Own

If you are training a sports prediction model with the goal of trading on Polymarket, three takeaways from this comparison:

Brier score versus the closing market should be your primary validation metric. Win rate is too noisy. P&L is too lagging. Brier-versus-market on a disagreement subset converges fast and tells you immediately whether your model has marginal edge over the market consensus.

Cap the maximum edge you trade. The largest disagreements between your model and the market are usually the model being wrong. We learned this with tennis. The pattern likely applies to your model too — verify with the Brier comparison restricted to high-edge buckets.

Identify the buckets where you lose to the market and patch them. The empty-net NHL situation is fixable with better features. The MLB bullpen situation is fixable with better data. The tennis high-edge situation may not be fixable — sometimes you just cap and move on.

The Bottom Line

5,000 games is enough to know whether your model has edge. It is not enough to know whether you have a giant edge — and you almost certainly do not. The edge that survives 5,000-game validation is small, consistent, and concentrated in specific buckets. That is what real prediction-market alpha looks like in 2026.

The full per-sport breakdown of model versus market is published on our validation page. The methodology for the Brier comparison is reproducible from the predictions API plus a Polymarket data dump.

Live published win probabilities for 11 sports at zenhodl.net/v1/try. The full Brier-versus-market validation methodology is detailed at zenhodl.net/methodology. Seven-day free API trial at zenhodl.net/pricing.