Why Your 70% Confidence Should Actually Mean 70%: Calibrated Probabilities in Prediction Markets

A model that says "Lakers win at 70%" can still be useless. The question isn't "were they right about who won." The question is: when it said 70%, did it actually win 70 of the next 100 similar games?

That's calibration. It's the property most sports models don't have, most data vendors don't test for, and most traders don't realize is the reason their backtest beat the market but their live account didn't.

At ZenHodl, we trade Polymarket live across 11 sports. Our calibration work is the reason the live trading feed shows real edges, not vanity win rates. Here's the full picture — what calibration is, why it's the quiet killer of sports models, and how to fix yours.

The 70% Thought Experiment

Imagine two models. Both predict NBA games. Both are 55% accurate (they pick the winner 55% of the time).

Model A says every game has a 55% chance for the favorite. Always. Same number.

Model B says some games are 80% favorites, some are 65%, some are 45%. It varies by matchup.

Both are equally accurate. But only one tells you anything useful. Model A is useless for trading — you can't bet on a market where your "edge" is always the same regardless of context. Model B might be tradeable, if its 80%-claimed games actually win 80% of the time.

That "if" is calibration. Being accurate (predicting the winner) is not the same as being calibrated (being right about how confident to be). A model can have perfect accuracy with terrible calibration, and vice versa. And only one of those properties makes you money at prediction markets.

What Calibration Actually Means

Calibration is a property you can measure with one sentence:

When a model predicts probability p, the event happens with frequency p.

If you group all the games your model called "70% home win," you should see home teams winning 70% of the time in that group. Not 60%. Not 80%. Seventy.

Graph that property across all probability buckets, and you get a reliability diagram:

Perfectly calibrated model:
  Predicted prob   Actual freq
      0.10             0.10
      0.20             0.20
      0.30             0.30
      ...              ...
      0.90             0.90

  ← diagonal line from (0,0) to (1,1)

Typical ML model trained on accuracy:
  Predicted prob   Actual freq
      0.10             0.19   ← overconfident at low probs
      0.20             0.26
      0.30             0.34
      0.50             0.51   ← fine in the middle
      0.70             0.62   ← overconfident at high probs
      0.90             0.74   ← badly overconfident

Most models overestimate their high-probability claims. When they say "90%," they're actually right about 74% of the time. That 16-point gap is where your backtest fools you.

How to Measure Calibration — ECE

Expected Calibration Error (ECE) is the standard metric. It's the weighted average of the gap between what the model predicted and what actually happened across probability buckets. Lower is better. Zero is perfect.

import numpy as np

def expected_calibration_error(predicted_probs, actuals, n_bins=10):
    """
    predicted_probs: array of model probabilities, shape (n,)
    actuals: array of 0/1 outcomes, shape (n,)
    """
    bin_edges = np.linspace(0.0, 1.0, n_bins + 1)
    ece = 0.0
    for i in range(n_bins):
        lo, hi = bin_edges[i], bin_edges[i + 1]
        mask = (predicted_probs >= lo) & (predicted_probs < hi)
        if mask.sum() == 0:
            continue
        bin_pred = predicted_probs[mask].mean()
        bin_actual = actuals[mask].mean()
        bin_weight = mask.sum() / len(predicted_probs)
        ece += bin_weight * abs(bin_pred - bin_actual)
    return ece

A well-calibrated sports model has ECE under 3% (3 percentage points). An uncalibrated one is often above 10%.

For reference, here are the ECE values we track across our 11 sports (as of April 2026, measured on a seeded historical buffer of 84,960 predictions):

MLB:     ECE 9.4%    ← decent
NHL:     ECE 4.1%    ← excellent (low-scoring sport = stable prior)
CS2:     ECE 4.9%    ← good on training, drifts in live (more on this below)
ATP:     ECE 9.6%    ← moderate (surface-dependent)
LOL:     ECE 1.0%    ← looks great on paper, but...

The LoL number is interesting. It's gorgeous on historical data. It's terrible on our live trades (live ECE is often 24-28 percentage points). That gap is the heart of the next section.

Why Training ECE ≠ Live ECE (Adverse Selection)

Here's the trap: a model can be beautifully calibrated on random samples and catastrophically miscalibrated on the subset of samples you bet on.

This is called adverse selection, and it's baked into the structure of prediction markets.

Your bot only places a trade when the model disagrees with the market. That's the whole point — you're paid for the disagreement. But the market is not random. Polymarket prices are made by informed sharp players, bookmakers, and prop desks. When you think the market is wrong, it's usually because they saw something you didn't. Consciously or not, the bot is selecting for the hardest-to-predict cases.

Concretely: if your model trained to 4% ECE on random games, when you filter down to "games where my model disagrees with the market by 20+ cents," your live ECE on that subset could be 25%. The model isn't broken. The selection bias is eating it.

This is why we maintain a live recalibrator at ZenHodl. Every resolved trade updates a rolling isotonic regression per sport. If the live subset of trades drifts away from baseline, the recalibrator notices within 25–50 trades and adjusts the effective fair price without touching the underlying model weights.

You can watch this in action on zenhodl.net/live — every ORDER event has a fair_c value that's already been adjusted for live calibration drift.

How to Fix Miscalibration

Once you've measured the problem, there are three practical ways to fix it, in increasing order of flexibility:

1. Platt Scaling

Fit a logistic regression from model output → actual outcome. Two parameters, closed-form solution, works on small samples.

from sklearn.linear_model import LogisticRegression

def platt_calibrate(raw_probs, actuals):
    X = raw_probs.reshape(-1, 1)
    lr = LogisticRegression()
    lr.fit(X, actuals)
    return lambda p: lr.predict_proba(np.array(p).reshape(-1, 1))[:, 1]

Good when your miscalibration is roughly monotonic and your sample is small (< 500 resolved predictions).

2. Isotonic Regression

Non-parametric. Fits a monotonic step function from predicted probability to actual frequency. Scales to any sample size.

from sklearn.isotonic import IsotonicRegression

def isotonic_calibrate(raw_probs, actuals):
    ir = IsotonicRegression(out_of_bounds="clip")
    ir.fit(raw_probs, actuals)
    return lambda p: ir.predict(np.asarray(p))

This is what we use in production. Flexible enough to handle S-curve miscalibration that Platt scaling can't capture.

3. Rolling Calibration Buffer

The above two are static fits. In live markets, you want calibration that updates as outcomes arrive. Keep a FIFO buffer of the last N predictions and actuals, and refit isotonic regression on each new datapoint:

from collections import deque
from sklearn.isotonic import IsotonicRegression
import numpy as np

class LiveRecalibrator:
    def __init__(self, buffer_size=500, refit_every=25):
        self.buffer = deque(maxlen=buffer_size)
        self.refit_every = refit_every
        self.ir = None
        self._updates_since_refit = 0

    def record(self, predicted_prob, outcome):
        self.buffer.append((predicted_prob, outcome))
        self._updates_since_refit += 1
        if self._updates_since_refit >= self.refit_every and len(self.buffer) >= 50:
            self._refit()
            self._updates_since_refit = 0

    def _refit(self):
        probs, outcomes = zip(*self.buffer)
        self.ir = IsotonicRegression(out_of_bounds="clip")
        self.ir.fit(np.array(probs), np.array(outcomes))

    def adjust(self, raw_prob):
        if self.ir is None:
            return raw_prob
        return float(self.ir.predict(np.array([raw_prob]))[0])

This is exactly the pattern we run per-sport at ZenHodl. Buffer size 500 is a good balance between responsiveness and noise. Refit every 25 new outcomes means the recalibrator adapts to drift within a trading day, not a month.

Why This Matters More in Prediction Markets than Sportsbooks

Sportsbooks let you bet at fixed odds. Prediction markets like Polymarket trade continuously — every bet moves the price. That has two consequences for calibration:

You can be right about the model and still lose. If your model says "70% home win" and Polymarket says "55% home win," you'd buy home. Then the market drifts to 60% before the game. You have adverse close-line value (CLV) of -10c, regardless of whether the home team wins.
Your calibration affects your sizing, not just your side. A Kelly-sized bet against a 55c market needs the true probability to be 58c+ to be profitable after spread. If your model says 70c but it's actually calibrated to 61c, you're over-sizing by a huge multiple.

At sportsbook odds, a miscalibrated model just loses a little less. At prediction market prices, a miscalibrated model blows up position size.

How We Use Calibration at ZenHodl

All of the above is running live on our production bots right now:

Every sport has a baseline ECE measured from seeded historical data
Every resolved trade updates a rolling isotonic recalibrator for its sport
Every live EVAL event in our live trading feed shows the recalibrated fair_c, not the raw model output
Our results page shows the real, on-chain cumulative P&L — since we went public on March 9, 2026, the bots are at 501 trades, 56.5% WR, +$95.11

The calibration work is why we don't have to sell you a "69.8% historical win rate" we can't reproduce. The live numbers match the backtest because the model is calibrated to reality, not to the pre-trade distribution.

The One-Line Takeaway

If you're building a sports prediction model and you only track one thing besides accuracy, track ECE. If it's above 5%, don't trade on it yet. If it's above 5% on the subset of plays you actually bet, you're not in trouble with your model — you're in trouble with adverse selection, and you need a live recalibrator.

Try It Live

You can see every eval, order, and fill from our calibrated bots streaming in real time at zenhodl.net/live. Full methodology is on the results page. If you want the calibrated probabilities as a JSON API for your own strategy, the $49/mo Starter plan includes the REST API + the full bot course that walks through this entire calibration pipeline.

No backtest promises. Just the scoreboard, updated live.