Methodology

CLV evidence refreshed 2026-05-27

How we build independent win probability models

For prediction markets — not to follow the lines, but to disagree with them.

Current live platform coverage: NBA, NCAAMB, NCAAWB, CFB, NFL, NHL, MLB. Research and course extensions are called out separately below.

Current Proof Snapshot

59.3% win rate across 361 validation trades in 10 sports

Methodology explains the model design. Validation shows the current exported backtest snapshot, results show live production outcomes, and /clv publishes per-sport closing-line value — the leading indicator of whether the calibration is real edge or noise.

Snapshot 2026-04-13 44 days old

Empirical Skill Evidence · snapshot 2026-05-08

The 78-percentage-point CLV gap

The methodology described on this page is what our model does. The CLV-gap measurement is what our model achieves. Across 950 trades in our public gap-analysis subset, the model's CLV-positive entries win 89.9% of the time; CLV-negative entries win 11.2%. The 78.8-pp gap is a direct test of whether the model carries information beyond luck. Across the eight sports with ≥30 trades in the gap, every one shows a positive separation.

Reproducible from the public dataset; full verification script linked from the whitepaper
Two-proportion Z-test on the aggregate gap: z = 24.27, p ≈ 10⁻¹³⁰
Gap widens as the CLV threshold tightens — pattern consistent with real signal
Selection-bias check (measured vs unmeasured trades) shows no detectable aggregate WR difference

Read the full whitepaper →

89.9%

CLV+

11.2%

CLV−

+78.8

PP gap

Three views — what does each page show?

/validation — exported backtest snapshot. What the model would have done on a defined sample with current filters baked in.
/results — live ledger. Every trade fired in production, including trades that wouldn't pass current filters. The honest aggregate.
/clv-evidence — empirical skill test. The 78-pp CLV gap; a direct test of whether the model carries information beyond luck on the trades where we captured the closing line.

ZenHodl Weekly

Get the weekly results and one modeling note.

One weekly email with live results, one model insight, and product updates.

A short weekly note for builders, traders, and researchers following the model.

Why independent probabilities matter

Most sports analytics services derive "fair value" by devigging sportsbook lines — averaging Pinnacle, FanDuel, and DraftKings odds. This gives you a consensus probability that tracks the market by construction. It's useful for sports betting (finding +EV against soft books), but it cannot find prediction market mispricings — because the output already agrees with the market.

ZenHodl models are trained on game state only: score differential, seconds remaining, period, Elo ratings, and sport-specific features. No odds, no lines, no market prices are used as inputs. This makes our output genuinely independent from the market — when our fair probability diverges from the Polymarket ask price, that divergence is a real signal, not noise.

The tradeoff: our models trade some absolute Brier score for genuine independence — but the per-sport Brier values reported below (NBA 0.124, ECE 0.002; NCAAMB 0.123; NHL 0.151) show the gap is competitive in practice. Independence is what creates trading value when the output is validated against real market prices. See /clv-evidence for the formal empirical-skill test (78-pp CLV gap, Z = 24.27, p ≈ 10⁻¹³⁰), /validation for the exported backtest snapshot, and /results for current live production performance.

	Market-Derived	ZenHodl (Independent)
Inputs	Sportsbook odds/lines	Score, time, Elo only
Output	Tracks market by construction	Genuinely independent
Brier Score	Better by construction	Competitive in practice (NBA 0.124, ECE 0.002)
Trading Value	Zero (agrees with market)	Validated empirically (78-pp CLV gap)

Data pipeline

We scrape ESPN's play-by-play API across the core sports we model. Each game produces hundreds of snapshots — one per score change or significant event.

10k+

Games

Millions

Snapshots

Multi

Sports

Multi

Seasons

Sports: NBA, NCAAMB, NCAAWB, CFB, NFL, NHL, MLB

Data is stored as Apache Parquet files. One row = one game state (score, period, clock, ESPN WP, outcome label).

Module 1 of our course teaches you to build this exact scraper.

Feature engineering

Each game state is featurized with 13–16 variables. We deliberately keep the feature set small — overfitting to noise destroys trading value.

Feature	Sports	Description
score_diff	All	home_score − away_score
seconds_remaining	All	Total game seconds left
period	All	Current period/half/inning
time_fraction	All	Fraction of game elapsed (0→1)
elo_diff	All	Home Elo − Away Elo
pregame_wp	All	ESPN pre-game win probability (fixed prior)
score_diff_x_tf	All	Lead × time elapsed (interaction)
score_diff_sq	All	Lead² (quadratic, captures blowouts)
is_home_batting	MLB	1 if home team is batting
down, distance	CFB/NFL	Football situation
yard_line	CFB/NFL	Field position
possession_home	CFB/NFL	1 if home has the ball
pace features	NBA	total_score, ortg_diff, drtg_diff

Model architecture

We use sport-specific models — no one-size-fits-all approach. The sections below distinguish the current live platform from adjacent research and course material.

Current live platform — 11 sports

The live platform covers NBA, NCAAMB, NCAAWB, CFB, NFL, NHL, MLB with sport-specific models, post-hoc isotonic calibration (ECE ≤ 0.002), and a four-layer prediction pipeline: base model, team/player overlays, isotonic calibration, and live rolling recalibration.

Basketball & Football (NBA, NCAAMB, NCAAWB, CFB, NFL)

Split-Phase XGBoost with 16 features including team offensive/defensive ratings (ORtg, DRtg), pace, momentum (scoring runs over last 2 and 5 minutes), Elo ratings, and interaction terms (score_diff × time_fraction). NBA Brier: 0.124, ECE: 0.002. Trained on 5,285 games (2021–2026) with walk-forward season-based splits. Post-hoc isotonic recalibration on the 2024–25 season.

Injury overlay: 58 NBA star players tracked in real-time via ESPN's injury API (10-minute cache). Each player has a pre-computed impact factor (e.g., Jokic: 10%, LeBron: 8%, Curry: 9%). When a player is OUT, the model subtracts their impact; QUESTIONABLE halves the adjustment. Total cap: ±15%.

Model selection uses a hybrid criterion: near-best Brier score AND best trading value (c/trade) on a held-out backtest window. This prevents selecting models that look good on calibration metrics but produce worse P&L.

Hockey (NHL)

XGBoost with 17 features including all basketball features plus hockey-specific metrics: power play %, penalty kill %, save %, faceoff win %, and penalty minutes differential. NHL Brier: 0.157 (improved 23.6% from 0.205), ECE: 0.002. Trained on 4,225 games.

Injury overlay: 44 NHL skaters tracked via ESPN (goalies excluded — handled by the separate goalie quality adjustment). Star impact factors: McDavid 10%, Draisaitl 8%, MacKinnon 9%, etc.

Goalie & shot overlay: Starting goalie detection adjusts for goalie quality. Expected goals (xG), Corsi, and power play state provide real-time shot quality signals. Combined cap: ±12%.

Baseball (MLB)

Ensemble model with starting pitcher ERA, WHIP, K/9 as base features. MLB Brier: 0.151, ECE: 0.002. Three post-model overlays stack on top of the base prediction:

Run expectancy: Tango run-expectancy matrix (outs × runners × batting side) provides granular in-game state beyond the score.
Park factors: 30-team FanGraphs 3-year rolling park factors (Coors Field = 1.25, highest).
Bullpen quality: 30-team ERA, time-scaled from 0% (early innings) to 100% (late innings) as starters hand off to relievers. Capped at ±5%.

Soccer (Bundesliga, EPL, La Liga, Serie A, Ligue 1)

XGBoost ML model trained on 2,949 matches with 56,000 in-game snapshots. Replaced the original analytical Poisson model (9.4% Brier improvement). Features: score differential, time fraction, Elo difference, total goals, second-half flag, and league-specific effects (one-hot encoded). Isotonic calibration on the last 20% of matches. AUC: 0.830.

Esports (CS2, LoL)

Both esports are live on the platform with dedicated XGBoost models and active trading.

CS2: 15 features including round differential, Elo, map-specific win rates (per team per map), recent form, head-to-head record, and Elo momentum. Economy data from bo3.gg and HLTV scorebot provides equipment value and consecutive-loss tracking as post-model overlays. Entry restricted to underdogs (20–42c) with 20c+ minimum edge after a live P&L audit found the model overperforms on underdog lines and underperforms on favorites.

LoL: 20 features including gold differential, kills, towers, dragons (absolute count + soul eligibility + soul obtained), barons, inhibitor advantage, Elo, series score, and best-of format. Dragon soul features are critical — the 4th dragon grants a massive team-wide buff that shifts win probability by 15–25%. Brier: 0.123, AUC: 0.861. Trained on 21,636 snapshots across 4,200 games.

Tennis (ATP, WTA)

Hierarchical analytical model: player-specific serve win rates (by surface) → point probability → game probability → set probability → match probability. Serve rates are computed from 13,174 ATP matches (2020–2024) using the Sackmann dataset, giving us player-specific rates for hard, clay, and grass surfaces instead of tour averages. Elo ratings are surface-aware (4,606 players). The live recalibrator auto-activates after 50 resolved trades.

Post-model overlay stack (all sports)

Every prediction passes through a four-layer correction pipeline after the base model:

Post-hoc isotonic calibration: Fitted on held-out calibration season. Brings ECE from 0.02–0.07 down to ≤0.002 across all sports. When we say 70%, teams win 70% of the time.
Live rolling recalibration: Automatically refits an isotonic correction every 25 resolved trades per sport, using a 500-sample rolling buffer persisted to disk. Catches calibration drift from rule changes, roster turnover, or seasonal patterns without manual intervention.
Closing-line value (CLV) audit: Every settled trade is measured against the market's closing price. Trades that beat the close win ~89% of the time; trades that lose the close win ~11%. CLV is the strongest leading indicator of whether the model genuinely beat the market — independent of who won the game. Per-sport CLV with edge-bucket breakdowns is published at /clv. The four-failure-mode taxonomy (bad model / bad blend / bad timing / bad CLV label) is published at /clv/repair alongside the explicit reactivation criteria for any sport currently paused.
Dynamic position sizing: Rolling 50-trade performance per sport determines a size multiplier (0.25x–1.5x). Hot sports (>55% WR + positive P&L) get sized up; cold sports (<40% WR) get sized down. Recomputed every 10 minutes.
Meta-model edge filter: A Gradient Boosting classifier trained on 359 resolved trades predicts which detected edges are real vs. noise. Filters out 28% of losing trades before entry, improving average P&L per trade from +3.9c to +5.6c on backtest.

Spread/Total

Spread and total models require a different setup from moneyline markets: regression on remaining margin or total, then a distributional layer to convert that forecast into cover/over probabilities.

Backtesting methodology

Our backtests are designed to avoid the common mistakes that inflate results.

Time-split validation: Train on season N−1, test on season N. No future data leaks.
Real market prices where available: Where direct market snapshot history exists, backtests use recorded market prices. Where coverage is incomplete, we call out any proxy assumptions explicitly on the validation page.
Deduplication: One entry per (game_id, side, score, period). No duplicate signals from the same game state.
Subsampling where needed: Historical replay backtests can downsample state rows to better match how the live system would realistically observe and act on the game.
Hold to settlement: All trades held to game end. No exit timing or stop-loss assumptions.
Net of fees: Results are reported net of the fee and slippage assumptions used by the active backtest export, with the exact assumptions disclosed on the validation page.
Sport-specific entry filters: Filters like score differential, game state, or pregame-consensus agreement are applied only where validation shows they improve robustness.
Execution windows: Entry timing constraints vary by sport and market. We only keep them when out-of-sample testing shows they improve execution quality.

What we tried that doesn't work

We believe in showing failures alongside successes.

Sell-the-top / fade spikes: Negative EV. Information-driven price moves don't mean-revert.
Compression sniping: Negative on historical data, adverse selection live.
Martingale / double-down on dips: Catastrophic — doubles position into adverse moves.
Early NBA taker configs: Earlier versions underperformed until pacing/context features and stricter entry filters were added. Old model versions are not treated as current proof.
Spread/Total mean-reversion exits: Spreads reprice permanently from score changes. Fix: hold to settlement instead.
ESPN live WP as model input: Tested and rejected — creates circular dependency with market. ESPN pre-game WP works as a fixed prior.
Conformal prediction intervals: Tested and rejected — added complexity without improving trading value.

Choose your next step

Read the proof, inspect live results, or go straight to the live platform.

See validation See live results Start API trial