Our Methodology

How we build independent win probability models for prediction markets

Current live platform coverage: NBA, NCAAMB, NCAAWB, CFB, NFL, NHL, MLB. Research and course extensions are called out separately below.

Current Proof Snapshot

59.3% win rate across 361 validation trades in 10 sports

Methodology explains the model design. Validation shows the current exported backtest snapshot, and results show live production outcomes.

Snapshot 2026-04-13 5 days old

ZenHodl Weekly

Get the weekly results and one modeling note.

One weekly email with live results, one model insight, and product updates.

A short weekly note for builders, traders, and researchers following the model.

Why independent probabilities matter

Most sports analytics services derive "fair value" by devigging sportsbook lines — averaging Pinnacle, FanDuel, and DraftKings odds. This gives you a consensus probability that tracks the market by construction. It's useful for sports betting (finding +EV against soft books), but it cannot find prediction market mispricings — because the output already agrees with the market.

ZenHodl models are trained on game state only: score differential, seconds remaining, period, Elo ratings, and sport-specific features. No odds, no lines, no market prices are used as inputs. This makes our output genuinely independent from the market — when our fair probability diverges from the Polymarket ask price, that divergence is a real signal, not noise.

The tradeoff: our models can have a worse Brier score than market-derived models in absolute forecasting terms. But independence is what creates trading value when the output is validated against real market prices. See the validation page for the latest exported backtest snapshot and live results for current production performance.

Market-Derived ZenHodl (Independent)
Inputs Sportsbook odds/lines Score, time, Elo only
Output Tracks market by construction Genuinely independent
Brier Score Better Worse
Trading Value Zero (agrees with market) Possible when independently validated

Data pipeline

We scrape ESPN's play-by-play API across the core sports we model. Each game produces hundreds of snapshots — one per score change or significant event.

10k+
Games
Millions
Snapshots
Multi
Sports
Multi
Seasons

Sports: NBA, NCAAMB, NCAAWB, CFB, NFL, NHL, MLB

Data is stored as Apache Parquet files. One row = one game state (score, period, clock, ESPN WP, outcome label).

Module 1 of our course teaches you to build this exact scraper.

Feature engineering

Each game state is featurized with 13–16 variables. We deliberately keep the feature set small — overfitting to noise destroys trading value.

Feature Sports Description
score_diff All home_score − away_score
seconds_remaining All Total game seconds left
period All Current period/half/inning
time_fraction All Fraction of game elapsed (0→1)
elo_diff All Home Elo − Away Elo
pregame_wp All ESPN pre-game win probability (fixed prior)
score_diff_x_tf All Lead × time elapsed (interaction)
score_diff_sq All Lead² (quadratic, captures blowouts)
is_home_batting MLB 1 if home team is batting
down, distance CFB/NFL Football situation
yard_line CFB/NFL Field position
possession_home CFB/NFL 1 if home has the ball
pace features NBA total_score, ortg_diff, drtg_diff

Model architecture

We use sport-specific models — no one-size-fits-all approach. The sections below distinguish the current live platform from adjacent research and course material.

Current live platform — 11 sports

The live platform covers NBA, NCAAMB, NCAAWB, CFB, NFL, NHL, MLB with sport-specific models, post-hoc isotonic calibration (ECE ≤ 0.002), and a four-layer prediction pipeline: base model, team/player overlays, isotonic calibration, and live rolling recalibration.

Basketball & Football (NBA, NCAAMB, NCAAWB, CFB, NFL)

Split-Phase XGBoost with 16 features including team offensive/defensive ratings (ORtg, DRtg), pace, momentum (scoring runs over last 2 and 5 minutes), Elo ratings, and interaction terms (score_diff × time_fraction). NBA Brier: 0.124, ECE: 0.002. Trained on 5,285 games (2021–2026) with walk-forward season-based splits. Post-hoc isotonic recalibration on the 2024–25 season.

Injury overlay: 58 NBA star players tracked in real-time via ESPN's injury API (10-minute cache). Each player has a pre-computed impact factor (e.g., Jokic: 10%, LeBron: 8%, Curry: 9%). When a player is OUT, the model subtracts their impact; QUESTIONABLE halves the adjustment. Total cap: ±15%.

Model selection uses a hybrid criterion: near-best Brier score AND best trading value (c/trade) on a held-out backtest window. This prevents selecting models that look good on calibration metrics but produce worse P&L.

Hockey (NHL)

XGBoost with 17 features including all basketball features plus hockey-specific metrics: power play %, penalty kill %, save %, faceoff win %, and penalty minutes differential. NHL Brier: 0.157 (improved 23.6% from 0.205), ECE: 0.002. Trained on 4,225 games.

Injury overlay: 44 NHL skaters tracked via ESPN (goalies excluded — handled by the separate goalie quality adjustment). Star impact factors: McDavid 10%, Draisaitl 8%, MacKinnon 9%, etc.

Goalie & shot overlay: Starting goalie detection adjusts for goalie quality. Expected goals (xG), Corsi, and power play state provide real-time shot quality signals. Combined cap: ±12%.

Baseball (MLB)

Ensemble model with starting pitcher ERA, WHIP, K/9 as base features. MLB Brier: 0.151, ECE: 0.002. Three post-model overlays stack on top of the base prediction:

Soccer (Bundesliga, EPL, La Liga, Serie A, Ligue 1)

XGBoost ML model trained on 2,949 matches with 56,000 in-game snapshots. Replaced the original analytical Poisson model (9.4% Brier improvement). Features: score differential, time fraction, Elo difference, total goals, second-half flag, and league-specific effects (one-hot encoded). Isotonic calibration on the last 20% of matches. AUC: 0.830.

Esports (CS2, LoL)

Both esports are live on the platform with dedicated XGBoost models and active trading.

CS2: 15 features including round differential, Elo, map-specific win rates (per team per map), recent form, head-to-head record, and Elo momentum. Economy data from bo3.gg and HLTV scorebot provides equipment value and consecutive-loss tracking as post-model overlays. Entry restricted to underdogs (20–42c) with 20c+ minimum edge after a live P&L audit found the model overperforms on underdog lines and underperforms on favorites.

LoL: 20 features including gold differential, kills, towers, dragons (absolute count + soul eligibility + soul obtained), barons, inhibitor advantage, Elo, series score, and best-of format. Dragon soul features are critical — the 4th dragon grants a massive team-wide buff that shifts win probability by 15–25%. Brier: 0.123, AUC: 0.861. Trained on 21,636 snapshots across 4,200 games.

Tennis (ATP, WTA)

Hierarchical analytical model: player-specific serve win rates (by surface) → point probability → game probability → set probability → match probability. Serve rates are computed from 13,174 ATP matches (2020–2024) using the Sackmann dataset, giving us player-specific rates for hard, clay, and grass surfaces instead of tour averages. Elo ratings are surface-aware (4,606 players). The live recalibrator auto-activates after 50 resolved trades.

Post-model overlay stack (all sports)

Every prediction passes through a four-layer correction pipeline after the base model:

  1. Post-hoc isotonic calibration: Fitted on held-out calibration season. Brings ECE from 0.02–0.07 down to ≤0.002 across all sports. When we say 70%, teams win 70% of the time.
  2. Live rolling recalibration: Automatically refits an isotonic correction every 25 resolved trades per sport, using a 500-sample rolling buffer persisted to disk. Catches calibration drift from rule changes, roster turnover, or seasonal patterns without manual intervention.
  3. Dynamic position sizing: Rolling 50-trade performance per sport determines a size multiplier (0.25x–1.5x). Hot sports (>55% WR + positive P&L) get sized up; cold sports (<40% WR) get sized down. Recomputed every 10 minutes.
  4. Meta-model edge filter: A Gradient Boosting classifier trained on 359 resolved trades predicts which detected edges are real vs. noise. Filters out 28% of losing trades before entry, improving average P&L per trade from +3.9c to +5.6c on backtest.

Spread/Total

Spread and total models require a different setup from moneyline markets: regression on remaining margin or total, then a distributional layer to convert that forecast into cover/over probabilities.

Backtesting methodology

Our backtests are designed to avoid the common mistakes that inflate results.

What we tried that doesn't work

We believe in showing failures alongside successes.

Choose your next step

Read the proof, inspect live results, or go straight to the live platform.