Version 1.1 — Revised April 2026. Supersedes the v1.0 ("Final, April 2026") release. Section 5.3 (Closing Line Value) has been rewritten to report results on n=686 measured trades — fulfilling the explicit commitment in the v1.0 paper to publish CLV after accumulating 100+ trades with closing-price data. Section 8 (Conclusion) updated to reflect the new evidence and to reference the live companion dataset published at https://zenhodl.net/clv. Citation metadata at the end of the paper has been updated to v1.1. No methodology, model, or §5.1 backtest claims have been changed; those remain identical to v1.0.
ZenHodl Research | April 2026 Version 1.1 (Revised April 2026)
Abstract
This paper presents a complete pipeline for systematic trading on blockchain-based sports prediction markets. We describe the data acquisition, model training, calibration, signal generation, and execution components, then evaluate performance on both historical and live data. Our primary contribution is demonstrating that calibrated machine learning win probability models, when combined with real-time game state processing and automated execution, can identify and exploit short-term pricing inefficiencies in these markets.
For each of 7 major sports, we train logistic regression and gradient-boosted tree ensembles on 41,000+ historical games, apply isotonic regression calibration, and use the calibrated probabilities to detect positive expected value opportunities against live market prices. Each game produces multiple evaluation snapshots at different game states, yielding hundreds of thousands of training examples.
On a 2025-26 season backtest of 2,625 trades graded against real Polymarket bid/ask prices, the system achieves a 69.8% win rate with +2.4 cents net profit per trade after an estimated 3.5 cents in execution costs. Live trading since March 2026 shows 90 bot-attributed trades, 88 of which had resolved at write time, with a 62.5% win rate on the resolved sample (55/88) and +$67.59 net P&L; the live sample is too small for strong statistical conclusions.
We discuss model architecture, calibration methodology, execution cost modeling, and the limitations of both backtested and live results.
1. Introduction
1.1 Prediction Market Efficiency
Prediction markets aggregate information through trading to produce probability estimates for future events. Theory suggests these markets should be approximately efficient, with prices reflecting the true probability of outcomes (Wolfers & Zitzewitz, 2004; Arrow et al., 2008).
However, efficiency is not instantaneous. During live sporting events, new information arrives continuously through score changes, momentum shifts, and game clock progression. Market participants process this information at varying speeds, creating brief windows where prices lag the true state of the game.
1.2 The Information Latency Hypothesis
Our core hypothesis is that during live games, there exists a 15-60 second window after significant game events (score changes, period transitions, possession changes) where prediction market prices have not fully adjusted to the new game state. This latency arises from:
- Human processing delay: Most market participants watch games on television with inherent broadcast delay
- Attention fragmentation: Participants monitoring multiple games cannot react to all simultaneously
- Asymmetric information integration: Score changes are immediately observable, but their probabilistic implications require computation
A machine learning model that processes game state features in real-time can compute updated win probabilities faster than the median market participant, capturing the information premium during this adjustment window.
1.3 Related Work
Foundational work on sports probability and market efficiency informs our approach. Stern (1991) established the statistical framework for modeling win probability as a function of in-game state variables, providing the foundation that underlies modern win probability models including ours. Sauer (1998) surveyed the economics of wagering markets comprehensively, documenting both the surprising efficiency of traditional betting markets and the specific conditions under which inefficiencies persist. Wolfers and Zitzewitz (2004) provided an influential overview of prediction markets and their information aggregation properties.
Empirical studies have documented specific inefficiencies exploitable by quantitative approaches. Borghesi (2007) demonstrated persistent biases in NFL betting markets related to home-field advantage and weather effects, providing evidence that even mature sports betting markets are not fully efficient. Croxson and Reade (2014) examined in-play betting market efficiency around goal arrivals in soccer, finding rapid but not instantaneous price adjustment. Kaunitz, Zhong, and Kreiner (2017) showed that systematic exploitation of closing line value in traditional sportsbooks can yield positive returns, though bookmaker countermeasures limit scalability.
Most relevant to our work, Page (2012) documented systematic biases in prediction markets during live events, showing that market participants tend to underreact to information that shifts probabilities toward extreme values — precisely the type of inefficiency our model targets during live game state changes. Our contribution extends this literature by applying calibrated ML models specifically to prediction markets with on-chain settlement (Polymarket), where the combination of thin liquidity, retail-dominated participation, and continuous in-game information flow creates a setting where the information latency hypothesis is most likely to hold.
2. Data and Methodology
2.1 Data Sources
Game State Data: We poll ESPN's public API every 5 seconds for all live games across 7 sports: NBA, NFL, NHL, MLB, NCAAMB (men's college basketball), NCAAWB (women's college basketball), and CFB (college football). For each game, we extract:
- Score differential
- Period/quarter/inning
- Time remaining (seconds)
- Possession (football only)
- Down, distance, yard line (football only)
- Starting pitcher ERA, WHIP, K/9 (baseball only)
- Power play/penalty kill status (hockey only)
Elo Ratings: We maintain continuously-updated Elo ratings (Glickman, 1999) for all teams using a K-factor of 20, home-court advantage of 50 points, and 50% seasonal regression. Ratings are computed from historical results and updated after each completed game.
Market Prices: Real-time bid/ask prices from Polymarket's WebSocket feed, with additional venue coverage from Kalshi and OddsAPI (DraftKings, FanDuel, BetMGM) for multi-venue comparison.
Training Data: 41,000+ historical games across all 7 sports, spanning the 2020-21 through 2025-26 seasons. Each game produces multiple evaluation snapshots at different game states, yielding hundreds of thousands of training examples. Each snapshot contains game state features paired with the actual binary outcome (home team win/loss).
2.2 Model Architecture
For each sport, we train an ensemble of two model classes:
-
Logistic Regression with Natural Spline Features: Provides a well-calibrated baseline with interpretable coefficients. Spline transformations on
score_diffandtime_fractioncapture non-linear relationships (e.g., a 10-point lead means different things in the first quarter versus the fourth). -
Gradient-Boosted Trees (XGBoost): Captures complex feature interactions. Trained with max_depth=4, learning_rate=0.1, n_estimators=200, and regularization (lambda=1.0, alpha=0.1) to prevent overfitting.
Both models are post-hoc calibrated using isotonic regression on a held-out calibration set. The final ensemble weights are determined by minimizing Brier score on the calibration set.
We assessed model sensitivity to key hyperparameters. Brier scores are stable within +/-0.005 across max_depth in {3, 4, 5} and learning_rate in {0.05, 0.1, 0.2} for all sports. The ensemble weights between logistic regression and XGBoost were determined by minimizing Brier score on the calibration set, with typical weights of 40-60% XGBoost depending on the sport.
Feature Engineering:
- score_diff: Home score minus away score
- time_fraction: Fraction of game remaining (1.0 = start, 0.0 = end)
- score_diff_x_tf: Interaction term capturing how score leads change in importance over time
- score_diff_sq: Squared score differential for non-linear response
- elo_diff: Pre-game Elo rating difference
- Sport-specific features as described in Section 2.1
2.3 Temporal Split Methodology
We use strict temporal splits at the season level. For sports with 3+ seasons of data, the oldest seasons are used for training, the second-newest season for calibration, and the most recent season for testing. For sports with fewer seasons, we use a 60/20/20 chronological split within the available data. No information from the calibration or test sets is available during model training.
Elo ratings are computed in a walk-forward manner, using only games completed before the current evaluation point.
2.4 Calibration
Calibration is critical for our application. A model that is discriminative but poorly calibrated will overestimate or underestimate true probabilities, leading to systematic trading errors.
We use isotonic regression calibration (Zadrozny & Elkan, 2002) because it makes no parametric assumptions about the calibration function. We measure calibration quality using:
- Expected Calibration Error (ECE): Weighted average of absolute calibration error across probability bins
- Brier Score: Proper scoring rule that measures both discrimination and calibration
- Reliability Diagrams: Visual assessment of calibration across the probability range
2.5 Uncertainty Quantification
Each model includes an uncertainty estimate based on calibration-error-based uncertainty bands. For each time-fraction bucket, we compute the average absolute calibration error on the held-out calibration set. This provides an empirical estimate of model uncertainty that varies by game state — wider early in games when outcomes are less determined, and narrower late in games with large score differentials.
For each prediction, we provide:
- A point estimate of win probability
- A confidence interval width that varies by game state
- Early-game predictions have wider intervals (more uncertainty)
- Late-game predictions with large score differentials have narrower intervals
This uncertainty estimate informs position sizing: we trade smaller when uncertainty is high and larger when the model is confident.
3. Model Performance
3.1 Overall Metrics
| Sport | Brier Score | ROC-AUC | ECE | Training Games |
|---|---|---|---|---|
| NCAAWB | 0.110 | 0.913 | 0.033 | 11,581 |
| CFB | 0.122 | 0.904 | 0.015 | 2,411 |
| NBA | 0.139 | 0.890 | 0.106 | 5,285 |
| NCAAMB | 0.145 | 0.868 | 0.022 | 12,285 |
| MLB | 0.154 | 0.856 | 0.018 | 4,413 |
| NFL | 0.155 | 0.864 | 0.055 | 1,140 |
| NHL | 0.205 | 0.739 | 0.034 | 4,225 |
NCAAWB achieves the lowest Brier score (best calibrated predictions), while NHL has the highest (hardest to predict due to the low-scoring, high-variance nature of hockey). NBA shows the highest ECE (0.106), indicating room for calibration improvement despite strong discrimination.
3.2 Calibration Analysis
All models except NBA show ECE below 0.055, indicating that when the model says a team has a 70% chance of winning, they win approximately 70% of the time. The NBA model's higher ECE suggests the probabilities are systematically miscalibrated, likely due to the high-variance nature of NBA in-game scoring runs.
3.3 Uncertainty Tables
Each model includes a lookup table mapping game-state time fractions to expected uncertainty widths. For example, in NBA:
- Early game (75%+ remaining): Uncertainty width 0.081 (high)
- Mid game (25-75% remaining): Width 0.040-0.060
- Late game (<25% remaining): Width 0.024 (low)
These widths inform the confidence level assigned to each trade signal.
4. Edge Detection and Execution
4.1 Signal Generation
For each live game with a matched Polymarket market, we compute:
edge_c = fair_wp_c - market_ask_c
Where fair_wp_c is the model's fair win probability in cents (0-100) and market_ask_c is the current Polymarket ask price.
A trade signal is generated when:
- edge_c >= min_edge (sport-specific threshold, typically 5-8 cents)
- fair_wp_c is between 55 and 95 cents (avoid extreme probabilities)
- Market spread is less than 6 cents (liquidity filter)
- Market price data is less than 30 seconds old (freshness filter)
- The model's uncertainty width is below a sport-specific threshold
4.2 Execution
Trades are placed as Fill-or-Kill (FOK) orders on Polymarket's Central Limit Order Book (CLOB) via the Polygon blockchain. Key execution parameters:
- Slippage tolerance: 2 cents
- Maximum entry price: 78 cents
- Position sizing: Kelly criterion at quarter-Kelly with maximum bet caps
- Concurrent position limit: 8 positions maximum
4.3 Execution Cost Model
Our backtest applies the following execution costs: - Taker fee: 2.0 cents per contract (Polymarket standard) - Slippage estimate: 1.0 cent (based on observed fill quality) - Latency penalty: 0.5 cents (price movement during 3-5 second execution) - Total estimated cost: 3.5 cents per trade
Important limitation: These are estimates. Actual execution costs vary with market depth, time of day, and competing market makers. The backtest assumes sufficient liquidity at the quoted ask price, which may not always hold.
4.4 Multi-Venue Comparison
The system simultaneously monitors prices across Polymarket, Kalshi, DraftKings, FanDuel, and BetMGM. When the model identifies an edge, it reports which venue offers the best price, enabling optimal execution routing.
5. Results
5.1 Backtest Results (2025-26 Season)
| Metric | Value |
|---|---|
| Total trades | 2,625 |
| Win rate | 69.8% |
| Raw gross profit per trade | +5.9c |
| Execution costs (slippage + latency) | -1.5c |
| Taker fee | -2.0c |
| Net profit per trade | +2.4c |
| Total net P&L | +$62.69 (computed trade-by-trade; the 2.4c average is rounded) |
These backtest results were generated using backtest_moneyline_wp.py, which uses real Polymarket bid/ask prices from enriched market snapshots. The backtest is graded "semi-realistic" — it uses actual market prices but assumes execution at the quoted ask with estimated slippage, without modeling market depth or queue position.
By Sport:
| Sport | Trades | Win Rate | Gross c/Trade (before execution costs) |
|---|---|---|---|
| NCAAMB | 1,237 | 76.6% | +9.3c |
| NCAAWB | 864 | 66.2% | +2.2c |
| NFL | 286 | 58.0% | -1.0c |
| NBA | 238 | 61.3% | +3.9c |
Net per-trade profit after 3.5c execution costs is +2.4c in aggregate. Individual sport net figures vary based on entry price distribution. Per-sport figures are rounded to one decimal place from trade-by-trade computation and do not sum precisely to the aggregate gross of +5.9c, which is computed directly from the full trade log.
NHL, MLB, and CFB models are trained and deployed but produced zero qualifying trades in the 2025-26 backtest period due to insufficient Polymarket market coverage or liquidity for these sports during the evaluation window.[^1]
NFL is the only sport with negative expected value in the backtest, likely due to a smaller training sample (1,140 games) and the NFL model's previously identified temporal split issue (since corrected).
NCAAMB accounts for 47% of all trades and the majority of backtest profit. This concentration means the strategy's overall profitability is heavily dependent on continued edge in college basketball markets. Diversification across sports reduces this risk but the current backtest does not demonstrate broad profitability across all target sports.
[^1]: The 4 sports shown (NCAAMB, NCAAWB, NFL, NBA) account for all 2,625 trades. NHL, MLB, and CFB had zero qualifying trades during this period.
5.1.1 Statistical Significance
With 2,625 trades and a 69.8% win rate, the 95% confidence interval for the true win rate is 68.0%-71.5% (Wilson score interval). The net profit of +2.4c per trade has a standard deviation of approximately 45c per trade (reflecting the binary nature of hold-to-settlement outcomes). The t-statistic for mean P&L vs. zero is:
t = 2.4 / (45 / sqrt(2625)) = 2.73 (p < 0.01)
This is statistically significant at the 1% level, rejecting the null hypothesis that the strategy has zero expected profit after costs.
We also compute the profit factor (total gross winnings divided by total gross losses, computed trade-by-trade from the backtest log) = 1.42, indicating that winning trades generate 42% more total profit than losing trades consume. The maximum consecutive losing streak in the backtest was 8 trades. The probability of an 8-trade losing streak given a 69.8% win rate is (1-0.698)^8 = 0.0000692 (approximately 7 in 100,000). The probability of observing at least one such streak within 2,625 trades is approximately 1 - (1 - 0.0000692)^2618 = 16.6%, indicating this drawdown is well within expected statistical bounds.
For the live trading period, with 88 resolved trades and a 62.5% win rate, we cannot reject the null hypothesis that the true live win rate equals the backtest rate of 69.8% (z = -1.49, p = 0.14, two-sided). This means the observed live performance is statistically consistent with the backtest expectations.
For discrete binary-outcome strategies, the per-trade information ratio is more informative than a traditional annualized Sharpe ratio. The per-trade information ratio is: IR = mean_pnl / std_pnl = 2.4c / 45c = 0.053. Scaled by the square root of 2,625 trades, this yields a strategy-level z-score of 2.73 (consistent with the t-test above). For comparison to conventional Sharpe ratios, assuming 180 active trading days and approximately 14.6 trades/day, the annualized daily Sharpe is approximately 0.52. This figure is computed from the realized daily P&L time series, where daily returns aggregate a variable number of trades (creating non-iid daily observations due to event clustering, particularly during tournament periods). This falls in the range of viable but not exceptional quantitative strategies. Maximum drawdown in the backtest was approximately $15 (from peak equity), representing about 25% of the total P&L.
5.1.2 Multiple Testing Considerations
We train and evaluate 7 sport-specific models. Four produced qualifying trades in the backtest period. We do not apply a formal multiple testing correction (e.g., Bonferroni) because: (1) the 7 models target distinct sports with independent market microstructure, (2) we report results for all 4 active sports including the unprofitable NFL, not just the best-performing subset, and (3) the aggregate result across all 2,625 trades is our primary claim, not any individual sport result.
Nevertheless, we acknowledge the look-elsewhere effect: with 7 models tested, the probability that at least one appears profitable by chance is elevated. The NCAAMB result (+9.3c/trade, 76.6% WR on 1,237 trades) is individually significant (t > 5), but readers should weight the aggregate result more heavily than any single sport.
5.1.3 Edge Stability Over Time
To assess whether the detected edge is stable or decaying, we examine backtest performance by month within the 2025-26 season. The following figures are estimated from the seasonal backtest and should be interpreted as approximate:
| Period | Trades | Win Rate | Net c/Trade |
|---|---|---|---|
| Oct-Nov 2025 | ~600 | 71.2% | +3.1c |
| Dec-Jan 2025-26 | ~700 | 70.5% | +2.8c |
| Feb-Mar 2026 | ~800 | 69.1% | +2.0c |
| Mar-Apr 2026 | ~525 | 68.3% | +1.6c |
Standard errors on the per-period win rates range from +/-1.4pp to +/-2.0pp (depending on sample size). The observed decline of 2.9 percentage points from the first to last period is approximately 1.1 standard errors (using the standard error of the difference between two independent proportions), which is suggestive but not statistically significant at conventional levels. A linear regression of win rate on time period yields a negative slope but with only four observations, the trend is not distinguishable from flat performance.
The point estimates suggest a possible declining trend consistent with gradual market efficiency improvement, though the evidence is not conclusive. This pattern suggests the edge may have a half-life of approximately 6-12 months, after which model retraining and strategy adaptation are necessary. We emphasize that this trend analysis is based on a single season and should not be extrapolated.
5.1.4 Comparison to Baselines
To contextualize the model's performance, we compare against two naive baselines:
-
Random entry baseline: Buying random moneyline contracts at market prices yields an expected return of approximately -2.0c per trade (the taker fee), confirming that the market is not offering free edge to uninformed participants.
-
ESPN WP baseline: Using ESPN's proprietary win probability directly (without our ML model) as the fair value estimate and trading when ESPN WP diverges from market price by 8c+ yields approximately +0.8c per trade net — positive but substantially below our calibrated model's +2.4c. This suggests that the ML model's calibration layer adds meaningful value beyond raw ESPN WP.
These baselines confirm that (a) the market does extract costs from uninformed traders and (b) the model's edge comes from superior probability calibration, not simply from using publicly available ESPN data.
5.2 Live Trading Results (March-April 2026)
| Metric | Value |
|---|---|
| Total bot-attributed trades | 90 |
| Resolved | 88 |
| Win rate | 62.5% |
| Net P&L | +$67.59 |
| Open positions | 2 |
By Bot/Sport:
| Bot | Trades | Record | P&L |
|---|---|---|---|
| Moneyline WP (NBA/MLB/NCAA) | 35 | 22W-12L | +$27.26 |
| CS2 (Counter-Strike) | 28 | 13W-14L | +$0.42 |
| Tennis (ATP/WTA) | 14 | 10W-4L | +$33.23 |
| LoL (League of Legends) | 12 | 9W-3L | +$6.18 |
| Soccer (EPL/LIGUE1) | 1 | 1W-0L | +$0.50 |
Position sizing differs between backtest and live trading. The backtest assumes $1 per contract (1 share at the quoted ask price). Live trading uses variable position sizes averaging approximately $1-5 per trade depending on the sport and confidence level. To enable comparison, we report edge in cents per contract (c/trade) rather than total dollar P&L. The backtest net edge of +2.4c/trade and the live net edge of approximately +7.9c/trade (95% CI: approximately -3c to +19c, reflecting the wide uncertainty inherent in 88 trades) suggest live execution may be capturing more favorable entries, though the small live sample size makes this comparison preliminary.
All live trades are executed on the Polygon blockchain and are publicly verifiable. The live results represent a filtered view of bot-attributed trades starting March 9, 2026, excluding manual trades and backfilled wallet transactions.
Live trading covers additional sports beyond those described in Section 2. CS2 uses an Elo + binomial series model with HLTV live game data. LoL uses an Elo + binomial series model with LoLEsports API data. Tennis uses a hierarchical point-game-set-match probability model with ATP/WTA Elo ratings. Soccer uses a Poisson goal model with Elo-adjusted scoring rates. Full model descriptions for these sports are outside the scope of this paper.[^2]
[^2]: The backtest in Section 5.1 covers only the 7 ESPN-based sports. The live results include esports and tennis models that use different data sources and model architectures.
The live win rate (62.5%) is 7.3 percentage points lower than the backtest win rate (69.8%). Several factors may explain this gap: (1) the live sports mix includes esports (CS2, LoL, Tennis) which are not part of the backtest, (2) real execution quality may be worse than the estimated costs, (3) the live period may represent different market conditions than the backtest period, and (4) with only 88 resolved trades, the 95% confidence interval on the live win rate is approximately 52-73%, meaning the difference may not be statistically significant.
5.3 Closing Line Value (CLV)
CLV measures whether the system consistently buys at prices below where the market eventually settles. A consistent positive CLV is widely treated as evidence of genuine edge rather than variance, because the metric resolves on each trade independently of whether the bet eventually wins or loses (Kaunitz, Zhong, & Kreiner, 2017).
We compute CLV per trade as closing_price_c − entry_price_c, where the closing price is the last non-terminal Polymarket midpoint observed before the market converges to its 0/100 resolution, and the entry price is the ask we paid at trade-open time. Coverage is n=686 measured trades out of 1,415 total settled trades (48.5%) after a one-shot historical backfill (April 2026, this revision) that pulled Polymarket price history for every settled trade with sufficient market data — up from n≈20 at the original publication.
Headline finding (read with the mechanical-bias caveat below before quoting). Across all 686 measured trades:
| Subset | n | Settled | Won | WR | 95% CI (Wilson) |
|---|---|---|---|---|---|
clv_c > 0 (beat the close) |
350 | 350 | 310 | 88.6% | [84.8%, 91.5%] |
clv_c ≤ 0 (lost the close) |
336 | 336 | 38 | 11.3% | [8.4%, 15.1%] |
The split is statistically significant by any reasonable test (the confidence intervals do not overlap and the difference is ~77 percentage points on n>300 per arm).
Important caveat (read before per-sport interpretation). The 88.6%/11.3% split is partly mechanical for live in-play markets: a trade that loses tends to converge toward 0c at settlement, which by construction produces negative CLV; a trade that wins converges toward 100c. The split is therefore an upper bound on the model's standalone forecasting edge in live markets and should not be read as "the model wins 89% of the time when it has edge." This caveat applies primarily to the live in-play sports in our dataset — NHL, MLB, NBA, CS2, LOL, and Soccer — where the bot enters during the game and the closing price is observed near settlement. The caveat applies more weakly to tennis (ATP, WTA) and to NCAAMB/NCAAWB at the entry points we use, where most positions are taken pre-match or very early in the match: the mechanical-convergence argument has less weight there because the time-to-close window is longer and post-entry score events have less direct influence on the closing price. A cleaner version of this measurement using micro-CLV at fixed time horizons (T+60s and T+180s, which de-couple CLV from eventual outcome) is in development; preliminary infrastructure is in place but the sample is not yet large enough to report separately.
Per-sport breakdown. The table below covers all 686 measured trades and is exhaustive. Per-sport 95% CIs on mean CLV use the standard error of the mean (SE = std(clv_c) / √n) with a z-based normal approximation (multiplier 1.96). A more conservative Student-t multiplier on n−1 degrees of freedom would widen the CIs slightly — by ~4% at n=31 (NBA), ~2% at n=91 (ATP, NHL), and below 1% at n=146 (CS2) — but does not change the verdict on any per-sport row. For n<10 the normal approximation is suspect and the row is flagged as preliminary. The "+CLV%" column is the fraction of trades in each sport with clv_c > 0, with a Wilson 95% CI.
| Sport | n | Mean CLV (c) | 95% CI on mean | +CLV% | Wilson CI on +CLV% | Overall WR | Verdict |
|---|---|---|---|---|---|---|---|
| ATP | 91 | +5.16 | [−2.06, +12.38] | 48.4% | [38.4%, 58.5%] | 49.5% | +CLV (mean CI overlaps zero) |
| LOL | 78 | +3.44 | [−3.28, +10.17] | 55.1% | [44.1%, 65.7%] | 48.7% | +CLV (mean CI overlaps zero) |
| NHL | 91 | +1.08 | [−6.04, +8.21] | 58.2% | [48.0%, 67.8%] | 62.6% | breakeven (small +CLV, CI overlaps zero) |
| MLB | 155 | −2.04 | [−8.09, +4.01] | 55.5% | [47.6%, 63.1%] | 60.6% | anomaly: high WR / negative mean CLV (see below) |
| Soccer | 35 | −2.12 | [−15.07, +10.83] | 42.9% | [28.0%, 59.1%] | 37.1% | −CLV (mean CI overlaps zero) |
| WTA | 41 | −2.67 | [−12.91, +7.56] | 43.9% | [29.9%, 59.0%] | 41.5% | −CLV (mean CI overlaps zero) |
| CS2 | 146 | −5.46 | [−10.70, −0.22] | 43.2% | [35.4%, 51.3%] | 40.4% | −CLV (mean CI excludes zero) |
| NBA | 31 | −7.87 | [−19.08, +3.34] | 51.6% | [34.8%, 68.0%] | 48.4% | −CLV (mean CI overlaps zero, n small) |
| NCAAMB | 4 | +17.75 | [+7.31, +28.19] | 100% | [51.0%, 100%] | 100% | preliminary (n<10) |
| NCAAWB | 4 | +6.82 | [−33.46, +47.11] | 75.0% | [30.1%, 95.4%] | 75.0% | preliminary (n<10) |
| TENNIS * | 5 | −5.12 | [−21.34, +11.10] | 40.0% | [11.8%, 76.9%] | 40.0% | legacy tag, n<10 |
| (untagged) | 5 | −4.04 | [−37.24, +29.16] | 60.0% | [23.1%, 88.2%] | 20.0% | legacy / unsorted |
| Total | 686 |
* The TENNIS row is the legacy combined ATP+WTA tag from before per-tour separation was introduced in the production system. Going forward, tennis trades are tagged as ATP or WTA directly. The (untagged) row is unresolved data-tagging from the earliest pipeline version. Both are kept in the table for completeness; neither is used to drive operational decisions.
Reading the table honestly. Of the eight sports with n ≥ 30, only CS2 has a 95% mean-CLV confidence interval that excludes zero (CS2: [−10.70, −0.22], n=146, implied two-sided p ≈ 0.04). All other sports — including ATP, LOL, NHL on the +CLV side and MLB, Soccer, WTA, NBA on the −CLV side — have mean-CLV CIs that overlap zero at the unadjusted 5% level. After Bonferroni correction for 8 simultaneous tests, the per-test threshold tightens to α ≈ 0.625% (z ≈ 2.73). At that threshold, none of the per-sport mean-CLV estimates — including CS2's — survive as statistically distinct from zero: CS2's implied z-statistic is approximately 2.04, well below the corrected 2.73 cutoff. The pointwise CS2 result remains the strongest per-sport evidence in the dataset, but it should not be presented as multiple-testing-robust on the current sample. The point estimates and signs are still informative — they are the inputs to operational decisions below — but per-sport mean-CLV magnitudes should be read as preliminary signals rather than confirmed conclusions until sample sizes grow.
The +CLV-proportion column (Wilson CI on n_pos / n) tells a similar story: most sports' Wilson CIs include 50%, indicating no statistically clear directional signal at the per-sport level. The headline 88.6%/11.3% split in the previous subsection is computed across all measured trades, where the much larger pooled sample (n=686) overwhelms the per-sport noise — but as cautioned above, that headline is partly mechanical and not directly comparable to the per-sport CIs here.
Operational decisions vs statistical conclusions. We treat negative mean-CLV point estimates on n ≥ 30 as a sufficient operational signal to pause real-money trading on a sport, even when the 95% CI on mean CLV overlaps zero. That is a deliberate asymmetric loss function: continuing to trade a sport with a negative point estimate has a real expected loss, while pausing it costs only the option value of trading until the data clarifies.
The five sports with n ≥ 30 and a negative mean CLV point estimate are MLB, Soccer, WTA, CS2, and NBA. We have currently paused real-money trading on four of those five — NBA, CS2, WTA, Soccer. We have not paused MLB despite it meeting the same numerical criterion, because MLB is the largest-sample sport (n=155), has the highest overall WR (60.6%), and exhibits the high-WR / negative-CLV anomaly described separately above; we judge the appropriate response there to be the development of micro-CLV measurement (T+60s and T+180s, which de-couple CLV from settlement convergence) rather than an immediate pause that would forgo the largest active sample. This is a discretionary deviation from the rule that we disclose explicitly. The four-paused-sport list spans a range of statistical evidence — even CS2's mean-CLV result, the strongest in the dataset, does not survive Bonferroni correction (see "Reading the table honestly" above); the other three pauses (NBA, WTA, Soccer) are based on point estimates that are not significant even unadjusted. We are choosing to act on point estimates rather than wait for confirmation; that asymmetric-loss reasoning is the justification, not the statistics.
Their model paths continue to compute predictions so CLV measurement continues; that data is in the public dataset at https://zenhodl.net/clv. The explicit reactivation criteria for each paused sport are documented at https://zenhodl.net/clv/repair.
We acknowledge a logical tension here. The mechanical-bias caveat above warns that negative CLV in live in-play markets is partly an artifact of how losing trades converge to 0c at settlement; three of the four paused sports (NBA, CS2, Soccer) are live in-play. The caveat does not dissolve the operational case for pausing — point estimates this negative are unlikely to be entirely mechanical, and a −5.46c mean CLV with a CI excluding zero on n=146 for CS2 is not noise — but it does mean the magnitude of the "true" forecasting deficit is uncertain in those sports. The micro-CLV measurement at fixed time horizons (T+60s, T+180s) is designed to break this ambiguity; until it is at sample size, the paused-sport list should be read as "we are choosing to pause on a partially-mechanical metric while the cleaner metric matures."
Failure-mode taxonomy. Among the 336 trades with clv_c ≤ 0, we tag the dominant failure cause using deterministic rules (full classifier source: core/clv_failure_tagger.py in the public repository):
| Failure mode | n | % of −CLV | Description |
|---|---|---|---|
bad_model |
228 | 67.9% | model probability was wrong; market moved away from the model after entry |
bad_blend |
83 | 24.7% | model-vs-market disagreement was large at entry, the market direction was right |
bad_clv_label |
22 | 6.5% | trade won, but CLV is negative because the live in-play measurement was contaminated by post-entry score events |
bad_timing |
3 | <1% | trade won but at a worse-than-close entry; small effect |
The dominance of bad_model (especially in CS2, NBA, WTA per the per-sport tags in /clv/failures.json) is consistent with the per-sport pause decisions: where the failures concentrate in the model itself, edge-threshold tuning will not recover the lost CLV; the model needs to be retrained or replaced before the sport can come back online.
Live data. The full per-sport CLV scorecard, edge-bucket breakdown, and failure-mode taxonomy are published with weekly updates at https://zenhodl.net/clv, with JSON and CSV downloads (/clv.json, /clv.csv, /clv/failures.json) under CC BY 4.0. The repair-status workflow showing which paused sports are candidates to reactivate, and the explicit criteria each must satisfy, is at https://zenhodl.net/clv/repair.
Implication for the §5.1 backtest result. §5.1 reports +2.4c per trade net of estimated costs and +5.9c gross across 2,625 backtest trades. The +CLV vs −CLV WR split documented above is consistent with that backtest representing genuine forecasting edge rather than variance noise — but the per-sport breakdown indicates that the aggregate average masks meaningful divergence: some sports (ATP, LOL, possibly NHL) appear to be +EV at the closing line; others (NBA, CS2, WTA, Soccer) appear to be −EV at the closing line on the live sample. Future revisions of this paper should report per-sport backtest results aligned with the per-sport CLV split rather than the aggregate. We flag this as the most actionable methodological gap in the v1 backtest.
Limitations.
- Coverage. Polymarket's price-history API does not always return non-terminal closing prices for resolved markets, particularly in fast-resolving live markets that converge to 0c or 100c rapidly. Coverage is 48.5% of all settled trades; older trades and very short-resolution markets are underrepresented. We disclose this limitation rather than restrict the dataset to the convenient subset.
- Small-sample sports. Five sports in the table above have
n<35. Per-sport conclusions for those sports are explicitly preliminary. - Mechanical bias on live in-play markets. As noted in the headline finding, the 89%/11% split is partly an artifact of how live in-play markets converge at settlement. Micro-CLV measurement (price observed at fixed seconds-since-entry) is the cleaner long-run methodology and is in development.
- Multiple-testing correction. The per-sport mean-CLV CIs in the table are computed pointwise. With 8 primary sports tested simultaneously, a Bonferroni-corrected experiment-wise α of 0.625% per test (z ≈ 2.73) would require tighter intervals than the displayed 95% intervals. As discussed in the "Reading the table honestly" paragraph above, even before correction only CS2 has a 95% mean-CLV CI excluding zero (z ≈ 2.04, p ≈ 0.04 unadjusted), and CS2 does not survive Bonferroni correction either — its z-statistic falls short of the 2.73 cutoff. No per-sport mean-CLV conclusion in this paper is multiple-testing robust on the current sample; the per-sport results are presented as operational signals rather than confirmed statistical findings.
6. Limitations and Risks
6.1 Backtest vs Live Performance Gap
The backtest assumes execution at the quoted ask price with estimated slippage. Real execution may be worse due to: - Insufficient market depth at the quoted price - Price movement between signal detection and order fill - Queue position effects in the order book - The backtest grade is "semi-realistic" — it uses real market snapshots but optimistic execution assumptions
6.2 Small Live Sample
90 bot-attributed live trades over approximately one month — 88 resolved at write time — is insufficient for strong statistical conclusions. At a 62.5% win rate on the 88 resolved trades, the 95% confidence interval for the true win rate is approximately 52–73%. The system could be profitable, breakeven, or mildly unprofitable at the true rate.
6.3 Model Degradation
Market efficiency tends to improve over time as more participants adopt quantitative approaches. The edge we observe may decay as: - More automated trading systems enter prediction markets - Market makers improve their pricing algorithms - Information transmission speed increases
Models require periodic retraining (every 2-4 weeks) to incorporate new game data and adapt to changing market conditions.
6.4 Regime Change
Structural changes in sports (rule modifications, season format changes) or markets (fee structure changes, regulatory actions) can invalidate historical patterns. The system's reliance on ESPN game state data means it is vulnerable to API changes or outages.
6.5 Execution Risk
On-chain execution on Polygon introduces blockchain-specific risks including network congestion, gas price spikes, and smart contract vulnerabilities. The system uses Fill-or-Kill orders to limit adverse selection but cannot eliminate all execution risk.
6.6 Threats to Validity
Internal validity: The temporal train/test split mitigates look-ahead bias, but the choice of edge thresholds (5-8c) and model hyperparameters were informed by preliminary analysis on overlapping data. We did not perform a fully blind, pre-registered evaluation.
External validity: Results are specific to Polymarket's market microstructure during the 2025-26 season. Generalization to other prediction markets (Kalshi, PredictIt), other time periods, or other sports leagues is not established.
Construct validity: Our edge metric (fair_wp - market_ask) assumes the model's probability is the 'true' probability. If the model is systematically biased in a direction that happens to correlate with profitable trades, the reported edge is illusory.
7. System Architecture
The system consists of: 1. Data Pipeline: Async ESPN polling (5s intervals) + Polymarket WebSocket (real-time) + multi-venue OddsAPI polling 2. Model Layer: 7 sport-specific XGBoost/LR ensemble models with isotonic calibration 3. Signal Engine: Edge detection with configurable thresholds, uncertainty gates, and staleness checks 4. Execution Layer: Polymarket CLOB via py-clob-client with FOK orders on Polygon 5. Monitoring: Circuit breakers, feed quality scoring, warm-start gates, and daily reconciliation agents
The entire system runs on commodity hardware (8GB VPS for the API, local machine for trade execution) and processes all 7 sports simultaneously.
8. Conclusion
We document that calibrated machine learning models can identify short-term pricing inefficiencies in sports prediction markets, and that the magnitude of that edge varies sharply across sports. The system achieves a 69.8% backtest win rate across 2,625 trades (+2.4c net per trade) and a 62.5% live win rate on the 88 resolved trades from a 90-trade live sample. The §5.3 CLV analysis on n=686 measured trades provides further empirical support: trades that beat the closing line win 88.6% [84.8%, 91.5%] of the time, trades that lose the closing line win 11.3% [8.4%, 15.1%]. We caution that this split is partly mechanical for live in-play markets (see §5.3 limitations); a cleaner micro-CLV measurement is in development.
The primary contribution is not the model architecture (which uses standard ML techniques) but the complete pipeline: real-time data ingestion, calibrated probability estimation, multi-venue price comparison, and automated execution. The v1.1 revision adds per-sport CLV measurement and a transparent sport-pause workflow that acts on persistently negative CLV rather than waiting for P&L variance to resolve. Per-sport CLV is published openly under CC BY 4.0 at https://zenhodl.net/clv, with the operational repair workflow at https://zenhodl.net/clv/repair. Both pages compound with this paper as new trades resolve; we view them as the empirical extension of §5.3 and §6.
Key open questions for future research include:
- Per-sport edge persistence. Do the +CLV sports in §5.3 (ATP, LOL, NHL) maintain their edge across multiple seasons, or does it decay on the same timescale we see globally?
- Live in-play CLV measurement. §5.3 uses final-close CLV, which is partially mechanical for in-play markets. A micro-CLV measure at fixed seconds-since-entry (T+60s, T+180s) would isolate forecasting edge from score-event timing. Recording infrastructure is in place; the sample is not yet at reportable size.
- Failure-mode-driven model repair. §5.3's failure tagging (
bad_model/bad_blend/bad_timing/bad_clv_label) suggests that different sports fail in different ways. Whether per-failure-mode interventions (model retraining forbad_model-dominant sports vs blend tuning forbad_blend-dominant sports) produce measurable CLV improvement is the natural next experiment. - Earlier detection. Can per-sport CLV decline reliably anticipate model degradation earlier than rolling P&L?
- Multi-venue. Does multi-venue execution routing improve net returns versus single-venue trading?
While the results are encouraging, we caution against over-interpreting them. The backtest, though statistically significant in aggregate, covers a single season with execution cost estimates rather than realized costs, and §5.3 shows that the aggregate average masks meaningful per-sport divergence. Of the eight primary sports in the live CLV dataset, only CS2 has a 95% mean-CLV CI that excludes zero — and that result is negative, indicating the model is currently selling closing-line value rather than capturing it on that sport. The point estimates suggest ATP, LOL, and NHL sit on the positive side of zero (mean CLV +5.16c, +3.44c, +1.08c respectively, with CIs all overlapping zero on current sample sizes), while MLB, Soccer, WTA, and NBA sit on the negative side (mean CLV −2.04c, −2.12c, −2.67c, −7.87c respectively, all with CIs overlapping zero except CS2). MLB is a notable anomaly — it has the largest per-sport sample (n=155), the highest overall WR (60.6%), and yet a slightly negative mean CLV; this is the strongest argument for the micro-CLV measurement work described in §5.3, since the high-WR / negative-CLV combination is consistent with the model winning on score-state outcomes but paying too much at entry, which final-close CLV cannot fully separate.
The most honest summary of our findings is: there appears to be a real but sport-dependent edge in live sports prediction markets for systems that process information faster than the median participant. The magnitude varies sharply across sports, only one per-sport mean-CLV result currently has a confidence interval excluding zero (CS2, on the negative side), and the aggregate edge is likely to decay over time. The CLV-driven sport-triage workflow described in §5.3 is our attempt to act on that variation in public rather than averaging it away — even where the statistical evidence is preliminary, the operational discipline of pausing point-estimate-negative sports has a defensible asymmetric expected-value rationale.
References
- Arrow, K.J. et al. (2008). "The Promise of Prediction Markets." Science, 320(5878).
- Borghesi, R. (2007). "The Home Team Weather Advantage and Biases in the NFL Betting Market." Journal of Economics and Business, 59(4).
- Croxson, K. & Reade, J.J. (2014). "Information and Efficiency: Goal Arrival in Soccer Betting." The Economic Journal, 124(575).
- Glickman, M.E. (1999). "Parameter Estimation in Large Dynamic Paired Comparison Experiments." Applied Statistics, 48(3).
- Kaunitz, L., Zhong, S. & Kreiner, J. (2017). "Beating the Bookies with Their Own Numbers." arXiv:1710.02824.
- Page, L. (2012). "It Ain't Over Till It's Over: Yogi Berra Bias on Prediction Markets." Economics Bulletin, 32(2).
- Sauer, R.D. (1998). "The Economics of Wagering Markets." Journal of Economic Literature, 36(4).
- Stern, H.S. (1991). "On the Probability of Winning a Football Game." The American Statistician, 45(3).
- Wolfers, J. & Zitzewitz, E. (2004). "Prediction Markets." Journal of Economic Perspectives, 18(2).
- Zadrozny, B. & Elkan, C. (2002). "Transforming Classifier Scores into Accurate Multiclass Probability Estimates." KDD '02.
How to Cite This Paper
If you reference this work in your own research, please use one of the following citation formats:
APA:
Evans, C. (2026). Exploiting short-term inefficiencies in sports prediction markets using calibrated win probability models (Version 1.1). ZenHodl Research. https://zenhodl.net/whitepaper
MLA:
Evans, Coy. "Exploiting Short-Term Inefficiencies in Sports Prediction Markets Using Calibrated Win Probability Models." Version 1.1, ZenHodl Research, Apr. 2026, zenhodl.net/whitepaper.
Chicago:
Evans, Coy. 2026. "Exploiting Short-Term Inefficiencies in Sports Prediction Markets Using Calibrated Win Probability Models" (Version 1.1). ZenHodl Research. https://zenhodl.net/whitepaper.
BibTeX:
@article{evans2026sportspm,
title={Exploiting Short-Term Inefficiencies in Sports Prediction Markets Using Calibrated Win Probability Models},
author={Evans, Coy},
journal={ZenHodl Research},
year={2026},
month={April},
note={Version 1.1, revised April 2026},
url={https://zenhodl.net/whitepaper}
}
Disclaimer: Past performance does not guarantee future results. Sports prediction market trading involves risk of loss. This paper presents research findings, not investment advice. All backtest results include estimated execution costs but may not reflect actual trading conditions. Live results have limited sample sizes and should not be extrapolated.