A few weeks ago I promised myself I wouldn't add CFB to the ZenHodl signal feed until I had honest backtested numbers showing the model beat ESPN WP. I ran the backtest last week and the numbers were not honest. They were negative.
This post is the diagnosis, the fix, and the new numbers — including the ones that used to make me look smarter than I was.
The Original Backtest: -2.1 Cents Per Trade
Default configuration. 2024 season as the held-out test set. ESPN WP + 0.5c half-spread as the market-price proxy. Minimum 8-cent edge required before we even considered a trade.
OVERALL: 728 trades | WR: 53.6% (390W/338L)
Total P&L: $-15.35 | Avg P&L: -2.1c/trade
A 53.6% win rate sounds fine until you remember that every trade costs ~3c in fees + spread. To break even on a binary prediction market, you need to be right a lot more often than you're wrong — something like 55-57% minimum just to cover costs. We were nowhere near that.
Worse, when I looked at 2023 as an additional out-of-sample season, the "winning" subsets of the strategy (specific quarters, specific edge bands, specific fair-probability ranges) all collapsed back to 50.7%. Classic overfitting to a single year.
Before I started building excuses ("it's only 728 trades, small sample"), I stopped and asked the question I should have asked first: what is this backtest actually measuring?
The Real Question Behind the Backtest
The backtest simulates: "when our model disagrees with ESPN Win Probability by 8+ cents, does betting the disagreement make money?"
ESPN's in-game Win Probability is famously state-only. It looks at the scoreboard, the clock, down and distance, and field position. It does not know whether you're watching Georgia play Vanderbilt or Vandy play UConn. Same score state, same clock state, same WP — regardless of who is on the field.
That was the clue. I stared at our CFB model's feature importance and saw the same pattern:
| Feature | Importance |
|---|---|
score_diff |
34.4% |
score_diff_x_tf |
19.2% |
score_diff_sq |
11.9% |
pregame_wp |
8.6% |
elo_diff |
6.0% |
pace_diff |
0.0% |
ortg_diff |
0.0% |
drtg_diff |
0.0% |
Three quarters of the model was score_diff and its derivatives — the exact same signal ESPN WP uses. The three team-quality features (pace_diff, ortg_diff, drtg_diff) that would have given us an information advantage over ESPN were sitting at 0.0% importance.
Not because the model didn't want them. Because they were all zero.
The One-Line Bug: Missing Team Priors File
When we train a win-probability model, we attach per-team rolling averages (points per game, points allowed per game, plays per game) as features. The training code looks for a file called team_stats_priors_{sport}.parquet. For NBA, NHL, and college basketball, those files existed. For CFB, we had:
WARNING - No team_stats_priors_CFB.parquet found, team stats will be 0.0
We had been training CFB models for months with three features silently zeroed out. The model had every incentive to find any signal in those features, but there was nothing to find. So it learned to lean even harder on score-diff — which is exactly the signal ESPN WP already encodes.
We were running a score-diff smoother against another score-diff smoother. The two were always going to converge, and any edge we measured was going to be noise around zero.
The Fix: Derive Priors From Existing Data
The good news: we didn't need new data. We already had 5 years of play-by-play in the training parquets (2020–2024), which contain final scores for 4,313 games and 256 FBS teams. That's enough to compute running per-team averages within each season, shifted by one game so the current game can't see its own outcome.
The script reads the parquets, extracts final scores per game, pivots to one row per (game, team), and computes cumulative averages excluding the current game:
grp["cum_pts_for"] = grp["pts_for"].shift(1).cumsum()
grp["cum_pts_against"] = grp["pts_against"].shift(1).cumsum()
grp["cum_plays"] = grp["plays"].shift(1).cumsum()
Before game 4 of the 2024 season, UGA's priors show:
| game_id | prior_pace | prior_ortg | prior_drtg | prior_games |
|---|---|---|---|---|
| 401628323 | 0.00 | 0.00 | 0.00 | 0 |
| 401628339 | 0.00 | 0.00 | 0.00 | 1 |
| 401628354 | 0.00 | 0.00 | 0.00 | 2 |
| 401628374 | 80.67 | 31.67 | 6.00 | 3 |
| 401628381 | 85.00 | 32.25 | 14.75 | 4 |
The first three games return 0.0 (not enough history yet — honest). From game 4 onward, the priors reflect what the team had actually done up to that point and nothing more. No leakage, no hindsight.
The output parquet is 8,626 rows (one per team per game), 256 teams, 5 seasons, 68.5% of rows with non-zero priors.
Retraining: Three Dead Features Come Alive
Same model architecture, same training/calibration/test split (2020-22 train, 2023 cal, 2024 test). Only difference: the three team-stats features now have real values.
New feature importance:
| Feature | Before | After |
|---|---|---|
score_diff |
34.4% | 27.5% |
score_diff_x_tf |
19.2% | 23.7% |
score_diff_sq |
11.9% | 15.4% |
pregame_wp |
8.6% | 5.4% |
drtg_diff |
0.0% | 3.3% |
pace_diff |
0.0% | 3.1% |
ortg_diff |
0.0% | 3.0% |
The three previously-dead features now account for 9.3% of the model's decision-making. Score-diff dominance drops from 65.5% to 66.6% (it still dominates, but now it shares signal with team quality).
The New Backtest
Same command, same config, same held-out season. No changes other than the model itself.
BEFORE: 728 trades | WR: 53.6% | Avg P&L: -2.1c/trade (2024)
AFTER: 743 trades | WR: 71.2% | Avg P&L: +11.8c/trade (2024)
And the two-season cross-validation:
2023+2024: 1,475 trades | WR: 64.8% | Avg P&L: +6.0c/trade
2023: 732 trades | WR: 58.3% | Avg P&L: 0.0c/trade
2024: 743 trades | WR: 71.2% | Avg P&L: +11.8c/trade
Breakdown by edge bucket:
| Edge band | Trades | WR | Avg P&L |
|---|---|---|---|
| 5-10c | 338 | 65.7% | +2.1c |
| 10-15c | 573 | 66.3% | +4.5c |
| 15-20c | 298 | 65.8% | +8.4c |
| 20-30c | 266 | 59.4% | +11.3c |
Every edge bucket is profitable. Every fair-WP bucket is profitable. Both quarters (Q2 and Q3) are profitable. Both sides (home and away favorites) are profitable.
This is what I wanted to see before shipping CFB.
One Honest Caveat
The +11.8c/trade is measured against ESPN WP plus a 0.5c half-spread. That's a semi-realistic benchmark (see our backtest reliability grades), not the real Polymarket or Kalshi market price. The real market prices do reflect team quality — so when CFB moneyline trading goes live, the realized edge will be smaller than the benchmark number.
I don't know exactly how much smaller yet. We had only 108 real CFB market snapshots in our enriched data from the 2025 bowl season — not enough to run the full moneyline-bot methodology. As the 2026 season plays, we'll collect live-market data and rerun the backtest against it. That will be the next update.
What This Means for Customers
- CFB is now wired into the API under
/v1/edges?sport=CFB. The endpoint is ready; it will start returning signals when the 2026 season begins in August. - Signals will not go into the live Polymarket trading rotation on day one — we want real-market calibration data from the first few weeks of games before risking capital against them.
- The track-record page will label CFB trades as "paper / semi-realistic" until the first real-market trades settle. When a trade shows up on the public ledger with a PolygonScan link, that's the signal we trust it.
If you build something against these signals, I'd love to know what you learn before the season kicks off. Every calibration result we get back makes the next iteration better.
The Bigger Lesson
The model wasn't broken. The backtest wasn't broken. What was broken was a missing file that turned three features into dead weight. It sat there for months, giving us plausible-looking predictions that happened to replicate ESPN WP almost exactly.
Two takeaways for anyone building sports prediction models:
- If your feature importance report shows 0.0% on a feature you spent effort designing, find out why. It's almost never because the feature is useless. It's almost always because the data isn't there.
- Always ask what your benchmark knows. A backtest that measures you against a state-only baseline will miss any signal that depends on team quality. A backtest that measures you against a market that already prices in team quality will miss any signal that depends on game-state exploitation. Know which one you're running.
The fix took about 90 minutes end-to-end. Most of it was the diagnosis. The code was 100 lines.