We run five trading bots on Polymarket. Last week, we audited every one of them. Each had problems we didn't expect — and the fixes had nothing to do with model accuracy.
This is what we found.
CS2: 78% Backtest, 45% Live
The biggest gap of any bot we run. The backtest said 78.1% win rate at +32.8 cents per trade. Live performance was 45.5% win rate, -2.0 cents per trade, on 33 resolved trades. Trades executed through the Polymarket CLOB API.
We dissected every losing trade and found five distinct problems:
Slippage wasn't modeled. The backtest assumes you fill at the ask. Live, our average slippage was 4.9 cents — and 5 trades had more than 10 cents of slippage. All 5 lost.
The underdog trap. Buying teams at 35-40 cents (heavy underdogs) had a 17% win rate. The model said 60% fair probability. The market said 35%. The market was right almost every time, because the model was using stale Elo ratings on tier-2 teams that had reshuffled rosters.
Low-edge noise. Trades with less than 10 cents of detected edge had a 38% win rate and lost 109 cents total. After slippage and the 2-cent fee, these "edges" were negative-EV.
High-entry asymmetric risk. Buying at 70-80 cents means 25-cent upside vs 75-cent downside. Even a 55% win rate is barely breakeven. Our 70-85 cent entry bucket lost 81 cents total.
Stale Elo ratings. The Elo difference bucket of +100 to +200 (our team rated significantly higher) had a 0% win rate on 5 trades. The Elo hadn't been rebuilt in months.
The fixes: Raised min_edge from 4c to 10c. Raised min_fair from 50c to 55c. Capped max_entry at 70c. Added an 8-cent max_slippage gate. Rebuilt Elo from 4,350 matches via bo3.gg (311 teams, up from 258).
Applying these filters retroactively to the same 33 live trades: 11 would have passed. 6 wins, 5 losses, 55% win rate, +40 cents. The filters turned -67c into +40c by blocking 22 trades that collectively lost 107 cents.
NBA: The Missing Elo Bug
Our NBA bot had been running with zero Elo teams loaded. A bug in the training pipeline lost the Elo dictionary during a DataFrame split. The model still ran — it just got 0.0 for every team's Elo difference.
The fix was a one-line cache (_ELO_CACHE) that preserves the Elo dict through pipeline transformations. After the fix:
- 42 NBA teams loaded with proper ratings
- Calibration improved 44% (ECE 0.103 → 0.057)
- Backtested at 69.3% win rate, +6.2 cents per trade across 218 trades
We also added dynamic edge thresholds. Early in a game when model uncertainty is 18.7c, we require larger edges. Late in a game when uncertainty drops to 4c, we accept smaller edges.
MLB: The Silent Skip
The MLB model was being silently skipped at startup because of a stale SHA256 hash file. The model existed, the Elo ratings were there, but the API server's integrity check was rejecting it. No error, no warning — MLB just wasn't in the loaded sports list.
Fix: regenerated the hash, retrained with 96 new games from 2026 opening week (6,808 total games), recomputed Elo for all 33 MLB teams. Went 5/5 on the first day live.
The 2026 opening week shifted Elo ratings significantly: - Yankees: 1622 → 1588 (slow start) - Brewers: 1545 → 1573 (strong start) - Dodgers: 1630 → 1580
Early-season Elo shifts matter because the market hasn't fully adjusted yet.
LoL: The Unused ML Pipeline
The LoL bot had a full XGBoost ML pipeline available — gold differential, kills, towers, dragons, barons — but was running in Elo-only mode. The training script existed, the data was there, the model just wasn't being loaded.
After enabling the ML model: - Brier score improved 47.8% (0.256 → 0.134) - AUC improved from 0.575 to 0.890 - Gold difference alone accounts for 39% of feature importance
A team with +5,000 gold at 20 minutes now correctly predicts 82% win probability instead of the Elo-only estimate of ~55%.
Tennis: Wrong Serve Rates for Half the Matches
The tennis bot was using ATP serve rates for WTA matches. WTA hard court serve win rate is 58% vs ATP's 64% — a 6-point difference that compounds through the hierarchical model (point → game → set → match).
The fix: split serve rate tables by tour, plus added set dominance momentum (6-0 sweep = +4% WP boost, tiebreak = no adjustment, comeback = +2% bonus).
The Meta-Lesson
We expected model accuracy to be the bottleneck. It wasn't. Every model was fundamentally fine. The problems were all in execution, configuration, or silent failures:
- A stale hash file
- A lost variable in a DataFrame split
- A model file that wasn't being loaded
- Population-specific calibration that wasn't applied
- Filters tuned for backtest assumptions that didn't survive live trading
The fix pattern was the same for every bot: look at the live trades, segment by every variable you can think of (edge size, entry price, Elo difference, slippage, time of day), find the buckets that are losing money, and either fix the root cause or filter them out.
You can't ship a backtest. You ship a system, and the system has to survive contact with reality.
All five bots are taught in our course. Live results across every sport are public on our results page with on-chain transaction verification.