# We Retrained 5 Trading Bots in One Week — What Changed

## The Problem

We run 5 automated trading bots on Polymarket across 8 sports. They were all making money in backtests. Then we looked at the live results.

CS2 bot: 45% win rate, losing money. NBA bot: Elo ratings were empty — 0 teams loaded. MLB model was being silently skipped due to a stale hash file. LoL bot was running Elo-only despite having a full ML pipeline available. Tennis bot had no WTA-specific calibration.

In one week, we audited every bot, found the gaps between backtest and live performance, and fixed them. This is what we learned.

## CS2: The Biggest Gap (78% Backtest → 45% Live)

The CS2 bot had the widest gap between backtested and live performance of any bot we've built.

Backtest: 32 trades, 78.1% win rate, +32.8 cents per trade. Tested against real Polymarket snapshot prices. Looked great.

Live: 33 trades, 45.5% win rate, -2.0 cents per trade. Losing money.

We dissected every live trade and found five problems:

**Problem 1: Slippage wasn't modeled.** The backtest assumes you fill at the ask price. Live, our average slippage was 4.9 cents, with 5 trades exceeding 10 cents. Those 5 high-slippage trades were all losses. When you're targeting 4-cent edges and paying 10+ cents in slippage, you're guaranteed to lose.

**Problem 2: The underdog trap.** We were buying teams priced at 35-40 cents — heavy underdogs. Our Elo model said they had a 60% chance, the market said 35%. The market was right. Our 0-40 cent entry bucket had a 17% win rate. The model was overconfident on teams it didn't have enough data about.

**Problem 3: Low-edge noise.** Trades with less than 10 cents of detected edge had a 38% win rate and lost 109 cents total. These thin edges get completely eaten by slippage and execution latency in live trading. In the backtest, they work because there's no slippage.

**Problem 4: High-entry asymmetric risk.** Buying at 70-80 cents means your upside is 20-30 cents but your downside is 70-80 cents. Even at 55% win rate, this is negative expected value. Our 70-85 cent entry bucket was -81 cents.

**Problem 5: Stale Elo ratings.** The Elo difference bucket of +100 to +200 (our team rated significantly higher) had a 0% win rate on 5 trades. The Elo ratings hadn't been updated in months, and mid-tier CS2 teams shuffle constantly.

**The fixes:** We raised the minimum edge from 4 cents to 10 cents. Set minimum fair probability to 55 cents (filters out underdog bets). Capped maximum entry at 70 cents (eliminates asymmetric risk). Added a maximum slippage gate of 8 cents (rejects any fill worse than 8 cents above the signal price). Rebuilt Elo ratings from 4,350 matches, covering 311 teams.

**Retroactive test:** Applying these filters to the live trades retroactively turned -67 cents into +40 cents. The 22 blocked trades collectively lost 107 cents. The 11 that passed had a 55% win rate and positive P&L.

Then we went further. We replaced the HLTV data source (which kept getting blocked by Cloudflare) with bo3.gg, which provides live economy data — equipment values, buy phases, round-by-round scores. We built a combined model that uses map-specific CT win rates (Nuke 57%, Dust2 51%), a 4-tier economy system, and conditional dampening based on data quality.

## NBA: The Missing Elo Bug

The NBA bot was running with 0 Elo teams loaded. The Elo ratings were being computed during training but lost during a DataFrame split operation. The model used elo_diff as a feature but was getting 0.0 for every game.

The fix was a one-line cache that preserves the Elo dictionary through the training pipeline. After the fix: 42 NBA teams with proper ratings, calibration improved 44% (ECE from 0.103 to 0.057), and the backtested trading results jumped to 69.3% win rate at +6.2 cents per trade across 218 trades.

We also added dynamic edge thresholds — early in a game when model uncertainty is 18.7 cents, we require larger edges. Late in a game when uncertainty drops to 4 cents, we accept smaller edges. This matches the natural confidence progression of in-game prediction.

## MLB: The Silent Skip

The MLB model was being silently skipped at startup because of a stale SHA256 hash file. The model existed, the Elo ratings were there, but the API server wasn't loading it. No error, no warning — just missing from the sports list.

We regenerated the hash, retrained with 96 new games from the 2026 opening week (6,808 total games), and recomputed Elo for all 33 MLB teams. The model went live on April 4 and went 5 for 5 on its first day.

MLB Elo shifts from the 2026 opening week: Yankees dropped from 1622 to 1588 (slow start), Brewers climbed from 1545 to 1573 (strong start), Dodgers from 1630 to 1580. These early-season shifts matter because the market hasn't fully adjusted yet.

## LoL: The Unused ML Pipeline

The LoL bot had a full ML pipeline available — XGBoost trained on gold difference, kills, towers, dragons, and barons — but was running in Elo-only mode. The ML model improved Brier score by 47.8% (from 0.256 to 0.134) and AUC from 0.575 to 0.890.

The difference: a team with +5,000 gold at 20 minutes now correctly predicts 82% win probability instead of the Elo-only estimate of ~55%. Gold difference alone accounts for 39% of the model's feature importance.

## Tennis: WTA Calibration

The tennis bot was using ATP serve rates for WTA matches. WTA hard court serve win rate is 58% vs ATP's 64%. This 6-point difference compounds through the hierarchical model (point → game → set → match) and was making the model overconfident on WTA favorites.

We also added set dominance momentum: a 6-0 set win gives a +4% win probability boost (the winning player is clearly in form), a tiebreak gives no adjustment (evenly matched), and a comeback gives a +2% bonus (momentum shift).

## The Meta-Lesson

The gap between backtest and live trading is always larger than you expect, and it's always caused by something different than you'd guess.

We expected model accuracy to be the bottleneck. It wasn't. The models were fine. The problems were all in execution: slippage, data staleness, silent failures, missing calibration for specific subpopulations.

The fix pattern was the same for every bot: look at the actual live trades, segment by every variable you can think of (edge size, entry price, Elo difference, slippage, sport, time of day), find the buckets that are losing money, and either fix the root cause or filter them out.

Five bots, five different problems, five different fixes. But one lesson: you can't ship a backtest. You ship a system, and the system has to survive contact with reality.

---

*Every bot described here is taught in our course at zenhodl.net/course. Live results are publicly auditable at zenhodl.net/results.*
