Every profitable-looking backtest has the same problem: it probably is not real. The gap between backtested returns and live returns in sports betting is enormous, and it always goes in one direction — your backtest is too optimistic.
This guide shows you how to build a backtest that does not lie to you using pandas and numpy.
The Minimum Viable Backtest
import pandas as pd
import numpy as np
def backtest(predictions, edge_threshold=0.08, max_bet_pct=0.05, bankroll=1000.0):
"""
predictions must have: model_prob, market_price, outcome, timestamp
"""
trades = []
for _, row in predictions.iterrows():
edge = row["model_prob"] - row["market_price"]
if edge < edge_threshold:
continue
bet = min(bankroll * max_bet_pct, bankroll * edge * 0.25)
if bet < 1.0:
continue
pnl = bet * (1 - row["market_price"]) / row["market_price"] if row["outcome"] else -bet
bankroll += pnl
trades.append({"edge": edge, "pnl": pnl, "bankroll": bankroll})
return pd.DataFrame(trades)
This skeleton tells you nothing useful by itself. The inputs — model probabilities, market prices, how you built the dataset — are where the biases hide.
Bias #1: Look-Ahead Bias
Using future information to make past decisions. The most destructive bias in sports backtesting.
How it happens: Training on the full dataset then backtesting on it. Using closing line prices instead of the price available at signal time. Computing Elo ratings with future games.
The fix: Walk-forward validation.
def walk_forward(data, features, train_days=365):
results = []
for test_date in sorted(data["date"].unique()):
cutoff = test_date - pd.Timedelta(days=1)
train = data[(data["date"] >= cutoff - pd.Timedelta(days=train_days))
& (data["date"] <= cutoff)]
if len(train) < 100:
continue
model = train_model(train)
test = data[data["date"] == test_date].copy()
test["model_prob"] = model.predict_proba(test[features])[:, 1]
results.append(test)
return pd.concat(results)
Train only on past data. No exceptions. If your backtest does not enforce this strictly, the results are meaningless.
Bias #2: Survivorship Bias
Your dataset only includes games that happened — not canceled games, postponements, or markets that were delisted.
def check_survivorship(raw_data, backtest_data):
drop_rate = 1 - len(backtest_data) / len(raw_data)
print(f"Drop rate: {drop_rate:.1%}")
if drop_rate > 0.05:
print("WARNING: >5% of games dropped — investigate.")
If dropped games are disproportionately unusual (overtime, weather delays), your backtest is biased toward "normal" games where your model performs best.
Bias #3: Execution Assumptions
Your backtest assumes you can execute at the observed price. You cannot.
def add_execution_costs(trades, spread=0.03, fee=0.02, fill_rate=0.70):
trades["filled"] = np.random.random(len(trades)) < fill_rate
filled = trades[trades["filled"]].copy()
filled["pnl"] = filled["pnl"] - filled["bet"] * (spread + fee)
return filled
Our execution quality research found that 99% of theoretical edge vanished under real execution constraints. If your backtest ignores spread, fees, slippage, and partial fills, multiply expected returns by 0.1 as a rough reality check.
Bias #4: Overfitting the Threshold
You tried thresholds of 5, 6, 7, 8, 9, 10 cents and found 8 works best. By testing 6 values, you optimized on the test set. Bootstrap to check robustness:
def bootstrap_test(predictions, threshold, n=500):
pnls = []
for _ in range(n):
sample = predictions.sample(frac=1.0, replace=True)
trades = backtest(sample, edge_threshold=threshold)
pnls.append(trades["pnl"].sum() if len(trades) > 0 else 0)
p_profit = np.mean(np.array(pnls) > 0)
print(f"Threshold {threshold}: P(profit)={p_profit:.0%}")
If P(profit) is below 85%, the threshold is not robust. A strategy that only works at one exact setting is likely overfit.
Bias #5: Regime Changes
Markets evolve. Market makers get smarter. Liquidity shifts. Check stability across time:
def check_regimes(trades):
trades["quarter"] = pd.to_datetime(trades["timestamp"]).dt.to_period("Q")
print(trades.groupby("quarter")["pnl"].agg(["count", "sum", "mean"]))
If more than one quarter is deeply negative, the strategy may not be robust.
The Honest Pipeline
- Walk-forward split — train only on past data
- Generate predictions — model outputs probability per game
- Apply edge filter — only trade when disagreement exceeds threshold
- Add execution costs — spread, fees, slippage, partial fills
- Bootstrap — check profitability is robust to resampling
- Check regimes — verify consistency across time
preds = walk_forward(data, features)
raw_trades = backtest(preds, edge_threshold=0.08)
real_trades = add_execution_costs(raw_trades)
shrinkage = 1 - real_trades["pnl"].sum() / raw_trades["pnl"].sum()
print(f"Shrinkage: {shrinkage:.0%}")
Above 80% shrinkage: edge is probably not real. 50-80%: something is there but execution eats most of it. Below 50%: genuinely promising.
The Honest Truth
Most sports betting backtests are worthless — not because people are dishonest, but because the biases are subtle and always make your strategy look better than it is. The backtest that survives all five checks above is rare. But when you find one, you have evidence, not hope.
Want to see a backtest that survived? Our results page shows real trades from a system built with these principles. The ZenHodl course teaches you to build and validate your own.