How to Backtest a Sports Betting Strategy in Python...

Every profitable-looking backtest has the same problem: it probably is not real. The gap between backtested returns and live returns in sports betting is enormous, and it always goes in one direction — your backtest is too optimistic.

This guide shows you how to build a backtest that does not lie to you using pandas and numpy.

The Minimum Viable Backtest

import pandas as pd
import numpy as np

def backtest(predictions, edge_threshold=0.08, max_bet_pct=0.05, bankroll=1000.0):
    """
    predictions must have: model_prob, market_price, outcome, timestamp
    """
    trades = []
    for _, row in predictions.iterrows():
        edge = row["model_prob"] - row["market_price"]
        if edge < edge_threshold:
            continue
        bet = min(bankroll * max_bet_pct, bankroll * edge * 0.25)
        if bet < 1.0:
            continue
        pnl = bet * (1 - row["market_price"]) / row["market_price"] if row["outcome"] else -bet
        bankroll += pnl
        trades.append({"edge": edge, "pnl": pnl, "bankroll": bankroll})
    return pd.DataFrame(trades)

This skeleton tells you nothing useful by itself. The inputs — model probabilities, market prices, how you built the dataset — are where the biases hide.

Bias #1: Look-Ahead Bias

Using future information to make past decisions. The most destructive bias in sports backtesting.

How it happens: Training on the full dataset then backtesting on it. Using closing line prices instead of the price available at signal time. Computing Elo ratings with future games.

The fix: Walk-forward validation.

def walk_forward(data, features, train_days=365):
    results = []
    for test_date in sorted(data["date"].unique()):
        cutoff = test_date - pd.Timedelta(days=1)
        train = data[(data["date"] >= cutoff - pd.Timedelta(days=train_days))
                     & (data["date"] <= cutoff)]
        if len(train) < 100:
            continue
        model = train_model(train)
        test = data[data["date"] == test_date].copy()
        test["model_prob"] = model.predict_proba(test[features])[:, 1]
        results.append(test)
    return pd.concat(results)

Train only on past data. No exceptions. If your backtest does not enforce this strictly, the results are meaningless.

Bias #2: Survivorship Bias

Your dataset only includes games that happened — not canceled games, postponements, or markets that were delisted.

def check_survivorship(raw_data, backtest_data):
    drop_rate = 1 - len(backtest_data) / len(raw_data)
    print(f"Drop rate: {drop_rate:.1%}")
    if drop_rate > 0.05:
        print("WARNING: >5% of games dropped — investigate.")

If dropped games are disproportionately unusual (overtime, weather delays), your backtest is biased toward "normal" games where your model performs best.

Bias #3: Execution Assumptions

Your backtest assumes you can execute at the observed price. You cannot.

def add_execution_costs(trades, spread=0.03, fee=0.02, fill_rate=0.70):
    trades["filled"] = np.random.random(len(trades)) < fill_rate
    filled = trades[trades["filled"]].copy()
    filled["pnl"] = filled["pnl"] - filled["bet"] * (spread + fee)
    return filled

Our execution quality research found that 99% of theoretical edge vanished under real execution constraints. If your backtest ignores spread, fees, slippage, and partial fills, multiply expected returns by 0.1 as a rough reality check.

Bias #4: Overfitting the Threshold

You tried thresholds of 5, 6, 7, 8, 9, 10 cents and found 8 works best. By testing 6 values, you optimized on the test set. Bootstrap to check robustness:

def bootstrap_test(predictions, threshold, n=500):
    pnls = []
    for _ in range(n):
        sample = predictions.sample(frac=1.0, replace=True)
        trades = backtest(sample, edge_threshold=threshold)
        pnls.append(trades["pnl"].sum() if len(trades) > 0 else 0)
    p_profit = np.mean(np.array(pnls) > 0)
    print(f"Threshold {threshold}: P(profit)={p_profit:.0%}")

If P(profit) is below 85%, the threshold is not robust. A strategy that only works at one exact setting is likely overfit.

Bias #5: Regime Changes

Markets evolve. Market makers get smarter. Liquidity shifts. Check stability across time:

def check_regimes(trades):
    trades["quarter"] = pd.to_datetime(trades["timestamp"]).dt.to_period("Q")
    print(trades.groupby("quarter")["pnl"].agg(["count", "sum", "mean"]))

If more than one quarter is deeply negative, the strategy may not be robust.

The Honest Pipeline

Walk-forward split — train only on past data
Generate predictions — model outputs probability per game
Apply edge filter — only trade when disagreement exceeds threshold
Add execution costs — spread, fees, slippage, partial fills
Bootstrap — check profitability is robust to resampling
Check regimes — verify consistency across time

preds = walk_forward(data, features)
raw_trades = backtest(preds, edge_threshold=0.08)
real_trades = add_execution_costs(raw_trades)
shrinkage = 1 - real_trades["pnl"].sum() / raw_trades["pnl"].sum()
print(f"Shrinkage: {shrinkage:.0%}")

Above 80% shrinkage: edge is probably not real. 50-80%: something is there but execution eats most of it. Below 50%: genuinely promising.

The Honest Truth

Most sports betting backtests are worthless — not because people are dishonest, but because the biases are subtle and always make your strategy look better than it is. The backtest that survives all five checks above is rare. But when you find one, you have evidence, not hope.

Want to see a backtest that survived? Our results page shows real trades from a system built with these principles. The ZenHodl course teaches you to build and validate your own.

How to Backtest a Sports Betting Strategy in Python (Without Fooling Yourself)