You backtested a strategy on three years of historical data. Sharpe 2.3, max drawdown 8%, win rate 68%. You deployed live. Two weeks later you're down 12% with a win rate of 49%. What happened?
The most likely answer: your backtest leaked future information. This post walks through the five most common leakage patterns, with concrete Python examples of what to look for and how to fix each.
The rule: At the moment you make a trading decision in your backtest, the only information available must be what would have been available to you at that exact moment in real-time. Everything else is a leak.
1The most expensive bug. You computed a feature using data that wasn't available at the prediction moment.
Classic case: you want a "rolling 30-day volatility" feature. You use pandas:
# BROKEN: centered rolling window sees FUTURE data
df['vol_30d'] = df['returns'].rolling(30, center=True).std()
# BROKEN: computing vol then normalizing leaks future statistics
df['vol_norm'] = df['vol_30d'] / df['vol_30d'].mean()
# CORRECT: backward-looking window only, no future data
df['vol_30d'] = df['returns'].rolling(30).std().shift(1)
The .shift(1) is critical. Rolling windows in pandas return the value at the end of the window, which is the current row — the same row you're using it to predict. You need yesterday's value for today's prediction.
In sports prediction, look-ahead often hides in "team stats." You want each game's feature to include team-level pace, offensive rating, defensive rating. If you compute those stats across a full season of games and then join them onto the game-level row, every early-season game has access to late-season statistics that a live bot couldn't know.
# BROKEN: season-average stats include future games
team_stats = games.groupby('team')[['pace', 'ortg', 'drtg']].mean()
games = games.merge(team_stats, on='team')
# CORRECT: cumulative stats only up to the game date
games = games.sort_values(['team', 'date'])
for col in ['pace', 'ortg', 'drtg']:
games[f'{col}_prior'] = (
games.groupby('team')[col]
.shift(1)
.expanding()
.mean()
.reset_index(level=0, drop=True)
)
Detection test: split your data chronologically at some random date D. Rebuild features on data before D only. If any feature value for a row dated before D differs between the two builds, you have look-ahead.
2Your dataset only contains assets, teams, or markets that still exist. Delisted stocks, relegated soccer teams, closed futures contracts — they're missing. Your model trained on a filtered population that excludes the worst historical outcomes.
In sports betting, a common version: you trained on NBA teams active in 2024. Any team that relocated, rebranded, or had its roster fully turn over between 2020 and 2024 is implicitly "missing" historical games in your training set. The model learned patterns on teams that persisted — which is a biased sample.
Detection is uncomfortable: verify you have data on entities that no longer exist.
# Check: how many distinct teams appear in 2020 season?
teams_2020 = set(games[games['season'] == 2020]['team'].unique())
teams_2024 = set(games[games['season'] == 2024]['team'].unique())
dropped = teams_2020 - teams_2024
added = teams_2024 - teams_2020
print(f"Teams in 2020 no longer in 2024: {dropped}")
print(f"Teams in 2024 not in 2020: {added}")
For prediction markets, the survivorship equivalent is resolved markets only. If your backtest dataset excludes markets that never resolved or were invalidated, you're excluding losing trades that would have happened in live trading (the position sits until refund).
3Your backtest assumed trades filled at the mid-price, instantly, with no slippage, no fees, no latency. Production traders fill at the ask (or worse), pay taker fees, and lose cents to latency every trade.
Minimum execution cost model:
SLIPPAGE_C = 1.0 # Average slippage from displayed ask
TAKER_FEE_C = 2.0 # Exchange taker fee
LATENCY_PENALTY_C = 0.5 # Price movement during 3-5s execution
def realistic_entry_price(displayed_ask_c: float) -> float:
"""Effective fill price including execution costs, in cents."""
return displayed_ask_c + SLIPPAGE_C + LATENCY_PENALTY_C
def pnl_per_share(entry_c: float, settlement_c: float, sport_fee_c: float = 0.5) -> float:
"""Realized P&L per share after fees."""
effective_entry = entry_c + SLIPPAGE_C + LATENCY_PENALTY_C
gross_pnl = settlement_c - effective_entry
return gross_pnl - TAKER_FEE_C - sport_fee_c
On 100-200 trade samples, the difference between "fills at mid" and "fills at ask + 3c" can flip a profitable strategy to a losing one. Always model execution cost, and err on the pessimistic side.
Tighter test: simulate fills using the actual order book depth, not just top-of-book. If you're buying 5 contracts at a price level that shows 2 contracts available, you eat 3 contracts at a worse level.
4You grid-searched min_edge over {3, 4, 5, 6, 7, 8, 9, 10} on your backtest and picked min_edge=7 because it had the highest Sharpe. Congratulations, you selected the threshold that best fits your backtest's noise.
This is model-selection leakage. Any threshold you tune on the backtest can't also be evaluated against the same backtest. Your reported Sharpe is biased upward.
Three ways to handle it:
from sklearn.utils import resample
import numpy as np
def bootstrap_sharpe(pnl_per_trade: np.ndarray, n_boot: int = 1000) -> dict:
sharpes = []
for _ in range(n_boot):
sample = resample(pnl_per_trade)
s = np.sqrt(252) * sample.mean() / sample.std() if sample.std() > 0 else 0
sharpes.append(s)
return {
"mean": np.mean(sharpes),
"p05": np.percentile(sharpes, 5),
"p50": np.percentile(sharpes, 50),
"p95": np.percentile(sharpes, 95),
}
5You used train_test_split with shuffle=True on time-series data. Now your test set contains rows from the same games as your training set — just different minutes within those games. The model effectively memorized game outcomes.
# BROKEN: random split on game-level rows leaks outcomes
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# CORRECT: split by game_id so all snapshots from one game go to same set
import hashlib
def game_bucket(game_id: str, n_buckets: int = 100) -> int:
h = int(hashlib.md5(str(game_id).encode()).hexdigest(), 16)
return h % n_buckets
test_games = set(gid for gid in games['game_id'].unique()
if game_bucket(gid) < 20) # 20% to test
train = games[~games['game_id'].isin(test_games)]
test = games[games['game_id'].isin(test_games)]
Hashing on game_id is deterministic: you always get the same split given the same input. That's good for reproducibility but you still have to be careful — if game IDs correlate with time (monotonic counter), you can end up with all old games in train and new in test, or vice versa.
The gold-standard for time-series trading: walk-forward. Train on seasons 1..N, test on season N+1, then slide forward.
Before you trust a backtest Sharpe, verify:
The final test: paper-trade in production for a month. If the paper P&L converges with the backtest P&L to within reason, your backtest was honest. If not, one of these pitfalls is still hiding.
ZenHodl's sports prediction API publishes every live trade on-chain with real execution prices, real fees, and real latency — no backtest. Fair win probabilities for 11 sports, verifiable performance, REST API access.
See verified live results