Backtesting Pitfalls: 5 Ways Your Trading Strategy Leaks Future Data

Published April 2026 · 13 min read · By SsysTech Softwares

You backtested a strategy on three years of historical data. Sharpe 2.3, max drawdown 8%, win rate 68%. You deployed live. Two weeks later you're down 12% with a win rate of 49%. What happened?

The most likely answer: your backtest leaked future information. This post walks through the five most common leakage patterns, with concrete Python examples of what to look for and how to fix each.

The rule: At the moment you make a trading decision in your backtest, the only information available must be what would have been available to you at that exact moment in real-time. Everything else is a leak.

Look-ahead bias in feature construction
Survivorship bias in market data
Unrealistic execution prices
Overfitting the threshold
Resampling ruined your out-of-sample set
Sanity-check checklist

1. Look-ahead bias in feature construction

1The most expensive bug. You computed a feature using data that wasn't available at the prediction moment.

Classic case: you want a "rolling 30-day volatility" feature. You use pandas:

# BROKEN: centered rolling window sees FUTURE data
df['vol_30d'] = df['returns'].rolling(30, center=True).std()

# BROKEN: computing vol then normalizing leaks future statistics
df['vol_norm'] = df['vol_30d'] / df['vol_30d'].mean()

# CORRECT: backward-looking window only, no future data
df['vol_30d'] = df['returns'].rolling(30).std().shift(1)

The .shift(1) is critical. Rolling windows in pandas return the value at the end of the window, which is the current row — the same row you're using it to predict. You need yesterday's value for today's prediction.

Sports-data analog

In sports prediction, look-ahead often hides in "team stats." You want each game's feature to include team-level pace, offensive rating, defensive rating. If you compute those stats across a full season of games and then join them onto the game-level row, every early-season game has access to late-season statistics that a live bot couldn't know.

# BROKEN: season-average stats include future games
team_stats = games.groupby('team')[['pace', 'ortg', 'drtg']].mean()
games = games.merge(team_stats, on='team')

# CORRECT: cumulative stats only up to the game date
games = games.sort_values(['team', 'date'])
for col in ['pace', 'ortg', 'drtg']:
    games[f'{col}_prior'] = (
        games.groupby('team')[col]
        .shift(1)
        .expanding()
        .mean()
        .reset_index(level=0, drop=True)
    )

Detection test: split your data chronologically at some random date D. Rebuild features on data before D only. If any feature value for a row dated before D differs between the two builds, you have look-ahead.

2. Survivorship bias in market data

2Your dataset only contains assets, teams, or markets that still exist. Delisted stocks, relegated soccer teams, closed futures contracts — they're missing. Your model trained on a filtered population that excludes the worst historical outcomes.

In sports betting, a common version: you trained on NBA teams active in 2024. Any team that relocated, rebranded, or had its roster fully turn over between 2020 and 2024 is implicitly "missing" historical games in your training set. The model learned patterns on teams that persisted — which is a biased sample.

Detection is uncomfortable: verify you have data on entities that no longer exist.

# Check: how many distinct teams appear in 2020 season?
teams_2020 = set(games[games['season'] == 2020]['team'].unique())
teams_2024 = set(games[games['season'] == 2024]['team'].unique())

dropped = teams_2020 - teams_2024
added = teams_2024 - teams_2020
print(f"Teams in 2020 no longer in 2024: {dropped}")
print(f"Teams in 2024 not in 2020: {added}")

For prediction markets, the survivorship equivalent is resolved markets only. If your backtest dataset excludes markets that never resolved or were invalidated, you're excluding losing trades that would have happened in live trading (the position sits until refund).

3. Unrealistic execution prices

3Your backtest assumed trades filled at the mid-price, instantly, with no slippage, no fees, no latency. Production traders fill at the ask (or worse), pay taker fees, and lose cents to latency every trade.

Minimum execution cost model:

SLIPPAGE_C = 1.0        # Average slippage from displayed ask
TAKER_FEE_C = 2.0       # Exchange taker fee
LATENCY_PENALTY_C = 0.5 # Price movement during 3-5s execution

def realistic_entry_price(displayed_ask_c: float) -> float:
    """Effective fill price including execution costs, in cents."""
    return displayed_ask_c + SLIPPAGE_C + LATENCY_PENALTY_C

def pnl_per_share(entry_c: float, settlement_c: float, sport_fee_c: float = 0.5) -> float:
    """Realized P&L per share after fees."""
    effective_entry = entry_c + SLIPPAGE_C + LATENCY_PENALTY_C
    gross_pnl = settlement_c - effective_entry
    return gross_pnl - TAKER_FEE_C - sport_fee_c

On 100-200 trade samples, the difference between "fills at mid" and "fills at ask + 3c" can flip a profitable strategy to a losing one. Always model execution cost, and err on the pessimistic side.

Tighter test: simulate fills using the actual order book depth, not just top-of-book. If you're buying 5 contracts at a price level that shows 2 contracts available, you eat 3 contracts at a worse level.

4. Overfitting the threshold

4You grid-searched min_edge over {3, 4, 5, 6, 7, 8, 9, 10} on your backtest and picked min_edge=7 because it had the highest Sharpe. Congratulations, you selected the threshold that best fits your backtest's noise.

This is model-selection leakage. Any threshold you tune on the backtest can't also be evaluated against the same backtest. Your reported Sharpe is biased upward.

Three ways to handle it:

Walk-forward cross-validation. Train on 2020-2022, tune on 2023, test on 2024. Never re-touch 2024.
Sensitivity analysis. Report Sharpe across a range of thresholds. If Sharpe is 1.8 at min_edge=5 but 2.3 at min_edge=7 and 1.6 at min_edge=9, the 2.3 is fragile. Report 1.8.
Bootstrap the distribution. Resample trades with replacement 1000 times. Report the 5th percentile Sharpe. If it's still positive, the strategy has real edge.

from sklearn.utils import resample
import numpy as np

def bootstrap_sharpe(pnl_per_trade: np.ndarray, n_boot: int = 1000) -> dict:
    sharpes = []
    for _ in range(n_boot):
        sample = resample(pnl_per_trade)
        s = np.sqrt(252) * sample.mean() / sample.std() if sample.std() > 0 else 0
        sharpes.append(s)
    return {
        "mean": np.mean(sharpes),
        "p05":  np.percentile(sharpes, 5),
        "p50":  np.percentile(sharpes, 50),
        "p95":  np.percentile(sharpes, 95),
    }

5. Resampling ruined your out-of-sample set

5You used train_test_split with shuffle=True on time-series data. Now your test set contains rows from the same games as your training set — just different minutes within those games. The model effectively memorized game outcomes.

# BROKEN: random split on game-level rows leaks outcomes
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# CORRECT: split by game_id so all snapshots from one game go to same set
import hashlib
def game_bucket(game_id: str, n_buckets: int = 100) -> int:
    h = int(hashlib.md5(str(game_id).encode()).hexdigest(), 16)
    return h % n_buckets

test_games = set(gid for gid in games['game_id'].unique()
                 if game_bucket(gid) < 20)  # 20% to test
train = games[~games['game_id'].isin(test_games)]
test  = games[games['game_id'].isin(test_games)]

Hashing on game_id is deterministic: you always get the same split given the same input. That's good for reproducibility but you still have to be careful — if game IDs correlate with time (monotonic counter), you can end up with all old games in train and new in test, or vice versa.

The gold-standard for time-series trading: walk-forward. Train on seasons 1..N, test on season N+1, then slide forward.

6. Sanity-check checklist

Before you trust a backtest Sharpe, verify:

[ ] Every feature was computed using only data available at prediction time (.shift(1) before every rolling/expanding stat)
[ ] Data includes entities that no longer exist (delisted stocks, relegated teams, invalidated markets)
[ ] Execution cost model includes slippage, taker fee, latency penalty, and sport/asset-specific fees
[ ] Thresholds are selected via walk-forward CV, not grid-search-on-test
[ ] Train/test split is by entity (game_id, symbol) not by row
[ ] Bootstrap Sharpe's 5th percentile is still positive
[ ] Sensitivity analysis across reasonable parameter ranges is reported, not just the optimum
[ ] Live paper-trading for at least 30 days before capital

The final test: paper-trade in production for a month. If the paper P&L converges with the backtest P&L to within reason, your backtest was honest. If not, one of these pitfalls is still hiding.

Skip the backtest build. Use verified live P&L.

ZenHodl's sports prediction API publishes every live trade on-chain with real execution prices, real fees, and real latency — no backtest. Fair win probabilities for 11 sports, verifiable performance, REST API access.

See verified live results

Further reading: Live Model Recalibration: Fixing ML Drift in Production · How to Build a Polymarket Trading Bot in Python

Backtesting Pitfalls: 5 Ways Your Trading Strategy Leaks Future Data

Contents

1. Look-ahead bias in feature construction

Sports-data analog

2. Survivorship bias in market data

3. Unrealistic execution prices

4. Overfitting the threshold

5. Resampling ruined your out-of-sample set

6. Sanity-check checklist

Skip the backtest build. Use verified live P&L.