From Play-By-Play to Edge: Turning Game-State Snapshots Into Win Probabilities With ML

May 11, 2026 · 13 min read · XGBoost, Calibration, Feature Engineering, Python

The path from raw play-by-play data to a tradable win probability is not a single ML problem. It is four problems in a row: labeling, feature engineering, model selection, and calibration. Skip any of them and the resulting probability is either wrong or unactionable. This post walks through the full pipeline at the level of detail you would need to ship one yourself.

The labeling problem

The naive label is "did the home team win this game." Apply it to every play-by-play snapshot and you have a binary classification dataset.

This works, but it has a subtle issue: every snapshot from the same game gets the same label. A model trained on this dataset can memorize game-specific patterns that do not generalize. Worse, if you split the dataset by random row, snapshots from the same game can land in both train and test, leaking information.

Always split by game, not by row. Hold out entire games for the test set. The full case for game-level splits is in our walk-forward validation post.

An alternative labeling approach is the score margin at end-of-game minus the score margin at the snapshot. This produces a regression target rather than classification and gives the model more information to learn from. We use both: classification for moneyline, regression for spread/total. Same play-by-play snapshots, different labels.

Feature engineering: what the model needs to know

The base feature set for any sport win-probability model includes:

That gets you a passable model. The features that meaningfully improve performance are sport-specific.

For basketball: pace (estimated possessions per 48 min), offensive rating diff, defensive rating diff, foul situation (free throws are a clock-killer in close games).

For hockey: shots on goal differential, power-play state, empty-net flag (massive late-game effect).

For football: down, distance, field position, possession, two-minute warning flag.

For soccer: red cards, elapsed minute, shot accumulation. The model is fundamentally a Poisson process not an XGBoost.

For tennis: set count, game count, point count, tiebreak flag, server identity. The model is hierarchical (point → game → set → match), not flat.

Engineered interaction features

Beyond raw features, three engineered interactions consistently lift model performance:

# Score-time interaction: a 5-point lead with 2 minutes left is very different
# from a 5-point lead with 30 minutes left
df["score_diff_x_time_remaining"] = df["score_diff"] * df["time_remaining"]

# Squared score diff: closeness matters non-linearly
df["score_diff_sq"] = df["score_diff"] ** 2

# Score diff scaled by Elo gap: a 5-point lead from the underdog is more
# meaningful than a 5-point lead from the favorite
df["score_diff_x_elo"] = df["score_diff"] * df["elo_diff"]

These three features combined typically add 1-3 percentage points of AUC over the raw feature set. Worth the few lines of code.

Model selection

For most sports (basketball, hockey, football, baseball, esports), XGBoost with 300-500 trees, max depth 4-6, learning rate 0.03-0.08 is a strong default. Tune via cross-validation on the per-sport dataset.

from xgboost import XGBClassifier

clf = XGBClassifier(
    n_estimators=400,
    max_depth=5,
    learning_rate=0.05,
    subsample=0.8,
    colsample_bytree=0.8,
    objective="binary:logistic",
    eval_metric="logloss",
    early_stopping_rounds=20,
    random_state=42,
)
clf.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    verbose=False,
)

For soccer, the right architecture is a Poisson process model with team-specific attack/defense rates and league-strength normalization. XGBoost on soccer wastes the structure of the problem.

For tennis, the right architecture is a hierarchical point-game-set-match model with surface-adjusted serve rates. The point-level Bernoulli rolls up to game-level via deuce dynamics, then to set-level, then to match-level.

The general principle: match the model architecture to the natural structure of the sport. A flat XGBoost on tennis loses to a properly-built hierarchical model by 4-6 percentage points of accuracy.

Calibration

A trained classifier produces probabilities, but they are usually not well-calibrated. Train an isotonic regression on a held-out set to map raw probabilities to calibrated ones:

from sklearn.isotonic import IsotonicRegression
import numpy as np

raw_probs = clf.predict_proba(X_calib)[:, 1]
iso = IsotonicRegression(out_of_bounds="clip")
iso.fit(raw_probs, y_calib)

# At inference
def predict_calibrated(features):
    raw = clf.predict_proba(features)[:, 1]
    return iso.transform(raw)

Always evaluate calibration with Expected Calibration Error after fitting:

def expected_calibration_error(y_true, y_prob, n_bins=15):
    bins = np.linspace(0, 1, n_bins + 1)
    ece = 0.0
    for lo, hi in zip(bins[:-1], bins[1:]):
        mask = (y_prob >= lo) & (y_prob < hi)
        if mask.sum() == 0:
            continue
        bin_acc = y_true[mask].mean()
        bin_conf = y_prob[mask].mean()
        ece += (mask.sum() / len(y_true)) * abs(bin_acc - bin_conf)
    return ece

Aim for ECE under 0.05 in production. Above 0.07 and the probabilities are not safe to size positions against. Above 0.10 and the model is broken.

Live recalibration

Calibration drifts. The team strengths your model learned three months ago are no longer accurate. Player injuries, trades, and rule changes shift the underlying distributions.

Two countermeasures:

Periodic retraining on a fresh slice of recent data. We retrain the moneyline models weekly. This catches medium-term drift without making the model unstable.

Live recalibration via a rolling isotonic on the most recent N=500 outcomes per sport. This catches short-term drift without requiring a full retrain. The recalibrator runs in-process and adjusts the model's output before it reaches the trading layer.

from collections import deque
from sklearn.isotonic import IsotonicRegression

class RollingRecalibrator:
    def __init__(self, window=500):
        self.window = window
        self.preds = deque(maxlen=window)
        self.actuals = deque(maxlen=window)
        self._iso = None
    def record(self, pred, actual):
        self.preds.append(pred)
        self.actuals.append(actual)
        if len(self.preds) >= 50:
            self._iso = IsotonicRegression(out_of_bounds="clip")
            self._iso.fit(list(self.preds), list(self.actuals))
    def adjust(self, raw_prob):
        if self._iso is None:
            return raw_prob
        return float(self._iso.transform([raw_prob])[0])

The combination of weekly retraining and live recalibration keeps ECE within bounds across the whole season.

End-to-end

Putting it together, a single inference path looks like:

def predict_win_probability(sport, game_state):
    # 1. Featurize
    features = featurize(sport, game_state)
    # 2. Base model
    raw_prob = SPORT_MODELS[sport].predict_proba([features])[:, 1][0]
    # 3. Calibrate (offline isotonic)
    calibrated = SPORT_CALIBRATORS[sport].transform([raw_prob])[0]
    # 4. Live recalibrate (rolling isotonic)
    final = RECALIBRATORS[sport].adjust(calibrated)
    return float(final)

Four steps. Each one matters. Skip the recalibrator and your model drifts. Skip the calibrator and Kelly sizing breaks. Skip the featurizer and the model never converges. Skip the base model and you have nothing to calibrate.

The bottom line

Going from play-by-play to a tradable win probability is a four-stage pipeline: label correctly (split by game), engineer features that match the sport's structure, choose a model architecture that respects the natural shape of the problem, and calibrate continuously. Each stage adds value. Each missing stage breaks something downstream.

The production version of this pipeline

ZenHodl runs win-probability models for 11 sports using exactly the pipeline described in this post. Per-sport ECE published. Free seven-day trial.

Try ZenHodl free