Live Model Recalibration in Python: Fixing ML Drift in Production

Published April 2026 · 14 min read · By SsysTech Softwares

Your XGBoost classifier scores 1.8% Expected Calibration Error on the held-out test set. Six weeks into production, the same model is posting 21% ECE on live data. The model didn't change. The world did.

This is calibration drift — one of the most expensive silent failures in production machine learning. The model still returns a number between 0 and 1, predictions still correlate with outcomes, but the probabilities no longer match reality. If you're using those probabilities to make capital-allocation decisions — trading bots, credit scoring, ad auctions, insurance pricing — you're bleeding money at the tails.

This post covers how to detect drift quickly, apply online recalibration without a full retrain, and keep your model aligned with live distributions. All code is Python 3.10+, scikit-learn, and pandas.

What calibration drift looks like
Measuring drift with ECE and reliability diagrams
Building a rolling recalibration buffer
Fitting rolling isotonic regression
Applying the calibrator at inference time
When to retrain instead of recalibrate
Production checklist

1. What calibration drift looks like

A well-calibrated classifier satisfies the property that among all instances where the model says "70% probability," roughly 70% actually happen. When drift occurs, the mapping breaks. Typical patterns:

Overconfidence at the tails. Model says 85%, reality is 62%.
Underconfidence in the middle. Model says 50%, reality is 60%.
Regime-specific bias. Calibrated in regular-season games, miscalibrated in playoffs.

The core problem is that the feature distribution during inference has drifted from training. Your model learned P(outcome | features) under training's feature distribution. When feature distributions change — rule changes, seasonality, roster changes in sports, macro regime shifts in finance — the conditional relationship changes too.

Real example: In one of our sports win-probability models, training ECE was 0.2%. Six weeks into live trading, ECE measured across ~500 real trades was 21%. The 80-90% confidence bucket showed 44% actual wins versus 85% predicted.

2. Measuring drift with ECE and reliability diagrams

Before you can fix drift you need to quantify it. Two tools:

Expected Calibration Error (ECE)

Bin predictions into N buckets (10 is standard). In each bucket, compare average predicted probability to average actual outcome. ECE is the weighted average of those differences.

import numpy as np

def expected_calibration_error(y_true: np.ndarray,
                                y_proba: np.ndarray,
                                n_bins: int = 10) -> float:
    """
    Expected Calibration Error with equal-width bins.

    ECE = sum over bins of (n_bin / N) * |mean_pred - mean_actual|

    Returns a scalar in [0, 1]. Lower is better.
    """
    edges = np.linspace(0, 1, n_bins + 1)
    n = len(y_true)
    ece = 0.0
    for i in range(n_bins):
        mask = (y_proba >= edges[i]) & (y_proba < edges[i + 1])
        if mask.sum() == 0:
            continue
        bin_pred = y_proba[mask].mean()
        bin_actual = y_true[mask].mean()
        ece += (mask.sum() / n) * abs(bin_pred - bin_actual)
    return float(ece)

# Example
ece = expected_calibration_error(y_live, p_live, n_bins=10)
print(f"Live ECE: {ece * 100:.2f}%")

Reliability diagram

ECE is one number. The reliability diagram tells you where the drift is concentrated.

def reliability_diagram(y_true, y_proba, n_bins=10):
    edges = np.linspace(0, 1, n_bins + 1)
    print(f"{'Bucket':<10}{'N':>6}{'Predicted':>12}{'Actual':>10}{'Gap':>8}")
    for i in range(n_bins):
        mask = (y_proba >= edges[i]) & (y_proba < edges[i + 1])
        c = int(mask.sum())
        if c < 10:
            continue
        ap = y_proba[mask].mean() * 100
        at = y_true[mask].mean() * 100
        diff = ap - at
        flag = "  <--- overconfident" if diff > 15 else ""
        bucket = f"{int(edges[i]*100)}-{int(edges[i+1]*100)}%"
        print(f"{bucket:<10}{c:>6}{ap:>11.1f}%{at:>9.1f}%{diff:>+7.1f}{flag}")

A healthy reliability diagram has all buckets within 3-5 percentage points of the diagonal. Anything worse than 10pp in a bucket where you size positions heavily is expensive.

3. Building a rolling recalibration buffer

The strategy: keep the base model frozen, fit a post-hoc calibrator on recent (prediction, outcome) pairs, and update the calibrator continuously as new outcomes arrive.

Start with a simple append-only buffer on disk. For sports prediction we use JSONL:

import json
import time
from pathlib import Path
from typing import Optional

class RecalibrationBuffer:
    """Append-only buffer of (prediction, outcome) pairs per sport."""

    def __init__(self, path: Path, max_entries: int = 10_000):
        self.path = path
        self.max_entries = max_entries

    def record(self, sport: str, prediction: float, outcome: int,
               ts: Optional[float] = None) -> None:
        """Called at outcome resolution. Never blocks the request path."""
        entry = {
            "sport": sport,
            "pred": float(prediction),
            "outcome": int(outcome),
            "ts": ts or time.time(),
        }
        with open(self.path, "a") as f:
            f.write(json.dumps(entry) + "\n")

    def load_recent(self, sport: str, n: int = 500):
        """Return last n (pred, outcome) pairs for this sport."""
        if not self.path.exists():
            return [], []
        preds, outcomes = [], []
        with open(self.path) as f:
            for line in f:
                try:
                    e = json.loads(line)
                except Exception:
                    continue
                if e.get("sport") != sport:
                    continue
                preds.append(e["pred"])
                outcomes.append(e["outcome"])
        return preds[-n:], outcomes[-n:]

Record on resolution, not prediction. If you record at prediction time with a "pending" outcome, you need a reliable backfill path when the outcome arrives. Simpler: only append when the outcome is known.

4. Fitting rolling isotonic regression

Isotonic regression is the right calibrator for most drift scenarios. It's non-parametric (fits any monotonic mapping), uses no distributional assumptions, and handles small sample sizes gracefully.

from sklearn.isotonic import IsotonicRegression
from typing import Dict
import numpy as np

class LiveRecalibrator:
    """Per-sport isotonic regression that refits periodically from a buffer."""

    def __init__(self,
                 buffer: RecalibrationBuffer,
                 window: int = 500,
                 min_samples: int = 50,
                 refit_every_n: int = 10):
        self.buffer = buffer
        self.window = window
        self.min_samples = min_samples
        self.refit_every_n = refit_every_n
        self._models: Dict[str, IsotonicRegression] = {}
        self._fit_count: Dict[str, int] = {}

    def _refit(self, sport: str) -> None:
        preds, outcomes = self.buffer.load_recent(sport, n=self.window)
        if len(preds) < self.min_samples:
            return
        iso = IsotonicRegression(y_min=0.01, y_max=0.99, out_of_bounds="clip")
        iso.fit(np.asarray(preds), np.asarray(outcomes))
        self._models[sport] = iso

    def record_and_maybe_refit(self, sport: str, pred: float, outcome: int):
        self.buffer.record(sport, pred, outcome)
        count = self._fit_count.get(sport, 0) + 1
        self._fit_count[sport] = count
        if count % self.refit_every_n == 0:
            self._refit(sport)

    def adjust(self, sport: str, raw_pred: float) -> float:
        """Apply current calibrator. Returns raw_pred unchanged if no fit yet."""
        iso = self._models.get(sport)
        if iso is None:
            return raw_pred
        return float(iso.predict([raw_pred])[0])

Key design decisions in this class:

Per-sport calibrators. NBA drift and NHL drift aren't the same event. Sharing a calibrator between them loses signal.
Rolling window of 500 samples. Large enough for stable isotonic fits; small enough to drop stale data within weeks.
Refit every 10 records. Cheaper than refitting on every prediction. Adjustable.
Fallback to raw prediction. If you have fewer than 50 samples in the window, trust the base model rather than a noisy calibrator.

5. Applying the calibrator at inference time

In the prediction path, apply the calibrator as the last step before returning the probability to callers:

def predict_wp(sport: str, features: np.ndarray) -> float:
    raw_pred = float(base_model.predict_proba(features)[0, 1])
    calibrated = recalibrator.adjust(sport, raw_pred)
    return calibrated

# In the outcome-resolution worker:
def on_game_resolved(sport: str, prediction: float, outcome: int):
    recalibrator.record_and_maybe_refit(sport, prediction, outcome)

Two inference paths in one codebase — one computes and calibrates, the other records outcomes. They share the buffer. If the recalibrator misbehaves, the prediction path never breaks:

def predict_wp_safe(sport: str, features: np.ndarray) -> float:
    raw_pred = float(base_model.predict_proba(features)[0, 1])
    try:
        return recalibrator.adjust(sport, raw_pred)
    except Exception:
        logger.exception("Recalibrator failed, returning raw prediction")
        return raw_pred

6. When to retrain instead of recalibrate

Online recalibration fixes calibration but not discrimination. If your model's AUC is dropping, you don't have a calibration problem — you have a model problem. Retrain.

Symptom	Fix
ECE rising, AUC stable	Online recalibration
AUC dropping, ECE rising	Retrain on recent data
Structural break (regime change)	Retrain + add regime features
Data source changed upstream	Investigate data, likely retrain

Track both metrics weekly. Recalibration is a bandage over a gradually-shifting distribution; retraining replaces the bandage with a new bone structure.

7. Production checklist

✅ Buffer writes are idempotent (record-once per outcome)
✅ Refit runs off the hot path (cron, background worker, or periodic async task)
✅ Calibrator failure mode: return raw prediction, log, alert
✅ Per-sport / per-segment calibrators for meaningful subpopulations
✅ Weekly reliability-diagram snapshot for human review
✅ Alert when ECE exceeds threshold (say, 8%)
✅ Calibrator versioning: which window produced the current mapping?
✅ Retrain trigger documented: when does recalibration stop being enough?

Want calibrated sports probabilities without the infrastructure work?

ZenHodl runs rolling isotonic recalibrators for 11 sports in production. Access calibrated win probabilities, edge detection signals, and verified live P&L via a simple REST API — no data pipeline required.

See ZenHodl's live API

Further reading: How to Build a Polymarket Trading Bot in Python

Live Model Recalibration in Python: Fixing ML Drift in Production

Contents

1. What calibration drift looks like

2. Measuring drift with ECE and reliability diagrams

Expected Calibration Error (ECE)

Reliability diagram

3. Building a rolling recalibration buffer

4. Fitting rolling isotonic regression

5. Applying the calibrator at inference time

6. When to retrain instead of recalibrate

7. Production checklist

Want calibrated sports probabilities without the infrastructure work?