Your XGBoost classifier scores 1.8% Expected Calibration Error on the held-out test set. Six weeks into production, the same model is posting 21% ECE on live data. The model didn't change. The world did.
This is calibration drift — one of the most expensive silent failures in production machine learning. The model still returns a number between 0 and 1, predictions still correlate with outcomes, but the probabilities no longer match reality. If you're using those probabilities to make capital-allocation decisions — trading bots, credit scoring, ad auctions, insurance pricing — you're bleeding money at the tails.
This post covers how to detect drift quickly, apply online recalibration without a full retrain, and keep your model aligned with live distributions. All code is Python 3.10+, scikit-learn, and pandas.
A well-calibrated classifier satisfies the property that among all instances where the model says "70% probability," roughly 70% actually happen. When drift occurs, the mapping breaks. Typical patterns:
The core problem is that the feature distribution during inference has drifted from training. Your model learned P(outcome | features) under training's feature distribution. When feature distributions change — rule changes, seasonality, roster changes in sports, macro regime shifts in finance — the conditional relationship changes too.
Real example: In one of our sports win-probability models, training ECE was 0.2%. Six weeks into live trading, ECE measured across ~500 real trades was 21%. The 80-90% confidence bucket showed 44% actual wins versus 85% predicted.
Before you can fix drift you need to quantify it. Two tools:
Bin predictions into N buckets (10 is standard). In each bucket, compare average predicted probability to average actual outcome. ECE is the weighted average of those differences.
import numpy as np
def expected_calibration_error(y_true: np.ndarray,
y_proba: np.ndarray,
n_bins: int = 10) -> float:
"""
Expected Calibration Error with equal-width bins.
ECE = sum over bins of (n_bin / N) * |mean_pred - mean_actual|
Returns a scalar in [0, 1]. Lower is better.
"""
edges = np.linspace(0, 1, n_bins + 1)
n = len(y_true)
ece = 0.0
for i in range(n_bins):
mask = (y_proba >= edges[i]) & (y_proba < edges[i + 1])
if mask.sum() == 0:
continue
bin_pred = y_proba[mask].mean()
bin_actual = y_true[mask].mean()
ece += (mask.sum() / n) * abs(bin_pred - bin_actual)
return float(ece)
# Example
ece = expected_calibration_error(y_live, p_live, n_bins=10)
print(f"Live ECE: {ece * 100:.2f}%")
ECE is one number. The reliability diagram tells you where the drift is concentrated.
def reliability_diagram(y_true, y_proba, n_bins=10):
edges = np.linspace(0, 1, n_bins + 1)
print(f"{'Bucket':<10}{'N':>6}{'Predicted':>12}{'Actual':>10}{'Gap':>8}")
for i in range(n_bins):
mask = (y_proba >= edges[i]) & (y_proba < edges[i + 1])
c = int(mask.sum())
if c < 10:
continue
ap = y_proba[mask].mean() * 100
at = y_true[mask].mean() * 100
diff = ap - at
flag = " <--- overconfident" if diff > 15 else ""
bucket = f"{int(edges[i]*100)}-{int(edges[i+1]*100)}%"
print(f"{bucket:<10}{c:>6}{ap:>11.1f}%{at:>9.1f}%{diff:>+7.1f}{flag}")
A healthy reliability diagram has all buckets within 3-5 percentage points of the diagonal. Anything worse than 10pp in a bucket where you size positions heavily is expensive.
The strategy: keep the base model frozen, fit a post-hoc calibrator on recent (prediction, outcome) pairs, and update the calibrator continuously as new outcomes arrive.
Start with a simple append-only buffer on disk. For sports prediction we use JSONL:
import json
import time
from pathlib import Path
from typing import Optional
class RecalibrationBuffer:
"""Append-only buffer of (prediction, outcome) pairs per sport."""
def __init__(self, path: Path, max_entries: int = 10_000):
self.path = path
self.max_entries = max_entries
def record(self, sport: str, prediction: float, outcome: int,
ts: Optional[float] = None) -> None:
"""Called at outcome resolution. Never blocks the request path."""
entry = {
"sport": sport,
"pred": float(prediction),
"outcome": int(outcome),
"ts": ts or time.time(),
}
with open(self.path, "a") as f:
f.write(json.dumps(entry) + "\n")
def load_recent(self, sport: str, n: int = 500):
"""Return last n (pred, outcome) pairs for this sport."""
if not self.path.exists():
return [], []
preds, outcomes = [], []
with open(self.path) as f:
for line in f:
try:
e = json.loads(line)
except Exception:
continue
if e.get("sport") != sport:
continue
preds.append(e["pred"])
outcomes.append(e["outcome"])
return preds[-n:], outcomes[-n:]
Record on resolution, not prediction. If you record at prediction time with a "pending" outcome, you need a reliable backfill path when the outcome arrives. Simpler: only append when the outcome is known.
Isotonic regression is the right calibrator for most drift scenarios. It's non-parametric (fits any monotonic mapping), uses no distributional assumptions, and handles small sample sizes gracefully.
from sklearn.isotonic import IsotonicRegression
from typing import Dict
import numpy as np
class LiveRecalibrator:
"""Per-sport isotonic regression that refits periodically from a buffer."""
def __init__(self,
buffer: RecalibrationBuffer,
window: int = 500,
min_samples: int = 50,
refit_every_n: int = 10):
self.buffer = buffer
self.window = window
self.min_samples = min_samples
self.refit_every_n = refit_every_n
self._models: Dict[str, IsotonicRegression] = {}
self._fit_count: Dict[str, int] = {}
def _refit(self, sport: str) -> None:
preds, outcomes = self.buffer.load_recent(sport, n=self.window)
if len(preds) < self.min_samples:
return
iso = IsotonicRegression(y_min=0.01, y_max=0.99, out_of_bounds="clip")
iso.fit(np.asarray(preds), np.asarray(outcomes))
self._models[sport] = iso
def record_and_maybe_refit(self, sport: str, pred: float, outcome: int):
self.buffer.record(sport, pred, outcome)
count = self._fit_count.get(sport, 0) + 1
self._fit_count[sport] = count
if count % self.refit_every_n == 0:
self._refit(sport)
def adjust(self, sport: str, raw_pred: float) -> float:
"""Apply current calibrator. Returns raw_pred unchanged if no fit yet."""
iso = self._models.get(sport)
if iso is None:
return raw_pred
return float(iso.predict([raw_pred])[0])
Key design decisions in this class:
In the prediction path, apply the calibrator as the last step before returning the probability to callers:
def predict_wp(sport: str, features: np.ndarray) -> float:
raw_pred = float(base_model.predict_proba(features)[0, 1])
calibrated = recalibrator.adjust(sport, raw_pred)
return calibrated
# In the outcome-resolution worker:
def on_game_resolved(sport: str, prediction: float, outcome: int):
recalibrator.record_and_maybe_refit(sport, prediction, outcome)
Two inference paths in one codebase — one computes and calibrates, the other records outcomes. They share the buffer. If the recalibrator misbehaves, the prediction path never breaks:
def predict_wp_safe(sport: str, features: np.ndarray) -> float:
raw_pred = float(base_model.predict_proba(features)[0, 1])
try:
return recalibrator.adjust(sport, raw_pred)
except Exception:
logger.exception("Recalibrator failed, returning raw prediction")
return raw_pred
Online recalibration fixes calibration but not discrimination. If your model's AUC is dropping, you don't have a calibration problem — you have a model problem. Retrain.
| Symptom | Fix |
|---|---|
| ECE rising, AUC stable | Online recalibration |
| AUC dropping, ECE rising | Retrain on recent data |
| Structural break (regime change) | Retrain + add regime features |
| Data source changed upstream | Investigate data, likely retrain |
Track both metrics weekly. Recalibration is a bandage over a gradually-shifting distribution; retraining replaces the bandage with a new bone structure.
ZenHodl runs rolling isotonic recalibrators for 11 sports in production. Access calibrated win probabilities, edge detection signals, and verified live P&L via a simple REST API — no data pipeline required.
See ZenHodl's live API