Latest post
How to Build a Polymarket Trading Bot in Python (2026 Guide) →Step-by-step: API setup, edge detection, order execution, risk management.
From data collection to model training to serving live win probabilities via REST — with working code.
Published April 12, 2026 · 8 min read
If you've ever wanted to predict the outcome of a live NBA game, an MLB matchup, or a CS2 esports match using real data, you're in the right place. In this guide, we'll walk through the core components of a sports prediction API — from data collection to model training to serving live predictions.
By the end, you'll understand the architecture behind systems like ZenHodl's prediction API, which serves calibrated win probabilities for 10 sports in real-time.
A prediction API takes in a game state (score, time remaining, team quality) and returns a probability:
GET /v1/games?sport=NBA
{
"game_id": "nba_2026041201",
"home_team": "BOS",
"away_team": "MIA",
"home_score": 58,
"away_score": 45,
"period": 3,
"home_win_probability": 0.847,
"model": "xgboost_v3_calibrated",
"updated_at": "2026-04-12T02:30:00Z"
}
The key challenge isn't building the API endpoint — it's making the probability estimate accurate and well-calibrated. A calibrated model means: when it says 70%, the team actually wins ~70% of the time.
You need play-by-play or score-update snapshots with timestamps. The best free source in 2026:
ESPN API (free, unofficial):
import requests
def get_nba_scoreboard():
url = "https://site.api.espn.com/apis/site/v2/sports/basketball/nba/scoreboard"
resp = requests.get(url, timeout=10)
data = resp.json()
games = []
for event in data.get("events", []):
competition = event["competitions"][0]
home = competition["competitors"][0]
away = competition["competitors"][1]
games.append({
"game_id": event["id"],
"home_team": home["team"]["abbreviation"],
"away_team": away["team"]["abbreviation"],
"home_score": int(home["score"]),
"away_score": int(away["score"]),
"period": competition.get("status", {}).get("period", 0),
"clock": competition.get("status", {}).get("displayClock", ""),
})
return games
For historical data, you'll want 3-5 seasons. Sources include Basketball Reference for NBA, FanGraphs for MLB, and Jeff Sackmann's GitHub for tennis.
Raw scores aren't enough. You need features that capture game context:
def build_features(game_state, team_stats, elo_ratings):
home = game_state["home_team"]
away = game_state["away_team"]
score_diff = game_state["home_score"] - game_state["away_score"]
time_fraction = 1 - (game_state["seconds_remaining"] / 2880)
return {
"score_diff": score_diff,
"seconds_remaining": game_state["seconds_remaining"],
"period": game_state["period"],
"time_fraction": time_fraction,
"elo_diff": elo_ratings.get(home, 1500) - elo_ratings.get(away, 1500),
"ortg_diff": team_stats[home]["ortg"] - team_stats[away]["ortg"],
"drtg_diff": team_stats[home]["drtg"] - team_stats[away]["drtg"],
"score_diff_x_tf": score_diff * time_fraction,
"score_diff_sq": score_diff ** 2,
}
The most important features by XGBoost importance:
XGBoost is the standard for tabular sports prediction:
import xgboost as xgb
from sklearn.metrics import brier_score_loss, roc_auc_score
from sklearn.isotonic import IsotonicRegression
X_train, X_test = X[:split_idx], X[split_idx:]
y_train, y_test = y[:split_idx], y[split_idx:]
model = xgb.XGBClassifier(
n_estimators=300, max_depth=5,
learning_rate=0.05, subsample=0.8,
colsample_bytree=0.8, eval_metric="logloss",
)
model.fit(X_train, y_train)
raw_probs = model.predict_proba(X_test)[:, 1]
print(f"Brier score: {brier_score_loss(y_test, raw_probs):.4f}")
print(f"ROC AUC: {roc_auc_score(y_test, raw_probs):.4f}")
Critical: use walk-forward splits, not random splits. Random splits leak future game information into training. Walk-forward means all training data is older than all test data.
Raw XGBoost outputs are often overconfident. Isotonic regression fixes this:
calibrator = IsotonicRegression(y_min=0.01, y_max=0.99, out_of_bounds="clip")
calibrator.fit(cal_probs, y_cal)
calibrated = calibrator.transform(raw_probs)
print(f"Brier AFTER calibration: {brier_score_loss(y_test, calibrated):.4f}")
After calibration, when your model says 70%, teams should actually win ~70% of the time. For reference, ZenHodl's production models achieve Expected Calibration Error of 0.002 across NBA, NHL, MLB, and LoL.
from fastapi import FastAPI
import pickle
app = FastAPI()
with open("wp_model_NBA.pkl", "rb") as f:
model_data = pickle.load(f)
model = model_data["model"]
calibrator = model_data["calibrator"]
@app.get("/v1/predict")
def predict(home_team: str, away_team: str, home_score: int,
away_score: int, period: int, seconds_remaining: int):
features = build_features(...)
raw = model.predict_proba([features])[0][1]
calibrated = calibrator.transform([raw])[0]
return {
"home_team": home_team,
"home_win_probability": round(float(calibrated), 4),
}
A static model isn't enough for live prediction. You need overlays:
These stack on top of the base prediction and are capped at ±20% total adjustment.
ZenHodl's API serves real-time win probabilities for 10 sports with a 7-day free trial.
Try the API Free See Live Results