Your trading bot just lost three days in a row on one strategy. The model is bad, or the regime changed, or the data feed is corrupt — you don't know yet. By the time you wake up tomorrow morning, it'll have lost three more days. That's the gap a circuit breaker closes.
The circuit breaker pattern, borrowed from electrical engineering and microservices, gives a system the ability to automatically suspend a failing component until the failure passes. In a trading bot, "failing component" is usually a single strategy or instrument that's bleeding while the rest of the portfolio is fine. You want to disable just that strategy, not the whole bot, and you want it to come back online automatically when conditions normalize — not next Tuesday after you remember to flip a switch.
This post walks through a production circuit breaker we run across 10 sports on automated prediction market bots. It's hysteresis-protected, fail-open, JSON-persisted, and ~150 lines of Python. All code is Python 3.10+ and uses only the standard library plus pandas (for the rolling P&L calculation).
Without a circuit breaker, a bot in drawdown has only two states: trading and not trading. Switching between them requires a human noticing, deciding, and acting. That loop has a typical latency of 24-72 hours, during which a degraded strategy continues to lose money.
A daily P&L kill-switch (cap session losses at −$50, restart tomorrow) handles intra-day failures. The circuit breaker handles multi-day failures — the slower, more expensive class. The two are complementary: kill-switch is intra-session, breaker is multi-session.
Three properties make the difference between a useful breaker and one that constantly false-fires:
1. Hysteresis. Disable threshold and re-enable threshold are different. Trip at −5% rolling 30d ROI; re-enable at >0% rolling 30d ROI. Without this gap, the breaker flaps on and off every time ROI crosses zero, which is constantly. With it, the strategy has to actually recover before resuming.
2. Fail-open semantics. If the breaker can't read its state file, can't compute P&L, or hits any exception, it returns "allow trade." A breaker that fails closed on its own bug is worse than no breaker — it silently halts your entire system. Always default to permissive.
3. Minimum sample size. A strategy with 8 trades and −7% ROI over the last 30 days isn't necessarily broken — that's noise. Require at least 30 trades in the window before the breaker is allowed to trip. Otherwise you'll disable strategies for being unlucky on small samples.
Production note: we tuned these parameters by walking forward across a year of live trading data and asking, for each candidate (trip threshold, recovery threshold, min sample), how many real degradations would have been caught and how many healthy strategies would have been false-tripped. −5%/+0%/N=30 was the sweet spot.
The breaker state needs to survive process restarts and be readable by humans (so you can see what's blocked and why). A flat JSON file is the simplest workable choice:
{
"as_of": "2026-05-05T03:49:12Z",
"sports": {
"NBA": {"status": "active", "roi_30d": 0.034, "n_30d": 87, "tripped_at": null},
"NHL": {"status": "blocked", "roi_30d": -0.071, "n_30d": 42, "tripped_at": "2026-05-03T03:49:08Z", "reason": "roi_below_threshold"},
"MLB": {"status": "active", "roi_30d": 0.012, "n_30d": 134, "tripped_at": null},
"NCAAMB": {"status": "active", "roi_30d": 0.058, "n_30d": 211, "tripped_at": null},
"TENNIS": {"status": "active", "roi_30d": 0.021, "n_30d": 71, "tripped_at": null},
"CS2": {"status": "monitor", "roi_30d": -0.018, "n_30d": 18, "tripped_at": null, "note": "below_min_n"}
}
}
Three statuses: active (allow trades), blocked (reject trades), monitor (allow but flag — below min sample). The cron writes this file once per day; the bot reads it on every signal evaluation.
The breaker has two halves: the daily evaluator (decides who's blocked, writes JSON) and the runtime check (reads JSON, returns allow/deny). Here's the runtime side — the hot path the bot calls before placing every order:
import json
import time
from pathlib import Path
from typing import Tuple
STATE_PATH = Path(__file__).resolve().parent / "sport_circuit_state.json"
STALE_AFTER_SECONDS = 60 * 60 * 30 # 30h: cron runs daily, allow some slack
class CircuitBreaker:
"""Reads JSON state, returns allow/deny per sport. Fail-open."""
def __init__(self, state_path: Path = STATE_PATH):
self.state_path = state_path
self._cache = None
self._cache_mtime = 0.0
def _load(self) -> dict:
try:
mtime = self.state_path.stat().st_mtime
if self._cache is not None and mtime == self._cache_mtime:
return self._cache
with self.state_path.open() as f:
state = json.load(f)
self._cache, self._cache_mtime = state, mtime
return state
except (FileNotFoundError, json.JSONDecodeError, OSError):
return {}
def check(self, sport: str) -> Tuple[bool, str]:
"""Return (allow, reason). Fail open on any error."""
try:
state = self._load()
if not state:
return True, "state_missing_fail_open"
# Stale state == fail open: cron probably broke
as_of = state.get("as_of_unix", 0)
if time.time() - as_of > STALE_AFTER_SECONDS:
return True, "state_stale_fail_open"
sport_state = state.get("sports", {}).get(sport)
if sport_state is None:
return True, "sport_unknown_fail_open"
status = sport_state.get("status", "active")
if status == "blocked":
return False, f"circuit_breaker_blocked"
return True, status
except Exception:
return True, "exception_fail_open"
Notice the fail-open paths: missing file, stale state, missing sport, exception — all return True. The breaker can never accidentally block trading because of its own bug. The price you pay for this is that a broken breaker will silently let through trades you wanted to block. That's the right trade-off for a non-critical safety net; if you want hard guarantees, layer a second mechanism (kill-switch, manual review).
The breaker should sit immediately before order submission, after all your signal logic. That way you keep generating signals (so your CLV log accumulates and you can later analyze what would have happened) but you don't trade them:
breaker = CircuitBreaker()
def evaluate_signal(candidate):
# ... existing checks: edge, fair_prob, max_entry, etc ...
if not basic_checks_pass(candidate):
return None
# Last gate before placing the order
allow, reason = breaker.check(candidate.sport)
if not allow:
log_rejected_signal(candidate, reason=reason)
return None
return submit_order(candidate)
Logging rejected signals is important — it gives you the counterfactual. After a sport recovers, you can backtest: "what would my P&L have been if I'd kept trading?" If the breaker correctly avoided losses, the answer validates the design. If you would have made money, you tune the threshold.
The other half of the breaker is the daily job that recomputes status. Read your trade log, slice the last 30 days, compute size-weighted ROI per sport, apply the trip/recovery rules, write the JSON.
import json
import time
import pandas as pd
from datetime import datetime, timezone, timedelta
from pathlib import Path
TRADES_PATH = Path("trades.jsonl")
STATE_PATH = Path("sport_circuit_state.json")
ROI_TRIP = -0.05 # disable below -5% ROI
ROI_RECOVER = 0.0 # re-enable above 0% ROI
MIN_N = 30 # minimum trades in window
WINDOW_DAYS = 30
def load_trades() -> pd.DataFrame:
rows = []
with TRADES_PATH.open() as f:
for line in f:
try:
rows.append(json.loads(line))
except json.JSONDecodeError:
continue
return pd.DataFrame(rows)
def evaluate_breaker():
df = load_trades()
df["ts"] = pd.to_datetime(df["ts"], utc=True, errors="coerce")
df = df[df["resolved"].notna()]
cutoff = datetime.now(timezone.utc) - timedelta(days=WINDOW_DAYS)
df = df[df["ts"] >= cutoff]
# Size-weighted ROI per sport
df["dollar_pnl"] = df["pnl_c"] * df["size"] / 100
df["dollar_size"] = df["entry_price_c"] * df["size"] / 100
grouped = df.groupby("sport").agg(
pnl=("dollar_pnl", "sum"),
cost=("dollar_size", "sum"),
n=("dollar_pnl", "count"),
)
grouped["roi_30d"] = grouped["pnl"] / grouped["cost"]
# Read previous state for hysteresis
try:
with STATE_PATH.open() as f:
prev = json.load(f).get("sports", {})
except (FileNotFoundError, json.JSONDecodeError):
prev = {}
new_state = {}
for sport, row in grouped.iterrows():
prev_status = prev.get(sport, {}).get("status", "active")
if row["n"] < MIN_N:
new_state[sport] = {
"status": "monitor", "roi_30d": float(row["roi_30d"]),
"n_30d": int(row["n"]), "note": "below_min_n",
}
continue
if prev_status == "blocked":
# Hysteresis: only re-enable when above recovery threshold
if row["roi_30d"] >= ROI_RECOVER:
status, reason = "active", "recovered"
else:
status, reason = "blocked", "still_below_threshold"
else:
if row["roi_30d"] < ROI_TRIP:
status, reason = "blocked", "roi_below_threshold"
else:
status, reason = "active", "ok"
new_state[sport] = {
"status": status, "roi_30d": float(row["roi_30d"]),
"n_30d": int(row["n"]), "reason": reason,
"tripped_at": (
datetime.now(timezone.utc).isoformat()
if status == "blocked" and prev_status != "blocked"
else prev.get(sport, {}).get("tripped_at")
),
}
out = {
"as_of": datetime.now(timezone.utc).isoformat(),
"as_of_unix": int(time.time()),
"sports": new_state,
}
tmp = STATE_PATH.with_suffix(".tmp")
with tmp.open("w") as f:
json.dump(out, f, indent=2)
tmp.replace(STATE_PATH)
if __name__ == "__main__":
evaluate_breaker()
Schedule via cron, daily, after your data-aggregation jobs:
# 03:49 daily, 10 minutes after the CLV bucket aggregator
49 3 * * * cd /opt/yourapp && python3 sport_circuit_breaker.py >> /var/log/breaker.log 2>&1
Without this cron, the breaker is a no-op. The state file goes stale, the runtime check fails open on staleness, and your bot trades through any drawdown. We learned this the hard way after deploying the breaker code without the cron entry — everything looked fine until we noticed nothing was ever blocking, three weeks later. Always pair the runtime code with a cron-health alert.
Things we tried that didn't work:
Three tests we run in CI for the breaker:
def test_fail_open_on_missing_file(tmp_path):
breaker = CircuitBreaker(state_path=tmp_path / "missing.json")
allow, reason = breaker.check("NBA")
assert allow is True
assert "fail_open" in reason
def test_blocks_when_state_says_blocked(tmp_path):
state = {
"as_of_unix": int(time.time()),
"sports": {"NBA": {"status": "blocked", "reason": "roi_below_threshold"}},
}
p = tmp_path / "s.json"
p.write_text(json.dumps(state))
breaker = CircuitBreaker(state_path=p)
allow, reason = breaker.check("NBA")
assert allow is False
def test_fail_open_on_stale_state(tmp_path):
state = {
"as_of_unix": int(time.time()) - 60 * 60 * 48, # 48h old
"sports": {"NBA": {"status": "blocked"}},
}
p = tmp_path / "s.json"
p.write_text(json.dumps(state))
breaker = CircuitBreaker(state_path=p)
allow, reason = breaker.check("NBA")
assert allow is True
assert "stale" in reason
The stale-state test is the most important one to keep around. It's the one failure mode that's both common (cron fails silently for two days) and catastrophic if handled wrong (entire bot blocked on one bad cron run).
The full circuit breaker is roughly 200 lines of Python: 80 for the runtime check class, 100 for the cron evaluator, 20 for tests. It runs in production daily across 10 sports and has caught two material model degradations that would otherwise have eaten 5-7 days of P&L each before manual intervention.
The hardest part isn't writing it — it's choosing the parameters and committing to letting the breaker make decisions. Operators tend to override their own circuit breakers ("I think the strategy will recover, let me re-enable") and then watch the strategy keep losing. If you build the breaker, trust the breaker. Tune the thresholds offline; don't override them mid-drawdown.
ZenHodl's automated bots use this exact pattern across 10 sports. Per-sport breaker state, hysteresis, daily cron, fail-open semantics. The whole stack is part of the platform.
See Live Results →