Model updates, new sports, API changes, and incidents. We log what shipped and what broke.
What changed. A focused operational pass: closed every remaining null-feature gap in the CLV meta-model shadow log, opened MLB trade volume by ~30%, and cleaned up disk/log hygiene that was slowly silting up.
elo_a/elo_b, size, score_diff, period, is_score_change for moneyline_wp (MLB/NHL/NBA), tennis_wp (ATP/WTA), and CS2 — all five fields now populate per row with real values, not null.match.team1_elo never existed; the real lookup is self.hltv_fetcher.get_team_elo(team)). Side-aware indexing now matches the cs2/moneyline convention so elo_a is always the bet side.signal_engine.py was tagged with bot="api_signal_engine"; the shadow logger now skips API-server callers so they don't win the 60s dedup race with all-null rows.max_entry_price_mlb_c: 78c → 82c (the 78-82c band is historically 85% WR / +7% ROI / +2.3c CLV on n=13).min_entry_price_mlb_c: 25c → 22c (trade small PnL for more CLV training data).min_period_nba: 3 → 4 (Q4 only). Last 8 NBA trades were all Q3 trades, all losses (0% WR, -$9.13, -100% ROI). Q4 still retains the 68.6% WR signal from the original 2026-04-13 diagnostic.MemoryMax: 512M → 1.5G (was OOM-killed 3× overnight; now stable at ~350MB usage).rotate_gate_log.sh rewrite: mv → cp + truncate. Old approach preserved hardlinks from deploy_code_only.sh's cp -al staging step, causing gzip to refuse compressing the rotated archive (1.1GB landed uncompressed). New approach breaks hardlinks at copy time. Deploy script now also sweeps staging dirs >1d old on every run./etc/logrotate.d/fairprob installed: daily rotation with copytruncate (preserves bots' open file descriptors), gzip after 1 day, keep 14 days. Pre-fix, unified_bot.log was 692MB and growing ~50MB/day.tennis_wp.log, bot.log, lol_wp.log, cs2_wp.log): 725MB recovered.Unclosed client session noise (1,268 warnings/24h) silenced via warnings.filterwarnings. Root cause was a cross-event-loop session reuse in espn_scores.py; added explicit loop-mismatch check that abandons stale sessions cleanly.
Model retraining before fall seasons start. CFB / NCAAWB / NFL WP models fail predict on current sklearn (1.7.2) — pickles were saved on an older version where LogisticRegression.multi_class was an instance attribute. The bot's smoke test catches this and skips those sports, so no live impact (NFL/CFB/NCAAWB are all out of season). Retrain before September.
What changed. Pricing page got a structural overhaul to fix the funnel-too-narrow + middle-tier-confusion problems.
bundle_pro (10K req + dashboard) to api_pro (100K req + dashboard, same $149). Two SKUs at $149 with different request caps was a buyer-confusion trap. bundle_pro stays in the catalog for the legacy course-bundle webhook.STRIPE_PRICE_API_ENTERPRISE is populated via setup_stripe.py, the CTA auto-flips from mailto to /billing/start/api_enterprise. No code redeploy required.api/subscription_plans.py — added developer_free, api_starter_annual, api_pro_annual, api_enterprise_annual; rebuilt PRICING_PAGE_SECTIONS; added interval arg to build_pricing_sections.api/auth.py — Starter monthly_cap 10K → 30K.api/templates/pricing.html — 4-card grid, Free card with dashed border, Monthly/Annual toggle, signup CTA support, "Save 17%" hint on monthly cards.api/app.py — /pricing?interval=year + /signup?plan=developer_free param wiring.email_utils.py, homepage.html, docs.html, products.html, build_vs_buy.html, blog comparison post, Arabic localization.
Annual SKUs need Stripe prices created before they're sellable:
python3 -m api.setup_stripe --key sk_live_... --write-env reads the new plans from the catalog, creates the Stripe price objects, and writes the IDs back to api/.env. After that, annual self-serve works and Enterprise CTA auto-flips.
What shipped.
A binary CLV-prediction meta-model trained on 1,058 measured trades (XGBoost + isotonic calibration, AUC 0.616 on chronological holdout) running in shadow mode alongside every signal. Decisions logged to clv_gate_log.jsonl with mode tag meta_v0_shadow. Operator can promote individual sports to enforce mode via meta_model_config.json (60s hot-reload, no redeploy). MLB shows highest per-sport AUC (0.769); ATP/CS2/LOL show no signal yet so excluded from sports_with_signal.
feature_snapshot_logger systemd service captures pre-trade orderbook depth, rolling mid drift / volatility (1m/5m/15m) on 800+ subscribed Polymarket tokens._handle_price_change drop; now capturing ~7K prints/90s.bot_self_state.py exposes per-sport recent PnL/CLV/slippage (5min cache from trades.jsonl)..backups/gate_log_archive/).retrain_with_backup.sh — auto-rollback if new pkl drops below 0.54 AUC.validate_meta_shadow.py) joins shadow decisions to realized trade outcomes for counterfactual PnL measurement.Walk-forward 4-fold CV gives mean AUC 0.553 (fold std 0.042) — barely above the 0.55 gate. The model is a research instrument in shadow mode, not yet trusted to block trades. First validation verdict expected after Monday cron (06:15 UTC) joins ~30 settled trades. Per-sport submodel for MLB (n=204) didn't yet pass gate (CV 0.559, high fold variance 0.111).
Public whitepaper. New page at /clv-evidence publishing the empirical 78-percentage-point CLV gap, Wilson 95% CIs, two-proportion Z-test (z = 24.27, p ≈ 10⁻¹³⁰), per-sport breakdown, selection-bias analysis, and a public reproducibility script. Snapshot pinned 2026-05-08 (n=950 in gap), passed five Codex CLI review passes. Raw data + verifier published at /api/trades.jsonl and /api/verify_clv_gap.py.
AttributeError: '_cal_table' firing every tick (~19K errors in log) — defensive init in SoccerWPModel.__init__ v12.81 patch.
Relabeled 732 bot=unknown_backfill trades (March 31 one-shot import) to mode=imported_legacy so open-position counts no longer inflate. 38 zombie trades (resolved=True, won=None, no game_id) confirmed unrecoverable from Polymarket CLOB (0 prints across 178 NCAA tokens).
Drift detection. Live trade ECE diverged from baseline buffer ECE on ATP (+18pp), CS2 (+14pp), LOL (+31pp), WTA (+14pp). Root-cause: adverse selection at the trade-firing layer, not a broken model — the recalibration buffer (n=5,550 WTA samples) showed model itself was within ±1pp calibration tolerance. The bot's selection logic was concentrating on the worst-calibrated snapshots.
LIVE_RECALIBRATOR_DISABLED=1 from .env (was set May 6, paused all auto-corrections).New gate that blocks trades in cohorts where the model has been historically overconfident by >15pp on n≥20 trades. Independent of the existing edge-gate (Filter A) and CLV-gate (Filter B), so each can run at different enforcement levels. Bots wire after the existing edge check.
auto_pause_low_skill_sports.py detects sports where CLV-PROXY (directional skill metric) drops below 50% on n≥30 matches and writes a 14-day pause directive. A model with negative directional skill cannot be filtered into profitability — only paused. Auto-resumes when metric recovers.
Context.
v12.22-v12.25 progressively tightened then floodgated the entry filters (min_edge=0, min_fair=0, score-change-only). Live volume since the May 1 deploy was 0 fires — the max_edge=18c cap was catching qualifying signals by less than 1c (e.g. WTA Golubic at 18.7c). Tennis "between-sets" skip filter blocked the only other matches that had score-change events.
max_edge_c 18 → 25 (NBA already at 25, NHL/MLB/CS2/ATP/WTA/LOL all loosened).skip_between_sets True → False. Existing comment claimed -1.5c/49.4% WR vs +25c/79.6% WR but no preserved script — treating as unverified.require_score_change=True) and floodgate floors (min_edge=0, min_fair=0) from v12.24-v12.25 are unchanged.
Observe 7-14 days. Revert skip_between_sets to True if between-sets fires show WR < 45% or P&L < −$5. Tighten max_edge back to 18 if the [20-25c] edge band shows the historical falling-knife signature (WR < 40%).
Why this exists. v12.18 (4/30) opened max_edge to 25-30c per sport to test whether W17's bleed was a regime shift. Five days of live data (98 trades, 34.8% WR, −$61.79) plus a per-sport audit of the W14 (+$76) and W15 (+$11) profitable weeks confirmed it was not a regime shift — it was the gates.
ATP/WTA/MLB/NHL/LOL: max_edge_c 30 → 18.
WTA: min_fair_prob_c 65 → 70 (back to v12.7).
LOL: min_fair_prob_c 60 → 70 (block structural [60-70%] loser).
CS2: already 18 / 70 from v12.21.
NBA: stays suspended (min_edge_nba=100).
Not a "fix". The model is the same. The selection logic is the same. We're just blocking the [20-30c] falling-knife band that universally underperformed across sports (v12.20). Volume target reverts to W14/W15 levels: 200-900 trades/wk depending on sport availability.
Promote and call it stable if 7-day total: WR ≥ 55%, P&L ≥ +$10. Re-evaluate (don't loosen further) if WR < 50% AND P&L < −$15. The mistake was opening gates too aggressively in v12.18; we won't double down by loosening again on the first bad week.
What shipped.
Re-trained CS2 grid model (v6) with three new opening-duel features: fk_diff, fk_rate_diff, last5_fk_diff. Re-fetched 3,747 series from GRID with the firstKill field included (25% of training maps now carry the new signal). CS2 trading is re-enabled on the VPS with v12.21 tight gates.
v6 holdout Brier: 0.1595 — statistically tied with v5's 0.1592. firstKill features rank 16/17/18 in importance (combined 2.1%). The signal exists but correlates strongly with kill_diff (already 25% importance), so XGBoost extracts almost nothing independently. Calibration on holdout is excellent (max bucket error 1.7pp, ECE ≈ 0.8%) — but it was already excellent in v5.
Two reasons: (1) v6 is statistically equivalent — deploying it costs nothing and starts the firstKill signal flowing in case it matters in live distributions we can't see in holdout; (2) the real CS2 problem is selection bias, not model quality, and we needed to re-enable trading to test selection-side fixes.
max_edge_c: 35 → 18. Universal cross-sport pattern (v12.20): the [20-25c] edge band is a falling-knife — CS2 ran 31.4% WR there. 18c keeps the [15-20c] high-confidence band where v3 sign-agree was 86% but excludes the noise zone.
Other gates unchanged: min_fair_prob_c=70, entry [35,50]c, slippage 6c, BO3+ only, T1/T2 tournaments only.
Promote if: 30+ trades, WR ≥ 50%, total P&L ≥ −$5. Revert if: any 14d window with P&L < −$15 or WR < 35%. The model is the same; we are evaluating gate logic.
Why this exists. After today's gate-tuning loop, we did a structured "why do we lose?" analysis across all 1,368 resolved trades. Five strong patterns emerged. Documenting them here so the patterns survive the session and inform future tuning.
Score-change-triggered fills (75 trades): 66.7% WR, +$3.59. Polling-triggered fills (1,286 trades): 48.1% WR, −$34.96. 95% of losses are concentrated in polling fills. Score-change fills create new information windows where the model has time to re-evaluate; polling fills happen during market drift where the model is stale and the market is right.
[50-60%] band: 34.7% WR — the "death zone" of overconfident coinflips. [70-80%] band: 60.0% WR — sweet spot. [90%+] band: 76.0% WR — calibrated favorites. Pattern repeats per-sport: ATP, WTA, LOL, CS2 all bleed at [50-60%] and shine at [70-80%].
Universal across sports. ATP: 30.8% WR. NBA: 20.0% WR. CS2: 31.4% WR. LOL: 15.4% WR. Even MLB (where the model is best calibrated) shows weakness here. Above 25c the bot tends to win in some sports because the edge implies the model is responding to a real opportunity rather than noise — but this is sport-dependent.
Across every sport with enough CLV data, the gap between winners' CLV and losers' CLV is +45c to +65c. Winners average +25c CLV; losers average −28c CLV. Losses are systematic adverse-selection events: when the market re-prices against us after entry, we lose. This is consistent with the falling-knife thesis — the bot loses to informed flow.
Best hours: 03h UTC (overnight, 75% WR), 23h UTC (74.5% WR). Worst: 22h UTC (30.4% WR), 15h UTC (38.5% WR), 14h UTC (46.8% WR with 698 trades, −$18.87). Peak market hours (US morning to early afternoon) have the worst calibration. Likely explanation: more market makers and informed flow during US business hours.
Each sport tells a different story:
The bot wins when it reacts to genuine information events (score changes), in confidence ranges where the model is well-calibrated ([70-80%], [90%+]), with fills slippage between 0.5-2c, outside peak market hours. The bot loses when polling-triggered fills happen during market drift, with model output in the [50-60%] coinflip zone, often with edges in the [20-25c] band where the market is actively repricing in. Losses are universally adverse-selection events — CLV gap is +45-65c between winners and losers across every sport.
Why this is documented but not shipped. Today's session over-shipped on gate tuning based on insufficient per-bucket samples. The patterns above are real and replicable, but the right move is to OBSERVE v12.18+v12.19 for 1-2 weeks, accumulate clean data with the new gates and the orphan bugs fixed, then revisit. Premature filter rules anchored to thin samples is exactly what created today's circles.
Why this exists. Today's session shipped 8+ gate tightenings (v12.6 through v12.17), each individually justified by the recent week's calibration data. But the recent week (W17, Apr 21-27) was the bot's WORST week ever (38% WR, −$17.62). The gates were over-fit to W17. During W15 (Apr 7-13) the bot ran with much wider bands and was 60% WR / +$6.30. Either W15 was lucky or W17 was a regime shift. We can't know without fresh trades under wider gates.
What's running now.
Why we can run this experiment safely now. Today's bug fixes (v12.14/15) and discipline work give us safety nets that didn't exist this morning:
trades.jsonl reliably (no more orphans)max_positions_per_match=1 safety is now actually enforced (was disabled by the same TypeError bug that orphaned the LOL trades)What we're explicitly testing. Before today's session, the bot bled hard for 2+ weeks. We attributed that to gate-level miscalibration. But (a) Elo was 3 weeks stale and only refreshed today, (b) several model retrains shipped today (v12.9/12/13), (c) the orphan bugs were silently disabling the per-match safety. So the recent bleed has THREE possible causes: bad gates, stale models, or broken safeties. Today's other ships fixed (b) and (c). v12.18 tests whether that's enough WITHOUT the aggressive gates.
Decision criteria for next 7-14 days.
Honest framing. This is an experiment, not a confidence move. The walk-forward gate REJECTED MLB at 65 today (held-out P&L −$1.72 vs zero-filter −$1.07). The gate's data is W17-anchored, so it might be wrong about W15-band volumes. We accept up to ~$10-20 of expected loss over the next 14 days as the price of finding out empirically. If wrong, revert to v12.17. If right, the bot is back at its W15 productivity with today's calibration improvements baked in.
What this targets. Today's calibration deep-dive surfaced a near-universal pattern: ECE balloons when |edge_c| crosses ~20c. The bot detects "the market is wrong" but the market is right and pricing in within-game state the model doesn't see (the GIANTX iTero falling-knife earlier today, ATP [20-25c] band at 32% ECE, NBA [20-25c] at 61pp gap, etc). The fix is per-sport edge caps calibrated to the data.
Per-sport changes (all walk-forward validated).
Why the analysis chain led here. Edge-stratified ECE on 668 resolved trades showed:
Volume impact. Tightening from 25 to 15-22 cuts an estimated 15-30% of trades per sport. The cut volume is the worst-calibrated subset — the trades where the bot was bleeding hardest. Held-out P&L improved on every sport that passed the gate; the regress-on-ship gate auto-rejected MLB at 15 (kept at 20) and WTA at 15 (kept at 20).
Combined with v12.15. v12.15 restored the per-match position cap that was disabled by the LoLPosition TypeError. v12.16 caps the per-trade edge size. Together these are the two most important LOL-bleed mitigations: the bot can no longer fire 5x on the same match, AND the size of any individual fire is capped where the model has been most wrong.
What broke.
The LOL bot's _execute_entry() function constructs a LoLPosition dataclass passing fair_prob_c_raw=fair_c_raw as a kwarg. But LoLPosition has no such field, and fair_c_raw wasn't even in _execute_entry's scope (it's defined in the caller's scope). Every successful FOK fill since this code was added would raise TypeError immediately after the order returned, BEFORE: (1) the FILLED log line, (2) the log_trade() call to trades.jsonl, AND (3) the _entries_per_match[match_id] += 1 increment.
Symptom that surfaced it.
User noticed a Polymarket position of 36+ shares on GIANTX iTero vs UCAM Esports Club with cost ~$15.89, but trades.jsonl had ZERO records for this match. Five separate fills happened today (15:57, 16:21, 16:26, 16:31, 16:36) at declining prices (58c → 43c → 41c → 40c → 42c). Verified against Polymarket's data-api: every fill matches a bot POST /order "200 OK" within 7-12 seconds. Transaction hashes confirmed on-chain.
Why the bot fired 5 times on the same match.
max_positions_per_match=1 should have blocked fires 2-5. But because the TypeError happened BEFORE the entries-counter increment, the counter stayed at 0 forever. Each iteration passed the match_entries < 1 check and fired again. Only the per-signal cooldown (~5 min) throttled the cadence. The same bug both orphaned the trades AND disabled the per-match exposure safety.
What shipped.
fair_prob_c_raw: float = 0.0 field to LoLPosition dataclass (line 715)fair_c_raw as parameter to _execute_entry() with default 0.0fair_c_raw=fair_c_raw through to the helpergetattr(self.tracker, "_score_change_ts", dict()) on the score-change timestamp lookup (same pattern as v12.14 tennis fix)
Reconstruction.
All 5 GIANTX iTero orphans recovered from Polymarket's data-api (using the wallet's POLY_FUNDER address) and appended to trades.jsonl with reconstructed=true and the on-chain tx_hash for audit. Combined with the 5 tennis orphans recovered in v12.14, today's session has surfaced and recovered ~$27 in untracked but real fills.
Deeper finding (NOT fixed today): falling-knife on stale model state. The LOL model's fair_prob comes from series state (e.g. "1-0 in BO3") + Elo. It does not see within-game state (gold/kills/towers/dragons in the active game). When game 2 was going hard against GIANTX, the market priced GIANTX iTero from 58c down to 40c. The model didn't update — it still saw "1-0 series" and reported 64.4c fair. The bot computed widening edge (64.4 − 40 = 24.4c) and bought 4 more times. Even with v12.15 restoring the per-match safety, this divergence pattern is a structural model limitation. Future work: feed within-game state features (or a price-velocity guard that pauses buying when ask drops faster than X cents/minute).
Trust note.
Today's session also surfaced the existence of a Mac-side bot infrastructure (~/Library/LaunchAgents/net.zenhodl.bot.plist) running unified_bot.py --mode live without the --disable cs2,soccer flag the VPS uses. The Mac bot's order_version_mismatch errors during VPS bot activity were the smoking gun that two parties were hitting the same wallet. After tracing every fill via Polymarket data-api timestamps, every share is accounted for as bot activity (no unauthorized access). Recommend reviewing the Mac LaunchAgent configuration to align with the VPS execution model so we don't have two bots competing for the same wallet's order-counter.
What broke.
On 2026-04-28 a feature added is_score_change tracking to tennis trades. The new code accessed self._score_change_ts on the runner, but that attribute is owned by the TennisGameTracker (the polling object), not the runner. Every successful tennis fill since then raised AttributeError inside the trade-log block. The exception was caught (the FOK order had already filled, so the position existed on Polymarket), but the trade record never got written to trades.jsonl.
Symptom that surfaced it.
The user noticed zero recorded trades for 2026-04-29 and 2026-04-30 despite many EVAL events firing. Investigation found 5 orphan FILL log lines with no corresponding entry in trades.jsonl:
Total $11.14 of real fills that the bot's books didn't see. Settlement, P&L, and CLV were all going to miss them.
What shipped.
One-line fix: getattr(self.tracker, "_score_change_ts", {}).get(match_id, 0). Defensive lookup on the correct object so even if tracker hasn't been initialized yet, the trade-log block doesn't fail. Verified working on the 17:19 Haddad Maia fill that immediately followed the deploy.
Reconstruction.
All 5 orphan trades were reconstructed from log lines and appended to trades.jsonl with reconstructed=true and reconstructed_reason tagging so reconciliation tools can identify them. Token_id, entry_price, fair_prob, and edge_c came directly from the log; game_id for Sramkova was a placeholder since it wasn't in scope at FILL time.
Lesson.
The bot's "fill still valid" exception handler was correct (don't lose track of an already-filled order just because logging fails) but the silent logger.error didn't raise an alert anywhere. Next reliability work: surface these errors via the same alerter that fires on calibration drift, so a future "fill but no record" pattern triggers an immediate operator notification rather than waiting for the user to spot the trade gap.
Side findings during this investigation.
sport=ATP in recent records. The earlier 4/27 trade tagged her correctly as WTA. Bug to investigate — likely affects which gate the v12.7 floor applies (ATP min=60 vs WTA min=70). The 69.2c fair_prob on Haddad Maia would have been blocked under WTA but passed as ATP.fair_prob_c=0, edge_c=0 — the prediction pipeline returned no model output, but the trades fired anyway. All 3 lost; total impact −$22.20. Probably an artifact of bot startup before the prediction pipeline was fully wired.
What shipped.
Three new features added to the WTA model: rank_diff (p1_rank − p2_rank), rank_points_log_diff (log of WTA points), and age_diff (years). These come from the Sackmann row directly so they're already AS-OF the match. Final per-player snapshots are stored in the pickle for inference; defaults (rank=300, points=500, age=25) are used for unknown players.
WTA result.
ATP rejection. Same features tested on ATP REGRESSED: Brier 0.1453→0.1474, ECE 1.98%→2.45%, max-bucket 6.75pp→12.02pp. ATP's Elo already correlates strongly with rank (men's tour has more match volume per player, so Elo converges to a stable estimate that captures the same signal as rank). Adding rank features creates redundancy that the model overfits, concentrating predictions in narrow bands. ATP keeps its v12.9 schema.
Cumulative WTA today. Pre-v12.9 baseline Brier 0.1590, ECE 3.27%. After v12.12 + v12.13: Brier 0.1563 (−0.0027 total), ECE 2.31% (−0.96pp). Same magnitude as ATP got in v12.9 alone — achieved via two compounding changes (overall serve_pct + rank features) instead of one.
Staleness caveat. Rank/age are snapshotted at training time (last match in 2024 data). At inference today, they're 16+ months stale. For top-10 players this matters less (rank churns slowly there); for journeymen it matters more. Refresh path: every retrain regenerates these snapshots. A future version could pull from the live ATP/WTA rankings JSON for fresher values, but for the current Brier lift, the snapshot is sufficient.
Pattern this confirms.
Today's session keeps surfacing the same lesson: ATP and WTA need different feature subsets. The TOUR_FEATURE_COLS dict introduced in v12.12 made this clean — ATP excludes rank features, WTA excludes surface-specific serve features. Both tours train from the same pipeline, with the model schema declared per-tour and stored in each pickle.
Why this shipped.
v12.9 deferred WTA because the [20-30%] bucket regressed from −6.8pp baseline to −12.1pp candidate. Today's hypothesis: the regression came from p1/p2_serve_pct_surf (20-match per-(player, surface) deque), not the overall serve features. WTA has materially smaller per-surface samples than ATP — women's tour shifts more between surfaces, so the deque mean is noisier.
Result.
Implementation.
Added TOUR_FEATURE_COLS dict to train_tennis_wp.py so each tour can declare its own feature subset. ATP keeps all 4 serve features (overall + surface-specific). WTA drops the 2 surface-specific ones. The pickle stores per-tour feature_cols, so inference reads the right schema automatically.
Lesson. Today's session is a small clinic on feature-engineering discipline: serve_pct_overall and serve_pct_surf carry related but not interchangeable signal. ATP has enough surface-specific volume that the 20-match deque is fine. WTA doesn't, so the deque is noisy — and that noise concentrated in the worst possible bucket [20-30%]. Dropping just the noisy variant kept the entire signal lift while removing the regression.
What was tried.
Added p1_last_match_ret / p2_last_match_ret binary features to ATP pre-match XGB. The flag is set to 1 when a player retired (RET) in their immediately-previous match, otherwise 0. Hypothesis: a player who just retired is likely injured/under-recovered and should win less than Elo predicts in their next match.
Result on holdout 2024.
Why it failed. ATP RET rate is only 2.39% of matches in 2020-2024. With a binary flag, the feature is 0 in >95% of training rows; XGBoost couldn't extract reliable signal from such sparse positives. The hypothesis may still be correct, but a binary indicator carrying nearly all 0s isn't the right encoding. A weighted/decayed retirement history (e.g. exponentially-weighted RET count over last 5 matches, normalized by match minutes when available) would carry more density.
Action.
Reverted both tennis_wp_model_ATP.pkl and train_tennis_wp.py to v12.9. The rejection is documented as a comment in the training script so the next attempt knows not to retry the binary form.
Lesson. v12.4 was when we learned in-sample looking good doesn't mean ship-ready. v12.11 is a different lesson: a hypothesis that's physically plausible can still fail because of how it's encoded. Sparse-positive binary flags lose signal compared to denser continuous variants.
What this fixes.
Tier 3.3 (per-surface live calibration tables) was originally scheduled as a "data exists, just stratify it" task. Today we discovered the data doesn't exist: TennisPosition had no surface field, so the recalibrator's getattr(pos, "surface", "") always returned empty. All 11,595 historical tennis records in the recal buffer have no surface tag. Same for trades.jsonl — can't backfill.
What shipped today.
Added surface field to the TennisPosition dataclass and wired it at position creation (surface=match.surface or "Hard"). Going forward, every newly-opened ATP/WTA position will carry surface, the recalibrator will tag the buffer record, and per-surface tables become possible after data accumulation.
What's deferred.
Tier 3.3 phase 2 (the actual per-surface tables) waits until the buffer has ≥100 samples per (sport, surface). At current trade rate (~50 tennis trades/day, split across 3 surfaces), this is ~7-14 days for hard, longer for grass (no grass season currently). The validation gate already exists; only the stratifier needs to be added to auto_recalibrate.py when the data is ready.
Lesson learned.
"The data tags surface" turned out to mean "the recorder API accepts surface" — not that surface was actually being passed. Verifying assumptions early (today the very first sanity check was Counter[(sport, surface)] on the buffer) saves a lot of wasted effort downstream.
What shipped.
Four new features added to the ATP pre-match model: p1_serve_pct_surf, p2_serve_pct_surf, p1_serve_pct_overall, p2_serve_pct_overall. Each is a rolling mean of (1stWon + 2ndWon) / svpt from the player's prior matches (last 20 on this surface, last 30 overall), computed AS-OF the match date so there is no leakage. Surface-specific values back off to overall, which back off to a tour default of 0.62 for players with no history.
Why this works. Elo captures who wins, but two players with identical Elo on hard court can have very different game styles — a serve-bot wins more by holding, a returner wins more by breaking. Serve_won_pct lets the model encode that asymmetry. Surface stratification matters because hold rates differ ~10pp between hard and clay for many players (e.g. Djokovic hard 70.4% vs clay 65.6%).
Holdout 2024 results (ATP).
WTA: deferred. Same training pipeline ran on WTA: Brier 0.1590→0.1577, ECE 3.27%→2.62% (better overall) — BUT the [20-30%] bucket regressed from −6.8pp (n=285) to −12.1pp (n=111), exceeding the 8pp per-bucket ship-gate ceiling. Even though net metrics improved, a single localized regression on a small bucket can mean a few specific match types (e.g. underdog clay specialists) get systematically miscalibrated. The disciplined call: revert WTA to the pre-v12.9 model and ship ATP only. WTA will re-attempt with feature subset experiments before promotion.
Honesty about expected live impact. Pre-match calibration improvements are small in absolute terms (~0.0014 Brier) and the model is just one input to live trading; in-play state evolution dominates. We expect a modest but real lift in tournament-opening trades where pre-match priors carry the most weight. The alerter and shadow eval log will measure real-world impact over the next 14 days.
Why this shipped.
The v12.7 alerter (its first full day live) flagged three sports: MLB, CS2, and SOCCER. Each was triaged with the same playbook used for ATP/WTA — two_pop_calibration.py to confirm the bleed is real, then walk_forward_cli.py to validate any candidate fix on chronologically-held-out data before shipping.
MLB — min_fair_wp_mlb_c = 70 (was global default 55). The alerter caught a textbook bucket-drift signature: the [60-70%] band's gap drifted from -11.1pp baseline → +38.0pp recent (a 49.1pp swing). Recent 30 trades: 33% WR, −$6.38 P&L, ECE 32%. Edge-band stratification showed high-edge trades losing more (the v12.4-style selection-bias-amplified pattern). Walk-forward verdict on the held-out 30%: P&L −$1.07 → +$0.97, WR 58.5% → 75.0%, ECE 12.6% → 7.4%. Volume cut: −63%. The cut volume was net negative.
CS2 — min_fair_prob_c = 70 (raised from 60).
Different signature: not regime-shift drift, but persistent miscalibration across all bands. Baseline 267 trades had −$13.54 P&L and 39.8% ECE; recent 14d had 38.6% WR and ECE 27%. Per-band recent: [50-60%) 25% WR, [60-70%) 29% WR vs predicted 55-66%. Walk-forward held-out: P&L −$11.22 → −$3.32 (+$7.90), WR 40% → 52.5%. Still not winning, but halves the bleed. CS2 is currently --disabled in the unified bot — gate takes effect on re-enable.
SOCCER — deferred.
Discipline test: SOCCER had only 35 total resolved trades (n=18 recent). Walk-forward at min=70 returned INSUFFICIENT_DATA (only 3 held-out trades). Even min=50 "passed" the gate but ECE got worse 38.7%→50.1%. The disciplined call is to NOT ship a config change on n=3 evidence. SOCCER is also currently --disabled. The alerter will re-fire when more data accumulates and we'll re-evaluate then.
What this validates. The v12.7 monitoring infrastructure paid for itself in 24 hours: caught three real bleeds we would not have noticed without it. The walk-forward gate then prevented one of those (SOCCER) from being a hasty same-day fix on a too-thin sample. Selection-bias awareness + walk-forward + alerter together = the system catching its own drift and rejecting its own knee-jerk responses.
Why we shipped this. Same diagnostic methodology applied to WTA that we used for ATP in v12.6 surfaced two things at once: (1) WTA was bleeding badly and (2) the v12.6 ATP fix had inadvertently made WTA worse. WTA bot trades show 41% WR and ECE 25% on the bot-selected sample — vs 3.27% ECE on the representative-population holdout. Same selection-bias signature as ATP, but ~2x bigger.
The unintended-consequence bit.
v12.6 lowered the shared min_fair_prob_c from 63 → 60 to re-enable an ATP-profitable bucket. But that floor was shared between ATP and WTA. WTA's [60-63%) bucket is decisively LOSING (n=3, 33% WR, +29.5pp gap) — the OPPOSITE sign of ATP's [60-63%) which is profitable. The fix: split the config knob.
What we shipped.
min_fair_prob_c_wta = 70.0 as a tour-specific override. ATP keeps its v12.6 value of 60. WTA gets a much higher floor because the entire [50-70%] range is bleeding (n=22 in [60-70%), 35.7% WR, +30pp gap).Tier 1 monitoring infrastructure (the meta-fix). Today's session caught the WTA bleed only because we happened to look. Real fix is infrastructure that flags this without us looking. Three new tools shipped:
two_pop_calibration.py — the diagnostic that prevents the "live ECE looks bad → retrain the model" mistake. Always reports BOTH (A) representative-population ECE and (B) bot-selected ECE side by side, plus the gap (C). Decisions are made on (A); the gap shows selection-bias magnitude. Today's gaps: ATP +10.6pp, WTA +21.8pp — both selection bias, neither indicating model retrain.walk_forward_cli.py — before any threshold change ships, applies it retroactively to the trade log, splits chronologically (70/30), and reports whether the held-out tail regresses on P&L / WR / Brier. Returns PASS / REJECT exit code so it can gate deployment scripts. Caveat: only validates TIGHTENING (filters that drop trades); LOOSENING needs the shadow log (Tier 2.1).live_calibration_alert.py — daily cron that flags BLEED (recent P&L below threshold), WIN_RATE_LOW (below 40%), and BUCKET_DRIFT (a bucket's calibration gap rose >10pp vs the prior baseline window). Selection-bias-aware: alerts on CHANGE, not on absolute ECE. Cron installed at 12:15 UTC daily.First run of the alerter caught real signals across the fleet. We immediately surfaced 9 alerts in 6 sports. Notable: MLB recent WR is 33.3% (n=30) bleeding $-6.38 with a [60-70%] bucket gap drift of +49pp (was -11pp two weeks ago, now +38pp); CS2 recent WR 38.6% (n=57) bleeding $-5.96; SOCCER 27.8% WR (n=18). Whether each is real drift vs sample noise still needs investigation, but the infrastructure now flags them within 24 hours instead of 14 days.
What we did NOT ship today.
tennis_serve_rates.json into the pre-match XGB (the analytical Markov already uses it, the XGB doesn't); injury / retire-last-match features; per-surface calibration tables.The honest meta-lesson from today. The bot's models are accurate (2-3% ECE on representative populations). The bot-selected sample's miscalibration is mostly selection bias from the trade filter, not model error. Most of the value going forward is in infrastructure that distinguishes the two and acts on the right one, not in retraining models. Today's three tools are that infrastructure. Yesterday we'd have responded to "WTA ECE is 25%" by retraining. Today we know to look at the gap, ship a config fix, and let the alerter watch the next 14 days.
Open questions for next pass: (1) MLB [60-70%] +49pp bucket drift — sample noise or regime shift, needs investigation. (2) Several sports show recent-period bleeds; some may need their own min_fair_prob/min_edge tuning like ATP/WTA just did. (3) Walk-forward CLI cannot yet validate floor LOOSENING (e.g. v12.6's 63→60 ATP change) until the shadow trade log lands — tier 2.1.
min_fair_prob_c 63 → 60 — re-enable a profitable bucket the floor was blockingWhy we looked again. v12.5 rebuilt the elo data layer. Same day, the open question was whether to retrain the pre-match ATP XGB itself with calibration regularization (focal loss / Brier-augmented objective / etc.) to fix the 9.58% live ECE on 6,203 ATP recal-buffer samples. Before retraining, we asked an honest question: is the live miscalibration real model bias, or selection bias from the bot only logging predictions on trades it opened?
What the diagnostic showed.
What we did NOT ship.
What we did ship.
Per-bucket calibration on the 91 resolved ATP trades, focusing on what the existing min_fair_prob_c=63 floor blocks:
Lowered min_fair_prob_c from 63 → 60 to re-enable the [60-63)c bucket. The lossy [55-60) bucket remains blocked. Expected effect: +20-30% volume on the profitable subset, no change to the bleed.
Why this is the right call (and bigger than it looks). It's tempting to treat "retrain the model" as more rigorous than "tweak a config value." But the data says the model isn't broken — the floor was. A 1-line config change with empirical backing beats a multi-day rebuild whose ship gate it can't honestly clear. We measured first, then acted minimally.
Honest caveats: (1) bucket-level signals on n=20 have wide CIs — the [60-63) profit could compress as more data comes in, in which case we re-tighten. (2) Selection-bias diagnosis means future "live ECE looks bad" reports should always be cross-checked against representative-population ECE before any retrain decision. (3) ATP CLV is still negative; this change improves the trade mix on the margin, doesn't claim to fix structural market efficiency.
Why we looked.
While diagnosing why ATP held-out CLV is negative at every max_edge_c setting (see v12.4 above), per-bin calibration on 6,203 ATP samples showed the model has classic favorite-longshot bias — predicts 75% but reality is 62%, predicts 25% but reality is 47%. The auto-recalibrator's validation gate REJECTED post-hoc fixes for ATP and WTA: isotonic, Platt, and Beta calibrators all left ≥3 buckets exceeding the ±8% tolerance. The miscalibration is non-monotonic and locally inconsistent — not patchable downstream.
What we found.
Tracing the prediction stack from ESPN tick to fair_c output, we discovered a bigger problem hiding in the data layout:
atp_matches_2025.csv on the VPS was 14 bytes — a previous auto-fetch had received a GitHub 404 and saved the response. Sackmann hasn't published 2025 yet, so this was technically benign (csv.DictReader silently skipped it), but it surfaced the broader staging issue.data/tennis/ directory the build script expects was empty — ATP CSVs lived next to the script in core/ and WTA CSVs were absent entirely (the elo file's WTA entries had been built from CSVs that no longer existed at any path on disk).build_tennis_elo.py) only ingested main-tour matches — not Challenger or qualifying. That meant every player who plays primarily Challengers (which is most of the live ATP matches we see today, since the tour is between Madrid and Rome) defaulted to Elo 1500. Burruchaga, Forejtek, Kolar, Pellegrino, Korpatsch — all blank-slate.What we shipped.
/opt/fairprob/data/tennis/ — ATP main 2020-2024, ATP Challenger/qualifying 2020-2024, WTA main 2020-2024 (~16 MB total).build_tennis_elo.py to also ingest atp_matches_qual_chall_*.csv (Challenger + qualifying main draws). WTA stays main-only because qual_itf would dilute with low-tier ITF noise.
What we did NOT ship.
We attempted a v1-analytical vs v2-XGB head-to-head on the atp_v2_shadow.jsonl log to decide whether to promote the existing v2 in-play model to primary (the same pattern WTA used in v11.x). The shadow log only contains 11 distinct match_ids so far — all currently in-progress — so no finished-match outcomes can be derived. Promotion decision deferred until ~7 days of accumulated shadow data is available.
What this should do for ATP volume.
The user-facing question that started this thread was "how do we get more ATP trade volume correctly?" The honest answer today is: most of the way to fixing volume is fixing the data the model trains on. We've done that. The cap question (currently max_edge_c=20) is parked — we re-test it once the new elo has produced a few hundred new in-bot predictions and the recal buffer can be re-binned.
Open issues for next pass: (1) v2 promotion decision needs ≥30 finished ATP matches in shadow log; (2) the data/tennis/ directory still has no automated refresh cron — we manually fetched. Wiring a weekly Sackmann sync prevents this from rotting again.
What we tried (v12.3, same day).
Built a per-(sport, edge-band) CLV gate in core/clv_filter.py, populated clv_buckets.json from the 686 trades with measured CLV, and flipped the moneyline bot from shadow mode to enforce. Same gate logic was also wired directly into the CS2, LOL, soccer, and tennis bots and the MLB SCORE-FILTER was replaced with bucket-based logic. In-sample the gate looked good: gate-allowed trades had +5.56c better CLV than gate-blocked trades.
What killed it (walk-forward validation).
Before letting the gate run live for any length of time, we built core/clv_gate_validation.py to do the test the v1.1 paper had flagged as missing: a chronological 80/20 split where bucket means are computed from the older 80% only and the gate is then applied to the held-out newer 20%.
Verdict and revert. The in-sample number was overfitting. We pulled v12.3 the same day:
clv_gate_mode reverted from enforce → shadow (logs decisions, never blocks)should_trade(mode="enforce") calls removed from CS2, LOL, soccer, and tennis botsclv_filter.py, clv_buckets.json, the moneyline shadow log) stays in place — it cost nothing to keep and gives us a re-validation target once we have a larger CLV sample and a better cross-validation methodology
Why we're publishing this.
We've said in the v1.1 whitepaper that we publish what didn't work. This is what that looks like in practice: a one-day round trip from "let's enforce the gate" to "the gate doesn't generalize, kill it." The validation script (clv_gate_validation.py) and the failure log are in the same place as the code that worked.
Public CLV (Closing Line Value) dashboard at /clv. Per-sport closing line value across every settled trade we've ever made. The headline finding from running the full backfill: trades that beat the close win ~89% of the time; trades that lost the close win ~11%. CLV is the single strongest leading indicator of forecasting edge that exists, and almost no prediction-market vendor publishes theirs. We do — every sport, with edge-bucket breakdowns, JSON / CSV download, CC BY 4.0.
backfill_clv.py --commit took CLV coverage from 9% → 48.5% across all 1,415 trades by pulling Polymarket price history for each tokenclv_backfill runs Tuesdays 04:39 UTC; future trades get CLV automatically without manual intervention. Registered in admin_cron_health with an 8-day stale windowDataset schema in the page head for Google Dataset SearchSport triage based on CLV. The CLV data did what it was designed to do: it told us which sports had real edge and which were bleeding closing-line value into the market. Result: 5 of 8 measured sports have negative CLV. We responded by acting on the data:
/clv alongside active sports — but no new orders are placedmin_edge_nba=100 in moneyline_wp_bot.py; WTA paused via min_edge_c_wta=100 in tennis_wp_bot.py; CS2 + Soccer disabled via --disable cs2,soccer on the unified bot's systemd ExecStart
MLB SCORE-FILTER override + WS reconnect telemetry (carried from v12.0).
Found that the April 28 score-change-only filter was blocking 4,026 of 4,028 MLB EVAL events because it was sampled at the old 8c gate. Added a narrow override: polling-trigger MLB trades now allowed in the proven 5–8c band only (the bucket the v12.0 audit found was 71.9% WR / +$8.92). All other MLB edges still require a score-change trigger. WS reconnect logging now emits clean "WS RECONNECTED — resuming signals (downtime=Xs, N eval cycles skipped)" lines on recovery and forces a fresh ESPN poll on reconnect.
Research-first reframe. We're being honest about what ZenHodl is. The bot is one thing it does. The data the bot generates — calibrated probabilities, ECE per sport, on-chain pre-committed benchmarks, per-sport CLV across every trade — is a different and rarer thing. As of v12.1, we treat that data as a first-class product. Most prediction-market vendors don't measure CLV at all. The few that do, don't publish it. We do, and we use it to make trading decisions in public. If a sport's CLV stays negative, we pause that sport on this page until it doesn't. If we ever start trading a sport that has been bleeding CLV without good reason, you'll see it here.
Open issue: tennis WS subscription cap (Haddad Maia 19.6c edge missed because her token wasn't in the 1,498-token active set). Documented for next session — fix is smarter token rotation, not in scope today.
Transparency Index — expanded and publicly citable. Grew the index from 21 to 27 sports prediction sources and added two new dimensions (track record longevity, sport coverage breadth), bringing the rubric to 7 dimensions / 35 max points. Recalibrated several scores after a fresh audit; the result is that FiveThirtyEight (archived) now ranks #1 at 30/35 and ZenHodl ranks #2 at 29/35 — the index passes the "would you rank yourself first if you were honest" smell test.
Dataset schema in the page head — surfaces in Google Dataset Search / structured-data crawlers/admin/transparency-index and never auto-apply#src-fivethirtyeight); citation block; rank chips on every rowNBA Playoffs 2026 benchmark — production-hardened ahead of May 5 tipoff. The on-chain pre-committed benchmark at /benchmarks/nba-playoffs-2026 went through a hardening pass to make every claim in the manifest enforceable in code:
status field (polymarket_unavailable / zenhodl_unavailable) and surfaces in a new "Excluded games" section on the public scoreboard. Manifest's tie-handling rule is now provably appliedmanifest.json SHA-256 to the on-chain receipt every render; green "Served file hash matches on-chain commit" badge confirms the manifest is byte-equal to what's on Polygon. Programmatic endpoint at /benchmarks/<slug>/hash-check.jsonzenhodl_unavailable rows for that batch and the next cron tick retriesBuild-vs-buy calculator — rebuilt for B2B decision-makers. The /build-vs-buy page got a full overhaul:
?rate=200&sports=8&tier=enterprise) so a buyer can lock numbers and forward to their CFOBot operations — gate audit + WS reconnect telemetry.
WS RECONNECTED — resuming signals (downtime=Xs, N eval cycles skipped during outage) line on recovery. Forces a fresh ESPN poll on reconnect so we don't trade off stale game data. Throttled the spam of "WS DISCONNECTED" warnings from once-per-second to once-per-30s. Next outage's postmortem is one grep awayKnown issue — CLV coverage gap. A spot audit on April 29 surfaced that closing-line value (CLV) is recorded on only 9% of trades globally, with critical gaps: CS2 has 0% coverage on 341 trades and tennis has <5% across both tours. The closing-price polling job is only fully wired for the moneyline plugin (MLB at 31.8% coverage is the best-covered sport, with mean CLV of -5.5¢). Filling the CS2 / tennis CLV gap is the next operational priority — without it we're flying blind on whether recent gate tightening is improving or hurting closing-line value, which is the single best leading indicator of edge erosion. Tracked publicly here so we can ship the fix in v12.1 and validate the v12.0 gate changes against it
What happened.
From 2026-04-22 through 2026-04-24, ai_drift_monitor.py overwrote per-sport calibration tables with newly-fit isotonic regressions without a holdout-validation gate. The refit was mathematically valid on the training half but produced overconfident probabilities at inference, so the bot priced edges that weren't there.
Impact.
Approximately -$63 cumulative bot P&L over a 7-day window attributable to the corrupted tables, concentrated in NHL and MLB. The same calibration tables back our public win-probability API, so any consumer reading /v1/games or /v1/edges during the incident window saw the same overconfident probabilities. Discovered during a routine bleed audit on April 25.
Root cause. The post-hoc refit pipeline assumed any new isotonic fit was an improvement. There was no holdout split, no Brier-score check against the prior calibrator, and no per-bucket sanity range — three guards we'd been planning to add but hadn't shipped. A monotone fit on a tiny recent window is mathematically "correct" but pushes recent noise into the model.
Fix shipped April 25.
Built calibration_validator.py as a hard gate on every refit:
calibration_history.jsonl for auditVerification. We replayed the validator against the historical refit attempts that caused the incident: 9 of 10 sports' refits would have been rejected by the new gate; only the LoL Platt fit passed. Going forward, any future drift in any of the 11 sports' calibrators is gated on the same validator.
Followups.
Public CLV dashboard at /admin/clv now exposes per-sport closing-line value as the leading indicator of edge erosion (independent of P&L variance). Sport-level circuit breaker (sport_circuit_breaker.py, shadow mode through ~May 9) auto-disables any sport whose 30-day ROI drops below -5%, so a future calibration regression that escapes the validator gets pulled within a day instead of a week.
surface and best_of for downstream analytics and review.scikit-learn to match production and removed an inference error affecting MLB and CFB locally./admin/revenue with active MRR, at-risk MRR, collected cash, and Stripe-versus-crypto breakdowns./admin/support/mint-link to mint fresh prefilled Stripe links for blocked customers.STRIPE_API_STARTER_NO_CARD_TRIAL_PCT to 0 to evaluate the billing flow against a single card-required configuration./activate flow for passwordless accounts and persisted activation milestones.api.zenhodl.net to zenhodl.net and updated redirects, canonicals, sitemap, and emails./v1/model/performance — Brier score, ROC-AUC, ECE, accuracy, and full conformal calibration tables for all 11 sports/v1/model/clv — live closing line value tracking. Measures how often our entry price beats the final market price/v1/venues — real-time status of all connected data venues (Polymarket, Kalshi, DraftKings, FanDuel, etc.)?venue=kalshi on /v1/games and /v1/edges to filter by specific venue/v1/snapshots/{sport}/{date} — win probability archived every 30s during live games (Pro+ tier)/v1/predictions/batch — bulk download predictions for up to 90 days (Pro+ tier)/v1/webhooks — register URLs to receive edge signals in real-time via signed POST requests (HMAC-SHA256)/course/rate with social proof widget on course page/v1/predict/{sport}/live for live win probabilities and venue-aware edges./v1/predict/{sport}/pregame for scheduled-game pricing./v1/predict/{sport}/{game_id} for single-game model output./v1/fair-lines/{sport} with American-odds conversion./v1/usage with monthly request breakdowns and tier caps.elo_power and enabled the new Elo logic for soccer.--select-by-trading to train_wp_model.py./v1/games, /v1/edges, /v1/sports, and /v1/predictions.