Changelog

v12.86 May 12, 2026

Operational cleanup: log hygiene, meta-model enrichment finalized, MLB unlocked

What changed. A focused operational pass: closed every remaining null-feature gap in the CLV meta-model shadow log, opened MLB trade volume by ~30%, and cleaned up disk/log hygiene that was slowly silting up.

Meta-model enrichment finalized

Fixed elo_a/elo_b, size, score_diff, period, is_score_change for moneyline_wp (MLB/NHL/NBA), tennis_wp (ATP/WTA), and CS2 — all five fields now populate per row with real values, not null.
Root cause of partial nulls: bots used wrong attribute names (match.team1_elo never existed; the real lookup is self.hltv_fetcher.get_team_elo(team)). Side-aware indexing now matches the cs2/moneyline convention so elo_a is always the bet side.
API signal_engine.py was tagged with bot="api_signal_engine"; the shadow logger now skips API-server callers so they don't win the 60s dedup race with all-null rows.

MLB volume unlocked

max_entry_price_mlb_c: 78c → 82c (the 78-82c band is historically 85% WR / +7% ROI / +2.3c CLV on n=13).
min_entry_price_mlb_c: 25c → 22c (trade small PnL for more CLV training data).
Filter audit showed 1,598 MLB skips in 30min on min/max_entry alone — every signal was hitting the gate.

Risk controls

NBA min_period_nba: 3 → 4 (Q4 only). Last 8 NBA trades were all Q3 trades, all losses (0% WR, -$9.13, -100% ROI). Q4 still retains the 68.6% WR signal from the original 2026-04-13 diagnostic.

Operational hygiene

Feature logger systemd MemoryMax: 512M → 1.5G (was OOM-killed 3× overnight; now stable at ~350MB usage).
rotate_gate_log.sh rewrite: mv → cp + truncate. Old approach preserved hardlinks from deploy_code_only.sh's cp -al staging step, causing gzip to refuse compressing the rotated archive (1.1GB landed uncompressed). New approach breaks hardlinks at copy time. Deploy script now also sweeps staging dirs >1d old on every run.
/etc/logrotate.d/fairprob installed: daily rotation with copytruncate (preserves bots' open file descriptors), gzip after 1 day, keep 14 days. Pre-fix, unified_bot.log was 692MB and growing ~50MB/day.
Deleted 4 orphan logs from pre-unified-bot era (tennis_wp.log, bot.log, lol_wp.log, cs2_wp.log): 725MB recovered.
aiohttp Unclosed client session noise (1,268 warnings/24h) silenced via warnings.filterwarnings. Root cause was a cross-event-loop session reuse in espn_scores.py; added explicit loop-mismatch check that abandons stale sessions cleanly.
Polymarket WS transient disconnect logging downgraded ERROR → WARNING (5+/day were false alarms; the reconnect loop handles them automatically).

Backlog

Model retraining before fall seasons start. CFB / NCAAWB / NFL WP models fail predict on current sklearn (1.7.2) — pickles were saved on an older version where LogisticRegression.multi_class was an instance attribute. The bot's smoke test catches this and skips those sports, so no live impact (NFL/CFB/NCAAWB are all out of season). Retrain before September.

v12.84 May 11, 2026

Pricing refresh: free Developer tier, Starter gets 30K req, annual plans, Enterprise self-serve

What changed. Pricing page got a structural overhaul to fix the funnel-too-narrow + middle-tier-confusion problems.

New free Developer tier — 500 requests/month, 5-minute data delay, all 11 sports. No card required. Sign up at /signup?plan=developer_free.
Starter gets 3× the request cap at the same $49/mo — 10K → 30K req/mo. Old 10K (~333/day) was too tight for multi-sport monitoring.
Pro tier disambiguation — the public Pro card was switched from bundle_pro (10K req + dashboard) to api_pro (100K req + dashboard, same $149). Two SKUs at $149 with different request caps was a buyer-confusion trap. bundle_pro stays in the catalog for the legacy course-bundle webhook.
Annual plans at 17% off (2 months free) for Starter / Pro / Enterprise. Toggle on /pricing via the Monthly/Annual switch.
Enterprise self-serve checkout — once STRIPE_PRICE_API_ENTERPRISE is populated via setup_stripe.py, the CTA auto-flips from mailto to /billing/start/api_enterprise. No code redeploy required.

Files touched

api/subscription_plans.py — added developer_free, api_starter_annual, api_pro_annual, api_enterprise_annual; rebuilt PRICING_PAGE_SECTIONS; added interval arg to build_pricing_sections.
api/auth.py — Starter monthly_cap 10K → 30K.
api/templates/pricing.html — 4-card grid, Free card with dashed border, Monthly/Annual toggle, signup CTA support, "Save 17%" hint on monthly cards.
api/app.py — /pricing?interval=year + /signup?plan=developer_free param wiring.
Copy bump (10K → 30K) in email_utils.py, homepage.html, docs.html, products.html, build_vs_buy.html, blog comparison post, Arabic localization.

Operator note

Annual SKUs need Stripe prices created before they're sellable: python3 -m api.setup_stripe --key sk_live_... --write-env reads the new plans from the catalog, creates the Stripe price objects, and writes the IDs back to api/.env. After that, annual self-serve works and Enterprise CTA auto-flips.

v12.83 May 11, 2026

CLV meta-model v0 deployed (shadow); end-to-end feature pipeline live

What shipped. A binary CLV-prediction meta-model trained on 1,058 measured trades (XGBoost + isotonic calibration, AUC 0.616 on chronological holdout) running in shadow mode alongside every signal. Decisions logged to clv_gate_log.jsonl with mode tag meta_v0_shadow. Operator can promote individual sports to enforce mode via meta_model_config.json (60s hot-reload, no redeploy). MLB shows highest per-sport AUC (0.769); ATP/CS2/LOL show no signal yet so excluded from sports_with_signal.

Feature pipeline (110+ candidate features now flowing)

Phase 1: standalone feature_snapshot_logger systemd service captures pre-trade orderbook depth, rolling mid drift / volatility (1m/5m/15m) on 800+ subscribed Polymarket tokens.
Day 2: trade-print flow (BUY/SELL volume, net pressure) via Polymarket WS — fixed silent _handle_price_change drop; now capturing ~7K prints/90s.
Day 3: multi-bookmaker via Odds API (Pinnacle + DK + FanDuel + BetMGM consensus), 5min refresh, 67 quotes/cycle.
Phase 3: ESPN event-reaction (seconds-since-score-change, event count) via async polling.
Day 4: bot_self_state.py exposes per-sport recent PnL/CLV/slippage (5min cache from trades.jsonl).
Days 5-7: sport-specific extractors for tennis (sets/games/surface from shadow log), MLB (Statcast pitch counts via free statsapi.mlb.com), CS2 (round-level state), basketball (late-game / one-possession).
Phase 5: line-movement velocity (Δ Pinnacle / Δ consensus over 5min/15min — sharp-money signal).
Phase 6: MLB Statcast (pitcher pitch count, exit velocity, leverage).
Phase 7: ESPN injury feed (per-team injury counts, status-change-in-24h).

Engineering hygiene

Idempotent shadow logging (60s TTL per token_id) — eliminates per-tick log noise (~99% reduction).
Gate-log rotation cron (daily 04:30 UTC, archives >500MB to .backups/gate_log_archive/).
retrain_with_backup.sh — auto-rollback if new pkl drops below 0.54 AUC.
Odds API 429 backoff (honors Retry-After header) + low-budget warnings.
Validation script (validate_meta_shadow.py) joins shadow decisions to realized trade outcomes for counterfactual PnL measurement.

Honest framing

Walk-forward 4-fold CV gives mean AUC 0.553 (fold std 0.042) — barely above the 0.55 gate. The model is a research instrument in shadow mode, not yet trusted to block trades. First validation verdict expected after Monday cron (06:15 UTC) joins ~30 settled trades. Per-sport submodel for MLB (n=204) didn't yet pass gate (CV 0.559, high fold variance 0.111).

v12.82 May 11, 2026

/clv-evidence whitepaper public + soccer Layer 2 calibration restored

Public whitepaper. New page at /clv-evidence publishing the empirical 78-percentage-point CLV gap, Wilson 95% CIs, two-proportion Z-test (z = 24.27, p ≈ 10⁻¹³⁰), per-sport breakdown, selection-bias analysis, and a public reproducibility script. Snapshot pinned 2026-05-08 (n=950 in gap), passed five Codex CLI review passes. Raw data + verifier published at /api/trades.jsonl and /api/verify_clv_gap.py.

Soccer fixes

Fixed silent AttributeError: '_cal_table' firing every tick (~19K errors in log) — defensive init in SoccerWPModel.__init__ v12.81 patch.
Layer 2 calibration table now loads unconditionally after Poisson init (was only inside dead XGB branch).
Now scanning 207 soccer tokens across EPL / LALIGA / LIGUE1 / BUNDESLIGA / SERIEA cleanly.

Methodology + validation page cross-links

Brier comparison softened from "Better/Worse" to "Better by construction / Competitive in practice (NBA 0.124, ECE 0.002)".
Trading Value row links directly to the 78-pp CLV gap whitepaper.
Three-page explainer block on /methodology and /validation clarifies the difference between /validation (backtest), /results (live ledger), and /clv-evidence (empirical skill test).
/validation page CLV companion callout pinned to canonical numbers, links to whitepaper.

Data hygiene

Relabeled 732 bot=unknown_backfill trades (March 31 one-shot import) to mode=imported_legacy so open-position counts no longer inflate. 38 zombie trades (resolved=True, won=None, no game_id) confirmed unrecoverable from Polymarket CLOB (0 prints across 178 NCAA tokens).

v12.81 May 9-10, 2026

Live recalibrator re-enabled + ATP/WTA drift investigation

Drift detection. Live trade ECE diverged from baseline buffer ECE on ATP (+18pp), CS2 (+14pp), LOL (+31pp), WTA (+14pp). Root-cause: adverse selection at the trade-firing layer, not a broken model — the recalibration buffer (n=5,550 WTA samples) showed model itself was within ±1pp calibration tolerance. The bot's selection logic was concentrating on the worst-calibrated snapshots.

Recalibrator status

Removed LIVE_RECALIBRATOR_DISABLED=1 from .env (was set May 6, paused all auto-corrections).
Uncommented weekly recalibrate cron (Mon 09:07 UTC). First re-enabled run: all 10 sports within calibration tolerance.
WTA / ATP shadow log inspection: 10 live Italian Open matches tracked correctly.

v12.74 May 5, 2026

Filter C: calibration-gap pre-trade gate

New gate that blocks trades in cohorts where the model has been historically overconfident by >15pp on n≥20 trades. Independent of the existing edge-gate (Filter A) and CLV-gate (Filter B), so each can run at different enforcement levels. Bots wire after the existing edge check.

Auto-pause for low-skill sports

auto_pause_low_skill_sports.py detects sports where CLV-PROXY (directional skill metric) drops below 50% on n≥30 matches and writes a 14-day pause directive. A model with negative directional skill cannot be filtered into profitability — only paused. Auto-resumes when metric recovers.

v12.26 May 1, 2026

Loosen max_edge to 25c & lift between-sets skip (experiment)

Context. v12.22-v12.25 progressively tightened then floodgated the entry filters (min_edge=0, min_fair=0, score-change-only). Live volume since the May 1 deploy was 0 fires — the max_edge=18c cap was catching qualifying signals by less than 1c (e.g. WTA Golubic at 18.7c). Tennis "between-sets" skip filter blocked the only other matches that had score-change events.

Changes

All sports: max_edge_c 18 → 25 (NBA already at 25, NHL/MLB/CS2/ATP/WTA/LOL all loosened).
Tennis: skip_between_sets True → False. Existing comment claimed -1.5c/49.4% WR vs +25c/79.6% WR but no preserved script — treating as unverified.
Score-change-only gate (require_score_change=True) and floodgate floors (min_edge=0, min_fair=0) from v12.24-v12.25 are unchanged.

Decision criteria

Observe 7-14 days. Revert skip_between_sets to True if between-sets fires show WR < 45% or P&L < −$5. Tighten max_edge back to 18 if the [20-25c] edge band shows the historical falling-knife signature (WR < 40%).

v12.22 April 30, 2026

Revert v12.18 flood-gates — back to W14/W15 levels

Why this exists. v12.18 (4/30) opened max_edge to 25-30c per sport to test whether W17's bleed was a regime shift. Five days of live data (98 trades, 34.8% WR, −$61.79) plus a per-sport audit of the W14 (+$76) and W15 (+$11) profitable weeks confirmed it was not a regime shift — it was the gates.

What W14/W15 audit found

MLB [60-70%] fair: 35 trades, 77.1% WR / +$15.05. Median edge 10c.
NHL [70-80%] fair: 20 trades, 75.0% WR / +$16.18. Median edge 14c.
ATP [60-70%] fair: 24 trades, 58.3% WR / +$12.31. Median edge 14c.
LOL [70-90%] fair: 15 trades, 80.0% WR / +$9.72. Median edge 19c.
LOL [60-70%] fair: 11 trades, 18.2% WR / −$5.47 — structural loser, not regime.

v12.22 gate revert

ATP/WTA/MLB/NHL/LOL: max_edge_c 30 → 18. WTA: min_fair_prob_c 65 → 70 (back to v12.7). LOL: min_fair_prob_c 60 → 70 (block structural [60-70%] loser). CS2: already 18 / 70 from v12.21. NBA: stays suspended (min_edge_nba=100).

What this is NOT

Not a "fix". The model is the same. The selection logic is the same. We're just blocking the [20-30c] falling-knife band that universally underperformed across sports (v12.20). Volume target reverts to W14/W15 levels: 200-900 trades/wk depending on sport availability.

Decision criteria

Promote and call it stable if 7-day total: WR ≥ 55%, P&L ≥ +$10. Re-evaluate (don't loosen further) if WR < 50% AND P&L < −$15. The mistake was opening gates too aggressively in v12.18; we won't double down by loosening again on the first bad week.

v12.21 April 30, 2026

CS2 v6 model + tight-gate re-enable

What shipped. Re-trained CS2 grid model (v6) with three new opening-duel features: fk_diff, fk_rate_diff, last5_fk_diff. Re-fetched 3,747 series from GRID with the firstKill field included (25% of training maps now carry the new signal). CS2 trading is re-enabled on the VPS with v12.21 tight gates.

Honest verdict: the rebuild was a wash

v6 holdout Brier: 0.1595 — statistically tied with v5's 0.1592. firstKill features rank 16/17/18 in importance (combined 2.1%). The signal exists but correlates strongly with kill_diff (already 25% importance), so XGBoost extracts almost nothing independently. Calibration on holdout is excellent (max bucket error 1.7pp, ECE ≈ 0.8%) — but it was already excellent in v5.

Why we shipped anyway

Two reasons: (1) v6 is statistically equivalent — deploying it costs nothing and starts the firstKill signal flowing in case it matters in live distributions we can't see in holdout; (2) the real CS2 problem is selection bias, not model quality, and we needed to re-enable trading to test selection-side fixes.

v12.21 tight gates

max_edge_c: 35 → 18. Universal cross-sport pattern (v12.20): the [20-25c] edge band is a falling-knife — CS2 ran 31.4% WR there. 18c keeps the [15-20c] high-confidence band where v3 sign-agree was 86% but excludes the noise zone.

Other gates unchanged: min_fair_prob_c=70, entry [35,50]c, slippage 6c, BO3+ only, T1/T2 tournaments only.

Decision criteria for next 7-14 days

Promote if: 30+ trades, WR ≥ 50%, total P&L ≥ −$5. Revert if: any 14d window with P&L < −$15 or WR < 35%. The model is the same; we are evaluating gate logic.

v12.20 ANALYSIS April 30, 2026

Why we lose: cross-sport and per-sport pattern analysis (1,368 resolved trades)

Why this exists. After today's gate-tuning loop, we did a structured "why do we lose?" analysis across all 1,368 resolved trades. Five strong patterns emerged. Documenting them here so the patterns survive the session and inform future tuning.

Cross-sport finding 1: score-change-triggered fills are ~20pp better

Score-change-triggered fills (75 trades): 66.7% WR, +$3.59. Polling-triggered fills (1,286 trades): 48.1% WR, −$34.96. 95% of losses are concentrated in polling fills. Score-change fills create new information windows where the model has time to re-evaluate; polling fills happen during market drift where the model is stale and the market is right.

Cross-sport finding 2: fair_prob has a U-shape

[50-60%] band: 34.7% WR — the "death zone" of overconfident coinflips. [70-80%] band: 60.0% WR — sweet spot. [90%+] band: 76.0% WR — calibrated favorites. Pattern repeats per-sport: ATP, WTA, LOL, CS2 all bleed at [50-60%] and shine at [70-80%].

Cross-sport finding 3: edge [20-25c] is the falling-knife band

Universal across sports. ATP: 30.8% WR. NBA: 20.0% WR. CS2: 31.4% WR. LOL: 15.4% WR. Even MLB (where the model is best calibrated) shows weakness here. Above 25c the bot tends to win in some sports because the edge implies the model is responding to a real opportunity rather than noise — but this is sport-dependent.

Cross-sport finding 4: CLV is the truth-teller

Across every sport with enough CLV data, the gap between winners' CLV and losers' CLV is +45c to +65c. Winners average +25c CLV; losers average −28c CLV. Losses are systematic adverse-selection events: when the market re-prices against us after entry, we lose. This is consistent with the falling-knife thesis — the bot loses to informed flow.

Cross-sport finding 5: hour-of-day matters

Best hours: 03h UTC (overnight, 75% WR), 23h UTC (74.5% WR). Worst: 22h UTC (30.4% WR), 15h UTC (38.5% WR), 14h UTC (46.8% WR with 698 trades, −$18.87). Peak market hours (US morning to early afternoon) have the worst calibration. Likely explanation: more market makers and informed flow during US business hours.

Per-sport table

Each sport tells a different story:

ATP (n=91, 49.5% WR, +$2.21): [60-70%] sweet spot at 60.8% WR / +$6.80. [50-60%] dies at 31% WR. CLV gap +49c. Profitable when filtered.
WTA (n=41, 41.5% WR, −$3.50): All bands except [70-80%] losing. [60-70%] is 39% WR. CLV gap +57c. Needs ≥70 floor — matches v12.7 design.
LOL (n=93, 40.9% WR, −$4.05): [60-70%] catastrophic at 27% WR. [70-90%] excellent at 60-74%. [20-25c] edge: 15% WR — the GIANTX falling-knife pattern. Strong fair_prob × edge interaction.
CS2 (n=333, 43.5% WR, −$18.78): Bleeding everywhere except [70-80%] and [90%+]. Slippage 0.5-2c → 53% WR; slippage 5c+ → 32% WR. Structurally broken; disabled.
MLB (n=215, 57.7% WR, −$1.74): Score-change 75% WR / +$2.91. Polling 55.9% WR / −$4.65. ALL fair bands ≥50% are profitable. CLV gap +64c. Best-calibrated sport — v12.18 floor revert is the right experiment here.
NHL (n=99, 60.6% WR, +$1.15): Score-change 66% WR / +$1.78. [15-20c] edge: 83.3% WR / +$3.76 (best band in any sport). Healthiest sport.
NBA (n=126, 45.2% WR, −$11.61): Heavy polling-fill bleeder. 94 of 126 trades have fair_prob_c=0 (logging anomaly — same family as the March 31 LOL anomalies). Real signal n is much smaller. Investigate logging bug before model-tuning.
NCAAMB (n=235, 52.8% WR, +$8.75): Same fair_prob_c=0 logging issue (231 of 235). Despite that, net positive. Out of season now. Separate logging bug to investigate next season.
SOCCER (n=35, 37.1% WR, −$2.58): All bands losing. CLV gap +47c (informed flow is real). Re-enabled in v12.18; small n — need 30+ more for direction.

The synthesized story

The bot wins when it reacts to genuine information events (score changes), in confidence ranges where the model is well-calibrated ([70-80%], [90%+]), with fills slippage between 0.5-2c, outside peak market hours. The bot loses when polling-triggered fills happen during market drift, with model output in the [50-60%] coinflip zone, often with edges in the [20-25c] band where the market is actively repricing in. Losses are universally adverse-selection events — CLV gap is +45-65c between winners and losers across every sport.

Actionable next steps (NOT shipped today — for after v12.18 observation)

Tighten gates on polling-triggered fills (e.g. require ≥15c edge or ≥65% fair_prob), keep gates loose on score-change-triggered fills
Investigate the fair_prob_c=0 logging anomaly in NBA / NCAAMB / NCAAWB — same family as the March 31 LOL anomaly
Per-sport hour-of-day filters — pause specific sports during their worst hours (e.g. NBA during peak US market hours)
Consider explicit slippage-aware sizing — tighten size when slippage exceeds 3c or refuse to fill at >5c slippage

Why this is documented but not shipped. Today's session over-shipped on gate tuning based on insufficient per-bucket samples. The patterns above are real and replicable, but the right move is to OBSERVE v12.18+v12.19 for 1-2 weeks, accumulate clean data with the new gates and the orphan bugs fixed, then revisit. Premature filter rules anchored to thin samples is exactly what created today's circles.

v12.18 EXPERIMENT April 30, 2026

Flood-gates opened — testing whether W17 was a regime shift or the new normal

Why this exists. Today's session shipped 8+ gate tightenings (v12.6 through v12.17), each individually justified by the recent week's calibration data. But the recent week (W17, Apr 21-27) was the bot's WORST week ever (38% WR, −$17.62). The gates were over-fit to W17. During W15 (Apr 7-13) the bot ran with much wider bands and was 60% WR / +$6.30. Either W15 was lucky or W17 was a regime shift. We can't know without fresh trades under wider gates.

What's running now.

ATP: floor 60 (unchanged), max_edge 20 → 30
WTA: floor 70 → 65, max_edge 25 → 30
MLB: floor 70 → 60, max_edge 22 → 30
NHL: max_edge 22 → 30
NBA: max_edge 20 → 25
LOL: max_edge 22 → 30
SOCCER: re-enabled (was disabled since 2026-04-23)
CS2: stays disabled — structurally broken across ALL bands historically (51% ECE [20-25c], 39% ECE [25c+], all post-hoc calibration candidates rejected today). Re-enabling adds volume with confidence of loss.

Why we can run this experiment safely now. Today's bug fixes (v12.14/15) and discipline work give us safety nets that didn't exist this morning:

Trade logging now actually writes to trades.jsonl reliably (no more orphans)
v12.15's max_positions_per_match=1 safety is now actually enforced (was disabled by the same TypeError bug that orphaned the LOL trades)
Daily live calibration alerter (v12.7) is running on cron — will fire if any sport bleeds >$5 in 14 days
Auto kill switch on daily losses already in place
Shadow eval log captures every evaluation, so we can compare what-the-bot-saw vs what-actually-happened on the loosened bands

What we're explicitly testing. Before today's session, the bot bled hard for 2+ weeks. We attributed that to gate-level miscalibration. But (a) Elo was 3 weeks stale and only refreshed today, (b) several model retrains shipped today (v12.9/12/13), (c) the orphan bugs were silently disabling the per-match safety. So the recent bleed has THREE possible causes: bad gates, stale models, or broken safeties. Today's other ships fixed (b) and (c). v12.18 tests whether that's enough WITHOUT the aggressive gates.

Decision criteria for next 7-14 days.

Volume target: ~140-200 trades / week (W14-W15 levels)
WR target: >50% (W15 was 60%, W14 was 49%)
Per-bucket gate: max bucket error <= 8pp on resolved trades
Exit if any sport: bleeds > $10 in 14 days — the alerter will fire

Honest framing. This is an experiment, not a confidence move. The walk-forward gate REJECTED MLB at 65 today (held-out P&L −$1.72 vs zero-filter −$1.07). The gate's data is W17-anchored, so it might be wrong about W15-band volumes. We accept up to ~$10-20 of expected loss over the next 14 days as the price of finding out empirically. If wrong, revert to v12.17. If right, the bot is back at its W15 productivity with today's calibration improvements baked in.

v12.16 April 30, 2026

Per-sport max_edge_c tightened to kill the falling-knife band

What this targets. Today's calibration deep-dive surfaced a near-universal pattern: ECE balloons when |edge_c| crosses ~20c. The bot detects "the market is wrong" but the market is right and pricing in within-game state the model doesn't see (the GIANTX iTero falling-knife earlier today, ATP [20-25c] band at 32% ECE, NBA [20-25c] at 61pp gap, etc). The fix is per-sport edge caps calibrated to the data.

Per-sport changes (all walk-forward validated).

WTA max_edge_c_wta: 35 → 20
LOL max_edge_c: 35 → 18 (held-out P&L +$3.97)
MLB max_edge_mlb: 25 → 20 (held-out P&L +$1.18)
NHL max_edge_nhl: 25 → 22 (held-out P&L +$0.57)
NBA max_edge_nba: 25 → 15 (held-out P&L +$3.53)
ATP max_edge_c: already at 20 from 2026-04-25 work, unchanged

Why the analysis chain led here. Edge-stratified ECE on 668 resolved trades showed:

ATP: 5-20c sweet, 20-25c band 32% ECE (caught by existing v12.6 cap=20)
LOL: 5-15c roughly OK, [20c+] band 51.7% ECE — classic falling-knife
NBA: [20-25c] band has +61pp gap (model says 81%, actual 20%)
WTA: above 15c uniformly bad (small sample but unambiguous)

Volume impact. Tightening from 25 to 15-22 cuts an estimated 15-30% of trades per sport. The cut volume is the worst-calibrated subset — the trades where the bot was bleeding hardest. Held-out P&L improved on every sport that passed the gate; the regress-on-ship gate auto-rejected MLB at 15 (kept at 20) and WTA at 15 (kept at 20).

Combined with v12.15. v12.15 restored the per-match position cap that was disabled by the LoLPosition TypeError. v12.16 caps the per-trade edge size. Together these are the two most important LOL-bleed mitigations: the bot can no longer fire 5x on the same match, AND the size of any individual fire is capped where the model has been most wrong.

v12.15 BUG FIX April 30, 2026

LOL bot was orphaning fills AND its max-positions-per-match safety was broken by the same bug

What broke. The LOL bot's _execute_entry() function constructs a LoLPosition dataclass passing fair_prob_c_raw=fair_c_raw as a kwarg. But LoLPosition has no such field, and fair_c_raw wasn't even in _execute_entry's scope (it's defined in the caller's scope). Every successful FOK fill since this code was added would raise TypeError immediately after the order returned, BEFORE: (1) the FILLED log line, (2) the log_trade() call to trades.jsonl, AND (3) the _entries_per_match[match_id] += 1 increment.

Symptom that surfaced it. User noticed a Polymarket position of 36+ shares on GIANTX iTero vs UCAM Esports Club with cost ~$15.89, but trades.jsonl had ZERO records for this match. Five separate fills happened today (15:57, 16:21, 16:26, 16:31, 16:36) at declining prices (58c → 43c → 41c → 40c → 42c). Verified against Polymarket's data-api: every fill matches a bot POST /order "200 OK" within 7-12 seconds. Transaction hashes confirmed on-chain.

Why the bot fired 5 times on the same match. max_positions_per_match=1 should have blocked fires 2-5. But because the TypeError happened BEFORE the entries-counter increment, the counter stayed at 0 forever. Each iteration passed the match_entries < 1 check and fired again. Only the per-signal cooldown (~5 min) throttled the cadence. The same bug both orphaned the trades AND disabled the per-match exposure safety.

What shipped.

Added fair_prob_c_raw: float = 0.0 field to LoLPosition dataclass (line 715)
Added fair_c_raw as parameter to _execute_entry() with default 0.0
Caller now passes fair_c_raw=fair_c_raw through to the helper
Defensive getattr(self.tracker, "_score_change_ts", dict()) on the score-change timestamp lookup (same pattern as v12.14 tennis fix)

Reconstruction. All 5 GIANTX iTero orphans recovered from Polymarket's data-api (using the wallet's POLY_FUNDER address) and appended to trades.jsonl with reconstructed=true and the on-chain tx_hash for audit. Combined with the 5 tennis orphans recovered in v12.14, today's session has surfaced and recovered ~$27 in untracked but real fills.

Deeper finding (NOT fixed today): falling-knife on stale model state. The LOL model's fair_prob comes from series state (e.g. "1-0 in BO3") + Elo. It does not see within-game state (gold/kills/towers/dragons in the active game). When game 2 was going hard against GIANTX, the market priced GIANTX iTero from 58c down to 40c. The model didn't update — it still saw "1-0 series" and reported 64.4c fair. The bot computed widening edge (64.4 − 40 = 24.4c) and bought 4 more times. Even with v12.15 restoring the per-match safety, this divergence pattern is a structural model limitation. Future work: feed within-game state features (or a price-velocity guard that pauses buying when ask drops faster than X cents/minute).

Trust note. Today's session also surfaced the existence of a Mac-side bot infrastructure (~/Library/LaunchAgents/net.zenhodl.bot.plist) running unified_bot.py --mode live without the --disable cs2,soccer flag the VPS uses. The Mac bot's order_version_mismatch errors during VPS bot activity were the smoking gun that two parties were hitting the same wallet. After tracing every fill via Polymarket data-api timestamps, every share is accounted for as bot activity (no unauthorized access). Recommend reviewing the Mac LaunchAgent configuration to align with the VPS execution model so we don't have two bots competing for the same wallet's order-counter.

v12.14 BUG FIX April 30, 2026

Tennis trade-log was orphaning fills since 2026-04-28

What broke. On 2026-04-28 a feature added is_score_change tracking to tennis trades. The new code accessed self._score_change_ts on the runner, but that attribute is owned by the TennisGameTracker (the polling object), not the runner. Every successful tennis fill since then raised AttributeError inside the trade-log block. The exception was caught (the FOK order had already filled, so the position existed on Polymarket), but the trade record never got written to trades.jsonl.

Symptom that surfaced it. The user noticed zero recorded trades for 2026-04-29 and 2026-04-30 despite many EVAL events firing. Investigation found 5 orphan FILL log lines with no corresponding entry in trades.jsonl:

2026-04-29 11:08 R. Sramkova 2.1 shares @ 58c ($1.22)
2026-04-30 16:10 Haddad Maia B. 4 shares @ 62c ($2.48)
2026-04-30 16:16 Haddad Maia B. 4 shares @ 62c ($2.48)
2026-04-30 16:30 Haddad Maia B. 4 shares @ 62c ($2.48)
2026-04-30 16:36 Haddad Maia B. 4 shares @ 62c ($2.48)

Total $11.14 of real fills that the bot's books didn't see. Settlement, P&L, and CLV were all going to miss them.

What shipped. One-line fix: getattr(self.tracker, "_score_change_ts", {}).get(match_id, 0). Defensive lookup on the correct object so even if tracker hasn't been initialized yet, the trade-log block doesn't fail. Verified working on the 17:19 Haddad Maia fill that immediately followed the deploy.

Reconstruction. All 5 orphan trades were reconstructed from log lines and appended to trades.jsonl with reconstructed=true and reconstructed_reason tagging so reconciliation tools can identify them. Token_id, entry_price, fair_prob, and edge_c came directly from the log; game_id for Sramkova was a placeholder since it wasn't in scope at FILL time.

Lesson. The bot's "fill still valid" exception handler was correct (don't lose track of an already-filled order just because logging fails) but the silent logger.error didn't raise an alert anywhere. Next reliability work: surface these errors via the same alerter that fires on calibration drift, so a future "fill but no record" pattern triggers an immediate operator notification rather than waiting for the user to spot the trade gap.

Side findings during this investigation.

Tour misclassification: Haddad Maia (a WTA player) is being tagged as sport=ATP in recent records. The earlier 4/27 trade tagged her correctly as WTA. Bug to investigate — likely affects which gate the v12.7 floor applies (ATP min=60 vs WTA min=70). The 69.2c fair_prob on Haddad Maia would have been blocked under WTA but passed as ATP.
March 31 LOL anomalies: 3 LOL trades on 2026-03-31 (JD Gaming 116.84 shares, two Forsaken handicap bets) all show fair_prob_c=0, edge_c=0 — the prediction pipeline returned no model output, but the trades fired anyway. All 3 lost; total impact −$22.20. Probably an artifact of bot startup before the prediction pipeline was fully wired.

v12.13 April 30, 2026

WTA pre-match adds rank/age features; ATP rejects them

What shipped. Three new features added to the WTA model: rank_diff (p1_rank − p2_rank), rank_points_log_diff (log of WTA points), and age_diff (years). These come from the Sackmann row directly so they're already AS-OF the match. Final per-player snapshots are stored in the pickle for inference; defaults (rank=300, points=500, age=25) are used for unknown players.

WTA result.

Brier 0.1576 → 0.1563 (−0.0013, on top of v12.12's lift)
ECE 2.67% → 2.31%
AUC 0.8507 → 0.8517
Max bucket error: 6.08pp → 5.84pp

ATP rejection. Same features tested on ATP REGRESSED: Brier 0.1453→0.1474, ECE 1.98%→2.45%, max-bucket 6.75pp→12.02pp. ATP's Elo already correlates strongly with rank (men's tour has more match volume per player, so Elo converges to a stable estimate that captures the same signal as rank). Adding rank features creates redundancy that the model overfits, concentrating predictions in narrow bands. ATP keeps its v12.9 schema.

Cumulative WTA today. Pre-v12.9 baseline Brier 0.1590, ECE 3.27%. After v12.12 + v12.13: Brier 0.1563 (−0.0027 total), ECE 2.31% (−0.96pp). Same magnitude as ATP got in v12.9 alone — achieved via two compounding changes (overall serve_pct + rank features) instead of one.

Staleness caveat. Rank/age are snapshotted at training time (last match in 2024 data). At inference today, they're 16+ months stale. For top-10 players this matters less (rank churns slowly there); for journeymen it matters more. Refresh path: every retrain regenerates these snapshots. A future version could pull from the live ATP/WTA rankings JSON for fresher values, but for the current Brier lift, the snapshot is sufficient.

Pattern this confirms. Today's session keeps surfacing the same lesson: ATP and WTA need different feature subsets. The TOUR_FEATURE_COLS dict introduced in v12.12 made this clean — ATP excludes rank features, WTA excludes surface-specific serve features. Both tours train from the same pipeline, with the model schema declared per-tour and stored in each pickle.

v12.12 April 30, 2026

WTA pre-match XGB now uses overall serve_won_pct (subset of v12.9)

Why this shipped. v12.9 deferred WTA because the [20-30%] bucket regressed from −6.8pp baseline to −12.1pp candidate. Today's hypothesis: the regression came from p1/p2_serve_pct_surf (20-match per-(player, surface) deque), not the overall serve features. WTA has materially smaller per-surface samples than ATP — women's tour shifts more between surfaces, so the deque mean is noisier.

Result.

Brier 0.1590 → 0.1576 (−0.0014, same magnitude as ATP's v12.9 lift)
ECE 3.27% → 2.67%
AUC 0.8484 → 0.8507
Max bucket error: 6.78pp → 6.08pp (the [20-30%] bucket dropped from −6.8pp to −0.8pp)
All 10 buckets pass the ≤8pp ship gate

Implementation. Added TOUR_FEATURE_COLS dict to train_tennis_wp.py so each tour can declare its own feature subset. ATP keeps all 4 serve features (overall + surface-specific). WTA drops the 2 surface-specific ones. The pickle stores per-tour feature_cols, so inference reads the right schema automatically.

Lesson. Today's session is a small clinic on feature-engineering discipline: serve_pct_overall and serve_pct_surf carry related but not interchangeable signal. ATP has enough surface-specific volume that the 20-match deque is fine. WTA doesn't, so the deque is noisy — and that noise concentrated in the worst possible bucket [20-30%]. Dropping just the noisy variant kept the entire signal lift while removing the regression.

v12.11 REJECTED April 30, 2026

Retirement-flag feature tested, rejected by ship gate (Tier 3.2)

What was tried. Added p1_last_match_ret / p2_last_match_ret binary features to ATP pre-match XGB. The flag is set to 1 when a player retired (RET) in their immediately-previous match, otherwise 0. Hypothesis: a player who just retired is likely injured/under-recovered and should win less than Elo predicts in their next match.

Result on holdout 2024.

Brier 0.1453 → 0.1457 (regressed by +0.0004)
ECE 1.98% → 2.07% (regressed by +0.09pp)
AUC 0.8735 → 0.8729 (regressed by −0.0006)
Two per-bucket gaps near 7pp ([30-40%] −7.1pp, [60-70%] −7.2pp), both larger than v12.9's max bucket error

Why it failed. ATP RET rate is only 2.39% of matches in 2020-2024. With a binary flag, the feature is 0 in >95% of training rows; XGBoost couldn't extract reliable signal from such sparse positives. The hypothesis may still be correct, but a binary indicator carrying nearly all 0s isn't the right encoding. A weighted/decayed retirement history (e.g. exponentially-weighted RET count over last 5 matches, normalized by match minutes when available) would carry more density.

Action. Reverted both tennis_wp_model_ATP.pkl and train_tennis_wp.py to v12.9. The rejection is documented as a comment in the training script so the next attempt knows not to retry the binary form.

Lesson. v12.4 was when we learned in-sample looking good doesn't mean ship-ready. v12.11 is a different lesson: a hypothesis that's physically plausible can still fail because of how it's encoded. Sparse-positive binary flags lose signal compared to denser continuous variants.

v12.10 April 30, 2026

Tennis position surface tagging activated (Tier 3.3 phase 1)

What this fixes. Tier 3.3 (per-surface live calibration tables) was originally scheduled as a "data exists, just stratify it" task. Today we discovered the data doesn't exist: TennisPosition had no surface field, so the recalibrator's getattr(pos, "surface", "") always returned empty. All 11,595 historical tennis records in the recal buffer have no surface tag. Same for trades.jsonl — can't backfill.

What shipped today. Added surface field to the TennisPosition dataclass and wired it at position creation (surface=match.surface or "Hard"). Going forward, every newly-opened ATP/WTA position will carry surface, the recalibrator will tag the buffer record, and per-surface tables become possible after data accumulation.

What's deferred. Tier 3.3 phase 2 (the actual per-surface tables) waits until the buffer has ≥100 samples per (sport, surface). At current trade rate (~50 tennis trades/day, split across 3 surfaces), this is ~7-14 days for hard, longer for grass (no grass season currently). The validation gate already exists; only the stratifier needs to be added to auto_recalibrate.py when the data is ready.

Lesson learned. "The data tags surface" turned out to mean "the recorder API accepts surface" — not that surface was actually being passed. Verifying assumptions early (today the very first sanity check was Counter[(sport, surface)] on the buffer) saves a lot of wasted effort downstream.

v12.9 April 30, 2026

ATP pre-match XGB now uses surface-stratified serve_won_pct (Tier 3.1)

What shipped. Four new features added to the ATP pre-match model: p1_serve_pct_surf, p2_serve_pct_surf, p1_serve_pct_overall, p2_serve_pct_overall. Each is a rolling mean of (1stWon + 2ndWon) / svpt from the player's prior matches (last 20 on this surface, last 30 overall), computed AS-OF the match date so there is no leakage. Surface-specific values back off to overall, which back off to a tour default of 0.62 for players with no history.

Why this works. Elo captures who wins, but two players with identical Elo on hard court can have very different game styles — a serve-bot wins more by holding, a returner wins more by breaking. Serve_won_pct lets the model encode that asymmetry. Surface stratification matters because hold rates differ ~10pp between hard and clay for many players (e.g. Djokovic hard 70.4% vs clay 65.6%).

Holdout 2024 results (ATP).

Brier 0.1467 → 0.1453 (−0.0014, beats +0.002 ship gate)
ECE 2.29% → 1.98%
AUC 0.871 → 0.874
All per-bucket gaps within ±7pp; largest is [40-50%] at −6.7pp (n=239), unchanged from baseline pattern
1016 players have direct serve_overall coverage; 1943 (player, surface) entries

WTA: deferred. Same training pipeline ran on WTA: Brier 0.1590→0.1577, ECE 3.27%→2.62% (better overall) — BUT the [20-30%] bucket regressed from −6.8pp (n=285) to −12.1pp (n=111), exceeding the 8pp per-bucket ship-gate ceiling. Even though net metrics improved, a single localized regression on a small bucket can mean a few specific match types (e.g. underdog clay specialists) get systematically miscalibrated. The disciplined call: revert WTA to the pre-v12.9 model and ship ATP only. WTA will re-attempt with feature subset experiments before promotion.

Honesty about expected live impact. Pre-match calibration improvements are small in absolute terms (~0.0014 Brier) and the model is just one input to live trading; in-play state evolution dominates. We expect a modest but real lift in tournament-opening trades where pre-match priors carry the most weight. The alerter and shadow eval log will measure real-world impact over the next 14 days.

v12.8 April 30, 2026

MLB + CS2 floors raised to 70 (alerter-driven, walk-forward-validated)

Why this shipped. The v12.7 alerter (its first full day live) flagged three sports: MLB, CS2, and SOCCER. Each was triaged with the same playbook used for ATP/WTA — two_pop_calibration.py to confirm the bleed is real, then walk_forward_cli.py to validate any candidate fix on chronologically-held-out data before shipping.

MLB — min_fair_wp_mlb_c = 70 (was global default 55). The alerter caught a textbook bucket-drift signature: the [60-70%] band's gap drifted from -11.1pp baseline → +38.0pp recent (a 49.1pp swing). Recent 30 trades: 33% WR, −$6.38 P&L, ECE 32%. Edge-band stratification showed high-edge trades losing more (the v12.4-style selection-bias-amplified pattern). Walk-forward verdict on the held-out 30%: P&L −$1.07 → +$0.97, WR 58.5% → 75.0%, ECE 12.6% → 7.4%. Volume cut: −63%. The cut volume was net negative.

CS2 — min_fair_prob_c = 70 (raised from 60). Different signature: not regime-shift drift, but persistent miscalibration across all bands. Baseline 267 trades had −$13.54 P&L and 39.8% ECE; recent 14d had 38.6% WR and ECE 27%. Per-band recent: [50-60%) 25% WR, [60-70%) 29% WR vs predicted 55-66%. Walk-forward held-out: P&L −$11.22 → −$3.32 (+$7.90), WR 40% → 52.5%. Still not winning, but halves the bleed. CS2 is currently --disabled in the unified bot — gate takes effect on re-enable.

SOCCER — deferred. Discipline test: SOCCER had only 35 total resolved trades (n=18 recent). Walk-forward at min=70 returned INSUFFICIENT_DATA (only 3 held-out trades). Even min=50 "passed" the gate but ECE got worse 38.7%→50.1%. The disciplined call is to NOT ship a config change on n=3 evidence. SOCCER is also currently --disabled. The alerter will re-fire when more data accumulates and we'll re-evaluate then.

What this validates. The v12.7 monitoring infrastructure paid for itself in 24 hours: caught three real bleeds we would not have noticed without it. The walk-forward gate then prevented one of those (SOCCER) from being a hasty same-day fix on a too-thin sample. Selection-bias awareness + walk-forward + alerter together = the system catching its own drift and rejecting its own knee-jerk responses.

v12.7 April 30, 2026

WTA floor → 70 + Tier 1 monitoring infrastructure (selection-bias-aware)

Why we shipped this. Same diagnostic methodology applied to WTA that we used for ATP in v12.6 surfaced two things at once: (1) WTA was bleeding badly and (2) the v12.6 ATP fix had inadvertently made WTA worse. WTA bot trades show 41% WR and ECE 25% on the bot-selected sample — vs 3.27% ECE on the representative-population holdout. Same selection-bias signature as ATP, but ~2x bigger.

The unintended-consequence bit. v12.6 lowered the shared min_fair_prob_c from 63 → 60 to re-enable an ATP-profitable bucket. But that floor was shared between ATP and WTA. WTA's [60-63%) bucket is decisively LOSING (n=3, 33% WR, +29.5pp gap) — the OPPOSITE sign of ATP's [60-63%) which is profitable. The fix: split the config knob.

What we shipped.

Added min_fair_prob_c_wta = 70.0 as a tour-specific override. ATP keeps its v12.6 value of 60. WTA gets a much higher floor because the entire [50-70%] range is bleeding (n=22 in [60-70%), 35.7% WR, +30pp gap).
Volume effect: WTA volume drops ~68% (from 41 to ~13 trades over the same 14-day window). That's a steep cut, but the cut volume was bleeding at $-3.35 net P&L. Stops bleeding before adding back signal.

Tier 1 monitoring infrastructure (the meta-fix). Today's session caught the WTA bleed only because we happened to look. Real fix is infrastructure that flags this without us looking. Three new tools shipped:

two_pop_calibration.py — the diagnostic that prevents the "live ECE looks bad → retrain the model" mistake. Always reports BOTH (A) representative-population ECE and (B) bot-selected ECE side by side, plus the gap (C). Decisions are made on (A); the gap shows selection-bias magnitude. Today's gaps: ATP +10.6pp, WTA +21.8pp — both selection bias, neither indicating model retrain.
walk_forward_cli.py — before any threshold change ships, applies it retroactively to the trade log, splits chronologically (70/30), and reports whether the held-out tail regresses on P&L / WR / Brier. Returns PASS / REJECT exit code so it can gate deployment scripts. Caveat: only validates TIGHTENING (filters that drop trades); LOOSENING needs the shadow log (Tier 2.1).
live_calibration_alert.py — daily cron that flags BLEED (recent P&L below threshold), WIN_RATE_LOW (below 40%), and BUCKET_DRIFT (a bucket's calibration gap rose >10pp vs the prior baseline window). Selection-bias-aware: alerts on CHANGE, not on absolute ECE. Cron installed at 12:15 UTC daily.

First run of the alerter caught real signals across the fleet. We immediately surfaced 9 alerts in 6 sports. Notable: MLB recent WR is 33.3% (n=30) bleeding $-6.38 with a [60-70%] bucket gap drift of +49pp (was -11pp two weeks ago, now +38pp); CS2 recent WR 38.6% (n=57) bleeding $-5.96; SOCCER 27.8% WR (n=18). Whether each is real drift vs sample noise still needs investigation, but the infrastructure now flags them within 24 hours instead of 14 days.

What we did NOT ship today.

Tier 2 work (deferred): shadow-trade log so we can validate LOOSENING configs and measure true bot-evaluation calibration; reconstructing the v2 in-play training script (currently the v2 lite pickle exists but its training pipeline does not).
Tier 3 work (deferred): injecting tennis_serve_rates.json into the pre-match XGB (the analytical Markov already uses it, the XGB doesn't); injury / retire-last-match features; per-surface calibration tables.
Auto-pause: the alerter flags but doesn't auto-pause. Selection-bias lessons argue against auto-pause without human review — today's WTA flagged correctly, but the MLB drift could still be sample noise.

The honest meta-lesson from today. The bot's models are accurate (2-3% ECE on representative populations). The bot-selected sample's miscalibration is mostly selection bias from the trade filter, not model error. Most of the value going forward is in infrastructure that distinguishes the two and acts on the right one, not in retraining models. Today's three tools are that infrastructure. Yesterday we'd have responded to "WTA ECE is 25%" by retraining. Today we know to look at the gap, ship a config fix, and let the alerter watch the next 14 days.

Open questions for next pass: (1) MLB [60-70%] +49pp bucket drift — sample noise or regime shift, needs investigation. (2) Several sports show recent-period bleeds; some may need their own min_fair_prob/min_edge tuning like ATP/WTA just did. (3) Walk-forward CLI cannot yet validate floor LOOSENING (e.g. v12.6's 63→60 ATP change) until the shadow trade log lands — tier 2.1.

v12.6 April 30, 2026

ATP `min_fair_prob_c` 63 → 60 — re-enable a profitable bucket the floor was blocking

Why we looked again. v12.5 rebuilt the elo data layer. Same day, the open question was whether to retrain the pre-match ATP XGB itself with calibration regularization (focal loss / Brier-augmented objective / etc.) to fix the 9.58% live ECE on 6,203 ATP recal-buffer samples. Before retraining, we asked an honest question: is the live miscalibration real model bias, or selection bias from the bot only logging predictions on trades it opened?

What the diagnostic showed.

The current model's calibration on a representative 2024 holdout (n=3,076 Sackmann matches) is 2.29% ECE / 0.1467 Brier / 0.871 AUC — already excellent.
The 12.88% ECE we measured on bot-opened trades (n=91 resolved) is on a 5x smaller, heavily-filtered subset — the bot only logs predictions on trades that passed every gate (edge band, fair-prob band, position cap, kelly). Those gates select for the predictions where bot disagrees most with market — exactly where qualitative info (injuries, recent form, motivation) the model can't see lives.
97% of bot trades (88/91) are on the model's favorite. The "favorite-longshot bias" we measured is on a sample dominated by one side of the distribution.
Edge-stratified ECE is non-monotonic: 12.74% / 25.89% / 14.24% across the [8,12) / [12,16) / [16,∞)c edge bins. Real model bias would be smooth; alternating buckets is a noise signature.

What we did NOT ship.

No retrain. Calibration regularization on a model that's already 2.29% ECE on a representative population would shrink predictions toward 50% globally to "fix" a bias that's mostly selection — collapsing the [60-70%] bucket where the model is well-calibrated, in service of a [50-60%] bucket where the bias is mostly artifact. The retrain ship gate (holdout Brier < 0.1467) would also be hard to clear by enough margin to justify it.
No new features (β plan). Service-rate ingest, head-to-head records, and retirement flags would add 0.001-0.005 Brier each. Won't move the needle on ATP CLV the way the WTA v2 promotion did. Deferred.
No v2 promotion. ATP v2 in-play shadow log still has only ~11 distinct match_ids; not enough to validate.

What we did ship. Per-bucket calibration on the 91 resolved ATP trades, focusing on what the existing min_fair_prob_c=63 floor blocks:

[55-60)c bucket: n=23, pred 58.0c, actual 26.1% — gap +31.9pp. Disastrous. Stays blocked.
[60-63)c bucket: n=20, pred 60.9c, actual 65.0% — gap −4.1pp (model UNDER-confident). Profitable. The floor was blocking it anyway.

Lowered min_fair_prob_c from 63 → 60 to re-enable the [60-63)c bucket. The lossy [55-60) bucket remains blocked. Expected effect: +20-30% volume on the profitable subset, no change to the bleed.

Why this is the right call (and bigger than it looks). It's tempting to treat "retrain the model" as more rigorous than "tweak a config value." But the data says the model isn't broken — the floor was. A 1-line config change with empirical backing beats a multi-day rebuild whose ship gate it can't honestly clear. We measured first, then acted minimally.

Honest caveats: (1) bucket-level signals on n=20 have wide CIs — the [60-63) profit could compress as more data comes in, in which case we re-tighten. (2) Selection-bias diagnosis means future "live ECE looks bad" reports should always be cross-checked against representative-population ECE before any retrain decision. (3) ATP CLV is still negative; this change improves the trade mix on the margin, doesn't claim to fix structural market efficiency.

v12.5 April 30, 2026

Tennis Elo dataset rebuild — +37% player coverage, ATP/WTA from-scratch refresh

Why we looked. While diagnosing why ATP held-out CLV is negative at every max_edge_c setting (see v12.4 above), per-bin calibration on 6,203 ATP samples showed the model has classic favorite-longshot bias — predicts 75% but reality is 62%, predicts 25% but reality is 47%. The auto-recalibrator's validation gate REJECTED post-hoc fixes for ATP and WTA: isotonic, Platt, and Beta calibrators all left ≥3 buckets exceeding the ±8% tolerance. The miscalibration is non-monotonic and locally inconsistent — not patchable downstream.

What we found. Tracing the prediction stack from ESPN tick to fair_c output, we discovered a bigger problem hiding in the data layout:

atp_matches_2025.csv on the VPS was 14 bytes — a previous auto-fetch had received a GitHub 404 and saved the response. Sackmann hasn't published 2025 yet, so this was technically benign (csv.DictReader silently skipped it), but it surfaced the broader staging issue.
The data/tennis/ directory the build script expects was empty — ATP CSVs lived next to the script in core/ and WTA CSVs were absent entirely (the elo file's WTA entries had been built from CSVs that no longer existed at any path on disk).
The build script (build_tennis_elo.py) only ingested main-tour matches — not Challenger or qualifying. That meant every player who plays primarily Challengers (which is most of the live ATP matches we see today, since the tour is between Madrid and Rome) defaulted to Elo 1500. Burruchaga, Forejtek, Kolar, Pellegrino, Korpatsch — all blank-slate.

What we shipped.

Restored full data layout: 15 Sackmann CSVs at /opt/fairprob/data/tennis/ — ATP main 2020-2024, ATP Challenger/qualifying 2020-2024, WTA main 2020-2024 (~16 MB total).
Extended build_tennis_elo.py to also ingest atp_matches_qual_chall_*.csv (Challenger + qualifying main draws). WTA stays main-only because qual_itf would dilute with low-tier ITF noise.
Rebuilt elo from scratch: 4,606 players → 6,305 players (+37% coverage), 65,289 completed matches processed.
Spot-check: top ATP players gained ~+170 Elo (more wins vs Challenger-tier opponents now in the system); top WTA players lost ~−60 Elo (Sabalenka, Swiatek, Haddad Maia all came down). The downward shift on WTA favorites should help the favorite-overestimation bias on its own — we'll measure it once new recal-buffer samples accumulate.
First post-restart sanity check: Kolar Z. vs Forejtek J. (ATP Challenger live as we shipped) shifted from model says 41% to model says 50% — a coin-flip, which is what an even-rated Challenger match should be. Pre-rebuild, Kolar had no Elo data so the model arbitrarily disfavored him.

What we did NOT ship. We attempted a v1-analytical vs v2-XGB head-to-head on the atp_v2_shadow.jsonl log to decide whether to promote the existing v2 in-play model to primary (the same pattern WTA used in v11.x). The shadow log only contains 11 distinct match_ids so far — all currently in-progress — so no finished-match outcomes can be derived. Promotion decision deferred until ~7 days of accumulated shadow data is available.

What this should do for ATP volume. The user-facing question that started this thread was "how do we get more ATP trade volume correctly?" The honest answer today is: most of the way to fixing volume is fixing the data the model trains on. We've done that. The cap question (currently max_edge_c=20) is parked — we re-test it once the new elo has produced a few hundred new in-bot predictions and the recal buffer can be re-binned.

Open issues for next pass: (1) v2 promotion decision needs ≥30 finished ATP matches in shadow log; (2) the data/tennis/ directory still has no automated refresh cron — we manually fetched. Wiring a weekly Sackmann sync prevents this from rotting again.

v12.4 April 30, 2026 REVERT

Per-bucket CLV gate failed walk-forward validation — pulled

What we tried (v12.3, same day). Built a per-(sport, edge-band) CLV gate in core/clv_filter.py, populated clv_buckets.json from the 686 trades with measured CLV, and flipped the moneyline bot from shadow mode to enforce. Same gate logic was also wired directly into the CS2, LOL, soccer, and tennis bots and the MLB SCORE-FILTER was replaced with bucket-based logic. In-sample the gate looked good: gate-allowed trades had +5.56c better CLV than gate-blocked trades.

What killed it (walk-forward validation). Before letting the gate run live for any length of time, we built core/clv_gate_validation.py to do the test the v1.1 paper had flagged as missing: a chronological 80/20 split where bucket means are computed from the older 80% only and the gate is then applied to the held-out newer 20%.

In-sample (Option A): 686 trades, gate-allow CLV − gate-block CLV = +5.56c
Out-of-sample (Option B, held-out 20%): gate-allow CLV − gate-block CLV = −12.30c — the gate's allow trades were worse than its block trades on data it had never seen
"no_opinion" pass-through trades (the buckets where the gate stayed silent) outperformed both other categories in the held-out window. Whenever the gate had an opinion, that opinion was on average wrong.
Only 8 buckets in the training set met n≥20 — not enough to ride through the regime change between the older and newer windows. Distribution drift > bucket signal.

Verdict and revert. The in-sample number was overfitting. We pulled v12.3 the same day:

Moneyline bot's clv_gate_mode reverted from enforce → shadow (logs decisions, never blocks)
MLB SCORE-FILTER restored to its v12.1 score-change-only logic
Hardcoded should_trade(mode="enforce") calls removed from CS2, LOL, soccer, and tennis bots
The gate machinery (clv_filter.py, clv_buckets.json, the moneyline shadow log) stays in place — it cost nothing to keep and gives us a re-validation target once we have a larger CLV sample and a better cross-validation methodology

Why we're publishing this. We've said in the v1.1 whitepaper that we publish what didn't work. This is what that looks like in practice: a one-day round trip from "let's enforce the gate" to "the gate doesn't generalize, kill it." The validation script (clv_gate_validation.py) and the failure log are in the same place as the code that worked.

v12.1 April 30, 2026

Public CLV scorecard, sport triage, research-first reframe

Public CLV (Closing Line Value) dashboard at /clv. Per-sport closing line value across every settled trade we've ever made. The headline finding from running the full backfill: trades that beat the close win ~89% of the time; trades that lost the close win ~11%. CLV is the single strongest leading indicator of forecasting edge that exists, and almost no prediction-market vendor publishes theirs. We do — every sport, with edge-bucket breakdowns, JSON / CSV download, CC BY 4.0.

One-shot historical backfill via backfill_clv.py --commit took CLV coverage from 9% → 48.5% across all 1,415 trades by pulling Polymarket price history for each token
Per-sport coverage jumps: ATP 2% → 99%, WTA 12% → 100%, LOL 4% → 86%, Soccer 11% → 97%, CS2 0% → 43%, MLB 32% → 71%, NHL 21% → 72%
Weekly cron added: clv_backfill runs Tuesdays 04:39 UTC; future trades get CLV automatically without manual intervention. Registered in admin_cron_health with an 8-day stale window
Public endpoints: /clv.json, /clv.csv, JSON-LD Dataset schema in the page head for Google Dataset Search

Sport triage based on CLV. The CLV data did what it was designed to do: it told us which sports had real edge and which were bleeding closing-line value into the market. Result: 5 of 8 measured sports have negative CLV. We responded by acting on the data:

Trading paused: NBA (mean CLV −7.9c), CS2 (−5.5c), WTA (−2.7c), Soccer (−2.1c). Predictions still compute and CLV continues to track for these sports — the data appears on /clv alongside active sports — but no new orders are placed
Still trading: NHL (CLV +0.5c, 63% WR), ATP (+5.2c), LOL (+3.4c), MLB (CLV neutral but 61% WR — kept on probation while we investigate the high-WR / negative-CLV anomaly)
NBA paused via min_edge_nba=100 in moneyline_wp_bot.py; WTA paused via min_edge_c_wta=100 in tennis_wp_bot.py; CS2 + Soccer disabled via --disable cs2,soccer on the unified bot's systemd ExecStart
WS subscription budget dropped from 1,498 tokens → 266 tokens since CS2/Soccer plugins no longer load their token sets — frees the cap for tokens we're actually trading

MLB SCORE-FILTER override + WS reconnect telemetry (carried from v12.0). Found that the April 28 score-change-only filter was blocking 4,026 of 4,028 MLB EVAL events because it was sampled at the old 8c gate. Added a narrow override: polling-trigger MLB trades now allowed in the proven 5–8c band only (the bucket the v12.0 audit found was 71.9% WR / +$8.92). All other MLB edges still require a score-change trigger. WS reconnect logging now emits clean "WS RECONNECTED — resuming signals (downtime=Xs, N eval cycles skipped)" lines on recovery and forces a fresh ESPN poll on reconnect.

Research-first reframe. We're being honest about what ZenHodl is. The bot is one thing it does. The data the bot generates — calibrated probabilities, ECE per sport, on-chain pre-committed benchmarks, per-sport CLV across every trade — is a different and rarer thing. As of v12.1, we treat that data as a first-class product. Most prediction-market vendors don't measure CLV at all. The few that do, don't publish it. We do, and we use it to make trading decisions in public. If a sport's CLV stays negative, we pause that sport on this page until it doesn't. If we ever start trading a sport that has been bleeding CLV without good reason, you'll see it here.

Open issue: tennis WS subscription cap (Haddad Maia 19.6c edge missed because her token wasn't in the 1,498-token active set). Documented for next session — fix is smarter token rotation, not in scope today.

v12.0 April 25–29, 2026

Transparency Index expansion, on-chain benchmark hardening, public dataset endpoints

Transparency Index — expanded and publicly citable. Grew the index from 21 to 27 sports prediction sources and added two new dimensions (track record longevity, sport coverage breadth), bringing the rubric to 7 dimensions / 35 max points. Recalibrated several scores after a fresh audit; the result is that FiveThirtyEight (archived) now ranks #1 at 30/35 and ZenHodl ranks #2 at 29/35 — the index passes the "would you rank yourself first if you were honest" smell test.

Public dataset endpoints: /transparency-index.json and /transparency-index.csv (CC BY 4.0, 5-min CDN cache, CORS open)
JSON-LD Dataset schema in the page head — surfaces in Google Dataset Search / structured-data crawlers
Public scorecard diff history at /transparency-index/history — every score change preserved in monthly snapshots
Monthly Claude auto-rerun (scheduled task) — suggested score updates surface for human review at /admin/transparency-index and never auto-apply
Sort / filter / search / side-by-side compare on the public table; per-row anchor links (#src-fivethirtyeight); citation block; rank chips on every row
Equal-weights fairness statement above the fold; the two new dimensions are areas where ZenHodl scores weakly — added because credible challengers raised them, not because they help our ranking

NBA Playoffs 2026 benchmark — production-hardened ahead of May 5 tipoff. The on-chain pre-committed benchmark at /benchmarks/nba-playoffs-2026 went through a hardening pass to make every claim in the manifest enforceable in code:

T-60 enforcement. Snapshot window tightened from 30–90 minutes to 60–180 minutes so every recorded prediction satisfies the manifest's "captured no later than T-60 minutes before tip-off" commitment
Tie-handling now visible. When either ZenHodl or Polymarket is unavailable at snapshot time, the row is recorded with a status field (polymarket_unavailable / zenhodl_unavailable) and surfaces in a new "Excluded games" section on the public scoreboard. Manifest's tie-handling rule is now provably applied
Live hash verification. Server-side compares the served manifest.json SHA-256 to the on-chain receipt every render; green "Served file hash matches on-chain commit" badge confirms the manifest is byte-equal to what's on Polygon. Programmatic endpoint at /benchmarks/<slug>/hash-check.json
Retry/backoff on Polymarket and ESPN fetches. Both APIs now have exponential-backoff (1s, 2s, 4s) retries on transient failures so a playoff API spike doesn't silently drop games
WPModel load now guarded. A corrupt model pickle no longer takes down all sports' snapshots — failed loads emit zenhodl_unavailable rows for that batch and the next cron tick retries
Small-N statistical honesty. Until n ≥ 10 settled games, the leaderboard hides bootstrap CI bounds and shows a "Preliminary" banner. Bootstrapped CIs at n=3 are statistical theatre; they're suppressed until they mean something
Reliability diagram. New Plotly chart on the scoreboard plots predicted probability vs observed home-win rate per bin, ZenHodl vs Polymarket overlaid — the canonical visualization for calibration, surfaces automatically once games start resolving

Build-vs-buy calculator — rebuilt for B2B decision-makers. The /build-vs-buy page got a full overhaul:

Multi-tier comparison block (Starter $49 / Growth $149 / Enterprise $499) — selecting a tier swaps the savings panel and CTA target so there's no "send me to /pricing and figure it out" bait-and-switch
5-year cumulative cost line chart (Plotly) — DIY (red, widening) vs API (green, flat). The widening gap is the picture
Time-to-value hero stats — "5 min from signup to first calibrated prediction" vs "DIY median: 4–6 months to first production prediction"
Competitive matrix — DIY / open-source repo / contractor / other API vendor / ZenHodl across 7 dimensions with year-one cost row. Preempts the "I'll just hire a contractor for $5k" objection
Sticky bottom CTA bar — follows the user with live savings number; sub-$100/hr engineer-rate guardrail surfaces a credibility warning
Shareable URL with encoded inputs (?rate=200&sports=8&tier=enterprise) so a buyer can lock numbers and forward to their CFO
FAQ section answering the actual objections ("why not a contractor", "why not an open-source repo", "is this calculator biased toward buy?") for long-tail SEO

Bot operations — gate audit + WS reconnect telemetry.

MLB min_edge lowered 8¢ → 5¢ after a per-bucket audit. The 5–8¢ band has 71.9% WR / +$8.92 over 57 settled trades and was being blocked. Adverse-selection filter still catches the 10–21¢ directional-miss band. All other gates (NBA, NCAAMB, NCAAWB, NHL, ATP, WTA, CS2, LOL, soccer) audit-confirmed and held at current levels
WebSocket reconnect telemetry. When the Polymarket WS drops, the bot now logs disconnect duration + count of skipped eval cycles and emits a single clear WS RECONNECTED — resuming signals (downtime=Xs, N eval cycles skipped during outage) line on recovery. Forces a fresh ESPN poll on reconnect so we don't trade off stale game data. Throttled the spam of "WS DISCONNECTED" warnings from once-per-second to once-per-30s. Next outage's postmortem is one grep away

Known issue — CLV coverage gap. A spot audit on April 29 surfaced that closing-line value (CLV) is recorded on only 9% of trades globally, with critical gaps: CS2 has 0% coverage on 341 trades and tennis has <5% across both tours. The closing-price polling job is only fully wired for the moneyline plugin (MLB at 31.8% coverage is the best-covered sport, with mean CLV of -5.5¢). Filling the CS2 / tennis CLV gap is the next operational priority — without it we're flying blind on whether recent gate tightening is improving or hurting closing-line value, which is the single best leading indicator of edge erosion. Tracked publicly here so we can ship the fix in v12.1 and validate the v12.0 gate changes against it

Incident April 22–25, 2026

Calibration table corruption — silent degradation across all sports

What happened. From 2026-04-22 through 2026-04-24, ai_drift_monitor.py overwrote per-sport calibration tables with newly-fit isotonic regressions without a holdout-validation gate. The refit was mathematically valid on the training half but produced overconfident probabilities at inference, so the bot priced edges that weren't there.

Impact. Approximately -$63 cumulative bot P&L over a 7-day window attributable to the corrupted tables, concentrated in NHL and MLB. The same calibration tables back our public win-probability API, so any consumer reading /v1/games or /v1/edges during the incident window saw the same overconfident probabilities. Discovered during a routine bleed audit on April 25.

Root cause. The post-hoc refit pipeline assumed any new isotonic fit was an improvement. There was no holdout split, no Brier-score check against the prior calibrator, and no per-bucket sanity range — three guards we'd been planning to add but hadn't shipped. A monotone fit on a tiny recent window is mathematically "correct" but pushes recent noise into the model.

Fix shipped April 25. Built calibration_validator.py as a hard gate on every refit:

70/30 holdout split before any new calibrator is accepted
Three calibrator candidates (isotonic, Platt, Beta) — the one with the best holdout Brier wins, the others are rejected
Per-bucket sanity check at 8¢ tolerance — refits that move any predicted-probability bucket more than 8¢ are blocked pending operator review
All refit attempts (accepted and rejected) logged to calibration_history.jsonl for audit

Verification. We replayed the validator against the historical refit attempts that caused the incident: 9 of 10 sports' refits would have been rejected by the new gate; only the LoL Platt fit passed. Going forward, any future drift in any of the 11 sports' calibrators is gated on the same validator.

Followups. Public CLV dashboard at /admin/clv now exposes per-sport closing-line value as the leading indicator of edge erosion (independent of P&L variance). Sport-level circuit breaker (sport_circuit_breaker.py, shadow mode through ~May 9) auto-disables any sport whose 30-day ROI drops below -5%, so a future calibration regression that escapes the validator gets pulled within a day instead of a week.

v11.0 April 14, 2026

Live Recalibration Overhaul — Runtime Fixes, Calibration, Backtests

Critical Runtime Fixes: Repaired the live recalibrator path, fixed settlement-time type errors, and updated tennis trade logging so those jobs no longer fail at runtime.
Recalibration Activated: Seeded recalibration history across all 11 supported sports so live calibration can run immediately instead of waiting on an empty buffer.
Calibration Improved: Major ECE improvements landed across tennis, CS2, and core US sports after the recalibration rollout and model cleanup.
Backtests Refreshed: Re-ran tennis, NFL, LoL, and CS2 backtests against the updated stack and disabled CFB after consistently negative results.
Tennis Model Tuning: Tightened dampening, raised WTA minimum edge, added a between-sets filter, and blended surface-specific Elo into match pricing.
Trade Logging Expanded: Tennis trades now store surface and best_of for downstream analytics and review.
Dependency Alignment: Updated local scikit-learn to match production and removed an inference error affecting MLB and CFB locally.

v10.0 April 8–10, 2026

Billing, Payments, Attribution, and Model Upgrades

NBA and NHL Retrained: Fixed missing team-stat inputs at inference, rebuilt the feature pipeline, and retrained both models with the intended data.
NHL Injury Overlay: Added a live skater injury adjustment layer sourced from ESPN to support pre-trade price corrections.
NBA Impact Table Expanded: Updated the star-impact map to cover more rotation players and current-season breakouts.
Live Recalibration: Added rolling isotonic recalibration per sport with periodic auto-refits and persisted history.
MLB Bullpen Overlay: Added inning-aware bullpen quality adjustments to late-game MLB pricing.
Momentum Kept: Confirmed momentum features still improve NBA model quality and retained them in the live stack.
Crypto Payments: Added NowPayments checkout, crypto discounts, email validation, and billing-event tracking across the payment flow.
Checkout Recovery: Enabled Stripe recovery links on expired sessions, shortened session expiry, and wired recovery emails into the webhook path.
Attribution Fixes: Restored non-zero conversion values, added enhanced conversion support, and captured ad click identifiers in first-party cookies.
Revenue Admin View: Added /admin/revenue with active MRR, at-risk MRR, collected cash, and Stripe-versus-crypto breakdowns.
Support Checkout Tool: Added /admin/support/mint-link to mint fresh prefilled Stripe links for blocked customers.
Middleware and CSP Cleanup: Replaced fragile middleware with ASGI implementations and tightened payment-related CSP allowlists.
IAB Flow Controls: Added override tracking and a configurable in-app-browser mode for Stripe checkout handling.
Cron Reliability: Added cron heartbeat logging and fixed broken campaign and retention jobs.
Billing Event Coverage: Expanded the billing ledger with crypto, recovery, IAB override, and support-link events.
Card-Required Baseline: Set STRIPE_API_STARTER_NO_CARD_TRIAL_PCT to 0 to evaluate the billing flow against a single card-required configuration.

v9.0 April 6, 2026

Model Intelligence Upgrade — Runtime Models and Homepage Cleanup

NBA Retrained: Switched to a tuned XGBoost stack, improved calibration, and added dynamic edge thresholds by game state.
NBA Injury Layer: Added ESPN-driven injury adjustments and a live star-impact table before edge calculation.
CS2 Combined Model: Rolled out a combined map-state and economy-aware CS2 model with confidence-weighted Elo support.
CS2 Entry Filters: Tightened edge and entry guards to cut overconfident and asymmetric fills.
LoL ML Model: Replaced the Elo-only path with the trained live model so match-state features now drive pricing.
WTA Calibration: Split ATP/WTA serve assumptions and added lightweight set-context adjustments.
MLB Load Fix: Regenerated the stale MLB model hash so all supported sports load correctly again.
Homepage Refresh: Simplified the hero, reduced CTA clutter, added clearer proof blocks, and refreshed product mockups.
CS2 Backtest Harness: Added a dedicated combined-model backtest using bo3.gg match history and bookmaker odds.

v8.0 April 4, 2026

Platform Overhaul — Infrastructure, Security, and Trading Stack

Unified Trading Process: Merged the live sport bots into one process with shared market feeds, portfolio-level controls, and a global circuit breaker.
CS2 bo3.gg Migration: Replaced HLTV with bo3.gg and moved CS2 onto a live economy-aware combined model.
Bot Safeguards: Added daily loss limits, streak cooldowns, rolling win-rate gates, and fixed an incorrect cents-to-dollars breaker bug.
Feed Quality Controls: Added freshness and confidence scoring, warm-start gating, and restart-safe execution queuing.
Activation Flow: Added a dedicated /activate flow for passwordless accounts and persisted activation milestones.
Course Ratings: Added purchase-gated ratings, review handling, and admin notification paths.
Security Hardening: Fixed unsubscribe auth, webhook SSRF checks, escaping, CSP, session cookies, and logout CSRF handling.
Backtest Integrity Pass: Regraded the public backtests, rebuilt LoL against real prices, fixed NFL split ordering, and synced public numbers to the corrected results.
State-Aware Pages: Personalized pricing, course, and recovery actions by user state.
Dashboard and Results UX: Added richer filtering, detail drawers, clearer status indicators, and better mobile behavior.
Monitoring Agents: Added automated reconciliation, entitlement, deployment, and strategy checks on cron.
Cache Pipeline: Improved stale pruning, sport normalization, active-market caching, and publish quality gates.
Purchase Fulfillment: Standardized confirmation emails, webhook account creation, and re-download access across products.

v7.0 April 4, 2026

CS2 Audit, Domain Migration, and Performance Cleanup

CS2 Audit and Refiltering: Tightened CS2 entry rules after live results exposed slippage, underdog bias, and thin-edge problems.
CS2 Rating Rebuild: Rebuilt the CS2 Elo base to expand team coverage and remove stale-rating failure cases.
Domain Migration: Moved the primary site from api.zenhodl.net to zenhodl.net and updated redirects, canonicals, sitemap, and emails.
Performance Pass: Added compression, long-lived asset caching, deferred scripts, and lighter proof media to improve page speed.
Claim Sync: Replaced stale public performance claims with the current backtest source of truth across product pages and emails.
Results Page Cleanup: Clarified ledger scope, reconciliation state, and excluded-row reporting.
Course Progress: Moved course progress from local storage into backend state and improved next-step guidance.
Trade Resolution Cron: Added scheduled settlement resolution through the Polymarket API.
Odds API Stability: Added retry backoff and request budgeting for sportsbook data polling.
Tennis Coverage: Added WTT tour support alongside ATP and WTA.
Copy Cleanup: Standardized account CTAs from “Sign up” to “Create Account”.

v6.0 March 30, 2026

Infrastructure Upgrade — CLV, Confidence Bands, Webhooks

Model Quality API: /v1/model/performance — Brier score, ROC-AUC, ECE, accuracy, and full conformal calibration tables for all 11 sports
CLV Tracking: /v1/model/clv — live closing line value tracking. Measures how often our entry price beats the final market price
Confidence Intervals: All prediction endpoints now include calibrated prediction bands from conformal prediction tables. Width narrows as games progress
Venue Status: /v1/venues — real-time status of all connected data venues (Polymarket, Kalshi, DraftKings, FanDuel, etc.)
Venue Filter: ?venue=kalshi on /v1/games and /v1/edges to filter by specific venue
Intraday WP Snapshots: /v1/snapshots/{sport}/{date} — win probability archived every 30s during live games (Pro+ tier)
Batch Predictions: /v1/predictions/batch — bulk download predictions for up to 90 days (Pro+ tier)
Webhook Push: /v1/webhooks — register URLs to receive edge signals in real-time via signed POST requests (HMAC-SHA256)
Course Rating System: Star ratings + written reviews at /course/rate with social proof widget on course page
NFL Model Fix: Corrected season ordering bug where 2020-21 test set was older than training data (data leakage eliminated)
Bot Safety: Circuit breaker (daily loss limit + streak detection), bankroll depletion check, WS/ESPN staleness guards, performance dashboard every 10 min
Billing Fixes: Fixed free trial 402 error, added checkout.session.completed webhook handler, fixed 3 pricing page checkout links pointing to wrong tiers
Security Hardening: Timing-safe CSRF comparison, HTTPS-only session cookies, randomized download secrets, custom 404 page

v5.0 March 28, 2026

Multi-Venue Pricing and Prediction API

Multi-Venue Pricing: Added edge calculations across Polymarket, Kalshi, and sportsbook feeds.
Cross-Feed Matching: Added team normalization so ESPN, exchange, and sportsbook data resolve to the same game entities.
Best Venue Display: The dashboard now shows which venue has the best available price per side.
Venue Filtering: Edge views can now be filtered by venue.
Live Prediction API: Added /v1/predict/{sport}/live for live win probabilities and venue-aware edges.
Pregame API: Added /v1/predict/{sport}/pregame for scheduled-game pricing.
Game Detail API: Added /v1/predict/{sport}/{game_id} for single-game model output.
Fair Lines API: Added /v1/fair-lines/{sport} with American-odds conversion.
Usage Metering: Added /v1/usage with monthly request breakdowns and tier caps.
Sportsbook Handling: Added multiplicative vig removal, adaptive polling, and a Polymarket-only fallback when sportsbook keys are missing.

v4.0 March 24, 2026

Advanced Elo Engine — MOV, Dynamic K, Surface Ratings

MOV Elo: Added margin-of-victory weighting for basketball and football Elo with autocorrelation protection.
Dynamic K: New teams now converge faster while mature teams stabilize with lower K values.
Sport-Specific Rules: Kept MOV and dynamic K off for low-scoring sports where margin is mostly noise.
Backtest Lift: Moneyline backtests improved after the Elo upgrade, with the largest gains in NBA and NCAAMB.
Tennis Surface Elo: Added Hard, Clay, and Grass ratings with blended match pricing.
LoL Regional Initialization: Added regional Elo starting points for esports teams.
Soccer Elo Tuning: Reduced elo_power and enabled the new Elo logic for soccer.
Model Retraining: Retrained the core win-probability models against the updated Elo inputs.

Data March 24, 2026

Market Data Store

Polymarket Archive: Added a large Parquet orderbook archive covering six sports.
Kalshi MLB Candles: Added full-season one-minute candle data for MLB markets.
Microstructure Pack: Added guardrail and spread-analysis datasets for market structure research.
Tennis Match Data: Added long-range ATP/WTA match and odds history.
Data Protection: Added watermarking, SHA-256 signing, and provenance cards across datasets.
Sample Previews: Added public previews at /samples.

v3.0 March 23, 2026

Model v3 — Football Features and Pregame Prior

Football Features: Added down, distance, yard line, and possession fields for CFB and NFL models.
Pregame Prior: Added an ESPN-based pregame win-probability prior.
Performance Lift: CFB and NFL backtests improved materially after the football feature expansion.
Trainer Option: Added --select-by-trading to train_wp_model.py.

New March 2026

CS2, LoL, and Tennis Models

CS2: Added a negative-binomial model with HLTV-derived Elo inputs.
LoL: Added a series model backed by lolesports data and regional initialization.
Tennis: Added a point-level model with surface-specific Elo.
Rollout State: Released all three in shadow mode before live trading.

New March 2026

ZenHodl API and Dashboard Launch

REST API: Launched /v1/games, /v1/edges, /v1/sports, and /v1/predictions.
Realtime Stream: Added a WebSocket feed for live updates.
Backtest Service: Added large-scale backtest querying as a product feature.
Edge Dashboard: Launched the auto-refreshing edge scanner UI.
Billing: Added Stripe checkout with a 7-day free trial.
Alerts: Added Discord edge notifications.

v2.0 February 2026

Spread/Total and Soccer Models

Spread Model: Added a regression-plus-CDF spread win-probability model.
Soccer Model: Added a Poisson-based soccer pricing model.
Backtests: Recorded positive early backtest results for both model families.
Polling: Added adaptive ESPN polling for faster live updates.

v1.0 December 2025

Initial Moneyline Model

Core Models: Launched LR-plus-spline moneyline models for NBA, NCAAMB, NHL, and MLB.
NCAAMB Stack: Added an XGBoost-plus-isotonic calibration path for college basketball.
Elo Base: Added team Elo ratings across the supported leagues.
Feature Set: Started with a score, time, period, and Elo-driven live pricing model.

Operational cleanup: log hygiene, meta-model enrichment finalized, MLB unlocked

Meta-model enrichment finalized

MLB volume unlocked

Risk controls

Operational hygiene

Backlog

Pricing refresh: free Developer tier, Starter gets 30K req, annual plans, Enterprise self-serve

Files touched

Operator note

CLV meta-model v0 deployed (shadow); end-to-end feature pipeline live

Feature pipeline (110+ candidate features now flowing)

Engineering hygiene

Honest framing

/clv-evidence whitepaper public + soccer Layer 2 calibration restored

Soccer fixes

Methodology + validation page cross-links

Data hygiene

Live recalibrator re-enabled + ATP/WTA drift investigation

Recalibrator status

Filter C: calibration-gap pre-trade gate

Auto-pause for low-skill sports

Loosen max_edge to 25c & lift between-sets skip (experiment)

Changes

Decision criteria

Revert v12.18 flood-gates — back to W14/W15 levels

What W14/W15 audit found

v12.22 gate revert

What this is NOT

Decision criteria

CS2 v6 model + tight-gate re-enable

Honest verdict: the rebuild was a wash

Why we shipped anyway

v12.21 tight gates

Decision criteria for next 7-14 days

Why we lose: cross-sport and per-sport pattern analysis (1,368 resolved trades)

Cross-sport finding 1: score-change-triggered fills are ~20pp better

Cross-sport finding 2: fair_prob has a U-shape

Cross-sport finding 3: edge [20-25c] is the falling-knife band

Cross-sport finding 4: CLV is the truth-teller

Cross-sport finding 5: hour-of-day matters

Per-sport table

The synthesized story

Actionable next steps (NOT shipped today — for after v12.18 observation)

Flood-gates opened — testing whether W17 was a regime shift or the new normal

Per-sport max_edge_c tightened to kill the falling-knife band

LOL bot was orphaning fills AND its max-positions-per-match safety was broken by the same bug

Tennis trade-log was orphaning fills since 2026-04-28

WTA pre-match adds rank/age features; ATP rejects them

WTA pre-match XGB now uses overall serve_won_pct (subset of v12.9)

Retirement-flag feature tested, rejected by ship gate (Tier 3.2)

Tennis position surface tagging activated (Tier 3.3 phase 1)

ATP pre-match XGB now uses surface-stratified serve_won_pct (Tier 3.1)

MLB + CS2 floors raised to 70 (alerter-driven, walk-forward-validated)

WTA floor → 70 + Tier 1 monitoring infrastructure (selection-bias-aware)

ATP min_fair_prob_c 63 → 60 — re-enable a profitable bucket the floor was blocking

Tennis Elo dataset rebuild — +37% player coverage, ATP/WTA from-scratch refresh

Per-bucket CLV gate failed walk-forward validation — pulled

Public CLV scorecard, sport triage, research-first reframe

Transparency Index expansion, on-chain benchmark hardening, public dataset endpoints

Calibration table corruption — silent degradation across all sports

Live Recalibration Overhaul — Runtime Fixes, Calibration, Backtests

Billing, Payments, Attribution, and Model Upgrades

Model Intelligence Upgrade — Runtime Models and Homepage Cleanup

Platform Overhaul — Infrastructure, Security, and Trading Stack

CS2 Audit, Domain Migration, and Performance Cleanup

Infrastructure Upgrade — CLV, Confidence Bands, Webhooks

Multi-Venue Pricing and Prediction API

Advanced Elo Engine — MOV, Dynamic K, Surface Ratings

Market Data Store

Model v3 — Football Features and Pregame Prior

CS2, LoL, and Tennis Models

ZenHodl API and Dashboard Launch

Spread/Total and Soccer Models

Initial Moneyline Model

ATP `min_fair_prob_c` 63 → 60 — re-enable a profitable bucket the floor was blocking