# Sports Prediction Market Backtest Pack — Schema Manifest **Coverage:** Dec 28, 2025 → Jan 29, 2026 (33 days) **Sports:** NBA, NHL, NFL, MLB, NCAAMB, NCAAWB, NCAAFB **Format:** Parquet (zstd-compressed) **Files:** `poly_snapshots.parquet` (618 MB) + `kalshi_snapshots.parquet` (817 MB) **Total rows:** 136,109,483 (Polymarket: 25,651,600 · Kalshi: 110,457,883) **Load size:** ~1.4 GB on disk, ~6 GB in memory uncompressed --- ## `poly_snapshots.parquet` — Polymarket CLOB depth snapshots (25.7M rows) Tick-level orderbook captures from Polymarket's CLOB API for sports prediction markets. Each row is one snapshot of one token's bid/ask + L2 depth at one moment in time. | Column | Type | Description | |---|---|---| | `timestamp` | float64 | Unix epoch seconds (sub-second precision) | | `datetime_utc` | datetime64[ns, UTC] | Same moment as ISO datetime | | `token_id` | string | Polymarket CLOB token ID (ERC-1155). Maps to one YES/NO outcome of one market | | `condition_id` | string | Polymarket condition ID (groups YES + NO tokens of same market) | | `event_slug` | string | URL slug of the event (e.g. `nba-lakers-celtics-2026-01-15`) | | `event_title` | string | Human title of the event | | `question` | string | Outcome being traded (e.g. `Will Lakers win?`) | | `sport` | string | NBA / NHL / NFL / MLB / NCAAMB / NCAAWB / NCAAFB | | `market_type` | string | `moneyline` / `spread` / `total` / `prop` | | `line` | float64 | Spread or total line (null for moneyline) | | `yes_bid` | float64 | Best bid price in cents (0.00–1.00) | | `yes_ask` | float64 | Best ask price in cents (0.00–1.00) | | `mid` | float64 | (bid+ask)/2 | | `spread` | float64 | ask − bid in cents | | `bid_size_top` | int64 | Size at best bid (shares) | | `ask_size_top` | int64 | Size at best ask (shares) | | `bid_depth_5c` | int64 | Cumulative bid shares within 5c of best bid | | `ask_depth_5c` | int64 | Cumulative ask shares within 5c of best ask | | `total_bid_depth` | int64 | All shares on the bid side of book | | `total_ask_depth` | int64 | All shares on the ask side of book | | `bid_levels` | int64 | Number of distinct bid price levels in book | | `ask_levels` | int64 | Number of distinct ask price levels in book | | `is_burst` | bool | True for high-frequency capture mode (during fast price moves) | --- ## `kalshi_snapshots.parquet` — Kalshi orderbook + game state (110.5M rows) Tick-level snapshots from Kalshi's exchange API for sports event markets. Joined row-by-row with the matching game's live score, period, and time remaining. This is the column set that makes this pack unique — Telonex does not have Kalshi, Kingsets does not join game state. | Column | Type | Description | |---|---|---| | `timestamp` | float64 | Unix epoch seconds (sub-second precision) | | `datetime_utc` | datetime64[ns, UTC] | Same moment as ISO datetime | | `ticker` | string | Kalshi market ticker (e.g. `KXNBAGAME-26JAN15LALCELT-LAL`) | | `title` | string | Human title of the market | | `game_id` | string | Underlying game identifier — use to JOIN with `poly_snapshots.event_slug` | | `home_team` | string | Home team three-letter code | | `away_team` | string | Away team three-letter code | | `home_score` | int64 | Live home score at this timestamp | | `away_score` | int64 | Live away score at this timestamp | | `score_diff` | int64 | home_score − away_score | | `period` | int64 | Current period / quarter / inning (1-indexed) | | `time_remaining` | string | Time left in current period (`MM:SS`, `null` if not applicable) | | `game_state` | string | `pregame` / `live` / `halftime` / `final` / `postponed` | | `yes_bid` | float64 | Best YES bid in dollars (0.00–1.00) | | `yes_ask` | float64 | Best YES ask in dollars (0.00–1.00) | | `mid` | float64 | (bid+ask)/2 | | `spread` | float64 | ask − bid in dollars | | `spread_cents` | float64 | spread × 100 | | `yes_bid_size` | int64 | Size at best YES bid (contracts) | | `yes_ask_size` | int64 | Size at best YES ask (contracts) | | `min_size` | int64 | Minimum trade size in this market | | `open_interest` | int64 | Total contracts outstanding | | `volume_24h` | int64 | Rolling 24-hour volume | | `liquidity_score` | float64 | Kalshi's internal liquidity metric (0–100) | | `tradable_now` | bool | True if market is currently open for trading | | `is_burst_mode` | bool | True for high-frequency capture mode | --- ## Joining Polymarket ↔ Kalshi by game There is no direct `game_id` field on the Polymarket side (Polymarket uses event slugs). The intended cross-venue join is: 1. Filter `kalshi_snapshots` by `home_team`, `away_team`, and game date. 2. Filter `poly_snapshots` by `event_title` containing both team names + same date. 3. As-of join the two streams on `timestamp` (Polars `join_asof` or pandas `merge_asof`). The included quickstart notebook (`quickstart.py`) demonstrates this with a worked example for one NBA game. --- ## What this data is good for - **Cross-venue arbitrage backtests** — Same game, both venues, every tick. The only commercial dataset for this combination. - **Score-reaction strategies** — Joined game state lets you measure how fast each venue reprices on goals/baskets/runs. - **Microstructure research** — L2 depth + spread + size + open-interest, all timestamped. - **Liquidity modeling** — `bid_depth_5c` / `ask_depth_5c` for fill-probability modeling. - **Burst-mode detection** — `is_burst` / `is_burst_mode` flags identify periods of high information flow. ## What this data is not - **Not order flow.** No trade-by-trade tape — these are book snapshots (multi-second). Polymarket's burst captures get to ~100ms during fast moves; Kalshi's typical resolution is 1-5s. - **Not all markets.** Sports only. No politics, crypto, or general-prediction markets. - **Not subscription / not updated.** This is a 33-day historical archive. New monthly drops are available separately. ## Provenance + capture methodology - **Polymarket side:** Captured via the public Polymarket CLOB WebSocket (`wss://ws-subscriptions-clob.polymarket.com/ws/`) and L2 snapshots from `https://clob.polymarket.com/book`. Burst mode triggers when `|Δmid| > 2c` within 10s. - **Kalshi side:** Captured via the public Kalshi WebSocket and REST orderbook endpoints. Game state joined from each sport's official scoring feed (NBA: stats.nba.com, NHL: api.nhle.com, NFL/CFB: ESPN, MLB: statsapi.mlb.com, NCAAMB/NCAAWB: ESPN). - **Sync accuracy:** Game state is matched to orderbook snapshots within ±1 second by `datetime_utc`. Verified post-hoc by spot-checking 50+ score-change moments per sport. ## Support Questions about the schema or how to load the data: admin@zenhodl.net A reply within 24 hours is guaranteed for the first 10 buyers (and typical thereafter).