Course Preview

First 8 cells of each module. Real teaching, real code — decide for yourself.

6 modules First 8 cells from each notebook · Real Jupyter notebooks
01 Scraping Espn
45 cells · 23 code · 1159 lines

Module 1: Scraping ESPN Play-by-Play Data at Scale

**Build a Polymarket Prediction Bot from Scratch**

---

What We're Building

In this module, we build a production-grade scraper that pulls **game-state snapshots** from ESPN's public API. By the end, you'll have a dataset of hundreds of thousands of rows — each one a snapshot of a game at a specific moment (score, period, time remaining, ESPN's own win probability) — with the final outcome attached as a label.

This dataset is the foundation for everything else in the course: training win-probability models, calibrating edge thresholds, backtesting strategies, and ultimately running a live bot on Polymarket.

Why ESPN?

  • **Free** — No API key, no authentication, no rate-limit headers. Just HTTP GET requests.
  • **Real-time** — The same endpoints power ESPN's live scoreboard, so they update within seconds of real events.
  • **Comprehensive** — Covers NBA, NCAAMB, NHL, MLB, NFL, CFB, soccer leagues, and more.
  • **Win probability included** — For basketball and football, ESPN returns a `winprobability` array alongside the play-by-play, giving us a free baseline model to benchmark against.
  • What the Output Looks Like

    Each row in our final dataset represents one play/moment in a game:

    | game_id | sport | home_team | away_team | home_score | away_score | period | seconds_remaining | score_diff | time_fraction | espn_home_wp | home_wins |

    |---------|-------|-----------|-----------|------------|------------|--------|-------------------|------------|---------------|--------------|-----------|

    | 401584793 | NBA | BOS | MIA | 28 | 22 | 2 | 1380.0 | 6 | 0.479 | 0.712 | 1 |

    | 401584793 | NBA | BOS | MIA | 30 | 25 | 2 | 1320.0 | 5 | 0.458 | 0.695 | 1 |

    | 401584793 | NBA | BOS | MIA | 30 | 28 | 2 | 1260.0 | 2 | 0.437 | 0.621 | 1 |

    **Key columns:**

  • `score_diff` — Home score minus away score (positive = home leading)
  • `time_fraction` — Fraction of game remaining (1.0 = start, 0.0 = end)
  • `espn_home_wp` — ESPN's own win probability for the home team (our baseline)
  • `home_wins` — Ground truth label (1 = home won, 0 = away won)
  • # ── Install required packages (run this cell first!) ──────────────────────────
    # Uncomment the line below and run if you haven't installed these yet:
    # !pip install aiohttp pandas numpy matplotlib pyarrow tqdm
    

    How to use AI with this notebook

    **New to Python? No problem.** Every cell in this notebook is designed to work with AI coding assistants.

    If you get stuck on any cell:

    1. **Copy the cell** into Claude, ChatGPT, or any AI assistant

    2. **Ask:** "Explain this code line by line"

    3. **To customize:** "Help me modify this for soccer instead of NBA"

    4. **To debug:** Paste the error message and ask "How do I fix this?"

    5. **To extend:** "Add a feature that tracks home/away win streaks"

    Think of the AI as a patient tutor sitting next to you. The notebooks give you working code — the AI helps you understand and extend it.

    > **Pro tip:** If a cell is confusing, ask the AI: "Explain this to me like I've never written Python before." It will break down every line.

    Setup

    Install dependencies if needed:

    # Uncomment and run if you need to install packages
    # !pip install aiohttp pandas pyarrow nest_asyncio
    import aiohttp
    import asyncio
    import json
    import time
    from datetime import datetime, timedelta
    from typing import Dict, List, Optional, Tuple
    from pathlib import Path
    
    import pandas as pd
    import nest_asyncio
    
    # Allow running async code in Jupyter (which already has an event loop)
    nest_asyncio.apply()
    
    print("All imports OK")
    Why async?

    ESPN has thousands of games across multiple seasons. A normal `requests.get()` loop waits for each response before sending the next — painfully slow.

    **Async** sends multiple requests at once. Think of it like ordering 10 pizzas by calling 10 restaurants simultaneously, instead of calling one, waiting for delivery, then calling the next. We'll scrape 60,000+ games in minutes instead of hours.

    > **New to async?** Don't worry. Paste any async code cell into Claude or ChatGPT and ask "explain this line by line." The pattern is always the same: `async with session.get(url) as resp`.

    ---

    1. ESPN API Discovery

    ESPN exposes a public JSON API that powers their website and mobile app. No authentication is needed — you just hit a URL and get JSON back.

    Scoreboard Endpoints

    Each sport has a scoreboard endpoint that returns **all games for a given day**:

    | Sport | Endpoint |

    |-------|----------|

    | NBA | `https://site.api.espn.com/apis/site/v2/sports/basketball/nba/scoreboard` |

    | NCAAMB | `https://site.api.espn.com/apis/site/v2/sports/basketball/mens-college-basketball/scoreboard?groups=50&limit=300` |

    | NHL | `https://site.api.espn.com/apis/site/v2/sports/hockey/nhl/scoreboard` |

    | NFL | `https://site.api.espn.com/apis/site/v2/sports/football/nfl/scoreboard` |

    | CFB | `https://site.api.espn.com/apis/site/v2/sports/football/college-football/scoreboard` |

    | MLB | `https://site.api.espn.com/apis/site/v2/sports/baseball/mlb/scoreboard` |

    Key Parameters
  • **`?dates=YYYYMMDD`** — Fetch a specific date's games (default = today)
  • **`?groups=50&limit=300`** — For NCAAMB, fetches Division I games (group 50) with enough limit to get them all
  • Summary Endpoints

    For a single game's play-by-play and win probability:

    ```

    https://site.api.espn.com/apis/site/v2/sports/basketball/nba/summary?event={game_id}

    ```

    Let's start by fetching one day of NBA games to see the raw structure.

    This is 8 of 45 cells. The full module continues with hands-on exercises and working code.

    Get This Module Free