NCAA March Machine Learning Mania 2026 — Data Documentation
This pipeline collects, normalizes, and merges data from 11 distinct sources to produce a unified feature store for NCAA tournament prediction. Each team-season row contains 53 raw features spanning efficiency metrics, ranking consensus, box score aggregates, coaching history, and more.
The data covers 22 training seasons (2003-2025, excluding 2020 which had no tournament) with approximately 342 Division I teams per season. Sources range from stable Kaggle competition files (available since 1985) to live-scraped real-time data (prediction markets, injury reports) that exist only for the current season.
Key design principles:
*_live.csv pointer to the latest version. This enables historical auditing and debugging.The pipeline has five stages:
src/fetch_*.py) scrape external sources via Playwright (BPI, Massey, injuries, brackets), REST APIs (NCAA games, odds), or Python libraries (kenpompy). Kaggle provides static competition files.kenpom_2026-03-06_1742.csv) plus a *_live.csv symlink to the latest. Historical KenPom data lives in kenpom.db (SQLite); player stats in player_stats.db.resolve_teams.py. Elo ratings are computed from all regular-season games since 1985. Injury reports are converted to AdjEM adjustments via the Win Shares replacement model.build_feature_store.py joins all sources into one row per (Season, TeamID) with 53 columns. Output: team_season_features.csv (7,872 rows).matchup_train.csv. The ensemble (70% LightGBM + 30% Logistic Regression) trains on these diffs. See Model Documentation for details.Provides: AdjOE (adjusted offensive efficiency), AdjDE (adjusted defensive efficiency), AdjTempo, AdjEM (= AdjOE - AdjDE), national rankings, Luck, and SOS (strength of schedule).
Method: kenpompy Python library with authenticated login. Credentials stored in .env.
Why it matters: AdjEM is the single most predictive team-strength metric in college basketball. Tempo-adjusted efficiency margins remove pace effects that distort raw scoring stats. A team that scores 80 points per game but plays at a fast tempo is not necessarily better than one scoring 65 at a slow pace — AdjEM normalizes this to points per 100 possessions.
Provides: BPI power ranking (1-353).
Method: Playwright headless Chrome scrapes ESPN's JavaScript-rendered BPI page. Uses --disable-blink-features=AutomationControlled and custom user agent to avoid bot detection.
Why it matters: BPI uses a different methodology from KenPom (proprietary ESPN model vs. Pomeroy's tempo-free efficiency). When BPI and KenPom disagree about a team (bpi_kenpom_divergence), that divergence signal is the 5th most important feature in the model. It captures information that neither system alone provides.
bpi_rank and bpi_kenpom_divergence are null. LightGBM handles this natively.Provides: Rankings from 56 independent computer rating systems including POM (Pomeroy), MOR (Massey), SAG (Sagarin), COL (Colley), DOL (Dolphin), WLK (Wolfe), AP (Associated Press poll), USA (Coaches poll), and 48 others.
Method: Playwright scrapes the masseyratings.com/ranks composite page and triggers its CSV export button. Only adds systems that already exist in the Kaggle baseline file to maintain consistency.
The model uses 8 selected systems to compute:
massey_mean_rank — average rank across the 8 systems (consensus strength)massey_std_rank — standard deviation of ranks (system disagreement / "controversy")massey_best_rank — best rank across the 8 systems (ceiling)pom_rank — Pomeroy rank specifically (highest-quality single system)Why it matters: Consensus across independent ranking systems is more robust than any single system. The massey_std_rank captures how "controversial" a team is — when systems disagree, there is genuine uncertainty about team strength, and the model can price that uncertainty into its predictions.
Provides: Projected seeds and regions from 4 sources: ESPN (Joe Lunardi), CBS Sports, RotoWire, and TeamRankings.
Method: Playwright scrapes each source's bracket page, parses team names and seed projections, merges by team name using CANONICAL_NAMES mapping dict.
Used for: Pre-Selection Sunday field construction and region assignments. Before the official bracket is released, consensus bracketology determines which teams enter the simulation and which region they are placed in.
Provides: Championship probabilities from Kalshi (prediction market), Polymarket (prediction market), and 5 sharp sportsbooks via The Odds API.
Method: REST APIs. Kalshi and Polymarket have public endpoints; sportsbook odds are aggregated via The Odds API (requires API key). Sportsbook odds are devigged using Shin's method (removes the bookmaker's built-in margin to recover implied probabilities).
Used for: Post-hoc model-market comparison and identifying potential value picks. Market data is not used as model features — only for analysis after predictions are generated.
Provides: Player injury status (Out, Out For Season, Day-to-Day, Doubtful, etc.) for all D1 teams.
Method: Playwright scrapes RotoWire's college basketball injury report page, capturing the full table via JavaScript extraction.
Downstream processing: Injured players are fed into the Win Shares replacement model (src/adjust_injuries_ws.py):
Provides: Game scores and full box scores (FGM, FGA, FGM3, FGA3, FTM, FTA, OR, DR, Ast, TO, Stl, Blk, PF).
Method: REST API via ncaa-api.henrygd.me (free proxy for ncaa.com data). Fetches scoreboards by date, then box scores per game. Prioritizes Kaggle data — only fills gaps for dates beyond existing Kaggle coverage.
Rate limit: 5 req/sec hard limit. Script uses 0.3s delay between requests with retry-on-timeout logic.
Why it matters: Kaggle updates its competition data periodically but not daily. Between updates, the NCAA API fills the gap so box score aggregates (four factors, win %, margin) stay current.
Provides: Per-player advanced stats (Win Shares, WS/40, BPM, PER), per-game averages (PPG, RPG, APG, MPG), and team totals.
Method: HTTP scrapes Sports Reference (BeautifulSoup parsing). Each team page is fetched and cached in SQLite.
Rate limit: 4 seconds between requests (strict Sports Reference requirement). 65s x attempt exponential backoff on 429 responses. Bulk historical collection is very slow as a result.
Used for: Injury impact model (comparing injured player's WS/40 to team-specific replacement level) and player-level features (depth, usage, star dependency).
Files:
MTeams.csv, MSeasons.csv, MTeamSpellings.csv — team metadata and name variantsMNCAATourneySeeds.csv, MNCAATourneyCompactResults.csv — tournament structure and outcomesMRegularSeasonCompactResults.csv, MRegularSeasonDetailedResults.csv — regular-season games (gap-filled by NCAA API)MMasseyOrdinals.csv — baseline Massey data (gap-filled by fetch_massey.py)MTeamCoaches.csv, MTeamConferences.csv — coaching and conference assignmentsWNCAATourneyCompactResults.csv — women's tournament results (needed for competition scoring)Why it matters: Kaggle data is the backbone of the pipeline. It provides the training labels (who won each tournament game) and the longest historical coverage. All other sources augment what Kaggle provides.
Computed from: MRegularSeasonCompactResults.csv dating back to 1985. Processes every regular-season game chronologically.
Parameters:
At each season boundary, ratings regress 25% toward the mean (1500). The margin-of-victory multiplier prevents blowout games from having disproportionate impact while still rewarding dominant wins.
Why it matters: Elo is the #1 feature by importance in the LightGBM model. It captures cumulative game-by-game performance across the entire season, providing a different signal from pre-tournament snapshots like KenPom. See Elo deep-dive for full documentation.
Provides:
MTeamCoaches.csv). Longer tenure correlates with program stability and recruiting continuity.Team name resolution is the single biggest recurring complexity in this pipeline. Every external source uses different team names:
| Source | Example Name | Mapping Method | Approx. Overrides |
|---|---|---|---|
| KenPom | "Connecticut" | kenpompy internal IDs | built-in |
| ESPN BPI | "UConn Huskies" | ESPN_BPI_NAME_OVERRIDES dict | ~30 |
| Massey | "Connecticut" | resolve_teams.get_resolver() | fuzzy match |
| NCAA API | "CONN" | resolve_teams.get_resolver() | fuzzy match |
| RotoWire | "UConn" | ROTOWIRE_TO_KENPOM dict | ~40 |
| Brackets | "UConn" | CANONICAL_NAMES dict | ~80 |
| Odds/Markets | "Connecticut Huskies" | team_lookup_mm.csv | ~70 |
| Sports Ref | "connecticut" | KENPOM_TO_SREF dict | ~50 |
| Kaggle | TeamID 1163 | canonical target | — |
The central resolver is resolve_teams.py, which loads all known spellings from MTeamSpellings.csv (500+ name variants mapping to Kaggle TeamIDs) and supports fuzzy matching for names not explicitly listed. Newer fetchers (fetch_massey.py, fetch_ncaa_games.py) use resolve_teams.get_resolver(). Older fetchers have their own hardcoded mapping dicts, totaling 500+ explicit name overrides across all scripts.
resolve_teams.get_resolver() which handles fuzzy matching automatically.
build_feature_store.py joins all data sources by (Season, TeamID) to produce a single wide table with 53 columns per team-season.
team_season_features.csv (7,872 rows x 55 columns including Season and TeamID).The heatmap shows the percentage of teams with non-null values for each feature in each season. Key observations:
Last update times for live data sources (as of page generation):
| Source | Last Updated | Age | File |
|---|---|---|---|
| KenPom | 2026-03-06 22:25 | 11 min ago | data/kenpom/kenpom_live.csv |
| ESPN BPI | 2026-03-06 22:26 | 11 min ago | data/bpi/bpi_2026.csv |
| Massey Ordinals | 2026-03-06 22:26 | 11 min ago | data/kaggle/MMasseyOrdinals.csv |
| Injuries | 2026-03-06 09:33 | 13 hours ago | data/injuries/injuries_live.csv |
| Odds | 2026-03-06 22:26 | 11 min ago | data/odds/odds_live.csv |
| Feature Store | 2026-03-06 22:27 | 10 min ago | data/features/team_season_features.csv |
| Matchup Training | 2026-03-06 22:27 | 10 min ago | data/features/matchup_train.csv |
| Bracketology | 2026-03-06 09:33 | 13 hours ago | data/brackets/bracketology_2026-03-06.csv |
run_pipeline.py orchestrates hourly updates for KenPom, BPI, Massey, and odds. Bracketology and injury reports are scraped on-demand. The feature store and matchup training data are regenerated after each data update cycle.
| Limitation | Impact | Mitigation |
|---|---|---|
| BPI only from 2014 | 11 of 22 training seasons have BPI data; 11 do not | LightGBM handles NaN natively. The model learns BPI's predictive value where available and ignores it where missing. |
| Injury data only from 2024 | Very limited training signal for injury adjustments (2-3 seasons) | Injury adjustments modify AdjEM directly rather than entering as a separate feature. The Win Shares replacement model is calibrated on player-level data, not tournament outcomes. |
| Sports Reference rate limits | Bulk historical player stat collection is extremely slow (4s/page + backoff) | SQLite cache (player_stats.db) persists data across runs. Once a team-season is fetched, it never needs re-fetching. |
| Kalshi 1% tick floor | Tail teams (true prob << 1%) all display 1.0%, distorting model-market comparisons | Filter out Kalshi comparisons at the 1% floor. Suppress signal for teams where CI upper bound < 1%. |
| Massey scraping fragility | masseyratings.com is a single-maintainer site; page structure can change | Playwright-based scraper with generous timeouts. Falls back gracefully if CSV export fails. |
| No women's KenPom/BPI | Cannot build equivalent features for women's tournament teams | None currently. This is the largest potential improvement area for competition scoring. |
| 2020 season (COVID) | No tournament was held; season excluded from training | Season is simply skipped. 22 training seasons instead of 23 for that year range. |