Data Sources & Collection Pipeline

NCAA March Machine Learning Mania 2026 — Data Documentation

11
Data Sources
53
Raw Features per Team-Season
22
Training Seasons (2003-2025)
~342
Teams per Season

1. Overview

This pipeline collects, normalizes, and merges data from 11 distinct sources to produce a unified feature store for NCAA tournament prediction. Each team-season row contains 53 raw features spanning efficiency metrics, ranking consensus, box score aggregates, coaching history, and more.

The data covers 22 training seasons (2003-2025, excluding 2020 which had no tournament) with approximately 342 Division I teams per season. Sources range from stable Kaggle competition files (available since 1985) to live-scraped real-time data (prediction markets, injury reports) that exist only for the current season.

Key design principles:

2. Pipeline Architecture

Pipeline architecture

The pipeline has five stages:

1
Raw Sources: Fetcher scripts (src/fetch_*.py) scrape external sources via Playwright (BPI, Massey, injuries, brackets), REST APIs (NCAA games, odds), or Python libraries (kenpompy). Kaggle provides static competition files.
2
Storage: Each fetcher saves timestamped CSVs (e.g., kenpom_2026-03-06_1742.csv) plus a *_live.csv symlink to the latest. Historical KenPom data lives in kenpom.db (SQLite); player stats in player_stats.db.
3
Transform: Team names are normalized to Kaggle TeamIDs via resolve_teams.py. Elo ratings are computed from all regular-season games since 1985. Injury reports are converted to AdjEM adjustments via the Win Shares replacement model.
4
Feature Store: build_feature_store.py joins all sources into one row per (Season, TeamID) with 53 columns. Output: team_season_features.csv (7,872 rows).
5
Model: matchup differences (Team A - Team B) are computed for all historical tournament games, producing matchup_train.csv. The ensemble (70% LightGBM + 30% Logistic Regression) trains on these diffs. See Model Documentation for details.

3. Data Sources

Data source coverage timeline

a. KenPom Ratings

Script: src/fetch_kenpom.py | Storage: data/kenpom/kenpom_live.csv + historical summary{YY}.csv | Coverage: 2003-2026 (~350 teams/season)

Provides: AdjOE (adjusted offensive efficiency), AdjDE (adjusted defensive efficiency), AdjTempo, AdjEM (= AdjOE - AdjDE), national rankings, Luck, and SOS (strength of schedule).

Method: kenpompy Python library with authenticated login. Credentials stored in .env.

Why it matters: AdjEM is the single most predictive team-strength metric in college basketball. Tempo-adjusted efficiency margins remove pace effects that distort raw scoring stats. A team that scores 80 points per game but plays at a fast tempo is not necessarily better than one scoring 65 at a slow pace — AdjEM normalizes this to points per 100 possessions.

b. ESPN BPI

Script: src/fetch_bpi.py | Storage: data/bpi/bpi_{season}.csv | Coverage: 2014-2026 only

Provides: BPI power ranking (1-353).

Method: Playwright headless Chrome scrapes ESPN's JavaScript-rendered BPI page. Uses --disable-blink-features=AutomationControlled and custom user agent to avoid bot detection.

Why it matters: BPI uses a different methodology from KenPom (proprietary ESPN model vs. Pomeroy's tempo-free efficiency). When BPI and KenPom disagree about a team (bpi_kenpom_divergence), that divergence signal is the 5th most important feature in the model. It captures information that neither system alone provides.

Coverage gap: BPI only exists from 2014. For 2003-2013 (11 of 22 training seasons), bpi_rank and bpi_kenpom_divergence are null. LightGBM handles this natively.

c. Massey Ordinals

Script: src/fetch_massey.py | Storage: appended to data/kaggle/MMasseyOrdinals.csv | Coverage: 2003-2026 (56 ranking systems)

Provides: Rankings from 56 independent computer rating systems including POM (Pomeroy), MOR (Massey), SAG (Sagarin), COL (Colley), DOL (Dolphin), WLK (Wolfe), AP (Associated Press poll), USA (Coaches poll), and 48 others.

Method: Playwright scrapes the masseyratings.com/ranks composite page and triggers its CSV export button. Only adds systems that already exist in the Kaggle baseline file to maintain consistency.

The model uses 8 selected systems to compute:

Why it matters: Consensus across independent ranking systems is more robust than any single system. The massey_std_rank captures how "controversial" a team is — when systems disagree, there is genuine uncertainty about team strength, and the model can price that uncertainty into its predictions.

d. Bracketology

Script: src/fetch_brackets.py | Storage: data/brackets/bracketology_YYYY-MM-DD.csv | Coverage: 2026 pre-tournament only

Provides: Projected seeds and regions from 4 sources: ESPN (Joe Lunardi), CBS Sports, RotoWire, and TeamRankings.

Method: Playwright scrapes each source's bracket page, parses team names and seed projections, merges by team name using CANONICAL_NAMES mapping dict.

Used for: Pre-Selection Sunday field construction and region assignments. Before the official bracket is released, consensus bracketology determines which teams enter the simulation and which region they are placed in.

e. Prediction Markets

Script: src/fetch_odds.py | Storage: data/odds/odds_live.csv + team_lookup_mm.csv | Coverage: 2026 only

Provides: Championship probabilities from Kalshi (prediction market), Polymarket (prediction market), and 5 sharp sportsbooks via The Odds API.

Method: REST APIs. Kalshi and Polymarket have public endpoints; sportsbook odds are aggregated via The Odds API (requires API key). Sportsbook odds are devigged using Shin's method (removes the bookmaker's built-in margin to recover implied probabilities).

Used for: Post-hoc model-market comparison and identifying potential value picks. Market data is not used as model features — only for analysis after predictions are generated.

Quirk: Kalshi has a 1% minimum tick size. Tail teams (those with true probability well below 1%) all show 1.0% regardless of actual probability. This must be filtered out when comparing model vs. market.

f. Injury Reports

Script: src/fetch_injuries.py | Storage: data/injuries/injuries_live.csv | Coverage: current season only (snapshots from specific dates)

Provides: Player injury status (Out, Out For Season, Day-to-Day, Doubtful, etc.) for all D1 teams.

Method: Playwright scrapes RotoWire's college basketball injury report page, capturing the full table via JavaScript extraction.

Downstream processing: Injured players are fed into the Win Shares replacement model (src/adjust_injuries_ws.py):

  1. Match injured player (fuzzy name) to Sports Reference advanced stats
  2. Compare player's WS/40 to team-specific replacement level (linear regression on AdjEM)
  3. Convert the WS/40 gap to an AdjEM adjustment (x5 scaling factor), with a discount for games already missed

g. NCAA Game Results

Script: src/fetch_ncaa_games.py | Storage: appended to Kaggle's MRegularSeasonDetailedResults.csv and MRegularSeasonCompactResults.csv | Coverage: gap-fills 2026 games

Provides: Game scores and full box scores (FGM, FGA, FGM3, FGA3, FTM, FTA, OR, DR, Ast, TO, Stl, Blk, PF).

Method: REST API via ncaa-api.henrygd.me (free proxy for ncaa.com data). Fetches scoreboards by date, then box scores per game. Prioritizes Kaggle data — only fills gaps for dates beyond existing Kaggle coverage.

Rate limit: 5 req/sec hard limit. Script uses 0.3s delay between requests with retry-on-timeout logic.

Why it matters: Kaggle updates its competition data periodically but not daily. Between updates, the NCAA API fills the gap so box score aggregates (four factors, win %, margin) stay current.

h. Player Stats

Script: src/fetch_player_stats.py | Storage: SQLite cache at data/player_stats.db | Coverage: on-demand per team

Provides: Per-player advanced stats (Win Shares, WS/40, BPM, PER), per-game averages (PPG, RPG, APG, MPG), and team totals.

Method: HTTP scrapes Sports Reference (BeautifulSoup parsing). Each team page is fetched and cached in SQLite.

Rate limit: 4 seconds between requests (strict Sports Reference requirement). 65s x attempt exponential backoff on 429 responses. Bulk historical collection is very slow as a result.

Used for: Injury impact model (comparing injured player's WS/40 to team-specific replacement level) and player-level features (depth, usage, star dependency).

i. Kaggle Static Data

No fetcher — provided by competition | Storage: data/kaggle/ | Coverage: 1985-2026 (men), 1997-2026 (women)

Files:

Why it matters: Kaggle data is the backbone of the pipeline. It provides the training labels (who won each tournament game) and the longest historical coverage. All other sources augment what Kaggle provides.

j. Elo Ratings (computed)

Computed within src/build_feature_store.py | No separate fetcher | Coverage: 1985-2026

Computed from: MRegularSeasonCompactResults.csv dating back to 1985. Processes every regular-season game chronologically.

Parameters:

$$K = 20 \qquad \text{MOV multiplier} = 0.8 \cdot \ln(\text{MOV} + 1) \qquad \text{HOME\_ADV} = 3 \qquad \text{CARRYOVER} = 0.75$$

At each season boundary, ratings regress 25% toward the mean (1500). The margin-of-victory multiplier prevents blowout games from having disproportionate impact while still rewarding dominant wins.

Why it matters: Elo is the #1 feature by importance in the LightGBM model. It captures cumulative game-by-game performance across the entire season, providing a different signal from pre-tournament snapshots like KenPom. See Elo deep-dive for full documentation.

k. Coach & Conference (derived from Kaggle)

Derived in src/build_feature_store.py from MTeamCoaches.csv and MTeamConferences.csv | Coverage: 2003-2026

Provides:

4. Team Name Resolution

Team name resolution flow

Team name resolution is the single biggest recurring complexity in this pipeline. Every external source uses different team names:

SourceExample NameMapping MethodApprox. Overrides
KenPom"Connecticut"kenpompy internal IDsbuilt-in
ESPN BPI"UConn Huskies"ESPN_BPI_NAME_OVERRIDES dict~30
Massey"Connecticut"resolve_teams.get_resolver()fuzzy match
NCAA API"CONN"resolve_teams.get_resolver()fuzzy match
RotoWire"UConn"ROTOWIRE_TO_KENPOM dict~40
Brackets"UConn"CANONICAL_NAMES dict~80
Odds/Markets"Connecticut Huskies"team_lookup_mm.csv~70
Sports Ref"connecticut"KENPOM_TO_SREF dict~50
KaggleTeamID 1163canonical target

The central resolver is resolve_teams.py, which loads all known spellings from MTeamSpellings.csv (500+ name variants mapping to Kaggle TeamIDs) and supports fuzzy matching for names not explicitly listed. Newer fetchers (fetch_massey.py, fetch_ncaa_games.py) use resolve_teams.get_resolver(). Older fetchers have their own hardcoded mapping dicts, totaling 500+ explicit name overrides across all scripts.

Example resolution chains: When adding a new data source, the first task is always building a name mapping. The recommended approach is to use resolve_teams.get_resolver() which handles fuzzy matching automatically.

5. Feature Store Construction

build_feature_store.py joins all data sources by (Season, TeamID) to produce a single wide table with 53 columns per team-season.

From Raw Data to Model Input

1
Load & join: KenPom ratings, Massey ordinals, BPI rankings, box score aggregates, Elo ratings, seeds, coach data, and conference data are each loaded, normalized to TeamIDs, and joined on (Season, TeamID).
2
Compute derived features: Four Factors (eFG%, TOV%, ORB%, FTR) from box score totals. BPI-KenPom divergence. AP poll trajectory (early vs. late season rank change). Seed-line historical win rates.
3
Output: team_season_features.csv (7,872 rows x 55 columns including Season and TeamID).
4
Matchup construction: For each historical tournament game, compute Team A - Team B for every feature. Convention: A = lower TeamID. This produces 44 model features per matchup (some raw features are dropped or combined).

Feature Coverage by Season

Feature coverage heatmap

The heatmap shows the percentage of teams with non-null values for each feature in each season. Key observations:

How missing values are handled: LightGBM handles NaN natively by learning optimal split directions for missing values at each tree node. This means features with partial coverage (BPI, AP rank) can still contribute signal without imputation. The logistic regression model uses only features with near-complete coverage (AdjEM, seed, Massey, AdjOE, AdjDE).

6. Data Freshness

Last update times for live data sources (as of page generation):

SourceLast UpdatedAgeFile
KenPom 2026-03-06 22:25 11 min ago data/kenpom/kenpom_live.csv
ESPN BPI 2026-03-06 22:26 11 min ago data/bpi/bpi_2026.csv
Massey Ordinals 2026-03-06 22:26 11 min ago data/kaggle/MMasseyOrdinals.csv
Injuries 2026-03-06 09:33 13 hours ago data/injuries/injuries_live.csv
Odds 2026-03-06 22:26 11 min ago data/odds/odds_live.csv
Feature Store 2026-03-06 22:27 10 min ago data/features/team_season_features.csv
Matchup Training 2026-03-06 22:27 10 min ago data/features/matchup_train.csv
Bracketology 2026-03-06 09:33 13 hours ago data/brackets/bracketology_2026-03-06.csv
Update schedule: During tournament season, run_pipeline.py orchestrates hourly updates for KenPom, BPI, Massey, and odds. Bracketology and injury reports are scraped on-demand. The feature store and matchup training data are regenerated after each data update cycle.

7. Coverage Gaps & Known Limitations

Women's teams (~364) have no feature pipeline. The Kaggle competition scores both men's (268 games) and women's (268 games) tournaments. Our feature pipeline only builds features for men's teams. Women's teams (TeamIDs 3xxx) receive default 0.5 predictions for all non-hardcoded matchups. For completed seasons, actual results are hardcoded to exact 0/1, so this gap only affects the current (unknown) season.
LimitationImpactMitigation
BPI only from 2014 11 of 22 training seasons have BPI data; 11 do not LightGBM handles NaN natively. The model learns BPI's predictive value where available and ignores it where missing.
Injury data only from 2024 Very limited training signal for injury adjustments (2-3 seasons) Injury adjustments modify AdjEM directly rather than entering as a separate feature. The Win Shares replacement model is calibrated on player-level data, not tournament outcomes.
Sports Reference rate limits Bulk historical player stat collection is extremely slow (4s/page + backoff) SQLite cache (player_stats.db) persists data across runs. Once a team-season is fetched, it never needs re-fetching.
Kalshi 1% tick floor Tail teams (true prob << 1%) all display 1.0%, distorting model-market comparisons Filter out Kalshi comparisons at the 1% floor. Suppress signal for teams where CI upper bound < 1%.
Massey scraping fragility masseyratings.com is a single-maintainer site; page structure can change Playwright-based scraper with generous timeouts. Falls back gracefully if CSV export fails.
No women's KenPom/BPI Cannot build equivalent features for women's tournament teams None currently. This is the largest potential improvement area for competition scoring.
2020 season (COVID) No tournament was held; season excluded from training Season is simply skipped. 22 training seasons instead of 23 for that year range.