Data Sources & Collection Pipeline

NCAA March Machine Learning Mania 2026 — Data Documentation

Data Sources

Raw Features per Team-Season

Training Seasons (2003-2025)

~342

Teams per Season

1. Overview

This pipeline collects, normalizes, and merges data from 11 distinct sources to produce a unified feature store for NCAA tournament prediction. Each team-season row contains 53 raw features spanning efficiency metrics, ranking consensus, box score aggregates, coaching history, and more.

The data covers 22 training seasons (2003-2025, excluding 2020 which had no tournament) with approximately 342 Division I teams per season. Sources range from stable Kaggle competition files (available since 1985) to live-scraped real-time data (prediction markets, injury reports) that exist only for the current season.

Key design principles:

Multi-source consensus: no single source is trusted in isolation. KenPom, Massey (56 systems), ESPN BPI, and Elo each contribute an independent signal.
Graceful degradation: not all features exist for all seasons. BPI is only available from 2014; injury data only from 2024. LightGBM handles missing values natively, so partial coverage does not break the pipeline.
Timestamped snapshots: every scrape produces a dated CSV alongside a *_live.csv pointer to the latest version. This enables historical auditing and debugging.
Rate-limit compliance: all scrapers respect source rate limits (4s for Sports Reference, 0.3s for NCAA API) with exponential backoff on 429 responses.

2. Pipeline Architecture

The pipeline has five stages:

Raw Sources: Fetcher scripts (src/fetch_*.py) scrape external sources via Playwright (BPI, Massey, injuries, brackets), REST APIs (NCAA games, odds), or Python libraries (kenpompy). Kaggle provides static competition files.

Storage: Each fetcher saves timestamped CSVs (e.g., kenpom_2026-03-06_1742.csv) plus a *_live.csv symlink to the latest. Historical KenPom data lives in kenpom.db (SQLite); player stats in player_stats.db.

Transform: Team names are normalized to Kaggle TeamIDs via resolve_teams.py. Elo ratings are computed from all regular-season games since 1985. Injury reports are converted to AdjEM adjustments via the Win Shares replacement model.

Feature Store: build_feature_store.py joins all sources into one row per (Season, TeamID) with 53 columns. Output: team_season_features.csv (7,872 rows).

Model: matchup differences (Team A - Team B) are computed for all historical tournament games, producing matchup_train.csv. The ensemble (70% LightGBM + 30% Logistic Regression) trains on these diffs. See Model Documentation for details.

3. Data Sources

a. KenPom Ratings

Script: src/fetch_kenpom.py | Storage: data/kenpom/kenpom_live.csv + historical summary{YY}.csv | Coverage: 2003-2026 (~350 teams/season)

Provides: AdjOE (adjusted offensive efficiency), AdjDE (adjusted defensive efficiency), AdjTempo, AdjEM (= AdjOE - AdjDE), national rankings, Luck, and SOS (strength of schedule).

Method: kenpompy Python library with authenticated login. Credentials stored in .env.

Why it matters: AdjEM is the single most predictive team-strength metric in college basketball. Tempo-adjusted efficiency margins remove pace effects that distort raw scoring stats. A team that scores 80 points per game but plays at a fast tempo is not necessarily better than one scoring 65 at a slow pace — AdjEM normalizes this to points per 100 possessions.

b. ESPN BPI

Script: src/fetch_bpi.py | Storage: data/bpi/bpi_{season}.csv | Coverage: 2014-2026 only

Provides: BPI power ranking (1-353).

Method: Playwright headless Chrome scrapes ESPN's JavaScript-rendered BPI page. Uses --disable-blink-features=AutomationControlled and custom user agent to avoid bot detection.

Why it matters: BPI uses a different methodology from KenPom (proprietary ESPN model vs. Pomeroy's tempo-free efficiency). When BPI and KenPom disagree about a team (bpi_kenpom_divergence), that divergence signal is the 5th most important feature in the model. It captures information that neither system alone provides.

Coverage gap: BPI only exists from 2014. For 2003-2013 (11 of 22 training seasons), bpi_rank and bpi_kenpom_divergence are null. LightGBM handles this natively.

c. Massey Ordinals

Script: src/fetch_massey.py | Storage: appended to data/kaggle/MMasseyOrdinals.csv | Coverage: 2003-2026 (56 ranking systems)

Provides: Rankings from 56 independent computer rating systems including POM (Pomeroy), MOR (Massey), SAG (Sagarin), COL (Colley), DOL (Dolphin), WLK (Wolfe), AP (Associated Press poll), USA (Coaches poll), and 48 others.

Method: Playwright scrapes the masseyratings.com/ranks composite page and triggers its CSV export button. Only adds systems that already exist in the Kaggle baseline file to maintain consistency.

The model uses 8 selected systems to compute:

massey_mean_rank — average rank across the 8 systems (consensus strength)
massey_std_rank — standard deviation of ranks (system disagreement / "controversy")
massey_best_rank — best rank across the 8 systems (ceiling)
pom_rank — Pomeroy rank specifically (highest-quality single system)

Why it matters: Consensus across independent ranking systems is more robust than any single system. The massey_std_rank captures how "controversial" a team is — when systems disagree, there is genuine uncertainty about team strength, and the model can price that uncertainty into its predictions.

d. Bracketology

Script: src/fetch_brackets.py | Storage: data/brackets/bracketology_YYYY-MM-DD.csv | Coverage: 2026 pre-tournament only

Provides: Projected seeds and regions from 4 sources: ESPN (Joe Lunardi), CBS Sports, RotoWire, and TeamRankings.

Method: Playwright scrapes each source's bracket page, parses team names and seed projections, merges by team name using CANONICAL_NAMES mapping dict.

Used for: Pre-Selection Sunday field construction and region assignments. Before the official bracket is released, consensus bracketology determines which teams enter the simulation and which region they are placed in.

e. Prediction Markets

Script: src/fetch_odds.py | Storage: data/odds/odds_live.csv + team_lookup_mm.csv | Coverage: 2026 only

Provides: Championship probabilities from Kalshi (prediction market), Polymarket (prediction market), and 5 sharp sportsbooks via The Odds API.

Method: REST APIs. Kalshi and Polymarket have public endpoints; sportsbook odds are aggregated via The Odds API (requires API key). Sportsbook odds are devigged using Shin's method (removes the bookmaker's built-in margin to recover implied probabilities).

Used for: Post-hoc model-market comparison and identifying potential value picks. Market data is not used as model features — only for analysis after predictions are generated.

Quirk: Kalshi has a 1% minimum tick size. Tail teams (those with true probability well below 1%) all show 1.0% regardless of actual probability. This must be filtered out when comparing model vs. market.

f. Injury Reports

Script: src/fetch_injuries.py | Storage: data/injuries/injuries_live.csv | Coverage: current season only (snapshots from specific dates)

Provides: Player injury status (Out, Out For Season, Day-to-Day, Doubtful, etc.) for all D1 teams.

Method: Playwright scrapes RotoWire's college basketball injury report page, capturing the full table via JavaScript extraction.

Downstream processing: Injured players are fed into the Win Shares replacement model (src/adjust_injuries_ws.py):

Match injured player (fuzzy name) to Sports Reference advanced stats
Compare player's WS/40 to team-specific replacement level (linear regression on AdjEM)
Convert the WS/40 gap to an AdjEM adjustment (x5 scaling factor), with a discount for games already missed

g. NCAA Game Results

Script: src/fetch_ncaa_games.py | Storage: appended to Kaggle's MRegularSeasonDetailedResults.csv and MRegularSeasonCompactResults.csv | Coverage: gap-fills 2026 games

Provides: Game scores and full box scores (FGM, FGA, FGM3, FGA3, FTM, FTA, OR, DR, Ast, TO, Stl, Blk, PF).

Method: REST API via ncaa-api.henrygd.me (free proxy for ncaa.com data). Fetches scoreboards by date, then box scores per game. Prioritizes Kaggle data — only fills gaps for dates beyond existing Kaggle coverage.

Rate limit: 5 req/sec hard limit. Script uses 0.3s delay between requests with retry-on-timeout logic.

Why it matters: Kaggle updates its competition data periodically but not daily. Between updates, the NCAA API fills the gap so box score aggregates (four factors, win %, margin) stay current.

h. Player Stats

Script: src/fetch_player_stats.py | Storage: SQLite cache at data/player_stats.db | Coverage: on-demand per team

Provides: Per-player advanced stats (Win Shares, WS/40, BPM, PER), per-game averages (PPG, RPG, APG, MPG), and team totals.

Method: HTTP scrapes Sports Reference (BeautifulSoup parsing). Each team page is fetched and cached in SQLite.

Rate limit: 4 seconds between requests (strict Sports Reference requirement). 65s x attempt exponential backoff on 429 responses. Bulk historical collection is very slow as a result.

Used for: Injury impact model (comparing injured player's WS/40 to team-specific replacement level) and player-level features (depth, usage, star dependency).

i. Kaggle Static Data

No fetcher — provided by competition | Storage: data/kaggle/ | Coverage: 1985-2026 (men), 1997-2026 (women)

Files:

MTeams.csv, MSeasons.csv, MTeamSpellings.csv — team metadata and name variants
MNCAATourneySeeds.csv, MNCAATourneyCompactResults.csv — tournament structure and outcomes
MRegularSeasonCompactResults.csv, MRegularSeasonDetailedResults.csv — regular-season games (gap-filled by NCAA API)
MMasseyOrdinals.csv — baseline Massey data (gap-filled by fetch_massey.py)
MTeamCoaches.csv, MTeamConferences.csv — coaching and conference assignments
WNCAATourneyCompactResults.csv — women's tournament results (needed for competition scoring)

Why it matters: Kaggle data is the backbone of the pipeline. It provides the training labels (who won each tournament game) and the longest historical coverage. All other sources augment what Kaggle provides.

j. Elo Ratings (computed)

Computed within src/build_feature_store.py | No separate fetcher | Coverage: 1985-2026

Computed from: MRegularSeasonCompactResults.csv dating back to 1985. Processes every regular-season game chronologically.

Parameters:

$$K = 20 \qquad \text{MOV multiplier} = 0.8 \cdot \ln(\text{MOV} + 1) \qquad \text{HOME\_ADV} = 3 \qquad \text{CARRYOVER} = 0.75$$

At each season boundary, ratings regress 25% toward the mean (1500). The margin-of-victory multiplier prevents blowout games from having disproportionate impact while still rewarding dominant wins.

Why it matters: Elo is the #1 feature by importance in the LightGBM model. It captures cumulative game-by-game performance across the entire season, providing a different signal from pre-tournament snapshots like KenPom. See Elo deep-dive for full documentation.

k. Coach & Conference (derived from Kaggle)

Derived in src/build_feature_store.py from MTeamCoaches.csv and MTeamConferences.csv | Coverage: 2003-2026

Provides:

Coach tenure: consecutive years at the current school (from MTeamCoaches.csv). Longer tenure correlates with program stability and recruiting continuity.
Coach tourney apps: career tournament appearances. Coaches with deep tournament experience tend to prepare better for single-elimination play.
Conference mean AdjEM: average KenPom AdjEM of all teams in the conference. Provides a conference-strength context beyond individual team ratings.

4. Team Name Resolution

Team name resolution is the single biggest recurring complexity in this pipeline. Every external source uses different team names:

Source	Example Name	Mapping Method	Approx. Overrides
KenPom	"Connecticut"	kenpompy internal IDs	built-in
ESPN BPI	"UConn Huskies"	`ESPN_BPI_NAME_OVERRIDES` dict	~30
Massey	"Connecticut"	`resolve_teams.get_resolver()`	fuzzy match
NCAA API	"CONN"	`resolve_teams.get_resolver()`	fuzzy match
RotoWire	"UConn"	`ROTOWIRE_TO_KENPOM` dict	~40
Brackets	"UConn"	`CANONICAL_NAMES` dict	~80
Odds/Markets	"Connecticut Huskies"	`team_lookup_mm.csv`	~70
Sports Ref	"connecticut"	`KENPOM_TO_SREF` dict	~50
Kaggle	TeamID 1163	canonical target	—

The central resolver is resolve_teams.py, which loads all known spellings from MTeamSpellings.csv (500+ name variants mapping to Kaggle TeamIDs) and supports fuzzy matching for names not explicitly listed. Newer fetchers (fetch_massey.py, fetch_ncaa_games.py) use resolve_teams.get_resolver(). Older fetchers have their own hardcoded mapping dicts, totaling 500+ explicit name overrides across all scripts.

Example resolution chains:

"UConn" → "Connecticut" → TeamID 1163
"NC State" → "North Carolina St" → TeamID 1323
"GONZ" → "Gonzaga" → TeamID 1211
"St. John's (NY)" → "St John's" → TeamID 1381

When adding a new data source, the first task is always building a name mapping. The recommended approach is to use resolve_teams.get_resolver() which handles fuzzy matching automatically.

5. Feature Store Construction

build_feature_store.py joins all data sources by (Season, TeamID) to produce a single wide table with 53 columns per team-season.

From Raw Data to Model Input

Load & join: KenPom ratings, Massey ordinals, BPI rankings, box score aggregates, Elo ratings, seeds, coach data, and conference data are each loaded, normalized to TeamIDs, and joined on (Season, TeamID).

Compute derived features: Four Factors (eFG%, TOV%, ORB%, FTR) from box score totals. BPI-KenPom divergence. AP poll trajectory (early vs. late season rank change). Seed-line historical win rates.

Output: team_season_features.csv (7,872 rows x 55 columns including Season and TeamID).

Matchup construction: For each historical tournament game, compute Team A - Team B for every feature. Convention: A = lower TeamID. This produces 44 model features per matchup (some raw features are dropped or combined).

Feature Coverage by Season

The heatmap shows the percentage of teams with non-null values for each feature in each season. Key observations:

Core features (AdjEM, Elo, Massey, win_pct, four factors, coach tenure) have 100% coverage across all seasons.
BPI features (bpi_rank, bpi_kenpom_divergence) are completely missing before 2014 and 100% present after.
AP poll (ap_rank) is only available for ranked teams (~12-16% of D1 teams per season). Unranked teams have null.
Seed is only available for tournament teams (~20% of D1 teams per season).
Player-level features (top_heavy_idx, depth) have variable coverage depending on Sports Reference cache completeness.

How missing values are handled: LightGBM handles NaN natively by learning optimal split directions for missing values at each tree node. This means features with partial coverage (BPI, AP rank) can still contribute signal without imputation. The logistic regression model uses only features with near-complete coverage (AdjEM, seed, Massey, AdjOE, AdjDE).

6. Data Freshness

Last update times for live data sources (as of page generation):

Source	Last Updated	Age	File
KenPom	2026-03-06 22:25	11 min ago	`data/kenpom/kenpom_live.csv`
ESPN BPI	2026-03-06 22:26	11 min ago	`data/bpi/bpi_2026.csv`
Massey Ordinals	2026-03-06 22:26	11 min ago	`data/kaggle/MMasseyOrdinals.csv`
Injuries	2026-03-06 09:33	13 hours ago	`data/injuries/injuries_live.csv`
Odds	2026-03-06 22:26	11 min ago	`data/odds/odds_live.csv`
Feature Store	2026-03-06 22:27	10 min ago	`data/features/team_season_features.csv`
Matchup Training	2026-03-06 22:27	10 min ago	`data/features/matchup_train.csv`
Bracketology	2026-03-06 09:33	13 hours ago	`data/brackets/bracketology_2026-03-06.csv`

Update schedule: During tournament season, run_pipeline.py orchestrates hourly updates for KenPom, BPI, Massey, and odds. Bracketology and injury reports are scraped on-demand. The feature store and matchup training data are regenerated after each data update cycle.

7. Coverage Gaps & Known Limitations

Women's teams (~364) have no feature pipeline. The Kaggle competition scores both men's (268 games) and women's (268 games) tournaments. Our feature pipeline only builds features for men's teams. Women's teams (TeamIDs 3xxx) receive default 0.5 predictions for all non-hardcoded matchups. For completed seasons, actual results are hardcoded to exact 0/1, so this gap only affects the current (unknown) season.

Limitation	Impact	Mitigation
BPI only from 2014	11 of 22 training seasons have BPI data; 11 do not	LightGBM handles NaN natively. The model learns BPI's predictive value where available and ignores it where missing.
Injury data only from 2024	Very limited training signal for injury adjustments (2-3 seasons)	Injury adjustments modify AdjEM directly rather than entering as a separate feature. The Win Shares replacement model is calibrated on player-level data, not tournament outcomes.
Sports Reference rate limits	Bulk historical player stat collection is extremely slow (4s/page + backoff)	SQLite cache (`player_stats.db`) persists data across runs. Once a team-season is fetched, it never needs re-fetching.
Kalshi 1% tick floor	Tail teams (true prob << 1%) all display 1.0%, distorting model-market comparisons	Filter out Kalshi comparisons at the 1% floor. Suppress signal for teams where CI upper bound < 1%.
Massey scraping fragility	masseyratings.com is a single-maintainer site; page structure can change	Playwright-based scraper with generous timeouts. Falls back gracefully if CSV export fails.
No women's KenPom/BPI	Cannot build equivalent features for women's tournament teams	None currently. This is the largest potential improvement area for competition scoring.
2020 season (COVID)	No tournament was held; season excluded from training	Season is simply skipped. 22 training seasons instead of 23 for that year range.