Technical Reference — March Machine Learning Mania 2026
This model predicts the probability that Team A beats Team B for every possible NCAA tournament matchup. It combines a gradient-boosted decision tree (LightGBM, 44 features) with a logistic regression (KenPom features, 5 features) in a 70%/30% weighted ensemble.
Key design principles:
The pipeline has five stages:
build_feature_store.py joins all sources into one row per (Season, TeamID) with 55 raw features. Outputs team_season_features.csv.matchup_train.csv (1445 games, 2003-2025).| Source | Collection Method | Features | Seasons | Notes |
|---|---|---|---|---|
| KenPom | kenpompy (Python API) | AdjEM, AdjOE, AdjDE, AdjTempo, ranks, Luck, SOS | 2003-2026 | Pre-tournament snapshots (historical) + live scrape (2026) |
| Massey Ordinals | masseyratings.com (Playwright) | 8 ranking systems: POM, MOR, SAG, COL, DOL, WLK, AP, USA | 2003-2026 | Kaggle baseline + daily gap-fill via web scraping |
| ESPN BPI | espn.com (Playwright) | BPI power ranking | 2014-2026 | JavaScript-rendered page extraction |
| Kaggle Box Scores | Kaggle competition data | FGM, FGA, FGM3, FGA3, FTM, FTA, OR, DR, Ast, TO, Stl, Blk | 2003-2026 | Detailed results + NCAA API gap-fill |
| Tournament Seeds | Kaggle + bracketology consensus | Seed number (1-16) | 2003-2026 | Kaggle official (historical) + ESPN/CBS/RotoWire consensus (2026) |
| Elo Ratings | Computed from Kaggle compact results | Custom Elo rating per team | 1985-2026 | K=20, MOV multiplier, 75% season carryover |
| Coach Data | Kaggle MTeamCoaches.csv | Tenure, tournament appearances | 2003-2026 | Consecutive years + historical tourney count |
| Conference | Kaggle + KenPom | Conference mean AdjEM | 2003-2026 | Average KenPom AdjEM of all conference teams |
| AP Poll | Massey Ordinals (AP system) | Poll trajectory (early vs late season rank) | 2003-2026 | Rank change from week 7 to pre-tournament |
| Injuries | RotoWire (Playwright) + Sports Ref | AdjEM adjustment via WS/40 replacement model | 2024-2026 | Player WS/40 vs team-specific replacement level |
| Prediction Markets | Kalshi, Polymarket, sportsbooks | Championship probabilities | 2026 only | Used for post-hoc model-market comparison, not as features |
resolve_teams.py module handles fuzzy matching across all known spellings, mapping every name to a canonical Kaggle TeamID.
Features are organized into tiers by source and type. All enter the model as differences: feature_A - feature_B.
| # | Feature | Name | Tier | Gain | Description |
|---|---|---|---|---|---|
| 1 | elo_rating_diff |
Elo Rating | Elo | 3,699 | Custom Elo from all reg-season games since 1985 |
| 2 | AdjEM_diff |
Adjusted Efficiency Margin | KenPom | 3,190 | Offensive - defensive efficiency, tempo-adjusted |
| 3 | massey_best_rank_diff |
Massey Best Rank | Consensus | 3,139 | Best rank across 8 Massey systems |
| 4 | pom_rank_diff |
KenPom Rank (POM) | Consensus | 2,943 | Pomeroy rank from Massey composite |
| 5 | bpi_kenpom_divergence_diff |
BPI-KenPom Divergence | Consensus | 1,948 | BPI rank minus KenPom rank |
| 6 | massey_mean_rank_diff |
Massey Mean Rank | Consensus | 918 | Average rank across 8 systems |
| 7 | AdjOE_diff |
Adjusted Offensive Efficiency | KenPom | 906 | Points per 100 possessions, tempo-adjusted |
| 8 | sos_proxy_diff |
Strength of Schedule | Box Scores | 782 | Mean opponents' win percentage |
| 9 | off_ftr_diff |
Free Throw Rate | Four Factors | 717 | FTA / FGA ratio |
| 10 | win_pct_diff |
Win Percentage | Momentum | 608 | Regular season win rate |
| 11 | opp_fg3_pct_diff |
Opp. 3-Point % | Box Scores | 576 | Three-point shooting allowed |
| 12 | seed_line_win_rate_diff |
Seed Line Win Rate | Structural | 552 | Historical win rate for this seed (1985-2002) |
| 13 | ast_rate_diff |
Assist Rate | Box Scores | 534 | Assists per game |
| 14 | AdjDE_diff |
Adjusted Defensive Efficiency | KenPom | 496 | Points allowed per 100 possessions |
| 15 | massey_std_rank_diff |
Massey Rank Std Dev | Consensus | 483 | Disagreement across ranking systems |
| 16 | RankAdjDE_diff |
KenPom AdjDE Rank | KenPom | 450 | National rank by defensive efficiency |
| 17 | orb_pct_diff |
Offensive Rebound % | Box Scores | 432 | Share of available offensive rebounds |
| 18 | coach_tenure_diff |
Coach Tenure | Coach | 356 | Consecutive years at current school |
| 19 | scoring_margin_std_diff |
Scoring Margin Std | Other | 329 | |
| 20 | ft_pct_diff |
Ft Pct | Other | 296 | |
| 21 | tov_rate_diff |
Tov Rate | Other | 285 | |
| 22 | log_seed_diff |
Log(Seed) | Structural | 283 | Logarithmic seed — compresses high seeds |
| 23 | bpi_rank_diff |
Bpi Rank | Other | 262 | |
| 24 | ap_rank_diff |
Ap Rank | Other | 252 | |
| 25 | poll_trajectory_diff |
Poll Trajectory | Other | 236 | |
| 26 | def_ftr_diff |
Def Ftr | Other | 236 | |
| 27 | scoring_margin_avg_diff |
Scoring Margin Avg | Other | 232 | |
| 28 | last10_margin_diff |
Last10 Margin | Other | 211 | |
| 29 | AdjTempo_diff |
Adjtempo | Other | 205 | |
| 30 | def_efg_pct_diff |
Def Efg Pct | Other | 188 | |
| 31 | opp_fg_pct_diff |
Opp Fg Pct | Other | 184 | |
| 32 | fg_pct_diff |
Fg Pct | Other | 182 | |
| 33 | RankAdjEM_diff |
KenPom AdjEM Rank | KenPom | 177 | National rank by efficiency margin |
| 34 | RankAdjOE_diff |
Rankadjoe | Other | 175 | |
| 35 | fg3_pct_diff |
Fg3 Pct | Other | 166 | |
| 36 | matchup_off_def_sum |
Matchup Off Def Sum | Other | 153 | |
| 37 | off_orb_pct_diff |
Off Orb Pct | Other | 145 | |
| 38 | off_efg_pct_diff |
Off Efg Pct | Other | 131 | |
| 39 | conf_mean_adjem_diff |
Conf Mean Adjem | Other | 130 | |
| 40 | seed_product |
Seed Product | Other | 116 | |
| 41 | coach_tourney_apps_diff |
Coach Tourney Apps | Other | 96 | |
| 42 | three_rate_diff |
Three Rate | Other | 87 | |
| 43 | off_tov_pct_diff |
Off Tov Pct | Other | 85 | |
| 44 | seed_diff |
Seed | Other | 25 |
Four Factors (Dean Oliver's framework): offensive and defensive versions of effective FG%, turnover rate, offensive rebound rate, and free throw rate. These are tempo-free — they measure quality independent of pace.
Matchup interactions: How offensive vs defensive the matchup is overall (symmetric — invariant to team ordering):
BPI-KenPom divergence: when BPI and KenPom disagree about a team's ranking, it signals that one system sees something the other doesn't. This feature ranks 5th by importance.
Elo rating: our custom Elo system (documented at /elo). Processes every regular-season game since 1985 with margin-of-victory scaling and 75% season carryover. Ranks #1 by feature importance.
A single feature: seed_diff. This is the minimum-viable model that any NCAA bracket predictor should beat. Log-loss: 0.566.
5 features with GridSearchCV over regularization strength C:
This is the "strong baseline" — hard to beat because AdjEM alone is highly predictive. Log-loss: 0.547.
44 features with Optuna-tuned hyperparameters:
| Parameter | Value |
|---|---|
max_depth | 3 |
num_leaves | 21 |
learning_rate | 0.0130 |
min_child_samples | 26 |
lambda_l1 | 0.0209 |
lambda_l2 | 0.0228 |
feature_fraction | 0.6373 |
bagging_fraction | 0.8234 |
bagging_freq | 2 |
n_estimators | 410 |
Log-loss: 0.538. The key hyperparameters are max_depth=3 (very shallow trees) and learning_rate=0.013 (very slow), which prevent overfitting on a small training set.
Leave-One-Season-Out (LOSO): for each of the 22 seasons (2003-2025, excluding 2020), train on all other seasons and predict the held-out season. This is more conservative than k-fold because it tests temporal generalization — the model must predict a season it has never seen.
The final prediction blends LightGBM and KenPom LR:
The optimal weight is found by grid search over LOSO predictions. The ensemble achieves 0.53651 log-loss — better than either model alone.
| Model | LOSO Log-Loss | Games |
|---|---|---|
| Ensemble | 0.53651 | 1445 |
| LightGBM (tuned) | 0.53890 | 1445 |
| KenPom LR | 0.54697 | 1445 |
| Seed-only LR | 0.56645 | 1445 |
Injuries are incorporated via a Win Shares replacement model that adjusts KenPom AdjEM for missing players:
The discount factor accounts for games already missed: if a player has already been out for most of the season, their absence is already reflected in the team's stats (discount = 0.2 if >50% missed).
The adjustment is split: AdjOE gets half the impact, AdjDE gets the other half. This is conservative — in reality, losing a scorer mostly hurts offense, but we don't have enough data to model the split precisely.
The model generates championship probabilities via Monte Carlo simulation:
The simulation handles edge cases: play-in games, teams at the same seed competing for region slots, and the "bye" mechanism when fewer than 64 teams are in the field.
An experimental mode adds $N(0, \sigma)$ noise to each team's AdjEM per simulation, modeling the uncertainty in team ratings. This compresses the probability distribution: the favorite's probability decreases while underdogs gain. The notebook comparison model uses $\sigma = 2.5$.
All metrics are computed out-of-sample via LOSO cross-validation on 1445 historical tournament games across 22 seasons:
Expected Calibration Error (ECE) = 0.014 means the model's predicted probabilities closely match actual outcomes. When the model says 70%, the team wins about 70% of the time.
The model performs best in early rounds (where seed advantages are largest) and degrades in later rounds (where remaining teams are more evenly matched):
Which feature groups matter most? We remove each tier entirely and measure the log-loss impact:
| Tier Removed | # Features | Log-Loss | Impact |
|---|---|---|---|
| Consensus | 8 | 0.55085 | +11.5 mLL |
| KenPom | 9 | 0.54651 | +7.1 mLL |
| Elo | 1 | 0.54389 | +4.5 mLL |
| Structural | 4 | 0.54102 | +1.6 mLL |
| Conference | 1 | 0.54044 | +1.1 mLL |
| RegSeason | 9 | 0.53986 | +0.5 mLL |
| FourFactors | 9 | 0.53983 | +0.5 mLL |
| Momentum | 4 | 0.53961 | +0.2 mLL |
| Coach | 2 | 0.53960 | +0.2 mLL |
| Player | 8 | 0.53777 | -1.6 mLL |
The ablation also reveals significant redundancy: KenPom's AdjEM, Massey's POM rank, and Elo are highly correlated. Removing any one is partially compensated by the others. But removing the entire consensus tier (8 features spanning Massey systems, BPI, derived divergence, and poll trajectory) is devastating because it eliminates all cross-system signal.
We compare model championship probabilities to prediction market prices from Kalshi, Polymarket, and sportsbook sharp consensus:
The model's predictions are not blended with market data for the primary submission. Markets are used only for post-hoc comparison and identifying potential value picks. An optional market-blend submission uses 70% model + 30% market-implied pairwise probabilities.