NCAA Tournament Prediction Model

Technical Reference — March Machine Learning Mania 2026

0.53651
LOSO Log-Loss
44
LightGBM Features
22
Training Seasons
50K
Bracket Simulations

1. Overview

This model predicts the probability that Team A beats Team B for every possible NCAA tournament matchup. It combines a gradient-boosted decision tree (LightGBM, 44 features) with a logistic regression (KenPom features, 5 features) in a 70%/30% weighted ensemble.

Key design principles:

2. Architecture

Pipeline architecture

The pipeline has five stages:

1
Data Collection: Fetcher scripts scrape KenPom, ESPN BPI, Massey, RotoWire, and prediction markets. Each source has its own team name mapping.
2
Feature Store: build_feature_store.py joins all sources into one row per (Season, TeamID) with 55 raw features. Outputs team_season_features.csv.
3
Matchup Construction: For each historical tournament game, compute Team A - Team B differences for all features. Convention: A = lower TeamID. Output: matchup_train.csv (1445 games, 2003-2025).
4
Model Training: LOSO cross-validation, Optuna hyperparameter tuning, ensemble weight optimization. Final models trained on all data.
5
Prediction: Generate pairwise probabilities for 2026 tournament teams. Monte Carlo bracket simulation (50K draws) produces championship probabilities.

3. Data Sources

Data source availability
SourceCollection MethodFeaturesSeasonsNotes
KenPom kenpompy (Python API) AdjEM, AdjOE, AdjDE, AdjTempo, ranks, Luck, SOS 2003-2026 Pre-tournament snapshots (historical) + live scrape (2026)
Massey Ordinals masseyratings.com (Playwright) 8 ranking systems: POM, MOR, SAG, COL, DOL, WLK, AP, USA 2003-2026 Kaggle baseline + daily gap-fill via web scraping
ESPN BPI espn.com (Playwright) BPI power ranking 2014-2026 JavaScript-rendered page extraction
Kaggle Box Scores Kaggle competition data FGM, FGA, FGM3, FGA3, FTM, FTA, OR, DR, Ast, TO, Stl, Blk 2003-2026 Detailed results + NCAA API gap-fill
Tournament Seeds Kaggle + bracketology consensus Seed number (1-16) 2003-2026 Kaggle official (historical) + ESPN/CBS/RotoWire consensus (2026)
Elo Ratings Computed from Kaggle compact results Custom Elo rating per team 1985-2026 K=20, MOV multiplier, 75% season carryover
Coach Data Kaggle MTeamCoaches.csv Tenure, tournament appearances 2003-2026 Consecutive years + historical tourney count
Conference Kaggle + KenPom Conference mean AdjEM 2003-2026 Average KenPom AdjEM of all conference teams
AP Poll Massey Ordinals (AP system) Poll trajectory (early vs late season rank) 2003-2026 Rank change from week 7 to pre-tournament
Injuries RotoWire (Playwright) + Sports Ref AdjEM adjustment via WS/40 replacement model 2024-2026 Player WS/40 vs team-specific replacement level
Prediction Markets Kalshi, Polymarket, sportsbooks Championship probabilities 2026 only Used for post-hoc model-market comparison, not as features
Team name resolution is the biggest recurring complexity. Each source uses different team names (e.g., "Connecticut" vs "UConn" vs "CONN"). The resolve_teams.py module handles fuzzy matching across all known spellings, mapping every name to a canonical Kaggle TeamID.

4. Feature Engineering

Feature Categories

Features are organized into tiers by source and type. All enter the model as differences: feature_A - feature_B.

Feature importance

Full Feature List (44 LightGBM features)

#FeatureNameTierGainDescription
1 elo_rating_diff Elo Rating Elo 3,699 Custom Elo from all reg-season games since 1985
2 AdjEM_diff Adjusted Efficiency Margin KenPom 3,190 Offensive - defensive efficiency, tempo-adjusted
3 massey_best_rank_diff Massey Best Rank Consensus 3,139 Best rank across 8 Massey systems
4 pom_rank_diff KenPom Rank (POM) Consensus 2,943 Pomeroy rank from Massey composite
5 bpi_kenpom_divergence_diff BPI-KenPom Divergence Consensus 1,948 BPI rank minus KenPom rank
6 massey_mean_rank_diff Massey Mean Rank Consensus 918 Average rank across 8 systems
7 AdjOE_diff Adjusted Offensive Efficiency KenPom 906 Points per 100 possessions, tempo-adjusted
8 sos_proxy_diff Strength of Schedule Box Scores 782 Mean opponents' win percentage
9 off_ftr_diff Free Throw Rate Four Factors 717 FTA / FGA ratio
10 win_pct_diff Win Percentage Momentum 608 Regular season win rate
11 opp_fg3_pct_diff Opp. 3-Point % Box Scores 576 Three-point shooting allowed
12 seed_line_win_rate_diff Seed Line Win Rate Structural 552 Historical win rate for this seed (1985-2002)
13 ast_rate_diff Assist Rate Box Scores 534 Assists per game
14 AdjDE_diff Adjusted Defensive Efficiency KenPom 496 Points allowed per 100 possessions
15 massey_std_rank_diff Massey Rank Std Dev Consensus 483 Disagreement across ranking systems
16 RankAdjDE_diff KenPom AdjDE Rank KenPom 450 National rank by defensive efficiency
17 orb_pct_diff Offensive Rebound % Box Scores 432 Share of available offensive rebounds
18 coach_tenure_diff Coach Tenure Coach 356 Consecutive years at current school
19 scoring_margin_std_diff Scoring Margin Std Other 329
20 ft_pct_diff Ft Pct Other 296
21 tov_rate_diff Tov Rate Other 285
22 log_seed_diff Log(Seed) Structural 283 Logarithmic seed — compresses high seeds
23 bpi_rank_diff Bpi Rank Other 262
24 ap_rank_diff Ap Rank Other 252
25 poll_trajectory_diff Poll Trajectory Other 236
26 def_ftr_diff Def Ftr Other 236
27 scoring_margin_avg_diff Scoring Margin Avg Other 232
28 last10_margin_diff Last10 Margin Other 211
29 AdjTempo_diff Adjtempo Other 205
30 def_efg_pct_diff Def Efg Pct Other 188
31 opp_fg_pct_diff Opp Fg Pct Other 184
32 fg_pct_diff Fg Pct Other 182
33 RankAdjEM_diff KenPom AdjEM Rank KenPom 177 National rank by efficiency margin
34 RankAdjOE_diff Rankadjoe Other 175
35 fg3_pct_diff Fg3 Pct Other 166
36 matchup_off_def_sum Matchup Off Def Sum Other 153
37 off_orb_pct_diff Off Orb Pct Other 145
38 off_efg_pct_diff Off Efg Pct Other 131
39 conf_mean_adjem_diff Conf Mean Adjem Other 130
40 seed_product Seed Product Other 116
41 coach_tourney_apps_diff Coach Tourney Apps Other 96
42 three_rate_diff Three Rate Other 87
43 off_tov_pct_diff Off Tov Pct Other 85
44 seed_diff Seed Other 25

Key Derived Features

Four Factors (Dean Oliver's framework): offensive and defensive versions of effective FG%, turnover rate, offensive rebound rate, and free throw rate. These are tempo-free — they measure quality independent of pace.

$$\text{eFG\%} = \frac{\text{FGM} + 0.5 \times \text{FGM3}}{\text{FGA}} \qquad \text{TOV\%} = \frac{\text{TO}}{\text{FGA} - \text{ORB} + \text{TO} + 0.475 \times \text{FTA}}$$

Matchup interactions: How offensive vs defensive the matchup is overall (symmetric — invariant to team ordering):

$$\text{matchup\_off\_def\_sum} = (\text{AdjOE}_A - \text{AdjDE}_B) + (\text{AdjOE}_B - \text{AdjDE}_A) \qquad \text{seed\_product} = \text{seed}_A \times \text{seed}_B$$

BPI-KenPom divergence: when BPI and KenPom disagree about a team's ranking, it signals that one system sees something the other doesn't. This feature ranks 5th by importance.

Elo rating: our custom Elo system (documented at /elo). Processes every regular-season game since 1985 with margin-of-victory scaling and 75% season carryover. Ranks #1 by feature importance.

5. Model Training

Model 1: Seed-Only Logistic Regression (Floor Baseline)

A single feature: seed_diff. This is the minimum-viable model that any NCAA bracket predictor should beat. Log-loss: 0.566.

Model 2: KenPom Logistic Regression

5 features with GridSearchCV over regularization strength C:

$$P(A\ \text{wins}) = \sigma(\mathbf{w}^T \mathbf{x} + b) \qquad \mathbf{x} = [\text{AdjEM}_\Delta,\ \text{seed}_\Delta,\ \text{AdjOE}_\Delta,\ \text{AdjDE}_\Delta,\ \text{massey\_mean}_\Delta]$$

This is the "strong baseline" — hard to beat because AdjEM alone is highly predictive. Log-loss: 0.547.

Model 3: LightGBM

44 features with Optuna-tuned hyperparameters:

ParameterValue
max_depth3
num_leaves21
learning_rate0.0130
min_child_samples26
lambda_l10.0209
lambda_l20.0228
feature_fraction0.6373
bagging_fraction0.8234
bagging_freq2
n_estimators410

Log-loss: 0.538. The key hyperparameters are max_depth=3 (very shallow trees) and learning_rate=0.013 (very slow), which prevent overfitting on a small training set.

Training Protocol

Leave-One-Season-Out (LOSO): for each of the 22 seasons (2003-2025, excluding 2020), train on all other seasons and predict the held-out season. This is more conservative than k-fold because it tests temporal generalization — the model must predict a season it has never seen.

Why not k-fold? Tournament dynamics change over time (expanding field, play-in games, evolving parity). LOSO ensures we measure how well the model predicts future tournaments, not how well it memorizes the distribution of all tournaments.

6. Ensemble

The final prediction blends LightGBM and KenPom LR:

$$P_{\text{ensemble}} = 0.70 \cdot P_{\text{LGB}} + 0.30 \cdot P_{\text{LR}} \qquad \text{clipped to } [0.01, 0.99]$$
Ensemble weight optimization

The optimal weight is found by grid search over LOSO predictions. The ensemble achieves 0.53651 log-loss — better than either model alone.

Model comparison
ModelLOSO Log-LossGames
Ensemble0.536511445
LightGBM (tuned)0.538901445
KenPom LR0.546971445
Seed-only LR0.566451445
Why does ensembling help? LightGBM captures nonlinear interactions (e.g., "seed 12 teams with high tempo beat seed 5 teams with slow pace") while logistic regression provides smooth, well-calibrated probabilities. The LR acts as a regularizer — when LightGBM overreacts to noise in rare matchup types, the LR pulls predictions back toward the calibrated AdjEM baseline.

Other ensemble strategies tested

7. Injury Adjustments

Injuries are incorporated via a Win Shares replacement model that adjusts KenPom AdjEM for missing players:

1
Load injury reports from RotoWire (status: "Out" or "Out For Season")
2
Match each injured player to Sports Reference advanced stats via fuzzy name matching
3
Compare player's WS/40 to team-specific replacement level (from linear regression on AdjEM)
4
Convert the WS/40 gap to an AdjEM adjustment and apply to the feature store
$$\text{Replacement WS/40} = 0.0193 + 0.0011 \times \text{AdjEM}$$ $$\Delta\text{AdjEM} = -(\text{Player WS/40} - \text{Replacement WS/40}) \times 5 \times \text{discount}$$

The discount factor accounts for games already missed: if a player has already been out for most of the season, their absence is already reflected in the team's stats (discount = 0.2 if >50% missed).

The adjustment is split: AdjOE gets half the impact, AdjDE gets the other half. This is conservative — in reality, losing a scorer mostly hurts offense, but we don't have enough data to model the split precisely.

8. Bracket Simulation

The model generates championship probabilities via Monte Carlo simulation:

1
Build field: identify tournament teams from Kaggle seeds or bracketology consensus (for 2026 before Selection Sunday).
2
Assign regions: either random shuffle per simulation or fixed from bracketology consensus.
3
Build bracket: standard NCAA seeding matchups (1v16, 8v9, 5v12, 4v13, 6v11, 3v14, 7v10, 2v15) within each region.
4
Simulate: for each of 50,000 draws, play out every game using the model's pairwise probabilities. Record advancement for each team at each round.
5
Aggregate: championship probability = fraction of simulations where team won the title. Confidence intervals via parametric bootstrap.

The simulation handles edge cases: play-in games, teams at the same seed competing for region slots, and the "bye" mechanism when fewer than 64 teams are in the field.

Optional: Rating Perturbation

An experimental mode adds $N(0, \sigma)$ noise to each team's AdjEM per simulation, modeling the uncertainty in team ratings. This compresses the probability distribution: the favorite's probability decreases while underdogs gain. The notebook comparison model uses $\sigma = 2.5$.

9. Validation

All metrics are computed out-of-sample via LOSO cross-validation on 1445 historical tournament games across 22 seasons:

0.536
Log-Loss
0.182
Brier Score
0.801
AUC-ROC
0.014
ECE (Calibration)

Expected Calibration Error (ECE) = 0.014 means the model's predicted probabilities closely match actual outcomes. When the model says 70%, the team wins about 70% of the time.

Context: a naive seed-based model achieves ~0.567 log-loss. The Kaggle competition uses Brier score (MSE) and includes both men's and women's tournaments — top Brier scores are hard to compare directly to log-loss. Our 0.536 LOSO log-loss on genuinely unknown men's tournament games is well above the seed baseline, and the model's Brier score (0.182) is in a competitive range.

Per-Round Performance

The model performs best in early rounds (where seed advantages are largest) and degrades in later rounds (where remaining teams are more evenly matched):

10. Feature Ablation

Which feature groups matter most? We remove each tier entirely and measure the log-loss impact:

Feature ablation
Tier Removed# FeaturesLog-LossImpact
Consensus 8 0.55085 +11.5 mLL
KenPom 9 0.54651 +7.1 mLL
Elo 1 0.54389 +4.5 mLL
Structural 4 0.54102 +1.6 mLL
Conference 1 0.54044 +1.1 mLL
RegSeason 9 0.53986 +0.5 mLL
FourFactors 9 0.53983 +0.5 mLL
Momentum 4 0.53961 +0.2 mLL
Coach 2 0.53960 +0.2 mLL
Player 8 0.53777 -1.6 mLL
Key insight: Consensus rankings (+11.5 mLL) and KenPom metrics (+7.1 mLL) are by far the most important tiers. Elo alone accounts for +4.5 mLL — impressive for a single feature. Player-level features (depth, usage, star dependency) actually hurt the model (-1.6 mLL) due to sparse, noisy coverage and are excluded.

The ablation also reveals significant redundancy: KenPom's AdjEM, Massey's POM rank, and Elo are highly correlated. Removing any one is partially compensated by the others. But removing the entire consensus tier (8 features spanning Massey systems, BPI, derived divergence, and poll trajectory) is devastating because it eliminates all cross-system signal.

11. Market Comparison

We compare model championship probabilities to prediction market prices from Kalshi, Polymarket, and sportsbook sharp consensus:

The model's predictions are not blended with market data for the primary submission. Markets are used only for post-hoc comparison and identifying potential value picks. An optional market-blend submission uses 70% model + 30% market-implied pairwise probabilities.

Systematic bias pattern: if the model has a structural bias, it's toward rewarding defensive efficiency and underweighting market narratives. This could be correct on average, but if wrong, the errors will be correlated across all "value" picks (Houston, Illinois, etc. all fail together).