NCAA Tournament Prediction Model

Technical Reference — March Machine Learning Mania 2026

0.53651

LOSO Log-Loss

LightGBM Features

Training Seasons

50K

Bracket Simulations

1. Overview

This model predicts the probability that Team A beats Team B for every possible NCAA tournament matchup. It combines a gradient-boosted decision tree (LightGBM, 44 features) with a logistic regression (KenPom features, 5 features) in a 70%/30% weighted ensemble.

Key design principles:

Leave-One-Season-Out CV: every evaluation is genuinely out-of-sample. No within-season leakage.
Matchup-level prediction: the model predicts P(A beats B) directly, not team strength. All features are differences (Team A minus Team B).
Multi-source consensus: combining 11 data sources reduces dependence on any single rating system.
Conservative regularization: shallow trees (depth 3), low learning rate (0.013), and modest L1/L2 penalties. The primary regularization comes from tree structure constraints, not penalty terms. The model is designed not to overfit the ~1,445 historical tournament games.

2. Architecture

The pipeline has five stages:

Data Collection: Fetcher scripts scrape KenPom, ESPN BPI, Massey, RotoWire, and prediction markets. Each source has its own team name mapping.

Feature Store: build_feature_store.py joins all sources into one row per (Season, TeamID) with 55 raw features. Outputs team_season_features.csv.

Matchup Construction: For each historical tournament game, compute Team A - Team B differences for all features. Convention: A = lower TeamID. Output: matchup_train.csv (1445 games, 2003-2025).

Model Training: LOSO cross-validation, Optuna hyperparameter tuning, ensemble weight optimization. Final models trained on all data.

Prediction: Generate pairwise probabilities for 2026 tournament teams. Monte Carlo bracket simulation (50K draws) produces championship probabilities.

3. Data Sources

Source	Collection Method	Features	Seasons	Notes
KenPom	kenpompy (Python API)	AdjEM, AdjOE, AdjDE, AdjTempo, ranks, Luck, SOS	2003-2026	Pre-tournament snapshots (historical) + live scrape (2026)
Massey Ordinals	masseyratings.com (Playwright)	8 ranking systems: POM, MOR, SAG, COL, DOL, WLK, AP, USA	2003-2026	Kaggle baseline + daily gap-fill via web scraping
ESPN BPI	espn.com (Playwright)	BPI power ranking	2014-2026	JavaScript-rendered page extraction
Kaggle Box Scores	Kaggle competition data	FGM, FGA, FGM3, FGA3, FTM, FTA, OR, DR, Ast, TO, Stl, Blk	2003-2026	Detailed results + NCAA API gap-fill
Tournament Seeds	Kaggle + bracketology consensus	Seed number (1-16)	2003-2026	Kaggle official (historical) + ESPN/CBS/RotoWire consensus (2026)
Elo Ratings	Computed from Kaggle compact results	Custom Elo rating per team	1985-2026	K=20, MOV multiplier, 75% season carryover
Coach Data	Kaggle MTeamCoaches.csv	Tenure, tournament appearances	2003-2026	Consecutive years + historical tourney count
Conference	Kaggle + KenPom	Conference mean AdjEM	2003-2026	Average KenPom AdjEM of all conference teams
AP Poll	Massey Ordinals (AP system)	Poll trajectory (early vs late season rank)	2003-2026	Rank change from week 7 to pre-tournament
Injuries	RotoWire (Playwright) + Sports Ref	AdjEM adjustment via WS/40 replacement model	2024-2026	Player WS/40 vs team-specific replacement level
Prediction Markets	Kalshi, Polymarket, sportsbooks	Championship probabilities	2026 only	Used for post-hoc model-market comparison, not as features

Team name resolution is the biggest recurring complexity. Each source uses different team names (e.g., "Connecticut" vs "UConn" vs "CONN"). The resolve_teams.py module handles fuzzy matching across all known spellings, mapping every name to a canonical Kaggle TeamID.

4. Feature Engineering

Feature Categories

Features are organized into tiers by source and type. All enter the model as differences: feature_A - feature_B.

Full Feature List (44 LightGBM features)

#	Feature	Name	Tier	Gain	Description
1	`elo_rating_diff`	Elo Rating	Elo	3,699	Custom Elo from all reg-season games since 1985
2	`AdjEM_diff`	Adjusted Efficiency Margin	KenPom	3,190	Offensive - defensive efficiency, tempo-adjusted
3	`massey_best_rank_diff`	Massey Best Rank	Consensus	3,139	Best rank across 8 Massey systems
4	`pom_rank_diff`	KenPom Rank (POM)	Consensus	2,943	Pomeroy rank from Massey composite
5	`bpi_kenpom_divergence_diff`	BPI-KenPom Divergence	Consensus	1,948	BPI rank minus KenPom rank
6	`massey_mean_rank_diff`	Massey Mean Rank	Consensus	918	Average rank across 8 systems
7	`AdjOE_diff`	Adjusted Offensive Efficiency	KenPom	906	Points per 100 possessions, tempo-adjusted
8	`sos_proxy_diff`	Strength of Schedule	Box Scores	782	Mean opponents' win percentage
9	`off_ftr_diff`	Free Throw Rate	Four Factors	717	FTA / FGA ratio
10	`win_pct_diff`	Win Percentage	Momentum	608	Regular season win rate
11	`opp_fg3_pct_diff`	Opp. 3-Point %	Box Scores	576	Three-point shooting allowed
12	`seed_line_win_rate_diff`	Seed Line Win Rate	Structural	552	Historical win rate for this seed (1985-2002)
13	`ast_rate_diff`	Assist Rate	Box Scores	534	Assists per game
14	`AdjDE_diff`	Adjusted Defensive Efficiency	KenPom	496	Points allowed per 100 possessions
15	`massey_std_rank_diff`	Massey Rank Std Dev	Consensus	483	Disagreement across ranking systems
16	`RankAdjDE_diff`	KenPom AdjDE Rank	KenPom	450	National rank by defensive efficiency
17	`orb_pct_diff`	Offensive Rebound %	Box Scores	432	Share of available offensive rebounds
18	`coach_tenure_diff`	Coach Tenure	Coach	356	Consecutive years at current school
19	`scoring_margin_std_diff`	Scoring Margin Std	Other	329
20	`ft_pct_diff`	Ft Pct	Other	296
21	`tov_rate_diff`	Tov Rate	Other	285
22	`log_seed_diff`	Log(Seed)	Structural	283	Logarithmic seed — compresses high seeds
23	`bpi_rank_diff`	Bpi Rank	Other	262
24	`ap_rank_diff`	Ap Rank	Other	252
25	`poll_trajectory_diff`	Poll Trajectory	Other	236
26	`def_ftr_diff`	Def Ftr	Other	236
27	`scoring_margin_avg_diff`	Scoring Margin Avg	Other	232
28	`last10_margin_diff`	Last10 Margin	Other	211
29	`AdjTempo_diff`	Adjtempo	Other	205
30	`def_efg_pct_diff`	Def Efg Pct	Other	188
31	`opp_fg_pct_diff`	Opp Fg Pct	Other	184
32	`fg_pct_diff`	Fg Pct	Other	182
33	`RankAdjEM_diff`	KenPom AdjEM Rank	KenPom	177	National rank by efficiency margin
34	`RankAdjOE_diff`	Rankadjoe	Other	175
35	`fg3_pct_diff`	Fg3 Pct	Other	166
36	`matchup_off_def_sum`	Matchup Off Def Sum	Other	153
37	`off_orb_pct_diff`	Off Orb Pct	Other	145
38	`off_efg_pct_diff`	Off Efg Pct	Other	131
39	`conf_mean_adjem_diff`	Conf Mean Adjem	Other	130
40	`seed_product`	Seed Product	Other	116
41	`coach_tourney_apps_diff`	Coach Tourney Apps	Other	96
42	`three_rate_diff`	Three Rate	Other	87
43	`off_tov_pct_diff`	Off Tov Pct	Other	85
44	`seed_diff`	Seed	Other	25

Key Derived Features

Four Factors (Dean Oliver's framework): offensive and defensive versions of effective FG%, turnover rate, offensive rebound rate, and free throw rate. These are tempo-free — they measure quality independent of pace.

$$\text{eFG\%} = \frac{\text{FGM} + 0.5 \times \text{FGM3}}{\text{FGA}} \qquad \text{TOV\%} = \frac{\text{TO}}{\text{FGA} - \text{ORB} + \text{TO} + 0.475 \times \text{FTA}}$$

Matchup interactions: How offensive vs defensive the matchup is overall (symmetric — invariant to team ordering):

$$\text{matchup\_off\_def\_sum} = (\text{AdjOE}_A - \text{AdjDE}_B) + (\text{AdjOE}_B - \text{AdjDE}_A) \qquad \text{seed\_product} = \text{seed}_A \times \text{seed}_B$$

BPI-KenPom divergence: when BPI and KenPom disagree about a team's ranking, it signals that one system sees something the other doesn't. This feature ranks 5th by importance.

Elo rating: our custom Elo system (documented at /elo). Processes every regular-season game since 1985 with margin-of-victory scaling and 75% season carryover. Ranks #1 by feature importance.

5. Model Training

Model 1: Seed-Only Logistic Regression (Floor Baseline)

A single feature: seed_diff. This is the minimum-viable model that any NCAA bracket predictor should beat. Log-loss: 0.566.

Model 2: KenPom Logistic Regression

5 features with GridSearchCV over regularization strength C:

$$P(A\ \text{wins}) = \sigma(\mathbf{w}^T \mathbf{x} + b) \qquad \mathbf{x} = [\text{AdjEM}_\Delta,\ \text{seed}_\Delta,\ \text{AdjOE}_\Delta,\ \text{AdjDE}_\Delta,\ \text{massey\_mean}_\Delta]$$

This is the "strong baseline" — hard to beat because AdjEM alone is highly predictive. Log-loss: 0.547.

Model 3: LightGBM

44 features with Optuna-tuned hyperparameters:

Parameter	Value
`max_depth`	3
`num_leaves`	21
`learning_rate`	0.0130
`min_child_samples`	26
`lambda_l1`	0.0209
`lambda_l2`	0.0228
`feature_fraction`	0.6373
`bagging_fraction`	0.8234
`bagging_freq`	2
`n_estimators`	410

Log-loss: 0.538. The key hyperparameters are max_depth=3 (very shallow trees) and learning_rate=0.013 (very slow), which prevent overfitting on a small training set.

Training Protocol

Leave-One-Season-Out (LOSO): for each of the 22 seasons (2003-2025, excluding 2020), train on all other seasons and predict the held-out season. This is more conservative than k-fold because it tests temporal generalization — the model must predict a season it has never seen.

Why not k-fold? Tournament dynamics change over time (expanding field, play-in games, evolving parity). LOSO ensures we measure how well the model predicts future tournaments, not how well it memorizes the distribution of all tournaments.

6. Ensemble

The final prediction blends LightGBM and KenPom LR:

$$P_{\text{ensemble}} = 0.70 \cdot P_{\text{LGB}} + 0.30 \cdot P_{\text{LR}} \qquad \text{clipped to } [0.01, 0.99]$$

The optimal weight is found by grid search over LOSO predictions. The ensemble achieves 0.53651 log-loss — better than either model alone.

Model	LOSO Log-Loss	Games
Ensemble	0.53651	1445
LightGBM (tuned)	0.53890	1445
KenPom LR	0.54697	1445
Seed-only LR	0.56645	1445

Why does ensembling help? LightGBM captures nonlinear interactions (e.g., "seed 12 teams with high tempo beat seed 5 teams with slow pace") while logistic regression provides smooth, well-calibrated probabilities. The LR acts as a regularizer — when LightGBM overreacts to noise in rare matchup types, the LR pulls predictions back toward the calibrated AdjEM baseline.

Other ensemble strategies tested

Stacking meta-learner: logistic regression on all three models' OOF predictions. Did not improve over simple weighted average.
Isotonic calibration: post-hoc calibration on ensemble predictions. Tested via LOSO to avoid leakage. Did not improve — the ensemble is already well-calibrated (ECE = 0.014).

7. Injury Adjustments

Injuries are incorporated via a Win Shares replacement model that adjusts KenPom AdjEM for missing players:

Load injury reports from RotoWire (status: "Out" or "Out For Season")

Match each injured player to Sports Reference advanced stats via fuzzy name matching

Compare player's WS/40 to team-specific replacement level (from linear regression on AdjEM)

Convert the WS/40 gap to an AdjEM adjustment and apply to the feature store

$$\text{Replacement WS/40} = 0.0193 + 0.0011 \times \text{AdjEM}$$ $$\Delta\text{AdjEM} = -(\text{Player WS/40} - \text{Replacement WS/40}) \times 5 \times \text{discount}$$

The discount factor accounts for games already missed: if a player has already been out for most of the season, their absence is already reflected in the team's stats (discount = 0.2 if >50% missed).

The adjustment is split: AdjOE gets half the impact, AdjDE gets the other half. This is conservative — in reality, losing a scorer mostly hurts offense, but we don't have enough data to model the split precisely.

8. Bracket Simulation

The model generates championship probabilities via Monte Carlo simulation:

Build field: identify tournament teams from Kaggle seeds or bracketology consensus (for 2026 before Selection Sunday).

Assign regions: either random shuffle per simulation or fixed from bracketology consensus.

Build bracket: standard NCAA seeding matchups (1v16, 8v9, 5v12, 4v13, 6v11, 3v14, 7v10, 2v15) within each region.

Simulate: for each of 50,000 draws, play out every game using the model's pairwise probabilities. Record advancement for each team at each round.

Aggregate: championship probability = fraction of simulations where team won the title. Confidence intervals via parametric bootstrap.

The simulation handles edge cases: play-in games, teams at the same seed competing for region slots, and the "bye" mechanism when fewer than 64 teams are in the field.

Optional: Rating Perturbation

An experimental mode adds $N(0, \sigma)$ noise to each team's AdjEM per simulation, modeling the uncertainty in team ratings. This compresses the probability distribution: the favorite's probability decreases while underdogs gain. The notebook comparison model uses $\sigma = 2.5$.

9. Validation

All metrics are computed out-of-sample via LOSO cross-validation on 1445 historical tournament games across 22 seasons:

0.536

Log-Loss

0.182

Brier Score

0.801

AUC-ROC

0.014

ECE (Calibration)

Expected Calibration Error (ECE) = 0.014 means the model's predicted probabilities closely match actual outcomes. When the model says 70%, the team wins about 70% of the time.

Context: a naive seed-based model achieves ~0.567 log-loss. The Kaggle competition uses Brier score (MSE) and includes both men's and women's tournaments — top Brier scores are hard to compare directly to log-loss. Our 0.536 LOSO log-loss on genuinely unknown men's tournament games is well above the seed baseline, and the model's Brier score (0.182) is in a competitive range.

Per-Round Performance

The model performs best in early rounds (where seed advantages are largest) and degrades in later rounds (where remaining teams are more evenly matched):

R64: highest accuracy (seed favorites win ~75%)
Sweet 16 / Elite 8: log-loss increases as matchups become more competitive
Final Four / Championship: most uncertain — essentially coin flips between elite teams

10. Feature Ablation

Which feature groups matter most? We remove each tier entirely and measure the log-loss impact:

Tier Removed	# Features	Log-Loss	Impact
Consensus	8	0.55085	+11.5 mLL
KenPom	9	0.54651	+7.1 mLL
Elo	1	0.54389	+4.5 mLL
Structural	4	0.54102	+1.6 mLL
Conference	1	0.54044	+1.1 mLL
RegSeason	9	0.53986	+0.5 mLL
FourFactors	9	0.53983	+0.5 mLL
Momentum	4	0.53961	+0.2 mLL
Coach	2	0.53960	+0.2 mLL
Player	8	0.53777	-1.6 mLL

Key insight: Consensus rankings (+11.5 mLL) and KenPom metrics (+7.1 mLL) are by far the most important tiers. Elo alone accounts for +4.5 mLL — impressive for a single feature. Player-level features (depth, usage, star dependency) actually hurt the model (-1.6 mLL) due to sparse, noisy coverage and are excluded.

The ablation also reveals significant redundancy: KenPom's AdjEM, Massey's POM rank, and Elo are highly correlated. Removing any one is partially compensated by the others. But removing the entire consensus tier (8 features spanning Massey systems, BPI, derived divergence, and poll trajectory) is devastating because it eliminates all cross-system signal.

11. Market Comparison

We compare model championship probabilities to prediction market prices from Kalshi, Polymarket, and sportsbook sharp consensus:

Model systematically higher on: defensive/efficiency teams (Houston, Illinois, Michigan St) — the model rewards KenPom metrics that markets may underweight
Markets systematically higher on: brand-name and momentum teams (Michigan, Florida, Kansas) — markets may overweight recent narratives and fan attention

The model's predictions are not blended with market data for the primary submission. Markets are used only for post-hoc comparison and identifying potential value picks. An optional market-blend submission uses 70% model + 30% market-implied pairwise probabilities.

Systematic bias pattern: if the model has a structural bias, it's toward rewarding defensive efficiency and underweighting market narratives. This could be correct on average, but if wrong, the errors will be correlated across all "value" picks (Houston, Illinois, etc. all fail together).