📖 Introduction
In this project, we explore historical NCAA data to predict Purdue’s 2024 Tournament performance.
We walk through data preprocessing, dimensionality reduction, model training, and final probability outputs.
📂 Project Structure
marchmadness-prediction/
├── data/ # CSVs for training, validation, final predictions & raw misc
├── notebooks/ # Jupyter notebooks with analysis & code
├── reports/ # Final PDF reports: evaluation & modeling
└── slides/ # Presentation decks
🔍 Data Overview
We split our data into:
- training_data.csv: used to train models
- val_data.csv: validation set for hyperparameter tuning
- final_predictions.csv: hold-out set predictions
- misc_data/: auxiliary datasets used in feature engineering
📊 2D & 3D PCA Visualizations
Dimensionality reduction helps us inspect variance structure and clustering by team metrics.
2D PCA

2D PCA of team performance features
3D PCA

3D PCA of team performance features
🌟 Top 10 Feature Importances (Random Forest)
We trained a Random Forest classifier and extracted the top 10 most influential features.

Random Forest top 10 feature importances
🏀 Predicted Win Probabilities
Finally, we generate predicted probabilities for each team in the hold-out tournament bracket.

Predicted win probability for each team
📈 Performance Summary
Across both the initial and fine‑tuned parameter sets, Logistic Regression consistently led in overall classification metrics, while Random Forest was the only model to nail Purdue’s actual tournament outcome exactly.
Model | Initial Accuracy | Tuned Accuracy |
---|---|---|
Logistic Regression | 0.5641 | 0.6667 |
SVM | 0.4615 | 0.5385 |
KNN | 0.4300 | 0.4839 |
Random Forest | 0.4146 | 0.4390 |
(Weighted macro‑average across accuracy, precision, recall, F1)
⚖️ Predicted vs. Actual Purdue Performance
Model | Predicted Round | Implied # Wins | Actual Round | Actual # Wins |
---|---|---|---|---|
Logistic Regression | Final Four (Semis) | 4 | Championship (Final) | 5 |
SVM | Final Four (Semis) | 4 | Championship (Final) | 5 |
KNN | Final Four (Semis) | 4 | Championship (Final) | 5 |
Random Forest | Championship Runner‑up | 5 | Championship (Final) | 5 |
- Logistic, SVM, and KNN each fell one round short of Purdue’s actual runner‑up finish (Semis vs. Final)
- Random Forest uniquely matched the true outcome, correctly predicting Purdue’s path to the championship game
🎯 Key Takeaways
-
Business Success:
All models met the “within 1‑round error” criterion, satisfying our primary objective -
Metric Trade‑Offs:
Despite Random Forest’s perfect bracket prediction for Purdue, its overall accuracy (0.439) lagged behind simpler models—underscoring that the highest aggregate score doesn’t always align with specific business goals. -
Feature Insights:
Across models, seed and efficiency differentials (AdjEM, AdjDE) consistently ranked among the top predictors, reaffirming their domain importance.
🔮 Next Steps
- Ensemble Stacking: Blend Logistic Regression, SVM, and RF via a meta‑learner to leverage both high‑accuracy and bracket‑precision signals.
- Probability Calibration: Apply Platt scaling or isotonic regression to improve alignment between predicted probabilities and observed frequencies.
- Game‑by‑Game Modeling: Explore a sequential model that predicts each matchup individually—potentially boosting overall tournament‑level accuracy.
- Live Updating: Integrate in‑season and in‑tournament feeds (injuries, momentum metrics) to dynamically adjust win probabilities round‑by‑round.
🔗 GitHub
You can explore the source code here:
👉 View on GitHub