Saul's Portfolio

📖 Introduction

In this project, we explore historical NCAA data to predict Purdue’s 2024 Tournament performance.
We walk through data preprocessing, dimensionality reduction, model training, and final probability outputs.

📂 Project Structure

marchmadness-prediction/
├── data/           # CSVs for training, validation, final predictions & raw misc
├── notebooks/      # Jupyter notebooks with analysis & code
├── reports/        # Final PDF reports: evaluation & modeling
└── slides/         # Presentation decks

🔍 Data Overview

We split our data into:

training_data.csv: used to train models
val_data.csv: validation set for hyperparameter tuning
final_predictions.csv: hold-out set predictions
misc_data/: auxiliary datasets used in feature engineering

📊 2D & 3D PCA Visualizations

Dimensionality reduction helps us inspect variance structure and clustering by team metrics.

2D PCA

2D PCA of team performance features

3D PCA

3D PCA of team performance features

🌟 Top 10 Feature Importances (Random Forest)

We trained a Random Forest classifier and extracted the top 10 most influential features.

Random Forest top 10 feature importances

🏀 Predicted Win Probabilities

Finally, we generate predicted probabilities for each team in the hold-out tournament bracket.

Predicted win probability for each team

📈 Performance Summary

Across both the initial and fine‑tuned parameter sets, Logistic Regression consistently led in overall classification metrics, while Random Forest was the only model to nail Purdue’s actual tournament outcome exactly.

Model	Initial Accuracy	Tuned Accuracy
Logistic Regression	0.5641	0.6667
SVM	0.4615	0.5385
KNN	0.4300	0.4839
Random Forest	0.4146	0.4390

(Weighted macro‑average across accuracy, precision, recall, F1)

⚖️ Predicted vs. Actual Purdue Performance

Model	Predicted Round	Implied # Wins	Actual Round	Actual # Wins
Logistic Regression	Final Four (Semis)	4	Championship (Final)	5
SVM	Final Four (Semis)	4	Championship (Final)	5
KNN	Final Four (Semis)	4	Championship (Final)	5
Random Forest	Championship Runner‑up	5	Championship (Final)	5

Logistic, SVM, and KNN each fell one round short of Purdue’s actual runner‑up finish (Semis vs. Final)
Random Forest uniquely matched the true outcome, correctly predicting Purdue’s path to the championship game

🎯 Key Takeaways

Business Success:
All models met the “within 1‑round error” criterion, satisfying our primary objective
Metric Trade‑Offs:
Despite Random Forest’s perfect bracket prediction for Purdue, its overall accuracy (0.439) lagged behind simpler models—underscoring that the highest aggregate score doesn’t always align with specific business goals.
Feature Insights:
Across models, seed and efficiency differentials (AdjEM, AdjDE) consistently ranked among the top predictors, reaffirming their domain importance.

🔮 Next Steps

Ensemble Stacking: Blend Logistic Regression, SVM, and RF via a meta‑learner to leverage both high‑accuracy and bracket‑precision signals.
Probability Calibration: Apply Platt scaling or isotonic regression to improve alignment between predicted probabilities and observed frequencies.
Game‑by‑Game Modeling: Explore a sequential model that predicts each matchup individually—potentially boosting overall tournament‑level accuracy.
Live Updating: Integrate in‑season and in‑tournament feeds (injuries, momentum metrics) to dynamically adjust win probabilities round‑by‑round.

🔗 GitHub

You can explore the source code here:
👉 View on GitHub

March Madness Prediction

Project Overview

Skills Used