🏀

NBA Prediction

End-to-end ML pipeline predicting NBA game winners with 62.1% accuracy (2024 season) using gradient boosting and real-time data integration

62.1%
Model Accuracy
7.5 ppts
Above Baseline
2400+
Games Predicted

(2024-2025 NBA Season and Playoffs)


Machine Learning XGBoost Data Pipeline Streamlit Sports Analytics
NBA Prediction Dashboard
⚠️

The Challenge

Build an end-to-end machine learning pipeline to predict the winners of NBA games:

  • - Trains a model to predict game winners
  • - Retrieves new data daily
  • - Performs feature engineering on the data
  • - Predict the probabilty of the home team winning
  • - Display the results online
💡

My Solution

Constructed a modular system that uses:

  • - XGBoost, Optuna, and Neptune.ai for model training and experiment tracking
  • - Github Actions, Selenium, and ScrapingAnt to scrape new games daily from NBA.com
  • - Pandas to engineer features and store these
  • - XGBoost and Sklearn to predict and calibrate winning probability
  • - Streamlit to deploy online

Technical Highlights

Key innovations and technical achievements in this project

🤖

Advanced Modeling

XGBoost and LightGBM with Optuna hyperparameter tuning for optimal performance and Scikit-learn to calibrate probabilities

🔧

Feature Engineering

Created engineered features such as win streaks, losing streaks, home/away streaks, rollling averages for various time periods, and head-to-head stats

📊

Data Pipeline

Automated ETL pipeline with error handling, data validation, and daily NBA.com scraping

🎯

Model Validation

Cross-validation and test set evaluation using stratified K-Folds and time-series K-Folds along with SHAP for interpretability and comparing feature importances between train and validation sets

📈

Experiment Tracking

Neptune.ai integration for comprehensive experiment logging and model comparison

🌐

Production Ready

GitHub Actions CI/CD, Streamlit deployment with monitoring dashboard

Technology Stack

Tools and frameworks used throughout the project lifecycle

🐍 Core ML/Data

Python Pandas NumPy

🚀 ML Models and Experimentation

XGBoost LightGBM Scikit-learn SHAP Optuna Neptune.ai

📊 Data & Scraping

BeautifulSoup Selenium ScrapingAnt

⚙️ MLOps and Deployment

Streamlit GitHub Actions

Results & Impact

Season Results (2024-2025 NBA Season and Playoffs)

62.1%
Test Accuracy
54.6%
Baseline (Home Wins)
NBA Model Accuracy
NBA Recent Games

(As you can see, the model has issues with playoffs. The 7 day rolling averages have bigger swings during playoffs as teams play each other multiple times in a short span. This is an area of improvement for future iterations of the model.)

Lessons and Future Plans

This was my first big ML project and I learned a lot. So much in fact, that I am in the process of completely re-doing the project from scratch. I plan to write more about this at a certain point, but there are a couple of big takeaways.

Key Issues

Accuracy

My model had achieved an accuracy of 62.1% for the 2024 season. The baseline of "home team always wins" had an accuracy of 54.6%, but better models often achieve closer to 65%.

Last season, at least one expert at nflpickwatch.com achieved an accuracy of around 69.3%. Close to 100 got over 65%.

Reliability and Extensibility

Third-party services like Hopsworks, Neptune, and Streamlit at times would just stop working. Sometimes they would start back on their own, and sometimes I had to create a workaround. I abandoned Hopsworks altogether, and for Streamlit, I had to create a "light" dashboard-only repo.

Solutions

More Data, Feature Engineering, and Experimentation

Feature engineering is the key to improving the performance of any ML model. More advanced statistics are available on the NBA.com website that will be scraped to aid in the feature engineering, and a better evaluation framework will be used to facilitate faster experimentation to figure out which features work best.

Modular, OOP Architecture

A finer-grained, OOP approach with abstract interfaces and dependency injection makes it easier to swap-out problematic components and to add new capabilities. Alternates and fallbacks can be more easily incorporated into the pipeline.

GitHub View Repository