End-to-end ML pipeline predicting NBA game winners with 62.1% accuracy (2024 season) using gradient boosting and real-time data integration
(2024-2025 NBA Season and Playoffs)
Build an end-to-end machine learning pipeline to predict the winners of NBA games:
Constructed a modular system that uses:
Key innovations and technical achievements in this project
XGBoost and LightGBM with Optuna hyperparameter tuning for optimal performance and Scikit-learn to calibrate probabilities
Created engineered features such as win streaks, losing streaks, home/away streaks, rolling averages for various time periods, and head-to-head stats
Automated ETL pipeline with error handling, data validation, and daily NBA.com scraping
Cross-validation and test set evaluation using stratified K-Folds and time-series K-Folds along with SHAP for interpretability and comparing feature importances between train and validation sets
Neptune.ai integration for comprehensive experiment logging and model comparison
GitHub Actions CI/CD, Streamlit deployment with monitoring dashboard
Tools and frameworks used throughout the project lifecycle
Season Results (2024-2025 NBA Season and Playoffs)
(As you can see, the model has issues with playoffs. The 7 day rolling averages have bigger swings during playoffs as teams play each other multiple times in a short span. This is an area of improvement for future iterations of the model.)
This was my first big ML project and I learned a lot. So much in fact, that I am in the process of completely re-doing the project from scratch. I plan to write more about this at a certain point, but there are a couple of big takeaways.
My model had achieved an accuracy of 62.1% for the 2024 season. The baseline of "home team always wins" had an accuracy of 54.6%, but better models often achieve closer to 65%.
Last season, at least one expert at nflpickwatch.com achieved an accuracy of around 69.3%. Close to 100 got over 65%.
Third-party services like Hopsworks, Neptune, and Streamlit at times would just stop working. Sometimes they would start back on their own, and sometimes I had to create a workaround. I abandoned Hopsworks altogether, and for Streamlit, I had to create a "light" dashboard-only repo.
Feature engineering is the key to improving the performance of any ML model. More advanced statistics are available on the NBA.com website that will be scraped to aid in the feature engineering, and a better evaluation framework will be used to facilitate faster experimentation to figure out which features work best.
A finer-grained, OOP approach with abstract interfaces and dependency injection makes it easier to swap-out problematic components and to add new capabilities. Alternates and fallbacks can be more easily incorporated into the pipeline.
I am currently working on a completely redesigned version of this project with improved architecture, better feature engineering, and enhanced reliability.
./ βββ configs/ β βββ core/ β β βββ hyperparameters/ β β β βββ catboost/ β β β β βββ baseline.json β β β β βββ current_best.json β β β βββ lightgbm/ β β β β βββ baseline.json β β β β βββ current_best.json β β β βββ pytorch/ β β β β βββ baseline.json β β β βββ sklearn_histgradientboosting/ β β β β βββ baseline.json β β β β βββ current_best.json β β β βββ sklearn_logisticregression/ β β β β βββ baseline.json β β β β βββ current_best.json β β β βββ sklearn_randomforest/ β β β β βββ baseline.json β β β β βββ current_best.json β β β βββ xgboost/ β β β βββ baseline.json β β β βββ current_best.json β β βββ models/ β β β βββ catboost_config.yaml β β β βββ lightgbm_config.yaml β β β βββ pytorch_config.yaml β β β βββ sklearn_histgradientboosting_config.yaml β β β βββ sklearn_logisticregression_config.yaml β β β βββ sklearn_randomforest_config.yaml β β β βββ xgboost_config.yaml β β βββ app_logging_config.yaml β β βββ model_testing_config.yaml β β βββ optuna_config.yaml β β βββ preprocessing_config.yaml β β βββ visualization_config.yaml β βββ nba/ β βββ app_config.yaml β βββ data_access_config.yaml β βββ data_processing_config.yaml β βββ feature_engineering_config.yaml β βββ webscraping_config.yaml βββ data/ β βββ cumulative_scraped/ β β βββ games_advanced.csv β β βββ games_four-factors.csv β β βββ games_misc.csv β β βββ games_scoring.csv β β βββ games_traditional.csv β βββ engineered/ β β βββ engineered_features.csv β βββ newly_scraped/ β β βββ games_advanced.csv β β βββ games_four-factors.csv β β βββ games_misc.csv β β βββ games_scoring.csv β β βββ games_traditional.csv β β βββ todays_games_ids.csv β β βββ todays_matchups.csv β βββ predictions/ β β βββ CatBoost_val_predictions.csv β β βββ LGBM_oof_predictions.csv β β βββ LGBM_val_predictions.csv β β βββ lightgbm_val_predictions.csv β β βββ SKLearn_HistGradientBoosting_val_predictions.csv β β βββ SKLearn_RandomForest_val_predictions.csv β β βββ XGBoost_oof_predictions.csv β β βββ XGBoost_val_predictions.csv β β βββ xgboost_val_predictions.csv β βββ processed/ β β βββ column_mapping.json β β βββ games_boxscores.csv β β βββ teams_boxscores.csv β βββ test_data/ β β βββ games_advanced.csv β β βββ games_four-factors.csv β β βββ games_misc.csv β β βββ games_scoring.csv β β βββ games_traditional.csv β βββ training/ β βββ training_data.csv β βββ validation_data.csv βββ docs/ β βββ commentary/ β β βββ Webscraping.md β βββ data/ β β βββ column_mapping.json β β βββ nba-boxscore-data-dictionary.md β βββ src/ β β βββ readme.md β βββ readme.md βββ hyperparameter_history/ β βββ LGBM_history.json β βββ XGBoost_history.json βββ notebooks/ β βββ baseline.ipynb β βββ eda.ipynb βββ docs/ β βββ AI/ β β βββ config_reference.txt β β βββ config_tree.txt β β βββ directory_tree.txt β β βββ interfaces.md β βββ commentary/ β β βββ Webscraping.md β βββ data/ β β βββ column_mapping.json β β βββ nba-boxscore-data-dictionary.md β βββ src/ β β βββ readme.md β βββ readme.md βββ hyperparameter_history/ β βββ LGBM_history.json β βββ XGBoost_history.json βββ noteboooks/ β βββ baseline.ipynb β βββ eda.ipynb βββ src/ β βββ ml_framework/ β β βββ core/ β β β βββ app_file_handling/ β β β β βββ app_file_handler.py β β β β βββ base_app_file_handler.py β β β βββ app_logging/ β β β β βββ app_logger.py β β β β βββ base_app_logger.py β β β βββ config_management/ β β β β βββ base_config_manager.py β β β β βββ config_manager.py β β β β βββ config_path.yaml β β β βββ error_handling/ β β β β βββ base_error_handler.py β β β β βββ error_handler.py β β β β βββ error_handler_factory.py β β β βββ common_di_container.py β β βββ framework/ β β β βββ data_access/ β β β β βββ base_data_access.py β β β β βββ csv_data_access.py β β β βββ data_classes/ β β β β βββ metrics.py β β β β βββ preprocessing.py β β β β βββ training.py β β β βββ base_data_validator.py β β βββ model_testing/ β β β βββ experiment_loggers/ β β β β βββ base_experiment_logger.py β β β β βββ experiment_logger_factory.py β β β β βββ mlflow_logger.py β β β βββ hyperparams_managers/ β β β β βββ base_hyperparams_manager.py β β β β βββ hyperparams_manager.py β β β βββ hyperparams_optimizers/ β β β β βββ base_hyperparams_optimizer.py β β β β βββ hyperparams_optimizer_factory.py β β β β βββ optuna_optimizer.py β β β βββ trainers/ β β β β βββ base_trainer.py β β β β βββ catboost_trainer.py β β β β βββ lightgbm_trainer.py β β β β βββ pytorch_trainer.py β β β β βββ sklearn_trainer.py β β β β βββ trainer_factory.py β β β β βββ trainer_utils.py β β β β βββ xgboost_trainer.py β β β βββ base_model_testing.py β β β βββ di_container.py β β β βββ main.py β β β βββ model_tester.py β β βββ preprocessing/ β β β βββ base_preprocessor.py β β β βββ preprocessor.py β β βββ uncertainty/ β β β βββ uncertainty_calibrator.py β β βββ visualization/ β β βββ charts/ β β β βββ base_chart.py β β β βββ chart_factory.py β β β βββ chart_types.py β β β βββ chart_utils.py β β β βββ feature_charts.py β β β βββ learning_curve_charts.py β β β βββ metrics_charts.py β β β βββ model_interpretation_charts.py β β β βββ shap_charts.py β β βββ exploratory/ β β β βββ base_explorer.py β β β βββ correlation_explorer.py β β β βββ distribution_explorer.py β β β βββ team_performance_explorer.py β β β βββ time_series_explorer.py β β βββ orchestration/ β β βββ base_chart_orchestrator.py β β βββ chart_orchestrator.py β βββ nba_app/ β βββ data_processing/ β β βββ base_data_processing_classes.py β β βββ di_container.py β β βββ main.py β β βββ process_scraped_NBA_data.py β βββ feature_engineering/ β β βββ base_feature_engineering.py β β βββ di_container.py β β βββ feature_engineer.py β β βββ feature_selector.py β β βββ main.py β βββ webscraping/ β β βββ old/ β β β βββ test_websraping.py β β β βββ webscraping_old.py β β βββ base_scraper_classes.py β β βββ boxscore_scraper.py β β βββ di_container.py β β βββ main.py β β βββ matchup_validator.py β β βββ nba_scraper.py β β βββ page_scraper.py β β βββ readme.md β β βββ schedule_scraper.py β β βββ test.ipynb β β βββ test_page_scraper.py β β βββ utils.py β β βββ web_driver.py β βββ data_validator.py βββ tests/ β βββ ml_framework/ β βββ nba_app/ β βββ webscraping/ β βββ test_boxscore_scraper.py β βββ test_integration.py β βββ test_nba_scraper.py β βββ test_page_scraper.py β βββ test_schedule_scraper.py βββ directory_tree.txt βββ pyproject.toml βββ README.md βββ uv.lock
src/ml_framework vs src/nba_app enforces a strict one-way dependency (nba_app β ml_framework), keeping the platform generic and the domain isolatedsrc/ package layout with __init__.py enables clean imports and scalable growth across modulessrc/ml_framework/core/config_management/, app_logging/, error_handling/ each own one concern; nba_app partitions data_processing, feature_engineering, and webscrapingsrc/ml_framework/model_testing/trainers/), experiment loggers (experiment_loggers/), and data access (framework/data_access/) are extended via factories without modifying existing codeBaseTrainer, BaseExperimentLogger, BaseDataAccess, BasePreprocessor) guarantee drop-in interchangeability of implementationsBaseTrainer, BaseModelTester, BaseHyperparamsOptimizer, BaseChart) prevent "fat" dependenciesdi_container.py files in nba_app and ml_frameworkBaseExperimentLogger + experiment_logger_factory with mlflow_logger.py; easily swap to W&B or a custom tracker by adding a new implementationtrainer_factory orchestrates catboost_trainer.py, lightgbm_trainer.py, xgboost_trainer.py, sklearn_trainer.py, pytorch_trainer.pyβall behind BaseTrainerBaseHyperparamsOptimizer with optuna_optimizer.py; optimizer factory allows alternative HPO backends without touching call sitesBaseDataAccess with csv_data_access.pyβswap in a DB-backed implementation transparentlyBaseChart and chart_factory support consistent chart creation; new charts plug in without refactoring call sitesBaseWebDriver, BasePageScraper, and BaseNbaScraper in nba_app/webscraping allow switching drivers, parsers, or sources via configconfigs/core and configs/nba separate platform defaults from NBA-specific settingsconfigs/core/hyperparameters with current_best.json and baseline.json per modelconfig_tree.txt documents the effective configuration namespace for full transparency and reproducibilityuv.lock pins dependencies; pyproject.toml consolidates build/runtime metadata.gitignore/.gitattributes maintain clean VCS state; data/ directories are staged but not versionedtests/ mirrors package structure, with unit and integration tests for webscraping and room to expand for platform_coresrc/ enable targeted unit tests that were not possible with notebookssrc/ml_framework/core/app_logging with rolling files and console output, fully configured from configs/core/app_logging_config.yamlsrc/ml_framework/core/error_handling with factory-based resolution for consistent behaviorsrc/ml_framework/model_testing with TimeSeriesSplit, OOF, validation-set testing, and SHAP integrationβconfigured in configs/core/model_testing_config.yamlhyperparameter_history/ for auditability and reproducibilitynotebooks/ quarantines exploratory work; production code resides in src/data/ is organized by lifecycle (newly_scraped, cumulative_scraped, processed, engineered, training, predictions), making pipelines explicit and debuggablesrc/ml_framework/visualization provides base interfaces, orchestration, and specialized chart sets (feature importance, learning curves, metrics, SHAP), all toggled via configs/core/visualization_config.yamlmlb_app) reusing ml_framework unchangedABCs + Factories + Config-first wiring
Each module has one reason to change.
src/ml_framework/core/config_management/ handles only config discovery/merge/access (BaseConfigManager, ConfigManager)src/ml_framework/core/app_logging/ focuses on structured logging (BaseAppLogger, AppLogger) configured via configs/core/app_logging_config.yamlsrc/ml_framework/core/error_handling/ centralizes error policies (BaseErrorHandler, ErrorHandler, ErrorHandlerFactory)Extend via new classes/factories without modifying existing code.
src/ml_framework/model_testing/trainers/ and register via trainer_factory.py, no orchestrator changesexperiment_logger_factory.py to support W&B or custom loggers alongside mlflow_logger.pyframework/data_access/) to add DB, cloud, or parquet backends next to csv_data_access.pyvisualization/charts/ and expose it via chart_factory.py; orchestrators don't changeAbstract base classes guarantee drop-in compatibility.
catboost_trainer.py, lightgbm_trainer.py, xgboost_trainer.py, sklearn_trainer.py, pytorch_trainer.py) can be swapped without breaking model_tester.pymlflow_logger.py today) can be replaced by any logger honoring the same interfaceSmall, focused interfaces prevent unnecessary coupling.
High-level policies depend on abstractions, not concretions.
experiment_logger_factory.py, trainer_factory.py, hyperparams_optimizer_factory.py, and di_container.py files supply concrete instances to code that only references base interfacesconfigs/core/model_testing_config.yaml selects models, logging, and preprocessing; code reads the abstractions and resolves implementations at runtimemodel_testing/experiment_loggers/mlflow_logger.py) but it sits behind BaseExperimentLogger and experiment_logger_factory.pyβswap to W&B or a homegrown tracker by adding one class and a factory entry