πŸ€

NBA Prediction

End-to-end ML pipeline predicting NBA game winners with 62.1% accuracy (2024 season) using gradient boosting and real-time data integration

62.1%
Model Accuracy
7.5 ppts
Above Baseline
2400+
Games Predicted

(2024-2025 NBA Season and Playoffs)


Machine Learning XGBoost Data Pipeline Streamlit Sports Analytics
NBA Prediction Dashboard
⚠️

The Challenge

Build an end-to-end machine learning pipeline to predict the winners of NBA games:

  • - Trains a model to predict game winners
  • - Retrieves new data daily
  • - Performs feature engineering on the data
  • - Predict the probability of the home team winning
  • - Display the results online
πŸ’‘

My Solution

Constructed a modular system that uses:

  • - XGBoost, Optuna, and Neptune.ai for model training and experiment tracking
  • - Github Actions, Selenium, and ScrapingAnt to scrape new games daily from NBA.com
  • - Pandas to engineer features and store these
  • - XGBoost and Sklearn to predict and calibrate winning probability
  • - Streamlit to deploy online

Technical Highlights

Key innovations and technical achievements in this project

πŸ€–

Advanced Modeling

XGBoost and LightGBM with Optuna hyperparameter tuning for optimal performance and Scikit-learn to calibrate probabilities

πŸ”§

Feature Engineering

Created engineered features such as win streaks, losing streaks, home/away streaks, rolling averages for various time periods, and head-to-head stats

πŸ“Š

Data Pipeline

Automated ETL pipeline with error handling, data validation, and daily NBA.com scraping

🎯

Model Validation

Cross-validation and test set evaluation using stratified K-Folds and time-series K-Folds along with SHAP for interpretability and comparing feature importances between train and validation sets

πŸ“ˆ

Experiment Tracking

Neptune.ai integration for comprehensive experiment logging and model comparison

🌐

Production Ready

GitHub Actions CI/CD, Streamlit deployment with monitoring dashboard

Technology Stack

Tools and frameworks used throughout the project lifecycle

🐍 Core ML/Data

Python Pandas NumPy

πŸš€ ML Models and Experimentation

XGBoost LightGBM Scikit-learn SHAP Optuna Neptune.ai

πŸ“Š Data & Scraping

BeautifulSoup Selenium ScrapingAnt

βš™οΈ MLOps and Deployment

Streamlit GitHub Actions

Results & Impact

Season Results (2024-2025 NBA Season and Playoffs)

62.1%
Test Accuracy
54.6%
Baseline (Home Wins)
NBA Model Accuracy
NBA Recent Games

(As you can see, the model has issues with playoffs. The 7 day rolling averages have bigger swings during playoffs as teams play each other multiple times in a short span. This is an area of improvement for future iterations of the model.)

Lessons and Future Plans

This was my first big ML project and I learned a lot. So much in fact, that I am in the process of completely re-doing the project from scratch. I plan to write more about this at a certain point, but there are a couple of big takeaways.

Key Issues

Accuracy

My model had achieved an accuracy of 62.1% for the 2024 season. The baseline of "home team always wins" had an accuracy of 54.6%, but better models often achieve closer to 65%.

Last season, at least one expert at nflpickwatch.com achieved an accuracy of around 69.3%. Close to 100 got over 65%.

Reliability and Extensibility

Third-party services like Hopsworks, Neptune, and Streamlit at times would just stop working. Sometimes they would start back on their own, and sometimes I had to create a workaround. I abandoned Hopsworks altogether, and for Streamlit, I had to create a "light" dashboard-only repo.

Solutions

More Data, Feature Engineering, and Experimentation

Feature engineering is the key to improving the performance of any ML model. More advanced statistics are available on the NBA.com website that will be scraped to aid in the feature engineering, and a better evaluation framework will be used to facilitate faster experimentation to figure out which features work best.

Modular, OOP Architecture

A finer-grained, OOP approach with abstract interfaces and dependency injection makes it easier to swap-out problematic components and to add new capabilities. Alternates and fallbacks can be more easily incorporated into the pipeline.

🚧 Version 2 - In Development

I am currently working on a completely redesigned version of this project with improved architecture, better feature engineering, and enhanced reliability.

Architecture Highlights

Directory Structure

./
β”œβ”€β”€ configs/
β”‚   β”œβ”€β”€ core/
β”‚   β”‚   β”œβ”€β”€ hyperparameters/
β”‚   β”‚   β”‚   β”œβ”€β”€ catboost/
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ baseline.json
β”‚   β”‚   β”‚   β”‚   └── current_best.json
β”‚   β”‚   β”‚   β”œβ”€β”€ lightgbm/
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ baseline.json
β”‚   β”‚   β”‚   β”‚   └── current_best.json
β”‚   β”‚   β”‚   β”œβ”€β”€ pytorch/
β”‚   β”‚   β”‚   β”‚   └── baseline.json
β”‚   β”‚   β”‚   β”œβ”€β”€ sklearn_histgradientboosting/
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ baseline.json
β”‚   β”‚   β”‚   β”‚   └── current_best.json
β”‚   β”‚   β”‚   β”œβ”€β”€ sklearn_logisticregression/
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ baseline.json
β”‚   β”‚   β”‚   β”‚   └── current_best.json
β”‚   β”‚   β”‚   β”œβ”€β”€ sklearn_randomforest/
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ baseline.json
β”‚   β”‚   β”‚   β”‚   └── current_best.json
β”‚   β”‚   β”‚   └── xgboost/
β”‚   β”‚   β”‚       β”œβ”€β”€ baseline.json
β”‚   β”‚   β”‚       └── current_best.json
β”‚   β”‚   β”œβ”€β”€ models/
β”‚   β”‚   β”‚   β”œβ”€β”€ catboost_config.yaml
β”‚   β”‚   β”‚   β”œβ”€β”€ lightgbm_config.yaml
β”‚   β”‚   β”‚   β”œβ”€β”€ pytorch_config.yaml
β”‚   β”‚   β”‚   β”œβ”€β”€ sklearn_histgradientboosting_config.yaml
β”‚   β”‚   β”‚   β”œβ”€β”€ sklearn_logisticregression_config.yaml
β”‚   β”‚   β”‚   β”œβ”€β”€ sklearn_randomforest_config.yaml
β”‚   β”‚   β”‚   └── xgboost_config.yaml
β”‚   β”‚   β”œβ”€β”€ app_logging_config.yaml
β”‚   β”‚   β”œβ”€β”€ model_testing_config.yaml
β”‚   β”‚   β”œβ”€β”€ optuna_config.yaml
β”‚   β”‚   β”œβ”€β”€ preprocessing_config.yaml
β”‚   β”‚   └── visualization_config.yaml
β”‚   └── nba/
β”‚       β”œβ”€β”€ app_config.yaml
β”‚       β”œβ”€β”€ data_access_config.yaml
β”‚       β”œβ”€β”€ data_processing_config.yaml
β”‚       β”œβ”€β”€ feature_engineering_config.yaml
β”‚       └── webscraping_config.yaml
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ cumulative_scraped/
β”‚   β”‚   β”œβ”€β”€ games_advanced.csv
β”‚   β”‚   β”œβ”€β”€ games_four-factors.csv
β”‚   β”‚   β”œβ”€β”€ games_misc.csv
β”‚   β”‚   β”œβ”€β”€ games_scoring.csv
β”‚   β”‚   └── games_traditional.csv
β”‚   β”œβ”€β”€ engineered/
β”‚   β”‚   └── engineered_features.csv
β”‚   β”œβ”€β”€ newly_scraped/
β”‚   β”‚   β”œβ”€β”€ games_advanced.csv
β”‚   β”‚   β”œβ”€β”€ games_four-factors.csv
β”‚   β”‚   β”œβ”€β”€ games_misc.csv
β”‚   β”‚   β”œβ”€β”€ games_scoring.csv
β”‚   β”‚   β”œβ”€β”€ games_traditional.csv
β”‚   β”‚   β”œβ”€β”€ todays_games_ids.csv
β”‚   β”‚   └── todays_matchups.csv
β”‚   β”œβ”€β”€ predictions/
β”‚   β”‚   β”œβ”€β”€ CatBoost_val_predictions.csv
β”‚   β”‚   β”œβ”€β”€ LGBM_oof_predictions.csv
β”‚   β”‚   β”œβ”€β”€ LGBM_val_predictions.csv
β”‚   β”‚   β”œβ”€β”€ lightgbm_val_predictions.csv
β”‚   β”‚   β”œβ”€β”€ SKLearn_HistGradientBoosting_val_predictions.csv
β”‚   β”‚   β”œβ”€β”€ SKLearn_RandomForest_val_predictions.csv
β”‚   β”‚   β”œβ”€β”€ XGBoost_oof_predictions.csv
β”‚   β”‚   β”œβ”€β”€ XGBoost_val_predictions.csv
β”‚   β”‚   └── xgboost_val_predictions.csv
β”‚   β”œβ”€β”€ processed/
β”‚   β”‚   β”œβ”€β”€ column_mapping.json
β”‚   β”‚   β”œβ”€β”€ games_boxscores.csv
β”‚   β”‚   └── teams_boxscores.csv
β”‚   β”œβ”€β”€ test_data/
β”‚   β”‚   β”œβ”€β”€ games_advanced.csv
β”‚   β”‚   β”œβ”€β”€ games_four-factors.csv
β”‚   β”‚   β”œβ”€β”€ games_misc.csv
β”‚   β”‚   β”œβ”€β”€ games_scoring.csv
β”‚   β”‚   └── games_traditional.csv
β”‚   └── training/
β”‚       β”œβ”€β”€ training_data.csv
β”‚       └── validation_data.csv
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ commentary/
β”‚   β”‚   └── Webscraping.md
β”‚   β”œβ”€β”€ data/
β”‚   β”‚   β”œβ”€β”€ column_mapping.json
β”‚   β”‚   └── nba-boxscore-data-dictionary.md
β”‚   β”œβ”€β”€ src/
β”‚   β”‚   └── readme.md
β”‚   └── readme.md
β”œβ”€β”€ hyperparameter_history/
β”‚   β”œβ”€β”€ LGBM_history.json
β”‚   └── XGBoost_history.json
β”œβ”€β”€ notebooks/
β”‚   β”œβ”€β”€ baseline.ipynb
β”‚   └── eda.ipynb
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ AI/
β”‚   β”‚   β”œβ”€β”€ config_reference.txt
β”‚   β”‚   β”œβ”€β”€ config_tree.txt
β”‚   β”‚   β”œβ”€β”€ directory_tree.txt
β”‚   β”‚   └── interfaces.md
β”‚   β”œβ”€β”€ commentary/
β”‚   β”‚   └── Webscraping.md
β”‚   β”œβ”€β”€ data/
β”‚   β”‚   β”œβ”€β”€ column_mapping.json
β”‚   β”‚   └── nba-boxscore-data-dictionary.md
β”‚   β”œβ”€β”€ src/
β”‚   β”‚   └── readme.md
β”‚   └── readme.md
β”œβ”€β”€ hyperparameter_history/
β”‚   β”œβ”€β”€ LGBM_history.json
β”‚   └── XGBoost_history.json
β”œβ”€β”€ noteboooks/
β”‚   β”œβ”€β”€ baseline.ipynb
β”‚   └── eda.ipynb
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ ml_framework/
β”‚   β”‚   β”œβ”€β”€ core/
β”‚   β”‚   β”‚   β”œβ”€β”€ app_file_handling/
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ app_file_handler.py
β”‚   β”‚   β”‚   β”‚   └── base_app_file_handler.py
β”‚   β”‚   β”‚   β”œβ”€β”€ app_logging/
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ app_logger.py
β”‚   β”‚   β”‚   β”‚   └── base_app_logger.py
β”‚   β”‚   β”‚   β”œβ”€β”€ config_management/
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ base_config_manager.py
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ config_manager.py
β”‚   β”‚   β”‚   β”‚   └── config_path.yaml
β”‚   β”‚   β”‚   β”œβ”€β”€ error_handling/
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ base_error_handler.py
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ error_handler.py
β”‚   β”‚   β”‚   β”‚   └── error_handler_factory.py
β”‚   β”‚   β”‚   └── common_di_container.py
β”‚   β”‚   β”œβ”€β”€ framework/
β”‚   β”‚   β”‚   β”œβ”€β”€ data_access/
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ base_data_access.py
β”‚   β”‚   β”‚   β”‚   └── csv_data_access.py
β”‚   β”‚   β”‚   β”œβ”€β”€ data_classes/
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ metrics.py
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ preprocessing.py
β”‚   β”‚   β”‚   β”‚   └── training.py
β”‚   β”‚   β”‚   └── base_data_validator.py
β”‚   β”‚   β”œβ”€β”€ model_testing/
β”‚   β”‚   β”‚   β”œβ”€β”€ experiment_loggers/
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ base_experiment_logger.py
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ experiment_logger_factory.py
β”‚   β”‚   β”‚   β”‚   └── mlflow_logger.py
β”‚   β”‚   β”‚   β”œβ”€β”€ hyperparams_managers/
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ base_hyperparams_manager.py
β”‚   β”‚   β”‚   β”‚   └── hyperparams_manager.py
β”‚   β”‚   β”‚   β”œβ”€β”€ hyperparams_optimizers/
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ base_hyperparams_optimizer.py
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ hyperparams_optimizer_factory.py
β”‚   β”‚   β”‚   β”‚   └── optuna_optimizer.py
β”‚   β”‚   β”‚   β”œβ”€β”€ trainers/
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ base_trainer.py
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ catboost_trainer.py
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ lightgbm_trainer.py
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ pytorch_trainer.py
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ sklearn_trainer.py
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ trainer_factory.py
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ trainer_utils.py
β”‚   β”‚   β”‚   β”‚   └── xgboost_trainer.py
β”‚   β”‚   β”‚   β”œβ”€β”€ base_model_testing.py
β”‚   β”‚   β”‚   β”œβ”€β”€ di_container.py
β”‚   β”‚   β”‚   β”œβ”€β”€ main.py
β”‚   β”‚   β”‚   └── model_tester.py
β”‚   β”‚   β”œβ”€β”€ preprocessing/
β”‚   β”‚   β”‚   β”œβ”€β”€ base_preprocessor.py
β”‚   β”‚   β”‚   └── preprocessor.py
β”‚   β”‚   β”œβ”€β”€ uncertainty/
β”‚   β”‚   β”‚   └── uncertainty_calibrator.py
β”‚   β”‚   └── visualization/
β”‚   β”‚       β”œβ”€β”€ charts/
β”‚   β”‚       β”‚   β”œβ”€β”€ base_chart.py
β”‚   β”‚       β”‚   β”œβ”€β”€ chart_factory.py
β”‚   β”‚       β”‚   β”œβ”€β”€ chart_types.py
β”‚   β”‚       β”‚   β”œβ”€β”€ chart_utils.py
β”‚   β”‚       β”‚   β”œβ”€β”€ feature_charts.py
β”‚   β”‚       β”‚   β”œβ”€β”€ learning_curve_charts.py
β”‚   β”‚       β”‚   β”œβ”€β”€ metrics_charts.py
β”‚   β”‚       β”‚   β”œβ”€β”€ model_interpretation_charts.py
β”‚   β”‚       β”‚   └── shap_charts.py
β”‚   β”‚       β”œβ”€β”€ exploratory/
β”‚   β”‚       β”‚   β”œβ”€β”€ base_explorer.py
β”‚   β”‚       β”‚   β”œβ”€β”€ correlation_explorer.py
β”‚   β”‚       β”‚   β”œβ”€β”€ distribution_explorer.py
β”‚   β”‚       β”‚   β”œβ”€β”€ team_performance_explorer.py
β”‚   β”‚       β”‚   └── time_series_explorer.py
β”‚   β”‚       └── orchestration/
β”‚   β”‚           β”œβ”€β”€ base_chart_orchestrator.py
β”‚   β”‚           └── chart_orchestrator.py
β”‚   └── nba_app/
β”‚       β”œβ”€β”€ data_processing/
β”‚       β”‚   β”œβ”€β”€ base_data_processing_classes.py
β”‚       β”‚   β”œβ”€β”€ di_container.py
β”‚       β”‚   β”œβ”€β”€ main.py
β”‚       β”‚   └── process_scraped_NBA_data.py
β”‚       β”œβ”€β”€ feature_engineering/
β”‚       β”‚   β”œβ”€β”€ base_feature_engineering.py
β”‚       β”‚   β”œβ”€β”€ di_container.py
β”‚       β”‚   β”œβ”€β”€ feature_engineer.py
β”‚       β”‚   β”œβ”€β”€ feature_selector.py
β”‚       β”‚   └── main.py
β”‚       β”œβ”€β”€ webscraping/
β”‚       β”‚   β”œβ”€β”€ old/
β”‚       β”‚   β”‚   β”œβ”€β”€ test_websraping.py
β”‚       β”‚   β”‚   └── webscraping_old.py
β”‚       β”‚   β”œβ”€β”€ base_scraper_classes.py
β”‚       β”‚   β”œβ”€β”€ boxscore_scraper.py
β”‚       β”‚   β”œβ”€β”€ di_container.py
β”‚       β”‚   β”œβ”€β”€ main.py
β”‚       β”‚   β”œβ”€β”€ matchup_validator.py
β”‚       β”‚   β”œβ”€β”€ nba_scraper.py
β”‚       β”‚   β”œβ”€β”€ page_scraper.py
β”‚       β”‚   β”œβ”€β”€ readme.md
β”‚       β”‚   β”œβ”€β”€ schedule_scraper.py
β”‚       β”‚   β”œβ”€β”€ test.ipynb
β”‚       β”‚   β”œβ”€β”€ test_page_scraper.py
β”‚       β”‚   β”œβ”€β”€ utils.py
β”‚       β”‚   └── web_driver.py
β”‚       └── data_validator.py
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ ml_framework/
β”‚   └── nba_app/
β”‚       └── webscraping/
β”‚           β”œβ”€β”€ test_boxscore_scraper.py
β”‚           β”œβ”€β”€ test_integration.py
β”‚           β”œβ”€β”€ test_nba_scraper.py
β”‚           β”œβ”€β”€ test_page_scraper.py
β”‚           └── test_schedule_scraper.py
β”œβ”€β”€ directory_tree.txt
β”œβ”€β”€ pyproject.toml
β”œβ”€β”€ README.md
└── uv.lock

Key Design Principles

Clean, layered architecture
  • src/ml_framework vs src/nba_app enforces a strict one-way dependency (nba_app β†’ ml_framework), keeping the platform generic and the domain isolated
  • src/ package layout with __init__.py enables clean imports and scalable growth across modules
SOLID principles throughout
  • Single Responsibility: Modules like src/ml_framework/core/config_management/, app_logging/, error_handling/ each own one concern; nba_app partitions data_processing, feature_engineering, and webscraping
  • Open/Closed: Trainers (src/ml_framework/model_testing/trainers/), experiment loggers (experiment_loggers/), and data access (framework/data_access/) are extended via factories without modifying existing code
  • Liskov Substitution: Abstract base classes (e.g., BaseTrainer, BaseExperimentLogger, BaseDataAccess, BasePreprocessor) guarantee drop-in interchangeability of implementations
  • Interface Segregation: Focused interfaces (e.g., separate BaseTrainer, BaseModelTester, BaseHyperparamsOptimizer, BaseChart) prevent "fat" dependencies
  • Dependency Inversion: High-level flows depend on abstractions; concrete classes are resolved via di_container.py files in nba_app and ml_framework
Interchangeable, plug-and-play components
  • Experiment tracking: BaseExperimentLogger + experiment_logger_factory with mlflow_logger.py; easily swap to W&B or a custom tracker by adding a new implementation
  • Models/trainers: trainer_factory orchestrates catboost_trainer.py, lightgbm_trainer.py, xgboost_trainer.py, sklearn_trainer.py, pytorch_trainer.pyβ€”all behind BaseTrainer
  • Hyperparameter search: BaseHyperparamsOptimizer with optuna_optimizer.py; optimizer factory allows alternative HPO backends without touching call sites
  • Data access: BaseDataAccess with csv_data_access.pyβ€”swap in a DB-backed implementation transparently
  • Visualization: BaseChart and chart_factory support consistent chart creation; new charts plug in without refactoring call sites
  • Web scraping: BaseWebDriver, BasePageScraper, and BaseNbaScraper in nba_app/webscraping allow switching drivers, parsers, or sources via config
Centralized, declarative configuration
  • configs/core and configs/nba separate platform defaults from NBA-specific settings
  • Fine-grained YAMLs for models, preprocessing, logging, Optuna, and visualization; hyperparameters are versioned in configs/core/hyperparameters with current_best.json and baseline.json per model
  • config_tree.txt documents the effective configuration namespace for full transparency and reproducibility
Reproducibility and environment parity
  • uv.lock pins dependencies; pyproject.toml consolidates build/runtime metadata
  • .gitignore/.gitattributes maintain clean VCS state; data/ directories are staged but not versioned
Testable, CI-friendly codebase
  • tests/ mirrors package structure, with unit and integration tests for webscraping and room to expand for platform_core
  • Smaller classes and functions under src/ enable targeted unit tests that were not possible with notebooks
Robust observability and error handling
  • Structured logging via src/ml_framework/core/app_logging with rolling files and console output, fully configured from configs/core/app_logging_config.yaml
  • Centralized error handling in src/ml_framework/core/error_handling with factory-based resolution for consistent behavior
End-to-end ML workflow discipline
  • Deterministic model testing in src/ml_framework/model_testing with TimeSeriesSplit, OOF, validation-set testing, and SHAP integrationβ€”configured in configs/core/model_testing_config.yaml
  • Hyperparameter history tracked in hyperparameter_history/ for auditability and reproducibility
Clear research-to-production path
  • notebooks/ quarantines exploratory work; production code resides in src/
  • data/ is organized by lifecycle (newly_scraped, cumulative_scraped, processed, engineered, training, predictions), making pipelines explicit and debuggable
Extensible visualization and interpretability
  • src/ml_framework/visualization provides base interfaces, orchestration, and specialized chart sets (feature importance, learning curves, metrics, SHAP), all toggled via configs/core/visualization_config.yaml
Domain expansion without rework
  • Three-tier architecture and one-way dependencies enable adding new sports apps (e.g., mlb_app) reusing ml_framework unchanged
GitHub View Version 1 Repository GitHub View Version 2 Repository

SOLID in Practice

ABCs + Factories + Config-first wiring

Single Responsibility Principle (SRP)

Each module has one reason to change.

  • Configuration: src/ml_framework/core/config_management/ handles only config discovery/merge/access (BaseConfigManager, ConfigManager)
  • Logging: src/ml_framework/core/app_logging/ focuses on structured logging (BaseAppLogger, AppLogger) configured via configs/core/app_logging_config.yaml
  • Error handling: src/ml_framework/core/error_handling/ centralizes error policies (BaseErrorHandler, ErrorHandler, ErrorHandlerFactory)
  • Domain slices: nba_app is split into data_processing, feature_engineering, and webscraping, each with their own DI containers and mains

Open/Closed Principle (OCP)

Extend via new classes/factories without modifying existing code.

  • Trainers: Add a new trainer under src/ml_framework/model_testing/trainers/ and register via trainer_factory.py, no orchestrator changes
  • Experiment logging: Implement BaseExperimentLogger and hook into experiment_logger_factory.py to support W&B or custom loggers alongside mlflow_logger.py
  • Data access: Implement BaseDataAccess (framework/data_access/) to add DB, cloud, or parquet backends next to csv_data_access.py
  • Charts: Add a new chart under visualization/charts/ and expose it via chart_factory.py; orchestrators don't change

Liskov Substitution Principle (LSP)

Abstract base classes guarantee drop-in compatibility.

  • BaseTrainer subclasses (catboost_trainer.py, lightgbm_trainer.py, xgboost_trainer.py, sklearn_trainer.py, pytorch_trainer.py) can be swapped without breaking model_tester.py
  • BaseExperimentLogger implementations (mlflow_logger.py today) can be replaced by any logger honoring the same interface
  • BaseDataAccess and BasePreprocessor enforce method contracts so downstream code doesn't care about concrete types

Interface Segregation Principle (ISP)

Small, focused interfaces prevent unnecessary coupling.

  • Separate interfaces for training (BaseTrainer), model testing (BaseModelTester), HPO (BaseHyperparamsOptimizer), validation (BaseDataValidator), preprocessing (BasePreprocessor), and visualization (BaseChart, BaseChartOrchestrator)
  • Web layer uses granular abstractions: BaseWebDriver, BasePageScraper, BaseNbaScraperβ€”scrapers don't inherit heavy methods they don't use

Dependency Inversion Principle (DIP)

High-level policies depend on abstractions, not concretions.

  • Factories and DI: experiment_logger_factory.py, trainer_factory.py, hyperparams_optimizer_factory.py, and di_container.py files supply concrete instances to code that only references base interfaces
  • Config-driven wiring: configs/core/model_testing_config.yaml selects models, logging, and preprocessing; code reads the abstractions and resolves implementations at runtime
  • One-way dependency rule: nba_app depends on ml_framework; ml_framework stays domain-agnostic, keeping policies at the top and details at the bottom

Bonus: Interchangeability by Design

  • Experiment tracker is MLflow today (model_testing/experiment_loggers/mlflow_logger.py) but it sits behind BaseExperimentLogger and experiment_logger_factory.pyβ€”swap to W&B or a homegrown tracker by adding one class and a factory entry
  • Trainers, HPO backends, data access layers, and chart types are all selected via configuration and factories, not hard-coded conditionalsβ€”making replacement low-risk and testable