This post is the introduction for a project I worked on recently. Namely, using Supervised Learning techniques to predict new NBA All-Stars based on player statistics from previous seasons. The techniques explored include: Pruned Decision Trees, K-Nearest Neighbors, Boosted Decision Trees, Artificial Neural Networks, and Support Vector Machines. Each of these will be covered in subsequent blog posts. This post is to introduce the problem and explain the general methodology.
Data, Data, Data
The dataset is merged from two sources: NBA All Stars from 2000-2016 and season stats for every player from 2000-2016. Each row in the dataset contains a single player’s statistics over a season. There are 50 total features, including total points/assists/rebounds, field goal/free throw/3 point percentages and more advanced stats derived from formulas (like PER, USAG% etc). There is a single column that specifies whether that player made the All-Star game that season. Using this training data, the goal was to classify the players from the 2018/2019 as All-Stars or non All-Stars. This dataset has some interesting characteristics that made it an appealing choice to investigate. First, the data is heavily unbalanced; of the 8069 players in the dataset, only 463 players are actually All Stars (only ~0.05%). The number of true negatives greatly outweighs the number of true positives. This imbalance needed to be accounted for when calculating accuracy or plotting curves. See General Approach for more details.
I used Python’s scikit-learn library for running the Supervised Learning algorithms. I used the pandas library for data manipulation and analysis and matplotlib for data visualization and plotting. For each of the five algorithms, I used the same general procedure for my analysis. My analysis included: a Model Complexity Curve (also called Validation Curve) to help tune hyperparameters, a Learning Curve to find the lower bound on the number of samples needed to learn this model and to investigate any issues due to high bias or high variance, a timing curve to investigate the runtime performance of training and testing and, for some algorithms, an iteration curve to investigate accuracy as the number of iterations increased. At a high level, here is the general methodology I followed when analyzing each algorithm:
- Split 80% of the dataset into a training set and 20% into a testing set, using a stratified split
- Standardize features by removing the mean and scaling to unit variance
- Of the 80% of training set, use a Model Complexity curve to find the best hyperparameters for tuning the model (with 5-fold cross validation)
- Of the same 80% training set and using the best estimator identified in Step 2, plot the Learning Curve for the model (with 5-fold cross validation, using 20% of data incrementally cumulated)
- Calculate the training accuracy of the model on the 80% that dataset from Step 1
- Calculate the test accuracy of the model on the 20% of the dataset held out in Step 1
- Plot the iteration or timing curves, as necessary, depending on the model
In Step 1, the initial split is done using stratified sampling. This is done to ensure that the split contains the same percentage of each classification (for ex: All Stars and Non-All Stars) in both the training and test sets.
Step 2 is done using scikit-learn’s StandardScaler to ensure that each attribute’s variance is in the same order and that the mean is centered around 0. This is required for certain models like SVMs and k-NNs (to ensure that a certain attribute with a large variance doesn’t dominate the objective functions). See the image below for an example of the StandardScaler applied to a few sample attributes from the NBA dataset.
Steps 3 and 4 both use built-in functions for plotting the Validation and Learning Curves. My experiments used a custom sample weight scoring function that automatically adjusts weights inversely proportional to class frequencies in the input data. This is done to ensure that the unbalanced dataset can be accurately scored.
Steps 5 and 6 also used a similar “balanced” sample weight scoring function to account for unbalanced data. Additionally, my experiments used 5-fold cross validation to prevent overfitting on the training data, at the cost of increased run times.
This concludes a high-level introduction to my project of classifying NBA All-Stars. Subsequent posts will go into the results I observed for each algorithm.
NOTE: This project was completed as part of my Masters coursework. Specifically, for the Machine Learning course.