Introduction

This post is the fifth post in the series on using various Supervised Learning techniques to classify NBA All-Stars. In a previous post, I introduced the problem, the data and the general approach I took for each algorithm. In this post, I will discuss how Neural Netowkrs performed on this problem.

Neural Networks

Artificial Neural Networks (ANN) learn a nonlinear function using three types of connected nodes: input, hidden and output. These nodes are connected in a manner similar to neurons in the human body. The algorithm takes input features in the leftmost input layer, each neuron in the hidden layer(s) transforms the output of the previous layer, then the hidden layer computes a weighted linear summation and applies a non-linear activation function. Lastly, the output layer gets the values from the last hidden layer and transforms it into output values. These output values are compared against the actual known output value and the error value is bubbled back through the layers so subsequent iterations can adjust the weights to continuously reduce error.

In scikit-learn, I used the MLPClassifier (Multilayer Perceptron), which uses the Backpropagation algorithm described above. I experimented with changing the alpha hyperparameter. I used the default hidden layer configuration of one hidden layer with 100 units for the NBA dataset, since my classification problem was not too complex. alpha is a penalty term that constrains the size of the weights. Increasing alpha can reduce high variance in the model, which results in fewer curves in the decision boundaries. Decreasing alpha can help reduce high bias in the model, resulting in a more complex, curvier decision boundary.

Model Complexity Curve

See Figure below for the Model Complexity curve for varying values of alpha for the NBA problem. As alpha increases, the accuracy scores remain consistent, but after alpha is greater than 0.1 the accuracy declines, which shows that using an alpha greater than 0.1 will result in overfitting.

Model Complexity Curve

Learning Curve

For the learning curve, I used alpha=10^-5 as the hyperparameter, since having a less complex model is preferable to a more complex one (Occam’s Razor). The learning curve shown below exhibits high variance (since the validation accuracy is increasing with more data). This suggests that either giving this model more data or reducing the number of features might improve accuracy. The model also shows high bias, since the accuracy scores are relatively low. One option to reduce the high bias might be to get new features from another source that might help improve accuracy. The final training accuracy was 84.5597% and testing accuracy was 76.8251%. This model showed similar accuracy scores as KNN but performed worse than decision trees or boosting.

Learning Curve

Timing and Iteration Curves

The figure below shows the iteration and timing curves for Neural Networks. Here we see at we need at least 500 iterations of the algorithm to be able to classify data with a better accuracy than 50% for the NBA problem. We also see that since Neural Networks are a eager learner, more time is spent in training (O(n)) than during prediction (O(1)).

Timing Curve
Timing Curve

Overall, Neural Networks did not perform too well for this problem. The final testing accuracy was lower than for Decision Trees. The model I chose had the default number of 100 hidden layers. In future experiments, I plan to modify the structure of the network to elicit better accuracy.

This concludes my investigation of Boosting to classify NBA All-Stars. The next post will cover Support Vector Machines and their performance on the same problem.

NOTE: This project was completed as part of my Masters coursework. Specifically, for the Machine Learning course.

Introduction

This post is the fourth post in the series on using various Supervised Learning techniques to classify NBA All-Stars. In a previous post, I introduced the problem, the data and the general approach I took for each algorithm. In this post, I will discuss how Boosted Decision Trees performed on this problem.

Boosting

Boosting is an ensemble learning technique that combines the predictions of several weak learners to produce a generalized prediction. Weak learners are models whose predictions are only slightly better than choosing at random. The final prediction is usually done by a weighted majority vote of all estimators. For my experiments, I used a specific boosting algorithm, AdaBoost. AdaBoost iteratively applies learning process to a distribution of weights. At each step, the training examples that were classified incorrectly are weighted higher for the next iteration and the examples that were classified correctly are weighted lower. For my experiments, I used a single-level Decision Tree (also known as a decision stump) as the base estimator for AdaBoost. I also experimented with various number of estimators to identify the best estimator count for my model. One interesting observation I had was that using a Decision Tree of greater depths (3 or greater) caused my boosting algorithm to underperform. After some research and reasoning, I discovered that the larger, more complex underlying decision tree was overfitting, causing boosting to also overfit. Using a less complex base estimator, one that was less likely to overfit, improved the performance of the boosted model.

Model Complexity Curve

See Figure below for the Model Complexity curve for Boosting. Here, we see that varying the n_estimators hyperparameter caused the training curve to continuously increase while keeping the validation curve relatively steady. This graph shows that overfitting is likely as the number of estimators grows. I chose 18 as the optimal number of estimators to use with the base estimator of a 1-level decision tree.

Model Complexity Curve

Learning Curve

In the Learning Curve plot below, we also see the that the training and validation scores converge to similar values as the size of the training set increases. This shows that our model has low bias and low variance so it generalizes well to new data. The final training accuracy was 92.3338% and testing accuracy was 87.7099%. While it performed better than k-NN, the Boosting model had around the same accuracy score as a 3-level Decision Tree. This could be because of outliers (All-Stars that didn’t have good stats but were voted in due to popularity) in the data of which Boosting is maximizing the importance.

Learning Curve

Timing Curve

The last figure below shows the timing curves for training vs prediction. The timing curve shows that Boosting is a eager learner since prediction takes constant time and training takes is O(n)

Timing Curve

Overall, Boosting performed well for this problem. The final testing accuracy was around the same as for Decision Trees. Boosting was considerably slower than Decision Trees without providing significant gains in accuracy. This lead me to conclude that regular Decision Trees were sufficient for this problem and that boosting was not strictly necessary.

This concludes my investigation of Boosting to classify NBA All-Stars. The next post will cover Artificial Neural Networks and their performance on the same problem.

NOTE: This project was completed as part of my Masters coursework. Specifically, for the Machine Learning course.

Introduction

This post is the third post in the series on using various Supervised Learning techniques to classify NBA All-Stars. In a previous post, I introduced the problem, the data and the general approach I took for each algorithm. In this post, I will discuss how k-Nearest Neighbors performed on this problem.

k-Nearest Neighbors

K-Nearest Neighbors is an instance-based learning algorithm. Rather than constructing a generalized model of the classification problem, it stores all training samples and classifies the testing data by taking a majority vote of the k nearest neighbors to the query point. Overfitting can occur in k-NN when we use a k value of 1. When k=1, each sample is in a neighborhood of its own and results in a model with low bias and high variance. As we increase k, we reduce the complexity in the model and we also reduce overfitting. In my experiments for k-NN, I varied the k parameter and used uniform weights for computing the majority votes for the nearest neighbors.

Model Complexity Curve

See figure below for the Model Complexity curve for the NBA All-Stars problem, which shows that overfitting occurs when k=1 and decreases as k increases. The peak validation accuracy I saw that was when k=3, so I used this as the basis for the learning and for computing testing accuracy. Increasing k decreases variance and increases bias, so in choosing k=3, I prioritized bias (higher accuracy) over variance. Note that this model is also unaffected by the unbalanced nature of the NBA data, since it only considers a few closest samples for classification.

Model Complexity Curve

Learning Curve

The Learning Curve in the figure below shows that this model isn’t as good at classifying NBA All-Stars as Decision Trees were. As we give the model more training examples, the accuracy scores continue to increase and there is a large gap between the training and validation scores. This shows that the model suffers from high variance, which means more training data or a larger k value will help improve the model’s performance. The final training accuracy was 84.5761% and testing accuracy was 76.4208%. As expected, the accuracy is lower than that of Decision Trees.

Learning Curve

Timing Curve

The last figure below shows the timing curves for training vs prediction. k-NN is a lazy learning model, which means that the algorithm spends a constant amount of time “learning” (or remembering) the model and returns classifications in non-constant time time. This is evident in the figure, which shows classification time increasing linearly with the amount of data, while the training time remains constant.

Timing Curve

Overall, k-NN struggled to perform well for this problem. The final testing accuracy was lower than for Decision Trees. This is to be expected since with a k value of 3, I’m only considering three players that are in the same neighborhood, in respect to season stats. Using a greater value of k is also not a good idea since the model would overfit on the training data. This tradeoff also demonstrates the No Free Lunch Theorem. Specifically, it demonstrates that the structure utilized by this model is not optimized for the problem of finding NBA All-Stars. In the conclusion post, I will apply this model to the stats from 2018 to see how well the model does on recent data.

This concludes my investigation of k-NN to classify NBA All-Stars. The next post will cover Boosting and its performance on the same problem.

NOTE: This project was completed as part of my Masters coursework. Specifically, for the Machine Learning course.

Introduction

This post is the second post in the series on using various Supervised Learning techniques to classify NBA All-Stars. In the previous post, I introduced the problem, the data and the general approach I took for each algorithm. In this post, I will discuss how [Decision Trees]((https://en.wikipedia.org/wiki/Decision_tree) performed on this problem.

Decision Trees

Decision Trees (DTs) classify samples by inferring simple decision rules from the data features. They build a tree structure where each internal node represents a if/then rule based on some attribute. Classification of new samples is done by traversing the tree and using the leaf node values. Some advantages of DTs are that they are easy to understand, implement and visualize. One big disadvantage of DTs is that they are prone to overfitting on the training data; overfitting occurs when the algorithm builds a tree that performs really well on the training data, capturing all the noise and unwanted features. For my experiments, I avoided overfitting by pruning (specifying the max_depth) the tree. In scikit-learn’s DecisionTreeClassifier class, I used the Gini Impurity as the function to measure the quality of a split and specified a balanced class weighting measure. All other parameters used the defaults provided by scikit-learn.

3-level Decision Tree for NBA Data

Model Complexity Curve

The figure above shows the final 3-level DT generated for the dataset I supplied. I was surprised to see that only 5 attributes of the 50 were considered in the tree: PER (Player Efficiency Rating). FTA (Free Throws Attempted), PPG (Points Per Game), 2P (Two Pointers Made), DWS (Defensive Win Shares). I decided on using a tree with only 3 levels after generating the Model Complexity Curve. See figure below for those results. The results in that graph demonstrate that using a tree of max_depth of 3 generated the highest cross validation score. As we increase the depth of the tree, we see that the training score continue to improve while the validation score declines. This demonstrates that adding more levels to the tree leads to overfitting on the training data, which causes the tree to not generalize well on unseen data.

Model Complexity Curve

Learning Curve

Next, I ran a Learning Curve experiment on the 3-level Decision Tree. The results are captured in the figure below. This graph shows that both the training and cross-validation scores converge to around the same value: 90-92% accuracy. This demonstrates that my Decision Tree model is ideal, in that it does not suffer from high bias or variance. Instead, this model will generalize well to previously-unseen testing data which is confirmed in the final testing results. The final training accuracy was 92.4931% and testing accuracy was 89.2402%. Additionally, this model required only about 3000 samples to achieve an accuracy of around 90%, which shows that this model generalizes well without requiring too much more data.

Learning Curve

Timing Curve

The last figure below shows the timing curves for training vs prediction. Decision Trees are an eager learning model, which means that the algorithm spends a linear amount of time learning from the model and returns classifications in constant time. This is evident in the figure, which shows training time increasing linearly with the amount of data, while the classification time remaining constant.

Timing Curve

Overall, Decision Trees performed well for this problem. The final testing accuracy was higher than I expected. In the conclusion post, I will apply this model to the stats from 2018 to see how well the model does on recent data.

This concludes my investigation of Decision Trees to classify NBA All-Stars. The next post will cover k-Nearest Neighbors and its performance on the same problem.

NOTE: This project was completed as part of my Masters coursework. Specifically, for the Machine Learning course.

Introduction

This post is the introduction for a project I worked on recently. Namely, using Supervised Learning techniques to predict new NBA All-Stars based on player statistics from previous seasons. The techniques explored include: Pruned Decision Trees, K-Nearest Neighbors, Boosted Decision Trees, Artificial Neural Networks, and Support Vector Machines. Each of these will be covered in subsequent blog posts. This post is to introduce the problem and explain the general methodology.

Data, Data, Data

The dataset is merged from two sources: NBA All Stars from 2000-2016 and season stats for every player from 2000-2016. Each row in the dataset contains a single player’s statistics over a season. There are 50 total features, including total points/assists/rebounds, field goal/free throw/3 point percentages and more advanced stats derived from formulas (like PER, USAG% etc). There is a single column that specifies whether that player made the All-Star game that season. Using this training data, the goal was to classify the players from the 2018/2019 as All-Stars or non All-Stars. This dataset has some interesting characteristics that made it an appealing choice to investigate. First, the data is heavily unbalanced; of the 8069 players in the dataset, only 463 players are actually All Stars (only ~0.05%). The number of true negatives greatly outweighs the number of true positives. This imbalance needed to be accounted for when calculating accuracy or plotting curves. See General Approach for more details.

General Approach

I used Python’s scikit-learn library for running the Supervised Learning algorithms. I used the pandas library for data manipulation and analysis and matplotlib for data visualization and plotting. For each of the five algorithms, I used the same general procedure for my analysis. My analysis included: a Model Complexity Curve (also called Validation Curve) to help tune hyperparameters, a Learning Curve to find the lower bound on the number of samples needed to learn this model and to investigate any issues due to high bias or high variance, a timing curve to investigate the runtime performance of training and testing and, for some algorithms, an iteration curve to investigate accuracy as the number of iterations increased. At a high level, here is the general methodology I followed when analyzing each algorithm:

  1. Split 80% of the dataset into a training set and 20% into a testing set, using a stratified split
  2. Standardize features by removing the mean and scaling to unit variance
  3. Of the 80% of training set, use a Model Complexity curve to find the best hyperparameters for tuning the model (with 5-fold cross validation)
  4. Of the same 80% training set and using the best estimator identified in Step 2, plot the Learning Curve for the model (with 5-fold cross validation, using 20% of data incrementally cumulated)
  5. Calculate the training accuracy of the model on the 80% that dataset from Step 1
  6. Calculate the test accuracy of the model on the 20% of the dataset held out in Step 1
  7. Plot the iteration or timing curves, as necessary, depending on the model

In Step 1, the initial split is done using stratified sampling. This is done to ensure that the split contains the same percentage of each classification (for ex: All Stars and Non-All Stars) in both the training and test sets.

Step 2 is done using scikit-learn’s StandardScaler to ensure that each attribute’s variance is in the same order and that the mean is centered around 0. This is required for certain models like SVMs and k-NNs (to ensure that a certain attribute with a large variance doesn’t dominate the objective functions). See the image below for an example of the StandardScaler applied to a few sample attributes from the NBA dataset.

Steps 3 and 4 both use built-in functions for plotting the Validation and Learning Curves. My experiments used a custom sample weight scoring function that automatically adjusts weights inversely proportional to class frequencies in the input data. This is done to ensure that the unbalanced dataset can be accurately scored.

Steps 5 and 6 also used a similar “balanced” sample weight scoring function to account for unbalanced data. Additionally, my experiments used 5-fold cross validation to prevent overfitting on the training data, at the cost of increased run times.

StandardScaler applied to a few features

This concludes a high-level introduction to my project of classifying NBA All-Stars. Subsequent posts will go into the results I observed for each algorithm.

NOTE: This project was completed as part of my Masters coursework. Specifically, for the Machine Learning course.