I recently wrote a post for my company’s Engineering blog! See post here!

# Predicting NBA-All Stars Using Neural Networks (Part 4/6)

# Introduction

This post is the fifth post in the series on using various Supervised Learning techniques to classify NBA All-Stars. In a previous post, I introduced the problem, the data and the general approach I took for each algorithm. In this post, I will discuss how Neural Netowkrs performed on this problem.

# Neural Networks

Artificial Neural Networks (ANN) learn a nonlinear function using three types of connected nodes: input, hidden and output. These nodes are connected in a manner similar to neurons in the human body. The algorithm takes input features in the leftmost input layer, each neuron in the hidden layer(s) transforms the output of the previous layer, then the hidden layer computes a weighted linear summation and applies a non-linear activation function. Lastly, the output layer gets the values from the last hidden layer and transforms it into output values. These output values are compared against the actual known output value and the error value is bubbled back through the layers so subsequent iterations can adjust the weights to continuously reduce error.

In scikit-learn, I used the MLPClassifier (Multilayer Perceptron), which uses the Backpropagation algorithm described above. I experimented with changing the alpha hyperparameter. I used the default hidden layer configuration of one hidden layer with 100 units for the NBA dataset, since my classification problem was not too complex. alpha is a penalty term that constrains the size of the weights. Increasing alpha can reduce high variance in the model, which results in fewer curves in the decision boundaries. Decreasing alpha can help reduce high bias in the model, resulting in a more complex, curvier decision boundary.

# Model Complexity Curve

See Figure below for the Model Complexity curve for varying values of alpha for the NBA problem. As alpha increases, the accuracy scores remain consistent, but after alpha is greater than 0.1 the accuracy declines, which shows that using an alpha greater than 0.1 will result in overfitting.

# Learning Curve

For the learning curve, I used alpha=10^-5 as the hyperparameter, since having a less complex model is preferable to a more complex one (Occam’s Razor). The learning curve shown below exhibits high variance (since the validation accuracy is increasing with more data). This suggests that either giving this model more data or reducing the number of features might improve accuracy. The model also shows high bias, since the accuracy scores are relatively low. One option to reduce the high bias might be to get new features from another source that might help improve accuracy. **The final training accuracy was 84.5597% and testing accuracy was 76.8251%**. This model showed similar accuracy scores as KNN but performed worse than decision trees or boosting.

# Timing and Iteration Curves

The figure below shows the iteration and timing curves for Neural Networks. Here we see at we need at least 500 iterations of the algorithm to be able to classify data with a better accuracy than 50% for the NBA problem. We also see that since Neural Networks are a eager learner, more time is spent in training (O(n)) than during prediction (O(1)).

Overall, Neural Networks did not perform too well for this problem. The final testing accuracy was lower than for Decision Trees. The model I chose had the default number of 100 hidden layers. In future experiments, I plan to modify the structure of the network to elicit better accuracy.

This concludes my investigation of Boosting to classify NBA All-Stars. The next post will cover Support Vector Machines and their performance on the same problem.

NOTE: This project was completed as part of my Masters coursework. Specifically, for the Machine Learning course.

# Predicting NBA-All Stars Using Boosting (Part 3/6)

# Introduction

This post is the fourth post in the series on using various Supervised Learning techniques to classify NBA All-Stars. In a previous post, I introduced the problem, the data and the general approach I took for each algorithm. In this post, I will discuss how Boosted Decision Trees performed on this problem.

# Boosting

Boosting is an ensemble learning technique that combines the predictions of several weak learners to produce a generalized prediction. Weak learners are models whose predictions are only slightly better than choosing at random. The final prediction is usually done by a weighted majority vote of all estimators. For my experiments, I used a specific boosting algorithm, AdaBoost. AdaBoost iteratively applies learning process to a distribution of weights. At each step, the training examples that were classified incorrectly are weighted higher for the next iteration and the examples that were classified correctly are weighted lower. For my experiments, I used a single-level Decision Tree (also known as a decision stump) as the base estimator for AdaBoost. I also experimented with various number of estimators to identify the best estimator count for my model. One interesting observation I had was that using a Decision Tree of greater depths (3 or greater) caused my boosting algorithm to underperform. After some research and reasoning, I discovered that the larger, more complex underlying decision tree was overfitting, causing boosting to also overfit. Using a less complex base estimator, one that was less likely to overfit, improved the performance of the boosted model.

# Model Complexity Curve

See Figure below for the Model Complexity curve for Boosting. Here, we see that varying the n_estimators hyperparameter caused the training curve to continuously increase while keeping the validation curve relatively steady. This graph shows that overfitting is likely as the number of estimators grows. I chose 18 as the optimal number of estimators to use with the base estimator of a 1-level decision tree.

# Learning Curve

In the Learning Curve plot below, we also see the that the training and validation scores converge to similar values as the size of the training set increases. This shows that our model has low bias and low variance so it generalizes well to new data. The final training accuracy was **92.3338% and testing accuracy was 87.7099%**. While it performed better than k-NN, the Boosting model had around the same accuracy score as a 3-level Decision Tree. This could be because of outliers (All-Stars that didn’t have good stats but were voted in due to popularity) in the data of which Boosting is maximizing the importance.

# Timing Curve

The last figure below shows the timing curves for training vs prediction. The timing curve shows that Boosting is a eager learner since prediction takes constant time and training takes is O(n)

Overall, Boosting performed well for this problem. The final testing accuracy was around the same as for Decision Trees. Boosting was considerably slower than Decision Trees without providing significant gains in accuracy. This lead me to conclude that regular Decision Trees were sufficient for this problem and that boosting was not strictly necessary.

This concludes my investigation of Boosting to classify NBA All-Stars. The next post will cover Artificial Neural Networks and their performance on the same problem.

NOTE: This project was completed as part of my Masters coursework. Specifically, for the Machine Learning course.

# Predicting NBA-All Stars Using k-Nearest Neighbors (Part 2/6)

# Introduction

This post is the third post in the series on using various Supervised Learning techniques to classify NBA All-Stars. In a previous post, I introduced the problem, the data and the general approach I took for each algorithm. In this post, I will discuss how k-Nearest Neighbors performed on this problem.

# k-Nearest Neighbors

K-Nearest Neighbors is an instance-based learning algorithm. Rather than constructing a generalized model of the classification problem, it stores all training samples and classifies the testing data by taking a majority vote of the k nearest neighbors to the query point. Overfitting can occur in k-NN when we use a k value of 1. When k=1, each sample is in a neighborhood of its own and results in a model with low bias and high variance. As we increase k, we reduce the complexity in the model and we also reduce overfitting. In my experiments for k-NN, I varied the k parameter and used uniform weights for computing the majority votes for the nearest neighbors.

# Model Complexity Curve

See figure below for the Model Complexity curve for the NBA All-Stars problem, which shows that overfitting occurs when k=1 and decreases as k increases. The peak validation accuracy I saw that was when k=3, so I used this as the basis for the learning and for computing testing accuracy. Increasing k decreases variance and increases bias, so in choosing k=3, I prioritized bias (higher accuracy) over variance. Note that this model is also unaffected by the unbalanced nature of the NBA data, since it only considers a few closest samples for classification.

# Learning Curve

The Learning Curve in the figure below shows that this model isn’t as good at classifying NBA All-Stars as Decision Trees were. As we give the model more training examples, the accuracy scores continue to increase and there is a large gap between the training and validation scores. This shows that the model suffers from high variance, which means more training data or a larger k value will help improve the model’s performance. The final training accuracy was *84.5761%* and testing accuracy was *76.4208%*. As expected, the accuracy is lower than that of Decision Trees.

# Timing Curve

The last figure below shows the timing curves for training vs prediction. k-NN is a lazy learning model, which means that the algorithm spends a constant amount of time “learning” (or remembering) the model and returns classifications in non-constant time time. This is evident in the figure, which shows classification time increasing linearly with the amount of data, while the training time remains constant.

Overall, k-NN struggled to perform well for this problem. The final testing accuracy was lower than for Decision Trees. This is to be expected since with a k value of 3, I’m only considering three players that are in the same neighborhood, in respect to season stats. Using a greater value of k is also not a good idea since the model would overfit on the training data. This tradeoff also demonstrates the No Free Lunch Theorem. Specifically, it demonstrates that the structure utilized by this model is not optimized for the problem of finding NBA All-Stars. In the conclusion post, I will apply this model to the stats from 2018 to see how well the model does on recent data.

This concludes my investigation of k-NN to classify NBA All-Stars. The next post will cover Boosting and its performance on the same problem.

NOTE: This project was completed as part of my Masters coursework. Specifically, for the Machine Learning course.

# Predicting NBA-All Stars Using Decision Trees (Part 1/6)

# Introduction

This post is the second post in the series on using various Supervised Learning techniques to classify NBA All-Stars. In the previous post, I introduced the problem, the data and the general approach I took for each algorithm. In this post, I will discuss how [Decision Trees]((https://en.wikipedia.org/wiki/Decision_tree) performed on this problem.

# Decision Trees

Decision Trees (DTs) classify samples by inferring simple decision rules from the data features. They build a tree structure where each internal node represents a if/then rule based on some attribute. Classification of new samples is done by traversing the tree and using the leaf node values. Some advantages of DTs are that they are easy to understand, implement and visualize. One big disadvantage of DTs is that they are prone to overfitting on the training data; overfitting occurs when the algorithm builds a tree that performs really well on the training data, capturing all the noise and unwanted features. For my experiments, I avoided overfitting by pruning (specifying the max_depth) the tree. In scikit-learn’s DecisionTreeClassifier class, I used the Gini Impurity as the function to measure the quality of a split and specified a balanced class weighting measure. All other parameters used the defaults provided by scikit-learn.

# Model Complexity Curve

The figure above shows the final 3-level DT generated for the dataset I supplied. I was surprised to see that only 5 attributes of the 50 were considered in the tree: PER (Player Efficiency Rating). FTA (Free Throws Attempted), PPG (Points Per Game), 2P (Two Pointers Made), DWS (Defensive Win Shares). I decided on using a tree with only 3 levels after generating the Model Complexity Curve. See figure below for those results. The results in that graph demonstrate that using a tree of max_depth of 3 generated the highest cross validation score. As we increase the depth of the tree, we see that the training score continue to improve while the validation score declines. This demonstrates that adding more levels to the tree leads to overfitting on the training data, which causes the tree to not generalize well on unseen data.

# Learning Curve

Next, I ran a Learning Curve experiment on the 3-level Decision Tree. The results are captured in the figure below. This graph shows that both the training and cross-validation scores converge to around the same value: 90-92% accuracy. This demonstrates that my Decision Tree model is ideal, in that it does not suffer from high bias or variance. Instead, this model will generalize well to previously-unseen testing data which is confirmed in the final testing results. The final training accuracy was **92.4931%** and testing accuracy was **89.2402%**. Additionally, this model required only about 3000 samples to achieve an accuracy of around 90%, which shows that this model generalizes well without requiring too much more data.

# Timing Curve

The last figure below shows the timing curves for training vs prediction. Decision Trees are an eager learning model, which means that the algorithm spends a linear amount of time learning from the model and returns classifications in constant time. This is evident in the figure, which shows training time increasing linearly with the amount of data, while the classification time remaining constant.

Overall, Decision Trees performed well for this problem. The final testing accuracy was higher than I expected. In the conclusion post, I will apply this model to the stats from 2018 to see how well the model does on recent data.

This concludes my investigation of Decision Trees to classify NBA All-Stars. The next post will cover k-Nearest Neighbors and its performance on the same problem.