Introduction

This post is the third post in the series on using various Supervised Learning techniques to classify NBA All-Stars. In a previous post, I introduced the problem, the data and the general approach I took for each algorithm. In this post, I will discuss how k-Nearest Neighbors performed on this problem.

k-Nearest Neighbors

K-Nearest Neighbors is an instance-based learning algorithm. Rather than constructing a generalized model of the classification problem, it stores all training samples and classifies the testing data by taking a majority vote of the k nearest neighbors to the query point. Overfitting can occur in k-NN when we use a k value of 1. When k=1, each sample is in a neighborhood of its own and results in a model with low bias and high variance. As we increase k, we reduce the complexity in the model and we also reduce overfitting. In my experiments for k-NN, I varied the k parameter and used uniform weights for computing the majority votes for the nearest neighbors.

Model Complexity Curve

See figure below for the Model Complexity curve for the NBA All-Stars problem, which shows that overfitting occurs when k=1 and decreases as k increases. The peak validation accuracy I saw that was when k=3, so I used this as the basis for the learning and for computing testing accuracy. Increasing k decreases variance and increases bias, so in choosing k=3, I prioritized bias (higher accuracy) over variance. Note that this model is also unaffected by the unbalanced nature of the NBA data, since it only considers a few closest samples for classification.

Model Complexity Curve

Learning Curve

The Learning Curve in the figure below shows that this model isn’t as good at classifying NBA All-Stars as Decision Trees were. As we give the model more training examples, the accuracy scores continue to increase and there is a large gap between the training and validation scores. This shows that the model suffers from high variance, which means more training data or a larger k value will help improve the model’s performance. The final training accuracy was 84.5761% and testing accuracy was 76.4208%. As expected, the accuracy is lower than that of Decision Trees.

Learning Curve

Timing Curve

The last figure below shows the timing curves for training vs prediction. k-NN is a lazy learning model, which means that the algorithm spends a constant amount of time “learning” (or remembering) the model and returns classifications in non-constant time time. This is evident in the figure, which shows classification time increasing linearly with the amount of data, while the training time remains constant.

Timing Curve

Overall, k-NN struggled to perform well for this problem. The final testing accuracy was lower than for Decision Trees. This is to be expected since with a k value of 3, I’m only considering three players that are in the same neighborhood, in respect to season stats. Using a greater value of k is also not a good idea since the model would overfit on the training data. This tradeoff also demonstrates the No Free Lunch Theorem. Specifically, it demonstrates that the structure utilized by this model is not optimized for the problem of finding NBA All-Stars. In the conclusion post, I will apply this model to the stats from 2018 to see how well the model does on recent data.

This concludes my investigation of k-NN to classify NBA All-Stars. The next post will cover Boosting and its performance on the same problem.

NOTE: This project was completed as part of my Masters coursework. Specifically, for the Machine Learning course.

Migrating from Lift to Play

Blog post for Rally Health on migrating from Lift to to Play framework Continue reading