Introduction

This post is the fourth post in the series on using various Supervised Learning techniques to classify NBA All-Stars. In a previous post, I introduced the problem, the data and the general approach I took for each algorithm. In this post, I will discuss how Boosted Decision Trees performed on this problem.

Boosting

Boosting is an ensemble learning technique that combines the predictions of several weak learners to produce a generalized prediction. Weak learners are models whose predictions are only slightly better than choosing at random. The final prediction is usually done by a weighted majority vote of all estimators. For my experiments, I used a specific boosting algorithm, AdaBoost. AdaBoost iteratively applies learning process to a distribution of weights. At each step, the training examples that were classified incorrectly are weighted higher for the next iteration and the examples that were classified correctly are weighted lower. For my experiments, I used a single-level Decision Tree (also known as a decision stump) as the base estimator for AdaBoost. I also experimented with various number of estimators to identify the best estimator count for my model. One interesting observation I had was that using a Decision Tree of greater depths (3 or greater) caused my boosting algorithm to underperform. After some research and reasoning, I discovered that the larger, more complex underlying decision tree was overfitting, causing boosting to also overfit. Using a less complex base estimator, one that was less likely to overfit, improved the performance of the boosted model.

Model Complexity Curve

See Figure below for the Model Complexity curve for Boosting. Here, we see that varying the

1
n_estimators
hyperparameter caused the training curve to continuously increase while keeping the validation curve relatively steady. This graph shows that overfitting is likely as the number of estimators grows. I chose 18 as the optimal number of estimators to use with the base estimator of a 1-level decision tree.

Model Complexity Curve

Learning Curve

In the Learning Curve plot below, we also see the that the training and validation scores converge to similar values as the size of the training set increases. This shows that our model has low bias and low variance so it generalizes well to new data. The final training accuracy was 92.3338% and testing accuracy was 87.7099%. While it performed better than k-NN, the Boosting model had around the same accuracy score as a 3-level Decision Tree. This could be because of outliers (All-Stars that didn’t have good stats but were voted in due to popularity) in the data of which Boosting is maximizing the importance.

Learning Curve

Timing Curve

The last figure below shows the timing curves for training vs prediction. The timing curve shows that Boosting is a eager learner since prediction takes constant time and training takes is O(n)

Timing Curve

Overall, Boosting performed well for this problem. The final testing accuracy was around the same as for Decision Trees. Boosting was considerably slower than Decision Trees without providing significant gains in accuracy. This lead me to conclude that regular Decision Trees were sufficient for this problem and that boosting was not strictly necessary.

This concludes my investigation of Boosting to classify NBA All-Stars. The next post will cover Artificial Neural Networks and their performance on the same problem.

NOTE: This project was completed as part of my Masters coursework. Specifically, for the Machine Learning course.

Introduction

This post is the third post in the series on using various Supervised Learning techniques to classify NBA All-Stars. In a previous post, I introduced the problem, the data and the general approach I took for each algorithm. In this post, I will discuss how k-Nearest Neighbors performed on this problem.

k-Nearest Neighbors

K-Nearest Neighbors is an instance-based learning algorithm. Rather than constructing a generalized model of the classification problem, it stores all training samples and classifies the testing data by taking a majority vote of the k nearest neighbors to the query point. Overfitting can occur in k-NN when we use a k value of 1. When k=1, each sample is in a neighborhood of its own and results in a model with low bias and high variance. As we increase k, we reduce the complexity in the model and we also reduce overfitting. In my experiments for k-NN, I varied the k parameter and used uniform weights for computing the majority votes for the nearest neighbors.

Model Complexity Curve

See figure below for the Model Complexity curve for the NBA All-Stars problem, which shows that overfitting occurs when k=1 and decreases as k increases. The peak validation accuracy I saw that was when k=3, so I used this as the basis for the learning and for computing testing accuracy. Increasing k decreases variance and increases bias, so in choosing k=3, I prioritized bias (higher accuracy) over variance. Note that this model is also unaffected by the unbalanced nature of the NBA data, since it only considers a few closest samples for classification.

Model Complexity Curve

Learning Curve

The Learning Curve in the figure below shows that this model isn’t as good at classifying NBA All-Stars as Decision Trees were. As we give the model more training examples, the accuracy scores continue to increase and there is a large gap between the training and validation scores. This shows that the model suffers from high variance, which means more training data or a larger k value will help improve the model’s performance. The final training accuracy was 84.5761% and testing accuracy was 76.4208%. As expected, the accuracy is lower than that of Decision Trees.

Learning Curve

Timing Curve

The last figure below shows the timing curves for training vs prediction. k-NN is a lazy learning model, which means that the algorithm spends a constant amount of time “learning” (or remembering) the model and returns classifications in non-constant time time. This is evident in the figure, which shows classification time increasing linearly with the amount of data, while the training time remains constant.

Timing Curve

Overall, k-NN struggled to perform well for this problem. The final testing accuracy was lower than for Decision Trees. This is to be expected since with a k value of 3, I’m only considering three players that are in the same neighborhood, in respect to season stats. Using a greater value of k is also not a good idea since the model would overfit on the training data. This tradeoff also demonstrates the No Free Lunch Theorem. Specifically, it demonstrates that the structure utilized by this model is not optimized for the problem of finding NBA All-Stars. In the conclusion post, I will apply this model to the stats from 2018 to see how well the model does on recent data.

This concludes my investigation of k-NN to classify NBA All-Stars. The next post will cover Boosting and its performance on the same problem.

NOTE: This project was completed as part of my Masters coursework. Specifically, for the Machine Learning course.

Introduction

This post is the second post in the series on using various Supervised Learning techniques to classify NBA All-Stars. In the previous post, I introduced the problem, the data and the general approach I took for each algorithm. In this post, I will discuss how [Decision Trees]((https://en.wikipedia.org/wiki/Decision_tree) performed on this problem.

Decision Trees

Decision Trees (DTs) classify samples by inferring simple decision rules from the data features. They build a tree structure where each internal node represents a if/then rule based on some attribute. Classification of new samples is done by traversing the tree and using the leaf node values. Some advantages of DTs are that they are easy to understand, implement and visualize. One big disadvantage of DTs is that they are prone to overfitting on the training data; overfitting occurs when the algorithm builds a tree that performs really well on the training data, capturing all the noise and unwanted features. For my experiments, I avoided overfitting by pruning (specifying the max_depth) the tree. In scikit-learn’s DecisionTreeClassifier class, I used the Gini Impurity as the function to measure the quality of a split and specified a balanced class weighting measure. All other parameters used the defaults provided by scikit-learn.

3-level Decision Tree for NBA Data

Model Complexity Curve

The figure above shows the final 3-level DT generated for the dataset I supplied. I was surprised to see that only 5 attributes of the 50 were considered in the tree: PER (Player Efficiency Rating). FTA (Free Throws Attempted), PPG (Points Per Game), 2P (Two Pointers Made), DWS (Defensive Win Shares). I decided on using a tree with only 3 levels after generating the Model Complexity Curve. See figure below for those results. The results in that graph demonstrate that using a tree of max_depth of 3 generated the highest cross validation score. As we increase the depth of the tree, we see that the training score continue to improve while the validation score declines. This demonstrates that adding more levels to the tree leads to overfitting on the training data, which causes the tree to not generalize well on unseen data.

Model Complexity Curve

Learning Curve

Next, I ran a Learning Curve experiment on the 3-level Decision Tree. The results are captured in the figure below. This graph shows that both the training and cross-validation scores converge to around the same value: 90-92% accuracy. This demonstrates that my Decision Tree model is ideal, in that it does not suffer from high bias or variance. Instead, this model will generalize well to previously-unseen testing data which is confirmed in the final testing results. The final training accuracy was 92.4931% and testing accuracy was 89.2402%. Additionally, this model required only about 3000 samples to achieve an accuracy of around 90%, which shows that this model generalizes well without requiring too much more data.

Learning Curve

Timing Curve

The last figure below shows the timing curves for training vs prediction. Decision Trees are an eager learning model, which means that the algorithm spends a linear amount of time learning from the model and returns classifications in constant time. This is evident in the figure, which shows training time increasing linearly with the amount of data, while the classification time remaining constant.

Timing Curve

Overall, Decision Trees performed well for this problem. The final testing accuracy was higher than I expected. In the conclusion post, I will apply this model to the stats from 2018 to see how well the model does on recent data.

This concludes my investigation of Decision Trees to classify NBA All-Stars. The next post will cover k-Nearest Neighbors and its performance on the same problem.

NOTE: This project was completed as part of my Masters coursework. Specifically, for the Machine Learning course.

Introduction

This post is the introduction for a project I worked on recently. Namely, using Supervised Learning techniques to predict new NBA All-Stars based on player statistics from previous seasons. The techniques explored include: Pruned Decision Trees, K-Nearest Neighbors, Boosted Decision Trees, Artificial Neural Networks, and Support Vector Machines. Each of these will be covered in subsequent blog posts. This post is to introduce the problem and explain the general methodology.

Data, Data, Data

The dataset is merged from two sources: NBA All Stars from 2000-2016 and season stats for every player from 2000-2016. Each row in the dataset contains a single player’s statistics over a season. There are 50 total features, including total points/assists/rebounds, field goal/free throw/3 point percentages and more advanced stats derived from formulas (like PER, USAG% etc). There is a single column that specifies whether that player made the All-Star game that season. Using this training data, the goal was to classify the players from the 2018/2019 as All-Stars or non All-Stars. This dataset has some interesting characteristics that made it an appealing choice to investigate. First, the data is heavily unbalanced; of the 8069 players in the dataset, only 463 players are actually All Stars (only ~0.05%). The number of true negatives greatly outweighs the number of true positives. This imbalance needed to be accounted for when calculating accuracy or plotting curves. See General Approach for more details.

General Approach

I used Python’s scikit-learn library for running the Supervised Learning algorithms. I used the pandas library for data manipulation and analysis and matplotlib for data visualization and plotting. For each of the five algorithms, I used the same general procedure for my analysis. My analysis included: a Model Complexity Curve (also called Validation Curve) to help tune hyperparameters, a Learning Curve to find the lower bound on the number of samples needed to learn this model and to investigate any issues due to high bias or high variance, a timing curve to investigate the runtime performance of training and testing and, for some algorithms, an iteration curve to investigate accuracy as the number of iterations increased. At a high level, here is the general methodology I followed when analyzing each algorithm:

  1. Split 80% of the dataset into a training set and 20% into a testing set, using a stratified split
  2. Standardize features by removing the mean and scaling to unit variance
  3. Of the 80% of training set, use a Model Complexity curve to find the best hyperparameters for tuning the model (with 5-fold cross validation)
  4. Of the same 80% training set and using the best estimator identified in Step 2, plot the Learning Curve for the model (with 5-fold cross validation, using 20% of data incrementally cumulated)
  5. Calculate the training accuracy of the model on the 80% that dataset from Step 1
  6. Calculate the test accuracy of the model on the 20% of the dataset held out in Step 1
  7. Plot the iteration or timing curves, as necessary, depending on the model

In Step 1, the initial split is done using stratified sampling. This is done to ensure that the split contains the same percentage of each classification (for ex: All Stars and Non-All Stars) in both the training and test sets.

Step 2 is done using scikit-learn’s StandardScaler to ensure that each attribute’s variance is in the same order and that the mean is centered around 0. This is required for certain models like SVMs and k-NNs (to ensure that a certain attribute with a large variance doesn’t dominate the objective functions). See the image below for an example of the StandardScaler applied to a few sample attributes from the NBA dataset.

Steps 3 and 4 both use built-in functions for plotting the Validation and Learning Curves. My experiments used a custom sample weight scoring function that automatically adjusts weights inversely proportional to class frequencies in the input data. This is done to ensure that the unbalanced dataset can be accurately scored.

Steps 5 and 6 also used a similar “balanced” sample weight scoring function to account for unbalanced data. Additionally, my experiments used 5-fold cross validation to prevent overfitting on the training data, at the cost of increased run times.

StandardScaler applied to a few features

This concludes a high-level introduction to my project of classifying NBA All-Stars. Subsequent posts will go into the results I observed for each algorithm.

NOTE: This project was completed as part of my Masters coursework. Specifically, for the Machine Learning course.

Introduction

I recently finished the Functional Programming Principles in Scala course on Coursera. Having primarily programmed in an imperative language (Java), I enjoyed all the goodies that provided by a functional language. Overall, I enjoyed the class very much, and I recommend it to anyone looking for an introduction to functional programming. In this post, I will go through some of the favorite features of Scala, many of which are missing from Java.

Pattern Matching

Pattern matching is a powerful feature in Scala with no easy equivalent in Java. Java has a switch statement to choose an execution paths based on some equality condition. Conventionally, it is used mostly to improve readability as a replacement for many if/else branches. The Java switch is limited to a few primitive types (int, byte, char, short) and String (in Java 7). Scala overcomes this limitation with a new match keyword that allows matches on complex types. These includes matching on types, collections and even specific objects.

Type Matching

Here’s an example that shows pattern matching on types.

def printType(x: Any): String = x match {
  case x:Integer => "I'm an integer: " + x.toString
  case y:Double => "I'm a double: " + y.toString
  case z:String => "I'm a String: " + z
  case _ => "IDK!"
}

Collection Matching

Pattern matching also works with Scala’s collections library. In this next snippet, we match on Lists of size 0, 1, 2, and more than 2.

def describeList(list: List[Any]): String = list match {
  case Nil => "Empty"
  case elem :: Nil => "One element"
  case elem1 :: elem2 :: Nil => "Two Elements"
  case _ => "More than Two elements"
}

Case class matching

Scala also provides a construct called a Case Class. Case classes can be used in pattern matching like so:

case class Team(city: String, teamName: String)

val teams = List(Team("Washington", "Wizards"), Team("New York", "Giants"))

for (team <- teams) {
  team match {
    case Team("Washington", "Wizards") => println("GO WIZ")
    case _ => println("Meh")
  }
}

:muscle:-typing

Scala features a strong type system. Types are checked at compile time, the compiler infers types and functions are types (first-class functions).

Similar to Java Generics, Scala supports classes parameterized with types. But unlike Java, Scala provides the ability for more complex types bounding. For example, the type List[A <: B] requires that the list be made up of types that are subtypes of B. Similarly, List[A >: B] requires that A is a supertype of B.

The type system gets quite complicated. More details can be found here.

Syntactic Goodies

Scala provides many utilites that allows for more expressive code. While these features can be implemented in Java, Scala provides these out of the box. Here are a few goodies that I think are cool:

Tuples

Scala uses a tuple type for representing an unstructured group of data together. Scala provides the ability for a tuple to hold up to 22 elements. Here is an example of the syntax:

val tuple = ("A", 1, 342.1)
tuple._1 // use ._# to access element # of a tuple
tuple.getClass // tuple is of type Tuple3

Ranges

Scala provides a Range class to represent a range of integer values starting at from going up to or including to. until is used to exclude the end value of a range. to is used to include the end value of a range. A step value can be used to define the incrementing value between the from and to. Here’s an example:

val range = 0 until 5 // [0, 1, 2, 3, 4]
val range1 = 10 to 0 by -2  // [10, 8, 6, 4, 2, 0]

for-comprehension

Scala also provides a pseudo-DSL for manipulating collections. Similar to list comprehensions in Python, Scala provides a more readable method of calling map/flatMap/filter on collections. Here is a canonical example of this feature:

case class Book(author: String, title: String)

for {
  book <- books
  if book.author startsWith "bob" 
} yield book.title

//the above is equivalent to the following:

books.withFilter(book => book.author startsWith "bob").map(book => book.title)

Conclusion

After spending the last few weeks working with Scala, I have developed a fondness for the simplicity and expressiveness of functional languages. I am still a little overwhelmed at the number of features available in the Scala language. There are quite a few “Good Parts” and this post glosses over a small number of them. I am looking forward to exploring the new functional features of Java 8 with the new perspective I gained from studying Scala!