The dataset is merged from two sources: NBA All Stars from 2000-2016 and season stats for every player from 2000-2016. Each row in the dataset contains a single player’s statistics over a season. There are 50 total features, including total points/assists/rebounds, field goal/free throw/3 point percentages and more advanced stats derived from formulas (like PER, USAG% etc). There is a single column that specifies whether that player made the All-Star game that season. Using this training data, the goal was to classify the players from the 2018/2019 as All-Stars or non All-Stars. This dataset has some interesting characteristics that made it an appealing choice to investigate. First, the data is heavily unbalanced; of the 8069 players in the dataset, only 463 players are actually All Stars (only ~0.05%). The number of true negatives greatly outweighs the number of true positives. This imbalance needed to be accounted for when calculating accuracy or plotting curves. See General Approach for more details.
I used Python’s scikit-learn library for running the Supervised Learning algorithms. I used the pandas library for data manipulation and analysis and matplotlib for data visualization and plotting. For each of the five algorithms, I used the same general procedure for my analysis. My analysis included: a Model Complexity Curve (also called Validation Curve) to help tune hyperparameters, a Learning Curve to find the lower bound on the number of samples needed to learn this model and to investigate any issues due to high bias or high variance, a timing curve to investigate the runtime performance of training and testing and, for some algorithms, an iteration curve to investigate accuracy as the number of iterations increased. At a high level, here is the general methodology I followed when analyzing each algorithm:
Split 80% of the dataset into a training set and 20% into a testing set, using a stratified split
Standardize features by removing the mean and scaling to unit variance
Of the 80% of training set, use a Model Complexity curve to find the best hyperparameters for tuning the model (with 5-fold cross validation)
Of the same 80% training set and using the best estimator identified in Step 2, plot the Learning Curve for the model (with 5-fold cross validation, using 20% of data incrementally cumulated)
Calculate the training accuracy of the model on the 80% that dataset from Step 1
Calculate the test accuracy of the model on the 20% of the dataset held out in Step 1
Plot the iteration or timing curves, as necessary, depending on the model
In Step 1, the initial split is done using stratified sampling. This is done to ensure that the split contains the same percentage of each classification (for ex: All Stars and Non-All Stars) in both the training and test sets.
Step 2 is done using scikit-learn’s StandardScaler to ensure that each attribute’s variance is in the same order and that the mean is centered around 0. This is required for certain models like SVMs and k-NNs (to ensure that a certain attribute with a large variance doesn’t dominate the objective functions). See the image below for an example of the StandardScaler applied to a few sample attributes from the NBA dataset.
Steps 3 and 4 both use built-in functions for plotting the Validation and Learning Curves. My experiments used a custom sample weight scoring function that automatically adjusts weights inversely proportional to class frequencies in the input data. This is done to ensure that the unbalanced dataset can be accurately scored.
Steps 5 and 6 also used a similar “balanced” sample weight scoring function to account for unbalanced data. Additionally, my experiments used 5-fold cross validation to prevent overfitting on the training data, at the cost of increased run times.
This concludes a high-level introduction to my project of classifying NBA All-Stars. Subsequent posts will go into the results I observed for each algorithm.
NOTE: This project was completed as part of my Masters coursework. Specifically, for the Machine Learning course.
I recently finished the Functional Programming Principles in Scala course on Coursera. Having primarily programmed in an imperative language (Java), I enjoyed all the goodies that provided by a functional language. Overall, I enjoyed the class very much, and I recommend it to anyone looking for an introduction to functional programming. In this post, I will go through some of the favorite features of Scala, many of which are missing from Java.
Pattern matching is a powerful feature in Scala with no easy equivalent in Java. Java has a switch statement to choose an execution paths based on some equality condition. Conventionally, it is used mostly to improve readability as a replacement for many if/else branches. The Java switch is limited to a few primitive types (int, byte, char, short) and String (in Java 7). Scala overcomes this limitation with a new match keyword that allows matches on complex types. These includes matching on types, collections and even specific objects.
Here’s an example that shows pattern matching on types.
Pattern matching also works with Scala’s collections library. In this next snippet, we match on Lists of size 0, 1, 2, and more than 2.
Case class matching
Scala also provides a construct called a Case Class. Case classes can be used in pattern matching like so:
Scala features a strong type system. Types are checked at compile time, the compiler infers types and functions are types (first-class functions).
Similar to Java Generics, Scala supports classes parameterized with types. But unlike Java, Scala provides the ability for more complex types bounding. For example, the type List[A <: B] requires that the list be made up of types that are subtypes of B. Similarly, List[A >: B] requires that A is a supertype of B.
The type system gets quite complicated. More details can be found here.
Scala provides many utilites that allows for more expressive code. While these features can be implemented in Java, Scala provides these out of the box. Here are a few goodies that I think are cool:
Scala uses a tuple type for representing an unstructured group of data together. Scala provides the ability for a tuple to hold up to 22 elements. Here is an example of the syntax:
Scala provides a Range class to represent a range of integer values starting at from going up to or including to. until is used to exclude the end value of a range. to is used to include the end value of a range. A step value can be used to define the incrementing value between the from and to. Here’s an example:
Scala also provides a pseudo-DSL for manipulating collections. Similar to list comprehensions in Python, Scala provides a more readable method of calling map/flatMap/filter on collections. Here is a canonical example of this feature:
After spending the last few weeks working with Scala, I have developed a fondness for the simplicity and expressiveness of functional languages. I am still a little overwhelmed at the number of features available in the Scala language. There are quite a few “Good Parts” and this post glosses over a small number of them. I am looking forward to exploring the new functional features of Java 8 with the new perspective I gained from studying Scala!
I recently finished reading the popular Design Patterns book. As I was reading, I noticed that many of the examples used in the book were a bit hard to relate to. For several patterns, they used the example of building a WSISYG editor, something with which I don’t have much experience. Many of the examples in the “Known Uses” section of each chapter were also a little outdated. As a way to help me understand these patterns in detail, I wanted to create my own examples, modeled after real-world situations. This post will go through a few patterns and will attempt to find similarities between abstract object-oriented design patterns and more tangible real-life examples.
Creational patterns deal with how objects are created. Controlling how objects are built can have a big impact on improving the design of a system. It can allow a system to be independent of how its objects are created.
One such creational pattern is the Builder pattern. This pattern separates HOW an object is constructed from its actual representation. This will allow us to use the same general construction process to create different representations.
One real-world example of this pattern occurs in the car manufacturing industry. When building a certain car according to specification, the manufacturer chooses several options about that car. These could include the color, the engine, and any additional features of the car. Here is the class diagram for this example:
In this example, the client (the manufacturer) interacts directly with the builder to manufacture cars. The advantage of using this pattern is that we can construct cars with different characteristics using the same builder, and by specifying different parameters to the builder. We have very little coupling between the client using our builder and the product being built.
This category of patterns are concerned with how classes and objects are built and composed as part of a larger system. Several of these patterns in this category also help coordinate independently developed classes into a single structure.
One such structural pattern is the Facade pattern. Facade provides a single unified interface that encompasses several interfaces in a subsystem. It acts as a funnel to expose a single interface to many clients, and hide the actual subsystem that is responsible for doing the work requested. Here is what I mean:
From the above diagram, we can see that the Facade coordinates several subsystems behind the scenes, but gives the appearance of a single, simple interface to clients. One example of this Amazon’s 1-click ordering system. When ordering items, a customer is presented with a simple interface to purchase an item, but there are several complex processes running to enable this item to be purchased.
A Customer ordering an item for this system is spared the details about what goes on “under the hood”. The Facade takes care of coordinating the actions of the different processors in the subsystem. Another advantage of this pattern is that if a certain piece of the subsystem were to change, the Customer making the order would be unaffected and unaware.
This group of patterns involves the interaction between objects. These patterns aim to improve the distribution of responsibility between objects and increase the overall flexibility of the system.
The Strategy pattern was, to me, one of the most intuitive patterns in the book. Strategy is used encapsulate an algorithm as an object to simplify interchangeability. The client can choose the appropriate Strategy at run-time based on factors that are not available at compile-time. In addition, there is the additional benefit that the client is shielded from any complex logic, which can be outsourced to the Strategy class.
One delicious example of this pattern is with eating Oreos. As everyone knows, there are a myriad of ways to consume this Great American Cookie. If we consider the process of eating this cookie as an algorithm, we can model the Strategy pattern pretty easily:
Here we have three possible strategies for eating Oreos: dipping it in milk, eating a whole cookie at once and (barbarically) split it apart into two pieces and lick the cream off. Our client is a hungry child who will be making a run-[ing around the house]time decision as to how to eat the cookie. Rather than being burdened by the complexity of figuring out how to eat an Oreo, our hungry child can simply offload that knowledge to Strategy classes that encapsulate the Oreo-eating algorithm.
Over the past few weeks, I have been working with Apache ZooKeeper for a certain project at work. ZooKeeper is a pretty cool technology and I thought I would document what I’ve learned so far, for others looking for a short introduction and for myself to look back on in the future.
What is ZooKeeper?
“A centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services”.
Informally, ZooKeeper is just a service that distributed systems can rely on for coordination. It allows developers to focus on implementing the core application logic, without having to worry about also implementing any coordination tasks required for the application. In today’s “big data” world, applications are usually individual, independent programs running on separate, physical machines. This sort of distribution further complicates the already-complicated problem of synchronizing shared data and state between processes. ZooKeeper provides a fast, highly available, fault tolerant, and distributed solution to help solve these synchronization issues.
Like a lot of innovative ideas these days, ZooKeeper was originally conceived by Google as Google Chubby. In that paper, Chubby is introduced as a lock service to “allow its clients to synchronize their activities and to agree on basic information about their environment”. In the context of distributed applications, this information can control the cooperation between many competing resources. Here are a few examples of when ZooKeeper can be used:
An application in which a single worker machine needs to wait on a certain condition to be satisfied before starting or continuing progress. This is an example of the familiar Mutual Exclusion problem just at a larger, more distributed scale. For this scenario, ZooKeeper can be used as a lock to prevent concurrent execution.
An application with a master-slave architecture where a single master delegates work to several worker nodes (see HBase, for an example). In this example, we can use ZooKeeper to elect a leader and to keep track of the worker nodes to see “who’s doing what”.
How does ZooKeeper work?
ZooKeeper’s data model is almost identical to a that of a file system. ZooKeeper works with small data nodes (called znodes) which are organized hierarchically into a tree structure.
In the above example, we have the root node /, under which we have two logical namespaces config and workers. These just serve to separate out the concerns of the application. The config namespace is used for centralized configuration management. In the leaf node, we can store data (upto 1MB), which in our case contains a BLOCK_SIZE property. The workers namespace is used as a name service to locate worker nodes in our system.
ZooKeeper’s nodes come in two flavors: persistent and ephemeral. The difference is simply that a persistent znode can only be deleted through an explicit delete call. Whereas ephemeral nodes are deleted once the client which created it has disconnected. Nodes can also be sequential. This just means that each node is given a unique, monotonically increasing integer which is appended to the path.
The above image shows a basic overview of a typical ZooKeeper interaction. Many application processes open sessions to ZooKeeper via a client library (ZooKeeper provides bindings for Java and C). The ensemble of ZooKeeper servers are responsible for serving client requests. Writes are routed through a single leader and are atomically broadcast to followers. Reads can be served from any follower. There is a small, bounded delay from when a write has been requested to when it is available to be read. This is the reason ZooKeeper is considered eventually consistent.
ZooKeeper is built on three main principles:
Replication - The data stored in a single znode is replicated across several servers in the ensemble. This ensures that there is not one single point of failure.
Order - Each update of data in ZooKeeper is assigned a global version number. Updates from a client are applied in the order in which they were was sent.
Speed - ZooKeeper stores its data in-memory, which allows it to achieve high throughput and low latency.
ZooKeeper provides a very simple API for interacting with znodes. ZooKeeper also comes with a command-line interface that is useful for debugging and playing around with the different operations. Here are the main operations used in the bin/zkCli.sh:
create [path] [data] - Creates a node for the given path and associates some data with it
delete [path] - Deletes a node with the given path
get [path] - Get the data associated with the node at this path
stat [path] - Gets the metadata of a node. This includes information about when the node was created, the version of the data, the length of the data etc. See documentation for more information about the Stat object.
set [path] [data] [version] - Update the data of a given node. This caller can specify the version that it knows about (returned from stat). If the version is not the latest, ZooKeeper rejects this update.
Recipes and Curator
As mentioned earlier, ZooKeeper is used to implement common synchronization primitives for distributed applications. To that end, ZooKeeper provides a few recipes, which are actual implementations of common synchronization mechanisms that a ZooKeeper user might need. Most of the time, an application’s interaction with ZooKeeper would be through these recipes.
In addition, Apache Curator, a library open-sourced by Netflix, provides a wrapper around the already-pretty-simple API of ZooKeeper. Curator provides support for ZooKeeper recipes as well as a fluent interface API that manages the complexity of connecting to ZooKeeper.
In celebration of Super Bowl Sunday, this post will examine how HBase can be used to model NFL play-by-play data.
We will be using data from the 2002 season through the 2012 season. The source of the data is CSV files provided by the awesome people at Advance NFL Stats. Each file contains data for a single season and the format of the data is as follows:
The gameid column contains the data of the game and the two teams competing. The description column contains a short natural language blurb about a certain play. The rest of the columns are self explanatory. Here is a snippet of a sample file from the 2012 season (which also happens to be Robert Griffin III’s first plays in the NFL):
20120909_WAS@NO,1,-5,0,WAS,NO,,,69,B.Cundiff kicks 65 yards from WAS 35 to end zone Touchback.,0,0,2012
20120909_WAS@NO,1,60,0,NO,WAS,1,10,80,(15:00) D.Brees pass incomplete short left to M.Colston.,0,0,2012
20120909_WAS@NO,1,59,57,NO,WAS,2,10,80,(14:57) D.Brees pass incomplete short left to L.Moore [A.Carriker].,0,0,2012
20120909_WAS@NO,1,59,52,NO,WAS,3,10,80,(14:52) (Shotgun) D.Brees pass incomplete short middle (S.Bowen).,0,0,2012
20120909_WAS@NO,1,59,47,NO,WAS,4,10,80,(14:47) T.Morstead punts 59 yards to WAS 21 Center-J.Drescher. B.Banks pushed ob at WAS 32 for 11 yards (S.Shanle).,0,0,2012
20120909_WAS@NO,1,59,34,WAS,NO,1,10,68,(14:34) (Shotgun) R.Griffin pass short left to P.Garcon to WAS 32 for no gain (C.White). Pass -4 YAC 4,0,0,2012
20120909_WAS@NO,1,58,58,WAS,NO,2,10,68,(13:58) (Shotgun) R.Griffin right end to WAS 44 for 12 yards (S.Shanle).,0,0,2012
20120909_WAS@NO,1,58,25,WAS,NO,1,10,56,(13:25) (Shotgun) R.Griffin pass short left to P.Garcon to NO 44 for 12 yards (R.Harper). Pass -2 YAC 14,0,0,2012
20120909_WAS@NO,1,57,44,WAS,NO,1,10,44,(12:44) (Shotgun) R.Griffin pass short right to P.Garcon to NO 35 for 9 yards (C.Lofton; C.White).,0,0,2012
In a previous post, we looked at the HBase’s Data Model. One of the killer features built into HBase was the ability to store versioned data. Data that spans a certain time-series could be modeled easily in HBase. NFL play-by-play data fits nicely with this model, since each play of a game can be considered as a time slice that captures the state of a game at that instant. This means that we can replace the timestamp field with an integer from 0 to n - 1, where n is the number of total plays in the game.
Another reason for using HBase is its innate ability to scale for large amounts of data. We will be storing data for 11 seasons each with 269 games (regular season and playoffs) with around 175 plays for each game. This totals to around 517,825 data points that we will currently be storing. This may not seem like much, but consider that if we start adding data from games before 2002 and we continue to add data up to the current season, we will soon be glad that we chose a scalable data store like HBase.
We won’t spend too much discussing the HBase schema decisions. The HBase documentation does a good job of explaining the different schema design considerations. We will be using gameid (see above) as the row key. This will allow us to store data for each game in a single row. We will use a single column family called ‘play’. Our columns will be the different attributes of a single play (for example: yard line, offense score). As discussed before, we will use a play number as the timestamp for each column. Here is a diagram that more succinctly describes our schema:
Table creation & Data import
Now that we have looked at the data we will be storing and the schema we will follow, lets finally create our table, and put some data in it. There are several APIs and libraries for different languages that allow us to interact with HBase. We will mostly be using the HBase shell, which is a JRuby IRB-based shell that allows us to run simple data management and manipulation commands. To import data from the CSV files into HBase, we will use JRuby, which is the “Ruby Language on the JVM”.
Let’s start by creating the table. The below command creates a new table called ‘nflplays’, with a single column family called ‘play’ and the specification that we want to store ALL_VERSIONS (2147483647) of a column. This may seem like overkill, we should never really have more than ~300 plays in a game. But I am doing this to illustrate how we can use constants from HConstants when defining table characteristics.
Next, we use the describe command to verify that the table we just created. We can see a few of the characteristics that this table contains. The ‘ENABLED’
Now that we have our table created, we can finally start importing the data from the CSV files into HBase. For this, I’ve written up this crude script in JRuby that goes through each line and parses out the different elements. Here it is:
A few things to notice:
With JRuby, we can import classes from Java using familiar import statements
Notice that each record is inserted with a Put object.
After running this script, we can view some of the data that has been inserted into HBase. First we can count the number of rows we have inserted (I’ve was running HBase on a micro EC2 machine, and I did not want to strain my already scarce resources so I only imported the first 8 games from 2012.
We can scan all the rows that we have imported using the ‘scan’ command in the HBase shell. Here is a snippet of the output I get:
Here we see the last play for two games. Since I didn’t specify how many verisions to return, it only returns the most recent version (last play of each game). We also see each column for a single row. Also notice that the timestamp here is the play number (see above for why this is the case). If we wanted to see only the play descriptions for a single game, we can use this query:
Let’s take a look at the Washington vs New Oreleans game, notice that some columns (like ‘togo’) only has 189 values, whereas the total number of plays is 190. This indicates that for play 190, we did not have a value for ‘togo’. This illustrates the ‘scarce’ property we looked at in the previous post.
Analyzing the data
Now that we have our data in HBase, we can run a few queries on that data. Here is a simple Java program that scans the table for all plays that contain the word ‘TOUCHDOWN’ in the description. You can imagine how we can extend this program to create more complex queries, using different Filters.