I recently finished the Functional Programming Principles in Scala course on Coursera. Having primarily programmed in an imperative language (Java), I enjoyed all the goodies that provided by a functional language. Overall, I enjoyed the class very much, and I recommend it to anyone looking for an introduction to functional programming. In this post, I will go through some of the favorite features of Scala, many of which are missing from Java.
Pattern matching is a powerful feature in Scala with no easy equivalent in Java. Java has a switch statement to choose an execution paths based on some equality condition. Conventionally, it is used mostly to improve readability as a replacement for many if/else branches. The Java switch is limited to a few primitive types (int, byte, char, short) and String (in Java 7). Scala overcomes this limitation with a new match keyword that allows matches on complex types. These includes matching on types, collections and even specific objects.
Here’s an example that shows pattern matching on types.
Pattern matching also works with Scala’s collections library. In this next snippet, we match on Lists of size 0, 1, 2, and more than 2.
Case class matching
Scala also provides a construct called a Case Class. Case classes can be used in pattern matching like so:
Scala features a strong type system. Types are checked at compile time, the compiler infers types and functions are types (first-class functions).
Similar to Java Generics, Scala supports classes parameterized with types. But unlike Java, Scala provides the ability for more complex types bounding. For example, the type List[A <: B] requires that the list be made up of types that are subtypes of B. Similarly, List[A >: B] requires that A is a supertype of B.
The type system gets quite complicated. More details can be found here.
Scala provides many utilites that allows for more expressive code. While these features can be implemented in Java, Scala provides these out of the box. Here are a few goodies that I think are cool:
Scala uses a tuple type for representing an unstructured group of data together. Scala provides the ability for a tuple to hold up to 22 elements. Here is an example of the syntax:
Scala provides a Range class to represent a range of integer values starting at from going up to or including to. until is used to exclude the end value of a range. to is used to include the end value of a range. A step value can be used to define the incrementing value between the from and to. Here’s an example:
Scala also provides a pseudo-DSL for manipulating collections. Similar to list comprehensions in Python, Scala provides a more readable method of calling map/flatMap/filter on collections. Here is a canonical example of this feature:
After spending the last few weeks working with Scala, I have developed a fondness for the simplicity and expressiveness of functional languages. I am still a little overwhelmed at the number of features available in the Scala language. There are quite a few “Good Parts” and this post glosses over a small number of them. I am looking forward to exploring the new functional features of Java 8 with the new perspective I gained from studying Scala!
I recently finished reading the popular Design Patterns book. As I was reading, I noticed that many of the examples used in the book were a bit hard to relate to. For several patterns, they used the example of building a WSISYG editor, something with which I don’t have much experience. Many of the examples in the “Known Uses” section of each chapter were also a little outdated. As a way to help me understand these patterns in detail, I wanted to create my own examples, modeled after real-world situations. This post will go through a few patterns and will attempt to find similarities between abstract object-oriented design patterns and more tangible real-life examples.
Creational patterns deal with how objects are created. Controlling how objects are built can have a big impact on improving the design of a system. It can allow a system to be independent of how its objects are created.
One such creational pattern is the Builder pattern. This pattern separates HOW an object is constructed from its actual representation. This will allow us to use the same general construction process to create different representations.
One real-world example of this pattern occurs in the car manufacturing industry. When building a certain car according to specification, the manufacturer chooses several options about that car. These could include the color, the engine, and any additional features of the car. Here is the class diagram for this example:
In this example, the client (the manufacturer) interacts directly with the builder to manufacture cars. The advantage of using this pattern is that we can construct cars with different characteristics using the same builder, and by specifying different parameters to the builder. We have very little coupling between the client using our builder and the product being built.
This category of patterns are concerned with how classes and objects are built and composed as part of a larger system. Several of these patterns in this category also help coordinate independently developed classes into a single structure.
One such structural pattern is the Facade pattern. Facade provides a single unified interface that encompasses several interfaces in a subsystem. It acts as a funnel to expose a single interface to many clients, and hide the actual subsystem that is responsible for doing the work requested. Here is what I mean:
From the above diagram, we can see that the Facade coordinates several subsystems behind the scenes, but gives the appearance of a single, simple interface to clients. One example of this Amazon’s 1-click ordering system. When ordering items, a customer is presented with a simple interface to purchase an item, but there are several complex processes running to enable this item to be purchased.
A Customer ordering an item for this system is spared the details about what goes on “under the hood”. The Facade takes care of coordinating the actions of the different processors in the subsystem. Another advantage of this pattern is that if a certain piece of the subsystem were to change, the Customer making the order would be unaffected and unaware.
This group of patterns involves the interaction between objects. These patterns aim to improve the distribution of responsibility between objects and increase the overall flexibility of the system.
The Strategy pattern was, to me, one of the most intuitive patterns in the book. Strategy is used encapsulate an algorithm as an object to simplify interchangeability. The client can choose the appropriate Strategy at run-time based on factors that are not available at compile-time. In addition, there is the additional benefit that the client is shielded from any complex logic, which can be outsourced to the Strategy class.
One delicious example of this pattern is with eating Oreos. As everyone knows, there are a myriad of ways to consume this Great American Cookie. If we consider the process of eating this cookie as an algorithm, we can model the Strategy pattern pretty easily:
Here we have three possible strategies for eating Oreos: dipping it in milk, eating a whole cookie at once and (barbarically) split it apart into two pieces and lick the cream off. Our client is a hungry child who will be making a run-[ing around the house]time decision as to how to eat the cookie. Rather than being burdened by the complexity of figuring out how to eat an Oreo, our hungry child can simply offload that knowledge to Strategy classes that encapsulate the Oreo-eating algorithm.
Over the past few weeks, I have been working with Apache ZooKeeper for a certain project at work. ZooKeeper is a pretty cool technology and I thought I would document what I’ve learned so far, for others looking for a short introduction and for myself to look back on in the future.
#What is ZooKeeper?
“A centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services”.
Informally, ZooKeeper is just a service that distributed systems can rely on for coordination. It allows developers to focus on implementing the core application logic, without having to worry about also implementing any coordination tasks required for the application. In today’s “big data” world, applications are usually individual, independent programs running on separate, physical machines. This sort of distribution further complicates the already-complicated problem of synchronizing shared data and state between processes. ZooKeeper provides a fast, highly available, fault tolerant, and distributed solution to help solve these synchronization issues.
Like a lot of innovative ideas these days, ZooKeeper was originally conceived by Google as Google Chubby. In that paper, Chubby is introduced as a lock service to “allow its clients to synchronize their activities and to agree on basic information about their environment”. In the context of distributed applications, this information can control the cooperation between many competing resources. Here are a few examples of when ZooKeeper can be used:
An application in which a single worker machine needs to wait on a certain condition to be satisfied before starting or continuing progress. This is an example of the familiar Mutual Exclusion problem just at a larger, more distributed scale. For this scenario, ZooKeeper can be used as a lock to prevent concurrent execution.
An application with a master-slave architecture where a single master delegates work to several worker nodes (see HBase, for an example). In this example, we can use ZooKeeper to elect a leader and to keep track of the worker nodes to see “who’s doing what”.
#How does ZooKeeper work?
ZooKeeper’s data model is almost identical to a that of a file system. ZooKeeper works with small data nodes (called znodes) which are organized hierarchically into a tree structure.
In the above example, we have the root node /, under which we have two logical namespaces config and workers. These just serve to separate out the concerns of the application. The config namespace is used for centralized configuration management. In the leaf node, we can store data (upto 1MB), which in our case contains a BLOCK_SIZE property. The workers namespace is used as a name service to locate worker nodes in our system.
ZooKeeper’s nodes come in two flavors: persistent and ephemeral. The difference is simply that a persistent znode can only be deleted through an explicit delete call. Whereas ephemeral nodes are deleted once the client which created it has disconnected. Nodes can also be sequential. This just means that each node is given a unique, monotonically increasing integer which is appended to the path.
The above image shows a basic overview of a typical ZooKeeper interaction. Many application processes open sessions to ZooKeeper via a client library (ZooKeeper provides bindings for Java and C). The ensemble of ZooKeeper servers are responsible for serving client requests. Writes are routed through a single leader and are atomically broadcast to followers. Reads can be served from any follower. There is a small, bounded delay from when a write has been requested to when it is available to be read. This is the reason ZooKeeper is considered eventually consistent.
ZooKeeper is built on three main principles:
Replication - The data stored in a single znode is replicated across several servers in the ensemble. This ensures that there is not one single point of failure.
Order - Each update of data in ZooKeeper is assigned a global version number. Updates from a client are applied in the order in which they were was sent.
Speed - ZooKeeper stores its data in-memory, which allows it to achieve high throughput and low latency.
ZooKeeper provides a very simple API for interacting with znodes. ZooKeeper also comes with a command-line interface that is useful for debugging and playing around with the different operations. Here are the main operations used in the bin/zkCli.sh:
create [path] [data] - Creates a node for the given path and associates some data with it
delete [path] - Deletes a node with the given path
get [path] - Get the data associated with the node at this path
stat [path] - Gets the metadata of a node. This includes information about when the node was created, the version of the data, the length of the data etc. See documentation for more information about the Stat object.
set [path] [data] [version] - Update the data of a given node. This caller can specify the version that it knows about (returned from stat). If the version is not the latest, ZooKeeper rejects this update.
#Recipes and Curator
As mentioned earlier, ZooKeeper is used to implement common synchronization primitives for distributed applications. To that end, ZooKeeper provides a few recipes, which are actual implementations of common synchronization mechanisms that a ZooKeeper user might need. Most of the time, an application’s interaction with ZooKeeper would be through these recipes.
In addition, Apache Curator, a library open-sourced by Netflix, provides a wrapper around the already-pretty-simple API of ZooKeeper. Curator provides support for ZooKeeper recipes as well as a fluent interface API that manages the complexity of connecting to ZooKeeper.
In celebration of Super Bowl Sunday, this post will examine how HBase can be used to model NFL play-by-play data.
We will be using data from the 2002 season through the 2012 season. The source of the data is CSV files provided by the awesome people at Advance NFL Stats. Each file contains data for a single season and the format of the data is as follows:
The gameid column contains the data of the game and the two teams competing. The description column contains a short natural language blurb about a certain play. The rest of the columns are self explanatory. Here is a snippet of a sample file from the 2012 season (which also happens to be Robert Griffin III’s first plays in the NFL):
20120909_WAS@NO,1,-5,0,WAS,NO,,,69,B.Cundiff kicks 65 yards from WAS 35 to end zone Touchback.,0,0,2012
20120909_WAS@NO,1,60,0,NO,WAS,1,10,80,(15:00) D.Brees pass incomplete short left to M.Colston.,0,0,2012
20120909_WAS@NO,1,59,57,NO,WAS,2,10,80,(14:57) D.Brees pass incomplete short left to L.Moore [A.Carriker].,0,0,2012
20120909_WAS@NO,1,59,52,NO,WAS,3,10,80,(14:52) (Shotgun) D.Brees pass incomplete short middle (S.Bowen).,0,0,2012
20120909_WAS@NO,1,59,47,NO,WAS,4,10,80,(14:47) T.Morstead punts 59 yards to WAS 21 Center-J.Drescher. B.Banks pushed ob at WAS 32 for 11 yards (S.Shanle).,0,0,2012
20120909_WAS@NO,1,59,34,WAS,NO,1,10,68,(14:34) (Shotgun) R.Griffin pass short left to P.Garcon to WAS 32 for no gain (C.White). Pass -4 YAC 4,0,0,2012
20120909_WAS@NO,1,58,58,WAS,NO,2,10,68,(13:58) (Shotgun) R.Griffin right end to WAS 44 for 12 yards (S.Shanle).,0,0,2012
20120909_WAS@NO,1,58,25,WAS,NO,1,10,56,(13:25) (Shotgun) R.Griffin pass short left to P.Garcon to NO 44 for 12 yards (R.Harper). Pass -2 YAC 14,0,0,2012
20120909_WAS@NO,1,57,44,WAS,NO,1,10,44,(12:44) (Shotgun) R.Griffin pass short right to P.Garcon to NO 35 for 9 yards (C.Lofton; C.White).,0,0,2012
In a previous post, we looked at the HBase’s Data Model. One of the killer features built into HBase was the ability to store versioned data. Data that spans a certain time-series could be modeled easily in HBase. NFL play-by-play data fits nicely with this model, since each play of a game can be considered as a time slice that captures the state of a game at that instant. This means that we can replace the timestamp field with an integer from 0 to n - 1, where n is the number of total plays in the game.
Another reason for using HBase is its innate ability to scale for large amounts of data. We will be storing data for 11 seasons each with 269 games (regular season and playoffs) with around 175 plays for each game. This totals to around 517,825 data points that we will currently be storing. This may not seem like much, but consider that if we start adding data from games before 2002 and we continue to add data up to the current season, we will soon be glad that we chose a scalable data store like HBase.
We won’t spend too much discussing the HBase schema decisions. The HBase documentation does a good job of explaining the different schema design considerations. We will be using gameid (see above) as the row key. This will allow us to store data for each game in a single row. We will use a single column family called ‘play’. Our columns will be the different attributes of a single play (for example: yard line, offense score). As discussed before, we will use a play number as the timestamp for each column. Here is a diagram that more succinctly describes our schema:
##Table creation & Data import
Now that we have looked at the data we will be storing and the schema we will follow, lets finally create our table, and put some data in it. There are several APIs and libraries for different languages that allow us to interact with HBase. We will mostly be using the HBase shell, which is a JRuby IRB-based shell that allows us to run simple data management and manipulation commands. To import data from the CSV files into HBase, we will use JRuby, which is the “Ruby Language on the JVM”.
Let’s start by creating the table. The below command creates a new table called ‘nflplays’, with a single column family called ‘play’ and the specification that we want to store ALL_VERSIONS (2147483647) of a column. This may seem like overkill, we should never really have more than ~300 plays in a game. But I am doing this to illustrate how we can use constants from HConstants when defining table characteristics.
Next, we use the describe command to verify that the table we just created. We can see a few of the characteristics that this table contains. The ‘ENABLED’
Now that we have our table created, we can finally start importing the data from the CSV files into HBase. For this, I’ve written up this crude script in JRuby that goes through each line and parses out the different elements. Here it is:
A few things to notice:
With JRuby, we can import classes from Java using familiar import statements
Notice that each record is inserted with a Put object.
After running this script, we can view some of the data that has been inserted into HBase. First we can count the number of rows we have inserted (I’ve was running HBase on a micro EC2 machine, and I did not want to strain my already scarce resources so I only imported the first 8 games from 2012.
We can scan all the rows that we have imported using the ‘scan’ command in the HBase shell. Here is a snippet of the output I get:
Here we see the last play for two games. Since I didn’t specify how many verisions to return, it only returns the most recent version (last play of each game). We also see each column for a single row. Also notice that the timestamp here is the play number (see above for why this is the case). If we wanted to see only the play descriptions for a single game, we can use this query:
Let’s take a look at the Washington vs New Oreleans game, notice that some columns (like ‘togo’) only has 189 values, whereas the total number of plays is 190. This indicates that for play 190, we did not have a value for ‘togo’. This illustrates the ‘scarce’ property we looked at in the previous post.
##Analyzing the data
Now that we have our data in HBase, we can run a few queries on that data. Here is a simple Java program that scans the table for all plays that contain the word ‘TOUCHDOWN’ in the description. You can imagine how we can extend this program to create more complex queries, using different Filters.
A “sparse, distributed, consistent, multi-dimensional, sorted map”
We will look at what each of these terms mean below. HBase is based on Google’s BigTable and is currently an Apache top-level project. It provides random read/write access to data stored in HDFS (Hadoop Distributed File System). It leverages the capabilities provided by Hadoop and HDFS. In a future post, we will look at the architecture of how HBase stores data. This post will be more of a high-level introduction to the data model used by HBase
We will start by looking at what each of the terms in the above quote mean and understand the data model using terms that we are already familiar with.
At its core, HBase is a mapping of keys to values. It serves one of the most basic functions of a data store. It stores values, indexed by a key. It retrieves values, given a key.
HBase guarantees that each cell of data is stored lexicographically by its key. This allows for fast range queries (for example: we can ask HBase to return all values with keys from k1...k4. In contrast, relational databases provides no such guarantee about the sort order of their values.
The key in HBase is actually made up of several parts: row key, column family, column and timestamp. Timestamp is the killer feature of HBase. It provides a way to store several versions of a while, which makes it a good choice for storing data series data. The key-value pair model looks like this now:
(row key, column family, column, timestamp) -> value
HBase is a sparse data store in that it stores nothing for empty/null values. There is no cell for a column without a value. In HBase, null values are free to store.
HBase is built for scale. Data stored in HBase can be spread over many physical machines and can store billions of cells. HBase sits on top of the HDFS, which takes care of the distribution and replication of data. In addition to scalability, this “feature” provides protection again node failures.
HBase is strongly consistent. This means that reads will always return the last written and committed value for a key. HBase guarantees that all changes within the same row are atomic.
Now that we have broken down the canonical definition of HBase, let’s take a look at some of the important terms that describe how data is stored in HBase.
The highest level of organization is the Table. This term is similar to the relational definition of the term. We organize logically independent groups of data into Tables. The diagram below shows an empty Table (we will use this diagram to iteratively build our understanding of the different terms.
Each Table is made up of 1 or more Rows. Rows provide a logical grouping of cells. Row keys are lexicographically sorted. Notice in the diagram below that ‘row-10’ is before ‘row-2’. Row keys can be made up of just bytes, which allows us to use a variety of types of data as the key. Each row will hold the data for a certain entity. The definition of a Row in HBase is similar to its relational counterpart.
Each Row is made up of 1 or more Columns. Columns are arbitrary labels for attributes of a row. In contrast with RDBMS, columns do not need to be specified in advance. As soon as we PUT (insert) a row into HBase, that column is implicitly created. This allows HBase to be a “semi-structured” database by giving it the flexibility to add columns on the fly, rather than declaring them when the table is initially created.
Columns are grouped into Column Families. They define storage attributes for Columns (compression, # of versions etc). Column Families must be declared when a Table is created and must be printable characters. All elements of a column family are stored together on the File System. It is also important to limit the number of Column Families to a relatively small amount (we will see the reason for this in a future post).
At the intersection of a Row, Column Family, Column is a Cell. Each cell contains a value and a version (usually a timestamp). HBase allows the client to store many versions of a single cell, so data that spans over a time period can be modeled easily with HBase. Null values are not stored in Cells (see “Sparse” section above).
#Putting it all together
Overall, the data model of HBase is a multi-dimensional key-value store. If you remember one this from this post, it should be this:
(Table, RowKey, Family, Column, Timestamp) -> Value
Or, if you like to think in terms of Java generics: