Introduction to the data

In celebration of Super Bowl Sunday, this post will examine how HBase can be used to model NFL play-by-play data.

We will be using data from the 2002 season through the 2012 season. The source of the data is CSV files provided by the awesome people at Advance NFL Stats. Each file contains data for a single season and the format of the data is as follows:

1
gameid|qtr|min|gsec|off|def|down|togo|ydline|description|offscore|defscore|season

The gameid column contains the data of the game and the two teams competing. The description column contains a short natural language blurb about a certain play. The rest of the columns are self explanatory. Here is a snippet of a sample file from the 2012 season (which also happens to be Robert Griffin III’s first plays in the NFL):

1
2
3
4
5
6
7
8
9
20120909_WAS@NO,1,-5,0,WAS,NO,,,69,B.Cundiff kicks 65 yards from WAS 35 to end zone Touchback.,0,0,2012
20120909_WAS@NO,1,60,0,NO,WAS,1,10,80,(15:00) D.Brees pass incomplete short left to M.Colston.,0,0,2012
20120909_WAS@NO,1,59,57,NO,WAS,2,10,80,(14:57) D.Brees pass incomplete short left to L.Moore [A.Carriker].,0,0,2012
20120909_WAS@NO,1,59,52,NO,WAS,3,10,80,(14:52) (Shotgun) D.Brees pass incomplete short middle (S.Bowen).,0,0,2012
20120909_WAS@NO,1,59,47,NO,WAS,4,10,80,(14:47) T.Morstead punts 59 yards to WAS 21 Center-J.Drescher. B.Banks pushed ob at WAS 32 for 11 yards (S.Shanle).,0,0,2012
20120909_WAS@NO,1,59,34,WAS,NO,1,10,68,(14:34) (Shotgun) R.Griffin pass short left to P.Garcon to WAS 32 for no gain (C.White). Pass -4 YAC 4,0,0,2012
20120909_WAS@NO,1,58,58,WAS,NO,2,10,68,(13:58) (Shotgun) R.Griffin right end to WAS 44 for 12 yards (S.Shanle).,0,0,2012
20120909_WAS@NO,1,58,25,WAS,NO,1,10,56,(13:25) (Shotgun) R.Griffin pass short left to P.Garcon to NO 44 for 12 yards (R.Harper). Pass -2 YAC 14,0,0,2012
20120909_WAS@NO,1,57,44,WAS,NO,1,10,44,(12:44) (Shotgun) R.Griffin pass short right to P.Garcon to NO 35 for 9 yards (C.Lofton; C.White).,0,0,2012

Why HBase?

In a previous post, we looked at the HBase’s Data Model. One of the killer features built into HBase was the ability to store versioned data. Data that spans a certain time-series could be modeled easily in HBase. NFL play-by-play data fits nicely with this model, since each play of a game can be considered as a time slice that captures the state of a game at that instant. This means that we can replace the timestamp field with an integer from 0 to n - 1, where n is the number of total plays in the game.

Another reason for using HBase is its innate ability to scale for large amounts of data. We will be storing data for 11 seasons each with 269 games (regular season and playoffs) with around 175 plays for each game. This totals to around 517,825 data points that we will currently be storing. This may not seem like much, but consider that if we start adding data from games before 2002 and we continue to add data up to the current season, we will soon be glad that we chose a scalable data store like HBase.

Schema

We won’t spend too much discussing the HBase schema decisions. The HBase documentation does a good job of explaining the different schema design considerations. We will be using gameid (see above) as the row key. This will allow us to store data for each game in a single row. We will use a single column family called ‘play’. Our columns will be the different attributes of a single play (for example: yard line, offense score). As discussed before, we will use a play number as the timestamp for each column. Here is a diagram that more succinctly describes our schema:

Our schema for modeling NFL plays in HBase.

Table creation & Data import

Now that we have looked at the data we will be storing and the schema we will follow, lets finally create our table, and put some data in it. There are several APIs and libraries for different languages that allow us to interact with HBase. We will mostly be using the HBase shell, which is a JRuby IRB-based shell that allows us to run simple data management and manipulation commands. To import data from the CSV files into HBase, we will use JRuby, which is the “Ruby Language on the JVM”.

Let’s start by creating the table. The below command creates a new table called ‘nflplays’, with a single column family called ‘play’ and the specification that we want to store ALL_VERSIONS (2147483647) of a column. This may seem like overkill, we should never really have more than ~300 plays in a game. But I am doing this to illustrate how we can use constants from HConstants when defining table characteristics.

create 'nflplays', {NAME=> 'play', VERSIONS=>org.apache.hadoop.hbase.HConstants::ALL_VERSIONS}

Next, we use the describe command to verify that the table we just created. We can see a few of the characteristics that this table contains. The ‘ENABLED’

hbase(main):002:0> describe 'nflplays'
DESCRIPTION                                                                                                                                              ENABLED                                                                             
 'nflplays', {NAME => 'play', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '2147483647', COMPRESSION => ' true                                                                                
 NONE', MIN_VERSIONS => '0', TTL => '2147483647', KEEP_DELETED_CELLS => 'false', BLOCKSIZE => '65536', IN_MEMORY => 'false', ENCODE_ON_DISK => 'true', B                                                                                     
 LOCKCACHE => 'true'}                                                                                                                                                                                                                        
1 row(s) in 1.5100 seconds

Now that we have our table created, we can finally start importing the data from the CSV files into HBase. For this, I’ve written up this crude script in JRuby that goes through each line and parses out the different elements. Here it is:

include Java
import org.apache.hadoop.hbase.client.HTable
import org.apache.hadoop.hbase.client.Put
import javax.xml.stream.XMLStreamConstants

def jbytes( *args )
  args.map { |arg| arg.to_s.to_java_bytes }
end

def parse_row(row)
  map = {}
  values = row.split(',')
  map['gameid'] = values[0]
  map['qtr'] = values[1]
  map['min'] = values[2]
  map['sec'] = values[3]
  map['off'] = values[4]
  map['def'] = values[5]
  map['down'] = values[6]
  map['togo'] = values[7]
  map['ydline'] = values[8]
  #The csv file contains some weird characters at the Game over plays, this little hack is to avoid parsing that
  map['description'] = (map['qtr'].to_i >= 4 and map['min'].to_i == 0 and map['down'].empty? and map['togo'].empty?) ? "Game Over" : values[9]
  map['offscore'] = values[10]
  map['defscore'] = values[11]
  map['season'] = values[12]
  return map
end

def put_into_hbase(document, play_number)
  table = HTable.new(@hbase.configuration, 'nflplays')
  table.setAutoFlush(false)
  document.each do |key, value|
    if !value.empty?
      rowkey = document['gameid'].to_java_bytes
      ts = play_number

      p = Put.new(rowkey, ts)
      p.add(*jbytes("play", key, value))
      table.put(p)

      puts play_number.to_s + ":" + key.to_s + ":" + value.to_s
    end
  end
  table.flushCommits()
end

count = 1
seen_before = {}
File.open('2012.csv', 'r:utf-8').each_line do |row|
  data = parse_row(row.strip!)
  if !seen_before.has_key?(data['gameid'])
    count = 1
    seen_before[data['gameid']] = true
  end

  put_into_hbase(data, count)
  count += 1
end
exit

A few things to notice:

  • With JRuby, we can import classes from Java using familiar import statements
  • Notice that each record is inserted with a Put object.
  • We set autoFlush to false, this buffers Puts with a client-side write buffer. We force the flush of the buffer using flushCommits(). See the documentation for more information.

After running this script, we can view some of the data that has been inserted into HBase. First we can count the number of rows we have inserted (I’ve was running HBase on a micro EC2 machine, and I did not want to strain my already scarce resources so I only imported the first 8 games from 2012.

hbase(main):009:0> count 'nflplays'
8 row(s) in 0.0260 seconds

We can scan all the rows that we have imported using the ‘scan’ command in the HBase shell. Here is a snippet of the output I get:

ROW                                                          COLUMN+CELL                                                                                                                                                                     
 20120909_STL@DET                                            column=play:def, timestamp=158, value=DET                                                                                                                                       
 20120909_STL@DET                                            column=play:defscore, timestamp=158, value=27                                                                                                                                   
 20120909_STL@DET                                            column=play:description, timestamp=158, value=Game Over                                                                                                                         
 20120909_STL@DET                                            column=play:down, timestamp=157, value=1                                                                                                                                        
 20120909_STL@DET                                            column=play:gameid, timestamp=158, value=20120909_STL@DET                                                                                                                       
 20120909_STL@DET                                            column=play:min, timestamp=158, value=0                                                                                                                                         
 20120909_STL@DET                                            column=play:off, timestamp=158, value=STL                                                                                                                                       
 20120909_STL@DET                                            column=play:offscore, timestamp=158, value=23                                                                                                                                   
 20120909_STL@DET                                            column=play:qtr, timestamp=158, value=4                                                                                                                                         
 20120909_STL@DET                                            column=play:season, timestamp=158, value=2012                                                                                                                                   
 20120909_STL@DET                                            column=play:sec, timestamp=158, value=2                                                                                                                                         
 20120909_STL@DET                                            column=play:togo, timestamp=157, value=10                                                                                                                                       
 20120909_STL@DET                                            column=play:ydline, timestamp=158, value=61                                                                                                                                     
 20120909_WAS@NO                                             column=play:def, timestamp=190, value=WAS                                                                                                                                       
 20120909_WAS@NO                                             column=play:defscore, timestamp=190, value=40                                                                                                                                   
 20120909_WAS@NO                                             column=play:description, timestamp=190, value=Game Over                                                                                                                         
 20120909_WAS@NO                                             column=play:down, timestamp=189, value=2                                                                                                                                        
 20120909_WAS@NO                                             column=play:gameid, timestamp=190, value=20120909_WAS@NO                                                                                                                        
 20120909_WAS@NO                                             column=play:min, timestamp=190, value=0                                                                                                                                         
 20120909_WAS@NO                                             column=play:off, timestamp=190, value=NO                                                                                                                                        
 20120909_WAS@NO                                             column=play:offscore, timestamp=190, value=32                                                                                                                                   
 20120909_WAS@NO                                             column=play:qtr, timestamp=190, value=4                                                                                                                                         
 20120909_WAS@NO                                             column=play:season, timestamp=190, value=2012                                                                                                                                   
 20120909_WAS@NO                                             column=play:sec, timestamp=190, value=1                                                                                                                                         
 20120909_WAS@NO                                             column=play:togo, timestamp=189, value=10                                                                                                                                       
 20120909_WAS@NO                                             column=play:ydline, timestamp=190, value=39       

Here we see the last play for two games. Since I didn’t specify how many verisions to return, it only returns the most recent version (last play of each game). We also see each column for a single row. Also notice that the timestamp here is the play number (see above for why this is the case). If we wanted to see only the play descriptions for a single game, we can use this query:

scan 'nflplays', {COLUMNS=>['play:description'], STARTROW => '20120909_STL@DET', ENDROW => '20120909_STL@DET', VERSIONS =>200}</code>

Let’s take a look at the Washington vs New Oreleans game, notice that some columns (like ‘togo’) only has 189 values, whereas the total number of plays is 190. This indicates that for play 190, we did not have a value for ‘togo’. This illustrates the ‘scarce’ property we looked at in the previous post.

Analyzing the data

Now that we have our data in HBase, we can run a few queries on that data. Here is a simple Java program that scans the table for all plays that contain the word ‘TOUCHDOWN’ in the description. You can imagine how we can extend this program to create more complex queries, using different Filters.

public class Touchdowns {
  public static void main(String[] args) throws IOException {
    Configuration config = HBaseConfiguration.create();
    HTable table = new HTable(config, "nflplays");
  
    Scan s = new Scan();
    s.setMaxVersions();
    s.addColumn(Bytes.toBytes("play"), Bytes.toBytes("description"));

    Filter tdFilter = new ValueFilter(CompareFilter.CompareOp.EQUAL, new RegexStringComparator(".*TOUCHDOWN.*"));
    s.setFilter(tdFilter);
 
    ResultScanner scanner = table.getScanner(s);
    try {
      for (Result result : scanner) {
        for (KeyValue kv : result.raw()) {
          System.out.println("KV: " + kv + ", Value: " + Bytes.toString(kv.getValue()));
        }
      }
    } 
    finally {
      scanner.close();
    }
  }
}

Scala: The Good Parts

Quick overview of some features of the Scala Programming Language Continue reading

Design Patterns in Real Life

Published on December 05, 2014

Introduction to ZooKeeper

Published on August 16, 2014