What is HBase?

HBase is:

A “sparse, distributed, consistent, multi-dimensional, sorted map”

We will look at what each of these terms mean below. HBase is based on Google’s BigTable and is currently an Apache top-level project. It provides random read/write access to data stored in HDFS (Hadoop Distributed File System). It leverages the capabilities provided by Hadoop and HDFS. In a future post, we will look at the architecture of how HBase stores data. This post will be more of a high-level introduction to the data model used by HBase

We will start by looking at what each of the terms in the above quote mean and understand the data model using terms that we are already familiar with.

“Map”

At its core, HBase is a mapping of keys to values. It serves one of the most basic functions of a data store. It stores values, indexed by a key. It retrieves values, given a key.

“Sorted”

HBase guarantees that each cell of data is stored lexicographically by its key. This allows for fast range queries (for example: we can ask HBase to return all values with keys from k1...k4. In contrast, relational databases provides no such guarantee about the sort order of their values.

“Multi-dimensional”

The key in HBase is actually made up of several parts: row key, column family, column and timestamp. Timestamp is the killer feature of HBase. It provides a way to store several versions of a while, which makes it a good choice for storing data series data. The key-value pair model looks like this now:

(row key, column family, column, timestamp) -> value

“Sparse”

HBase is a sparse data store in that it stores nothing for empty/null values. There is no cell for a column without a value. In HBase, null values are free to store.

“Distributed”

HBase is built for scale. Data stored in HBase can be spread over many physical machines and can store billions of cells. HBase sits on top of the HDFS, which takes care of the distribution and replication of data. In addition to scalability, this “feature” provides protection again node failures.

“Consistent”

HBase is strongly consistent. This means that reads will always return the last written and committed value for a key. HBase guarantees that all changes within the same row are atomic.

Now that we have broken down the canonical definition of HBase, let’s take a look at some of the important terms that describe how data is stored in HBase.

Data Model

Table

The highest level of organization is the Table. This term is similar to the relational definition of the term. We organize logically independent groups of data into Tables. The diagram below shows an empty Table (we will use this diagram to iteratively build our understanding of the different terms.

Logical representation of a Table in HBase. We will build on this representation as we look at more terms.

Rows

Each Table is made up of 1 or more Rows. Rows provide a logical grouping of cells. Row keys are lexicographically sorted. Notice in the diagram below that ‘row-10’ is before ‘row-2’. Row keys can be made up of just bytes, which allows us to use a variety of types of data as the key. Each row will hold the data for a certain entity. The definition of a Row in HBase is similar to its relational counterpart.

Logical representation of a Table with rows in HBase.

Columns

Each Row is made up of 1 or more Columns. Columns are arbitrary labels for attributes of a row. In contrast with RDBMS, columns do not need to be specified in advance. As soon as we PUT (insert) a row into HBase, that column is implicitly created. This allows HBase to be a “semi-structured” database by giving it the flexibility to add columns on the fly, rather than declaring them when the table is initially created.

Logical representation of a Table with rows and columns in HBase. So far, our representation is similar to a relational database's table

Column Family

Columns are grouped into Column Families. They define storage attributes for Columns (compression, # of versions etc). Column Families must be declared when a Table is created and must be printable characters. All elements of a column family are stored together on the File System. It is also important to limit the number of Column Families to a relatively small amount (we will see the reason for this in a future post).

Logical representation of a Table with two column families. col1 belongs to fam1 and columns col2 and col3 belong to the family fam2.

Cells

At the intersection of a Row, Column Family, Column is a Cell. Each cell contains a value and a version (usually a timestamp). HBase allows the client to store many versions of a single cell, so data that spans over a time period can be modeled easily with HBase. Null values are not stored in Cells (see “Sparse” section above).

Logical representation of a Table with 7 non-empty cells. A few cells contain several versions. The large 'DO NOT ENTER' signs represent the fact that no storage space is wasted in storing NULL values. Those cells are not stored in HBase.

Putting it all together

Overall, the data model of HBase is a multi-dimensional key-value store. If you remember one this from this post, it should be this:

(Table, RowKey, Family, Column, Timestamp) -> Value

Or, if you like to think in terms of Java generics:

Representing HBase's Data Model using Java Generics. The arrows associate each data structure with their HBase equivalent, from above.

In this post, I will go through a quick overiew of a few uses of Google’s Guava library. When I entered the professional world, I used Guava pretty heavily and the following blog post includes some of the notes I took when learning about Guava’s many useful tools.

What is Guava?

Guava is a open-source library for the Java language that provides several utilities that makes a programmer’s job easier. According to Jared Levy (one of the creators of Java), “The library’s functionality simplifies your code so it’s easier to write, read, and maintain”. This wiki page is designed to introduce some useful aspects of Guava and expose some gotchas that you might come across. It includes utilities for Collections, Concurrency, Primitives, Reflection, Comparison, I/O, Hashing, Networking, Strings, Math, In-memory caching, and various basic data types.

Useful utility classes

Preconditions: Used to validate assumptions at the start of methods or constructors (and fail-fast if they aren’t valid)

public Person(Name name) {
  this.name = Preconditions.checkNotNull(name); // throws NullPointerException
}
 
public void eat(Food food) {
  Preconditions.checkArgument(food.isPoisonous(), "Food must not be poisonous"); // throws IllegalArgumentException
  Preconditions.checkState(person.isHungry(), "Person must be running"); // throws IllegalStateException
  ...
}

Objects: simplify writing equals(), toString(), hashCode() for a class

// Objects.equal provides null-sensitive comparison 
Objects.equal("a", "a"); // returns true
Objects.equal(null, "a"); // returns false
Objects.equal("a", null); // returns false
Objects.equal(null, null); // returns true
 
// Objects.hashCode provides a order-sensitive hashCode method for a variable number of args. Uses Arrays.hashCode
Objects.hashCode(person.getName(), person.getAge());
 
// Objects.toStringHelper provides a easy, clean way to implement a class' toString method, overloaded add() methods for each primitive data type
Objects.toStringHelper("Person") // returns Person{age=43}
       .add("age", 43)
       .add("hungry", null)
       .omitNullValues()
       .toString();

NOTE: Java 7 provides a new Objects class that provides similar functionality.

Strings: simplify String joining, splitting, matching

// Joiner provides utility methods to join pieces of text into a String
static final Joiner JOINER = Joiner.on("-").skipNulls();
return JOINER.join("Me", null, "Myself", "I"); // returns "Me-Myself-I"
 
// Splitter provides utility methods to split a given String into pieces of text split by a given separator
static final Splitter SPLITTER = Splitter.on(',').trimResults().omitEmptyStrings();
return SPLITTER.split("me,myself,,    I"); // returns an Iterable of ["me", "myself", "I"]
 
// CharMatcher has several common matchers and methods to [trim|replace|remove|return|collapsing] occurrences of a match in a String
String onlyDigits = CharMatcher.DIGIT.retainFrom(string); // keep only the digits
String noDigits = CharMatcher.JAVA_DIGIT.replaceFrom(string, "*"); // replace all digits with stars

Functional Programming: Functions and Predicates that can be used to simulate first-class functions in Java

// Function<F, T>: Override equals method to provide a one-way transformation of F to T
static final Function<String, Integer> LENGTH = new Function<String, Integer>() {
  public Integer apply(String string) {
    return string.length();
  }
};
 
// Predicate<F>: Override apply method to determine if F is true or false
static final Predicate<String> ALL_CAPS = new Predicate<String>() {
  public boolean apply(String string) {
    return CharMatcher.JAVA_UPPER_CASE.matchesAllOf(string);
  }
};

IntMath/LongMath

Guava provides utility methods for performing int, long, BigInteger operations in conveniently named IntMath, LongMath, and BigIntegerMath classes. IntMath and LongMath provide overflow-checked add, subtract, multiply and divide methods. These methods fail-fast on overflow. Here are some examples of using the utility methods:

int x = 3;
int y = 45;
long l = 21520;
 
MathPreconditions.checkPositive(x); // returns 3
  
IntMath.factorial(x); // returns 6
IntMath.gcd(x, y) // returns 3
IntMath.isPowerOfTwo(x) // returns false
LongMath.mod(l, 10); //returns 0

Guava Pitfalls/Gotchas

  • Joiner/Splitter instances are always immutable:
// This has no effect:
Joiner joiner = Joiner.on(',');
joiner.skipNulls(); // does nothing!
return joiner.join("bad", null, "wrong");
 
 // Do this instead:
 Joiner joiner = Joiner.on(',').skipNulls();
 Guava has several methods that returns views. Views modify underlying collection:

In this post, I will go through a demo of using Lucene’s simple API for indexing and searching Tweets. We will be indexing Tweets from the Sentiment140 Tweet corpus. This dataset provides the following data points for each Tweet:

  1. the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
  2. the id of the tweet (2087)
  3. the date of the tweet (Sat May 16 23:58:44 UTC 2009)
  4. the query (lyx). If there is no query, then this value is NO_QUERY.
  5. the user that tweeted (robotickilldozr)
  6. the text of the tweet (Lyx is cool)

Here is example of the tweets in the file:

1
2
"0","1468051743","Mon Apr 06 23:27:33 PDT 2009","NO_QUERY","someone81","Grr not down to go to school today"
"0","1468011579","Tue Dec 26 12:29:11 PDT 2008","NO_QUERY","some1else","Another tweet"

At this point, we can see that these columns cleanly map to Fields that Lucene will end up indexing. See the previous Introduction to Lucene post for more information about Fields.

Indexing

Creating the IndexWriter

Let’s start by creating our IndexWriter (see below). In our example, we will be using the local File System to store our index. Another option is to store our index in main memory (RAMDirectory). This option is suitable for smaller indexes, where response time is the highest priority.

IndexWriter indexWriter = new IndexWriter(FSDirectory.open(indexDir), new IndexWriterConfig(Version.LUCENE_44, new KeywordAnalyzer()));

Configuring Analyzers

Notice that we also configured our IndexWriter with a KeywordAnalyzer. This Analyzer generates a single token for the entire string. We can choose many different Analyzers, based on our use case. Other standard Analyzers include the StandardAnalyzer, WhitespaceAnalyzer, StopAnalyzer and SnowballAnalyer. You could even implement your Analyzer that suits your use case!

Adding Documents

Configuring Fields

The first step in adding Documents is configuring our FieldType. The two main options we care about storing vs indexing. Storing the value means that the actual value will also be stored in the index. This is useful when we want to output the value when searching. We can compress these values for large documents. The other option is to index the value. We can turn off indexing for a field when we know that our Lucene search queries are not going to be using that field for lookups.

In our case, we want to configure all the fields to be stored and indexed. So here is how we create our FieldType:

FieldType fieldType = new FieldType();
        fieldType.setStored(true);
        fieldType.setIndexed(true);

Creating Fields and adding to Documents

In this next code snippet, we create our Document object and the Fields that we are interested in.

Document document = new Document();

document.add(new Field(DATE, tweetInfo[2], fieldType));
document.add(new StringField(USER, tweetInfo[4], Field.Store.YES));
document.add(new StringField(TEXT, tweetInfo[5], Field.Store.YES));

indexWriter.addDocument(document);

At this point, we our index is created and we can start querying.

Searching

Lucene provides a very robust API for constructing and executing queries on our index. The documentation provides a good introduction to the search syntax. We will look at building various queries to search our Tweet index.

Creating the IndexSearcher

This section mirrors the above “Creating the IndexWriter” section. In order to be able to run search queries on our index, we need to make use of the IndexSearcher class. Each IndexSearcher requires a pointer to the location of our index in our File System (or in memory). We also provide it a DirectoryReader object that atomically opens and reads the index. Here is how to do that:

Directory directory = FSDirectory.open(indexDir);
DirectoryReader directoryReader = DirectoryReader.open(directory);
IndexSearcher indexSearcher = new IndexSearcher(directoryReader);

Building the query

Once we have our IndexWriter, we need to create our Query. Here is one example using a TermQuery to find all Tweets based on a certain value fo a given field. In this query, we are finding all tweets by a certain user:

Term term = new Term(USER, "scotthamilton");
Query query = new TermQuery(term);
TopDocs topDocs = indexSearcher.search(query, numHits);

The TopDocs represent the results our query. The ScoreDocs represent the document that matched the query and the score of the result. Once we have our resuts, we can print out the document’s data using the IndexSearcher:

for (int i = 0; i < results.length; i++) {
  int docId = results[i].doc;
  Document foundDocument = indexSearcher.doc(docId);
  System.out.println(foundDocument.get(USER) + " : " + foundDocument.get(TEXT));
}

To see the entire source code, visit this repository.

Solr

Apache Solr is a enterprise-level HTTP search server built on top of Lucene. Solr performs all the operations of Lucene and provides additional features that are not available in Lucene. Documents (in XML, JSON, CSV or binary format) are added to the index via a POST; Search results (in XML, JSON, CSV or binary format) are returned via a HTTP GET. Here are a few of the feautres Solr provides:

  • Full text search - Allows for complex search queries
  • Logging - Provides logging for debugging and support purposes.
  • Near real time indexing - Search results are updated soon after a document has been indexed.
  • Faceted search - Allows search results to be categorized in sub-groups.
  • Geo-spatial search - searching based on geographic location

SolrCloud

SolrCloud is a distributed version of Solr. It provides distributed indexing and searching for large scale, fault tolerant Solr server. It uses Apache ZooKeeper for cluster configuration and management. When the data being indexed is too large for a single server, SolrCloud breaks it up into shards. Shards are split portions of the entire index and can be distributed across different servers in a cluster. When adding documents to the index, SolrCloud figures out the correct machine to which this shard should belong. SolrCloud provides additional features on top of Solr, including:

  • Automatic failover - if a single node goes down, its index is replicated on a different node using a backup
  • Maintaining consistency - updates to the index must be directed to the correct shard so that the one, consistent view of the document in maintained
  • Automatic shard partitioning - SolrCloud only needs to know the number of shards and it takes care of partitioning the index; it even forwards updates to the index to the correct index
  • Simple Configuration - SolrCloud uses ZooKeeper for configuring the cluster, which centralizes the configuration for the cluster.

Here is a diagram showing the SolrCloud architecture source

Solr Architecture

Elasticsearch vs SolrCloud

Elasticsearch is a another enterprise search engine built on top of Apache Lucene. It is a competitor to SolrCloud; both add features to Lucene and provide an HTTP wrapper around Lucene through which documents can be indexed and searched. Here are a few differences between the Elasticsearch and SolrCloud:

  • Solr uses Zookeeper for cluster configuration, while Elasticsearch uses an internal coordination mechanism for configuration
  • Both ES and SC use the concept of sharding (partitions of Lucene index).
  • Elasticsearch’s uses a JSON query syntax, while Solr uses a simple key/value pair query
  • Elasticsearch’s killer feature is the Percolator. This allows the user to register certain queries to generate an alert when documents are added that match that query. Description from the documentation: “Instead of sending docs, indexing them, and then running queries, one sends queries, registers them, and then sends docs and finds out which queries match that doc.”

See this for more information about the differences between SolrCloud and Elasticsearch.

Apache Lucene™ is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.

What is Lucene?

Lucene is a library that allows the user to index textual data (Word & PDF documents, emails, webpages, tweets etc). It allows you to add search capabilities to your application. There are two main steps that Lucene performs:

  1. Create an index of documents you want to search.
  2. Parse query, search index, return results.

Indexing

Lucene uses an inverted index (mapping of a term to its metadata). This metadata includes information about which files contain this term, number of occurrences etc. The fundamental units of indexing in Lucene are the Document and Field classes:

  • A Document is a container that contains one or more Fields.
  • A Field stores the terms we want to index and search on. It stores a mapping of a key (name of the field) and a value (value of the field that we find in the content).

Here is a diagram describing the steps Lucene takes when indexing content (Source: Lucene in Action, Figure 2.1).

Indexing Overview

Extracting Text

In order to be able to index documents of various types, Lucene needs to be able to extract the test from the given document into a format that it can parse. Apache Tika is one framework that parses documents and extracts text content.

Analyze

This process filters and cleans up the text data. The text data goes through several steps (for example: extracting words, removing common (stop) words, make words lowercase etc) and converts the text into tokens that can be added to the index. The picture to the right shows the indexing process which results in an inverted index being stored on the underlying filesystem. See below for an example of an inverted index.

Smithsonian Image

Write Index

Lucene uses an inverted index data structure for storing the Fields we want to search on. An inverted index uses the tokens as the lookup key to find the documents which contains that token. It maps the content to its location. The index can be physically stored as part of a Directory (either in a file system or in memory).

Below is an example of an inverted index. Logically, this represents the result of the indexing process.

Inverted Index

Searching

Once our documents are indexed, we will need to add search functionality. All queries of the index are done through the IndexSearcher. Given a search expression, we parse the query, create a QueryParser and search the index for results. The results are returned as TopDocs which contain ScoreDocs, which contain the document IDs and the confidence scores of the results that match the query. The fundamental classes for searching are:

  • IndexSearcher - Provides “read-only” access to the index. Exposes several search methods that take in a Query object and return the top n “best” TopDocs as the result. This class is the counter part to the IndexWriter class used for creating/updating indexes.
  • Term - Basic unit for searching. Counter part to the Field object used in indexing. We create a certain Field when indexing (for ex: “Name” : “Chuck Norris”) and we use Terms in a TermQuery when searching. It contains the same mapping from the name of the field to the value
  • Query: Lucene provides several types of Queries, including TermQuery, BooleanQuery, PrefixQuery, WildcardQuery, PhraseQuery, and FuzzyQuery. Each type of query provides a unique way of searching the index.
  • QueryParser: Parses a human-readable query (for ex: “opower AND arlington”) into Query object that can be used for searching.
  • TopDocs - Container for pointers to N search results. Each TopDoc contains a document ID and a confidence score.

Here is an overview of the process described above:

Searching Process

This concludes a high-level introduction to Apache Lucene. In future posts, I will explore Solr and give an example of using Lucene in a real application. The inspiration for this series is derived from a meetup of the Washington D.C. Hadoop Users Group in which Douglas Cutting spoke about Lucene.

References

McCandless, Michael; Hatcher, Erik; Gospodnetić, Otis (2010). Lucene in Action, Second Edition. Manning.