What is HBase?

HBase is:

A “sparse, distributed, consistent, multi-dimensional, sorted map”

We will look at what each of these terms mean below. HBase is based on Google’s BigTable and is currently an Apache top-level project. It provides random read/write access to data stored in HDFS (Hadoop Distributed File System). It leverages the capabilities provided by Hadoop and HDFS. In a future post, we will look at the architecture of how HBase stores data. This post will be more of a high-level introduction to the data model used by HBase

We will start by looking at what each of the terms in the above quote mean and understand the data model using terms that we are already familiar with.

“Map”

At its core, HBase is a mapping of keys to values. It serves one of the most basic functions of a data store. It stores values, indexed by a key. It retrieves values, given a key.

“Sorted”

HBase guarantees that each cell of data is stored lexicographically by its key. This allows for fast range queries (for example: we can ask HBase to return all values with keys from k1...k4. In contrast, relational databases provides no such guarantee about the sort order of their values.

“Multi-dimensional”

The key in HBase is actually made up of several parts: row key, column family, column and timestamp. Timestamp is the killer feature of HBase. It provides a way to store several versions of a while, which makes it a good choice for storing data series data. The key-value pair model looks like this now:

(row key, column family, column, timestamp) -> value

“Sparse”

HBase is a sparse data store in that it stores nothing for empty/null values. There is no cell for a column without a value. In HBase, null values are free to store.

“Distributed”

HBase is built for scale. Data stored in HBase can be spread over many physical machines and can store billions of cells. HBase sits on top of the HDFS, which takes care of the distribution and replication of data. In addition to scalability, this “feature” provides protection again node failures.

“Consistent”

HBase is strongly consistent. This means that reads will always return the last written and committed value for a key. HBase guarantees that all changes within the same row are atomic.

Now that we have broken down the canonical definition of HBase, let’s take a look at some of the important terms that describe how data is stored in HBase.

Data Model

Table

The highest level of organization is the Table. This term is similar to the relational definition of the term. We organize logically independent groups of data into Tables. The diagram below shows an empty Table (we will use this diagram to iteratively build our understanding of the different terms.

Logical representation of a Table in HBase. We will build on this representation as we look at more terms.

Rows

Each Table is made up of 1 or more Rows. Rows provide a logical grouping of cells. Row keys are lexicographically sorted. Notice in the diagram below that ‘row-10’ is before ‘row-2’. Row keys can be made up of just bytes, which allows us to use a variety of types of data as the key. Each row will hold the data for a certain entity. The definition of a Row in HBase is similar to its relational counterpart.

Logical representation of a Table with rows in HBase.

Columns

Each Row is made up of 1 or more Columns. Columns are arbitrary labels for attributes of a row. In contrast with RDBMS, columns do not need to be specified in advance. As soon as we PUT (insert) a row into HBase, that column is implicitly created. This allows HBase to be a “semi-structured” database by giving it the flexibility to add columns on the fly, rather than declaring them when the table is initially created.

Logical representation of a Table with rows and columns in HBase. So far, our representation is similar to a relational database's table

Column Family

Columns are grouped into Column Families. They define storage attributes for Columns (compression, # of versions etc). Column Families must be declared when a Table is created and must be printable characters. All elements of a column family are stored together on the File System. It is also important to limit the number of Column Families to a relatively small amount (we will see the reason for this in a future post).

Logical representation of a Table with two column families. col1 belongs to fam1 and columns col2 and col3 belong to the family fam2.

Cells

At the intersection of a Row, Column Family, Column is a Cell. Each cell contains a value and a version (usually a timestamp). HBase allows the client to store many versions of a single cell, so data that spans over a time period can be modeled easily with HBase. Null values are not stored in Cells (see “Sparse” section above).

Logical representation of a Table with 7 non-empty cells. A few cells contain several versions. The large 'DO NOT ENTER' signs represent the fact that no storage space is wasted in storing NULL values. Those cells are not stored in HBase.

Putting it all together

Overall, the data model of HBase is a multi-dimensional key-value store. If you remember one this from this post, it should be this:

(Table, RowKey, Family, Column, Timestamp) -> Value

Or, if you like to think in terms of Java generics:

Representing HBase's Data Model using Java Generics. The arrows associate each data structure with their HBase equivalent, from above.

Scala: The Good Parts

Quick overview of some features of the Scala Programming Language Continue reading

Design Patterns in Real Life

Published on December 05, 2014

Introduction to ZooKeeper

Published on August 16, 2014