March 2016 – Robert Yokota

Recently I noticed that several NoSQL stores that claim to be multi-model data stores are implemented on top of a key-value layer. By using simple key-value pairs, such stores are able to support both documents and graphs.

A wide column store such as HBase seems like a more natural fit for a multi-model data store, since a key-value pair is just a row with a single column. There are many graph stores built on top of HBase, such as Zen, Titan, and S2Graph. However, I couldn’t find any document stores built on top of HBase. So I decided to see how hard it would be to create a document layer for HBase, which I call HDocDB.

Document databases tend to provide three different types of APIs. There are language-specific client APIs (MongoDB), REST APIs (CouchDB), and SQL-like APIs (CouchBase, Azure DocumentDB). For HDocDB, I decided to use a Java client library called OJAI.

One nice characteristic of HBase is that multiple operations to the same row can be performed atomically. If a document can be stored in columns that all reside in the same row, then the document can be kept consistent when modifications are made to different columns that comprise the document. Many graph layers on top of HBase use a “tall table” model where edges for the same graph are stored in different rows. Since operations which span rows are not atomic in HBase, inconsistencies can arise in a graph, which must be fixed using other means (batch jobs, map-reduce, etc.). By storing a single document using a single row, situations involving inconsistent documents can be avoided.

To store a document in a single row, we use a strategy called “shredding” that was developed when researchers first attempted to store XML in relational databases. In the case of JSON, which is easier to store than XML (due to the lack of schema and no requirement for preserving document order except in the case of arrays), we use a technique called key-flattening that was developed for the research system Argo. When key-flattening is adapted to HBase, each scalar value in the document is stored as a separate column, with the column qualifier being the full JSON path to that value. This allows different parts of the document to be read and modified independently.

For HDocDB, I also added basic support for global secondary indexes. The implementation is based on Hofhansl and Yates. For more sophisticated indexing, one can integrate HDocDB with ElasticSearch or Solr.

Since OJAI is integrated with Jackson, it is also easy to store plain Java objects into HDocDB. This means that HDocDB can also be seen as an object layer on top of HBase. We can now say HBase supports the following models:

Key-value
Wide column
Document (HDocDB)
Graph (Titan, Zen, S2Graph)
SQL (Phoenix)
Object (HDocDB)

So not only can HBase be seen as a solid CP store (as shown in a previous blog), but it can also be seen as a solid multi-model store.

Robert Yokota

Month: March 2016

HBase as a Multi-Model Data Store