HBase Application Archetypes Redux

At Yammer, we’ve transitioned away from polyglot persistence to persistence consolidation. In a microservice architecture, the principle that each microservice should be responsible for its own data had led to a proliferation of different types of data stores at Yammer. This in turn led to multiple efforts to make sure that each data store could be easily used, monitored, operationalized, and maintained. In the end, we decided it would be more efficient, both architecturally and organizationally, to reduce the number of data store types in use at Yammer to as few as possible.

Today HBase is the primary data store for non-relational data at Yammer (we use PostgreSQL for relational data). Microservices are still responsible for their own data, but the data is segregated by cluster boundaries or mechanisms within the data store itself (such as HBase namespaces or PostgreSQL schemas).

HBase was chosen for a number of reasons, including its performance, scalability, reliability, its support for strong consistency, and its ability to support a wide variety of data models. At Yammer we have a number of services that rely on HBase for persistence in production:

Feedie, a feeds service
RoyalMail, an inbox service
Ocular, for tracking messages that a user has viewed
Streamie, for storing activity streams
Prankie, a ranking service with time-based decay
Authlog, for authorization audit trails
Spammie, for spam monitoring and blocking
Graphene, a generic graph modeling service

HBase is able to satisfy the persistence needs of several very different domains. Of course, there are some use cases for which HBase is not recommended, for example, when using raw HDFS would be more efficient, or when ad-hoc querying via SQL is preferred (although projects like Apache Phoenix can provide SQL on top of HBase).

Previously, Lars George and Jonathan Hsieh from Cloudera attempted to survey the most commonly occurring use cases for HBase, which they referred to as application archetypes. In their presentation, they categorized archetypes as either “good”, “bad”, or “maybe” when used with HBase. Below I present an augmented listing of their “good” archetypes, along with pointers to projects that implement them.

Entity

The Entity archetype is the most natural of the archetypes. HBase, being a wide column store, can represent the entity properties with individual columns. Projects like Apache Gora and HEntityDB support this archetype.

	Column Family: default
Row Key	Column: <property 1 name>	Column: <property 2 name>	…
<entity ID>	<property 1 value>	<property 2 value>	…

Entities can be also stored in the same manner as with a key-value store. In this case the entity would be serialized as a binary or JSON value in a single column.

	Column Family: default
Row Key	Column: body
<entity ID>	<entity blob>

SORTED COLLECTION

The Sorted Collection archetype is a generalization of the original Messaging archetype that was presented. In this archetype the entities are stored as binary or JSON values, with the column qualifier being the value of the sort key to use. For example, in a messaging feed, the column qualifier would be a timestamp or a monotonically increasing counter of some sort. The column qualifier can also be “inverted” (such as by subtracting a numeric ID from the maximum possible value) so that entities are stored in descending order.

	Column Family: default
Row Key	Column: <sort key 1 value>	Column: <sort key 2 value>	…
<collection ID>	<entity 1 blob>	<entity 2 blob>	…

Alternatively, each entity can be stored as a set of properties. This is similar to how Cassandra implements CQL. HEntityDB supports storing entity collections in this manner.

	Column Family: default
Row Key	Column: <sort key 1 value + property 1 name>	Column: <sort key 1 value + property 2 name>	…	Column: <sort key 2 value + property 1 name>	Column: <sort key 2 value + property 2 name>	…
<collection ID>	<property 1 of entity 1>	<property 2 of entity 1>	…	<property 1 of entity 2>	<property 2 of entity 2>	…

In order to access entities by some other value than the sort key, additional column families representing indices can be used.

	Column Family: sorted			Column Family: index
Row Key	Column: <sort key 1 value>	Column: <sort key 2 value>	…	Column: <index 1 value>	Column: <index 2 value>	…
<collection ID>	<entity 1 blob>	<entity 2 blob>	…	<entity 1 blob>	<entity 2 blob>	…

To prevent the collection from growing unbounded, a coprocessor can be used to trim the sorted collection during compactions. If index column families are used, the coprocessor would also remove corresponding entries from the index column families when trimming the sorted collection. At Yammer, both the Feedie and RoyalMail services use this technique. Both services also use server-side filters for efficient pagination of the sorted collection during queries.

DOCUMENT

Using a technique called key-flattening, a document can be shredded by storing each value in the document according to the path from the root to the name of the element containing the value. HDocDB uses this approach.

	Column Family: default
Row Key	Column: <property 1 path>	Column: <property 2 path>	…
<document ID>	<property 1 value>	<property 2 value>	…

The document can also be stored as a binary value, in which case support for Medium Objects (MOBs) can be used if the documents are large. This approach is described in the book Architecting HBase Applications.

	Column Family: default
Row Key	Column: body
<document ID>	<reference to MOB>

GRAPH

There are many ways to store a graph in HBase. One method is to use an adjacency list, where each vertex stores its neighbors in the same row. This is the approach taken in JanusGraph.

	Column Family: default
Row Key	Column: <edge 1 key>	Column: <edge 2 key>	…	Column: <property 1 name>	Column: <property 2 name>	…
<vertex ID>	<edge 1 properties>	<edge 2 properties>	…	<property 1 value>	<property 2 value>	…

In the table above, the edge key is actually comprised of a number of parts, including the label, direction, edge ID, and adjacent vertex ID.

Alternatively, a separate table to represent edges can be used, in which case the incident vertices are stored in the same row as an edge. This may scale better if the adjacency list is large, such as in a social network. This is the approach taken in both Zen and HGraphDB.

	Column Family: default
Row Key	Column: <property 1 name>	Column: <property 2 name>	…
<vertex ID>	<property 1 value>	<property 2 value>	…

	Column Family: default
Row Key	Column: fromVertex	Column: toVertex	Column: <property 1 name>	Column: <property 2 name>	…
<edge ID>	<vertex ID>	<vertex ID>	<property 1 value>	<property 2 value>	…

When storing edges in a separate table, additional index tables must be used to provide efficient access to the incident edges of a vertex. For example, the full list of tables in HGraphDB can be viewed here.

QUEUE

A queue can be modeled by using a row key comprised of the consumer ID and a counter. Both Cask and Box implement queues in this manner.

	Column Family: default
Row Key	Column: metadata	Column: body
<consumer ID + counter>	<message metadata>	<message body>

Cask also uses coprocessors for efficient scan filtering and queue trimming, and Apache Tephra for transactional queue processing.

METRICS

The Metrics archetype is a variant of the Entity archetype in which the column values are counters or some other aggregate.

	Column Family: default
Row Key	Column: <property 1 name>	Column: <property 2 name>	…
<entity ID>	<property 1 counter>	<property 2 counter>	…

HGraphDB is actually a combination of the Graph and Metrics archetypes, as arbitrary counters can be stored on either vertices or edges.

Update: For other projects that use HBase, see this list.

Robert Yokota

HBase Application Archetypes Redux

Entity

SORTED COLLECTION

DOCUMENT

GRAPH

QUEUE

METRICS

Like this:

Leave a ReplyCancel reply

Entity

SORTED COLLECTION

DOCUMENT

GRAPH

QUEUE

METRICS

Share this:

Like this:

Leave a ReplyCancel reply

Discover more from Robert Yokota