At Yammer, we’ve transitioned away from polyglot persistence to persistence consolidation. In a microservice architecture, the principle that each microservice should be responsible for its own data had led to a proliferation of different types of data stores at Yammer. This in turn led to multiple efforts to make sure that each data store could be easily used, monitored, operationalized, and maintained. In the end, we decided it would be more efficient, both architecturally and organizationally, to reduce the number of data store types in use at Yammer to as few as possible.
Today HBase is the primary data store for non-relational data at Yammer (we use PostgreSQL for relational data). Microservices are still responsible for their own data, but the data is segregated by cluster boundaries or mechanisms within the data store itself (such as HBase namespaces or PostgreSQL schemas).
HBase was chosen for a number of reasons, including its performance, scalability, reliability, its support for strong consistency, and its ability to support a wide variety of data models. At Yammer we have a number of services that rely on HBase for persistence in production:
- Feedie, a feeds service
- RoyalMail, an inbox service
- Ocular, for tracking messages that a user has viewed
- Streamie, for storing activity streams
- Prankie, a ranking service with time-based decay
- Authlog, for authorization audit trails
- Spammie, for spam monitoring and blocking
- Graphene, a generic graph modeling service
HBase is able to satisfy the persistence needs of several very different domains. Of course, there are some use cases for which HBase is not recommended, for example, when using raw HDFS would be more efficient, or when ad-hoc querying via SQL is preferred (although projects like Apache Phoenix can provide SQL on top of HBase).
Previously, Lars George and Jonathan Hsieh from Cloudera attempted to survey the most commonly occurring use cases for HBase, which they referred to as application archetypes. In their presentation, they categorized archetypes as either “good”, “bad”, or “maybe” when used with HBase. Below I present an augmented listing of their “good” archetypes, along with pointers to projects that implement them.
Entity
The Entity archetype is the most natural of the archetypes. HBase, being a wide column store, can represent the entity properties with individual columns. Projects like Apache Gora and HEntityDB support this archetype.
Column Family: default | |||
Row Key | Column: <property 1 name> | Column: <property 2 name> | … |
<entity ID> | <property 1 value> | <property 2 value> | … |
Entities can be also stored in the same manner as with a key-value store. In this case the entity would be serialized as a binary or JSON value in a single column.
Column Family: default | |
Row Key | Column: body |
<entity ID> | <entity blob> |
SORTED COLLECTION
The Sorted Collection archetype is a generalization of the original Messaging archetype that was presented. In this archetype the entities are stored as binary or JSON values, with the column qualifier being the value of the sort key to use. For example, in a messaging feed, the column qualifier would be a timestamp or a monotonically increasing counter of some sort. The column qualifier can also be “inverted” (such as by subtracting a numeric ID from the maximum possible value) so that entities are stored in descending order.
Column Family: default | |||
Row Key | Column: <sort key 1 value> | Column: <sort key 2 value> | … |
<collection ID> | <entity 1 blob> | <entity 2 blob> | … |
Alternatively, each entity can be stored as a set of properties. This is similar to how Cassandra implements CQL. HEntityDB supports storing entity collections in this manner.
Column Family: default | ||||||
Row Key | Column: <sort key 1 value + property 1 name> | Column: <sort key 1 value + property 2 name> | … | Column: <sort key 2 value + property 1 name> | Column: <sort key 2 value + property 2 name> | … |
<collection ID> | <property 1 of entity 1> | <property 2 of entity 1> | … | <property 1 of entity 2> | <property 2 of entity 2> | … |
In order to access entities by some other value than the sort key, additional column families representing indices can be used.
Column Family: sorted | Column Family: index | |||||
Row Key | Column: <sort key 1 value> | Column: <sort key 2 value> | … | Column: <index 1 value> | Column: <index 2 value> | … |
<collection ID> | <entity 1 blob> | <entity 2 blob> | … | <entity 1 blob> | <entity 2 blob> | … |
To prevent the collection from growing unbounded, a coprocessor can be used to trim the sorted collection during compactions. If index column families are used, the coprocessor would also remove corresponding entries from the index column families when trimming the sorted collection. At Yammer, both the Feedie and RoyalMail services use this technique. Both services also use server-side filters for efficient pagination of the sorted collection during queries.
DOCUMENT
Using a technique called key-flattening, a document can be shredded by storing each value in the document according to the path from the root to the name of the element containing the value. HDocDB uses this approach.
Column Family: default | |||
Row Key | Column: <property 1 path> | Column: <property 2 path> | … |
<document ID> | <property 1 value> | <property 2 value> | … |
The document can also be stored as a binary value, in which case support for Medium Objects (MOBs) can be used if the documents are large. This approach is described in the book Architecting HBase Applications.
Column Family: default | |
Row Key | Column: body |
<document ID> | <reference to MOB> |
GRAPH
There are many ways to store a graph in HBase. One method is to use an adjacency list, where each vertex stores its neighbors in the same row. This is the approach taken in JanusGraph.
Column Family: default | ||||||
Row Key | Column: <edge 1 key> | Column: <edge 2 key> | … | Column: <property 1 name> | Column: <property 2 name> | … |
<vertex ID> | <edge 1 properties> | <edge 2 properties> | … | <property 1 value> | <property 2 value> | … |
In the table above, the edge key is actually comprised of a number of parts, including the label, direction, edge ID, and adjacent vertex ID.
Alternatively, a separate table to represent edges can be used, in which case the incident vertices are stored in the same row as an edge. This may scale better if the adjacency list is large, such as in a social network. This is the approach taken in both Zen and HGraphDB.
Column Family: default | |||
Row Key | Column: <property 1 name> | Column: <property 2 name> | … |
<vertex ID> | <property 1 value> | <property 2 value> | … |
Column Family: default | |||||
Row Key | Column: fromVertex | Column: toVertex | Column: <property 1 name> | Column: <property 2 name> | … |
<edge ID> | <vertex ID> | <vertex ID> | <property 1 value> | <property 2 value> | … |
When storing edges in a separate table, additional index tables must be used to provide efficient access to the incident edges of a vertex. For example, the full list of tables in HGraphDB can be viewed here.
QUEUE
A queue can be modeled by using a row key comprised of the consumer ID and a counter. Both Cask and Box implement queues in this manner.
Column Family: default | ||
Row Key | Column: metadata | Column: body |
<consumer ID + counter> | <message metadata> | <message body> |
Cask also uses coprocessors for efficient scan filtering and queue trimming, and Apache Tephra for transactional queue processing.
METRICS
The Metrics archetype is a variant of the Entity archetype in which the column values are counters or some other aggregate.
Column Family: default | |||
Row Key | Column: <property 1 name> | Column: <property 2 name> | … |
<entity ID> | <property 1 counter> | <property 2 counter> | … |
HGraphDB is actually a combination of the Graph and Metrics archetypes, as arbitrary counters can be stored on either vertices or edges.
Update: For other projects that use HBase, see this list.