Graph Analytics on HBase with HGraphDB and Apache Flink Gelly

Previously, I’ve shown how both Apache Giraph and Apache Spark GraphFrames can be used to analyze graphs stored in HGraphDB.  In this blog I will show how yet another graph analytics framework, Apache Flink Gelly, can be used with HGraphDB.

First, some observations on how Giraph, GraphFrames, and Gelly differ.  Giraph runs on Hadoop MapReduce, while GraphFrames and Gelly run on Spark and Flink, respectively. MapReduce has been pivotal in launching the era of big data.  Two of its characteristics are the following:

  1. MapReduce has only 3 steps (the map, shuffle, and reduce steps)
  2. MapReduce processes data in batches

The fact that MapReduce utilizes only 3 steps has led to the development of workflow engines like Apache Oozie that can combine MapReduce jobs into more complex flows, represented by directed acyclic graphs.  Also, the fact that MapReduce performs only batch processing has led to the development of stream processing frameworks like Apache Storm, which is often combined with Hadoop MapReduce in what is referred to as a lambda architecture.

Later, dataflow engines such as Apache Spark and Apache Flink were developed to handle data processing as a single job, rather than several independent MapReduce jobs that need to be chained together.  However, while Spark is fundamentally a batch-oriented framework, Flink is fundamentally a stream-oriented framework.   Both try to unify batch and stream processing in different ways.  Spark provides stream processing by breaking data into micro-batches.  Flink posits that batch is a special case of streaming, and that stream-processing engines can handle batches better than batch-processing engines can handle streams.

Needless to say, users who have the requirement to process big data, including large graphs, have a plethora of unique and interesting options at their disposal today.

To use Apache Flink Gelly with HGraphDB, graph data first needs to be wrapped in Flink DataSets.  HGraphDB provides two classes, HBaseVertexInputFormat and HBaseEdgeInputFormat, than can be used to import the vertices and edges of a graph into DataSets.

As a demonstration, we can run one of the Gelly neighborhood examples on HGraphDB as follows.  First we create the graph in the example:

Vertex v1 = graph.addVertex(T.id, 1L);
Vertex v2 = graph.addVertex(T.id, 2L);
Vertex v3 = graph.addVertex(T.id, 3L);
Vertex v4 = graph.addVertex(T.id, 4L);
Vertex v5 = graph.addVertex(T.id, 5L);
v1.addEdge("e", v2, "weight", 0.1);
v1.addEdge("e", v3, "weight", 0.5);
v1.addEdge("e", v4, "weight", 0.4);
v2.addEdge("e", v4, "weight", 0.7);
v2.addEdge("e", v5, "weight", 0.3);
v3.addEdge("e", v4, "weight", 0.2);
v4.addEdge("e", v5, "weight", 0.9);

A vertex in Gelly consists of an ID and a value, whereas an edge in Gelly consists of the source vertex ID, the target vertex ID, and an optional value.  When using HBaseVertexInputFormat and HBaseEdgeInputFormat, the name of a property can be specified for the property value in the HGraphDB vertex or edge to be associated with the Gelly vertex or edge.  If no property name is specified, then the value will default to the ID of the vertex or edge.  Below we import the vertices using an instance of HBaseVertexInputFormat with no property name specified, and we import the edges using an instance of HBaseEdgeInputFormat with the property name specified as “weight”.

HBaseGraphConfiguration conf = graph.configuration();
ExecutionEnvironment env = 
    ExecutionEnvironment.getExecutionEnvironment();
DataSet<Tuple2<Long, Long>> vertices = env.createInput(
    new HBaseVertexInputFormat<>(conf),
    TypeInformation.of(new TypeHint<Tuple2<Long, Long>>() {})
);
DataSet<Tuple3<Long, Long, Double>> edges = env.createInput(
    new HBaseEdgeInputFormat<>(conf, "weight"),
    TypeInformation.of(new TypeHint<Tuple3<Long, Long, Double>>() {})
);

Once we have the two DataSets, we can create a Gelly graph as follows:

Graph<Long, Long, Double> gelly = 
    Graph.fromTupleDataSet(vertices, edges, env);

Finally, running the neighborhood processing example is exactly the same as in the documentation:

DataSet<Tuple2<Long, Double>> minWeights = 
    gelly.reduceOnEdges(new SelectMinWeight(), EdgeDirection.OUT);

// user-defined function to select the minimum weight
static final class SelectMinWeight implements ReduceEdgesFunction {
    @Override
    public Double reduceEdges(
        Double firstEdgeValue, Double secondEdgeValue) {
        return Math.min(firstEdgeValue, secondEdgeValue);
    }
}

HGraphDB brings together several big data technologies in the Apache ecosystem in order to process large graphs. Graph data can be stored in Apache HBase, OLTP graph operations can be performed using Apache TinkerPop, and complex graph analytics can be performed using Apache Giraph, Apache Spark GraphFrames, or Apache Flink Gelly.

Graph Analytics on HBase with HGraphDB and Apache Flink Gelly

HBase Application Archetypes Redux

At Yammer, we’ve transitioned away from polyglot persistence to persistence consolidation. In a microservice architecture, the principle that each microservice should be responsible for its own data had led to a proliferation of different types of data stores at Yammer. This in turn led to multiple efforts to make sure that each data store could be easily used, monitored, operationalized, and maintained. In the end, we decided it would be more efficient, both architecturally and organizationally, to reduce the number of data store types in use at Yammer to as few as possible.

Today HBase is the primary data store for non-relational data at Yammer (we use PostgreSQL for relational data).  Microservices are still responsible for their own data, but the data is segregated by cluster boundaries or mechanisms within the data store itself (such as HBase namespaces or PostgreSQL schemas).

HBase was chosen for a number of reasons, including its performance, scalability, reliability, its support for strong consistency, and its ability to support a wide variety of data models.  At Yammer we have a number of services that rely on HBase for persistence in production:

  • Feedie, a feeds service
  • RoyalMail, an inbox service
  • Ocular, for tracking messages that a user has viewed
  • Streamie, for storing activity streams
  • Prankie, a ranking service with time-based decay
  • Authlog, for authorization audit trails
  • Spammie, for spam monitoring and blocking
  • Graphene, a generic graph modeling service

HBase is able to satisfy the persistence needs of several very different domains. Of course, there are some use cases for which HBase is not recommended, for example, when using raw HDFS would be more efficient, or when ad-hoc querying via SQL is preferred (although projects like Apache Phoenix can provide SQL on top of HBase).

Previously, Lars George and Jonathan Hsieh from Cloudera attempted to survey the most commonly occurring use cases for HBase, which they referred to as application archetypes.  In their presentation, they categorized archetypes as either “good”, “bad”, or “maybe” when used with HBase. Below I present an augmented listing of their “good” archetypes, along with pointers to projects that implement them.

Entity

The Entity archetype is the most natural of the archetypes.  HBase, being a wide column store, can represent the entity properties with individual columns.  Projects like Apache Gora and HEntityDB support this archetype.

Column Family: default
Row Key Column: <property 1 name> Column: <property 2 name>
<entity ID>  <property 1 value> <property 2 value>

Entities can be also stored in the same manner as with a key-value store.  In this case the entity would be serialized as a binary or JSON value in a single column.

Column Family: default
Row Key Column: body
<entity ID>  <entity blob>

SORTED COLLECTION

The Sorted Collection archetype is a generalization of the original Messaging archetype that was presented.  In this archetype the entities are stored as binary or JSON values, with the column qualifier being the value of the sort key to use.  For example, in a messaging feed, the column qualifier would be a timestamp or a monotonically increasing counter of some sort.  The column qualifier can also be “inverted” (such as by subtracting a numeric ID from the maximum possible value) so that entities are stored in descending order.

Column Family: default
Row Key Column: <sort key 1 value> Column: <sort key 2 value>
<collection ID>  <entity 1 blob> <entity 2 blob>

Alternatively, each entity can be stored as a set of properties.  This is similar to how Cassandra implements CQL.  HEntityDB supports storing entity collections in this manner.

Column Family: default
Row Key Column: <sort key 1 value + property 1 name> Column: <sort key 1 value + property 2 name> Column: <sort key 2 value + property 1 name> Column: <sort key 2 value + property 2 name>
<collection ID> <property 1 of entity 1> <property 2 of entity 1> <property 1 of entity 2> <property 2 of entity 2>

In order to access entities by some other value than the sort key, additional column families representing indices can be used.

Column Family: sorted Column Family: index
Row Key Column: <sort key 1 value> Column: <sort key 2 value> Column: <index 1 value> Column: <index 2 value>
<collection ID>  <entity 1 blob> <entity 2 blob> <entity 1 blob> <entity 2 blob>

To prevent the collection from growing unbounded, a coprocessor can be used to trim the sorted collection during compactions.  If index column families are used, the coprocessor would also remove corresponding entries from the index column families when trimming the sorted collection.  At Yammer, both the Feedie and RoyalMail services use this technique.  Both services also use server-side filters for efficient pagination of the sorted collection during queries.

DOCUMENT

Using a technique called key-flattening, a document can be shredded by storing each value in the document according to the path from the root to the name of the element containing the value.  HDocDB uses this approach.

Column Family: default
Row Key Column: <property 1 path> Column: <property 2 path>
<document ID>  <property 1 value> <property 2 value>

The document can also be stored as a binary value, in which case support for Medium Objects (MOBs) can be used if the documents are large.  This approach is described in the book Architecting HBase Applications.

Column Family: default
Row Key Column: body
<document ID>  <reference to MOB>

GRAPH

There are many ways to store a graph in HBase.  One method is to use an adjacency list, where each vertex stores its neighbors in the same row.  This is the approach taken in JanusGraph.

Column Family: default
Row Key Column: <edge 1 key> Column: <edge 2 key> Column: <property 1 name> Column: <property 2 name>
<vertex ID>  <edge 1 properties> <edge 2 properties> <property 1 value> <property 2 value>

In the table above, the edge key is actually comprised of a number of parts, including the label, direction, edge ID, and adjacent vertex ID.

Alternatively, a separate table to represent edges can be used, in which case the incident vertices are stored in the same row as an edge.   This may scale better if the adjacency list is large, such as in a social network.  This is the approach taken in both Zen and HGraphDB.

Column Family: default
Row Key Column: <property 1 name> Column: <property 2 name>
<vertex ID> <property 1 value> <property 2 value>
Column Family: default
Row Key Column: fromVertex Column: toVertex Column: <property 1 name> Column: <property 2 name>
<edge ID>  <vertex ID> <vertex ID> <property 1 value> <property 2 value>

When storing edges in a separate table, additional index tables must be used to provide efficient access to the incident edges of a vertex.  For example, the full list of tables in HGraphDB can be viewed here.

QUEUE

A queue can be modeled by using a row key comprised of the consumer ID and a counter.  Both Cask and Box implement queues in this manner.

Column Family: default
Row Key Column: metadata Column: body
<consumer ID + counter>  <message metadata> <message body>

Cask also uses coprocessors for efficient scan filtering and queue trimming, and Apache Tephra for transactional queue processing.

METRICS

The Metrics archetype is a variant of the Entity archetype in which the column values are counters or some other aggregate.

Column Family: default
Row Key Column: <property 1 name> Column: <property 2 name>
<entity ID>  <property 1 counter> <property 2 counter>

HGraphDB is actually a combination of the Graph and Metrics archetypes, as arbitrary counters can be stored on either vertices or edges.

Update: For other projects that use HBase, see this list.

HBase Application Archetypes Redux