Graph Analytics on HBase with HGraphDB and Giraph

HGraphDB is a client framework for HBase that provides a TinkerPop Graph API.  HGraphDB also provides integration with Apache Giraph, a graph compute engine for analyzing graphs that Facebook has shown to be massively scalable.  In this blog we will show how to convert a sample Giraph computation that works with text files to instead work with HGraphDB.

In the Giraph quick start, the SimpleShortestPathsComputation is used to show how to run a Giraph computation against a graph contained in a file as a JSON representation.  Here are the contents of the JSON file:

[0,0,[[1,1],[3,3]]]
[1,0,[[0,1],[2,2],[3,1]]]
[2,0,[[1,2],[4,4]]]
[3,0,[[0,3],[1,1],[4,4]]]
[4,0,[[3,4],[2,4]]]

Each line above has the format [fromVertexId, vertexValue, [[toVertexId, edgeValue],...]], where the edgeValue is the weight or cost of the edge that will be used for the path computation.

To run the example in the Giraph quick start, the following command line is used:

hadoop jar giraph-examples-1.3.0-SNAPSHOT-for-hadoop-2.5.1-jar-with-dependencies.jar \
    org.apache.giraph.GiraphRunner \
    org.apache.giraph.examples.SimpleShortestPathsComputation \
    -vif org.apache.giraph.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat \
    -vip /user/ryokota/input/tiny_graph.txt \
    -vof org.apache.giraph.io.formats.IdWithValueTextOutputFormat \
    -op /user/ryokota/output/shortestpaths \
    -w 1 -ca giraph.SplitMasterWorker=false

The results of the job will appear in a file under the output path (/user/ryokota/output/shortestpaths), with the following contents:

0 1.0
1 0.0
2 2.0
3 1.0
4 5.0

Now let’s leave that example and consider the exact same graph stored in HGraphDB.  The graph above can be created in HGraphDB using the following statements.

        Vertex v0 = graph.addVertex(T.id, 0);
        Vertex v1 = graph.addVertex(T.id, 1);
        Vertex v2 = graph.addVertex(T.id, 2);
        Vertex v3 = graph.addVertex(T.id, 3);
        Vertex v4 = graph.addVertex(T.id, 4);
        v0.addEdge("e", v1, "weight", 1);
        v0.addEdge("e", v3, "weight", 3);
        v1.addEdge("e", v0, "weight", 1);
        v1.addEdge("e", v2, "weight", 2);
        v1.addEdge("e", v3, "weight", 1);
        v2.addEdge("e", v1, "weight", 2);
        v2.addEdge("e", v4, "weight", 4);
        v3.addEdge("e", v0, "weight", 3);
        v3.addEdge("e", v1, "weight", 1);
        v3.addEdge("e", v4, "weight", 4);
        v4.addEdge("e", v3, "weight", 4);
        v4.addEdge("e", v2, "weight", 4);

There is also a class called HBaseBulkLoader that can be used for more efficient creation of larger graphs.

Instead of using the JSON input format above, HGraphDB provides two input formats, HBaseVertexInputFormat and HBaseEdgeInputFormat, which will read from the vertices table and edges table in HBase, respectively.  To use these formats, the Giraph computation needs to be changed slightly.  Here is the original SimpleShortestPathsComputation:

public class SimpleShortestPathsComputation extends BasicComputation<LongWritable, DoubleWritable, FloatWritable, DoubleWritable> {
  ...
  @Override
  public void compute(
      Vertex<LongWritable, DoubleWritable, FloatWritable> vertex,
      Iterable<DoubleWritable> messages) throws IOException {
    if (getSuperstep() == 0) {
      vertex.setValue(new DoubleWritable(Double.MAX_VALUE));
    }
    double minDist = isSource(vertex) ? 0d : Double.MAX_VALUE;
    for (DoubleWritable message : messages) {
      minDist = Math.min(minDist, message.get());
    }
    if (minDist < vertex.getValue().get()) {
      vertex.setValue(new DoubleWritable(minDist));
      for (Edge<LongWritable, FloatWritable> edge : vertex.getEdges()) {
        double distance = minDist + edge.getValue().get();
        sendMessage(edge.getTargetVertexId(), new DoubleWritable(distance));
      }
    }
    vertex.voteToHalt();
  }
}

And here is the version for HGraphDB.  The main changes are in bold.

public class SimpleShortestPathsComputation extends
        HBaseComputation<Long, DoubleWritable, FloatWritable, DoubleWritable> {
  ...
  @Override
  public void compute(
      Vertex<ObjectWritable<Long>, VertexValueWritable<DoubleWritable>, EdgeValueWritable<FloatWritable>> vertex,
      Iterable<DoubleWritable> messages) throws IOException {
    VertexValueWritable<DoubleWritable> vertexValue = vertex.getValue();
    if (getSuperstep() == 0) {
      vertexValue.setValue(new DoubleWritable(Double.MAX_VALUE));
    }
    double minDist = isSource(vertex) ? 0d : Double.MAX_VALUE;
    for (DoubleWritable message : messages) {
      minDist = Math.min(minDist, message.get());
    }
    if (minDist < vertexValue.getValue().get()) {
      vertexValue.setValue(new DoubleWritable(minDist));
      for (Edge<ObjectWritable, EdgeValueWritable> edge : vertex.getEdges()) {
        double distance = minDist + ((Number) edge.getValue().getEdge().property("weight").value()).doubleValue();
        sendMessage(edge.getTargetVertexId(), new DoubleWritable(distance));
      }
    }
    vertex.voteToHalt();
  }
}

The major difference is that when using HBaseVertexInputFormat, the “value” of a Giraph vertex is an instance of type VertexValueWritable, which is comprised of an HBaseVertex and a Writable value.   Likewise when using HBaseEdgeInputFormat, the “value” of a Giraph edge is an instance of type EdgeValueWritable, which is comprised of an HBaseEdge and a Writable value.  The instances of HBaseVertex and HBaseEdge should be considered read-only and only be used to obtain IDs and property values.

Running the above Giraph computation against HBase is similar to running the original example.  Note that we also have to customize IdWithValueTextOutputFormat to work properly with VertexValueWritable.

./hadoop jar hgraphdb-0.4.4-SNAPSHOT-test-jar-with-dependencies.jar \
    org.apache.giraph.GiraphRunner \
    io.hgraphdb.giraph.examples.SimpleShortestPathsComputation \
    -vif io.hgraphdb.giraph.HBaseVertexInputFormat \
    -eif io.hgraphdb.giraph.HBaseEdgeInputFormat \
    -vof io.hgraphdb.giraph.examples.IdWithValueTextOutputFormat \
    -op /user/ryokota/output/shortestpaths \
    -w 1 -ca giraph.SplitMasterWorker=false \
    -ca hbase.zookeeper.quorum=127.0.0.1 \
    -ca zookeeper.znode.parent=/hbase-unsecure \
    -ca gremlin.hbase.namespace=testgraph \
    -ca hbase.mapreduce.edgetable=testgraph:edges \
    -ca hbase.mapreduce.vertextable=testgraph:vertices

As an alternative to using a text-based output format such as IdWithValueTextOutputFormat, HGraphDB provides two abstract output formats, HBaseVertexOutputFormat and HBaseEdgeOutputFormat, that can be used to modify the graph after a Giraph computation.  For example, the shortest path result for each vertex could be set as a property on the vertex by extending HBaseVertexOutputFormat and implementing the method

public abstract void writeVertex(HBaseBulkLoader writer, HBaseVertex vertex, Writable value);

As you can see, HGraphDB extends the functionality in Apache Giraph by making it quite easy to both read and write graphs stored in HBase when performing sophisticated graph analytics.

Graph Analytics on HBase with HGraphDB and Giraph