Previously, I’ve shown how both Apache Giraph and Apache Spark GraphFrames can be used to analyze graphs stored in HGraphDB. In this blog I will show how yet another graph analytics framework, Apache Flink Gelly, can be used with HGraphDB.
First, some observations on how Giraph, GraphFrames, and Gelly differ. Giraph runs on Hadoop MapReduce, while GraphFrames and Gelly run on Spark and Flink, respectively. MapReduce has been pivotal in launching the era of big data. Two of its characteristics are the following:
- MapReduce has only 3 steps (the map, shuffle, and reduce steps)
- MapReduce processes data in batches
The fact that MapReduce utilizes only 3 steps has led to the development of workflow engines like Apache Oozie that can combine MapReduce jobs into more complex flows, represented by directed acyclic graphs. Also, the fact that MapReduce performs only batch processing has led to the development of stream processing frameworks like Apache Storm, which is often combined with Hadoop MapReduce in what is referred to as a lambda architecture.
Later, dataflow engines such as Apache Spark and Apache Flink were developed to handle data processing as a single job, rather than several independent MapReduce jobs that need to be chained together. However, while Spark is fundamentally a batch-oriented framework, Flink is fundamentally a stream-oriented framework. Both try to unify batch and stream processing in different ways. Spark provides stream processing by breaking data into micro-batches. Flink posits that batch is a special case of streaming, and that stream-processing engines can handle batches better than batch-processing engines can handle streams.
Needless to say, users who have the requirement to process big data, including large graphs, have a plethora of unique and interesting options at their disposal today.
To use Apache Flink Gelly with HGraphDB, graph data first needs to be wrapped in Flink DataSets. HGraphDB provides two classes, HBaseVertexInputFormat
and HBaseEdgeInputFormat
, than can be used to import the vertices and edges of a graph into DataSets.
As a demonstration, we can run one of the Gelly neighborhood examples on HGraphDB as follows. First we create the graph in the example:
Vertex v1 = graph.addVertex(T.id, 1L); Vertex v2 = graph.addVertex(T.id, 2L); Vertex v3 = graph.addVertex(T.id, 3L); Vertex v4 = graph.addVertex(T.id, 4L); Vertex v5 = graph.addVertex(T.id, 5L); v1.addEdge("e", v2, "weight", 0.1); v1.addEdge("e", v3, "weight", 0.5); v1.addEdge("e", v4, "weight", 0.4); v2.addEdge("e", v4, "weight", 0.7); v2.addEdge("e", v5, "weight", 0.3); v3.addEdge("e", v4, "weight", 0.2); v4.addEdge("e", v5, "weight", 0.9);
A vertex in Gelly consists of an ID and a value, whereas an edge in Gelly consists of the source vertex ID, the target vertex ID, and an optional value. When using HBaseVertexInputFormat
and HBaseEdgeInputFormat
, the name of a property can be specified for the property value in the HGraphDB vertex or edge to be associated with the Gelly vertex or edge. If no property name is specified, then the value will default to the ID of the vertex or edge. Below we import the vertices using an instance of HBaseVertexInputFormat
with no property name specified, and we import the edges using an instance of HBaseEdgeInputFormat
with the property name specified as “weight”.
HBaseGraphConfiguration conf = graph.configuration(); ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); DataSet<Tuple2<Long, Long>> vertices = env.createInput( new HBaseVertexInputFormat<>(conf), TypeInformation.of(new TypeHint<Tuple2<Long, Long>>() {}) ); DataSet<Tuple3<Long, Long, Double>> edges = env.createInput( new HBaseEdgeInputFormat<>(conf, "weight"), TypeInformation.of(new TypeHint<Tuple3<Long, Long, Double>>() {}) );
Once we have the two DataSets, we can create a Gelly graph as follows:
Graph<Long, Long, Double> gelly = Graph.fromTupleDataSet(vertices, edges, env);
Finally, running the neighborhood processing example is exactly the same as in the documentation:
DataSet<Tuple2<Long, Double>> minWeights = gelly.reduceOnEdges(new SelectMinWeight(), EdgeDirection.OUT); // user-defined function to select the minimum weight static final class SelectMinWeight implements ReduceEdgesFunction { @Override public Double reduceEdges( Double firstEdgeValue, Double secondEdgeValue) { return Math.min(firstEdgeValue, secondEdgeValue); } }
HGraphDB brings together several big data technologies in the Apache ecosystem in order to process large graphs. Graph data can be stored in Apache HBase, OLTP graph operations can be performed using Apache TinkerPop, and complex graph analytics can be performed using Apache Giraph, Apache Spark GraphFrames, or Apache Flink Gelly.