Kafka Graphs: Graph Analytics with Apache Kafka

As the 2018 Apache Kafka Report has shown, Kafka has become mission-critical to enterprises of all sizes around the globe. Although there are many similar technologies in the field today, none have the equivalent of the thriving ecosystem that has developed around Kafka. Frameworks like Kafka Connect, Kafka Streams, and KSQL have enabled a much wider variety of scenarios to be addressed by Kafka. We are witnessing the growth of an entire technology market, distributed streaming, that resembles how the relational database market grew to take hold of enterprises at the end of the last century.

Kafka Graphs is a new framework that extends Kafka Streams to provide distributed graph analytics. It provides both a library for graph transformations as well as a distributed platform for executing graph algorithms. Kafka Graphs was inspired by other platforms for graph analytics, such as Apache Flink Gelly, Apache Spark GraphX, and Apache Giraph, but unlike these other frameworks it does not require anything other than what is already provided by the Kafka abstraction funnel.

Graph Representation and Transformations

A graph in Kafka Graphs is represented by two tables from Kafka Streams, one for vertices and one for edges. The vertex table is comprised of an ID and a vertex value, while the edge table is comprised of a source ID, target ID, and edge value.

KTable<Long, Long> vertices = ...
KTable<Edge<Long>, Long> edges = ...
KGraph<Long, Long, Long> graph = new KGraph<>(
    vertices, 
    edges, 
    GraphSerialized.with(Serdes.Long(), Serdes.Long(), Serdes.Long())
);

Once a graph is created, graph transformations can be performed on it. For example, the following will compute the sum of the values of all incoming neighbors for each vertex.

graph.reduceOnNeighbors(new SumValues(), EdgeDirection.IN);

Pregel-Based Graph Algorithms

Kafka Graphs provides a number of graph algorithms based on the vertex-centric approach of Pregel. The vertex-centric approach allows a computation to “think like a vertex” so that it only need consider how the value of a vertex should change based on messages sent from other vertices. The following algorithms are provided by Kafka Graphs:

  1. Breadth-first search (BFS): given a source vertex, determines the minimum number of hops to reach every other vertex.
  2. Label propagation (LP): finds communities in a graph by propagating labels between neighbors.
  3. Local clustering coefficient (LCC): computes the degree of clustering for each vertex as determined by the ratio between the number of triangles a vertex closes with its neighbors to the maximum number of triangles it could close.
  4. Multiple-source shortest paths (MSSP): given a set of source vertices, finds the shortest paths from these vertices to all other vertices.
  5. PageRank (PR): measures the rank or popularity of each vertex by propagating influence between vertices.
  6. Single-source shortest paths (SSSP): given a source vertex, finds the shortest paths to all other vertices.
  7. Weakly connected components (WCC): determines the weakly connected component for each vertex.

For example, here is the implementation of the single-source shortest paths (SSSP) algorithm:

public final class SSSPComputeFunction 
  implements ComputeFunction<Long, Double, Double, Double> {

  public void compute(
    int superstep,
    VertexWithValue<Long, Double> vertex,
    Map<Long, Double> messages,
    Iterable<EdgeWithValue<Long, Double>> edges,
    Callback<Long, Double, Double> cb) {

    double minDistance = vertex.id().equals(srcVertexId)
      ? 0d : Double.POSITIVE_INFINITY;

    for (Double message : messages.values()) {
      minDistance = Math.min(minDistance, message);
    }

    if (minDistance < vertex.value()) {
      cb.setNewVertexValue(minDistance);
      for (EdgeWithValue<Long, Double> edge : edges) {
        double distance = minDistance + edge.value();
        cb.sendMessageTo(edge.target(), distance);
      }
    }
  }
}

Custom Pregel-based graph algorithms can also be added by implementing the ComputeFunction interface.

Distributed Graph Processing

Since Kafka Graphs is built on top of Kafka Streams, it is able to leverage the underlying partitioning scheme of Kafka Streams in order to support distributed graph processing. To facilitate running graph algorithms in a distributed manner, Kafka Streams provides a REST application for managing graph algorithm executions.

java -jar kafka-graphs-rest-app-0.1.0.jar \
  --kafka.graphs.bootstrapServers=localhost:9092 \
  --kafka.graphs.zookeeperConnect=localhost:2181

When multiple instantiations of the REST application are started on different hosts, all configured with the same Kafka and ZooKeeper servers, they will automatically coordinate with each other to partition the set of vertices when executing a graph algorithm. When a REST request is sent to one host, it will automatically proxy the request to the other hosts if necessary.

More information on the REST application can be found here.

Summary

Kafka Graphs is a new addition to the rapidly expanding ecosystem surrounding Apache Kafka. Kafka Graphs is still in its early stages, but please feel to try it and suggest improvements if you have a need to perform distributed graph analytics with Kafka.

Kafka Graphs: Graph Analytics with Apache Kafka