October 2018 – Robert Yokota

In previous posts, I had discussed how Kafka can be used for both stream-relational processing as well as graph processing. I will now show how Kafka can also be used to process Datalog programs.

The Datalog Language

Datalog ¹ is a declarative logic programming language that is used as the query language for databases such as Datomic and LogicBlox.² Both relational queries and graph-based queries can be expressed in Datalog, so in one sense it can be seen as a unifying or foundational database language. ³

Finite Model Theory

To mathematicians, both graphs and relations can be viewed as finite models. Finite model theory has been described as “the backbone of database theory”⁴. When SQL was developed, first-order logic was used as its foundation since it was seen as sufficient for expressing any database query that might ever be needed. Later theoreticians working in finite model theory were better able to understand the limits on the expressivity of first-order logic.

For example, Datalog was shown to be able to express queries involving transitive closure, which cannot be expressed in first-order logic nor in relational algebra, since relational algebra is equivalent to first-order logic.⁵ This led to the addition of the WITH RECURSIVE clause to SQL-99 in order to support transitive closure queries against relational databases.

Facts and Rules

Below is an example Datalog program.

  
parent(’Isabella’, ’Ella’).
parent(’Ella’, ’Ben’).
parent(’Daniel’, ’Ben’).
sibling(X, Y) :- parent(X, Z), parent(Y, Z), X̸ != Y.

The Datalog program above consists of 3 facts and 1 rule.⁶ A rule has a head and a body (separated by the :- symbol). The body is itself a conjunction of subgoals (separated by commas). The symbols X, Y, and Z are variables, while ‘Isabella’, ‘Ella’, ‘Ben’, and ‘Daniel’ are constants.

In the above program, the facts state that Isabella has a parent named Ella, Ella has a parent named Ben, and Daniel has a parent named Ben. The rule states that if there exist two people (X and Y) who are not the same (X != Y) and who have the same parent (Z), then they are siblings. Thus we can see that Ella and Daniel are siblings.

The initial facts of a Datalog program are often referred to as the extensional database (EDB), while the remaining rules define the intensional database (IDB). When evaluating a Datalog program, we use the extensional database as input to the rules in order to output the intensional database.

Datalog Evaluation

There are several ways to evaluate a Datalog program. One of the more straightforward algorithms is called semi-naive evaluation. In semi-naive evaluation, we use the most recently generated facts ( $\Delta_\text{old}$ ) to satisfy one subgoal, and all facts $F$ to satisfy the remaining subgoals, which generates a new set of facts ( $\Delta_\text{new}$ ). These new facts are then used for the next evaluation round, and we continue iteratively until no new facts are derived.⁷ The algorithm is shown below, where $\textbf{EVAL-INCR}$ is the method of satisfying subgoals just described.

$\line(1,0){500} \\ \texttt{// Input: } F \texttt{ (set of Datalog EDB facts),} \\ \texttt{//}\hspace{56pt} R \texttt{ (set of Datalog rules with non-empty body)} \\ \Delta_\text{old} := F \\ \textbf{while } \Delta_\text{old} \neq \emptyset \\ \indent\Delta_\text{new} := \textbf{EVAL-INCR}(R,F,\Delta_\text{old}) \\ \indent F := F \cup \Delta_\text{new} \\ \indent\Delta_\text{old} := \Delta_\text{new} \\ \texttt{output } F \\ \line(1,0){500}$

Semi-naive evaluation is essentially how incremental view maintenance is performed in relational databases.

So how might we use Kafka to evaluate Datalog programs? Well, it turns out that researchers have shown how to perform distributed semi-naive evaluation of Datalog programs on Pregel, Google’s framework for large-scale graph processing. Since Kafka Graphs supports Pregel, we can use Kafka Graphs to evaluate Datalog programs as well.

Datalog on Pregel

The general approach for implementing distributed semi-naive evaluation on Pregel is to treat each constant in the facts, such as ‘Ella’ and ‘Daniel’, as a vertex in a complete graph.⁸ The value of each vertex will be the complete set of rules $R$ as well as the subset of facts $F$ that reference the constant represented by the vertex. During each round of the semi-naive algorithm, a given vertex will resolve the subgoals with the facts that it has and then send facts or partial rules to other vertices if necessary. That means that each round will be comprised of one or more Pregel supersteps, since vertices receive messages that were sent in the previous superstep.

Beyond the straightforward approach described above⁹, many enhancements can be made, such as efficient grouping of vertices into super-vertices, rule rewriting, and support for recursive aggregates.¹⁰

Datalog Processing With Kafka Streams

Based on existing research, we can adapt the above approach to Kafka Graphs, which I call KDatalog.¹¹

Assuming our Datalog program is in a file named siblings.txt, we first use KDatalog to parse the program and construct a graph consisting of vertices for the constants in the program, with edges between every pair of vertices. As mentioned, the initial value of the vertex for a given constant $c$ will consist of the set of rules $R$ as well as the subset of facts $F$ that contain $c$ .

    
    StreamsBuilder builder = new StreamsBuilder();
    Properties producerConfig = ...
    FileReader reader = new FileReader("siblings.txt");
    KGraph<String, String, String> graph = 
        KDatalog.createGraph(builder, producerConfig, reader);

We can then use this graph with KDatalog’s implementation of distributed semi-naive evaluation on Pregel, which is called DatalogComputation. After the computation terminates, the output of the algorithm will be the union of the values at every vertex.

Since KDatalog uses Kafka Graphs, and Kafka Graphs is built with Kafka Streams, we are essentially using Kafka Streams as a distributed Datalog engine.

The Future of Datalog

Datalog has recently enjoyed somewhat of a resurgence, as it has found successful application in a wide variety of areas, including data integration, information extraction, networking, program analysis, security, and cloud computing.¹² Datalog has even been used to express conflict-free replicated data types (CRDTs) in distributed systems. If interest in Datalog continues to grow, perhaps Kafka will play an important role in helping enterprises use Datalog for their business needs.

Robert Yokota

Month: October 2018

KDatalog: Kafka as a Datalog Engine