Building A Graph Database Using Kafka

I previously showed how to build a relational database using Kafka. This time I’ll show how to build a graph database using Kafka. Just as with KarelDB, at the heart of our graph database will be the embedded key-value store, KCache.

Kafka as a Graph Database

The graph database that I’m most familiar with is HGraphDB, a graph database that uses HBase as its backend. More specifically, it uses the HBase client API, which allows it to integrate with not only HBase, but also any other data store that implements the HBase client API, such as Google BigTable. This leads to an idea. Rather than trying to build a new graph database around KCache entirely from scratch, we can try to wrap KCache with the HBase client API.

HBase is an example of a wide column store, also known as an extensible record store. Like its predecessor BigTable, it allows any number of column values to be associated with a key, without requiring a schema. For this reason, a wide column store can also be seen as two-dimensional key-value store.1

I’ve implemented KStore as a wide column store (or extensible record store) abstraction for Kafka that relies on KCache under the covers. KStore implements the HBase client API, so it can be used wherever the HBase client API is supported.

Let’s try to use KStore with HGraphDB. After installing and starting the Gremlin console, we install KStore and HGraphDB.

$ ./bin/gremlin.sh

         \,,,/
         (o o)
-----oOOo-(3)-oOOo-----
plugin activated: tinkerpop.server
plugin activated: tinkerpop.utilities
plugin activated: tinkerpop.tinkergraph

gremlin> :install org.apache.hbase hbase-client 2.2.1
gremlin> :install org.apache.hbase hbase-common 2.2.1
gremlin> :install org.apache.hadoop hadoop-common 3.1.2
gremlin> :install io.kstore kstore 0.1.0
gremlin> :install io.hgraphdb hgraphdb 3.0.0
gremlin> :plugin use io.hgraphdb
 

After we restart the Gremlin console, we configure HGraphDB with the KStore connection class and the Kafka bootstrap servers.2 We can then issue Gremlin commands against Kafka.

$ ./bin/gremlin.sh

         \,,,/
         (o o)
-----oOOo-(3)-oOOo-----
plugin activated: tinkerpop.server
plugin activated: tinkerpop.utilities
plugin activated: io.hgraphdb
plugin activated: tinkerpop.tinkergraph

gremlin> cfg = new HBaseGraphConfiguration()\
......1> .set("hbase.client.connection.impl", "io.kstore.KafkaStoreConnection")\
......2> .set("kafkacache.bootstrap.servers", "localhost:9092")
==>io.hgraphdb.HBaseGraphConfiguration@41b0ae4c

gremlin> graph = new HBaseGraph(cfg)
==>hbasegraph[hbasegraph]

gremlin> g = graph.traversal()
==>graphtraversalsource[hbasegraph[hbasegraph], standard]

gremlin> v1 = g.addV('person').property('name','marko').next()
==>v[0371a1db-8768-4910-94e3-7516fc65dab3]

gremlin> v2 = g.addV('person').property('name','stephen').next()
==>v[3bbc9ce3-24d3-41cf-bc4b-3d95dbac6589]

gremlin> g.V(v1).addE('knows').to(v2).property('weight',2).iterate()
  

It works! HBaseGraph is now using Kafka as its storage backend.

Kafka as a Document Database

Now that we have a wide column store abstraction for Kafka in the form of KStore, let’s see what else we can do with it. Another database that uses the HBase client API is HDocDB, a document database for HBase. To use KStore with HDocDB, first we need to set hbase.client.connection.impl in our hbase-site.xml as follows.

<configuration>
    <property>
        <name>hbase.client.connection.impl</name>
        <value>io.kstore.KafkaStoreConnection</value>
    </property>
    <property>
        <name>kafkacache.bootstrap.servers</name>
        <value>localhost:9092</value>
    </property>
</configuration>

Now we can issue MongoDB-like commands against Kafka, using HDocDB.3

$ jrunscript -cp <hbase-conf-dir>:target/hdocdb-1.0.1.jar:../kstore/target/kstore-0.1.0.jar -f target/classes/shell/hdocdb.js -f -

nashorn> db.mycoll.insert( { _id: "jdoe", first_name: "John", last_name: "Doe" } )

nashorn> var doc = db.mycoll.find( { last_name: "Doe" } )[0]

nashorn> print(doc)
{"_id":"jdoe","first_name":"John","last_name":"Doe"}

nashorn> db.mycoll.update( { last_name: "Doe" }, { $set: { first_name: "Jim" } } )

nashorn> var doc = db.mycoll.find( { last_name: "Doe" } )[0]

nashorn> print(doc)
{"_id":"jdoe","first_name":"Jim","last_name":"Doe"}
  

Pretty cool, right?

Kafka as a Wide Column Store

Of course, there is no requirement to wrap KStore with another layer in order to use it. KStore can be used directly as a wide column store abstraction on top of Kafka. I’ve integrated KStore with the HBase Shell so that one can work directly with KStore from the command line.

$ ./kstore-shell.sh localhost:9092

hbase(main):001:0> create 'test', 'cf'
Created table test
Took 0.2328 seconds
=> Hbase::Table - test

hbase(main):003:0* list
TABLE
test
1 row(s)
Took 0.0192 seconds
=> ["test"]

hbase(main):004:0> put 'test', 'row1', 'cf:a', 'value1'
Took 0.1284 seconds

hbase(main):005:0> put 'test', 'row2', 'cf:b', 'value2'
Took 0.0113 seconds

hbase(main):006:0> put 'test', 'row3', 'cf:c', 'value3'
Took 0.0096 seconds

hbase(main):007:0> scan 'test'
ROW                                COLUMN+CELL
 row1                              column=cf:a, timestamp=1578763986780, value=value1
 row2                              column=cf:b, timestamp=1578763992567, value=value2
 row3                              column=cf:c, timestamp=1578763996677, value=value3
3 row(s)
Took 0.0233 seconds

hbase(main):008:0> get 'test', 'row1'
COLUMN                             CELL
 cf:a                              timestamp=1578763986780, value=value1
1 row(s)
Took 0.0106 seconds

hbase(main):009:0>

There’s no limit to the type of fun one can have with KStore. 🙂

Back to Graphs

Getting back to graphs, another popular graph database is JanusGraph, which is interesting because it has a pluggable storage layer. Some of the storage backends that it supports through this layer are HBase, Cassandra, and BerkeleyDB.

Of course, KStore can be used in place of HBase when configuring JanusGraph. Again, it’s simply a matter of configuring the KStore connection class in the JanusGraph configuration.

storage.hbase.ext.hbase.client.connection.impl: io.kstore.KafkaStoreConnection
storage.hbase.ext.kafkacache.bootstrap.servers: localhost:9092

However, we can do better when integrating JanusGraph with Kafka. JanusGraph can be integrated with any storage backend that supports a wide column store abstraction. When integrating with key-value stores such as BerkeleyDB, JanusGraph provides its own adapter for mapping a key-value store to a wide column store. Thus we can simply provide KCache to JanusGraph as a key-value store, and it will perform the mapping to a wide column store abstraction for us automatically.

I’ve implemented a new storage plugin for JanusGraph called janusgraph-kafka that does exactly this. Let’s try it out. After following the instructions here, we can start the Gremlin console.

$ ./bin/gremlin.sh

         \,,,/
         (o o)
-----oOOo-(3)-oOOo-----
plugin activated: tinkerpop.server
plugin activated: tinkerpop.tinkergraph
plugin activated: tinkerpop.hadoop
plugin activated: tinkerpop.spark
plugin activated: tinkerpop.utilities
plugin activated: janusgraph.imports

gremlin>  graph = JanusGraphFactory.open('conf/janusgraph-kafka.properties')
==>standardjanusgraph[io.kcache.janusgraph.diskstorage.kafka.KafkaStoreManager:[127.0.0.1]]

gremlin> g = graph.traversal()
==>graphtraversalsource[standardjanusgraph[io.kcache.janusgraph.diskstorage.kafka.KafkaStoreManager:[127.0.0.1]], standard]

gremlin> v1 = g.addV('person').property('name','marko').next()
==>v[4320]

gremlin> v2 = g.addV('person').property('name','stephen').next()
==>v[4104]

gremlin> g.V(v1).addE('knows').to(v2).property('weight',2).iterate()
  

Works like a charm.

Summary

In this and the previous post, I’ve shown how Kafka can be used as

I guess I could have titled this post “Building a Graph Database, Document Database, and Wide Column Store Using Kafka”, although that’s a bit long. In any case, hopefully I’ve shown that Kafka is a lot more versatile than most people realize.


Building A Graph Database Using Kafka