In past posts I had discussed the differences between Spark and Flink in terms of stream processing, with Spark treating streaming data as micro-batches, and Flink treating streaming data as a first-class citizen. Apache Beam is another streaming framework that has an interesting goal of generalizing other stream processing platforms.
Recently I joined Confluent, so I was interested in learning how Beam differs from Kafka Streams. Kafka Streams is a stream processing framework on top of Kafka that has both a functional DSL as well as a declarative SQL layer called KSQL. Confluent has an informative blog that compares Kafka Streams with Beam in great technical detail. From what I’ve gathered, one way to state the differences between the two systems is as follows:
- Kafka Streams is a stream-relational processing platform.
- Apache Beam is a stream-only processing platform.
A stream-relational processing platform has the following capabilities which are typically missing in a stream-only processing platform:
- Relations (or tables) are first-class citizens, i.e. each has an independent identity.
- Relations can be transformed into other relations.
- Relations can be queried in an ad-hoc manner.
For example, in Beam one can aggregate state using windowing, but this state cannot be queried in an ad-hoc manner. It can only be emitted back into the stream through the use of triggers. This is why Beam can be viewed as a stream-only system.
The capability to transform one relation into another is what gives relational algebra (and SQL) its power. Operations on relations such as selection, projection, and Cartesian product result in more relations, which can then be used for further operations. This is referred to as the closure property of relational algebra.
Ad-hoc querying allows the user to inject himself as a “sink” at various points in the stream. At these junctures, he can construct interactive queries to extract the information that he is really interested in. The user’s temporal presence becomes intertwined with the stream. This essentially gives the stream a more dynamic nature, something that can help the larger organization be more dynamic as well.
Jay Kreps has written a lot about stream-table (or stream-relation) duality, where streams and tables (or relations) can be considered as two ways of looking at the same data. This duality has also been explored in research projects like CQL, which was one of the first systems to provide support for both streams and relations1. In the CQL paper, the authors discuss how both stream-relational systems and stream-only systems are equally expressive. However, the authors also point out the ease-of-use of stream-relational systems over stream-only systems:
“…the dual approach results in more intuitive queries than the stream-only approach.“2
Modern day stream processing platforms attempt to unify the world of batch and stream processing. Batches are essentially treated as bounded streams. However, stream-relational processing platforms go beyond stream-only processing platforms by not only unifying batch and stream processing, but also unifying the world of streams and relations. This means that streams can be converted to relations and vice versa within the data flow of the framework itself (and not just at the endpoints of the flow).
Beam is meant to generalize stream-only systems. Streaming engines can plug into Beam by implementing a Beam Runner. In order to understand both Kafka Streams and Beam, I spent a couple of weekends writing a Beam Runner for Kafka Streams3. It provides a proof-of-concept of how the Beam Dataflow model can be implemented on top of Kafka Streams. However, it doesn’t make use of the ability of Kafka Streams to represent relations, nor does it use the native windowing functionality of Kafka Streams, so it is not as useful to someone who might want the full power of stream-relational functionality4.
I would like to dwell a bit more on stream-table duality as I feel it is a fundamental concept in computer science, not just in databases. As Jay Kreps has mentioned, stream-table duality has been explored extensively by CQL (2003). However, a similar duality was known by the functional programming community even earlier. Here is an interesting reference from over 30 years ago:
“Now we have seen that streams provide an alternative way to model objects with local state. We can model a changing quantity, such as the local state of some object, using a stream that represents the time history of successive states. In essence, we represent time explicitly, using streams, so that we decouple time in our simulated world from the sequence of events that take place during evaluation.”5
The above quote is from Structure and Interpretation of Computer Programs, a classic text in computer science. From a programming perspective, modeling objects with streams leads to a functional programming view of time, whereas modeling objects with state leads to an imperative programming view of time. So perhaps stream-table duality can also be called stream-state duality. Streams and state are alternative ways of modeling the temporality of objects in the external world. This is one of the key ideas behind event sourcing.
If one wants to get even more abstract, analogous concepts exist in philosophy, specifically within the metaphysics of time. For example,
“…eternalism and presentism are conflicting views about the nature of time. The eternalist claims that all times exist and thus that the objects present at all past, present, and future times exist. In contrast, the presentist argues that only the present exists, and thus that only those objects present now exist. As a result the eternalist, but not the presentist, can sincerely quantify over all times and objects existing at those times.”6
So a Beam proponent would be close to an eternalist, while a MySQL advocate would be close to a presentist. With Kafka Streams, you get to be both!
Now that I’ve been able to encompass all of temporal reality with this blog, let me presently return to the main point. Kafka Streams, as a stream-relational processing platform, provides first-class support for relations, relation transformations, and ad-hoc querying, which are all missing in a generalized stream-only framework like Beam7 . And as the world of relational databases has shown (for almost 50 years since Codd’s original paper), the use of relations, along with being able to transform and query them in an interactive fashion, can provide a critically important tool to enterprises that wish to be more responsive amidst changing business needs and circumstances.
- The relations in CQL actually differ from traditional relations in that they are time-varying.
- Arasu, Babu, and Widom. The CQL Continuous Query Language, p. 19, 2006.
- It follows the implementation of the other Beam Runners, especially the one for Samza, another stream-relational processing platform.
- A Kafka Stream reads data from a Kafka topic. Therefore, when using sources in the Dataflow model, data is first read from the source and then written to a topic before being converted by the Runner into a Kafka Stream. Normally one would use Kafka Connect to move bulk data into and out of Kafka. The Kafka Streams Runner also provides a KafkaStreamsIO class that can use topics as either sources or sinks directly.
- Abelson and Sussman. Structure and Interpretation of Computer Programs, MIT Press, p. 287, 1985.
- Haslanger and Kurtz. Persistence, MIT Press, p.16, 2006.
- Although the Beam authors state that at Google they’re adding ad-hoc querying of state from outside of Google Flume pipelines, and so hopefully this feature will one day appear in Beam as well. See Streaming Systems.