The Confluent Schema Registry recently added support for tags, metadata, and rules, which together support the concept of a data contract. Data quality rules in a data contract can be expressed using the Google Common Expression Language (CEL), while migration rules can be expressed using JSONata. In this article, I’ll provide some tips on understanding the capabilities of CEL in the context of data contract rules.
The CEL Type System
One of the most important aspects of CEL is that it is a strongly typed language. In CEL, all expressions have a well-defined type, and all operators and functions check that their arguments have the expected types. Consider the following CEL expression:
'hello' == 3 // raises an error
One would expect this expression to return false, but instead it raises an error. That’s because the CEL type checker ensures that both arguments to the equality operator have the same type. The CEL type system includes the usual built-in types (int
, uint
, double
, bool
, string
, bytes
, list
, map
, null_type
), but also has two built-in types that deserve further commentary: type
and dyn
.
Every value in CEL has a type, which is also considered a value. That means that the type of a value can be used in expressions. Below we use the type
function to compare the type of the value “hello” with the type of the value “world”.
type('hello') == type('world') // both sides evaluate to the 'string' type
The other type that deserves discussion is dyn
. This type is the union of all other types, similar to Object
in Java. A value can have its type converted to the dyn
type using the dyn
function. The dyn
function can often be used to prevent the type checker from raising an error. For example, the following expression may raise an error because both the equality and conditional operators require arguments to have the same type.
value == null ? message.name + ' ' + message.lastName : value
However, the following expression using the dyn
function will not raise an error.
dyn(value) == null ? message.name + ' ' + message.lastName : dyn(value)
Guards for Field-Level Rules
When defining a data contract rule, a CEL expression can be used at the message level or at the field level. Below we express a message-level rule of type CEL to check that the ssn
field is not empty.
{ "ruleSet": { "domainRules": [ { "name": "checkSsn", "kind": "CONDITION", "type": "CEL", "mode": "WRITE", "expr": "message.ssn != ''" } ] } }
Message-level rules with type CEL are passed a variable named message
, which represents the message being processed.
We could have instead expressed the above condition as a field-level rule using the CEL_FIELD rule type.
{ "ruleSet": { "domainRules": [ { "name": "checkSsn", "kind": "CONDITION", "type": "CEL_FIELD", "mode": "WRITE", "expr": "name == 'ssn' ; value != ''" } ] } }
Rules with type CEL_FIELD are executed for every field in a message. Such rules are passed the following variables:
value
– the field valuefullName
– the fully-qualified name of the fieldname
– the field nametypeName
– the name of the field type, one of STRING, BYTES, INT, LONG, FLOAT, DOUBLE, BOOLEANtags
– tags that apply to the fieldmessage
– the containing message
Note that the expr
for a rule of type CEL_FIELD is of the following form, where the guard is an optional CEL expression preceding the CEL expression for the rule body.
<CEL expr for guard> ; <CEL expr for rule body>
Guards are useful for preventing the type checker from raising an error. Without a guard, the following CEL expression will raise an error for any field in the message that is not of type string, because the inequality operator requires that both arguments have the same type.
{ "ruleSet": { "domainRules": [ { "name": "checkSsn", "kind": "CONDITION", "type": "CEL_FIELD", "mode": "WRITE", "expr": "name == 'ssn' && value != ''" } ] } }
One could fix the above expression using the dyn
function as shown below, but that is less obvious than using a guard.
{ "ruleSet": { "domainRules": [ { "name": "checkSsn", "kind": "CONDITION", "type": "CEL_FIELD", "mode": "WRITE", "expr": "name == 'ssn' && dyn(value) != ''" } ] } }
If we want to apply the rule body to all fields with the same type, we can use a guard that checks the typeName
:
{ "ruleSet": { "domainRules": [ { "name": "checkSsn", "kind": "CONDITION", "type": "CEL_FIELD", "mode": "WRITE", "expr": "typeName == 'STRING' ; value != ''" } ] } }
Checking for Empty or Missing Fields
Confluent Schema Registry supports schemas for Avro, Protobuf, and JSON Schema. When using CEL expressions, checking for an empty or missing field in an object may need to be performed differently for each of the corresponding schema types.
In Protobuf, a missing field is set to the default value for the field type. From the Protobuf documentation:
- For strings, the default value is the empty string.
- For bytes, the default value is empty bytes.
- For bools, the default value is false.
- For numeric types, the default value is zero.
- For enums, the default value is the first defined enum value, which must be 0.
- For message fields, the field is not set. Its exact value is language-dependent.
For Protobuf, a field with type message is the only type of field that might be null, depending on the language.
In Avro, any field can be null as long as it has been declared as optional, which is a union of null and one or more other types.
Similarly, in JSON Schema, any field can be null if it has been declared as optional, which is a oneOf
of null and one or more other types. Furthermore, any field can be missing unless it has been declared as required.
As an example, for Protobuf, to check that a string is empty or missing, we use the following expression at the message level.
message.ssn == ''
Alternatively for Protobuf, one can use the has
macro to determine whether the field is set to its default value.
!has(message.ssn)
For Avro, we would use a rule like the following. Note that we need to use the dyn
function since in the CEL type system, the null type is distinct from the other types.
message.ssn == '' || dyn(message.ssn) == null
For JSON, we would use a rule with the has
macro, since fields can be missing in JSON.
!has(message.ssn) || message.ssn == '' || dyn(message.ssn) == null
CEL String Literals
CEL supports several kinds of string literals. Quoted string literals can use either single-quotes or double-quotes.
message.name == 'hello' // equivalent to: message.name == "hello"
Since CEL expressions are used as JSON values in data contract rules, single quotes are to be preferred.
A triple-quoted string is delimited by either three single-quotes or three double-quotes, and may contain newlines.
message.name == '''I'm happy for y'all'''
Finally, a string preceded by the r
or R
character is a raw string, which does not interpret escape sequences. A raw string is useful for representing regular expressions.
message.ssn.matches(r'\\d{3}-\\d{2}-\\d{4}')
Transformations using CEL Maps and Messages
So far, all examples using the CEL and CEL_FIELD rule types have used CEL expressions to represent conditions. Both rule types can also use CEL expressions to represent transformations. Below we use a rule of type CEL_FIELD to set an empty status
value to the string ‘unknown’.
{ "ruleSet": { "domainRules": [ { "name": "transformStatus", "kind": "TRANFORM", "type": "CEL_FIELD", "mode": "WRITE", "expr": "name == 'status' ; value == '' ? 'unknown' : value" } ] } }
To use the CEL rule type as a message-level transformation, return a CEL map. For example, assume that the Order
message has two fields and we want to transform the status
field as before. The following message-level rule of type CEL will return a CEL map with the desired result.
{ "ruleSet": { "domainRules": [ { "name": "transformStatus", "kind": "TRANFORM", "type": "CEL", "mode": "WRITE", "expr": "{ 'orderId': message.orderId, 'status': message.status == '' ? 'unknown' : message.status }" } ] } }
If possible, the resulting CEL map will be automatically converted to an object of the appropriate schema type, either Avro, Protobuf, or JSON.
CEL also has first-class support for Protobuf messages. To return a Protobuf message, the expression M{f1: e1, f2: e2, ..., fN: eN}
can be used, where M
is the simple or qualified name of the message type. For example, if com.acme.Order
is a Protobuf message type, the following rule can be used to return a Protobuf object.
{ "ruleSet": { "domainRules": [ { "name": "transformStatus", "kind": "TRANFORM", "type": "CEL", "mode": "WRITE", "expr": "com.acme.Order{ orderId: message.orderId, status: message.status == '' ? 'unknown' : message.status }" } ] } }
Of course, returning a Protobuf object should only be done if the data contract has a Protobuf schema.