Adventures in Hardening HBase

When using HBase, it is often desirable to encrypt data in transit between an HBase client and an HBase server.  This might be the case, for example, when storing PII (Personally Identifiable Information) in HBase, or when running HBase in a multi-tenant cloud environment.

Transport encryption is often enabled by configuring HBase to use SASL with GSSAPI/Kerberos to provide data confidentiality and integrity on a per-connection basis.  However, the default implementation of GSSAPI/Kerberos does not seem to make use of AES-NI hardware acceleration.  In our testing, we have seen up to a 50% increase in the P75 measurements for latencies of some of our HBase applications when using GSSAPI/Kerberos encryption versus no encryption.

One workaround is to bypass the encryption used by SASL and use an encryption library that can support AES-NI acceleration.  This effort has already been completed for HDFS (HDFS-6606) and is in progress for Hadoop (HADOOP-10768).  Based on some of this earlier work, similar changes can be made for HBase.

The way that the fix for HADOOP-10768 works is conceptually as follows.  If the Hadoop client has been configured to negotiate a cipher suite in place of the one negotiated by SASL, then the following actions will take place:

  • The client will send the server a set of cipher suites that it supports.
  • The server will negotiate a mutually acceptable cipher suite.
  • At the end of the SASL handshake, the server will generate a pair of encryption keys using the cipher suite and send them to the client via the secure SASL channel.
  • The generated encryption keys, instead of the SASL layer, will be used to encrypt all subsequent traffic between the client and server.

Originally I was hoping that the work for HADOOP-10768 would be easily portable to the HBase codebase.  It seems that some of the HBase code for SASL support originated from the corresponding Hadoop code, but has since diverged.  For example, when performing the SASL handshake, the Hadoop client and server use protocol buffers to wrap the SASL state and SASL token, whereas the HBase client and server do not use protocol buffers when passing this data.

Instead, in HBase, during the SASL handshake the client sends

  • The integer length of the SASL token
  • The bytes of the SASL token

whereas the server sends

  • An integer which is either 0 for success or 1 for failure
  • In the case of success,
    • The integer length of the SASL token
    • The bytes of the SASL token
  • In the case of failure,
    • A string representing the class of the Exception
    • A string representing an error message

There is one exception to the above scheme, and that is if the server sends a special integer SWITCH_TO_SIMPLE_AUTH (represented as -88) in place of the length of the SASL token, the rest of the message is ignored and the client falls back to simple authentication instead of completing the SASL handshake.

In order to adapt the fix for HADOOP-10768 for HBase, I decided to use another special integer called USE_NEGOTIATED_CIPHER (represented as -89) for messages related to cipher suite negotiation between client and server.  If the client is configured to negotiate a cipher suite, then at the beginning of the SASL handshake, in place of a message containing only the length and bytes of a SASL token, it will send a message of the form

  • USE_NEGOTIATED_CIPHER (-89)
  • A string representing the acceptable cipher suites
  • The integer length of the SASL token
  • The bytes of the SASL token

And at the end of the SASL handshake, the server will send one additional message of the form

  • A zero for success
  • USE_NEGOTIATED_CIPHER (-89)
  • A string representing the negotiated cipher suite
  • A pair of encryption keys
  • A pair of initialization vectors

We can turn on DEBUG logging for HBase to see what the client and server SASL negotiation normally looks like, without the custom cipher negotiation.  Here is the client:

Creating SASL GSSAPI client. Server's Kerberos principal name is XXXX
Have sent token of size 688 from initSASLContext.
Will read input token of size 108 for processing by initSASLContext
Will send token of size 0 from initSASLContext.
Will read input token of size 32 for processing by initSASLContext
Will send token of size 32 from initSASLContext.
SASL client context established. Negotiated QoP: auth-cont

And here is the server:

Kerberos principal name is XXXX
Created SASL server with mechanism = GSSAPI
Have read input token of size 688 for processing by saslServer.evaluateResponse()
Will send token of size 108 from saslServer.
Have read input token of size 0 for processing by saslServer.evaluateResponse()
Will send token of size 32 from saslServer.
Have read input token of size 32 for processing by saslServer.evaluateResponse()
SASL server GSSAPI callback: setting canonicalized client ID: XXXX
SASL server context established. Authenticated client: XXXX (auth:SIMPLE). Negotiated QoP is auth-cont

To enable custom cipher negotiation, we set the following HBase configuration parameters for both the client and server (in addition to the properties to enable Kerberos):

<property>
  <name>hbase.rpc.security.crypto.cipher.suites</name> 
  <value>AES/CTR/NoPadding</value>
</property>
<property>
  <name>hbase.rpc.protection</name>
  <value>privacy</value>
</property>

With the above configuration, here is the client (new actions in bold):

Creating SASL GSSAPI client. Server's Kerberos principal name is XXXX
Will send client ciphers: AES/CTR/NoPadding
Have sent token of size 651 from initSASLContext.
Will read input token of size 110 for processing by initSASLContext
Will send token of size 0 from initSASLContext.
Will read input token of size 65 for processing by initSASLContext
Will send token of size 65 from initSASLContext.
Client using cipher suite AES/CTR/NoPadding with server
SASL client context established. Negotiated QoP: auth-cont

And here is the server, when using custom cipher negotiation (new actions in bold):

Have read client ciphers: AES/CTR/NoPadding
Kerberos principal name is XXXX
Created SASL server with mechanism = GSSAPI
Have read input token of size 651 for processing by saslServer.evaluateResponse()
Will send token of size 110 from saslServer.
Have read input token of size 0 for processing by saslServer.evaluateResponse()
Will send token of size 65 from saslServer.
Have read input token of size 65 for processing by saslServer.evaluateResponse()
SASL server GSSAPI callback: setting canonicalized client ID: XXXX
Server using cipher suite AES/CTR/NoPadding with client
SASL server context established. Authenticated client: XXXX (auth
:SIMPLE). Negotiated QoP is auth-cont

Once the cipher suite negotiation is complete, both the client and server will have created an instance of SaslCryptoCodec to perform the encryption. The client will call SaslCryptoCodec.wrap()/unwrap() instead of SaslClient.wrap()/unwrap() while the server will call SaslCryptoCodec.wrap()/unwrap() instead of SaslServer.wrap()/unwrap().  This is the same technique as used in HBASE-10768.

With the above code deployed to our production servers, we can compare the latencies of different encryption modes for one of our HBase applications.  (In order to run clients in different modes we have also patched our HBase servers with the fix for HBASE-14865.)  Below we show the P50, P75, and P95 latencies over a 12 hour period.  The higher line is an HBase client configured with GSSAPI/Kerberos encryption (higher is worse), the middle line is an HBase client configured with accelerated encryption, and the lower line is an HBase client configured with no encryption.

screen-shot-2016-09-13-at-11-23-46-am

screen-shot-2016-09-13-at-11-24-19-am

screen-shot-2016-09-13-at-11-24-47-am

Also, here is the user CPU time for the three differently configured HBase clients (GSSAPI/Kerberos encryption, accelerated encryption, no encryption).

screen-shot-2016-09-13-at-11-25-51-am

We can see that accelerated encryption provides a significant performance improvement over GSSAPI/Kerberos encryption.  The changes I made to HBase in order to support accelerated encryption are available at HBASE-16633.

Adventures in Hardening HBase

HBase as a Multi-Model Data Store

Recently I noticed that several NoSQL stores that claim to be multi-model data stores are implemented on top of a key-value layer. By using simple key-value pairs, such stores are able to support both documents and graphs.

A wide column store such as HBase seems like a more natural fit for a multi-model data store, since a key-value pair is just a row with a single column. There are many graph stores built on top of HBase, such as Zen, Titan, and S2Graph. However, I couldn’t find any document stores built on top of HBase. So I decided to see how hard it would be to create a document layer for HBase, which I call HDocDB.

Document databases tend to provide three different types of APIs. There are language-specific client APIs (MongoDB), REST APIs (CouchDB), and SQL-like APIs (CouchBase, Azure DocumentDB). For HDocDB, I decided to use a Java client library called OJAI.

One nice characteristic of HBase is that multiple operations to the same row can be performed atomically. If a document can be stored in columns that all reside in the same row, then the document can be kept consistent when modifications are made to different columns that comprise the document. Many graph layers on top of HBase use a “tall table” model where edges for the same graph are stored in different rows. Since operations which span rows are not atomic in HBase, inconsistencies can arise in a graph, which must be fixed using other means (batch jobs, map-reduce, etc.). By storing a single document using a single row, situations involving inconsistent documents can be avoided.

To store a document in a single row, we use a strategy called “shredding” that was developed when researchers first attempted to store XML in relational databases. In the case of JSON, which is easier to store than XML (due to the lack of schema and no requirement for preserving document order except in the case of arrays), we use a technique called key-flattening that was developed for the research system Argo. When key-flattening is adapted to HBase, each scalar value in the document is stored as a separate column, with the column qualifier being the full JSON path to that value. This allows different parts of the document to be read and modified independently.

For HDocDB, I also added basic support for global secondary indexes. The implementation is based on Hofhansl and Yates. For more sophisticated indexing, one can integrate HDocDB with ElasticSearch or Solr.

Since OJAI is integrated with Jackson, it is also easy to store plain Java objects into HDocDB. This means that HDocDB can also be seen as an object layer on top of HBase. We can now say HBase supports the following models:

  • Key-value
  • Wide column
  • Document (HDocDB)
  • Graph (Titan, Zen, S2Graph)
  • SQL (Phoenix)
  • Object (HDocDB)

So not only can HBase be seen as a solid CP store (as shown in a previous blog), but it can also be seen as a solid multi-model store.

HBase as a Multi-Model Data Store

Call Me Maybe: HBase addendum

The following post originally appeared in the Yammer Engineering blog on September 10, 2014.

In a previous blog, I demonstrated some good results for HBase using an automated test framework called Jepsen. In fact, they may have seemed too good. HBase is designed for strong consistency, yet also seemed to exhibit extraordinary availability during a network partition. How was this possible? Apparently, HBase clients will retry operations when they fail. This can be better seen during a sample run below:

Screen Shot 2014-09-10 at 3.01.10 PM

During the network partition, no requests are successful; after the partition is healed, requests are able to succeed, and the request latencies slowly decrease.

Here is a longer network partition showing much greater latencies:

Screen Shot 2014-09-10 at 2.54.25 PM

In fact, if the network partition is long enough, the HBase client will start to report timeouts:

Timed out waiting for some tasks to complete!

In such cases, not all requests will be successfully processed. Here is a typical result:

0 unrecoverable timeouts
Collecting results.
Writes completed in 500.038 seconds

5000 total
4974 acknowledged
4974 survivors
all 4974 acked writes out of 5000 succeeded. :-)

An even longer network partition can lead the Jepsen test to run out of memory:

java.lang.OutOfMemoryError: unable to create new native thread

So HBase exhibits behavior typical of a CP system, and when facing a network partition, cannot achieve both consistency and availability. Sorry if I gave that impression.

Call Me Maybe: HBase addendum

Call Me Maybe: HBase

The following post originally appeared in the Yammer Engineering blog on September 10, 2014.

The Yammer architecture has evolved from a monolithic Rails application to a micro-service based architecture. Our services share many commonalities, including the use of Dropwizard and Metrics. However, as with many micro-service based architectures, each micro-service may have very different needs in terms of persistence. This has led to the adoption of polyglot persistence here at Yammer.

As such, engineers at Yammer are very interested in understanding and evaluating the differences between various databases. One particularly enlightening exposition has been the Call Me Maybe series by Aphyr. This series demonstrates how various databases behave in the presence of network partitions caused by an automated test framework called Jepsen(*). Some of the databases it covers are Postgres, Redis, MongoDB, Riak, Cassandra, and FoundationDB. I have often wondered how HBase would behave under Jepsen. Let’s try augmenting Jepsen to find out.

Jepsen actually consists of two parts. The first part is a provisioning framework written using salticid. This provisions a five-node Ubuntu cluster, where the nodes are referred to as n1, n2, n3, n4, and n5. It can then install and setup the desired database. The second part is a set of runtime tests written using Clojure. (Aphyr also has a great tutorial on Clojure called Clojure from the ground up.)

As I don’t have much experience with HBase, I decided to forego the salticid customizations and simply use Cloudera Manager to set up a five-node HBase cluster on EC2. I used CDH 5.1.2, which bundles hbase-0.98.1+cdh5.1.2+70. I then modified the /etc/hosts file on each of the nodes to add entries for n1 through n5 (with n1 being the ZooKeeper master). With that step done, it was time to write the actual tests, which are available here.

The first test I wrote, hbase-app, is based on one of the Postgres tests. It simply adds a single row for each number. While it is running, Jepsen uses iptables to simulate a network partition within the cluster. Let’s see how it behaves.

lein run hbase -n 2000
...
0 unrecoverable timeouts
Collecting results.
Writes completed in 200.047 seconds

2000 total
2000 acknowledged
2000 survivors
All 2000 writes succeeded. :-D

All 2000 writes succeeded and there was no data loss. So far, so good.

The second test I wrote, hbase-append-app, is based on one of the FoundationDB tests. It repeatedly writes to the same cell by attempting to append to a list stored as a blob while a network partition occurs. The test uses the checkAndPut call, which allows for atomic read-modify-write operations.

lein run hbase-append -n 2000
...
0 unrecoverable timeouts
Collecting results.
Writes completed in 200.05 seconds

2000 total
282 acknowledged
282 survivors
all 282 acked writes out of 2000 succeeded. :-)

This time not all writes succeeded, since all clients are attempting to write to the same cell, and the chance is high that another client will write to the cell between the read and write of a given client’s read-modify-write operation. However, no data loss occurred, as all 282 successful writes are apparent in the final result.

The third test I wrote, hbase-isolation-app, is based on one of the Cassandra tests. It modifies two cells in the same row while a network partition occurs, to test if row updates are atomic.

lein run hbase-isolation -n 2000
...
0 unrecoverable timeouts
Collecting results.
()
Writes completed in 200.043 seconds

2000 total
2000 acknowledged
2000 survivors
All 2000 writes succeeded. :-D

Nice. Row updates are atomic as HBase claims. However, the above test modifies two cells in the same column family. Let’s try modifying two cells in different column families, but in the same row. From what I understand of HBase, cells in different column families are stored in different HBase “stores”. I ran a fourth test, hbase-isolation-multiple-cf-app, to see if it made a difference.

lein run hbase-isolation-multiple-cf -n 2000
...
0 unrecoverable timeouts
Collecting results.
()
Writes completed in 200.036 seconds

2000 total
2000 acknowledged
2000 survivors
All 2000 writes succeeded. :-D

Move along, nothing to see here.

Finally, HBase claims to have atomic counters. In Dynamo-based systems such as Cassandra and Riak, a counter needs to be a CRDT to behave properly. Let’s see how HBase counters behave. I used a fifth test, hbase-counter-app, that is also based on one of the Cassandra tests.

lein run hbase-counter -n 2000
...
0 unrecoverable timeouts
Collecting results.
Writes completed in 200.045 seconds

2000 total
2000 acknowledged
2000 survivors
All 2000 writes succeeded. :-D

No counter increments were lost.

HBase performed well in all five of the above tests. The claims in its documentation concerning atomic row updates and counter operations held up. I’m looking forward to learning more about HBase, which so far appears to be a very solid database.

Update: see the addendum here.

(*) Aphyr is now working on Jepsen 2. This blog refers to the original Jepsen.

Call Me Maybe: HBase