Call Me Maybe: HBase addendum

The following post originally appeared in the Yammer Engineering blog on September 10, 2014.

In a previous blog, I demonstrated some good results for HBase using an automated test framework called Jepsen. In fact, they may have seemed too good. HBase is designed for strong consistency, yet also seemed to exhibit extraordinary availability during a network partition. How was this possible? Apparently, HBase clients will retry operations when they fail. This can be better seen during a sample run below:

Screen Shot 2014-09-10 at 3.01.10 PM

During the network partition, no requests are successful; after the partition is healed, requests are able to succeed, and the request latencies slowly decrease.

Here is a longer network partition showing much greater latencies:

Screen Shot 2014-09-10 at 2.54.25 PM

In fact, if the network partition is long enough, the HBase client will start to report timeouts:

Timed out waiting for some tasks to complete!

In such cases, not all requests will be successfully processed. Here is a typical result:

0 unrecoverable timeouts
Collecting results.
Writes completed in 500.038 seconds

5000 total
4974 acknowledged
4974 survivors
all 4974 acked writes out of 5000 succeeded. :-)

An even longer network partition can lead the Jepsen test to run out of memory:

java.lang.OutOfMemoryError: unable to create new native thread

So HBase exhibits behavior typical of a CP system, and when facing a network partition, cannot achieve both consistency and availability. Sorry if I gave that impression.

Call Me Maybe: HBase addendum

Leave a Reply