The following post originally appeared in the Yammer Engineering blog on September 10, 2014.
In a previous blog, I demonstrated some good results for HBase using an automated test framework called Jepsen. In fact, they may have seemed too good. HBase is designed for strong consistency, yet also seemed to exhibit extraordinary availability during a network partition. How was this possible? Apparently, HBase clients will retry operations when they fail. This can be better seen during a sample run below:
During the network partition, no requests are successful; after the partition is healed, requests are able to succeed, and the request latencies slowly decrease.
Here is a longer network partition showing much greater latencies:
In fact, if the network partition is long enough, the HBase client will start to report timeouts:
Timed out waiting for some tasks to complete!
In such cases, not all requests will be successfully processed. Here is a typical result:
0 unrecoverable timeouts Collecting results. Writes completed in 500.038 seconds 5000 total 4974 acknowledged 4974 survivors all 4974 acked writes out of 5000 succeeded. :-)
An even longer network partition can lead the Jepsen test to run out of memory:
java.lang.OutOfMemoryError: unable to create new native thread
So HBase exhibits behavior typical of a CP system, and when facing a network partition, cannot achieve both consistency and availability. Sorry if I gave that impression.