Tips On Writing Custom HBase Filters

Two of the most useful and powerful features of HBase are its support for server-side filters and coprocessors.  For example, custom filters can be used for efficient pagination, while custom coprocessors can be used to provide endpoints to provide efficient aggregation of data in HBase.  In addition, more sophisticated filters and coprocessors can be used to turn HBase into an entirely different data store, such as a JSON document store (HDocDB), a relational database (Phoenix), or others.

While working with custom filters, I ran into a couple of issues that I didn’t find documented elsewhere (perhaps I missed them), so I thought I’d jot them down here to benefit others.

First, when writing a custom filter, the cells passed to the filterKeyValue method are a superset of the cells that will be returned to the client.  The main reason for this is that even though a column family may be specified to retain only one version of a cell, multiple versions of the cell may still exist in the store because a compaction has not yet taken place, and the pruning of versions in the query result doesn’t happen until after filterKeyValue is called.  This actually took me by surprise, as I didn’t find it documented anywhere, and my initial mental model assumed that the pruning of versions would happen before this method was called.  (Update:  This has since been filed as HBASE-17125.)

The second tip is in regard to the filterRowCells method.  This method gives you the list of cells that have passed previous filter methods, and allows you to modify it before it is passed to the next phase of the filter pipeline.   For example, here is how the DependentColumnFilter in HBase uses this method to filter out cells that don’t have a matching timestamp.

  @Override
  public void filterRowCells(List<Cell> kvs) {
    Iterator<? extends Cell> it = kvs.iterator();
    Cell kv;
    while(it.hasNext()) {
      kv = it.next();
      if(!stampSet.contains(kv.getTimestamp())) {
        it.remove();
      }
    }
  }

However, when implementing filterRowCells, the Iterator.remove method should not be used. This is because the underlying list of cells is passed as an ArrayList, and Iterator.remove is an O(n) operation for instances of ArrayList.   As more and more elements are removed from within filterRowCells, the time complexity of this operation will begin to approach O(n2).   Instead, the Guava method Iterables.removeIf should be preferred (or Collection.removeIf, if you are using Java 8).

  @Override
  public void filterRowCells(List<Cell> kvs) {
    Iterables.removeIf(kvs, new Predicate<Cell>() {
      @Override
      public boolean apply(Cell kv) {
        return !stampSet.contains(kv.getTimestamp());
      }
    });
  }

The Iterables.removeIf method will check to see if the Iterable passed to it is an instance of RandomAccess (which is true for ArrayList), and if so, will remove all elements that pass the specified Predicate in total O(n) time (by making use of ArrayList.set).

One of our queries using a custom filter was passing tens of thousands of cells to filterRowCells and filtering a majority of the cells out using Iterator.remove.  After changing the custom filter to use Iterables.removeIf, the query time dropped from 800 ms to 250 ms.

Since HBase already uses the Iterables class from Guava, I’ve submitted HBASE-16893 and PHOENIX-3393 to change the filters in the HBase and Phoenix codebases to use Iterables.removeIf instead of Iterator.remove.

Tips On Writing Custom HBase Filters