Benchmarking hash functions

It turns out that for integer hash tables implementations the identity function (f(x)=x) is preferred in most cases. This is a trade-off between speed and desirable properties such as uniform distribution that works well for integers. In essence the speed of attempted lookups is so fast that we can afford the increased number of collisions a naive hash function will give us.

For byte arrays such as strings this is not the case, and instead the opposite is true. Optimising the lookup algorithm is less important, and the properties of the hash function much more so.

To choose a good hash function we need to look at both speed, how quickly the hash is calculated, and how likely collisions are to occur in a given use case. We will start with looking at speed, or data throughput, of some of the most interesting potential candidates.

The benchmark

In this micro-benchmark we will simply calculate the hash of arrays of different sizes, measuring the data throughput. A slightly stripped version of what we are doing can be seen below.

for (len = begin; len < end; len += inc)
    {
      t1 = ntime();
      for (i = 0; i < n; i ++)
        h = farmhash(in, len);
      t2 = ntime();
      t = (float) (t2 - t1) / 1000000000;
    }

System

The tests are done using a Macbook Pro (2,6 GHz Intel Core i7) running Mac OS X 10.10.1, and compiled with clang v3.6.0.

Candidates

  • std::hash – C++11 standard library hash function
  • MurmurHash3 – Well known and used hash function
  • SpookyV2 – Alternative to Murmurhash3
  • Google Cityhash – Google attempt to improve on Murmurhash3
  • Google Farmhash  – Recent Google successor to Cityhash
  • cfarmhash – C port of Google Farmhash Hash64

Results

“rate” is measured in MB/s (peak throughput being ~10 000 MB/s, or ~10 GB/s) and “size” in bytes.

hash-function-benchmark

 Observations

  • Throughput depends strongly on input size
  • Cityhash, Farmhash and the standard C++ library hash function perform similarly
  • SpookyV2 and Murmurhash3 perform similarly
  • As a side not all hash functions were compiled and linked into the benchmark binary, since Cityhash and Farmhash does not export a LTO enable library (linking against the exported library would result in a significant performance loss for both Cityhash and Farmhash)

Again, this is pure data throughput. How useful the hash function is in typical use case will be tested next.

Advertisements

2 thoughts on “Benchmarking hash functions”

  1. A few thoughts on your interesting post :

    – SpookyHash v2 should be about the same speed as Cityhash, and much faster than Murmurhash. So something went wrong. Reading the end comment, maybe it was compiled / linked differently from the faster versions ?

    – CityHash has 2 versions, one “portable” and one which uses special Intel instructions (called “crc”), which is faster but requires specific opcodes/intrinsics and is therefore not portable. Which one was used in your benchmark ? I suspect it’s the portable one, which would be good, but identify it clearly.

    – xxHash64 should also perform well in such comparison.

    – A fast hash is really easy to do. A fast hash with good properties, that’s another question. But hopefully, all the ones mentioned above are known to provide excellent distribution, so they are fair bet.
    std::hash, on the other hand, is implementation-dependent, and can be fairly poor, if not extremely poor, in some cases.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s