It turns out that for integer hash tables implementations the identity function (f(x)=x) is preferred in most cases. This is a trade-off between speed and desirable properties such as uniform distribution that works well for integers. In essence the speed of attempted lookups is so fast that we can afford the increased number of collisions a naive hash function will give us.

For byte arrays such as strings this is not the case, and instead the opposite is true. Optimising the lookup algorithm is less important, and the properties of the hash function much more so.

To choose a good hash function we need to look at both speed, how quickly the hash is calculated, and how likely collisions are to occur in a given use case. We will start with looking at speed, or data throughput, of some of the most interesting potential candidates.

# The benchmark

In this micro-benchmark we will simply calculate the hash of arrays of different sizes, measuring the data throughput. A slightly stripped version of what we are doing can be seen below.

for (len = begin; len < end; len += inc) { t1 = ntime(); for (i = 0; i < n; i ++) h = farmhash(in, len); t2 = ntime(); t = (float) (t2 - t1) / 1000000000; }

# System

The tests are done using a Macbook Pro (2,6 GHz Intel Core i7) running Mac OS X 10.10.1, and compiled with clang v3.6.0.

# Candidates

- std::hash – C++11 standard library hash function
- MurmurHash3 – Well known and used hash function
- SpookyV2 – Alternative to Murmurhash3
- Google Cityhash – Google attempt to improve on Murmurhash3
- Google Farmhash – Recent Google successor to Cityhash
- cfarmhash

# Results

“rate” is measured in MB/s (peak throughput being ~10 000 MB/s, or ~10 GB/s) and “size” in bytes.

# Observations

- Throughput depends strongly on input size
- Cityhash, Farmhash and the standard C++ library hash function perform similarly
- SpookyV2 and Murmurhash3 perform similarly
- As a side not all hash functions were compiled and linked into the benchmark binary, since Cityhash and Farmhash does not export a LTO enable library (linking against the exported library would result in a significant performance loss for both Cityhash and Farmhash)

Again, this is pure data throughput. How useful the hash function is in typical use case will be tested next.

Reblogged this on interface IReadable {.

A few thoughts on your interesting post :

– SpookyHash v2 should be about the same speed as Cityhash, and much faster than Murmurhash. So something went wrong. Reading the end comment, maybe it was compiled / linked differently from the faster versions ?

– CityHash has 2 versions, one “portable” and one which uses special Intel instructions (called “crc”), which is faster but requires specific opcodes/intrinsics and is therefore not portable. Which one was used in your benchmark ? I suspect it’s the portable one, which would be good, but identify it clearly.

– xxHash64 should also perform well in such comparison.

– A fast hash is really easy to do. A fast hash with good properties, that’s another question. But hopefully, all the ones mentioned above are known to provide excellent distribution, so they are fair bet.

std::hash, on the other hand, is implementation-dependent, and can be fairly poor, if not extremely poor, in some cases.