diff --git a/src/cuckoocache.h b/src/cuckoocache.h --- a/src/cuckoocache.h +++ b/src/cuckoocache.h @@ -210,6 +210,41 @@ * compute_hashes is convenience for not having to write out this expression * everywhere we use the hash values of an Element. * + * We need to map the 32-bit input hash onto a hash bucket in a range [0, + * size) in a manner which preserves as much of the hash's uniformity as + * possible. Ideally this would be done by bitmasking but the size is + * usually not a power of two. + * + * The naive approach would be to use a mod -- which isn't perfectly uniform + * but so long as the hash is much larger than size it is not that bad. + * Unfortunately, mod/division is fairly slow on ordinary microprocessors + * (e.g. 90-ish cycles on haswell, ARM doesn't even have an instruction for + * it.); when the divisor is a constant the compiler will do clever tricks + * to turn it into a multiply+add+shift, but size is a run-time value so the + * compiler can't do that here. + * + * One option would be to implement the same trick the compiler uses and + * compute the constants for exact division based on the size, as described + * in "{N}-bit Unsigned Division via {N}-bit Multiply-Add" by Arch D. + * Robison in 2005. But that code is somewhat complicated and the result is + * still slower than other options: + * + * Instead we treat the 32-bit random number as a Q32 fixed-point number in + * the range [0,1) and simply multiply it by the size. Then we just shift + * the result down by 32-bits to get our bucket number. The results has + * non-uniformity the same as a mod, but it is much faster to compute. More + * about this technique can be found at + * http://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/ + * + * The resulting non-uniformity is also more equally distributed which would + * be advantageous for something like linear probing, though it shouldn't + * matter one way or the other for a cuckoo table. + * + * The primary disadvantage of this approach is increased intermediate + * precision is required but for a 32-bit random number we only need the + * high 32 bits of a 32*32->64 multiply, which means the operation is + * reasonably fast even on a typical 32-bit processor. + * * @param e the element whose hashes will be returned * @returns std::array of deterministic hashes derived from e */