Hash map implementations in practice

Thanks to Kent Ross for finding several of the details which I list here.

I think hash maps are a really interesting data structure. They’re probably the second most widely used data structure in modern programming languages, behind dynamic arrays. But they involve many much more complicated design decisions than dynamic arrays do. So they’re probably the data structure where we can learn the most about how they’re used in practice.

(I don’t mean to imply dynamic arrays don’t have any interesting subtleties. To start with, there’s a way to modify dynamic arrays so that their \(O(1)\) time to append is worst case instead of amortized. There’s also a fun way to make them so that they only waste \(O(\sqrt{n})\) space instead of \(O(n)\). This paper is about the latter; it explains the former in passing.)

In this post, I’ll briefly describe a few of the key ways that hash map implementations vary, and then I’ll do a review of the hash map implementations in Ruby, Python, Go, Rust, Java, C++, PHP, and C#, as well as some other influential implementations such as those in Google’s SparseHash package.

Key concepts in hash map design

Hash maps differ in their collision resolution strategies. The simplest strategy to implement is probably chaining. There’s a decent explanation of chaining here. When you’re implementing a hash map with chaining, you need to choose a maximum load factor–that is, the number of elements per bucket at which you resize the hash map. Most languages which use chaining seem to use a load factor that’s a bit less than 1. Another fun detail here is that you can use data structures that aren’t linked lists for your buckets. For example, Java switches buckets to be balanced trees instead of linked lists if they have more than 8 elements, in an effort to make the worst case runtime logarithmic instead of linear.

A wide variety of different open addressing strategies are also used in practice–linear probing, quadratic probing, double hashing, and others.

One exciting and relatively recent development in hash map implementations is Robin Hood hashing. Here’s a quote from a great article about it:

The clever trick is just this: when you probe for a position to insert a new element, if the probe length for the existing element is less than the current probe length for the element being inserted, swap the two elements and keep going.

That way elements that were inserted early and thus “lucked out” on their probe lengths, will gradually be moved away from their preferred slot as new elements come in that could make better use of that place in the table (hence the name - the insertion “takes from the rich”, i.e. the elements with low probe counts). It leads to an “evening out” of the probe lengths.

Why is low variance better? Well, with modern cache architectures a probe count of 1 isn’t really much faster than a probe count of 3, because the main cost is fetching the cache line, so the ideal scenario is for all elements to have the same probe count, and having that probe count fitting within one cache line.

It turns out that Robin Hood hashing has the same expected probe count as normal open addressing (about 2.55) - it just has substantially less variance, and therefore ends up with far fewer cache misses. Yes, there are fewer elements that can early out after 1 probe, but there also far fewer elements that end up needing to fetch multiple cache lines because of long probe lengths.

Currently, Rust is the only mainstream language which uses Robin Hood hashing in its default hash map implementation.

Hash maps in practice

Python

Python’s dictionary implementation uses an open addressing scheme. It resizes when the hash map is 2/3 full. You can read the source code here.

Unlike most programming languages, Python doesn’t try to use a hash function which appears random. To quote the source:

Major subtleties ahead:  Most hash schemes depend on having a "good" hash
function, in the sense of simulating randomness.  Python doesn't:  its most
important hash functions (for ints) are very regular in common
cases:

  >>>[hash(i) for i in range(4)]
  [0, 1, 2, 3]

This isn't necessarily bad!  To the contrary, in a table of size 2**i, taking
the low-order i bits as the initial table index is extremely fast, and there
are no collisions at all for dicts indexed by a contiguous range of ints. So
this gives better-than-random behavior in common cases, and that's very
desirable.

OTOH, when collisions occur, the tendency to fill contiguous slices of the
hash table makes a good collision resolution strategy crucial.  Taking only
the last i bits of the hash code is also vulnerable:  for example, consider
the list [i << 16 for i in range(20000)] as a set of keys.  Since ints are
their own hash codes, and this fits in a dict of size 2**15, the last 15 bits
of every hash code are all 0:  they *all* map to the same table index.

But catering to unusual cases should not slow the usual ones, so we just take
the last i bits anyway.  It's up to collision resolution to do the rest.  If
we *usually* find the key we're looking for on the first try (and, it turns
out, we usually do -- the table load factor is kept under 2/3, so the odds
are solidly in our favor), then it makes best sense to keep the initial index
computation dirt cheap.

In Python 3.6, an additional layer of indirection was added, with the following reasoning:

The current memory layout for dictionaries is
unnecessarily inefficient.  It has a sparse table of
24-byte entries containing the hash value, key pointer,
and value pointer.

Instead, the 24-byte entries should be stored in a
dense table referenced by a sparse table of indices.

For example, the dictionary:

    d = {'timmy': 'red', 'barry': 'green', 'guido': 'blue'}

is currently stored as:

    entries = [['--', '--', '--'],
               [-8522787127447073495, 'barry', 'green'],
               ['--', '--', '--'],
               ['--', '--', '--'],
               ['--', '--', '--'],
               [-9092791511155847987, 'timmy', 'red'],
               ['--', '--', '--'],
               [-6480567542315338377, 'guido', 'blue']]

Instead, the data should be organized as follows:

    indices =  [None, 1, None, None, None, 0, None, 2]
    entries =  [[-9092791511155847987, 'timmy', 'red'],
                [-8522787127447073495, 'barry', 'green'],
                [-6480567542315338377, 'guido', 'blue']]

This significantly reduces memory usage. It also means that Python dictionaries are now ordered, which makes Hacker News (and me) unhappy.

Since version 3.3, Python dicts double in size when they resize. Before version 3.3, it quadrupled its capacity on resize.

V8

Source here.

This uses open addressing and has a maximum load capacity of 80%.

One interesting detail:

// To remove an entry we need to ensure that it does not create an empty
// entry that will cause the search for another entry to stop too soon. If all
// the entries between the entry to remove and the next empty slot have their
// initial position inside this interval, clearing the entry to remove will
// not break the search. If, while searching for the next empty entry, an
// entry is encountered which does not have its initial position between the
// entry to remove and the position looked at, then this entry can be moved to
// the place of the entry to remove without breaking the search for it. The
// entry made vacant by this move is now the entry to remove and the process
// starts over.

Java

Java’s HashMap class uses chaining, but with a neat twist!

in Java 8 hash elements use balanced trees instead of linked lists after a certain threshold is reached. Which means HashMap starts with storing Entry objects in linked list but after the number of items in a hash becomes larger than a certain threshold, the hash will change from using a linked list to a balanced tree, this will improve the worst case performance from O(n) to O(log n).

According to this page the threshold is 8 elements:

Well, this optimization is described in JEP-180. Basically when a bucket becomes too big (currently: TREEIFY_THRESHOLD = 8), HashMap dynamically replaces it with an ad-hoc implementation of tree map. This way rather than having pessimistic O(n) we get much better O(logn). How does it work? Well, previously entries with conflicting keys were simply appended to linked list, which later had to be traversed. Now HashMap promotes list into binary tree, using hash code as a branching variable. If two hashes are different but ended up in the same bucket, one is considered bigger and goes to the right. If hashes are equal (as in our case), HashMap hopes that the keys are Comparable, so that it can establish some order. This is not a requirement of HashMap keys, but apparently a good practice. If keys are not comparable, don’t expect any performance improvements in case of heavy hash collisions.

So if you implement Comparable properly, hash map retrieval is worst case \(O(log(n))\)! What an exciting world we live in!

The default load factor in Java hash maps is 0.75, and has been since at least Java version 5.

C++ STL

chaining. Amusingly enough, this seem to be a direct result of a requirement in the C++ standard:

The Standard effectively mandates std::unordered_set and std::unordered_map implementations that use open hashing, which means an array of buckets, each of which holds the head of a logical (and typically actual) list. That requirement is subtle: it’s a consequence of the default max load factor being 1.0 and the guarantee that the table will not be rehashed unless grown beyond that load factor: that would be impractical without chaining, as the collisions with closed hashing become overwhelming as the load factor approaches 1

Linux hashtable

Linux has a fixed sized hashtable which uses chaining, which it uses internally a bunch of places.

Google SparseHashMap and DenseHashMap

Google SparseHashMap and DenseHashMap: quadratic probing

This directory contains several hash-map implementations, similar in API to SGI’s hash_map class, but with different performance characteristics. sparse_hash_map uses very little space overhead, 1-2 bits per entry. dense_hash_map is very fast, particulary on lookup. (sparse_hash_set and dense_hash_set are the set versions of these routines.) On the other hand, these classes have requirements that may not make them appropriate for all applications.

All these implementation use a hashtable with internal quadratic probing. This method is space-efficient – there is no pointer overhead – and time-efficient for good hash functions.

…

The usage of these classes differ from SGI’s hash_map, and other hashtable implementations, in the following major ways:

1) dense_hash_map requires you to set aside one key value as the ‘empty bucket’ value, set via the set_empty_key() method. This MUST be called before you can use the dense_hash_map. It is illegal to insert any elements into a dense_hash_map whose key is equal to the empty-key.

2) For both dense_hash_map and sparse_hash_map, if you wish to delete elements from the hashtable, you must set aside a key value as the ‘deleted bucket’ value, set via the set_deleted_key() method. If your hash-map is insert-only, there is no need to call this method. If you call set_deleted_key(), it is illegal to insert any elements into a dense_hash_map or sparse_hash_map whose key is equal to the deleted-key.

Ruby

According to the source, Ruby used chaining (with a threshold load factor of 5x) until 2.4, when it decided to follow Python’s lead and switch to a similar scheme with open addressing and two separate tables. As my colleague Kent points out, the Ruby hash table is basically the same as the Python one, except Ruby misspells “perturb” as “perterb” for some reason, and in Ruby the hash code is shifted down 11 bits instead of 5 in each perturbation.

Rust

In an exciting change of pace, Rust uses “linear probing with Robin Hood bucket stealing.”

This had a super neat bug which led to a sequence of hash map insertions taking quadratic time.

To quote that Tumblr piece:

Surprisingly to me, the specific dynamics of Robin Hood hashing end up being relatively unimportant here; I believe that vanilla linear probing would exhibit similar behaviors. The key effect of Robin Hood hashing is just that it gives you confidence and/or hubris to push a table to 90% capacity, which greatly exacerbates the problem.

I’m glad that Rust is doing the trailblazing here. I think that long term, more languages should probably switch to some kind of Robin Hood hashing, and it’s nice that we’re working out these kinks now.

C#

C# uses open addressing with double hashing. That article also contains this hilarious tidbit:

In an overloaded form of the Hashtable’s constructor, you can specify a loadFactor value between 0.1 and 1.0. Realize, however, that whatever value you provide, it is scaled down 72%, so even if you pass in a value of 1.0 the Hashtable class’s actual loadFactor will be 0.72. The 0.72 was found by Microsoft to be the optimal load factor, so consider using the default 1.0 load factor value (which gets scaled automatically to 0.72). Therefore, you would be encouraged to use the default of 1.0 (which is really 0.72).

Also, that documentation consistently writes ‘rehasing’ when I am almost sure they mean ‘rehashing’. Apparently Ruby isn’t the only programming language which needs a copy editor.

Golang

Golang uses chaining:

A map is just a hash table. The data is arranged into an array of buckets. Each bucket contains up to 8 key/value pairs. The low-order bits of the hash are used to select a bucket. Each bucket contains a few high-order bits of each hash to distinguish the entries within a single bucket.

If more than 8 keys hash to a bucket, we chain on extra buckets.

When the hashtable grows, we allocate a new array of buckets twice as big. Buckets are incrementally copied from the old bucket array to the new bucket array.

According to the same file, the default average load factor at which a resizing is triggered is 6.5.

PHP

PHP uses chaining, with the additional constraint that PHP semantics require that PHP hash tables be ordered by default, which sparks controversy on HN.

PHP’s hash map implementation uses the identity function as the hash function for integers. This has great performance in the case where your hash keys are a contiguous sequence of integers. However, it also means that if you have a hash map with capacity 64, and you insert a bunch of numbers which all have the same remainder mod 64, they’ll be in the same hash bucket, so insert will take linear time. Under this condition, inserts can take linear time. This blog post shows how inserting 65536 evil elements can take 30 seconds.