Implementation Details Are Coming to Kill Your Experiment

Keeping Your Sanity

When it comes to data science projects, reproducibility is always nice to have, not only for the sake of science, but also to preserve one’s sanity – revisited experiments should make sense rather than leave one baffled. Moreover, if one intends to share their project, they should want to make sure their peers would get the same result using the same code and dataset. Here, we are not talking about trivial aspects of versioning and data management. Nor are we talking about the usage of random state control parameters, which we hope all users of data science libraries are familiar with. What we want to talk about is the grey area of overlooked implementation details, and highlight a case where the gensim library’s accidental exposure of a Python implementation detail almost made us lose our minds.


Gaslighting and gensim

One of our projects involved natural language processing (NLP). To perform the task we used the gensim library (v. 3.1.0), a Python library which provides implementations of various algorithms useful for NLP. We ran our experiments and collected results. Some time later we ran the experiments once again, using the same dataset and conditions. Surprisingly, the results of the two runs did not match, despite our efforts to control the experiment.

For unrelated reasons, the code was moved to another machine with a different environment. And, to some relief but even greater confusion, suddenly the results were reproducible! The only difference between the two environments was in their versions of Python: the first machine had Python 3.5.2 installed, the other had version 3.6.3. All libraries, including gensim, were at the same version. A version jump from v. 3.5 to v. 3.6 doesn’t suggest substantial changes – certainly nothing on the scale of Python’s notorious v. 2 to v. 3 leap – but the devil is in the details.


Chasing at Shadows

The gensim library includes a class called Dictionary that maps words to integer identifiers, collects word frequencies, and converts textual documents to vectors. These kinds of operations are common in NLP tasks, where the frequency of specific words can indicate their importance and topics of discussion. For example, the Latent Dirichlet Allocation topic model works with textual documents represented as vectors in which each value is an integer indicating the frequency of a specific word (the ‘bag of words’ encoding). Mapping words to integer identifiers also serves a very practical purpose: it reduces the memory needed to store enormous text corpora. It’s a valuable feature, but the words, identifiers and frequencies gensim stores in a Dictionary are all linked together in a complicated way that means any change to one can affect the others.

We made some careful investigations and soon had isolated the problem to the Dictionary class. Then we examined the source code and found that the functionality is fairly straightforward, except for one curious use of the dict collection type, Python’s implementation of the hash table data structure that maps arbitrary keys to a corresponding value. Due to their design, hash tables don’t have a ‘natural’ order of elements as linear collections such as lists do, and various implementation choices can mean that the order of elements within a hash table may be unpredictable. Our word-integer mappings were unpredictable, so perhaps dict was to blame? But attempts to force the issue were still difficult; recreating Dictionary instances gave us the consistency we sought, but between executions the results were very different.

The ordering of dict entries should never be relied upon due to the nature of hash tables, but this does not explain the non-determinism of dict. Why was this arbitrary order not consistent between executions?


Descent into Madness

As mentioned above, the dict in Python is implemented using the hash table data structure – the keys of a dict are hashed and the hashes are mapped to buckets in a dynamic table. Displaying the contents of a dict loops over the buckets. When iterating over a dict, the buckets are traversed in their hash order (skipping empty buckets).
Let’s look at this in closer detail. Since the hash table has less buckets than hashes, an operation is needed to map a number of hashes to a single bucket. Python uses the operation hash & (number_buckets-1) to ensure all hashes are mapped to an allocated bucket. For the small dicts we are going to construct in this example Python will allocate 8 buckets, hence the number of the bucket selected will be hash & 7. Using Python 2.7:

$ python2.7>>> hash("cat")&7, hash("dog")&77, 5  >>> {"cat": 1, "dog": 2}{'dog': 2, 'cat': 1}

The key "dog", being in bucket 5, appears before the key "cat", which was placed in bucket 7. Hash randomisation is already introduced, but disabled by default in Python 2.7, so the hashes are consistent between executions:

$ python2.7>>> hash("cat")&7, hash("dog")&77, 5  

And therefore so is the order of iteration over dict.

Since Python 3.3, hash randomisation was turned on by default. So, in one run we may get something like this:

$ python3.5>>> hash("cat")&7, hash("dog")&72, 3  >>> {"cat": 1, "dog": 2}{'cat': 1, 'dog': 2}

And in another we may get this:

$ python3.5>>> hash("cat")&7, hash("dog")&73, 1  >>> {"cat": 1, "dog": 2}{'dog': 2, 'cat': 1}

Since v. 3.3, the only way to get consistency in Python’s hashing is to set the PYTHONHASHSEED environment variable. In v. 3.6, the iteration order of Python dict changed yet again as it was made consistent with the order that elements are inserted into the collection. Thus, while the hash randomisation is enabled by default in v. 3.6, its effect on the dict iteration is obscured by the way it maintains the order of insertion. This means that our keys can be hashed like so:

$ python3.6>>> hash("cat")&7, hash("dog")&73, 7  >>> {'dog': 2, 'cat': 1}{'dog': 2, 'cat': 1}

As we can see, "cat" is placed into the earlier bucket, but "dog" was inserted first so it is the first entry returned when iterating over the dict. We can swap the insertion order and we still get the order preserving behaviour regardless of the hash values:

$ python3.6>>> hash("cat")&7, hash("dog")&76, 2  >>> {"cat": 1, "dog": 2}{'cat': 1, 'dog': 2}

Attitudes towards dict‘s hashing behaviour have varied, and the result is that the behaviour of any software that is affected by the order of iteration of dict entries may change depending on the version of Python used to run the software. Of course, the order of entries in a dict should never be relied upon, but much Python code uses dict and you may be using it without even knowing. Which brings us back to gensim.


Revealing the Killer

As previously mentioned, gensim‘s Dictionary class uses dict to perform a mapping between words and integer identifiers. The order of iteration over a dict determines which words are assigned which integers, so as the order changes so do the identifiers each word is assigned. This is actually a case where the usage of dict is incidental – many approaches can be used to map words to integers, but the gensim implementers have chosen dict iteration to achieve this. As such, the Dictionary class has the essential property of being deterministic only when the iteration order of dict is consistent.

In one way, this is great! Everyone performing experiments using gensim‘s Dictionary and the latest version of Python has reproducible experiments, if their data transformations are deterministic. But the corollary is that many experiments are needlessly non-deterministic, and therefore not repeatable. Worse, this issue can manifest in confusing ways; our v. 3.5 experiments, for example, were producing different topic models because the initial vectors were different in each execution, causing the algorithm to converge in different ways. Imagine our surprise at starting at this curious result of that complex algorithm only to eventually arrive at the humble dict.


The End… Or Is It?

The principle of reproducibility is essential to science, and all algorithms are fundamentally deterministic – even the hash functions dict depends on. Given the subtlety of some implementation details, it’s understandable that developers may get caught out. But the issue in gensim‘s case is more profound: data science libraries need to ensure determinism if there is to be any confidence in the results they produce. To this end, we’ve submitted a PR that adds this property to gensim‘s Dictionary by removing dict from the process of generating word identifiers and replacing it with a sorting policy that leaves no room for unpredictability. We are happy to report that the gensim maintainers have been receptive to this change, and we are hopeful that we can soon start using Dictionary with confidence that our experiments will be consistent.

But it makes you think… how long before we encounter the next source of unintentional non-determinism in a data science library?

Header image courtesy of Dean Hochman.


This blogpost was written in collaboration with Shannon Pace

Thanks to James Gardner and Andrew Simmons for reviewing this post and providing suggestions.