Wednesday, June 19, 2024

On the time complexity of functional collections

Clojure made functional collections popular.  Rich Hickey, its inventor, deserves a lot of credit for that.  However, he also propagated an inaccurate way of describing their time complexity on several common operations such as looking up a key in a map.  I don't know exactly what phrase he used at first, but I've seen people describe the time complexity of these operations as "near-constant" or "effectively constant", or sometimes shouting: "effectively constant".  He also seems to have originated the practice I see in the Clojure community of speaking as if the base of the logarithm mattered: "O(log32 n)".  (The "32" should be a subscript, but I don't see an affordance for subscripts in this Blogger UI.)

All of these locutions are wrong.  The only correct way to describe the time complexity of the operations in question is as "O(log n)" or "logarithmic time" ("log time" for short).  Time complexity describes how the time to perform the operation grows as the size of the input (in this case, the collection) grows without bound.  Because the Hash Array-Mapped Trie (HAMT) — the very clever data structure invented by Phil Bagwell — is a tree, the worst-case time to access an element in the tree must be proportional to the depth of the tree, which is proportional to the logarithm of the number of elements (provided that the tree is balanced, which it will be if the hash function is well distributed).  The base of the logarithm is the radix (branching factor) of the tree, which in Clojure's case is 32, but this has no bearing on its time complexity; as everyone knows, logarithms of different bases differ only by a constant factor, and big-O notation ignores constant factors.

I think part of what is going on here is a bit of confusion between the time complexity of an algorithm and its real-world performance.  Consider this sentence from Hickey's HOPL 2020 paper, A History of Clojure:

Performance was excellent, more akin to O(1) than the theoretical bounds of O(logN).

You don't find the time complexity of an algorithm by measurement, but by analyzing the algorithm.  While it's not 100% clear, this sentence certainly gives the impression that he didn't quite understand that.

Let me speculate a little.  The performance of a lookup on a map, implemented as an HAMT, using string keys, has two components: the time to hash the key, and the time to walk the HAMT, using the hash value, to find the map entry containing that key.  I'm going to guess that for the string keys that Rich tried in his testing, the tree-walking time was less than or comparable to the string-hashing time up to a depth of maybe 3 or 4, or maybe larger.  32^4 is 1,048,576, which might be larger than any map he tested; so it's entirely plausible that he just didn't test any collections large enough to see the logarithmic behavior emerge.

If that's right, it certainly speaks well for the performance of the HAMT design.  Let me acknowledge at this point that Rich also had a marketing problem to deal with: he had to convince potential Clojure users that its functional collections would not make their programs unusably slow.  O(1) or "near-constant" certainly sounds better than O(log n).  I can understand the temptation he faced.

But again: time complexity is about how the time grows as the size of the input grows without bound.  And clearly, in this case, there will be some point at which the tree-walking time will begin to be larger than the hashing time.  This will happen sooner for short keys than long ones, and soonest if the keys are integers hashed by the identity function (or maybe by folding a 64-bit integer into a 32-bit hash; probably one or two instructions).  But it will happen.

— That is, it will happen as long as the algorithm doesn't run out of hash bits.  Clojure uses a 32-bit hash; since each tree level consumes 5 bits, that gives it 6.4 levels.  As the tree starts to fill up, the number of collisions will begin to become significant.  I'm not sure what Clojure does with collisions.  Bagwell suggested rehashing to obtain more bits, but I don't know that Clojure does that; it might just do linear search over collision buckets.  In the latter case, the time complexity would actually be linear (O(n)) rather than logarithmic; the linear behavior won't begin to emerge until the map has billions of entries, but again, time complexity isn't about that.

The other point worth making here is that while time complexity is an important fact about the performance of an algorithm, it is not the only important fact.  The amount of time it takes on small instances can also matter; depending on the use case, it can be more important than the time complexity.  There are algorithms in the CS literature (called "galactic algorithms"; TIL!) which have state-of-the-art time complexity, but are not used in practice because their constant factors are too large (I guess in practice this means they have complicated initializations to perform before getting to the meat of the computation).

None of this is intended as a criticism of Hickey's choice of HAMTs for Clojure.  The only reason FSet doesn't use HAMTs is that I wasn't aware of their existence when I was writing it.  Probably I will rectify this at some point, though that's not a trivial thing to do because the change can't be perfectly compatible; FSet's trees are comparison-based, while HAMTs are hash-based, requiring a change to how user-defined classes are interfaced to the library.  Still, I expect HAMTs would be substantially faster in many applications.

No comments:

Post a Comment