Thursday, October 17, 2024

Comparison: FSet vs. Sycamore

[BULLETIN: Quicklisp now has the latest version of FSet.]

Sycamore, primarily by Neil Dantam, is a functional collections library that is built around the same weight-balanced binary tree data structure (with leaf vectors) that FSet uses.  While the README on that page comments briefly on the differences between Sycamore and FSet, I don't feel that it does FSet justice.  Here is my analysis.

Dantam claims that his library is 30% to 50% faster than FSet on common operations.  While I haven't done comprehensive micro-benchmarking, a couple of quick tests indicates that this claim is plausible.  A look through the internals of the implementation confirms that it is clean and tight, and I must commend him.  There may be some techniques in here that I could usefully borrow.

Most of the performance difference is necessitated by two design choices that were made differently in the two libraries.  One of these Dantam mentions in his comparison: FSet's use of a single, global ordering relation implemented as a CLOS generic function, vs. Sycamore's more standard choice of requiring a comparison function to be supplied when a collection is created.  The other one he doesn't mention: the fact that FSet supports a notion of equivalent-but-unequal values, which are values that are incomparable — there's no way, or at least no obvious way, to say which is less than the other, and yet we want to treat them as unequal.  The simplest example is the integer 1 and the single-float 1.0, which have equal numerical values (and cl:= returns true on them), but which are nonetheless not eql.  (I have a previous blog post that goes into a lot more detail about equality and comparison.)  Since Sycamore expects the user-supplied comparison function to return an integer that is negative, zero, or positive to indicate the ordering of its arguments, there's no encoding for the equivalent-but-unequal case, nor is there any of the code that would be required to handle that case.

Both of these decisions were driven by my goal for the FSet project.  I didn't just want to provide a functional collections library that could be called occasionally when one had a specific need for such a data structure.  My ambition was much grander: to make functional collections into a reasonable default choice for the vast majority of programming situations.  I wanted FSet users (including, of course, myself) to be able to use functional collections freely, with very little extra effort or thought.  While Lisp by itself reaches a little bit in this direction — lists can certainly be used functionally — lists used as functional collections run into severe time complexity problems as those collections get large.  I wanted the FSet collections to be as convenient and well-supported as lists, but without the time complexity issues.

— Or rather, I wanted them to be even more convenient than lists.  Before writing FSet, I had spent years working in a little-known proprietary language called Refine, which happened to be implemented on top of Common Lisp, so it was not unusual to switch between the two languages.  And I had noticed something.  In contrast to CL, with its several different predefined equality predicates and with its functions that take :test arguments to specify which one to use, Refine has a single notiion of equality.  The value space is cleanly divided between immutable types, which are compared by value — along with numbers, these include strings, sets, maps, and seqs — and mutable objects, which are always compared by identity.  And it worked!  I found I did not miss the ability to specify an equality predicate when performing an operation such as "union".  It was just never needed.  Get equality right at the language level, and the problem goes away.

Although FSet's compare generic function isn't just for equality — it also defines an ordering that is used by the binary trees — I thought it would probably turn out to be the case that a single global ordering, implemented as a generic function and therefore extensible, would be fine the vast majority of the time.  I think experience has borne this out.  And just as you can mix types in Lisp lists — say, numbers and symbols — without further thought, so you can have any combination of types in an FSet set, effortlessly.  (A project I'm currently working on actually takes considerable advantage of this capability.)

As for supporting equivalent-but-unequal values, this desideratum flows directly from the principle of least astonishment.  While it might not be too surprising for a set or map implementation to fail distinguish the integer 1 from the float 1.0, it certainly would be very surprising, and almost certainly a source of bugs in a compiler that used it, for it to fail to distinguish two uninterned symbols with the same name.  (I saw a macro expansion recently that contained two distinct symbols that both printed as #:NEW.  It happens.)  A compiler using Sycamore for a map on symbols would have to supply a comparison function that accounted for this; it couldn't just compare the package name and symbol name.  (You'd have to do something like keep a weak hash table mapping symbols to integers, assigned in the order in which the comparison function encountered them.  It's doable, but FSet protects you from this madness.)

Along with those deep semantic design choices, I've spent a lot of time on developing a wide and featureful API for FSet (an effort that's ongoing).  FSet has many features that Sycamore lacks, including:

  • seqs, a binary-tree sequence implementation that holds arbitrary Lisp objects (Sycamore ropes hold only characters, which is certainly an important special case, but why restrict ourselves?)
  • default values for maps and seqs (the value to return when the key is outside the domain is associated with the collection, not supplied at the call site; this turns out to be a significant convenience)
  • generic functions that operate on both lists and FSet collections, to shadow the CL builtins
  • the powerful map-union and map-intersection operations (I'll blog about these in the future)
  • more ways to iterate over the collections (the FSet tutorial has a good summary, about 3/4 of the way down)
  • speaking of the tutorial, FSet has lots more documentation

Let me digress slightly to give an example of how FSet makes programming more elegant and convenient.  Joe Marshall just put up a blog post comparing Go(lang) with Common Lisp, which is worth a read on its own; I'm just going to grab a code snippet from there to show a little bit of what programming with FSet is like.  Here's Joe's code:

 (defun collate (items &key (key #'identity) (test #'eql) (merger (merge-adjoin #'eql)) (default nil))
   (let ((table (make-hash-table :test test)))
     (dolist (item items table)
       (let ((k (funcall key item)))
         (setf (gethash k table) (funcall merger (gethash k table default) item))))))

 (defun merge-adjoin (test)
   (lambda (collection item)
     (adjoin item collection :test test)))

And here's what I would write using FSet:

 (defun collate (items &key (key #'identity))
   (let ((result (map :default (set))))
     (dolist (item items result)
       (includef (@ result (funcall key item)) item))))

(Well, I would probably move result outside the dolist form to make it clearer what the return value is, but let's go with Joe's stylistic choice here.)

For those who haven't used FSet: the form (map :default (set)) creates a map whose default is the empty set, meaning that lookups on that map will return the empty set if the key is not in the map.  This saves the includef form from having to handle that possibility.

My version makes assumptions, it's true, about how you want to collect the items with a given key; it doesn't give you other choices.  It could, but what would be the point?  It's already using a general set with better time complexity than lists, and saving you from having to write anything like merge-adjoin.  The extensible global equivalence relation means you're not going to need to supply a :test either.

I think the FSet-enhanced code is cleaner, more elegant, and therefore clearer than the plain-CL version.  Don't you agree?  Maybe you wouldn't say it's a huge improvement, okay, but it's a small example; in a larger codebase, I would argue, these small improvements add up.

* * * * *

To summarize: if you just want a library you can call in a few places for specific purposes, Sycamore might work better for you (but think hard if you're writing a comparator for symbols).  FSet can certainly be used that way, but it can be much more.  If you want to see one way in which Common Lisp can be made into a better language, without giving up anything that we love about it, I urge you to give FSet a try.

FSet has changed the way I write Lisp programs.  — an FSet user

(UPDATE: the magnitude of the performance difference between FSet and Sycamore surprised me, and inspired me to do some profiling of FSet.  It turned out that I could get a 20% speedup on one micro-benchmark simply by adding some inline declarations.  Mea culpa, mea culpa, mea maxima culpa; I should have done this years ago.   With that change, the generic function overhead appears to be the only significant cause of the remaining ~20% performance difference.  I tried creating a Sycamore set using a thin wrapper around fset:compare, and the resulting performance was very similar to that of FSet with its new inlines.)

Wednesday, September 4, 2024

Equality and Comparison in FSet

This post is somewhat prompted by a recent blog post by vindarel, about Common Lisp's various built-in equality predicates.  It is aleo related to Marco Antoniotti's CDR 8, Generic Equality and Comparison for Common Lisp, implemented by Charles Zhang; Alex Gutev's GENERIC-CL; and Henry Baker's well-known 1992 paper on equality.

Let me start by summarizing those designs.  CDR 8 proposes a generic equaity function equals, and a comparison function compare.  These are both CLOS generic functions intended to be user-extended, though they also have some predefined methods.  equals has several keyword parameters controlling its exact behavior.  One of these is case-sensitive, which controls string comparison.  Another is recursive, which controls its behavior on conses; if recursive is false (the default), conses are compared by eq, but if it's true, a tree comparison is done.  compare is specified to return one of the symbols <, >, =, or /= to indicate the relative order of its arguments; it also has keyword parameters such as case-sensitive and recursive

GENERIC-CL replaces many CL operations with CLOS generic functions, and also adds new ones.  It touches many parts of the language other than equality and comparison, but I'll leave those aside for now.  It has two generic equality functions: equalp, which, notwithstanding the name, is case-sensitive for characters and strings, and likep, which is case-insensitive.  It also has comparison predicates lessp etc., along with a compare function (implemented using lessp) that can return :less, :equal, or :greater.

Henry's paper makes some interesting arguments about how a Common Lisp equality predicate should behave; he makes these concrete by defining a novel predicate egal.  His most salient point, for my purposes, is that mutable objects, including vectors and conses, should always be compared with eq.  I will argue below that FSet adheres to the spirit of this desideratum even though not to its letter.

FSet advertises itself as a "set-theoretic" collections library, and as such, requires a well-defined notion of equality.  Also, since it is implemented using balanced binary trees, it requires an ordering function.  FSet defines a generic function compare with these properties:

  • It returns one of the symbols :less, :equal, :greater, or :unequal (:unequal is used in certain rare cases of values which are not equal but cannot be consistently ordered)
  • It implements a strict weak ordering, with an additional constraint: along with incomparability (indicated by either :equal or :unequal) being transitive, equality is also transitive by itself
  • It can compare any two Lisp objects; this is an element of FSet's design philosophy
  • Being a generic function, it is of course user-extensible

FSet's equality predicate is equal?, which simply calls compare and checks that the result is :equal.  Thus, the only step required to add a user-defined type to the FSet universe is to define a compare method for it.  FSet provides a few utilities to help with this, which I'll go into below.

The cases in which compare returns :unequal to indicate unequal-but-incomparable arguments include:

  • Numbers of equal value but different types; that is, = would return true on them, but eql would return false.  Example: the integer 1 and the float 1.0.
  • Distinct uninterned symbols (symbols whose symbol-package is nil) whose symbol-names are equal (by string=).
  • Objects of a type for which no specific compare method has been defined, and which are distinct according to eql.
  • If you create a package, rename it, then create a new package reusing the original name of the first package, the two packages compare :unequal.   (FSet holds on to the original name, to protect itself from the effects of rename-package, which could otherwise be distastrous.)  Also, two symbols with the same name, one from the old package and one from the new, also compare :unequal.
  • Aggregates which are being compared component-wise, in the case where none of the component-wise comparisons returns :less or :greater, and at least one of them returns :unequal.

If compare's default method is called with objects of different classes, it returns a result based solely on the classes; the contents of the objects are not examined.  Again, it is part of FSet's design philosophy to give you as much freedom as reasonably possible; this includes allowing you to have sets containing more than one kind of object.

(In general, FSet's built-in ordering has been chosen for performance, not for its likely usefulness to clients.  For example, compare on two strings of different lengths orders the shorter one first, ignoring the contents, because this requires only an O(1) operation.)

Comparison with equal

FSet's equal? on built-in CL types behaves almost identically to CL's equal, with the one difference that on vectors (other than bit-vectors), equal just calls eq, but equal? compares the contents.  (I just noticed that this is not true for multidimensional arrays, and have filed an FSet bug.)  (On bit-vectors, they both compare the contents.)

Comparison with CDR 8

There are noticeable similarities between FSet and the CDR 8 proposal; the latter not only includes a comparison function, but even provides for it to return /=, corresponding to FSet's :unequal, to indicate unequal but incomparable arguments.  But the idea that the behavior of equality and comparison could be modified via keyword parameters does not seem appropriate for FSet.  I think it would make FSet quite a bit harder to use, for little gain.  For example, FSet comparison on lists walks the lists, but CDR 8, by default, just calls eq on their heads; users would have to remember to pass :recursive t to get the behavior they probably expect.  FSet collections would have to remember which options they were created with, and if you tried, say, to take the union of two sets which used different options, you'd get an error.

Years of programming experience — not only with FSet but also with Refine, the little-known proprietary language that inspired FSet — have left me with the clear impression that having a single global equality predicate is a great simplification and very rarely limiting, provided it was defined properly to begin with.

I also note that FSet has more predefined methods for its comparison function (and therefore for its equality predicate) than are proposed in CDR 8.  In particular, CDR 8's default compare methods return /= in more cases (e.g. distinct symbols), which is not terribly useful, in my view; FSet tries to minimize its use of :unequal because its data structure code, in that case, has to fall back to using alists, which have much poorer time complexity than binary trees.  (OTOH, Marco seems to have overlooked the other cases listed above that arguably should be treated as unequal but incomparable.)

Comparison with GENERIC-CL

Again, there are noticeable similarities between FSet's and GENERIC-CL's equality predicates and comparison functions.  GENERIC-CL does have two different equality predicates, equalp and likep, but these have no parameters other than the objects to be compared; it does not follow the CDR 8 suggestion of specifying keyword parameters that alter their behavior.   Its equalp is very similar to FSet's equal?, but not quite identical; one difference is that it returns true when called on the integer 1 and the float 1.0, where both fset:equal? and cl:equal return false.

That normally-minor discrepancy is related to a larger deficiency: GENERIC-CL's comparison operator has no defined return value corresponding to :unequal, to indicate unequal-but-incomparable arguments.  That is, FSet and CDR 8 both recognize that comparison can't implement a total ordering over all possible pairs of objects, but GENERIC-CL overlooks this point.

There are other overlaps between FSet and GENERIC-CL, but I'll save an analysis of those for another time.

Comparison with EGAL

Henry is proposing an extension to Common Lisp, not an operator that can be written in portable CL.  This shows up in two ways: first, some of his sample code implementing egal requires access to implementation internals; second, he proposes a distinction between mutable and immutable vectors and strings that does not exist in CL.  The text also suggests adding an immutable cons type to CL, though the sample code doesn't mention this case.

I agree with Henry in principle: a mutable cons (or string, or vector) is a very different beast from an immutable one; as he puts it, "eq is correct for mutable cons cells and equal is correct for immutable cons cells".  CL would have been a better language, in principle, had conses been immutable, and immutable strings and vectors been available (if perhaps not the default).  But here I must invoke one of my favorite quips: "The difference between theory and practice is never great in theory, but in practice it can be very great indeed."  The key design goal of CL, to unify the Lisp community by providing a language into which existing programs in various Lisp dialects could easily be ported, demanded that conses remain mutable by default.  Adding immutable versions of these types was not, to my knowledge, a priority.

And as Henry himself points out, in the overwhelmingly most common usage pattern for these types, they are treated as immutable once fully constructed.  For example, a common idiom is for a function to build a list in reverse order, then pass it through nreverse before returning it; at that point, it is fully constructed, and it won't be modified thereafter.  Obviously, this is a generalization over real-world Lisp programs and won't always be true, but since Lisp encourages sharing of structure, I think Lisp programmers learn pretty early that they have to be very careful when mutating a list or string or vector that they can't easily prove they're holding the only pointer to (normally by virtue of having just created it).  Given that this is pretty close to being true in practice, and that comparing these aggregates by their contents is usually what people want to do when they use them as members of collections, it would seem odd for FSet to distinguish them by identity.

Also, there's the simple fact that for these built-in types, CL provides no portable way to order or hash them by identity.  Such functionality must exist internally for use by eq and eql hash tables, but the language does not expose any portable interface to it.

So in this case, both programming convenience and the hard constraints of implementability force a choice that is contrary to theoretical purity: FSet must compare these types by their contents.  The catch, of course, is that one must be careful, once having used a list or string or vector as an element of an FSet collection, never to modify it, lest one break the collection's ordering invariant.  But in practice, this rule doesn't seem at all onerous: if you found the object in the heap somewhere — as opposed to having just created it— don't mutate it.

When it comes to user-defined types, however, the situation is quite different.  It is easy for the programmer, defining a class intended for mutation, to arrange for FSet to distinguish objects of the class by their identity rather than their contents.  The recommended way to do this is to include a serial-number slot that is initialized, at object-creation time, to the next value from an integer sequence; then write a compare method that uses this slot.  (I'll show some examples shortly.)

So if the design of your program involves some pieces of mutable state that are placed in collections, my strong recommendation is that such state should never be implemented as a bare list or string or vector, but should always be wrapped in an instance of a user-defined class.  I believe this to be a good design principle in general, even when FSet is not involved, but it becomes imperative for programs using FSet.

Adding Support for User-Defined Classes

When adding FSet support for a user-defined class, the first question is whether instances of the class represent mutable objects or mathematical values.  If it's a mathematical value, it should be treated as immutable once constructed.  (Alas, CL provides no way to enforce immutability.)  In that case, it should be compared by content.  FSet provides a convenient macro compare-slots for this purpose.  Here's an example:

(defstruct frob
  position
  color)
 
(defmethod compare ((a frob) (b frob))
  (compare-slots a b #'frob-position #'frob-color))

This specifies that frobs shall be ordered first by position, then by color.  compare-slots handles the details for you, including the complications that arise if one of the slot value comparisons returns :unequal.

For standard classes, best performance is obtained by supplying slot names as quoted symbols rather than function-quoted accessor names:

(defclass directed-graph ()
  ((nodes :initarg :nodes :accessor digraph-nodes)
   (edges :initarg :edges :accessor diagraph-edges)))
 
(defmethod compare ((a directed-graph) (b directed-graph))
  (compare-slots a b 'nodes 'edges))

I am not sure whether to recommend the use of slot names for structure classes; the answer may depend on the implementation.  At least on SBCL, you're probably better off using accessor functions for structs.

(Actually, the functions supplied don't have to be accessors; you could compare by some computed value instead, if you wanted.  I haven't seen a use for this possibility in practice, though.)

Structure classes implementing mutable objects should do something like this:

(defvar *next-widget-serial-number* 0)
(defstruct widget
  (serial-number (incf *next-widget-serial-number*))
  ...)
 
(defmethod compare ((a widget) (b widget))
  (compare-slots a b #'widget-serial-number))

For standard classes implementing mutable objects, FSet provides an especially convenient solution: just include identity-ordering-mixin as a superclass:

(defclass thingy (identity-ordering-mixin ...) ...)

That's it!

More on FSet's Single Global Ordering

I sometimes get pushback, albeit mostly from people who haven't actually used FSet, about my design decision to have a single global ordering implemented by compare, rather than allowing collections to accept an ordering function when they are created.  Let me defend this decision a little bit.

Because the ordering is extensible by defining new methods on compare, a programmer can always force a non-default ordering by defining a wrapper type.  For example, if you have a map whose keys are strings and which you want to be maintained in lexicographic order, you can easily write a structure class to wrap the strings, and give that class a compare method that forces the strings to be compared lexicographically.  (FSet even helps you out by providing a generic function compare-lexicographically, which you can just call.)

That said, I believe the need to write wrapper classes arises very rarely.  It's needed only when there is a reason that a set or map needs to be continually maintained in the non-default order.  If the non-default ordering is needed only occasionally — say, when the collection is being printed — it's usually easier to convert it to a list at that point (see FSet's generic function convert, about which I should write another blog post) and then just call sort or stable-sort on it.

And there is a wonderful simplicity to having the ordering be global.  Ease of use is a very important design goal for FSet; collection-specific orderings would give the user another wrinkle to think about.  I just don't see that the benefits, which seem to me very small, would outweigh the cost in cognitive load.

Perhaps the best way to put it is that FSet is primarily intended for application programming, not systems programming.  The distinction is fuzzy, but broadly, if programmer productivity is more important to you than squeezing out the last few percent of performance, you're doing application programming, not systems programming.  This is not necessarily a distinction about the kind of program being written — there certainly are applications that have performance-sensitive parts — but rather, about the amount of knowledge, experience, and mental effort required to write it.  FSet is designed for general productivity, not necessarily for someone who needs maximal control to achieve maximal performance.

Monday, July 15, 2024

FSet 1.4.0 released (repost)

[Reposting so it will show up at the top of Planet Lisp]

Greetings FSet users,

For several years I was too busy to do much with Common Lisp, but having left my last job a few months ago, I am now working on a project in CL.  I'm using FSet, of course, and so I've been reminded that it needed some TLC; there were some bugs to fix, and the documentation was very old and possibly hard to find.  So I've put some time into it and prepared a new release.

The first thing I did was to review all the commits Paul Dietz made back in 2020.  These were more extensive than I had realized; he greatly expanded the test suite and fixed a number of bugs.  I have tried to thank him for his work, but he seems to have retired from GrammaTech; I have not been able to reach him.  If anyone is in touch with him. please convey my thanks.

One bug Paul noticed but didn't fix, probably because he thought someone might be depending on the current behavior, was that compare on maps and seqs was not comparing the default; if two maps or seqs had the same contents but different defaults, they would nonetheless be reported as equal.  There is indeed a chance of breaking existing code by fixing this, but I think it's small; in any case, I've decided to risk it — the behavior was clearly a bug.

The only other possibly breaking change I've made is to revamp the APIs of list-relation and query-registry.  I wrote these classes some time ago, specifically for the project I was working on (and have now resumed); they're not well documented, and I'll be surprised if anyone is using them, especially in the case of query-registry.  If I'm wrong and you are using them. post a comment and I'll explain how to convert your code, if it's not obvious.  (I did remove some methods from query-registry that I was no longer using; I can restore them if necessary.)

I've also collected the FSet documentation into one place, and freshened it a little.

As part of this work I have also updated Misc-Extensions, which contains some macros that I like to use (and are used in FSet).  In particular, I made some improvements to GMap, my iteration macro (we all have our own iteration macros, right?), and wrote a README for the system, that should make it a lot easier for people to see what's in it.

 

Wednesday, June 19, 2024

On the time complexity of functional collections

Clojure made functional collections popular.  Rich Hickey, its inventor, deserves a lot of credit for that.  However, he also propagated an inaccurate way of describing their time complexity on several common operations such as looking up a key in a map.  I don't know exactly what phrase he used at first, but I've seen people describe the time complexity of these operations as "near-constant" or "effectively constant", or sometimes shouting: "effectively constant".  He also seems to have originated the practice I see in the Clojure community of speaking as if the base of the logarithm mattered: "O(log32 n)".  (The "32" should be a subscript, but I don't see an affordance for subscripts in this Blogger UI.)

All of these locutions are wrong.  The only correct way to describe the time complexity of the operations in question is as "O(log n)" or "logarithmic time" ("log time" for short).  Time complexity describes how the time to perform the operation grows as the size of the input (in this case, the collection) grows without bound.  Because the Hash Array-Mapped Trie (HAMT) — the very clever data structure invented by Phil Bagwell — is a tree, the worst-case time to access an element in the tree must be proportional to the depth of the tree, which is proportional to the logarithm of the number of elements (provided that the tree is balanced, which it will be if the hash function is well distributed).  The base of the logarithm is the radix (branching factor) of the tree, which in Clojure's case is 32, but this has no bearing on its time complexity; as everyone knows, logarithms of different bases differ only by a constant factor, and big-O notation ignores constant factors.

I think part of what is going on here is a bit of confusion between the time complexity of an algorithm and its real-world performance.  Consider this sentence from Hickey's HOPL 2020 paper, A History of Clojure:

Performance was excellent, more akin to O(1) than the theoretical bounds of O(logN).

You don't find the time complexity of an algorithm by measurement, but by analyzing the algorithm.  While it's not 100% clear, this sentence certainly gives the impression that he didn't quite understand that.

Let me speculate a little.  The performance of a lookup on a map, implemented as an HAMT, using string keys, has two components: the time to hash the key, and the time to walk the HAMT, using the hash value, to find the map entry containing that key.  I'm going to guess that for the string keys that Rich tried in his testing, the tree-walking time was less than or comparable to the string-hashing time up to a depth of maybe 3 or 4, or maybe larger.  32^4 is 1,048,576, which might be larger than any map he tested; so it's entirely plausible that he just didn't test any collections large enough to see the logarithmic behavior emerge.

If that's right, it certainly speaks well for the performance of the HAMT design.  Let me acknowledge at this point that Rich also had a marketing problem to deal with: he had to convince potential Clojure users that its functional collections would not make their programs unusably slow.  O(1) or "near-constant" certainly sounds better than O(log n).  I can understand the temptation he faced.

But again: time complexity is about how the time grows as the size of the input grows without bound.  And clearly, in this case, there will be some point at which the tree-walking time will begin to be larger than the hashing time.  This will happen sooner for short keys than long ones, and soonest if the keys are integers hashed by the identity function (or maybe by folding a 64-bit integer into a 32-bit hash; probably one or two instructions).  But it will happen.

— That is, it will happen as long as the algorithm doesn't run out of hash bits.  Clojure uses a 32-bit hash; since each tree level consumes 5 bits, that gives it 6.4 levels.  As the tree starts to fill up, the number of collisions will begin to become significant.  I'm not sure what Clojure does with collisions.  Bagwell suggested rehashing to obtain more bits, but I don't know that Clojure does that; it might just do linear search over collision buckets.  In the latter case, the time complexity would actually be linear (O(n)) rather than logarithmic; the linear behavior won't begin to emerge until the map has billions of entries, but again, time complexity isn't about that.

The other point worth making here is that while time complexity is an important fact about the performance of an algorithm, it is not the only important fact.  The amount of time it takes on small instances can also matter; depending on the use case, it can be more important than the time complexity.  There are algorithms in the CS literature (called "galactic algorithms"; TIL!) which have state-of-the-art time complexity, but are not used in practice because their constant factors are too large (I guess in practice this means they have complicated initializations to perform before getting to the meat of the computation).

None of this is intended as a criticism of Hickey's choice of HAMTs for Clojure.  The only reason FSet doesn't use HAMTs is that I wasn't aware of their existence when I was writing it.  Probably I will rectify this at some point, though that's not a trivial thing to do because the change can't be perfectly compatible; FSet's trees are comparison-based, while HAMTs are hash-based, requiring a change to how user-defined classes are interfaced to the library.  Still, I expect HAMTs would be substantially faster in many applications.

Saturday, June 8, 2024

FSet 1.4.0 Released

Greetings FSet users,

For several years I was too busy to do much with Common Lisp, but having left my last job a few months ago, I am now working on a project in CL.  I'm using FSet, of course, and so I've been reminded that it needed some TLC; there were some bugs to fix, and the documentation was very old and possibly hard to find.  So I've put some time into it and prepared a new release.

The first thing I did was to review all the commits Paul Dietz made back in 2020.  These were more extensive than I had realized; he greatly expanded the test suite and fixed a number of bugs.  I have tried to thank him for his work, but he seems to have retired from GrammaTech; I have not been able to reach him.  If anyone is in touch with him. please convey my thanks.

One bug Paul noticed but didn't fix, probably because he thought someone might be depending on the current behavior, was that compare on maps and seqs was not comparing the default; if two maps or seqs had the same contents but different defaults, they would nonetheless be reported as equal.  There is indeed a chance of breaking existing code by fixing this, but I think it's small; in any case, I've decided to risk it — the behavior was clearly a bug.

The only other possibly breaking change I've made is to revamp the APIs of list-relation and query-registry.  I wrote these classes some time ago, specifically for the project I was working on (and have now resumed); they're not well documented, and I'll be surprised if anyone is using them, especially in the case of query-registry.  If I'm wrong and you are using them. post a comment and I'll explain how to convert your code, if it's not obvious.  (I did remove some methods from query-registry that I was no longer using; I can restore them if necessary.)

I've also collected the FSet documentation into one place, and freshened it a little.

As part of this work I have also updated Misc-Extensions, which contains some macros that I like to use (and are used in FSet).  In particular, I made some improvements to GMap, my iteration macro (we all have our own iteration macros, right?), and wrote a README for the system, that should make it a lot easier for people to see what's in it.

 

Sunday, June 2, 2024

Functional Collections in Programming Languages

As I was updating the documentation for FSet, my functional collections library for Common Lisp, I wondered about the history of functional collections in programming languages.  I have found a couple of interesting early examples, but before I present those, I want to address this question: exactly what does it mean for a collection type (or indeed, any type) to be "functional" in an imperative language with assignment?

When I speak of "functional" types in imperative languages, I mean more specifically types with value semantics as opposed to reference semantics.  The distinction is more subtle than that between mutable and immutable types, which it is often conflated with.  Let's consider for a moment a simple type that we all understand: integers in C.  For example:

int a = 2;
int b = a;
++a;

After execution of these statements, a will be 3 and b will be 2.  Note in particular that the assignment of a to b is a value assignment: it copies the value of a into b; it does not make b an alias of a.  If it had made b an alias of a, then b would also have been 3 afterwards.

For contrast, consider this example in Java:

int[] a = {15, 37};
int[] b = a;
a[0] = 42;

After this, both a and b will contain {42, 37}.  Here the assignment is a reference assignment: b doesn't just wind up with the same array value as a; instead, it gets aliased to a, so that changes made to either one are reflected in the other.

These observations suggest a definition of the distinction between value semantics and reference semantics: if a type has value semantics, then assignment of it does not cause aliasing, while with reference semantics, assignment does cause aliasing.

Now let's look at a C++ example:

std::string a = "foo";
std::string b = a;
a.insert(1, "x");

Particularly if you don't know C++, I invite you to puzzle over this for a moment.  What is the value of b?  In fact, it is "foo"; the assignment is value assignment, implemented by copying the contents of the string, so the a.insert operation affects only a.  Strings in C++ have value semantics, as indeed do the STL collection types (vector etc.).  Of course, in C++, you can create a reference or pointer to any type, so you can always get reference semantics when you want it, even for built-in types such as integers.

So now I have a couple of questions for you.  First, are C++ strings mutable?  Given that they have operations like insert, one would be inclined to call them mutable, don't you think?  Let's say that they are.  Okay, then are C/C++ integers also mutable?  We don't usually think of integers as being mutable; we usually think of an operation like ++a as assigning a new value to a, not as incrementing a mutable integer object.  But as we see here, integers and strings have the same behavior vis-a-vis assignment and modification; if we consider strings to be mutable, seems like we have to consider integers mutable as well.

But my purpose isn't really to get you to call integers mutable.  My point is, rather, that the mutable/immutable distinction doesn't capture everything that's going on here.  The more useful distinction is between value semantics and reference semantics.  When I speak of "functional collections", I mean that they have value semantics, not necessarily that they have nothing that looks like a mutating operation.

So the question I wondered about was, what were the first programming languages to provide collections with value semantics?

A case can be made that Lisp was the first, in 1958.  Although Lisp lists are mutable, they are usually treated as immutable once fully constructed.  For example, a function constructing and returning a list might call nreverse on it just before returning it, to destructively reverse it (a common idiom, because constructing a list using cons starts at the end); but usually, the caller of such a function, having received the returned list, would not subsequently mutate it.  Certainly there are and have always been exceptions, but my impression is that, even in Lisp's early years, significant bodies of code were written in which the vast majority of lists were treated functionally (once fully constructed).

But the first language in which collections were treated functionally by definition, rather than by usual convention, appears to have been APL in 1966.  Interestingly, given the very great difference the choice of value or reference semantics makes to the way in which one writes programs in a language, I can't find a clear statement in the APL manual that I'm looking at as to which semantics APL uses for its arrays.  It seems to be something that people expect to go without saying (one reason I'm writing about it).  But I downloaded and compiled GNU APL and tried it out, and, sure enough, it uses value semantics (the left arrow is the assignment operator; user input is in bold):

V←1 2 3 4
V
1 2 3 4
W←V
V[2]←7
V
1 7 3 4
W
1 2 3 4

Another early value-semantics language was SETL.  From Robert B. K. Dewar's The SETL Programming Language (1979):

One important point is that SETL treats tuples as values when it comes to assignments.  Consider the following sections of code:
abc := 12;
cde := abc;
abc := abc + 2; $ cde
still = 12

abc := [1,2,3];
cde := abc;
abc(2) := 0; $ cde
still = [1,2,3]

In SETL the two sequences have similar effects.  If you expected cde to change in the second
sequence, then study it carefully.  If not, then you have the correct idea already.

Here, just as I have done above, Dewar is demonstrating how the value semantics of SETL tuples is like that of integers.

SETL strongly influenced a little-known, proprietary language called Refine, which I worked in from 1987 to 2003.  Refine was originally developed at Kestrel Institute; it had functional collections and was implemented on top of Common Lisp.  It was my experience with Refine that motivated me to write FSet.

Other languages with functional collections appeared in the 1980s and 1990s, including the ML family (primarily Standard ML and OCaml) and Haskell.  No doubt there were others of which I am not aware.  But the language that has probably done the most to popularize functional collections is Rich Hickey's Clojure.

All of which brings me back to FSet.  FSet, of course, has value semantics:

FSET-USER> (isetq x (map ('a (seq 47 33)) ('b (seq 17 3 99))))
#{| (A #[ 47 33 ]) (B #[ 17 3 99 ]) |} ;
a map whose range values are seqs
FSET-USER> (isetq y x)
#{| (A #[ 47 33 ]) (B #[ 17 3 99 ]) |}
FSET-USER> (setf (@ (@ x 'a) 1) 1)     ;
sets element 1 of the seq for 'a
1
FSET-USER> x
#{| (A #[ 47 1 ]) (B #[ 17 3 99 ]) |}
FSET-USER> y
#{| (A #[ 47 33 ]) (B #[ 17 3 99 ]) |}

(Here isetq is an "interactive setq" that suppresses undefined-variable warnings some Lisp implementations issue when you interactively set a previously unknown variable.)  You can see that the update to x doesn't change y; but how does FSet manage this?  Common Lisp's setf macro was designed to support such cases.  In the CLHS sec. 5.1.2.2, we see:

A function form can be used as a place [the first operand of setf] if it falls into one of the following categories:
[...]
• A function call form whose first element is the name of any one of the functions in the next figure [which are ldb, mask-field, and getf], provided that the supplied argument to that function is in turn a place form; in this case, the new place has stored back into it the result of applying the supplied "update" function.

So for instance, you can update bitfields of integer variables:

CL-USER> (setq *print-base* 16)  ; hexadecimal output makes this easier to understand
10
CL-USER> (setq x #x1000)
1000
CL-USER> (setf (ldb (byte 8 4) x) #x32)
32
CL-USER> x
1320

Since it's an integer that's being updated here, clearly setf has the ability to handle updates to types with value semantics.  If you're curious how that works, see the CLHS sec. 5.1.1.2.