Thursday, October 17, 2024

Comparison: FSet vs. Sycamore

[BULLETIN: Quicklisp now has the latest version of FSet.]

Sycamore, primarily by Neil Dantam, is a functional collections library that is built around the same weight-balanced binary tree data structure (with leaf vectors) that FSet uses.  While the README on that page comments briefly on the differences between Sycamore and FSet, I don't feel that it does FSet justice.  Here is my analysis.

Dantam claims that his library is 30% to 50% faster than FSet on common operations.  While I haven't done comprehensive micro-benchmarking, a couple of quick tests indicates that this claim is plausible.  A look through the internals of the implementation confirms that it is clean and tight, and I must commend him.  There may be some techniques in here that I could usefully borrow.

Most of the performance difference is necessitated by two design choices that were made differently in the two libraries.  One of these Dantam mentions in his comparison: FSet's use of a single, global ordering relation implemented as a CLOS generic function, vs. Sycamore's more standard choice of requiring a comparison function to be supplied when a collection is created.  The other one he doesn't mention: the fact that FSet supports a notion of equivalent-but-unequal values, which are values that are incomparable — there's no way, or at least no obvious way, to say which is less than the other, and yet we want to treat them as unequal.  The simplest example is the integer 1 and the single-float 1.0, which have equal numerical values (and cl:= returns true on them), but which are nonetheless not eql.  (I have a previous blog post that goes into a lot more detail about equality and comparison.)  Since Sycamore expects the user-supplied comparison function to return an integer that is negative, zero, or positive to indicate the ordering of its arguments, there's no encoding for the equivalent-but-unequal case, nor is there any of the code that would be required to handle that case.

Both of these decisions were driven by my goal for the FSet project.  I didn't just want to provide a functional collections library that could be called occasionally when one had a specific need for such a data structure.  My ambition was much grander: to make functional collections into a reasonable default choice for the vast majority of programming situations.  I wanted FSet users (including, of course, myself) to be able to use functional collections freely, with very little extra effort or thought.  While Lisp by itself reaches a little bit in this direction — lists can certainly be used functionally — lists used as functional collections run into severe time complexity problems as those collections get large.  I wanted the FSet collections to be as convenient and well-supported as lists, but without the time complexity issues.

— Or rather, I wanted them to be even more convenient than lists.  Before writing FSet, I had spent years working in a little-known proprietary language called Refine, which happened to be implemented on top of Common Lisp, so it was not unusual to switch between the two languages.  And I had noticed something.  In contrast to CL, with its several different predefined equality predicates and with its functions that take :test arguments to specify which one to use, Refine has a single notiion of equality.  The value space is cleanly divided between immutable types, which are compared by value — along with numbers, these include strings, sets, maps, and seqs — and mutable objects, which are always compared by identity.  And it worked!  I found I did not miss the ability to specify an equality predicate when performing an operation such as "union".  It was just never needed.  Get equality right at the language level, and the problem goes away.

Although FSet's compare generic function isn't just for equality — it also defines an ordering that is used by the binary trees — I thought it would probably turn out to be the case that a single global ordering, implemented as a generic function and therefore extensible, would be fine the vast majority of the time.  I think experience has borne this out.  And just as you can mix types in Lisp lists — say, numbers and symbols — without further thought, so you can have any combination of types in an FSet set, effortlessly.  (A project I'm currently working on actually takes considerable advantage of this capability.)

As for supporting equivalent-but-unequal values, this desideratum flows directly from the principle of least astonishment.  While it might not be too surprising for a set or map implementation to fail distinguish the integer 1 from the float 1.0, it certainly would be very surprising, and almost certainly a source of bugs in a compiler that used it, for it to fail to distinguish two uninterned symbols with the same name.  (I saw a macro expansion recently that contained two distinct symbols that both printed as #:NEW.  It happens.)  A compiler using Sycamore for a map on symbols would have to supply a comparison function that accounted for this; it couldn't just compare the package name and symbol name.  (You'd have to do something like keep a weak hash table mapping symbols to integers, assigned in the order in which the comparison function encountered them.  It's doable, but FSet protects you from this madness.)

Along with those deep semantic design choices, I've spent a lot of time on developing a wide and featureful API for FSet (an effort that's ongoing).  FSet has many features that Sycamore lacks, including:

  • seqs, a binary-tree sequence implementation that holds arbitrary Lisp objects (Sycamore ropes hold only characters, which is certainly an important special case, but why restrict ourselves?)
  • default values for maps and seqs (the value to return when the key is outside the domain is associated with the collection, not supplied at the call site; this turns out to be a significant convenience)
  • generic functions that operate on both lists and FSet collections, to shadow the CL builtins
  • the powerful map-union and map-intersection operations (I'll blog about these in the future)
  • more ways to iterate over the collections (the FSet tutorial has a good summary, about 3/4 of the way down)
  • speaking of the tutorial, FSet has lots more documentation

Let me digress slightly to give an example of how FSet makes programming more elegant and convenient.  Joe Marshall just put up a blog post comparing Go(lang) with Common Lisp, which is worth a read on its own; I'm just going to grab a code snippet from there to show a little bit of what programming with FSet is like.  Here's Joe's code:

 (defun collate (items &key (key #'identity) (test #'eql) (merger (merge-adjoin #'eql)) (default nil))
   (let ((table (make-hash-table :test test)))
     (dolist (item items table)
       (let ((k (funcall key item)))
         (setf (gethash k table) (funcall merger (gethash k table default) item))))))

 (defun merge-adjoin (test)
   (lambda (collection item)
     (adjoin item collection :test test)))

And here's what I would write using FSet:

 (defun collate (items &key (key #'identity))
   (let ((result (map :default (set))))
     (dolist (item items result)
       (includef (@ result (funcall key item)) item))))

(Well, I would probably move result outside the dolist form to make it clearer what the return value is, but let's go with Joe's stylistic choice here.)

For those who haven't used FSet: the form (map :default (set)) creates a map whose default is the empty set, meaning that lookups on that map will return the empty set if the key is not in the map.  This saves the includef form from having to handle that possibility.

My version makes assumptions, it's true, about how you want to collect the items with a given key; it doesn't give you other choices.  It could, but what would be the point?  It's already using a general set with better time complexity than lists, and saving you from having to write anything like merge-adjoin.  The extensible global equivalence relation means you're not going to need to supply a :test either.

I think the FSet-enhanced code is cleaner, more elegant, and therefore clearer than the plain-CL version.  Don't you agree?  Maybe you wouldn't say it's a huge improvement, okay, but it's a small example; in a larger codebase, I would argue, these small improvements add up.

* * * * *

To summarize: if you just want a library you can call in a few places for specific purposes, Sycamore might work better for you (but think hard if you're writing a comparator for symbols).  FSet can certainly be used that way, but it can be much more.  If you want to see one way in which Common Lisp can be made into a better language, without giving up anything that we love about it, I urge you to give FSet a try.

FSet has changed the way I write Lisp programs.  — an FSet user

(UPDATE: the magnitude of the performance difference between FSet and Sycamore surprised me, and inspired me to do some profiling of FSet.  It turned out that I could get a 20% speedup on one micro-benchmark simply by adding some inline declarations.  Mea culpa, mea culpa, mea maxima culpa; I should have done this years ago.   With that change, the generic function overhead appears to be the only significant cause of the remaining ~20% performance difference.  I tried creating a Sycamore set using a thin wrapper around fset:compare, and the resulting performance was very similar to that of FSet with its new inlines.)