Wednesday, September 4, 2024

Equality and Comparison in FSet

This post is somewhat prompted by a recent blog post by vindarel, about Common Lisp's various built-in equality predicates.  It is aleo related to Marco Antoniotti's CDR 8, Generic Equality and Comparison for Common Lisp, implemented by Charles Zhang; Alex Gutev's GENERIC-CL; and Henry Baker's well-known 1992 paper on equality.

Let me start by summarizing those designs.  CDR 8 proposes a generic equaity function equals, and a comparison function compare.  These are both CLOS generic functions intended to be user-extended, though they also have some predefined methods.  equals has several keyword parameters controlling its exact behavior.  One of these is case-sensitive, which controls string comparison.  Another is recursive, which controls its behavior on conses; if recursive is false (the default), conses are compared by eq, but if it's true, a tree comparison is done.  compare is specified to return one of the symbols <, >, =, or /= to indicate the relative order of its arguments; it also has keyword parameters such as case-sensitive and recursive

GENERIC-CL replaces many CL operations with CLOS generic functions, and also adds new ones.  It touches many parts of the language other than equality and comparison, but I'll leave those aside for now.  It has two generic equality functions: equalp, which, notwithstanding the name, is case-sensitive for characters and strings, and likep, which is case-insensitive.  It also has comparison predicates lessp etc., along with a compare function (implemented using lessp) that can return :less, :equal, or :greater.

Henry's paper makes some interesting arguments about how a Common Lisp equality predicate should behave; he makes these concrete by defining a novel predicate egal.  His most salient point, for my purposes, is that mutable objects, including vectors and conses, should always be compared with eq.  I will argue below that FSet adheres to the spirit of this desideratum even though not to its letter.

FSet advertises itself as a "set-theoretic" collections library, and as such, requires a well-defined notion of equality.  Also, since it is implemented using balanced binary trees, it requires an ordering function.  FSet defines a generic function compare with these properties:

  • It returns one of the symbols :less, :equal, :greater, or :unequal (:unequal is used in certain rare cases of values which are not equal but cannot be consistently ordered)
  • It implements a strict weak ordering, with an additional constraint: along with incomparability (indicated by either :equal or :unequal) being transitive, equality is also transitive by itself
  • It can compare any two Lisp objects; this is an element of FSet's design philosophy
  • Being a generic function, it is of course user-extensible

FSet's equality predicate is equal?, which simply calls compare and checks that the result is :equal.  Thus, the only step required to add a user-defined type to the FSet universe is to define a compare method for it.  FSet provides a few utilities to help with this, which I'll go into below.

The cases in which compare returns :unequal to indicate unequal-but-incomparable arguments include:

  • Numbers of equal value but different types; that is, = would return true on them, but eql would return false.  Example: the integer 1 and the float 1.0.
  • Distinct uninterned symbols (symbols whose symbol-package is nil) whose symbol-names are equal (by string=).
  • Objects of a type for which no specific compare method has been defined, and which are distinct according to eql.
  • If you create a package, rename it, then create a new package reusing the original name of the first package, the two packages compare :unequal.   (FSet holds on to the original name, to protect itself from the effects of rename-package, which could otherwise be distastrous.)  Also, two symbols with the same name, one from the old package and one from the new, also compare :unequal.
  • Aggregates which are being compared component-wise, in the case where none of the component-wise comparisons returns :less or :greater, and at least one of them returns :unequal.

If compare's default method is called with objects of different classes, it returns a result based solely on the classes; the contents of the objects are not examined.  Again, it is part of FSet's design philosophy to give you as much freedom as reasonably possible; this includes allowing you to have sets containing more than one kind of object.

(In general, FSet's built-in ordering has been chosen for performance, not for its likely usefulness to clients.  For example, compare on two strings of different lengths orders the shorter one first, ignoring the contents, because this requires only an O(1) operation.)

Comparison with equal

FSet's equal? on built-in CL types behaves almost identically to CL's equal, with the one difference that on vectors (other than bit-vectors), equal just calls eq, but equal? compares the contents.  (I just noticed that this is not true for multidimensional arrays, and have filed an FSet bug.)  (On bit-vectors, they both compare the contents.)

Comparison with CDR 8

There are noticeable similarities between FSet and the CDR 8 proposal; the latter not only includes a comparison function, but even provides for it to return /=, corresponding to FSet's :unequal, to indicate unequal but incomparable arguments.  But the idea that the behavior of equality and comparison could be modified via keyword parameters does not seem appropriate for FSet.  I think it would make FSet quite a bit harder to use, for little gain.  For example, FSet comparison on lists walks the lists, but CDR 8, by default, just calls eq on their heads; users would have to remember to pass :recursive t to get the behavior they probably expect.  FSet collections would have to remember which options they were created with, and if you tried, say, to take the union of two sets which used different options, you'd get an error.

Years of programming experience — not only with FSet but also with Refine, the little-known proprietary language that inspired FSet — have left me with the clear impression that having a single global equality predicate is a great simplification and very rarely limiting, provided it was defined properly to begin with.

I also note that FSet has more predefined methods for its comparison function (and therefore for its equality predicate) than are proposed in CDR 8.  In particular, CDR 8's default compare methods return /= in more cases (e.g. distinct symbols), which is not terribly useful, in my view; FSet tries to minimize its use of :unequal because its data structure code, in that case, has to fall back to using alists, which have much poorer time complexity than binary trees.  (OTOH, Marco seems to have overlooked the other cases listed above that arguably should be treated as unequal but incomparable.)

Comparison with GENERIC-CL

Again, there are noticeable similarities between FSet's and GENERIC-CL's equality predicates and comparison functions.  GENERIC-CL does have two different equality predicates, equalp and likep, but these have no parameters other than the objects to be compared; it does not follow the CDR 8 suggestion of specifying keyword parameters that alter their behavior.   Its equalp is very similar to FSet's equal?, but not quite identical; one difference is that it returns true when called on the integer 1 and the float 1.0, where both fset:equal? and cl:equal return false.

That normally-minor discrepancy is related to a larger deficiency: GENERIC-CL's comparison operator has no defined return value corresponding to :unequal, to indicate unequal-but-incomparable arguments.  That is, FSet and CDR 8 both recognize that comparison can't implement a total ordering over all possible pairs of objects, but GENERIC-CL overlooks this point.

There are other overlaps between FSet and GENERIC-CL, but I'll save an analysis of those for another time.

Comparison with EGAL

Henry is proposing an extension to Common Lisp, not an operator that can be written in portable CL.  This shows up in two ways: first, some of his sample code implementing egal requires access to implementation internals; second, he proposes a distinction between mutable and immutable vectors and strings that does not exist in CL.  The text also suggests adding an immutable cons type to CL, though the sample code doesn't mention this case.

I agree with Henry in principle: a mutable cons (or string, or vector) is a very different beast from an immutable one; as he puts it, "eq is correct for mutable cons cells and equal is correct for immutable cons cells".  CL would have been a better language, in principle, had conses been immutable, and immutable strings and vectors been available (if perhaps not the default).  But here I must invoke one of my favorite quips: "The difference between theory and practice is never great in theory, but in practice it can be very great indeed."  The key design goal of CL, to unify the Lisp community by providing a language into which existing programs in various Lisp dialects could easily be ported, demanded that conses remain mutable by default.  Adding immutable versions of these types was not, to my knowledge, a priority.

And as Henry himself points out, in the overwhelmingly most common usage pattern for these types, they are treated as immutable once fully constructed.  For example, a common idiom is for a function to build a list in reverse order, then pass it through nreverse before returning it; at that point, it is fully constructed, and it won't be modified thereafter.  Obviously, this is a generalization over real-world Lisp programs and won't always be true, but since Lisp encourages sharing of structure, I think Lisp programmers learn pretty early that they have to be very careful when mutating a list or string or vector that they can't easily prove they're holding the only pointer to (normally by virtue of having just created it).  Given that this is pretty close to being true in practice, and that comparing these aggregates by their contents is usually what people want to do when they use them as members of collections, it would seem odd for FSet to distinguish them by identity.

Also, there's the simple fact that for these built-in types, CL provides no portable way to order or hash them by identity.  Such functionality must exist internally for use by eq and eql hash tables, but the language does not expose any portable interface to it.

So in this case, both programming convenience and the hard constraints of implementability force a choice that is contrary to theoretical purity: FSet must compare these types by their contents.  The catch, of course, is that one must be careful, once having used a list or string or vector as an element of an FSet collection, never to modify it, lest one break the collection's ordering invariant.  But in practice, this rule doesn't seem at all onerous: if you found the object in the heap somewhere — as opposed to having just created it— don't mutate it.

When it comes to user-defined types, however, the situation is quite different.  It is easy for the programmer, defining a class intended for mutation, to arrange for FSet to distinguish objects of the class by their identity rather than their contents.  The recommended way to do this is to include a serial-number slot that is initialized, at object-creation time, to the next value from an integer sequence; then write a compare method that uses this slot.  (I'll show some examples shortly.)

So if the design of your program involves some pieces of mutable state that are placed in collections, my strong recommendation is that such state should never be implemented as a bare list or string or vector, but should always be wrapped in an instance of a user-defined class.  I believe this to be a good design principle in general, even when FSet is not involved, but it becomes imperative for programs using FSet.

Adding Support for User-Defined Classes

When adding FSet support for a user-defined class, the first question is whether instances of the class represent mutable objects or mathematical values.  If it's a mathematical value, it should be treated as immutable once constructed.  (Alas, CL provides no way to enforce immutability.)  In that case, it should be compared by content.  FSet provides a convenient macro compare-slots for this purpose.  Here's an example:

(defstruct frob
  position
  color)
 
(defmethod compare ((a frob) (b frob))
  (compare-slots a b #'frob-position #'frob-color))

This specifies that frobs shall be ordered first by position, then by color.  compare-slots handles the details for you, including the complications that arise if one of the slot value comparisons returns :unequal.

For standard classes, best performance is obtained by supplying slot names as quoted symbols rather than function-quoted accessor names:

(defclass directed-graph ()
  ((nodes :initarg :nodes :accessor digraph-nodes)
   (edges :initarg :edges :accessor diagraph-edges)))
 
(defmethod compare ((a directed-graph) (b directed-graph))
  (compare-slots a b 'nodes 'edges))

I am not sure whether to recommend the use of slot names for structure classes; the answer may depend on the implementation.  At least on SBCL, you're probably better off using accessor functions for structs.

(Actually, the functions supplied don't have to be accessors; you could compare by some computed value instead, if you wanted.  I haven't seen a use for this possibility in practice, though.)

Structure classes implementing mutable objects should do something like this:

(defvar *next-widget-serial-number* 0)
(defstruct widget
  (serial-number (incf *next-widget-serial-number*))
  ...)
 
(defmethod compare ((a widget) (b widget))
  (compare-slots a b #'widget-serial-number))

For standard classes implementing mutable objects, FSet provides an especially convenient solution: just include identity-ordering-mixin as a superclass:

(defclass thingy (identity-ordering-mixin ...) ...)

That's it!

More on FSet's Single Global Ordering

I sometimes get pushback, albeit mostly from people who haven't actually used FSet, about my design decision to have a single global ordering implemented by compare, rather than allowing collections to accept an ordering function when they are created.  Let me defend this decision a little bit.

Because the ordering is extensible by defining new methods on compare, a programmer can always force a non-default ordering by defining a wrapper type.  For example, if you have a map whose keys are strings and which you want to be maintained in lexicographic order, you can easily write a structure class to wrap the strings, and give that class a compare method that forces the strings to be compared lexicographically.  (FSet even helps you out by providing a generic function compare-lexicographically, which you can just call.)

That said, I believe the need to write wrapper classes arises very rarely.  It's needed only when there is a reason that a set or map needs to be continually maintained in the non-default order.  If the non-default ordering is needed only occasionally — say, when the collection is being printed — it's usually easier to convert it to a list at that point (see FSet's generic function convert, about which I should write another blog post) and then just call sort or stable-sort on it.

And there is a wonderful simplicity to having the ordering be global.  Ease of use is a very important design goal for FSet; collection-specific orderings would give the user another wrinkle to think about.  I just don't see that the benefits, which seem to me very small, would outweigh the cost in cognitive load.

Perhaps the best way to put it is that FSet is primarily intended for application programming, not systems programming.  The distinction is fuzzy, but broadly, if programmer productivity is more important to you than squeezing out the last few percent of performance, you're doing application programming, not systems programming.  This is not necessarily a distinction about the kind of program being written — there certainly are applications that have performance-sensitive parts — but rather, about the amount of knowledge, experience, and mental effort required to write it.  FSet is designed for general productivity, not necessarily for someone who needs maximal control to achieve maximal performance.