This is an archive of the discontinued LLVM Phabricator instance.

[ADT] Introduce Disjoint Set Union structure
AbandonedPublic

Authored by mkazantsev on Nov 24 2017, 4:41 AM.

Details

Summary

This patch introduces a data structure - Disjoint Set Union - that allows to perform
operations on disjoint sets of elements:

  1. Check whether two elements X and Y belong to one set;
  2. Merge sets containing X and Y into one set.

If we have a transitive symmetrical equivalence function F, and we proved that F(X, Y)
and F(Y,Z) is true, then adding these two pairs to DSU will also allow us to prove that
F(Z, X) is true for cheap. One possible application of that is using it as cache for
comparators: if we proved that X compareTo Y == 0 and Y compareTo Z == 0, then
DSU can easily prove that X compareTo Z == 0.

Diff Detail

Event Timeline

mkazantsev created this revision.Nov 24 2017, 4:41 AM
sanjoy added inline comments.Nov 24 2017, 1:39 PM
include/llvm/ADT/DisjointSetUnion.h
42

You're copying T instances here -- why not take const T & instead?

(For SCEV * and Value * this does not matter, but e.g. people may eventually want to store std::string here).

84

Please avoid recursion here, unless you're certain this would be (say) less than 10 frames for all practical cases (in which case add an assert).

103

The pattern I've seen here is taking the map type as a template parameter instead of hardcoding it (with DenseMap as a default), so that folks can substitute SmallDenseMap etc.

105

How about just one DenseMap that maps T instances to a std::pair<T, int>?

mkazantsev added inline comments.Nov 26 2017, 8:50 PM
include/llvm/ADT/DisjointSetUnion.h
84

I'm pretty certain that the expected depth is effectively small, but I was also thinking to rewrite this with loop, so I'll do it.

105

That would consume as twice as much memory. We only store Rank for roots and Parent for non-roots, and this would have us store both for both.

mkazantsev added inline comments.Nov 26 2017, 8:56 PM
include/llvm/ADT/DisjointSetUnion.h
105

We could store a union, though...

mkazantsev added inline comments.Nov 26 2017, 9:35 PM
include/llvm/ADT/DisjointSetUnion.h
105

After giving it some thought, I don't think it's a good idea. Storing pairs is expensive due to reason I wrote above, storing unions doesn't allow us to understand where we should stop while traversing parents to find head. If we want to do so, we need an extra flag. I'd leave it as is unless we want to complicate this plece of logic without obvious benefits.

Two things.
First, to bikeshed this horribly:

Everywhere else in llvm we call this a union-find data structure.
Almost all papers you can reference these days do the same.
What you call head is called find roughly everywhere, and we should do the same here.

See, e.g.,
https://pdfs.semanticscholar.org/bbcf/76a84ee10348442ccb50ccdbfb288ede5cbb.pdf (hopcroft and ullman's analysis bounding it to log*)
https://dl.acm.org/citation.cfm?doid=62.2160 (Tarjan's bounding of union-find to inverse ackermann)
https://pdfs.semanticscholar.org/b716/349a3072afbede9f0fb8f561a8e0f297baf0.pdf (survey of data-structures used to solve disjoint-set-union problems)

Even wikipedia calls it find() :)

(we call it find in all of the llvm implementations)

Second:

EquivalenceClasses.h already implements this datastructure, but not the union by rank. It does do path compression.

We should not end up with two. I'm fine if we go with your implementation, but the end result (doesn't have to be in this patch) should be only one of these classes existing.

include/llvm/ADT/DisjointSetUnion.h
84

I would not bother making this non-recursive (I can't recall ever having seen a production implementation that is non-recursive)

Since you are doing union-by-rank, depth is limited to log(total number of items).
IE any root node of rank X must have >= 2**X items under it.

The worst case is if you only ever call find after all of the unions, and then you have log(n) recursion worst case here.For For LLVM ,it's probably bounded in the millions for most practical problems, so somewhere between 20-24 is my guess at worst case, assuming 16 million items.

If you call find, and intersperse the two, you will pretty much never get a depth > 5.
(amortized, it can only be >5 if you have more than 2**65535 items
non-amortized, very hard to say, but my guess is <= 10 for most problem types)

We also do this recursively in the other implementations of this datastructure we have (for example, EquivalenceClasses and AliasSetTracker).

I didn't know it that EquivalenceClasses already does it. After looking into the code I can agree that it's pretty much the same thing done there. I guess it's easier to add ranking heuristic in there than to duplicate this piece of logic.

I'll check if EquivalenceClasses solves the compile time problem for which I did this one. If it does, I think the better way will be to base my solution on it, and then possibly add the ranking in there.

Thanks for pointing it out! :)

mkazantsev abandoned this revision.Nov 27 2017, 3:28 AM

It seems that EquivalenceClasses does exactly what we need. Surprisingly, it didn't have unit tests so I've commited them as rL319018. I abandon this patch and will rewrite dependent patches so that they use EquivalenceClasses.

I also consider adding rank heuristics to EquivalenceClasses in future.

Abandoning this one.