This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
include/llvm/ADT/
-
llvm/
-
ADT/
9
DisjointSetUnion.h
-
unittests/ADT/
-
ADT/
-
CMakeLists.txt
-
DisjointSetUnionTest.cpp

Differential D40427

[ADT] Introduce Disjoint Set Union structure
AbandonedPublic

Authored by mkazantsev on Nov 24 2017, 4:41 AM.

Download Raw Diff

Details

Reviewers

chandlerc
sanjoy
reames
anna

Summary

This patch introduces a data structure - Disjoint Set Union - that allows to perform
operations on disjoint sets of elements:

Check whether two elements X and Y belong to one set;
Merge sets containing X and Y into one set.

If we have a transitive symmetrical equivalence function F, and we proved that F(X, Y)
and F(Y,Z) is true, then adding these two pairs to DSU will also allow us to prove that
F(Z, X) is true for cheap. One possible application of that is using it as cache for
comparators: if we proved that X compareTo Y == 0 and Y compareTo Z == 0, then
DSU can easily prove that X compareTo Z == 0.

Diff Detail

Event Timeline

mkazantsev created this revision.Nov 24 2017, 4:41 AM

Herald added a subscriber: mgorny. · View Herald TranscriptNov 24 2017, 4:41 AM

mkazantsev added a child revision: D40428: [SCEV][NFC] More efficient caching in CompareSCEVComplexity.Nov 24 2017, 4:49 AM

mkazantsev added a reviewer: anna.

sanjoy added inline comments.Nov 24 2017, 1:39 PM

include/llvm/ADT/DisjointSetUnion.h
42	You're copying `T` instances here -- why not take `const T &` instead? (For `SCEV ` and `Value ` this does not matter, but e.g. people may eventually want to store `std::string` here).
84	Please avoid recursion here, unless you're certain this would be (say) less than 10 frames for all practical cases (in which case add an assert).
103	The pattern I've seen here is taking the map type as a template parameter instead of hardcoding it (with `DenseMap` as a default), so that folks can substitute `SmallDenseMap` etc.
105	How about just one `DenseMap` that maps `T` instances to a `std::pair<T, int>`?

mkazantsev added inline comments.Nov 26 2017, 8:50 PM

include/llvm/ADT/DisjointSetUnion.h
84	I'm pretty certain that the expected depth is effectively small, but I was also thinking to rewrite this with loop, so I'll do it.
105	That would consume as twice as much memory. We only store Rank for roots and Parent for non-roots, and this would have us store both for both.

mkazantsev added inline comments.Nov 26 2017, 8:56 PM

include/llvm/ADT/DisjointSetUnion.h
105	We could store a union, though...

mkazantsev added inline comments.Nov 26 2017, 9:35 PM

include/llvm/ADT/DisjointSetUnion.h
105	After giving it some thought, I don't think it's a good idea. Storing pairs is expensive due to reason I wrote above, storing unions doesn't allow us to understand where we should stop while traversing parents to find head. If we want to do so, we need an extra flag. I'd leave it as is unless we want to complicate this plece of logic without obvious benefits.

Two things.
First, to bikeshed this horribly:

Everywhere else in llvm we call this a union-find data structure.
Almost all papers you can reference these days do the same.
What you call head is called find roughly everywhere, and we should do the same here.

See, e.g.,
https://pdfs.semanticscholar.org/bbcf/76a84ee10348442ccb50ccdbfb288ede5cbb.pdf (hopcroft and ullman's analysis bounding it to log*)
https://dl.acm.org/citation.cfm?doid=62.2160 (Tarjan's bounding of union-find to inverse ackermann)
https://pdfs.semanticscholar.org/b716/349a3072afbede9f0fb8f561a8e0f297baf0.pdf (survey of data-structures used to solve disjoint-set-union problems)

Even wikipedia calls it find() :)

(we call it find in all of the llvm implementations)

Second:

EquivalenceClasses.h already implements this datastructure, but not the union by rank. It does do path compression.

We should not end up with two. I'm fine if we go with your implementation, but the end result (doesn't have to be in this patch) should be only one of these classes existing.

include/llvm/ADT/DisjointSetUnion.h
84	I would not bother making this non-recursive (I can't recall ever having seen a production implementation that is non-recursive) Since you are doing union-by-rank, depth is limited to log(total number of items). IE any root node of rank X must have >= 2X items under it. The worst case is if you only ever call find after all of the unions, and then you have log(n) recursion worst case here.For For LLVM ,it's probably bounded in the millions for most practical problems, so somewhere between 20-24 is my guess at worst case, assuming 16 million items. If you call find, and intersperse the two, you will pretty much never get a depth > 5. (amortized, it can only be >5 if you have more than 265535 items non-amortized, very hard to say, but my guess is <= 10 for most problem types) We also do this recursively in the other implementations of this datastructure we have (for example, EquivalenceClasses and AliasSetTracker).

I didn't know it that EquivalenceClasses already does it. After looking into the code I can agree that it's pretty much the same thing done there. I guess it's easier to add ranking heuristic in there than to duplicate this piece of logic.

I'll check if EquivalenceClasses solves the compile time problem for which I did this one. If it does, I think the better way will be to base my solution on it, and then possibly add the ranking in there.

Thanks for pointing it out! :)

It seems that EquivalenceClasses does exactly what we need. Surprisingly, it didn't have unit tests so I've commited them as rL319018. I abandon this patch and will rewrite dependent patches so that they use EquivalenceClasses.

I also consider adding rank heuristics to EquivalenceClasses in future.

Abandoning this one.

mkazantsev removed a child revision: D40428: [SCEV][NFC] More efficient caching in CompareSCEVComplexity.Nov 27 2017, 3:32 AM

Revision Contents

Path

Size

include/

llvm/

ADT/

DisjointSetUnion.h

110 lines

unittests/

ADT/

CMakeLists.txt

1 line

DisjointSetUnionTest.cpp

99 lines

Diff 124161

include/llvm/ADT/DisjointSetUnion.h

				//===- DisjointSetUnion.h - A structure for merging sets -------- C++ ---===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_ADT_DISJOINTSETUNION_H
				#define LLVM_ADT_DISJOINTSETUNION_H

				#include "DenseMap.h"

				namespace llvm {

				// This is a data structure that allows to respond the following queries:
				//
				// 1) For given two elements X and Y, merge the set containing X and the set
				// containing Y into their union.
				// 2) Check if given two elements X and Y belong to one set.
				//
				// Before any statement of type 1) is made, every elements belongs to its own
				// set (containing only this sole element). Thus, all universe of entities of
				// type T is a union of disjoint sets that can be merged.
				//
				// We represent our sets as trees with exactly one vertex that is reachable from
				// all other vertices of this set. This vertex is called head, and every edge
				// goes from some vertex V to Parent[V]. Every vertex is either a head of its
				// set, or it has an immediate parent vertex.
				//
				// To reach good complexity of queries, we use two heuristics.
				// First, we compress paths in trees (attaching all traversed nodes immediately
				// to the head) on each query of type 1).
				// Second, when merging trees during query 2), we attach the head of a smaller
				// tree to the bigger tree.
				// These two heuristics combined allow us to reach complexity which is close to
				// O(1) for both queries.
				template<typename T>
				class DisjointSetUnion {
				public:
				// Check whether or not X and Y belong to one set after merges made so far.
				bool isInSameSet(T X, T Y) {
				sanjoyUnsubmitted Not Done Reply Inline Actions You're copying `T` instances here -- why not take `const T &` instead? (For `SCEV ` and `Value ` this does not matter, but e.g. people may eventually want to store `std::string` here). sanjoy: You're copying `T` instances here -- why not take `const T &` instead? (For `SCEV *` and…
				return head(X) == head(Y);
				}

				// Merge the set that contains X and the set that contains Y into a new set M,
				// which is a union of these sets.
				void mergeSetsOf(T X, T Y) {
				T H1 = head(X);
				T H2 = head(Y);
				if (H1 != H2) {
				// If there were two different sets, merge them. Use ranking heuristic:
				// attach a smaller tree to a bigger one.
				int R1 = rank(H1);
				int R2 = rank(H2);
				if (R1 > R2)
				std::swap(H1, H2);
				// Attach head of the smaller set immediately to the head of bigger one.
				Parent[H1] = H2;
				// H1 is no longer a head and will never become one. We don't have to
				// store a map instance for it.
				Rank.erase(H1);
				// Now the set of H2 contains all elements of old sets of H1 and H2.
				Rank[H2] = R1 + R2;
				}
				}

				// Clear the information about merges and state that now this DSU is returned
				// to its initial state, meaning that every vertex belongs to its own set
				// containing this sole element.
				void clear() {
				Parent.clear();
				Rank.clear();
				}

				private:
				// For a given vertex X, return the head of its set.
				T head(T X) {
				auto It = Parent.find(X);
				if (It != Parent.end())
				// Head is Parent(Parent(...(Parent(X))...). After we find it, we can
				// compress this path making Head the immediate parent of X. We do so
				// recursively for all chain of parents.
				return Parent[X] = head(It->second);
				sanjoyUnsubmitted Not Done Reply Inline Actions Please avoid recursion here, unless you're certain this would be (say) less than 10 frames for all practical cases (in which case add an assert). sanjoy: Please avoid recursion here, unless you're certain this would be (say) less than 10 frames for…
				mkazantsevAuthorUnsubmitted Not Done Reply Inline Actions I'm pretty certain that the expected depth is effectively small, but I was also thinking to rewrite this with loop, so I'll do it. mkazantsev: I'm pretty certain that the expected depth is effectively small, but I was also thinking to…
				dberlinUnsubmitted Not Done Reply Inline Actions I would not bother making this non-recursive (I can't recall ever having seen a production implementation that is non-recursive) Since you are doing union-by-rank, depth is limited to log(total number of items). IE any root node of rank X must have >= 2X items under it. The worst case is if you only ever call find after all of the unions, and then you have log(n) recursion worst case here.For For LLVM ,it's probably bounded in the millions for most practical problems, so somewhere between 20-24 is my guess at worst case, assuming 16 million items. If you call find, and intersperse the two, you will pretty much never get a depth > 5. (amortized, it can only be >5 if you have more than 265535 items non-amortized, very hard to say, but my guess is <= 10 for most problem types) We also do this recursively in the other implementations of this datastructure we have (for example, EquivalenceClasses and AliasSetTracker). dberlin: I would not bother making this non-recursive (I can't recall ever having seen a production…
				// Every vertex that doesn't have a parent is the head of its set.
				return X;
				}

				// Calculates rank of a head vertex X. Rank of a head is the number of
				// elements in its set.
				int rank(T X) {
				assert(head(X) == X && "Attempt to calculate rank for non-head vertex?");
				auto It = Rank.find(X);
				if (It != Rank.end()) {
				assert(It->second > 1 && "Storing rank of one-element set in the map?");
				return It->second;
				}
				// This set consists of sole element X.
				return 1;
				}

				// Stores immediate parents of vertices.
				DenseMap<T , T> Parent;
				sanjoyUnsubmitted Not Done Reply Inline Actions The pattern I've seen here is taking the map type as a template parameter instead of hardcoding it (with `DenseMap` as a default), so that folks can substitute `SmallDenseMap` etc. sanjoy: The pattern I've seen here is taking the map type as a template parameter instead of hardcoding…
				// Memorizes calculated ranks of head vertices.
				DenseMap<T, int> Rank;
				sanjoyUnsubmitted Not Done Reply Inline Actions How about just one `DenseMap` that maps `T` instances to a `std::pair<T, int>`? sanjoy: How about just one `DenseMap` that maps `T` instances to a `std::pair<T, int>`?
				mkazantsevAuthorUnsubmitted Not Done Reply Inline Actions That would consume as twice as much memory. We only store Rank for roots and Parent for non-roots, and this would have us store both for both. mkazantsev: That would consume as twice as much memory. We only store Rank for roots and Parent for non…
				mkazantsevAuthorUnsubmitted Not Done Reply Inline Actions We could store a union, though... mkazantsev: We could store a union, though...
				mkazantsevAuthorUnsubmitted Not Done Reply Inline Actions After giving it some thought, I don't think it's a good idea. Storing pairs is expensive due to reason I wrote above, storing unions doesn't allow us to understand where we should stop while traversing parents to find head. If we want to do so, we need an extra flag. I'd leave it as is unless we want to complicate this plece of logic without obvious benefits. mkazantsev: After giving it some thought, I don't think it's a good idea. Storing pairs is expensive due to…
				};

				} // end namespace llvm

				#endif // LLVM_ADT_DISJOINTSETUNION_H

unittests/ADT/CMakeLists.txt

Show All 10 Lines	set(ADTSources
BitVectorTest.cpp		BitVectorTest.cpp
BreadthFirstIteratorTest.cpp		BreadthFirstIteratorTest.cpp
BumpPtrListTest.cpp		BumpPtrListTest.cpp
DAGDeltaAlgorithmTest.cpp		DAGDeltaAlgorithmTest.cpp
DeltaAlgorithmTest.cpp		DeltaAlgorithmTest.cpp
DenseMapTest.cpp		DenseMapTest.cpp
DenseSetTest.cpp		DenseSetTest.cpp
DepthFirstIteratorTest.cpp		DepthFirstIteratorTest.cpp
		DisjointSetUnionTest.cpp
FoldingSet.cpp		FoldingSet.cpp
FunctionRefTest.cpp		FunctionRefTest.cpp
HashingTest.cpp		HashingTest.cpp
IListBaseTest.cpp		IListBaseTest.cpp
IListIteratorTest.cpp		IListIteratorTest.cpp
IListNodeBaseTest.cpp		IListNodeBaseTest.cpp
IListNodeTest.cpp		IListNodeTest.cpp
IListSentinelTest.cpp		IListSentinelTest.cpp
▲ Show 20 Lines • Show All 46 Lines • Show Last 20 Lines

unittests/ADT/DisjointSetUnionTest.cpp

				//=== llvm/unittest/ADT/DisjointSetUnionTest.cpp - DSU structure tests ----===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//

				#include "llvm/ADT/DisjointSetUnion.h"
				#include "gtest/gtest.h"

				using namespace llvm;

				namespace llvm {

				TEST(DisjointSetUnionTest, NoMerges) {
				DisjointSetUnion<int> DSU;
				// Until we merged any sets, check that every element belongs to its own set
				// that contains this sole element.
				for (int i = 0; i < 3; i++)
				for (int j = 0; j < 3; j++)
				if (i == j)
				EXPECT_TRUE(DSU.isInSameSet(i, j));
				else
				EXPECT_FALSE(DSU.isInSameSet(i, j));
				}

				TEST(DisjointSetUnionTest, SimpleMerge1) {
				DisjointSetUnion<int> DSU;
				// Check that once we merge (A, B), (B, C), (C, D), then all elements belong
				// to one set.
				DSU.mergeSetsOf(0, 1);
				DSU.mergeSetsOf(1, 2);
				DSU.mergeSetsOf(2, 3);
				for (int i = 0; i < 4; ++i)
				for (int j = 0; j < 4; ++j)
				EXPECT_TRUE(DSU.isInSameSet(i, j));
				}

				TEST(DisjointSetUnionTest, SimpleMerge2) {
				DisjointSetUnion<int> DSU;
				// Check that once we merge (A, B), (C, D), (A, C), then all elements belong
				// to one set.
				DSU.mergeSetsOf(0, 1);
				DSU.mergeSetsOf(2, 3);
				DSU.mergeSetsOf(0, 2);
				for (int i = 0; i < 4; ++i)
				for (int j = 0; j < 4; ++j)
				EXPECT_TRUE(DSU.isInSameSet(i, j));
				}

				TEST(DisjointSetUnionTest, Clear) {
				DisjointSetUnion<int> DSU;
				// Check that reset works.
				DSU.mergeSetsOf(0, 1);
				DSU.mergeSetsOf(1, 2);
				DSU.clear();
				for (int i = 0; i < 3; i++)
				for (int j = 0; j < 3; j++)
				if (i == j)
				EXPECT_TRUE(DSU.isInSameSet(i, j));
				else
				EXPECT_FALSE(DSU.isInSameSet(i, j));
				}

				TEST(DisjointSetUnionTest, TwoSets) {
				DisjointSetUnion<int> DSU;
				// Form sets of odd and even numbers, check that we split them into these
				// two sets correcrly.
				for (int i = 0; i < 30; i += 2)
				DSU.mergeSetsOf(0, i);
				for (int i = 1; i < 30; i += 2)
				DSU.mergeSetsOf(1, i);

				for (int i = 0; i < 30; i++)
				for (int j = 0; j < 30; j++)
				if (i % 2 == j % 2)
				EXPECT_TRUE(DSU.isInSameSet(i, j));
				else
				EXPECT_FALSE(DSU.isInSameSet(i, j));
				}

				TEST(DisjointSetUnionTest, MultipleSets) {
				DisjointSetUnion<int> DSU;
				// Split numbers from [0, 100) into sets so that values in the same set have
				// equal remainders (mod 17).
				for (int i = 0; i < 100; i++)
				DSU.mergeSetsOf(i % 17, i);

				for (int i = 0; i < 100; i++)
				for (int j = 0; j < 100; j++)
				if (i % 17 == j % 17)
				EXPECT_TRUE(DSU.isInSameSet(i, j));
				else
				EXPECT_FALSE(DSU.isInSameSet(i, j));
				}

				} // llvm