This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/tools/llvm-exegesis/lib/
-
trunk/
-
tools/
-
llvm-exegesis/
-
lib/
-
Clustering.cpp

Differential D54381

[llvm-exegesis] InstructionBenchmarkClustering::dbScan(): use llvm::SetVector<> instead of ILLEGAL std::unordered_set<>
ClosedPublic

Authored by lebedev.ri on Nov 10 2018, 12:10 PM.

Download Raw Diff

Details

Reviewers

courbet
MaskRay
RKSimon
gchatelet
john.brawn

Commits

rG0b4b512826c7: [llvm-exegesis] InstructionBenchmarkClustering::dbScan(): use llvm::SetVector<>…
rL347197: [llvm-exegesis] InstructionBenchmarkClustering::dbScan(): use llvm::SetVector<>…

Summary

Test data: 500kLOC of benchmark.yaml, 23Mb. (that is a subset of the actual uops benchmark i was trying to analyze!)
Old time:

$ time ./bin/llvm-exegesis -mode=analysis -analysis-epsilon=100000 -benchmarks-file=/tmp/benchmarks.yaml -analysis-inconsistencies-output-file=/tmp/clusters.html &> /dev/null

real    0m24.884s
user    0m24.099s
sys     0m0.785s

New time:

$ time ./bin/llvm-exegesis -mode=analysis -analysis-epsilon=100000 -benchmarks-file=/tmp/benchmarks.yaml -analysis-inconsistencies-output-file=/tmp/clusters.html &> /dev/null

real    0m10.469s
user    0m9.797s
sys     0m0.672s

So -60%. And that isn't the good bit yet.

Old:

calls to allocation functions: 106560180 (yes, 107 *million* allocations.)
bytes allocated in total (ignoring deallocations): 12.17 GB

New:

calls to allocation functions: 3347676 (-96.86%) (just 3 mil)
bytes allocated in total (ignoring deallocations): 10.52 GB (~2GB less)

Two points i want to raise:

std::unordered_set<> should not have been used there in the first place. It is banned by the https://llvm.org/docs/ProgrammersManual.html#other-set-like-container-options
There is no tests, so i'm not fully sure this is correct. Since it was unordered set, i guess there are zero restrictions on the order, and anything will be ok?
I tried other containers suggested in https://llvm.org/docs/ProgrammersManual.html#set-like-containers-std-set-smallset-setvector-etc, this llvm::SetVector<> seems to be best here.

Diff Detail

Repository: rL LLVM

Event Timeline

lebedev.ri created this revision.Nov 10 2018, 12:10 PM

Herald added a subscriber: tschuett. · View Herald TranscriptNov 10 2018, 12:10 PM

lebedev.ri mentioned this in D54382: [llvm-exegesis] Analysis::writeSnippet(): be smarter about memory allocations..Nov 10 2018, 1:11 PM

lebedev.ri added a child revision: D54382: [llvm-exegesis] Analysis::writeSnippet(): be smarter about memory allocations..Nov 11 2018, 12:56 AM

I could be missing something, but I don't understand why ToProcess needs to be a set-like container since we're erasing elements as we go (ie the erased elements won't be duplicate checked on next insertion). We skip any that have been previously processed in the inner loop too, which seems like it's doing the same work the set would be doing.

The isNoise() check also looks odd, since if CurrentCluster.Id has id kNoise then it could push the same index into CurrentCluster.PointIndices an unspecified number of times depending on the order that the elements are inserted and removed from ToProcess, but if CurrentCluster.Id can't be kNoise then that's not relevant.

From the docs:

SetVector is an adapter class that defaults to using std::vector

so calling erase on the first element isn't going to be terribly efficient either.

In D54381#1294614, @bobsayshilol wrote:

I could be missing something, but I don't understand why ToProcess needs to be a set-like container since we're erasing elements as we go (ie the erased elements won't be duplicate checked on next insertion). We skip any that have been previously processed in the inner loop too, which seems like it's doing the same work the set would be doing.

Here i only improve the choice of the data structure, and don't do any algorithm-changing decisions.

I think the usage of Set is correct.
We will, at most, fail to deduplicate one element (the one we just removed) out of N returned by rangeQuery(Q).
But if we don't do any deduplication at all, then we will add all the N elements. And that sounds exponential.
Maybe it should be a custom SetVector, where the Set retains the knowledge about removed elements, don't know yet.

The isNoise() check also looks odd, since if CurrentCluster.Id has id kNoise then it could push the same index into CurrentCluster.PointIndices an unspecified number of times depending on the order that the elements are inserted and removed from ToProcess, but if CurrentCluster.Id can't be kNoise then that's not relevant.

Interesting question, but irrelevant for the patch at hand.

From the docs:

SetVector is an adapter class that defaults to using std::vector

so calling erase on the first element isn't going to be terribly efficient either.

Well, i'd call -60% pretty efficient.
Further patches [do/will] improve upon that.

In D54381#1294619, @lebedev.ri wrote:

I think the usage of Set is correct.
We will, at most, fail to deduplicate one element (the one we just removed) out of N returned by rangeQuery(Q).
But if we don't do any deduplication at all, then we will add all the N elements. And that sounds exponential.

Ah I meant with deduplication via something like:

auto MiddleIt = ToProcess.insert(ToProcess.end(), Neighbors.begin(), Neighbors.end());
std::inplace_merge(ToProcess.begin(), MiddleIt, ToProcess.end());
std::erase(std::unique(ToProcess.begin(), ToProcess.end()), ToProcess.end());

when we're inserting elements, and then popping from the back of the vector at the beginning of the loop (since the iteration order for a std::unordered_set was undefined anyway). That does rely on the fact that rangeQuery() currently returns a sorted vector though.

In D54381#1294619, @lebedev.ri wrote:

Here i only improve the choice of the data structure, and don't do any algorithm-changing decisions.

I don't think the above changes the algorithmic behaviour, but I could be wrong.

In D54381#1294619, @lebedev.ri wrote:

Interesting question, but irrelevant for the patch at hand.

That's a valid point :)

In D54381#1294614, @bobsayshilol wrote:

I could be missing something, but I don't understand why ToProcess needs to be a set-like container since we're erasing elements as we go (ie the erased elements won't be duplicate checked on next insertion). We skip any that have been previously processed in the inner loop too, which seems like it's doing the same work the set would be doing.

The isNoise() check also looks odd, since if CurrentCluster.Id has id kNoise then it could push the same index into CurrentCluster.PointIndices an unspecified number of times depending on the order that the elements are inserted and removed from ToProcess, but if CurrentCluster.Id can't be kNoise then that's not relevant.

From the docs:

SetVector is an adapter class that defaults to using std::vector

so calling erase on the first element isn't going to be terribly efficient either.

Depends on how you use the "vector", the density of the elements gives it a big edge, especially paired with a slab allocator, if you can predict the rough upper boundary on the amount of data you need to allocate, you can have an adapter class that simply emulates vector/queue semantics by using a window into said array (imagine it as a 0, 1000 inserts later you have 1000, now removing something from the front is as simple as advancing the start of the window by 1-1000). If you're not using set semantics (I don't think you are from the conversation yesterday), you can also gain huge advantage copying data across such data structures.

@lebedev.ri, std::deque<> gave you a huge performance increase because it's essentially a combination of vectors and a slab allocator that usually allocates bigger slabs every time you exceed the capacity (IIRC), with a fancy iterator to hide crossing slab edges. Due to using slab allocation it also provides stable references as long as you only operate on the front and end of the queue (think tailq with an allocator/fancy iterator on top).

When used to cluster points (measurements) this way with DBSCAN, the order how points are processed does not matter. (But I would feel slightly better if the result is deterministic, i.e. tools/llvm-exegesis/lib/Analysis.cpp#L199 should sort PointIndices first).

I actually think a bitset (vector<char> may be better, to track if an element has been inserted) + pre-allocated vector (of size of the number of points, used as a stack) is the best. They will be even faster than a DenseSet<size_t>.

This revision now requires changes to proceed.Nov 11 2018, 4:45 PM

Not related to this patch series but some complaints about the existing interface (sorry)... I feel it is sort of over-engineered, e.g.

kError (related to InstructionBenchmark::Error) is not used but it is defined as a special ClusterId value.
kNoise kUndef can just be exposed as public members and no static member functions noise() error() are needed.
PerInstructionStats::min(). It is defined as numeric_limits<double>::min() but never modified.

In D54381#1294950, @MaskRay wrote:

When used to cluster points (measurements) this way with DBSCAN, the order how points are processed does not matter.

Are you implying that it is incorrect to use anything but non-unsorted set here?
I can't just use DenseSet here, it ends up begin at least one magnitude slower.
And we most certainly can't keep std::unordered_set<>.

(But I would feel slightly better if the result is deterministic, i.e. tools/llvm-exegesis/lib/Analysis.cpp#L199 should sort PointIndices first).

I actually think a bitset (vector<char> may be better, to track if an element has been inserted)

Yes, that is likely, will investigate.

+ pre-allocated vector (of size of the number of points, used as a stack) is the best. They will be even faster than a DenseSet<size_t>.

lebedev.ri mentioned this in D54418: [llvm-exegesis] InstructionBenchmarkClustering::dbScan(): replace SetVector with custom BitVectorVector.Nov 12 2018, 3:02 AM

lebedev.ri mentioned this in D54445: [llvm-exegesis] BitVectorVector: provide assign() method, use it..Nov 12 2018, 11:51 PM

Two points i want to raise:

std::unordered_set<> should not have been used there in the first place. It is banned by the https://llvm.org/docs/ProgrammersManual.html#other-set-like-container-options

Thanks Roman, I was not aware of that.

There is no tests, so i'm not fully sure this is correct.

There are basic tests in unittests/tools/llvm-exegesis/ClusteringTests.cpp.

Since it was unordered set, i guess there are zero restrictions on the order, and anything will be ok?

IIUC, SetVector is just more restrictive than unordered_set, so it should be fine.

In D54381#1294614, @bobsayshilol wrote:

I could be missing something, but I don't understand why ToProcess needs to be a set-like container since we're erasing elements as we go (ie the erased elements won't be duplicate checked on next insertion). We skip any that have been previously processed in the inner loop too, which seems like it's doing the same work the set would be doing.

IIRC, rangeQuery() can return points that need to be deduped against the already present ones.

In D54381#1302626, @courbet wrote:

Two points i want to raise:

std::unordered_set<> should not have been used there in the first place. It is banned by the https://llvm.org/docs/ProgrammersManual.html#other-set-like-container-options

Thanks Roman, I was not aware of that.

Thank you for the reviews!
Any chance you could stamp the last 3 remainig reviews so i could land everything? :)

There is no tests, so i'm not fully sure this is correct.

There are basic tests in unittests/tools/llvm-exegesis/ClusteringTests.cpp.

Since it was unordered set, i guess there are zero restrictions on the order, and anything will be ok?

IIUC, SetVector is just more restrictive than unordered_set, so it should be fine.

This revision was not accepted when it landed; it landed in state Needs Revision.Nov 19 2018, 5:30 AM

Closed by commit rL347197: [llvm-exegesis] InstructionBenchmarkClustering::dbScan(): use llvm::SetVector<>… (authored by lebedevri). · Explain Why

This revision was automatically updated to reflect the committed changes.

Diffusion mentioned this in rL347198: [llvm-exegesis] Analysis::writeSnippet(): be smarter about memory allocations..

Diffusion mentioned this in rL347204: [llvm-exegesis] (+final perf overview) InstructionBenchmarkClustering….

Revision Contents

Path

Size

llvm/

trunk/

tools/

llvm-exegesis/

lib/

Clustering.cpp

7 lines

Diff 174592

llvm/trunk/tools/llvm-exegesis/lib/Clustering.cpp

//===-- Clustering.cpp ------------------------------------------- C++ --===//		//===-- Clustering.cpp ------------------------------------------- C++ --===//
//		//
// The LLVM Compiler Infrastructure		// The LLVM Compiler Infrastructure
//		//
// This file is distributed under the University of Illinois Open Source		// This file is distributed under the University of Illinois Open Source
// License. See LICENSE.TXT for details.		// License. See LICENSE.TXT for details.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "Clustering.h"		#include "Clustering.h"
		#include "llvm/ADT/SetVector.h"
#include <string>		#include <string>
#include <unordered_set>

namespace llvm {		namespace llvm {
namespace exegesis {		namespace exegesis {

// The clustering problem has the following characteristics:		// The clustering problem has the following characteristics:
// (A) - Low dimension (dimensions are typically proc resource units,		// (A) - Low dimension (dimensions are typically proc resource units,
// typically < 10).		// typically < 10).
// (B) - Number of points : ~thousands (points are measurements of an MCInst)		// (B) - Number of points : ~thousands (points are measurements of an MCInst)
▲ Show 20 Lines • Show All 94 Lines • ▼ Show 20 Lines	for (size_t P = 0, NumPoints = Points_.size(); P < NumPoints; ++P) {

// Create a new cluster, add P.		// Create a new cluster, add P.
Clusters_.emplace_back(ClusterId::makeValid(Clusters_.size()));		Clusters_.emplace_back(ClusterId::makeValid(Clusters_.size()));
Cluster &CurrentCluster = Clusters_.back();		Cluster &CurrentCluster = Clusters_.back();
ClusterIdForPoint_[P] = CurrentCluster.Id; /* Label initial point */		ClusterIdForPoint_[P] = CurrentCluster.Id; /* Label initial point */
CurrentCluster.PointIndices.push_back(P);		CurrentCluster.PointIndices.push_back(P);

// Process P's neighbors.		// Process P's neighbors.
std::unordered_set<size_t> ToProcess(Neighbors.begin(), Neighbors.end());		llvm::SetVector<size_t> ToProcess;
		ToProcess.insert(Neighbors.begin(), Neighbors.end());
while (!ToProcess.empty()) {		while (!ToProcess.empty()) {
// Retrieve a point from the set.		// Retrieve a point from the set.
const size_t Q = *ToProcess.begin();		const size_t Q = *ToProcess.begin();
ToProcess.erase(Q);		ToProcess.erase(ToProcess.begin());

if (ClusterIdForPoint_[Q].isNoise()) {		if (ClusterIdForPoint_[Q].isNoise()) {
// Change noise point to border point.		// Change noise point to border point.
ClusterIdForPoint_[Q] = CurrentCluster.Id;		ClusterIdForPoint_[Q] = CurrentCluster.Id;
CurrentCluster.PointIndices.push_back(Q);		CurrentCluster.PointIndices.push_back(Q);
continue;		continue;
}		}
if (!ClusterIdForPoint_[Q].isUndef()) {		if (!ClusterIdForPoint_[Q].isUndef()) {
Show All 39 Lines