This is an archive of the discontinued LLVM Phabricator instance.

[llvm-exegesis] InstructionBenchmarkClustering::dbScan(): use llvm::SetVector<> instead of ILLEGAL std::unordered_set<>
ClosedPublic

Authored by lebedev.ri on Nov 10 2018, 12:10 PM.

Details

Summary

Test data: 500kLOC of benchmark.yaml, 23Mb. (that is a subset of the actual uops benchmark i was trying to analyze!)
Old time:

$ time ./bin/llvm-exegesis -mode=analysis -analysis-epsilon=100000 -benchmarks-file=/tmp/benchmarks.yaml -analysis-inconsistencies-output-file=/tmp/clusters.html &> /dev/null

real    0m24.884s
user    0m24.099s
sys     0m0.785s

New time:

$ time ./bin/llvm-exegesis -mode=analysis -analysis-epsilon=100000 -benchmarks-file=/tmp/benchmarks.yaml -analysis-inconsistencies-output-file=/tmp/clusters.html &> /dev/null

real    0m10.469s
user    0m9.797s
sys     0m0.672s

So -60%. And that isn't the good bit yet.

Old:

  • calls to allocation functions: 106560180 (yes, 107 *million* allocations.)
  • bytes allocated in total (ignoring deallocations): 12.17 GB

New:

  • calls to allocation functions: 3347676 (-96.86%) (just 3 mil)
  • bytes allocated in total (ignoring deallocations): 10.52 GB (~2GB less)

Two points i want to raise:

Diff Detail

Repository
rL LLVM

Event Timeline

lebedev.ri created this revision.Nov 10 2018, 12:10 PM

I could be missing something, but I don't understand why ToProcess needs to be a set-like container since we're erasing elements as we go (ie the erased elements won't be duplicate checked on next insertion). We skip any that have been previously processed in the inner loop too, which seems like it's doing the same work the set would be doing.

The isNoise() check also looks odd, since if CurrentCluster.Id has id kNoise then it could push the same index into CurrentCluster.PointIndices an unspecified number of times depending on the order that the elements are inserted and removed from ToProcess, but if CurrentCluster.Id can't be kNoise then that's not relevant.

From the docs:

SetVector is an adapter class that defaults to using std::vector

so calling erase on the first element isn't going to be terribly efficient either.

I could be missing something, but I don't understand why ToProcess needs to be a set-like container since we're erasing elements as we go (ie the erased elements won't be duplicate checked on next insertion). We skip any that have been previously processed in the inner loop too, which seems like it's doing the same work the set would be doing.

Here i only improve the choice of the data structure, and don't do any algorithm-changing decisions.

I think the usage of Set is correct.
We will, at most, fail to deduplicate one element (the one we just removed) out of N returned by rangeQuery(Q).
But if we don't do any deduplication at all, then we will add all the N elements. And that sounds exponential.
Maybe it should be a custom SetVector, where the Set retains the knowledge about removed elements, don't know yet.

The isNoise() check also looks odd, since if CurrentCluster.Id has id kNoise then it could push the same index into CurrentCluster.PointIndices an unspecified number of times depending on the order that the elements are inserted and removed from ToProcess, but if CurrentCluster.Id can't be kNoise then that's not relevant.

Interesting question, but irrelevant for the patch at hand.

From the docs:

SetVector is an adapter class that defaults to using std::vector

so calling erase on the first element isn't going to be terribly efficient either.

Well, i'd call -60% pretty efficient.
Further patches [do/will] improve upon that.

I think the usage of Set is correct.
We will, at most, fail to deduplicate one element (the one we just removed) out of N returned by rangeQuery(Q).
But if we don't do any deduplication at all, then we will add all the N elements. And that sounds exponential.

Ah I meant with deduplication via something like:

auto MiddleIt = ToProcess.insert(ToProcess.end(), Neighbors.begin(), Neighbors.end());
std::inplace_merge(ToProcess.begin(), MiddleIt, ToProcess.end());
std::erase(std::unique(ToProcess.begin(), ToProcess.end()), ToProcess.end());

when we're inserting elements, and then popping from the back of the vector at the beginning of the loop (since the iteration order for a std::unordered_set was undefined anyway). That does rely on the fact that rangeQuery() currently returns a sorted vector though.

Here i only improve the choice of the data structure, and don't do any algorithm-changing decisions.

I don't think the above changes the algorithmic behaviour, but I could be wrong.

Interesting question, but irrelevant for the patch at hand.

That's a valid point :)

I could be missing something, but I don't understand why ToProcess needs to be a set-like container since we're erasing elements as we go (ie the erased elements won't be duplicate checked on next insertion). We skip any that have been previously processed in the inner loop too, which seems like it's doing the same work the set would be doing.

The isNoise() check also looks odd, since if CurrentCluster.Id has id kNoise then it could push the same index into CurrentCluster.PointIndices an unspecified number of times depending on the order that the elements are inserted and removed from ToProcess, but if CurrentCluster.Id can't be kNoise then that's not relevant.

From the docs:

SetVector is an adapter class that defaults to using std::vector

so calling erase on the first element isn't going to be terribly efficient either.

Depends on how you use the "vector", the density of the elements gives it a big edge, especially paired with a slab allocator, if you can predict the rough upper boundary on the amount of data you need to allocate, you can have an adapter class that simply emulates vector/queue semantics by using a window into said array (imagine it as a 0, 1000 inserts later you have 1000, now removing something from the front is as simple as advancing the start of the window by 1-1000). If you're not using set semantics (I don't think you are from the conversation yesterday), you can also gain huge advantage copying data across such data structures.

@lebedev.ri, std::deque<> gave you a huge performance increase because it's essentially a combination of vectors and a slab allocator that usually allocates bigger slabs every time you exceed the capacity (IIRC), with a fancy iterator to hide crossing slab edges. Due to using slab allocation it also provides stable references as long as you only operate on the front and end of the queue (think tailq with an allocator/fancy iterator on top).

MaskRay requested changes to this revision.EditedNov 11 2018, 4:45 PM

When used to cluster points (measurements) this way with DBSCAN, the order how points are processed does not matter. (But I would feel slightly better if the result is deterministic, i.e. tools/llvm-exegesis/lib/Analysis.cpp#L199 should sort PointIndices first).

I actually think a bitset (vector<char> may be better, to track if an element has been inserted) + pre-allocated vector (of size of the number of points, used as a stack) is the best. They will be even faster than a DenseSet<size_t>.

This revision now requires changes to proceed.Nov 11 2018, 4:45 PM

Not related to this patch series but some complaints about the existing interface (sorry)... I feel it is sort of over-engineered, e.g.

  • kError (related to InstructionBenchmark::Error) is not used but it is defined as a special ClusterId value.
  • kNoise kUndef can just be exposed as public members and no static member functions noise() error() are needed.
  • PerInstructionStats::min(). It is defined as numeric_limits<double>::min() but never modified.

When used to cluster points (measurements) this way with DBSCAN, the order how points are processed does not matter.

Are you implying that it is incorrect to use anything but non-unsorted set here?
I can't just use DenseSet here, it ends up begin at least one magnitude slower.
And we most certainly can't keep std::unordered_set<>.

(But I would feel slightly better if the result is deterministic, i.e. tools/llvm-exegesis/lib/Analysis.cpp#L199 should sort PointIndices first).

I actually think a bitset (vector<char> may be better, to track if an element has been inserted)

Yes, that is likely, will investigate.

+ pre-allocated vector (of size of the number of points, used as a stack) is the best. They will be even faster than a DenseSet<size_t>.

courbet accepted this revision.Nov 19 2018, 2:25 AM

Two points i want to raise:

std::unordered_set<> should not have been used there in the first place. It is banned by the https://llvm.org/docs/ProgrammersManual.html#other-set-like-container-options

Thanks Roman, I was not aware of that.

There is no tests, so i'm not fully sure this is correct.

There are basic tests in unittests/tools/llvm-exegesis/ClusteringTests.cpp.

Since it was unordered set, i guess there are zero restrictions on the order, and anything will be ok?

IIUC, SetVector is just more restrictive than unordered_set, so it should be fine.

I could be missing something, but I don't understand why ToProcess needs to be a set-like container since we're erasing elements as we go (ie the erased elements won't be duplicate checked on next insertion). We skip any that have been previously processed in the inner loop too, which seems like it's doing the same work the set would be doing.

IIRC, rangeQuery() can return points that need to be deduped against the already present ones.

Two points i want to raise:

std::unordered_set<> should not have been used there in the first place. It is banned by the https://llvm.org/docs/ProgrammersManual.html#other-set-like-container-options

Thanks Roman, I was not aware of that.

Thank you for the reviews!
Any chance you could stamp the last 3 remainig reviews so i could land everything? :)

There is no tests, so i'm not fully sure this is correct.

There are basic tests in unittests/tools/llvm-exegesis/ClusteringTests.cpp.

Since it was unordered set, i guess there are zero restrictions on the order, and anything will be ok?

IIUC, SetVector is just more restrictive than unordered_set, so it should be fine.

This revision was not accepted when it landed; it landed in state Needs Revision.Nov 19 2018, 5:30 AM
This revision was automatically updated to reflect the committed changes.