This is an archive of the discontinued LLVM Phabricator instance.

[llvm-exegesis] Option to lobotomize dbscan (PR40880)
AbandonedPublic

Authored by lebedev.ri on Mar 19 2019, 3:42 AM.

Details

Reviewers
courbet
gchatelet
Summary

Let's suppose we have measured 4 different opcodes, and got: 0.5, 1.0, 1.5, 2.0.
Let's suppose we are using -analysis-clustering-epsilon=0.5.
By default now we will start processing the 0.5 point, find that 1.0 is it's neighbor, add them to a new cluster.
Then we will notice that 1.5 is a neighbor of 1.0 and add it to that same cluster.
Then we will notice that 2.0 is a neighbor of 1.5 and add it to that same cluster.
So all these points ended up in the same cluster.
This may or may not be a correct implementation of dbscan clustering algorithm.

But this is rather horribly broken for the reasons of comparing the clusters with the LLVM sched data.
Let's suppose all those opcodes are currently in the same sched cluster.
If i specify -analysis-inconsistency-epsilon=0.5, then no matter
the LLVM values this cluster will never match the LLVM values,
and thus this cluster will always be displayed as inconsistent.

The solution is obviously to split off some of these opcodes into different sched cluster.
But how do i do that? Out of 4 opcodes displayed in the inconsistency report,
which ones are the "bad ones"? Which ones are the most different from the checked-in data?
I'd need to go in to the .yaml and look it up manually.

The trivial solution is to, when creating clusters, don't use the full dbscan algorithm,
but instead "pick some unclustered point, pick all unclustered points that are it's neighbor,
put them all into a new cluster, repeat". And just so as it happens, we can arrive
at that algorithm by not performing the "add neighbors of a neighbor to the cluster" step.

(This will also help with opcode denoising/stabilization)

While the current default is good for abstract 'analyse clustering of measurements',
i'm not sure how often that is the actual goal, not 'compare llvm data with measurements'.
So i'm not sure what should be the default.

Thoughts?

This is yet another step to bring me closer to being able to continue cleanup of bdver2 sched model..

Fixes PR40880.

Diff Detail

Repository
rL LLVM

Event Timeline

lebedev.ri created this revision.Mar 19 2019, 3:42 AM
lebedev.ri removed a project: Restricted Project.Mar 19 2019, 3:42 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 19 2019, 3:42 AM
lebedev.ri changed the repository for this revision from rLLDB LLDB to rL LLVM.Mar 19 2019, 3:43 AM
lebedev.ri removed a project: Restricted Project.

This may or may not be a correct implementation of dbscan clustering algorithm.

Yes, that's what we want. This really is one cluster.

While the current default is good for abstract 'analyse clustering of measurements',

i'm not sure how often that is the actual goal, not 'compare llvm data with measurements'.
So i'm not sure what should be the default.

It think that's a fair point: I originally intended to sort of have one cluster be one sched class. But why do you need clusters at all if all you care about is the mismatch between each instruction and its scheduling data ?

Then why do you need clustering at all ? Why not create one cluster by instruction ?

This may or may not be a correct implementation of dbscan clustering algorithm.

Yes, that's what we want. This really is one cluster.

Yes, i do think that is the correct implementation of the algorithm.

While the current default is good for abstract 'analyse clustering of measurements',

i'm not sure how often that is the actual goal, not 'compare llvm data with measurements'.
So i'm not sure what should be the default.

It think that's a fair point: I originally intended to sort of have one cluster be one sched class.
But why do you need clusters at all if all you care about is the mismatch between each instruction and its scheduling data ?

Then why do you need clustering at all ? Why not create one cluster by instruction ?

No, the clustering is actually good.
If i only take one measurements set, and then dumbly change all the sched values to match,
then when i take a new measurements set, i will be "very" surprized that
it all of a sudden no longer matches the checked-in values.

Thus i take multiple measurements (3..10), and then if those measurements are different,
the "cluster stabilization" (D58355) kicks in and i only get non-flaky clusters by default.

I'm not saying that this *is* The solution, but i do believe the problem exists,
and i'm not sure what other approaches are there to solve it..

There is another bug there, i suppose, we really should sort the measurements we loaded.
E.g. because now given measurement 0.5, 1.0, 1.5, if they are in that order then
you get (i think, didn't check) two clusters: 0.5+1.0 and 1.5 (correct).
But if they are `1.0, 0.5`, 1.5, then that will be a single cluster..

I'm not saying that this *is* The solution, but i do believe the problem exists,
and i'm not sure what other approaches are there to solve it..

There is another bug there, i suppose, we really should sort the measurements we loaded.

Sorting for clustering only works in one dimension. As soon as you have several dimensions (e.g. uops), it becomes more complex (hence dbscan & others).

E.g. because now given measurement 0.5, 1.0, 1.5, if they are in that order then
you get (i think, didn't check) two clusters: 0.5+1.0 and 1.5 (correct).
But if they are `1.0, 0.5`, 1.5, then that will be a single cluster..

dbscan without lobotomization is mostly insensitive to point ordering (see https://reviews.llvm.org/D59693)

To make my suggestion clearer: if you just want to compare measured instruction data to its checked-in data, why not cluster by instruction opcode (to merge the several measurements for a given instruction) and run the analysis on that ?

Essentially: clusterid == opcode, you just do away with dbscan fully.

To make my suggestion clearer: if you just want to compare measured instruction data to its checked-in data, why not cluster by instruction opcode (to merge the several measurements for a given instruction) and run the analysis on that ?

Essentially: clusterid == opcode, you just do away with dbscan fully.

It think that's a fair point: I originally intended to sort of have one cluster be one sched class.
But why do you need clusters at all if all you care about is the mismatch between each instruction and its scheduling data ?

Then why do you need clustering at all ? Why not create one cluster by instruction ?

No, the clustering is actually good.
If i only take one measurements set, and then dumbly change all the sched values to match,
then when i take a new measurements set, i will be "very" surprized that
it all of a sudden no longer matches the checked-in values.

Thus i take multiple measurements (3..10), and then if those measurements are different,
the "cluster stabilization" (D58355) kicks in and i only get non-flaky clusters by default.

To make my suggestion clearer: if you just want to compare measured instruction data to its checked-in data, why not cluster by instruction opcode (to merge the several measurements for a given instruction) and run the analysis on that ?

Essentially: clusterid == opcode, you just do away with dbscan fully.

It think that's a fair point: I originally intended to sort of have one cluster be one sched class.
But why do you need clusters at all if all you care about is the mismatch between each instruction and its scheduling data ?

Then why do you need clustering at all ? Why not create one cluster by instruction ?

No, the clustering is actually good.
If i only take one measurements set, and then dumbly change all the sched values to match,
then when i take a new measurements set, i will be "very" surprized that
it all of a sudden no longer matches the checked-in values.

Thus i take multiple measurements (3..10), and then if those measurements are different,
the "cluster stabilization" (D58355) kicks in and i only get non-flaky clusters by default.

To reword: because if i do simple clustering by opcode, i will then need to add yet another
"stabilization" step - for each cluster, check that every measurement is neighbor of all
the other points in that cluster, and if they are not, mark cluster as noise.
(well, not every vs. every, just the lower/upper triangle excluding diagonal)

I can do that instead, maybe that would even better than this (no dependency on measurement ordering).

Any advice on how to proceed? I would really love to see this issue resolved :)

To reword: because if i do simple clustering by opcode, i will then need to add yet another
"stabilization" step - for each cluster, check that every measurement is neighbor of all
the other points in that cluster, and if they are not, mark cluster as noise.
(well, not every vs. every, just the lower/upper triangle excluding diagonal)

OK I see, thanks. To sum up my understanding: There are some areas where two clusters that should be separate are so noisy that there is a dense region connecting the two clusters, so even taking a small epsilon will not separate them. You want to reject these merged clusters based on the variance of the points within the cluster.

One suggestion I have is to compute the variance within the cluster (this can be done incrementally when adding points to the cluster) and reject clusters where the variance is more than a certain threshold. What do you think ?

I can do that instead, maybe that would even better than this (no dependency on measurement ordering).

Yes, I would really like to avoid the dependence on the ordering.

To reword: because if i do simple clustering by opcode, i will then need to add yet another
"stabilization" step - for each cluster, check that every measurement is neighbor of all
the other points in that cluster, and if they are not, mark cluster as noise.
(well, not every vs. every, just the lower/upper triangle excluding diagonal)

OK I see, thanks. To sum up my understanding: There are some areas where two clusters that should be separate are so noisy that there is a dense region connecting the two clusters, so even taking a small epsilon will not separate them. You want to reject these merged clusters based on the variance of the points within the cluster.

There are two situations, as far as i can tell:
(also, i'm only looking at the case with only a single dimension - latency/uops/rthrouthput, not a combination of measurements.)

  1. Let's suppose we have measurements 0.5, 1.0, 1.5. If they are all from the same opcode, they will currently be put into the same cluster. This is unwanted (at least for me)
  2. If you have measurements: 0.5(opcode a), 3.5(opcode a), they will be put into different clusters, which is, while correct, also not quite wanted, because they are from the same opcode. That should be "unstable" cluster. (it is unspecified why that happened, could be noisy measurements, could be cpu pipeline quirks, could be register fastpath, could be dependent on the reg values, etc etc)

The second issue standalone i have resolved with D58355, but the first issue remains.
So i'm trying to solve the first issue, without regressing the second issue.

One suggestion I have is to compute the variance within the cluster (this can be done incrementally when adding points to the cluster) and reject clusters where the variance is more than a certain threshold. What do you think ?

I can do that instead, maybe that would even better than this (no dependency on measurement ordering).

Yes, I would really like to avoid the dependence on the ordering.

Okay, i will try the "cluster by opcode + stabilize" approach, thanks!