Let's suppose we have measured 4 different opcodes, and got: 0.5, 1.0, 1.5, 2.0.
Let's suppose we are using -analysis-clustering-epsilon=0.5.
By default now we will start processing the 0.5 point, find that 1.0 is it's neighbor, add them to a new cluster.
Then we will notice that 1.5 is a neighbor of 1.0 and add it to that same cluster.
Then we will notice that 2.0 is a neighbor of 1.5 and add it to that same cluster.
So all these points ended up in the same cluster.
This may or may not be a correct implementation of dbscan clustering algorithm.
But this is rather horribly broken for the reasons of comparing the clusters with the LLVM sched data.
Let's suppose all those opcodes are currently in the same sched cluster.
If i specify -analysis-inconsistency-epsilon=0.5, then no matter
the LLVM values this cluster will never match the LLVM values,
and thus this cluster will always be displayed as inconsistent.
The solution is obviously to split off some of these opcodes into different sched cluster.
But how do i do that? Out of 4 opcodes displayed in the inconsistency report,
which ones are the "bad ones"? Which ones are the most different from the checked-in data?
I'd need to go in to the .yaml and look it up manually.
The trivial solution is to, when creating clusters, don't use the full dbscan algorithm,
but instead "pick some unclustered point, pick all unclustered points that are it's neighbor,
put them all into a new cluster, repeat". And just so as it happens, we can arrive
at that algorithm by not performing the "add neighbors of a neighbor to the cluster" step.
(This will also help with opcode denoising/stabilization)
While the current default is good for abstract 'analyse clustering of measurements',
i'm not sure how often that is the actual goal, not 'compare llvm data with measurements'.
So i'm not sure what should be the default.
Thoughts?
This is yet another step to bring me closer to being able to continue cleanup of bdver2 sched model..
Fixes PR40880.