Ok, so this turned out to be easier than i expected.
Also, i initially thought that other modes might need this post-processing,
but i'm not sure which opcodes are affected there, if any.
The results look much better.
On BdVer2 this exposes at least one stable sched cluster
that has inconsistent values from from the measurements,
and a dozen or so somewhat-unstable clusters that also are inconsistent.
Resolves(?) PR41275
In this paragraph it is unclear what next instruction refers to.
Maybe rephrase to something like "By constructions, snippet's instructions execution never overlaps. As a consequence the per-snippet latency is the sum of the latencies of the instructions in the Snippet. For instance, in the following example latency(BT32rr R11D R11D) + latency(RCR8rCL R11B R11B) = 12
"