The MachineCombiner pattern matches on machine instruction sequences and generates
a new instruction sequence (that hopefully is more efficient).
Currently, latency calculation (in MachineCombiner) involved when finding the
depth of (root of) the new/transformed instruction sequence includes latency
of transient (ie. machine instructions like COPY, etc. that will be removed
later during register allocation).
This seems incorrect as it results in the depth of the new sequence to be
higher (in some cases, like in the affected test files) than the old sequence,
resulting in a longer critical path and the MachineCombiner ends up rejecting
the transform for efficiency reason.
Also, looking at the logic in MachineTraceMetrics::Ensemble::updateDepth()
(which is used to calculate the latency/depth of the old instruction sequence),
it excludes latency of transient instructions from the calculation.
Is this actually profitable compared to the original code? Shouldn't this use mls?