This is an archive of the discontinued LLVM Phabricator instance.

[MachineCombiner]: Avoid including transient instructions in latency calculation
Needs ReviewPublic

Authored by malharJ on Apr 11 2022, 6:59 AM.

Details

Reviewers
fhahn
georges
Summary

The MachineCombiner pattern matches on machine instruction sequences and generates
a new instruction sequence (that hopefully is more efficient).

Currently, latency calculation (in MachineCombiner) involved when finding the
depth of (root of) the new/transformed instruction sequence includes latency
of transient (ie. machine instructions like COPY, etc. that will be removed
later during register allocation).

This seems incorrect as it results in the depth of the new sequence to be
higher (in some cases, like in the affected test files) than the old sequence,
resulting in a longer critical path and the MachineCombiner ends up rejecting
the transform for efficiency reason.

Also, looking at the logic in MachineTraceMetrics::Ensemble::updateDepth()
(which is used to calculate the latency/depth of the old instruction sequence),
it excludes latency of transient instructions from the calculation.

Diff Detail

Event Timeline

malharJ created this revision.Apr 11 2022, 6:59 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 11 2022, 6:59 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
malharJ requested review of this revision.Apr 11 2022, 6:59 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 11 2022, 6:59 AM
fhahn added inline comments.Apr 11 2022, 7:54 AM
llvm/test/CodeGen/AArch64/neon-mla-mls.ll
141

Is this actually profitable compared to the original code? Shouldn't this use mls?

georges added inline comments.Apr 11 2022, 9:29 AM
llvm/test/CodeGen/AArch64/neon-mla-mls.ll
141

I don't think you can use mls here since the negation is on the input accumulator rather than the multiplicand.
Apart from that, I agree with @fhahn that this doesn't look obviously profitable. fmov/mov are not zero-latency on a lot of micro-architectures.
The equivalent case without the fmov (e.g. accumulating into A rather than C so it ends up in v0 naturally) could be believably faster due to the late accumulator operand forwarding present on many micro-architectures, and at worst shouldn't be worse than the existing code. I'm not sure if that already happened prior to this change though.

malharJ added inline comments.Apr 11 2022, 11:02 AM
llvm/test/CodeGen/AArch64/neon-mla-mls.ll
141

I understand the concern about the additional fmov/mov,
but as @georges already pointed out, this is more related
to the calling convention of returning result in r0
(and also unrelated to my patch which just fixes the latency/depth calculation)

I am happy to interchange the order of %A and %C,
as that eliminates the fmov/mov, if that's what is required for this patch.

Also, independent of the above, I think the test can be renamed to use
"mla" (or maybe "negmla") instead of "mls"

Are you sure this code is doing what it is intended to be doing? If it is adding the latencies (depths) between operands of inserted instructions and their defs, and we say that the latency of transient instructions are ignored, that what adds the latency between the last instruction and whatever will end up using it?

The scheduling info this is using on it's own doesn't suggest this should be changing mul;sub to neg;mla: https://godbolt.org/z/G1WTW5P9x

mgabka added a subscriber: mgabka.Apr 13 2022, 1:40 AM
fhahn added a comment.Jan 9 2023, 9:38 AM

Is this by any chance fixed by D129615?

Matt added a subscriber: Matt.Aug 24 2023, 11:42 AM