This is an archive of the discontinued LLVM Phabricator instance.

[LICM] Check the number of divergent paths from loop header to target block
AbandonedPublic

Authored by jaykang10 on Oct 14 2021, 7:14 AM.

Details

Summary

If the control flow is divergent too much inside loop, it could be better to avoid to move instructions outside loop.

For example, let's say there is a switch instruction with big number of cases inside loop and the instruction is located in one of the case block. In this case, it is not easy to say the instruction is executed.

This patch checks the number of divergent paths from loop header to block which has the instruction. If the number of divergent paths is bigger than threshold, LICM pass does not move the instruction.

The performance improvement from spec2017 for AArch64 is as below.

Benchmark	improvement(%)
500.perlbench_r	2.176103248
502.gcc_r	-0.271585852
505.mcf_r	0.511525241
520.omnetpp_r	-0.008166838
523.xalancbmk_r	0.488642971
525.x264_r	-0.049168496
531.deepsjeng_r	0.021372649
541.leela_r	-0.023171636
557.xz_r	-0.012851939

Diff Detail

Event Timeline

jaykang10 created this revision.Oct 14 2021, 7:14 AM
jaykang10 requested review of this revision.Oct 14 2021, 7:14 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 14 2021, 7:14 AM
lebedev.ri requested changes to this revision.Oct 14 2021, 7:16 AM
lebedev.ri added a subscriber: lebedev.ri.

As it's stated in the pass description, it is a canonicalization transform, performed without a costmodel,
and it's expected to be undone in backend.

llvm/lib/Transforms/Scalar/LICM.cpp
15–20
This revision now requires changes to proceed.Oct 14 2021, 7:16 AM
jaykang10 abandoned this revision.Oct 14 2021, 7:22 AM

As it's stated in the pass description, it is a canonicalization transform, performed without a costmodel,
and it's expected to be undone in backend.

Ah, Thanks for letting me know. @lebedev.ri I missed it.

Let me abandon this patch.

I would like to mention one thing.

This patch is not related to register pressure or live range. It checks the number of divergent paths from loop header to the block which has loop invariant code. If the number is too big, the probability of the instruction being executed is small.

LICM pass is already doing similar thing with profile information.

jaykang10 reclaimed this revision.Oct 14 2021, 8:17 AM

In order to get more comments, I open this review again.

This revision now requires changes to proceed.Oct 14 2021, 8:17 AM

JFYI, Roman is correct. LICM is a canonicalizing transform. This patch will never be accepted. If the hoisting was not profitable, machine-licm exists precisely to undo it in a cost based way.

As to your point about the profile driven piece, you are correct, that is inconsistent. That should never have landed, or been enabled. I'm digging through the history to figure out how that happened.

JFYI, Roman is correct. LICM is a canonicalizing transform. This patch will never be accepted. If the hoisting was not profitable, machine-licm exists precisely to undo it in a cost based way.

As to your point about the profile driven piece, you are correct, that is inconsistent. That should never have landed, or been enabled. I'm digging through the history to figure out how that happened.

Thanks for comment @reames.

I can see you start the discussion to revert the commit with profile information.

Once the patch is reverted, let me abandon this patch.

@lebedev.ri , @reames : I am not saying this the right patch, but just wanted to leave a thought about LICM being a canonicalizing transform. The situation where we are in at the moment, is that LICM is deemed a canonicalisation transform, but we don't have a solid solution in the backend to undo this. LoopSink is an IR pass, works with profile info, and is unsuitable if register pressure needs to be taken into account. MachineSink would be the natural place for this, but currently lacks the ability to sink back into loops. Some work was started to address this, D94308, but that needs finishing and isn't the end of the story. It would need more heuristics, and then get enabled, so that is quite a way off from where we want to be.

Long story short, I think it is a bit easy to say this is a canonicalization transformation. I am not ready to challenge this, not sure I ever will, but just saying that we are in a limbo and not really in a good situation with this.

reames added a comment.EditedOct 15 2021, 10:32 AM

..., but we don't have a solid solution in the backend to undo this.

I would have a lot more sympathy for this argument if the examples being presented weren't near trivial sinking. The only example checked in with D87551 is a multiply used to implement a square operation. Such an instruction can (on x86, before regalloc) be trivially moved as a) it's non-faulting, b) it uses only a second register input so moving it as it worst neutral and c) the input and output are the same register class. The only mildly complicated bit is modeling the implicit eflags dead def, but even there, the location we sink to immediately kills it as well.

If we can't handle cases that simple in our sinking transforms, we should not workaround them elsewhere. That's embarrassing, and we should simply fix it.

If we were being presented with hard cases to undo with a cost based transform, I might be willing to entertain the notion that LICM should be non-canonical. As it stands, I don't see such examples.

To keep the conversation coherent I'll talk about the same feedback for D87551 here:

If we were being presented with hard cases to undo with a cost based transform, I might be willing to entertain the notion that LICM should be non-canonical. As it stands, I don't see such examples.

The original change that prompted adding this from @wenlei is a more complicated situation https://reviews.llvm.org/D65060#1596899. Specifically, once the values are hoisted out of the 200 switch arms there's not longer a dominating use block to sink them back into. In addition, the block frequency information at that point no longer indicates that there's an issue from the hoisting. Getting MachineLICM to handle that case is complicated and I suspect fragile.

To keep the conversation coherent I'll talk about the same feedback for D87551 here:

If we were being presented with hard cases to undo with a cost based transform, I might be willing to entertain the notion that LICM should be non-canonical. As it stands, I don't see such examples.

The original change that prompted adding this from @wenlei is a more complicated situation https://reviews.llvm.org/D65060#1596899. Specifically, once the values are hoisted out of the 200 switch arms there's not longer a dominating use block to sink them back into. In addition, the block frequency information at that point no longer indicates that there's an issue from the hoisting. Getting MachineLICM to handle that case is complicated and I suspect fragile.

FYI, I find this very unconvincing. First, the motivating test case appears to have been discussed, but never shared. Second, as discussed, we appeared to have something along the lines of a switch mapping to address offsets, which is then loaded from. We can and do pull such geps down through the phi, and that optimization should probably have triggered on the discussed example. In general, a missing optimization is a *bad* motivation for disabling a canonicalization.

jaykang10 planned changes to this revision.Dec 4 2021, 12:42 AM
lebedev.ri resigned from this revision.Jan 12 2023, 4:49 PM

This review seems to be stuck/dead, consider abandoning if no longer relevant.

Herald added a project: Restricted Project. · View Herald TranscriptJan 12 2023, 4:49 PM
Herald added a subscriber: StephenFan. · View Herald Transcript
jaykang10 abandoned this revision.Jan 13 2023, 12:10 AM