This is an archive of the discontinued LLVM Phabricator instance.

[PassManager] Run additional LICM before LoopRotate
ClosedPublic

Authored by lebedev.ri on Mar 24 2021, 2:39 AM.

Details

Summary

This is an alternative to D99204.
Better PhaseOrdering test TBD.

Loop rotation often has to perform code duplication
from header into preheader, which introduces PHI nodes.

In D99204, @thopre wrote:

With loop peeling, it is important that unnecessary PHIs be avoided or
it will leads to spurious peeling. One source of such PHIs is loop
rotation which creates PHIs for invariant loads. Those PHIs are
particularly problematic since loop peeling is now run as part of simple
loop unrolling before GVN is run, and are thus a source of spurious
peeling.

Note that while some of the load can be hoisted and eventually
eliminated by instruction combine, this is not always possible due to
alignment issue. In particular, the motivating example [1] was a load
inside a class instance which cannot be hoisted because the `this'
pointer has an alignment of 1.

[1] http://lists.llvm.org/pipermail/llvm-dev/attachments/20210312/4ce73c47/attachment.cpp

Now, we could enhance LoopRotate to avoid duplicating code when not needed,
but instead hoist loop-invariant code, but isn't that a code duplication? (*sic*)
We have LICM, and in fact we already run it right after LoopRotation.

We could try to move it to before LoopRotation,
that is basically free from compile-time perspective:
https://llvm-compile-time-tracker.com/compare.php?from=6c93eb4477d88af046b915bc955c03693b2cbb58&to=a4bee6d07732b1184c436da489040b912f0dc271&stat=instructions
But, looking at stats, i think it isn't great that we would no longer do LICM after LoopRotation, in particular:

statistic nameLoopRotate-LICMLICM-LoopRotateΔ%abs(%)
asm-printer.EmittedInsts90159309015799-1310.00%0.00%
indvars.NumElimCmp3536354480.23%0.23%
indvars.NumElimExt3672536580-145-0.39%0.39%
indvars.NumElimIV11971187-10-0.84%0.84%
indvars.NumElimIdentity143136-7-4.90%4.90%
indvars.NumElimRem45125.00%25.00%
indvars.NumLFTR2984229890480.16%0.16%
indvars.NumReplaced22932227-66-2.88%2.88%
indvars.NumSimplifiedSDiv68233.33%33.33%
indvars.NumWidened2643826329-109-0.41%0.41%
instcount.TotalBlocks11783381173840-4498-0.38%0.38%
instcount.TotalFuncs11182511182940.00%0.00%
instcount.TotalInsts99054429896139-9303-0.09%0.09%
lcssa.NumLCSSA425871423961-1910-0.45%0.45%
licm.NumHoisted3783573787533960.10%0.10%
licm.NumMovedCalls21932208150.68%0.68%
licm.NumMovedLoads3589931821-4078-11.36%11.36%
licm.NumPromoted1117811154-24-0.21%0.21%
licm.NumSunk13359135872281.71%1.71%
loop-delete.NumDeleted85478402-145-1.70%1.70%
loop-instsimplify.NumSimplified1287611890-986-7.66%7.66%
loop-peel.NumPeeled1008925-83-8.23%8.23%
loop-rotate.NumNotRotatedDueToHeaderSize368365-3-0.82%0.82%
loop-rotate.NumRotated4201542003-12-0.03%0.03%
loop-simplifycfg.NumLoopBlocksDeleted24024220.83%0.83%
loop-simplifycfg.NumLoopExitsDeleted49720-477-95.98%95.98%
loop-simplifycfg.NumTerminatorsFolded618336-282-45.63%45.63%
loop-unroll.NumCompletelyUnrolled110281103240.04%0.04%
loop-unroll.NumUnrolled1260812529-79-0.63%0.63%
mem2reg.NumDeadAlloca1022210221-1-0.01%0.01%
mem2reg.NumPHIInsert192110192106-40.00%0.00%
mem2reg.NumSingleStore637650637643-70.00%0.00%
scalar-evolution.NumBruteForceTripCountsComputed814812-2-0.25%0.25%
scalar-evolution.NumTripCountsComputed283108282934-174-0.06%0.06%
scalar-evolution.NumTripCountsNotComputed10671210671860.01%0.01%
simple-loop-unswitch.NumBranches51784752-426-8.23%8.23%
simple-loop-unswitch.NumCostMultiplierSkipped914503-411-44.97%44.97%
simple-loop-unswitch.NumSwitches2018-2-10.00%10.00%
simple-loop-unswitch.NumTrivial18395-88-48.09%48.09%

... but that actually regresses LICM (-12% licm.NumMovedLoads),
loop-simplifycfg (NumLoopExitsDeleted, NumTerminatorsFolded),
simple-loop-unswitch (NumTrivial).

What if we instead have LICM both before and after LoopRotate?

statistic nameLoopRotate-LICMLICM-LoopRotate-LICMΔ%abs(%)
asm-printer.EmittedInsts90159309014474-1456-0.02%0.02%
indvars.NumElimCmp35363546100.28%0.28%
indvars.NumElimExt3672536681-44-0.12%0.12%
indvars.NumElimIV11971185-12-1.00%1.00%
indvars.NumElimIdentity14314632.10%2.10%
indvars.NumElimRem45125.00%25.00%
indvars.NumLFTR2984229899570.19%0.19%
indvars.NumReplaced2293229960.26%0.26%
indvars.NumSimplifiedSDiv68233.33%33.33%
indvars.NumWidened2643826404-34-0.13%0.13%
instcount.TotalBlocks11783381173652-4686-0.40%0.40%
instcount.TotalFuncs11182511182940.00%0.00%
instcount.TotalInsts99054429895452-9990-0.10%0.10%
lcssa.NumLCSSA425871425373-498-0.12%0.12%
licm.NumHoisted37835738335249951.32%1.32%
licm.NumMovedCalls21932204110.50%0.50%
licm.NumMovedLoads3589935755-144-0.40%0.40%
licm.NumPromoted1117811163-15-0.13%0.13%
licm.NumSunk13359143219627.20%7.20%
loop-delete.NumDeleted85478538-9-0.11%0.11%
loop-instsimplify.NumSimplified1287612041-835-6.48%6.48%
loop-peel.NumPeeled1008924-84-8.33%8.33%
loop-rotate.NumNotRotatedDueToHeaderSize368365-3-0.82%0.82%
loop-rotate.NumRotated4201542005-10-0.02%0.02%
loop-simplifycfg.NumLoopBlocksDeleted24024110.42%0.42%
loop-simplifycfg.NumTerminatorsFolded61861910.16%0.16%
loop-unroll.NumCompletelyUnrolled110281102910.01%0.01%
loop-unroll.NumUnrolled1260812525-83-0.66%0.66%
mem2reg.NumPHIInsert192110192073-37-0.02%0.02%
mem2reg.NumSingleStore63765063765220.00%0.00%
scalar-evolution.NumTripCountsComputed283108282998-110-0.04%0.04%
scalar-evolution.NumTripCountsNotComputed106712106691-21-0.02%0.02%
simple-loop-unswitch.NumBranches5178518570.14%0.14%
simple-loop-unswitch.NumCostMultiplierSkipped914925111.20%1.20%
simple-loop-unswitch.NumTrivial183179-4-2.19%2.19%
simple-loop-unswitch.NumBranches51784752-426-8.23%8.23%
simple-loop-unswitch.NumCostMultiplierSkipped914503-411-44.97%44.97%
simple-loop-unswitch.NumSwitches2018-2-10.00%10.00%
simple-loop-unswitch.NumTrivial18395-88-48.09%48.09%

I.e. we end up with less instructions, less peeling, more LICM activity,
also note how none of those 4 regressions are here. Namely:

statistic nameLICM-LoopRotateLICM-LoopRotate-LICMΔ%abs(%)
asm-printer.EmittedInsts90157999014474-1325-0.01%0.01%
indvars.NumElimCmp3544354620.06%0.06%
indvars.NumElimExt36580366811010.28%0.28%
indvars.NumElimIV11871185-2-0.17%0.17%
indvars.NumElimIdentity136146107.35%7.35%
indvars.NumLFTR298902989990.03%0.03%
indvars.NumReplaced22272299723.23%3.23%
indvars.NumWidened2632926404750.28%0.28%
instcount.TotalBlocks11738401173652-188-0.02%0.02%
instcount.TotalInsts98961399895452-687-0.01%0.01%
lcssa.NumLCSSA42396142537314120.33%0.33%
licm.NumHoisted37875338335245991.21%1.21%
licm.NumMovedCalls22082204-4-0.18%0.18%
licm.NumMovedLoads3182135755393412.36%12.36%
licm.NumPromoted111541116390.08%0.08%
licm.NumSunk13587143217345.40%5.40%
loop-delete.NumDeleted840285381361.62%1.62%
loop-instsimplify.NumSimplified11890120411511.27%1.27%
loop-peel.NumPeeled925924-1-0.11%0.11%
loop-rotate.NumRotated420034200520.00%0.00%
loop-simplifycfg.NumLoopBlocksDeleted242241-1-0.41%0.41%
loop-simplifycfg.NumLoopExitsDeleted204974772385.00%2385.00%
loop-simplifycfg.NumTerminatorsFolded33661928384.23%84.23%
loop-unroll.NumCompletelyUnrolled1103211029-3-0.03%0.03%
loop-unroll.NumUnrolled1252912525-4-0.03%0.03%
mem2reg.NumDeadAlloca102211022210.01%0.01%
mem2reg.NumPHIInsert192106192073-33-0.02%0.02%
mem2reg.NumSingleStore63764363765290.00%0.00%
scalar-evolution.NumBruteForceTripCountsComputed81281420.25%0.25%
scalar-evolution.NumTripCountsComputed282934282998640.02%0.02%
scalar-evolution.NumTripCountsNotComputed106718106691-27-0.03%0.03%
simple-loop-unswitch.NumBranches475251854339.11%9.11%
simple-loop-unswitch.NumCostMultiplierSkipped50392542283.90%83.90%
simple-loop-unswitch.NumSwitches1820211.11%11.11%
simple-loop-unswitch.NumTrivial951798488.42%88.42%


(this is vanilla llvm testsuite + rawspeed + darktable)

This does have an observable compile-time regression of +~0.5% geomean
https://llvm-compile-time-tracker.com/compare.php?from=7c5222e4d1a3a14f029e5f614c9aefd0fa505f1e&to=5d81826c3411982ca26e46b9d0aff34c80577664&stat=instructions
but i think that's basically nothing.

Diff Detail

Event Timeline

lebedev.ri created this revision.Mar 24 2021, 2:39 AM
lebedev.ri requested review of this revision.Mar 24 2021, 2:39 AM
lebedev.ri edited the summary of this revision. (Show Details)
lebedev.ri edited the summary of this revision. (Show Details)Mar 24 2021, 3:29 AM
lebedev.ri edited the summary of this revision. (Show Details)Mar 24 2021, 3:41 AM
lebedev.ri edited the summary of this revision. (Show Details)Mar 24 2021, 4:19 AM

Thanks for working on that. I've tried this patch with the application from which I created the reduced testcase sent to the mailing list and that does fix the extra peeling. Thank you very much. I'll try to review the patch now

Thanks for working on that. I've tried this patch with the application from which I created the reduced testcase sent to the mailing list and that does fix the extra peeling. Thank you very much. I'll try to review the patch now

Glad to hear it!

LGTM

llvm/test/Other/pass-pipelines.ll
54

Shouldn't that CHECK-NOT be duplicated between each Loop Pass Manager check below as well?

lebedev.ri added inline comments.Mar 24 2021, 5:17 AM
llvm/test/Other/pass-pipelines.ll
54

I don't know.
This seems to be consistent with the previous checks, where there already were several LPM's,
but no CHECK-NOT inbetween.

nikic requested changes to this revision.Mar 24 2021, 5:21 AM

Just marking this to clarify that approval by @thopre is insufficient. Don't have time to review this right now.

This revision now requires changes to proceed.Mar 24 2021, 5:21 AM

Just marking this to clarify that approval by @thopre is insufficient. Don't have time to review this right now.

Indeed, I didn't approve it intentionally

llvm/test/Other/pass-pipelines.ll
54

One of them was added a year ago by a different person than the author of the CHECK-NOT. I presume it was probably missed. Does the test pass if you add more CHECK-NOT?

One thing I can throw out now is that currently, we use split LPMs for LoopRotate and LICM, because historically it was too expensive to preserve MSSA in LoopRotate. After your change MSSA needs to be preserved anyway, and we are in the rather silly situation of LICM, LoopRotate and LICM again each using a separate LPM. If we want to go down this route, we'd probably want to land something like D74640 first in order to merge these LPMs again.

lebedev.ri marked 2 inline comments as done.

Just marking this to clarify that approval by @thopre is insufficient. Don't have time to review this right now.

Indeed, I didn't approve it intentionally

Yep, not sure why that needed to be stated explicitly.

llvm/test/Other/pass-pipelines.ll
54

Hm, seems to work.

One thing I can throw out now is that currently, we use split LPMs for LoopRotate and LICM, because historically it was too expensive to preserve MSSA in LoopRotate. After your change MSSA needs to be preserved anyway, and we are in the rather silly situation of LICM, LoopRotate and LICM again each using a separate LPM. If we want to go down this route, we'd probably want to land something like D74640 first in order to merge these LPMs again.

Notably, D74640 is an obsolete-pm problem. I'm not sure how much we should care about it nowadays.
But yes, D74640 does help there somewhat: without vs. with

mkazantsev added inline comments.Mar 24 2021, 10:12 AM
llvm/lib/Passes/PassBuilder.cpp
579

Theroretically LICM should move all invariants out of loop, and loop rotation should not create any new invariants. Do we really need this again?

thopre added inline comments.Mar 24 2021, 10:19 AM
llvm/lib/Passes/PassBuilder.cpp
579

Load of size >4 whose base pointer is derived from the this pointer fail the isSafeToExecuteUnconditionally() in LICM because the alignment of the this is 1.

lebedev.ri edited the summary of this revision. (Show Details)Mar 24 2021, 10:29 AM

I.e. we end up with less instructions, more LICM activity (+30% more sunks out of loops!),

This sounds too good to be true ... and it is. The statistic was closer to counting sinking candidates than actually sunk instructions. I've fixed it in 8a168d2d70678164004fca8de78e98bfb6e1272d and would expect this patch to have a much more limited impact on it now.

lebedev.ri marked 2 inline comments as done.Mar 24 2021, 10:34 AM
lebedev.ri added inline comments.
llvm/lib/Passes/PassBuilder.cpp
579

Theroretically LICM should move all invariants out of loop, and loop rotation should not create any new invariants. Do we really need this again?

I think the numbers speak for themselves, see 3'rd table in the description i just added
that shows improvements when going from LICM before LoopRotate to LICM after LoopRotate.
Especially, note the licm.NumSunk (+30%), and note that there was no such regression
when just moving the LICM to before LoopRotate.

Depending on what the run-time impact is in practice, the additional compile-time may or may not be worth it.
I'm running some additional testing offline on this, please give me a couple of days to test and analyze.

I.e. we end up with less instructions, more LICM activity (+30% more sunks out of loops!),

This sounds too good to be true ... and it is. The statistic was closer to counting sinking candidates than actually sunk instructions. I've fixed it in 8a168d2d70678164004fca8de78e98bfb6e1272d and would expect this patch to have a much more limited impact on it now.

Sure, let me remeasure this.

thopre added inline comments.Mar 24 2021, 10:40 AM
llvm/lib/Passes/PassBuilder.cpp
579

Sorry I misread, you meant moving LICM before rather than adding a new LICM pass before.

lebedev.ri edited the summary of this revision. (Show Details)Mar 24 2021, 11:19 AM

I.e. we end up with less instructions, more LICM activity (+30% more sunks out of loops!),

This sounds too good to be true ... and it is. The statistic was closer to counting sinking candidates than actually sunk instructions. I've fixed it in 8a168d2d70678164004fca8de78e98bfb6e1272d and would expect this patch to have a much more limited impact on it now.

Down to 7%.
Also, note that stats suggest that early LICM is worse than two LICM's:

<> ... but that actually regresses LICM (-12% licm.NumMovedLoads),
loop-simplifycfg (NumLoopExitsDeleted, NumTerminatorsFolded),
simple-loop-unswitch (NumTrivial).

! In D99249#2647162, @lebedev.ri wrote:

also note how none of those 4 regressions are here.

So i still believe this is what we need.

Looks nice for me, but let's wait for Alina's investigation to get concluded.

llvm/lib/Passes/PassBuilder.cpp
579

I'm not sure of change of licm.NumSunk statistic either way is a regression or not. Its increase can either be "we somehow enabled more optimizations" or "we duplicated code and now sink duplicated values twice". Thinking more about it, I can imagine how rotate can turn a Phi into a sinkable loop invariant. So I guess it's fine to keep LICM after it.

lebedev.ri marked an inline comment as done.Mar 26 2021, 2:04 AM

Looks nice for me, but let's wait for Alina's investigation to get concluded.

@mkazantsev thank you for taking a look!

llvm/lib/Passes/PassBuilder.cpp
579

Looking at asm-printer.EmittedInsts and instcount.TotalInsts,
we quite clearly end up with less instructions in either of
the LICM-LoopRotate and LICM-LoopRotate-LICM
(as compared to the baseline of LoopRotate-LICM),
and LICM-LoopRotate-LICM is clearly a winner overall.
Which is why that is the config i'm proposing here.

fhahn added a comment.Mar 26 2021, 2:15 AM

If only we could re-run LICM on the rotated loops :)

If only we could re-run LICM on the rotated loops :)

Yep, @MaskRay has also been saying that in my pipeline patches :)

Depending on what the run-time impact is in practice, the additional compile-time may or may not be worth it.
I'm running some additional testing offline on this, please give me a couple of days to test and analyze.

@asbirlea How it's going?

asbirlea accepted this revision.Mar 31 2021, 10:40 AM

Most performance results look ok. There are a few that are mixed (+ & -), one regression that's still investigated that so far seems architecture specific and not related to this patch, a couple of noticeable improvements and a few marginal improvements.
Please note I have only tested the new pass manager.
So, green light from my side assuming the other reviewers are also on board with this.

@lebedev.ri @thopre Is it right that you have seen the performance gain in practice from this change in larger applications, such as the example with peeling that prompted the first patch?
I'd be curious what the impact for that is.

Most performance results look ok. There are a few that are mixed (+ & -), one regression that's still investigated that so far seems architecture specific and not related to this patch, a couple of noticeable improvements and a few marginal improvements.
Please note I have only tested the new pass manager.
So, green light from my side assuming the other reviewers are also on board with this.

Thanks for running the tests. Sounds like the effect here is relatively minor, but at least it's not negative.

I think at this point there is a pretty clear motivation for why we want to run LICM before LoopRotate, but the motivation for the LICM-LoopRotate-LICM sequence is still somewhat murky to me. Do we have any PhaseOrdering tests that would regress without the second LICM run?

I'm not sure how to write proper interestingness test for this, but e.g. given this input

(bad example, in the end everything optimizes away)
currently (-looprotate -licm):

*** IR Dump After LICMPass ***
; Preheader:
for.end.lr.ph:                                    ; preds = %entry
  %hue = getelementptr inbounds %"class.rawspeed::Cr2sRawInterpolator", %"class.rawspeed::Cr2sRawInterpolator"* %this, i64 0, i32 3
  %1 = load i32, i32* %hue, align 4, !tbaa !8
  %Cb.i.i = getelementptr inbounds %"struct.std::array.30", %"struct.std::array.30"* %MCUs, i64 0, i32 0, i64 undef, i32 0, i64 undef, i32 1
  %sub.i.i = add i32 %1, -16384
  %Cb.i.i.promoted = load i32, i32* %Cb.i.i, align 4, !tbaa !9
  br label %for.end

; Loop:
for.end:                                          ; preds = %for.end.lr.ph, %for.end
  %2 = phi i32 [ %Cb.i.i.promoted, %for.end.lr.ph ], [ %add.i.i, %for.end ]
  %MCUIdx.03 = phi i32 [ undef, %for.end.lr.ph ], [ %inc21, %for.end ]
  %add.i.i = add i32 %sub.i.i, %2
  %inc21 = add nsw i32 %MCUIdx.03, 1
  %cmp = icmp slt i32 %inc21, %sub
  br i1 %cmp, label %for.end, label %for.cond.for.end22_crit_edge, !llvm.loop !11

; Exit blocks
for.cond.for.end22_crit_edge:                     ; preds = %for.end
  %add.i.i.lcssa = phi i32 [ %add.i.i, %for.end ]
  store i32 %add.i.i.lcssa, i32* %Cb.i.i, align 4, !tbaa !9
  br label %for.end22

with -licm -loop-rotate:

*** IR Dump After LoopRotatePass ***
; Preheader:
for.end.lr.ph:                                    ; preds = %entry
  br label %for.end

; Loop:
for.end:                                          ; preds = %for.end.lr.ph, %for.end
  %MCUIdx.03 = phi i32 [ undef, %for.end.lr.ph ], [ %inc21, %for.end ]
  %1 = load i32, i32* %hue, align 4, !tbaa !8
  %2 = load i32, i32* %Cb.i.i, align 4, !tbaa !9
  %sub.i.i = add i32 %1, -16384
  %add.i.i = add i32 %sub.i.i, %2
  store i32 %add.i.i, i32* %Cb.i.i, align 4, !tbaa !9
  %inc21 = add nsw i32 %MCUIdx.03, 1
  %cmp = icmp slt i32 %inc21, %sub
  br i1 %cmp, label %for.end, label %for.cond.for.end22_crit_edge, !llvm.loop !11

; Exit blocks
for.cond.for.end22_crit_edge:                     ; preds = %for.end
  br label %for.end22

and with -licm -loop-rotate -licm:

*** IR Dump After LICMPass ***
; Preheader:
for.end.lr.ph:                                    ; preds = %entry
  %1 = load i32, i32* %hue, align 4, !tbaa !8
  %sub.i.i = add i32 %1, -16384
  %Cb.i.i.promoted = load i32, i32* %Cb.i.i, align 4, !tbaa !9
  br label %for.end

; Loop:
for.end:                                          ; preds = %for.end.lr.ph, %for.end
  %2 = phi i32 [ %Cb.i.i.promoted, %for.end.lr.ph ], [ %add.i.i, %for.end ]
  %MCUIdx.03 = phi i32 [ undef, %for.end.lr.ph ], [ %inc21, %for.end ]
  %add.i.i = add i32 %sub.i.i, %2
  %inc21 = add nsw i32 %MCUIdx.03, 1
  %cmp = icmp slt i32 %inc21, %sub
  br i1 %cmp, label %for.end, label %for.cond.for.end22_crit_edge, !llvm.loop !11

; Exit blocks
for.cond.for.end22_crit_edge:                     ; preds = %for.end
  %add.i.i.lcssa = phi i32 [ %add.i.i, %for.end ]
  store i32 %add.i.i.lcssa, i32* %Cb.i.i, align 4, !tbaa !9
  br label %for.end22

Note how the -licm -loop-rotate causes %sub.i.i to be computed in-loop.
Do tell if that isn't sufficient of an answer.

Most performance results look ok. There are a few that are mixed (+ & -), one regression that's still investigated that so far seems architecture specific and not related to this patch, a couple of noticeable improvements and a few marginal improvements.
Please note I have only tested the new pass manager.
So, green light from my side assuming the other reviewers are also on board with this.

@lebedev.ri @thopre Is it right that you have seen the performance gain in practice from this change in larger applications, such as the example with peeling that prompted the first patch?

Yes, we had a ML benchmark regress by about 4% due to the peeling preventing vectorization due to loss of alignment on our target. I've tried both LICM-looprotate-LICM and LICM-looprotate and both fix the regression.

I'm not sure how to write proper interestingness test for this, but e.g. given this input

(bad example, in the end everything optimizes away)
<...>
Note how the -licm -loop-rotate causes %sub.i.i to be computed in-loop.
Do tell if that isn't sufficient of an answer.

Or more visibly: https://godbolt.org/z/GzEbacs4K

To the best of my knowledge, all review feedback has been addressed here.
I will be landing this tomorrow, if nothing new blocking appears before then.
Thanks!

nikic added a comment.Apr 1 2021, 2:19 PM

Here's a reduced version of @lebedev.ri's example:

define void @test(i32* dereferenceable(4) %ptr, i32 %size) {
entry:                                     
  br label %for.cond
    
for.cond:
  %iv = phi i32 [ 0, %entry ], [ %iv.next, %for.cont ]
  %cmp = icmp ult i32 %iv, %size
  br i1 %cmp, label %for.cont, label %for.end

for.cont:
  %val = load i32, i32* %ptr, align 4
  %iv.next = add nsw i32 %iv, 1
  call void @use(i32 %val) inaccessiblememonly
  br label %for.cond

for.end:
  ret void
}

declare void @use(i32)

The problem is that the load is not speculatable due to lack of align attribute. After -loop-rotate we have

define void @test(i32* align 4 dereferenceable(4) %ptr, i32 %size) {
entry:
  %cmp1 = icmp ult i32 0, %size
  br i1 %cmp1, label %for.cont.lr.ph, label %for.end

for.cont.lr.ph:                                   ; preds = %entry
  br label %for.cont

for.cont:                                         ; preds = %for.cont.lr.ph, %for.cont
  %iv2 = phi i32 [ 0, %for.cont.lr.ph ], [ %iv.next, %for.cont ]
  %val = load i32, i32* %ptr, align 4
  %iv.next = add nsw i32 %iv2, 1
  call void @use(i32 %val) #0
  %cmp = icmp ult i32 %iv.next, %size
  br i1 %cmp, label %for.cont, label %for.cond.for.end_crit_edge

for.cond.for.end_crit_edge:                       ; preds = %for.cont
  br label %for.end

for.end:                                          ; preds = %for.cond.for.end_crit_edge, %entry
  ret void
}

in which case the load is executed unconditionally (if the loop is executed) and can be hoisted. Thus the need for -licm after -loop-rotate.

Based on @thopre's comments, it seems like lack of align attribute on C++ this pointers was also the problem for his case. It's possible that I'm missing some C++ subtlety here, but I believe the lack of align attribute here is just an oversight. At some point, we changed LLVM to no longer assume that pointers without alignment information have natural alignment, and clang was adjusted in D80166 to emit align attributes for dereferenceable pointers to compensate for that change. However, the align attributes were added only for reference types, not for this pointers. Notably https://github.com/llvm/llvm-project/blob/17800f900dca8243773dec5f90578cce03069b8f/clang/lib/CodeGen/CGCall.cpp#L2310 is missing the align attribute. We're probably missing optimizations not just in LICM because of that.

Yep, that looks suspect.
I'll see if i can follow-up on that.
Thanks!

This revision was not accepted when it landed; it landed in state Needs Review.Apr 2 2021, 1:12 AM
This revision was automatically updated to reflect the committed changes.

Maybe this caused https://bugs.llvm.org/show_bug.cgi?id=50550. Could you check and comment it?

tra added a subscriber: tra.Oct 8 2021, 1:36 PM

FYI, this appears to introduce a somewhat serious performance regression for NVPTX.
https://bugs.llvm.org/show_bug.cgi?id=52037

It's not necessarily a bug, but rather an unfortunate interaction between loop transforms blocking SROA which affects performance-sensitive quirk in NVPTX, where local storage is actually quite expensive.