This is an archive of the discontinued LLVM Phabricator instance.

[LoopPeel] Peel if it turns invariant loads dereferenceable.
ClosedPublic

Authored by fhahn on Aug 16 2021, 3:36 AM.

Details

Summary

This patch adds a new cost heuristic that allows peeling a single
iteration off multi-exit read-only loops, if they contain loop-invariant
loads that dominate the latch. If all non-latch exits are terminated
with unreachable, the invariant loads in the loop are guaranteed to be
dereferenceable, enabling hoisting/CSE'ing them.

This enables vectorization of loops with certain runtime-checks, like
multiple calls to std::vector::at.

This should give a 20-30% improvement in score of Geekbench5/HDR.

Diff Detail

Event Timeline

fhahn created this revision.Aug 16 2021, 3:36 AM
fhahn requested review of this revision.Aug 16 2021, 3:36 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 16 2021, 3:36 AM

Before reviewing the patch, a high level question. Fair warning, I am a bit worried about cost heuristic changes here, they tend to be delicate.

Why do we need to peel this? LICM is generally good at using speculation to hoist, and trivial unswitching is good at versioning the conditions to remove them. I'd expect to see the motivating case already handled by some combination of existing transforms. (You might have to iterate them a few times. So is this working around a pass ordering issue?)

llvm/lib/Transforms/Utils/LoopPeel.cpp
200

You don't appear to be requiring the load controls an exit. It's much less obvious that "just" removing a load is worthwhile. Consider a huge loop with one potentially invariant load.

fhahn added a comment.Aug 17 2021, 1:22 PM

Before reviewing the patch, a high level question. Fair warning, I am a bit worried about cost heuristic changes here, they tend to be delicate.

Why do we need to peel this? LICM is generally good at using speculation to hoist, and trivial unswitching is good at versioning the conditions to remove them. I'd expect to see the motivating case already handled by some combination of existing transforms. (You might have to iterate them a few times. So is this working around a pass ordering issue?)

Do you have any pointers at the kind of speculation LICM could be doing? I could not spot anything that would be applicable after an initial look.

I think the motivating case could indeed be handled by a set of existing passes, which was my first try. I think it would look roughly like the following:

  1. run LICM to hoist out the invariant loads feeding the branch in the header,
  2. run indvars to turn the check fed by the hoisted loads into an invariant check,
  3. unswitch
  4. LICM to hoist the now unconditional loads feeding the second branch in the unswitched loop,
  5. run indvars to turn the second branch condition into an invariant.
  6. unswitch

(there's a small caveat that this pass order does not catch the specific test case in peel-multiple-unreachable-exits-for-vectorization.ll, but the unreduced motivating std::vector::at example, which I just noticed)

The problem with that particular order is that it clashes with the existing order which is LICM,Unswitch in one loop pass manager, followed by Indvars and others in a separate loop pass manager. The passes are split up in different loop pass managers because we run function passes in between them (InstCombine & SimplifyCFG). Adjusting the pipeline seems like it would be quite a big shakeup and peeling off the first iteration seemed like a less invasive change and less work for the optimizations overall. Note that we would have to run LICM,indvars,unswitch once for each std::vector::at call/runtime check.

What do you think? I think eliminating the separation between the 2 loop pass managers would be beneficial in its own right, but SimplifyCFG and InstCombine seems like a substantial gap to bridge.

fhahn added a comment.Aug 17 2021, 1:26 PM

I think eliminating the separation between the 2 loop pass managers would be beneficial in its own right, but SimplifyCFG and InstCombine seems like a substantial gap to bridge.

There's also a FIXME in PassBuilder.cpp about that.

Before reviewing the patch, a high level question. Fair warning, I am a bit worried about cost heuristic changes here, they tend to be delicate.

Why do we need to peel this? LICM is generally good at using speculation to hoist, and trivial unswitching is good at versioning the conditions to remove them. I'd expect to see the motivating case already handled by some combination of existing transforms. (You might have to iterate them a few times. So is this working around a pass ordering issue?)

Do you have any pointers at the kind of speculation LICM could be doing? I could not spot anything that would be applicable after an initial look.

LICM does two forms of legality reasoning for trap-safety. Option 1 is guaranteed to execute. Option 2 is proving that the hoisted load can't fault at the new location. Looking at your test case (peel-multiple-unreachable-exits-for-vectorization.ll), option 1 does work (when iterated as you note). Option 2 would require deref facts on the arguments A and B of @sum_2_at_with_int_conversion. I don't see any reason why clang shouldn't be able to emit the deref info on those args? (Assuming this is lowering a c++ reference to a std::vector<> at least.) If it can, LICM should be able to speculate the loads for start and end from each vector outside the loop.

I think the motivating case could indeed be handled by a set of existing passes, which was my first try. I think it would look roughly like the following:

  1. run LICM to hoist out the invariant loads feeding the branch in the header,
  2. run indvars to turn the check fed by the hoisted loads into an invariant check,
  3. unswitch
  4. LICM to hoist the now unconditional loads feeding the second branch in the unswitched loop,
  5. run indvars to turn the second branch condition into an invariant.
  6. unswitch

(there's a small caveat that this pass order does not catch the specific test case in peel-multiple-unreachable-exits-for-vectorization.ll, but the unreduced motivating std::vector::at example, which I just noticed)

The problem with that particular order is that it clashes with the existing order which is LICM,Unswitch in one loop pass manager, followed by Indvars and others in a separate loop pass manager. The passes are split up in different loop pass managers because we run function passes in between them (InstCombine & SimplifyCFG). Adjusting the pipeline seems like it would be quite a big shakeup and peeling off the first iteration seemed like a less invasive change and less work for the optimizations overall. Note that we would have to run LICM,indvars,unswitch once for each std::vector::at call/runtime check.

What do you think? I think eliminating the separation between the 2 loop pass managers would be beneficial in its own right, but SimplifyCFG and InstCombine seems like a substantial gap to bridge.

I agree with your reasoning all the way through.

We'd previously explored the idea of adding loop versions of InstCombine and SimplifyCFG. (See LoopSimplifyCFG, I don't remember where we stood on instcombine.)

One mid-term idea would be to keep the split loop pass, but have the second one contain the same passes as the first. (Essentially, we'd have one unrolled iteration of the loop iteration with the instcombine and simplifycfg in place.) Though, actually, I don't remember, do we even iterate loop passes to fixed point if there are changes being made to the loop? (All of my context on this involves an out of tree pass order which solves this with brute force repetition.)

fhahn added a comment.Aug 19 2021, 1:32 PM

Do you have any pointers at the kind of speculation LICM could be doing? I could not spot anything that would be applicable after an initial look.

LICM does two forms of legality reasoning for trap-safety. Option 1 is guaranteed to execute. Option 2 is proving that the hoisted load can't fault at the new location. Looking at your test case (peel-multiple-unreachable-exits-for-vectorization.ll), option 1 does work (when iterated as you note). Option 2 would require deref facts on the arguments A and B of @sum_2_at_with_int_conversion. I don't see any reason why clang shouldn't be able to emit the deref info on those args? (Assuming this is lowering a c++ reference to a std::vector<> at least.) If it can, LICM should be able to speculate the loads for start and end from each vector outside the loop.

I realize that the problem description missed a key bit of information: the motivating case has vectors passed through pointers, not references so deref is not guaranteed unfortunately.

What do you think? I think eliminating the separation between the 2 loop pass managers would be beneficial in its own right, but SimplifyCFG and InstCombine seems like a substantial gap to bridge.

I agree with your reasoning all the way through.

We'd previously explored the idea of adding loop versions of InstCombine and SimplifyCFG. (See LoopSimplifyCFG, I don't remember where we stood on instcombine.)

One mid-term idea would be to keep the split loop pass, but have the second one contain the same passes as the first. (Essentially, we'd have one unrolled iteration of the loop iteration with the instcombine and simplifycfg in place.) Though, actually, I don't remember, do we even iterate loop passes to fixed point if there are changes being made to the loop? (All of my context on this involves an out of tree pass order which solves this with brute force repetition.)

AFAIK we do not iterate loop passes to a fixed point until no changes are made, I think we only execute them for each loop once (and new loops). Let me take a look and see on how re-composing the existing passes would look like.

I realize that the problem description missed a key bit of information: the motivating case has vectors passed through pointers, not references so deref is not guaranteed unfortunately.

Shouldn't points to std::vectors still be deref_or_null? If so, we should be able to prove non-null here.

AFAIK we do not iterate loop passes to a fixed point until no changes are made, I think we only execute them for each loop once (and new loops). Let me take a look and see on how re-composing the existing passes would look like.

Honestly, I feel like we really should do either a) bounded iteration, or b) unroll the bounded iteration by hand a small handful of times. The whole premise of the new PM was to allow conditional pass execution, this is a case where doing so is clearly "worth it".

mkazantsev added inline comments.Aug 29 2021, 10:08 PM
llvm/lib/Transforms/Utils/LoopPeel.cpp
164

Returns... ? Is that supposed to be boolean?

176

Can it be a single-exit loop where latch doesn't exit?

201

Does this actually lead to the effect you are aiming? Some complexly computed pointer (e.g. result of chain of geps) may be proven loop-invariant by SCEV, but how other passes will figure out this load is from a dereferenceable pointer?

mnadeem added a subscriber: mnadeem.Sep 1 2021, 4:41 PM
reames requested changes to this revision.Sep 10 2021, 11:55 AM

Pending action by author, please request review once ready for more discussion. Just getting this off my active review queue.

This revision now requires changes to proceed.Sep 10 2021, 11:55 AM
fhahn updated this revision to Diff 377986.Oct 7 2021, 1:05 PM

Bring the patch up to date again, address reviewer comments!

I realize that the problem description missed a key bit of information: the motivating case has vectors passed through pointers, not references so deref is not guaranteed unfortunately.

Shouldn't points to std::vectors still be deref_or_null? If so, we should be able to prove non-null here.

I think I am missing something here. I tired to put toegether a reduced example: https://clang.godbolt.org/z/jfqKzzK5x . I don't see how deref_or_null on %A/%B would help here. We do not check of %B is null before executing the first load %B.start = load i64*, i64** %B.gep.start.

AFAIK we do not iterate loop passes to a fixed point until no changes are made, I think we only execute them for each loop once (and new loops). Let me take a look and see on how re-composing the existing passes would look like.

Honestly, I feel like we really should do either a) bounded iteration, or b) unroll the bounded iteration by hand a small handful of times. The whole premise of the new PM was to allow conditional pass execution, this is a case where doing so is clearly "worth it".

I went back an experiemented with different pass-orderings. It looks like one other problem is that we cannot compute the backedge-taken count for the loop, because we cannot compute the exit count for the exit that depends on the load in the loop. There might be a way around that issue, but even then running indvar,licm,simple-loop-unswitch multiple times would still be a major shakeup of the pipeline I'd like to avoid for now, if possible.

mkazantsev accepted this revision.EditedOct 7 2021, 10:38 PM

LGTM, but please make sure Philip is also OK.

llvm/lib/Transforms/Utils/LoopPeel.cpp
178

Can we reuse logic from D110922? OK if it's done in a follow-up, just consider this.

nikic added a subscriber: nikic.Oct 8 2021, 12:40 AM

Shouldn't there be a check somewhere whether the load is already dereferenceable, in which case we presumably don't want to peel it?

llvm/lib/Transforms/Utils/LoopPeel.cpp
171

This variable is initialized to false and then never changed?

fhahn added a comment.Oct 8 2021, 12:44 AM

(just submitting responses to comments I forgot to send yesterday)

llvm/lib/Transforms/Utils/LoopPeel.cpp
164

It returns the number of iterations to peel off (added a comment). At the moment it's either 0 or 1, and the main reason it is not a bool is to assign the result directly to DesiredPeelCount.

176

Yes, the latch can also be non-exiting. Dropped the assert.

200

Thanks for the suggestion! I updated the code to track which values/instructions are users of such loads. Peeling is only limited to cases where an exit condition uses such a load (transitively).

201

Good point, I changed it to use the weaker Loop::isLoopInvariant.

fhahn updated this revision to Diff 378137.Oct 8 2021, 2:22 AM

Shouldn't there be a check somewhere whether the load is already dereferenceable, in which case we presumably don't want to peel it?

I added a check using isDereferenceablePointer to catch some cases where the load is guaranteed to be dereferenceable. This is still quite weak, but in practice I think most dereferenceable loads should be moved out of the loop before we check for peeling (especailly given the restriction to read only loops at the moment).

fhahn marked 4 inline comments as done.Oct 8 2021, 2:23 AM
fhahn added inline comments.
llvm/lib/Transforms/Utils/LoopPeel.cpp
171

Thanks, that's a leftover from earlier. Removed

178

I just saw that D110922 got reverted again. I'll submit a follow-up once the patch has settled.

dmakogon added inline comments.
llvm/lib/Transforms/Utils/LoopPeel.cpp
170

Seems like there's no need of the SE now

178

D110922 is relanded now

fhahn updated this revision to Diff 378190.Oct 8 2021, 5:57 AM
fhahn marked 2 inline comments as done.

Remove unused SE argument, use IsBlockFollowedByDeoptOrUnreachable

@reames do the latest update / the recent responses address your concerns?

@reames do the latest update / the recent responses address your concerns?

Yep, you addressed the major concerns. I think you could stand to wordsmith the function comment and submit message a bit to be clearer about the new heuristic, but the code addresses my concerns.

Note that I have not done a detailed code review as the patch was already LGTMed by others. I skimmed and didn't spot anything huge, but it was only a skim.

nikic added inline comments.Oct 11 2021, 12:54 PM
llvm/lib/Transforms/Utils/LoopPeel.cpp
219

nit: LoadUsers.contains(Exiting->getTerminator())?

This revision was not accepted when it landed; it landed in state Needs Review.Oct 12 2021, 3:42 AM
This revision was automatically updated to reflect the committed changes.
fhahn added a comment.Oct 12 2021, 4:20 AM

@reames do the latest update / the recent responses address your concerns?

Yep, you addressed the major concerns. I think you could stand to wordsmith the function comment and submit message a bit to be clearer about the new heuristic, but the code addresses my concerns.

Thanks for confirming. I tried to adjust the message of the commit to make it more in line with the committed code.

llvm/lib/Transforms/Utils/LoopPeel.cpp
219

Thanks, I unfortunately forgot to add this to the committed version, but I pushed a follow up which also uses any_of instead of the explicit loop: 40d85f16c45e