This is an archive of the discontinued LLVM Phabricator instance.

[LoopPeeling] Support peeling loops with non-latch exits
ClosedPublic

Authored by nikic on Sep 28 2022, 5:23 AM.

Details

Summary

Loop peeling currently requires that a) the latch is exiting b) a branch and c) other exits are unreachable/deopt. This patch removes all of these limitations, and adds the necessary branch weight updating support. It essentially works the same way as before with latch -> exiting terminator and loop trip count -> per exit trip count.

It's worth noting that there are still other limitations in profitability heuristics: This patch enables peeling of loops to make conditions invariant (which is pretty much always highly profitable if possible), while peeling to make loads dereferenceable still checks that non-latch exits are unreachable and PGO-based peeling has even more conditions. Those checks could be relaxed later if we consider those cases profitable.

The motivation for this change is that loops using iterator adaptors in Rust often optimize very badly, and end up with a loop phi of the form phi(true, false) in the final result. Peeling eliminates that phi and conditions based on it, which enables a lot of follow-on simplification.

Diff Detail

Event Timeline

nikic created this revision.Sep 28 2022, 5:23 AM
nikic requested review of this revision.Sep 28 2022, 5:23 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 28 2022, 5:23 AM
reames accepted this revision.Oct 6 2022, 1:23 PM

LGTM

This revision is now accepted and ready to land.Oct 6 2022, 1:23 PM
This revision was landed with ongoing or failed builds.Oct 7 2022, 3:36 AM
This revision was automatically updated to reflect the committed changes.
bgraur added a subscriber: bgraur.Oct 17 2022, 1:10 PM

Heads-up: a root cause is pointing to this patch for a miscompile that leads to an infinite loop in one of our tests.
Working on a reproducer.

tra added a subscriber: tra.Oct 17 2022, 2:49 PM

This change may have caused the https://github.com/llvm/llvm-project/issues/28112 to show up again in a few of our tests (the timeout just mentioned by @bgraur) .
Unfortunately it's likely going to be hard to create a reduced reproducer as the issue tends to shows up in large code and often disappears in an unpredictable manner when code gets reduced.

Is there a way to disable this heuristic?

Another possible source of this problem is that the loop transformation didn't take into account 'convergent' attribute and may have duplicated code that should not have been. E.g. here's what we had to do in the past: https://reviews.llvm.org/D17518

nikic added a subscriber: nhaehnle.Oct 19 2022, 1:20 AM

It's plausible that the issue is related to convergence. As far as I can tell, loop peeling currently does not do any checks related to convergent operations in the loop. However, it's not really clear to me whether loop peeling of convergent operations is illegal -- it looks like (non-runtime) unrolling of convergent operations is allowed, so I would expect loop peeling to be as well. I have a very hard time understanding the requirements this imposes from the LangRef definition.

Maybe @nhaehnle can chime in, who seems to be the expert on this topic.

The common issue with duplication of convergent instructions is when two threads that previously would execute the same static convergent instruction end up executing different static convergent instructions after a transform. This is clearly not the case with loop peeling, so it really ought to be okay.

That said, there's the underlying meta-problem that we haven't yet adopted a good definition of convergent, meaning that given a piece of LLVM IR, it is not possible to tell which threads of a group are supposed to communicate in a given convergent instruction. It is entirely possible that there is some implicit assumption in the code about how convergent instructions behave that happens to get broken by a downstream transform now that the post-loop-peel CFG is different.

One interesting observation is that if we take the current definition of convergent as gospel (which we really shouldn't, but for the sake of the argument), then loop peeling with convergent ops is actually forbidden, and peeling loops with additional exits may make the problem worse. Specifically:

preheader:
  br loop
loop:
  convergentop
  br i1 cc1, exit, latch
latch:
  br i1 cc2, loop, exit
exit:

Peeling once results in:

  convergentop'
  br i1 cc1', exit, latch.peel     <-- X
latch.peel:
  br i1 cc2, preheader, exit    <-- X
preheader:
  br loop
loop:
  convergentop
  br i1 cc1, exit, latch
latch:
  br i1 cc2, loop, exit
exit:

The convergentop inside the loop now has the two branches marked X as additional control dependencies, which is forbidden by current LangRef. It is possible that this wasn't an issue in the past because the peeled latch branch is most likely constant anyway and gets optimized away, and now there's some new loop that gets peeled with a secondary exit whose condition isn't constant.

Keep in mind that this is all speculation, and personally, I find it bizarre. The definition of convergent that I would prefer doesn't have this issue, but it may well be that something in the downstream PTX flow is sensitive to this somehow today. It would be good to properly root-cause the issue.

P.S.: It's also possible that the issue only manifests with nested loops. Either because perhaps an outer loop is peeled (does that ever happen?) or because an inner loop is peeled and one of the additional exits breaks out of or continues an outer loop.

tra added a comment.Oct 19 2022, 10:55 AM

It would be good to properly root-cause the issue.

I'll try extracting the IR/PTX from the test we have failing, though I'm not optimistic that it will be small enough to be usable.

tra added a comment.Oct 19 2022, 6:49 PM

As I suspected, the hanging code is ~2.5MB of unoptimized IR, reduced to ~1MB after optimizations. The IR is thread-sync heavy, and it's hard to tell if anything is obviously wrong after it gets further massaged by ptxas into GPU instructions. :-(

Would it be possible to introduce a cc1 option disabling this kind of peeling, so we have some sort of workaround until we figure out a better way to deal with the issue?

tra added a comment.Oct 24 2022, 10:52 AM

@nikic . This blocks our internal compiler release and we need some sort of work-around until we can properly deal with the issue.

Would it be possible to introduce a cc1 option disabling this kind of peeling,...?

I do not understand the code enough to tell where would be the best point to control this. If you could suggest a minimally invasive way to do it, I'd appreciate it.

nikic added a comment.Oct 25 2022, 1:03 AM

As I suspected, the hanging code is ~2.5MB of unoptimized IR, reduced to ~1MB after optimizations. The IR is thread-sync heavy, and it's hard to tell if anything is obviously wrong after it gets further massaged by ptxas into GPU instructions. :-(

In this test case, is there only a single additional loop being peeled? If there are multiple, is it possible to localize which one is the problematic one? Can you identify which of the removed conditions in canPeel() are the ones that are relevant? Does the problematic loop contain convergent instructions as speculated before?

Just some suggestions on how the problem could be narrowed a bit further.

Would it be possible to introduce a cc1 option disabling this kind of peeling, so we have some sort of workaround until we figure out a better way to deal with the issue?

For the record, D136643 has the opt option (I assume that's what you meant, a cc1 option would be unusual). I think that patch is too complicated as-is, but generally having a temporary option is fine.

tra added a comment.Oct 25 2022, 9:47 AM

As I suspected, the hanging code is ~2.5MB of unoptimized IR, reduced to ~1MB after optimizations. The IR is thread-sync heavy, and it's hard to tell if anything is obviously wrong after it gets further massaged by ptxas into GPU instructions. :-(

In this test case, is there only a single additional loop being peeled? If there are multiple, is it possible to localize which one is the problematic one? Can you identify which of the removed conditions in canPeel() are the ones that are relevant? Does the problematic loop contain convergent instructions as speculated before?

The bug triggers in IR generated at runtime for a fairly convoluted computation on sparse matrices. Attempts to reduce the test case result in substantially different generated IR and make the issue disappear.

To answer your questions:

  • There are multiple loops.
  • It's hard-to-imposible to tell which loop causes the problem, due to the test nature making it hard to precisely control generated IR. It's also not feasible to detect the issue in the generated SASS by eyeballing it. I did manage to do it on few occasions in the past, but it was on a much smaller code.
  • Hence I can't tell where things went wrong.
  • The code does have a lot of calls to @llvm.nvvm.barrier0 intrinsic which does have convergent attribute and does need control flow to remain 'structured', so that the GPU code could be correctly instrumented to re-converge after diverged conditional branches.

With D136643 we should be able to observe what exactly changes in the IR in that test and that may narrow down the scope to a subset of the loops involved.

Just some suggestions on how the problem could be narrowed a bit further.

Would it be possible to introduce a cc1 option disabling this kind of peeling, so we have some sort of workaround until we figure out a better way to deal with the issue?

For the record, D136643 has the opt option (I assume that's what you meant, a cc1 option would be unusual). I think that patch is too complicated as-is, but generally having a temporary option is fine.

This option is a workaround, not intended for general use. It needs to be a CC1 option as we need to pass it to a GPU cc1 sub-compilation, but we do want your code to work as is during the host compilation. And yes, the patch is probably one of the more complicated on/off knobs I've seen. :-) If there's an easier way to control things, I'd be all for it, but I'll take whatever works w/o disrupting other LLVM users.