Disable jump threading into loop headers
ClosedPublic

Authored by kparzysz on Aug 7 2017, 8:52 AM.

Details

Summary

Consider this type of a loop:

for (...) {
  ...
  if (...) continue;
  ...
}

Normally, the "continue" would branch to the loop control code that checks whether the loop should continue iterating and which contains the (often) unique loop latch branch. In certain cases jump threading can "thread" the inner branch directly to the loop header, creating a second loop latch. Loop canonicalization would then transform this loop into a loop nest. The problem with this is that in such a loop nest neither loop is countable even if the original loop was. This may inhibit subsequent loop optimizations and be detrimental to performance.

Diff Detail

Repository
rL LLVM
kparzysz created this revision.Aug 7 2017, 8:52 AM

@bmakam: Isn't this very similar to some of your recent work?

@bmakam: Isn't this very similar to some of your recent work?

This is similar to r308422, but it was only limited to threading unconditional jumps in almost empty blocks. It seems like there are more places in jumpthreading that can turn a loop into non-canonical forms. We also saw in D29572 that jumpthreading can turn a loop into irreducible loop but it was performance neutral.

davide added a comment.Aug 8 2017, 8:23 AM

Do you have numbers/examples of performance delta(s) with this patch applied?

On our internal benchmarks it's mostly neutral, but the benchmark that motivated it improved by 5.3%. We have some extra code, though, that is supposed to deal with similar cases, so some of the impact may be reduced. By default it keeps the behavior unchanged, so for all other architectures it should have no effect.

Any thoughts on this?

Any thoughts on this?

I think that we should just disable this for all targets. JumpThreading is run early in the pipeline where we are supposed to be simplifying the IR, but importantly, canonicalizing it. Our canonical form must be chosen to enable subsequent optimizations, and if it's not, then we should fix that. The whole reason that LoopHeaders exists in the first place is to avoid this kind of thing. The comment above FindLoopHeaders reads:

/// FindLoopHeaders - We do not want jump threading to turn proper loop
/// structures into irreducible loops.  Doing this breaks up the loop nesting
/// hierarchy and pessimizes later transformations.  To prevent this from
/// happening, we first have to find the loop headers. ...

If this optimization is a good thing to do, then we should have a special run of JumpThreading late in the pipeline where we allow it to put loops into efficient-but-hard-to-analyze forms. That can certainly be follow-up work if it is found to be helpful.

kparzysz updated this revision to Diff 114038.Sep 6 2017, 11:38 AM
kparzysz retitled this revision from TTI interface for creating jump-threaded branches into loop headers to Disable jump threading into loop headers.
kparzysz edited the summary of this revision. (Show Details)
kparzysz added a reviewer: hfinkel.

Changed the code to simply disable jump threading into loop headers.

hfinkel accepted this revision.Sep 6 2017, 11:44 AM

Changed the code to simply disable jump threading into loop headers.

LGTM.

This revision is now accepted and ready to land.Sep 6 2017, 11:44 AM
This revision was automatically updated to reflect the committed changes.

On our internal benchmarks it's mostly neutral, but the benchmark that motivated it improved by 5.3%. We have some extra code, though, that is supposed to deal with similar cases, so some of the impact may be reduced. By default it keeps the behavior unchanged, so for all other architectures it should have no effect.

@kparzysz, This change regressed spec2017/perlbench by 3% on Falkor. Perhaps we need a late jump-threading pass like Hal suggested?

Have you tried it? Changing jump threading to have early/late versions is trivial, the question is when should the late one run, and if that would help in the first place.

Have you tried it? Changing jump threading to have early/late versions is trivial, the question is when should the late one run, and if that would help in the first place.

I reverted this change and it recovered the regression. I was going to try to re-run the whole jumpthreading pass after latesimplifycfg to see if it would help, but I don't know if it is a good place because LSR that runs after, will not have the loop in canonical form and loopsimplify will again turn it into irreducible loop. Another place I was thinking was to try before CGP.

I'll come up with a late jump threading pass that you can try plugging in, in addition to having the existing one.

Late jump threading is in D37816.

Have you tried it? Changing jump threading to have early/late versions is trivial, the question is when should the late one run, and if that would help in the first place.

I reverted this change and it recovered the regression. I was going to try to re-run the whole jumpthreading pass after latesimplifycfg to see if it would help, but I don't know if it is a good place because LSR that runs after, will not have the loop in canonical form and loopsimplify will again turn it into irreducible loop. Another place I was thinking was to try before CGP.

If it'll work right before CGP, that's probably best. That way targets can run IR-level passes that want to look at loops before that.

Have you tried it? Changing jump threading to have early/late versions is trivial, the question is when should the late one run, and if that would help in the first place.

I reverted this change and it recovered the regression. I was going to try to re-run the whole jumpthreading pass after latesimplifycfg to see if it would help, but I don't know if it is a good place because LSR that runs after, will not have the loop in canonical form and loopsimplify will again turn it into irreducible loop. Another place I was thinking was to try before CGP.

If it'll work right before CGP, that's probably best. That way targets can run IR-level passes that want to look at loops before that.

I tried right after latesimplifycfg and it almost recovers the regression. However, running it right before CGP made it even worse for performance of spec2017/perlbench. It seems to be due to some kind of bad interaction with LSR, I haven't digged deeper. Would running right before LSR be reasonable?

Have you tried it? Changing jump threading to have early/late versions is trivial, the question is when should the late one run, and if that would help in the first place.

I reverted this change and it recovered the regression. I was going to try to re-run the whole jumpthreading pass after latesimplifycfg to see if it would help, but I don't know if it is a good place because LSR that runs after, will not have the loop in canonical form and loopsimplify will again turn it into irreducible loop. Another place I was thinking was to try before CGP.

If it'll work right before CGP, that's probably best. That way targets can run IR-level passes that want to look at loops before that.

I tried right after latesimplifycfg and it almost recovers the regression. However, running it right before CGP made it even worse for performance of spec2017/perlbench. It seems to be due to some kind of bad interaction with LSR, I haven't digged deeper. Would running right before LSR be reasonable?

We need to figure out if LSR is doing something reasonable. It's possible that LSR is actually causing the regression, and old jumpthreading is just serving to obscure the loop structure (thus preventing LSR from doing whatever bad thing it's doing).

On another note, this patch was meant to prevent adding extra branches to the header from within the loop. Branches to the header that come from outside of the loop should be fine. In other words, it should be ok to refine the check to only disable jump threading when any of the PredBBs is dominated by the header (SuccBB).

We need to figure out if LSR is doing something reasonable. It's possible that LSR is actually causing the regression, and old jumpthreading is just serving to obscure the loop structure (thus preventing LSR from doing whatever bad thing it's doing).

It seems like LSR is doing the right thing, the difference in codegen is due to loopsimplify that is required in LSR. If LoopSimplify is required in CGP, then running LateJumpThreading right before CGP (D37866) completely recovers the regression.