This is an archive of the discontinued LLVM Phabricator instance.

LFTR for multiple exit loops
ClosedPublic

Authored by reames on May 29 2019, 1:27 PM.

Details

Summary

Teach IndVarSimply's LinearFunctionTestReplace transform to handle multiple exit loops. LFTR does two key things 1) it rewrites (all) exit tests in terms of a common IV potentially eliminating one in the process and 2) it moves any offset/indexing/f(i) style logic out of the loop.

This turns out to actually be pretty easy to implement. SCEV already has all the information we need to know what the backedge taken count is for each individual exit. (We use that when computing the BE taken count for the loop as a whole.) We basically just need to iterate through the exiting blocks and apply the existing logic with the exit specific BE taken count. (The previously landed NFC makes this super obvious.)

I chose to go ahead and apply this to all loop exits instead of only latch exits as originally proposed. After reviewing other passes, the only case I could find where LFTR form was harmful was LoopPredication. I've fixed the latch case, and guards aren't LFTRed anyways. We'll have some more work to do on the way towards widenable_conditions, but that's easily deferred.

Diff Detail

Repository
rL LLVM

Event Timeline

reames created this revision.May 29 2019, 1:27 PM
reames updated this revision to Diff 202256.May 30 2019, 11:18 AM

Clean up towards a practical patch. Need to land a bunch of tests, and then rebase. Also included the LoopPred fix needed to make the increase in scope non-pessimizing in practice. (There may be more pieces like that needed.)

reames updated this revision to Diff 202750.Jun 3 2019, 10:14 AM

Rebase on tests, and minor generalization. Still have a couple of other test failures to track down, so not yet ready for review.

reames planned changes to this revision.Jun 4 2019, 2:30 PM

The NFC portion of this turned out to be a bit more involved than expected. It's been separated into D62880. Once that lands, I'll rebase for the small remaining functional bit. I'm debating just going to all loop exits instead of just the latch exit. On further reflection, I'm not sure it's really going to matter impact wise, and fleshing out problems faster might be worthwhile.

Also, this is still blocked on one last correctness issue in the existing LFTR implementation. Specifically, we're not clearing the inbounds parameter on a pointer IV when moving to a dynamically dead IV. That needs fixed before we generalize the code this much and increase the number of opportunities for miscompiles sharply.

reames updated this revision to Diff 204179.Jun 11 2019, 3:21 PM
reames retitled this revision from [WIP] LFTR for multiple exit loops to LFTR for multiple exit loops.
reames edited the summary of this revision. (Show Details)
reames added reviewers: nikic, apilipenko, sanjoy.
reames set the repository for this revision to rL LLVM.
reames added a subscriber: llvm-commits.

Rebase on landed prep patches. Ready for actual review, though won't land until after https://reviews.llvm.org/D62939.

Herald added a project: Restricted Project. · View Herald TranscriptJun 11 2019, 3:21 PM

This looks good, but I'd like to see some tests that involve an IV change. In particular I would expect that this transformation can go from exit conditions on two different IVs, to exit conditions on a single IV and it would be nice to see that.

reames updated this revision to Diff 204387.Jun 12 2019, 4:50 PM

rebase on requested tests

Just to note, I'm deliberately not adding any tests for possibly poison IVs. I don't think there are any unique issues for multiple exit IVs, and giving that I want to leave the coverage concentrated on things which are unique to multiple exits.

nikic accepted this revision.Jun 13 2019, 10:10 AM

LGTM, thanks for the extra tests.

lib/Transforms/Scalar/IndVarSimplify.cpp
2634 ↗(On Diff #204387)

Rename BETakenCount to ExitCount here and also in linearFunctionTestReplace, possibly as NFC precommit? With multiple exits, this is not the backedge count anymore...

This revision is now accepted and ready to land.Jun 13 2019, 10:10 AM
reames marked an inline comment as done.Jun 13 2019, 11:38 AM
reames added inline comments.
lib/Transforms/Scalar/IndVarSimplify.cpp
2634 ↗(On Diff #204387)

Done in 363293.

This revision was automatically updated to reflect the committed changes.

I just submitted rL363964 which fixes a bug introduced in this change. I can't find an example which actually triggers it so it may be entirely latent, but if anyone is chasing a miscompile tied to this change, try applying that patch.

Hello,
I am going to be a bit unhelpful here. We see some regressions caused by this change, but I didn't have a chance to look into it today and so I don't have a reproducer. And here's my unhelpful question: are you aware or have you seen regressions, or are you still working on this? I am not really familiar with this pass, but from reading the description, it should do good things. Taking a wild guess, I see changes in the tests and they change comparisons from ult to ne, could it for example be that this confuses other passes/analysis?
Anyway, I will try to look into this tomorrow.
Cheers.

Hello,
I am going to be a bit unhelpful here. We see some regressions caused by this change, but I didn't have a chance to look into it today and so I don't have a reproducer. And here's my unhelpful question: are you aware or have you seen regressions, or are you still working on this? I am not really familiar with this pass, but from reading the description, it should do good things. Taking a wild guess, I see changes in the tests and they change comparisons from ult to ne, could it for example be that this confuses other passes/analysis?
Anyway, I will try to look into this tomorrow.
Cheers.

Just to be clear, you mean performance regressions correct? Not correctness ones?

I won't be surprised if this exposes weaknesses in other passes. If you can find me examples, I'm happy to go about trying to fix them. If there's too many, or the regressions are too large, we can also disable this behind a flag while we work through issues.

One case I know of is that you have a hot loop which runs very few iterations, this could end up being a regression. I have ideas on how to address that, but since they're involved, I'd really like to confirm a test case first.

Look forward to more details when you have them.

I just ran into a performance regression due to this commit. With https://martin.st/temp/glew-preproc-i686.c, compiling with clang -target i686-w64-mingw32 -c -O3 glew-preproc-i686.c went from 73 seconds before this commit, to 133 seconds afterwards.

There should be also positive impact on LSR pass from this change, since LSR likes equality comparisons.

I just ran into a performance regression due to this commit. With https://martin.st/temp/glew-preproc-i686.c, compiling with clang -target i686-w64-mingw32 -c -O3 glew-preproc-i686.c went from 73 seconds before this commit, to 133 seconds afterwards.

Clarification here. Do you mean this was a compile time regression? Or a runtime regression? (i.e. is your clang here build with a different compiler? Or is this a self host?)

I just ran into a performance regression due to this commit. With https://martin.st/temp/glew-preproc-i686.c, compiling with clang -target i686-w64-mingw32 -c -O3 glew-preproc-i686.c went from 73 seconds before this commit, to 133 seconds afterwards.

Clarification here. Do you mean this was a compile time regression? Or a runtime regression? (i.e. is your clang here build with a different compiler? Or is this a self host?)

Compile time regression. (I build the project it's part of with a 120 second timeout per translation unit, enforced with ulimit -t 120, that's why I noticed.)

Just to confirm that in my case it's a performance regression.
I am going to take a look now.

I think I am mainly looking at micro-architecture sensitivities and some unfortunate knock on effects of this patch.
I don't think I can blame anything of that on this patch. If I change my mind about that, I will let you know. :-) But then I'd of course have to come up with a reproducer (or a fix).
Thanks for your help, sorry for the noise!

Compile time regression. (I build the project it's part of with a 120 second timeout per translation unit, enforced with ulimit -t 120, that's why I noticed.)

If you could provide an IR reproducer, and ideally a hint where to look, I can see what I find. Please file a bug with reproduction instructions and an IR attachment.

I think I am mainly looking at micro-architecture sensitivities and some unfortunate knock on effects of this patch.
I don't think I can blame anything of that on this patch. If I change my mind about that, I will let you know. :-) But then I'd of course have to come up with a reproducer (or a fix).
Thanks for your help, sorry for the noise!

Np. I'm happy to help debate fixes for issues you might find. Feel free to add me as a reviewer (if appropriate), or start an email conversation with preliminary findings if it's helpful.

Compile time regression. (I build the project it's part of with a 120 second timeout per translation unit, enforced with ulimit -t 120, that's why I noticed.)

If you could provide an IR reproducer, and ideally a hint where to look, I can see what I find. Please file a bug with reproduction instructions and an IR attachment.

A C level reproducer is already linked (https://martin.st/temp/glew-preproc-i686.c, compiled with clang -target i686-w64-mingw32 -c -O3 glew-preproc-i686.c) I filed https://bugs.llvm.org/show_bug.cgi?id=42357 now as well, with that, and matching IR for reproducing on that level. The issue is not only that compilation takes longer, the output also almost doubled in size.