This is an archive of the discontinued LLVM Phabricator instance.

Move SeparateConstOffsetFromGEPPass() before LSR() and enable EnableGEPOpt by default.
ClosedPublic

Authored by gsocshubham on Jun 25 2022, 5:27 AM.

Details

Summary

GEP's across basic blocks were not getting splitted due to EnableGEPOpt which was turned off by default. Hence, EarlyCSE missed the opportunity to eliminate common part of GEP's. This can be achieved by simply turning GEP pass on.

  1. This patch moves SeparateConstOffsetFromGEPPass() just before LSR().
  2. It enables EnableGEPOpt by default.

Resolves - https://github.com/llvm/llvm-project/issues/50528

Added an unit test.

Diff Detail

Event Timeline

gsocshubham created this revision.Jun 25 2022, 5:27 AM
gsocshubham requested review of this revision.Jun 25 2022, 5:27 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 25 2022, 5:27 AM
gsocshubham edited the summary of this revision. (Show Details)Jun 25 2022, 5:28 AM
gsocshubham edited the summary of this revision. (Show Details)

Below are SPECS intrate benchmark results obtained on commit f6c79c6ae49f3a642bebe32a2346186c38bb83d7

Peak tuning(-O3)

Benchmark          %change

500.perlbench_r    2.173
505.mcf_r          -0.472
520.omnetpp_r      0.055
523.xalancbmk_r    -0.939
525.x264_r         -4.048
531.deepsjeng_r    4.944
541.leela_r        0.604
557.xz_r           0.209

Base tuning(-O2)

Benchmark          %change

500.perlbench_r    1.341
505.mcf_r          -0.526
520.omnetpp_r      0.961
523.xalancbmk_r    1.257
525.x264_r         -6.739
531.deepsjeng_r    0
541.leela_r        -0.462
557.xz_r           0.522

The only benchmark with most regression is 525.x264_r. Most of the benchmarks shows improved results.

gsocshubham added inline comments.Jun 25 2022, 5:40 AM
llvm/lib/Transforms/Scalar/SeparateConstOffsetFromGEP.cpp
1050

Let me know best course of action on lowerToSingleIndexGEPs().

I have temporarily removed it because I was running SPEC benchmarks for AArch64 and GEP's are being splitted using lowerToSingleIndexGEPs() when called using clang.

But when called using opt, lowerToArithmetics() gets called. I want to have lowerToArithmetics() as default.

Also, this pass was no longer used.

Please review and suggest changes.

xbolva00 added a subscriber: xbolva00.EditedJun 25 2022, 6:40 AM

Missing context
Missing test
x264_r regression is big one..
Many tests fail in precommit CI - please update them

dmgreen added a subscriber: dmgreen.Jul 1 2022, 5:43 AM

Hello. Unfortunately, I doubt that people will be in favour of this approach, especially if it is introducing ptr2int and int2ptr's so early in the pass pipeline. It looks like a pass that needs to be run as part of the backend.

There is a run of the pass in the AArch64 backend already, but it is disabled by default: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AArch64/AArch64TargetMachine.cpp#L581
Enabling it didn't seem to help with the original case in #50528. Would it be possible to teach it what it needs to for that case, and from there enable the EnableGEPOpt flag?

Hello. Unfortunately, I doubt that people will be in favour of this approach, especially if it is introducing ptr2int and int2ptr's so early in the pass pipeline. It looks like a pass that needs to be run as part of the backend.

There is a run of the pass in the AArch64 backend already, but it is disabled by default: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AArch64/AArch64TargetMachine.cpp#L581

Hello David,

Passing -aarch64-enable-gep-opt=true -O3 would do the job - https://clang.godbolt.org/z/hcjsh1vex what this patch was proposing. WDYT?

Enabling it didn't seem to help with the original case in #50528. Would it be possible to teach it what it needs to for that case, and from there enable the EnableGEPOpt flag?

Sure. For #50528, enabling EnableGEPOpt reduces GEP instructions by half but is done by splitting GEP into ptr2int and int2ptr.

nikic requested changes to this revision.Jul 4 2022, 12:14 PM
nikic added a subscriber: nikic.

Hello. Unfortunately, I doubt that people will be in favour of this approach, especially if it is introducing ptr2int and int2ptr's so early in the pass pipeline. It looks like a pass that needs to be run as part of the backend.

Indeed. This pass, especially in lowerToArithmetics() mode, is only usable in the backend.

This revision now requires changes to proceed.Jul 4 2022, 12:14 PM

Passing -aarch64-enable-gep-opt=true -O3 would do the job - https://clang.godbolt.org/z/hcjsh1vex what this patch was proposing. WDYT?

..

Sure. For #50528, enabling EnableGEPOpt reduces GEP instructions by half but is done by splitting GEP into ptr2int and int2ptr.

Oh yeah I see it does. I must have missed the -O3 off the time I tried it. That's good. I wonder why it was never enabled in the past.

From the look of it, it runs after LSR, which I think it would need to run before. Otherwise it is likely to mess up what LSR has tried to do. That would be before the call to TargetPassConfig::addIRPasses(). I'm not sure if the LICM run is necessary either, but I see it is used in other backends. We would need to gather some benchmark to see how it behaves. Like do the issues in x264 still occur, and what happens across more benchmark cases.

gsocshubham added a comment.EditedJul 7 2022, 10:47 PM

Passing -aarch64-enable-gep-opt=true -O3 would do the job - https://clang.godbolt.org/z/hcjsh1vex what this patch was proposing. WDYT?

..

Sure. For #50528, enabling EnableGEPOpt reduces GEP instructions by half but is done by splitting GEP into ptr2int and int2ptr.

Oh yeah I see it does. I must have missed the -O3 off the time I tried it. That's good. I wonder why it was never enabled in the past.

From the look of it, it runs after LSR, which I think it would need to run before. Otherwise it is likely to mess up what LSR has tried to do. That would be before the call to TargetPassConfig::addIRPasses(). I'm not sure if the LICM run is necessary either, but I see it is used in other backends. We would need to gather some benchmark to see how it behaves. Like do the issues in x264 still occur, and what happens across more benchmark cases.

Will running it before LSR make any difference compared to current location?

The issue in x264 was occuring due to pass registeration way early at IR level. It is irrelevant now since it is already enabled in AArch64 if passed relevant flags. As we have been given suggestion to move GEP pass just before Selection DAG, I have registered/moved GEP pass from addIRPasses() to at the end of AArch64PassConfig::addCodeGenPrepare() but there is no regression in x264 benchmark result as compared to master.

Passing -aarch64-enable-gep-opt=true -O3 would do the job - https://clang.godbolt.org/z/hcjsh1vex what this patch was proposing. WDYT?

..

Sure. For #50528, enabling EnableGEPOpt reduces GEP instructions by half but is done by splitting GEP into ptr2int and int2ptr.

Oh yeah I see it does. I must have missed the -O3 off the time I tried it. That's good. I wonder why it was never enabled in the past.

From the look of it, it runs after LSR, which I think it would need to run before. Otherwise it is likely to mess up what LSR has tried to do. That would be before the call to TargetPassConfig::addIRPasses(). I'm not sure if the LICM run is necessary either, but I see it is used in other backends. We would need to gather some benchmark to see how it behaves. Like do the issues in x264 still occur, and what happens across more benchmark cases.

Will running it before LSR make any difference compared to current location?

The issue in x264 was occuring due to pass registeration way early at IR level. It is irrelevant now since it is already enabled in AArch64 if passed relevant flags. As we have been given suggestion to move GEP pass just before Selection DAG, I have registered/moved GEP pass from addIRPasses() to at the end of AArch64PassConfig::addCodeGenPrepare() but there is no regression in x264 benchmark result as compared to master.

Everything in AArch64PassConfig::addIRPasses counts as the "backend" from the point of view of LLVM. They are still llvm-ir passes, but ran as part of the backend prior to ISel lowering to perform certain optimizations and help with lowering that are target-specific, but easier to perform on IR than MIR. I think the pass really needs to run before the call to LSR, as LSR will be very opinionated about the geps in loops and we don't want to mess that up. LSR is added by TargetPassConfig::addIRPasses(), so moving the SeparateConstOffsetFromGEP passes anywhere before that call in AArch64PassConfig::addIRPasses should be OK.

Move GEP pass before LSR pass.

Passing -aarch64-enable-gep-opt=true -O3 would do the job - https://clang.godbolt.org/z/hcjsh1vex what this patch was proposing. WDYT?

..

Sure. For #50528, enabling EnableGEPOpt reduces GEP instructions by half but is done by splitting GEP into ptr2int and int2ptr.

Oh yeah I see it does. I must have missed the -O3 off the time I tried it. That's good. I wonder why it was never enabled in the past.

From the look of it, it runs after LSR, which I think it would need to run before. Otherwise it is likely to mess up what LSR has tried to do. That would be before the call to TargetPassConfig::addIRPasses(). I'm not sure if the LICM run is necessary either, but I see it is used in other backends. We would need to gather some benchmark to see how it behaves. Like do the issues in x264 still occur, and what happens across more benchmark cases.

Will running it before LSR make any difference compared to current location?

The issue in x264 was occuring due to pass registeration way early at IR level. It is irrelevant now since it is already enabled in AArch64 if passed relevant flags. As we have been given suggestion to move GEP pass just before Selection DAG, I have registered/moved GEP pass from addIRPasses() to at the end of AArch64PassConfig::addCodeGenPrepare() but there is no regression in x264 benchmark result as compared to master.

Everything in AArch64PassConfig::addIRPasses counts as the "backend" from the point of view of LLVM. They are still llvm-ir passes, but ran as part of the backend prior to ISel lowering to perform certain optimizations and help with lowering that are target-specific, but easier to perform on IR than MIR. I think the pass really needs to run before the call to LSR, as LSR will be very opinionated about the geps in loops and we don't want to mess that up. LSR is added by TargetPassConfig::addIRPasses(), so moving the SeparateConstOffsetFromGEP passes anywhere before that call in AArch64PassConfig::addIRPasses should be OK.

I have moved GEP pass just before LSR pass and below are benchmarks results at peak (O3) -

Benchmark              %change w.r.t master
500.perlbench_r ->     2.085
505.mcf_r       ->     0.466
520.omnetpp_r   ->     0.607
523.xalancbmk_r ->     -0.515
531.deepsjeng_r ->     0.326
541.leela_r     ->     -0.980
557.xz_r        ->     -0.0571

Hello. Unfortunately, I doubt that people will be in favour of this approach, especially if it is introducing ptr2int and int2ptr's so early in the pass pipeline. It looks like a pass that needs to be run as part of the backend.

There is a run of the pass in the AArch64 backend already, but it is disabled by default: https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AArch64/AArch64TargetMachine.cpp#L581
Enabling it didn't seem to help with the original case in #50528. Would it be possible to teach it what it needs to for that case, and from there enable the EnableGEPOpt flag?

Regarding teaching SeparateConstOffsetFromGEPPass pass to handle constant GEP's -

From #50528, there is only one case where there is gep with all constant indices as below -

  1. store i32 %inc, i32* getelementptr inbounds (%struct.state_t, %struct.state_t* @s, i64 0, i32 2)

-> At first, this was not being considered for gep as it is store instruction and hence I split above into 2 instructions in testcase as below -

%temp = getelementptr inbounds %struct.state_t, %struct.state_t* @s, i64 0, i32 2
  store i32 %inc, i32* %temp
  1. %temp is not being considered by the pass since it has all constant indices -
// The backend can already nicely handle the case where all indices are
// constant.
if (GEP->hasAllConstantIndices())
  return false;

I tried forcefully splitting it by passing the checks but it does not seem right to split GEP with all constant indices at IR level. From the comment backend passes should take care of it but I still see repeated instructions -

madd    x8, x8, x10, x9
madd    x0, x11, x12, x8

I see 2 occurrencea of above set of instruction in assembly. Ideally it should occur only once.

Reference - https://clang.godbolt.org/z/KfrfT97hn

It sounds like this would be a good first step, and we can look into the other geps in the issue in a separate patch if needed. Can you:

  • Change the EnableGEPOpt from false to true
  • Add a test case from the bug, run through llc to show the codegen improvements.
  • Update the patch with full context
nikic resigned from this revision.Jul 14 2022, 12:48 PM

(Patch no longer modifies the middle-end pipeline in unacceptable ways)

gsocshubham retitled this revision from Enable SeparateConstOffsetFromGEPPass() at -O3 and -O2 to Move SeparateConstOffsetFromGEPPass() before LSR() and enable EnableGEPOpt by default..
gsocshubham edited the summary of this revision. (Show Details)

a. Updated patch with context.
b. Updated title and summary.
c. Moved SeparateConstOffsetFromGEPPass() before LSR()
d. Added an unit test.

This patch does not enable splitting of GEP with all constant indices - SeparateConstOffsetFromGEP.cpp:969

// The backend can already nicely handle the case where all indices are
// constant.
if (GEP->hasAllConstantIndices())
  return false;

Thanks - Please reread this and see what "with full context" means: https://llvm.org/docs/Phabricator.html#requesting-a-review-via-the-web-interface. The patch should be created with -U9999999.

I was thinking of the original longer test from the issue, but it is a bit long. This one might be OK too.

llvm/test/CodeGen/AArch64/cond-br-tuning.ll
33 ↗(On Diff #444933)

Change the store here to store 10 not 0. That should keep this testing what it did previously.

llvm/test/Transforms/SeparateConstOffsetFromGEP/split-gep.ll
2 ↗(On Diff #444933)

Remove -aarch64-enable-gep-opt - it is on by default now.

30 ↗(On Diff #444933)

Remove extra line.

Updated patch using -U9999999 and fixed review comments.

llvm/test/CodeGen/AArch64/cond-br-tuning.ll
33 ↗(On Diff #444933)

Done.

Thanks - Please reread this and see what "with full context" means: https://llvm.org/docs/Phabricator.html#requesting-a-review-via-the-web-interface. The patch should be created with -U9999999.

Understood. Thanks. I have updated patch using -U9999999. Is it fine now?

I was thinking of the original longer test from the issue, but it is a bit long. This one might be OK too.

gsocshubham edited the summary of this revision. (Show Details)Jul 20 2022, 1:16 AM

Thanks. This looks good, but did the test go missing from the previous revision?

llvm/test/CodeGen/AArch64/O3-pipeline.ll
1 ↗(On Diff #446067)

This should be removed, as the file doesn't use the script.

Fixed review comments.

llvm/test/Transforms/SeparateConstOffsetFromGEP/split-gep.ll
2 ↗(On Diff #444933)

Done.

30 ↗(On Diff #444933)

Done.

Thanks. This looks good, but did the test go missing from the previous revision?

Yes. It got removed. I have added it.

On my latest revision, build is failed. Is it due to pre merge test failures?

Build Status
    Buildable 176471	
    Build 265103: pre-merge checks	x64 windows failed · x64 debian failed

What is the script used for building? Can someone please point to it? In my local, build is passing. I want to replicate current build at my local with same config options.

dmgreen accepted this revision.Jul 20 2022, 11:31 PM

Thanks. The buildbot failure seems unrelated.

This LGTM if you can alter the test slightly and things are passing locally.

llvm/test/CodeGen/AArch64/cond-br-tuning.ll
36 ↗(On Diff #446108)

%d = icmp ne i32 %c, 10

39 ↗(On Diff #446108)

store i32 10, i32* %ptr, align 4

This revision is now accepted and ready to land.Jul 20 2022, 11:31 PM
gsocshubham added inline comments.Jul 21 2022, 5:19 AM
llvm/test/CodeGen/AArch64/cond-br-tuning.ll
36 ↗(On Diff #446108)

Do you mean it to change it to %d = icmp ne i32 %c, 0?

It is already %d = icmp ne i32 %c, 10

dmgreen added inline comments.Jul 21 2022, 5:27 AM
llvm/test/CodeGen/AArch64/cond-br-tuning.ll
36 ↗(On Diff #446108)

Yeah sorry, that is what I meant. The store should be storing a value that isn't 0, so the csel isn't optimized away. The cmp should still not be present (it does get optimized, as it can re-use the adds flags).

Fixed review comments.

There is only one LIT test which is failing now - CodeGen/AArch64/GlobalISel/arm64-irtranslator.ll.

gsocshubham added inline comments.Jul 21 2022, 8:28 AM
llvm/test/CodeGen/AArch64/cond-br-tuning.ll
36 ↗(On Diff #446108)

Done.

39 ↗(On Diff #446108)

Done.

Fixed review comments.

There is only one LIT test which is failing now - CodeGen/AArch64/GlobalISel/arm64-irtranslator.ll.

Regarding CodeGen/AArch64/GlobalISel/arm64-irtranslator.ll, it is failing due to change in GEP representation. There are -O0 and -O3 checks from which -O3 checks are failing due to change in GEP representation. I tried to update testcase using - ../../../../utils/update_mir_test_checks.py --llc-binary=../../../../../install/bin/llc arm64-irtranslator.ll due to which checks had been updated but now -O0 checks are failing. The testcase contains 6k lines and is difficult to update manually.

I used this option to just update O3 checks but that is not helping -
../../../../utils/update_mir_test_checks.py --llc-binary=../../../../../install/bin/llc arm64-irtranslator.ll --filter O3

How do I just update O3 checks for testcase CodeGen/AArch64/GlobalISel/arm64-irtranslator.l? I have updated CodeGen/AArch64/GlobalISel/arm64-irtranslator-gep.ll using utils/update_mir_test_checks.py successfully.

There is only one LIT test which is failing now - CodeGen/AArch64/GlobalISel/arm64-irtranslator.ll.

It looks like it is removing a unused alloca. Try this:

index ef559652380e..3fb33ecbb2c7 100644
--- a/llvm/test/CodeGen/AArch64/GlobalISel/arm64-irtranslator.ll
+++ b/llvm/test/CodeGen/AArch64/GlobalISel/arm64-irtranslator.ll
@@ -1458,10 +1458,12 @@ define void @test_lifetime_intrin() {
 ; O3-LABEL: name: test_lifetime_intrin
 ; O3: {{%[0-9]+}}:_(p0) = G_FRAME_INDEX %stack.0.slot
 ; O3-NEXT: LIFETIME_START %stack.0.slot
+; O3-NEXT: G_STORE
 ; O3-NEXT: LIFETIME_END %stack.0.slot
 ; O3-NEXT: RET_ReallyLR
   %slot = alloca i8, i32 4
   call void @llvm.lifetime.start.p0i8(i64 0, i8* %slot)
+  store volatile i8 10, i8* %slot
   call void @llvm.lifetime.end.p0i8(i64 0, i8* %slot)
   ret void
 }
llvm/test/CodeGen/AArch64/GlobalISel/arm64-irtranslator-gep.ll
54 ↗(On Diff #446511)

Make this %v1, %v2, I think is better. One of the loads is otherwise unused.

llvm/test/Transforms/SeparateConstOffsetFromGEP/split-gep.ll
1 ↗(On Diff #446511)

Oh - If this test is using the AArch64 target then it needs to be moved into an AArch64 subdirectory, which is only ran when AArch64 is a registered target. If it doesn't exist already, make sure there is a lit.local.cfg.

Fix review comments and lit test failures - AArch64/GlobalISel/arm64-irtranslator.ll and AArch64/GlobalISel/arm64-irtranslator-gep.ll

llvm/test/CodeGen/AArch64/GlobalISel/arm64-irtranslator-gep.ll
54 ↗(On Diff #446511)

Done.

There is only one LIT test which is failing now - CodeGen/AArch64/GlobalISel/arm64-irtranslator.ll.

It looks like it is removing a unused alloca. Try this:

index ef559652380e..3fb33ecbb2c7 100644
--- a/llvm/test/CodeGen/AArch64/GlobalISel/arm64-irtranslator.ll
+++ b/llvm/test/CodeGen/AArch64/GlobalISel/arm64-irtranslator.ll
@@ -1458,10 +1458,12 @@ define void @test_lifetime_intrin() {
 ; O3-LABEL: name: test_lifetime_intrin
 ; O3: {{%[0-9]+}}:_(p0) = G_FRAME_INDEX %stack.0.slot
 ; O3-NEXT: LIFETIME_START %stack.0.slot
+; O3-NEXT: G_STORE
 ; O3-NEXT: LIFETIME_END %stack.0.slot
 ; O3-NEXT: RET_ReallyLR
   %slot = alloca i8, i32 4
   call void @llvm.lifetime.start.p0i8(i64 0, i8* %slot)
+  store volatile i8 10, i8* %slot
   call void @llvm.lifetime.end.p0i8(i64 0, i8* %slot)
   ret void
 }

Thanks a lot for this.

Are there any other suggestions on this patch?

dmgreen accepted this revision.Jul 22 2022, 4:58 AM

Yeah - the patch LGTM. Thanks.

Do you have commit access? If not, I can commit it for you, I just need a "name <name@email.com>" to attribute it to.

Yeah - the patch LGTM. Thanks.

Do you have commit access? If not, I can commit it for you, I just need a "name <name@email.com>" to attribute it to.

I do not have commit access. Please commit for me.

Here are the details -

Name - Shubham Narlawar
Email - shubham.narlawar@rrlogic.co.in

"Shubham Narlawar <shubham.narlawar@rrlogic.co.in>"

This revision was landed with ongoing or failed builds.Jul 22 2022, 7:21 AM
This revision was automatically updated to reflect the committed changes.