Page MenuHomePhabricator

[InstCombine] fold cast of right-shift if high bits are not demanded
ClosedPublic

Authored by spatel on Sep 21 2021, 6:52 AM.

Details

Summary

(masked) trunc (lshr X, ShiftC) --> (masked) lshr (trunc X), C

Narrowing the shift should be better for analysis and can lead to follow-on transforms as shown.

Attempt at the general proof in Alive2:
https://alive2.llvm.org/ce/z/tRnnSF

Here are a couple of the specific tests:
https://alive2.llvm.org/ce/z/bCnTp-

Diff Detail

Event Timeline

spatel created this revision.Sep 21 2021, 6:52 AM
spatel requested review of this revision.Sep 21 2021, 6:53 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 21 2021, 6:53 AM

Should this be demandedbits-driven?

spatel planned changes to this revision.Sep 21 2021, 7:54 AM

Should this be demandedbits-driven?

Yes - I thought about that while drafting this, but I didn't follow through. I'll conjure some new tests and try that.

spatel updated this revision to Diff 373972.Sep 21 2021, 10:00 AM

Patch updated:
Generalized to be a demanded-bits fold. I added tests ending in 'or' rather than just 'and'. ( https://alive2.llvm.org/ce/z/TfaHnb )
I think we already fold other cases (for example, ending with a left-shift).

spatel retitled this revision from [InstCombine] fold cast between shift and mask to [InstCombine] fold cast of right-shift if high bits are not demanded.Sep 21 2021, 10:01 AM
spatel edited the summary of this revision. (Show Details)
lebedev.ri added inline comments.Sep 21 2021, 10:17 AM
llvm/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp
397
lebedev.ri accepted this revision.Sep 21 2021, 10:31 AM

Looks fine to me.

llvm/lib/Transforms/InstCombine/InstCombineSimplifyDemanded.cpp
397

No, ignore me.

This revision is now accepted and ready to land.Sep 21 2021, 10:31 AM
This revision was landed with ongoing or failed builds.Sep 21 2021, 1:10 PM
This revision was automatically updated to reflect the committed changes.

This is causing timeouts on a number of multistage builds. I have seen it at least on PPC, SystemZ and AArch64:
PPC: https://lab.llvm.org/buildbot/#/builders/121/builds/11680
PPC: https://lab.llvm.org/buildbot/#/builders/36/builds/12596
SystemZ: https://lab.llvm.org/buildbot/#/builders/8/builds/1821
AArch64: https://lab.llvm.org/buildbot/#/builders/179/builds/1073

The timeout happens when the stage 1 compiler is building lib/Transforms/Coroutines/CoroFrame.cpp
I'll see if I can track down what is causing the infinite loop/recursion.

I also run into infinite loops caused by this commit. It's reproducible with https://martin.st/temp/qsettings-preproc.cpp with clang -target x86_64-w64-mingw32 -c qsettings-preproc.cpp -O3 -std=c++17. But that's not minimzed/reduced at all (and that source file takes a pretty significant amount of time to compile even to begin with).

I also run into infinite loops caused by this commit. It's reproducible with https://martin.st/temp/qsettings-preproc.cpp with clang -target x86_64-w64-mingw32 -c qsettings-preproc.cpp -O3 -std=c++17. But that's not minimzed/reduced at all (and that source file takes a pretty significant amount of time to compile even to begin with).

This is causing timeouts on a number of multistage builds. I have seen it at least on PPC, SystemZ and AArch64:
PPC: https://lab.llvm.org/buildbot/#/builders/121/builds/11680
PPC: https://lab.llvm.org/buildbot/#/builders/36/builds/12596
SystemZ: https://lab.llvm.org/buildbot/#/builders/8/builds/1821
AArch64: https://lab.llvm.org/buildbot/#/builders/179/builds/1073

The timeout happens when the stage 1 compiler is building lib/Transforms/Coroutines/CoroFrame.cpp
I'll see if I can track down what is causing the infinite loop/recursion.

Thanks for letting me know. I reverted the patch (and the follow-up that fixed a clang test failure at 52832cd917af0), so I'm not holding up those bots while we investigate. I'm trying to reduce Martin's file now, but it is taking a while! If anyone finds a smaller example, please do post it here.

I also run into infinite loops caused by this commit. It's reproducible with https://martin.st/temp/qsettings-preproc.cpp with clang -target x86_64-w64-mingw32 -c qsettings-preproc.cpp -O3 -std=c++17. But that's not minimzed/reduced at all (and that source file takes a pretty significant amount of time to compile even to begin with).

This is causing timeouts on a number of multistage builds. I have seen it at least on PPC, SystemZ and AArch64:
PPC: https://lab.llvm.org/buildbot/#/builders/121/builds/11680
PPC: https://lab.llvm.org/buildbot/#/builders/36/builds/12596
SystemZ: https://lab.llvm.org/buildbot/#/builders/8/builds/1821
AArch64: https://lab.llvm.org/buildbot/#/builders/179/builds/1073

The timeout happens when the stage 1 compiler is building lib/Transforms/Coroutines/CoroFrame.cpp
I'll see if I can track down what is causing the infinite loop/recursion.

Thanks for letting me know. I reverted the patch (and the follow-up that fixed a clang test failure at 52832cd917af0), so I'm not holding up those bots while we investigate. I'm trying to reduce Martin's file now, but it is taking a while! If anyone finds a smaller example, please do post it here.

I am running a reducer on the CoroFrame.cpp one and it is pretty small so far and still reducing. I'll provide it once it is done in case it turns out to be useful.

nemanjai added a comment.EditedSep 22 2021, 11:10 AM

Reduced test case at https://pastebin.com/NVYbsRfD

Run opt -O3 -mtriple=powerpc64le-- -disable-output file.ll and it doesn't terminate.

Reduced test case at https://pastebin.com/NVYbsRfD

Run opt -O3 -mtriple=powerpc64le-- -disable-output file.ll and it doesn't terminate.

Thanks! It should be easy to find the opposing transform now. I got that test down to an infinite loop with -instcombine and:

declare void @use(i64)

define i64 @t0(i64 %x) {
  %a = ashr i64 %x, 3
  call void @use(i64 %a)
  %tr = trunc i64 %a to i32
  %sh = lshr i32 %tr, 6
  %z = zext i32 %sh to i64
  ret i64 %z
}

Reduced test case at https://pastebin.com/NVYbsRfD

Run opt -O3 -mtriple=powerpc64le-- -disable-output file.ll and it doesn't terminate.

Thanks! It should be easy to find the opposing transform now. I got that test down to an infinite loop with -instcombine and:

That was easy (I hope)...
We had a transform that was creating extra instructions without checking uses correctly, so we'd end up ping-pong'ing between that and this one. I don't see infinite looping on either of the failure examples posted here after:
1cd6b44f267b

Just a heads-up, I'm seeing timeouts again now with "2nd try" commited.

I'll see if I can pull out a reproducer working on main too.

Just a heads-up, I'm seeing timeouts again now with "2nd try" commited.

I'll see if I can pull out a reproducer working on main too.

Ok:

opt -o /dev/null -passes='instcombine' hang.ll

with hang.ll being

target datalayout = "n32"

define i32 @f_t15_t01_t09(i40 %x) {
entry:
  store i40 %x, i40* undef, align 1
  %0 = load i40, i40* undef, align 1
  %1 = add i40 %0, 2147483647
  %2 = select i1 undef, i40 %1, i40 %0
  %downscale = ashr i40 %2, 31
  %resize = trunc i40 %downscale to i16
  %resize1 = sext i16 %resize to i32
  %upscale = shl i32 %resize1, 31
  ret i32 %upscale
}

Just a heads-up, I'm seeing timeouts again now with "2nd try" commited.

I'll see if I can pull out a reproducer working on main too.

Ok:

opt -o /dev/null -passes='instcombine' hang.ll

with hang.ll being

target datalayout = "n32"

define i32 @f_t15_t01_t09(i40 %x) {
entry:
  store i40 %x, i40* undef, align 1
  %0 = load i40, i40* undef, align 1
  %1 = add i40 %0, 2147483647
  %2 = select i1 undef, i40 %1, i40 %0
  %downscale = ashr i40 %2, 31
  %resize = trunc i40 %downscale to i16
  %resize1 = sext i16 %resize to i32
  %upscale = shl i32 %resize1, 31
  ret i32 %upscale
}

Thanks! I'll step into this in the debugger now.
Let me know if I should revert.

The root bug is in that same block that caused the previous problem.
I think it's time to fix that for good by decomposing it into simpler folds that won't conflict with other transforms.
So I reverted this patch again: 3c5500907b10

bjope added a subscriber: bjope.Oct 4 2021, 3:04 PM

Hi @spatel,

I noticed a regression in a downstream benchmark, that at least partly seem to be caused by it. Here is a reduced example: https://godbolt.org/z/M9MKjcYPG

From what I can see there is a quite early run of InstCombine in the O3 pipeline, which basically happens directly after GlobalOpt without any CSE in between. So in such an early run of InstCombine we do trigger transforms based on "one use", which wouldn't have happened if running CSE before InstCombine. I figure that might be a more general problem and not only specific to the rewrites introduced in this patch.

We'll analyse the regression a bit more (maybe there are other things that happens that contributes to the regression). But wanted to mention the above. And it makes me a bit curious if it is a general problem with that early instcombine run that "one use" checks might be fooled by not having done CSE after GlobalOpt.

spatel added a comment.Oct 5 2021, 7:23 AM

I noticed a regression in a downstream benchmark, that at least partly seem to be caused by it. Here is a reduced example: https://godbolt.org/z/M9MKjcYPG

From what I can see there is a quite early run of InstCombine in the O3 pipeline, which basically happens directly after GlobalOpt without any CSE in between. So in such an early run of InstCombine we do trigger transforms based on "one use", which wouldn't have happened if running CSE before InstCombine. I figure that might be a more general problem and not only specific to the rewrites introduced in this patch.

We'll analyse the regression a bit more (maybe there are other things that happens that contributes to the regression). But wanted to mention the above. And it makes me a bit curious if it is a general problem with that early instcombine run that "one use" checks might be fooled by not having done CSE after GlobalOpt.

Thanks for posting the example. That does seem like a general problem, and it's worth experimenting with the pass manager to see if reordering the passes makes things better or worse.
I'm not sure if we have an IR pass that is responsible for seeing that we have redundant shift ops like in the example. Is that a possible trick for GVN?
Also, I tried running the example through codegen for x86 and AArch64, and they both manage to eliminate the redundant extra shift after legalization. Is it possible that your target is missing a semi-generic SDAG transform?

bjope added a comment.Oct 5 2021, 3:32 PM

I noticed a regression in a downstream benchmark, that at least partly seem to be caused by it. Here is a reduced example: https://godbolt.org/z/M9MKjcYPG

From what I can see there is a quite early run of InstCombine in the O3 pipeline, which basically happens directly after GlobalOpt without any CSE in between. So in such an early run of InstCombine we do trigger transforms based on "one use", which wouldn't have happened if running CSE before InstCombine. I figure that might be a more general problem and not only specific to the rewrites introduced in this patch.

We'll analyse the regression a bit more (maybe there are other things that happens that contributes to the regression). But wanted to mention the above. And it makes me a bit curious if it is a general problem with that early instcombine run that "one use" checks might be fooled by not having done CSE after GlobalOpt.

Thanks for posting the example. That does seem like a general problem, and it's worth experimenting with the pass manager to see if reordering the passes makes things better or worse.
I'm not sure if we have an IR pass that is responsible for seeing that we have redundant shift ops like in the example. Is that a possible trick for GVN?
Also, I tried running the example through codegen for x86 and AArch64, and they both manage to eliminate the redundant extra shift after legalization. Is it possible that your target is missing a semi-generic SDAG transform?

The IR posted in godbolt was a bit reduced, and running the example through codegen gave the same result also for my target.
Although. the original IR looked a bit more like in this example https://godbolt.org/z/s8Krzrq36 , which show that the number of instructions in the loop increase from 16 to 19, for x86, when using opt from trunc instead of the 13.0.0 version. And afaict this patch is the main difference.

spatel added a comment.Oct 5 2021, 4:17 PM

Although. the original IR looked a bit more like in this example https://godbolt.org/z/s8Krzrq36 , which show that the number of instructions in the loop increase from 16 to 19, for x86, when using opt from trunc instead of the 13.0.0 version. And afaict this patch is the main difference.

Ah - the larger example is very interesting if I'm seeing it correctly. We're xor'ing 4 bits from some value?

That could be made significantly shorter in IR or codegen:
https://alive2.llvm.org/ce/z/bWgS_h
https://godbolt.org/z/sP6n13xd3
(Note that the x86 codegen is likely better for a target without popcount! I'll file some bugs tomorrow.)

Although. the original IR looked a bit more like in this example https://godbolt.org/z/s8Krzrq36 , which show that the number of instructions in the loop increase from 16 to 19, for x86, when using opt from trunc instead of the 13.0.0 version. And afaict this patch is the main difference.

I agree with your analysis, and I don't mean to dismiss the regression, but we might be able to optimize the larger example even better (either because of or in spite of this patch!).

I filed these bugs:
https://llvm.org/PR52092
https://llvm.org/PR52093
https://llvm.org/PR52094