Add `shl`, `lshr` and `ashr` instructions to the DAG post-dominated by `trunc`, allowing TruncInstCombine to reduce bitwidth of expressions containing shifts. Fixes PR50555.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
We can not do this transform as proposed here,
it increases the instruction count.
Could you point me at the test for this change?
It should only contain lshr-of-zext, there should not be any trunc;
please add a test where zext has an extra use.
llvm/test/Transforms/InstCombine/pr50555.ll | ||
---|---|---|
9 ↗ | (On Diff #365189) | @lebedev.ri: Yes, you're right, it increases instruction count, adding new zext when the old one has an extra use. But it also simplifies lshr to lower bits type making it simpler. Isn't it a good compromise? And also, after zext sinking, it can trigger a chain of changes combining zext with other instrs like in cases below: add, trunc and so on. |
I think this can be adjusted in SLP vectorizer. We have MinBWs container in there, to try to operate on non-wide instructions. Probably need to tweak it to handle this case. Did you try to modify collectValuesToDemote function?
FWIW i agree that it obviously improves the SLP snippet in question,
i'm just not sure this is the right way to do it. Sorry.
llvm/test/Transforms/InstCombine/pr50555.ll | ||
---|---|---|
9 ↗ | (On Diff #365189) |
To be noted, this is the profitability heuristics of instcombine: don't increase instruction count.
Sure. But it still increases instruction count. |
If we want to solve this as an instcombine (or maybe aggressive-instcombine) problem, we have to expand the pattern to make it clearly profitable. I'm not sure how to generalize it, but we can do the narrowing starting from the trunc and remove an instruction:
https://alive2.llvm.org/ce/z/lwtDwZ
define i16 @src(i8 %x) { %z = zext i8 %x to i32 %s = lshr i32 %z, 1 %a = add nuw nsw i32 %s, %z %s2 = lshr i32 %a, 2 %t = trunc i32 %s2 to i16 ret i16 %t } define i16 @tgt(i8 %x) { %z = zext i8 %x to i16 %s = lshr i16 %z, 1 %a = add nuw nsw i16 %s, %z %s2 = lshr i16 %a, 2 ret i16 %s2 }
Thanks to all, I've moved this fix to aggresive-instcombine, where it is even planned in TODO: section.
llvm/test/Transforms/InstCombine/pr50555.ll | ||
---|---|---|
9 ↗ | (On Diff #365189) |
Ok, I see, thanks for noting this. |
Nice, this seems to fit naturally there.
That being said, you probably still want some standalone tests for the pattern in question, both a positive ones, and a negative ones - what's the requirement on the shift amount?
Agree - aggressive-instcombine doesn't get nearly as much testing as regular instcombine, so we need more tests to be confident it doesn't over-reach.
Leaving the shift amount off of the getRelevantOperands() list doesn't work on this example (crash):
define i16 @sh_amt(i8 %x, i8 %sh1) { %z = zext i8 %x to i32 %zs = zext i8 %sh1 to i32 %s = lshr i32 %z, %zs %a = add nuw nsw i32 %s, %z %s2 = lshr i32 %a, 2 %t = trunc i32 %s2 to i16 ret i16 %t }
Some observations for logical right-shift.
this will have a hard time with variable shift amounts.
You need to avoid creating out-of-bounds shifts, there are two obvious options:
- https://alive2.llvm.org/ce/z/XShcju <- shift amount needs to be less than target width, and truncation should only drop zeros.
- https://alive2.llvm.org/ce/z/QiDPV7 <- could saturate the shift amount if you know that %x has more leading zeros than the number of bits to be truncated
We might already have this logic somewhere, not sure.
Yes, thanks for your observations, I'm already working on it: https://alive2.llvm.org/ce/z/XcCJ9Q
There is also special care for the vector case to make a transform not being more poisonous.
TruncInstCombine already has appropriate logic but needs to be tweaked.
For now I'm supposing that shift amout is constant (int or vector). Not sure that transform adding check for variable shift amount is good.
Not sure that transform adding check for variable shift amount is good.
Define good. I think supporting variable shifts will take exactly two lines:
compute knownbits of the shift amount, and get the maximal shift amount via KnownBits::getMaxValue().
Define good. I think supporting variable shifts will take exactly two lines:
compute knownbits of the shift amount, and get the maximal shift amount via KnownBits::getMaxValue().
Hmm, how could we compute knownbits of the shift amount at compile time? Do you mean analyzing DAG for the shift amount Value, taking knowbits recursively?
Also I don't believe this computing makes sense: for the most cases, when shift amount is variable, its first byte is unknown. For instance, how could knownbits help to optimize @spatel's example?
define i16 @sh_amt(i8 %x, i8 %sh1) { %z = zext i8 %x to i32 %zs = zext i8 %sh1 to i32 %s = lshr i32 %z, %zs %a = add nuw nsw i32 %s, %z %s2 = lshr i32 %a, 2 %t = trunc i32 %s2 to i16 ret i16 %t }
You've seen llvm::computeKnownBits(), right?
Also I don't believe this computing makes sense: for the most cases, when shift amount is variable, its first byte is unknown. For instance, how could knownbits help to optimize @spatel's example?
define i16 @sh_amt(i8 %x, i8 %sh1) { %z = zext i8 %x to i32 %zs = zext i8 %sh1 to i32 %s = lshr i32 %z, %zs %a = add nuw nsw i32 %s, %z %s2 = lshr i32 %a, 2 %t = trunc i32 %s2 to i16 ret i16 %t }
I find this comment to be highly inflammatory.
Just because there's large number of cases it won't help doesn't mean it can't ever help with anything.
https://alive2.llvm.org/ce/z/RkkBTy <- we have no idea what %y is, but we can tell it's less than the target bitwidth.
AggressiveInstCombine runs only with -O3, right? Do we know how expensive it would be for -O2?
Yes - only at O3 currently. That's mainly because nobody has bothered to see if it was worth fighting over to include at -O2.
That question is much easier to answer since we have compile-time-tracker. But I'm not sure how to answer the cost question directly - I think we can approximate it by just removing the pass from O3 and checking the difference.
The result seems to be a consistent but very small cost (0.04% geomean here):
https://llvm-compile-time-tracker.com/?config=NewPM-O3&stat=instructions&remote=rotateright
Hmm, how could we compute knownbits of the shift amount at compile time? Do you mean analyzing DAG for the shift amount Value, taking knowbits recursively?
You've seen llvm::computeKnownBits(), right?
Thanks, used it.
I think we want to do this in three steps.
lshr is easy and obvious, but for ashr we want to count *sign* bits.
Haven't really thought about shl
Do you mean splitting this to three separate patches?
shl is simpler than both right shifts since it has no bits moved from truncated part to the untruncated one.
The condition used for shl here is necessary and sufficient, whereas it is only sufficient for the right shifts.
llvm/test/Transforms/SLPVectorizer/X86/pr50555.ll | ||
---|---|---|
3 ↗ | (On Diff #366403) | Should this be moved to be a phase ordering test do you think? |
llvm/test/Transforms/SLPVectorizer/X86/pr50555.ll | ||
---|---|---|
3 ↗ | (On Diff #366403) | Do you think it's more test that slp-vectorizer follows aggressive-instcombine? Ok, moved. |
Yes.
shl is simpler than both right shifts since it has no bits moved from truncated part to the untruncated one.
The condition used for shl here is necessary and sufficient, whereas it is only sufficient for the right shifts.
That is kind of my point.
At least the left and right shifts have different legality rules,
and different right-shifts also have slightly different rules.
Not having to deal with everything at once will strictly simplify review.
llvm/test/Transforms/SLPVectorizer/X86/pr50555.ll | ||
---|---|---|
3 ↗ | (On Diff #366403) | This should be in the X86 sub-directory - look at other tests in there for examples as we don't specify explicit passes: ; RUN: opt -O2 -S < %s | FileCheck %s--check-prefixes=SSE ; RUN: opt -O2 -S -mattr=avx < %s | FileCheck %s--check-prefixes=AVX ; RUN: opt -passes='default<O2>' -S < %s | FileCheck %s--check-prefixes=SSE ; RUN: opt -passes='default<O2>' -S -mattr=avx < %s | FileCheck %s--check-prefixes=AVX |
llvm/test/Transforms/SLPVectorizer/X86/pr50555.ll | ||
---|---|---|
3 ↗ | (On Diff #366403) | Right - the goal of PhaseOrdering tests is to make sure that >1 passes are interacting as expected and that we get the expected results from the typical (-On) pass pipelines in 'opt'. |
llvm/test/Transforms/SLPVectorizer/X86/pr50555.ll | ||
---|---|---|
3 ↗ | (On Diff #366403) | Sure, thanks! Moved to subdirectory, changed to -O3 option. |
Yes, I'm to add ashr. Also planning to add AssumptionCache to use it for computeKnownBits(). And investigate question about including AIC to -O2.
And investigate question about including AIC to -O2.
Yeah, AIC for O2+ - it would be great if possible..