Page MenuHomePhabricator

[AggressiveInstCombine] Add shift instructions to `TruncInstCombine` DAG
AbandonedPublic

Authored by anton-afanasyev on Aug 9 2021, 7:31 AM.

Details

Summary
Add `shl`, `lshr` and `ashr` instructions to the DAG post-dominated by `trunc`,
allowing TruncInstCombine to reduce bitwidth of expressions containing shifts.

Fixes PR50555.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
anton-afanasyev requested review of this revision.Aug 9 2021, 7:31 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 9 2021, 7:31 AM
lebedev.ri requested changes to this revision.Aug 9 2021, 7:40 AM

We can not do this transform as proposed here,
it increases the instruction count.

Could you point me at the test for this change?
It should only contain lshr-of-zext, there should not be any trunc;
please add a test where zext has an extra use.

This revision now requires changes to proceed.Aug 9 2021, 7:40 AM

Add test with zext having extra use

anton-afanasyev added inline comments.Aug 9 2021, 8:33 AM
llvm/test/Transforms/InstCombine/pr50555.ll
9 ↗(On Diff #365189)

We can not do this transform as proposed here,
it increases the instruction count.

Could you point me at the test for this change?
It should only contain lshr-of-zext, there should not be any trunc;
please add a test where zext has an extra use.

@lebedev.ri: Yes, you're right, it increases instruction count, adding new zext when the old one has an extra use. But it also simplifies lshr to lower bits type making it simpler. Isn't it a good compromise?

And also, after zext sinking, it can trigger a chain of changes combining zext with other instrs like in cases below: add, trunc and so on.

I think this can be adjusted in SLP vectorizer. We have MinBWs container in there, to try to operate on non-wide instructions. Probably need to tweak it to handle this case. Did you try to modify collectValuesToDemote function?

FWIW i agree that it obviously improves the SLP snippet in question,
i'm just not sure this is the right way to do it. Sorry.

llvm/test/Transforms/InstCombine/pr50555.ll
9 ↗(On Diff #365189)

@lebedev.ri: Yes, you're right, it increases instruction count, adding new zext when the old one has an extra use.

To be noted, this is the profitability heuristics of instcombine: don't increase instruction count.

But it also simplifies lshr to lower bits type making it simpler. Isn't it a good compromise?
And also, after zext sinking, it can trigger a chain of changes combining zext with other instrs like in cases below: add, trunc and so on.

Sure. But it still increases instruction count.

spatel added a comment.Aug 9 2021, 8:59 AM

If we want to solve this as an instcombine (or maybe aggressive-instcombine) problem, we have to expand the pattern to make it clearly profitable. I'm not sure how to generalize it, but we can do the narrowing starting from the trunc and remove an instruction:
https://alive2.llvm.org/ce/z/lwtDwZ

define i16 @src(i8 %x) {
  %z = zext i8 %x to i32
  %s = lshr i32 %z, 1
  %a = add nuw nsw i32 %s, %z
  %s2 = lshr i32 %a, 2
  %t = trunc i32 %s2 to i16
  ret i16 %t
}

define i16 @tgt(i8 %x) {
  %z = zext i8 %x to i16
  %s = lshr i16 %z, 1
  %a = add nuw nsw i16 %s, %z
  %s2 = lshr i16 %a, 2
  ret i16 %s2
}
anton-afanasyev marked an inline comment as done.Aug 9 2021, 10:51 AM

Thanks to all, I've moved this fix to aggresive-instcombine, where it is even planned in TODO: section.

llvm/test/Transforms/InstCombine/pr50555.ll
9 ↗(On Diff #365189)

To be noted, this is the profitability heuristics of instcombine: don't increase instruction count.

Ok, I see, thanks for noting this.

anton-afanasyev marked an inline comment as done.

Move fix to AggressiveInstCombine

Remove InstCombine tests

Nice, this seems to fit naturally there.
That being said, you probably still want some standalone tests for the pattern in question, both a positive ones, and a negative ones - what's the requirement on the shift amount?

Nice, this seems to fit naturally there.
That being said, you probably still want some standalone tests for the pattern in question, both a positive ones, and a negative ones - what's the requirement on the shift amount?

Agree - aggressive-instcombine doesn't get nearly as much testing as regular instcombine, so we need more tests to be confident it doesn't over-reach.
Leaving the shift amount off of the getRelevantOperands() list doesn't work on this example (crash):

define i16 @sh_amt(i8 %x, i8 %sh1) {
  %z = zext i8 %x to i32
  %zs = zext i8 %sh1 to i32
  %s = lshr i32 %z, %zs
  %a = add nuw nsw i32 %s, %z
  %s2 = lshr i32 %a, 2
  %t = trunc i32 %s2 to i16
  ret i16 %t
}

Some observations for logical right-shift.
this will have a hard time with variable shift amounts.
You need to avoid creating out-of-bounds shifts, there are two obvious options:

  1. https://alive2.llvm.org/ce/z/XShcju <- shift amount needs to be less than target width, and truncation should only drop zeros.
  2. https://alive2.llvm.org/ce/z/QiDPV7 <- could saturate the shift amount if you know that %x has more leading zeros than the number of bits to be truncated

We might already have this logic somewhere, not sure.

Some observations for logical right-shift.
this will have a hard time with variable shift amounts.
You need to avoid creating out-of-bounds shifts, there are two obvious options:

  1. https://alive2.llvm.org/ce/z/XShcju <- shift amount needs to be less than target width, and truncation should only drop zeros.
  2. https://alive2.llvm.org/ce/z/QiDPV7 <- could saturate the shift amount if you know that %x has more leading zeros than the number of bits to be truncated

We might already have this logic somewhere, not sure.

Yes, thanks for your observations, I'm already working on it: https://alive2.llvm.org/ce/z/XcCJ9Q
There is also special care for the vector case to make a transform not being more poisonous.
TruncInstCombine already has appropriate logic but needs to be tweaked.
For now I'm supposing that shift amout is constant (int or vector). Not sure that transform adding check for variable shift amount is good.

Some observations for logical right-shift.
this will have a hard time with variable shift amounts.
You need to avoid creating out-of-bounds shifts, there are two obvious options:

  1. https://alive2.llvm.org/ce/z/XShcju <- shift amount needs to be less than target width, and truncation should only drop zeros.
  2. https://alive2.llvm.org/ce/z/QiDPV7 <- could saturate the shift amount if you know that %x has more leading zeros than the number of bits to be truncated

We might already have this logic somewhere, not sure.

Yes, thanks for your observations, I'm already working on it: https://alive2.llvm.org/ce/z/XcCJ9Q
There is also special care for the vector case to make a transform not being more poisonous.
TruncInstCombine already has appropriate logic but needs to be tweaked.
For now I'm supposing that shift amout is constant (int or vector).

Not sure that transform adding check for variable shift amount is good.

Define good. I think supporting variable shifts will take exactly two lines:
compute knownbits of the shift amount, and get the maximal shift amount via KnownBits::getMaxValue().

Define good. I think supporting variable shifts will take exactly two lines:
compute knownbits of the shift amount, and get the maximal shift amount via KnownBits::getMaxValue().

Hmm, how could we compute knownbits of the shift amount at compile time? Do you mean analyzing DAG for the shift amount Value, taking knowbits recursively?
Also I don't believe this computing makes sense: for the most cases, when shift amount is variable, its first byte is unknown. For instance, how could knownbits help to optimize @spatel's example?

define i16 @sh_amt(i8 %x, i8 %sh1) {
  %z = zext i8 %x to i32
  %zs = zext i8 %sh1 to i32
  %s = lshr i32 %z, %zs
  %a = add nuw nsw i32 %s, %z
  %s2 = lshr i32 %a, 2
  %t = trunc i32 %s2 to i16
  ret i16 %t
}
anton-afanasyev planned changes to this revision.Aug 10 2021, 11:28 AM
anton-afanasyev retitled this revision from [InstCombine] Get rid of `hasOneUses()` when swapping `lshr` and `zext` to [AggressiveInstCombine] Add `lshr` and `ashr` instructions to TruncInstCombine DAG.
anton-afanasyev edited the summary of this revision. (Show Details)

Define good. I think supporting variable shifts will take exactly two lines:
compute knownbits of the shift amount, and get the maximal shift amount via KnownBits::getMaxValue().

Hmm, how could we compute knownbits of the shift amount at compile time? Do you mean analyzing DAG for the shift amount Value, taking knowbits recursively?

You've seen llvm::computeKnownBits(), right?

Also I don't believe this computing makes sense: for the most cases, when shift amount is variable, its first byte is unknown. For instance, how could knownbits help to optimize @spatel's example?

define i16 @sh_amt(i8 %x, i8 %sh1) {
  %z = zext i8 %x to i32
  %zs = zext i8 %sh1 to i32
  %s = lshr i32 %z, %zs
  %a = add nuw nsw i32 %s, %z
  %s2 = lshr i32 %a, 2
  %t = trunc i32 %s2 to i16
  ret i16 %t
}

I find this comment to be highly inflammatory.

Just because there's large number of cases it won't help doesn't mean it can't ever help with anything.
https://alive2.llvm.org/ce/z/RkkBTy <- we have no idea what %y is, but we can tell it's less than the target bitwidth.

AggressiveInstCombine runs only with -O3, right? Do we know how expensive it would be for -O2?

AggressiveInstCombine runs only with -O3, right? Do we know how expensive it would be for -O2?

Yes - only at O3 currently. That's mainly because nobody has bothered to see if it was worth fighting over to include at -O2.

That question is much easier to answer since we have compile-time-tracker. But I'm not sure how to answer the cost question directly - I think we can approximate it by just removing the pass from O3 and checking the difference.

The result seems to be a consistent but very small cost (0.04% geomean here):
https://llvm-compile-time-tracker.com/?config=NewPM-O3&stat=instructions&remote=rotateright

anton-afanasyev edited the summary of this revision. (Show Details)

Update, add shl

Add negative and positive tests

Hmm, how could we compute knownbits of the shift amount at compile time? Do you mean analyzing DAG for the shift amount Value, taking knowbits recursively?

You've seen llvm::computeKnownBits(), right?

Thanks, used it.

anton-afanasyev retitled this revision from [AggressiveInstCombine] Add `lshr` and `ashr` instructions to TruncInstCombine DAG to [AggressiveInstCombine] Add shift instructions to `TruncInstCombine` DAG.Aug 13 2021, 11:28 PM
anton-afanasyev edited the summary of this revision. (Show Details)
anton-afanasyev edited the summary of this revision. (Show Details)

Fix test

I think we want to do this in three steps.
lshr is easy and obvious, but for ashr we want to count *sign* bits.
Haven't really thought about shl

I think we want to do this in three steps.
lshr is easy and obvious, but for ashr we want to count *sign* bits.
Haven't really thought about shl

Do you mean splitting this to three separate patches?
shl is simpler than both right shifts since it has no bits moved from truncated part to the untruncated one.
The condition used for shl here is necessary and sufficient, whereas it is only sufficient for the right shifts.

RKSimon added inline comments.Aug 15 2021, 3:06 AM
llvm/test/Transforms/SLPVectorizer/X86/pr50555.ll
3 ↗(On Diff #366403)

Should this be moved to be a phase ordering test do you think?

llvm/test/Transforms/SLPVectorizer/X86/pr50555.ll
3 ↗(On Diff #366403)

Do you think it's more test that slp-vectorizer follows aggressive-instcombine? Ok, moved.

Move SLPVectorizer test to PhaseOrdering

I think we want to do this in three steps.
lshr is easy and obvious, but for ashr we want to count *sign* bits.
Haven't really thought about shl

Do you mean splitting this to three separate patches?

Yes.

shl is simpler than both right shifts since it has no bits moved from truncated part to the untruncated one.
The condition used for shl here is necessary and sufficient, whereas it is only sufficient for the right shifts.

That is kind of my point.
At least the left and right shifts have different legality rules,
and different right-shifts also have slightly different rules.
Not having to deal with everything at once will strictly simplify review.

That is kind of my point.
At least the left and right shifts have different legality rules,
and different right-shifts also have slightly different rules.
Not having to deal with everything at once will strictly simplify review.

Ok, start from shl: https://reviews.llvm.org/D108091

RKSimon added inline comments.Aug 16 2021, 9:16 AM
llvm/test/Transforms/SLPVectorizer/X86/pr50555.ll
3 ↗(On Diff #366403)

This should be in the X86 sub-directory - look at other tests in there for examples as we don't specify explicit passes:
e.g.

; RUN: opt -O2 -S < %s | FileCheck %s--check-prefixes=SSE
; RUN: opt -O2 -S -mattr=avx < %s | FileCheck %s--check-prefixes=AVX
; RUN: opt -passes='default<O2>' -S < %s | FileCheck %s--check-prefixes=SSE
; RUN: opt -passes='default<O2>' -S -mattr=avx < %s | FileCheck %s--check-prefixes=AVX
spatel added inline comments.Aug 16 2021, 9:33 AM
llvm/test/Transforms/SLPVectorizer/X86/pr50555.ll
3 ↗(On Diff #366403)

Right - the goal of PhaseOrdering tests is to make sure that >1 passes are interacting as expected and that we get the expected results from the typical (-On) pass pipelines in 'opt'.

anton-afanasyev marked 2 inline comments as done.Aug 16 2021, 11:38 AM
anton-afanasyev added inline comments.
llvm/test/Transforms/SLPVectorizer/X86/pr50555.ll
3 ↗(On Diff #366403)

Sure, thanks! Moved to subdirectory, changed to -O3 option.

anton-afanasyev marked an inline comment as done.

Address comments

(Feel free to post lshr patch after landing D108091)

Is there anything left to do on this?

Is there anything left to do on this?

Rebase this?
ashr is left.

Yes, I'm to add ashr. Also planning to add AssumptionCache to use it for computeKnownBits(). And investigate question about including AIC to -O2.

And investigate question about including AIC to -O2.

Yeah, AIC for O2+ - it would be great if possible..

@anton-afanasyev reverse ping - is there anything left on this patch?

anton-afanasyev abandoned this revision.Nov 3 2021, 11:14 AM

This was already commited in a series of patch, abandoning.