This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] Simplify bswap -> shift
ClosedPublic

Authored by chfast on Jan 19 2022, 7:22 AM.

Details

Summary

Simplify bswap(x) to shl(x) or lshr(x) if x has exactly one
"active byte", i.e. all active bits are contained in boundaries
of a single byte of x.

https://alive2.llvm.org/ce/z/nvbbU5
https://alive2.llvm.org/ce/z/KiiL3J

Diff Detail

Event Timeline

chfast created this revision.Jan 19 2022, 7:22 AM
chfast requested review of this revision.Jan 19 2022, 7:22 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 19 2022, 7:22 AM
chfast added inline comments.
llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp
1242

To my understanding, LLVM assumes 8 bit bytes. I was not able to find any constant as the replacement for 8.

This is one of several related patterns where we mask/shift a single byte from an end of the input:
https://alive2.llvm.org/ce/z/dtHMVr

define i32 @bs_shl32(i32 %0) {
  %2 = shl i32 %0, 24
  %3 = call i32 @llvm.bswap.i32(i32 %2)
  ret i32 %3
}

define i32 @bs_lshr32(i32 %0) {
  %2 = lshr i32 %0, 24
  %3 = call i32 @llvm.bswap.i32(i32 %2)
  ret i32 %3
}

define i32 @bs_and32_000000ff(i32 %x) {
  %m = and i32 %x, 255
  %b = call i32 @llvm.bswap.i32(i32 %m)
  ret i32 %b
}

define i32 @bs_and32_ff000000(i32 %x) {
  %m = and i32 %x, -16777216
  %b = call i32 @llvm.bswap.i32(i32 %m)
  ret i32 %b
}

It might help to see a larger, motivating example in source or IR, so we know what a potentially more general solution would look like.
The backend also misses these patterns, so we may want to mimic whatever we do in IR for codegen. I recently added a similar fold with D117508.

This is one of several related patterns where we mask/shift a single byte from an end of the input:
https://alive2.llvm.org/ce/z/dtHMVr

define i32 @bs_shl32(i32 %0) {
  %2 = shl i32 %0, 24
  %3 = call i32 @llvm.bswap.i32(i32 %2)
  ret i32 %3
}

define i32 @bs_lshr32(i32 %0) {
  %2 = lshr i32 %0, 24
  %3 = call i32 @llvm.bswap.i32(i32 %2)
  ret i32 %3
}

define i32 @bs_and32_000000ff(i32 %x) {
  %m = and i32 %x, 255
  %b = call i32 @llvm.bswap.i32(i32 %m)
  ret i32 %b
}

define i32 @bs_and32_ff000000(i32 %x) {
  %m = and i32 %x, -16777216
  %b = call i32 @llvm.bswap.i32(i32 %m)
  ret i32 %b
}

It might help to see a larger, motivating example in source or IR, so we know what a potentially more general solution would look like.
The backend also misses these patterns, so we may want to mimic whatever we do in IR for codegen. I recently added a similar fold with D117508.

Can we do computeKnownBits().countMaxActiveBits() <= 8 -> replace with shl. If (BitWidth - computeKnownBits().countMaxTrailingZeros()) <= 8 -> replace with lshr? If the input to the bswap happens to be a shift in the other direction, the new shift should be combined with it by existing combines to form an And.

Hi @spatel,

My motivation is rather boring. I have a generic data -> integer loader. The data is in big-endian order. For i32 example the data can be from 1 to 4 bytes in size. Here is the generic template which handles all 4 cases:
https://godbolt.org/z/dPcGjvb94

The load<1> has unnecessary bswap, there zext(*data) should be enough.
There may be some optimization opportunities in load<3> but I did not investigated this in detail.

I can implement other proposed matches on IR and CodeGen levels if you think this is good idea.

If we're after a more general mechanism, it might be possible to extend llvm::recognizeBSwapOrBitReverseIdiom() to recognise shift/mask patterns as well - the inner collectBitParts already handles bswap etc.

llvm/test/Transforms/InstCombine/bswap-fold.ll
386

We don't gain much from duplicating tests for different bitwidths like this - one scalar and one vector is typically enough, possibly with some basic additional multiuse and vector shift-amounts-with-undef tests as well

My motivation is rather boring. I have a generic data -> integer loader. The data is in big-endian order. For i32 example the data can be from 1 to 4 bytes in size. Here is the generic template which handles all 4 cases:
https://godbolt.org/z/dPcGjvb94

Thanks! I guessed there was more than just this one potential optimization, but that makes it clearer.
Pushing a logical shift after the bswap (and reversing direction) might get us most of what we need:
https://alive2.llvm.org/ce/z/2zveR6
That should allow the existing demanded bits fold to trigger in the simplest cases if I'm seeing it correctly.

My motivation is rather boring. I have a generic data -> integer loader. The data is in big-endian order. For i32 example the data can be from 1 to 4 bytes in size. Here is the generic template which handles all 4 cases:
https://godbolt.org/z/dPcGjvb94

Thanks! I guessed there was more than just this one potential optimization, but that makes it clearer.
Pushing a logical shift after the bswap (and reversing direction) might get us most of what we need:
https://alive2.llvm.org/ce/z/2zveR6
That should allow the existing demanded bits fold to trigger in the simplest cases if I'm seeing it correctly.

That wouldn't help with the (bswap (and)) though would it? But my computeKnownBits suggestion would allow the bswap to be reduced to a shift.

Pushing a logical shift after the bswap (and reversing direction) might get us most of what we need:
https://alive2.llvm.org/ce/z/2zveR6
That should allow the existing demanded bits fold to trigger in the simplest cases if I'm seeing it correctly.

That wouldn't help with the (bswap (and)) though would it? But my computeKnownBits suggestion would allow the bswap to be reduced to a shift.

Right - knownbits would give us more flexibility on the single byte cases, but it wouldn't do anything for the patterns that deal with >1 byte? There's some overlap, but these could be independent patches.

I was hoping that pushing the shift to the end could trigger a narrowing transform on the bswap, but we miss that too:

define i32 @_Z4loadILj3EEjPKh(i8* noundef %0) {
  %2 = bitcast i8* %0 to i24*
  %3 = load i24, i24* %2, align 1
  %4 = zext i24 %3 to i32
  %5 = call i32 @llvm.bswap.i32(i32 %4) ; can we do anything with this?
  %6 = lshr exact i32 %5, 8
  ret i32 %6
}

define i32 @_Z4loadILj2EEjPKh(i8* noundef %0) {
  %2 = bitcast i8* %0 to i16*
  %3 = load i16, i16* %2, align 1
  %4 = zext i16 %3 to i32
  %5 = call i32 @llvm.bswap.i32(i32 %4) ; bswap.i16
  %6 = lshr exact i32 %5, 16
  ret i32 %6
}
chfast updated this revision to Diff 401631.Jan 20 2022, 7:07 AM

[InstCombine] Simplify bswap -> shift

Simplify bswap(x) to shl(x) or lshr(x) if x has at most 8 (byte-size)
low/high active bits.

chfast retitled this revision from [InstCombine] Fold bswap(shl(x, C)) -> and(x, 255) to [InstCombine] Simplify bswap -> shift.Jan 20 2022, 7:08 AM
chfast edited the summary of this revision. (Show Details)

Can we do computeKnownBits().countMaxActiveBits() <= 8 -> replace with shl. If (BitWidth - computeKnownBits().countMaxTrailingZeros()) <= 8 -> replace with lshr?

I did this, except I used BitWidth - computeKnownBits().countMinTrailingZeros() witch I believe is the correct one.

llvm/test/Transforms/InstCombine/bswap-fold.ll
421

Any undef shift index or and argument prevents this optimization. Is this expected?

442

Here we replace bswap with lshr i64 %2, 56, but it is further expanded to

shl i64 %0, 1
and i64 %3, 254
craig.topper added inline comments.Jan 20 2022, 9:09 AM
llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp
1221

Can this go further.

Something like

unsigned TZ = alignDown(Known.countMinTrailingZeros(), 8);
unsigned LZ = alignDown(Known.countMinLeadingZeros(), 8);

if (BitWidth - TZ - LZ == 8) {
  if ((BitWidth - LZ - 8) > TZ)
    ShiftRight by ((BitWidth - LZ - 8)  - TZ)
  else
    ShiftLeft by (TZ - (BitWidth - LZ - 8))
}

Adapted from the SimplifyDemandedBits for bswap like https://reviews.llvm.org/D117508

craig.topper added inline comments.Jan 20 2022, 9:21 AM
llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp
1221

Oops that should be

unsigned TZ = alignDown(Known.countMinTrailingZeros(), 8);
unsigned LZ = alignDown(Known.countMinLeadingZeros(), 8);

if (BitWidth - TZ - LZ == 8) {
  if ((BitWidth - TZ - 8) > TZ)
    ShiftRight by ((BitWidth - TZ - 8)  - TZ)
  else
    ShiftLeft by (TZ - (BitWidth - TZ - 8))
}
spatel added inline comments.Jan 20 2022, 10:53 AM
llvm/test/Transforms/InstCombine/bswap-fold.ll
442

I added one-use checks with:
2d031ec5e53f
...so this shouldn't have an extra instruction now.

Please pre-commit the tests with the baseline CHECKs, so we just show diffs in this patch.

craig.topper added inline comments.Jan 20 2022, 10:53 AM
llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp
1221

Or even simpler.

unsigned TZ = alignDown(Known.countMinTrailingZeros(), 8);
unsigned LZ = alignDown(Known.countMinLeadingZeros(), 8);

if (BitWidth - TZ - LZ == 8) {
  if (LZ > TZ)
    ShiftRight by (LZ - TZ)
  else
    ShiftLeft by (TZ - LZ)
}

Wish I could amend or delete my previous comments.

llvm/test/Transforms/InstCombine/bswap-fold.ll
421

I didn't think of that, but it is correct for computeKnownBits. The undef doesn't allow computeKnownBits to determine anything.

I don't think I would worry too much about it. @spatel or @lebedev.ri what do you think?

chfast added inline comments.Jan 20 2022, 12:11 PM
llvm/lib/Transforms/InstCombine/InstCombineCalls.cpp
1221

Yes, I ended up with the same simplification on paper. Although I think you still have the shift kinds swapped.

llvm/test/Transforms/InstCombine/bswap-fold.ll
442

I added one-use checks with:
2d031ec5e53f
...so this shouldn't have an extra instruction now.

Nice

Please pre-commit the tests with the baseline CHECKs, so we just show diffs in this patch.

Will do.

chfast updated this revision to Diff 401773.Jan 20 2022, 1:51 PM
chfast edited the summary of this revision. (Show Details)

Implemented extended variant which handles "active byte" at any position, suggested by @craig.topper.

This revision is now accepted and ready to land.Jan 20 2022, 1:59 PM

@spatel, can you also review this?

Should I land 2 commits: one adding tests, second with the implementation and test updates?

lebedev.ri accepted this revision.Jan 20 2022, 2:20 PM
lebedev.ri edited the summary of this revision. (Show Details)
lebedev.ri added a subscriber: lebedev.ri.

LG

https://alive2.llvm.org/ce/z/nvbbU5
https://alive2.llvm.org/ce/z/KiiL3J

Should I land 2 commits: one adding tests, second with the implementation and test updates?

Yes.

I think the commit message needs to be modified after the improved change. It's now longer low/high active bits.

chfast edited the summary of this revision. (Show Details)Jan 20 2022, 2:23 PM
spatel accepted this revision.Jan 20 2022, 2:33 PM

@spatel, can you also review this?

LGTM

Should I land 2 commits: one adding tests, second with the implementation and test updates?

Yes, the patch to add tests should be labeled "NFC". Just in case this patch is reverted, the tests can remain in tree.

llvm/test/Transforms/InstCombine/bswap-fold.ll
421

I doubt that vector bswap occurs enough to worry about, and the even rarer case with undefs shouldn't limit this patch.

chfast updated this revision to Diff 401807.Jan 20 2022, 3:47 PM

Trigger rebuild.

This revision was landed with ongoing or failed builds.Jan 20 2022, 4:26 PM
This revision was automatically updated to reflect the committed changes.

I have noticed this post about undef/poison: https://llvm.discourse.group/t/evolution-of-undef-and-poison-over-time/5917/2.
Should I have added tests with poison instead of undef?