This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] Canonizing 'and' before 'shl'
Needs ReviewPublic

Authored by opaparo on Dec 14 2017, 5:32 AM.

Details

Summary

Following a debate that aroused here, a new canonical form to a masked shl is introduced. The canonical form will now be:

define i8 @andshl(i8 %x) {
  %and = and i8 %x, 1
  %shl = shl nuw nsw i8 %and, 3
  ret i8 %shl
}

instead of:

define i8 @andshl(i8 %x) {
  %and = shl i8 %x, 3
  %shl = and i8 %and, 8
  ret i8 %shl
}

Which will result, first and foremost, in smaller constants used by the 'and' instructions.

Some complementary changes are also introduced:

  1. InstCombiner::MatchBSwap (under lib/Transforms/InstCombine/InstCombineAndOrXor.cpp) was changed to recognize both the old and new patterns (tests will fail if only one the new pattern is recognized).
  2. New features and fine tuning were introduced to InstructionSimplify in order to continue supporting existing test cases as well as enhancing other similar test cases. Those are specified in test/Transforms/InstCombine/select-bitext-bitwise-ops.ll

Diff Detail

Repository
rL LLVM

Event Timeline

opaparo created this revision.Dec 14 2017, 5:32 AM
craig.topper added inline comments.Dec 14 2017, 10:00 AM
test/Transforms/InstCombine/2010-11-01-lshr-mask.ll
7–8

This looks like a regression.

test/Transforms/InstCombine/cast.ll
814

Any idea why were weren't simplifying this before this change?

At least the bswap part of this (and probably handling the regression that @craig.topper pointed out) should be split into independent patches. I don't know what the InstSimplify part is doing yet, but that seems like it should be an independent patch too.

You can see the bswap limitation with a small test like this:

declare void @extra_use(i32)
  
define i32 @test7_and_first(i32 %x) {
  %shl = shl i32 %x, 16
  %shr = lshr i32 %x, 16
  %or = or i32 %shl, %shr
  %t1 = and i32 %or, 16711935
  %shl3 = shl nuw i32 %t1, 8
  %and4 = lshr i32 %or, 8
  %shr5 = and i32 %and4, 16711935
  %or6 = or i32 %shl3, %shr5
  ret i32 %or6
}

define i32 @test7_and_first_extra_use(i32 %x) {
  %shl = shl i32 %x, 16
  %shr = lshr i32 %x, 16
  %or = or i32 %shl, %shr
  %t1 = and i32 %or, 16711935
  %shl3 = shl nuw i32 %t1, 8
  %and4 = lshr i32 %or, 8
  %shr5 = and i32 %and4, 16711935
  %or6 = or i32 %shl3, %shr5
  call void @extra_use(i32 %t1)
  ret i32 %or6
}


define i32 @test7_shl_first(i32 %x) {
  %shl = shl i32 %x, 16
  %shr = lshr i32 %x, 16
  %or = or i32 %shl, %shr
  %and2 = shl i32 %or, 8
  %shl3 = and i32 %and2, -16711936
  %and4 = lshr i32 %or, 8
  %shr5 = and i32 %and4, 16711935
  %or6 = or i32 %shl3, %shr5
  ret i32 %or6
}
$ ./opt -instcombine -S bswap.ll 

declare void @extra_use(i32)

define i32 @test7_and_first(i32 %x) {
  %or6 = call i32 @llvm.bswap.i32(i32 %x)
  ret i32 %or6
}

define i32 @test7_and_first_extra_use(i32 %x) {
  %shl = shl i32 %x, 16
  %shr = lshr i32 %x, 16
  %or = or i32 %shl, %shr
  %t1 = and i32 %or, 16711935
  %shl3 = shl nuw i32 %t1, 8
  %and4 = lshr i32 %or, 8
  %shr5 = and i32 %and4, 16711935
  %or6 = or i32 %shl3, %shr5
  call void @extra_use(i32 %t1)
  ret i32 %or6
}

define i32 @test7_shl_first(i32 %x) {
  %or6 = call i32 @llvm.bswap.i32(i32 %x)
  ret i32 %or6
}
opaparo updated this revision to Diff 127361.Dec 18 2017, 7:40 AM
  1. Relaxing the canonization condition: the masked shl will be canonized to the new form only if the 'shl''s 0th operand is not a shift instruction. This is due to other, better optimization that will be able to kick in.
  2. Reverted the InstructionSimplify and bswap changes. Those will be added in two different reviews.
opaparo marked an inline comment as done.Dec 18 2017, 7:41 AM
opaparo added inline comments.Dec 19 2017, 6:53 AM
test/Transforms/InstCombine/cast.ll
814

I've looked into it.
The key here is the first three instructions. Consider the old version:

%C = zext i8 %A to i32
%D = shl i32 %C, 4
%E = and i32 %D, 48

vs the new version:

%C = zext i8 %A to i32
%E = and i32 %C, 3
%D = shl i32 %E, 4

In the old version, 'shl' of 'zext' could not be transformed into 'zext' of 'shl' as in the general case they are not identical due to bits 8 and above of the result of the 'shl' that will be zeroed out. On the other hand, in the new version 'and' of 'zext' could be transformed to 'zext' of 'and'. Afterwards, since bits 8 and above of the 'shl' result are promised to be zeroes, 'shl' of 'zext' can now be transformed into 'zext' of 'shl'. This caused a chain reaction in which the 'zext' kept "sinking" down, eventually merging with the other 'zext'.
Note that, in the new version, if the RHS operand of the 'and' was greater or equal to 16 this transformation would again not be possible as there would again be a chance of losing bits. 'and' of 'zext' will still be transformed into 'zext' of 'and', but it will stop there.

opaparo marked an inline comment as done.Dec 19 2017, 6:54 AM

I know smaller constants is meaningful to X86. But do other targets have different immediate sizes for And instructions. ARM encodes immediates with a shift amount applied to them I think? Not sure about others.

lib/Transforms/InstCombine/InstCombineShifts.cpp
554

This transform is very similar to the one above, but the binop here is conditional. Should we be avoid pulling shifts above a conditional and?

spatel added a comment.Jul 9 2018, 7:22 AM

I know smaller constants is meaningful to X86. But do other targets have different immediate sizes for And instructions. ARM encodes immediates with a shift amount applied to them I think? Not sure about others.

This might be a stretch, but we could argue that smaller constants also increase the likelihood of CSE (smaller range means more potential equivalences?).

D48278 is an ARM-motivated patch that looks like it would benefit from this change.

I can't tell where we left off on this patch. Did we eliminate all of the roadblocks?