This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Improve ISel for v_bfi instructions.
AbandonedPublic

Authored by tsymalla on Sep 22 2022, 1:41 AM.

Details

Reviewers
foad
piotr
Summary

This patch introduces a new ISel pattern for
v_bfi instructions. In some cases, only a single
v_bfi instruction is generated, even if there
could be multiple ones. The final codegen has
leftover v_and and v_xor instructions.
Such cases can appear when using nested bitfieldInsert
instructions.

A (xor (and imm0, (xor (shl), (xor (and (xor (shl)), imm1)))) has two BFI parts.
The outer BFI part relies on the inner BFI part. During InstCombine, the inner xor
sequence gets turned into bfi_0 = (y & x) | (z & ~x) and later to a BFI, while the
outer BFI part stays untouched and will not be converted into a BFI instruction.

Diff Detail

Event Timeline

tsymalla created this revision.Sep 22 2022, 1:41 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 22 2022, 1:41 AM
tsymalla requested review of this revision.Sep 22 2022, 1:41 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 22 2022, 1:41 AM
foad requested changes to this revision.Sep 22 2022, 2:22 AM
foad added inline comments.
llvm/lib/Target/AMDGPU/SIInstructions.td
1908–1909

As written this is not true. Counterexample: a=x=y=C0=C1=0.

1911

Should have a DivergentBinFrag on the outermost node so that we don't select VALU instructions for uniform expressions.

This revision now requires changes to proceed.Sep 22 2022, 2:22 AM
tsymalla updated this revision to Diff 462240.Sep 22 2022, 10:56 AM

Added DivergentBinFrag class to outermost node.
Fixed comment.

tsymalla edited the summary of this revision. (Show Details)Sep 22 2022, 11:16 AM
foad added inline comments.Sep 23 2022, 12:26 AM
llvm/lib/Target/AMDGPU/SIInstructions.td
1909

Still not true. Counterexample: a=y=C1=0, z=C0=1.

tsymalla added inline comments.Sep 23 2022, 1:18 AM
llvm/lib/Target/AMDGPU/SIInstructions.td
1909

You are correct. I need to overthink the pattern. In general, the equation is not correct when when y != z and a=y=C1.

foad added inline comments.Sep 23 2022, 1:25 AM
llvm/lib/Target/AMDGPU/SIInstructions.td
1909

It might be true if you restrict it to cases where ~C0|C1 is true, i.e. the bits set in C0 are a subset of the bits set in C1?

tsymalla added inline comments.Sep 28 2022, 6:33 AM
llvm/lib/Target/AMDGPU/SIInstructions.td
1909

No, I don't think that applies in this particular example.
Maybe it makes more sense to restrict the matching to the case where C0 = 0xffc00 and C1 = 0x3ff00000. In this case, it should work. I cannot think of any (relevant) correlation between C0 and C1.

tsymalla updated this revision to Diff 464985.Oct 4 2022, 6:04 AM

Update pattern matching to include a shl instruction.
Describe a case where such pattern can occur.

tsymalla edited the summary of this revision. (Show Details)Oct 4 2022, 6:05 AM
tsymalla edited the summary of this revision. (Show Details)
foad added a comment.Oct 4 2022, 6:58 AM

I don't immediately see how shifts are relevant.

For the basic case of nested bitfield inserts, perhaps you could create tests for the cases you want to handle. For example, IR equivalents of:

  1. (x & y) | (~x & z) // single insert
  2. (x & y | (~x & ((u & v) | (~u & z))) // nested insert
  3. (x & ((u & v) | (~u & y))) | (~x & z) // nested insert

For the nested inserts we might want separate test cases depending on whether the "select" arguments x and u are known to be disjoint or not. E.g. 0x0F and 0xF0 are disjoint, 0xFF0 and 0x0FF overlap, and for non-constant values we don't know whether they overlap or not.

I don't immediately see how shifts are relevant.

For the basic case of nested bitfield inserts, perhaps you could create tests for the cases you want to handle. For example, IR equivalents of:

  1. (x & y) | (~x & z) // single insert
  2. (x & y | (~x & ((u & v) | (~u & z))) // nested insert
  3. (x & ((u & v) | (~u & y))) | (~x & z) // nested insert

For the nested inserts we might want separate test cases depending on whether the "select" arguments x and u are known to be disjoint or not. E.g. 0x0F and 0xF0 are disjoint, 0xFF0 and 0x0FF overlap, and for non-constant values we don't know whether they overlap or not.

The right and determine way to do this would be to transform the second xor, and, xor sequence into a and, and, or sequence (just like the inner one), so it gets picked up by Isel as well without writing any special pattern matching.
However, InstCombine does not handle such cases and only converts the inner sequence into a sequence so that it can be matched to a BFI.
It is correct that the shifts don't really relate to the pattern, they are used here to match such cases. See for example:

%24 = shl i32 %23, 10
%25 = xor i32 %24, %21
%26 = and i32 %25, 1047552
%27 = xor i32 %26, %21
%28 = select i1 false, i32 %23, i32 %27
%param.0.vec.extract = extractelement <3 x float> %19, i64 0
%29 = fmul reassoc nnan nsz arcp contract afn float %param.0.vec.extract, 1.023000e+03
%30 = fptoui float %29 to i32
%31 = shl i32 %30, 20
%32 = xor i32 %31, %28
%33 = and i32 %32, 1072693248
%34 = xor i32 %33, %28

This gets transformed into:

%15 = shl i32 %14, 10
  %16 = and i32 %15, 1047552
  %17 = and i32 %12, -1047553
  %18 = or i32 %16, %17
  %19 = fmul reassoc nnan nsz arcp contract afn float %8, 1.023000e+03
  %20 = fptoui float %19 to i32
  %21 = shl i32 %20, 20
  %22 = xor i32 %21, %12
  %23 = and i32 %22, 1072693248
  %24 = xor i32 %23, %18

If I see that correctly, this is implemented in InstCombineAndOrXor::visitMaskedMerge.
I don't know if such code sequence ever appears in other places, so I went with the route of implementing it in ISel.

I agree with creating all those tests.

foad added a comment.Oct 5 2022, 2:51 AM

Also, please try to include a link to an Alive2 proof that your transformations are correct.

tsymalla abandoned this revision.Oct 6 2022, 5:11 AM

In this specific example, visitMaskedMerge for xor InstCombine tries to combine the xor, and, xor pattern as long as both xor instruction use the same operand. This works for the first xor, and, xor sequence, but changes the IR in such way that the second xor, and, xor sequence (which depends on the result of the first one) cannot be matched anymore even if it could before. This prevents the second v_bfi from being generated.
I will abandon this change and try to generate the canonical form earlier.