This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AMDGPU/
-
Target/
-
AMDGPU/
2/6
SIInstructions.td
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
-
bfi_int.ll

Differential D134418

[AMDGPU] Improve ISel for v_bfi instructions.
AbandonedPublic

Authored by tsymalla on Sep 22 2022, 1:41 AM.

Download Raw Diff

Details

Reviewers

foad
piotr

Summary

This patch introduces a new ISel pattern for
v_bfi instructions. In some cases, only a single
v_bfi instruction is generated, even if there
could be multiple ones. The final codegen has
leftover v_and and v_xor instructions.
Such cases can appear when using nested bitfieldInsert
instructions.

A (xor (and imm0, (xor (shl), (xor (and (xor (shl)), imm1)))) has two BFI parts.
The outer BFI part relies on the inner BFI part. During InstCombine, the inner xor
sequence gets turned into bfi_0 = (y & x) | (z & ~x) and later to a BFI, while the
outer BFI part stays untouched and will not be converted into a BFI instruction.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

tsymalla created this revision.Sep 22 2022, 1:41 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 22 2022, 1:41 AM

Herald added subscribers: kosarev, kerbowa, hiraditya and 7 others. · View Herald Transcript

tsymalla requested review of this revision.Sep 22 2022, 1:41 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 22 2022, 1:41 AM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

Harbormaster completed remote builds in B188122: Diff 462108.Sep 22 2022, 1:42 AM

foad requested changes to this revision.Sep 22 2022, 2:22 AM

foad added inline comments.

llvm/lib/Target/AMDGPU/SIInstructions.td
1908–1909	As written this is not true. Counterexample: a=x=y=C0=C1=0.
1911	Should have a `DivergentBinFrag` on the outermost node so that we don't select VALU instructions for uniform expressions.

This revision now requires changes to proceed.Sep 22 2022, 2:22 AM

Added DivergentBinFrag class to outermost node.
Fixed comment.

Harbormaster completed remote builds in B188222: Diff 462240.Sep 22 2022, 10:56 AM

tsymalla edited the summary of this revision. (Show Details)Sep 22 2022, 11:16 AM

foad added inline comments.Sep 23 2022, 12:26 AM

llvm/lib/Target/AMDGPU/SIInstructions.td
1909	Still not true. Counterexample: a=y=C1=0, z=C0=1.

tsymalla added inline comments.Sep 23 2022, 1:18 AM

llvm/lib/Target/AMDGPU/SIInstructions.td
1909	You are correct. I need to overthink the pattern. In general, the equation is not correct when when y != z and a=y=C1.

foad added inline comments.Sep 23 2022, 1:25 AM

llvm/lib/Target/AMDGPU/SIInstructions.td
1909	It might be true if you restrict it to cases where ~C0\|C1 is true, i.e. the bits set in C0 are a subset of the bits set in C1?

tsymalla added inline comments.Sep 28 2022, 6:33 AM

llvm/lib/Target/AMDGPU/SIInstructions.td
1909	No, I don't think that applies in this particular example. Maybe it makes more sense to restrict the matching to the case where C0 = 0xffc00 and C1 = 0x3ff00000. In this case, it should work. I cannot think of any (relevant) correlation between C0 and C1.

tsymalla mentioned this in rG82cac65dd286: [NFC][AMDGPU] Pre-commit test for D134418..Oct 4 2022, 5:44 AM

Update pattern matching to include a shl instruction.
Describe a case where such pattern can occur.

tsymalla edited the summary of this revision. (Show Details)Oct 4 2022, 6:05 AM

tsymalla edited the summary of this revision. (Show Details)

Harbormaster completed remote builds in B190190: Diff 464985.Oct 4 2022, 6:51 AM

I don't immediately see how shifts are relevant.

For the basic case of nested bitfield inserts, perhaps you could create tests for the cases you want to handle. For example, IR equivalents of:

(x & y) | (~x & z) // single insert
(x & y | (~x & ((u & v) | (~u & z))) // nested insert
(x & ((u & v) | (~u & y))) | (~x & z) // nested insert

For the nested inserts we might want separate test cases depending on whether the "select" arguments x and u are known to be disjoint or not. E.g. 0x0F and 0xF0 are disjoint, 0xFF0 and 0x0FF overlap, and for non-constant values we don't know whether they overlap or not.

In D134418#3833436, @foad wrote:

I don't immediately see how shifts are relevant.

For the basic case of nested bitfield inserts, perhaps you could create tests for the cases you want to handle. For example, IR equivalents of:

(x & y) | (~x & z) // single insert

(x & y | (~x & ((u & v) | (~u & z))) // nested insert

(x & ((u & v) | (~u & y))) | (~x & z) // nested insert

For the nested inserts we might want separate test cases depending on whether the "select" arguments x and u are known to be disjoint or not. E.g. 0x0F and 0xF0 are disjoint, 0xFF0 and 0x0FF overlap, and for non-constant values we don't know whether they overlap or not.

The right and determine way to do this would be to transform the second xor, and, xor sequence into a and, and, or sequence (just like the inner one), so it gets picked up by Isel as well without writing any special pattern matching.
However, InstCombine does not handle such cases and only converts the inner sequence into a sequence so that it can be matched to a BFI.
It is correct that the shifts don't really relate to the pattern, they are used here to match such cases. See for example:

%24 = shl i32 %23, 10
%25 = xor i32 %24, %21
%26 = and i32 %25, 1047552
%27 = xor i32 %26, %21
%28 = select i1 false, i32 %23, i32 %27
%param.0.vec.extract = extractelement <3 x float> %19, i64 0
%29 = fmul reassoc nnan nsz arcp contract afn float %param.0.vec.extract, 1.023000e+03
%30 = fptoui float %29 to i32
%31 = shl i32 %30, 20
%32 = xor i32 %31, %28
%33 = and i32 %32, 1072693248
%34 = xor i32 %33, %28

This gets transformed into:

%15 = shl i32 %14, 10
  %16 = and i32 %15, 1047552
  %17 = and i32 %12, -1047553
  %18 = or i32 %16, %17
  %19 = fmul reassoc nnan nsz arcp contract afn float %8, 1.023000e+03
  %20 = fptoui float %19 to i32
  %21 = shl i32 %20, 20
  %22 = xor i32 %21, %12
  %23 = and i32 %22, 1072693248
  %24 = xor i32 %23, %18

If I see that correctly, this is implemented in InstCombineAndOrXor::visitMaskedMerge.
I don't know if such code sequence ever appears in other places, so I went with the route of implementing it in ISel.

I agree with creating all those tests.

Also, please try to include a link to an Alive2 proof that your transformations are correct.

In this specific example, visitMaskedMerge for xor InstCombine tries to combine the xor, and, xor pattern as long as both xor instruction use the same operand. This works for the first xor, and, xor sequence, but changes the IR in such way that the second xor, and, xor sequence (which depends on the result of the first one) cannot be matched anymore even if it could before. This prevents the second v_bfi from being generated.
I will abandon this change and try to generate the canonical form earlier.

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

SIInstructions.td

19 lines

test/

CodeGen/

AMDGPU/

bfi_int.ll

24 lines

Diff 464985

llvm/lib/Target/AMDGPU/SIInstructions.td

Show First 20 Lines • Show All 1,897 Lines • ▼ Show 20 Lines	def BFIImm32 : PatFrag<
[{		[{
auto *X = dyn_cast<ConstantSDNode>(N->getOperand(0)->getOperand(1));		auto *X = dyn_cast<ConstantSDNode>(N->getOperand(0)->getOperand(1));
auto *NotX = dyn_cast<ConstantSDNode>(N->getOperand(1)->getOperand(1));		auto *NotX = dyn_cast<ConstantSDNode>(N->getOperand(1)->getOperand(1));
return X && NotX &&		return X && NotX &&
~(unsigned)X->getZExtValue() == (unsigned)NotX->getZExtValue();		~(unsigned)X->getZExtValue() == (unsigned)NotX->getZExtValue();
}]		}]
>;		>;

		// Create two BFI instructions at once, if possible.
		// This tries to handle one-level deep nested bitfieldInserts:
		//
		// ((src << numBits) ^ y) & imm0) ^ bfi(x, y, z)) =>
		foadUnsubmitted Not Done Reply Inline Actions As written this is not true. Counterexample: a=x=y=C0=C1=0. foad: As written this is not true. Counterexample: a=x=y=C0=C1=0.
		foadUnsubmitted Not Done Reply Inline Actions Still not true. Counterexample: a=y=C1=0, z=C0=1. foad: Still not true. Counterexample: a=y=C1=0, z=C0=1.
		tsymallaAuthorUnsubmitted Done Reply Inline Actions You are correct. I need to overthink the pattern. In general, the equation is not correct when when y != z and a=y=C1. tsymalla: You are correct. I need to overthink the pattern. In general, the equation is not correct when…
		foadUnsubmitted Not Done Reply Inline Actions It might be true if you restrict it to cases where ~C0\|C1 is true, i.e. the bits set in C0 are a subset of the bits set in C1? foad: It might be true if you restrict it to cases where ~C0\|C1 is true, i.e. the bits set in C0 are…
		tsymallaAuthorUnsubmitted Done Reply Inline Actions No, I don't think that applies in this particular example. Maybe it makes more sense to restrict the matching to the case where C0 = 0xffc00 and C1 = 0x3ff00000. In this case, it should work. I cannot think of any (relevant) correlation between C0 and C1. tsymalla: No, I don't think that applies in this particular example. Maybe it makes more sense to…
		// v_bfi (imm0, lshlrev(numBits, src), bfi(x, y, z))
		//
		foadUnsubmitted Not Done Reply Inline Actions Should have a `DivergentBinFrag` on the outermost node so that we don't select VALU instructions for uniform expressions. foad: Should have a `DivergentBinFrag` on the outermost node so that we don't select VALU…
		// Such sequences can occur after InstCombine:
		// A (xor (and imm0, (xor (shl), (xor (and (xor (shl)), imm1)))) has two
		// BFI parts. The outer BFI part relies on the inner BFI part.
		// During InstCombine, the inner xor sequence gets turned into
		// bfi_0 = (y & x) \| (z & ~x) and later to a BFI, while the outer BFI part
		// stays untouched and will not be converted into a BFI instruction.
		def : AMDGPUPat <
		(DivergentBinFrag<xor> (and (xor (shl i32:$src, (i32 imm:$numBits)), i32:$y), (i32 imm:$imm0)),
		(BFIImm32 i32:$x, i32:$y, i32:$z)),
		(V_BFI_B32_e64 VSrc_b32:$imm0, (V_LSHLREV_B32_e64 i32:$numBits, i32:$src),
		(V_BFI_B32_e64 VSrc_b32:$x, VSrc_b32:$y, VSrc_b32:$z))
		>;

// Definition from ISA doc:		// Definition from ISA doc:
// (y & x) \| (z & ~x)		// (y & x) \| (z & ~x)
def : AMDGPUPat <		def : AMDGPUPat <
(DivergentBinFrag<or> (and i32:$y, i32:$x), (and i32:$z, (not i32:$x))),		(DivergentBinFrag<or> (and i32:$y, i32:$x), (and i32:$z, (not i32:$x))),
(V_BFI_B32_e64 VSrc_b32:$x, VSrc_b32:$y, VSrc_b32:$z)		(V_BFI_B32_e64 VSrc_b32:$x, VSrc_b32:$y, VSrc_b32:$z)
>;		>;

// (y & C) \| (z & ~C)		// (y & C) \| (z & ~C)
▲ Show 20 Lines • Show All 1,558 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/bfi_int.ll

Show First 20 Lines • Show All 1,906 Lines • ▼ Show 20 Lines	entry:
store i64 %scalar.use, i64 addrspace(1)* undef		store i64 %scalar.use, i64 addrspace(1)* undef
ret void		ret void
}		}

define i32 @v_bfi_seq_i32(i32 %x, i32 %y, i32 %z) {		define i32 @v_bfi_seq_i32(i32 %x, i32 %y, i32 %z) {
; GFX7-LABEL: v_bfi_seq_i32:		; GFX7-LABEL: v_bfi_seq_i32:
; GFX7: ; %bb.0:		; GFX7: ; %bb.0:
; GFX7-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX7-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX7-NEXT: v_lshlrev_b32_e32 v0, 20, v0
; GFX7-NEXT: s_mov_b32 s4, 0xffc00		; GFX7-NEXT: s_mov_b32 s4, 0xffc00
; GFX7-NEXT: v_xor_b32_e32 v0, v0, v1		; GFX7-NEXT: v_bfi_b32 v1, s4, v1, v2
; GFX7-NEXT: v_bfi_b32 v2, s4, v1, v2		; GFX7-NEXT: v_lshlrev_b32_e32 v0, 20, v0
; GFX7-NEXT: v_and_b32_e32 v0, 0x3ff00000, v0		; GFX7-NEXT: s_mov_b32 s4, 0x3ff00000
; GFX7-NEXT: v_xor_b32_e32 v0, v0, v2		; GFX7-NEXT: v_bfi_b32 v0, s4, v0, v1
; GFX7-NEXT: s_setpc_b64 s[30:31]		; GFX7-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX8-LABEL: v_bfi_seq_i32:		; GFX8-LABEL: v_bfi_seq_i32:
; GFX8: ; %bb.0:		; GFX8: ; %bb.0:
; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX8-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX8-NEXT: v_lshlrev_b32_e32 v0, 20, v0
; GFX8-NEXT: s_mov_b32 s4, 0xffc00		; GFX8-NEXT: s_mov_b32 s4, 0xffc00
; GFX8-NEXT: v_xor_b32_e32 v0, v0, v1		; GFX8-NEXT: v_bfi_b32 v1, s4, v1, v2
; GFX8-NEXT: v_bfi_b32 v2, s4, v1, v2		; GFX8-NEXT: v_lshlrev_b32_e32 v0, 20, v0
; GFX8-NEXT: v_and_b32_e32 v0, 0x3ff00000, v0		; GFX8-NEXT: s_mov_b32 s4, 0x3ff00000
; GFX8-NEXT: v_xor_b32_e32 v0, v0, v2		; GFX8-NEXT: v_bfi_b32 v0, s4, v0, v1
; GFX8-NEXT: s_setpc_b64 s[30:31]		; GFX8-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX10-LABEL: v_bfi_seq_i32:		; GFX10-LABEL: v_bfi_seq_i32:
; GFX10: ; %bb.0:		; GFX10: ; %bb.0:
; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX10-NEXT: s_waitcnt_vscnt null, 0x0		; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
; GFX10-NEXT: v_lshlrev_b32_e32 v0, 20, v0
; GFX10-NEXT: v_xor_b32_e32 v0, v0, v1
; GFX10-NEXT: v_bfi_b32 v1, 0xffc00, v1, v2		; GFX10-NEXT: v_bfi_b32 v1, 0xffc00, v1, v2
; GFX10-NEXT: v_and_b32_e32 v0, 0x3ff00000, v0		; GFX10-NEXT: v_lshlrev_b32_e32 v0, 20, v0
; GFX10-NEXT: v_xor_b32_e32 v0, v0, v1		; GFX10-NEXT: v_bfi_b32 v0, 0x3ff00000, v0, v1
; GFX10-NEXT: s_setpc_b64 s[30:31]		; GFX10-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX8-GISEL-LABEL: v_bfi_seq_i32:		; GFX8-GISEL-LABEL: v_bfi_seq_i32:
; GFX8-GISEL: ; %bb.0:		; GFX8-GISEL: ; %bb.0:
; GFX8-GISEL-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX8-GISEL-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX8-GISEL-NEXT: v_lshlrev_b32_e32 v0, 20, v0		; GFX8-GISEL-NEXT: v_lshlrev_b32_e32 v0, 20, v0
; GFX8-GISEL-NEXT: v_and_b32_e32 v3, 0xffc00, v1		; GFX8-GISEL-NEXT: v_and_b32_e32 v3, 0xffc00, v1
; GFX8-GISEL-NEXT: v_and_b32_e32 v2, 0xfff003ff, v2		; GFX8-GISEL-NEXT: v_and_b32_e32 v2, 0xfff003ff, v2
Show All 26 Lines