This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AMDGPU/
-
Target/
-
AMDGPU/
5/6
SIInstructions.td
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
5/6
dagcombine-lshr-and-cmp.ll
-
divergence-driven-trunc-to-i1.ll

Differential D118461

[AMDGPU] Introduce new ISel combine for trunc-slr patterns
ClosedPublic

Authored by tsymalla on Jan 28 2022, 5:32 AM.

Download Raw Diff

Details

Reviewers

foad
Flakebi

Commits

rG476babcc1dbc: [AMDGPU] Introduce new ISel combine for trunc-slr patterns

Summary

In some cases, when selecting a (trunc (slr)) pattern, the slr gets translated
to a v_lshrrev_b3e2_e64 instruction whereas the truncation gets selected to
a sequence of v_and_b32_e64 and v_cmp_eq_u32_e64. In the final ISA, this appears
as selecting the nth-bit:

v_lshrrev_b32_e32 v0, 2, v1
v_and_b32_e32 v0, 1, v0
v_cmp_eq_u32_e32 vcc_lo, 1, v0

However, when the value used in the right shift is known at compilation time, the
whole sequence can be reduced to two VALUs when the constant operand in the v_and is adjusted to (1 << lshrrev_operand):

v_and_b32_e32 v0, (1 << 2), v1
v_cmp_ne_u32_e32 vcc_lo, 0, v0

In the example above, the following pseudo-code:

v0 = (v1 >> 2)
v0 = v0 & 1
vcc_lo = (v0 == 1)

would be translated to:

v0 = v1 & 0b100
vcc_lo = (v0 == 0b100)

which should yield an equivalent result.
This is a little bit hard to test as one needs to force the SelectionDAG to
contain the nodes before instruction selection, but the test sequence was
roughly derived from a production shader.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

tsymalla created this revision.Jan 28 2022, 5:32 AM

Herald added subscribers: foad, kerbowa, hiraditya and 7 others. · View Herald TranscriptJan 28 2022, 5:32 AM

tsymalla requested review of this revision.Jan 28 2022, 5:32 AM

Herald added a project: Restricted Project. · View Herald TranscriptJan 28 2022, 5:32 AM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

tsymalla added reviewers: foad, Flakebi.Jan 28 2022, 5:33 AM

It seems like the xor is getting in the way. Would something like D38161 help instead?

Harbormaster completed remote builds in B146257: Diff 403982.Jan 28 2022, 6:51 AM

In D118461#3279351, @foad wrote:

It seems like the xor is getting in the way. Would something like D38161 help instead?

Currently, the xor gets combined to a setcc_ne which gets combined to the srl / trunc sequence.
Initially, there is the xor / setcc_eq sequence which could be simplified like in D38161, removing the need for the xor.
Probably that would clean up everything a bit.

Can this be done as a combine instead? Plus if we handle this for VALU, should also for SALU

llvm/test/CodeGen/AMDGPU/dagcombine-lshr-and-cmp.ll
12	Don't need control flow in this test. Also should test pattern for scalar and vector inputs

In D118461#3279539, @tsymalla wrote:

In D118461#3279351, @foad wrote:

It seems like the xor is getting in the way. Would something like D38161 help instead?

Currently, the xor gets combined to a setcc_ne which gets combined to the srl / trunc sequence.
Initially, there is the xor / setcc_eq sequence which could be simplified like in D38161, removing the need for the xor.
Probably that would clean up everything a bit.

Why do we even have the xor in the IR? Normally (if you run IR optimizations as well as backend passes) instcombine would combine it into the icmp. Why didn't this happen? Was it introduced late by StructurizeCFG? Does D118478 help?

In D118461#3279674, @foad wrote:

In D118461#3279539, @tsymalla wrote:

In D118461#3279351, @foad wrote:

It seems like the xor is getting in the way. Would something like D38161 help instead?

Currently, the xor gets combined to a setcc_ne which gets combined to the srl / trunc sequence.
Initially, there is the xor / setcc_eq sequence which could be simplified like in D38161, removing the need for the xor.
Probably that would clean up everything a bit.

Why do we even have the xor in the IR? Normally (if you run IR optimizations as well as backend passes) instcombine would combine it into the icmp. Why didn't this happen? Was it introduced late by StructurizeCFG? Does D118478 help?

You are correct, the xor is not needed in the IR. Simply inverting the predicate for the icmp should be sufficient. The xor is combined into a setcc, so this should be fine by just adjusting the test.

tsymalla added inline comments.Jan 28 2022, 8:20 AM

llvm/test/CodeGen/AMDGPU/dagcombine-lshr-and-cmp.ll
12	The control flow was used to prevent having the truncate in the SDag optimized away (which is used as part of the pattern match here). I am going to check if the adjustments to the test (check comment from @foad) are going to help here. Going to test additional cases in the new revision.

Added handling for scalar cases, improved test case.

Harbormaster completed remote builds in B146596: Diff 404477.Jan 31 2022, 4:49 AM

I'm still not sure that we need this, if the xor can be cleaned up earlier. Does D118623 help?

In D118461#3286686, @foad wrote:

I'm still not sure that we need this, if the xor can be cleaned up earlier. Does D118623 help?

Jay, unfortunately it doesn't help. I tried your patch out, but for my test case, the matcher won't apply as there is no XOR in the LLVM IR, it only gets created as SDag node. By the way, your change relates to the function "buildConditions" in the comments which should be "insertConditions".
However, the idea here is not to remove the XOR, but to remove an additional VALU which gets created by translating the TRUNCATE and the AND in the MIR separately instead of handling them as one sequence. Replacing the "setcc ne" with its inverse and not introducing an additional XOR instead might remove the need for this change, but what about possible other cases where this pattern could get matched?

In D118461#3289676, @tsymalla wrote:

In D118461#3286686, @foad wrote:

I'm still not sure that we need this, if the xor can be cleaned up earlier. Does D118623 help?

Jay, unfortunately it doesn't help. I tried your patch out, but for my test case, the matcher won't apply as there is no XOR in the LLVM IR, it only gets created as SDag node. By the way, your change relates to the function "buildConditions" in the comments which should be "insertConditions".
However, the idea here is not to remove the XOR, but to remove an additional VALU which gets created by translating the TRUNCATE and the AND in the MIR separately instead of handling them as one sequence. Replacing the "setcc ne" with its inverse and not introducing an additional XOR instead might remove the need for this change, but what about possible other cases where this pattern could get matched?

Well there is an XOR in the LLVM IR in the test case in your patch! Can you share another test case?

As a rule of thumb, if there is a missed optimisation, I would like to try to fix it as early as possible in the pass pipeline. Otherwise SelectionDAG has to get more and more complicated, to try to optimise things that should really have been cleaned up in the IR before instruction selection.

llvm/lib/Target/AMDGPU/SIInstructions.td
2290	`v_cmp_ne_u32_e64 $a, 0, $a` is probably better because 0 is always an inline constant, but `1 << $b` might not be.
llvm/test/CodeGen/AMDGPU/dagcombine-lshr-and-cmp.ll
1	Please auto-generate the checks with utils/update_llc_test_checks.py and pre-commit this test with the old codegen, so that this patch will clearly show how the codegen changes.
41	"No newline at end of file" :)

In D118461#3289697, @foad wrote:

In D118461#3289676, @tsymalla wrote:

In D118461#3286686, @foad wrote:

I'm still not sure that we need this, if the xor can be cleaned up earlier. Does D118623 help?

Jay, unfortunately it doesn't help. I tried your patch out, but for my test case, the matcher won't apply as there is no XOR in the LLVM IR, it only gets created as SDag node. By the way, your change relates to the function "buildConditions" in the comments which should be "insertConditions".
However, the idea here is not to remove the XOR, but to remove an additional VALU which gets created by translating the TRUNCATE and the AND in the MIR separately instead of handling them as one sequence. Replacing the "setcc ne" with its inverse and not introducing an additional XOR instead might remove the need for this change, but what about possible other cases where this pattern could get matched?

Well there is an XOR in the LLVM IR in the test case in your patch! Can you share another test case?

As a rule of thumb, if there is a missed optimisation, I would like to try to fix it as early as possible in the pass pipeline. Otherwise SelectionDAG has to get more and more complicated, to try to optimise things that should really have been cleaned up in the IR before instruction selection.

I think I misunderstood what you were saying. The pattern that gets matched here is inside the entry block of both functions. The XOR in the second test case (uniform) is there to prevent the truncate in the SDag from being optimized away and has no semantic relevance for the actual test. BTW, your patch still doesn't apply for me in this case. I agree with optimizing the XOR.
Thanks for your additional comments.

The XOR in the second test case (uniform) is there to prevent the truncate in the SDag from being optimized away and has no semantic relevance for the actual test.

OK, sorry, I did not read the test carefully enough. I have looked at it more carefully now.

So the problem is that this DAG:

    t4: i32 = and t2, Constant:i32<2>
  t7: i1 = setcc t4, Constant:i32<0>, setne:ch
t9: i1,i64,ch = llvm.amdgcn.if t0, TargetConstant:i64<1324>, t7

gets optimized like this by a generic combine implemented in TargetLowering::SimplifySetCC:

    t24: i32 = srl t2, Constant:i64<1>
  t25: i1 = truncate t24
t9: i1,i64,ch = llvm.amdgcn.if t0, TargetConstant:i64<1324>, t25

I guess there is no way we can undo that with another combine, because they will end up fighting each other. So I think your patch is reasonable.

llvm/lib/Target/AMDGPU/SIInstructions.td
2290	Also, if you do this, there will only be one use of the constant not two, so I don't think you will have to "Restrict the range to prevent using an additional VGPR for the shifted value".
llvm/test/CodeGen/AMDGPU/dagcombine-lshr-and-cmp.ll
1	Actually it will have to use update_mir_test_checks.

tsymalla mentioned this in rG78bf2e0a3f5f: [AMDGPU] Update two Codegen tests. (NFC).Feb 3 2022, 1:29 AM

Change requests from review addressed.

llvm/lib/Target/AMDGPU/SIInstructions.td
2290	The resulting value should still be checked to ensure no 32-bit overflow occurs, correct? For instance, if the shift value is something like 33, 1 << 33 would exceed Int32_Max.

Harbormaster completed remote builds in B147334: Diff 405561.Feb 3 2022, 3:29 AM

foad added inline comments.Feb 3 2022, 3:45 AM

llvm/lib/Target/AMDGPU/SIInstructions.td
2290	I'm not sure there is any need to check. The result of a shift by 33 is undefined, so it doesn't really matter what code we generate in that case.

tsymalla added inline comments.Feb 3 2022, 5:30 AM

llvm/lib/Target/AMDGPU/SIInstructions.td
2290	Sure, I’ll remove the check.

Removed range checks.

tsymalla marked 2 inline comments as done.Feb 3 2022, 6:34 AM

tsymalla edited the summary of this revision. (Show Details)

LGTM, thanks! Just very minor comments inline.

llvm/lib/Target/AMDGPU/SIInstructions.td
2273	Don't need parens around the `<<` expression.
llvm/test/CodeGen/AMDGPU/dagcombine-lshr-and-cmp.ll
1	"dagcombine" is not a very good name for the file, because this is isel not a combine, but I guess it's OK if you've already committed the file.

This revision is now accepted and ready to land.Feb 3 2022, 6:48 AM

Harbormaster completed remote builds in B147378: Diff 405615.Feb 3 2022, 7:27 AM

This revision was landed with ongoing or failed builds.Feb 3 2022, 9:06 AM

Closed by commit rG476babcc1dbc: [AMDGPU] Introduce new ISel combine for trunc-slr patterns (authored by tsymalla). · Explain Why

This revision was automatically updated to reflect the committed changes.

tsymalla added a commit: rG476babcc1dbc: [AMDGPU] Introduce new ISel combine for trunc-slr patterns.

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

SIInstructions.td

28 lines

test/

CodeGen/

AMDGPU/

dagcombine-lshr-and-cmp.ll

34 lines

divergence-driven-trunc-to-i1.ll

11 lines

Diff 405675

llvm/lib/Target/AMDGPU/SIInstructions.td

Show First 20 Lines • Show All 2,263 Lines • ▼ Show 20 Lines	def : GCNPat <
(V_CMP_EQ_U32_e64 (V_AND_B32_e64 (i32 1), $a), (i32 1))		(V_CMP_EQ_U32_e64 (V_AND_B32_e64 (i32 1), $a), (i32 1))
>;		>;

def : GCNPat <		def : GCNPat <
(i1 (DivergentUnaryFrag<trunc> i16:$a)),		(i1 (DivergentUnaryFrag<trunc> i16:$a)),
(V_CMP_EQ_U32_e64 (V_AND_B32_e64 (i32 1), $a), (i32 1))		(V_CMP_EQ_U32_e64 (V_AND_B32_e64 (i32 1), $a), (i32 1))
>;		>;

		def IMMBitSelConst : SDNodeXForm<imm, [{
		return CurDAG->getTargetConstant(1 << N->getZExtValue(), SDLoc(N),
		foadUnsubmitted Not Done Reply Inline Actions Don't need parens around the `<<` expression. foad: Don't need parens around the `<<` expression.
		MVT::i32);
		}]>;

		// Matching separate SRL and TRUNC instructions
		// with dependent operands (SRL dest is source of TRUNC)
		// generates three instructions. However, by using bit shifts,
		// the V_LSHRREV_B32_e64 result can be directly used in the
		// operand of the V_AND_B32_e64 instruction:
		// (trunc i32 (srl i32 $a, i32 $b)) ->
		// v_and_b32_e64 $a, (1 << $b), $a
		// v_cmp_ne_u32_e64 $a, 0, $a

		// Handle the VALU case.
		def : GCNPat <
		(i1 (DivergentUnaryFrag<trunc> (i32 (srl i32:$a, (i32 imm:$b))))),
		(V_CMP_NE_U32_e64 (V_AND_B32_e64 (i32 (IMMBitSelConst $b)), $a),
		(i32 0))
		foadUnsubmitted Done Reply Inline Actions `v_cmp_ne_u32_e64 $a, 0, $a` is probably better because 0 is always an inline constant, but `1 << $b` might not be. foad: `v_cmp_ne_u32_e64 $a, 0, $a` is probably better because 0 is always an inline constant, but `1…
		foadUnsubmitted Done Reply Inline Actions Also, if you do this, there will only be one use of the constant not two, so I don't think you will have to "Restrict the range to prevent using an additional VGPR for the shifted value". foad: Also, if you do this, there will only be one use of the constant not two, so I don't think you…
		tsymallaAuthorUnsubmitted Done Reply Inline Actions The resulting value should still be checked to ensure no 32-bit overflow occurs, correct? For instance, if the shift value is something like 33, 1 << 33 would exceed Int32_Max. tsymalla: The resulting value should still be checked to ensure no 32-bit overflow occurs, correct? For…
		foadUnsubmitted Done Reply Inline Actions I'm not sure there is any need to check. The result of a shift by 33 is undefined, so it doesn't really matter what code we generate in that case. foad: I'm not sure there is any need to check. The result of a shift by 33 is undefined, so it…
		tsymallaAuthorUnsubmitted Done Reply Inline Actions Sure, I’ll remove the check. tsymalla: Sure, I’ll remove the check.
		>;

		// Handle the scalar case.
		def : GCNPat <
		(i1 (UniformUnaryFrag<trunc> (i32 (srl i32:$a, (i32 imm:$b))))),
		(S_CMP_LG_U32 (S_AND_B32 (i32 (IMMBitSelConst $b)), $a),
		(i32 0))
		>;

def : GCNPat <		def : GCNPat <
(i1 (DivergentUnaryFrag<trunc> i64:$a)),		(i1 (DivergentUnaryFrag<trunc> i64:$a)),
(V_CMP_EQ_U32_e64 (V_AND_B32_e64 (i32 1),		(V_CMP_EQ_U32_e64 (V_AND_B32_e64 (i32 1),
(i32 (EXTRACT_SUBREG $a, sub0))), (i32 1))		(i32 (EXTRACT_SUBREG $a, sub0))), (i32 1))
>;		>;

def : GCNPat <		def : GCNPat <
(i32 (bswap i32:$a)),		(i32 (bswap i32:$a)),
▲ Show 20 Lines • Show All 853 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/dagcombine-lshr-and-cmp.ll

	; NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py
				foadUnsubmitted Done Reply Inline Actions Please auto-generate the checks with utils/update_llc_test_checks.py and pre-commit this test with the old codegen, so that this patch will clearly show how the codegen changes. foad: Please auto-generate the checks with utils/update_llc_test_checks.py and pre-commit this test…
				foadUnsubmitted Done Reply Inline Actions Actually it will have to use update_mir_test_checks. foad: Actually it will have to use update_mir_test_checks.
				foadUnsubmitted Not Done Reply Inline Actions "dagcombine" is not a very good name for the file, because this is isel not a combine, but I guess it's OK if you've already committed the file. foad: "dagcombine" is not a very good name for the file, because this is isel not a combine, but I…
	; RUN: llc -march=amdgcn -mtriple=amdgcn-- -stop-after=amdgpu-isel -verify-machineinstrs -O0 < %s \| FileCheck -check-prefix=GCN %s			; RUN: llc -march=amdgcn -mtriple=amdgcn-- -stop-after=amdgpu-isel -verify-machineinstrs -O0 < %s \| FileCheck -check-prefix=GCN %s

	define i32 @divergent_lshr_and_cmp(i32 %x) {			define i32 @divergent_lshr_and_cmp(i32 %x) {
	; GCN-LABEL: name: divergent_lshr_and_cmp			; GCN-LABEL: name: divergent_lshr_and_cmp
	; GCN: bb.0.entry:			; GCN: bb.0.entry:
	; GCN-NEXT: successors: %bb.1(0x40000000), %bb.2(0x40000000)			; GCN-NEXT: successors: %bb.1(0x40000000), %bb.2(0x40000000)
	; GCN-NEXT: liveins: $vgpr0, $sgpr30_sgpr31			; GCN-NEXT: liveins: $vgpr0, $sgpr30_sgpr31
	; GCN-NEXT: {{ $}}			; GCN-NEXT: {{ $}}
	; GCN-NEXT: [[COPY:%[0-9]+]]:sreg_64 = COPY $sgpr30_sgpr31			; GCN-NEXT: [[COPY:%[0-9]+]]:sreg_64 = COPY $sgpr30_sgpr31
	; GCN-NEXT: [[COPY1:%[0-9]+]]:vgpr_32 = COPY $vgpr0			; GCN-NEXT: [[COPY1:%[0-9]+]]:vgpr_32 = COPY $vgpr0
	; GCN-NEXT: [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 1			; GCN-NEXT: [[V_AND_B32_e64_:%[0-9]+]]:vgpr_32 = V_AND_B32_e64 2, [[COPY1]], implicit $exec
				arsenmUnsubmitted Done Reply Inline Actions Don't need control flow in this test. Also should test pattern for scalar and vector inputs arsenm: Don't need control flow in this test. Also should test pattern for scalar and vector inputs
				tsymallaAuthorUnsubmitted Done Reply Inline Actions The control flow was used to prevent having the truncate in the SDag optimized away (which is used as part of the pattern match here). I am going to check if the adjustments to the test (check comment from @foad) are going to help here. Going to test additional cases in the new revision. tsymalla: The control flow was used to prevent having the truncate in the SDag optimized away (which is…
	; GCN-NEXT: [[V_LSHRREV_B32_e64_:%[0-9]+]]:vgpr_32 = V_LSHRREV_B32_e64 killed [[S_MOV_B32_]], [[COPY1]], implicit $exec			; GCN-NEXT: [[V_CMP_NE_U32_e64_:%[0-9]+]]:sreg_64 = V_CMP_NE_U32_e64 killed [[V_AND_B32_e64_]], 0, implicit $exec
	; GCN-NEXT: [[V_AND_B32_e64_:%[0-9]+]]:vgpr_32 = V_AND_B32_e64 1, killed [[V_LSHRREV_B32_e64_]], implicit $exec			; GCN-NEXT: [[SI_IF:%[0-9]+]]:sreg_64 = SI_IF killed [[V_CMP_NE_U32_e64_]], %bb.2, implicit-def dead $exec, implicit-def dead $scc, implicit $exec
	; GCN-NEXT: [[V_CMP_EQ_U32_e64_:%[0-9]+]]:sreg_64 = V_CMP_EQ_U32_e64 killed [[V_AND_B32_e64_]], 1, implicit $exec
	; GCN-NEXT: [[SI_IF:%[0-9]+]]:sreg_64 = SI_IF killed [[V_CMP_EQ_U32_e64_]], %bb.2, implicit-def dead $exec, implicit-def dead $scc, implicit $exec
	; GCN-NEXT: S_BRANCH %bb.1			; GCN-NEXT: S_BRANCH %bb.1
	; GCN-NEXT: {{ $}}			; GCN-NEXT: {{ $}}
	; GCN-NEXT: bb.1.out.true:			; GCN-NEXT: bb.1.out.true:
	; GCN-NEXT: successors: %bb.2(0x80000000)			; GCN-NEXT: successors: %bb.2(0x80000000)
	; GCN-NEXT: {{ $}}			; GCN-NEXT: {{ $}}
	; GCN-NEXT: [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 2			; GCN-NEXT: [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 2
	; GCN-NEXT: [[V_LSHLREV_B32_e64_:%[0-9]+]]:vgpr_32 = V_LSHLREV_B32_e64 killed [[S_MOV_B32_1]], [[COPY1]], implicit $exec			; GCN-NEXT: [[V_LSHLREV_B32_e64_:%[0-9]+]]:vgpr_32 = V_LSHLREV_B32_e64 killed [[S_MOV_B32_]], [[COPY1]], implicit $exec
	; GCN-NEXT: S_BRANCH %bb.2			; GCN-NEXT: S_BRANCH %bb.2
	; GCN-NEXT: {{ $}}			; GCN-NEXT: {{ $}}
	; GCN-NEXT: bb.2.UnifiedReturnBlock:			; GCN-NEXT: bb.2.UnifiedReturnBlock:
	; GCN-NEXT: [[PHI:%[0-9]+]]:vgpr_32 = PHI [[COPY1]], %bb.0, [[V_LSHLREV_B32_e64_]], %bb.1			; GCN-NEXT: [[PHI:%[0-9]+]]:vgpr_32 = PHI [[COPY1]], %bb.0, [[V_LSHLREV_B32_e64_]], %bb.1
	; GCN-NEXT: SI_END_CF [[SI_IF]], implicit-def dead $exec, implicit-def dead $scc, implicit $exec			; GCN-NEXT: SI_END_CF [[SI_IF]], implicit-def dead $exec, implicit-def dead $scc, implicit $exec
	; GCN-NEXT: [[COPY2:%[0-9]+]]:ccr_sgpr_64 = COPY [[COPY]]			; GCN-NEXT: [[COPY2:%[0-9]+]]:ccr_sgpr_64 = COPY [[COPY]]
	; GCN-NEXT: $vgpr0 = COPY [[PHI]]			; GCN-NEXT: $vgpr0 = COPY [[PHI]]
	; GCN-NEXT: [[COPY3:%[0-9]+]]:ccr_sgpr_64 = COPY [[COPY2]]			; GCN-NEXT: [[COPY3:%[0-9]+]]:ccr_sgpr_64 = COPY [[COPY2]]
	; GCN-NEXT: S_SETPC_B64_return [[COPY3]], implicit $vgpr0			; GCN-NEXT: S_SETPC_B64_return [[COPY3]], implicit $vgpr0
	entry:			entry:
	%0 = and i32 %x, 2			%0 = and i32 %x, 2
	%1 = icmp ne i32 %0, 0			%1 = icmp ne i32 %0, 0
	; Prevent removal of truncate in SDag by inserting llvm.amdgcn.if			; Prevent removal of truncate in SDag by inserting llvm.amdgcn.if
	br i1 %1, label %out.true, label %out.else			br i1 %1, label %out.true, label %out.else

	out.true:			out.true:
	%2 = shl i32 %x, 2			%2 = shl i32 %x, 2
	ret i32 %2			ret i32 %2

	out.else:			out.else:
				foadUnsubmitted Done Reply Inline Actions "No newline at end of file" :) foad: "No newline at end of file" :)
	ret i32 %x			ret i32 %x
	}			}

	define amdgpu_kernel void @uniform_opt_lshr_and_cmp(i1 addrspace(1)* %out, i32 %x) {			define amdgpu_kernel void @uniform_opt_lshr_and_cmp(i1 addrspace(1)* %out, i32 %x) {
	; GCN-LABEL: name: uniform_opt_lshr_and_cmp			; GCN-LABEL: name: uniform_opt_lshr_and_cmp
	; GCN: bb.0.entry:			; GCN: bb.0.entry:
	; GCN-NEXT: successors: %bb.1(0x40000000), %bb.2(0x40000000)			; GCN-NEXT: successors: %bb.1(0x40000000), %bb.2(0x40000000)
	; GCN-NEXT: liveins: $sgpr0_sgpr1			; GCN-NEXT: liveins: $sgpr0_sgpr1
	; GCN-NEXT: {{ $}}			; GCN-NEXT: {{ $}}
	; GCN-NEXT: [[COPY:%[0-9]+]]:sgpr_64(p4) = COPY $sgpr0_sgpr1			; GCN-NEXT: [[COPY:%[0-9]+]]:sgpr_64(p4) = COPY $sgpr0_sgpr1
	; GCN-NEXT: [[S_LOAD_DWORDX2_IMM:%[0-9]+]]:sreg_64_xexec = S_LOAD_DWORDX2_IMM [[COPY]](p4), 9, 0 :: (dereferenceable invariant load (s64) from %ir.out.kernarg.offset.cast, align 4, addrspace 4)			; GCN-NEXT: [[S_LOAD_DWORDX2_IMM:%[0-9]+]]:sreg_64_xexec = S_LOAD_DWORDX2_IMM [[COPY]](p4), 9, 0 :: (dereferenceable invariant load (s64) from %ir.out.kernarg.offset.cast, align 4, addrspace 4)
	; GCN-NEXT: [[S_LOAD_DWORD_IMM:%[0-9]+]]:sreg_32_xm0_xexec = S_LOAD_DWORD_IMM [[COPY]](p4), 11, 0 :: (dereferenceable invariant load (s32) from %ir.x.kernarg.offset.cast, addrspace 4)			; GCN-NEXT: [[S_LOAD_DWORD_IMM:%[0-9]+]]:sreg_32_xm0_xexec = S_LOAD_DWORD_IMM [[COPY]](p4), 11, 0 :: (dereferenceable invariant load (s32) from %ir.x.kernarg.offset.cast, addrspace 4)
	; GCN-NEXT: [[COPY1:%[0-9]+]]:sreg_64 = COPY [[S_LOAD_DWORDX2_IMM]]			; GCN-NEXT: [[COPY1:%[0-9]+]]:sreg_64 = COPY [[S_LOAD_DWORDX2_IMM]]
	; GCN-NEXT: [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 2			; GCN-NEXT: [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 2
	; GCN-NEXT: [[S_AND_B32_:%[0-9]+]]:sreg_32 = S_AND_B32 [[S_LOAD_DWORD_IMM]], killed [[S_MOV_B32_]], implicit-def dead $scc			; GCN-NEXT: [[S_AND_B32_:%[0-9]+]]:sreg_32 = S_AND_B32 [[S_LOAD_DWORD_IMM]], killed [[S_MOV_B32_]], implicit-def dead $scc
	; GCN-NEXT: [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 1			; GCN-NEXT: [[S_AND_B32_1:%[0-9]+]]:sreg_32 = S_AND_B32 2, [[S_LOAD_DWORD_IMM]], implicit-def dead $scc
	; GCN-NEXT: [[S_LSHR_B32_:%[0-9]+]]:sreg_32 = S_LSHR_B32 [[S_LOAD_DWORD_IMM]], killed [[S_MOV_B32_1]], implicit-def dead $scc			; GCN-NEXT: S_CMP_LG_U32 killed [[S_AND_B32_1]], 0, implicit-def $scc
	; GCN-NEXT: [[S_AND_B32_1:%[0-9]+]]:sreg_32 = S_AND_B32 1, killed [[S_LSHR_B32_]], implicit-def dead $scc
	; GCN-NEXT: S_CMP_EQ_U32 killed [[S_AND_B32_1]], 1, implicit-def $scc
	; GCN-NEXT: [[COPY2:%[0-9]+]]:sreg_64 = COPY $scc			; GCN-NEXT: [[COPY2:%[0-9]+]]:sreg_64 = COPY $scc
	; GCN-NEXT: [[COPY3:%[0-9]+]]:sreg_64_xexec = COPY [[COPY2]]			; GCN-NEXT: [[COPY3:%[0-9]+]]:sreg_64_xexec = COPY [[COPY2]]
	; GCN-NEXT: [[S_MOV_B32_2:%[0-9]+]]:sreg_32 = S_MOV_B32 0			; GCN-NEXT: [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 0
	; GCN-NEXT: S_CMP_EQ_U32 killed [[S_AND_B32_]], killed [[S_MOV_B32_2]], implicit-def $scc			; GCN-NEXT: S_CMP_EQ_U32 killed [[S_AND_B32_]], killed [[S_MOV_B32_1]], implicit-def $scc
	; GCN-NEXT: S_CBRANCH_SCC1 %bb.2, implicit $scc			; GCN-NEXT: S_CBRANCH_SCC1 %bb.2, implicit $scc
	; GCN-NEXT: S_BRANCH %bb.1			; GCN-NEXT: S_BRANCH %bb.1
	; GCN-NEXT: {{ $}}			; GCN-NEXT: {{ $}}
	; GCN-NEXT: bb.1.out.true:			; GCN-NEXT: bb.1.out.true:
	; GCN-NEXT: [[S_MOV_B64_:%[0-9]+]]:sreg_64 = S_MOV_B64 -1			; GCN-NEXT: [[S_MOV_B64_:%[0-9]+]]:sreg_64 = S_MOV_B64 -1
	; GCN-NEXT: [[S_XOR_B64_:%[0-9]+]]:sreg_64_xexec = S_XOR_B64 [[COPY3]], killed [[S_MOV_B64_]], implicit-def dead $scc			; GCN-NEXT: [[S_XOR_B64_:%[0-9]+]]:sreg_64_xexec = S_XOR_B64 [[COPY3]], killed [[S_MOV_B64_]], implicit-def dead $scc
	; GCN-NEXT: [[COPY4:%[0-9]+]]:sreg_32 = COPY [[COPY1]].sub1			; GCN-NEXT: [[COPY4:%[0-9]+]]:sreg_32 = COPY [[COPY1]].sub1
	; GCN-NEXT: [[COPY5:%[0-9]+]]:sreg_32 = COPY [[COPY1]].sub0			; GCN-NEXT: [[COPY5:%[0-9]+]]:sreg_32 = COPY [[COPY1]].sub0
	; GCN-NEXT: [[S_MOV_B32_3:%[0-9]+]]:sreg_32 = S_MOV_B32 61440			; GCN-NEXT: [[S_MOV_B32_2:%[0-9]+]]:sreg_32 = S_MOV_B32 61440
	; GCN-NEXT: [[S_MOV_B32_4:%[0-9]+]]:sreg_32 = S_MOV_B32 -1			; GCN-NEXT: [[S_MOV_B32_3:%[0-9]+]]:sreg_32 = S_MOV_B32 -1
	; GCN-NEXT: [[REG_SEQUENCE:%[0-9]+]]:sgpr_128 = REG_SEQUENCE killed [[COPY5]], %subreg.sub0, killed [[COPY4]], %subreg.sub1, killed [[S_MOV_B32_4]], %subreg.sub2, killed [[S_MOV_B32_3]], %subreg.sub3			; GCN-NEXT: [[REG_SEQUENCE:%[0-9]+]]:sgpr_128 = REG_SEQUENCE killed [[COPY5]], %subreg.sub0, killed [[COPY4]], %subreg.sub1, killed [[S_MOV_B32_3]], %subreg.sub2, killed [[S_MOV_B32_2]], %subreg.sub3
	; GCN-NEXT: [[V_CNDMASK_B32_e64_:%[0-9]+]]:vgpr_32 = V_CNDMASK_B32_e64 0, 0, 0, 1, killed [[S_XOR_B64_]], implicit $exec			; GCN-NEXT: [[V_CNDMASK_B32_e64_:%[0-9]+]]:vgpr_32 = V_CNDMASK_B32_e64 0, 0, 0, 1, killed [[S_XOR_B64_]], implicit $exec
	; GCN-NEXT: BUFFER_STORE_BYTE_OFFSET killed [[V_CNDMASK_B32_e64_]], killed [[REG_SEQUENCE]], 0, 0, 0, 0, 0, implicit $exec :: (store (s8) into %ir.out.load, addrspace 1)			; GCN-NEXT: BUFFER_STORE_BYTE_OFFSET killed [[V_CNDMASK_B32_e64_]], killed [[REG_SEQUENCE]], 0, 0, 0, 0, 0, implicit $exec :: (store (s8) into %ir.out.load, addrspace 1)
	; GCN-NEXT: S_ENDPGM 0			; GCN-NEXT: S_ENDPGM 0
	; GCN-NEXT: {{ $}}			; GCN-NEXT: {{ $}}
	; GCN-NEXT: bb.2.out.else:			; GCN-NEXT: bb.2.out.else:
	; GCN-NEXT: [[COPY6:%[0-9]+]]:sreg_32 = COPY [[COPY1]].sub1			; GCN-NEXT: [[COPY6:%[0-9]+]]:sreg_32 = COPY [[COPY1]].sub1
	; GCN-NEXT: [[COPY7:%[0-9]+]]:sreg_32 = COPY [[COPY1]].sub0			; GCN-NEXT: [[COPY7:%[0-9]+]]:sreg_32 = COPY [[COPY1]].sub0
	; GCN-NEXT: [[S_MOV_B32_5:%[0-9]+]]:sreg_32 = S_MOV_B32 61440			; GCN-NEXT: [[S_MOV_B32_4:%[0-9]+]]:sreg_32 = S_MOV_B32 61440
	; GCN-NEXT: [[S_MOV_B32_6:%[0-9]+]]:sreg_32 = S_MOV_B32 -1			; GCN-NEXT: [[S_MOV_B32_5:%[0-9]+]]:sreg_32 = S_MOV_B32 -1
	; GCN-NEXT: [[REG_SEQUENCE1:%[0-9]+]]:sgpr_128 = REG_SEQUENCE killed [[COPY7]], %subreg.sub0, killed [[COPY6]], %subreg.sub1, killed [[S_MOV_B32_6]], %subreg.sub2, killed [[S_MOV_B32_5]], %subreg.sub3			; GCN-NEXT: [[REG_SEQUENCE1:%[0-9]+]]:sgpr_128 = REG_SEQUENCE killed [[COPY7]], %subreg.sub0, killed [[COPY6]], %subreg.sub1, killed [[S_MOV_B32_5]], %subreg.sub2, killed [[S_MOV_B32_4]], %subreg.sub3
	; GCN-NEXT: [[V_CNDMASK_B32_e64_1:%[0-9]+]]:vgpr_32 = V_CNDMASK_B32_e64 0, 0, 0, 1, [[COPY3]], implicit $exec			; GCN-NEXT: [[V_CNDMASK_B32_e64_1:%[0-9]+]]:vgpr_32 = V_CNDMASK_B32_e64 0, 0, 0, 1, [[COPY3]], implicit $exec
	; GCN-NEXT: BUFFER_STORE_BYTE_OFFSET killed [[V_CNDMASK_B32_e64_1]], killed [[REG_SEQUENCE1]], 0, 0, 0, 0, 0, implicit $exec :: (store (s8) into %ir.out.load, addrspace 1)			; GCN-NEXT: BUFFER_STORE_BYTE_OFFSET killed [[V_CNDMASK_B32_e64_1]], killed [[REG_SEQUENCE1]], 0, 0, 0, 0, 0, implicit $exec :: (store (s8) into %ir.out.load, addrspace 1)
	; GCN-NEXT: S_ENDPGM 0			; GCN-NEXT: S_ENDPGM 0
	entry:			entry:
	%0 = and i32 %x, 2			%0 = and i32 %x, 2
	%1 = icmp ne i32 %0, 0			%1 = icmp ne i32 %0, 0
	; Don't optimize the truncate in the SDag away.			; Don't optimize the truncate in the SDag away.
	br i1 %1, label %out.true, label %out.else			br i1 %1, label %out.true, label %out.else
	Show All 10 Lines

llvm/test/CodeGen/AMDGPU/divergence-driven-trunc-to-i1.ll

Show All 9 Lines	define amdgpu_kernel void @uniform_trunc_i16_to_i1(i1 addrspace(1)* %out, i16 %x, i1 %z) {
; GCN-NEXT: [[S_LOAD_DWORDX2_IMM:%[0-9]+]]:sreg_64_xexec = S_LOAD_DWORDX2_IMM [[COPY]](p4), 9, 0 :: (dereferenceable invariant load (s64) from %ir.out.kernarg.offset.cast, align 4, addrspace 4)		; GCN-NEXT: [[S_LOAD_DWORDX2_IMM:%[0-9]+]]:sreg_64_xexec = S_LOAD_DWORDX2_IMM [[COPY]](p4), 9, 0 :: (dereferenceable invariant load (s64) from %ir.out.kernarg.offset.cast, align 4, addrspace 4)
; GCN-NEXT: [[S_LOAD_DWORD_IMM:%[0-9]+]]:sreg_32_xm0_xexec = S_LOAD_DWORD_IMM [[COPY]](p4), 11, 0 :: (dereferenceable invariant load (s32) from %ir.z.kernarg.offset.align.down.cast, addrspace 4)		; GCN-NEXT: [[S_LOAD_DWORD_IMM:%[0-9]+]]:sreg_32_xm0_xexec = S_LOAD_DWORD_IMM [[COPY]](p4), 11, 0 :: (dereferenceable invariant load (s32) from %ir.z.kernarg.offset.align.down.cast, addrspace 4)
; GCN-NEXT: [[COPY1:%[0-9]+]]:sreg_32 = COPY [[S_LOAD_DWORDX2_IMM]].sub1		; GCN-NEXT: [[COPY1:%[0-9]+]]:sreg_32 = COPY [[S_LOAD_DWORDX2_IMM]].sub1
; GCN-NEXT: [[COPY2:%[0-9]+]]:sreg_32 = COPY [[S_LOAD_DWORDX2_IMM]].sub0		; GCN-NEXT: [[COPY2:%[0-9]+]]:sreg_32 = COPY [[S_LOAD_DWORDX2_IMM]].sub0
; GCN-NEXT: [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 61440		; GCN-NEXT: [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 61440
; GCN-NEXT: [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 -1		; GCN-NEXT: [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 -1
; GCN-NEXT: [[REG_SEQUENCE:%[0-9]+]]:sgpr_128 = REG_SEQUENCE killed [[COPY2]], %subreg.sub0, killed [[COPY1]], %subreg.sub1, killed [[S_MOV_B32_1]], %subreg.sub2, killed [[S_MOV_B32_]], %subreg.sub3		; GCN-NEXT: [[REG_SEQUENCE:%[0-9]+]]:sgpr_128 = REG_SEQUENCE killed [[COPY2]], %subreg.sub0, killed [[COPY1]], %subreg.sub1, killed [[S_MOV_B32_1]], %subreg.sub2, killed [[S_MOV_B32_]], %subreg.sub3
; GCN-NEXT: [[S_SEXT_I32_I16_:%[0-9]+]]:sreg_32 = S_SEXT_I32_I16 [[S_LOAD_DWORD_IMM]]		; GCN-NEXT: [[S_SEXT_I32_I16_:%[0-9]+]]:sreg_32 = S_SEXT_I32_I16 [[S_LOAD_DWORD_IMM]]
; GCN-NEXT: [[S_MOV_B32_2:%[0-9]+]]:sreg_32 = S_MOV_B32 16		; GCN-NEXT: [[S_AND_B32_:%[0-9]+]]:sreg_32 = S_AND_B32 65536, [[S_LOAD_DWORD_IMM]], implicit-def dead $scc
; GCN-NEXT: [[S_LSHR_B32_:%[0-9]+]]:sreg_32 = S_LSHR_B32 [[S_LOAD_DWORD_IMM]], killed [[S_MOV_B32_2]], implicit-def dead $scc		; GCN-NEXT: S_CMP_LG_U32 killed [[S_AND_B32_]], 0, implicit-def $scc
; GCN-NEXT: [[S_AND_B32_:%[0-9]+]]:sreg_32 = S_AND_B32 1, killed [[S_LSHR_B32_]], implicit-def dead $scc
; GCN-NEXT: S_CMP_EQ_U32 killed [[S_AND_B32_]], 1, implicit-def $scc
; GCN-NEXT: [[COPY3:%[0-9]+]]:sreg_64 = COPY $scc		; GCN-NEXT: [[COPY3:%[0-9]+]]:sreg_64 = COPY $scc
; GCN-NEXT: [[S_MOV_B32_3:%[0-9]+]]:sreg_32 = S_MOV_B32 0		; GCN-NEXT: [[S_MOV_B32_2:%[0-9]+]]:sreg_32 = S_MOV_B32 0
; GCN-NEXT: S_CMP_LT_I32 killed [[S_SEXT_I32_I16_]], killed [[S_MOV_B32_3]], implicit-def $scc		; GCN-NEXT: S_CMP_LT_I32 killed [[S_SEXT_I32_I16_]], killed [[S_MOV_B32_2]], implicit-def $scc
; GCN-NEXT: [[COPY4:%[0-9]+]]:sreg_64 = COPY $scc		; GCN-NEXT: [[COPY4:%[0-9]+]]:sreg_64 = COPY $scc
; GCN-NEXT: [[S_OR_B64_:%[0-9]+]]:sreg_64_xexec = S_OR_B64 killed [[COPY4]], killed [[COPY3]], implicit-def dead $scc		; GCN-NEXT: [[S_OR_B64_:%[0-9]+]]:sreg_64_xexec = S_OR_B64 killed [[COPY4]], killed [[COPY3]], implicit-def dead $scc
; GCN-NEXT: [[V_CNDMASK_B32_e64_:%[0-9]+]]:vgpr_32 = V_CNDMASK_B32_e64 0, 0, 0, 1, killed [[S_OR_B64_]], implicit $exec		; GCN-NEXT: [[V_CNDMASK_B32_e64_:%[0-9]+]]:vgpr_32 = V_CNDMASK_B32_e64 0, 0, 0, 1, killed [[S_OR_B64_]], implicit $exec
; GCN-NEXT: BUFFER_STORE_BYTE_OFFSET killed [[V_CNDMASK_B32_e64_]], killed [[REG_SEQUENCE]], 0, 0, 0, 0, 0, implicit $exec :: (store (s8) into %ir.out.load, addrspace 1)		; GCN-NEXT: BUFFER_STORE_BYTE_OFFSET killed [[V_CNDMASK_B32_e64_]], killed [[REG_SEQUENCE]], 0, 0, 0, 0, 0, implicit $exec :: (store (s8) into %ir.out.load, addrspace 1)
; GCN-NEXT: S_ENDPGM 0		; GCN-NEXT: S_ENDPGM 0
%setcc = icmp slt i16 %x, 0		%setcc = icmp slt i16 %x, 0
%select = select i1 %setcc, i1 true, i1 %z		%select = select i1 %setcc, i1 true, i1 %z
store i1 %select, i1 addrspace(1)* %out		store i1 %select, i1 addrspace(1)* %out
▲ Show 20 Lines • Show All 135 Lines • ▼ Show 20 Lines	define i1 @divergent_trunc_i64_to_i1(i1 addrspace(1)* %out, i64 %x, i1 %z) {
; GCN-NEXT: [[COPY5:%[0-9]+]]:ccr_sgpr_64 = COPY [[COPY]]		; GCN-NEXT: [[COPY5:%[0-9]+]]:ccr_sgpr_64 = COPY [[COPY]]
; GCN-NEXT: $vgpr0 = COPY [[V_CNDMASK_B32_e64_]]		; GCN-NEXT: $vgpr0 = COPY [[V_CNDMASK_B32_e64_]]
; GCN-NEXT: [[COPY6:%[0-9]+]]:ccr_sgpr_64 = COPY [[COPY5]]		; GCN-NEXT: [[COPY6:%[0-9]+]]:ccr_sgpr_64 = COPY [[COPY5]]
; GCN-NEXT: S_SETPC_B64_return [[COPY6]], implicit $vgpr0		; GCN-NEXT: S_SETPC_B64_return [[COPY6]], implicit $vgpr0
%setcc = icmp slt i64 %x, 0		%setcc = icmp slt i64 %x, 0
%select = select i1 %setcc, i1 true, i1 %z		%select = select i1 %setcc, i1 true, i1 %z
ret i1 %select		ret i1 %select
}		}