This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Remove unnecessary waitcnts
AbandonedPublic

Authored by OutOfCache on Mar 24 2023, 10:39 AM.

Download Raw Diff

Details

Reviewers

arsenm
mareko
tsymalla
msearles
sebastian-ne
foad

Group Reviewers

Restricted Project

Summary

Some ds_* instructions do not access LDS memory.
Therefore, @sebastian-ne suggested removing the waitcnts.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	60,030 ms	x64 debian > MLIR.Examples/standalone::test.toy

Event Timeline

OutOfCache created this revision.Mar 24 2023, 10:39 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 24 2023, 10:39 AM

Herald added subscribers: kosarev, foad, kerbowa and 7 others. · View Herald Transcript

OutOfCache requested review of this revision.Mar 24 2023, 10:39 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 24 2023, 10:39 AM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

OutOfCache edited the summary of this revision. (Show Details)Mar 24 2023, 10:44 AM

OutOfCache added reviewers: arsenm, mareko, Flakebi, tsymalla, msearles.

OutOfCache added a project: Restricted Project.

OutOfCache added a subscriber: Flakebi.

Harbormaster completed remote builds in B221637: Diff 508156.Mar 24 2023, 11:34 AM

foad added inline comments.Mar 25 2023, 5:02 AM

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1393 ↗	(On Diff #508156)	Might be better to make usesLGKM_CNT itself more accurate, by tweaking DSInstructions.td so that LGKM_CNT is set to 0 for instructions that don't use it.

editing DSInstructions.td instead of SIInsertWaitcnts.cpp

OutOfCache edited the summary of this revision. (Show Details)Mar 27 2023, 3:27 AM

OutOfCache edited reviewers, added: sebastian-ne, Restricted Project; removed: Flakebi.

OutOfCache marked an inline comment as done.

OutOfCache added a subscriber: sebastian-ne.

OutOfCache added inline comments.

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
1393 ↗	(On Diff #508156)	Thank you! For some reason, I only found the definition of the intrinsic. I did not know there was a separate .td file for DS Instructions.

LGTM, thanks.

This revision is now accepted and ready to land.Mar 27 2023, 3:45 AM

Herald added a subscriber: StephenFan. · View Herald TranscriptMar 27 2023, 3:45 AM

Harbormaster completed remote builds in B221937: Diff 508553.Mar 27 2023, 4:14 AM

Are you sure about this? lgkmcnt(0) isn't about accessing LDS memory, but about waiting for the result to be received from the LDS block.

In D146829#4224452, @mareko wrote:

Are you sure about this? lgkmcnt(0) isn't about accessing LDS memory, but about waiting for the result to be received from the LDS block.

In that case I am not certain. Is there a way to verify whether this is acceptable?

The waitcnt's serve two purposes. They notify that the result of the operation is available to the thread that requested it, and they ensure that the effect of the operation is visible to other threads before this thread continues to do other operations. This latter purpose is used to ensure the happens-before relationship in the memory model. So for example, if a VMEM release atomic is done at workgroup scope, should these operations be visible to other threads before the result that is store-released onto VMEM?

If these operations go down the LDS queues (even if they are not performed in the LDS itself), then there are 2 queues for the waves of a workgroup, but a single L1 shared by all waves of a workgroup for VMEM. So to ensure visibility to all waves in the workgroup the LDS operation must be waited to complete before starting the VMEM operation if there needs to be a happens-before relation. That waiting is achieved by the waitcnt on LGKM before executing the VMEM instruction.

In D146829#4224849, @t-tye wrote:

The waitcnt's serve two purposes. They notify that the result of the operation is available to the thread that requested it, and they ensure that the effect of the operation is visible to other threads before this thread continues to do other operations. This latter purpose is used to ensure the happens-before relationship in the memory model. So for example, if a VMEM release atomic is done at workgroup scope, should these operations be visible to other threads before the result that is store-released onto VMEM?

If these operations go down the LDS queues (even if they are not performed in the LDS itself), then there are 2 queues for the waves of a workgroup, but a single L1 shared by all waves of a workgroup for VMEM. So to ensure visibility to all waves in the workgroup the LDS operation must be waited to complete before starting the VMEM operation if there needs to be a happens-before relation. That waiting is achieved by the waitcnt on LGKM before executing the VMEM instruction.

Thank you for taking the time to explain! If I understand correctly, the waitcnt does not only notify the current lane that the result is available, but also the other lanes within the same workgroup. So without the waitcnt, there is a possibility that the other lanes see the result of the VMEM instruction first?

After further discussion, @mareko is right and the waitcnts are necessary. Thanks for bringing that up!

All lanes receive the result at the same time because it's really just v_mov under the hood. s_waitcnt only waits for it.

the waitcnt does not only notify the current lane that the result is available, but also the other lanes within the same workgroup

FWIW, it's lanes of the same wave, not workgroup.

In D146829#4227224, @OutOfCache wrote:

In D146829#4224849, @t-tye wrote:

The waitcnt's serve two purposes. They notify that the result of the operation is available to the thread that requested it, and they ensure that the effect of the operation is visible to other threads before this thread continues to do other operations. This latter purpose is used to ensure the happens-before relationship in the memory model. So for example, if a VMEM release atomic is done at workgroup scope, should these operations be visible to other threads before the result that is store-released onto VMEM?

If these operations go down the LDS queues (even if they are not performed in the LDS itself), then there are 2 queues for the waves of a workgroup, but a single L1 shared by all waves of a workgroup for VMEM. So to ensure visibility to all waves in the workgroup the LDS operation must be waited to complete before starting the VMEM operation if there needs to be a happens-before relation. That waiting is achieved by the waitcnt on LGKM before executing the VMEM instruction.

Thank you for taking the time to explain! If I understand correctly, the waitcnt does not only notify the current lane that the result is available, but also the other lanes within the same workgroup. So without the waitcnt, there is a possibility that the other lanes see the result of the VMEM instruction first?

waitcnt causes the thread to stall until the previous operations that change the counter are completed. This can be used to ensure a result has been returned to the thread. It can also be used to delay executing following instructions until the previous operations have completed and so are globally visible. For the memory model it is the latter that it gets used for. I the thread ensures that the LDS/... operation are complete before executing a VMEM operation, it ensures all waves will see the updates to LDS and VMEM in the same order which is a requirement for seq_cst and release memory orders.

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

DSInstructions.td

4 lines

test/

CodeGen/

AMDGPU/

llvm.amdgcn.ds.bpermute.ll

4 lines

waitcnt-permute.mir

2 lines

wqm.ll

6 lines

Diff 508553

llvm/lib/Target/AMDGPU/DSInstructions.td

Show First 20 Lines • Show All 407 Lines • ▼ Show 20 Lines	: DS_GWS<opName,
" $data0$offset gds"> {		" $data0$offset gds"> {

let has_gws_data0 = 1;		let has_gws_data0 = 1;
let hasSideEffects = 1;		let hasSideEffects = 1;
}		}

class DS_VOID <string opName> : DS_Pseudo<opName,		class DS_VOID <string opName> : DS_Pseudo<opName,
(outs), (ins), ""> {		(outs), (ins), ""> {
		let LGKM_CNT = 0;
let mayLoad = 0;		let mayLoad = 0;
let mayStore = 0;		let mayStore = 0;
let hasSideEffects = 1;		let hasSideEffects = 1;
let UseNamedOperandTable = 0;		let UseNamedOperandTable = 0;
let AsmMatchConverter = "";		let AsmMatchConverter = "";

let has_vdst = 0;		let has_vdst = 0;
let has_addr = 0;		let has_addr = 0;
Show All 9 Lines	class DS_1A1D_PERMUTE <string opName, SDPatternOperator node = null_frag,
RegisterOperand data_op = getLdStRegisterOperand<VGPR_32>.ret>		RegisterOperand data_op = getLdStRegisterOperand<VGPR_32>.ret>
: DS_Pseudo<opName,		: DS_Pseudo<opName,
(outs data_op:$vdst),		(outs data_op:$vdst),
(ins VGPR_32:$addr, data_op:$data0, offset:$offset),		(ins VGPR_32:$addr, data_op:$data0, offset:$offset),
" $vdst, $addr, $data0$offset",		" $vdst, $addr, $data0$offset",
[(set i32:$vdst,		[(set i32:$vdst,
(node (DS1Addr1Offset i32:$addr, i32:$offset), i32:$data0))] > {		(node (DS1Addr1Offset i32:$addr, i32:$offset), i32:$data0))] > {

		let LGKM_CNT = 0;
let mayLoad = 0;		let mayLoad = 0;
let mayStore = 0;		let mayStore = 0;
let isConvergent = 1;		let isConvergent = 1;

let has_data1 = 0;		let has_data1 = 0;
let has_gds = 0;		let has_gds = 0;
}		}

▲ Show 20 Lines • Show All 176 Lines • ▼ Show 20 Lines
def DS_XOR_SRC2_B64 : DS_1A<"ds_xor_src2_b64">;		def DS_XOR_SRC2_B64 : DS_1A<"ds_xor_src2_b64">;
def DS_MIN_SRC2_F64 : DS_1A<"ds_min_src2_f64">;		def DS_MIN_SRC2_F64 : DS_1A<"ds_min_src2_f64">;
def DS_MAX_SRC2_F64 : DS_1A<"ds_max_src2_f64">;		def DS_MAX_SRC2_F64 : DS_1A<"ds_max_src2_f64">;

def DS_WRITE_SRC2_B32 : DS_1A<"ds_write_src2_b32">;		def DS_WRITE_SRC2_B32 : DS_1A<"ds_write_src2_b32">;
def DS_WRITE_SRC2_B64 : DS_1A<"ds_write_src2_b64">;		def DS_WRITE_SRC2_B64 : DS_1A<"ds_write_src2_b64">;
} // End SubtargetPredicate = HasDsSrc2Insts		} // End SubtargetPredicate = HasDsSrc2Insts

let Uses = [EXEC], mayLoad = 0, mayStore = 0, isConvergent = 1 in {		let Uses = [EXEC], LGKM_CNT = 0, mayLoad = 0, mayStore = 0, isConvergent = 1 in {
def DS_SWIZZLE_B32 : DS_1A_RET <"ds_swizzle_b32", VGPR_32, 0, SwizzleImm>;		def DS_SWIZZLE_B32 : DS_1A_RET <"ds_swizzle_b32", VGPR_32, 0, SwizzleImm>;
}		}

let mayStore = 0 in {		let mayStore = 0 in {
defm DS_READ_I8 : DS_1A_RET_mc<"ds_read_i8">;		defm DS_READ_I8 : DS_1A_RET_mc<"ds_read_i8">;
defm DS_READ_U8 : DS_1A_RET_mc<"ds_read_u8">;		defm DS_READ_U8 : DS_1A_RET_mc<"ds_read_u8">;
defm DS_READ_I16 : DS_1A_RET_mc<"ds_read_i16">;		defm DS_READ_I16 : DS_1A_RET_mc<"ds_read_i16">;
defm DS_READ_U16 : DS_1A_RET_mc<"ds_read_u16">;		defm DS_READ_U16 : DS_1A_RET_mc<"ds_read_u16">;
▲ Show 20 Lines • Show All 1,056 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.ds.bpermute.ll

	Show All 23 Lines
	define amdgpu_kernel void @ds_bpermute_imm_index(ptr addrspace(1) %out, i32 %base_index, i32 %src) nounwind {			define amdgpu_kernel void @ds_bpermute_imm_index(ptr addrspace(1) %out, i32 %base_index, i32 %src) nounwind {
	%bpermute = call i32 @llvm.amdgcn.ds.bpermute(i32 64, i32 %src) #0			%bpermute = call i32 @llvm.amdgcn.ds.bpermute(i32 64, i32 %src) #0
	store i32 %bpermute, ptr addrspace(1) %out, align 4			store i32 %bpermute, ptr addrspace(1) %out, align 4
	ret void			ret void
	}			}

	; CHECK-LABEL: {{^}}ds_bpermute_add_shl:			; CHECK-LABEL: {{^}}ds_bpermute_add_shl:
	; CHECK: ds_bpermute_b32 v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}} offset:4			; CHECK: ds_bpermute_b32 v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}} offset:4
	; CHECK: s_waitcnt lgkmcnt			; CHECK-NOT: s_waitcnt lgkmcnt
	define void @ds_bpermute_add_shl(ptr addrspace(1) %out, i32 %base_index, i32 %src) nounwind {			define void @ds_bpermute_add_shl(ptr addrspace(1) %out, i32 %base_index, i32 %src) nounwind {
	%index = add i32 %base_index, 1			%index = add i32 %base_index, 1
	%byte_index = shl i32 %index, 2			%byte_index = shl i32 %index, 2
	%bpermute = call i32 @llvm.amdgcn.ds.bpermute(i32 %byte_index, i32 %src) #0			%bpermute = call i32 @llvm.amdgcn.ds.bpermute(i32 %byte_index, i32 %src) #0
	store i32 %bpermute, ptr addrspace(1) %out, align 4			store i32 %bpermute, ptr addrspace(1) %out, align 4
	ret void			ret void
	}			}

	; CHECK-LABEL: {{^}}ds_bpermute_or_shl:			; CHECK-LABEL: {{^}}ds_bpermute_or_shl:
	; CHECK: ds_bpermute_b32 v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}} offset:4			; CHECK: ds_bpermute_b32 v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}} offset:4
	; CHECK: s_waitcnt lgkmcnt			; CHECK-NOT: s_waitcnt lgkmcnt
	define void @ds_bpermute_or_shl(ptr addrspace(1) %out, i32 %base_index, i32 %src) nounwind {			define void @ds_bpermute_or_shl(ptr addrspace(1) %out, i32 %base_index, i32 %src) nounwind {
	%masked = and i32 %base_index, 62			%masked = and i32 %base_index, 62
	%index = or i32 %masked, 1			%index = or i32 %masked, 1
	%byte_index = shl i32 %index, 2			%byte_index = shl i32 %index, 2
	%bpermute = call i32 @llvm.amdgcn.ds.bpermute(i32 %byte_index, i32 %src) #0			%bpermute = call i32 @llvm.amdgcn.ds.bpermute(i32 %byte_index, i32 %src) #0
	store i32 %bpermute, ptr addrspace(1) %out, align 4			store i32 %bpermute, ptr addrspace(1) %out, align 4
	ret void			ret void
	}			}

	attributes #0 = { nounwind readnone convergent }			attributes #0 = { nounwind readnone convergent }

llvm/test/CodeGen/AMDGPU/waitcnt-permute.mir

	# RUN: llc -mtriple=amdgcn -mcpu=fiji -verify-machineinstrs -run-pass si-insert-waitcnts -o - %s \| FileCheck %s			# RUN: llc -mtriple=amdgcn -mcpu=fiji -verify-machineinstrs -run-pass si-insert-waitcnts -o - %s \| FileCheck %s

	...			...
	# CHECK-LABEL: name: waitcnt-permute{{$}}			# CHECK-LABEL: name: waitcnt-permute{{$}}
	# CHECK: DS_BPERMUTE_B32			# CHECK: DS_BPERMUTE_B32
	# CHECK-NEXT: S_WAITCNT 127			# CHECK-NOT: S_WAITCNT 127

	name: waitcnt-permute			name: waitcnt-permute
	liveins:			liveins:
	- { reg: '$vgpr0' }			- { reg: '$vgpr0' }
	- { reg: '$vgpr1' }			- { reg: '$vgpr1' }
	- { reg: '$sgpr30_sgpr31' }			- { reg: '$sgpr30_sgpr31' }
	body: \|			body: \|
	bb.0:			bb.0:
	liveins: $vgpr0, $vgpr1, $sgpr30_sgpr31			liveins: $vgpr0, $vgpr1, $sgpr30_sgpr31

	$vgpr0 = DS_BPERMUTE_B32 killed $vgpr0, killed $vgpr1, 0, implicit $exec			$vgpr0 = DS_BPERMUTE_B32 killed $vgpr0, killed $vgpr1, 0, implicit $exec
	$vgpr0 = V_ADD_F32_e32 1065353216, killed $vgpr0, implicit $mode, implicit $exec			$vgpr0 = V_ADD_F32_e32 1065353216, killed $vgpr0, implicit $mode, implicit $exec
	S_SETPC_B64_return killed $sgpr30_sgpr31, implicit killed $vgpr0			S_SETPC_B64_return killed $sgpr30_sgpr31, implicit killed $vgpr0

	...			...

llvm/test/CodeGen/AMDGPU/wqm.ll

	Show First 20 Lines • Show All 2,210 Lines • ▼ Show 20 Lines
	; GFX9-W64-NEXT: v_cvt_i32_f32_e32 v0, v0			; GFX9-W64-NEXT: v_cvt_i32_f32_e32 v0, v0
	; GFX9-W64-NEXT: v_mov_b32_e32 v2, v0			; GFX9-W64-NEXT: v_mov_b32_e32 v2, v0
	; GFX9-W64-NEXT: s_not_b64 exec, exec			; GFX9-W64-NEXT: s_not_b64 exec, exec
	; GFX9-W64-NEXT: v_mov_b32_e32 v2, 0			; GFX9-W64-NEXT: v_mov_b32_e32 v2, 0
	; GFX9-W64-NEXT: s_not_b64 exec, exec			; GFX9-W64-NEXT: s_not_b64 exec, exec
	; GFX9-W64-NEXT: s_or_saveexec_b64 s[0:1], -1			; GFX9-W64-NEXT: s_or_saveexec_b64 s[0:1], -1
	; GFX9-W64-NEXT: ds_swizzle_b32 v2, v2 offset:swizzle(SWAP,2)			; GFX9-W64-NEXT: ds_swizzle_b32 v2, v2 offset:swizzle(SWAP,2)
	; GFX9-W64-NEXT: s_mov_b64 exec, s[0:1]			; GFX9-W64-NEXT: s_mov_b64 exec, s[0:1]
	; GFX9-W64-NEXT: s_waitcnt lgkmcnt(0)
	; GFX9-W64-NEXT: v_mov_b32_e32 v0, v2			; GFX9-W64-NEXT: v_mov_b32_e32 v0, v2
	; GFX9-W64-NEXT: v_cvt_f32_i32_e32 v1, v0			; GFX9-W64-NEXT: v_cvt_f32_i32_e32 v1, v0
	; GFX9-W64-NEXT: .LBB36_2: ; %ENDIF			; GFX9-W64-NEXT: .LBB36_2: ; %ENDIF
	; GFX9-W64-NEXT: s_or_b64 exec, exec, s[14:15]			; GFX9-W64-NEXT: s_or_b64 exec, exec, s[14:15]
	; GFX9-W64-NEXT: s_and_b64 exec, exec, s[12:13]			; GFX9-W64-NEXT: s_and_b64 exec, exec, s[12:13]
	; GFX9-W64-NEXT: v_mov_b32_e32 v0, v1			; GFX9-W64-NEXT: v_mov_b32_e32 v0, v1
	; GFX9-W64-NEXT: ; return to shader part epilog			; GFX9-W64-NEXT: ; return to shader part epilog
	;			;
	Show All 13 Lines
	; GFX10-W32-NEXT: v_cvt_i32_f32_e32 v0, v0			; GFX10-W32-NEXT: v_cvt_i32_f32_e32 v0, v0
	; GFX10-W32-NEXT: v_mov_b32_e32 v2, v0			; GFX10-W32-NEXT: v_mov_b32_e32 v2, v0
	; GFX10-W32-NEXT: s_not_b32 exec_lo, exec_lo			; GFX10-W32-NEXT: s_not_b32 exec_lo, exec_lo
	; GFX10-W32-NEXT: v_mov_b32_e32 v2, 0			; GFX10-W32-NEXT: v_mov_b32_e32 v2, 0
	; GFX10-W32-NEXT: s_not_b32 exec_lo, exec_lo			; GFX10-W32-NEXT: s_not_b32 exec_lo, exec_lo
	; GFX10-W32-NEXT: s_or_saveexec_b32 s0, -1			; GFX10-W32-NEXT: s_or_saveexec_b32 s0, -1
	; GFX10-W32-NEXT: ds_swizzle_b32 v2, v2 offset:swizzle(SWAP,2)			; GFX10-W32-NEXT: ds_swizzle_b32 v2, v2 offset:swizzle(SWAP,2)
	; GFX10-W32-NEXT: s_mov_b32 exec_lo, s0			; GFX10-W32-NEXT: s_mov_b32 exec_lo, s0
	; GFX10-W32-NEXT: s_waitcnt lgkmcnt(0)
	; GFX10-W32-NEXT: v_mov_b32_e32 v0, v2			; GFX10-W32-NEXT: v_mov_b32_e32 v0, v2
	; GFX10-W32-NEXT: v_cvt_f32_i32_e32 v1, v0			; GFX10-W32-NEXT: v_cvt_f32_i32_e32 v1, v0
	; GFX10-W32-NEXT: .LBB36_2: ; %ENDIF			; GFX10-W32-NEXT: .LBB36_2: ; %ENDIF
	; GFX10-W32-NEXT: s_or_b32 exec_lo, exec_lo, s13			; GFX10-W32-NEXT: s_or_b32 exec_lo, exec_lo, s13
	; GFX10-W32-NEXT: s_and_b32 exec_lo, exec_lo, s12			; GFX10-W32-NEXT: s_and_b32 exec_lo, exec_lo, s12
	; GFX10-W32-NEXT: v_mov_b32_e32 v0, v1			; GFX10-W32-NEXT: v_mov_b32_e32 v0, v1
	; GFX10-W32-NEXT: ; return to shader part epilog			; GFX10-W32-NEXT: ; return to shader part epilog
	main_body:			main_body:
	▲ Show 20 Lines • Show All 490 Lines • ▼ Show 20 Lines
	; GFX9-W64-NEXT: v_cvt_i32_f32_e32 v0, v0			; GFX9-W64-NEXT: v_cvt_i32_f32_e32 v0, v0
	; GFX9-W64-NEXT: v_mov_b32_e32 v2, v0			; GFX9-W64-NEXT: v_mov_b32_e32 v2, v0
	; GFX9-W64-NEXT: s_not_b64 exec, exec			; GFX9-W64-NEXT: s_not_b64 exec, exec
	; GFX9-W64-NEXT: v_mov_b32_e32 v2, 0			; GFX9-W64-NEXT: v_mov_b32_e32 v2, 0
	; GFX9-W64-NEXT: s_not_b64 exec, exec			; GFX9-W64-NEXT: s_not_b64 exec, exec
	; GFX9-W64-NEXT: s_or_saveexec_b64 s[0:1], -1			; GFX9-W64-NEXT: s_or_saveexec_b64 s[0:1], -1
	; GFX9-W64-NEXT: ds_swizzle_b32 v2, v2 offset:swizzle(SWAP,2)			; GFX9-W64-NEXT: ds_swizzle_b32 v2, v2 offset:swizzle(SWAP,2)
	; GFX9-W64-NEXT: s_mov_b64 exec, s[0:1]			; GFX9-W64-NEXT: s_mov_b64 exec, s[0:1]
	; GFX9-W64-NEXT: s_waitcnt lgkmcnt(0)
	; GFX9-W64-NEXT: v_mov_b32_e32 v0, v2			; GFX9-W64-NEXT: v_mov_b32_e32 v0, v2
	; GFX9-W64-NEXT: v_cvt_f32_i32_e32 v1, v0			; GFX9-W64-NEXT: v_cvt_f32_i32_e32 v1, v0
	; GFX9-W64-NEXT: .LBB45_2: ; %ENDIF			; GFX9-W64-NEXT: .LBB45_2: ; %ENDIF
	; GFX9-W64-NEXT: s_or_b64 exec, exec, s[14:15]			; GFX9-W64-NEXT: s_or_b64 exec, exec, s[14:15]
	; GFX9-W64-NEXT: s_and_b64 exec, exec, s[12:13]			; GFX9-W64-NEXT: s_and_b64 exec, exec, s[12:13]
	; GFX9-W64-NEXT: v_mov_b32_e32 v0, v1			; GFX9-W64-NEXT: v_mov_b32_e32 v0, v1
	; GFX9-W64-NEXT: ; return to shader part epilog			; GFX9-W64-NEXT: ; return to shader part epilog
	;			;
	Show All 13 Lines
	; GFX10-W32-NEXT: v_cvt_i32_f32_e32 v0, v0			; GFX10-W32-NEXT: v_cvt_i32_f32_e32 v0, v0
	; GFX10-W32-NEXT: v_mov_b32_e32 v2, v0			; GFX10-W32-NEXT: v_mov_b32_e32 v2, v0
	; GFX10-W32-NEXT: s_not_b32 exec_lo, exec_lo			; GFX10-W32-NEXT: s_not_b32 exec_lo, exec_lo
	; GFX10-W32-NEXT: v_mov_b32_e32 v2, 0			; GFX10-W32-NEXT: v_mov_b32_e32 v2, 0
	; GFX10-W32-NEXT: s_not_b32 exec_lo, exec_lo			; GFX10-W32-NEXT: s_not_b32 exec_lo, exec_lo
	; GFX10-W32-NEXT: s_or_saveexec_b32 s0, -1			; GFX10-W32-NEXT: s_or_saveexec_b32 s0, -1
	; GFX10-W32-NEXT: ds_swizzle_b32 v2, v2 offset:swizzle(SWAP,2)			; GFX10-W32-NEXT: ds_swizzle_b32 v2, v2 offset:swizzle(SWAP,2)
	; GFX10-W32-NEXT: s_mov_b32 exec_lo, s0			; GFX10-W32-NEXT: s_mov_b32 exec_lo, s0
	; GFX10-W32-NEXT: s_waitcnt lgkmcnt(0)
	; GFX10-W32-NEXT: v_mov_b32_e32 v0, v2			; GFX10-W32-NEXT: v_mov_b32_e32 v0, v2
	; GFX10-W32-NEXT: v_cvt_f32_i32_e32 v1, v0			; GFX10-W32-NEXT: v_cvt_f32_i32_e32 v1, v0
	; GFX10-W32-NEXT: .LBB45_2: ; %ENDIF			; GFX10-W32-NEXT: .LBB45_2: ; %ENDIF
	; GFX10-W32-NEXT: s_or_b32 exec_lo, exec_lo, s13			; GFX10-W32-NEXT: s_or_b32 exec_lo, exec_lo, s13
	; GFX10-W32-NEXT: s_and_b32 exec_lo, exec_lo, s12			; GFX10-W32-NEXT: s_and_b32 exec_lo, exec_lo, s12
	; GFX10-W32-NEXT: v_mov_b32_e32 v0, v1			; GFX10-W32-NEXT: v_mov_b32_e32 v0, v1
	; GFX10-W32-NEXT: ; return to shader part epilog			; GFX10-W32-NEXT: ; return to shader part epilog
	main_body:			main_body:
	Show All 31 Lines
	; GFX9-W64-NEXT: s_cbranch_execz .LBB46_2			; GFX9-W64-NEXT: s_cbranch_execz .LBB46_2
	; GFX9-W64-NEXT: ; %bb.1: ; %IF			; GFX9-W64-NEXT: ; %bb.1: ; %IF
	; GFX9-W64-NEXT: image_sample v2, v2, s[0:7], s[8:11] dmask:0x1			; GFX9-W64-NEXT: image_sample v2, v2, s[0:7], s[8:11] dmask:0x1
	; GFX9-W64-NEXT: s_waitcnt vmcnt(0)			; GFX9-W64-NEXT: s_waitcnt vmcnt(0)
	; GFX9-W64-NEXT: image_sample v2, v2, s[0:7], s[8:11] dmask:0x1			; GFX9-W64-NEXT: image_sample v2, v2, s[0:7], s[8:11] dmask:0x1
	; GFX9-W64-NEXT: s_waitcnt vmcnt(0)			; GFX9-W64-NEXT: s_waitcnt vmcnt(0)
	; GFX9-W64-NEXT: v_cvt_i32_f32_e32 v2, v2			; GFX9-W64-NEXT: v_cvt_i32_f32_e32 v2, v2
	; GFX9-W64-NEXT: ds_swizzle_b32 v2, v2 offset:swizzle(SWAP,2)			; GFX9-W64-NEXT: ds_swizzle_b32 v2, v2 offset:swizzle(SWAP,2)
	; GFX9-W64-NEXT: s_waitcnt lgkmcnt(0)
	; GFX9-W64-NEXT: v_mov_b32_e32 v0, v2			; GFX9-W64-NEXT: v_mov_b32_e32 v0, v2
	; GFX9-W64-NEXT: v_cvt_f32_i32_e32 v0, v0			; GFX9-W64-NEXT: v_cvt_f32_i32_e32 v0, v0
	; GFX9-W64-NEXT: .LBB46_2: ; %ENDIF			; GFX9-W64-NEXT: .LBB46_2: ; %ENDIF
	; GFX9-W64-NEXT: s_or_b64 exec, exec, s[14:15]			; GFX9-W64-NEXT: s_or_b64 exec, exec, s[14:15]
	; GFX9-W64-NEXT: s_and_b64 exec, exec, s[12:13]			; GFX9-W64-NEXT: s_and_b64 exec, exec, s[12:13]
	; GFX9-W64-NEXT: ; return to shader part epilog			; GFX9-W64-NEXT: ; return to shader part epilog
	;			;
	; GFX10-W32-LABEL: test_strict_wqm_within_wqm:			; GFX10-W32-LABEL: test_strict_wqm_within_wqm:
	; GFX10-W32: ; %bb.0: ; %main_body			; GFX10-W32: ; %bb.0: ; %main_body
	; GFX10-W32-NEXT: s_mov_b32 s12, exec_lo			; GFX10-W32-NEXT: s_mov_b32 s12, exec_lo
	; GFX10-W32-NEXT: s_wqm_b32 exec_lo, exec_lo			; GFX10-W32-NEXT: s_wqm_b32 exec_lo, exec_lo
	; GFX10-W32-NEXT: v_mov_b32_e32 v2, v0			; GFX10-W32-NEXT: v_mov_b32_e32 v2, v0
	; GFX10-W32-NEXT: v_mov_b32_e32 v0, 0			; GFX10-W32-NEXT: v_mov_b32_e32 v0, 0
	; GFX10-W32-NEXT: s_mov_b32 s13, exec_lo			; GFX10-W32-NEXT: s_mov_b32 s13, exec_lo
	; GFX10-W32-NEXT: v_cmpx_eq_u32_e32 0, v1			; GFX10-W32-NEXT: v_cmpx_eq_u32_e32 0, v1
	; GFX10-W32-NEXT: s_cbranch_execz .LBB46_2			; GFX10-W32-NEXT: s_cbranch_execz .LBB46_2
	; GFX10-W32-NEXT: ; %bb.1: ; %IF			; GFX10-W32-NEXT: ; %bb.1: ; %IF
	; GFX10-W32-NEXT: image_sample v2, v2, s[0:7], s[8:11] dmask:0x1 dim:SQ_RSRC_IMG_1D			; GFX10-W32-NEXT: image_sample v2, v2, s[0:7], s[8:11] dmask:0x1 dim:SQ_RSRC_IMG_1D
	; GFX10-W32-NEXT: s_waitcnt vmcnt(0)			; GFX10-W32-NEXT: s_waitcnt vmcnt(0)
	; GFX10-W32-NEXT: image_sample v2, v2, s[0:7], s[8:11] dmask:0x1 dim:SQ_RSRC_IMG_1D			; GFX10-W32-NEXT: image_sample v2, v2, s[0:7], s[8:11] dmask:0x1 dim:SQ_RSRC_IMG_1D
	; GFX10-W32-NEXT: s_waitcnt vmcnt(0)			; GFX10-W32-NEXT: s_waitcnt vmcnt(0)
	; GFX10-W32-NEXT: v_cvt_i32_f32_e32 v2, v2			; GFX10-W32-NEXT: v_cvt_i32_f32_e32 v2, v2
	; GFX10-W32-NEXT: ds_swizzle_b32 v2, v2 offset:swizzle(SWAP,2)			; GFX10-W32-NEXT: ds_swizzle_b32 v2, v2 offset:swizzle(SWAP,2)
	; GFX10-W32-NEXT: s_waitcnt lgkmcnt(0)
	; GFX10-W32-NEXT: v_mov_b32_e32 v0, v2			; GFX10-W32-NEXT: v_mov_b32_e32 v0, v2
	; GFX10-W32-NEXT: v_cvt_f32_i32_e32 v0, v0			; GFX10-W32-NEXT: v_cvt_f32_i32_e32 v0, v0
	; GFX10-W32-NEXT: .LBB46_2: ; %ENDIF			; GFX10-W32-NEXT: .LBB46_2: ; %ENDIF
	; GFX10-W32-NEXT: s_or_b32 exec_lo, exec_lo, s13			; GFX10-W32-NEXT: s_or_b32 exec_lo, exec_lo, s13
	; GFX10-W32-NEXT: s_and_b32 exec_lo, exec_lo, s12			; GFX10-W32-NEXT: s_and_b32 exec_lo, exec_lo, s12
	; GFX10-W32-NEXT: ; return to shader part epilog			; GFX10-W32-NEXT: ; return to shader part epilog
	main_body:			main_body:
	%cmp = icmp eq i32 %z, 0			%cmp = icmp eq i32 %z, 0
	▲ Show 20 Lines • Show All 432 Lines • Show Last 20 Lines