This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Mark control flow intrinsics non-duplicable
ClosedPublic

Authored by ruiling on Jan 26 2022, 7:19 AM.

Download Raw Diff

Details

Reviewers

arsenm
foad
critson
nhaehnle

Commits

rGbe3f4591aff0: AMDGPU: Mark control flow intrinsics non-duplicable

Summary

This is used to help get simplified CFG for divergent regions as well as
get better code generation in some cases.

For example, with below IR:

define amdgpu_kernel void @test() {
bb:
  br label %bb1

bb1:
  %tmp = phi i32 [ 0, %bb ], [ %tmp5, %bb4 ]
  %tid = call i32 @llvm.amdgcn.workitem.id.x()
  %cnd = icmp eq i32 %tid, 0
  br i1 %cnd, label %bb4, label %bb2

bb2:
  %tmp3 = add nsw i32 %tmp, 1
  br label %bb4

bb4:
  %tmp5 = phi i32 [ %tmp3, %bb2 ], [ %tmp, %bb1 ]
  store volatile i32 %tmp5, ptr addrspace(1) undef
  br label %bb1
}

We got below assembly before the change:

  v_mov_b32_e32 v1, 0
  v_cmp_eq_u32_e32 vcc, 0, v0
  s_branch .LBB0_2
.LBB0_1:                                ; %bb4
                                        ;   in Loop: Header=BB0_2 Depth=1
  s_mov_b32 s2, -1
  s_mov_b32 s3, 0xf000
  buffer_store_dword v1, off, s[0:3], 0
  s_waitcnt vmcnt(0)
.LBB0_2:                                ; %bb 
                                        ; =>This Inner Loop Header: Depth=1
  s_and_saveexec_b64 s[0:1], vcc 
  s_xor_b64 s[0:1], exec, s[0:1]
                                        ; kill: def $sgpr0_sgpr1 killed $sgpr0_sgpr1 killed $exec
  s_cbranch_execnz .LBB0_1
; %bb.3:                                ; %bb2
                                        ;   in Loop: Header=BB0_2 Depth=1
  s_or_b64 exec, exec, s[0:1]
  s_waitcnt expcnt(0)
  v_add_i32_e64 v1, s[0:1], 1, v1
  s_branch .LBB0_1

After the change:

  s_mov_b32 s0, 0
  v_cmp_eq_u32_e32 vcc, 0, v0
  s_mov_b32 s2, -1
  s_mov_b32 s3, 0xf000
  v_mov_b32_e32 v0, s0
  s_branch .LBB0_2
.LBB0_1:                                ; %bb4
                                        ;   in Loop: Header=BB0_2 Depth=1
  buffer_store_dword v0, off, s[0:3], 0
  s_waitcnt vmcnt(0)
.LBB0_2:                                ; %bb1
                                        ; =>This Inner Loop Header: Depth=1
  s_and_saveexec_b64 s[0:1], vcc 
  s_cbranch_execnz .LBB0_1
; %bb.3:                                ; %bb2
                                        ;   in Loop: Header=BB0_2 Depth=1
  s_or_b64 exec, exec, s[0:1]
  s_waitcnt expcnt(0)
  v_add_i32_e64 v0, s[0:1], 1, v0
  s_branch .LBB0_1

We are using one less VGPR, one less s_xor_, and better LICM with one
additional branch after the change. Please note the experiment
was done with reverting the workaround D139780, as it will stop the
tail-duplication completely for this case.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

ruiling created this revision.Jan 26 2022, 7:19 AM

Herald added subscribers: kerbowa, hiraditya, t-tye and 5 others. · View Herald TranscriptJan 26 2022, 7:19 AM

ruiling requested review of this revision.Jan 26 2022, 7:19 AM

Herald added a project: Restricted Project. · View Herald TranscriptJan 26 2022, 7:19 AM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

Is this an alternative to D117796?

I don't think this is a good idea. We don't actually need a structured CFG at this point, and tail duplicating isn't exactly unstructuring anyway. This is not an alternative to fixing the LiveVariables update problem, it's just the testcase that broke happened to have appeared due to tail duplication

This revision now requires changes to proceed.Jan 26 2022, 7:25 AM

In D118250#3272528, @foad wrote:

Is this an alternative to D117796?

No, this is not trying to fix that problem. That patch should still be needed for block split when processing PHI introduced by LCSSA.

In D118250#3272530, @arsenm wrote:

I don't think this is a good idea. We don't actually need a structured CFG at this point, and tail duplicating isn't exactly unstructuring anyway. This is not an alternative to fixing the LiveVariables update problem, it's just the testcase that broke happened to have appeared due to tail duplication

Tail duplicating divergent branching is also pretty bad idea. Can you prove ANY benefit by duplicating divergent branching?

In D118250#3272547, @ruiling wrote:

In D118250#3272530, @arsenm wrote:

I don't think this is a good idea. We don't actually need a structured CFG at this point, and tail duplicating isn't exactly unstructuring anyway. This is not an alternative to fixing the LiveVariables update problem, it's just the testcase that broke happened to have appeared due to tail duplication

Tail duplicating divergent branching is also pretty bad idea. Can you prove ANY benefit by duplicating divergent branching?

Structurization is an aid to insert the exec mask manipulation instructions, and after that point we no longer care about preserving it. SI_IF isn't really a branch, although we insert a skip exec branch just in case it's necessary and is logically just bit manipulation. It's the ugly glue we use between the real CFG and the divergent CFG.

In D118250#3272547, @ruiling wrote:

In D118250#3272530, @arsenm wrote:

I don't think this is a good idea. We don't actually need a structured CFG at this point, and tail duplicating isn't exactly unstructuring anyway. This is not an alternative to fixing the LiveVariables update problem, it's just the testcase that broke happened to have appeared due to tail duplication

Tail duplicating divergent branching is also pretty bad idea. Can you prove ANY benefit by duplicating divergent branching?

It's probably not a good idea, but we don't really require this property. In the testcase I was looking at, it nets adding one additional instruction (probably because it ends up confusing the optimize exec passes). I guess given how infrequently this probably occurs, this is OK. I think it would be an interesting experiment to run benchmarks with and without this with global-isel (although we probably need to have correct regbankselect first)

Harbormaster completed remote builds in B145742: Diff 403261.Jan 27 2022, 3:14 AM

I think we don't need to do this, but it should in general result in a net increase in the number of instructions. Can you add a comment explaining that this is not a hard requirement to not duplicate?

This revision is now accepted and ready to land.Mar 28 2022, 3:42 PM

Herald added a project: Restricted Project. · View Herald TranscriptMar 28 2022, 3:42 PM

Herald added a subscriber: hsmhsm. · View Herald Transcript

Reverse ping

Herald added a subscriber: kosarev. · View Herald TranscriptJun 3 2022, 6:29 AM

rebase with more comment

Herald added a subscriber: StephenFan. · View Herald TranscriptFeb 1 2023, 10:21 PM

ruiling edited the summary of this revision. (Show Details)Feb 1 2023, 10:25 PM

I revisit the problem again. I agree we are allowed to duplicate such intrinsic based on the fact that the intrinsic lower has already support their usage in unstructured CFG. but the case shown that duplicating such intrinsics might not be helpful for generating better code. By making them non-duplicable, we would get simplified usage of control flow intrinsics. Some testing on large set of graphics shaders, the change causes no code generation differences.

Harbormaster completed remote builds in B211387: Diff 494168.Feb 1 2023, 11:35 PM

LGTM with additional comments

llvm/lib/Target/AMDGPU/SIInstructions.td
427	Should add a comment that this is not a hard requirement
435	Should add a comment that this is not a hard requirement

This revision was landed with ongoing or failed builds.Feb 5 2023, 11:34 PM

Closed by commit rGbe3f4591aff0: AMDGPU: Mark control flow intrinsics non-duplicable (authored by ruiling). · Explain Why

This revision was automatically updated to reflect the committed changes.

ruiling added a commit: rGbe3f4591aff0: AMDGPU: Mark control flow intrinsics non-duplicable.

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

SIInstructions.td

10 lines

test/

CodeGen/

AMDGPU/

atomic-optimizer-strict-wqm.ll

4 lines

stop-tail-duplicate-cfg-intrinsic.mir

73 lines

Diff 495007

llvm/lib/Target/AMDGPU/SIInstructions.td

Show First 20 Lines • Show All 362 Lines • ▼ Show 20 Lines	def IGLP_OPT : SPseudoInstSI<(outs), (ins i32imm:$mask),
let FixedSize = 1;		let FixedSize = 1;
let Size = 0;		let Size = 0;
let isMeta = 1;		let isMeta = 1;
}		}

// SI pseudo instructions. These are used by the CFG structurizer pass		// SI pseudo instructions. These are used by the CFG structurizer pass
// and should be lowered to ISA instructions prior to codegen.		// and should be lowered to ISA instructions prior to codegen.

let isTerminator = 1 in {		// As we have enhanced control flow intrinsics to work under unstructured CFG,
		// duplicating such intrinsics can be actually treated as legal. On the contrary,
		// by making them non-duplicable, we are observing better code generation result.
		// So we choose to mark them non-duplicable in hope of getting better code
		// generation as well as simplied CFG during Machine IR optimization stage.

		let isTerminator = 1, isNotDuplicable = 1 in {

let OtherPredicates = [EnableLateCFGStructurize] in {		let OtherPredicates = [EnableLateCFGStructurize] in {
def SI_NON_UNIFORM_BRCOND_PSEUDO : CFPseudoInstSI <		def SI_NON_UNIFORM_BRCOND_PSEUDO : CFPseudoInstSI <
(outs),		(outs),
(ins SReg_1:$vcc, brtarget:$target),		(ins SReg_1:$vcc, brtarget:$target),
[(brcond i1:$vcc, bb:$target)]> {		[(brcond i1:$vcc, bb:$target)]> {
let Size = 12;		let Size = 12;
}		}
Show All 33 Lines
} // End isTerminator = 1		} // End isTerminator = 1

def SI_END_CF : CFPseudoInstSI <		def SI_END_CF : CFPseudoInstSI <
(outs), (ins SReg_1:$saved), [], 1, 1> {		(outs), (ins SReg_1:$saved), [], 1, 1> {
let Size = 4;		let Size = 4;
let isAsCheapAsAMove = 1;		let isAsCheapAsAMove = 1;
let isReMaterializable = 1;		let isReMaterializable = 1;
let hasSideEffects = 1;		let hasSideEffects = 1;
		let isNotDuplicable = 1; // Not a hard requirement, see long comments above for details.
		arsenmUnsubmitted Not Done Reply Inline Actions Should add a comment that this is not a hard requirement arsenm: Should add a comment that this is not a hard requirement
let mayLoad = 1; // FIXME: Should not need memory flags		let mayLoad = 1; // FIXME: Should not need memory flags
let mayStore = 1;		let mayStore = 1;
}		}

def SI_IF_BREAK : CFPseudoInstSI <		def SI_IF_BREAK : CFPseudoInstSI <
(outs SReg_1:$dst), (ins SReg_1:$vcc, SReg_1:$src), []> {		(outs SReg_1:$dst), (ins SReg_1:$vcc, SReg_1:$src), []> {
let Size = 4;		let Size = 4;
		let isNotDuplicable = 1; // Not a hard requirement, see long comments above for details.
		arsenmUnsubmitted Not Done Reply Inline Actions Should add a comment that this is not a hard requirement arsenm: Should add a comment that this is not a hard requirement
let isAsCheapAsAMove = 1;		let isAsCheapAsAMove = 1;
let isReMaterializable = 1;		let isReMaterializable = 1;
}		}

// Branch to the early termination block of the shader if SCC is 0.		// Branch to the early termination block of the shader if SCC is 0.
// This uses SCC from a previous SALU operation, i.e. the update of		// This uses SCC from a previous SALU operation, i.e. the update of
// a mask of live lanes after a kill/demote operation.		// a mask of live lanes after a kill/demote operation.
// Only valid in pixel shaders.		// Only valid in pixel shaders.
▲ Show 20 Lines • Show All 3,196 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/atomic-optimizer-strict-wqm.ll

	Show All 12 Lines
	; GFX10-NEXT: s_mov_b32 s1, exec_lo			; GFX10-NEXT: s_mov_b32 s1, exec_lo
	; GFX10-NEXT: s_mov_b32 s4, 0			; GFX10-NEXT: s_mov_b32 s4, 0
	; GFX10-NEXT: s_mov_b32 s2, 0			; GFX10-NEXT: s_mov_b32 s2, 0
	; GFX10-NEXT: s_branch .LBB0_2			; GFX10-NEXT: s_branch .LBB0_2
	; GFX10-NEXT: .LBB0_1: ; %Flow			; GFX10-NEXT: .LBB0_1: ; %Flow
	; GFX10-NEXT: ; in Loop: Header=BB0_2 Depth=1			; GFX10-NEXT: ; in Loop: Header=BB0_2 Depth=1
	; GFX10-NEXT: s_waitcnt_depctr 0xffe3			; GFX10-NEXT: s_waitcnt_depctr 0xffe3
	; GFX10-NEXT: s_or_b32 exec_lo, exec_lo, s3			; GFX10-NEXT: s_or_b32 exec_lo, exec_lo, s3
				; GFX10-NEXT: s_and_b32 s0, exec_lo, vcc_lo
				; GFX10-NEXT: s_or_b32 s2, s0, s2
	; GFX10-NEXT: s_andn2_b32 exec_lo, exec_lo, s2			; GFX10-NEXT: s_andn2_b32 exec_lo, exec_lo, s2
	; GFX10-NEXT: s_cbranch_execz .LBB0_5			; GFX10-NEXT: s_cbranch_execz .LBB0_5
	; GFX10-NEXT: .LBB0_2: ; %bb4			; GFX10-NEXT: .LBB0_2: ; %bb4
	; GFX10-NEXT: ; =>This Inner Loop Header: Depth=1			; GFX10-NEXT: ; =>This Inner Loop Header: Depth=1
	; GFX10-NEXT: s_and_b32 s0, exec_lo, vcc_lo
	; GFX10-NEXT: s_or_b32 s2, s0, s2
	; GFX10-NEXT: s_and_saveexec_b32 s3, s1			; GFX10-NEXT: s_and_saveexec_b32 s3, s1
	; GFX10-NEXT: s_cbranch_execz .LBB0_1			; GFX10-NEXT: s_cbranch_execz .LBB0_1
	; GFX10-NEXT: ; %bb.3: ; in Loop: Header=BB0_2 Depth=1			; GFX10-NEXT: ; %bb.3: ; in Loop: Header=BB0_2 Depth=1
	; GFX10-NEXT: v_mbcnt_lo_u32_b32 v1, exec_lo, 0			; GFX10-NEXT: v_mbcnt_lo_u32_b32 v1, exec_lo, 0
	; GFX10-NEXT: v_cmp_eq_u32_e64 s0, 0, v1			; GFX10-NEXT: v_cmp_eq_u32_e64 s0, 0, v1
	; GFX10-NEXT: s_and_b32 exec_lo, exec_lo, s0			; GFX10-NEXT: s_and_b32 exec_lo, exec_lo, s0
	; GFX10-NEXT: s_cbranch_execz .LBB0_1			; GFX10-NEXT: s_cbranch_execz .LBB0_1
	; GFX10-NEXT: ; %bb.4: ; in Loop: Header=BB0_2 Depth=1			; GFX10-NEXT: ; %bb.4: ; in Loop: Header=BB0_2 Depth=1
	Show All 25 Lines

llvm/test/CodeGen/AMDGPU/stop-tail-duplicate-cfg-intrinsic.mir

This file was added.

				# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py
				# RUN: llc -mtriple=amdgcn-amd-amdhsa -run-pass=early-tailduplication -verify-machineinstrs -o - %s \| FileCheck %s

				---
				name: stop_duplicate_cfg_intrinsic
				tracksRegLiveness: true
				body: \|
				; CHECK-LABEL: name: stop_duplicate_cfg_intrinsic
				; CHECK: bb.0:
				; CHECK-NEXT: successors: %bb.1(0x80000000)
				; CHECK-NEXT: liveins: $vgpr0
				; CHECK-NEXT: {{ $}}
				; CHECK-NEXT: [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr0
				; CHECK-NEXT: [[DEF:%[0-9]+]]:sreg_64 = IMPLICIT_DEF
				; CHECK-NEXT: [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 0
				; CHECK-NEXT: [[COPY1:%[0-9]+]]:vgpr_32 = COPY [[S_MOV_B32_]]
				; CHECK-NEXT: [[V_CMP_EQ_U32_e64_:%[0-9]+]]:sreg_64_xexec = V_CMP_EQ_U32_e64 [[COPY]], [[COPY1]], implicit $exec
				; CHECK-NEXT: {{ $}}
				; CHECK-NEXT: bb.1:
				; CHECK-NEXT: successors: %bb.2(0x40000000), %bb.3(0x40000000)
				; CHECK-NEXT: {{ $}}
				; CHECK-NEXT: [[PHI:%[0-9]+]]:vgpr_32 = PHI [[S_MOV_B32_]], %bb.0, %6, %bb.3
				; CHECK-NEXT: [[SI_IF:%[0-9]+]]:sreg_64_xexec = SI_IF [[V_CMP_EQ_U32_e64_]], %bb.2, implicit-def $exec, implicit-def $scc, implicit $exec
				; CHECK-NEXT: S_BRANCH %bb.3
				; CHECK-NEXT: {{ $}}
				; CHECK-NEXT: bb.2:
				; CHECK-NEXT: successors: %bb.3(0x80000000)
				; CHECK-NEXT: {{ $}}
				; CHECK-NEXT: SI_END_CF [[SI_IF]], implicit-def $exec, implicit-def $scc, implicit $exec
				; CHECK-NEXT: [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 1
				; CHECK-NEXT: [[COPY2:%[0-9]+]]:vgpr_32 = COPY [[S_MOV_B32_1]]
				; CHECK-NEXT: [[V_ADD_CO_U32_e64_:%[0-9]+]]:vgpr_32, dead [[V_ADD_CO_U32_e64_1:%[0-9]+]]:sreg_64_xexec = V_ADD_CO_U32_e64 [[PHI]], [[COPY2]], 0, implicit $exec
				; CHECK-NEXT: {{ $}}
				; CHECK-NEXT: bb.3:
				; CHECK-NEXT: successors: %bb.1(0x80000000)
				; CHECK-NEXT: {{ $}}
				; CHECK-NEXT: [[PHI1:%[0-9]+]]:vgpr_32 = PHI [[V_ADD_CO_U32_e64_]], %bb.2, [[PHI]], %bb.1
				; CHECK-NEXT: [[S_MOV_B32_2:%[0-9]+]]:sreg_32 = S_MOV_B32 4294967295
				; CHECK-NEXT: [[S_MOV_B32_3:%[0-9]+]]:sreg_32 = S_MOV_B32 61440
				; CHECK-NEXT: [[REG_SEQUENCE:%[0-9]+]]:sreg_64 = REG_SEQUENCE [[S_MOV_B32_2]], %subreg.sub0, [[S_MOV_B32_3]], %subreg.sub1
				; CHECK-NEXT: [[REG_SEQUENCE1:%[0-9]+]]:sgpr_128 = REG_SEQUENCE [[DEF]], %subreg.sub0_sub1, [[REG_SEQUENCE]], %subreg.sub2_sub3
				; CHECK-NEXT: BUFFER_STORE_DWORD_OFFSET [[PHI1]], [[REG_SEQUENCE1]], 0, 0, 0, 0, implicit $exec
				; CHECK-NEXT: S_BRANCH %bb.1
				bb.1:
				liveins: $vgpr0

				%0:vgpr_32 = COPY $vgpr0
				%12:sreg_64 = IMPLICIT_DEF
				%4:sreg_32 = S_MOV_B32 0
				%14:vgpr_32 = COPY %4:sreg_32
				%5:sreg_64_xexec = V_CMP_EQ_U32_e64 %0:vgpr_32, %14:vgpr_32, implicit $exec

				bb.2:
				%6:vgpr_32 = PHI %4:sreg_32, %bb.1, %11:vgpr_32, %bb.4
				%8:sreg_64_xexec = SI_IF %5:sreg_64_xexec, %bb.3, implicit-def $exec, implicit-def $scc, implicit $exec
				S_BRANCH %bb.4

				bb.3:
				SI_END_CF %8:sreg_64_xexec, implicit-def $exec, implicit-def $scc, implicit $exec
				%13:sreg_32 = S_MOV_B32 1
				%15:vgpr_32 = COPY %13:sreg_32
				%10:vgpr_32, dead %20:sreg_64_xexec = V_ADD_CO_U32_e64 %6:vgpr_32, %15:vgpr_32, 0, implicit $exec

				bb.4:
				%11:vgpr_32 = PHI %10:vgpr_32, %bb.3, %6:vgpr_32, %bb.2
				%16:sreg_32 = S_MOV_B32 4294967295
				%17:sreg_32 = S_MOV_B32 61440
				%18:sreg_64 = REG_SEQUENCE %16:sreg_32, %subreg.sub0, %17:sreg_32, %subreg.sub1
				%19:sgpr_128 = REG_SEQUENCE %12:sreg_64, %subreg.sub0_sub1, %18:sreg_64, %subreg.sub2_sub3
				BUFFER_STORE_DWORD_OFFSET %11:vgpr_32, %19:sgpr_128, 0, 0, 0, 0, implicit $exec
				S_BRANCH %bb.2

				...