This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Mark control flow intrinsics non-duplicable
ClosedPublic

Authored by ruiling on Jan 26 2022, 7:19 AM.

Download Raw Diff

Details

Reviewers

arsenm
foad
critson
nhaehnle

Commits

rGbe3f4591aff0: AMDGPU: Mark control flow intrinsics non-duplicable

Summary

This is used to help get simplified CFG for divergent regions as well as
get better code generation in some cases.

For example, with below IR:

define amdgpu_kernel void @test() {
bb:
  br label %bb1

bb1:
  %tmp = phi i32 [ 0, %bb ], [ %tmp5, %bb4 ]
  %tid = call i32 @llvm.amdgcn.workitem.id.x()
  %cnd = icmp eq i32 %tid, 0
  br i1 %cnd, label %bb4, label %bb2

bb2:
  %tmp3 = add nsw i32 %tmp, 1
  br label %bb4

bb4:
  %tmp5 = phi i32 [ %tmp3, %bb2 ], [ %tmp, %bb1 ]
  store volatile i32 %tmp5, ptr addrspace(1) undef
  br label %bb1
}

We got below assembly before the change:

  v_mov_b32_e32 v1, 0
  v_cmp_eq_u32_e32 vcc, 0, v0
  s_branch .LBB0_2
.LBB0_1:                                ; %bb4
                                        ;   in Loop: Header=BB0_2 Depth=1
  s_mov_b32 s2, -1
  s_mov_b32 s3, 0xf000
  buffer_store_dword v1, off, s[0:3], 0
  s_waitcnt vmcnt(0)
.LBB0_2:                                ; %bb 
                                        ; =>This Inner Loop Header: Depth=1
  s_and_saveexec_b64 s[0:1], vcc 
  s_xor_b64 s[0:1], exec, s[0:1]
                                        ; kill: def $sgpr0_sgpr1 killed $sgpr0_sgpr1 killed $exec
  s_cbranch_execnz .LBB0_1
; %bb.3:                                ; %bb2
                                        ;   in Loop: Header=BB0_2 Depth=1
  s_or_b64 exec, exec, s[0:1]
  s_waitcnt expcnt(0)
  v_add_i32_e64 v1, s[0:1], 1, v1
  s_branch .LBB0_1

After the change:

  s_mov_b32 s0, 0
  v_cmp_eq_u32_e32 vcc, 0, v0
  s_mov_b32 s2, -1
  s_mov_b32 s3, 0xf000
  v_mov_b32_e32 v0, s0
  s_branch .LBB0_2
.LBB0_1:                                ; %bb4
                                        ;   in Loop: Header=BB0_2 Depth=1
  buffer_store_dword v0, off, s[0:3], 0
  s_waitcnt vmcnt(0)
.LBB0_2:                                ; %bb1
                                        ; =>This Inner Loop Header: Depth=1
  s_and_saveexec_b64 s[0:1], vcc 
  s_cbranch_execnz .LBB0_1
; %bb.3:                                ; %bb2
                                        ;   in Loop: Header=BB0_2 Depth=1
  s_or_b64 exec, exec, s[0:1]
  s_waitcnt expcnt(0)
  v_add_i32_e64 v0, s[0:1], 1, v0
  s_branch .LBB0_1

We are using one less VGPR, one less s_xor_, and better LICM with one
additional branch after the change. Please note the experiment
was done with reverting the workaround D139780, as it will stop the
tail-duplication completely for this case.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

ruiling created this revision.Jan 26 2022, 7:19 AM

Herald added subscribers: kerbowa, hiraditya, t-tye and 5 others. · View Herald TranscriptJan 26 2022, 7:19 AM

ruiling requested review of this revision.Jan 26 2022, 7:19 AM

Herald added a project: Restricted Project. · View Herald TranscriptJan 26 2022, 7:19 AM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

Is this an alternative to D117796?

I don't think this is a good idea. We don't actually need a structured CFG at this point, and tail duplicating isn't exactly unstructuring anyway. This is not an alternative to fixing the LiveVariables update problem, it's just the testcase that broke happened to have appeared due to tail duplication

This revision now requires changes to proceed.Jan 26 2022, 7:25 AM

In D118250#3272528, @foad wrote:

Is this an alternative to D117796?

No, this is not trying to fix that problem. That patch should still be needed for block split when processing PHI introduced by LCSSA.

In D118250#3272530, @arsenm wrote:

I don't think this is a good idea. We don't actually need a structured CFG at this point, and tail duplicating isn't exactly unstructuring anyway. This is not an alternative to fixing the LiveVariables update problem, it's just the testcase that broke happened to have appeared due to tail duplication

Tail duplicating divergent branching is also pretty bad idea. Can you prove ANY benefit by duplicating divergent branching?

In D118250#3272547, @ruiling wrote:

In D118250#3272530, @arsenm wrote:

I don't think this is a good idea. We don't actually need a structured CFG at this point, and tail duplicating isn't exactly unstructuring anyway. This is not an alternative to fixing the LiveVariables update problem, it's just the testcase that broke happened to have appeared due to tail duplication

Tail duplicating divergent branching is also pretty bad idea. Can you prove ANY benefit by duplicating divergent branching?

Structurization is an aid to insert the exec mask manipulation instructions, and after that point we no longer care about preserving it. SI_IF isn't really a branch, although we insert a skip exec branch just in case it's necessary and is logically just bit manipulation. It's the ugly glue we use between the real CFG and the divergent CFG.

In D118250#3272547, @ruiling wrote:

In D118250#3272530, @arsenm wrote:

I don't think this is a good idea. We don't actually need a structured CFG at this point, and tail duplicating isn't exactly unstructuring anyway. This is not an alternative to fixing the LiveVariables update problem, it's just the testcase that broke happened to have appeared due to tail duplication

Tail duplicating divergent branching is also pretty bad idea. Can you prove ANY benefit by duplicating divergent branching?

It's probably not a good idea, but we don't really require this property. In the testcase I was looking at, it nets adding one additional instruction (probably because it ends up confusing the optimize exec passes). I guess given how infrequently this probably occurs, this is OK. I think it would be an interesting experiment to run benchmarks with and without this with global-isel (although we probably need to have correct regbankselect first)

Harbormaster completed remote builds in B145742: Diff 403261.Jan 27 2022, 3:14 AM

I think we don't need to do this, but it should in general result in a net increase in the number of instructions. Can you add a comment explaining that this is not a hard requirement to not duplicate?

This revision is now accepted and ready to land.Mar 28 2022, 3:42 PM

Herald added a project: Restricted Project. · View Herald TranscriptMar 28 2022, 3:42 PM

Herald added a subscriber: hsmhsm. · View Herald Transcript

Reverse ping

Herald added a subscriber: kosarev. · View Herald TranscriptJun 3 2022, 6:29 AM

rebase with more comment

Herald added a subscriber: StephenFan. · View Herald TranscriptFeb 1 2023, 10:21 PM

ruiling edited the summary of this revision. (Show Details)Feb 1 2023, 10:25 PM

I revisit the problem again. I agree we are allowed to duplicate such intrinsic based on the fact that the intrinsic lower has already support their usage in unstructured CFG. but the case shown that duplicating such intrinsics might not be helpful for generating better code. By making them non-duplicable, we would get simplified usage of control flow intrinsics. Some testing on large set of graphics shaders, the change causes no code generation differences.

Harbormaster completed remote builds in B211387: Diff 494168.Feb 1 2023, 11:35 PM

LGTM with additional comments

llvm/lib/Target/AMDGPU/SIInstructions.td
345	Should add a comment that this is not a hard requirement
353	Should add a comment that this is not a hard requirement

This revision was landed with ongoing or failed builds.Feb 5 2023, 11:34 PM

Closed by commit rGbe3f4591aff0: AMDGPU: Mark control flow intrinsics non-duplicable (authored by ruiling). · Explain Why

This revision was automatically updated to reflect the committed changes.

ruiling added a commit: rGbe3f4591aff0: AMDGPU: Mark control flow intrinsics non-duplicable.

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

SIInstructions.td

4 lines

test/

CodeGen/

AMDGPU/

stop-tail-duplicate-cfg-intrinsic.mir

77 lines

Diff 403261

llvm/lib/Target/AMDGPU/SIInstructions.td

Show First 20 Lines • Show All 286 Lines • ▼ Show 20 Lines	def WAVE_BARRIER : SPseudoInstSI<(outs), (ins),
let isConvergent = 1;		let isConvergent = 1;
let FixedSize = 1;		let FixedSize = 1;
let Size = 0;		let Size = 0;
}		}

// SI pseudo instructions. These are used by the CFG structurizer pass		// SI pseudo instructions. These are used by the CFG structurizer pass
// and should be lowered to ISA instructions prior to codegen.		// and should be lowered to ISA instructions prior to codegen.

let isTerminator = 1 in {		let isTerminator = 1, isNotDuplicable = 1 in {

let OtherPredicates = [EnableLateCFGStructurize] in {		let OtherPredicates = [EnableLateCFGStructurize] in {
def SI_NON_UNIFORM_BRCOND_PSEUDO : CFPseudoInstSI <		def SI_NON_UNIFORM_BRCOND_PSEUDO : CFPseudoInstSI <
(outs),		(outs),
(ins SReg_1:$vcc, brtarget:$target),		(ins SReg_1:$vcc, brtarget:$target),
[(brcond i1:$vcc, bb:$target)]> {		[(brcond i1:$vcc, bb:$target)]> {
let Size = 12;		let Size = 12;
}		}
Show All 33 Lines
} // End isTerminator = 1		} // End isTerminator = 1

def SI_END_CF : CFPseudoInstSI <		def SI_END_CF : CFPseudoInstSI <
(outs), (ins SReg_1:$saved), [], 1, 1> {		(outs), (ins SReg_1:$saved), [], 1, 1> {
let Size = 4;		let Size = 4;
let isAsCheapAsAMove = 1;		let isAsCheapAsAMove = 1;
let isReMaterializable = 1;		let isReMaterializable = 1;
let hasSideEffects = 1;		let hasSideEffects = 1;
		let isNotDuplicable = 1;
		arsenmUnsubmitted Not Done Reply Inline Actions Should add a comment that this is not a hard requirement arsenm: Should add a comment that this is not a hard requirement
let mayLoad = 1; // FIXME: Should not need memory flags		let mayLoad = 1; // FIXME: Should not need memory flags
let mayStore = 1;		let mayStore = 1;
}		}

def SI_IF_BREAK : CFPseudoInstSI <		def SI_IF_BREAK : CFPseudoInstSI <
(outs SReg_1:$dst), (ins SReg_1:$vcc, SReg_1:$src), []> {		(outs SReg_1:$dst), (ins SReg_1:$vcc, SReg_1:$src), []> {
let Size = 4;		let Size = 4;
		let isNotDuplicable = 1;
		arsenmUnsubmitted Not Done Reply Inline Actions Should add a comment that this is not a hard requirement arsenm: Should add a comment that this is not a hard requirement
let isAsCheapAsAMove = 1;		let isAsCheapAsAMove = 1;
let isReMaterializable = 1;		let isReMaterializable = 1;
}		}

// Branch to the early termination block of the shader if SCC is 0.		// Branch to the early termination block of the shader if SCC is 0.
// This uses SCC from a previous SALU operation, i.e. the update of		// This uses SCC from a previous SALU operation, i.e. the update of
// a mask of live lanes after a kill/demote operation.		// a mask of live lanes after a kill/demote operation.
// Only valid in pixel shaders.		// Only valid in pixel shaders.
▲ Show 20 Lines • Show All 2,733 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/stop-tail-duplicate-cfg-intrinsic.mir

This file was added.

				# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py
				# RUN: llc -mtriple=amdgcn-amd-amdhsa -run-pass=early-tailduplication -verify-machineinstrs -o - %s \| FileCheck %s

				# We used to duplicate blocks with control flow intrinsic, we should stop doing this
				# as we get no benefit but just make cfg irreducible.

				---
				name: stop_duplicate_cfg_intrinsic
				tracksRegLiveness: true
				body: \|
				; CHECK-LABEL: name: stop_duplicate_cfg_intrinsic
				; CHECK: bb.0:
				; CHECK-NEXT: successors: %bb.1(0x80000000)
				; CHECK-NEXT: liveins: $vgpr0
				; CHECK-NEXT: {{ $}}
				; CHECK-NEXT: [[COPY:%[0-9]+]]:vgpr_32 = COPY $vgpr0
				; CHECK-NEXT: [[DEF:%[0-9]+]]:sreg_64 = IMPLICIT_DEF
				; CHECK-NEXT: [[S_MOV_B32_:%[0-9]+]]:sreg_32 = S_MOV_B32 0
				; CHECK-NEXT: [[COPY1:%[0-9]+]]:vgpr_32 = COPY [[S_MOV_B32_]]
				; CHECK-NEXT: [[V_CMP_EQ_U32_e64_:%[0-9]+]]:sreg_64_xexec = V_CMP_EQ_U32_e64 [[COPY]], [[COPY1]], implicit $exec
				; CHECK-NEXT: {{ $}}
				; CHECK-NEXT: bb.1:
				; CHECK-NEXT: successors: %bb.2(0x40000000), %bb.3(0x40000000)
				; CHECK-NEXT: {{ $}}
				; CHECK-NEXT: [[PHI:%[0-9]+]]:vgpr_32 = PHI [[S_MOV_B32_]], %bb.0, %6, %bb.3
				; CHECK-NEXT: [[SI_IF:%[0-9]+]]:sreg_64_xexec = SI_IF [[V_CMP_EQ_U32_e64_]], %bb.2, implicit-def $exec, implicit-def $scc, implicit $exec
				; CHECK-NEXT: S_BRANCH %bb.3
				; CHECK-NEXT: {{ $}}
				; CHECK-NEXT: bb.2:
				; CHECK-NEXT: successors: %bb.3(0x80000000)
				; CHECK-NEXT: {{ $}}
				; CHECK-NEXT: SI_END_CF [[SI_IF]], implicit-def $exec, implicit-def $scc, implicit $exec
				; CHECK-NEXT: [[S_MOV_B32_1:%[0-9]+]]:sreg_32 = S_MOV_B32 1
				; CHECK-NEXT: [[COPY2:%[0-9]+]]:vgpr_32 = COPY [[S_MOV_B32_1]]
				; CHECK-NEXT: %10:vgpr_32, dead %11:sreg_64_xexec = V_ADD_CO_U32_e64 [[PHI]], [[COPY2]], 0, implicit $exec
				; CHECK-NEXT: {{ $}}
				; CHECK-NEXT: bb.3:
				; CHECK-NEXT: successors: %bb.1(0x80000000)
				; CHECK-NEXT: {{ $}}
				; CHECK-NEXT: [[PHI1:%[0-9]+]]:vgpr_32 = PHI %10, %bb.2, [[PHI]], %bb.1
				; CHECK-NEXT: [[S_MOV_B32_2:%[0-9]+]]:sreg_32 = S_MOV_B32 4294967295
				; CHECK-NEXT: [[S_MOV_B32_3:%[0-9]+]]:sreg_32 = S_MOV_B32 61440
				; CHECK-NEXT: [[REG_SEQUENCE:%[0-9]+]]:sreg_64 = REG_SEQUENCE [[S_MOV_B32_2]], %subreg.sub0, [[S_MOV_B32_3]], %subreg.sub1
				; CHECK-NEXT: [[REG_SEQUENCE1:%[0-9]+]]:sgpr_128 = REG_SEQUENCE [[DEF]], %subreg.sub0_sub1, [[REG_SEQUENCE]], %subreg.sub2_sub3
				; CHECK-NEXT: BUFFER_STORE_DWORD_OFFSET [[PHI1]], [[REG_SEQUENCE1]], 0, 0, 0, 0, 0, implicit $exec :: (volatile store (s32) into `i32 addrspace(1)* undef`, addrspace 1)
				; CHECK-NEXT: S_BRANCH %bb.1

				bb.1:
				liveins: $vgpr0

				%0:vgpr_32 = COPY $vgpr0
				%12:sreg_64 = IMPLICIT_DEF
				%4:sreg_32 = S_MOV_B32 0
				%14:vgpr_32 = COPY %4:sreg_32
				%5:sreg_64_xexec = V_CMP_EQ_U32_e64 %0:vgpr_32, %14:vgpr_32, implicit $exec

				bb.2:
				%6:vgpr_32 = PHI %4:sreg_32, %bb.1, %11:vgpr_32, %bb.4
				%8:sreg_64_xexec = SI_IF %5:sreg_64_xexec, %bb.3, implicit-def $exec, implicit-def $scc, implicit $exec
				S_BRANCH %bb.4

				bb.3:
				SI_END_CF %8:sreg_64_xexec, implicit-def $exec, implicit-def $scc, implicit $exec
				%13:sreg_32 = S_MOV_B32 1
				%15:vgpr_32 = COPY %13:sreg_32
				%10:vgpr_32, dead %20:sreg_64_xexec = V_ADD_CO_U32_e64 %6:vgpr_32, %15:vgpr_32, 0, implicit $exec

				bb.4:
				%11:vgpr_32 = PHI %10:vgpr_32, %bb.3, %6:vgpr_32, %bb.2
				%16:sreg_32 = S_MOV_B32 4294967295
				%17:sreg_32 = S_MOV_B32 61440
				%18:sreg_64 = REG_SEQUENCE %16:sreg_32, %subreg.sub0, %17:sreg_32, %subreg.sub1
				%19:sgpr_128 = REG_SEQUENCE %12:sreg_64, %subreg.sub0_sub1, %18:sreg_64, %subreg.sub2_sub3
				BUFFER_STORE_DWORD_OFFSET %11:vgpr_32, %19:sgpr_128, 0, 0, 0, 0, 0, implicit $exec :: (volatile store (s32) into `i32 addrspace(1)* undef`, addrspace 1)
				S_BRANCH %bb.2

				...