This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Mark control flow intrinsics non-duplicable
ClosedPublic

Authored by ruiling on Jan 26 2022, 7:19 AM.

Details

Summary

This is used to help get simplified CFG for divergent regions as well as
get better code generation in some cases.

For example, with below IR:

define amdgpu_kernel void @test() {
bb:
  br label %bb1

bb1:
  %tmp = phi i32 [ 0, %bb ], [ %tmp5, %bb4 ]
  %tid = call i32 @llvm.amdgcn.workitem.id.x()
  %cnd = icmp eq i32 %tid, 0
  br i1 %cnd, label %bb4, label %bb2

bb2:
  %tmp3 = add nsw i32 %tmp, 1
  br label %bb4

bb4:
  %tmp5 = phi i32 [ %tmp3, %bb2 ], [ %tmp, %bb1 ]
  store volatile i32 %tmp5, ptr addrspace(1) undef
  br label %bb1
}

We got below assembly before the change:

  v_mov_b32_e32 v1, 0
  v_cmp_eq_u32_e32 vcc, 0, v0
  s_branch .LBB0_2
.LBB0_1:                                ; %bb4
                                        ;   in Loop: Header=BB0_2 Depth=1
  s_mov_b32 s2, -1
  s_mov_b32 s3, 0xf000
  buffer_store_dword v1, off, s[0:3], 0
  s_waitcnt vmcnt(0)
.LBB0_2:                                ; %bb 
                                        ; =>This Inner Loop Header: Depth=1
  s_and_saveexec_b64 s[0:1], vcc 
  s_xor_b64 s[0:1], exec, s[0:1]
                                        ; kill: def $sgpr0_sgpr1 killed $sgpr0_sgpr1 killed $exec
  s_cbranch_execnz .LBB0_1
; %bb.3:                                ; %bb2
                                        ;   in Loop: Header=BB0_2 Depth=1
  s_or_b64 exec, exec, s[0:1]
  s_waitcnt expcnt(0)
  v_add_i32_e64 v1, s[0:1], 1, v1
  s_branch .LBB0_1

After the change:

  s_mov_b32 s0, 0
  v_cmp_eq_u32_e32 vcc, 0, v0
  s_mov_b32 s2, -1
  s_mov_b32 s3, 0xf000
  v_mov_b32_e32 v0, s0
  s_branch .LBB0_2
.LBB0_1:                                ; %bb4
                                        ;   in Loop: Header=BB0_2 Depth=1
  buffer_store_dword v0, off, s[0:3], 0
  s_waitcnt vmcnt(0)
.LBB0_2:                                ; %bb1
                                        ; =>This Inner Loop Header: Depth=1
  s_and_saveexec_b64 s[0:1], vcc 
  s_cbranch_execnz .LBB0_1
; %bb.3:                                ; %bb2
                                        ;   in Loop: Header=BB0_2 Depth=1
  s_or_b64 exec, exec, s[0:1]
  s_waitcnt expcnt(0)
  v_add_i32_e64 v0, s[0:1], 1, v0
  s_branch .LBB0_1

We are using one less VGPR, one less s_xor_, and better LICM with one
additional branch after the change. Please note the experiment
was done with reverting the workaround D139780, as it will stop the
tail-duplication completely for this case.

Diff Detail

Event Timeline

ruiling created this revision.Jan 26 2022, 7:19 AM
ruiling requested review of this revision.Jan 26 2022, 7:19 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 26 2022, 7:19 AM
foad added a comment.Jan 26 2022, 7:23 AM

Is this an alternative to D117796?

arsenm requested changes to this revision.Jan 26 2022, 7:25 AM

I don't think this is a good idea. We don't actually need a structured CFG at this point, and tail duplicating isn't exactly unstructuring anyway. This is not an alternative to fixing the LiveVariables update problem, it's just the testcase that broke happened to have appeared due to tail duplication

This revision now requires changes to proceed.Jan 26 2022, 7:25 AM

Is this an alternative to D117796?

No, this is not trying to fix that problem. That patch should still be needed for block split when processing PHI introduced by LCSSA.

I don't think this is a good idea. We don't actually need a structured CFG at this point, and tail duplicating isn't exactly unstructuring anyway. This is not an alternative to fixing the LiveVariables update problem, it's just the testcase that broke happened to have appeared due to tail duplication

Tail duplicating divergent branching is also pretty bad idea. Can you prove ANY benefit by duplicating divergent branching?

I don't think this is a good idea. We don't actually need a structured CFG at this point, and tail duplicating isn't exactly unstructuring anyway. This is not an alternative to fixing the LiveVariables update problem, it's just the testcase that broke happened to have appeared due to tail duplication

Tail duplicating divergent branching is also pretty bad idea. Can you prove ANY benefit by duplicating divergent branching?

Structurization is an aid to insert the exec mask manipulation instructions, and after that point we no longer care about preserving it. SI_IF isn't really a branch, although we insert a skip exec branch just in case it's necessary and is logically just bit manipulation. It's the ugly glue we use between the real CFG and the divergent CFG.

I don't think this is a good idea. We don't actually need a structured CFG at this point, and tail duplicating isn't exactly unstructuring anyway. This is not an alternative to fixing the LiveVariables update problem, it's just the testcase that broke happened to have appeared due to tail duplication

Tail duplicating divergent branching is also pretty bad idea. Can you prove ANY benefit by duplicating divergent branching?

It's probably not a good idea, but we don't really require this property. In the testcase I was looking at, it nets adding one additional instruction (probably because it ends up confusing the optimize exec passes). I guess given how infrequently this probably occurs, this is OK. I think it would be an interesting experiment to run benchmarks with and without this with global-isel (although we probably need to have correct regbankselect first)

arsenm accepted this revision.Mar 28 2022, 3:42 PM

I think we don't need to do this, but it should in general result in a net increase in the number of instructions. Can you add a comment explaining that this is not a hard requirement to not duplicate?

This revision is now accepted and ready to land.Mar 28 2022, 3:42 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 28 2022, 3:42 PM
Herald added a subscriber: hsmhsm. · View Herald Transcript
arsenm added a comment.Jun 3 2022, 6:29 AM

Reverse ping

ruiling updated this revision to Diff 494168.Feb 1 2023, 10:21 PM

rebase with more comment

ruiling edited the summary of this revision. (Show Details)Feb 1 2023, 10:25 PM

I revisit the problem again. I agree we are allowed to duplicate such intrinsic based on the fact that the intrinsic lower has already support their usage in unstructured CFG. but the case shown that duplicating such intrinsics might not be helpful for generating better code. By making them non-duplicable, we would get simplified usage of control flow intrinsics. Some testing on large set of graphics shaders, the change causes no code generation differences.

arsenm accepted this revision.Feb 2 2023, 2:58 AM

LGTM with additional comments

llvm/lib/Target/AMDGPU/SIInstructions.td
345

Should add a comment that this is not a hard requirement

353

Should add a comment that this is not a hard requirement

This revision was landed with ongoing or failed builds.Feb 5 2023, 11:34 PM
This revision was automatically updated to reflect the committed changes.