This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU : Fix common dominator of two incoming blocks terminates with uniform branch issue.
ClosedPublic

Authored by wdng on Mar 24 2017, 1:08 PM.

Details

Summary

We don't need to fix the PHI if the common dominator of the two incoming blocks terminates with a uniform branch. But looks like the condition is not strong enough to find two blocks with a uniform branch, which bring loop iteration order regression.

MI:
  %vreg5<def> = PHI %vreg125, <BB#1>, %vreg12, <BB#3>; SReg_64:%vreg5,%vreg125,%vreg12
  
MBB0:
BB#1: derived from LLVM BB %for.body.preheader
    Predecessors according to CFG: BB#0
	%vreg126<def> = S_MOV_B32 0; SReg_32_XM0:%vreg126
	%vreg125<def> = S_MOV_B64 0; SReg_64:%vreg125
	S_BRANCH <BB#3>
    Successors according to CFG: BB#3(?%)

MBB1:
BB#3: derived from LLVM BB %for.body
    Predecessors according to CFG: BB#1 BB#3
	%vreg5<def> = PHI %vreg125, <BB#1>, %vreg12, <BB#3>; SReg_64:%vreg5,%vreg125,%vreg12
	%vreg6<def> = PHI %vreg1, <BB#1>, %vreg9, <BB#3>; VGPR_32:%vreg6,%vreg1,%vreg9
	%vreg7<def> = PHI %vreg126, <BB#1>, %vreg11, <BB#3>; SReg_32_XM0:%vreg7,%vreg126,%vreg11
	%vreg8<def> = PHI %vreg2, <BB#1>, %vreg10, <BB#3>; VGPR_32:%vreg8,%vreg2,%vreg10
	%vreg127<def,tied6> = V_MAC_F32_e64 0, %vreg6, 0, %vreg6, 0, %vreg1<tied0>, 0, 0, %EXEC<imp-use>; VGPR_32:%vreg127,%vreg6,%vreg6,%vreg1
	%vreg9<def> = V_MAD_F32 1, %vreg8, 0, %vreg8, 0, %vreg127<kill>, 0, 0, %EXEC<imp-use>; VGPR_32:%vreg9,%vreg8,%vreg8,%vreg127
	%vreg128<def> = V_ADD_F32_e64 0, %vreg6, 0, %vreg6, 0, 0, %EXEC<imp-use>; VGPR_32:%vreg128,%vreg6,%vreg6
	%vreg10<def,tied6> = V_MAC_F32_e64 0, %vreg128<kill>, 0, %vreg8, 0, %vreg2<tied0>, 0, 0, %EXEC<imp-use>; VGPR_32:%vreg10,%vreg128,%vreg8,%vreg2
	%vreg129<def> = S_MOV_B32 1; SReg_32_XM0:%vreg129
	%vreg11<def> = S_ADD_I32 %vreg7, %vreg129<kill>, %SCC<imp-def,dead>; SReg_32_XM0:%vreg11,%vreg7,%vreg129
	%vreg130<def> = V_MUL_F32_e64 0, %vreg10, 0, %vreg10, 0, 0, %EXEC<imp-use>; VGPR_32:%vreg130,%vreg10,%vreg10
	%vreg131<def,tied6> = V_MAC_F32_e64 0, %vreg9, 0, %vreg9, 0, %vreg130<tied0>, 0, 0, %EXEC<imp-use>; VGPR_32:%vreg131,%vreg9,%vreg9,%vreg130
	%vreg132<def> = V_MOV_B32_e32 1082130432, %EXEC<imp-use>; VGPR_32:%vreg132
	%vreg133<def> = V_CMP_NLE_F32_e64 0, %vreg131<kill>, 0, %vreg132<kill>, 0, 0, %EXEC<imp-use>; SReg_64:%vreg133 VGPR_32:%vreg131,%vreg132
	%vreg135<def> = COPY %vreg21; VGPR_32:%vreg135 SReg_32_XM0:%vreg21
	%vreg134<def> = V_CMP_GE_U32_e64 %vreg11, %vreg135, %EXEC<imp-use>; SReg_64:%vreg134 SReg_32_XM0:%vreg11 VGPR_32:%vreg135
	%vreg136<def> = S_OR_B64 %vreg134<kill>, %vreg133<kill>, %SCC<imp-def,dead>; SReg_64:%vreg136,%vreg134,%vreg133
	%vreg12<def> = SI_IF_BREAK %vreg136<kill>, %vreg5, %SCC<imp-def,dead>; SReg_64:%vreg12,%vreg136,%vreg5
	SI_LOOP %vreg12, <BB#3>, %EXEC<imp-def,dead>, %SCC<imp-def,dead>, %EXEC<imp-use>; SReg_64:%vreg12
	S_BRANCH <BB#4>
    Successors according to CFG: BB#4(0x04000000 / 0x80000000 = 3.12%) BB#3(0x7c000000 / 0x80000000 = 96.88%)

NCD: 
BB#1: derived from LLVM BB %for.body.preheader
    Predecessors according to CFG: BB#0
	%vreg126<def> = S_MOV_B32 0; SReg_32_XM0:%vreg126
	%vreg125<def> = S_MOV_B64 0; SReg_64:%vreg125
	S_BRANCH <BB#3>
    Successors according to CFG: BB#3(?%)

Diff Detail

Repository
rL LLVM

Event Timeline

wdng created this revision.Mar 24 2017, 1:08 PM
wdng edited the summary of this revision. (Show Details)Mar 24 2017, 1:10 PM
wdng edited the summary of this revision. (Show Details)
wdng edited reviewers, added: t-tye; removed: tony-tye.Mar 24 2017, 1:18 PM
arsenm edited edge metadata.Mar 24 2017, 1:51 PM
arsenm added a subscriber: llvm-commits.

This needs a test and is not correct. The direct predecessors could themselves be divergent blocks. You need to identify whether this phi is in a divergent region

wdng updated this revision to Diff 93360.Mar 29 2017, 5:52 AM

Add conditions to identify whether this phi is in a divergent region.

wdng edited the summary of this revision. (Show Details)Mar 30 2017, 4:06 PM
wdng added a reviewer: rampitec.
wdng updated this revision to Diff 93563.Mar 30 2017, 4:15 PM

Completely remove weak if condition for checking whether two blocks with a uniform branch, current implementation uses divergence analysis to detect whether two blocks with a uniform branch. If all conditions on all paths leading to a block are uniform, block is uniform.

rampitec edited edge metadata.Mar 30 2017, 6:22 PM
In D31350#714741, @wdng wrote:

Completely remove weak if condition for checking whether two blocks with a uniform branch, current implementation uses divergence analysis to detect whether two blocks with a uniform branch. If all conditions on all paths leading to a block are uniform, block is uniform.

I do not see divergence analysis used. I see that you are checking for PHI to be a constant, which is way stronger than just uniform.

lib/Target/AMDGPU/SIFixSGPRCopies.cpp
393 ↗(On Diff #93563)

Please use nullptr instead of NULL.

arsenm added inline comments.Mar 30 2017, 6:34 PM
lib/Target/AMDGPU/SIFixSGPRCopies.cpp
393 ↗(On Diff #93563)

if (IPostDom)

395 ↗(On Diff #93563)

This isn't going to do anything. PHINodes are IR constructs

This is still approaching this from the wrong direction. You need to move all SGPR phis in the entire region unless the inputs are from the control flow intrinsics. There also needs to be a test

wdng updated this revision to Diff 93668.Mar 31 2017, 9:29 AM

Based on Matt' comments, check conditions whether inputs are from the control flow intrinsics.

wdng added a comment.Mar 31 2017, 9:31 AM

Tests will be added later once we decide to use this approach. Thanks!

rampitec added inline comments.Apr 3 2017, 12:06 PM
lib/Target/AMDGPU/SIFixSGPRCopies.cpp
393 ↗(On Diff #93668)

I do not understand this. If we speak about %vreg5 from the code in the description, it is not defined by the post dominator. Its definitions are in BB#1 and BB#3 and both needs to be checked.

wdng updated this revision to Diff 94425.Apr 6 2017, 1:00 PM

Address code reviews: check terminators for all predecessors.

wdng edited the summary of this revision. (Show Details)Apr 6 2017, 1:01 PM
rampitec added inline comments.Apr 6 2017, 1:13 PM
lib/Target/AMDGPU/SIFixSGPRCopies.cpp
331 ↗(On Diff #94425)

You need recursive search to do it.

334 ↗(On Diff #94425)

Weird brace indention.

wdng added inline comments.Apr 6 2017, 1:22 PM
lib/Target/AMDGPU/SIFixSGPRCopies.cpp
331 ↗(On Diff #94425)

http://llvm.org/docs/ProgrammersManual.html#iterating-over-predecessors-successors-of-blocks, looks like this has iterated over of all predecessors of a BB, right?

rampitec added inline comments.Apr 6 2017, 1:23 PM
lib/Target/AMDGPU/SIFixSGPRCopies.cpp
331 ↗(On Diff #94425)

All immediate predecessors.

wdng updated this revision to Diff 94442.Apr 6 2017, 2:49 PM

Use BFS to search for all intermediate predecessors to check whether it has terminators.

wdng marked 5 inline comments as done.Apr 6 2017, 2:49 PM
rampitec added inline comments.Apr 6 2017, 2:53 PM
lib/Target/AMDGPU/SIFixSGPRCopies.cpp
333 ↗(On Diff #94442)

You do not need two queues. In fact it is usually done with a SmallVector.

wdng added inline comments.Apr 6 2017, 3:02 PM
lib/Target/AMDGPU/SIFixSGPRCopies.cpp
333 ↗(On Diff #94442)

BFS can be implemented in different ways, either using one queue or two. Are there any benefits to use SmallVector? Thanks!

rampitec added inline comments.Apr 6 2017, 3:05 PM
lib/Target/AMDGPU/SIFixSGPRCopies.cpp
333 ↗(On Diff #94442)

You definitely do not need two. Then llvm preferrs to use its own containers over std.

wdng added inline comments.Apr 6 2017, 4:09 PM
lib/Target/AMDGPU/SIFixSGPRCopies.cpp
333 ↗(On Diff #94442)

I agree to use LLVM's own container, but implementing BFS using vectors is not a good data structure choice although it's doable.

arsenm added inline comments.Apr 6 2017, 4:41 PM
lib/Target/AMDGPU/SIFixSGPRCopies.cpp
335 ↗(On Diff #94442)

There needs to be a set because there can be loops

test/CodeGen/AMDGPU/sgprcopies.ll
1 ↗(On Diff #94442)

s/SI GCN.

Also needs -verify-machineinstrs

3 ↗(On Diff #94442)

Should have better name and secretion of what this is really checking

12 ↗(On Diff #94442)

Should have instnamer run on this and the test could be simplified a bit

wdng updated this revision to Diff 94640.EditedApr 10 2017, 12:13 AM
  1. Add hash set to avoid processing cycles in CFG.
  2. This LIT test is based on changes of ocltst, renaming lit test function name.
wdng updated this revision to Diff 94644.Apr 10 2017, 12:19 AM

Upload correct diff.

wdng updated this revision to Diff 94735.EditedApr 10 2017, 1:47 PM

Use LLVM provided container for implementation.

rampitec added inline comments.Apr 10 2017, 5:28 PM
lib/Target/AMDGPU/SIFixSGPRCopies.cpp
333 ↗(On Diff #94735)

Usually it is called Visited.

334 ↗(On Diff #94735)

And this is usually Worklist.

337 ↗(On Diff #94735)

Not needed.

339 ↗(On Diff #94735)

SmallVector<MachineBasicBlock*, 4> Worklist(MBB->pred_begin(), MBB->pred_end());
while (!Worklist.empty())

341 ↗(On Diff #94735)

pop_back_val.

343 ↗(On Diff #94735)

if (Visited.insert(mbb).second) continue;

348 ↗(On Diff #94735)

pred, not preds. Anyway, Worklist.insert(Worklist.end(), mbb->pred_begin(), mbb->pred_end()).

wdng updated this revision to Diff 94780.Apr 10 2017, 10:27 PM
wdng marked 2 inline comments as done.

Code changes based on code reviews.

wdng marked 6 inline comments as done.Apr 10 2017, 10:27 PM
rampitec added inline comments.Apr 11 2017, 10:36 AM
lib/Target/AMDGPU/SIFixSGPRCopies.cpp
344 ↗(On Diff #94780)

If you use Visited.insert(mbb).second it is one search in the set instead of two.

wdng added inline comments.Apr 11 2017, 11:34 AM
lib/Target/AMDGPU/SIFixSGPRCopies.cpp
344 ↗(On Diff #94780)

Looks like this is not correct. We need to check whether this node has visited before, then add into Visited if not found instead of inserting first and then search, correct?

rampitec added inline comments.Apr 11 2017, 11:38 AM
lib/Target/AMDGPU/SIFixSGPRCopies.cpp
344 ↗(On Diff #94780)
wdng updated this revision to Diff 94873.Apr 11 2017, 12:25 PM

Address code reviews. Thanks Stas!

wdng updated this revision to Diff 94909.Apr 11 2017, 5:10 PM

Fix other 3 LIT tests regression caused by code changes.

arsenm added inline comments.Apr 11 2017, 5:22 PM
lib/Target/AMDGPU/SIFixSGPRCopies.cpp
331 ↗(On Diff #94909)

This is a bad name because most blocks have terminators. Almost all should at this point, this needs to specify that it is divergent terminator

333 ↗(On Diff #94909)

You could maybe make this a map and cache the result so that every single block that needs to be visited doesn't need to walk all the way up each time

wdng added inline comments.Apr 11 2017, 11:47 PM
lib/Target/AMDGPU/SIFixSGPRCopies.cpp
333 ↗(On Diff #94909)

Visited is used to check whether the visiting node has been visited before, once is has been visited before it's predecessors won't be processed again. For example:

   D
 /   \
A     C
|  \  |
B     E 
 \   /
   F

Assume starting from node F, trace down F->B->A->D, when visiting from F to E, A and D won't be processed again. So I don't think there is a need to save results like A and D.

wdng updated this revision to Diff 94979.Apr 12 2017, 8:53 AM

Change function name.

arsenm added inline comments.Apr 12 2017, 12:35 PM
lib/Target/AMDGPU/SIFixSGPRCopies.cpp
333 ↗(On Diff #94909)

If you consider the entire function, the same predecessors will be visited for phis in other blocks

78 ↗(On Diff #94979)

This should be the first included llvm header

test/CodeGen/AMDGPU/sgprcopies.ll
6 ↗(On Diff #94979)

This should be marked amdgpu_kernel

wdng updated this revision to Diff 95055.Apr 12 2017, 4:30 PM
wdng marked an inline comment as done.
  1. Address code reviews.
  2. Will create another patch for optimized searching of divergent terminator.
rampitec added inline comments.Apr 12 2017, 4:35 PM
test/CodeGen/AMDGPU/sgprcopies.ll
6 ↗(On Diff #95055)

Calling convention, not function name.

wdng updated this revision to Diff 95057.Apr 12 2017, 4:42 PM

Address code review.

This revision is now accepted and ready to land.Apr 12 2017, 4:46 PM
wdng marked 3 inline comments as done.Apr 12 2017, 4:46 PM
wdng added inline comments.
lib/Target/AMDGPU/SIFixSGPRCopies.cpp
333 ↗(On Diff #94909)

A separate patch will be created for optimized search divergent terminators.

This revision was automatically updated to reflect the committed changes.
wdng marked an inline comment as done.