Page MenuHomePhabricator

AMDGPU: Fix copying i1 value out of loop with non-uniform exit
ClosedPublic

Authored by nhaehnle on Nov 28 2017, 4:24 AM.

Details

Summary

When an i1-value is defined inside of a loop and used outside of it, we
cannot simply use the SGPR bitmask from the loop's last iteration.

There are also useful and correct cases of an i1-value being copied between
basic blocks, e.g. when a condition is computed outside of a loop and used
inside it. The concept of dominators is not sufficient to capture what is
going on, so I propose the notion of "lane-dominators".

Fixes a bug encountered in Nier: Automata.

Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=103743
Change-Id: If37b969ddc71d823ab3004aeafb9ea050e45bd9a

Diff Detail

Repository
rL LLVM

Event Timeline

nhaehnle created this revision.Nov 28 2017, 4:24 AM
t-tye added inline comments.Nov 28 2017, 7:50 PM
lib/Target/AMDGPU/Utils/AMDGPULaneDominator.cpp
70 ↗(On Diff #124543)

Would this return true for the CFG:

B

v
A

v
end

Since no successors of A reaches A to cause false to be returned. But no check seems to be made to ensure that A traditionally-dominates B.

arsenm added inline comments.Nov 28 2017, 8:00 PM
lib/Target/AMDGPU/Utils/AMDGPULaneDominator.cpp
13–15 ↗(On Diff #124543)

Are you really trying to find control flow equivalent regions? Or is this a active lanes are a subset?

alex-t edited edge metadata.Nov 29 2017, 4:05 AM

If I understand everything correct...
The problem you're trying to solve is well known.
You have divergent loop-exit and a value that is uniformly defined inside the loop but used outside the loop.
In this case different threads would have different values.
Traditional Divergence Analysis cannot handle this. Since definition inside the loop body is uniform the use is uniform as well.
Since the value has no explicit data dependency of the loop index, the PHI-node in the loop header (that is divergent if loop-exit is) does not affect it's divergence formally.
The value in fact does have loop-carried dependency. For example:

%tid = call i32 @llvm.amdgcn.workitem.id.x()

for.body:

%val = add i32 %val, 1                                               <== definition of %val is uniform
%cmp = icmp gt i64 %tid, %arg1
br i1 %cmp, label %for.end, label %for.body             <== loop exit condition is divergent

for.end:

store i32 %val, i32 addrspace(1)* %out     <== each thread will have different %val here

Fortunately, you interested in the concrete def and concrete use: AMDGPU::laneDominates(DefInst->getParent(), &MBB))

Could you please look here: https://reviews.llvm.org/D40556

Could you use same approach?

You have 2 blocks: defBlock and useBlock and you want to know:

  1. is useBlock is control dependent of defBlock ?
  2. if 1 is true is defBlock's termination branch uniform?

The set of control dependencies for defBlock is it's post-dominance frontier set
The set of control dependencies for useBlock is it's post-dominance frontier set
We need to check the branches that are NOT common in 2 sets above.

nhaehnle added inline comments.Jan 26 2018, 8:22 AM
lib/Target/AMDGPU/Utils/AMDGPULaneDominator.cpp
13–15 ↗(On Diff #124543)

They don't have to be equivalent. In the case of:

   A
  / \
 /   \
B     C

A lane-dominates both B and C, but they're not equivalent.

70 ↗(On Diff #124543)

I'm afraid I'm not sure what the CFG you have in mind looks like. The function assumes that A traditionally-dominates B, the caller must ensure that (as per the comment).

If I understand everything correct...
The problem you're trying to solve is well known.
You have divergent loop-exit and a value that is uniformly defined inside the loop but used outside the loop.

More or less. However, whether the value is uniform or not doesn't really make a difference: I can change the test case so that %cc is non-uniform, and the same issue occurs. So this isn't really about DivergenceAnalysis.

Could you please look here: https://reviews.llvm.org/D40556

Could you use same approach?

You have 2 blocks: defBlock and useBlock and you want to know:

  1. is useBlock is control dependent of defBlock ?
  2. if 1 is true is defBlock's termination branch uniform? The set of control dependencies for defBlock is it's post-dominance frontier set The set of control dependencies for useBlock is it's post-dominance frontier set We need to check the branches that are NOT common in 2 sets above.

I don't think this works, but perhaps I'm misunderstanding you. In the test case which I've added, the defBlock is %for.body, and the useBlock is %for.end.

%for.end post-dominates the entire loop, so its post-dominance frontier is empty.

%for.body post-dominates %entry and %end.loop, so its PDF is only %mid.loop.

None of that information seems to help?

If I understand everything correct...
The problem you're trying to solve is well known.
You have divergent loop-exit and a value that is uniformly defined inside the loop but used outside the loop.

More or less. However, whether the value is uniform or not doesn't really make a difference: I can change the test case so that %cc is non-uniform, and the same issue occurs. So this isn't really about DivergenceAnalysis.

Could you please look here: https://reviews.llvm.org/D40556

Could you use same approach?

You have 2 blocks: defBlock and useBlock and you want to know:

  1. is useBlock is control dependent of defBlock ?
  2. if 1 is true is defBlock's termination branch uniform? The set of control dependencies for defBlock is it's post-dominance frontier set The set of control dependencies for useBlock is it's post-dominance frontier set We need to check the branches that are NOT common in 2 sets above.

I don't think this works, but perhaps I'm misunderstanding you. In the test case which I've added, the defBlock is %for.body, and the useBlock is %for.end.

%for.end post-dominates the entire loop, so its post-dominance frontier is empty.

%for.body post-dominates %entry and %end.loop, so its PDF is only %mid.loop.

None of that information seems to help?

for.body:

%i = phi i32 [0, %entry], [%i.inc, %end.loop]
  • %cc = icmp ult i32 %i, 4** <-- definition br i1 %cc, label %mid.loop, label %for.end

mid.loop:

%v = call float @llvm.amdgcn.buffer.load.f32(<4 x i32> %rsrc, i32 %tid, i32 %i, i1 false, i1 false)
%cc2 = fcmp oge float %v, 0.0
  • br i1 %cc2, label %end.loop, label %for.end ** <-- divergent branch condition

end.loop:

%i.inc = add i32 %i, 1
br label %for.body

for.end:

**br i1 %cc, label %if, label %end**     <-- use

Since the use block's PDF is empty and def block PDF contains the only one block "mid.loop" we only should check the "mid.loop"'s termination branch divergence.
Here it's immediately clear that the "cc2" is divergent and the branch in "mid.loop" is divergent as well.
So, the use in "for.end" is divergent by the control dependency of the "mid.block" divergent branch.

If I understand everything correct...
The problem you're trying to solve is well known.
You have divergent loop-exit and a value that is uniformly defined inside the loop but used outside the loop.

More or less. However, whether the value is uniform or not doesn't really make a difference: I can change the test case so that %cc is non-uniform, and the same issue occurs. So this isn't really about DivergenceAnalysis.

Could you please look here: https://reviews.llvm.org/D40556

Could you use same approach?

You have 2 blocks: defBlock and useBlock and you want to know:

  1. is useBlock is control dependent of defBlock ?
  2. if 1 is true is defBlock's termination branch uniform? The set of control dependencies for defBlock is it's post-dominance frontier set The set of control dependencies for useBlock is it's post-dominance frontier set We need to check the branches that are NOT common in 2 sets above.

I don't think this works, but perhaps I'm misunderstanding you. In the test case which I've added, the defBlock is %for.body, and the useBlock is %for.end.

%for.end post-dominates the entire loop, so its post-dominance frontier is empty.

%for.body post-dominates %entry and %end.loop, so its PDF is only %mid.loop.

None of that information seems to help?

for.body:

 %i = phi i32 [0, %entry], [%i.inc, %end.loop]
** %cc = icmp ult i32 %i, 4**                               <-- definition
 br i1 %cc, label %mid.loop, label %for.end

mid.loop:

 %v = call float @llvm.amdgcn.buffer.load.f32(<4 x i32> %rsrc, i32 %tid, i32 %i, i1 false, i1 false)
 %cc2 = fcmp oge float %v, 0.0
** br i1 %cc2, label %end.loop, label %for.end   **           <-- divergent branch condition

end.loop:

%i.inc = add i32 %i, 1
br label %for.body

for.end:

**br i1 %cc, label %if, label %end**     <-- use

Since the use block's PDF is empty and def block PDF contains the only one block "mid.loop" we only should check the "mid.loop"'s termination branch divergence.
Here it's immediately clear that the "cc2" is divergent and the branch in "mid.loop" is divergent as well.
So, the use in "for.end" is divergent by the control dependency of the "mid.block" divergent branch.

Right, but what if the uniformness were reversed? We could have something like:

for.body:
  %i = phi i32 [0, %entry], [%i.inc, %end.loop]
  %cc = icmp ult i32 %i, 4**                               <-- uniform definition
  %v = call float @llvm.amdgcn.buffer.load.f32(<4 x i32> %rsrc, i32 %tid, i32 %i, i1 false, i1 false)
  %cc2 = fcmp oge float %v, 0.0
  br i1 %cc2, label %mid.loop, label %for.end <-- divergent branch

mid.loop:
  ...
  br i1 %cc3, label %end.loop, label %for.end   <-- uniform branch condition defined somewhere else

The same problem would still exist, but the proposed way of looking at PDFs would not detect the situation.

arsenm added inline comments.Mar 26 2018, 9:30 AM
lib/Target/AMDGPU/SILowerI1Copies.cpp
146 ↗(On Diff #124543)

I think the info from the commit message needs to be added here

lib/Target/AMDGPU/Utils/AMDGPULaneDominator.cpp
52 ↗(On Diff #124543)

We keep repeating essentially the same control flow depth first search in various places, but I don't have a better idea until we really fix the control flow situation

test/CodeGen/AMDGPU/i1-copy-from-loop.ll
1–2 ↗(On Diff #124543)

s/SI/GCN

alex-t accepted this revision.Mar 27 2018, 6:03 AM
This revision is now accepted and ready to land.Mar 27 2018, 6:03 AM
This revision was automatically updated to reflect the committed changes.
hintonda removed a subscriber: hintonda.Apr 4 2018, 4:37 PM