Page MenuHomePhabricator

Fix for the OCL/LC to failure on some OCLPerf tests
Needs ReviewPublic

Authored by alex-t on May 29 2019, 11:25 AM.

Details

Reviewers
rampitec
nhaehnle
Summary

The reason is: uniform value that is defined inside the loop body and used outside the loop must be treated as divergent if loop has at least one divergent exit.

Diff Detail

Event Timeline

alex-t created this revision.May 29 2019, 11:25 AM
alex-t added a subscriber: llvm-commits.
rampitec added inline comments.May 29 2019, 11:51 AM
lib/Analysis/LegacyDivergenceAnalysis.cpp
112 ↗(On Diff #202000)

Describe the return value.

180 ↗(On Diff #202000)

Given that pass only runs when TTI.hasBranchDivergence() this change seems reasonable to me. I do not really believe an index of a loop with the divergent condition is uniform. It is only uniform across the enabled lanes, but not across of the whole wave. Basically you make it divergent or uniform depending on uses, which makes it a semi-uniform value, more or less. I think this logic needs a better description though.

alex-t updated this revision to Diff 202152.May 30 2019, 4:37 AM
alex-t marked an inline comment as done.

Added comments describing the reason for the change.

Herald added a project: Restricted Project. · View Herald TranscriptMay 30 2019, 4:37 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
alex-t added inline comments.May 30 2019, 4:37 AM
lib/Analysis/LegacyDivergenceAnalysis.cpp
180 ↗(On Diff #202000)

Since all threads of the wavefront that are live in the loop body observe same value. Thus, formally the value is uniform. The user outside the loop in LCSSA is, again formally, uses another value - LCSSA PHI-node, that is divergent. The problem arises when converting LCSSA back to normal form we insert the copy SGPR to VGPR.

The truth is that the value inside the loop is uniform but NOT scalar!

The proper fix would be to split the decision by two independent parts:

  1. Is the value uniform - !DA->isDivergent(V)
  2. Is it SALU or VALU: TargetLowering::isVALU(V) - this should check out-of-divergent-loop-users and as much other conditions as is necessary for the target.
alex-t marked 2 inline comments as done.May 30 2019, 4:37 AM
This revision is now accepted and ready to land.May 30 2019, 8:41 AM

In general I see the issue this way: there is no binary answer if value is divergent or uniform. It can be uniform across some scope: some number of lanes, active lanes, wave, workgroup, or even a whole program. So you need a wave uniform value, where DA returns uniformness across active lanes, which is less.

So in the future we need to consider returning uniformness scope from isUniform() instead of a bool.

nhaehnle requested changes to this revision.May 31 2019, 12:28 AM

Changing the meaning of "uniform" in this way is just bound to lead to confusion further down the line, so please can we not do that?

I agree with Stas that the notion of "uniform" alone is simply not enough, though I'd phrase it this way: we need to distinguish between uniform at definition and uniform at use. The value inside the loop is uniform at definition, but it is not uniform when used outside the loop with divergent exit.

This revision now requires changes to proceed.May 31 2019, 12:28 AM

See D60834 for what I think is the right direction to fix this.

Once again:

LCSSA is not sufficient for the following reason:
Since definition is considered uniform it is selected to SALU and produces SGPR.
The value in this SGPR is correct for all laves leave in loop body.
LCSSA PHI node that inserted in loop exit block will turn into the copy from SGPR to VGPR – this is incorrect.

I don't agree that the enhancement the definition of the "divergence" to the scope is correct way at all.
literally, the value is uniform if all threads observe same value. Nothing about exec mask, lanes or GPU :)
All threads in our case are all executing loop body. That's it.

My goal is to select instructions to SALU or VALU form in dependence of the both DA results and context.
Assigning correct register classes to the cross block values is part of this task. The problem is that current implementation only consults to DA making the decision.
All I need is to augment the DA results with the context.

Also please note: this review is about temporary solution that allow to not revert https://reviews.llvm.org/D59990#1521409

Incremental changes a easier to integrate.

See D60834 for what I think is the right direction to fix this.

Once again:

LCSSA is not sufficient for the following reason:
Since definition is considered uniform it is selected to SALU and produces SGPR.
The value in this SGPR is correct for all laves leave in loop body.
LCSSA PHI node that inserted in loop exit block will turn into the copy from SGPR to VGPR – this is incorrect.

Why isn't sgpr to vgpr copy correct? If it is done still inside the loop it is suboptimal but correct. When done outside of the loop it is incorrect, that's true.

alex-t updated this revision to Diff 202444.May 31 2019, 8:45 AM

added the Divergent Analysis test update that was missed

Changing the meaning of "uniform" in this way is just bound to lead to confusion further down the line, so please can we not do that?

I agree with Stas that the notion of "uniform" alone is simply not enough, though I'd phrase it this way: we need to distinguish between uniform at definition and uniform at use. The value inside the loop is uniform at definition, but it is not uniform when used outside the loop with divergent exit.

To me "uniform at def" vs "uniform at use" it to vague. Essentially what is different between use and def is the context in terms of the threads affected, i.e. exactly the scope. You value can be uniform at def, and them suddenly become divergent even before the use, just because you have changed the exec mask. That is not use makes it divergent, but the control flow context.

See D60834 for what I think is the right direction to fix this.

Once again:

LCSSA is not sufficient for the following reason:
Since definition is considered uniform it is selected to SALU and produces SGPR.
The value in this SGPR is correct for all laves leave in loop body.
LCSSA PHI node that inserted in loop exit block will turn into the copy from SGPR to VGPR – this is incorrect.

Why isn't sgpr to vgpr copy correct? If it is done still inside the loop it is suboptimal but correct. When done outside of the loop it is incorrect, that's true.

I obviously meant outside :)
I answered to Nicolai about LCSSA. I not yet sure why but the move is always placed in "Flow" bock that is outside the loop.

alex-t updated this revision to Diff 202607.Jun 2 2019, 9:26 AM

Alternative fix.

Needs test.

nhaehnle added inline comments.Jun 3 2019, 12:10 AM
lib/Target/AMDGPU/SIInstrInfo.cpp
4174–4180

Can you give an example of why this is necessary? Assuming we have LCSSA phis, I don't understand the reason for this.

5969–5979

shouldSink should never be used for correctness. See the comment:

/// MachineSink determines on its own whether the instruction is safe to sink;
/// this gives the target a hook to override the default behavior with regards
/// to which instructions should be sunk.

I suspect the right fix is to ensure that COPY instructions inserted by legalizeGenericOperands are marked with an implicit use of EXEC when a VGPR is involved.

alex-t marked 2 inline comments as done.Jun 3 2019, 7:59 AM
alex-t added inline comments.
lib/Target/AMDGPU/SIInstrInfo.cpp
4174–4180

The copy may be placed to the block produced by the edge splitting like below:

the copy of %14 in the example below will be placed to the bb.4.Flow that is already out of the loop body.

bb.2.Flow7:
; predecessors: %bb.0, %bb.4

successors: %bb.5(0x80000000); %bb.5(100.00%)

**%7:vgpr_32 **= PHI %108:vgpr_32, %bb.0, **%14:sreg_32_xm0,** %bb.4
SI_END_CF %6:sreg_64, implicit-def dead $exec, implicit-def dead $scc, implicit $exec
S_BRANCH %bb.5

bb.3.for.body:
; predecessors: %bb.1, %bb.3

successors: %bb.4(0x04000000), %bb.3(0x7c000000); %bb.4(3.12%), %bb.3(96.88%)


%14:sreg_32_xm0 = S_ADD_I32 %10:sreg_32_xm0, killed %86:sreg_32_xm0, implicit-def dead $scc

SI_LOOP %15:sreg_64, %bb.3, implicit-def dead $exec, implicit-def dead $scc, implicit $exec
S_BRANCH %bb.4

bb.4.Flow:
; predecessors: %bb.3

successors: %bb.2(0x80000000); %bb.2(100.00%)

SI_END_CF %15:sreg_64, implicit-def dead $exec, implicit-def dead $scc, implicit $exec
S_BRANCH %bb.2

bb.5.for.end:
; predecessors: %bb.2

%16:vgpr_32 = PHI **%7:vgpr_32, %bb.2**
5969–5979

Comment says: "this gives the target a hook to override the default behavior with regards
to which instructions should be sunk.". So, I don't understand why it "should never be used for correctness".
Also, adding extra operands to the copies leads Peephole optimizer to crash in assert

Assertion failed: Def->getNumOperands() == 2 && "Invalid number of operands", file llvm\lib\CodeGen\PeepholeOptimizer.cpp, line 1811

The Si Fix VGPR Copies pass adds implicit uses of EXEC later on...

rampitec added inline comments.Jun 3 2019, 9:04 AM
lib/Target/AMDGPU/SIInstrInfo.cpp
5974

I think even V to V copy is illegal to move into a non-control flow equivalent region.

nhaehnle added inline comments.Jun 5 2019, 12:30 AM
lib/Target/AMDGPU/SIInstrInfo.cpp
4174–4180

That PHI is not an LCSSA phi. Is this with a compilation that has D60834?

5969–5979

It should never be used for correctness because the comment says so: MachineSink determines on its own whether the instruction is safe to sink. This change makes that comment incorrect.

IMO it would be better to remove the assertion in the ValueTracker. Sure, it claims that bad things will happen everywhere, but we're already using COPY with an implicit EXEC use through much of the compiler pipeline without bad things happening, and this is the semantically right way forward.