The reason is: uniform value that is defined inside the loop body and used outside the loop must be treated as divergent if loop has at least one divergent exit.
|112 ↗||(On Diff #202000)|
Describe the return value.
|180 ↗||(On Diff #202000)|
Given that pass only runs when TTI.hasBranchDivergence() this change seems reasonable to me. I do not really believe an index of a loop with the divergent condition is uniform. It is only uniform across the enabled lanes, but not across of the whole wave. Basically you make it divergent or uniform depending on uses, which makes it a semi-uniform value, more or less. I think this logic needs a better description though.
|180 ↗||(On Diff #202000)|
Since all threads of the wavefront that are live in the loop body observe same value. Thus, formally the value is uniform. The user outside the loop in LCSSA is, again formally, uses another value - LCSSA PHI-node, that is divergent. The problem arises when converting LCSSA back to normal form we insert the copy SGPR to VGPR.
The truth is that the value inside the loop is uniform but NOT scalar!
The proper fix would be to split the decision by two independent parts:
In general I see the issue this way: there is no binary answer if value is divergent or uniform. It can be uniform across some scope: some number of lanes, active lanes, wave, workgroup, or even a whole program. So you need a wave uniform value, where DA returns uniformness across active lanes, which is less.
So in the future we need to consider returning uniformness scope from isUniform() instead of a bool.
Changing the meaning of "uniform" in this way is just bound to lead to confusion further down the line, so please can we not do that?
I agree with Stas that the notion of "uniform" alone is simply not enough, though I'd phrase it this way: we need to distinguish between uniform at definition and uniform at use. The value inside the loop is uniform at definition, but it is not uniform when used outside the loop with divergent exit.
LCSSA is not sufficient for the following reason:
Since definition is considered uniform it is selected to SALU and produces SGPR.
The value in this SGPR is correct for all laves leave in loop body.
LCSSA PHI node that inserted in loop exit block will turn into the copy from SGPR to VGPR – this is incorrect.
I don't agree that the enhancement the definition of the "divergence" to the scope is correct way at all.
literally, the value is uniform if all threads observe same value. Nothing about exec mask, lanes or GPU :)
All threads in our case are all executing loop body. That's it.
My goal is to select instructions to SALU or VALU form in dependence of the both DA results and context.
Assigning correct register classes to the cross block values is part of this task. The problem is that current implementation only consults to DA making the decision.
All I need is to augment the DA results with the context.
Also please note: this review is about temporary solution that allow to not revert https://reviews.llvm.org/D59990#1521409
Incremental changes a easier to integrate.
To me "uniform at def" vs "uniform at use" it to vague. Essentially what is different between use and def is the context in terms of the threads affected, i.e. exactly the scope. You value can be uniform at def, and them suddenly become divergent even before the use, just because you have changed the exec mask. That is not use makes it divergent, but the control flow context.
Can you give an example of why this is necessary? Assuming we have LCSSA phis, I don't understand the reason for this.
shouldSink should never be used for correctness. See the comment:
/// MachineSink determines on its own whether the instruction is safe to sink; /// this gives the target a hook to override the default behavior with regards /// to which instructions should be sunk.
I suspect the right fix is to ensure that COPY instructions inserted by legalizeGenericOperands are marked with an implicit use of EXEC when a VGPR is involved.
The copy may be placed to the block produced by the edge splitting like below:
the copy of %14 in the example below will be placed to the bb.4.Flow that is already out of the loop body.
successors: %bb.5(0x80000000); %bb.5(100.00%) **%7:vgpr_32 **= PHI %108:vgpr_32, %bb.0, **%14:sreg_32_xm0,** %bb.4 SI_END_CF %6:sreg_64, implicit-def dead $exec, implicit-def dead $scc, implicit $exec S_BRANCH %bb.5
successors: %bb.4(0x04000000), %bb.3(0x7c000000); %bb.4(3.12%), %bb.3(96.88%) %14:sreg_32_xm0 = S_ADD_I32 %10:sreg_32_xm0, killed %86:sreg_32_xm0, implicit-def dead $scc SI_LOOP %15:sreg_64, %bb.3, implicit-def dead $exec, implicit-def dead $scc, implicit $exec S_BRANCH %bb.4
successors: %bb.2(0x80000000); %bb.2(100.00%) SI_END_CF %15:sreg_64, implicit-def dead $exec, implicit-def dead $scc, implicit $exec S_BRANCH %bb.2
%16:vgpr_32 = PHI **%7:vgpr_32, %bb.2**
Comment says: "this gives the target a hook to override the default behavior with regards
Assertion failed: Def->getNumOperands() == 2 && "Invalid number of operands", file llvm\lib\CodeGen\PeepholeOptimizer.cpp, line 1811
The Si Fix VGPR Copies pass adds implicit uses of EXEC later on...
That PHI is not an LCSSA phi. Is this with a compilation that has D60834?
It should never be used for correctness because the comment says so: MachineSink determines on its own whether the instruction is safe to sink. This change makes that comment incorrect.
IMO it would be better to remove the assertion in the ValueTracker. Sure, it claims that bad things will happen everywhere, but we're already using COPY with an implicit EXEC use through much of the compiler pipeline without bad things happening, and this is the semantically right way forward.