This is an archive of the discontinued LLVM Phabricator instance.

Pass Divergence Analysis data to selection DAG to drive divergence dependent instruction selection
ClosedPublic

Authored by alex-t on Jul 11 2017, 9:52 AM.

Details

Summary

In SIMT architectures VGPRs are high-demand resource. Same time significant part of the computations operate on naturally scalar data.
That computations can be performed by the SALU and save a lot of VGPRs. This is intended to increase occupancy.
Also, splitting the data flow to scalar and vector parts provide more flexibility to the instruction scheduler that can increase HW utilization.

On GPU targets we say that instruction is vector if it operates on VGPR operands each lane of which contains different values.
We say the instruction is scalar if it operates on SGPR that is shared among the all threads in the warp.

Divergence Analysis was introduced by F. Pereira & Co in 2013 and now is a part of LLVM core analysis stuff.
Unfortunately it's results are mostly useless because there is no way to inform instruction selection DAG about the divergence property of the concrete instruction.
Literally, IR operation that has not divergent operands produces uniform result and should be selected to scalar instruction.

We used to pass divergence data for memory access instructions through metadata just because MemSDNode has memory operand that refer the IR.
This approach is restricted to memory accesses only. That's why we'd need another pass working on the machine code that propagates divergence property
from the value load to computations and finally to the result store. Except the fact that we'd need one more pass,
this pass would repeat on the machine instructions same algorithm that was already done by the divergence analysis over IR.

Since SDNode flags field was recently enhanced to 16 bits and there are 5 bits unoccupied yet we have a chance to use them for passing divergence data to instruction selection.

This change introduce possible approach to the implementation of such enhancement.
It passes DA data for load instructions only. If accepted we'll go ahead and add same code to handle other instructions as well.

Diff Detail

Repository
rL LLVM

Event Timeline

alex-t created this revision.Jul 11 2017, 9:52 AM
lattner resigned from this revision.Jul 11 2017, 10:05 AM
rampitec added inline comments.Jul 11 2017, 1:33 PM
lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp
334 ↗(On Diff #106050)

The analysis is pretty expensive, but not needed by all targets. There is TTI.hasBranchDivergence(). How about adding it as required only if TTI.hasBranchDivergence()? It also means you will need default isDivergent to 0 if analysis is unavailable.

alex-t added inline comments.Jul 12 2017, 5:20 AM
lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp
334 ↗(On Diff #106050)

Not sure if it makes sense. DA.runOnFunction itself bails out if target has no divergence:

bool DivergenceAnalysis::runOnFunction(Function &F) {

auto *TTIWP = getAnalysisIfAvailable<TargetTransformInfoWrapperPass>();
if (TTIWP == nullptr)
  return false;

TargetTransformInfo &TTI = TTIWP->getTTI(F);
// Fast path: if the target does not have branch divergence, we do not mark
// any branch as divergent.
  • if (!TTI.hasBranchDivergence()) return false;**
rampitec added inline comments.Jul 12 2017, 8:03 AM
lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp
334 ↗(On Diff #106050)

It bails, yet it depends on DominatorTree on its own, so adding it as a required pass will cause DT to build.

alex-t added a comment.Sep 1 2017, 8:47 AM

Ping. Does anybody going to look at this? :)

arsenm added inline comments.Sep 11 2017, 8:42 AM
include/llvm/CodeGen/SelectionDAGNodes.h
665–666 ↗(On Diff #106050)

I have a general concern about this. The way this is used is going to not fit with how SelectionDAG APIs work, and is going to be very invasive. An SDNode is supposed to be immutable and some level of CSE is done by getNode. You can't have an API that involves setting a bit on a newly created node. Anything setting this needs to be done in getNode.

Are divergent and non-divergent nodes CSEable? These need to be handled somewhere to prevent them from folding.

You seem to only specially handle loads, but we have a lot of cases where we have combine issues from not knowing whether it's going to be selected to SALU or VALU instructions. If we have to somehow propagate this on every place a node is produced, that is a massive undertaking. I don't think that at this point it's worth trying to do such a level of work on SelectionDAG with GlobalISel on the way. Only handling loads I thought we could do just from the MemOperand.

test/CodeGen/AMDGPU/hsa-func.ll
2 ↗(On Diff #106050)

This should be dropped

alex-t added inline comments.Sep 11 2017, 9:07 AM
include/llvm/CodeGen/SelectionDAGNodes.h
665–666 ↗(On Diff #106050)

I agree with you in general... I also don't like to explicitly propagate divergence flag in each place in combining or/and legalizing.
The problem is that getNode is not the only point where new SDNode may be created. For example getLoad and getExtLoad bypass getNode and create LoadSDNode explicitly. As for the handling divergence in CSE map... I maybe do not understand your point? If the node is CSEed we don't care is it divergent or not.

alex-t updated this revision to Diff 122277.Nov 9 2017, 11:06 AM

Implementation changed according to the reviewers suggestions.

rampitec edited edge metadata.Nov 9 2017, 12:06 PM

This actually looks clean to me, thank you!

This revision is now accepted and ready to land.Nov 16 2017, 1:29 PM
vpykhtin edited edge metadata.Dec 5 2017, 6:44 AM

In general adding "custom" code to SelectionDAGBuilder::setValue looks odd. Instead I would add a target-customizable postprocessing loop on pairs of Value <-> SDNode into SelectionDAGISel::SelectBasicBlock right after the DAG is created. The target hook should be able to get whatever it requires LLVM IR analisys and annotate SDNodes.

lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp
334 ↗(On Diff #122277)

this isn't fixed yet.

alex-t updated this revision to Diff 125574.Dec 5 2017, 10:53 AM

Targets that have no divergence do not depend on Divergence Analysis anymore.

In general adding "custom" code to SelectionDAGBuilder::setValue looks odd. Instead I would add a target-customizable postprocessing loop on pairs of Value <-> SDNode into SelectionDAGISel::SelectBasicBlock right after the DAG is created. The target hook should be able to get whatever it requires LLVM IR analisys and annotate SDNodes.

The problem I see here is that original Value is already unavailable after DAG builder, which would mean we need to expose NodeMap to targets. In fact current solution looks better to me.

alex-t updated this revision to Diff 125725.Dec 6 2017, 7:27 AM

Attention please! If nobody has objections this will be committed next Friday.

Does ReplaceAllUsesWith need to propagate changes to the "IsDivergent" bit?

include/llvm/CodeGen/SelectionDAG.h
360 ↗(On Diff #125725)

I like this. :)

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.h
688 ↗(On Diff #125725)

You're mutating the node after it's been inserted into the CSEMap, which is generally bad. Also, it's not clear this is the node you need to set the "divergent" bit on (NewN could be something which will be eliminated by DAGCombine, like a BITCAST or MERGE_VALUES). Can we do this some other way which is more obviously correct?

There actually can be problem with folding the node if we patch it after creation. At least this needs to be checked.

alex-t added a comment.Dec 8 2017, 4:38 AM

Does ReplaceAllUsesWith need to propagate changes to the "IsDivergent" bit?

Divergence Analysis is the iterative solver over SSA form. So, after it's done we assume all the Values are correctly annotated with Divergence flag.
When we change some DAG pattern (combiner/legalizer etc) to some other pattern, the Divergence of any new node (and recursively the resulting pattern root) is superposition of the divergence of it's operands.
So we partially repeat the work that was done by the DA but locally - for each newly created node. This work because we assume all the operands have correct bit set.
Any DAG transformation that change divergent pattern to not-divergent or vice versa is illegal.
Given that we don't need to propagate the flag in ReplaceAllUsesWith

alex-t added a comment.EditedDec 8 2017, 4:42 AM

There actually can be problem with folding the node if we patch it after creation. At least this needs to be checked.

That's true. The problem is that in SelectionDAG::getNode (where the SCEMap insertion is) we have no Value and no chance to check it's divergence.
And this is correct: SelectionDAG is for selection and we should not expose the IR Values to it.

The only way I see is to pass the Divergence parameter to getNode from all the SelectionDAGBuilder visitors. This will be correct but requires to change each of 109 visitors and getNode().

alex-t added inline comments.Dec 8 2017, 5:06 AM
lib/CodeGen/SelectionDAG/SelectionDAGBuilder.h
688 ↗(On Diff #125725)

You are right.
BTW this is not the only place mutating SDNode that has already been created.
Look in SelectionDAG.cpp lines: 4719-4722

if (SDNode *E = FindNodeOrInsertPos(ID, DL, IP)) {
  E->intersectFlagsWith(Flags);
  return SDValue(E, 0);
}

SDnode that found in the map is mutated and then returned w/o any memoization of the mutation

Any DAG transformation that change divergent pattern to not-divergent or vice versa is illegal.

Transforming "x*0 -> 0" is illegal if x is divergent? That seems surprising.

lib/CodeGen/SelectionDAG/SelectionDAGBuilder.h
688 ↗(On Diff #125725)

I'm more concerned about whether you're attaching the "divergent" bit to the right node. The mutation is probably mostly harmless; as you note, other places mess with flags.

Any DAG transformation that change divergent pattern to not-divergent or vice versa is illegal.

Transforming "x*0 -> 0" is illegal if x is divergent? That seems surprising.

Okay, I was unclear. Except for the constants. Your example is a corner case that turn the variable to the constant.
In this case w/o bit propagation we're still correct but sub-optimal.
I can imagine though the case where a long sequence of constant folding ends up with pure zero. If in addition the operand that becomes constant was the only divergent operand, we'd like to propagate.

Any DAG transformation that change divergent pattern to not-divergent or vice versa is illegal.

Transforming "x*0 -> 0" is illegal if x is divergent? That seems surprising.

Okay, I was unclear. Except for the constants. Your example is a corner case that turn the variable to the constant.
In this case w/o bit propagation we're still correct but sub-optimal.
I can imagine though the case where a long sequence of constant folding ends up with pure zero. If in addition the operand that becomes constant was the only divergent operand, we'd like to propagate.

More general there is a known corner case that for example "get_local_id(x) & ~63" is uniform.

The idea here is that get_local_id() is a source of divergence, but only low 6 bits of it are divergent and upper bits are uniform for our target. Handling such cases would need DA interface to be extended to produce a divergent bit mask instead of one bit answer and employ computeKnownBits on an expression to deduce expressions which converge to be uniform even if depend on a non-uniform value.

The corner case is order of magnitude less frequent than more straight forward uses of a divergent expression though. We have plans to extend divergence analysis in the future to handle this, but without a good mechanism to propagate through DAG it will not be very useful anyway.

There actually can be problem with folding the node if we patch it after creation. At least this needs to be checked.

That's true. The problem is that in SelectionDAG::getNode (where the SCEMap insertion is) we have no Value and no chance to check it's divergence.
And this is correct: SelectionDAG is for selection and we should not expose the IR Values to it.

The only way I see is to pass the Divergence parameter to getNode from all the SelectionDAGBuilder visitors. This will be correct but requires to change each of 109 visitors and getNode().

In fact we have no chance to have 2 SDNodes that differ by the Divergence flag only.
Please note that the selection operates per block. SelectionDAGBuilder construct the DAG for one block at a time.
Then it selects and emits the code. Then all the data including CSE map get cleared.
FoldingSetNodeID creates the hash including node and it's operands.
Thus we hit the hash only if there is same node with same operands.
Form the data dependency point it must have same divergence. So literally it is same node and setting same value of divergence flag makes no harm.
The only case when we could have 2 nodes that differ by the divergence only is if both have same operands but one is control-dependent of the divergent branch.
That immediately means that 2 nodes belong to different basic blocks and hence cannot be folded.

There actually can be problem with folding the node if we patch it after creation. At least this needs to be checked.

That's true. The problem is that in SelectionDAG::getNode (where the SCEMap insertion is) we have no Value and no chance to check it's divergence.
And this is correct: SelectionDAG is for selection and we should not expose the IR Values to it.

The only way I see is to pass the Divergence parameter to getNode from all the SelectionDAGBuilder visitors. This will be correct but requires to change each of 109 visitors and getNode().

In fact we have no chance to have 2 SDNodes that differ by the Divergence flag only.
Please note that the selection operates per block. SelectionDAGBuilder construct the DAG for one block at a time.
Then it selects and emits the code. Then all the data including CSE map get cleared.
FoldingSetNodeID creates the hash including node and it's operands.
Thus we hit the hash only if there is same node with same operands.
Form the data dependency point it must have same divergence. So literally it is same node and setting same value of divergence flag makes no harm.
The only case when we could have 2 nodes that differ by the divergence only is if both have same operands but one is control-dependent of the divergent branch.
That immediately means that 2 nodes belong to different basic blocks and hence cannot be folded.

Thank you, that makes sense. I have no objections then, since creation of a node with a proper flags would result in a really massive patch changing all constructors and visitors. E.g. all node visitors would then need to gain a knowledge about a non node specific property. Post patching seems better to me.

In this case w/o bit propagation we're still correct but sub-optimal.

I'm worried that you're covering up bugs by accepting "sub-optimal" results. Specifically, if you have a node which is marked divergent, but doesn't actually have any divergent operands, it will stay marked divergent in most cases... but if DAGCombine or legalization transforms it to some other equivalent operation, it'll erase the "divergent" marking. So your markings will look mostly correct in simple cases, but break for more complicated cases.

alex-t updated this revision to Diff 126767.Dec 13 2017, 8:57 AM

Divergence bit propagation added to ReplaceAllUsesWith

efriedma added inline comments.Dec 19 2017, 4:30 PM
lib/CodeGen/SelectionDAG/SelectionDAG.cpp
7311 ↗(On Diff #126767)

Missing code to unset IsDivergent if a node becomes non-divergent, and missing code to recursively propagate changes.

Actually, what I'd really like to see here is some sort of verifier for the divergent bit. It should be possible to recompute the divergence of the SelectionDAG at any point from first principles. There's a small set of operations which are fundamentally divergent: CopyFromReg where the register contains a divergent value (you should be able to derive this from DivergenceAnalysis), divergent memory accesses, and some target-specific intrinsics. (Not sure that's a complete list, but should be close.) All other operations are divergent if and only if they have a divergent predecessor.

Actually, what I'd really like to see here is some sort of verifier for the divergent bit. It should be possible to recompute the divergence of the SelectionDAG at any point from first principles. There's a small set of operations which are fundamentally divergent: CopyFromReg where the register contains a divergent value (you should be able to derive this from DivergenceAnalysis), divergent memory accesses, and some target-specific intrinsics. (Not sure that's a complete list, but should be close.) All other operations are divergent if and only if they have a divergent predecessor.

What you described is exactly the way how the Divergence Analysis works. Do you really consider creating one more DA upon the Selection DAG?

In this case w/o bit propagation we're still correct but sub-optimal.

I'm worried that you're covering up bugs by accepting "sub-optimal" results. Specifically, if you have a node which is marked divergent, but doesn't actually have any divergent operands, it will stay marked divergent in most cases... but if DAGCombine or legalization transforms it to some other equivalent operation, it'll erase the "divergent" marking. So your markings will look mostly correct in simple cases, but break for more complicated cases.

I still insist that divergence bit propagation over the use replacement is not necessary. In most cases it is useless.
Please note that any node that was created in the combiner/legalizer transformation already has the correct divergence because in CreateOperands the new node divergence is computed as OR over it's operands divergence bits.
Even if some transformation is managed to change the non-divergent use with the divergent one it is illegal.
The corner case in your example that turns variable into the constant can never lead to incorrect code. In the worst case we'll have a splat of zeroes that waste vector register. So the code is still correct but is not optimal.

Do you really consider creating one more DA upon the Selection DAG?

Yes. IR instructions don't have a one-to-one correspondence to SelectionDAG nodes, so I think you're inevitably going to run into subtle bugs which will be difficult to track down.

The key here is computing whether a SelectionDAG node is "naturally" divergent (divergent regardless of its operands); once you have that, computing and verifying complete divergence is trivial. And computing whether a SelectionDAG node is naturally divergent shouldn't be hard, as far as I can tell, so this really shouldn't be much code overall.

Please note that any node that was created in the combiner/legalizer transformation already has the correct divergence because in CreateOperands the new node divergence is computed as OR over it's operands divergence bits.

This is only true if your original computation is correct, and if DAGCombine/Legalization doesn't create any nodes which are naturally divergent. Neither of those are safe assumptions, I think. DAGCombine and legalization will transform loads and stores, which could end up creating a naturally divergent node. And some divergent nodes will never be passed to SelectionDAGBuilder::setValue when you build the DAG, due to the way SelectionDAGBuilder handles values with illegal types. But I'm not sure that's a complete list of the issues with the current version, and there's no practical way to check without a verifier.

alex-t added a comment.EditedDec 21 2017, 11:40 PM

To start with, let's make sure that we're agreed on terms.
Divergent machine runs a set of threads (warp or wavefront) that execute same set of instructions in same order (SIMT).
Divergent operation operates on "vector" registers such that each register consists of many lanes - each thread operates on the data in corresponding lane.
From the above immediately follows that the only source of divergence is thread ID or any data that is derived from thread ID.
Usually it is a small set of target intrinsics that may be the source of such a data.

There are 2 reasons of operation to be divergent:

  1. It data-dependent on some divergent operation
%tid = call i64 get_global_id_x()  // source of divergence
%1 = add i64 %x, %tid                // data dependence on operand 1
%2 = shl i64 %1, 16                     // data dependence on operand 0
%gep = getelementptr i32, i32 addrspace(1) * %array, i64 %2   // data dependence on operand 1
%val = load i32, %gep                // data dependence on operand 0
  1. operation that is uniform itself but is control-dependent on the divergent branch:
int tid = get_global_id(0)

if (tid < n) {
  x = 1;               // no data-dependency on any divergent data
} else {
  x = 2;              // no data-dependency on any divergent data
}
y = x + 5;     // threads taking different branch-targets have different "y" value - operation is divergent ( it is vector addition on vector registers )

Since the selection DAG only models data dependency the latter case is out of scope of this discussion.
The DAG is constructed, transformed and selected per block.

From the above follows that operation in the selection DAG only may be divergent if there is a path in the DAG from some divergent node to the current node.

Initially DAG is constructed by the walk of the IR (SelectionDAGBuilder) and models IR exactly. Thus the divergence property is kept unchanged.

Both DAG peephole optimizations (combiner) and operations/types legalization do not create the new edges in data dependence graph.
I mean that they match the pattern following the existing edges and then change it to some another sub-graph such that all incoming edges of the old subgraph become incoming edges of the new one and same for the outgoing.
Even if several incoming/outgoing are merged together it keeps data flow pattern.

This is only true if your original computation is correct, and if DAGCombine/Legalization doesn't create any nodes which are naturally divergent. Neither of those are safe assumptions, I think. DAGCombine and legalization will transform loads and stores, which could end up creating a naturally divergent node.

So, my question is: could you imagine even theoretical sensible transformation that convert the graph in such a way that uniform node will get divergent income?

And some divergent nodes will never be passed to SelectionDAGBuilder::setValue when you build the DAG, due to the way SelectionDAGBuilder handles values with illegal types. But I'm not sure that's a complete list of the issues with the current version, and there's no practical way to check without a verifier.

Even if it creates new DAG pattern it returns it's root that (because of CreateOperands) has correct divergence that will be passed to setValue. Or I did not understand what you meant?

This is only true if your original computation is correct, and if DAGCombine/Legalization doesn't create any nodes which are naturally divergent. Neither of those are safe assumptions, I think. DAGCombine and legalization will transform loads and stores, which could end up creating a naturally divergent node.

So, my question is: could you imagine even theoretical sensible transformation that convert the graph in such a way that uniform node will get divergent income?

No, but that isn't the point. The problem is that you could replace a naturally divergent node with an equivalent naturally divergent node, but the new node doesn't have the divergent bit set (since the bit only gets set in DAGCombine for nodes with divergent operands, and naturally divergent nodes might not have divergent operands). Thinking about it a bit more, I guess regular load/store operations are a bad example; if a load produced multiple values given a uniform address, it would be a data race. But I think atomic memory operations could run into this issue? (Consider, for example, the code in DAGTypeLegalizer::PromoteIntRes_Atomic1.)

And some divergent nodes will never be passed to SelectionDAGBuilder::setValue when you build the DAG, due to the way SelectionDAGBuilder handles values with illegal types. But I'm not sure that's a complete list of the issues with the current version, and there's no practical way to check without a verifier.

Even if it creates new DAG pattern it returns it's root that (because of CreateOperands) has correct divergence that will be passed to setValue. Or I did not understand what you meant?

That's not what I meant.

Say you have a call to a divergent function which returns an i64, but i64 isn't legal on your target (so the function effectively returns two values of type i32). We create the call, a couple CopyFromReg nodes, and then a MERGE_VALUES to merge the value. Then you set the MERGE_VALUES to be divergent... but that isn't really helpful: legalization for MERGE_VALUES erases the node, so the "divergent" bit goes away.

alex-t updated this revision to Diff 129965.Jan 16 2018, 8:26 AM

This is a draft of the divergence analysis solver on the selection DAG. In the course of discussion the divergence bit verification was requested.
Analysis of the one given block cannot cover control dependencies. Thus the divergence bits set from the IR reflecting control dependencies cannot match those computed on the one isolated block DAG. That's why it is not exactly the verification. The analysis performed on the DAG augments the divergence information passed from the IR.

This is only true if your original computation is correct, and if DAGCombine/Legalization doesn't create any nodes which are naturally divergent. Neither of those are safe assumptions, I think. DAGCombine and legalization will transform loads and stores, which could end up creating a naturally divergent node.

So, my question is: could you imagine even theoretical sensible transformation that convert the graph in such a way that uniform node will get divergent income?

No, but that isn't the point. The problem is that you could replace a naturally divergent node with an equivalent naturally divergent node, but the new node doesn't have the divergent bit set (since the bit only gets set in DAGCombine for nodes with divergent operands, and naturally divergent nodes might not have divergent operands). Thinking about it a bit more, I guess regular load/store operations are a bad example; if a load produced multiple values given a uniform address, it would be a data race. But I think atomic memory operations could run into this issue? (Consider, for example, the code in DAGTypeLegalizer::PromoteIntRes_Atomic1.)

And some divergent nodes will never be passed to SelectionDAGBuilder::setValue when you build the DAG, due to the way SelectionDAGBuilder handles values with illegal types. But I'm not sure that's a complete list of the issues with the current version, and there's no practical way to check without a verifier.

Even if it creates new DAG pattern it returns it's root that (because of CreateOperands) has correct divergence that will be passed to setValue. Or I did not understand what you meant?

That's not what I meant.

Say you have a call to a divergent function which returns an i64, but i64 isn't legal on your target (so the function effectively returns two values of type i32). We create the call, a couple CopyFromReg nodes, and then a MERGE_VALUES to merge the value. Then you set the MERGE_VALUES to be divergent... but that isn't really helpful: legalization for MERGE_VALUES erases the node, so the "divergent" bit goes away.

The diff uploaded is a draft just to check - does it look like what you meant? In fact there are some issues to resolve:

  1. The content of the target specific "isSDNodeSourceOfDivergence" procedure depend on the stage of the DAG lowering where it is called. The most reasonable place is just before the selection after all combining/legalizing are done. In this case all the intrinsics are already expanded and turned to the CopyFromReg or similar elementary operations. So it is unclear if it reasonable to have the code handling this intrinsics.
  2. All the divergence flags propagation in the "ReplaceAllUsesWith" are useless and should be removed.
  3. This solution is not in fact verification because the flags computed on single block in general don't match those passed from the IR because of the control dependencies. This is just yet another part of analysis to augment the information.

I was thinking of a verifier more like the LLVM IR verifier... so we would constantly maintain correct divergence information, then check it in asserts builds. That was we can be confident the bit is right from building the DAG through ISel. In terms of code changes, essentially make the divergence computation in createOperands call isSDNodeSourceOfDivergence, delete the changes to setValue, and make VerifyDAGDiverence assert rather than modify the node when it detects a difference.

  1. The content of the target specific "isSDNodeSourceOfDivergence" procedure depend on the stage of the DAG lowering where it is called. The most reasonable place is just before the selection after all combining/legalizing are done. In this case all the intrinsics are already expanded and turned to the CopyFromReg or similar elementary operations. So it is unclear if it reasonable to have the code handling this intrinsics.

It makes the rest of the patch cleaner if you handle intrinsics in isSDNodeSourceOfDivergence, I think.

  1. This solution is not in fact verification because the flags computed on single block in general don't match those passed from the IR because of the control dependencies. This is just yet another part of analysis to augment the information.

Specifically which nodes are a problem here? We should query the IR DivergenceAnalysis to compute isSDNodeSourceOfDivergence for a CopyFromReg from a live-in virtual register. (Not sure there's an existing map from registers to values, but you could easily construct one; basically the inverse of FunctionLoweringInfo::ValueMap.)

Specifically which nodes are a problem here? We should query the IR DivergenceAnalysis to compute isSDNodeSourceOfDivergence for a CopyFromReg from a live-in virtual register. (Not sure there's an existing map from registers to values, but you could easily construct one; basically the inverse of FunctionLoweringInfo::ValueMap.)

In one of my previous posts I have explained what control dependencies are. Let's try again.
Consider the following OpenCL code:

uint tid = get_global_id(0);    // returns the ID of the individual workitem
if (tid < 10) {
  x = 2;
} else {
  x = 3;
} 
z = y + x; // all threads 0-9 have x= 2, others x= 3

Please note that the addition "z = y + x" is divergent because different threads compute different values of "z".
Please also note that this addition does not depend on "tid" or any other divergent data. It is not possible to discover this dependency analyzing individual block. We need CFG information.
Divergence Analysis on IR covers control dependencies by means of special PHI-nodes processing.
For regular node the node divergence is computed as literally logical OR of all operands divergence bits.
For PHI-node it adds to the list all the branch instructions that terminate basic blocks in PHI's source blocks post-dominance frontier.

All the above means that we cannot just drop the IR divergence analysis results. DAG only reflects data dependencies.
Analyzing individual block on the DAG we can only follow data dependencies. So if we try to match the divergence bits computed on the IR (counting control flow)
with those computed on the individual block DAG we'll get in assert on the divergence bits set on the nodes control dependent on the divergent branches.

To track all the nodes divergent by the control dependencies we'd need to sustain special data structure along the all stages of the DAG processing.
This all looks too resource consuming.

There is one possible trade-off:
We can add virtual hook in TargetTransformInfo to query if the target support divergence analysis driven selection. It returns true iif the target ensures it has no transformations that may break divergence data integrity.
For AMDGPU that is always true.

If the target does not support this we don't use the divergence bit at all.

This would allow us to use the functionality w/o any even theoretical threat to other targets.

Please also note that this addition does not depend on "tid" or any other divergent data. It is not possible to discover this dependency analyzing individual block. We need CFG information.

Yes, this is what I was getting at with "We should query the IR DivergenceAnalysis to compute isSDNodeSourceOfDivergence for a CopyFromReg from a live-in virtual register."; the nodes which need CFG information are precisely CopyFromReg nodes from virtual registers. Each virtual register created by the SelectionDAGBuilder should correspond to exactly one IR instruction.

Please also note that this addition does not depend on "tid" or any other divergent data. It is not possible to discover this dependency analyzing individual block. We need CFG information.

Yes, this is what I was getting at with "We should query the IR DivergenceAnalysis to compute isSDNodeSourceOfDivergence for a CopyFromReg from a live-in virtual register."; the nodes which need CFG information are precisely CopyFromReg nodes from virtual registers. Each virtual register created by the SelectionDAGBuilder should correspond to exactly one IR instruction.

In general this would work but we still have several issues:

As I understand you are concerned about the mutating the SDNode after it has been created in getNode().

  1. FunctionLoweringInfo::ValueMap is created during the SelectionDAGBuilder walk through the BasicBlock. So we cannot query live-in register divergence from the CreateOperands => TargetLoweringInfo::isSDNodeSourceOfDivergence. By this point ValueMap has not yet been filled in.
  2. Even if we able to count control dependencies from the SelectionDAGBuilder we would have a mean to propagate the flag value through the DAG along the data dependency edges.

All above means that we cannot just validate the flag values and assert if it does not match. We have to run iterative solver for each block just before the selection to count the control dependencies and to propagate the flag values.

I tried this approach and it works at a first glance.

One more item that should be discussed is the target-specific exceptions to the common divergence modeling algorithm.
For instance in AMDGPU target we have amdgcn.readfirstlane/readlane intrinsics. They accept vector register and return the first or specific lane value.
So both accept naturally divergent VGPR but return the scalar value.
Following the common divergence computing algorithm - "the divergence of operation's result is superposition of the operands divergence" we'd set %scalar = tail call i32 @llvm.amdgcn.readfirstlane(i32 %tid) to divergent that is not true.
In the IR form of the divergence-driven selection we rely on the TargetTransformInfo::isAlwaysUniform hook that was added to interface for this purpose.
It allows the target to declare arbitrary set of target operations as "always uniform" so that the analysis does not count for their operands divergence.

To meet this design we'd have to add similar hook to the TargetLoweringInfo interface. Is this feasible?

As I understand you are concerned about the mutating the SDNode after it has been created in getNode().

My most important concern is actually getting the modeling correct, so queries come up with the correct result when it gets queried by DAGCombine. If the bit on the SDNode is just a cache which can be recomputed/verified, it's fine to mutate it when we need to.

  1. FunctionLoweringInfo::ValueMap is created during the SelectionDAGBuilder walk through the BasicBlock. So we cannot query live-in register divergence from the CreateOperands => TargetLoweringInfo::isSDNodeSourceOfDivergence. By this point ValueMap has not yet been filled in.

Really? I thought we fill it in before we actually start building the SelectionDAG (in FunctionLoweringInfo::set). But you can move it earlier if you need to.

All above means that we cannot just validate the flag values and assert if it does not match. We have to run iterative solver for each block just before the selection to count the control dependencies and to propagate the flag values.

I tried this approach and it works at a first glance.

Great!

To meet this design we'd have to add similar hook to the TargetLoweringInfo interface. Is this feasible?

Yes, this should be fine.

  1. FunctionLoweringInfo::ValueMap is created during the SelectionDAGBuilder walk through the BasicBlock. So we cannot query live-in register divergence from the CreateOperands => TargetLoweringInfo::isSDNodeSourceOfDivergence. By this point ValueMap has not yet been filled in.

Really? I thought we fill it in before we actually start building the SelectionDAG (in FunctionLoweringInfo::set). But you can move it earlier if you need to.

All above means that we cannot just validate the flag values and assert if it does not match. We have to run iterative solver for each block just before the selection to count the control dependencies and to propagate the flag values.

Oops... That was my mistake.

FunctionLoweringInfo::ValueMap gets filled in by the FunctionLoweringInfo::CreateRegs in SelectionDAGISel::SelectAllBasicBlocks much earlier then the SelectionDAGBuilder walks the IR. So, everything works! :)

BTW, we don't need to verify flags since we're creating them in CreaeOperands.
The flag for each node is computed from it's divergence and it's operands. This is going on in SelectionDAGBuilder walk.
For each node, it's operands are already computed in this point and node's divergence is immediately set to correct value.
This is correct just because in contrary to IR DAG has no loops.
Same story if CreateOperands is called from Combiner/Legalizer.

alex-t updated this revision to Diff 132862.Feb 5 2018, 11:10 AM

Here is alternative implementation based on the TargetLoweringInfo hooks.

I'd like to see a verifier somewhere that the divergence bit is still correct after DAGCombine (it could be different from what SelectionDAG::createOperands would compute given how ReplaceAllUsesWith works).

lib/CodeGen/SelectionDAG/SelectionDAG.cpp
8258 ↗(On Diff #132862)

This is good; I'm happy we're cleanly computing divergence for a DAG node.

rampitec added inline comments.Feb 5 2018, 11:48 AM
lib/Target/AMDGPU/AMDGPUISelLowering.cpp
781 ↗(On Diff #132862)

SIRegisterInfo::isVGPR()

794 ↗(On Diff #132862)

!DA || DA->isDivergent(...)

You are using getAnalysisIfAvailable, so it can be missing.

813 ↗(On Diff #132862)

Can you make isIntrinsicSourceOfDivergence() external and use it instead?

rampitec added inline comments.Feb 5 2018, 11:51 AM
lib/Target/AMDGPU/AMDGPUISelLowering.cpp
786 ↗(On Diff #132862)

I am afraid that is not true to say that VGPR is necessarily divergent.

alex-t added a comment.Feb 8 2018, 2:14 AM

I'd like to see a verifier somewhere that the divergence bit is still correct after DAGCombine (it could be different from what SelectionDAG::createOperands would compute given how ReplaceAllUsesWith works).

Could you please clarify the goal of the verification? Let's say we managed to transform the DAG in such a way that uniform pattern has been changed to divergent one.
Then the approach depends on our attitude to the transformation.
If we agree that transformation that change the pattern divergence is illegal? like it is in case we change uniform to divergent, we should assert and bail out.
If we assume that transformation is legal? like in your example of folding divergent variable to zero constant (x*0 => 0), we should recompute the divergence bits instead.
To handle both cases we need one more re-computation over all DAG nodes like it was done in my previous implementation but with error message if the uniform node becomes divergent.

I would like to just re-compute the bits just before selection and leave the legality of the DAG transformation issues to that transformations authors. In other words we compute what we have.
If someone transform the DAG incorrectly it is his own problem.

alex-t added inline comments.Feb 8 2018, 2:18 AM
lib/Target/AMDGPU/AMDGPUISelLowering.cpp
786 ↗(On Diff #132862)

That's not true. Do we have a mean to detect that this is a splat vector?
If not I'd stay with conservative approach that consider all VGPRs divergent.
Alternatively we could add one more target hook to query for special VGPRs that are uniform.

rampitec added inline comments.Feb 8 2018, 10:00 AM
lib/Target/AMDGPU/AMDGPUISelLowering.cpp
786 ↗(On Diff #132862)

I doubt you can reliably detect it. The concern is potential unneeded moves and readfirstlane instructions, one thing that we are trying to avoid here.

If we're going to include the "divergent" bit in SDNodes, so we can query it all the time, the bit needs to be correct all the time. The goal of a verifier is to ensure that at any given point, the bits stored in the SelectionDAG are the same as the bits we would compute from scratch. So code still needs to do the right thing to update the divergence bits, if necessary, but the verifier lets us catch mistakes early. This is similar to the way we have a domtree verifier, to ensure transforms correctly update the domtree.

rampitec added inline comments.Feb 8 2018, 12:42 PM
lib/Target/AMDGPU/AMDGPUISelLowering.cpp
786 ↗(On Diff #132862)

I withdraw this objection. Apparently this is all about physregs, and we do not have a lot of them at this stage.

alex-t added a comment.Feb 9 2018, 5:06 AM

If we're going to include the "divergent" bit in SDNodes, so we can query it all the time, the bit needs to be correct all the time. The goal of a verifier is to ensure that at any given point, the bits stored in the SelectionDAG are the same as the bits we would compute from scratch. So code still needs to do the right thing to update the divergence bits, if necessary, but the verifier lets us catch mistakes early. This is similar to the way we have a domtree verifier, to ensure transforms correctly update the domtree.

Re-computation the bits for the entire DAG any time combiner change something is too expensive.
In this case I'd opt to propagate the bit in ReplaceAllUses methods.

alex-t updated this revision to Diff 133617.Feb 9 2018, 7:41 AM

Preliminary revision illustrating possible approach to keeping divergence information consistent along the DAG transformation

alex-t added inline comments.Feb 13 2018, 6:29 AM
lib/Target/AMDGPU/AMDGPUISelLowering.cpp
794 ↗(On Diff #132862)

if DA == nulptr in the case above we'd return ((bool)!DA) true?

maybe it's better return false for the targets that have no DA?

I mean " DA && DA->isDivegent()" if we have no DA we return false. In case we have, the returned value will be defined by the isDivergent result

alex-t added inline comments.Feb 13 2018, 7:32 AM
lib/Target/AMDGPU/AMDGPUISelLowering.cpp
781 ↗(On Diff #132862)

What should we do fro R600Subtarget?

rampitec added inline comments.Feb 13 2018, 9:59 AM
lib/Target/AMDGPU/AMDGPUISelLowering.cpp
781 ↗(On Diff #132862)

Since we are not going to change selection in R600 that is practically non-important what we do for this case.

794 ↗(On Diff #132862)

That is conservatively correct to return true. Presumably targets w/o DA will have no use of the bit anyway, but if they are it is dangerous to assume uniformness.

alex-t added inline comments.Feb 13 2018, 10:06 AM
lib/Target/AMDGPU/AMDGPUISelLowering.cpp
794 ↗(On Diff #132862)

targets w/o DA will have no use of the bit anyway, but if they are it is dangerous

Sounds a bit paranoid :) I just noted that returning "true" for the target that has no divergence at all looks misleading.

alex-t added inline comments.Feb 14 2018, 1:00 AM
lib/Target/AMDGPU/AMDGPUISelLowering.cpp
794 ↗(On Diff #132862)

Moreover, targets that have no DA have neither overridden isSDNodeSourceOfDivergence and will never get here.

alex-t updated this revision to Diff 134442.Feb 15 2018, 9:02 AM

Some bugfixes and changes according to the reviewers requirements.

rampitec added inline comments.Feb 15 2018, 9:22 AM
lib/Target/AMDGPU/AMDGPUISelLowering.cpp
794 ↗(On Diff #132862)

What if DA is invalidated and thus NULL, even for targets with divergence?

826 ↗(On Diff #134442)

It still duplicates implementation in AMDGPUTargetTransformInfo.

I still want a verifier, a function which checks the bits currently saved on SelectionDAG nodes are the same as the bits we would compute from scratch (and calls report_fatal_error() if they aren't). Maybe call it in a couple places in SelectionDAGISel::CodeGenAndEmitDAG() if assertions are enabled.

This comment was removed by alex-t.
alex-t updated this revision to Diff 135310.Feb 21 2018, 12:22 PM

This is the preview of the implementation that provide walk-through divergence bits consistency.
Please note that the verification algorithm has polynomial complexity and is expected to be switched ON/OFF by the option (upcoming soon) with default to OFF.

You should be able to do verification in linear time. Just call SelectionDAG::AssignTopologicalOrder() before you start iterating over allnodes().

You should be able to do verification in linear time. Just call SelectionDAG::AssignTopologicalOrder() before you start iterating over allnodes().

Actually, on second thought, maybe don't do that; AssignTopologicalOrder() mutates the SelectionDAG, so it could change the generated code if we call it conditionally. But anyway, a topological sort should be straightforward.

alex-t updated this revision to Diff 135490.Feb 22 2018, 11:36 AM

Verification algorithm of linear complexity

efriedma added inline comments.Feb 22 2018, 12:42 PM
include/llvm/CodeGen/TargetLowering.h
2561 ↗(On Diff #135490)

Weird indentation; try clang-format?

lib/CodeGen/SelectionDAG/SelectionDAG.cpp
7506 ↗(On Diff #135490)

This is exactly what I was looking for; thanks.

lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp
748 ↗(On Diff #135490)

Maybe #ifndef NDEBUG.

Can we somehow skip this for targets which don't use divergence information?

alex-t updated this revision to Diff 135881.Feb 26 2018, 3:09 AM

Formatting fixed.
DAG divergence verification for "divergent" targets only.

alex-t updated this revision to Diff 135930.Feb 26 2018, 11:20 AM

One test fixed

make check-llvm has passed

Target-independent bits LGTM.

lib/Target/AMDGPU/AMDGPUISelLowering.cpp
838 ↗(On Diff #135930)

Formatting?

alex-t updated this revision to Diff 136723.Mar 2 2018, 5:39 AM

ready to land

alex-t marked 6 inline comments as done.Mar 2 2018, 5:40 AM
alex-t updated this revision to Diff 136772.Mar 2 2018, 9:21 AM
alex-t updated this revision to Diff 136985.Mar 5 2018, 5:58 AM
This revision was automatically updated to reflect the committed changes.