Currently not (xor_one_use) pattern is always selected to S_XNOR irrelative od the node divergence.
This relies on further custom selection pass which converts to VALU if necessary and replaces with V_NOT_B32 ( V_XOR_B32)
on those targets which have no V_XNOR.
Current change enables the patterns which explicitly select the not (xor_one_use) to appropriate form.
We assume that xor (not) is already turned into the not (xor) by the combiner.
Details
- Reviewers
rampitec foad - Commits
- rG5157f984ae2c: [AMDGPU] Enable divergence-driven XNOR selection
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
This looks like a regression in xnor.ll :
s_not_b32 s0, s0 v_not_b32_e32 v0, v0 v_xor_b32_e32 v0, s0, v0 v_xor_b32_e32 v0, s4, v0
but it is not really. All the nodes in the example are divergent and the divergent ( xor, x -1) is selected to V_NOT_B32 as of https://reviews.llvm.org/D115884 has been committed.
S_NOT_B32 appears at the left because of the custom optimization that converts S_XNOR_B32 back to NOT (XOR) for the targets which have no V_XNOR. This optimization relies on the fact that if the NOT operand is SGPR and V_XOR_B32_e32 can accept SGPR as a first source operand.
I am not sure if it is always safe. The VALU instructions execution is controlled by the EXEC mask but SALU is not.
llvm/lib/Target/AMDGPU/VOP2Instructions.td | ||
---|---|---|
729 | Why are these two patterns required? Surely we can just let the NOT and the XOR be selected individually. The only effect of these patterns is to swap the order of the NOT and the XOR, but if that is beneficial then surely it should be done as a DAG combine instead? | |
llvm/lib/Target/AMDGPU/VOPInstructions.td | ||
797 ↗ | (On Diff #396159) | Please rebase this patch on D116241. |
llvm/test/CodeGen/AMDGPU/divergence-driven-xnor.ll | ||
2 | Also run this test on a subtarget that has v_xnor instructions? |
This is indeed a regression. It is always safe to keep s_not_b32 on SALU. Also note this effectively makes SIInstrInfo::lowerScalarXnor() useless. This is why XNOR was left behind by the D111907.
SIInstrInfo::lowerScalarXnor() is exactly the part of the "manual" SALU to VALU lowering that I am trying to get rid of.
The divergent "not" must be selected to the "V_NOT_B32_e32/64" otherwise we still have illegal VGPR to SGPR copies.
This happens because the divergent "not" node has divergent operands and their result will be likely in VGPR.
Also, we should select everything correctly first and can apply some peephole optimizations after.
In other words: we should not "cheat ourselves" during the selection. The selection should be done fairly corresponding to the node divergence bit.
Then we can apply the optimization in case it is safe.
Note that this is not the only case when we would like to further optimize the code after selection.
I'm planning to further add a separate pass for that.
We cannot solve the problem in the custom selection procedure because NOT node operand has not yet been selected and we do not know if it is SGPR or VGPR.
The only way, for now, is to post-process not(xor)/xor(not) in SIFixSGPRCopies. This may be considered a temporary hack until we have no proper pass for that.
SIInstrInfo::lowerScalarXnor() is dead after your patch and thus the patch has to remove it.
Then this is a clear regression, so if this requires a separate peephole later we need that peephole first and make sure the test does not regress.
SIInstrInfo::lowerScalarXnor() is dead after your patch
I don't understand why it is dead. In general moveToVALU moves instructions to VALU if any of their inputs are VGPRs, which can happen even if the result is uniform -- e.g. if some of the inputs are derived from a floating point calculation which had to use VALU instructions.
What moveToVALU does is convert to VALU everything to the end, corresponding to the data dependency graph.
This is generally not what we want. At first, any VGPR to SGPR copy with a uniform VGPR source is going to be V_READFIRSTLANE_B32.
Nevertheless, in some cases (and there are plenty of) it is profitable to convert the whole data chain to VALU. The decision should be made depending on the context analysis results.
In case it is considered profitable we still need to be able to convert the whole variety of operations from SALU to VALU.
That is why no moveToVALU potential callees are deleted so far.
I would not say that we need the peephole "first" because until this change is not applied there is nothing to optimize.
The only way I see here is to apply the peephole and the selection change at once. Hence we have to keep the pieces of the further post-isel optimizer within the SIFixSGPRCopies.
Added postprocessing of the selected machine IR. This makes it on par with the existing selection mechanism.
llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp | ||
---|---|---|
924 ↗ | (On Diff #398148) | I have not read this code but it looks very complicated. I think this would be much better done as a DAG combine analogous to the existing SITargetLowering::reassociateScalarOps -- which is about 1/5th of the size of this function! |
llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp | ||
---|---|---|
924 ↗ | (On Diff #398148) | This is not about the DAG at all. And this is not about the DAG combine. |
SITargetLowering::reassociateScalarOps exists to fix the instruction selection that is done in a wrong way.
No! It's not trying to fix anything, it's just trying to reassociate expressions to keep more of the intermediate results uniform, so we use fewer vgprs and fewer valu instructions. For example:
// v1 = (v0 + s0) + s1 v_add v1, v0, s0 v_add v1, v1, s1 --> // v1 = (s0 + s1) + v0 ; reassociated s_add s2, s0, s1 v_add v1, s2, v0
This is exactly the same kind of thing you need to do to restore the missed optimization in xnor.ll:
// v1 = ~(s0 ^ v0) v_xor v1, s0, v0 v_not v1, v1 --> // v1 = ~s0 ^ v0 s_not s1, s0 v_xor v1, s1, v0
To repeat what I have already said elsewhere: this is not a correctness issue. This is just an optimization, where you can choose to calculate either ~s0 ^ v0 or s0 ^ ~v0 (or ~(s0 ^ v0)) and get exactly the same result. The optimization is to prefer the first form, because the intermediate result ~s0 is uniform, so you can keep it in an sgpr and not waste vgprs and valu instructions.
I am sorry. I have really misled you by confusing the SITargetLowering::reassociateScalarOps with SIInstrInfo::lowerScalarXnor that we have recently discussed with @rampitec.
The SITargetLowering::reassociateScalarOps in our case does nothing because both XOR and NOT nodes are divergent.
The difference between what we had before my change to the selection and what we have now is following:
Before:
We selected both XOR and NOT to S_XOR_B32 and S_NOT_B32 and then optimized the whole pattern in SIInstrInfo::lowerScalarXnor and passed it to moveToVALU.
Now:
We select the divergent NOT to V_NOT_B32_e32 and divergent XOR to V_XOR_B32_e64. The selection is correct but we missed the opportunity to exploit the fact that even divergent NOT may be selected to S_NOT_B32 w/o the correctness lost in case its input is SGPR. But the latter fact is only available AFTER the selection is done.
During the selection process, we cannot predict if the divergent NOT operands are SGPRs because we are walking the DAG postorder and the node operands are not yet selected.
So, we only can apply the optimization AFTER the selection is done.
No, you cannot correctly select divergent NOT to S_NOT_B32. That is not what was happening before your patch (see https://reviews.llvm.org/D116270?vs=on&id=396159#change-5HrmrjqhUdXJ). What was happening was that an input like ~(uniform ^ divergent) was being "reassociated" to ~uniform ^ divergent so it could be correctly selected to S_NOT + V_XOR. I assume this was done with a very clever selection pattern, but I am suggesting that instead of that you could implement it as a DAG combine (to do the reassociation), so there is no need for clever selection patterns.
Once again, in my case BOTH nodes (not,xor) are divergent!
%s.load = load i32, i32 addrspace(4)* %s.kernarg.offset.cast, align 4, !invariant.load !0 DIVERGENT: %v = call i32 @llvm.amdgcn.workitem.id.x(), !range !1 DIVERGENT: %xor = xor i32 %v, %s.load DIVERGENT: %d = xor i32 %xor, -1 DIVERGENT: store i32 %d, i32 addrspace(1)* %out.load, align 4
And the correct selection is V_NOT_B32_e32.
But after the selection is done, we can in some cases rearrange instructions in such a way that we replace the V_NOT_B32_e32 with the S_NOT_B32 keeping the code correct.
The long test in my function does this.
Why it was S_NOT_B32 before my patch? Because of the SIInstrInfo::lowerScalarXnor + moveToVALU "magic". And, please note, that the latter one also works AFTER the selection is done and all the register classes for instructions inputs are known.
I know. I am suggesting that a DAG combine can rewrite this code to the equivalent of:
%s.load = load i32, i32 addrspace(4)* %s.kernarg.offset.cast, align 4, !invariant.load !0 DIVERGENT: %v = call i32 @llvm.amdgcn.workitem.id.x(), !range !1 %not = xor i32 %s.load, -1 DIVERGENT: %d = xor i32 %v, %not DIVERGENT: store i32 %d, i32 addrspace(1)* %out.load, align 4
Now %not is uniform, so it is trivial to select it to s_not.
The idea is brilliant. Unfortunately, it only can be implemented by changing LLVM common code.
If you look in the SITargetLowering::reassociateScalarOps you can see the operand constantness check and corresponding comment:
// If either operand is constant this will conflict with // DAGCombiner::ReassociateOps(). if (DAG.isConstantIntBuildVectorOrConstantInt(Op0) || DAG.isConstantIntBuildVectorOrConstantInt(Op1)) return SDValue();
Removing this guard leads to the infinite transforming the pattern back and forth in SITargetLowering::reassociateScalarOps and DAGCombiner::ReassociateOps().
The former transform (xor (xor uniform, divergent), -1) to (xor (xor uniform, -1), divergent) but the latter one transform it back by applying this rule:
if (N0.hasOneUse()) { ** // Reassociate: (op (op x, c1), y) -> (op (op x, y), c1) // iff (op x, c1) has one use** if (SDValue OpNode = DAG.getNode(Opc, SDLoc(N0), VT, N00, N1)) return DAG.getNode(Opc, DL, VT, OpNode, N01); return SDValue(); }
The infinite loop is formed because we add to the worklist the users of the transformed node. Hence this user comes to the combine in the next step and the same pattern gets transformed again and again.
What I can do here is to change the DAGCombiner::ReassociateOps() to make it consider node divergence. Not sure if this is suitable for upstreaming.
Oh dear. I don't really understand why the common code is doing this. The idea seems to come from a FIXME added in e260ed8628bbe245ffc39b130d121f2f50dc0bce
I think it is worth trying to change this generic combine to give up if x is uniform and y is divergent.
I think it is worth trying to change this generic combine to give up if x is uniform and y is divergent.
That is exactly what I have done for now:
if (N0.hasOneUse() && // In this case the transformation spoils the opportunity // to keep the whole operation scalar. !(!N0->isDivergent() && N1->isDivergent())) { // Reassociate: (op (op x, c1), y) -> (op (op x, y), c1) // iff (op x, c1) has one use if (SDValue OpNode = DAG.getNode(Opc, SDLoc(N0), VT, N00, N1)) return DAG.getNode(Opc, DL, VT, OpNode, N01); return SDValue(); }
but need to test if it does not break anything else :)
In general, I don't like the idea of making the DAGCombiner::reassociateOpsCommutative take into account the divergence.
We have 2 different heuristics to apply in the hope they make a profit.
Considering the arithmetic sub-tree user as root:
- keep commutative binary operation with one constant operand as close to the root in the tree as possible. This lets memory access patterns take advantage of the constant offset.
- given that the constant operand is uniform and the divergent node may use a uniform node but not vice-versa, keep the commutative binary operation with one constant operand as outer in the tree as possible in hope that it will be selected to the scalar operation.
t104: i32 = shl [ORD=15] [ID=17] # D:1 t27, Constant:i32<2> [ID=9] t105: i64 = zero_extend [ORD=15] [ID=19] # D:1 t104 t31: i64 = add [ORD=15] [ID=29] # D:1 t97, t105 t33: i64 = add [ORD=16] [ID=31] # D:1 t31, Constant:i64<4> [ID=6] t50: i32,ch = load<(load (s32) from %ir.b_ptr, addrspace 1)> [ORD=18] [ID=33] # D:1 t0, t33, undef:i64 [ID=2] # D:0
In the example above constant goes to the offset field in the load and the rest of the arithmetic sub-tree form the base. Since it is exposed to further combining, we end up with the simple move instead of 2 additions. Hence, the reassociation moving the constant to the operation that is uniform is not profitable here.
In our XOR example, we'd like to move constant operand up to another xor that already has one operand uniform.
To not introduce yet another regression fixing the current one, we would have to expose the heuristics details to the common code. For example: moving constant operand up in the commutative pattern is good if it is XOR but is unwanted if it is ADD that is the base of the load or store.
In general, it is a good idea to let the target decide if the concrete operation reassociation is profitable. So we may want to add TargetLoweringInfo::reassociateOpsCommutative hook to let the target attempt and then apply general reassociation if the target has not changed the pattern.
Overall this seems reasonable. The only alternative I can think of would need more complicated isel patterns that do the reassociation gated by predicates that check the divergence, something like:
let Predicate = doesNotHaveXNOR in def : GCNPat< (i32 (xor (xor_oneuse i32:$src0, i32:$src1), i32:$src2)), (i32 (V_XOR_B32 $src0, (V_XOR_B32 $src1, $src2))), [{ return src0->isDivergent() && !src1->isDivergent() && !src2->isDivergent(); }] >;
... and lots of commuted versions of the same thing, and the same for any other isel pattern that matches something that could be reassoicated. So that doesn't sound very scalable.
Currently not (xor_one_use) pattern is always selected to S_XNOR irrelative od the node divergence.
"irrelative od" -> "irrespective of" or "regardless of".
llvm/include/llvm/CodeGen/TargetLowering.h | ||
---|---|---|
3295 | Needs a proper descriptive comment. | |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
12584 | Return early if !N0.hasOneUse(). | |
12586 | De-Morgan this to N0->isDivergent() || !N1->isDivergent(). | |
12592–12594 | I don't understand this heuristic. Can you give an example of when it would help? |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
12592–12594 | I could just demonstrate the concrete example but I would need to paste the DAGs here that look like overkill. So, I try to explain w/o the drawing. %and = and i32 %tmp, 16711935 ; 0x00ff00ff %tmp1 = and i32 %arg1, 4294967040 ; 0xffffff00 %tmp2 = or i32 %tmp1, -65536 %tmp3 = or i32 %tmp2, %and This is folded and can be selected to v_perm_b32 with this heuristic but will be 4 scalar operations w/o it. |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
12592–12594 | I still don't see why this would be useful in general. I think it means we should do this reassociation: |
Initial DAG
DAG after the transformation and constant folding
This can be selected to v_perm_b32
I guess this particular case could be handled by improving the v_perm matching in SITargetLowering::performOrCombine, e.g. add a cases for both of:
or (op x, c1), c2 -> perm x, x, permute_mask(c1, c2) or (perm x, x, c1), (op y, c2) -> perm x, y, permute_mask(c1, c2)
Or a single big case that handles:
or (or (op x, c1), c2), (op y, c3) -> perm x, y, permute_mask(c1, c2, c3)
Initial DAG
DAG after the transformation and constant folding
This can be selected to v_perm_b32
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
12592–12594 | I did not say that it is useful in all cases. I said that there is a high probability that the 2 or more constant operands in the same commutative arithmetic sub-tree will be folded by further steps of the combining. This is just a heuristic. v_mov_b32_e32 v3, 0xffff0500 v_perm_b32 v2, s0, v2, v3 instead of s_and_b32 s0, s0, 0xff00 s_or_b32 s0, s0, 0xffff0000 v_and_b32_e32 v2, 0xff00ff, v2 v_or_b32_e32 v2, s0, v2 In general, I agree that this is a fragile design and we will have to add more and more exemptions to keep all the optimizations working.
I consider the last one is much simpler to implement. |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
12592–12594 | I understand why this heuristic helps you form v_perm_b32 in one specific case. I don't understand why "there is a high probability that the 2 or more constant operands in the same commutative arithmetic sub-tree will be folded by further steps of the combining". As far as I can see, it is just luck that it helps this particular v_perm_b32 combine, and it could just as easily make another case fail to combine. Do you have any evidence that it increases the total number of v_perm_b32 combines across a large corpus of code? |
Bug fixed: Memory access DAG pattern check must ensure that "base + offset" pattern has MemSDNode users
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
12592–12594 | Yes. You are right. I cannot prove that this will help in another DAG pattern. |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
12592–12594 |
I suggest ignoring the regression. And if it turns out to be important, I suggest fixing it by improving v_perm pattern recognition in SITargetLowering::performOrCombine, not by tweaking isReassocProfitable. |
Looks reasonable, just some minor comments.
llvm/include/llvm/CodeGen/TargetLowering.h | ||
---|---|---|
3295 | That's better than nothing but it still doesn't describe what N0 and N1 are or what reassociation is being controlled. If I understand it is something like: (op N0:(op N00, N01), N1) -> (op (op N00, N1), N01) where both op are the same associative commutative op, and N01 is a constant. | |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
12583 | This can use any_of and N->uses(). | |
12586 | Would be better to check that the use is the memory address, not (e.g.) the value being stored by a store instruction. But I don't know if there's a simple way to check that. | |
llvm/lib/Target/AMDGPU/SIISelLowering.h | ||
454 | We don't normally repeat virtual on the override -- not sure if it matters. |
Added detailed comment explainingthe new target hook.
Memory access pattern check changed to match exactly the address operand.
Minor fixes.
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
12586 | That is a good idea but I cannot use the any_of then. The any_of uses the operator * type that is SDNode but I need the SDUse instead. So the check is going to remain a separate function. Also, despite this making the check more precise, it uncovers the permute regression that becomes explicit now. |
Needs a proper descriptive comment.