This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Enable divergence-driven XNOR selection
ClosedPublic

Authored by alex-t on Dec 24 2021, 6:54 AM.

Details

Summary

Currently not (xor_one_use) pattern is always selected to S_XNOR irrelative od the node divergence.
This relies on further custom selection pass which converts to VALU if necessary and replaces with V_NOT_B32 ( V_XOR_B32)
on those targets which have no V_XNOR.
Current change enables the patterns which explicitly select the not (xor_one_use) to appropriate form.
We assume that xor (not) is already turned into the not (xor) by the combiner.

Diff Detail

Event Timeline

alex-t created this revision.Dec 24 2021, 6:54 AM
alex-t requested review of this revision.Dec 24 2021, 6:54 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 24 2021, 6:54 AM
Herald added a subscriber: wdng. · View Herald Transcript
alex-t updated this revision to Diff 396159.Dec 24 2021, 6:57 AM

LIT test file attributes corrected

This looks like a regression in xnor.ll :

	s_not_b32 s0, s0                        	v_not_b32_e32 v0, v0
	v_xor_b32_e32 v0, s0, v0                        v_xor_b32_e32 v0, s4, v0

but it is not really. All the nodes in the example are divergent and the divergent ( xor, x -1) is selected to V_NOT_B32 as of https://reviews.llvm.org/D115884 has been committed.
S_NOT_B32 appears at the left because of the custom optimization that converts S_XNOR_B32 back to NOT (XOR) for the targets which have no V_XNOR. This optimization relies on the fact that if the NOT operand is SGPR and V_XOR_B32_e32 can accept SGPR as a first source operand.
I am not sure if it is always safe. The VALU instructions execution is controlled by the EXEC mask but SALU is not.

foad added a reviewer: foad.Dec 31 2021, 3:57 AM
foad added inline comments.
llvm/lib/Target/AMDGPU/VOP2Instructions.td
729

Why are these two patterns required? Surely we can just let the NOT and the XOR be selected individually. The only effect of these patterns is to swap the order of the NOT and the XOR, but if that is beneficial then surely it should be done as a DAG combine instead?

llvm/lib/Target/AMDGPU/VOPInstructions.td
797 ↗(On Diff #396159)

Please rebase this patch on D116241.

llvm/test/CodeGen/AMDGPU/divergence-driven-xnor.ll
2

Also run this test on a subtarget that has v_xnor instructions?

This looks like a regression in xnor.ll :

	s_not_b32 s0, s0                        	v_not_b32_e32 v0, v0
	v_xor_b32_e32 v0, s0, v0                        v_xor_b32_e32 v0, s4, v0

but it is not really. All the nodes in the example are divergent and the divergent ( xor, x -1) is selected to V_NOT_B32 as of https://reviews.llvm.org/D115884 has been committed.
S_NOT_B32 appears at the left because of the custom optimization that converts S_XNOR_B32 back to NOT (XOR) for the targets which have no V_XNOR. This optimization relies on the fact that if the NOT operand is SGPR and V_XOR_B32_e32 can accept SGPR as a first source operand.
I am not sure if it is always safe. The VALU instructions execution is controlled by the EXEC mask but SALU is not.

This is indeed a regression. It is always safe to keep s_not_b32 on SALU. Also note this effectively makes SIInstrInfo::lowerScalarXnor() useless. This is why XNOR was left behind by the D111907.

alex-t added a comment.EditedJan 6 2022, 3:39 AM

This looks like a regression in xnor.ll :

	s_not_b32 s0, s0                        	v_not_b32_e32 v0, v0
	v_xor_b32_e32 v0, s0, v0                        v_xor_b32_e32 v0, s4, v0

but it is not really. All the nodes in the example are divergent and the divergent ( xor, x -1) is selected to V_NOT_B32 as of https://reviews.llvm.org/D115884 has been committed.
S_NOT_B32 appears at the left because of the custom optimization that converts S_XNOR_B32 back to NOT (XOR) for the targets which have no V_XNOR. This optimization relies on the fact that if the NOT operand is SGPR and V_XOR_B32_e32 can accept SGPR as a first source operand.
I am not sure if it is always safe. The VALU instructions execution is controlled by the EXEC mask but SALU is not.

This is indeed a regression. It is always safe to keep s_not_b32 on SALU. Also note this effectively makes SIInstrInfo::lowerScalarXnor() useless. This is why XNOR was left behind by the D111907.

SIInstrInfo::lowerScalarXnor() is exactly the part of the "manual" SALU to VALU lowering that I am trying to get rid of.
The divergent "not" must be selected to the "V_NOT_B32_e32/64" otherwise we still have illegal VGPR to SGPR copies.
This happens because the divergent "not" node has divergent operands and their result will be likely in VGPR.
Also, we should select everything correctly first and can apply some peephole optimizations after.
In other words: we should not "cheat ourselves" during the selection. The selection should be done fairly corresponding to the node divergence bit.
Then we can apply the optimization in case it is safe.
Note that this is not the only case when we would like to further optimize the code after selection.
I'm planning to further add a separate pass for that.

We cannot solve the problem in the custom selection procedure because NOT node operand has not yet been selected and we do not know if it is SGPR or VGPR.
The only way, for now, is to post-process not(xor)/xor(not) in SIFixSGPRCopies. This may be considered a temporary hack until we have no proper pass for that.

This looks like a regression in xnor.ll :

	s_not_b32 s0, s0                        	v_not_b32_e32 v0, v0
	v_xor_b32_e32 v0, s0, v0                        v_xor_b32_e32 v0, s4, v0

but it is not really. All the nodes in the example are divergent and the divergent ( xor, x -1) is selected to V_NOT_B32 as of https://reviews.llvm.org/D115884 has been committed.
S_NOT_B32 appears at the left because of the custom optimization that converts S_XNOR_B32 back to NOT (XOR) for the targets which have no V_XNOR. This optimization relies on the fact that if the NOT operand is SGPR and V_XOR_B32_e32 can accept SGPR as a first source operand.
I am not sure if it is always safe. The VALU instructions execution is controlled by the EXEC mask but SALU is not.

This is indeed a regression. It is always safe to keep s_not_b32 on SALU. Also note this effectively makes SIInstrInfo::lowerScalarXnor() useless. This is why XNOR was left behind by the D111907.

SIInstrInfo::lowerScalarXnor() is exactly the part of the "manual" SALU to VALU lowering that I am trying to get rid of.
The divergent "not" must be selected to the "V_NOT_B32_e32/64" otherwise we still have illegal VGPR to SGPR copies.
This happens because the divergent "not" node has divergent operands and their result will be likely in VGPR.
Also, we should select everything correctly first and can apply some peephole optimizations after.
In other words: we should not "cheat ourselves" during the selection. The selection should be done fairly corresponding to the node divergence bit.
Then we can apply the optimization in case it is safe.
Note that this is not the only case when we would like to further optimize the code after selection.
I'm planning to further add a separate pass for that.

We cannot solve the problem in the custom selection procedure because NOT node operand has not yet been selected and we do not know if it is SGPR or VGPR.
The only way, for now, is to post-process not(xor)/xor(not) in SIFixSGPRCopies. This may be considered a temporary hack until we have no proper pass for that.

SIInstrInfo::lowerScalarXnor() is dead after your patch and thus the patch has to remove it.

Then this is a clear regression, so if this requires a separate peephole later we need that peephole first and make sure the test does not regress.

foad added a comment.Jan 7 2022, 1:14 AM

SIInstrInfo::lowerScalarXnor() is dead after your patch

I don't understand why it is dead. In general moveToVALU moves instructions to VALU if any of their inputs are VGPRs, which can happen even if the result is uniform -- e.g. if some of the inputs are derived from a floating point calculation which had to use VALU instructions.

alex-t added a comment.Jan 7 2022, 3:10 AM

SIInstrInfo::lowerScalarXnor() is dead after your patch

I don't understand why it is dead. In general moveToVALU moves instructions to VALU if any of their inputs are VGPRs, which can happen even if the result is uniform -- e.g. if some of the inputs are derived from a floating point calculation which had to use VALU instructions.

What moveToVALU does is convert to VALU everything to the end, corresponding to the data dependency graph.
This is generally not what we want. At first, any VGPR to SGPR copy with a uniform VGPR source is going to be V_READFIRSTLANE_B32.
Nevertheless, in some cases (and there are plenty of) it is profitable to convert the whole data chain to VALU. The decision should be made depending on the context analysis results.
In case it is considered profitable we still need to be able to convert the whole variety of operations from SALU to VALU.
That is why no moveToVALU potential callees are deleted so far.

alex-t added a comment.Jan 7 2022, 3:21 AM

This looks like a regression in xnor.ll :

	s_not_b32 s0, s0                        	v_not_b32_e32 v0, v0
	v_xor_b32_e32 v0, s0, v0                        v_xor_b32_e32 v0, s4, v0

but it is not really. All the nodes in the example are divergent and the divergent ( xor, x -1) is selected to V_NOT_B32 as of https://reviews.llvm.org/D115884 has been committed.
S_NOT_B32 appears at the left because of the custom optimization that converts S_XNOR_B32 back to NOT (XOR) for the targets which have no V_XNOR. This optimization relies on the fact that if the NOT operand is SGPR and V_XOR_B32_e32 can accept SGPR as a first source operand.
I am not sure if it is always safe. The VALU instructions execution is controlled by the EXEC mask but SALU is not.

This is indeed a regression. It is always safe to keep s_not_b32 on SALU. Also note this effectively makes SIInstrInfo::lowerScalarXnor() useless. This is why XNOR was left behind by the D111907.

SIInstrInfo::lowerScalarXnor() is exactly the part of the "manual" SALU to VALU lowering that I am trying to get rid of.
The divergent "not" must be selected to the "V_NOT_B32_e32/64" otherwise we still have illegal VGPR to SGPR copies.
This happens because the divergent "not" node has divergent operands and their result will be likely in VGPR.
Also, we should select everything correctly first and can apply some peephole optimizations after.
In other words: we should not "cheat ourselves" during the selection. The selection should be done fairly corresponding to the node divergence bit.
Then we can apply the optimization in case it is safe.
Note that this is not the only case when we would like to further optimize the code after selection.
I'm planning to further add a separate pass for that.

We cannot solve the problem in the custom selection procedure because NOT node operand has not yet been selected and we do not know if it is SGPR or VGPR.
The only way, for now, is to post-process not(xor)/xor(not) in SIFixSGPRCopies. This may be considered a temporary hack until we have no proper pass for that.

SIInstrInfo::lowerScalarXnor() is dead after your patch and thus the patch has to remove it.

Then this is a clear regression, so if this requires a separate peephole later we need that peephole first and make sure the test does not regress.

I would not say that we need the peephole "first" because until this change is not applied there is nothing to optimize.
The only way I see here is to apply the peephole and the selection change at once. Hence we have to keep the pieces of the further post-isel optimizer within the SIFixSGPRCopies.

alex-t updated this revision to Diff 398148.Jan 7 2022, 7:56 AM

Added postprocessing of the selected machine IR. This makes it on par with the existing selection mechanism.

alex-t marked 2 inline comments as done.Jan 7 2022, 7:58 AM
foad added inline comments.Jan 7 2022, 8:16 AM
llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
924 ↗(On Diff #398148)

I have not read this code but it looks very complicated. I think this would be much better done as a DAG combine analogous to the existing SITargetLowering::reassociateScalarOps -- which is about 1/5th of the size of this function!

alex-t added inline comments.Jan 7 2022, 1:12 PM
llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
924 ↗(On Diff #398148)

This is not about the DAG at all. And this is not about the DAG combine.
SITargetLowering::reassociateScalarOps exists to fix the instruction selection that is done in a wrong way. The instruction selection in LC was initially done incorrectly.
Namely, everything is selected to the SALU and then manually lowered to VALU if there is at least one VGPR to SGPR copy. The whole work I am trying to do here is to change this so that we explicitly select the correct form of the instruction. This function is 5 times larger just because it includes all things done in reassociateScalarOps and the corresponding moveToVALU logic.

foad added a comment.Jan 10 2022, 1:13 AM

SITargetLowering::reassociateScalarOps exists to fix the instruction selection that is done in a wrong way.

No! It's not trying to fix anything, it's just trying to reassociate expressions to keep more of the intermediate results uniform, so we use fewer vgprs and fewer valu instructions. For example:

// v1 = (v0 + s0) + s1
v_add v1, v0, s0
v_add v1, v1, s1
 -->
// v1 = (s0 + s1) + v0 ; reassociated
s_add s2, s0, s1
v_add v1, s2, v0

This is exactly the same kind of thing you need to do to restore the missed optimization in xnor.ll:

// v1 = ~(s0 ^ v0)
v_xor v1, s0, v0
v_not v1, v1
 -->
// v1 = ~s0 ^ v0
s_not s1, s0
v_xor v1, s1, v0
foad added a comment.Jan 10 2022, 2:20 AM

This looks like a regression in xnor.ll :

	s_not_b32 s0, s0                        	v_not_b32_e32 v0, v0
	v_xor_b32_e32 v0, s0, v0                        v_xor_b32_e32 v0, s4, v0

but it is not really. All the nodes in the example are divergent and the divergent ( xor, x -1) is selected to V_NOT_B32 as of https://reviews.llvm.org/D115884 has been committed.
S_NOT_B32 appears at the left because of the custom optimization that converts S_XNOR_B32 back to NOT (XOR) for the targets which have no V_XNOR. This optimization relies on the fact that if the NOT operand is SGPR and V_XOR_B32_e32 can accept SGPR as a first source operand.
I am not sure if it is always safe. The VALU instructions execution is controlled by the EXEC mask but SALU is not.

To repeat what I have already said elsewhere: this is not a correctness issue. This is just an optimization, where you can choose to calculate either ~s0 ^ v0 or s0 ^ ~v0 (or ~(s0 ^ v0)) and get exactly the same result. The optimization is to prefer the first form, because the intermediate result ~s0 is uniform, so you can keep it in an sgpr and not waste vgprs and valu instructions.

alex-t added a comment.EditedJan 10 2022, 7:37 AM

This looks like a regression in xnor.ll :

	s_not_b32 s0, s0                        	v_not_b32_e32 v0, v0
	v_xor_b32_e32 v0, s0, v0                        v_xor_b32_e32 v0, s4, v0

but it is not really. All the nodes in the example are divergent and the divergent ( xor, x -1) is selected to V_NOT_B32 as of https://reviews.llvm.org/D115884 has been committed.
S_NOT_B32 appears at the left because of the custom optimization that converts S_XNOR_B32 back to NOT (XOR) for the targets which have no V_XNOR. This optimization relies on the fact that if the NOT operand is SGPR and V_XOR_B32_e32 can accept SGPR as a first source operand.
I am not sure if it is always safe. The VALU instructions execution is controlled by the EXEC mask but SALU is not.

To repeat what I have already said elsewhere: this is not a correctness issue. This is just an optimization, where you can choose to calculate either ~s0 ^ v0 or s0 ^ ~v0 (or ~(s0 ^ v0)) and get exactly the same result. The optimization is to prefer the first form, because the intermediate result ~s0 is uniform, so you can keep it in an sgpr and not waste vgprs and valu instructions.

SITargetLowering::reassociateScalarOps exists to fix the instruction selection that is done in a wrong way.

No! It's not trying to fix anything, it's just trying to reassociate expressions to keep more of the intermediate results uniform, so we use fewer vgprs and fewer valu instructions. For example:

// v1 = (v0 + s0) + s1
v_add v1, v0, s0
v_add v1, v1, s1
 -->
// v1 = (s0 + s1) + v0 ; reassociated
s_add s2, s0, s1
v_add v1, s2, v0

This is exactly the same kind of thing you need to do to restore the missed optimization in xnor.ll:

// v1 = ~(s0 ^ v0)
v_xor v1, s0, v0
v_not v1, v1
 -->
// v1 = ~s0 ^ v0
s_not s1, s0
v_xor v1, s1, v0

I am sorry. I have really misled you by confusing the SITargetLowering::reassociateScalarOps with SIInstrInfo::lowerScalarXnor that we have recently discussed with @rampitec.
The SITargetLowering::reassociateScalarOps in our case does nothing because both XOR and NOT nodes are divergent.
The difference between what we had before my change to the selection and what we have now is following:
Before:

We selected both XOR and NOT to S_XOR_B32 and S_NOT_B32 and then optimized the whole pattern in SIInstrInfo::lowerScalarXnor and passed it to moveToVALU.

Now:

We select the divergent NOT to V_NOT_B32_e32 and divergent XOR to V_XOR_B32_e64. The selection is correct but we missed the opportunity to exploit the fact that even divergent NOT may be selected to S_NOT_B32 w/o the correctness lost in case its input is SGPR. But the latter fact is only available AFTER the selection is done.

During the selection process, we cannot predict if the divergent NOT operands are SGPRs because we are walking the DAG postorder and the node operands are not yet selected.
So, we only can apply the optimization AFTER the selection is done.

foad added a comment.Jan 10 2022, 7:45 AM

Now:

We select the divergent NOT to V_NOT_B32_e32 and divergent XOR to V_XOR_B32_e64. The selection is correct but we missed the opportunity to exploit the fact that even divergent NOT may be selected to S_NOT_B32 w/o the correctness lost.

No, you cannot correctly select divergent NOT to S_NOT_B32. That is not what was happening before your patch (see https://reviews.llvm.org/D116270?vs=on&id=396159#change-5HrmrjqhUdXJ). What was happening was that an input like ~(uniform ^ divergent) was being "reassociated" to ~uniform ^ divergent so it could be correctly selected to S_NOT + V_XOR. I assume this was done with a very clever selection pattern, but I am suggesting that instead of that you could implement it as a DAG combine (to do the reassociation), so there is no need for clever selection patterns.

alex-t added a comment.EditedJan 10 2022, 7:56 AM

Now:

We select the divergent NOT to V_NOT_B32_e32 and divergent XOR to V_XOR_B32_e64. The selection is correct but we missed the opportunity to exploit the fact that even divergent NOT may be selected to S_NOT_B32 w/o the correctness lost.

No, you cannot correctly select divergent NOT to S_NOT_B32. That is not what was happening before your patch (see https://reviews.llvm.org/D116270?vs=on&id=396159#change-5HrmrjqhUdXJ). What was happening was that an input like ~(uniform ^ divergent) was being "reassociated" to ~uniform ^ divergent so it could be correctly selected to S_NOT + V_XOR. I assume this was done with a very clever selection pattern, but I am suggesting that instead of that you could implement it as a DAG combine (to do the reassociation), so there is no need for clever selection patterns.

Once again, in my case BOTH nodes (not,xor) are divergent!

 %s.load = load i32, i32 addrspace(4)* %s.kernarg.offset.cast, align 4, !invariant.load !0
DIVERGENT:       %v = call i32 @llvm.amdgcn.workitem.id.x(), !range !1
DIVERGENT:       %xor = xor i32 %v, %s.load
DIVERGENT:       %d = xor i32 %xor, -1
DIVERGENT:       store i32 %d, i32 addrspace(1)* %out.load, align 4

And the correct selection is V_NOT_B32_e32.
But after the selection is done, we can in some cases rearrange instructions in such a way that we replace the V_NOT_B32_e32 with the S_NOT_B32 keeping the code correct.
The long test in my function does this.

Why it was S_NOT_B32 before my patch? Because of the SIInstrInfo::lowerScalarXnor + moveToVALU "magic". And, please note, that the latter one also works AFTER the selection is done and all the register classes for instructions inputs are known.

foad added a comment.Jan 10 2022, 8:04 AM

Once again, in my case BOTH nodes (not,xor) are divergent!

 %s.load = load i32, i32 addrspace(4)* %s.kernarg.offset.cast, align 4, !invariant.load !0
DIVERGENT:       %v = call i32 @llvm.amdgcn.workitem.id.x(), !range !1
DIVERGENT:       %xor = xor i32 %v, %s.load
DIVERGENT:       %d = xor i32 %xor, -1
DIVERGENT:       store i32 %d, i32 addrspace(1)* %out.load, align 4

I know. I am suggesting that a DAG combine can rewrite this code to the equivalent of:

                 %s.load = load i32, i32 addrspace(4)* %s.kernarg.offset.cast, align 4, !invariant.load !0
DIVERGENT:       %v = call i32 @llvm.amdgcn.workitem.id.x(), !range !1
                 %not = xor i32 %s.load, -1
DIVERGENT:       %d = xor i32 %v, %not
DIVERGENT:       store i32 %d, i32 addrspace(1)* %out.load, align 4

Now %not is uniform, so it is trivial to select it to s_not.

alex-t added a comment.EditedJan 10 2022, 8:09 AM

Once again, in my case BOTH nodes (not,xor) are divergent!

 %s.load = load i32, i32 addrspace(4)* %s.kernarg.offset.cast, align 4, !invariant.load !0
DIVERGENT:       %v = call i32 @llvm.amdgcn.workitem.id.x(), !range !1
DIVERGENT:       %xor = xor i32 %v, %s.load
DIVERGENT:       %d = xor i32 %xor, -1
DIVERGENT:       store i32 %d, i32 addrspace(1)* %out.load, align 4

I know. I am suggesting that a DAG combine can rewrite this code to the equivalent of:

                 %s.load = load i32, i32 addrspace(4)* %s.kernarg.offset.cast, align 4, !invariant.load !0
DIVERGENT:       %v = call i32 @llvm.amdgcn.workitem.id.x(), !range !1
                 %not = xor i32 %s.load, -1
DIVERGENT:       %d = xor i32 %v, %not
DIVERGENT:       store i32 %d, i32 addrspace(1)* %out.load, align 4

Now %not is uniform, so it is trivial to select it to s_not.

Okay. I have finally got the idea. Thanks :)

Once again, in my case BOTH nodes (not,xor) are divergent!

 %s.load = load i32, i32 addrspace(4)* %s.kernarg.offset.cast, align 4, !invariant.load !0
DIVERGENT:       %v = call i32 @llvm.amdgcn.workitem.id.x(), !range !1
DIVERGENT:       %xor = xor i32 %v, %s.load
DIVERGENT:       %d = xor i32 %xor, -1
DIVERGENT:       store i32 %d, i32 addrspace(1)* %out.load, align 4

I know. I am suggesting that a DAG combine can rewrite this code to the equivalent of:

                 %s.load = load i32, i32 addrspace(4)* %s.kernarg.offset.cast, align 4, !invariant.load !0
DIVERGENT:       %v = call i32 @llvm.amdgcn.workitem.id.x(), !range !1
                 %not = xor i32 %s.load, -1
DIVERGENT:       %d = xor i32 %v, %not
DIVERGENT:       store i32 %d, i32 addrspace(1)* %out.load, align 4

Now %not is uniform, so it is trivial to select it to s_not.

The idea is brilliant. Unfortunately, it only can be implemented by changing LLVM common code.
If you look in the SITargetLowering::reassociateScalarOps you can see the operand constantness check and corresponding comment:

 // If either operand is constant this will conflict with
 // DAGCombiner::ReassociateOps().
if (DAG.isConstantIntBuildVectorOrConstantInt(Op0) ||
     DAG.isConstantIntBuildVectorOrConstantInt(Op1))
   return SDValue();

Removing this guard leads to the infinite transforming the pattern back and forth in SITargetLowering::reassociateScalarOps and DAGCombiner::ReassociateOps().
The former transform (xor (xor uniform, divergent), -1) to (xor (xor uniform, -1), divergent) but the latter one transform it back by applying this rule:

if (N0.hasOneUse()) {
 ** // Reassociate: (op (op x, c1), y) -> (op (op x, y), c1)
  //              iff (op x, c1) has one use**
  if (SDValue OpNode = DAG.getNode(Opc, SDLoc(N0), VT, N00, N1))
    return DAG.getNode(Opc, DL, VT, OpNode, N01);
  return SDValue();
}

The infinite loop is formed because we add to the worklist the users of the transformed node. Hence this user comes to the combine in the next step and the same pattern gets transformed again and again.
What I can do here is to change the DAGCombiner::ReassociateOps() to make it consider node divergence. Not sure if this is suitable for upstreaming.

foad added a comment.Jan 11 2022, 9:37 AM

Removing this guard leads to the infinite transforming the pattern back and forth in SITargetLowering::reassociateScalarOps and DAGCombiner::ReassociateOps().
The former transform (xor (xor uniform, divergent), -1) to (xor (xor uniform, -1), divergent) but the latter one transform it back by applying this rule:

if (N0.hasOneUse()) {
 ** // Reassociate: (op (op x, c1), y) -> (op (op x, y), c1)
  //              iff (op x, c1) has one use**
  if (SDValue OpNode = DAG.getNode(Opc, SDLoc(N0), VT, N00, N1))
    return DAG.getNode(Opc, DL, VT, OpNode, N01);
  return SDValue();
}

Oh dear. I don't really understand why the common code is doing this. The idea seems to come from a FIXME added in e260ed8628bbe245ffc39b130d121f2f50dc0bce

I think it is worth trying to change this generic combine to give up if x is uniform and y is divergent.

alex-t added a comment.EditedJan 11 2022, 12:11 PM

I think it is worth trying to change this generic combine to give up if x is uniform and y is divergent.

That is exactly what I have done for now:

if (N0.hasOneUse() &&
  // In this case the transformation spoils the opportunity
  // to keep the whole operation scalar.
  !(!N0->isDivergent() && N1->isDivergent())) {
  // Reassociate: (op (op x, c1), y) -> (op (op x, y), c1)
  //              iff (op x, c1) has one use
  if (SDValue OpNode = DAG.getNode(Opc, SDLoc(N0), VT, N00, N1))
    return DAG.getNode(Opc, DL, VT, OpNode, N01);
  return SDValue();
}

but need to test if it does not break anything else :)

In general, I don't like the idea of making the DAGCombiner::reassociateOpsCommutative take into account the divergence.

We have 2 different heuristics to apply in the hope they make a profit.

Considering the arithmetic sub-tree user as root:

  1. keep commutative binary operation with one constant operand as close to the root in the tree as possible. This lets memory access patterns take advantage of the constant offset.
  2. given that the constant operand is uniform and the divergent node may use a uniform node but not vice-versa, keep the commutative binary operation with one constant operand as outer in the tree as possible in hope that it will be selected to the scalar operation.
        t104: i32 = shl [ORD=15] [ID=17] # D:1 t27, Constant:i32<2> [ID=9]
      t105: i64 = zero_extend [ORD=15] [ID=19] # D:1 t104
    t31: i64 = add [ORD=15] [ID=29] # D:1 t97, t105
  t33: i64 = add [ORD=16] [ID=31] # D:1 t31, Constant:i64<4> [ID=6]
t50: i32,ch = load<(load (s32) from %ir.b_ptr, addrspace 1)> [ORD=18] [ID=33] # D:1 t0, t33, undef:i64 [ID=2] # D:0

In the example above constant goes to the offset field in the load and the rest of the arithmetic sub-tree form the base. Since it is exposed to further combining, we end up with the simple move instead of 2 additions. Hence, the reassociation moving the constant to the operation that is uniform is not profitable here.

In our XOR example, we'd like to move constant operand up to another xor that already has one operand uniform.
To not introduce yet another regression fixing the current one, we would have to expose the heuristics details to the common code. For example: moving constant operand up in the commutative pattern is good if it is XOR but is unwanted if it is ADD that is the base of the load or store.
In general, it is a good idea to let the target decide if the concrete operation reassociation is profitable. So we may want to add TargetLoweringInfo::reassociateOpsCommutative hook to let the target attempt and then apply general reassociation if the target has not changed the pattern.

alex-t updated this revision to Diff 400620.Jan 17 2022, 12:29 PM

DAG combiner hook added to control divergence-driven peephole optimizatoins.

I like this in general. @foad?

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
12586

Please simplify this condition. Make separate "return false" probably. The part "((!(!" is unreadable.

alex-t updated this revision to Diff 401633.Jan 20 2022, 7:09 AM

condition was made more readable

alex-t marked an inline comment as done.Jan 20 2022, 7:09 AM
foad added a comment.Jan 20 2022, 8:10 AM

Overall this seems reasonable. The only alternative I can think of would need more complicated isel patterns that do the reassociation gated by predicates that check the divergence, something like:

let Predicate = doesNotHaveXNOR in
def : GCNPat<
  (i32 (xor (xor_oneuse i32:$src0, i32:$src1), i32:$src2)),
  (i32 (V_XOR_B32 $src0, (V_XOR_B32 $src1, $src2))),
  [{ return src0->isDivergent() && !src1->isDivergent() && !src2->isDivergent(); }]
>;

... and lots of commuted versions of the same thing, and the same for any other isel pattern that matches something that could be reassoicated. So that doesn't sound very scalable.

Currently not (xor_one_use) pattern is always selected to S_XNOR irrelative od the node divergence.

"irrelative od" -> "irrespective of" or "regardless of".

llvm/include/llvm/CodeGen/TargetLowering.h
3295

Needs a proper descriptive comment.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
12584

Return early if !N0.hasOneUse().

12586

De-Morgan this to N0->isDivergent() || !N1->isDivergent().

12592–12594

I don't understand this heuristic. Can you give an example of when it would help?

alex-t added inline comments.Jan 20 2022, 9:04 AM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
12592–12594

I could just demonstrate the concrete example but I would need to paste the DAGs here that look like overkill. So, I try to explain w/o the drawing.
Let's imagine we have a sub-tree constituting the commutative arithmetic operations.
Let us have a path in the tree such that each node has at least one operant constant.
Given that it is very likely that this sub-tree is going to be simplified by the combiner by application arithmetic rules and constant folding.
This heuristic states the priority of such constant folding over keeping the outer node uniform.

%and = and i32 %tmp, 16711935     ; 0x00ff00ff
%tmp1 = and i32 %arg1, 4294967040 ; 0xffffff00
%tmp2 = or i32 %tmp1, -65536
%tmp3 = or i32 %tmp2, %and

This is folded and can be selected to v_perm_b32 with this heuristic but will be 4 scalar operations w/o it.

foad added inline comments.Jan 20 2022, 9:25 AM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
12592–12594

I still don't see why this would be useful in general. I think it means we should do this reassociation:
(op (op n00, C), (op2 n10, C2)) --> (op (op n00, (op2 n10, C2)), C)
where op2 is commutative but not necessarily the same as op. E.g. (x|C)|(z&C2) --> (x|(z&C2))|C

Initial DAG


DAG after the transformation and constant folding

This can be selected to v_perm_b32

foad added a comment.EditedJan 20 2022, 9:53 AM

I guess this particular case could be handled by improving the v_perm matching in SITargetLowering::performOrCombine, e.g. add a cases for both of:

or (op x, c1), c2 -> perm x, x, permute_mask(c1, c2)
or (perm x, x, c1), (op y, c2) -> perm x, y, permute_mask(c1, c2)

Or a single big case that handles:

or (or (op x, c1), c2), (op y, c3) -> perm x, y, permute_mask(c1, c2, c3)

Initial DAG


DAG after the transformation and constant folding

This can be selected to v_perm_b32

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
12592–12594

I did not say that it is useful in all cases. I said that there is a high probability that the 2 or more constant operands in the same commutative arithmetic sub-tree will be folded by further steps of the combining. This is just a heuristic.
And that is why I said that I don't like the combiner approach in general.
As soon as we influence the divergence in the one combining algorithm, we must teach all other combining algorithms to leverage the divergence information. This is a huge amount of work. Without this, we likely restrict the optimization opportunity just because we change just one of the peephole DAG transformations by introducing the new factor unknown to the rest of the combiner.
In my particular example, this heuristic let us get

	v_mov_b32_e32 v3, 0xffff0500
	v_perm_b32 v2, s0, v2, v3

instead of

	s_and_b32 s0, s0, 0xff00
	s_or_b32 s0, s0, 0xffff0000
	v_and_b32_e32 v2, 0xff00ff, v2
	v_or_b32_e32 v2, s0, v2

In general, I agree that this is a fragile design and we will have to add more and more exemptions to keep all the optimizations working.
There are 2 clean ways:

  • Divergence-driven DAG combiner
  • Post ISel MIR optimizer

I consider the last one is much simpler to implement.

foad added inline comments.Jan 21 2022, 3:57 AM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
12592–12594

I understand why this heuristic helps you form v_perm_b32 in one specific case. I don't understand why "there is a high probability that the 2 or more constant operands in the same commutative arithmetic sub-tree will be folded by further steps of the combining". As far as I can see, it is just luck that it helps this particular v_perm_b32 combine, and it could just as easily make another case fail to combine. Do you have any evidence that it increases the total number of v_perm_b32 combines across a large corpus of code?

alex-t updated this revision to Diff 402009.Jan 21 2022, 9:09 AM

Bug fixed: Memory access DAG pattern check must ensure that "base + offset" pattern has MemSDNode users

alex-t added inline comments.Jan 21 2022, 9:31 AM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
12592–12594

Yes. You are right. I cannot prove that this will help in another DAG pattern.
This is exactly what I tried to explain from the beginning.
The only reliable way to change the "manual" selection that was done in SIFixSGPRCopies + SIInstrInfo::moveToVALU, is the fair divergence-driven selection and further MIR processing.
Any attempts to add divergence information to one separate combiner transformation will lead to such "heuristics" that fix case by case. It will happen because the combiner is the peephole optimizer and it was designed based on the common arithmetic rules without the "divergence". the divergence is target-specific and should not affect the DAG combiner.
Summarizing all the above: without this hack, I have a regression.
And I expect there will be more such hacks in case we go on to influence the combiner with divergence-driven transformations. Another approach - post-ISel MIR optimization looks to you too complicated.
I have no other ideas so far.
BTW I am not even sure that 4 scalar arithmetic operations instead of v_perm_b32 are always really bad. We increase code size but save one VGPR. In some cases with high VGPR reg pressure, it is good. In general, it is not easy to formalize - in which cases the transformation based on the node divergence is profitable,

foad added inline comments.Jan 21 2022, 9:47 AM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
12592–12594

Summarizing all the above: without this hack, I have a regression.

I suggest ignoring the regression. And if it turns out to be important, I suggest fixing it by improving v_perm pattern recognition in SITargetLowering::performOrCombine, not by tweaking isReassocProfitable.

alex-t updated this revision to Diff 402570.Jan 24 2022, 9:36 AM

Removed constant folding heuristic, prooved useless.

LGTM provided comments are fixed.

llvm/include/llvm/CodeGen/TargetLowering.h
3295

Still needs comment.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
12595

Early return instead.

12597

Demorgan this.

alex-t updated this revision to Diff 402653.EditedJan 24 2022, 1:47 PM

Added comment to common target hook declaration. Condition changed.

alex-t marked 4 inline comments as done.Jan 24 2022, 1:48 PM

LGTM provided comments are fixed.

Fixed. Now, could you please accept?

foad added a comment.Jan 25 2022, 3:38 AM

Looks reasonable, just some minor comments.

llvm/include/llvm/CodeGen/TargetLowering.h
3295

That's better than nothing but it still doesn't describe what N0 and N1 are or what reassociation is being controlled. If I understand it is something like: (op N0:(op N00, N01), N1) -> (op (op N00, N1), N01) where both op are the same associative commutative op, and N01 is a constant.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
12583

This can use any_of and N->uses().

12586

Would be better to check that the use is the memory address, not (e.g.) the value being stored by a store instruction. But I don't know if there's a simple way to check that.

llvm/lib/Target/AMDGPU/SIISelLowering.h
454

We don't normally repeat virtual on the override -- not sure if it matters.

alex-t updated this revision to Diff 403038.Jan 25 2022, 2:24 PM

Added detailed comment explainingthe new target hook.
Memory access pattern check changed to match exactly the address operand.
Minor fixes.

alex-t updated this revision to Diff 403039.Jan 25 2022, 2:33 PM
alex-t marked an inline comment as done.

removed accidentally added files

alex-t added inline comments.Jan 25 2022, 2:33 PM
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
12586

That is a good idea but I cannot use the any_of then. The any_of uses the operator * type that is SDNode but I need the SDUse instead. So the check is going to remain a separate function. Also, despite this making the check more precise, it uncovers the permute regression that becomes explicit now.

rampitec added inline comments.Jan 25 2022, 2:38 PM
llvm/test/CodeGen/AMDGPU/permute.ll
108–109

Add FIXME here that v_perm_b32 with 0xffff0500 mask can be used.

158–159

Same here with mask 0xffff0500.

alex-t updated this revision to Diff 403051.Jan 25 2022, 3:19 PM

FIXME v_perm_b32 added

alex-t marked 2 inline comments as done.Jan 25 2022, 3:19 PM
This revision is now accepted and ready to land.Jan 25 2022, 3:20 PM
This revision was landed with ongoing or failed builds.Jan 26 2022, 4:30 AM
This revision was automatically updated to reflect the committed changes.