This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/CodeGen/
-
llvm/
-
CodeGen/
2/3
TargetLowering.h
-
lib/
-
CodeGen/SelectionDAG/
-
SelectionDAG/
-
DAGCombiner.cpp
-
Target/AMDGPU/
-
AMDGPU/
1
SIISelLowering.h
8/15
SIISelLowering.cpp
-
SOPInstructions.td
1/1
VOP2Instructions.td
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
1/1
divergence-driven-xnor.ll
-
fused-bitlogic.ll
2/2
permute.ll
-
wwm-reserved-spill.ll
-
xnor.ll
-
xor3.ll

Differential D116270

[AMDGPU] Enable divergence-driven XNOR selection
ClosedPublic

Authored by alex-t on Dec 24 2021, 6:54 AM.

Download Raw Diff

Details

Reviewers

rampitec
foad

Commits

rG5157f984ae2c: [AMDGPU] Enable divergence-driven XNOR selection

Summary

Currently not (xor_one_use) pattern is always selected to S_XNOR irrelative od the node divergence.
This relies on further custom selection pass which converts to VALU if necessary and replaces with V_NOT_B32 ( V_XOR_B32)
on those targets which have no V_XNOR.
Current change enables the patterns which explicitly select the not (xor_one_use) to appropriate form.
We assume that xor (not) is already turned into the not (xor) by the combiner.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

alex-t created this revision.Dec 24 2021, 6:54 AM

Herald added subscribers: foad, kerbowa, hiraditya and 8 others. · View Herald TranscriptDec 24 2021, 6:54 AM

alex-t requested review of this revision.Dec 24 2021, 6:54 AM

Herald added a project: Restricted Project. · View Herald TranscriptDec 24 2021, 6:54 AM

Herald added a subscriber: wdng. · View Herald Transcript

LIT test file attributes corrected

This looks like a regression in xnor.ll :

	s_not_b32 s0, s0                        	v_not_b32_e32 v0, v0
	v_xor_b32_e32 v0, s0, v0                        v_xor_b32_e32 v0, s4, v0

but it is not really. All the nodes in the example are divergent and the divergent ( xor, x -1) is selected to V_NOT_B32 as of https://reviews.llvm.org/D115884 has been committed.
S_NOT_B32 appears at the left because of the custom optimization that converts S_XNOR_B32 back to NOT (XOR) for the targets which have no V_XNOR. This optimization relies on the fact that if the NOT operand is SGPR and V_XOR_B32_e32 can accept SGPR as a first source operand.
I am not sure if it is always safe. The VALU instructions execution is controlled by the EXEC mask but SALU is not.

Harbormaster completed remote builds in B140605: Diff 396159.Dec 24 2021, 7:38 AM

foad added a reviewer: foad.Dec 31 2021, 3:57 AM

foad added inline comments.

llvm/lib/Target/AMDGPU/VOP2Instructions.td
729	Why are these two patterns required? Surely we can just let the NOT and the XOR be selected individually. The only effect of these patterns is to swap the order of the NOT and the XOR, but if that is beneficial then surely it should be done as a DAG combine instead?
llvm/lib/Target/AMDGPU/VOPInstructions.td
797 ↗	(On Diff #396159)	Please rebase this patch on D116241.
llvm/test/CodeGen/AMDGPU/divergence-driven-xnor.ll
2	Also run this test on a subtarget that has v_xnor instructions?

In D116270#3209307, @alex-t wrote:
This looks like a regression in xnor.ll :
	s_not_b32 s0, s0                        	v_not_b32_e32 v0, v0
	v_xor_b32_e32 v0, s0, v0                        v_xor_b32_e32 v0, s4, v0
but it is not really. All the nodes in the example are divergent and the divergent ( xor, x -1) is selected to V_NOT_B32 as of https://reviews.llvm.org/D115884 has been committed.
S_NOT_B32 appears at the left because of the custom optimization that converts S_XNOR_B32 back to NOT (XOR) for the targets which have no V_XNOR. This optimization relies on the fact that if the NOT operand is SGPR and V_XOR_B32_e32 can accept SGPR as a first source operand.
I am not sure if it is always safe. The VALU instructions execution is controlled by the EXEC mask but SALU is not.

This is indeed a regression. It is always safe to keep s_not_b32 on SALU. Also note this effectively makes SIInstrInfo::lowerScalarXnor() useless. This is why XNOR was left behind by the D111907.

In D116270#3215217, @rampitec wrote:
In D116270#3209307, @alex-t wrote:
This looks like a regression in xnor.ll :
	s_not_b32 s0, s0                        	v_not_b32_e32 v0, v0
	v_xor_b32_e32 v0, s0, v0                        v_xor_b32_e32 v0, s4, v0
but it is not really. All the nodes in the example are divergent and the divergent ( xor, x -1) is selected to V_NOT_B32 as of https://reviews.llvm.org/D115884 has been committed.
S_NOT_B32 appears at the left because of the custom optimization that converts S_XNOR_B32 back to NOT (XOR) for the targets which have no V_XNOR. This optimization relies on the fact that if the NOT operand is SGPR and V_XOR_B32_e32 can accept SGPR as a first source operand.
I am not sure if it is always safe. The VALU instructions execution is controlled by the EXEC mask but SALU is not.
This is indeed a regression. It is always safe to keep s_not_b32 on SALU. Also note this effectively makes SIInstrInfo::lowerScalarXnor() useless. This is why XNOR was left behind by the D111907.

SIInstrInfo::lowerScalarXnor() is exactly the part of the "manual" SALU to VALU lowering that I am trying to get rid of.
The divergent "not" must be selected to the "V_NOT_B32_e32/64" otherwise we still have illegal VGPR to SGPR copies.
This happens because the divergent "not" node has divergent operands and their result will be likely in VGPR.
Also, we should select everything correctly first and can apply some peephole optimizations after.
In other words: we should not "cheat ourselves" during the selection. The selection should be done fairly corresponding to the node divergence bit.
Then we can apply the optimization in case it is safe.
Note that this is not the only case when we would like to further optimize the code after selection.
I'm planning to further add a separate pass for that.

We cannot solve the problem in the custom selection procedure because NOT node operand has not yet been selected and we do not know if it is SGPR or VGPR.
The only way, for now, is to post-process not(xor)/xor(not) in SIFixSGPRCopies. This may be considered a temporary hack until we have no proper pass for that.

In D116270#3224818, @alex-t wrote:
In D116270#3215217, @rampitec wrote:
In D116270#3209307, @alex-t wrote:
This looks like a regression in xnor.ll :
	s_not_b32 s0, s0                        	v_not_b32_e32 v0, v0
	v_xor_b32_e32 v0, s0, v0                        v_xor_b32_e32 v0, s4, v0
but it is not really. All the nodes in the example are divergent and the divergent ( xor, x -1) is selected to V_NOT_B32 as of https://reviews.llvm.org/D115884 has been committed.
S_NOT_B32 appears at the left because of the custom optimization that converts S_XNOR_B32 back to NOT (XOR) for the targets which have no V_XNOR. This optimization relies on the fact that if the NOT operand is SGPR and V_XOR_B32_e32 can accept SGPR as a first source operand.
I am not sure if it is always safe. The VALU instructions execution is controlled by the EXEC mask but SALU is not.
This is indeed a regression. It is always safe to keep s_not_b32 on SALU. Also note this effectively makes SIInstrInfo::lowerScalarXnor() useless. This is why XNOR was left behind by the D111907.
SIInstrInfo::lowerScalarXnor() is exactly the part of the "manual" SALU to VALU lowering that I am trying to get rid of.
The divergent "not" must be selected to the "V_NOT_B32_e32/64" otherwise we still have illegal VGPR to SGPR copies.
This happens because the divergent "not" node has divergent operands and their result will be likely in VGPR.
Also, we should select everything correctly first and can apply some peephole optimizations after.
In other words: we should not "cheat ourselves" during the selection. The selection should be done fairly corresponding to the node divergence bit.
Then we can apply the optimization in case it is safe.
Note that this is not the only case when we would like to further optimize the code after selection.
I'm planning to further add a separate pass for that.

We cannot solve the problem in the custom selection procedure because NOT node operand has not yet been selected and we do not know if it is SGPR or VGPR.
The only way, for now, is to post-process not(xor)/xor(not) in SIFixSGPRCopies. This may be considered a temporary hack until we have no proper pass for that.

SIInstrInfo::lowerScalarXnor() is dead after your patch and thus the patch has to remove it.

Then this is a clear regression, so if this requires a separate peephole later we need that peephole first and make sure the test does not regress.

SIInstrInfo::lowerScalarXnor() is dead after your patch

I don't understand why it is dead. In general moveToVALU moves instructions to VALU if any of their inputs are VGPRs, which can happen even if the result is uniform -- e.g. if some of the inputs are derived from a floating point calculation which had to use VALU instructions.

In D116270#3226983, @foad wrote:

SIInstrInfo::lowerScalarXnor() is dead after your patch

I don't understand why it is dead. In general moveToVALU moves instructions to VALU if any of their inputs are VGPRs, which can happen even if the result is uniform -- e.g. if some of the inputs are derived from a floating point calculation which had to use VALU instructions.

What moveToVALU does is convert to VALU everything to the end, corresponding to the data dependency graph.
This is generally not what we want. At first, any VGPR to SGPR copy with a uniform VGPR source is going to be V_READFIRSTLANE_B32.
Nevertheless, in some cases (and there are plenty of) it is profitable to convert the whole data chain to VALU. The decision should be made depending on the context analysis results.
In case it is considered profitable we still need to be able to convert the whole variety of operations from SALU to VALU.
That is why no moveToVALU potential callees are deleted so far.

In D116270#3225806, @rampitec wrote:
In D116270#3224818, @alex-t wrote:
In D116270#3215217, @rampitec wrote:
In D116270#3209307, @alex-t wrote:
This looks like a regression in xnor.ll :
	s_not_b32 s0, s0                        	v_not_b32_e32 v0, v0
	v_xor_b32_e32 v0, s0, v0                        v_xor_b32_e32 v0, s4, v0
but it is not really. All the nodes in the example are divergent and the divergent ( xor, x -1) is selected to V_NOT_B32 as of https://reviews.llvm.org/D115884 has been committed.
S_NOT_B32 appears at the left because of the custom optimization that converts S_XNOR_B32 back to NOT (XOR) for the targets which have no V_XNOR. This optimization relies on the fact that if the NOT operand is SGPR and V_XOR_B32_e32 can accept SGPR as a first source operand.
I am not sure if it is always safe. The VALU instructions execution is controlled by the EXEC mask but SALU is not.
This is indeed a regression. It is always safe to keep s_not_b32 on SALU. Also note this effectively makes SIInstrInfo::lowerScalarXnor() useless. This is why XNOR was left behind by the D111907.
SIInstrInfo::lowerScalarXnor() is exactly the part of the "manual" SALU to VALU lowering that I am trying to get rid of.
The divergent "not" must be selected to the "V_NOT_B32_e32/64" otherwise we still have illegal VGPR to SGPR copies.
This happens because the divergent "not" node has divergent operands and their result will be likely in VGPR.
Also, we should select everything correctly first and can apply some peephole optimizations after.
In other words: we should not "cheat ourselves" during the selection. The selection should be done fairly corresponding to the node divergence bit.
Then we can apply the optimization in case it is safe.
Note that this is not the only case when we would like to further optimize the code after selection.
I'm planning to further add a separate pass for that.

We cannot solve the problem in the custom selection procedure because NOT node operand has not yet been selected and we do not know if it is SGPR or VGPR.
The only way, for now, is to post-process not(xor)/xor(not) in SIFixSGPRCopies. This may be considered a temporary hack until we have no proper pass for that.
SIInstrInfo::lowerScalarXnor() is dead after your patch and thus the patch has to remove it.

Then this is a clear regression, so if this requires a separate peephole later we need that peephole first and make sure the test does not regress.

I would not say that we need the peephole "first" because until this change is not applied there is nothing to optimize.
The only way I see here is to apply the peephole and the selection change at once. Hence we have to keep the pieces of the further post-isel optimizer within the SIFixSGPRCopies.

Added postprocessing of the selected machine IR. This makes it on par with the existing selection mechanism.

Herald added a subscriber: qcolombet. · View Herald TranscriptJan 7 2022, 7:56 AM

alex-t marked 2 inline comments as done.Jan 7 2022, 7:58 AM

foad added inline comments.Jan 7 2022, 8:16 AM

llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
924 ↗	(On Diff #398148)	I have not read this code but it looks very complicated. I think this would be much better done as a DAG combine analogous to the existing SITargetLowering::reassociateScalarOps -- which is about 1/5th of the size of this function!

Harbormaster completed remote builds in B142087: Diff 398148.Jan 7 2022, 8:33 AM

alex-t added inline comments.Jan 7 2022, 1:12 PM

llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
924 ↗	(On Diff #398148)	This is not about the DAG at all. And this is not about the DAG combine. SITargetLowering::reassociateScalarOps exists to fix the instruction selection that is done in a wrong way. The instruction selection in LC was initially done incorrectly. Namely, everything is selected to the SALU and then manually lowered to VALU if there is at least one VGPR to SGPR copy. The whole work I am trying to do here is to change this so that we explicitly select the correct form of the instruction. This function is 5 times larger just because it includes all things done in reassociateScalarOps and the corresponding moveToVALU logic.

SITargetLowering::reassociateScalarOps exists to fix the instruction selection that is done in a wrong way.

No! It's not trying to fix anything, it's just trying to reassociate expressions to keep more of the intermediate results uniform, so we use fewer vgprs and fewer valu instructions. For example:

// v1 = (v0 + s0) + s1
v_add v1, v0, s0
v_add v1, v1, s1
 -->
// v1 = (s0 + s1) + v0 ; reassociated
s_add s2, s0, s1
v_add v1, s2, v0

This is exactly the same kind of thing you need to do to restore the missed optimization in xnor.ll:

// v1 = ~(s0 ^ v0)
v_xor v1, s0, v0
v_not v1, v1
 -->
// v1 = ~s0 ^ v0
s_not s1, s0
v_xor v1, s1, v0

In D116270#3209307, @alex-t wrote:
This looks like a regression in xnor.ll :
	s_not_b32 s0, s0                        	v_not_b32_e32 v0, v0
	v_xor_b32_e32 v0, s0, v0                        v_xor_b32_e32 v0, s4, v0
but it is not really. All the nodes in the example are divergent and the divergent ( xor, x -1) is selected to V_NOT_B32 as of https://reviews.llvm.org/D115884 has been committed.
S_NOT_B32 appears at the left because of the custom optimization that converts S_XNOR_B32 back to NOT (XOR) for the targets which have no V_XNOR. This optimization relies on the fact that if the NOT operand is SGPR and V_XOR_B32_e32 can accept SGPR as a first source operand.
I am not sure if it is always safe. The VALU instructions execution is controlled by the EXEC mask but SALU is not.

To repeat what I have already said elsewhere: this is not a correctness issue. This is just an optimization, where you can choose to calculate either ~s0 ^ v0 or s0 ^ ~v0 (or ~(s0 ^ v0)) and get exactly the same result. The optimization is to prefer the first form, because the intermediate result ~s0 is uniform, so you can keep it in an sgpr and not waste vgprs and valu instructions.

In D116270#3230834, @foad wrote:
In D116270#3209307, @alex-t wrote:
This looks like a regression in xnor.ll :
	s_not_b32 s0, s0                        	v_not_b32_e32 v0, v0
	v_xor_b32_e32 v0, s0, v0                        v_xor_b32_e32 v0, s4, v0
but it is not really. All the nodes in the example are divergent and the divergent ( xor, x -1) is selected to V_NOT_B32 as of https://reviews.llvm.org/D115884 has been committed.
S_NOT_B32 appears at the left because of the custom optimization that converts S_XNOR_B32 back to NOT (XOR) for the targets which have no V_XNOR. This optimization relies on the fact that if the NOT operand is SGPR and V_XOR_B32_e32 can accept SGPR as a first source operand.
I am not sure if it is always safe. The VALU instructions execution is controlled by the EXEC mask but SALU is not.
To repeat what I have already said elsewhere: this is not a correctness issue. This is just an optimization, where you can choose to calculate either ~s0 ^ v0 or s0 ^ ~v0 (or ~(s0 ^ v0)) and get exactly the same result. The optimization is to prefer the first form, because the intermediate result ~s0 is uniform, so you can keep it in an sgpr and not waste vgprs and valu instructions.

In D116270#3230743, @foad wrote:
SITargetLowering::reassociateScalarOps exists to fix the instruction selection that is done in a wrong way.

No! It's not trying to fix anything, it's just trying to reassociate expressions to keep more of the intermediate results uniform, so we use fewer vgprs and fewer valu instructions. For example:
// v1 = (v0 + s0) + s1
v_add v1, v0, s0
v_add v1, v1, s1
 -->
// v1 = (s0 + s1) + v0 ; reassociated
s_add s2, s0, s1
v_add v1, s2, v0
This is exactly the same kind of thing you need to do to restore the missed optimization in xnor.ll:
// v1 = ~(s0 ^ v0)
v_xor v1, s0, v0
v_not v1, v1
 -->
// v1 = ~s0 ^ v0
s_not s1, s0
v_xor v1, s1, v0

I am sorry. I have really misled you by confusing the SITargetLowering::reassociateScalarOps with SIInstrInfo::lowerScalarXnor that we have recently discussed with @rampitec.
The SITargetLowering::reassociateScalarOps in our case does nothing because both XOR and NOT nodes are divergent.
The difference between what we had before my change to the selection and what we have now is following:
Before:

We selected both XOR and NOT to S_XOR_B32 and S_NOT_B32 and then optimized the whole pattern in SIInstrInfo::lowerScalarXnor and passed it to moveToVALU.

Now:

We select the divergent NOT to V_NOT_B32_e32 and divergent XOR to V_XOR_B32_e64. The selection is correct but we missed the opportunity to exploit the fact that even divergent NOT may be selected to S_NOT_B32 w/o the correctness lost in case its input is SGPR. But the latter fact is only available AFTER the selection is done.

During the selection process, we cannot predict if the divergent NOT operands are SGPRs because we are walking the DAG postorder and the node operands are not yet selected.
So, we only can apply the optimization AFTER the selection is done.

In D116270#3231609, @alex-t wrote:

Now:

We select the divergent NOT to V_NOT_B32_e32 and divergent XOR to V_XOR_B32_e64. The selection is correct but we missed the opportunity to exploit the fact that even divergent NOT may be selected to S_NOT_B32 w/o the correctness lost.

No, you cannot correctly select divergent NOT to S_NOT_B32. That is not what was happening before your patch (see https://reviews.llvm.org/D116270?vs=on&id=396159#change-5HrmrjqhUdXJ). What was happening was that an input like ~(uniform ^ divergent) was being "reassociated" to ~uniform ^ divergent so it could be correctly selected to S_NOT + V_XOR. I assume this was done with a very clever selection pattern, but I am suggesting that instead of that you could implement it as a DAG combine (to do the reassociation), so there is no need for clever selection patterns.

In D116270#3231623, @foad wrote:
In D116270#3231609, @alex-t wrote:
Now:
We select the divergent NOT to V_NOT_B32_e32 and divergent XOR to V_XOR_B32_e64. The selection is correct but we missed the opportunity to exploit the fact that even divergent NOT may be selected to S_NOT_B32 w/o the correctness lost.
No, you cannot correctly select divergent NOT to S_NOT_B32. That is not what was happening before your patch (see https://reviews.llvm.org/D116270?vs=on&id=396159#change-5HrmrjqhUdXJ). What was happening was that an input like ~(uniform ^ divergent) was being "reassociated" to ~uniform ^ divergent so it could be correctly selected to S_NOT + V_XOR. I assume this was done with a very clever selection pattern, but I am suggesting that instead of that you could implement it as a DAG combine (to do the reassociation), so there is no need for clever selection patterns.

Once again, in my case BOTH nodes (not,xor) are divergent!

 %s.load = load i32, i32 addrspace(4)* %s.kernarg.offset.cast, align 4, !invariant.load !0
DIVERGENT:       %v = call i32 @llvm.amdgcn.workitem.id.x(), !range !1
DIVERGENT:       %xor = xor i32 %v, %s.load
DIVERGENT:       %d = xor i32 %xor, -1
DIVERGENT:       store i32 %d, i32 addrspace(1)* %out.load, align 4

And the correct selection is V_NOT_B32_e32.
But after the selection is done, we can in some cases rearrange instructions in such a way that we replace the V_NOT_B32_e32 with the S_NOT_B32 keeping the code correct.
The long test in my function does this.

Why it was S_NOT_B32 before my patch? Because of the SIInstrInfo::lowerScalarXnor + moveToVALU "magic". And, please note, that the latter one also works AFTER the selection is done and all the register classes for instructions inputs are known.

In D116270#3231657, @alex-t wrote:

Once again, in my case BOTH nodes (not,xor) are divergent!

 %s.load = load i32, i32 addrspace(4)* %s.kernarg.offset.cast, align 4, !invariant.load !0
DIVERGENT:       %v = call i32 @llvm.amdgcn.workitem.id.x(), !range !1
DIVERGENT:       %xor = xor i32 %v, %s.load
DIVERGENT:       %d = xor i32 %xor, -1
DIVERGENT:       store i32 %d, i32 addrspace(1)* %out.load, align 4

I know. I am suggesting that a DAG combine can rewrite this code to the equivalent of:

                 %s.load = load i32, i32 addrspace(4)* %s.kernarg.offset.cast, align 4, !invariant.load !0
DIVERGENT:       %v = call i32 @llvm.amdgcn.workitem.id.x(), !range !1
                 %not = xor i32 %s.load, -1
DIVERGENT:       %d = xor i32 %v, %not
DIVERGENT:       store i32 %d, i32 addrspace(1)* %out.load, align 4

Now %not is uniform, so it is trivial to select it to s_not.

In D116270#3231666, @foad wrote:

In D116270#3231657, @alex-t wrote:

Once again, in my case BOTH nodes (not,xor) are divergent!

 %s.load = load i32, i32 addrspace(4)* %s.kernarg.offset.cast, align 4, !invariant.load !0
DIVERGENT:       %v = call i32 @llvm.amdgcn.workitem.id.x(), !range !1
DIVERGENT:       %xor = xor i32 %v, %s.load
DIVERGENT:       %d = xor i32 %xor, -1
DIVERGENT:       store i32 %d, i32 addrspace(1)* %out.load, align 4

I know. I am suggesting that a DAG combine can rewrite this code to the equivalent of:

                 %s.load = load i32, i32 addrspace(4)* %s.kernarg.offset.cast, align 4, !invariant.load !0
DIVERGENT:       %v = call i32 @llvm.amdgcn.workitem.id.x(), !range !1
                 %not = xor i32 %s.load, -1
DIVERGENT:       %d = xor i32 %v, %not
DIVERGENT:       store i32 %d, i32 addrspace(1)* %out.load, align 4

Now %not is uniform, so it is trivial to select it to s_not.

Okay. I have finally got the idea. Thanks :)

In D116270#3231666, @foad wrote:

In D116270#3231657, @alex-t wrote:

Once again, in my case BOTH nodes (not,xor) are divergent!

 %s.load = load i32, i32 addrspace(4)* %s.kernarg.offset.cast, align 4, !invariant.load !0
DIVERGENT:       %v = call i32 @llvm.amdgcn.workitem.id.x(), !range !1
DIVERGENT:       %xor = xor i32 %v, %s.load
DIVERGENT:       %d = xor i32 %xor, -1
DIVERGENT:       store i32 %d, i32 addrspace(1)* %out.load, align 4

I know. I am suggesting that a DAG combine can rewrite this code to the equivalent of:

                 %s.load = load i32, i32 addrspace(4)* %s.kernarg.offset.cast, align 4, !invariant.load !0
DIVERGENT:       %v = call i32 @llvm.amdgcn.workitem.id.x(), !range !1
                 %not = xor i32 %s.load, -1
DIVERGENT:       %d = xor i32 %v, %not
DIVERGENT:       store i32 %d, i32 addrspace(1)* %out.load, align 4

Now %not is uniform, so it is trivial to select it to s_not.

The idea is brilliant. Unfortunately, it only can be implemented by changing LLVM common code.
If you look in the SITargetLowering::reassociateScalarOps you can see the operand constantness check and corresponding comment:

 // If either operand is constant this will conflict with
 // DAGCombiner::ReassociateOps().
if (DAG.isConstantIntBuildVectorOrConstantInt(Op0) ||
     DAG.isConstantIntBuildVectorOrConstantInt(Op1))
   return SDValue();

Removing this guard leads to the infinite transforming the pattern back and forth in SITargetLowering::reassociateScalarOps and DAGCombiner::ReassociateOps().
The former transform (xor (xor uniform, divergent), -1) to (xor (xor uniform, -1), divergent) but the latter one transform it back by applying this rule:

if (N0.hasOneUse()) {
 ** // Reassociate: (op (op x, c1), y) -> (op (op x, y), c1)
  //              iff (op x, c1) has one use**
  if (SDValue OpNode = DAG.getNode(Opc, SDLoc(N0), VT, N00, N1))
    return DAG.getNode(Opc, DL, VT, OpNode, N01);
  return SDValue();
}

The infinite loop is formed because we add to the worklist the users of the transformed node. Hence this user comes to the combine in the next step and the same pattern gets transformed again and again.
What I can do here is to change the DAGCombiner::ReassociateOps() to make it consider node divergence. Not sure if this is suitable for upstreaming.

In D116270#3234484, @alex-t wrote:
Removing this guard leads to the infinite transforming the pattern back and forth in SITargetLowering::reassociateScalarOps and DAGCombiner::ReassociateOps().
The former transform (xor (xor uniform, divergent), -1) to (xor (xor uniform, -1), divergent) but the latter one transform it back by applying this rule:
if (N0.hasOneUse()) {
 ** // Reassociate: (op (op x, c1), y) -> (op (op x, y), c1)
  //              iff (op x, c1) has one use**
  if (SDValue OpNode = DAG.getNode(Opc, SDLoc(N0), VT, N00, N1))
    return DAG.getNode(Opc, DL, VT, OpNode, N01);
  return SDValue();
}

Oh dear. I don't really understand why the common code is doing this. The idea seems to come from a FIXME added in e260ed8628bbe245ffc39b130d121f2f50dc0bce

I think it is worth trying to change this generic combine to give up if x is uniform and y is divergent.

I think it is worth trying to change this generic combine to give up if x is uniform and y is divergent.

That is exactly what I have done for now:

if (N0.hasOneUse() &&
  // In this case the transformation spoils the opportunity
  // to keep the whole operation scalar.
  !(!N0->isDivergent() && N1->isDivergent())) {
  // Reassociate: (op (op x, c1), y) -> (op (op x, y), c1)
  //              iff (op x, c1) has one use
  if (SDValue OpNode = DAG.getNode(Opc, SDLoc(N0), VT, N00, N1))
    return DAG.getNode(Opc, DL, VT, OpNode, N01);
  return SDValue();
}

but need to test if it does not break anything else :)

In general, I don't like the idea of making the DAGCombiner::reassociateOpsCommutative take into account the divergence.

We have 2 different heuristics to apply in the hope they make a profit.

Considering the arithmetic sub-tree user as root:

keep commutative binary operation with one constant operand as close to the root in the tree as possible. This lets memory access patterns take advantage of the constant offset.
given that the constant operand is uniform and the divergent node may use a uniform node but not vice-versa, keep the commutative binary operation with one constant operand as outer in the tree as possible in hope that it will be selected to the scalar operation.

        t104: i32 = shl [ORD=15] [ID=17] # D:1 t27, Constant:i32<2> [ID=9]
      t105: i64 = zero_extend [ORD=15] [ID=19] # D:1 t104
    t31: i64 = add [ORD=15] [ID=29] # D:1 t97, t105
  t33: i64 = add [ORD=16] [ID=31] # D:1 t31, Constant:i64<4> [ID=6]
t50: i32,ch = load<(load (s32) from %ir.b_ptr, addrspace 1)> [ORD=18] [ID=33] # D:1 t0, t33, undef:i64 [ID=2] # D:0

In the example above constant goes to the offset field in the load and the rest of the arithmetic sub-tree form the base. Since it is exposed to further combining, we end up with the simple move instead of 2 additions. Hence, the reassociation moving the constant to the operation that is uniform is not profitable here.

In our XOR example, we'd like to move constant operand up to another xor that already has one operand uniform.
To not introduce yet another regression fixing the current one, we would have to expose the heuristics details to the common code. For example: moving constant operand up in the commutative pattern is good if it is XOR but is unwanted if it is ADD that is the base of the load or store.
In general, it is a good idea to let the target decide if the concrete operation reassociation is profitable. So we may want to add TargetLoweringInfo::reassociateOpsCommutative hook to let the target attempt and then apply general reassociation if the target has not changed the pattern.

DAG combiner hook added to control divergence-driven peephole optimizatoins.

Herald added a subscriber: ecnelises. · View Herald TranscriptJan 17 2022, 12:29 PM

Harbormaster completed remote builds in B143850: Diff 400620.Jan 17 2022, 1:14 PM

I like this in general. @foad?

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
12586	Please simplify this condition. Make separate "return false" probably. The part "((!(!" is unreadable.

condition was made more readable

alex-t marked an inline comment as done.Jan 20 2022, 7:09 AM

Overall this seems reasonable. The only alternative I can think of would need more complicated isel patterns that do the reassociation gated by predicates that check the divergence, something like:

let Predicate = doesNotHaveXNOR in
def : GCNPat<
  (i32 (xor (xor_oneuse i32:$src0, i32:$src1), i32:$src2)),
  (i32 (V_XOR_B32 $src0, (V_XOR_B32 $src1, $src2))),
  [{ return src0->isDivergent() && !src1->isDivergent() && !src2->isDivergent(); }]
>;

... and lots of commuted versions of the same thing, and the same for any other isel pattern that matches something that could be reassoicated. So that doesn't sound very scalable.

Currently not (xor_one_use) pattern is always selected to S_XNOR irrelative od the node divergence.

"irrelative od" -> "irrespective of" or "regardless of".

llvm/include/llvm/CodeGen/TargetLowering.h
3295	Needs a proper descriptive comment.
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
12584	Return early if `!N0.hasOneUse()`.
12586	De-Morgan this to `N0->isDivergent() \|\| !N1->isDivergent()`.
12592–12594	I don't understand this heuristic. Can you give an example of when it would help?

Harbormaster completed remote builds in B144577: Diff 401633.Jan 20 2022, 8:18 AM

alex-t added inline comments.Jan 20 2022, 9:04 AM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
12592–12594	I could just demonstrate the concrete example but I would need to paste the DAGs here that look like overkill. So, I try to explain w/o the drawing. Let's imagine we have a sub-tree constituting the commutative arithmetic operations. Let us have a path in the tree such that each node has at least one operant constant. Given that it is very likely that this sub-tree is going to be simplified by the combiner by application arithmetic rules and constant folding. This heuristic states the priority of such constant folding over keeping the outer node uniform. %and = and i32 %tmp, 16711935 ; 0x00ff00ff %tmp1 = and i32 %arg1, 4294967040 ; 0xffffff00 %tmp2 = or i32 %tmp1, -65536 %tmp3 = or i32 %tmp2, %and This is folded and can be selected to v_perm_b32 with this heuristic but will be 4 scalar operations w/o it.

foad added inline comments.Jan 20 2022, 9:25 AM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
12592–12594	I still don't see why this would be useful in general. I think it means we should do this reassociation: `(op (op n00, C), (op2 n10, C2)) --> (op (op n00, (op2 n10, C2)), C)` where op2 is commutative but not necessarily the same as op. E.g. `(x\|C)\|(z&C2) --> (x\|(z&C2))\|C`

Initial DAG

DAG after the transformation and constant folding

This can be selected to v_perm_b32

I guess this particular case could be handled by improving the v_perm matching in SITargetLowering::performOrCombine, e.g. add a cases for both of:

or (op x, c1), c2 -> perm x, x, permute_mask(c1, c2)
or (perm x, x, c1), (op y, c2) -> perm x, y, permute_mask(c1, c2)

Or a single big case that handles:

or (or (op x, c1), c2), (op y, c3) -> perm x, y, permute_mask(c1, c2, c3)

Initial DAG

DAG after the transformation and constant folding

This can be selected to v_perm_b32

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
12592–12594	I did not say that it is useful in all cases. I said that there is a high probability that the 2 or more constant operands in the same commutative arithmetic sub-tree will be folded by further steps of the combining. This is just a heuristic. And that is why I said that I don't like the combiner approach in general. As soon as we influence the divergence in the one combining algorithm, we must teach all other combining algorithms to leverage the divergence information. This is a huge amount of work. Without this, we likely restrict the optimization opportunity just because we change just one of the peephole DAG transformations by introducing the new factor unknown to the rest of the combiner. In my particular example, this heuristic let us get v_mov_b32_e32 v3, 0xffff0500 v_perm_b32 v2, s0, v2, v3 instead of s_and_b32 s0, s0, 0xff00 s_or_b32 s0, s0, 0xffff0000 v_and_b32_e32 v2, 0xff00ff, v2 v_or_b32_e32 v2, s0, v2 In general, I agree that this is a fragile design and we will have to add more and more exemptions to keep all the optimizations working. There are 2 clean ways: Divergence-driven DAG combiner Post ISel MIR optimizer I consider the last one is much simpler to implement.

foad added inline comments.Jan 21 2022, 3:57 AM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
12592–12594	I understand why this heuristic helps you form v_perm_b32 in one specific case. I don't understand why "there is a high probability that the 2 or more constant operands in the same commutative arithmetic sub-tree will be folded by further steps of the combining". As far as I can see, it is just luck that it helps this particular v_perm_b32 combine, and it could just as easily make another case fail to combine. Do you have any evidence that it increases the total number of v_perm_b32 combines across a large corpus of code?

Bug fixed: Memory access DAG pattern check must ensure that "base + offset" pattern has MemSDNode users

alex-t added inline comments.Jan 21 2022, 9:31 AM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
12592–12594	Yes. You are right. I cannot prove that this will help in another DAG pattern. This is exactly what I tried to explain from the beginning. The only reliable way to change the "manual" selection that was done in SIFixSGPRCopies + SIInstrInfo::moveToVALU, is the fair divergence-driven selection and further MIR processing. Any attempts to add divergence information to one separate combiner transformation will lead to such "heuristics" that fix case by case. It will happen because the combiner is the peephole optimizer and it was designed based on the common arithmetic rules without the "divergence". the divergence is target-specific and should not affect the DAG combiner. Summarizing all the above: without this hack, I have a regression. And I expect there will be more such hacks in case we go on to influence the combiner with divergence-driven transformations. Another approach - post-ISel MIR optimization looks to you too complicated. I have no other ideas so far. BTW I am not even sure that 4 scalar arithmetic operations instead of v_perm_b32 are always really bad. We increase code size but save one VGPR. In some cases with high VGPR reg pressure, it is good. In general, it is not easy to formalize - in which cases the transformation based on the node divergence is profitable,

foad added inline comments.Jan 21 2022, 9:47 AM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
12592–12594	Summarizing all the above: without this hack, I have a regression. I suggest ignoring the regression. And if it turns out to be important, I suggest fixing it by improving v_perm pattern recognition in SITargetLowering::performOrCombine, not by tweaking isReassocProfitable.

Harbormaster completed remote builds in B144850: Diff 402009.Jan 21 2022, 11:36 AM

Removed constant folding heuristic, prooved useless.

LGTM provided comments are fixed.

llvm/include/llvm/CodeGen/TargetLowering.h
3295	Still needs comment.
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
12595	Early return instead.
12597	Demorgan this.

Added comment to common target hook declaration. Condition changed.

alex-t marked 4 inline comments as done.Jan 24 2022, 1:48 PM

In D116270#3267210, @rampitec wrote:

LGTM provided comments are fixed.

Fixed. Now, could you please accept?

Looks reasonable, just some minor comments.

llvm/include/llvm/CodeGen/TargetLowering.h
3295	That's better than nothing but it still doesn't describe what N0 and N1 are or what reassociation is being controlled. If I understand it is something like: `(op N0:(op N00, N01), N1) -> (op (op N00, N1), N01)` where both `op` are the same associative commutative op, and N01 is a constant.
llvm/lib/Target/AMDGPU/SIISelLowering.cpp
12583	This can use any_of and N->uses().
12586	Would be better to check that the use is the memory address, not (e.g.) the value being stored by a store instruction. But I don't know if there's a simple way to check that.
llvm/lib/Target/AMDGPU/SIISelLowering.h
454	We don't normally repeat `virtual` on the override -- not sure if it matters.

Added detailed comment explainingthe new target hook.
Memory access pattern check changed to match exactly the address operand.
Minor fixes.

removed accidentally added files

Harbormaster completed remote builds in B145605: Diff 403039.Jan 25 2022, 2:33 PM

alex-t added inline comments.Jan 25 2022, 2:33 PM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
12586	That is a good idea but I cannot use the any_of then. The any_of uses the operator * type that is SDNode but I need the SDUse instead. So the check is going to remain a separate function. Also, despite this making the check more precise, it uncovers the permute regression that becomes explicit now.

rampitec added inline comments.Jan 25 2022, 2:38 PM

llvm/test/CodeGen/AMDGPU/permute.ll
108–109	Add FIXME here that v_perm_b32 with 0xffff0500 mask can be used.
158–159	Same here with mask 0xffff0500.

FIXME v_perm_b32 added

alex-t marked 2 inline comments as done.Jan 25 2022, 3:19 PM

LGTM

This revision is now accepted and ready to land.Jan 25 2022, 3:20 PM

This revision was landed with ongoing or failed builds.Jan 26 2022, 4:30 AM

Closed by commit rG5157f984ae2c: [AMDGPU] Enable divergence-driven XNOR selection (authored by alex-t). · Explain Why

This revision was automatically updated to reflect the committed changes.

alex-t added a commit: rG5157f984ae2c: [AMDGPU] Enable divergence-driven XNOR selection.

Harbormaster completed remote builds in B145616: Diff 403051.Jan 26 2022, 9:20 PM

craig.topper mentioned this in D126117: [DAG] SimplifyDemandedBits - ISD::OR - attempt to simplify Op1 again once we have KnownBits from Op0.May 21 2022, 11:14 AM

Revision Contents

Path

Size

llvm/

include/

llvm/

CodeGen/

TargetLowering.h

11 lines

lib/

CodeGen/

SelectionDAG/

DAGCombiner.cpp

2 lines

Target/

AMDGPU/

5 lines

36 lines

4 lines

36 lines

test/

CodeGen/

AMDGPU/

divergence-driven-xnor.ll

44 lines

fused-bitlogic.ll

2 lines

permute.ll

17 lines

wwm-reserved-spill.ll

4 lines

xnor.ll

10 lines

xor3.ll

6 lines

Diff 403216

llvm/include/llvm/CodeGen/TargetLowering.h

Show First 20 Lines • Show All 3,286 Lines • ▼ Show 20 Lines	public:
bool isPositionIndependent() const;		bool isPositionIndependent() const;

virtual bool isSDNodeSourceOfDivergence(const SDNode *N,		virtual bool isSDNodeSourceOfDivergence(const SDNode *N,
FunctionLoweringInfo *FLI,		FunctionLoweringInfo *FLI,
LegacyDivergenceAnalysis *DA) const {		LegacyDivergenceAnalysis *DA) const {
return false;		return false;
}		}

		// Lets target to control the following reassociation of operands: (op (op x,
		foadUnsubmitted Done Reply Inline Actions Needs a proper descriptive comment. foad: Needs a proper descriptive comment.
		rampitecUnsubmitted Done Reply Inline Actions Still needs comment. rampitec: Still needs comment.
		foadUnsubmitted Not Done Reply Inline Actions That's better than nothing but it still doesn't describe what N0 and N1 are or what reassociation is being controlled. If I understand it is something like: `(op N0:(op N00, N01), N1) -> (op (op N00, N1), N01)` where both `op` are the same associative commutative op, and N01 is a constant. foad: That's better than nothing but it still doesn't describe what N0 and N1 are or what…
		// c1), y) -> (op (op x, y), c1) where N0 is (op x, c1) and N1 is y. By
		// default consider profitable any case where N0 has single use. This
		// behavior reflects the condition replaced by this target hook call in the
		// DAGCombiner. Any particular target can implement its own heuristic to
		// restrict common combiner.
		virtual bool isReassocProfitable(SelectionDAG &DAG, SDValue N0,
		SDValue N1) const {
		return N0.hasOneUse();
		}

virtual bool isSDNodeAlwaysUniform(const SDNode * N) const {		virtual bool isSDNodeAlwaysUniform(const SDNode * N) const {
return false;		return false;
}		}

/// Returns true by value, base pointer and offset pointer and addressing mode		/// Returns true by value, base pointer and offset pointer and addressing mode
/// by reference if the node's address can be legally represented as		/// by reference if the node's address can be legally represented as
/// pre-indexed load / store address.		/// pre-indexed load / store address.
virtual bool getPreIndexedAddressParts(SDNode * /N/, SDValue &/Base/,		virtual bool getPreIndexedAddressParts(SDNode * /N/, SDValue &/Base/,
▲ Show 20 Lines • Show All 1,496 Lines • Show Last 20 Lines

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,064 Lines • ▼ Show 20 Lines	SDValue DAGCombiner::reassociateOpsCommutative(unsigned Opc, const SDLoc &DL,

if (DAG.isConstantIntBuildVectorOrConstantInt(peekThroughBitcasts(N01))) {		if (DAG.isConstantIntBuildVectorOrConstantInt(peekThroughBitcasts(N01))) {
if (DAG.isConstantIntBuildVectorOrConstantInt(peekThroughBitcasts(N1))) {		if (DAG.isConstantIntBuildVectorOrConstantInt(peekThroughBitcasts(N1))) {
// Reassociate: (op (op x, c1), c2) -> (op x, (op c1, c2))		// Reassociate: (op (op x, c1), c2) -> (op x, (op c1, c2))
if (SDValue OpNode = DAG.FoldConstantArithmetic(Opc, DL, VT, {N01, N1}))		if (SDValue OpNode = DAG.FoldConstantArithmetic(Opc, DL, VT, {N01, N1}))
return DAG.getNode(Opc, DL, VT, N00, OpNode);		return DAG.getNode(Opc, DL, VT, N00, OpNode);
return SDValue();		return SDValue();
}		}
if (N0.hasOneUse()) {		if (TLI.isReassocProfitable(DAG, N0, N1)) {
// Reassociate: (op (op x, c1), y) -> (op (op x, y), c1)		// Reassociate: (op (op x, c1), y) -> (op (op x, y), c1)
// iff (op x, c1) has one use		// iff (op x, c1) has one use
if (SDValue OpNode = DAG.getNode(Opc, SDLoc(N0), VT, N00, N1))		if (SDValue OpNode = DAG.getNode(Opc, SDLoc(N0), VT, N00, N1))
return DAG.getNode(Opc, DL, VT, OpNode, N01);		return DAG.getNode(Opc, DL, VT, OpNode, N01);
return SDValue();		return SDValue();
}		}
}		}
return SDValue();		return SDValue();
▲ Show 20 Lines • Show All 23,063 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/SIISelLowering.h

Show First 20 Lines • Show All 443 Lines • ▼ Show 20 Lines	void computeKnownBitsForTargetInstr(GISelKnownBits &Analysis, Register R,
unsigned Depth = 0) const override;		unsigned Depth = 0) const override;

Align computeKnownAlignForTargetInstr(GISelKnownBits &Analysis, Register R,		Align computeKnownAlignForTargetInstr(GISelKnownBits &Analysis, Register R,
const MachineRegisterInfo &MRI,		const MachineRegisterInfo &MRI,
unsigned Depth = 0) const override;		unsigned Depth = 0) const override;
bool isSDNodeSourceOfDivergence(const SDNode *N,		bool isSDNodeSourceOfDivergence(const SDNode *N,
FunctionLoweringInfo FLI, LegacyDivergenceAnalysis DA) const override;		FunctionLoweringInfo FLI, LegacyDivergenceAnalysis DA) const override;

		bool hasMemSDNodeUser(SDNode *N) const;

		bool isReassocProfitable(SelectionDAG &DAG, SDValue N0,
		foadUnsubmitted Not Done Reply Inline Actions We don't normally repeat `virtual` on the override -- not sure if it matters. foad: We don't normally repeat `virtual` on the override -- not sure if it matters.
		SDValue N1) const override;

bool isCanonicalized(SelectionDAG &DAG, SDValue Op,		bool isCanonicalized(SelectionDAG &DAG, SDValue Op,
unsigned MaxDepth = 5) const;		unsigned MaxDepth = 5) const;
bool isCanonicalized(Register Reg, MachineFunction &MF,		bool isCanonicalized(Register Reg, MachineFunction &MF,
unsigned MaxDepth = 5) const;		unsigned MaxDepth = 5) const;
bool denormalsEnabledForType(const SelectionDAG &DAG, EVT VT) const;		bool denormalsEnabledForType(const SelectionDAG &DAG, EVT VT) const;
bool denormalsEnabledForType(LLT Ty, MachineFunction &MF) const;		bool denormalsEnabledForType(LLT Ty, MachineFunction &MF) const;

bool isKnownNeverNaNForTargetNode(SDValue Op,		bool isKnownNeverNaNForTargetNode(SDValue Op,
▲ Show 20 Lines • Show All 48 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 9,633 Lines • ▼ Show 20 Lines	if (SDValue Split
return Split;		return Split;
}		}

return SDValue();		return SDValue();
}		}

SDValue SITargetLowering::performXorCombine(SDNode *N,		SDValue SITargetLowering::performXorCombine(SDNode *N,
DAGCombinerInfo &DCI) const {		DAGCombinerInfo &DCI) const {
		if (SDValue RV = reassociateScalarOps(N, DCI.DAG))
		return RV;

EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);
if (VT != MVT::i64)		if (VT != MVT::i64)
return SDValue();		return SDValue();

SDValue LHS = N->getOperand(0);		SDValue LHS = N->getOperand(0);
SDValue RHS = N->getOperand(1);		SDValue RHS = N->getOperand(1);

const ConstantSDNode *CRHS = dyn_cast<ConstantSDNode>(RHS);		const ConstantSDNode *CRHS = dyn_cast<ConstantSDNode>(RHS);
▲ Show 20 Lines • Show All 896 Lines • ▼ Show 20 Lines
// For a reassociatable opcode perform:		// For a reassociatable opcode perform:
// op x, (op y, z) -> op (op x, z), y, if x and z are uniform		// op x, (op y, z) -> op (op x, z), y, if x and z are uniform
SDValue SITargetLowering::reassociateScalarOps(SDNode *N,		SDValue SITargetLowering::reassociateScalarOps(SDNode *N,
SelectionDAG &DAG) const {		SelectionDAG &DAG) const {
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);
if (VT != MVT::i32 && VT != MVT::i64)		if (VT != MVT::i32 && VT != MVT::i64)
return SDValue();		return SDValue();

		if (DAG.isBaseWithConstantOffset(SDValue(N, 0)))
		return SDValue();

unsigned Opc = N->getOpcode();		unsigned Opc = N->getOpcode();
SDValue Op0 = N->getOperand(0);		SDValue Op0 = N->getOperand(0);
SDValue Op1 = N->getOperand(1);		SDValue Op1 = N->getOperand(1);

if (!(Op0->isDivergent() ^ Op1->isDivergent()))		if (!(Op0->isDivergent() ^ Op1->isDivergent()))
return SDValue();		return SDValue();

if (Op0->isDivergent())		if (Op0->isDivergent())
std::swap(Op0, Op1);		std::swap(Op0, Op1);

if (Op1.getOpcode() != Opc \|\| !Op1.hasOneUse())		if (Op1.getOpcode() != Opc \|\| !Op1.hasOneUse())
return SDValue();		return SDValue();

SDValue Op2 = Op1.getOperand(1);		SDValue Op2 = Op1.getOperand(1);
Op1 = Op1.getOperand(0);		Op1 = Op1.getOperand(0);
if (!(Op1->isDivergent() ^ Op2->isDivergent()))		if (!(Op1->isDivergent() ^ Op2->isDivergent()))
return SDValue();		return SDValue();

if (Op1->isDivergent())		if (Op1->isDivergent())
std::swap(Op1, Op2);		std::swap(Op1, Op2);

// If either operand is constant this will conflict with
// DAGCombiner::ReassociateOps().
if (DAG.isConstantIntBuildVectorOrConstantInt(Op0) \|\|
DAG.isConstantIntBuildVectorOrConstantInt(Op1))
return SDValue();

SDLoc SL(N);		SDLoc SL(N);
SDValue Add1 = DAG.getNode(Opc, SL, VT, Op0, Op1);		SDValue Add1 = DAG.getNode(Opc, SL, VT, Op0, Op1);
return DAG.getNode(Opc, SL, VT, Add1, Op2);		return DAG.getNode(Opc, SL, VT, Add1, Op2);
}		}

static SDValue getMad64_32(SelectionDAG &DAG, const SDLoc &SL,		static SDValue getMad64_32(SelectionDAG &DAG, const SDLoc &SL,
EVT VT,		EVT VT,
SDValue N0, SDValue N1, SDValue N2,		SDValue N0, SDValue N1, SDValue N2,
▲ Show 20 Lines • Show All 1,984 Lines • ▼ Show 20 Lines	SITargetLowering::getTypeLegalizationCost(const DataLayout &DL,
// vector ALU. Let's assume anything above 8 dwords is expensive		// vector ALU. Let's assume anything above 8 dwords is expensive
// even if legal.		// even if legal.
if (Size <= 256)		if (Size <= 256)
return Cost;		return Cost;

Cost.first += (Size + 255) / 256;		Cost.first += (Size + 255) / 256;
return Cost;		return Cost;
}		}

		bool SITargetLowering::hasMemSDNodeUser(SDNode *N) const {
		SDNode::use_iterator I = N->use_begin(), E = N->use_end();
		foadUnsubmitted Not Done Reply Inline Actions This can use any_of and N->uses(). foad: This can use any_of and N->uses().
		for (; I != E; ++I) {
		foadUnsubmitted Not Done Reply Inline Actions Return early if `!N0.hasOneUse()`. foad: Return early if `!N0.hasOneUse()`.
		if (MemSDNode M = dyn_cast<MemSDNode>(I)) {
		if (getBasePtrIndex(M) == I.getOperandNo())
		rampitecUnsubmitted Done Reply Inline Actions Please simplify this condition. Make separate "return false" probably. The part "((!(!" is unreadable. rampitec: Please simplify this condition. Make separate "return false" probably. The part "((!(!" is…
		foadUnsubmitted Not Done Reply Inline Actions De-Morgan this to `N0->isDivergent() \|\| !N1->isDivergent()`. foad: De-Morgan this to `N0->isDivergent() \|\| !N1->isDivergent()`.
		foadUnsubmitted Done Reply Inline Actions Would be better to check that the use is the memory address, not (e.g.) the value being stored by a store instruction. But I don't know if there's a simple way to check that. foad: Would be better to check that the use is the memory address, not (e.g.) the value being stored…
		alex-tAuthorUnsubmitted Done Reply Inline Actions That is a good idea but I cannot use the any_of then. The any_of uses the operator * type that is SDNode but I need the SDUse instead. So the check is going to remain a separate function. Also, despite this making the check more precise, it uncovers the permute regression that becomes explicit now. alex-t: That is a good idea but I cannot use the any_of then. The any_of uses the operator * type that…
		return true;
		}
		}
		return false;
		}

		bool SITargetLowering::isReassocProfitable(SelectionDAG &DAG, SDValue N0,
		SDValue N1) const {
		foadUnsubmitted Not Done Reply Inline Actions I don't understand this heuristic. Can you give an example of when it would help? foad: I don't understand this heuristic. Can you give an example of when it would help?
		alex-tAuthorUnsubmitted Done Reply Inline Actions I could just demonstrate the concrete example but I would need to paste the DAGs here that look like overkill. So, I try to explain w/o the drawing. Let's imagine we have a sub-tree constituting the commutative arithmetic operations. Let us have a path in the tree such that each node has at least one operant constant. Given that it is very likely that this sub-tree is going to be simplified by the combiner by application arithmetic rules and constant folding. This heuristic states the priority of such constant folding over keeping the outer node uniform. %and = and i32 %tmp, 16711935 ; 0x00ff00ff %tmp1 = and i32 %arg1, 4294967040 ; 0xffffff00 %tmp2 = or i32 %tmp1, -65536 %tmp3 = or i32 %tmp2, %and This is folded and can be selected to v_perm_b32 with this heuristic but will be 4 scalar operations w/o it. alex-t: I could just demonstrate the concrete example but I would need to paste the DAGs here that look…
		foadUnsubmitted Not Done Reply Inline Actions I still don't see why this would be useful in general. I think it means we should do this reassociation: `(op (op n00, C), (op2 n10, C2)) --> (op (op n00, (op2 n10, C2)), C)` where op2 is commutative but not necessarily the same as op. E.g. `(x\|C)\|(z&C2) --> (x\|(z&C2))\|C` foad: I still don't see why this would be useful //in general//. I think it means we should do this…
		alex-tAuthorUnsubmitted Done Reply Inline Actions I did not say that it is useful in all cases. I said that there is a high probability that the 2 or more constant operands in the same commutative arithmetic sub-tree will be folded by further steps of the combining. This is just a heuristic. And that is why I said that I don't like the combiner approach in general. As soon as we influence the divergence in the one combining algorithm, we must teach all other combining algorithms to leverage the divergence information. This is a huge amount of work. Without this, we likely restrict the optimization opportunity just because we change just one of the peephole DAG transformations by introducing the new factor unknown to the rest of the combiner. In my particular example, this heuristic let us get v_mov_b32_e32 v3, 0xffff0500 v_perm_b32 v2, s0, v2, v3 instead of s_and_b32 s0, s0, 0xff00 s_or_b32 s0, s0, 0xffff0000 v_and_b32_e32 v2, 0xff00ff, v2 v_or_b32_e32 v2, s0, v2 In general, I agree that this is a fragile design and we will have to add more and more exemptions to keep all the optimizations working. There are 2 clean ways: Divergence-driven DAG combiner Post ISel MIR optimizer I consider the last one is much simpler to implement. alex-t: I did not say that it is useful in all cases. I said that there is a high probability that the…
		foadUnsubmitted Not Done Reply Inline Actions I understand why this heuristic helps you form v_perm_b32 in one specific case. I don't understand why "there is a high probability that the 2 or more constant operands in the same commutative arithmetic sub-tree will be folded by further steps of the combining". As far as I can see, it is just luck that it helps this particular v_perm_b32 combine, and it could just as easily make another case fail to combine. Do you have any evidence that it increases the total number of v_perm_b32 combines across a large corpus of code? foad: I understand why this heuristic helps you form v_perm_b32 in one specific case. I don't…
		alex-tAuthorUnsubmitted Done Reply Inline Actions Yes. You are right. I cannot prove that this will help in another DAG pattern. This is exactly what I tried to explain from the beginning. The only reliable way to change the "manual" selection that was done in SIFixSGPRCopies + SIInstrInfo::moveToVALU, is the fair divergence-driven selection and further MIR processing. Any attempts to add divergence information to one separate combiner transformation will lead to such "heuristics" that fix case by case. It will happen because the combiner is the peephole optimizer and it was designed based on the common arithmetic rules without the "divergence". the divergence is target-specific and should not affect the DAG combiner. Summarizing all the above: without this hack, I have a regression. And I expect there will be more such hacks in case we go on to influence the combiner with divergence-driven transformations. Another approach - post-ISel MIR optimization looks to you too complicated. I have no other ideas so far. BTW I am not even sure that 4 scalar arithmetic operations instead of v_perm_b32 are always really bad. We increase code size but save one VGPR. In some cases with high VGPR reg pressure, it is good. In general, it is not easy to formalize - in which cases the transformation based on the node divergence is profitable, alex-t: Yes. You are right. I cannot prove that this will help in another DAG pattern. This is exactly…
		foadUnsubmitted Not Done Reply Inline Actions Summarizing all the above: without this hack, I have a regression. I suggest ignoring the regression. And if it turns out to be important, I suggest fixing it by improving v_perm pattern recognition in SITargetLowering::performOrCombine, not by tweaking isReassocProfitable. foad: > Summarizing all the above: without this hack, I have a regression. I suggest ignoring the…
		if (!N0.hasOneUse())
		rampitecUnsubmitted Done Reply Inline Actions Early return instead. rampitec: Early return instead.
		return false;
		// Take care of the oportunity to keep N0 uniform
		rampitecUnsubmitted Done Reply Inline Actions Demorgan this. rampitec: Demorgan this.
		if (N0->isDivergent() \|\| !N1->isDivergent())
		return true;
		// Check if we have a good chance to form the memory access pattern with the
		// base and offset
		return (DAG.isBaseWithConstantOffset(N0) &&
		hasMemSDNodeUser(*N0->use_begin()));
		}

llvm/lib/Target/AMDGPU/SOPInstructions.td

Show First 20 Lines • Show All 544 Lines • ▼ Show 20 Lines	def S_XOR_B32 : SOP2_32 <"s_xor_b32",
[(set i32:$sdst, (UniformBinFrag<xor> i32:$src0, i32:$src1))]		[(set i32:$sdst, (UniformBinFrag<xor> i32:$src0, i32:$src1))]
>;		>;

def S_XOR_B64 : SOP2_64 <"s_xor_b64",		def S_XOR_B64 : SOP2_64 <"s_xor_b64",
[(set i64:$sdst, (UniformBinFrag<xor> i64:$src0, i64:$src1))]		[(set i64:$sdst, (UniformBinFrag<xor> i64:$src0, i64:$src1))]
>;		>;

def S_XNOR_B32 : SOP2_32 <"s_xnor_b32",		def S_XNOR_B32 : SOP2_32 <"s_xnor_b32",
[(set i32:$sdst, (not (xor_oneuse i32:$src0, i32:$src1)))]		[(set i32:$sdst, (UniformUnaryFrag<not> (xor_oneuse i32:$src0, i32:$src1)))]
>;		>;

def S_XNOR_B64 : SOP2_64 <"s_xnor_b64",		def S_XNOR_B64 : SOP2_64 <"s_xnor_b64",
[(set i64:$sdst, (not (xor_oneuse i64:$src0, i64:$src1)))]		[(set i64:$sdst, (UniformUnaryFrag<not> (xor_oneuse i64:$src0, i64:$src1)))]
>;		>;

def S_NAND_B32 : SOP2_32 <"s_nand_b32",		def S_NAND_B32 : SOP2_32 <"s_nand_b32",
[(set i32:$sdst, (UniformUnaryFrag<not> (and_oneuse i32:$src0, i32:$src1)))]		[(set i32:$sdst, (UniformUnaryFrag<not> (and_oneuse i32:$src0, i32:$src1)))]
>;		>;

def S_NAND_B64 : SOP2_64 <"s_nand_b64",		def S_NAND_B64 : SOP2_64 <"s_nand_b64",
[(set i64:$sdst, (UniformUnaryFrag<not> (and_oneuse i64:$src0, i64:$src1)))]		[(set i64:$sdst, (UniformUnaryFrag<not> (and_oneuse i64:$src0, i64:$src1)))]
▲ Show 20 Lines • Show All 1,466 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/VOP2Instructions.td

Show First 20 Lines • Show All 631 Lines • ▼ Show 20 Lines	GCNPat<
), sub0,		), sub0,
(Inst		(Inst
(i32 (EXTRACT_SUBREG $src0, sub1)),		(i32 (EXTRACT_SUBREG $src0, sub1)),
(i32 (EXTRACT_SUBREG $src1, sub1))		(i32 (EXTRACT_SUBREG $src1, sub1))
), sub1		), sub1
)		)
>;		>;

def : divergent_i64_BinOp <and, V_AND_B32_e32>;		def : divergent_i64_BinOp <and, V_AND_B32_e64>;
def : divergent_i64_BinOp <or, V_OR_B32_e32>;		def : divergent_i64_BinOp <or, V_OR_B32_e64>;
def : divergent_i64_BinOp <xor, V_XOR_B32_e32>;		def : divergent_i64_BinOp <xor, V_XOR_B32_e64>;

let SubtargetPredicate = Has16BitInsts in {		let SubtargetPredicate = Has16BitInsts in {

let FPDPRounding = 1 in {		let FPDPRounding = 1 in {
def V_MADMK_F16 : VOP2_Pseudo <"v_madmk_f16", VOP_MADMK_F16, [], "">;		def V_MADMK_F16 : VOP2_Pseudo <"v_madmk_f16", VOP_MADMK_F16, [], "">;
defm V_LDEXP_F16 : VOP2Inst <"v_ldexp_f16", VOP_F16_F16_I32, AMDGPUldexp>;		defm V_LDEXP_F16 : VOP2Inst <"v_ldexp_f16", VOP_F16_F16_I32, AMDGPUldexp>;
} // End FPDPRounding = 1		} // End FPDPRounding = 1

Show All 32 Lines

} // End SubtargetPredicate = Has16BitInsts		} // End SubtargetPredicate = Has16BitInsts

let SubtargetPredicate = HasDLInsts in {		let SubtargetPredicate = HasDLInsts in {

let isReMaterializable = 1 in		let isReMaterializable = 1 in
defm V_XNOR_B32 : VOP2Inst <"v_xnor_b32", VOP_I32_I32_I32, xnor>;		defm V_XNOR_B32 : VOP2Inst <"v_xnor_b32", VOP_I32_I32_I32, xnor>;

		def : GCNPat<
		(i32 (DivergentUnaryFrag<not> (xor_oneuse i32:$src0, i32:$src1))),
		(i32 (V_XNOR_B32_e64 $src0, $src1))
		>;

		def : GCNPat<
		(i32 (DivergentBinFrag<xor_oneuse> (not i32:$src0), i32:$src1)),
		(i32 (V_XNOR_B32_e64 $src0, $src1))
		>;

		def : GCNPat<
		(i64 (DivergentUnaryFrag<not> (xor_oneuse i64:$src0, i64:$src1))),
		(REG_SEQUENCE VReg_64, (i32 (V_XNOR_B32_e64
		(i32 (EXTRACT_SUBREG $src0, sub0)),
		(i32 (EXTRACT_SUBREG $src1, sub0)))), sub0,
		(i32 (V_XNOR_B32_e64
		(i32 (EXTRACT_SUBREG $src0, sub1)),
		(i32 (EXTRACT_SUBREG $src1, sub1)))), sub1)
		>;

		def : GCNPat<
		(i64 (DivergentBinFrag<xor_oneuse> (not i64:$src0), i64:$src1)),
		(REG_SEQUENCE VReg_64, (i32 (V_XNOR_B32_e64
		(i32 (EXTRACT_SUBREG $src0, sub0)),
		(i32 (EXTRACT_SUBREG $src1, sub0)))), sub0,
		(i32 (V_XNOR_B32_e64
		(i32 (EXTRACT_SUBREG $src0, sub1)),
		(i32 (EXTRACT_SUBREG $src1, sub1)))), sub1)
		>;

let Constraints = "$vdst = $src2",		let Constraints = "$vdst = $src2",
DisableEncoding = "$src2",		DisableEncoding = "$src2",
isConvertibleToThreeAddress = 1,		isConvertibleToThreeAddress = 1,
isCommutable = 1 in		isCommutable = 1 in
defm V_FMAC_F32 : VOP2Inst <"v_fmac_f32", VOP_MAC_F32>;		defm V_FMAC_F32 : VOP2Inst <"v_fmac_f32", VOP_MAC_F32>;

} // End SubtargetPredicate = HasDLInsts		} // End SubtargetPredicate = HasDLInsts

let SubtargetPredicate = HasFmaLegacy32 in {		let SubtargetPredicate = HasFmaLegacy32 in {
		foadUnsubmitted Done Reply Inline Actions Why are these two patterns required? Surely we can just let the NOT and the XOR be selected individually. The only effect of these patterns is to swap the order of the NOT and the XOR, but if that is beneficial then surely it should be done as a DAG combine instead? foad: Why are these two patterns required? Surely we can just let the NOT and the XOR be selected…

let Constraints = "$vdst = $src2",		let Constraints = "$vdst = $src2",
DisableEncoding = "$src2",		DisableEncoding = "$src2",
isConvertibleToThreeAddress = 1,		isConvertibleToThreeAddress = 1,
isCommutable = 1 in		isCommutable = 1 in
defm V_FMAC_LEGACY_F32 : VOP2Inst <"v_fmac_legacy_f32", VOP_MAC_LEGACY_F32>;		defm V_FMAC_LEGACY_F32 : VOP2Inst <"v_fmac_legacy_f32", VOP_MAC_LEGACY_F32>;

} // End SubtargetPredicate = HasFmaLegacy32		} // End SubtargetPredicate = HasFmaLegacy32
▲ Show 20 Lines • Show All 1,038 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/divergence-driven-xnor.ll

This file was added.

				; RUN: llc -march=amdgcn -stop-after=amdgpu-isel < %s \| FileCheck -check-prefix=GCN %s
				; RUN: llc -march=amdgcn -mcpu=gfx906 -stop-after=amdgpu-isel < %s \| FileCheck -check-prefix=GCN_DL %s
				foadUnsubmitted Done Reply Inline Actions Also run this test on a subtarget that has v_xnor instructions? foad: Also run this test on a subtarget that has v_xnor instructions?

				; GCN-LABEL: name: uniform_xnor_i64
				; GCN: S_XNOR_B64
				define amdgpu_kernel void @uniform_xnor_i64(i64 addrspace(1)* %out, i64 %a, i64 %b) {
				%xor = xor i64 %a, %b
				%res = xor i64 %xor, -1
				store i64 %res, i64 addrspace(1)* %out
				ret void
				}
				; GCN-LABEL: name: divergent_xnor_i64
				; GCN: V_XOR_B32_e64
				; GCN: V_XOR_B32_e64
				; GCN: V_NOT_B32_e32
				; GCN: V_NOT_B32_e32
				; GCN_DL: V_XNOR_B32_e64
				; GCN_DL: V_XNOR_B32_e64
				define i64 @divergent_xnor_i64(i64 addrspace(1)* %out, i64 %a, i64 %b) {
				%xor = xor i64 %a, %b
				%res = xor i64 %xor, -1
				ret i64 %res
				}

				; GCN-LABEL: name: uniform_xnor_i32
				; GCN: S_XNOR_B32
				define amdgpu_kernel void @uniform_xnor_i32(i32 addrspace(1)* %out, i32 %a, i32 %b) {
				%xor = xor i32 %a, %b
				%res = xor i32 %xor, -1
				store i32 %res, i32 addrspace(1)* %out
				ret void
				}

				; GCN-LABEL: name: divergent_xnor_i32
				; GCN: V_XOR_B32_e64
				; GCN: V_NOT_B32_e32
				; GCN_DL: V_XNOR_B32_e64
				define i32 @divergent_xnor_i32(i32 addrspace(1)* %out, i32 %a, i32 %b) {
				%xor = xor i32 %a, %b
				%res = xor i32 %xor, -1
				ret i32 %res
				}

				declare i32 @llvm.amdgcn.workitem.id.x() #0

llvm/test/CodeGen/AMDGPU/fused-bitlogic.ll

	Show First 20 Lines • Show All 157 Lines • ▼ Show 20 Lines
	; GCN-NEXT: v_lshlrev_b32_e32 v6, 5, v0			; GCN-NEXT: v_lshlrev_b32_e32 v6, 5, v0
	; GCN-NEXT: s_waitcnt lgkmcnt(0)			; GCN-NEXT: s_waitcnt lgkmcnt(0)
	; GCN-NEXT: global_load_dwordx4 v[0:3], v6, s[0:1]			; GCN-NEXT: global_load_dwordx4 v[0:3], v6, s[0:1]
	; GCN-NEXT: global_load_dwordx2 v[4:5], v6, s[0:1] offset:16			; GCN-NEXT: global_load_dwordx2 v[4:5], v6, s[0:1] offset:16
	; GCN-NEXT: s_waitcnt vmcnt(1)			; GCN-NEXT: s_waitcnt vmcnt(1)
	; GCN-NEXT: v_xor_b32_e32 v1, v3, v1			; GCN-NEXT: v_xor_b32_e32 v1, v3, v1
	; GCN-NEXT: v_xor_b32_e32 v0, v2, v0			; GCN-NEXT: v_xor_b32_e32 v0, v2, v0
	; GCN-NEXT: s_waitcnt vmcnt(0)			; GCN-NEXT: s_waitcnt vmcnt(0)
	; GCN-NEXT: v_xnor_b32_e32 v0, v0, v4
	; GCN-NEXT: v_xnor_b32_e32 v1, v1, v5			; GCN-NEXT: v_xnor_b32_e32 v1, v1, v5
				; GCN-NEXT: v_xnor_b32_e32 v0, v0, v4
	; GCN-NEXT: global_store_dwordx2 v6, v[0:1], s[0:1]			; GCN-NEXT: global_store_dwordx2 v6, v[0:1], s[0:1]
	; GCN-NEXT: s_endpgm			; GCN-NEXT: s_endpgm
	bb:			bb:
	%i = tail call i32 @llvm.amdgcn.workitem.id.x()			%i = tail call i32 @llvm.amdgcn.workitem.id.x()
	%i1 = zext i32 %i to i64			%i1 = zext i32 %i to i64
	%i2 = getelementptr inbounds <3 x i64>, <3 x i64> addrspace(1)* %arg, i64 %i1			%i2 = getelementptr inbounds <3 x i64>, <3 x i64> addrspace(1)* %arg, i64 %i1
	%i3 = load <3 x i64>, <3 x i64> addrspace(1)* %i2, align 32			%i3 = load <3 x i64>, <3 x i64> addrspace(1)* %i2, align 32
	%i4 = extractelement <3 x i64> %i3, i64 0			%i4 = extractelement <3 x i64> %i3, i64 0
	▲ Show 20 Lines • Show All 173 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/permute.ll

Show First 20 Lines • Show All 99 Lines • ▼ Show 20 Lines	bb:
%tmp = load i32, i32 addrspace(1)* %gep, align 4		%tmp = load i32, i32 addrspace(1)* %gep, align 4
%tmp2 = and i32 %tmp, -16776961		%tmp2 = and i32 %tmp, -16776961
%tmp3 = and i32 %arg1, 16776960		%tmp3 = and i32 %arg1, 16776960
%tmp4 = xor i32 %tmp2, %tmp3		%tmp4 = xor i32 %tmp2, %tmp3
store i32 %tmp4, i32 addrspace(1)* %gep, align 4		store i32 %tmp4, i32 addrspace(1)* %gep, align 4
ret void		ret void
}		}

; GCN-LABEL: {{^}}and_or_or_and:		; GCN-LABEL: {{^}}and_or_or_and:
; GCN: v_mov_b32_e32 [[MASK:v[0-9]+]], 0xffff0500		; GCN: s_and_b32 s{{[0-9]+}}, s{{[0-9]+}}, 0xff00
		rampitecUnsubmitted Done Reply Inline Actions Add FIXME here that v_perm_b32 with 0xffff0500 mask can be used. rampitec: Add FIXME here that v_perm_b32 with 0xffff0500 mask can be used.
; GCN: v_perm_b32 v{{[0-9]+}}, {{[vs][0-9]+}}, {{[vs][0-9]+}}, [[MASK]]		; GCN: s_or_b32 [[SREG:s[0-9]+]], s{{[0-9]+}}, 0xffff0000
		; GCN: v_and_b32_e32 [[VREG:v[0-9]+]], 0xff00ff, v{{[0-9]+}}
		; GCN: v_or_b32_e32 v{{[0-9]+}}, [[SREG]], [[VREG]]
		; FIXME here should have been "v_perm_b32" with 0xffff0500 mask.
define amdgpu_kernel void @and_or_or_and(i32 addrspace(1)* nocapture %arg, i32 %arg1) {		define amdgpu_kernel void @and_or_or_and(i32 addrspace(1)* nocapture %arg, i32 %arg1) {
bb:		bb:
%id = tail call i32 @llvm.amdgcn.workitem.id.x()		%id = tail call i32 @llvm.amdgcn.workitem.id.x()
%gep = getelementptr i32, i32 addrspace(1)* %arg, i32 %id		%gep = getelementptr i32, i32 addrspace(1)* %arg, i32 %id
%tmp = load i32, i32 addrspace(1)* %gep, align 4		%tmp = load i32, i32 addrspace(1)* %gep, align 4
%and = and i32 %tmp, 16711935 ; 0x00ff00ff		%and = and i32 %tmp, 16711935 ; 0x00ff00ff
%tmp1 = and i32 %arg1, 4294967040 ; 0xffffff00		%tmp1 = and i32 %arg1, 4294967040 ; 0xffffff00
%tmp2 = or i32 %tmp1, -65536		%tmp2 = or i32 %tmp1, -65536
Show All 28 Lines	bb:
%tmp = load i32, i32 addrspace(1)* %gep, align 4		%tmp = load i32, i32 addrspace(1)* %gep, align 4
%or1 = or i32 %tmp, 16776960 ; 0x00ffff00		%or1 = or i32 %tmp, 16776960 ; 0x00ffff00
%or2 = or i32 %arg1, 4278190335 ; 0xff0000ff		%or2 = or i32 %arg1, 4278190335 ; 0xff0000ff
%and = and i32 %or1, %or2		%and = and i32 %or1, %or2
store i32 %and, i32 addrspace(1)* %gep, align 4		store i32 %and, i32 addrspace(1)* %gep, align 4
ret void		ret void
}		}

; GCN-LABEL: {{^}}known_ffff0500:		; GCN-LABEL: {{^}}known_ffff0500:
; GCN-DAG: v_mov_b32_e32 [[MASK:v[0-9]+]], 0xffff0500		; GCN: v_mov_b32_e32 [[RES:v[0-9]+]], 0xffff8004
		rampitecUnsubmitted Done Reply Inline Actions Same here with mask 0xffff0500. rampitec: Same here with mask 0xffff0500.
; GCN-DAG: v_mov_b32_e32 [[RES:v[0-9]+]], 0xffff8004		; GCN: s_and_b32 [[SREG:s[0-9]+]], [[SREG]], 0xff00
; GCN: v_perm_b32 v{{[0-9]+}}, {{[vs][0-9]+}}, {{[vs][0-9]+}}, [[MASK]]		; GCN: s_or_b32 [[SREG]], [[SREG]], 0xffff0000
		; GCN: v_and_b32_e32 [[VREG:v[0-9]+]], 0xff00ff, [[VREG]]
		; GCN: v_or_b32_e32 [[VREG]], [[SREG]], [[VREG]]
		; GCN: store_dword v[{{[0-9:]+}}], [[VREG]]{{$}}
; GCN: store_dword v[{{[0-9:]+}}], [[RES]]{{$}}		; GCN: store_dword v[{{[0-9:]+}}], [[RES]]{{$}}
		; FIXME here should have been "v_perm_b32" with 0xffff0500 mask.
define amdgpu_kernel void @known_ffff0500(i32 addrspace(1)* nocapture %arg, i32 %arg1) {		define amdgpu_kernel void @known_ffff0500(i32 addrspace(1)* nocapture %arg, i32 %arg1) {
bb:		bb:
%id = tail call i32 @llvm.amdgcn.workitem.id.x()		%id = tail call i32 @llvm.amdgcn.workitem.id.x()
%gep = getelementptr i32, i32 addrspace(1)* %arg, i32 %id		%gep = getelementptr i32, i32 addrspace(1)* %arg, i32 %id
%load = load i32, i32 addrspace(1)* %gep, align 4		%load = load i32, i32 addrspace(1)* %gep, align 4
%mask1 = or i32 %arg1, 32768 ; 0x8000		%mask1 = or i32 %arg1, 32768 ; 0x8000
%mask2 = or i32 %load, 4		%mask2 = or i32 %load, 4
%and = and i32 %mask2, 16711935 ; 0x00ff00ff		%and = and i32 %mask2, 16711935 ; 0x00ff00ff
▲ Show 20 Lines • Show All 53 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/wwm-reserved-spill.ll

	Show First 20 Lines • Show All 466 Lines • ▼ Show 20 Lines
	; GFX9-O0-NEXT: v_lshlrev_b64 v[1:2], s34, v[0:1]			; GFX9-O0-NEXT: v_lshlrev_b64 v[1:2], s34, v[0:1]
	; GFX9-O0-NEXT: v_mov_b32_e32 v3, v2			; GFX9-O0-NEXT: v_mov_b32_e32 v3, v2
	; GFX9-O0-NEXT: ; kill: def $vgpr6 killed $vgpr6 killed $vgpr6_vgpr7 killed $exec			; GFX9-O0-NEXT: ; kill: def $vgpr6 killed $vgpr6 killed $vgpr6_vgpr7 killed $exec
	; GFX9-O0-NEXT: s_mov_b32 s35, 0			; GFX9-O0-NEXT: s_mov_b32 s35, 0
	; GFX9-O0-NEXT: v_mov_b32_e32 v0, 0			; GFX9-O0-NEXT: v_mov_b32_e32 v0, 0
	; GFX9-O0-NEXT: ; kill: def $vgpr6 killed $vgpr6 def $vgpr6_vgpr7 killed $exec			; GFX9-O0-NEXT: ; kill: def $vgpr6 killed $vgpr6 def $vgpr6_vgpr7 killed $exec
	; GFX9-O0-NEXT: v_mov_b32_e32 v7, v0			; GFX9-O0-NEXT: v_mov_b32_e32 v7, v0
	; GFX9-O0-NEXT: v_mov_b32_e32 v0, v7			; GFX9-O0-NEXT: v_mov_b32_e32 v0, v7
	; GFX9-O0-NEXT: v_or_b32_e32 v0, v0, v3			; GFX9-O0-NEXT: v_or_b32_e64 v0, v0, v3
	; GFX9-O0-NEXT: v_mov_b32_e32 v2, v1			; GFX9-O0-NEXT: v_mov_b32_e32 v2, v1
	; GFX9-O0-NEXT: v_mov_b32_e32 v1, v6			; GFX9-O0-NEXT: v_mov_b32_e32 v1, v6
	; GFX9-O0-NEXT: v_or_b32_e32 v6, v1, v2			; GFX9-O0-NEXT: v_or_b32_e64 v6, v1, v2
	; GFX9-O0-NEXT: ; kill: def $vgpr6 killed $vgpr6 def $vgpr6_vgpr7 killed $exec			; GFX9-O0-NEXT: ; kill: def $vgpr6 killed $vgpr6 def $vgpr6_vgpr7 killed $exec
	; GFX9-O0-NEXT: v_mov_b32_e32 v7, v0			; GFX9-O0-NEXT: v_mov_b32_e32 v7, v0
	; GFX9-O0-NEXT: v_mov_b32_e32 v1, v6			; GFX9-O0-NEXT: v_mov_b32_e32 v1, v6
	; GFX9-O0-NEXT: v_mov_b32_e32 v3, v4			; GFX9-O0-NEXT: v_mov_b32_e32 v3, v4
	; GFX9-O0-NEXT: v_mov_b32_e32 v0, v7			; GFX9-O0-NEXT: v_mov_b32_e32 v0, v7
	; GFX9-O0-NEXT: v_mov_b32_e32 v2, v5			; GFX9-O0-NEXT: v_mov_b32_e32 v2, v5
	; GFX9-O0-NEXT: v_sub_co_u32_e64 v1, s[36:37], v1, v3			; GFX9-O0-NEXT: v_sub_co_u32_e64 v1, s[36:37], v1, v3
	; GFX9-O0-NEXT: v_subb_co_u32_e64 v0, s[36:37], v0, v2, s[36:37]			; GFX9-O0-NEXT: v_subb_co_u32_e64 v0, s[36:37], v0, v2, s[36:37]
	▲ Show 20 Lines • Show All 371 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/xnor.ll

Show First 20 Lines • Show All 55 Lines • ▼ Show 20 Lines	entry:
%r1.val = add i64 %xor, %a		%r1.val = add i64 %xor, %a
store i64 %r0.val, i64 addrspace(1)* %r0		store i64 %r0.val, i64 addrspace(1)* %r0
store i64 %r1.val, i64 addrspace(1)* %r1		store i64 %r1.val, i64 addrspace(1)* %r1
ret void		ret void
}		}

; GCN-LABEL: {{^}}vector_xnor_i32_one_use		; GCN-LABEL: {{^}}vector_xnor_i32_one_use
; GCN-NOT: s_xnor_b32		; GCN-NOT: s_xnor_b32
; GCN: v_not_b32
; GCN: v_xor_b32		; GCN: v_xor_b32
		; GCN: v_not_b32
; GCN-DL: v_xnor_b32		; GCN-DL: v_xnor_b32
define i32 @vector_xnor_i32_one_use(i32 %a, i32 %b) {		define i32 @vector_xnor_i32_one_use(i32 %a, i32 %b) {
entry:		entry:
%xor = xor i32 %a, %b		%xor = xor i32 %a, %b
%r = xor i32 %xor, -1		%r = xor i32 %xor, -1
ret i32 %r		ret i32 %r
}		}

; GCN-LABEL: {{^}}vector_xnor_i64_one_use		; GCN-LABEL: {{^}}vector_xnor_i64_one_use
; GCN-NOT: s_xnor_b64		; GCN-NOT: s_xnor_b64
; GCN: v_not_b32
; GCN: v_not_b32
; GCN: v_xor_b32		; GCN: v_xor_b32
; GCN: v_xor_b32		; GCN: v_xor_b32
		; GCN: v_not_b32
		; GCN: v_not_b32
; GCN-DL: v_xnor_b32		; GCN-DL: v_xnor_b32
; GCN-DL: v_xnor_b32		; GCN-DL: v_xnor_b32
define i64 @vector_xnor_i64_one_use(i64 %a, i64 %b) {		define i64 @vector_xnor_i64_one_use(i64 %a, i64 %b) {
entry:		entry:
%xor = xor i64 %a, %b		%xor = xor i64 %a, %b
%r = xor i64 %xor, -1		%r = xor i64 %xor, -1
ret i64 %r		ret i64 %r
}		}
▲ Show 20 Lines • Show All 57 Lines • ▼ Show 20 Lines	entry:
%xor = xor i64 %b, %a		%xor = xor i64 %b, %a
%r0.val = xor i64 %xor, -1		%r0.val = xor i64 %xor, -1
store i64 %r0.val, i64 addrspace(1)* %r0		store i64 %r0.val, i64 addrspace(1)* %r0
ret void		ret void
}		}

; GCN-LABEL: {{^}}vector_xor_na_b_i32_one_use		; GCN-LABEL: {{^}}vector_xor_na_b_i32_one_use
; GCN-NOT: s_xnor_b32		; GCN-NOT: s_xnor_b32
; GCN: v_not_b32
; GCN: v_xor_b32		; GCN: v_xor_b32
		; GCN: v_not_b32
; GCN-DL: v_xnor_b32		; GCN-DL: v_xnor_b32
define i32 @vector_xor_na_b_i32_one_use(i32 %a, i32 %b) {		define i32 @vector_xor_na_b_i32_one_use(i32 %a, i32 %b) {
entry:		entry:
%na = xor i32 %a, -1		%na = xor i32 %a, -1
%r = xor i32 %na, %b		%r = xor i32 %na, %b
ret i32 %r		ret i32 %r
}		}

; GCN-LABEL: {{^}}vector_xor_a_nb_i32_one_use		; GCN-LABEL: {{^}}vector_xor_a_nb_i32_one_use
; GCN-NOT: s_xnor_b32		; GCN-NOT: s_xnor_b32
; GCN: v_not_b32
; GCN: v_xor_b32		; GCN: v_xor_b32
		; GCN: v_not_b32
; GCN-DL: v_xnor_b32		; GCN-DL: v_xnor_b32
define i32 @vector_xor_a_nb_i32_one_use(i32 %a, i32 %b) {		define i32 @vector_xor_a_nb_i32_one_use(i32 %a, i32 %b) {
entry:		entry:
%nb = xor i32 %b, -1		%nb = xor i32 %b, -1
%r = xor i32 %a, %nb		%r = xor i32 %a, %nb
ret i32 %r		ret i32 %r
}		}

Show All 24 Lines

llvm/test/CodeGen/AMDGPU/xor3.ll

Show All 20 Lines	; GFX10-NEXT: ; return to shader part epilog
%result = xor i32 %x, %c		%result = xor i32 %x, %c
%bc = bitcast i32 %result to float		%bc = bitcast i32 %result to float
ret float %bc		ret float %bc
}		}

define amdgpu_ps float @xor3_vgpr_b(i32 inreg %a, i32 %b, i32 inreg %c) {		define amdgpu_ps float @xor3_vgpr_b(i32 inreg %a, i32 %b, i32 inreg %c) {
; GFX9-LABEL: xor3_vgpr_b:		; GFX9-LABEL: xor3_vgpr_b:
; GFX9: ; %bb.0:		; GFX9: ; %bb.0:
; GFX9-NEXT: v_xor_b32_e32 v0, s2, v0		; GFX9-NEXT: s_xor_b32 s0, s3, s2
; GFX9-NEXT: v_xor_b32_e32 v0, s3, v0		; GFX9-NEXT: v_xor_b32_e32 v0, s0, v0
; GFX9-NEXT: ; return to shader part epilog		; GFX9-NEXT: ; return to shader part epilog
;		;
; GFX10-LABEL: xor3_vgpr_b:		; GFX10-LABEL: xor3_vgpr_b:
; GFX10: ; %bb.0:		; GFX10: ; %bb.0:
; GFX10-NEXT: v_xor3_b32 v0, s2, v0, s3		; GFX10-NEXT: v_xor3_b32 v0, s3, s2, v0
; GFX10-NEXT: ; return to shader part epilog		; GFX10-NEXT: ; return to shader part epilog
%x = xor i32 %a, %b		%x = xor i32 %a, %b
%result = xor i32 %x, %c		%result = xor i32 %x, %c
%bc = bitcast i32 %result to float		%bc = bitcast i32 %result to float
ret float %bc		ret float %bc
}		}

define amdgpu_ps float @xor3_vgpr_all2(i32 %a, i32 %b, i32 %c) {		define amdgpu_ps float @xor3_vgpr_all2(i32 %a, i32 %b, i32 %c) {
▲ Show 20 Lines • Show All 124 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Enable divergence-driven XNOR selectionClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 403216

llvm/include/llvm/CodeGen/TargetLowering.h

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp

llvm/lib/Target/AMDGPU/SIISelLowering.h

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

llvm/lib/Target/AMDGPU/SOPInstructions.td

llvm/lib/Target/AMDGPU/VOP2Instructions.td

llvm/test/CodeGen/AMDGPU/divergence-driven-xnor.ll

llvm/test/CodeGen/AMDGPU/fused-bitlogic.ll

llvm/test/CodeGen/AMDGPU/permute.ll

llvm/test/CodeGen/AMDGPU/wwm-reserved-spill.ll

llvm/test/CodeGen/AMDGPU/xnor.ll

llvm/test/CodeGen/AMDGPU/xor3.ll

[AMDGPU] Enable divergence-driven XNOR selection
ClosedPublic