This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Introduce new ISel combine for trunc-slr patterns
ClosedPublic

Authored by tsymalla on Jan 28 2022, 5:32 AM.

Details

Summary

In some cases, when selecting a (trunc (slr)) pattern, the slr gets translated
to a v_lshrrev_b3e2_e64 instruction whereas the truncation gets selected to
a sequence of v_and_b32_e64 and v_cmp_eq_u32_e64. In the final ISA, this appears
as selecting the nth-bit:

v_lshrrev_b32_e32 v0, 2, v1
v_and_b32_e32 v0, 1, v0
v_cmp_eq_u32_e32 vcc_lo, 1, v0

However, when the value used in the right shift is known at compilation time, the
whole sequence can be reduced to two VALUs when the constant operand in the v_and is adjusted to (1 << lshrrev_operand):

v_and_b32_e32 v0, (1 << 2), v1
v_cmp_ne_u32_e32 vcc_lo, 0, v0

In the example above, the following pseudo-code:

v0 = (v1 >> 2)
v0 = v0 & 1
vcc_lo = (v0 == 1)

would be translated to:

v0 = v1 & 0b100
vcc_lo = (v0 == 0b100)

which should yield an equivalent result.
This is a little bit hard to test as one needs to force the SelectionDAG to
contain the nodes before instruction selection, but the test sequence was
roughly derived from a production shader.

Diff Detail

Event Timeline

tsymalla created this revision.Jan 28 2022, 5:32 AM
tsymalla requested review of this revision.Jan 28 2022, 5:32 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 28 2022, 5:32 AM
foad added a comment.Jan 28 2022, 6:22 AM

It seems like the xor is getting in the way. Would something like D38161 help instead?

It seems like the xor is getting in the way. Would something like D38161 help instead?

Currently, the xor gets combined to a setcc_ne which gets combined to the srl / trunc sequence.
Initially, there is the xor / setcc_eq sequence which could be simplified like in D38161, removing the need for the xor.
Probably that would clean up everything a bit.

Can this be done as a combine instead? Plus if we handle this for VALU, should also for SALU

llvm/test/CodeGen/AMDGPU/dagcombine-lshr-and-cmp.ll
12

Don't need control flow in this test. Also should test pattern for scalar and vector inputs

foad added a comment.Jan 28 2022, 8:05 AM

It seems like the xor is getting in the way. Would something like D38161 help instead?

Currently, the xor gets combined to a setcc_ne which gets combined to the srl / trunc sequence.
Initially, there is the xor / setcc_eq sequence which could be simplified like in D38161, removing the need for the xor.
Probably that would clean up everything a bit.

Why do we even have the xor in the IR? Normally (if you run IR optimizations as well as backend passes) instcombine would combine it into the icmp. Why didn't this happen? Was it introduced late by StructurizeCFG? Does D118478 help?

It seems like the xor is getting in the way. Would something like D38161 help instead?

Currently, the xor gets combined to a setcc_ne which gets combined to the srl / trunc sequence.
Initially, there is the xor / setcc_eq sequence which could be simplified like in D38161, removing the need for the xor.
Probably that would clean up everything a bit.

Why do we even have the xor in the IR? Normally (if you run IR optimizations as well as backend passes) instcombine would combine it into the icmp. Why didn't this happen? Was it introduced late by StructurizeCFG? Does D118478 help?

You are correct, the xor is not needed in the IR. Simply inverting the predicate for the icmp should be sufficient. The xor is combined into a setcc, so this should be fine by just adjusting the test.

tsymalla added inline comments.Jan 28 2022, 8:20 AM
llvm/test/CodeGen/AMDGPU/dagcombine-lshr-and-cmp.ll
12

The control flow was used to prevent having the truncate in the SDag optimized away (which is used as part of the pattern match here). I am going to check if the adjustments to the test (check comment from @foad) are going to help here.
Going to test additional cases in the new revision.

tsymalla updated this revision to Diff 404477.Jan 31 2022, 4:03 AM

Added handling for scalar cases, improved test case.

foad added a comment.Feb 1 2022, 1:39 AM

I'm still not sure that we need this, if the xor can be cleaned up earlier. Does D118623 help?

I'm still not sure that we need this, if the xor can be cleaned up earlier. Does D118623 help?

Jay, unfortunately it doesn't help. I tried your patch out, but for my test case, the matcher won't apply as there is no XOR in the LLVM IR, it only gets created as SDag node. By the way, your change relates to the function "buildConditions" in the comments which should be "insertConditions".
However, the idea here is not to remove the XOR, but to remove an additional VALU which gets created by translating the TRUNCATE and the AND in the MIR separately instead of handling them as one sequence. Replacing the "setcc ne" with its inverse and not introducing an additional XOR instead might remove the need for this change, but what about possible other cases where this pattern could get matched?

foad added a comment.Feb 2 2022, 2:34 AM

I'm still not sure that we need this, if the xor can be cleaned up earlier. Does D118623 help?

Jay, unfortunately it doesn't help. I tried your patch out, but for my test case, the matcher won't apply as there is no XOR in the LLVM IR, it only gets created as SDag node. By the way, your change relates to the function "buildConditions" in the comments which should be "insertConditions".
However, the idea here is not to remove the XOR, but to remove an additional VALU which gets created by translating the TRUNCATE and the AND in the MIR separately instead of handling them as one sequence. Replacing the "setcc ne" with its inverse and not introducing an additional XOR instead might remove the need for this change, but what about possible other cases where this pattern could get matched?

Well there is an XOR in the LLVM IR in the test case in your patch! Can you share another test case?

As a rule of thumb, if there is a missed optimisation, I would like to try to fix it as early as possible in the pass pipeline. Otherwise SelectionDAG has to get more and more complicated, to try to optimise things that should really have been cleaned up in the IR before instruction selection.

llvm/lib/Target/AMDGPU/SIInstructions.td
2290

v_cmp_ne_u32_e64 $a, 0, $a is probably better because 0 is always an inline constant, but 1 << $b might not be.

llvm/test/CodeGen/AMDGPU/dagcombine-lshr-and-cmp.ll
1

Please auto-generate the checks with utils/update_llc_test_checks.py and pre-commit this test with the old codegen, so that this patch will clearly show how the codegen changes.

41

"No newline at end of file" :)

I'm still not sure that we need this, if the xor can be cleaned up earlier. Does D118623 help?

Jay, unfortunately it doesn't help. I tried your patch out, but for my test case, the matcher won't apply as there is no XOR in the LLVM IR, it only gets created as SDag node. By the way, your change relates to the function "buildConditions" in the comments which should be "insertConditions".
However, the idea here is not to remove the XOR, but to remove an additional VALU which gets created by translating the TRUNCATE and the AND in the MIR separately instead of handling them as one sequence. Replacing the "setcc ne" with its inverse and not introducing an additional XOR instead might remove the need for this change, but what about possible other cases where this pattern could get matched?

Well there is an XOR in the LLVM IR in the test case in your patch! Can you share another test case?

As a rule of thumb, if there is a missed optimisation, I would like to try to fix it as early as possible in the pass pipeline. Otherwise SelectionDAG has to get more and more complicated, to try to optimise things that should really have been cleaned up in the IR before instruction selection.

I think I misunderstood what you were saying. The pattern that gets matched here is inside the entry block of both functions. The XOR in the second test case (uniform) is there to prevent the truncate in the SDag from being optimized away and has no semantic relevance for the actual test. BTW, your patch still doesn't apply for me in this case. I agree with optimizing the XOR.
Thanks for your additional comments.

foad added a comment.Feb 2 2022, 7:23 AM

The XOR in the second test case (uniform) is there to prevent the truncate in the SDag from being optimized away and has no semantic relevance for the actual test.

OK, sorry, I did not read the test carefully enough. I have looked at it more carefully now.

So the problem is that this DAG:

    t4: i32 = and t2, Constant:i32<2>
  t7: i1 = setcc t4, Constant:i32<0>, setne:ch
t9: i1,i64,ch = llvm.amdgcn.if t0, TargetConstant:i64<1324>, t7

gets optimized like this by a generic combine implemented in TargetLowering::SimplifySetCC:

    t24: i32 = srl t2, Constant:i64<1>
  t25: i1 = truncate t24
t9: i1,i64,ch = llvm.amdgcn.if t0, TargetConstant:i64<1324>, t25

I guess there is no way we can undo that with another combine, because they will end up fighting each other. So I think your patch is reasonable.

llvm/lib/Target/AMDGPU/SIInstructions.td
2290

Also, if you do this, there will only be one use of the constant not two, so I don't think you will have to "Restrict the range to prevent using an additional VGPR for the shifted value".

llvm/test/CodeGen/AMDGPU/dagcombine-lshr-and-cmp.ll
1

Actually it will have to use update_mir_test_checks.

tsymalla updated this revision to Diff 405561.Feb 3 2022, 3:05 AM
tsymalla marked 5 inline comments as done.

Change requests from review addressed.

llvm/lib/Target/AMDGPU/SIInstructions.td
2290

The resulting value should still be checked to ensure no 32-bit overflow occurs, correct? For instance, if the shift value is something like 33, 1 << 33 would exceed Int32_Max.

foad added inline comments.Feb 3 2022, 3:45 AM
llvm/lib/Target/AMDGPU/SIInstructions.td
2290

I'm not sure there is any need to check. The result of a shift by 33 is undefined, so it doesn't really matter what code we generate in that case.

tsymalla added inline comments.Feb 3 2022, 5:30 AM
llvm/lib/Target/AMDGPU/SIInstructions.td
2290

Sure, I’ll remove the check.

tsymalla updated this revision to Diff 405615.Feb 3 2022, 6:33 AM

Removed range checks.

tsymalla marked 2 inline comments as done.Feb 3 2022, 6:34 AM
tsymalla edited the summary of this revision. (Show Details)
foad accepted this revision.Feb 3 2022, 6:48 AM

LGTM, thanks! Just very minor comments inline.

llvm/lib/Target/AMDGPU/SIInstructions.td
2273

Don't need parens around the << expression.

llvm/test/CodeGen/AMDGPU/dagcombine-lshr-and-cmp.ll
1

"dagcombine" is not a very good name for the file, because this is isel not a combine, but I guess it's OK if you've already committed the file.

This revision is now accepted and ready to land.Feb 3 2022, 6:48 AM
This revision was landed with ongoing or failed builds.Feb 3 2022, 9:06 AM
This revision was automatically updated to reflect the committed changes.