Page MenuHomePhabricator

[AMDGPU] Divergence-driven compare operations instruction selection
ClosedPublic

Authored by alex-t on Jul 15 2021, 9:47 AM.

Details

Summary

Description: This change enables the compare operations to be selected to SALU/VALU form

dependent of the SDNode divergence flag.

Diff Detail

Event Timeline

alex-t created this revision.Jul 15 2021, 9:47 AM
alex-t requested review of this revision.Jul 15 2021, 9:47 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 15 2021, 9:47 AM
Herald added a subscriber: wdng. · View Herald Transcript
rampitec added inline comments.Jul 15 2021, 12:21 PM
llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
594

TRI->getBoolRC()?

598

It is certainly too wide, you need to use clang-format.

llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll
5137

This looks like a regression. A common one.

rampitec added a subscriber: Restricted Project.
arsenm added inline comments.Jul 15 2021, 12:26 PM
llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
592

When do copies from SCC appear? I thought the InstrEmitter could essentially always avoid these

Can you explain the motivation for this change further?
Particularly as it increases generated code size.

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
4498

Do we now have to always use larger e64 instructions or are these later reduced if we can use vcc instead?

foad added inline comments.Jul 16 2021, 6:23 AM
llvm/lib/Target/AMDGPU/SIInstrInfo.td
530

I think you probably need to look ahead to see if this setcc is used by a divergent select. Otherwise you regress: a == b ? c : d where a and b are uniform but c or d are divergent. In this case it is better to use v_cmp than s_cmp, even though the operands are uniform, because it can feed straight into v_cndmask.

alex-t updated this revision to Diff 359563.Jul 17 2021, 7:27 AM

Fixed amdgpu-codegenprepare-idiv.ll test regression + several minor fixes.

alex-t added inline comments.Jul 17 2021, 7:39 AM
llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
592

They appear if we have a uniform comparison that provides the operand for the select node. We haven't yet made it to be S_CSELECT if uniform and V_CNDMASK if divergent. That's why InstrEmitter has to adjust operands by adding the SCC copies. It may also happen for any VALU operation that takes SReg_64_XEXECRegClass operand.

594

No. getBoolRC returns the SReg_32RegClass but I really need SReg_32_XM0_XEXEC.

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
4498

SIShrinkInstructions does whatever is possible to reduce back to e32.
As for the change motivation, the main goal is to select SALU instructions for uniform SDNodes and VALU for the divergent ones.
It increases code size a bit but it changes the VALU instruction that operates over the 64 32bit lanes to SALU instruction that operates over 3 scalar registers.

alex-t added inline comments.Jul 17 2021, 7:44 AM
llvm/lib/Target/AMDGPU/SIInstrInfo.td
530

That's right. And is planned for the next change. I am just trying to keep it staged. In general, I am planning to add the dedicated procedure that checks the "VALU only" SDNode users beforehand and selects to VALU even uniform if any.

rampitec added inline comments.Jul 19 2021, 10:09 AM
llvm/test/CodeGen/AMDGPU/addrspacecast.ll
14

This also needs a sort of a look ahead.

alex-t updated this revision to Diff 359844.Jul 19 2021, 10:52 AM

Formatting fixed

alex-t added inline comments.Jul 19 2021, 11:02 AM
llvm/test/CodeGen/AMDGPU/addrspacecast.ll
14

All such a kind of regressions need to be addressed in the separate patch I believe. Please note here 2 facts:

  1. v_cndmask that consumes VCC is uniform and should be selected to S_CSELECT itself but this part of the work is not done yet.
  2. To fix all of these patterns I need to add a lookahead in the selection predicate - scan over all CC users and select only if no VALU instructions were found. I was planning this as a separate change.
rampitec added inline comments.Jul 19 2021, 11:09 AM
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
5782

Unused variable initialization. You do not need to declare CondReg here at all.

nhaehnle added inline comments.Jul 20 2021, 8:00 AM
llvm/lib/Target/AMDGPU/SIInstrInfo.td
530

Is it so clear-cut?

For a == b ? c : d with a and b uniform but c or d divergent, there's a choice between v_cmp + v_cndmask vs. s_cmp + s_cselect + v_cndmask.

So this change increases code size and trades 1 VALU for 2 SALU, but SALU utilization tends to be low, so purely based on instruction counts the s_cmp-based sequence is still better most of the time. On gfx10 there's a scheduling pitfall to watch out for though.

alex-t added inline comments.Jul 22 2021, 8:27 AM
llvm/lib/Target/AMDGPU/SIInstrInfo.td
530

For this concrete example, S_CMP is still better. In my comment above I meant that I am planning to add a selection predicate as the right place to decide if the given node should be selected to VALU or SALU.
Further, we can try different strategies - one described above or another one, more complex.

Following the discussion regarding uniform "setcc" with a divergent use:
In the current implementation of the divergence-driven ISel node is selected only depending on the divergence bit value regardless of VALU or SALU uses.
The latter is not necessarily related to the user divergence. The user may be uniform but selected to VALU just because corresponding SALU instruction does not exist.
As mentioned above, the alternative approach is to select the given node to VALU if it has VALU users. At first, this requires some reasonable heuristic.
Let's say we have a uniform SDNode that has 1 VALU but 10 SALU users. Is it profitable to select it to VALU?
The selection hook that looks ahead for the users needs to be controlled by the option to try different heuristics.
Also, the problem in question is common for all the opcodes - not the "setcc" only. Thus, adding such a hook would require changing all the places in the target where the divergence bit is currently checked. That is why I insist this should go to a separate patch.

For the current one, I checked the possible regression related to the "select" pattern by running the OpenCL conformance math_bruteforce with the clean and changed compiler. The timing shows that the new one is on par with the old one.

alex-t updated this revision to Diff 367605.Thu, Aug 19, 1:45 PM

Unused variable initialization removed.

alex-t marked an inline comment as done.Thu, Aug 19, 1:46 PM
rampitec accepted this revision.Fri, Aug 20, 11:26 AM

LGTM. Please wait other reviewers before landing.

This revision is now accepted and ready to land.Fri, Aug 20, 11:26 AM
This revision was landed with ongoing or failed builds.Wed, Aug 25, 8:30 AM
This revision was automatically updated to reflect the committed changes.