This is an archive of the discontinued LLVM Phabricator instance.

Handling ADD|SUB U64 decomposed Pseudos not getting lowered to SDWA form
ClosedPublic

Authored by yassingh on Oct 24 2022, 10:51 PM.

Details

Summary

This patch fixes some of the V_ADD/SUB_U64_PSEUDO not getting converted to their sdwa form.
We still get below patterns in generated code:
v_and_b32_e32 v0, 0xff, v0
v_add_co_u32_e32 v0, vcc, v1, v0
v_addc_co_u32_e64 v1, s[0:1], 0, 0, vcc

and,
v_and_b32_e32 v2, 0xff, v2
v_add_co_u32_e32 v0, vcc, v0, v2
v_addc_co_u32_e32 v1, vcc, 0, v1, vcc

1st and 2nd instructions of both above examples should have been folded into sdwa add with BYTE_0 src operand.

The reason being the pseudo instruction is broken down into VOP3 instruction pair of V_ADD_CO_U32_e64 and V_ADDC_U32_e64.
The sdwa pass attempts lowering them to their VOP2 form before converting them into sdwa instructions. However V_ADDC_U32_e64
cannot be shrunk to it's VOP2 form if it has non-reg src1 operand.
This change attempts to fix that problem by only shrinking V_ADD_CO_U32_e64 instruction.

Diff Detail

Event Timeline

yassingh created this revision.Oct 24 2022, 10:51 PM
Herald added a project: Restricted Project. · View Herald TranscriptOct 24 2022, 10:51 PM
yassingh requested review of this revision.Oct 24 2022, 10:51 PM
Herald added a project: Restricted Project. · View Herald TranscriptOct 24 2022, 10:51 PM
yassingh edited the summary of this revision. (Show Details)Oct 24 2022, 11:10 PM
yassingh added reviewers: arsenm, ronlieb, rampitec.
yassingh added a subscriber: cdevadas.
rampitec added inline comments.Oct 25 2022, 12:10 PM
llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp
871

This can be inline literal and still useful I suppose.

arsenm added inline comments.Oct 25 2022, 1:35 PM
llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp
924–927

This is increasing the instruction size (and most likely the code size). This only makes sense to do if we know the fold into the operand can happen. This should perform those legality checks and make the full transform

yassingh added inline comments.Oct 26 2022, 4:59 AM
llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp
871

Sorry I don't understand, are you suggesting moving this condition to a string literal?

924–927

Yes. There are 2 possible scenarios, if only the src1 operand is immediate then the 'mov' instruction will be folded. If both operands are immediates then 'MOV' will be part of generated assembly. Function 1 and 2 in the both test-files depict these scenarios.

I can add the legality check for this fold happening but then some of the instructions will be missed?

foad added a comment.Oct 26 2022, 8:00 AM

Can you precommit the tests and rebase so that we can see the diffs?

llvm/test/CodeGen/AMDGPU/v_add_u64_pseudo_sdwa.ll
9–10

This is silly. You have added a mov instruction to shrink the addc instruction for no reason, because it is not actually converted to sdwa form.

yassingh updated this revision to Diff 470845.Oct 26 2022, 9:54 AM
yassingh added a reviewer: foad.

Pre-committed the tests and rebased the diff

yassingh added inline comments.Oct 26 2022, 9:58 AM
llvm/test/CodeGen/AMDGPU/v_add_u64_pseudo_sdwa.ll
9–10

Yes you are right but the pass right now only attempts to convert to sdwa when both instructions (V_ADD_CO and V_ADDC_CO) are shrinkable to their sdwa formed so I went this way.
I can explore only shrinking the v_add_co instruction and avoid inserting that mov instruction?

rampitec added inline comments.Oct 26 2022, 11:14 AM
llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp
871

I mean bailing on non-register operand limits the pass. The src1 operand can be inline literal and still convertible to sdwa form.

yassingh updated this revision to Diff 471118.Oct 27 2022, 4:35 AM

Only shrink V_ADD_CO_U32_e32 instruction if V_ADDC_CO_U32_e32 is not shrinkable. Hence avoiding introducing redundant copy instruction as suggested by @arsenm and @foad

yassingh added inline comments.Oct 27 2022, 4:53 AM
llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp
871

Earlier we were checking if the instruction is not shrinkable don't proceed. Now we are also checking whether the reason for not shrinking is src1 being an immediate operand.
Hence we are covering src1 being an inline literal(correct me if I'm wrongly assuming that inline literal means immediate operand)

arsenm added inline comments.Oct 27 2022, 7:43 AM
llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp
871

Inline immediate ares the ones that are free to encode in vsrc0, or in all vsrc operands for VOP3, as opposed to other literals that require materialization in a register. See TII::isInlineConstant

rampitec added inline comments.Oct 27 2022, 9:38 AM
llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp
871

If an immediate is an inline literal instruction is still shrinkable. For example if it is 0 or 1 it is shrinkable, if it is 100 it is not.

cdevadas added inline comments.Nov 2 2022, 1:04 AM
llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp
871

Need a comment here for allowing non-register vsrc1 cases.

yassingh updated this revision to Diff 474481.Nov 10 2022, 2:07 AM

Since only v_add_co_u32_e32 can be converted to SDWA form there is no need to try convert v_addc_co_u32_e64 to it's VOP2 form. This will allow us to get rid of all the checks we were performing on v_addc_co_u32. The shrinking of this instr will be taken care by si-shrink-instructions.

yassingh updated this revision to Diff 474487.Nov 10 2022, 2:13 AM
foad added inline comments.Nov 10 2022, 3:05 AM
llvm/tmp.mir
71 ↗(On Diff #474487)

This can be much simpler. Typically you only need name: and tracksRegLiveness: and body:.

If you used llc to generate this then you should try adding -simplify-mir.

yassingh added inline comments.Nov 10 2022, 3:25 AM
llvm/tmp.mir
71 ↗(On Diff #474487)

Thanks for pointing out, accidently added this file to review. Removing now

yassingh updated this revision to Diff 474494.Nov 10 2022, 3:26 AM
foad added a comment.Nov 10 2022, 10:08 AM

Looks good overall.

llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp
902

"output", "its" (no apostrophe), full stop at end of sentence.

904

Doesn't this need to be VCC_LO for wave32? Please add a test for that. You can use SIRegisterInfo::getVCC() to get the appropriate reg for the wave size.

foad added a comment.Nov 10 2022, 10:09 AM

This change attempts to fix that problem by introducing a register operand in place of immediate src1 operand.

You're not doing that any more, so the description needs updating.

arsenm added inline comments.Nov 10 2022, 11:08 AM
llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp
853

Only converting the low half to VOP2 is somewhat surprising given the name, but I guess this makes sense

yassingh updated this revision to Diff 474662.Nov 10 2022, 9:52 PM
yassingh edited the summary of this revision. (Show Details)

Addressed review comments, updated summary.

yassingh added inline comments.Nov 10 2022, 9:59 PM
llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp
904

Updated to TRI->getVCC(). However I am not able to add test for VCC_LO. Tried compiling for gfx1010, wavefrontsize=32 but SIInstrInfo::canShrink returns false for V_ADD_CO_U32_e64 hence the pass does not attempt converting it to sdwa form.

Responsible condition in SIInstrInfo::canShrink() => if (!hasVALU32BitEncoding(MI.getOpcode())) return false;

foad added inline comments.Nov 11 2022, 3:18 AM
llvm/lib/Target/AMDGPU/SIPeepholeSDWA.cpp
904

Oh you're right, GFX10/11 V_ADD_CO_U32 does not have an e32 or sdwa form. Sorry.

arsenm accepted this revision.Nov 16 2022, 3:07 PM

LGTM

This revision is now accepted and ready to land.Nov 16 2022, 3:07 PM
This revision was landed with ongoing or failed builds.Nov 16 2022, 8:32 PM
This revision was automatically updated to reflect the committed changes.