This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Allow folding of sgpr to vgpr copy
ClosedPublic

Authored by rampitec on Oct 21 2019, 1:52 PM.

Details

Summary

Potentially sgpr to sgpr copy should also be possible.
That is however trickier because we may end up with a
wrong register class at use because of xm0/xexec permutations.

Diff Detail

Event Timeline

rampitec created this revision.Oct 21 2019, 1:52 PM
Herald added a project: Restricted Project. · View Herald TranscriptOct 21 2019, 1:52 PM
arsenm added inline comments.Oct 21 2019, 2:23 PM
llvm/test/CodeGen/AMDGPU/fmul-2-combine-multi-use.ll
79–80 ↗(On Diff #225952)

This looks like it got worse?

rampitec marked an inline comment as done.Oct 21 2019, 2:34 PM
rampitec added inline comments.
llvm/test/CodeGen/AMDGPU/fmul-2-combine-multi-use.ll
79–80 ↗(On Diff #225952)

Yes, this is regression specific to fma/mac. The reg class after the folding mismatches xm0/xexec operand definition of fma src.
The regression is however small, while some copies are eliminated in other cases.

rampitec marked an inline comment as done.Oct 21 2019, 2:44 PM
rampitec added inline comments.
llvm/test/CodeGen/AMDGPU/fmul-2-combine-multi-use.ll
79–80 ↗(On Diff #225952)

I.e. we should refine how we use sgpr register classes instead of inhibiting folding.

arsenm added inline comments.Oct 21 2019, 2:45 PM
llvm/test/CodeGen/AMDGPU/fmul-2-combine-multi-use.ll
79–80 ↗(On Diff #225952)

The fma src doesn't use xm0_xexec though? Can you add a testcase with this specific case? I think this should be easily avoidable

rampitec updated this revision to Diff 225968.Oct 21 2019, 3:51 PM
rampitec marked an inline comment as done.

Added test for fma regression.

rampitec marked an inline comment as done.Oct 21 2019, 4:06 PM
rampitec added inline comments.
llvm/test/CodeGen/AMDGPU/fmul-2-combine-multi-use.ll
79–80 ↗(On Diff #225952)

It is explicitly disabled in the SIFoldOperands::foldOperand():

// Don't fold subregister extracts into tied operands, only if it is a full
// copy since a subregister use tied to a full register def doesn't really
// make sense. e.g. don't fold:
//
// %1 = COPY %0:sub1
// %2<tied3> = V_MAC_{F16, F32} %3, %4, %1<tied0>
//
//  into
// %2<tied3> = V_MAC_{F16, F32} %3, %4, %0:sub1<tied0>
if (UseOp.isTied() && OpToFold.getSubReg() != AMDGPU::NoSubRegister)
  return;
rampitec updated this revision to Diff 225973.Oct 21 2019, 4:21 PM

Changed run line to gfx1010, otherwise folding of sgpr in the test does not happen because it violates constant bus restriction.

rampitec marked an inline comment as done.
rampitec added inline comments.
llvm/test/CodeGen/AMDGPU/fmul-2-combine-multi-use.ll
79–80 ↗(On Diff #225952)

This code is removed in child D69287 and folding happens again.

vpykhtin accepted this revision.Oct 23 2019, 9:38 AM

LGTM.

This revision is now accepted and ready to land.Oct 23 2019, 9:38 AM
This revision was automatically updated to reflect the committed changes.