This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Minimize number of s_mov generated by copyPhysReg
ClosedPublic

Authored by critson on Oct 10 2020, 7:10 AM.

Details

Summary

Generate the minimal set of s_mov instructions required when
expanding a SGPR copy operation in copyPhysReg.

Diff Detail

Event Timeline

critson created this revision.Oct 10 2020, 7:10 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 10 2020, 7:10 AM
critson requested review of this revision.Oct 10 2020, 7:10 AM
critson updated this revision to Diff 297442.Oct 10 2020, 11:09 PM
  • Merge code generation loops to avoid needing to generate work list
  • Fix potential issue when all elements of copy are overwritten
  • Fix test file location error

To motivate the peephole.
This pattern effects 2% of graphics shaders on GFX9, and nearly 7% on GFX10.
On average we save ~1.5 instructions per effected shader.
On some VulkanCTS tests the savings are much higher.
Given the relatively low gain, I assume it was not worth introducing a new peephole pass, and took this approach to address the duplicate s_mov instructions at the point of generation (when they cheapest to spot).

foad added inline comments.Oct 12 2020, 2:32 AM
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
665

Is this "if" just an optimization?

676–678

In this case would it be better to generate an IMPLICIT_DEF of the superregister, instead of a real copy?

693

Typo "combine".

critson added inline comments.Oct 12 2020, 3:26 AM
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
665

It is necessary because getRegSplitParts does not give a valid/useful results for a single SGPR.

676–678

Good point.

foad added inline comments.Oct 12 2020, 3:36 AM
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
665

Perhaps getRegSplitParts could be fixed to handle this degenerate case. It seems fragile to list all the SReg_32 classes here -- e.g. did you forget SReg_32_XEXEC_HI and SReg_32_XM0_XEXEC?

critson updated this revision to Diff 297539.Oct 12 2020, 3:45 AM

Address reviewer comments.

arsenm added inline comments.Oct 12 2020, 9:30 AM
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
645–653

Can we do something earlier? I don't think copyPhysReg should be considering context beyond the given instruction

critson added inline comments.Oct 12 2020, 7:47 PM
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
645–653

I think the earliest we can do this is immediately following "Simple Register Coalescing", although I am unsure of the RA implications and I do not think we have an appropriate pass it fits into.

To avoid adding a new pass I think this might be a reasonable addition to "SI Fix VGPR copies" as that already passes over the whole IR and is doing work on copies.
I could rename "SI Fix VGPR copies" to something like "SI Late Fixup Copies" and put the peephole in there?

How about splitting this change into two parts? Limit this change to generating mixes of 64- and 32-bit moves, and then separately consider what to do about those moves of registers where subregisters are overwritten immediately.

How about splitting this change into two parts? Limit this change to generating mixes of 64- and 32-bit moves, and then separately consider what to do about those moves of registers where subregisters are overwritten immediately.

That was my plan.
I was just hoping I could get some input on where would be considered acceptable for the overwrite dedup to go.

critson updated this revision to Diff 298309.Oct 14 2020, 11:36 PM
  • Remove peephole
  • Pre-commit test
foad accepted this revision.Oct 15 2020, 1:17 AM

LGTM.

Remove peephole

Summary needs updating to reflect that.

This revision is now accepted and ready to land.Oct 15 2020, 1:17 AM
critson edited the summary of this revision. (Show Details)Oct 15 2020, 1:26 AM
This revision was landed with ongoing or failed builds.Oct 15 2020, 6:35 AM
This revision was automatically updated to reflect the committed changes.