This is an archive of the discontinued LLVM Phabricator instance.

[amdgpu] Lower SGPR-to-VGPR copy in the final phase of ISel.
ClosedPublic

Authored by hliao on Sep 11 2020, 10:05 PM.

Details

Summary
  • Need to lower COPY from SGPR to VGPR to a real instruction as the standard COPY is used where the source and destination are from the same register bank so that we potentially coalesc them together and save one COPY. Considering that, backend optimizations, such as CSE, won't handle them. However, the copy from SGPR to VGPR always needs materializing to a native instruction, it should be lowered into a real one before other backend optimizations.

Diff Detail

Event Timeline

hliao created this revision.Sep 11 2020, 10:05 PM
hliao requested review of this revision.Sep 11 2020, 10:06 PM

With this patch, we reduce the register pressure for certain tests which could be improved up to 10%. As the code generated would be visibly different from the previous one, regression tests are updated as well.

hliao added inline comments.Sep 11 2020, 10:37 PM
llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
1240–1244

This change is added as CSE performances before operand folding. Afer replacing COPY with V_MOV, we have only one V_MOV to load that literal value. The result of that loading is re-used multiple times in the following REG_SEQUENCE. This's a change to fix wqm.ll regression test.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
105–108

Add one option to disable or enable such lowering. Because this lowering may increase register live ranges as well as register pressures, some tests may have lower performance. Having this option eases the performance evaluation for suspected benchmarks.

11357–11358

Only handle SGPR_(32|64) so far as SReg_(32|64) by default in instruction selection. So far operand folding performs in the pre-order direction, which makes the chain of MOV from the immediate value to SGPR and then to VGPR won't be folded efficiently. We fold the first MOV from the immediate value to SGPR into the second MOV from SGPR to VGPR. IMHO, we should change the folding order from pre-order to post-order one.

llvm/test/CodeGen/AMDGPU/fabs.ll
14

The additional optimizations remove redundant instructions so that shrink pass could discover such patterns.

llvm/test/CodeGen/AMDGPU/sgpr-copy-cse.ll
10

This is the test extracted from that benchmark.

llvm/test/CodeGen/AMDGPU/waitcnt-vscnt.ll
156–157

as instructions are reduced, the schedule is slightly different from the previous one. Those 2 s_waitcnt could not be merged into a single one.

llvm/test/CodeGen/AMDGPU/wqm.ll
653

Here's the result due to the pre-order operand folding. Previously, we have one S_MOV loading immediate value followed by COPY from SGPR to VGPR. As COPY doesn't allow immediate operand. The fold only happens on the user of that COPY. As we lowering that COPY to V_MOV, that S_MOV will be folded firstly into that V_MOV. However, that's not a good one as one VGPR is used instead of one SGPR. IMHO, the proper way to fix that is to change the pre-order operand folding into a post-order one, i.e. we always fold from the user instruction. That folding site is a better place to decide how to fold, especially if more than one operands could be folded.

hliao updated this revision to Diff 291582.Sep 14 2020, 8:34 AM

Follow suggestions from clang-tidy.

foad added a subscriber: foad.Sep 14 2020, 9:48 AM
rampitec accepted this revision.Sep 14 2020, 11:28 AM

LGTM, but please also wait for Matt's review.

This revision is now accepted and ready to land.Sep 14 2020, 11:28 AM
arsenm added inline comments.Sep 17 2020, 8:17 AM
llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
1239–1245

This is a separate change

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
11341

I think it's unreasonable to have another surprise pass here

I don't think this transform necessarily makes sense. Copies should always be preferable to moves