- Need to lower COPY from SGPR to VGPR to a real instruction as the standard COPY is used where the source and destination are from the same register bank so that we potentially coalesc them together and save one COPY. Considering that, backend optimizations, such as CSE, won't handle them. However, the copy from SGPR to VGPR always needs materializing to a native instruction, it should be lowered into a real one before other backend optimizations.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
With this patch, we reduce the register pressure for certain tests which could be improved up to 10%. As the code generated would be visibly different from the previous one, regression tests are updated as well.
llvm/lib/Target/AMDGPU/SIFoldOperands.cpp | ||
---|---|---|
1240–1244 | This change is added as CSE performances before operand folding. Afer replacing COPY with V_MOV, we have only one V_MOV to load that literal value. The result of that loading is re-used multiple times in the following REG_SEQUENCE. This's a change to fix wqm.ll regression test. | |
llvm/lib/Target/AMDGPU/SIISelLowering.cpp | ||
105–108 | Add one option to disable or enable such lowering. Because this lowering may increase register live ranges as well as register pressures, some tests may have lower performance. Having this option eases the performance evaluation for suspected benchmarks. | |
11357–11358 | Only handle SGPR_(32|64) so far as SReg_(32|64) by default in instruction selection. So far operand folding performs in the pre-order direction, which makes the chain of MOV from the immediate value to SGPR and then to VGPR won't be folded efficiently. We fold the first MOV from the immediate value to SGPR into the second MOV from SGPR to VGPR. IMHO, we should change the folding order from pre-order to post-order one. | |
llvm/test/CodeGen/AMDGPU/fabs.ll | ||
14 | The additional optimizations remove redundant instructions so that shrink pass could discover such patterns. | |
llvm/test/CodeGen/AMDGPU/sgpr-copy-cse.ll | ||
10 | This is the test extracted from that benchmark. | |
llvm/test/CodeGen/AMDGPU/waitcnt-vscnt.ll | ||
156–157 | as instructions are reduced, the schedule is slightly different from the previous one. Those 2 s_waitcnt could not be merged into a single one. | |
llvm/test/CodeGen/AMDGPU/wqm.ll | ||
653 | Here's the result due to the pre-order operand folding. Previously, we have one S_MOV loading immediate value followed by COPY from SGPR to VGPR. As COPY doesn't allow immediate operand. The fold only happens on the user of that COPY. As we lowering that COPY to V_MOV, that S_MOV will be folded firstly into that V_MOV. However, that's not a good one as one VGPR is used instead of one SGPR. IMHO, the proper way to fix that is to change the pre-order operand folding into a post-order one, i.e. we always fold from the user instruction. That folding site is a better place to decide how to fold, especially if more than one operands could be folded. |
I don't think this transform necessarily makes sense. Copies should always be preferable to moves
This change is added as CSE performances before operand folding. Afer replacing COPY with V_MOV, we have only one V_MOV to load that literal value. The result of that loading is re-used multiple times in the following REG_SEQUENCE. This's a change to fix wqm.ll regression test.