Potentially sgpr to sgpr copy should also be possible.
That is however trickier because we may end up with a
wrong register class at use because of xm0/xexec permutations.
Details
- Reviewers
arsenm vpykhtin - Commits
- rG61e7a61bdccf: [AMDGPU] Allow folding of sgpr to vgpr copy
Diff Detail
Event Timeline
llvm/test/CodeGen/AMDGPU/fmul-2-combine-multi-use.ll | ||
---|---|---|
79–81 | This looks like it got worse? |
llvm/test/CodeGen/AMDGPU/fmul-2-combine-multi-use.ll | ||
---|---|---|
79–81 | Yes, this is regression specific to fma/mac. The reg class after the folding mismatches xm0/xexec operand definition of fma src. |
llvm/test/CodeGen/AMDGPU/fmul-2-combine-multi-use.ll | ||
---|---|---|
79–81 | I.e. we should refine how we use sgpr register classes instead of inhibiting folding. |
llvm/test/CodeGen/AMDGPU/fmul-2-combine-multi-use.ll | ||
---|---|---|
79–81 | The fma src doesn't use xm0_xexec though? Can you add a testcase with this specific case? I think this should be easily avoidable |
llvm/test/CodeGen/AMDGPU/fmul-2-combine-multi-use.ll | ||
---|---|---|
79–81 | It is explicitly disabled in the SIFoldOperands::foldOperand(): // Don't fold subregister extracts into tied operands, only if it is a full // copy since a subregister use tied to a full register def doesn't really // make sense. e.g. don't fold: // // %1 = COPY %0:sub1 // %2<tied3> = V_MAC_{F16, F32} %3, %4, %1<tied0> // // into // %2<tied3> = V_MAC_{F16, F32} %3, %4, %0:sub1<tied0> if (UseOp.isTied() && OpToFold.getSubReg() != AMDGPU::NoSubRegister) return; |
Changed run line to gfx1010, otherwise folding of sgpr in the test does not happen because it violates constant bus restriction.
This looks like it got worse?