When selectVOP3PMadMixModsImpl fails, it can still create new copy instr
via selectVOP3ModsImpl. When selectG_FMA_FMAD gives up, new copy instr
will remain dead but will not be automatically removed.
InstructionSelect does not check if instructions created during selection
are dead.
Such dead copy doesn't have register class on dst operand and causes crash.
Fix is to build copy when operands are being added to selected instruction.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp | ||
---|---|---|
3443–3447 | This copy should just have been restricted in the first place. I also would have expected this to not have produced instructions unless the match happened; can you first try constraining the operands of this copy, and then splitting the predicate part from the instruction construction? |
llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp | ||
---|---|---|
4852 | With this it looks like we will end up calling getVOP3ModsImpl twice. I guess this isn't called often enough (and the "get" function seems simple enough) to cause performance issues but it still feels suboptimal |
Now that I checked in tests with madmix that has sgpr input, copy is not created but it should have been.
Patch is not correct at this state.
Is there a way to move
if (!MatchedSrc0 && !MatchedSrc1 && !MatchedSrc2) return false;
in selectG_FMA_FMAD before lookup for modifiers begin via selectVOP3PMadMixModsImpl?
What looks equivalent to me is to check if at least one of the operands is fpext (could be hidden behind other instructions that could be folded into mods). Unfortunately this still has to call getVOP3ModsImpl on each operand at worst case before actual mods selection begins
This should be equivalent to trunk llvm but without creating copy instruction when fma_mix does not get selected.
I was checking if test like this gets selected in the same way:
define amdgpu_vs float @test_f16_f32_add_fma_ext_mul(float inreg %x, float %y, float %z, half %u, half %v) { .entry: %a = fmul half %u, %v %b = fpext half %a to float %abs_x = call contract float @llvm.fabs.f32(float %x) %c = call float @llvm.fmuladd.f32(float %abs_x, float %y, float %b) %d = fadd float %c, %z ret float %d } declare float @llvm.fmuladd.f32(float, float, float) declare float @llvm.fabs.f32(float)
selects into
%0:sreg_32 = COPY $sgpr0 %1:vgpr_32 = COPY $vgpr0 %2:vgpr_32 = COPY $vgpr1 %5:vgpr_32 = COPY $vgpr2 %6:vgpr_32 = COPY $vgpr3 %8:vgpr_32 = nofpexcept V_MUL_F16_e64 0, %5, 0, %6, 0, 0, implicit $mode, implicit $exec %14:vgpr_32 = COPY %0 %11:vgpr_32 = V_FMA_MIX_F32 2, %14, 0, %1, 8, %8, 0, 0, 0, implicit $mode, implicit $exec %12:vgpr_32 = nofpexcept V_ADD_F32_e64 0, %11, 0, %2, 0, 0, implicit $mode, implicit $exec $vgpr0 = COPY %12 SI_RETURN_TO_EPILOG implicit $vgpr0
Previous version of this patch would not make a copy and it would use sgpr directly. Visible only in mir. sifoldoperands folds this copy later anyway.
%11:vgpr_32 = V_FMA_MIX_F32 2, %0, 0, %1, 8, %8, 0, 0, 0, implicit $mode, implicit $exec
I do think we should lean more on post-selection operand folding for dealing with SGPR copies. We should write a new and better version. The current version is "backwards". Traditionally after selection we would have excess SGPRs and try to remove them to legalize instructions. With RegBankSelect making every operand a VGPR, we need one written from the perspective of everything starts as VGPRs and needs to make an optimal choice for SGPR operands
llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp | ||
---|---|---|
4852 | I agree, you should know everything after the first round of selectVOP3PMadMixModsImpl and should just pass through what to do |
I did not understand last comment. What do I need to do exactly?
Now selectVOP3PMadMixModsImpl (unchanged) is called when we know that fma/mad_mix will be selected.
Afaik select modifiers functions don't fail and are called once pattern was matched.
The tricky thing here is that selectVOP3PMadMixModsImpl was used to check for actual pattern (it needs to find fpext of some operand).
Now we check for fpext before selecting modifiers for operands.
Remove the special case getVOP3ModsImpl and FPEXT checks. Delete the copy build inside of selectVOP3ModsImpl, and move that code down to where the new instruction is constructed. You just need a new build copy to VGPR helper
llvm/test/CodeGen/AMDGPU/GlobalISel/combine-fma-add-mul.ll | ||
---|---|---|
173–174 ↗ | (On Diff #475815) | Should pre-contract this |
Remove build copy from selectVOP3ModsImpl.
Copy is now built when adding operands to selected instruction.
llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp | ||
---|---|---|
3435–3437 | The name isn't accurate since this isn't just blindly inserting a copy. How about copyToVGPRIfSrcFolded | |
3438 | Unrelated, but I don't understand why this modifier check is here. It doesn't factor into the constant bus restriction | |
3680 | Could you put this down in the callback with the addReg (same with the rest) |
llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp | ||
---|---|---|
3438 | I wanted to remove both checks, also build copy into a vgpr class but there were way too many changes in regression tests (mostly mir). In default use case (to my understanding at least) all checks could be removed |
Name change for helper function, move call inside lambda body. Copies are now created when src operand is being added to the new instruction (MIB) and inserted before it.
llvm/test/CodeGen/AMDGPU/GlobalISel/combine-fma-add-mul.ll | ||
---|---|---|
173–174 ↗ | (On Diff #475815) | Should still use fmuladd or fma intrinsic here |
The name isn't accurate since this isn't just blindly inserting a copy. How about copyToVGPRIfSrcFolded