Due to the encoding changes in GFX11, we had a hack in place that disables the use of VGPRs above 128. This patch removes the need for that hack. We introduce a new register class VGPR_32_Lo128 which is used for 16-bit operands of VOP1, VOP2, and VOPC instructions. This register class only has the low 128 VGPRs, but is otherwise identical to VGPR_32. Therefore, 16-bit VOP1, VOP2, and VOPC instructions are correctly limited to use the first 128 VGPRs, while the other instructions can freely use all 256. We introduce new pseduo-instructions used on GFX11 which have the suffix t16 (True 16) to use the VGPR_32_Lo128 register class.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Joe, is this class really needed? The patch overhauls all VOP 16 bit instructions with none of them turned into True16 [just yet?]. A true 16 bit instruction shall use a 16 bit register, not a 32 bit VGPR. I.e. operands shall belong to a class composed of (VGPR_LO16, VGPR_HI16) or (VGPR_LO16, VGPR_HI16, SGPR_LO16) if scalars are accepted. Therefore, I would expect special classes limiting those, not VGPR_32 itself. As is the patch limits the use of the 16 bit operations to half of the available registers. What's the plan here?
llvm/lib/Target/AMDGPU/SIShrinkInstructions.cpp | ||
---|---|---|
154 | No else after return. |
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp | ||
---|---|---|
2880 | You probably do not need PSet for it, it is not handled anywhere. |
llvm/lib/Target/AMDGPU/GCNDPPCombine.cpp | ||
---|---|---|
126 | Don't need the llvm:: (there are a few occurrences in this file and in SIShrinkInstructions.cpp). | |
127 | Maybe add a brief comment here? These instructions are "shrinkable" but we don't want to shrink them pre-RA because <reasons>. | |
606–608 | This may be true but I don't see why you're asserting it here in particular. Couldn't this go into isShrinkable, near the other check that you added? |
llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-sitofp.mir | ||
---|---|---|
91–93 | This is undesirable. We don't want to see _f128 classes here. We should be selecting the _e64 form of the V_CVT instruction instead. If you rebase on 3743f9afeb51e0b7bdf2269583f32b7e35369168 you will see the same problem for fptosi as well as sitofp. The fix is to change the patterns in SIInstructions.td to always use the _e64 forms: https://reviews.llvm.org/differential/diff/459699/ |
We introduce a new register class VGPR_32_F128 which is used for
encodings VOP1, VOP2, and VOPC. This register class only has the first
128 VGPRs, but is otherwise identical to VGPR_32. Therefore, VOP1, VOP2,
and VOPC instructions are correctly limited to use the first 128
VGPRs, while the other instructions can freely use all 256.
This paragraph needs to explain that the new register class is used for 16-bit operands only. The whole point of this patch is that 32-bit operands will no longer be restricted to using the first 128 VGPRs.
llvm/lib/Target/AMDGPU/SIRegisterInfo.td | ||
---|---|---|
557 | I don't know what "_F128" is supposed to mean. I read this as a class for long double |
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp | ||
---|---|---|
3279–3281 | My understanding is that having a _T16_e32 instruction here should work correctly, but it is undesirable because it has more restrictive register classes than the _T16_e64 form. So perhaps we want a more general assertion in (or just before) the register allocator that there are no _T16_e32 instructions present? I'm not sure how or where to implement that. |
This is not the intended final state of true 16 bit instructions. You are correct that a full solution would use VGPR_LO16, VGPR_HI16 or something like that. However, this patch is a step towards that. This patch is strictly better than what we had before (modulo some shrinking issues), because only 16-bit operands of VOP1, VOP2, and VOPC instructions are limited to half the registers, rather than all operands of all instructions. There is very little wasted code as well, because VGPR_32_F128 can be directly replaced with VGPR_16_F128 in time. I expect tracking down code quality issues from using VGPR_16 to take quite some time, hence why I have submitted the patch at this point.
done
llvm/lib/Target/AMDGPU/GCNDPPCombine.cpp | ||
---|---|---|
606–608 | isShrinkable guards against shrinking in this pass, but this assert was supposed to check if previous passes shrunk anything. In theory, it's not necessary, but I put the assert for future proofing. | |
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp | ||
3279–3281 | From the perspective of the RA, allocating a _T16_e32 will work correctly. In this particular function, there is a restriction. V_FMAC_F16_T16_e32 -> V_FMA_F16_gfx9_e64 should not be allowed, because the register class in the former is VGPR_32_F128 and in the latter is VGPR_32. So it would require a COPY or something, which we are not doing. | |
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp | ||
2880 | I don't fully understand this, but MachineLICMBase calls getRegPressureSetLimit and will hit llvm_unreachable without the PSet for VGPR_32_F128 | |
llvm/lib/Target/AMDGPU/SIRegisterInfo.td | ||
557 | It is short for First 128. Is a set with only the first 128 VGPRs. I will add a comment noting this. | |
llvm/lib/Target/AMDGPU/SIShrinkInstructions.cpp | ||
144 | That seems fine to me. I will rebase on that when its landed. | |
llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-sitofp.mir | ||
91–93 | I picked up those changes, thanks! |
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp | ||
---|---|---|
2880 | Add let GeneratePressureSet = 0; to the RC definition and remove this. |
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp | ||
---|---|---|
3279–3281 | Is a COPY needed in that situation? I thought perhaps the register allocator would just handle it, if the register classes overlapped. But I'm not sure. Anyway I guess this assert is fine for now at least. |
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp | ||
---|---|---|
3279–3281 | The MIR verifier will fail immediately after the twoaddressinstruction pass if you make an instruction like that. |
llvm/lib/Target/AMDGPU/SIInstructions.td | ||
---|---|---|
928 | Thanks, I have fixed the names. | |
987 | This is simply to account for the fact that we have new pseudo instructions for each instruction with 16 bit operands on GFX11. Therefore if a pattern directly directly refers to an instruction we need to duplicate it, one for each pseudo. | |
llvm/lib/Target/AMDGPU/SIRegisterInfo.td | ||
557 | Ok, replaced _F128 with _Lo128. | |
llvm/lib/Target/AMDGPU/VOP3Instructions.td | ||
928–931 | replaced _T16 with _t16 everywhere |
replaced _T16 with _t16 and _F128 with _Lo128. corrected name of CVT instruction in isel pat
Update commit message for renaming and the separate patch disabling pre-RA shrinking.
Can you copy the new commit message into the summary of this patch? You have to do that manually.
llvm/lib/Target/AMDGPU/SIInstructions.td | ||
---|---|---|
987 |
Yes. In future the _t16_e64 version will use 16-bit register classes for 16-bit operands but the regular _e64 version will continue to use 32-bit classes for them. | |
llvm/lib/Target/AMDGPU/SIRegisterInfo.td | ||
557 | Just saying: Lo128 is not ideal either because it will be confused with the _LO16/HI16 classes which mean something completely different. |
Updated some folding/shrinking cases that needed to have true16 pseudo instructions added.
Use the correct t16 instruction when setting mode reg to fix an issue found in testing. All known issue are resolved, please take a final look.
llvm/lib/Target/AMDGPU/SIModeRegister.cpp | ||
---|---|---|
177 ↗ | (On Diff #461328) | This is fine for now. As a follow up I'll try changing FPTRUNC_UPWARD_PSEUDO into an e64 instruction, so we can go back to using a simple setDesc call here. |
Don't need the llvm:: (there are a few occurrences in this file and in SIShrinkInstructions.cpp).