This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU][GFX11] Use VGPR_32_Lo128 for VOP1,2,C
ClosedPublic

Authored by Joe_Nash on Sep 12 2022, 1:11 PM.

Details

Summary
Due to the encoding changes in GFX11, we had a hack in place that
disables the use of VGPRs above 128. This patch removes the need for
that hack.

We introduce a new register class VGPR_32_Lo128 which is used for 16-bit
operands of VOP1, VOP2, and VOPC instructions. This register class only has the
low 128 VGPRs, but is otherwise identical to VGPR_32. Therefore, 16-bit VOP1,
VOP2, and VOPC instructions are correctly limited to use the first 128
VGPRs, while the other instructions can freely use all 256.

We introduce new pseduo-instructions used on GFX11 which have the suffix
t16 (True 16) to use the VGPR_32_Lo128 register class.

Diff Detail

Event Timeline

Joe_Nash created this revision.Sep 12 2022, 1:11 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 12 2022, 1:11 PM
Joe_Nash requested review of this revision.Sep 12 2022, 1:11 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 12 2022, 1:11 PM
Joe_Nash added reviewers: foad, arsenm, rampitec, nhaehnle, dp, Restricted Project.Sep 12 2022, 1:12 PM

Joe, is this class really needed? The patch overhauls all VOP 16 bit instructions with none of them turned into True16 [just yet?]. A true 16 bit instruction shall use a 16 bit register, not a 32 bit VGPR. I.e. operands shall belong to a class composed of (VGPR_LO16, VGPR_HI16) or (VGPR_LO16, VGPR_HI16, SGPR_LO16) if scalars are accepted. Therefore, I would expect special classes limiting those, not VGPR_32 itself. As is the patch limits the use of the 16 bit operations to half of the available registers. What's the plan here?

llvm/lib/Target/AMDGPU/SIShrinkInstructions.cpp
155

No else after return.

rampitec added inline comments.Sep 12 2022, 2:31 PM
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
2880 ↗(On Diff #459540)

You probably do not need PSet for it, it is not handled anywhere.

foad added inline comments.Sep 13 2022, 2:14 AM
llvm/lib/Target/AMDGPU/GCNDPPCombine.cpp
126

Don't need the llvm:: (there are a few occurrences in this file and in SIShrinkInstructions.cpp).

127

Maybe add a brief comment here? These instructions are "shrinkable" but we don't want to shrink them pre-RA because <reasons>.

608–610

This may be true but I don't see why you're asserting it here in particular. Couldn't this go into isShrinkable, near the other check that you added?

foad added inline comments.Sep 13 2022, 3:09 AM
llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-sitofp.mir
92–93

This is undesirable. We don't want to see _f128 classes here. We should be selecting the _e64 form of the V_CVT instruction instead. If you rebase on 3743f9afeb51e0b7bdf2269583f32b7e35369168 you will see the same problem for fptosi as well as sitofp.

The fix is to change the patterns in SIInstructions.td to always use the _e64 forms: https://reviews.llvm.org/differential/diff/459699/
Pre-GFX11 this should be pretty harmless even if selecting the _e64 forms doesn't actually give any benefit.

foad added a comment.Sep 13 2022, 3:16 AM

We introduce a new register class VGPR_32_F128 which is used for
encodings VOP1, VOP2, and VOPC. This register class only has the first
128 VGPRs, but is otherwise identical to VGPR_32. Therefore, VOP1, VOP2,
and VOPC instructions are correctly limited to use the first 128
VGPRs, while the other instructions can freely use all 256.

This paragraph needs to explain that the new register class is used for 16-bit operands only. The whole point of this patch is that 32-bit operands will no longer be restricted to using the first 128 VGPRs.

arsenm added inline comments.Sep 13 2022, 5:01 AM
llvm/lib/Target/AMDGPU/SIRegisterInfo.td
557

I don't know what "_F128" is supposed to mean. I read this as a class for long double

foad added inline comments.Sep 13 2022, 5:23 AM
llvm/lib/Target/AMDGPU/SIShrinkInstructions.cpp
145

I would prefer to change SIShrinkInstructions in a more general way that does not specifically check for True16 instructions: D133769

foad added inline comments.Sep 13 2022, 5:44 AM
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
3294–3296

My understanding is that having a _T16_e32 instruction here should work correctly, but it is undesirable because it has more restrictive register classes than the _T16_e64 form. So perhaps we want a more general assertion in (or just before) the register allocator that there are no _T16_e32 instructions present? I'm not sure how or where to implement that.

Joe, is this class really needed? The patch overhauls all VOP 16 bit instructions with none of them turned into True16 [just yet?]. A true 16 bit instruction shall use a 16 bit register, not a 32 bit VGPR. I.e. operands shall belong to a class composed of (VGPR_LO16, VGPR_HI16) or (VGPR_LO16, VGPR_HI16, SGPR_LO16) if scalars are accepted. Therefore, I would expect special classes limiting those, not VGPR_32 itself. As is the patch limits the use of the 16 bit operations to half of the available registers. What's the plan here?

This is not the intended final state of true 16 bit instructions. You are correct that a full solution would use VGPR_LO16, VGPR_HI16 or something like that. However, this patch is a step towards that. This patch is strictly better than what we had before (modulo some shrinking issues), because only 16-bit operands of VOP1, VOP2, and VOPC instructions are limited to half the registers, rather than all operands of all instructions. There is very little wasted code as well, because VGPR_32_F128 can be directly replaced with VGPR_16_F128 in time. I expect tracking down code quality issues from using VGPR_16 to take quite some time, hence why I have submitted the patch at this point.

Joe_Nash marked 6 inline comments as done.Sep 13 2022, 12:27 PM

We introduce a new register class VGPR_32_F128 which is used for
encodings VOP1, VOP2, and VOPC. This register class only has the first
128 VGPRs, but is otherwise identical to VGPR_32. Therefore, VOP1, VOP2,
and VOPC instructions are correctly limited to use the first 128
VGPRs, while the other instructions can freely use all 256.

This paragraph needs to explain that the new register class is used for 16-bit operands only. The whole point of this patch is that 32-bit operands will no longer be restricted to using the first 128 VGPRs.

done

llvm/lib/Target/AMDGPU/GCNDPPCombine.cpp
608–610

isShrinkable guards against shrinking in this pass, but this assert was supposed to check if previous passes shrunk anything. In theory, it's not necessary, but I put the assert for future proofing.

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
3294–3296

From the perspective of the RA, allocating a _T16_e32 will work correctly.

In this particular function, there is a restriction. V_FMAC_F16_T16_e32 -> V_FMA_F16_gfx9_e64 should not be allowed, because the register class in the former is VGPR_32_F128 and in the latter is VGPR_32. So it would require a COPY or something, which we are not doing.

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
2880 ↗(On Diff #459540)

I don't fully understand this, but MachineLICMBase calls getRegPressureSetLimit and will hit llvm_unreachable without the PSet for VGPR_32_F128

llvm/lib/Target/AMDGPU/SIRegisterInfo.td
557

It is short for First 128. Is a set with only the first 128 VGPRs. I will add a comment noting this.

llvm/lib/Target/AMDGPU/SIShrinkInstructions.cpp
145

That seems fine to me. I will rebase on that when its landed.

llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-sitofp.mir
92–93

I picked up those changes, thanks!

Joe_Nash updated this revision to Diff 459840.Sep 13 2022, 12:27 PM
Joe_Nash marked 3 inline comments as done.

followed review suggestions and rebased

rampitec added inline comments.Sep 13 2022, 1:29 PM
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
2880 ↗(On Diff #459540)

Add let GeneratePressureSet = 0; to the RC definition and remove this.

foad added a comment.Sep 14 2022, 2:07 AM

I'm happy with the general approach and the C++ parts. Can anyone take a closer look at the TableGen parts - maybe @dp or @rampitec? Thanks!

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
3294–3296

Is a COPY needed in that situation? I thought perhaps the register allocator would just handle it, if the register classes overlapped. But I'm not sure. Anyway I guess this assert is fine for now at least.

Joe_Nash marked an inline comment as done.Sep 14 2022, 7:29 AM
Joe_Nash added inline comments.
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
3294–3296

The MIR verifier will fail immediately after the twoaddressinstruction pass if you make an instruction like that.

arsenm added inline comments.Sep 14 2022, 10:15 AM
llvm/lib/Target/AMDGPU/SIRegisterInfo.td
557

Lo128 would probably be more consistent terminology over "first"

llvm/lib/Target/AMDGPU/VOP3Instructions.td
928–931

Since the _T16 doesn't appear in the instruction mnemonic it should be lowercased

dp added inline comments.Sep 14 2022, 11:20 AM
llvm/lib/Target/AMDGPU/SIInstructions.td
928

The names cvt16to32_e64 and cvt32to16_e64 are misleading. They should be the other way around.

987

Why are these patterns special? _E64 variants have no VGPR limitations. Is this required for future changes?

rampitec added inline comments.Sep 14 2022, 2:07 PM
llvm/lib/Target/AMDGPU/SIRegisterInfo.td
557

I probably agree to that. F128 also hints a long double to me.
With that I still hope this class is transitional and will be replaced by a real 16 bit RC.

Joe_Nash marked 4 inline comments as done.Sep 14 2022, 2:27 PM
Joe_Nash added inline comments.
llvm/lib/Target/AMDGPU/SIInstructions.td
928

Thanks, I have fixed the names.

987

This is simply to account for the fact that we have new pseudo instructions for each instruction with 16 bit operands on GFX11. Therefore if a pattern directly directly refers to an instruction we need to duplicate it, one for each pseudo.

llvm/lib/Target/AMDGPU/SIRegisterInfo.td
557

Ok, replaced _F128 with _Lo128.
Yes, this is a transitional class.

llvm/lib/Target/AMDGPU/VOP3Instructions.td
928–931

replaced _T16 with _t16 everywhere

Joe_Nash updated this revision to Diff 460234.Sep 14 2022, 2:28 PM
Joe_Nash marked 3 inline comments as done.

replaced _T16 with _t16 and _F128 with _Lo128. corrected name of CVT instruction in isel pat

This revision is now accepted and ready to land.Sep 14 2022, 2:39 PM
Joe_Nash updated this revision to Diff 460247.Sep 14 2022, 3:15 PM

correct comment F128 -> Lo128

Joe_Nash updated this revision to Diff 460250.Sep 14 2022, 3:19 PM

Update commit message for renaming and the separate patch disabling pre-RA shrinking.

foad added a comment.Sep 15 2022, 3:23 AM

Update commit message for renaming and the separate patch disabling pre-RA shrinking.

Can you copy the new commit message into the summary of this patch? You have to do that manually.

llvm/lib/Target/AMDGPU/SIInstructions.td
987

Is this required for future changes?

Yes. In future the _t16_e64 version will use 16-bit register classes for 16-bit operands but the regular _e64 version will continue to use 32-bit classes for them.

llvm/lib/Target/AMDGPU/SIRegisterInfo.td
557

Just saying: Lo128 is not ideal either because it will be confused with the _LO16/HI16 classes which mean something completely different.

Joe_Nash retitled this revision from [AMDGPU][GFX11] Use VGPR_32_F128 for VOP1,2,C to [AMDGPU][GFX11] Use VGPR_32_Lo128 for VOP1,2,C.Sep 15 2022, 7:02 AM
Joe_Nash edited the summary of this revision. (Show Details)
Joe_Nash updated this revision to Diff 460878.Sep 16 2022, 1:27 PM

Updated some folding/shrinking cases that needed to have true16 pseudo instructions added.

Joe_Nash updated this revision to Diff 461328.Sep 19 2022, 1:00 PM

Use the correct t16 instruction when setting mode reg to fix an issue found in testing. All known issue are resolved, please take a final look.

rampitec accepted this revision.Sep 19 2022, 1:15 PM

Still LGTM

foad accepted this revision.Sep 20 2022, 2:22 AM
foad added inline comments.
llvm/lib/Target/AMDGPU/SIModeRegister.cpp
177–189

This is fine for now. As a follow up I'll try changing FPTRUNC_UPWARD_PSEUDO into an e64 instruction, so we can go back to using a simple setDesc call here.

This revision was landed with ongoing or failed builds.Sep 20 2022, 7:28 AM
This revision was automatically updated to reflect the committed changes.