Page MenuHomePhabricator

[AMDGPU] Make vector superclasses allocatable
ClosedPublic

Authored by cdevadas on Sep 5 2021, 10:15 PM.

Details

Summary

The combined vector register classes with both
VGPRs and AGPRs are currently unallocatable.
This patch turns them into allocatable as a
prerequisite to enable copy between VGPR and
AGPR registers during regalloc.

Also, added the missing AV register classes from
192b to 1024b.

Diff Detail

Event Timeline

cdevadas created this revision.Sep 5 2021, 10:15 PM
cdevadas requested review of this revision.Sep 5 2021, 10:15 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 5 2021, 10:15 PM
foad added a subscriber: critson.Sep 6 2021, 2:04 AM

+@critson for awareness.

arsenm added a comment.Sep 7 2021, 4:56 PM

Can you specifically mention VGPRs and AGPRs in the commit message/description

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
11446

Can you shortcut the getRegClassForReg if not virtual?

cdevadas updated this revision to Diff 371594.Sep 9 2021, 8:02 AM
cdevadas retitled this revision from [AMDGPU] Make vector super classes allocatable to [AMDGPU] Make vector superclasses allocatable.
cdevadas edited the summary of this revision. (Show Details)

Allow generating psets for AV reglcass.
Addressed the review comments.

cdevadas updated this revision to Diff 372125.Sep 12 2021, 9:48 AM

Redefined multiclass AVRegClass to take vregList and aregList separately and applied the decimate operation on them individually.

arsenm added inline comments.Sep 13 2021, 1:26 PM
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
2790

This logically doesn't flow right. This shouldn't be checking hasVectorRegisters, and just use isVGPR. You override this decision later with the AGPR class

cdevadas added inline comments.Sep 13 2021, 7:45 PM
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
2790

That check was added to consider AV classes too. The opcode is later changed to V_ACCVGPR_WRITE_B32_e64 if t is AGPR only class.

cdevadas updated this revision to Diff 373501.Sep 20 2021, 1:08 AM

Rebase
+ Ping

Do you know how RA will chose registers for an AV operand? V, A, and AV seem to have same AllocationPriority, so what exactly RA will be doing? At least on gfx908 we would want it to pick least congested RC. Well, after allocating everything which truly requires either V or A.
In fact I believe it will do directly the opposite, given that wider tuples have higher priority, so RA will likely start with AV before any other allocation is considered.

llvm/lib/Target/AMDGPU/GCNRegPressure.cpp
81

So for AV and VS it will always tell VGPR? It does not seem conceptually right.

llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
262

Now that AV is allocatable I suspect we will miss a lot of optimizations with multiple changes like this (e.g. hasAGPRs -> isAGPRClass) and maybe even make wrong and illegal decisions because we will be unable to correctly identify if that is a VGPR or AGPR presented an AV.

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
896

What will happen if that is AV?

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
2431

I do not think there can be a separate pressure on AV. What does that mean for AV pressure if you have pressure 256 on V and 256 on A? It cannot be increased or decreased separately.

Do you know how RA will chose registers for an AV operand? V, A, and AV seem to have same AllocationPriority, so what exactly RA will be doing?

You can set an explicit allocation order for a class. I think what you get now is VGPRs are higher priority than AGPRs. The real point is that RA can decide to introduce RA temporary registers to relieve pressure

Do you know how RA will chose registers for an AV operand? V, A, and AV seem to have same AllocationPriority, so what exactly RA will be doing?

You can set an explicit allocation order for a class. I think what you get now is VGPRs are higher priority than AGPRs. The real point is that RA can decide to introduce RA temporary registers to relieve pressure

An explicit order interleaving registers is probably needed at least on gfx908. However, it does not solve all the issues. Imagine you have set it and allocated 3x32 VGPR tuples and 3x32 AGPR tuples as a result. Then RA will start allocating smaller registers and needs 64 more VGPRs. It will end up with 96 + 64 = 160 VGPRs and 96 AGPRs, where it could get away with 128 of each class. This 2x performance difference.

Yet another question concerns gfx90a. Assume we are reading matrix C from memory into a register tuple. The mfma would need to use AGPR, but the load may use either AGPR or VGPR and has AV operand (same for a store). How likely will it happen that a VGPR tuple will be used for the load and then copied into AGPR? Can this happen?

Yet another question concerns gfx90a. Assume we are reading matrix C from memory into a register tuple. The mfma would need to use AGPR, but the load may use either AGPR or VGPR and has AV operand (same for a store). How likely will it happen that a VGPR tuple will be used for the load and then copied into AGPR? Can this happen?

Ideally we would have the concrete classes chosen ahead of time (i.e. regbankselect analyses the uses and defs and sets the bank, resulting in the concrete class). The main reason to make this change is to allow the allocator to split ranges with cross class copies. We wouldn't generally want AV_* classes in the incoming MIR.

Yet another question concerns gfx90a. Assume we are reading matrix C from memory into a register tuple. The mfma would need to use AGPR, but the load may use either AGPR or VGPR and has AV operand (same for a store). How likely will it happen that a VGPR tuple will be used for the load and then copied into AGPR? Can this happen?

Ideally we would have the concrete classes chosen ahead of time (i.e. regbankselect analyses the uses and defs and sets the bank, resulting in the concrete class). The main reason to make this change is to allow the allocator to split ranges with cross class copies. We wouldn't generally want AV_* classes in the incoming MIR.

In general selector cannot do it. You need to know RP by class and chose an RC accordingly. Moreover, on gfx90a you would also need to chose an instruction accordingly.

Yet another question concerns gfx90a. Assume we are reading matrix C from memory into a register tuple. The mfma would need to use AGPR, but the load may use either AGPR or VGPR and has AV operand (same for a store). How likely will it happen that a VGPR tuple will be used for the load and then copied into AGPR? Can this happen?

Ideally we would have the concrete classes chosen ahead of time (i.e. regbankselect analyses the uses and defs and sets the bank, resulting in the concrete class). The main reason to make this change is to allow the allocator to split ranges with cross class copies. We wouldn't generally want AV_* classes in the incoming MIR.

In general selector cannot do it. You need to know RP by class and chose an RC accordingly. Moreover, on gfx90a you would also need to chose an instruction accordingly.

Isn't it possible to select legal register classes by having constrained register operands in the instruction definition itself?

llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
262

I agree we might miss out on certain optimizations. That should be handled (separately).
COPY is a special instruction, the machine opcode changes based on the target register for AMDGPU.
Any Pre-RA copy instance with AV class, I am forcing it to the VGPR class.

But illegal decisions should be averted. If it happens, mostly occurs at the selection.

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
896

I guess this function, copyPhysReg will only be called post-RA.

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
2431

The assert here https://github.com/llvm/llvm-project/blob/main/llvm/utils/TableGen/CodeGenRegisters.cpp#L2028
enforces at least one PSet when we make the AV classes allocatable.

There should have been a flag to entirely avoid the psets for allocatable regclasses.

In general selector cannot do it. You need to know RP by class and chose an RC accordingly. Moreover, on gfx90a you would also need to chose an instruction accordingly.

Isn't it possible to select legal register classes by having constrained register operands in the instruction definition itself?

You could do specific instructions. Moreover, gfx90a has introduced _acd andf _vcd mfma variants because register classes of SrcC and Dst are tied. For SrcA and SrcB we actually have a freedom to use either RC independently. So we need to make a decision which register class to use and then which instruction to use consequentially. That decision cannot be reasonably done at selection time, in fact it can only be done during regalloc when we know actual pressure. Moreover, the decision will be different for gfx908 and gfx90a. For gfx908 you would want to allocate equal amount of both types of registers, and for gfx90a you would want to stay with VGPRs as long as you can. Then changing an RC will also require replacement of other instructions dealing with it, like v_accvgpr_* will turn into v_mov_b32 and vice versa, some v_accvgpr_* instructions dropped etc. And all of that is only possible inside the RA.

Right now we do not do anything like this, SrcC and Dst are always AGPRs, then SrcA and SrcB are VGPRs I believe. It is a suboptimal but easy choice. But since you are enabling AV operands you are leaving this to RA to decide, and I have a question how exactly will it decide? We could constraint reg classes of mfma operands just like we do now. That is probably an easiest way to at least preserve the functionality.

One thing to keep in mind: and allocatable AV class directly affects instructions we have to generate, not just select but also create in many different passes. I.e. if the specific class is AGPR we have to use v_accvgpr_*, if that is VGPR we have to use v_mob_b32. Then v_accvgpr_write_b32 does not work with immediates on gfx908, it does not work with SGPR sources, loads and stores do not work with AGPRs in gfx908 etc. I.e. that would be a huge and unsound codegen change across the BE.

With all of that I assume we have to always use specific RC for everything we select and generate. We basically cannot use AV as an operand of any instruction we have produced. The only use for the allocatable AV shall be RA copying registers instead of spilling. I do not think that is what this change is doing now.

llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
262

The point of this code is to avoid unneeded copies between VGPRs and AGPRs. Forcing it to VGPRs misses the point.

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
896

Ah, right.

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
2431

There is GeneratePressureSet to avoid it, but the issue is conceptual. I do not understand a strategy of tracking pressure for a combined class.

rampitec added inline comments.Sep 21 2021, 12:48 PM
llvm/lib/Target/AMDGPU/GCNRegPressure.cpp
81

Given the last comment, if we have no real AV operands in any instructions this would be fine.

llvm/lib/Target/AMDGPU/SIFoldOperands.cpp
730

Then this shall not be needed because we should not have situations like this.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
11447

And this shall remain as is to not allow any live AV operands.

I have just applied the patch and verified there are no live AV operands after selection and basically anywhere. So I guess that clears the concerns of how RA would select an actual RC - it would not. This is what we need I suppose.
I am still unclear about PSet though. Maybe it is better to suppress PSet generation?

rampitec added inline comments.Sep 21 2021, 3:17 PM
llvm/lib/Target/AMDGPU/SIFixSGPRCopies.cpp
262

Given no actual AV operand this is OK.

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
2790

You should not have any AV at this point, these all are resolved. Then it was working with AGPRs already. This portion does not seem to be needed.

However, reverting it revealed problem with global isel (llc -global-isel -mtriple=amdgcn-mesa-mesa3d -mcpu=gfx908 -o - -verify-machineinstrs < llvm/test/CodeGen/AMDGPU/ds_gws_align.ll):

%13:av_32 = COPY %8:sreg_32
%14:av_32 = COPY %9:sreg_32

You have w/a this by folding the immediate, but really global isel shall not produce AV operands.

4854

It is still isAGPR. No need to legalize VGPR.

llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
2420

So again this is not that simple with a combined class. This is a lower bound, probably twice smaller than in reality. But I guess we cannot answer better.

llvm/lib/Target/AMDGPU/SIWholeQuadMode.cpp
1430 ↗(On Diff #373501)

It can only be VGPR here.

1440 ↗(On Diff #373501)

Ditto.

cdevadas added inline comments.Sep 22 2021, 5:59 AM
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
2431

GeneratePressureSet doesn't help to avoid the PSet entirely. As you can see, for all AV classes except AV_32, GeneratePressureSet is currently zero. The moment I reset this flag for AV_32, it asserts as I indicated earlier.

rampitec added inline comments.Sep 22 2021, 10:40 AM
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
2431

I did not see that assert because I was using optimized tablegen. Shall we omit the assert if GeneratePressureSet = 0? On practice it passes testing with GeneratePressureSet = 0 and optimized tablegen.

cdevadas added inline comments.Sep 22 2021, 11:01 AM
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
2431

Yes, it all worked well with no Psets generated for AV classes when I used the optimized tablegen.
I can post a patch to remove that assertion.

rampitec added inline comments.Sep 22 2021, 11:11 AM
llvm/lib/Target/AMDGPU/SIRegisterInfo.cpp
2431

Maybe just add || !GeneratePressureSet to the assertion?

cdevadas updated this revision to Diff 374601.Sep 23 2021, 10:22 AM

Removed psets entirely for AV classes + Addressed other comments.

Thanks. FoldImmediate() and Global ISel seems to be last issue to me.

Thanks. FoldImmediate() and Global ISel seems to be last issue to me.

The MAIInst uses AVSrc_32 for src0 & src1 operands.
https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AMDGPU/VOP3PInstructions.td#L378

I am not sure we can do something to prevent GIsel from choosing AV classes here.
Yes, the w/a in FoldImmediate was done to adjust the register class as per the opcode we choose.

rampitec added a comment.EditedSep 24 2021, 2:06 AM

Thanks. FoldImmediate() and Global ISel seems to be last issue to me.

The MAIInst uses AVSrc_32 for src0 & src1 operands.
https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AMDGPU/VOP3PInstructions.td#L378

I am not sure we can do something to prevent GIsel from choosing AV classes here.
Yes, the w/a in FoldImmediate was done to adjust the register class as per the opcode we choose.

Right, that's the only place where we currently have the freedom to chose either v or a, and it is supposed to be resolved at selection. The need to use AV class here comes from MC support so that asm will be free to use any register.

I believe custom code in regbank select can refine the register class for these 2 operands.

I have to note that we cannot rely on the optimization for correctness, if FoldImmediate will not run or succeed we will have broken IR and failure, so it has to be fixed in any case.

Thanks. FoldImmediate() and Global ISel seems to be last issue to me.

The MAIInst uses AVSrc_32 for src0 & src1 operands.
https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AMDGPU/VOP3PInstructions.td#L378

I am not sure we can do something to prevent GIsel from choosing AV classes here.
Yes, the w/a in FoldImmediate was done to adjust the register class as per the opcode we choose.

Regbankselect is only implemented in a way that should just work and is probably not optimal. In the future we could improve the selector to take the hint of the incoming bank to select the specific A/V class

Thanks. FoldImmediate() and Global ISel seems to be last issue to me.

The MAIInst uses AVSrc_32 for src0 & src1 operands.
https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AMDGPU/VOP3PInstructions.td#L378

I am not sure we can do something to prevent GIsel from choosing AV classes here.
Yes, the w/a in FoldImmediate was done to adjust the register class as per the opcode we choose.

Regbankselect is only implemented in a way that should just work and is probably not optimal. In the future we could improve the selector to take the hint of the incoming bank to select the specific A/V class

The problem is exactly it does not work if AV is selected. This needs to be resolved to V or A at selection, otherwise you get a compilation crash (and actually cannot reasonably select related instructions, it will not work).

Thanks. FoldImmediate() and Global ISel seems to be last issue to me.

The MAIInst uses AVSrc_32 for src0 & src1 operands.
https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AMDGPU/VOP3PInstructions.td#L378

I am not sure we can do something to prevent GIsel from choosing AV classes here.
Yes, the w/a in FoldImmediate was done to adjust the register class as per the opcode we choose.

Something is missing code to constrain or insert a constraining copy when the instructions are initially emitted

Thanks. FoldImmediate() and Global ISel seems to be last issue to me.

The MAIInst uses AVSrc_32 for src0 & src1 operands.
https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AMDGPU/VOP3PInstructions.td#L378

I am not sure we can do something to prevent GIsel from choosing AV classes here.
Yes, the w/a in FoldImmediate was done to adjust the register class as per the opcode we choose.

Something is missing code to constrain or insert a constraining copy when the instructions are initially emitted

During InstructionSelect, the MIOperands' regclasses are assigned for the first time with llvm::constrainOperandRegClass.
https://github.com/llvm/llvm-project/blob/main/llvm/lib/CodeGen/GlobalISel/Utils.cpp#L104 indicates that the regclasses are derived from the instruction definition itself.
Only when the regclass is unallocatable, the selected regbank is used to restrain the regclass for the MO (https://github.com/llvm/llvm-project/blob/main/llvm/lib/CodeGen/GlobalISel/Utils.cpp#L114).
This is the reason SRC0 & SRC1 of MAIInst get AV class even though regbankSelect has managed to give them both VRegBank.
Should we restrain the MIOperand regclasses based on the incoming regbank when the corresponding RC in the instruction definition is its superclass?

Should we restrain the MIOperand regclasses based on the incoming regbank when the corresponding RC in the instruction definition is its superclass?

Yes, I'm surprised this doesn't happen already. Just the instruction definition can be ambiguous

Should we restrain the MIOperand regclasses based on the incoming regbank when the corresponding RC in the instruction definition is its superclass?

Yes, I'm surprised this doesn't happen already. Just the instruction definition can be ambiguous

Posted D112323.

cdevadas updated this revision to Diff 383645.Oct 31 2021, 3:01 AM

Removed the workaround post-Selection that forced a Vreg class for the occurrences of AV classes. With D112323, we no longer choose AV superclasses during instruction selection. They either become a Vreg class or an Areg class.

This revision is now accepted and ready to land.Nov 1 2021, 12:01 PM
This revision was automatically updated to reflect the committed changes.