This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Introduce more scratch registers in the ABI.
ClosedPublic

Authored by cdevadas on Mar 18 2020, 5:58 AM.

Details

Summary

The AMDGPU target has a convention that defined all VGPRs
(except the initial 32 argument registers) as callee-saved.
This convention is not efficient always, esp. when the callee
requiring more registers, ended up emitting a large number of
spills, even though its caller requires only a few.

This patch revises the ABI by introducing more scratch registers
that a callee can freely use.
The 256 VGPR registers now become:

32 argument registers
112 scratch registers and
112 callee-saved registers.

The scratch registers and the CSRs are intermixed at regular
intervals (a split boundary of 8) to obtain a better occupancy.

Diff Detail

Event Timeline

cdevadas created this revision.Mar 18 2020, 5:58 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 18 2020, 5:58 AM
arsenm added inline comments.Mar 18 2020, 7:26 AM
llvm/lib/Target/AMDGPU/SIRegisterInfo.td
220–223 ↗(On Diff #251047)

This part should be split into a separate change

cdevadas marked an inline comment as done.Mar 18 2020, 8:13 AM
cdevadas added inline comments.
llvm/lib/Target/AMDGPU/SIRegisterInfo.td
220–223 ↗(On Diff #251047)

Do you prefer it as a separate commit?

arsenm added inline comments.Mar 18 2020, 8:42 AM
llvm/lib/Target/AMDGPU/SIRegisterInfo.td
220–223 ↗(On Diff #251047)

Yes, this can be committed independently

cdevadas updated this revision to Diff 251138.Mar 18 2020, 11:35 AM
cdevadas edited the summary of this revision. (Show Details)

Reverted the changes made on the CostPerUse value.
It will go in a follow-up commit after this review.

Reverted the changes made on the CostPerUse value.
It will go in a follow-up commit after this review.

I would expect this to be the preliminary commit

arsenm accepted this revision.Mar 20 2020, 9:08 AM
arsenm added inline comments.
llvm/docs/AMDGPUUsage.rst
8683

A description of why it's split this way may be helpful

This revision is now accepted and ready to land.Mar 20 2020, 9:08 AM
t-tye added inline comments.Mar 20 2020, 10:26 AM
llvm/docs/AMDGPUUsage.rst
8683

Is the striping being picked at 4 VGPRs to match the hardware VGPR allocation granularity (4 for <=GFX9 and 8 for >=GFX10)? How does this stripping impact register file fragmentation? What is the impact of objects being promoted to registers that are larger than 4 VGPRs?

arsenm added inline comments.Mar 20 2020, 10:52 AM
llvm/docs/AMDGPUUsage.rst
8683

These aren't used for argument passing, so there's no concept of objects to consider

t-tye added inline comments.
llvm/docs/AMDGPUUsage.rst
8683

Also this is an ABI breaking change (as is the change for the handling of the wave scratch offset) so should the EI_ABIVERSION for each EI_OSABI in the ELF header be bumped? My thinking is no since AMD has not yet started to support isa level linking nor function pointers so this change cannot affect any existing programs.

arsenm added inline comments.Mar 20 2020, 11:26 AM
llvm/docs/AMDGPUUsage.rst
8683

We didn't have, and still don't, have an ABI worthy of considering for this

b-sumner marked an inline comment as done.Mar 20 2020, 11:44 AM
b-sumner added inline comments.
llvm/docs/AMDGPUUsage.rst
8683

Seems OK to me to not bump the version since we're not touching the kernel ABI and we don't have ISA linking.

cdevadas updated this revision to Diff 251739.Mar 20 2020, 1:06 PM

Added a brief description of the register split (AMDGPUUsage.rst).

arsenm accepted this revision.Mar 20 2020, 2:22 PM
arsenm added inline comments.
llvm/docs/AMDGPUUsage.rst
8709

guarantee is a bit too strong of a claim

t-tye added inline comments.Mar 20 2020, 5:35 PM
llvm/docs/AMDGPUUsage.rst
8683

These aren't used for argument passing, so there's no concept of objects to consider

But if the compiler wants to promote an object to contiguous registers, then it will end up spanning both clobbered and non-clobbered registers for objects larger than the granularity, forcing some spill/restore code for the parts that are in non-clobbered registers. If the stripping was at a courser granularity then this may be unnecessary if the object will fit in the courser granularity.

Do we think this could be an issue?

cdevadas marked 2 inline comments as done.Mar 21 2020, 7:35 AM
cdevadas added inline comments.
llvm/docs/AMDGPUUsage.rst
8683

If you are still talking about passing the object, the compiler won't promote an object of large size into registers beyond the range defined by the convention (the first 32 VGPRs, in our case, that we didn't split anyway).

8709

Will change it to 'get a better occupancy'

arsenm requested changes to this revision.Mar 21 2020, 9:04 AM

Can you add some tests stressing very wide VGPRs with calls? These will be dynamic vector indexing, and some image intrinsics (or you could just use inline asm)

llvm/docs/AMDGPUUsage.rst
8683

We have few contexts with single VPGR tuples > 4. Some image instructions and indirect register indexing. We probably don’t have any tests with calls stressing these cases

This revision now requires changes to proceed.Mar 21 2020, 9:04 AM
cdevadas marked an inline comment as done.Mar 21 2020, 10:29 AM
cdevadas added inline comments.
llvm/docs/AMDGPUUsage.rst
8683

Are you saying, there are cases more than 4 contiguous VGPRs to be allocated by RA? If yes, can you tell me the instructions?
I was under the impression that we have instructions requiring at most 4 contiguous VGPRs (for instance, FLAT_LOAD_DWORDX4)

t-tye added inline comments.Mar 21 2020, 11:44 AM
llvm/docs/AMDGPUUsage.rst
8683

No I am not talking about argument passing, I am talking about generating code for the body of the function. It would presumably be best that it only uses clobbered registers for values that do not want to be live across the calls it makes. So for objects larger than 4 dwords that will mean prologue/epilogue spilling/restoring, and register shuffling at call sites.

8683

We do have T# and V# values using in image instructions that are larger than 4 dwords. The instructions that use them currently need them in SGPRs but they may need to be moved from VGPRs (which would not need them to be contiguous). Do we also want to be sure the ABI will work for future hardware that may change, or are we ok with different ABIs for different targets?

t-tye requested changes to this revision.Mar 21 2020, 12:33 PM

Added @mjbedy to review.

t-tye added a reviewer: tpr.Mar 21 2020, 12:42 PM
rampitec added inline comments.Mar 23 2020, 1:13 PM
llvm/docs/AMDGPUUsage.rst
8683

There are some image instructions which have address in much longer VGPR tuple. If you look into SiRegisterInfo.td our largest VGPR register class is 32 dwords long. I'd say this also shall be a minimal interleave value.

arsenm added inline comments.Mar 23 2020, 1:50 PM
llvm/docs/AMDGPUUsage.rst
8683

But those are AGPRs, which last I checked were not considered preserved across calls anyway. The question is also how often do those need to be live across calls, which is probably very rare.

rampitec added inline comments.Mar 23 2020, 2:12 PM
llvm/docs/AMDGPUUsage.rst
8683

AGPRs can be copied to VGPRs. Anyway, there are image addresses too which are pretty big.

Thank you all for the comments.
I can see that there are concerns with the current split boundary (4 VGPRs together), considering the fact that we have wide VGPR uses in certain scenarios (image instructions).
But, like Matt mentioned, how frequently such scenarios occur?
Changing the split boundary to a large value would probably take away the whole purpose of this patch - reduce the CSR spills & try to ensure a better occupancy.

Thank you all for the comments.
I can see that there are concerns with the current split boundary (4 VGPRs together), considering the fact that we have wide VGPR uses in certain scenarios (image instructions).
But, like Matt mentioned, how frequently such scenarios occur?
Changing the split boundary to a large value would probably take away the whole purpose of this patch - reduce the CSR spills & try to ensure a better occupancy.

What will happen if you need to pass VReg_1024 into a function? It might not be a common case, but will it even work?

Thank you all for the comments.
I can see that there are concerns with the current split boundary (4 VGPRs together), considering the fact that we have wide VGPR uses in certain scenarios (image instructions).
But, like Matt mentioned, how frequently such scenarios occur?
Changing the split boundary to a large value would probably take away the whole purpose of this patch - reduce the CSR spills & try to ensure a better occupancy.

What will happen if you need to pass VReg_1024 into a function? It might not be a common case, but will it even work?

It works, but that isn't changed by this patch. This is not changing the argument registers which are all still in v0-v31. I believe the largest argument type we pass in registers is <8 x i32>, and force <32 x i32> to be stack passed anyway

Thank you all for the comments.
I can see that there are concerns with the current split boundary (4 VGPRs together), considering the fact that we have wide VGPR uses in certain scenarios (image instructions).
But, like Matt mentioned, how frequently such scenarios occur?
Changing the split boundary to a large value would probably take away the whole purpose of this patch - reduce the CSR spills & try to ensure a better occupancy.

What will happen if you need to pass VReg_1024 into a function? It might not be a common case, but will it even work?

It works, but that isn't changed by this patch. This is not changing the argument registers which are all still in v0-v31. I believe the largest argument type we pass in registers is <8 x i32>, and force <32 x i32> to be stack passed anyway

OK. What if such a register needs to be preserved by a caller? I guess there is no safe window for it, so it will be forced to spill. Then we will spill a whole huge register, not just a part of it, because we do not use sublane spilling (except to AGPRs), right?

Thank you all for the comments.
I can see that there are concerns with the current split boundary (4 VGPRs together), considering the fact that we have wide VGPR uses in certain scenarios (image instructions).
But, like Matt mentioned, how frequently such scenarios occur?
Changing the split boundary to a large value would probably take away the whole purpose of this patch - reduce the CSR spills & try to ensure a better occupancy.

What will happen if you need to pass VReg_1024 into a function? It might not be a common case, but will it even work?

It works, but that isn't changed by this patch. This is not changing the argument registers which are all still in v0-v31. I believe the largest argument type we pass in registers is <8 x i32>, and force <32 x i32> to be stack passed anyway

OK. What if such a register needs to be preserved by a caller? I guess there is no safe window for it, so it will be forced to spill. Then we will spill a whole huge register, not just a part of it, because we do not use sublane spilling (except to AGPRs), right?

Yes, that's what I would expect

Thank you all for the comments.
I can see that there are concerns with the current split boundary (4 VGPRs together), considering the fact that we have wide VGPR uses in certain scenarios (image instructions).
But, like Matt mentioned, how frequently such scenarios occur?
Changing the split boundary to a large value would probably take away the whole purpose of this patch - reduce the CSR spills & try to ensure a better occupancy.

What will happen if you need to pass VReg_1024 into a function? It might not be a common case, but will it even work?

It works, but that isn't changed by this patch. This is not changing the argument registers which are all still in v0-v31. I believe the largest argument type we pass in registers is <8 x i32>, and force <32 x i32> to be stack passed anyway

OK. What if such a register needs to be preserved by a caller? I guess there is no safe window for it, so it will be forced to spill. Then we will spill a whole huge register, not just a part of it, because we do not use sublane spilling (except to AGPRs), right?

Yes, that's what I would expect

Then probably interleave 4 is not a best choice. We may also need to adjust cost of tuples to make aligned allocation more likely.

Thank you all for the comments.
I can see that there are concerns with the current split boundary (4 VGPRs together), considering the fact that we have wide VGPR uses in certain scenarios (image instructions).
But, like Matt mentioned, how frequently such scenarios occur?
Changing the split boundary to a large value would probably take away the whole purpose of this patch - reduce the CSR spills & try to ensure a better occupancy.

What will happen if you need to pass VReg_1024 into a function? It might not be a common case, but will it even work?

It works, but that isn't changed by this patch. This is not changing the argument registers which are all still in v0-v31. I believe the largest argument type we pass in registers is <8 x i32>, and force <32 x i32> to be stack passed anyway

OK. What if such a register needs to be preserved by a caller? I guess there is no safe window for it, so it will be forced to spill. Then we will spill a whole huge register, not just a part of it, because we do not use sublane spilling (except to AGPRs), right?

Yes, that's what I would expect

Then probably interleave 4 is not a best choice. We may also need to adjust cost of tuples to make aligned allocation more likely.

A cost for VGPR registers has been handled with https://reviews.llvm.org/D76417. This will ensure a balanced allocation of scratch registers & CSRs at every split (once the current patch is in the upstream).

Thank you all for the comments.
I can see that there are concerns with the current split boundary (4 VGPRs together), considering the fact that we have wide VGPR uses in certain scenarios (image instructions).
But, like Matt mentioned, how frequently such scenarios occur?
Changing the split boundary to a large value would probably take away the whole purpose of this patch - reduce the CSR spills & try to ensure a better occupancy.

What will happen if you need to pass VReg_1024 into a function? It might not be a common case, but will it even work?

It works, but that isn't changed by this patch. This is not changing the argument registers which are all still in v0-v31. I believe the largest argument type we pass in registers is <8 x i32>, and force <32 x i32> to be stack passed anyway

OK. What if such a register needs to be preserved by a caller? I guess there is no safe window for it, so it will be forced to spill. Then we will spill a whole huge register, not just a part of it, because we do not use sublane spilling (except to AGPRs), right?

Yes, that's what I would expect

Then probably interleave 4 is not a best choice. We may also need to adjust cost of tuples to make aligned allocation more likely.

It's a tradeoff between more CSR spills in general, and the case where a large tuple needs to live across a call (which I still expect to be extremely rare). AGPRs are never treated as CSRs anyway (and for the real AGPR use cases, calls are unlikely to be used)

Then probably interleave 4 is not a best choice. We may also need to adjust cost of tuples to make aligned allocation more likely.

A cost for VGPR registers has been handled with https://reviews.llvm.org/D76417. This will ensure a balanced allocation of scratch registers & CSRs at every split (once the current patch is in the upstream).

D76417 changes the cost of VGPRs, I am speaking about tuples. Assume you interleave by 4. Not you can allocate something into v[64:68] or v[65:69]. If you use the latter you will have a spill of 4 registers around any call site where it is alive. If you use the former you have a chance not to spill it.

mjbedy added inline comments.Mar 30 2020, 6:47 AM
llvm/docs/AMDGPUUsage.rst
8683

An additional concern I would have is with additional register fragmentation this might cause. This is essentially introducing an alignment requirement for wider VGPR allocations in these cases, which could cause higher register requirements. Has there been any investigation of the impact of this?

cdevadas updated this revision to Diff 261054.Apr 29 2020, 3:10 PM
cdevadas edited the summary of this revision. (Show Details)

Divided the VGPRs into equal number of CSRs and scratch registers. Also, added a test case for VGPR tuple allocation.

arsenm accepted this revision.May 1 2020, 11:21 AM

LGTM with test nits. This isn't perfect but we can't do much better now

llvm/test/CodeGen/AMDGPU/vgpr-tuple-allocation.ll
6 ↗(On Diff #261054)

inreg currently doesn't do anything for non-shaders, so you should remove it here.

8 ↗(On Diff #261054)

Typo presreved

cdevadas updated this revision to Diff 261700.May 3 2020, 10:07 AM

Fixed the testcase.

arsenm accepted this revision.May 4 2020, 7:38 AM
t-tye accepted this revision.May 5 2020, 9:31 AM

LGTM. The call convention is still open to further refinement as more information is collected, but this appears to be an improvement so a good starting point.

This revision is now accepted and ready to land.May 5 2020, 9:31 AM
This revision was automatically updated to reflect the committed changes.
thakis added a subscriber: thakis.May 5 2020, 10:53 AM

This breaks check-llvm on Windows: http://45.33.8.238/win/14566/step_11.txt

Please take a look, and revert if it takes a while to investigate. (Maybe the test just needs a triple?)