Page MenuHomePhabricator

[AMDGPU][RFC] Improve sgpr function arguments
Needs ReviewPublic

Authored by sebastian-ne on May 10 2021, 9:18 AM.

Details

Summary

The SGPR layout on functions calls currently looks like this:

+------------+--------------+-------------------------+-------------------+-------------------+------------------+

s[0:3] SRDarguments...s[30:31] return addresss32 stack pointers33 frame pointers34 base pointer

+------------+--------------+-------------------------+-------------------+-------------------+------------------+

The return address and stack pointer occupy multiple 4-aligned blocks
of SGPRs.
Large scalar memory reads require a 4-aligned block of SGPRs, so if less
of them are available, register allocation becomes more difficult.

The stack resource descriptor occupies SGPR0-3. If we want to pass
user-data SGPRS to a function, the SGPRs need to be moved from s[0:...]
to s[4:...] before the call.
This is also the case when flat scratch is used instead of the SRD, even
if s[0:4] is unused then, because the same call convention is used.

To improve this, I propose the following layout:

+--------------+-------------------+-------------------+-------------------------+--------------+------------------+

arguments...s32 stack pointers33 frame pointers[34:35] return addresss[36:39] SRDs40 base pointer

+--------------+-------------------+-------------------+-------------------------+--------------+------------------+

The stack and frame pointer stay where they are and the return address
is moved behind them, into s[34:35]. That way, only one 4-aligned block
is occupied.

To free s[0:3] for arguments, the SRD is moved to s[36:39]. This has
the effect, that all of s[0:31] can be used for arguments.

The base pointer is not used in the general case, so it is moved to
s40.

Using flat scratch instead of the SRD will leave a whole between the
return address and the base pointer (if used). So, alternatively we
could swap the base pointer and the SRD, leaving a 4-wide whole in the
common case and when the SRD is in use.

Diff Detail

Event Timeline

sebastian-ne created this revision.May 10 2021, 9:18 AM
sebastian-ne requested review of this revision.May 10 2021, 9:18 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 10 2021, 9:18 AM

We have another proposal we were working on to rearrange these a bit differently. We need to account for a few more inputs in the layout

We have another proposal we were working on to rearrange these a bit differently. We need to account for a few more inputs in the layout

As long as this remains in GFX land, we should be fine with it because our new proposal is for compute only (as of now).

We have another proposal we were working on to rearrange these a bit differently. We need to account for a few more inputs in the layout

As long as this remains in GFX land, we should be fine with it because our new proposal is for compute only (as of now).

I would like to keep the same calling convention in compute and graphics. At least regarding the stack pointer and others, because I don’t see a compelling reason to diverge even more. Actually, I’d like it if they were more common than they are now, because we implement some things twice at the moment.
The compute proposal should work just fine; if we move the stack and frame pointer, we end up with the same benefits as in this patch. I commented on the internal proposal for this (I hope I found the right one?).

We have another proposal we were working on to rearrange these a bit differently. We need to account for a few more inputs in the layout

As long as this remains in GFX land, we should be fine with it because our new proposal is for compute only (as of now).

I would like to keep the same calling convention in compute and graphics. At least regarding the stack pointer and others, because I don’t see a compelling reason to diverge even more. Actually, I’d like it if they were more common than they are now, because we implement some things twice at the moment.
The compute proposal should work just fine; if we move the stack and frame pointer, we end up with the same benefits as in this patch. I commented on the internal proposal for this (I hope I found the right one?).

Well, then you're saying unification of both ABIs and it is not discussed thoroughly internally. The layout needs to be documented and get reviewed internally before we can proceed with this patch.

t-tye added a comment.May 11 2021, 8:38 AM

We have another proposal we were working on to rearrange these a bit differently. We need to account for a few more inputs in the layout

As long as this remains in GFX land, we should be fine with it because our new proposal is for compute only (as of now).

I would like to keep the same calling convention in compute and graphics. At least regarding the stack pointer and others, because I don’t see a compelling reason to diverge even more. Actually, I’d like it if they were more common than they are now, because we implement some things twice at the moment.
The compute proposal should work just fine; if we move the stack and frame pointer, we end up with the same benefits as in this patch. I commented on the internal proposal for this (I hope I found the right one?).

Well, then you're saying unification of both ABIs and it is not discussed thoroughly internally. The layout needs to be documented and get reviewed internally before we can proceed with this patch.

I think the reason that compute was considering putting FP and BP at the high registers is because they can often be optimized away. So putting them there allows them to be contiguous with other callee user registers rather then leaving an unused hole that is harder for the register allocator to use. Is that reasonable? How does that work for gfx?

I am all for trying to unify the call convention for compute and gfx if possible:-)