This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
include/llvm/CodeGen/
-
llvm/
-
CodeGen/
-
MIRYamlMapping.h
1/1
TargetFrameLowering.h
-
lib/Target/
-
Target/
-
AArch64/
-
AArch64FrameLowering.h
5/8
AArch64FrameLowering.cpp
1/2
AArch64InstrInfo.cpp
-
AArch64MachineFunctionInfo.h
1/1
AArch64StackOffset.h
-
AMDGPU/
-
SIFrameLowering.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
framelayout-sve.mir
-
unittests/Target/AArch64/
-
Target/
-
AArch64/
-
TestStackOffset.cpp

Differential D61437

[AArch64] Static (de)allocation of SVE stack objects.
ClosedPublic

Authored by sdesmalen on May 2 2019, 5:38 AM.

Download Raw Diff

Details

Reviewers

thegameg
rovka
t.p.northover
efriedma
rengolin
greened

Commits

rG4f99b6f0fe42: [AArch64] Static (de)allocation of SVE stack objects.
rL373585: [AArch64] Static (de)allocation of SVE stack objects.

Summary

Adds support to AArch64FrameLowering to allocate fixed-stack SVE objects.

The focus of this patch is purely to allow the stack frame to
allocate/deallocate space for scalable SVE objects. More dynamic
allocation (at compile-time, i.e. determining placement of SVE objects
on the stack), or resolving frame-index references that include
scalable-sized offsets, are left for subsequent patches.

SVE objects are allocated in the stack frame as a separate region below
the callee-save area, and above the alignment gap. This is done so that
the SVE objects can be accessed directly from the FP at (runtime)
VL-based offsets to benefit from using the VL-scaled addressing modes.

The layout looks as follows:

+-------------+
| stack arg   |   
+-------------+
| Callee Saves|
|   X29, X30  |       (if available)
|-------------| <- FP (if available)
|     :       |   
|  SVE area   |   
|     :       |   
+-------------+
|/////////////| alignment gap.
|     :       |   
| Stack objs  |
|     :       |   
+-------------+ <- SP after call and frame-setup

SVE and non-SVE stack objects are distinguished using different
StackIDs. The offsets for objects with TargetStackID::SVEVector should be
interpreted as purely scalable offsets within their respective SVE region.

Diff Detail

Event Timeline

sdesmalen created this revision.May 2 2019, 5:38 AM

Herald added subscribers: kristof.beyls, tschuett, javed.absar. · View Herald TranscriptMay 2 2019, 5:38 AM

sdesmalen added parent revisions: D61436: [AArch64] NFC: Generalize emitFrameOffset to support more than byte offsets., D61435: [AArch64] NFC: Add generic StackOffset to describe scalable offsets..May 2 2019, 5:39 AM

Your proposed stack layout doesn't really make sense. There are a few issues:

How do you compute the address of a stack argument?
Under the ios and Windows calling conventions, vararg functions must allocate some fixed slots directly after the stack arguments.
How do you restore SP in the epilogue?

It would make a lot more sense to place the SVE objects somewhere between FP and SP; we already support allocating a variable amount of space between FP and SP, for stack realignment.

Not sure what impact this has on this patch; maybe not much?

sdesmalen mentioned this in D61436: [AArch64] NFC: Generalize emitFrameOffset to support more than byte offsets..May 3 2019, 8:23 AM

We've actually experimented with various layouts and eventually chose this layout for our HPC compiler.

Let me give some more clarification on the spill/fill addressing modes as background for this choice.

When loading regular (non-scalable) data from the stack in the presence of SVE stack objects, the base offset can be materialized using ADDVL, which adds a multiple of the runtime VL to a register. For example, a GPR register spilled at an offset SP + 16 bytes + 2 * sizeof(SVE vector) can be loaded using the sequence:

addvl x8, sp, #2
ldr x0, [x8, #16]

Conversely, the SVE spill/fill addressing modes expect a (runtime) VL scaled offset. For example:

ldr z0, [sp, #2, mul vl]  // loads z0 from SP + 2 * sizeof(SVE vector)

If we want to load SVE vector z0 from an offset SP + 16 bytes + 2 * sizeof(SVE vector), this requires first materializing the base offset by adding 16 bytes, and then using the scaled addressing mode to load z0:

add x8, sp, #16
ldr z0, [x8, #2, mul vl]

Because the additional add <fixed-size offset>, or alternatively addvl <scalable offset> is expensive, we distinguish fixed-size objects and scalable (SVE) objects in different regions. By allocating the SVE region before all other stack objects (CSRs, spills, locals), we benefit that the existing frame-layout doesn't need to change. More importantly, this means that accesses to almost all fixed-size stack objects (with exception of stack arguments) will be as efficient as they would be without SVE stack objects, and don't require an extra frame register. In the presence of a frame-pointer, we can also benefit from accessing our SVE objects directly from the FP.

How do you compute the address of a stack argument?

We can compute the address of a stack argument using 'addvl' and regular 'add/sub' instructions.

Under the ios and Windows calling conventions, vararg functions must allocate some fixed slots directly after the stack arguments.

I don't think there is a reason this decision would prevent the ability to create fixed slots directly after stack arguments (with some work to support it for these calling conventions, of course), although I admit we have not had to concern ourselves with this case for our HPC compiler which implements the AAPCS (with SVE extensions). I don't know enough about the iOS and Windows calling conventions to know if there are explicit assumptions made on the frame-layout other than this?

How do you restore SP in the epilogue?

We restore the SP in the epilogue by adding the scalable stack-size to the SP as a last step. For example (from test/CodeGen/AArch64/framelayout-sve.mir)

# CHECK-NEXT: $sp = frame-setup ADDVL_XXI $sp, -2    // allocate scalable-sized stack
# CHECK-NEXT: $sp = frame-setup SUBXri $sp, 16, 0    // allocate fixed-size stack

# CHECK:      $sp = frame-destroy ADDXri $sp, 16, 0  // deallocate fixed-size stack
# CHECK-NEXT: $sp = frame-destroy ADDVL_XXI $sp, 2   // deallocate scalable-sized stack

The description of the addressing modes is helpful; I didn't realize there was native support for vl-relative arithmetic. I guess that makes it more straightforward than I was expecting for stack address computations to skip over the SVE spill slots.

I'm not sure I understand why it's important to allocate the SVE spill slots before the CSRs, as opposed to allocating them between the CSRs and the regular locals/spills. For code which has a frame pointer, placing the SVE spill slots between the CSRs and the locals/spills has a number of benefits over your suggested layout:

the epilogue is cheaper (you don't need an addvl after restoring sp from fp)
it's cheaper to access arguments passed on the stack
it's cheaper to access the SVE spill slots: you can arrange for the frame pointer to point to the top of the SVE spill area, and use negative offsets from it to spill/restore SVE registers in a single instruction.
code using frame pointers can be unwound using a non-SVE-aware DWARF unwinder.

And I don't see any benefits to the other order, unless I'm missing something big.

In cases where we don't have a frame pointer, the two orders are basically equivalent, I guess.

I guess on current SVE implementations, there isn't any advantage to aligning SVE spill slots more than 16 bytes? And you don't expect that to ever change on future implementations?

I'll write out a diagram for my suggested layout, using a similar style to the one in framelayout-sve.mir, to make sure I'm describing it clearly:

#     +--------------+
#     | stack arg    |
#     +--------------+ <- SP before call
#     | Callee Saves |
#     +--------------+
#     | Frame record |
#     +--------------+ <- FP
#     | SVE objs     |
#     +--------------+
#     | gap for stack realignment, if there's an over-aligned local variable
#     +--------------+
#     |     :        |
#     | Stack objs   |
#     |     :        |
#     +--------------+ <- SP after call and frame-setup

I don't know enough about the iOS and Windows calling conventions to know if there are explicit assumptions made on the frame-layout other than this?

That's the only relevant restriction, I think, unless you want to count the Windows unwind rules.

Added more unittests for StackOffset.

sdesmalen mentioned this in D61435: [AArch64] NFC: Add generic StackOffset to describe scalable offsets..May 7 2019, 5:15 AM

Thanks for your suggestions @efriedma!

Just to double check, your suggested layout has the frame-record *after* the callee-saves. The current layout however, puts the frame-record above the callee-saves. Are you suggesting to change that?

I'm not sure I understand why it's important to allocate the SVE spill slots before the CSRs, as opposed to allocating them between the CSRs and the regular locals/spills.

The current layout for our HPC compiler was a trade-off between getting an efficient implementation for SVE spills/fills on one hand, while keeping in mind a way to limit our downstream debt on the other hand. By keeping the layout as unchanged as possible (i.e. keeping all existing offsets to locals/spills the same, with the exception of stack arguments), we figured this simplified the code and reduced the chance of introducing bugs or regressing performance for accesses to regular stack objects in the presence of any SVE slots (with exception of stack arguments).

I spent some time investigating your suggestion to place the SVE area between the callee-saves and locals/spills and found some things worth noting/considering: 

In the presence of an SVE area, the compiler should then no longer use stack-slot scavenging to reuse gaps in the CSR area, because accesses from the SP will be expensive.
The compiler will have less flexibility to choose the best base pointer to access a stack-slot, because using the FP to access a non-SVE local/spill will require an extra ADDVL instruction. For large stack-frames, this may incur an overhead (and would probably require the emergency spill slot).
Allocation of (non-SVE) stack space will always need to happen in separate steps, because it will no longer be possible to allocate the entire stack space in one go and then save the callee-saves from the new SP, because the scalable area is inserted in the middle. Instead, compiler needs to first allocate stack space for callee-saves, store callee-saves, and finally allocate the remaining stack-space. Pre/post-incrementing addressing modes can be used for the first two steps, but I don't know if this would be more expensive than using the regular addressing modes.
The emergency scavenging will always need to be allocated near the SP (or BP), rather than FP. This is not really a problem, but more something that is different when the stack does not contain any SVE objects.
We'd need to change the location of the frame-record within the callee-saves. If we do so, we'll probably want to do that regardless of whether the stack contains SVE spills or not to keep the layouts similar. Also the distance between FP and locals/spills would be smaller, which is probably beneficial. According to the AAPCS, the placement of the FrameRecord within the stack frame is unspecified (section 5.2.3 The Frame Pointer). Do you know if the same freedom holds true for iOS and Windows calling conventions?

the epilogue is cheaper (you don't need an addvl after restoring sp from fp)

In most cases however, LLVM chooses to restore the stack by incrementing the stack-pointer, even when that is suboptimal (e.g. when the FP is available and restoring the SP by adding sizeof(stack) requires more than 1 add instruction). The exception seems to be when the stack is aligned > 16 bytes and it needs to restore it by using the frame-pointer. Do you know if this behaviour is intentional?

it's cheaper to access arguments passed on the stack

Correct.

it's cheaper to access the SVE spill slots: you can arrange for the frame pointer to point to the top of the SVE spill area, and use negative offsets from it to spill/restore SVE registers in a single instruction.

Note that with the layout proposed in this patch, we can overcome that by extending the 16 byte frame-record to be n x 16 bytes <=> sizeof(1 SVE-vec spill), and access all SVE objects directly from FP + 1 + Offset.

code using frame pointers can be unwound using a non-SVE-aware DWARF unwinder.

When using a frame-pointer, that is still the case with the proposed layout, because the FP will always point to the frame-record, so it can always easily find the previous FP and LR, and offsets to the (non-SVE) callee-saves will be unchanged.

I guess on current SVE implementations, there isn't any advantage to aligning SVE spill slots more than 16 bytes? And you don't expect that to ever change on future implementations?

Locals arising from use of the ACLE may be set to a different alignment, but since the ACLE does not allow them being members of structs or arrays, there is probably little value in doing so. One advantage of placing the SVE area as you suggested is that we could easily implement such re-alignment by moving up the alignment gap between the callee-saves and the SVE area.

In D61437#1497980, @sdesmalen wrote:

Thanks for your suggestions @efriedma!

Just to double check, your suggested layout has the frame-record *after* the callee-saves. The current layout however, puts the frame-record above the callee-saves. Are you suggesting to change that?

Yes, I'm suggesting to rearrange them, to make the fp more useful for accessing SVE spills.

I'm not sure I understand why it's important to allocate the SVE spill slots before the CSRs, as opposed to allocating them between the CSRs and the regular locals/spills.

The current layout for our HPC compiler was a trade-off between getting an efficient implementation for SVE spills/fills on one hand, while keeping in mind a way to limit our downstream debt on the other hand. By keeping the layout as unchanged as possible (i.e. keeping all existing offsets to locals/spills the same, with the exception of stack arguments), we figured this simplified the code and reduced the chance of introducing bugs or regressing performance for accesses to regular stack objects in the presence of any SVE slots (with exception of stack arguments).

If the SVE spill area is below the CSRs, you can leverage the existing checks to handle stack realignment, so I don't think it's that complicated to implement. But maybe your approach requires changing fewer places.

I spent some time investigating your suggestion to place the SVE area between the callee-saves and locals/spills and found some things worth noting/considering: 

In the presence of an SVE area, the compiler should then no longer use stack-slot scavenging to reuse gaps in the CSR area, because accesses from the SP will be expensive.

I don't think there's ever more than one 8-byte slot; not a great loss. And if we really wanted to, we could access the slot relative to fp.

The compiler will have less flexibility to choose the best base pointer to access a stack-slot, because using the FP to access a non-SVE local/spill will require an extra ADDVL instruction. For large stack-frames, this may incur an overhead (and would probably require the emergency spill slot).

We don't normally use fp anyway, unless the function has dynamic allocations; the legal negative offsets from fp are much smaller than the legal positive offsets from sp. And if there are dynamic allocations, we often emit a base pointer anyway.

But on a related note, we end up forcing a base pointer in all cases with dynamic allocation and SVE spill slots, which I guess is a potential downside.

Allocation of (non-SVE) stack space will always need to happen in separate steps, because it will no longer be possible to allocate the entire stack space in one go and then save the callee-saves from the new SP, because the scalable area is inserted in the middle. Instead, compiler needs to first allocate stack space for callee-saves, store callee-saves, and finally allocate the remaining stack-space. Pre/post-incrementing addressing modes can be used for the first two steps, but I don't know if this would be more expensive than using the regular addressing modes.

On cortex-a57 etc., the performance of pre/post-increment is basically the same as an extra arithmetic instruction, IIRC. So yes, it's slightly more expensive, but not by a lot.

The emergency scavenging will always need to be allocated near the SP (or BP), rather than FP. This is not really a problem, but more something that is different when the stack does not contain any SVE objects.

This is probably a one-line change, since we already do this in cases with stack realignment.

We'd need to change the location of the frame-record within the callee-saves. If we do so, we'll probably want to do that regardless of whether the stack contains SVE spills or not to keep the layouts similar. Also the distance between FP and locals/spills would be smaller, which is probably beneficial. According to the AAPCS, the placement of the FrameRecord within the stack frame is unspecified (section 5.2.3 The Frame Pointer). Do you know if the same freedom holds true for iOS and Windows calling conventions?

It doesn't matter on iOS. On Windows, the document describing unwind data actually claims the frame record is supposed to be allocated after the local variables for functions with dynamic stack allocations, but we currently don't implement that, and we haven't seen any issues. Maybe there's some interaction between C++ exceptions and dynamic allocation we don't implement correctly? I haven't really spent any time trying to break it, and dynamic allocations combined with C++ exception handling doesn't really show up in real-world code.

the epilogue is cheaper (you don't need an addvl after restoring sp from fp)

In most cases however, LLVM chooses to restore the stack by incrementing the stack-pointer, even when that is suboptimal (e.g. when the FP is available and restoring the SP by adding sizeof(stack) requires more than 1 add instruction). The exception seems to be when the stack is aligned > 16 bytes and it needs to restore it by using the frame-pointer. Do you know if this behaviour is intentional?

That isn't intentional, I think; probably just nobody noticed. Stack frames that require more than one instruction are rare, and frames that require more than two basically never happen.

it's cheaper to access arguments passed on the stack

Correct.

it's cheaper to access the SVE spill slots: you can arrange for the frame pointer to point to the top of the SVE spill area, and use negative offsets from it to spill/restore SVE registers in a single instruction.

Note that with the layout proposed in this patch, we can overcome that by extending the 16 byte frame-record to be n x 16 bytes <=> sizeof(1 SVE-vec spill), and access all SVE objects directly from FP + 1 + Offset.

Oh, that's clever, and I guess it's not that expensive.

code using frame pointers can be unwound using a non-SVE-aware DWARF unwinder.

When using a frame-pointer, that is still the case with the proposed layout, because the FP will always point to the frame-record, so it can always easily find the previous FP and LR, and offsets to the (non-SVE) callee-saves will be unchanged.

Sorry, I didn't state this correctly. The key here would be if code isn't using frame pointers, we could emit a frame pointer for all functions with SVE spill slots, and then get correct unwinding without a SVE-aware unwinder, and without recompiling everything with frame pointers.

I guess on current SVE implementations, there isn't any advantage to aligning SVE spill slots more than 16 bytes? And you don't expect that to ever change on future implementations?

Locals arising from use of the ACLE may be set to a different alignment, but since the ACLE does not allow them being members of structs or arrays, there is probably little value in doing so. One advantage of placing the SVE area as you suggested is that we could easily implement such re-alignment by moving up the alignment gap between the callee-saves and the SVE area.

Yes, that's what I was thinking.

Ignore the bit about Windows unwinding. I remembered how it actually works; the Microsoft document is just wrong, and it's not actually necessary to allocate the frame record in any particular position on WIndows (except that it has to be somewhere that isn't allocated using _chkstk... but that wouldn't happen anyway with the layout I'm suggesting).

This sounds like it will report the wrong stack size in PEI for the StackSize remark and the stack size warning. Is that expected?

In D61437#1498210, @efriedma wrote:

If the SVE spill area is below the CSRs, you can leverage the existing checks to handle stack realignment, so I don't think it's that complicated to implement. But maybe your approach requires changing fewer places.

I think you've made some compelling reasons to try the change in layout! I'll actually try this out downstream first before updating this patch, so I can run it through our SVE testing and see if there is any impact on performance or if I run into anything unexpected.

But on a related note, we end up forcing a base pointer in all cases with dynamic allocation and SVE spill slots, which I guess is a potential downside.

Does LLVM need this information to be available before register allocation so it knows whether to use the register or not? Because we would only know if we'd need a BP if there are any SVE instructions that would lead to spills *after* register allocation (unless the BP is reserved during RA and only used for scavenging).

When using a frame-pointer, that is still the case with the proposed layout, because the FP will always point to the frame-record, so it can always easily find the previous FP and LR, and offsets to the (non-SVE) callee-saves will be unchanged.

Sorry, I didn't state this correctly. The key here would be if code isn't using frame pointers, we could emit a frame pointer for all functions with SVE spill slots, and then get correct unwinding without a SVE-aware unwinder, and without recompiling everything with frame pointers.

Okay, so we should always use the FP if the function needs unwind table entries and has SVE spills/locals.

In D61437#1498242, @efriedma wrote:

Ignore the bit about Windows unwinding. I remembered how it actually works; the Microsoft document is just wrong, and it's not actually necessary to allocate the frame record in any particular position on WIndows (except that it has to be somewhere that isn't allocated using _chkstk... but that wouldn't happen anyway with the layout I'm suggesting).

Thanks for clarifying!

In D61437#1501444, @thegameg wrote:

This sounds like it will report the wrong stack size in PEI for the StackSize remark and the stack size warning. Is that expected?

For now the answer is yes. One of our primary concerns at the moment is adding basic SVE spill/fill support and we appreciate the caveat that nothing in LLVM really supports the concept of scalable types yet, including offsets and sizes.

Patch D61435 introduces (AArch64)StackOffset, which we'll use to describe offsets composed of a scalable and fixed-size part. Instead of recording sizes and offsets as an 'int' or 'unsigned', they should be described as an instance of a StackOffset/StackSize class. When the scalable-type patch (D32530) lands we should make an effort to roll out the StackOffset class (perhaps with an alias for StackSize) to generic CodeGen interfaces such as getStackSize().

Does LLVM need this information to be available before register allocation so it knows whether to use the register or not?

You have to decide whether the register is reserved before register allocation, so the register allocator doesn't decide to use it, yes. You should be able to change the answer after register allocation: basically, reserve the register through register allocation, then decide after register allocation you don't really need it and "unreserve" it.

I wouldn't really worry about optimizing this; dynamic stack allocation is rare in most C and C++ codebases, and one integer register likely doesn't matter much.

Okay, so we should always use the FP if the function needs unwind table entries and has SVE spills/locals.

Yes. Granted, you probably want a frame pointer anyway for functions with SVE spills/locals.

cameron.mcinally added a subscriber: cameron.mcinally.Jul 30 2019, 1:34 PM

Herald added a reviewer: rengolin. · View Herald TranscriptJul 30 2019, 1:34 PM

greened added a subscriber: greened.Aug 1 2019, 11:20 AM

I wouldn't really worry about optimizing this; dynamic stack allocation is rare in most C and C++ codebases, and one integer register likely doesn't matter much.

Note that the situation is different with Fortran, where dynamic stack allocation is much more common, though I don't know whether this particular issue will impact performance all that much.

greened added inline comments.Aug 1 2019, 12:53 PM

lib/Target/AArch64/AArch64FrameLowering.cpp
187	Needs a comment explaining what this does.
865	This is confusing. Asserting that `SVEStackSize` is non-zero but the message sort of implies it must be true. Maybe word this similarly to the assert right above: "unexpected function without stack frame but with SVE objects."
1281	Maybe this and the uses below are better as a separate NFC patch?
1436	It would be helpful to have a comment here explaining why we are not `Done` if `SVEStackSize` is non-zero.
1486	Add a comment here explaining what this is doing.
lib/Target/AArch64/AArch64InstrInfo.cpp
3052	It's not clear to me what these are. Could you name them a bit more clearly, specifically without acronyms?
lib/Target/AArch64/AArch64StackOffset.h
97	Again, `PL` and `VL` are not very clear.

Changed the location of the SVE area within the frame-layout as suggested by @efriedma
Updated the summary.

Herald added subscribers: nhaehnle, jvesely, arsenm. · View Herald TranscriptAug 2 2019, 6:24 AM

sdesmalen added inline comments.Aug 2 2019, 6:24 AM

lib/Target/AArch64/AArch64FrameLowering.cpp
1281	This change is no longer in my updated patch.
1436	This change is no longer in my updated patch.
1486	This change is no longer in my updated patch.
lib/Target/AArch64/AArch64InstrInfo.cpp
3052	Good point, I see how this is unclear. I've renamed the variables in my latest revision!

sdesmalen added a parent revision: D65653: [AArch64] Change location of frame-record within callee-save area..Aug 2 2019, 6:25 AM

@efriedma, sorry for taking a while to update this patch with the new layout. Other than being distracted by many other things, I tried it on our downstream repo first to see if this might lead to any negative performance impact. This all seems fine, and I now realise this approach makes the code in AArch64FrameLowering a bit simpler (which was the opposite of what I initially thought). I separated out the patch to reorder the frame-record within the callee-save area into D65653.

sdesmalen added a subscriber: joelkevinjones.Aug 2 2019, 7:10 AM

troyj added a subscriber: troyj.Aug 2 2019, 8:44 AM

I wonder if this should have a test that ensures we generate VL-scaled addressing modes for SVE object addressing. If there's not enough codegen yet to emit the asm, then we should probably add such a test when we can. After all, it's the stated goal of this patch. :)

In D61437#1612435, @greened wrote:

I wonder if this should have a test that ensures we generate VL-scaled addressing modes for SVE object addressing. If there's not enough codegen yet to emit the asm, then we should probably add such a test when we can. After all, it's the stated goal of this patch. :)

You're right. I have a separate patch for that, that I could share next week. This patch only adds the support to allocate the SVE area using ADDVL.
(The compiler currently guards against accessing any stack objects in the presence of an SVE area using an assert).

My previous patch didn't have any context, so added it now (git format-patch -U999999)

This LGTM but I think someone else should probably sign off on it as well.

Gentle ping. Now D65653 has been committed, I think this patch is ready for review again.

(I also have some further patches prepared that follow this one; one that implements sp/fp accesses to SVE objects, and another one that implements the saving/restoring of SVE callee-save registers. If it is appreciated, I can post those on Phabricator for context)

greened added inline comments.Sep 3 2019, 7:11 AM

include/llvm/CodeGen/TargetFrameLowering.h
28	Why was the formatting changed here? It's easier for me to read with the newlines.

Updated formatting of TargetStackID enum.

sdesmalen marked an inline comment as done.Sep 19 2019, 12:18 AM

sdesmalen added a child revision: D67749: [AArch64] Stackframe accesses to SVE objects..Sep 19 2019, 12:32 AM

Ping.

@eli.friedman are you happy with this patch now that the SVE region is moved between callee-saves and locals as you suggested in https://reviews.llvm.org/D61437#1490588 ?

I think you've basically landed all the significant pieces of this already? But sure, LGTM.

This revision is now accepted and ready to land.Sep 30 2019, 6:30 PM

Closed by commit rL373585: [AArch64] Static (de)allocation of SVE stack objects. (authored by s.desmalen). · Explain WhyOct 3 2019, 4:34 AM

This revision was automatically updated to reflect the committed changes.

Herald added a project: Restricted Project. · View Herald TranscriptOct 3 2019, 4:34 AM

Revision Contents

Path

Size

include/

llvm/

CodeGen/

MIRYamlMapping.h

1 line

TargetFrameLowering.h

6 lines

lib/

Target/

AArch64/

AArch64FrameLowering.h

11 lines

AArch64FrameLowering.cpp

76 lines

AArch64InstrInfo.cpp

31 lines

AArch64MachineFunctionInfo.h

16 lines

AArch64StackOffset.h

45 lines

AMDGPU/

SIFrameLowering.cpp

2 lines

test/

CodeGen/

AArch64/

framelayout-sve.mir

121 lines

unittests/

Target/

AArch64/

TestStackOffset.cpp

71 lines

Diff 213033

include/llvm/CodeGen/MIRYamlMapping.h

Context not available.
	static void enumeration(yaml::IO &IO, TargetStackID::Value &ID) {	static void enumeration(yaml::IO &IO, TargetStackID::Value &ID) {
	IO.enumCase(ID, "default", TargetStackID::Default);	IO.enumCase(ID, "default", TargetStackID::Default);
	IO.enumCase(ID, "sgpr-spill", TargetStackID::SGPRSpill);	IO.enumCase(ID, "sgpr-spill", TargetStackID::SGPRSpill);
		IO.enumCase(ID, "sve-vec", TargetStackID::SVEVector);
	IO.enumCase(ID, "noalloc", TargetStackID::NoAlloc);	IO.enumCase(ID, "noalloc", TargetStackID::NoAlloc);
	}	}
	};	};
Context not available.

include/llvm/CodeGen/TargetFrameLowering.h

Context not available.
	class RegScavenger;	class RegScavenger;

	namespace TargetStackID {	namespace TargetStackID {
	enum Value {	enum Value { Default = 0, SGPRSpill = 1, SVEVector = 2, NoAlloc = 255 };
		greenedUnsubmitted Done Reply Inline Actions Why was the formatting changed here? It's easier for me to read with the newlines. greened: Why was the formatting changed here? It's easier for me to read with the newlines.
	Default = 0,
	SGPRSpill = 1,
	NoAlloc = 255
	};
	}	}

	/// Information about stack frame layout on the target. It holds the direction	/// Information about stack frame layout on the target. It holds the direction
Context not available.

lib/Target/AArch64/AArch64FrameLowering.h

Context not available.
	int FI) const override;	int FI) const override;
	int getSEHFrameIndexOffset(const MachineFunction &MF, int FI) const;	int getSEHFrameIndexOffset(const MachineFunction &MF, int FI) const;

		bool isSupportedStackID(TargetStackID::Value ID) const override {
		switch (ID) {
		default:
		return false;
		case TargetStackID::Default:
		case TargetStackID::SVEVector:
		case TargetStackID::NoAlloc:
		return true;
		}
		}

	private:	private:
	bool shouldCombineCSRLocalStackBump(MachineFunction &MF,	bool shouldCombineCSRLocalStackBump(MachineFunction &MF,
	unsigned StackBumpBytes) const;	unsigned StackBumpBytes) const;
Context not available.

lib/Target/AArch64/AArch64FrameLowering.cpp

Context not available.
	// \| prev_fp, prev_lr \|	// \| prev_fp, prev_lr \|
	// \| (a.k.a. "frame record") \|	// \| (a.k.a. "frame record") \|
	// \|-----------------------------------\| <- fp(=x29)	// \|-----------------------------------\| <- fp(=x29)
		// \| \|
		// \| SVE stack objects \|
		// \| \|
		// \|-----------------------------------\|
	// \|.empty.space.to.make.part.below....\|	// \|.empty.space.to.make.part.below....\|
	// \|.aligned.in.case.it.needs.more.than\| (size of this area is unknown at	// \|.aligned.in.case.it.needs.more.than\| (size of this area is unknown at
	// \|.the.standard.16-byte.alignment....\| compile time; if present)	// \|.the.standard.16-byte.alignment....\| compile time; if present)
		greenedUnsubmitted Done Reply Inline Actions Needs a comment explaining what this does. greened: Needs a comment explaining what this does.
Context not available.
	return DefaultSafeSPDisplacement;	return DefaultSafeSPDisplacement;
	}	}

		/// Returns the size of the entire SVE stackframe (calleesaves + spills).
		static StackOffset getSVEStackSize(const MachineFunction &MF) {
		const AArch64FunctionInfo *AFI = MF.getInfo<AArch64FunctionInfo>();
		return {(int64_t)AFI->getStackSizeSVE(), MVT::nxv1i8};
		}

	bool AArch64FrameLowering::canUseRedZone(const MachineFunction &MF) const {	bool AArch64FrameLowering::canUseRedZone(const MachineFunction &MF) const {
	if (!EnableRedZone)	if (!EnableRedZone)
	return false;	return false;
Context not available.
	const AArch64FunctionInfo *AFI = MF.getInfo<AArch64FunctionInfo>();	const AArch64FunctionInfo *AFI = MF.getInfo<AArch64FunctionInfo>();
	unsigned NumBytes = AFI->getLocalStackSize();	unsigned NumBytes = AFI->getLocalStackSize();

	return !(MFI.hasCalls() \|\| hasFP(MF) \|\| NumBytes > 128);	return !(MFI.hasCalls() \|\| hasFP(MF) \|\| NumBytes > 128 \|\|
		getSVEStackSize(MF));
	}	}

	/// hasFP - Return true if the specified function should have a dedicated frame	/// hasFP - Return true if the specified function should have a dedicated frame
Context not available.
	if (canUseRedZone(MF))	if (canUseRedZone(MF))
	return false;	return false;

		// When there is an SVE area on the stack, always allocate the
		// callee-saves and spills/locals separately.
		if (getSVEStackSize(MF))
		return false;

	return true;	return true;
	}	}

		greenedUnsubmitted Done Reply Inline Actions This is confusing. Asserting that `SVEStackSize` is non-zero but the message sort of implies it must be true. Maybe word this similarly to the assert right above: "unexpected function without stack frame but with SVE objects." greened: This is confusing. Asserting that `SVEStackSize` is non-zero but the message sort of implies…
Context not available.
	// Ideally it should match SP value after prologue.	// Ideally it should match SP value after prologue.
	AFI->setTaggedBasePointerOffset(MFI.getStackSize());	AFI->setTaggedBasePointerOffset(MFI.getStackSize());

		const StackOffset &SVEStackSize = getSVEStackSize(MF);

	// getStackSize() includes all the locals in its size calculation. We don't	// getStackSize() includes all the locals in its size calculation. We don't
	// include these locals when computing the stack size of a funclet, as they	// include these locals when computing the stack size of a funclet, as they
	// are allocated in the parent's stack frame and accessed via the frame	// are allocated in the parent's stack frame and accessed via the frame
Context not available.
	: (int)MFI.getStackSize();	: (int)MFI.getStackSize();
	if (!AFI->hasStackFrame() && !windowsRequiresStackProbe(MF, NumBytes)) {	if (!AFI->hasStackFrame() && !windowsRequiresStackProbe(MF, NumBytes)) {
	assert(!HasFP && "unexpected function without stack frame but with FP");	assert(!HasFP && "unexpected function without stack frame but with FP");
		assert(!SVEStackSize &&
		"unexpected function without stack frame but with SVE objects");
	// All of the stack allocation is for locals.	// All of the stack allocation is for locals.
	AFI->setLocalStackSize(NumBytes);	AFI->setLocalStackSize(NumBytes);
	if (!NumBytes)	if (!NumBytes)
Context not available.
	AFI->setLocalStackSize(NumBytes - PrologueSaveSize);	AFI->setLocalStackSize(NumBytes - PrologueSaveSize);
	bool CombineSPBump = shouldCombineCSRLocalStackBump(MF, NumBytes);	bool CombineSPBump = shouldCombineCSRLocalStackBump(MF, NumBytes);
	if (CombineSPBump) {	if (CombineSPBump) {
		assert(!SVEStackSize && "Cannot combine SP bump with SVE");
	emitFrameOffset(MBB, MBBI, DL, AArch64::SP, AArch64::SP,	emitFrameOffset(MBB, MBBI, DL, AArch64::SP, AArch64::SP,
	{-NumBytes, MVT::i8}, TII, MachineInstr::FrameSetup, false,	{-NumBytes, MVT::i8}, TII, MachineInstr::FrameSetup, false,
	NeedsWinCFI, &HasWinCFI);	NeedsWinCFI, &HasWinCFI);
Context not available.
	NumBytes = 0;	NumBytes = 0;
	}	}

		emitFrameOffset(MBB, MBBI, DL, AArch64::SP, AArch64::SP, -SVEStackSize, TII,
		MachineInstr::FrameSetup);

	// Allocate space for the rest of the frame.	// Allocate space for the rest of the frame.
	if (NumBytes) {	if (NumBytes) {
	const bool NeedsRealignment = RegInfo->needsStackRealignment(MF);	const bool NeedsRealignment = RegInfo->needsStackRealignment(MF);
		greenedUnsubmitted Not Done Reply Inline Actions Maybe this and the uses below are better as a separate NFC patch? greened: Maybe this and the uses below are better as a separate NFC patch?
		sdesmalenAuthorUnsubmitted Done Reply Inline Actions This change is no longer in my updated patch. sdesmalen: This change is no longer in my updated patch.
		greenedUnsubmitted Not Done Reply Inline Actions It would be helpful to have a comment here explaining why we are not `Done` if `SVEStackSize` is non-zero. greened: It would be helpful to have a comment here explaining why we are not `Done` if `SVEStackSize`…
		sdesmalenAuthorUnsubmitted Done Reply Inline Actions This change is no longer in my updated patch. sdesmalen: This change is no longer in my updated patch.
Context not available.
	.setMIFlag(MachineInstr::FrameDestroy);	.setMIFlag(MachineInstr::FrameDestroy);
	}	}

		const StackOffset &SVEStackSize = getSVEStackSize(MF);

	// If there is a single SP update, insert it before the ret and we're done.	// If there is a single SP update, insert it before the ret and we're done.
	if (CombineSPBump) {	if (CombineSPBump) {
		assert(!SVEStackSize && "Cannot combine SP bump with SVE");
	emitFrameOffset(MBB, MBB.getFirstTerminator(), DL, AArch64::SP, AArch64::SP,	emitFrameOffset(MBB, MBB.getFirstTerminator(), DL, AArch64::SP, AArch64::SP,
	{NumBytes + (int64_t)AfterCSRPopSize, MVT::i8}, TII,	{NumBytes + (int64_t)AfterCSRPopSize, MVT::i8}, TII,
	MachineInstr::FrameDestroy, false, NeedsWinCFI, &HasWinCFI);	MachineInstr::FrameDestroy, false, NeedsWinCFI, &HasWinCFI);
Context not available.
	NumBytes -= PrologueSaveSize;	NumBytes -= PrologueSaveSize;
	assert(NumBytes >= 0 && "Negative stack allocation size!?");	assert(NumBytes >= 0 && "Negative stack allocation size!?");

		// Deallocate the SVE area.
		if (SVEStackSize)
		if (!AFI->isStackRealigned())
		emitFrameOffset(MBB, LastPopI, DL, AArch64::SP, AArch64::SP, SVEStackSize,
		TII, MachineInstr::FrameDestroy);

	if (!hasFP(MF)) {	if (!hasFP(MF)) {
	bool RedZone = canUseRedZone(MF);	bool RedZone = canUseRedZone(MF);
	// If this was a redzone leaf function, we don't need to restore the	// If this was a redzone leaf function, we don't need to restore the
		greenedUnsubmitted Not Done Reply Inline Actions Add a comment here explaining what this is doing. greened: Add a comment here explaining what this is doing.
		sdesmalenAuthorUnsubmitted Done Reply Inline Actions This change is no longer in my updated patch. sdesmalen: This change is no longer in my updated patch.
Context not available.
	bool isCSR =	bool isCSR =
	!isFixed && ObjectOffset >= -((int)AFI->getCalleeSavedStackSize());	!isFixed && ObjectOffset >= -((int)AFI->getCalleeSavedStackSize());

		const StackOffset &SVEStackSize = getSVEStackSize(MF);
		assert(!SVEStackSize && "Accessing frame indices in presence of SVE "
		"not yet supported");

	// Use frame pointer to reference fixed objects. Use it for locals if	// Use frame pointer to reference fixed objects. Use it for locals if
	// there are VLAs or a dynamically realigned SP (and thus the SP isn't	// there are VLAs or a dynamically realigned SP (and thus the SP isn't
	// reliable as a base). Make sure useFPForScavengingIndex() does the	// reliable as a base). Make sure useFPForScavengingIndex() does the
Context not available.
	<< ' ' << printReg(Reg, RegInfo);	<< ' ' << printReg(Reg, RegInfo);
	dbgs() << "\n";);	dbgs() << "\n";);

		bool HasSVEStackObjects = [&MFI]() {
		for (int I = MFI.getObjectIndexBegin(); I != 0; ++I)
		if (MFI.getStackID(I) == TargetStackID::SVEVector &&
		MFI.getObjectOffset(I) < 0)
		return true;
		// Note: We don't take allocatable stack objects into
		// account yet, because allocation for those is not yet
		// implemented.
		return false;
		}();

	// If any callee-saved registers are used, the frame cannot be eliminated.	// If any callee-saved registers are used, the frame cannot be eliminated.
	bool CanEliminateFrame = SavedRegs.count() == 0;	bool CanEliminateFrame = (SavedRegs.count() == 0) && !HasSVEStackObjects;

	// The CSR spill slots have not been allocated yet, so estimateStackSize	// The CSR spill slots have not been allocated yet, so estimateStackSize
	// won't include them.	// won't include them.
Context not available.

	void AArch64FrameLowering::processFunctionBeforeFrameFinalized(	void AArch64FrameLowering::processFunctionBeforeFrameFinalized(
	MachineFunction &MF, RegScavenger *RS) const {	MachineFunction &MF, RegScavenger *RS) const {
		MachineFrameInfo &MFI = MF.getFrameInfo();

		assert(getStackGrowthDirection() == TargetFrameLowering::StackGrowsDown &&
		"Upwards growing stack unsupported");

		// Process all fixed stack SVE objects.
		int64_t Offset = 0;
		for (int I = MFI.getObjectIndexBegin(); I != 0; ++I) {
		unsigned StackID = MFI.getStackID(I);
		if (StackID == TargetStackID::SVEVector) {
		int64_t FixedOffset = -MFI.getObjectOffset(I);
		if (FixedOffset > Offset)
		Offset = FixedOffset;
		}
		}

		unsigned MaxAlign = getStackAlignment();
		uint64_t SVEStackSize = alignTo(Offset, MaxAlign);

		AArch64FunctionInfo *AFI = MF.getInfo<AArch64FunctionInfo>();
		AFI->setStackSizeSVE(SVEStackSize);
		assert(MaxAlign <= 16 && "Cannot align scalable vectors more than 16 bytes");

	// If this function isn't doing Win64-style C++ EH, we don't need to do	// If this function isn't doing Win64-style C++ EH, we don't need to do
	// anything.	// anything.
	if (!MF.hasEHFunclets())	if (!MF.hasEHFunclets())
	return;	return;
	const TargetInstrInfo &TII = *MF.getSubtarget().getInstrInfo();	const TargetInstrInfo &TII = *MF.getSubtarget().getInstrInfo();
	MachineFrameInfo &MFI = MF.getFrameInfo();
	WinEHFuncInfo &EHInfo = *MF.getWinEHFuncInfo();	WinEHFuncInfo &EHInfo = *MF.getWinEHFuncInfo();

	MachineBasicBlock &MBB = MF.front();	MachineBasicBlock &MBB = MF.front();
Context not available.

lib/Target/AArch64/AArch64InstrInfo.cpp

Context not available.
	MaxEncoding = 0xfff;	MaxEncoding = 0xfff;
	ShiftSize = 12;	ShiftSize = 12;
	break;	break;
		case AArch64::ADDVL_XXI:
		case AArch64::ADDPL_XXI:
		MaxEncoding = 31;
		ShiftSize = 0;
		if (Offset < 0) {
		MaxEncoding = 32;
		Sign = -1;
		Offset = -Offset;
		}
		break;
	default:	default:
	llvm_unreachable("Unsupported opcode");	llvm_unreachable("Unsupported opcode");
	}	}
		greenedUnsubmitted Not Done Reply Inline Actions It's not clear to me what these are. Could you name them a bit more clearly, specifically without acronyms? greened: It's not clear to me what these are. Could you name them a bit more clearly, specifically…
		sdesmalenAuthorUnsubmitted Done Reply Inline Actions Good point, I see how this is unclear. I've renamed the variables in my latest revision! sdesmalen: Good point, I see how this is unclear. I've renamed the variables in my latest revision!
Context not available.
	StackOffset Offset, const TargetInstrInfo *TII,	StackOffset Offset, const TargetInstrInfo *TII,
	MachineInstr::MIFlag Flag, bool SetNZCV,	MachineInstr::MIFlag Flag, bool SetNZCV,
	bool NeedsWinCFI, bool *HasWinCFI) {	bool NeedsWinCFI, bool *HasWinCFI) {
	int64_t Bytes;	int64_t Bytes, NumPredicateVectors, NumDataVectors;
	Offset.getForFrameOffset(Bytes);	Offset.getForFrameOffset(Bytes, NumPredicateVectors, NumDataVectors);

	// First emit non-scalable frame offsets, or a simple 'mov'.	// First emit non-scalable frame offsets, or a simple 'mov'.
	if (Bytes \|\| (!Offset && SrcReg != DestReg)) {	if (Bytes \|\| (!Offset && SrcReg != DestReg)) {
Context not available.
	NeedsWinCFI, HasWinCFI);	NeedsWinCFI, HasWinCFI);
	SrcReg = DestReg;	SrcReg = DestReg;
	}	}

		assert(!(SetNZCV && (NumPredicateVectors \|\| NumDataVectors)) &&
		"SetNZCV not supported with SVE vectors");
		assert(!(NeedsWinCFI && (NumPredicateVectors \|\| NumDataVectors)) &&
		"WinCFI not supported with SVE vectors");

		if (NumDataVectors) {
		emitFrameOffsetAdj(MBB, MBBI, DL, DestReg, SrcReg, NumDataVectors,
		AArch64::ADDVL_XXI, TII, Flag, NeedsWinCFI, nullptr);
		SrcReg = DestReg;
		}

		if (NumPredicateVectors) {
		assert(DestReg != AArch64::SP && "Unaligned access to SP");
		emitFrameOffsetAdj(MBB, MBBI, DL, DestReg, SrcReg, NumPredicateVectors,
		AArch64::ADDPL_XXI, TII, Flag, NeedsWinCFI, nullptr);
		}
	}	}

	MachineInstr *AArch64InstrInfo::foldMemoryOperandImpl(	MachineInstr *AArch64InstrInfo::foldMemoryOperandImpl(
Context not available.

lib/Target/AArch64/AArch64MachineFunctionInfo.h

Context not available.
	/// returned struct in a register. This field holds the virtual register into	/// returned struct in a register. This field holds the virtual register into
	/// which the sret argument is passed.	/// which the sret argument is passed.
	unsigned SRetReturnReg = 0;	unsigned SRetReturnReg = 0;
		/// SVE stack size (for predicates and data vectors) are maintained here
		/// rather than in FrameInfo, as the placement and Stack IDs are target
		/// specific.
		uint64_t StackSizeSVE = 0;

		/// HasCalculatedStackSizeSVE indicates whether StackSizeSVE is valid.
		bool HasCalculatedStackSizeSVE = false;

	/// Has a value when it is known whether or not the function uses a	/// Has a value when it is known whether or not the function uses a
	/// redzone, and no value otherwise.	/// redzone, and no value otherwise.
Context not available.
	ArgumentStackToRestore = bytes;	ArgumentStackToRestore = bytes;
	}	}

		bool hasCalculatedStackSizeSVE() const { return HasCalculatedStackSizeSVE; }

		void setStackSizeSVE(uint64_t S) {
		HasCalculatedStackSizeSVE = true;
		StackSizeSVE = S;
		}

		uint64_t getStackSizeSVE() const { return StackSizeSVE; }

	bool hasStackFrame() const { return HasStackFrame; }	bool hasStackFrame() const { return HasStackFrame; }
	void setHasStackFrame(bool s) { HasStackFrame = s; }	void setHasStackFrame(bool s) { HasStackFrame = s; }

Context not available.

lib/Target/AArch64/AArch64StackOffset.h

Context not available.
	/// vector and a 64bit GPR.	/// vector and a 64bit GPR.
	class StackOffset {	class StackOffset {
	int64_t Bytes;	int64_t Bytes;
		int64_t ScalableBytes;

	explicit operator int() const;	explicit operator int() const;

	public:	public:
	using Part = std::pair<int64_t, MVT>;	using Part = std::pair<int64_t, MVT>;

	StackOffset() : Bytes(0) {}	StackOffset() : Bytes(0), ScalableBytes(0) {}

	StackOffset(int64_t Offset, MVT::SimpleValueType T) : StackOffset() {	StackOffset(int64_t Offset, MVT::SimpleValueType T) : StackOffset() {
	assert(!MVT(T).isScalableVector() && "Scalable types not supported");
	*this += Part(Offset, T);	*this += Part(Offset, T);
	}	}

	StackOffset(const StackOffset &Other) : Bytes(Other.Bytes) {}	StackOffset(const StackOffset &Other)
		: Bytes(Other.Bytes), ScalableBytes(Other.ScalableBytes) {}

	StackOffset &operator=(const StackOffset &) = default;	StackOffset &operator=(const StackOffset &) = default;

	StackOffset &operator+=(const StackOffset::Part &Other) {	StackOffset &operator+=(const StackOffset::Part &Other) {
	assert(Other.second.getSizeInBits() % 8 == 0 &&	assert(Other.second.getSizeInBits() % 8 == 0 &&
	"Offset type is not a multiple of bytes");	"Offset type is not a multiple of bytes");
	Bytes += Other.first * (Other.second.getSizeInBits() / 8);	int64_t OffsetInBytes = Other.first * (Other.second.getSizeInBits() / 8);
		if (Other.second.isScalableVector())
		ScalableBytes += OffsetInBytes;
		else
		Bytes += OffsetInBytes;
	return *this;	return *this;
	}	}

	StackOffset &operator+=(const StackOffset &Other) {	StackOffset &operator+=(const StackOffset &Other) {
	Bytes += Other.Bytes;	Bytes += Other.Bytes;
		ScalableBytes += Other.ScalableBytes;
	return *this;	return *this;
	}	}

Context not available.

	StackOffset &operator-=(const StackOffset &Other) {	StackOffset &operator-=(const StackOffset &Other) {
	Bytes -= Other.Bytes;	Bytes -= Other.Bytes;
		ScalableBytes -= Other.ScalableBytes;
	return *this;	return *this;
	}	}

Context not available.
	return Res;	return Res;
	}	}

		greenedUnsubmitted Done Reply Inline Actions Again, `PL` and `VL` are not very clear. greened: Again, `PL` and `VL` are not very clear.
		/// Returns the scalable part of the offset in bytes.
		int64_t getScalableBytes() const { return ScalableBytes; }

	/// Returns the non-scalable part of the offset in bytes.	/// Returns the non-scalable part of the offset in bytes.
	int64_t getBytes() const { return Bytes; }	int64_t getBytes() const { return Bytes; }

	/// Returns the offset in parts to which this frame offset can be	/// Returns the offset in parts to which this frame offset can be
	/// decomposed for the purpose of describing a frame offset.	/// decomposed for the purpose of describing a frame offset.
	/// For non-scalable offsets this is simply its byte size.	/// For non-scalable offsets this is simply its byte size.
	void getForFrameOffset(int64_t &ByteSized) const { ByteSized = Bytes; }	void getForFrameOffset(int64_t &NumBytes, int64_t &NumPredicateVectors,
		int64_t &NumDataVectors) const {
		assert(isValid() && "Invalid frame offset");

		NumBytes = Bytes;
		NumDataVectors = 0;
		NumPredicateVectors = ScalableBytes / 2;
		// This method is used to get the offsets to adjust the frame offset.
		// If the function requires ADDPL to be used and needs more than two ADDPL
		// instructions, part of the offset is folded into NumDataVectors so that it
		// uses ADDVL for part of it, reducing the number of ADDPL instructions.
		if (NumPredicateVectors % 8 == 0 \|\| NumPredicateVectors < -64 \|\|
		NumPredicateVectors > 62) {
		NumDataVectors = NumPredicateVectors / 8;
		NumPredicateVectors -= NumDataVectors * 8;
		}
		}

	/// Returns whether the offset is known zero.	/// Returns whether the offset is known zero.
	explicit operator bool() const { return Bytes; }	explicit operator bool() const { return Bytes \|\| ScalableBytes; }

		bool isValid() const {
		// The smallest scalable element supported by scaled SVE addressing
		// modes are predicates, which are 2 scalable bytes in size. So the scalable
		// byte offset must always be a multiple of 2.
		return ScalableBytes % 2 == 0;
		}
	};	};

	} // end namespace llvm	} // end namespace llvm
Context not available.

lib/Target/AMDGPU/SIFrameLowering.cpp

Context not available.
	case TargetStackID::NoAlloc:	case TargetStackID::NoAlloc:
	case TargetStackID::SGPRSpill:	case TargetStackID::SGPRSpill:
	return true;	return true;
		case TargetStackID::SVEVector:
		return false;
	}	}
	llvm_unreachable("Invalid TargetStackID::Value");	llvm_unreachable("Invalid TargetStackID::Value");
	}	}
Context not available.

test/CodeGen/AArch64/framelayout-sve.mir

This file was added.

				# RUN: llc -mtriple=aarch64-none-linux-gnu -run-pass=prologepilog %s -o - \| FileCheck %s
				#
				# Test allocation and deallocation of SVE objects on the stack,
				# as well as using a combination of scalable and non-scalable
				# offsets to access the SVE on the stack.
				#
				# SVE objects are allocated below the (scalar) callee saves,
				# and above spills/locals and the alignment gap, e.g.
				#
				# +-------------+
				# \| stack arg \|
				# +-------------+ <- SP before call
				# \| Callee Saves\|
				# \| Frame record\| (if available)
				# \|-------------\| <- FP (if available)
				# \| SVE area \|
				# +-------------+
				# \|/////////////\| alignment gap.
				# \| : \|
				# \| Stack objs \|
				# \| : \|
				# +-------------+ <- SP after call and frame-setup
				#
				--- \|

				define void @test_allocate_sve() nounwind { entry: unreachable }
				define void @test_allocate_sve_gpr_callee_saves() nounwind { entry: unreachable }
				define void @test_allocate_sve_gpr_realigned() nounwind { entry: unreachable }

				...
				# +----------+
				# \| %fixed- \| // scalable SVE object of n * 18 bytes, aligned to 16 bytes,
				# \| stack.0 \| // to be materialized with 2ADDVL (<=> 2 n * 16bytes)
				# +----------+
				# \| %stack.0 \| // not scalable
				# +----------+ <- SP

				# CHECK-LABEL: name: test_allocate_sve
				# CHECK: stackSize: 16

				# CHECK: bb.0.entry:
				# CHECK-NEXT: $sp = frame-setup ADDVL_XXI $sp, -2
				# CHECK-NEXT: $sp = frame-setup SUBXri $sp, 16, 0

				# CHECK-NEXT: $sp = frame-destroy ADDVL_XXI $sp, 2
				# CHECK-NEXT: $sp = frame-destroy ADDXri $sp, 16, 0
				# CHECK-NEXT: RET_ReallyLR
				name: test_allocate_sve
				fixedStack:
				- { id: 0, stack-id: sve-vec, size: 18, alignment: 2, offset: -18 }
				stack:
				- { id: 0, stack-id: default, size: 16, alignment: 8 }
				body: \|
				bb.0.entry:
				RET_ReallyLR
				---
				...
				# +----------+
				# \| x20, x21 \| // callee saves
				# +----------+
				# \| %fixed- \| // scalable objects
				# \| stack.0 \|
				# +----------+
				# \| %stack.0 \| // not scalable
				# +----------+ <- SP

				# CHECK-LABEL: name: test_allocate_sve_gpr_callee_saves
				# CHECK: stackSize: 32

				# CHECK: bb.0.entry:
				# CHECK-NEXT: $sp = frame-setup STPXpre killed $x21, killed $x20, $sp, -2
				# CHECK-NEXT: $sp = frame-setup ADDVL_XXI $sp, -2
				# CHECK-NEXT: $sp = frame-setup SUBXri $sp, 16, 0
				# CHECK-NEXT: $x20 = IMPLICIT_DEF
				# CHECK-NEXT: $x21 = IMPLICIT_DEF
				# CHECK-NEXT: $sp = frame-destroy ADDVL_XXI $sp, 2
				# CHECK-NEXT: $sp = frame-destroy ADDXri $sp, 16, 0
				# CHECK-NEXT: $sp, $x21, $x20 = frame-destroy LDPXpost $sp, 2
				# CHECK-NEXT: RET_ReallyLR
				name: test_allocate_sve_gpr_callee_saves
				fixedStack:
				- { id: 0, stack-id: sve-vec, size: 18, alignment: 2, offset: -18 }
				stack:
				- { id: 0, stack-id: default, size: 16, alignment: 8 }
				body: \|
				bb.0.entry:
				$x20 = IMPLICIT_DEF
				$x21 = IMPLICIT_DEF
				RET_ReallyLR
				---
				...
				# +----------+
				# \| lr, fp \| // frame record
				# +----------+ <- FP
				# \| %fixed- \| // scalable objects
				# \| stack.0 \|
				# +----------+
				# \|//////////\| // alignment gap
				# \| %stack.0 \| // not scalable
				# +----------+ <- SP
				# CHECK-LABEL: name: test_allocate_sve_gpr_realigned
				# CHECK: stackSize: 32

				# CHECK: bb.0.entry:
				# CHECK-NEXT: $sp = frame-setup STPXpre killed $fp, killed $lr, $sp, -2
				# CHECK-NEXT: $fp = frame-setup ADDXri $sp, 0, 0
				# CHECK-NEXT: $sp = frame-setup ADDVL_XXI $sp, -2
				# CHECK-NEXT: $[[TMP:x[0-9]+]] = frame-setup SUBXri $sp, 16, 0
				# CHECK-NEXT: $sp = ANDXri killed $[[TMP]]
				# CHECK-NEXT: $sp = frame-destroy ADDXri $fp, 0, 0
				# CHECK-NEXT: $sp, $fp, $lr = frame-destroy LDPXpost $sp, 2
				# CHECK-NEXT: RET_ReallyLR
				name: test_allocate_sve_gpr_realigned
				fixedStack:
				- { id: 0, stack-id: sve-vec, size: 18, alignment: 2, offset: -18 }
				stack:
				- { id: 0, stack-id: default, size: 16, alignment: 32 }
				body: \|
				bb.0.entry:
				RET_ReallyLR
				---

unittests/Target/AArch64/TestStackOffset.cpp

Context not available.

	StackOffset C(2, MVT::v4i64);	StackOffset C(2, MVT::v4i64);
	EXPECT_EQ(64, C.getBytes());	EXPECT_EQ(64, C.getBytes());

		StackOffset D(2, MVT::nxv4i64);
		EXPECT_EQ(64, D.getScalableBytes());

		StackOffset E(2, MVT::v4i64);
		EXPECT_EQ(0, E.getScalableBytes());

		StackOffset F(2, MVT::nxv4i64);
		EXPECT_EQ(0, F.getBytes());
	}	}

	TEST(StackOffset, Add) {	TEST(StackOffset, Add) {
Context not available.
	StackOffset D(1, MVT::i32);	StackOffset D(1, MVT::i32);
	D += A;	D += A;
	EXPECT_EQ(12, D.getBytes());	EXPECT_EQ(12, D.getBytes());

		StackOffset E(1, MVT::nxv1i32);
		StackOffset F = C + E;
		EXPECT_EQ(12, F.getBytes());
		EXPECT_EQ(4, F.getScalableBytes());
	}	}

	TEST(StackOffset, Sub) {	TEST(StackOffset, Sub) {
Context not available.
	StackOffset D(1, MVT::i64);	StackOffset D(1, MVT::i64);
	D -= A;	D -= A;
	EXPECT_EQ(0, D.getBytes());	EXPECT_EQ(0, D.getBytes());

		C += StackOffset(2, MVT::nxv1i32);
		StackOffset E = StackOffset(1, MVT::nxv1i32);
		StackOffset F = C - E;
		EXPECT_EQ(4, F.getBytes());
		EXPECT_EQ(4, F.getScalableBytes());
	}	}

	TEST(StackOffset, isZero) {	TEST(StackOffset, isZero) {
Context not available.
	StackOffset B(0, MVT::i32);	StackOffset B(0, MVT::i32);
	EXPECT_TRUE(!A);	EXPECT_TRUE(!A);
	EXPECT_TRUE(!(A + B));	EXPECT_TRUE(!(A + B));

		StackOffset C(0, MVT::nxv1i32);
		EXPECT_TRUE(!(A + C));

		StackOffset D(1, MVT::nxv1i32);
		EXPECT_FALSE(!(A + D));
		}

		TEST(StackOffset, isValid) {
		EXPECT_FALSE(StackOffset(1, MVT::nxv8i1).isValid());
		EXPECT_TRUE(StackOffset(2, MVT::nxv8i1).isValid());

		EXPECT_DEATH(StackOffset(1, MVT::i1),
		"Offset type is not a multiple of bytes");
		EXPECT_DEATH(StackOffset(1, MVT::nxv1i1),
		"Offset type is not a multiple of bytes");
	}	}

	TEST(StackOffset, getForFrameOffset) {	TEST(StackOffset, getForFrameOffset) {
	StackOffset A(1, MVT::i64);	StackOffset A(1, MVT::i64);
	StackOffset B(1, MVT::i32);	StackOffset B(1, MVT::i32);
	int64_t ByteSized;	StackOffset C(1, MVT::nxv4i32);
	(A + B).getForFrameOffset(ByteSized);
		// If all offsets can be materialized with only ADDVL,
		// make sure PLSized is 0.
		int64_t ByteSized, VLSized, PLSized;
		(A + B + C).getForFrameOffset(ByteSized, PLSized, VLSized);
	EXPECT_EQ(12, ByteSized);	EXPECT_EQ(12, ByteSized);
		EXPECT_EQ(1, VLSized);
		EXPECT_EQ(0, PLSized);

		// If we need an ADDPL to materialize the offset, and the number of scalable
		// bytes fits the ADDPL immediate, fold the scalable bytes to fit in PLSized.
		StackOffset D(1, MVT::nxv16i1);
		(C + D).getForFrameOffset(ByteSized, PLSized, VLSized);
		EXPECT_EQ(0, ByteSized);
		EXPECT_EQ(0, VLSized);
		EXPECT_EQ(9, PLSized);

		StackOffset E(4, MVT::nxv4i32);
		StackOffset F(1, MVT::nxv16i1);
		(E + F).getForFrameOffset(ByteSized, PLSized, VLSized);
		EXPECT_EQ(0, ByteSized);
		EXPECT_EQ(0, VLSized);
		EXPECT_EQ(33, PLSized);

		// If the offset requires an ADDPL instruction to materialize, and would
		// require more than two instructions, decompose it into both
		// ADDVL (n x 16 bytes) and ADDPL (n x 2 bytes) instructions.
		StackOffset G(8, MVT::nxv4i32);
		StackOffset H(1, MVT::nxv16i1);
		(G + H).getForFrameOffset(ByteSized, PLSized, VLSized);
		EXPECT_EQ(0, ByteSized);
		EXPECT_EQ(8, VLSized);
		EXPECT_EQ(1, PLSized);
	}	}
Context not available.

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Static (de)allocation of SVE stack objects.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 213033

include/llvm/CodeGen/MIRYamlMapping.h

include/llvm/CodeGen/TargetFrameLowering.h

lib/Target/AArch64/AArch64FrameLowering.h

lib/Target/AArch64/AArch64FrameLowering.cpp

lib/Target/AArch64/AArch64InstrInfo.cpp

lib/Target/AArch64/AArch64MachineFunctionInfo.h

lib/Target/AArch64/AArch64StackOffset.h

lib/Target/AMDGPU/SIFrameLowering.cpp

test/CodeGen/AArch64/framelayout-sve.mir

unittests/Target/AArch64/TestStackOffset.cpp

[AArch64] Static (de)allocation of SVE stack objects.
ClosedPublic