Page MenuHomePhabricator

[AArch64] Static (de)allocation of SVE stack objects.
Needs ReviewPublic

Authored by sdesmalen on May 2 2019, 5:38 AM.

Details

Summary

Adds support to AArch64FrameLowering to allocate fixed-stack SVE objects.

The focus of this patch is purely to allow the stack frame to
allocate/deallocate space for scalable SVE objects. More dynamic
allocation (at compile-time, i.e. determining placement of SVE objects
on the stack), or resolving frame-index references that include
scalable-sized offsets, are left for subsequent patches.

SVE objects are allocated in the stack frame as a separate region below
the callee-save area, and above the alignment gap. This is done so that
the SVE objects can be accessed directly from the FP at (runtime)
VL-based offsets to benefit from using the VL-scaled addressing modes.

The layout looks as follows:

+-------------+
| stack arg   |   
+-------------+
| Callee Saves|
|   X29, X30  |       (if available)
|-------------| <- FP (if available)
|     :       |   
|  SVE area   |   
|     :       |   
+-------------+
|/////////////| alignment gap.
|     :       |   
| Stack objs  |
|     :       |   
+-------------+ <- SP after call and frame-setup

SVE and non-SVE stack objects are distinguished using different
StackIDs. The offsets for objects with TargetStackID::SVEVector should be
interpreted as purely scalable offsets within their respective SVE region.

Diff Detail

Event Timeline

sdesmalen created this revision.May 2 2019, 5:38 AM

Your proposed stack layout doesn't really make sense. There are a few issues:

  1. How do you compute the address of a stack argument?
  2. Under the ios and Windows calling conventions, vararg functions must allocate some fixed slots directly after the stack arguments.
  3. How do you restore SP in the epilogue?

It would make a lot more sense to place the SVE objects somewhere between FP and SP; we already support allocating a variable amount of space between FP and SP, for stack realignment.

Not sure what impact this has on this patch; maybe not much?

We've actually experimented with various layouts and eventually chose this layout for our HPC compiler.

Let me give some more clarification on the spill/fill addressing modes as background for this choice.

When loading regular (non-scalable) data from the stack in the presence of SVE stack objects, the base offset can be materialized using ADDVL, which adds a multiple of the runtime VL to a register. For example, a GPR register spilled at an offset SP + 16 bytes + 2 * sizeof(SVE vector) can be loaded using the sequence:

addvl x8, sp, #2
ldr x0, [x8, #16]

Conversely, the SVE spill/fill addressing modes expect a (runtime) VL scaled offset. For example:

ldr z0, [sp, #2, mul vl]  // loads z0 from SP + 2 * sizeof(SVE vector)

If we want to load SVE vector z0 from an offset SP + 16 bytes + 2 * sizeof(SVE vector), this requires first materializing the base offset by adding 16 bytes, and then using the scaled addressing mode to load z0:

add x8, sp, #16
ldr z0, [x8, #2, mul vl]

Because the additional add <fixed-size offset>, or alternatively addvl <scalable offset> is expensive, we distinguish fixed-size objects and scalable (SVE) objects in different regions. By allocating the SVE region before all other stack objects (CSRs, spills, locals), we benefit that the existing frame-layout doesn't need to change. More importantly, this means that accesses to almost all fixed-size stack objects (with exception of stack arguments) will be as efficient as they would be without SVE stack objects, and don't require an extra frame register. In the presence of a frame-pointer, we can also benefit from accessing our SVE objects directly from the FP.

  1. How do you compute the address of a stack argument?

We can compute the address of a stack argument using 'addvl' and regular 'add/sub' instructions.

  1. Under the ios and Windows calling conventions, vararg functions must allocate some fixed slots directly after the stack arguments.

I don't think there is a reason this decision would prevent the ability to create fixed slots directly after stack arguments (with some work to support it for these calling conventions, of course), although I admit we have not had to concern ourselves with this case for our HPC compiler which implements the AAPCS (with SVE extensions). I don't know enough about the iOS and Windows calling conventions to know if there are explicit assumptions made on the frame-layout other than this?

  1. How do you restore SP in the epilogue?

We restore the SP in the epilogue by adding the scalable stack-size to the SP as a last step. For example (from test/CodeGen/AArch64/framelayout-sve.mir)

# CHECK-NEXT: $sp = frame-setup ADDVL_XXI $sp, -2    // allocate scalable-sized stack
# CHECK-NEXT: $sp = frame-setup SUBXri $sp, 16, 0    // allocate fixed-size stack

# CHECK:      $sp = frame-destroy ADDXri $sp, 16, 0  // deallocate fixed-size stack
# CHECK-NEXT: $sp = frame-destroy ADDVL_XXI $sp, 2   // deallocate scalable-sized stack

The description of the addressing modes is helpful; I didn't realize there was native support for vl-relative arithmetic. I guess that makes it more straightforward than I was expecting for stack address computations to skip over the SVE spill slots.

I'm not sure I understand why it's important to allocate the SVE spill slots before the CSRs, as opposed to allocating them between the CSRs and the regular locals/spills. For code which has a frame pointer, placing the SVE spill slots between the CSRs and the locals/spills has a number of benefits over your suggested layout:

  1. the epilogue is cheaper (you don't need an addvl after restoring sp from fp)
  2. it's cheaper to access arguments passed on the stack
  3. it's cheaper to access the SVE spill slots: you can arrange for the frame pointer to point to the top of the SVE spill area, and use negative offsets from it to spill/restore SVE registers in a single instruction.
  4. code using frame pointers can be unwound using a non-SVE-aware DWARF unwinder.

And I don't see any benefits to the other order, unless I'm missing something big.

In cases where we don't have a frame pointer, the two orders are basically equivalent, I guess.

I guess on current SVE implementations, there isn't any advantage to aligning SVE spill slots more than 16 bytes? And you don't expect that to ever change on future implementations?

I'll write out a diagram for my suggested layout, using a similar style to the one in framelayout-sve.mir, to make sure I'm describing it clearly:

#     +--------------+
#     | stack arg    |
#     +--------------+ <- SP before call
#     | Callee Saves |
#     +--------------+
#     | Frame record |
#     +--------------+ <- FP
#     | SVE objs     |
#     +--------------+
#     | gap for stack realignment, if there's an over-aligned local variable
#     +--------------+
#     |     :        |
#     | Stack objs   |
#     |     :        |
#     +--------------+ <- SP after call and frame-setup

I don't know enough about the iOS and Windows calling conventions to know if there are explicit assumptions made on the frame-layout other than this?

That's the only relevant restriction, I think, unless you want to count the Windows unwind rules.

sdesmalen updated this revision to Diff 198440.May 7 2019, 5:06 AM
  • Added more unittests for StackOffset.

Thanks for your suggestions @efriedma!

Just to double check, your suggested layout has the frame-record *after* the callee-saves. The current layout however, puts the frame-record above the callee-saves. Are you suggesting to change that?

I'm not sure I understand why it's important to allocate the SVE spill slots before the CSRs, as opposed to allocating them between the CSRs and the regular locals/spills.

The current layout for our HPC compiler was a trade-off between getting an efficient implementation for SVE spills/fills on one hand, while keeping in mind a way to limit our downstream debt on the other hand. By keeping the layout as unchanged as possible (i.e. keeping all existing offsets to locals/spills the same, with the exception of stack arguments), we figured this simplified the code and reduced the chance of introducing bugs or regressing performance for accesses to regular stack objects in the presence of any SVE slots (with exception of stack arguments).

I spent some time investigating your suggestion to place the SVE area between the callee-saves and locals/spills and found some things worth noting/considering:


  • In the presence of an SVE area, the compiler should then no longer use stack-slot scavenging to reuse gaps in the CSR area, because accesses from the SP will be expensive.
  • The compiler will have less flexibility to choose the best base pointer to access a stack-slot, because using the FP to access a non-SVE local/spill will require an extra ADDVL instruction. For large stack-frames, this may incur an overhead (and would probably require the emergency spill slot).
  • Allocation of (non-SVE) stack space will always need to happen in separate steps, because it will no longer be possible to allocate the entire stack space in one go and then save the callee-saves from the new SP, because the scalable area is inserted in the middle. Instead, compiler needs to first allocate stack space for callee-saves, store callee-saves, and finally allocate the remaining stack-space. Pre/post-incrementing addressing modes can be used for the first two steps, but I don't know if this would be more expensive than using the regular addressing modes.
  • The emergency scavenging will always need to be allocated near the SP (or BP), rather than FP. This is not really a problem, but more something that is different when the stack does not contain any SVE objects.
  • We'd need to change the location of the frame-record within the callee-saves. If we do so, we'll probably want to do that regardless of whether the stack contains SVE spills or not to keep the layouts similar. Also the distance between FP and locals/spills would be smaller, which is probably beneficial. According to the AAPCS, the placement of the FrameRecord within the stack frame is unspecified (section 5.2.3 The Frame Pointer). Do you know if the same freedom holds true for iOS and Windows calling conventions?
  1. the epilogue is cheaper (you don't need an addvl after restoring sp from fp)

In most cases however, LLVM chooses to restore the stack by incrementing the stack-pointer, even when that is suboptimal (e.g. when the FP is available and restoring the SP by adding sizeof(stack) requires more than 1 add instruction). The exception seems to be when the stack is aligned > 16 bytes and it needs to restore it by using the frame-pointer. Do you know if this behaviour is intentional?

  1. it's cheaper to access arguments passed on the stack

Correct.

  1. it's cheaper to access the SVE spill slots: you can arrange for the frame pointer to point to the top of the SVE spill area, and use negative offsets from it to spill/restore SVE registers in a single instruction.

Note that with the layout proposed in this patch, we can overcome that by extending the 16 byte frame-record to be n x 16 bytes <=> sizeof(1 SVE-vec spill), and access all SVE objects directly from FP + 1 + Offset.

  1. code using frame pointers can be unwound using a non-SVE-aware DWARF unwinder.

When using a frame-pointer, that is still the case with the proposed layout, because the FP will always point to the frame-record, so it can always easily find the previous FP and LR, and offsets to the (non-SVE) callee-saves will be unchanged.

I guess on current SVE implementations, there isn't any advantage to aligning SVE spill slots more than 16 bytes? And you don't expect that to ever change on future implementations?

Locals arising from use of the ACLE may be set to a different alignment, but since the ACLE does not allow them being members of structs or arrays, there is probably little value in doing so. One advantage of placing the SVE area as you suggested is that we could easily implement such re-alignment by moving up the alignment gap between the callee-saves and the SVE area.

Thanks for your suggestions @efriedma!

Just to double check, your suggested layout has the frame-record *after* the callee-saves. The current layout however, puts the frame-record above the callee-saves. Are you suggesting to change that?

Yes, I'm suggesting to rearrange them, to make the fp more useful for accessing SVE spills.

I'm not sure I understand why it's important to allocate the SVE spill slots before the CSRs, as opposed to allocating them between the CSRs and the regular locals/spills.

The current layout for our HPC compiler was a trade-off between getting an efficient implementation for SVE spills/fills on one hand, while keeping in mind a way to limit our downstream debt on the other hand. By keeping the layout as unchanged as possible (i.e. keeping all existing offsets to locals/spills the same, with the exception of stack arguments), we figured this simplified the code and reduced the chance of introducing bugs or regressing performance for accesses to regular stack objects in the presence of any SVE slots (with exception of stack arguments).

If the SVE spill area is below the CSRs, you can leverage the existing checks to handle stack realignment, so I don't think it's that complicated to implement. But maybe your approach requires changing fewer places.

I spent some time investigating your suggestion to place the SVE area between the callee-saves and locals/spills and found some things worth noting/considering:


  • In the presence of an SVE area, the compiler should then no longer use stack-slot scavenging to reuse gaps in the CSR area, because accesses from the SP will be expensive.

I don't think there's ever more than one 8-byte slot; not a great loss. And if we really wanted to, we could access the slot relative to fp.

  • The compiler will have less flexibility to choose the best base pointer to access a stack-slot, because using the FP to access a non-SVE local/spill will require an extra ADDVL instruction. For large stack-frames, this may incur an overhead (and would probably require the emergency spill slot).

We don't normally use fp anyway, unless the function has dynamic allocations; the legal negative offsets from fp are much smaller than the legal positive offsets from sp. And if there are dynamic allocations, we often emit a base pointer anyway.

But on a related note, we end up forcing a base pointer in all cases with dynamic allocation and SVE spill slots, which I guess is a potential downside.

  • Allocation of (non-SVE) stack space will always need to happen in separate steps, because it will no longer be possible to allocate the entire stack space in one go and then save the callee-saves from the new SP, because the scalable area is inserted in the middle. Instead, compiler needs to first allocate stack space for callee-saves, store callee-saves, and finally allocate the remaining stack-space. Pre/post-incrementing addressing modes can be used for the first two steps, but I don't know if this would be more expensive than using the regular addressing modes.

On cortex-a57 etc., the performance of pre/post-increment is basically the same as an extra arithmetic instruction, IIRC. So yes, it's slightly more expensive, but not by a lot.

  • The emergency scavenging will always need to be allocated near the SP (or BP), rather than FP. This is not really a problem, but more something that is different when the stack does not contain any SVE objects.

This is probably a one-line change, since we already do this in cases with stack realignment.

  • We'd need to change the location of the frame-record within the callee-saves. If we do so, we'll probably want to do that regardless of whether the stack contains SVE spills or not to keep the layouts similar. Also the distance between FP and locals/spills would be smaller, which is probably beneficial. According to the AAPCS, the placement of the FrameRecord within the stack frame is unspecified (section 5.2.3 The Frame Pointer). Do you know if the same freedom holds true for iOS and Windows calling conventions?

It doesn't matter on iOS. On Windows, the document describing unwind data actually claims the frame record is supposed to be allocated after the local variables for functions with dynamic stack allocations, but we currently don't implement that, and we haven't seen any issues. Maybe there's some interaction between C++ exceptions and dynamic allocation we don't implement correctly? I haven't really spent any time trying to break it, and dynamic allocations combined with C++ exception handling doesn't really show up in real-world code.

  1. the epilogue is cheaper (you don't need an addvl after restoring sp from fp)

In most cases however, LLVM chooses to restore the stack by incrementing the stack-pointer, even when that is suboptimal (e.g. when the FP is available and restoring the SP by adding sizeof(stack) requires more than 1 add instruction). The exception seems to be when the stack is aligned > 16 bytes and it needs to restore it by using the frame-pointer. Do you know if this behaviour is intentional?

That isn't intentional, I think; probably just nobody noticed. Stack frames that require more than one instruction are rare, and frames that require more than two basically never happen.

  1. it's cheaper to access arguments passed on the stack

Correct.

  1. it's cheaper to access the SVE spill slots: you can arrange for the frame pointer to point to the top of the SVE spill area, and use negative offsets from it to spill/restore SVE registers in a single instruction.

Note that with the layout proposed in this patch, we can overcome that by extending the 16 byte frame-record to be n x 16 bytes <=> sizeof(1 SVE-vec spill), and access all SVE objects directly from FP + 1 + Offset.

Oh, that's clever, and I guess it's not that expensive.

  1. code using frame pointers can be unwound using a non-SVE-aware DWARF unwinder.

When using a frame-pointer, that is still the case with the proposed layout, because the FP will always point to the frame-record, so it can always easily find the previous FP and LR, and offsets to the (non-SVE) callee-saves will be unchanged.

Sorry, I didn't state this correctly. The key here would be if code isn't using frame pointers, we could emit a frame pointer for all functions with SVE spill slots, and then get correct unwinding without a SVE-aware unwinder, and without recompiling everything with frame pointers.

I guess on current SVE implementations, there isn't any advantage to aligning SVE spill slots more than 16 bytes? And you don't expect that to ever change on future implementations?

Locals arising from use of the ACLE may be set to a different alignment, but since the ACLE does not allow them being members of structs or arrays, there is probably little value in doing so. One advantage of placing the SVE area as you suggested is that we could easily implement such re-alignment by moving up the alignment gap between the callee-saves and the SVE area.

Yes, that's what I was thinking.

Ignore the bit about Windows unwinding. I remembered how it actually works; the Microsoft document is just wrong, and it's not actually necessary to allocate the frame record in any particular position on WIndows (except that it has to be somewhere that isn't allocated using _chkstk... but that wouldn't happen anyway with the layout I'm suggesting).

This sounds like it will report the wrong stack size in PEI for the StackSize remark and the stack size warning. Is that expected?

If the SVE spill area is below the CSRs, you can leverage the existing checks to handle stack realignment, so I don't think it's that complicated to implement. But maybe your approach requires changing fewer places.

I think you've made some compelling reasons to try the change in layout! I'll actually try this out downstream first before updating this patch, so I can run it through our SVE testing and see if there is any impact on performance or if I run into anything unexpected.

But on a related note, we end up forcing a base pointer in all cases with dynamic allocation and SVE spill slots, which I guess is a potential downside.

Does LLVM need this information to be available before register allocation so it knows whether to use the register or not? Because we would only know if we'd need a BP if there are any SVE instructions that would lead to spills *after* register allocation (unless the BP is reserved during RA and only used for scavenging).

When using a frame-pointer, that is still the case with the proposed layout, because the FP will always point to the frame-record, so it can always easily find the previous FP and LR, and offsets to the (non-SVE) callee-saves will be unchanged.

Sorry, I didn't state this correctly. The key here would be if code isn't using frame pointers, we could emit a frame pointer for all functions with SVE spill slots, and then get correct unwinding without a SVE-aware unwinder, and without recompiling everything with frame pointers.

Okay, so we should always use the FP if the function needs unwind table entries and has SVE spills/locals.

Ignore the bit about Windows unwinding. I remembered how it actually works; the Microsoft document is just wrong, and it's not actually necessary to allocate the frame record in any particular position on WIndows (except that it has to be somewhere that isn't allocated using _chkstk... but that wouldn't happen anyway with the layout I'm suggesting).

Thanks for clarifying!

This sounds like it will report the wrong stack size in PEI for the StackSize remark and the stack size warning. Is that expected?

For now the answer is yes. One of our primary concerns at the moment is adding basic SVE spill/fill support and we appreciate the caveat that nothing in LLVM really supports the concept of scalable types yet, including offsets and sizes.

Patch D61435 introduces (AArch64)StackOffset, which we'll use to describe offsets composed of a scalable and fixed-size part. Instead of recording sizes and offsets as an 'int' or 'unsigned', they should be described as an instance of a StackOffset/StackSize class. When the scalable-type patch (D32530) lands we should make an effort to roll out the StackOffset class (perhaps with an alias for StackSize) to generic CodeGen interfaces such as getStackSize().

Does LLVM need this information to be available before register allocation so it knows whether to use the register or not?

You have to decide whether the register is reserved before register allocation, so the register allocator doesn't decide to use it, yes. You should be able to change the answer after register allocation: basically, reserve the register through register allocation, then decide after register allocation you don't really need it and "unreserve" it.

I wouldn't really worry about optimizing this; dynamic stack allocation is rare in most C and C++ codebases, and one integer register likely doesn't matter much.

Okay, so we should always use the FP if the function needs unwind table entries and has SVE spills/locals.

Yes. Granted, you probably want a frame pointer anyway for functions with SVE spills/locals.

I wouldn't really worry about optimizing this; dynamic stack allocation is rare in most C and C++ codebases, and one integer register likely doesn't matter much.

Note that the situation is different with Fortran, where dynamic stack allocation is much more common, though I don't know whether this particular issue will impact performance all that much.

greened added inline comments.Thu, Aug 1, 12:53 PM
lib/Target/AArch64/AArch64FrameLowering.cpp
191

Needs a comment explaining what this does.

903

This is confusing. Asserting that SVEStackSize is non-zero but the message sort of implies it must be true. Maybe word this similarly to the assert right above: "unexpected function without stack frame but with SVE objects."

1345

Maybe this and the uses below are better as a separate NFC patch?

1503

It would be helpful to have a comment here explaining why we are not Done if SVEStackSize is non-zero.

1554

Add a comment here explaining what this is doing.

lib/Target/AArch64/AArch64InstrInfo.cpp
3076–3077

It's not clear to me what these are. Could you name them a bit more clearly, specifically without acronyms?

lib/Target/AArch64/AArch64StackOffset.h
106–123

Again, PL and VL are not very clear.

sdesmalen updated this revision to Diff 213033.Fri, Aug 2, 6:24 AM
sdesmalen marked 7 inline comments as done.
sdesmalen edited the summary of this revision. (Show Details)
sdesmalen added a reviewer: greened.
  • Changed the location of the SVE area within the frame-layout as suggested by @efriedma
  • Updated the summary.
sdesmalen added inline comments.Fri, Aug 2, 6:24 AM
lib/Target/AArch64/AArch64FrameLowering.cpp
1345

This change is no longer in my updated patch.

1503

This change is no longer in my updated patch.

1554

This change is no longer in my updated patch.

lib/Target/AArch64/AArch64InstrInfo.cpp
3076–3077

Good point, I see how this is unclear. I've renamed the variables in my latest revision!

@efriedma, sorry for taking a while to update this patch with the new layout. Other than being distracted by many other things, I tried it on our downstream repo first to see if this might lead to any negative performance impact. This all seems fine, and I now realise this approach makes the code in AArch64FrameLowering a bit simpler (which was the opposite of what I initially thought). I separated out the patch to reorder the frame-record within the callee-save area into D65653.

troyj added a subscriber: troyj.Fri, Aug 2, 8:44 AM

I wonder if this should have a test that ensures we generate VL-scaled addressing modes for SVE object addressing. If there's not enough codegen yet to emit the asm, then we should probably add such a test when we can. After all, it's the stated goal of this patch. :)

I wonder if this should have a test that ensures we generate VL-scaled addressing modes for SVE object addressing. If there's not enough codegen yet to emit the asm, then we should probably add such a test when we can. After all, it's the stated goal of this patch. :)

You're right. I have a separate patch for that, that I could share next week. This patch only adds the support to allocate the SVE area using ADDVL.
(The compiler currently guards against accessing any stack objects in the presence of an SVE area using an assert).

sdesmalen updated this revision to Diff 213058.Fri, Aug 2, 9:17 AM

My previous patch didn't have any context, so added it now (git format-patch -U999999)

This LGTM but I think someone else should probably sign off on it as well.