This is an archive of the discontinued LLVM Phabricator instance.

[RISCV] Fix RVV stack frame alignment bugs
ClosedPublic

Authored by frasercrmck on May 17 2022, 7:02 AM.

Details

Summary

This patch addresses several alignment issues in the stack frame when
RVV objects are taken into account.

One bug is that the RVV stack was never guaranteed to keep the alignment
of the stack *as a whole*. We must maintain a 16-byte aligned stack at
all times, especially when calling other functions. With the standard V
extension, this is conveniently happening since VLEN is at least 128 and
always 16-byte aligned. However, we support Zvl64b which does not
guarantee this. To fix this, the RVV stack size is rounded up to be
aligned to 16 bytes. This in practice generally makes us allocate a
stack sized at least 2*VLEN in size, and a multiple of 2.

|------------------------------| -- <-- FP
| 8-byte callee-save           | |      |
|------------------------------| |      |
| one VLENB-sized RVV object   | |      |
|------------------------------| |      |
| 8-byte local variable        | |      |
|------------------------------| -- <-- SP (must be aligned to 16)

In the example above, with Zvl64b we are decrementing SP by 12 bytes
which does not leave SP correctly aligned. We therefore introduce an
extra VLENB-sized amount used for alignment. This would therefore ensure
the total stack size was 16 bytes (48 for Zvl128b, 80 for Zvl256b, etc):

|------------------------------| -- <-- FP
| 8-byte callee-save           | |      |
|------------------------------| |      |
| one VLENB-sized padding obj  | |      |
| one VLENB-sized RVV object   | |      |
|------------------------------| |      |
| 8-byte local variable        | |      |
|------------------------------| -- <-- SP

A new RVV invariant has been introduced in this patch, which is that the
base of the RVV stack itself is now always aligned to 16 bytes, not 8 as
before. This keeps us more in line with the scalar stack and should be
easier to reason about. The calculation of the RVV padding has thus
changed to be the amount required to align the scalar local variable
section to the RVV section's alignment. This amount is further rounded
up when setting up the initial stack to keep everything aligned:

|------------------------------| -- <-- FP
| 8-byte callee-save           |
|------------------------------|
|                              |
| RVV objects                  |
| (aligned to at least 16)     |
|                              |
|------------------------------|
| RVV padding of 8 bytes       |
|------------------------------|
| 8-byte local variable        |
|------------------------------| -- <-- SP

In the example above, it's clear that we need 8 bytes of padding to keep
the RVV section aligned to 16 when using SP. But to keep SP *itself*
aligned to 16 we can't decrement the initial stack pointer by 24 - we
have to round up to 32.

With the RVV section correctly aligned, the second bug fixed by
this patch is that RVV objects themselves are now correctly aligned. We
were previously only guaranteeing an alignment of 8 bytes, even if they
required a higher alignment. This is relatively simple and in practice
we see more rounding up of VLEN amounts to account for alignment in
between objects:

|------------------------------|
| RVV object (aligned to 16)   |
|------------------------------|
| no padding necessary         |
|------------------------------|
| 2*VLENB RVV object (align 16)|
|------------------------------|
| VLENB alignment padding      |
|------------------------------|
| RVV object (align 32)        |
|------------------------------|
| 3*VLENB alignment padding    |
|------------------------------|
| VLENB RVV object (align 32)  |
|------------------------------| -- <-- base of RVV section

Note that a lot of the regressions in codegen owing to the new alignment
rules are correct but actually only strictly necessary for Zvl64b (and
Zvl32b but that's not really supported). I plan a follow-up patch to
take the known VLEN into account when padding for alignment.

Diff Detail

Event Timeline

frasercrmck created this revision.May 17 2022, 7:02 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 17 2022, 7:02 AM
frasercrmck requested review of this revision.May 17 2022, 7:02 AM

Thank you for a informative and well written patch description!

Reading through, it seems like your second bug mentioned could probably be reproduced and fixed independently of the others? Would it make sense to split that into it's own patch?

One bit I don't understand the reasoning on is increasing minimal RVV alignment to 16 bytes. You say that this makes it easier to reason about, but I don't really follow that.

llvm/test/CodeGen/RISCV/rvv/rv32-spill-vector.ll
55

Near as I can tell, these shifts are coming from the 16 byte minimum RVV alignment right? Or is there some other cause I'm missing.

As noted in the top-level comment, I wonder if this is worthwhile.

Reading through, it seems like your second bug mentioned could probably be reproduced and fixed independently of the others? Would it make sense to split that into it's own patch?

It's likely, yeah. It's something I came across most of the way through fixing everything else so I didn't really think about it. I'm sure I could come up with a test that exercises this, and I think the fix is relatively self-contained. I'll give it a bash later.

One bit I don't understand the reasoning on is increasing minimal RVV alignment to 16 bytes. You say that this makes it easier to reason about, but I don't really follow that.

Yeah, it's a good question. To be honest the real reason slipped my mind while writing the description. It was in part of fixing the scalar stack alignment. Since it's underneath the RVV section and needs to be aligned to 16 bytes, unless we know that the RVV section is 16-byte aligned (in size) then we'd have to dynamically realign sp to keep the scalar section below it correctly aligned. This would affect the situation where our (minimum) VLEN is 64.

I thought that dynamically realigning the scalar stack was less desirable than just padding out the RVV stack size, but also that having different alignment requirements depending on the RVV version was overly complicated and just leads to maintenance burden (the simpler the FrameLowering the better, imo). Now that doesn't preclude us from generating different code to satisfy that same alignment requirement depending on our minimum VLEN, so in practice I don't expect the 16-byte alignment requirement to negatively affect us when VLEN >= 128 - once I add in that optimization.

kito-cheng added inline comments.May 18 2022, 9:15 AM
llvm/test/CodeGen/RISCV/rvv/rv32-spill-vector.ll
55

Does it possible to re-align by this way?

csrr a0, vlenb
addi a0, a0, 15
andi a0, a0, -16
sub sp, sp, a0
frasercrmck added inline comments.May 19 2022, 12:09 AM
llvm/test/CodeGen/RISCV/rvv/rv32-spill-vector.ll
55

Technically I think we could, but because we may be in a situation where we only have sp/bp and we need to jump over the RVV section to reach callee saves or fixed objects, that would complicate the code we emit for frame offset calculations.

I think, on balance, having a "known" size for the RVV section is preferable, even if it may waste stack space on certain (zvl32b/zvl64b) configurations.

frasercrmck edited the summary of this revision. (Show Details)May 19 2022, 3:54 AM

Reading through, it seems like your second bug mentioned could probably be reproduced and fixed independently of the others? Would it make sense to split that into it's own patch?

It's likely, yeah. It's something I came across most of the way through fixing everything else so I didn't really think about it. I'm sure I could come up with a test that exercises this, and I think the fix is relatively self-contained. I'll give it a bash later.

Split into D125962 and D125964. Thanks!

fix up a comment: RVV padding isn't only required when *accessing* RVV objects;
it affects the calculation of *any* objects when we have a non-zero RVV stack
section.

One bit I don't understand the reasoning on is increasing minimal RVV alignment to 16 bytes. You say that this makes it easier to reason about, but I don't really follow that.

Yeah, it's a good question. To be honest the real reason slipped my mind while writing the description. It was in part of fixing the scalar stack alignment. Since it's underneath the RVV section and needs to be aligned to 16 bytes, unless we know that the RVV section is 16-byte aligned (in size) then we'd have to dynamically realign sp to keep the scalar section below it correctly aligned. This would affect the situation where our (minimum) VLEN is 64.

I thought that dynamically realigning the scalar stack was less desirable than just padding out the RVV stack size, but also that having different alignment requirements depending on the RVV version was overly complicated and just leads to maintenance burden (the simpler the FrameLowering the better, imo). Now that doesn't preclude us from generating different code to satisfy that same alignment requirement depending on our minimum VLEN, so in practice I don't expect the 16-byte alignment requirement to negatively affect us when VLEN >= 128 - once I add in that optimization.

Ok, I think that makes sense. We can always chose a more complicated scheme later if we don't like the extra padding, but fixing bugs comes first.

Again though, I think this is a separable patch. Can I ask you to split that off? We should be able to demonstrate the misaligned stack without fixing either a) internal underalignment alignments of RVV objects, or b) your offseting code (I think?). If I understood you correctly, this basically just ends up needing to set RVVPadding more often.

Hm, though one further question which maybe shows I don't fully understand. With vl128b, is there anything preventing the actual implementation from having a VLEN=196? My understanding was that the vlNb variants were a minimum, not an exact value. If so, don't we need to pad out to a 16b alignment even with vl128? (Except when actual register size is known.)

craig.topper added a comment.EditedMay 19 2022, 9:11 AM

One bit I don't understand the reasoning on is increasing minimal RVV alignment to 16 bytes. You say that this makes it easier to reason about, but I don't really follow that.

Yeah, it's a good question. To be honest the real reason slipped my mind while writing the description. It was in part of fixing the scalar stack alignment. Since it's underneath the RVV section and needs to be aligned to 16 bytes, unless we know that the RVV section is 16-byte aligned (in size) then we'd have to dynamically realign sp to keep the scalar section below it correctly aligned. This would affect the situation where our (minimum) VLEN is 64.

I thought that dynamically realigning the scalar stack was less desirable than just padding out the RVV stack size, but also that having different alignment requirements depending on the RVV version was overly complicated and just leads to maintenance burden (the simpler the FrameLowering the better, imo). Now that doesn't preclude us from generating different code to satisfy that same alignment requirement depending on our minimum VLEN, so in practice I don't expect the 16-byte alignment requirement to negatively affect us when VLEN >= 128 - once I add in that optimization.

Ok, I think that makes sense. We can always chose a more complicated scheme later if we don't like the extra padding, but fixing bugs comes first.

Again though, I think this is a separable patch. Can I ask you to split that off? We should be able to demonstrate the misaligned stack without fixing either a) internal underalignment alignments of RVV objects, or b) your offseting code (I think?). If I understood you correctly, this basically just ends up needing to set RVVPadding more often.

Hm, though one further question which maybe shows I don't fully understand. With vl128b, is there anything preventing the actual implementation from having a VLEN=196? My understanding was that the vlNb variants were a minimum, not an exact value. If so, don't we need to pad out to a 16b alignment even with vl128? (Except when actual register size is known.)

VLEN must be a power of 2 less than or equal to 65536. From the spec "The number of bits in a single vector register, VLEN ≥ ELEN, which must be a power of 2, and must be no greater than 2^16." So VLEN=196 is not possible

Ok, I think that makes sense. We can always chose a more complicated scheme later if we don't like the extra padding, but fixing bugs comes first.

Again though, I think this is a separable patch. Can I ask you to split that off? We should be able to demonstrate the misaligned stack without fixing either a) internal underalignment alignments of RVV objects, or b) your offseting code (I think?). If I understood you correctly, this basically just ends up needing to set RVVPadding more often.

I think there's something separable but I do think it'd have to bring a lot of this patch with it. To keep the base of the RVV section aligned to 16, we'd need to use this new computation of the RVV padding as a hard-coded 16 doesn't cut it. Once the padding changes, we'd also need to change some or all of the offsetting/frame index reference code to account for it.

(To clarify, with this patch the RVV padding is conceptually only the part at the bottom of the stack between the scalar section and the base of the RVV section. The alignment above the RVV section is taken into account via the alignTo in getStackSizeWithRVVPadding.

On main without this patch, RVV padding is more nebulous and is the combined above/below parts, so the offsetting only needs to round some offsets up to 8, rather than accurately tracking the various sections we're traversing. For example, If the scalar section size is 8 then the RVV padding is effectively zero between the scalar and RVV sections, and is 16 above RVV. If the scalar section size is 12 then the RVV padding is 4 below and 12 above.

I personally found that quite difficult to reason about, so that's why I made the conceptual change. I tried to reflect this in the changes I made to the stack diagrams in the code - I hope it's clear enough. It's quite subtle and does affect how offsets are computed. I think with this patch the whole scheme is a bit clearer - conceptually, at least)

Anyway I guess my conclusion is that these two things are more interlinked than it may first appear, and splitting out the 16-byte stuff almost feels as though it's the majority of this patch. The smaller part may actually the ensuring of RVV alignments greater than 16 and all the internal RVV alignments as you point out.

We could also maybe split off the change that increases the size of the RVV stack to be a multiple of 16 bytes, while leaving it underaligned at 8. That might reduce some of the test noise involving the shifts. It might also fix the scalar-stacka-align.ll issue too.

StephenFan added a comment.EditedMay 20 2022, 7:02 AM
This comment has been deleted.
StephenFan added inline comments.May 20 2022, 7:04 AM
llvm/lib/Target/RISCV/RISCVFrameLowering.cpp
753

If I didn't misunderstand, the SP here is assumed that align to maxalign, which is the max alignment of all stack objects (including rvv stack objects). But I found https://github.com/llvm/llvm-project/blob/main/llvm/lib/CodeGen/MachineFrameInfo.cpp#L61 only records max align on non-scalable stack objects.

And I also worried that if the size of RVV padding is big enough, the size of the local scalar variable field is out of the range of 12 bits signed number. If it is necessary to add an extra scavenger spill slot for it?

And I also worried that if the size of RVV padding is big enough, the size of the local scalar variable field is out of the range of 12 bits signed number. If it is necessary to add an extra scavenger spill slot for it?

We already reserve a scavenger slot if there are any RVV objects at all in the frame - I believe that this one reg is sufficient to cover this eventuality. I don't think there's anything special about RVV padding compared with just the number of scalar local variables being too large for a 12-bit int in the "regular" case.

llvm/lib/Target/RISCV/RISCVFrameLowering.cpp
753

Yeah, that's right. This patch itself isn't enough - we have to update MaxAlign ourselves with the RVV objects. You probably saw that in D125973.

Although, some scalable objects are included in max align, as you can already see in rvv-stack-align.mir which does realign the stack to 32 if there's a 32-byte aligned scalable-vector alloca. I haven't dug into that to see why some are included and some aren't, but I don't think it affects our correctness once D125973 is merged.

I don't think it makes much sense to merge D125973 first as without this patch, correctly updating the max align will do nothing to fix any of the scalable misalignment issues and I don't think it'll solve any non-scalable bugs either. That's why it's based on this one.

Still trying to wrap my head around this code. In the process, I've got a small cleanup to propose. https://reviews.llvm.org/D126088

Thinking about what this code does, I think we've got two pieces:

  • Step 1 - Pick a base register (FP, SP, or BP). There's some legality aspects in this selection (we can't index across either dynamic realign or var sized objects since we don't know their sizes), but it's also a profitability decision.
  • Step 2 - Given choice from 1, compute offset expression.

I'm wondering if it makes sense to separate the code out this way. If we had a function which only did step 2, then having the asserts to make sure we didn't try to cross var sized objects becomes easy. It also becomes easy to ensure that e.g. all the SP offset paths stay in sync.

Thoughts?

StephenFan accepted this revision.May 21 2022, 6:24 AM

And I also worried that if the size of RVV padding is big enough, the size of the local scalar variable field is out of the range of 12 bits signed number. If it is necessary to add an extra scavenger spill slot for it?

We already reserve a scavenger slot if there are any RVV objects at all in the frame - I believe that this one reg is sufficient to cover this eventuality. I don't think there's anything special about RVV padding compared with just the number of scalar local variables being too large for a 12-bit int in the "regular" case.

You are right. Thanks for your explanation!

This revision is now accepted and ready to land.May 21 2022, 6:24 AM

Thinking about what this code does, I think we've got two pieces:

  • Step 1 - Pick a base register (FP, SP, or BP). There's some legality aspects in this selection (we can't index across either dynamic realign or var sized objects since we don't know their sizes), but it's also a profitability decision.
  • Step 2 - Given choice from 1, compute offset expression.

I'm wondering if it makes sense to separate the code out this way. If we had a function which only did step 2, then having the asserts to make sure we didn't try to cross var sized objects becomes easy. It also becomes easy to ensure that e.g. all the SP offset paths stay in sync.

Thoughts?

I prototyped this structure, and all of the tests in tree continued to pass.

Given this change has been LGTMed, this is not a blocker, but I think this code organization does make sense as a post cleanup.

I am not particular confident in the correctness of this code as it stands. Either the existing code, or this patch.

I prototyped this structure, and all of the tests in tree continued to pass.

Given this change has been LGTMed, this is not a blocker, but I think this code organization does make sense as a post cleanup.

I am not particular confident in the correctness of this code as it stands. Either the existing code, or this patch.

Sorry, forgot to reply. The change you propose makes sense to me logically. I can't tell if there are gotchas in reorganizing things like that though.

Since I think discussions have been across various patches at this point, which aspect of this patch are you suspicious of?

I prototyped this structure, and all of the tests in tree continued to pass.

Given this change has been LGTMed, this is not a blocker, but I think this code organization does make sense as a post cleanup.

I am not particular confident in the correctness of this code as it stands. Either the existing code, or this patch.

Sorry, forgot to reply. The change you propose makes sense to me logically. I can't tell if there are gotchas in reorganizing things like that though.

Since I think discussions have been across various patches at this point, which aspect of this patch are you suspicious of?

A couple of themes, but nothing actionable.

Theme 1 - We have confirmed we have untested code. We know this code used to differ between two different cases of using SP as the base register. Its not clear if this is correct, or not. This implies both lack of test coverage, and (probably) lack of understanding.

Theme 2 - This is very complicated code, and the changes are broad enough to be hard to track.

I am not objecting to this landing. It's probably more correct than what's in tree. I'm just saying that even with this, we probably have uncaught issues here. (Or at least, the code structure doesn't convince me that we *don't*.) I am a strong proponent of reducing code complexity until it can't be simplified further as a means to flesh out bugs.

This revision was landed with ongoing or failed builds.May 23 2022, 11:04 PM
This revision was automatically updated to reflect the committed changes.

A couple of themes, but nothing actionable.

Theme 1 - We have confirmed we have untested code. We know this code used to differ between two different cases of using SP as the base register. Its not clear if this is correct, or not. This implies both lack of test coverage, and (probably) lack of understanding.

Theme 2 - This is very complicated code, and the changes are broad enough to be hard to track.

I am not objecting to this landing. It's probably more correct than what's in tree. I'm just saying that even with this, we probably have uncaught issues here. (Or at least, the code structure doesn't convince me that we *don't*.) I am a strong proponent of reducing code complexity until it can't be simplified further as a means to flesh out bugs.

Yeah, good points. I hope we can do further work on this code to simplify it and increase readability so we can reduce the barrier to entry somewhat. I don't have any plans for Theme 2, really, but for 1) I am going to root out the untested code and work out whether it's a lack of coverage or whether it's logically impossible to reach.