This is an archive of the discontinued LLVM Phabricator instance.

[RISCV] Add a test showing incorrect RVV stack alignment
ClosedPublic

Authored by frasercrmck on Oct 1 2021, 7:19 AM.

Details

Summary

The RISC-V stack is assumed to be aligned to 16 bytes and can handle stack
realignment for larger objects, but the "RVV stack" is only ensured to be
aligned to 8 bytes. This means that objects specified at a larger alignment may
be misaligned, not only for 16-byte-aligned RVV objects that don't trigger
realignment, but also for 32-byte-and-larger-aligned objects which do.

The new test checks a variety of alignment configurations, showing the
misaligned cases.

Diff Detail

Event Timeline

frasercrmck created this revision.Oct 1 2021, 7:19 AM
frasercrmck requested review of this revision.Oct 1 2021, 7:19 AM

So I'm not 100% sure yet, but this alignment bug may well also apply to certain scalar (well, non-scalable-vector) stack objects if there's a non-zero RVV stack size and they're on the "wrong" size of the RVV section from whichever frame register we're using.

I still haven't found the right fix for this. Do we have to track the maximum vector-object alignment separately to minimize changes on generated code that doesn't need vector alignment larger than 8 bytes?

If this misalign problem does affect the correct executable of the program contain RVV stack object, and to fix it, the following two ways may make sense:

  1. Deal with allocations of RVV vector object like variable sized object.
  2. Change the RVV frame layout to what D93750 did. Instead of calculating the total RVVStackSize and allocate them all at once, D93750 will allocate the RVV stack objects one by one. Thus we can realign for every RVV stack object separately when it was allocating.
luke957 added a subscriber: luke957.Oct 3 2021, 2:59 AM

I didn't understand the problem. In the second test case rvv_stack_align16, the sp will be udpated to sp -16 - vlenb x 2. It is aligned to 16 bytes for vlenb = 8, 16, ..., etc. Isn't it?
In addition, the memory alignment constraint is to align to the size of element. <vscale x 4 x i32> is to align 4, not align to 16.

  • reduce the test a bit less to show actual 16-byte-alignment bug

I didn't understand the problem. In the second test case rvv_stack_align16, the sp will be udpated to sp -16 - vlenb x 2. It is aligned to 16 bytes for vlenb = 8, 16, ..., etc. Isn't it?
In addition, the memory alignment constraint is to align to the size of element. <vscale x 4 x i32> is to align 4, not align to 16.

Hi, you're right, thanks. I reduced it at the last minute a little too far. I've updated the test so that with two 8-byte objects in front it uses an offset of 24 from the stack pointer.

You're right that the minimum/desirable alignment for RVV is the same as the element size, but it's not a requirement on the maximum: just the minimum. The user can always supply their own alignment to something like alloca and we should be able to support it. And lastly, as it happens, if you use IRBuilder to create alloca <vscale x 4 x i32> you should in fact see that our getPrefTypeAlign does indeed return 16 despite our best wishes. I think the DataLayout still always aligns vectors to their (known) size.

If this misalign problem does affect the correct executable of the program contain RVV stack object, and to fix it, the following two ways may make sense:

  1. Deal with allocations of RVV vector object like variable sized object.
  2. Change the RVV frame layout to what D93750 did. Instead of calculating the total RVVStackSize and allocate them all at once, D93750 will allocate the RVV stack objects one by one. Thus we can realign for every RVV stack object separately when it was allocating.

Thanks for the ideas. I must admit I don't know how this should be fixed, given all of the fp/sp/bp/fixed object/"rvv object"/"default object" scenarios. Given that the usual case is <= 8-byte alignment for vector objects, I wouldn't want to pessimise the default path if it can be avoided. But if we have any objects with greater alignment we should be producing the correct stack layout.

As @HsiangKai says, since VLEN is at least 64 we know any combination of vector objects is always aligned to at least 8 (I suspect this may add yet another issue when supporting VLEN=32 @craig.topper, @rogfer01). We could possibly track vector object alignment separately and realign the stack pointer if there's anything >= 8. We might need the base pointer in that scenario?

This issue is failing some of our tests though. We're creating an alloca and even though we can "underspecify" the alignment (from the DataLayout's point of view) after creation to 4 or 8, some other LLVM passes just reset it back to the pref type align. I suppose that's legal but it was a little unexpected when I was trying to "fix" this in the IR.

As @HsiangKai says, since VLEN is at least 64 we know any combination of vector objects is always aligned to at least 8 (I suspect this may add yet another issue when supporting VLEN=32 @craig.topper, @rogfer01). We could possibly track vector object alignment separately and realign the stack pointer if there's anything >= 8. We might need the base pointer in that scenario?

Ah yes I now see why this isn't good enough. We can calculate the base sp alignment easy enough. But if we needed to calculate offsets for multiple aligned objects, e.g.:

| <vscale x 1 x i64> align 16 | <- SP + alignTo(VLENB, 32) + alignTo(VLENB, 16)
|-----------------------------|
| <vscale x 1 x i64> align 32 | <- SP + alignTo(VLENB, 32)
|-----------------------------|
| <vscale x 1 x i64>          | <- SP

If we only know that VLENB >= 8 then we need to dynamically "align" each offset by an unknown amount. This is what you meant by option 2, right @StephenFan?

  1. Change the RVV frame layout to what D93750 did. Instead of calculating the total RVVStackSize and allocate them all at once, D93750 will allocate the RVV stack objects one by one. Thus we can realign for every RVV stack object separately when it was allocating.

Unless we do something more blunt like scale VLENB up by a known good multiplier. But that would quickly start to generate huge offsets.

| <vscale x 1 x i64> align 16 | <- SP + VLENB*6
|-----------------------------|
| <vscale x 1 x i64> align 32 | <- SP + VLENB*4
|-----------------------------|
| <vscale x 1 x i64>          | <- SP

I suppose we could dynamically capture alignTo(VLENB, MaxRVVAlign) in a register and use that in offset computations? Since that's a static amount, if we know that the maximum alignment is at most 8 then nothing needs to change. It would still overalign but less so.

| <vscale x 1 x i64> align 16 | <- SP + 2*alignTo(VLENB, 32)
|-----------------------------|
| <vscale x 1 x i64> align 32 | <- SP + alignTo(VLENB, 32)
|-----------------------------|
| <vscale x 1 x i64>          | <- SP

I think we'd have to materialize that amount each time, though. We can't keep it live through the whole function.

As @HsiangKai says, since VLEN is at least 64 we know any combination of vector objects is always aligned to at least 8 (I suspect this may add yet another issue when supporting VLEN=32 @craig.topper, @rogfer01). We could possibly track vector object alignment separately and realign the stack pointer if there's anything >= 8. We might need the base pointer in that scenario?

Ah yes I now see why this isn't good enough. We can calculate the base sp alignment easy enough. But if we needed to calculate offsets for multiple aligned objects, e.g.:

| <vscale x 1 x i64> align 16 | <- SP + alignTo(VLENB, 32) + alignTo(VLENB, 16)
|-----------------------------|
| <vscale x 1 x i64> align 32 | <- SP + alignTo(VLENB, 32)
|-----------------------------|
| <vscale x 1 x i64>          | <- SP

If we only know that VLENB >= 8 then we need to dynamically "align" each offset by an unknown amount. This is what you meant by option 2, right @StephenFan?

Yes. But I didn't understand why we need to alignTo(VLENB, 32) rather than just alignTo(32) ? The alignment of vector register is always 8, and the alloca instruction is not related to vector register.

resuscitate old test

Yes. But I didn't understand why we need to alignTo(VLENB, 32) rather than just alignTo(32) ? The alignment of vector register is always 8, and the alloca instruction is not related to vector register.

Sorry I never replied - I had to switch focus for a (long) while. I'm now getting back to fixing this bug.

By alignTo(VLENB, 32) I was trying to capture the notion that in my example we're dynamically rounding up the size of a vector to the next object's alignment, so that we can address it when multiple vector objects are in contention. I was assuming that the base pointer is aligned to the maximum alignment. So Object #0 (closest to bp) is aligned by the bp. Object #1 must be aligned to 32 but is also offset by the size of Object #0 which is of size VLENB. So if VLENB not aligned to 32 (i.e., in practice it is smaller than 32 - anything larger is automatically aligned as it's a power of 2) it needs to be rounded up to 32 - if it's larger we need to skip ahead by VLENB instead. Maybe my notation was confusing. Or maybe I misunderstood you?

I'm not sure if this is what we actually want to do, at least in the initial version - I was just prototyping ideas.

add rv32 lines and zve64x

reames accepted this revision.May 17 2022, 8:00 AM
reames added a subscriber: reames.

LGTM. We can land the tests and tweak them again if needed. Most of the discussion here seems related to the possible fixes, not the tests themselves.

This revision is now accepted and ready to land.May 17 2022, 8:00 AM
This revision was landed with ongoing or failed builds.May 17 2022, 8:06 AM
This revision was automatically updated to reflect the committed changes.