Download Raw Diff

Details

Reviewers

rogfer01
HsiangKai
craig.topper
StephenFan
reames

Commits

rGf7709a059ac0: [RISCV] Add a test showing incorrect RVV stack alignment

Summary

The RISC-V stack is assumed to be aligned to 16 bytes and can handle stack
realignment for larger objects, but the "RVV stack" is only ensured to be
aligned to 8 bytes. This means that objects specified at a larger alignment may
be misaligned, not only for 16-byte-aligned RVV objects that don't trigger
realignment, but also for 32-byte-and-larger-aligned objects which do.

The new test checks a variety of alignment configurations, showing the
misaligned cases.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

frasercrmck created this revision.Oct 1 2021, 7:19 AM

Herald added subscribers: achieveartificialintelligence, vkmr, evandro and 23 others. · View Herald TranscriptOct 1 2021, 7:19 AM

frasercrmck requested review of this revision.Oct 1 2021, 7:19 AM

Herald added subscribers: llvm-commits, MaskRay. · View Herald TranscriptOct 1 2021, 7:19 AM

So I'm not 100% sure yet, but this alignment bug may well also apply to certain scalar (well, non-scalable-vector) stack objects if there's a non-zero RVV stack size and they're on the "wrong" size of the RVV section from whichever frame register we're using.

I still haven't found the right fix for this. Do we have to track the maximum vector-object alignment separately to minimize changes on generated code that doesn't need vector alignment larger than 8 bytes?

If this misalign problem does affect the correct executable of the program contain RVV stack object, and to fix it, the following two ways may make sense:

Deal with allocations of RVV vector object like variable sized object.
Change the RVV frame layout to what D93750 did. Instead of calculating the total RVVStackSize and allocate them all at once, D93750 will allocate the RVV stack objects one by one. Thus we can realign for every RVV stack object separately when it was allocating.

luke957 added a subscriber: luke957.Oct 3 2021, 2:59 AM

I didn't understand the problem. In the second test case rvv_stack_align16, the sp will be udpated to sp -16 - vlenb x 2. It is aligned to 16 bytes for vlenb = 8, 16, ..., etc. Isn't it?
In addition, the memory alignment constraint is to align to the size of element. <vscale x 4 x i32> is to align 4, not align to 16.

reduce the test a bit less to show actual 16-byte-alignment bug

Harbormaster completed remote builds in B126798: Diff 376825.Oct 4 2021, 1:55 AM

In D110933#3039129, @HsiangKai wrote:

I didn't understand the problem. In the second test case rvv_stack_align16, the sp will be udpated to sp -16 - vlenb x 2. It is aligned to 16 bytes for vlenb = 8, 16, ..., etc. Isn't it?
In addition, the memory alignment constraint is to align to the size of element. <vscale x 4 x i32> is to align 4, not align to 16.

Hi, you're right, thanks. I reduced it at the last minute a little too far. I've updated the test so that with two 8-byte objects in front it uses an offset of 24 from the stack pointer.

You're right that the minimum/desirable alignment for RVV is the same as the element size, but it's not a requirement on the maximum: just the minimum. The user can always supply their own alignment to something like alloca and we should be able to support it. And lastly, as it happens, if you use IRBuilder to create alloca <vscale x 4 x i32> you should in fact see that our getPrefTypeAlign does indeed return 16 despite our best wishes. I think the DataLayout still always aligns vectors to their (known) size.

In D110933#3038463, @StephenFan wrote:

If this misalign problem does affect the correct executable of the program contain RVV stack object, and to fix it, the following two ways may make sense:

Deal with allocations of RVV vector object like variable sized object.

Change the RVV frame layout to what D93750 did. Instead of calculating the total RVVStackSize and allocate them all at once, D93750 will allocate the RVV stack objects one by one. Thus we can realign for every RVV stack object separately when it was allocating.

Thanks for the ideas. I must admit I don't know how this should be fixed, given all of the fp/sp/bp/fixed object/"rvv object"/"default object" scenarios. Given that the usual case is <= 8-byte alignment for vector objects, I wouldn't want to pessimise the default path if it can be avoided. But if we have any objects with greater alignment we should be producing the correct stack layout.

As @HsiangKai says, since VLEN is at least 64 we know any combination of vector objects is always aligned to at least 8 (I suspect this may add yet another issue when supporting VLEN=32 @craig.topper, @rogfer01). We could possibly track vector object alignment separately and realign the stack pointer if there's anything >= 8. We might need the base pointer in that scenario?

This issue is failing some of our tests though. We're creating an alloca and even though we can "underspecify" the alignment (from the DataLayout's point of view) after creation to 4 or 8, some other LLVM passes just reset it back to the pref type align. I suppose that's legal but it was a little unexpected when I was trying to "fix" this in the IR.

frasercrmck added a reviewer: StephenFan.Oct 4 2021, 8:49 AM

In D110933#3039424, @frasercrmck wrote:

As @HsiangKai says, since VLEN is at least 64 we know any combination of vector objects is always aligned to at least 8 (I suspect this may add yet another issue when supporting VLEN=32 @craig.topper, @rogfer01). We could possibly track vector object alignment separately and realign the stack pointer if there's anything >= 8. We might need the base pointer in that scenario?

Ah yes I now see why this isn't good enough. We can calculate the base sp alignment easy enough. But if we needed to calculate offsets for multiple aligned objects, e.g.:

| <vscale x 1 x i64> align 16 | <- SP + alignTo(VLENB, 32) + alignTo(VLENB, 16)
|-----------------------------|
| <vscale x 1 x i64> align 32 | <- SP + alignTo(VLENB, 32)
|-----------------------------|
| <vscale x 1 x i64>          | <- SP

If we only know that VLENB >= 8 then we need to dynamically "align" each offset by an unknown amount. This is what you meant by option 2, right @StephenFan?

In D110933#3038463, @StephenFan wrote:

Change the RVV frame layout to what D93750 did. Instead of calculating the total RVVStackSize and allocate them all at once, D93750 will allocate the RVV stack objects one by one. Thus we can realign for every RVV stack object separately when it was allocating.

Unless we do something more blunt like scale VLENB up by a known good multiplier. But that would quickly start to generate huge offsets.

| <vscale x 1 x i64> align 16 | <- SP + VLENB*6
|-----------------------------|
| <vscale x 1 x i64> align 32 | <- SP + VLENB*4
|-----------------------------|
| <vscale x 1 x i64>          | <- SP

I suppose we could dynamically capture alignTo(VLENB, MaxRVVAlign) in a register and use that in offset computations? Since that's a static amount, if we know that the maximum alignment is at most 8 then nothing needs to change. It would still overalign but less so.

| <vscale x 1 x i64> align 16 | <- SP + 2*alignTo(VLENB, 32)
|-----------------------------|
| <vscale x 1 x i64> align 32 | <- SP + alignTo(VLENB, 32)
|-----------------------------|
| <vscale x 1 x i64>          | <- SP

I think we'd have to materialize that amount each time, though. We can't keep it live through the whole function.

In D110933#3040335, @frasercrmck wrote:
In D110933#3039424, @frasercrmck wrote:

As @HsiangKai says, since VLEN is at least 64 we know any combination of vector objects is always aligned to at least 8 (I suspect this may add yet another issue when supporting VLEN=32 @craig.topper, @rogfer01). We could possibly track vector object alignment separately and realign the stack pointer if there's anything >= 8. We might need the base pointer in that scenario?

Ah yes I now see why this isn't good enough. We can calculate the base sp alignment easy enough. But if we needed to calculate offsets for multiple aligned objects, e.g.:
| <vscale x 1 x i64> align 16 | <- SP + alignTo(VLENB, 32) + alignTo(VLENB, 16)
|-----------------------------|
| <vscale x 1 x i64> align 32 | <- SP + alignTo(VLENB, 32)
|-----------------------------|
| <vscale x 1 x i64>          | <- SP
If we only know that VLENB >= 8 then we need to dynamically "align" each offset by an unknown amount. This is what you meant by option 2, right @StephenFan?

Yes. But I didn't understand why we need to alignTo(VLENB, 32) rather than just alignTo(32) ? The alignment of vector register is always 8, and the alloca instruction is not related to vector register.

resuscitate old test

Herald added a project: Restricted Project. · View Herald TranscriptMay 11 2022, 3:20 AM

Herald added subscribers: sunshaoce, • pcwang-thead, eopXD and 2 others. · View Herald Transcript

In D110933#3042262, @StephenFan wrote:

Yes. But I didn't understand why we need to alignTo(VLENB, 32) rather than just alignTo(32) ? The alignment of vector register is always 8, and the alloca instruction is not related to vector register.

Sorry I never replied - I had to switch focus for a (long) while. I'm now getting back to fixing this bug.

By alignTo(VLENB, 32) I was trying to capture the notion that in my example we're dynamically rounding up the size of a vector to the next object's alignment, so that we can address it when multiple vector objects are in contention. I was assuming that the base pointer is aligned to the maximum alignment. So Object #0 (closest to bp) is aligned by the bp. Object #1 must be aligned to 32 but is also offset by the size of Object #0 which is of size VLENB. So if VLENB not aligned to 32 (i.e., in practice it is smaller than 32 - anything larger is automatically aligned as it's a power of 2) it needs to be rounded up to 32 - if it's larger we need to skip ahead by VLENB instead. Maybe my notation was confusing. Or maybe I misunderstood you?

I'm not sure if this is what we actually want to do, at least in the initial version - I was just prototyping ideas.

Harbormaster completed remote builds in B163868: Diff 428608.May 11 2022, 4:02 AM

add rv32 lines and zve64x

frasercrmck added a child revision: D125787: [RISCV] Fix RVV stack frame alignment bugs.May 17 2022, 7:03 AM

Harbormaster completed remote builds in B164881: Diff 430044.May 17 2022, 7:59 AM

LGTM. We can land the tests and tweak them again if needed. Most of the discussion here seems related to the possible fixes, not the tests themselves.

This revision is now accepted and ready to land.May 17 2022, 8:00 AM

This revision was landed with ongoing or failed builds.May 17 2022, 8:06 AM

Closed by commit rGf7709a059ac0: [RISCV] Add a test showing incorrect RVV stack alignment (authored by frasercrmck). · Explain Why

This revision was automatically updated to reflect the committed changes.

frasercrmck added a commit: rGf7709a059ac0: [RISCV] Add a test showing incorrect RVV stack alignment.

Diff 430066

llvm/test/CodeGen/RISCV/rvv/rvv-stack-align.mir

This file was added.

				# NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				# RUN: llc -mtriple riscv32 -mattr=+zve64x -start-before=prologepilog -o - \
				# RUN: -verify-machineinstrs %s \| FileCheck %s --check-prefix=RV32
				# RUN: llc -mtriple riscv32 -mattr=+v -start-before=prologepilog -o - \
				# RUN: -verify-machineinstrs %s \| FileCheck %s --check-prefix=RV32
				# RUN: llc -mtriple riscv64 -mattr=+zve64x -start-before=prologepilog -o - \
				# RUN: -verify-machineinstrs %s \| FileCheck %s --check-prefix=RV64
				# RUN: llc -mtriple riscv64 -mattr=+v -start-before=prologepilog -o - \
				# RUN: -verify-machineinstrs %s \| FileCheck %s --check-prefix=RV64
				--- \|
				target datalayout = "e-m:e-p:64:64-i64:64-i128:128-n64-S128"
				target triple = "riscv64"

				declare void @extern(<vscale x 4 x i32>*)

				define void @rvv_stack_align8() #0 {
				; RV32-LABEL: rvv_stack_align8:
				; RV32: # %bb.0:
				; RV32-NEXT: addi sp, sp, -48
				; RV32-NEXT: sw ra, 44(sp) # 4-byte Folded Spill
				; RV32-NEXT: csrr a0, vlenb
				; RV32-NEXT: slli a0, a0, 1
				; RV32-NEXT: sub sp, sp, a0
				; RV32-NEXT: addi a0, sp, 32
				; RV32-NEXT: addi a1, sp, 16
				; RV32-NEXT: addi a2, sp, 8
				; RV32-NEXT: call extern@plt
				; RV32-NEXT: csrr a0, vlenb
				; RV32-NEXT: slli a0, a0, 1
				; RV32-NEXT: add sp, sp, a0
				; RV32-NEXT: lw ra, 44(sp) # 4-byte Folded Reload
				; RV32-NEXT: addi sp, sp, 48
				; RV32-NEXT: ret
				;
				; RV64-LABEL: rvv_stack_align8:
				; RV64: # %bb.0:
				; RV64-NEXT: addi sp, sp, -32
				; RV64-NEXT: sd ra, 24(sp) # 8-byte Folded Spill
				; RV64-NEXT: csrr a0, vlenb
				; RV64-NEXT: slli a0, a0, 1
				; RV64-NEXT: sub sp, sp, a0
				; RV64-NEXT: addi a0, sp, 24
				; RV64-NEXT: addi a1, sp, 16
				; RV64-NEXT: addi a2, sp, 8
				; RV64-NEXT: call extern@plt
				; RV64-NEXT: csrr a0, vlenb
				; RV64-NEXT: slli a0, a0, 1
				; RV64-NEXT: add sp, sp, a0
				; RV64-NEXT: ld ra, 24(sp) # 8-byte Folded Reload
				; RV64-NEXT: addi sp, sp, 32
				; RV64-NEXT: ret
				%a = alloca <vscale x 4 x i32>, align 8
				%b = alloca i64
				%c = alloca i64
				call void @extern(<vscale x 4 x i32>* %a)
				ret void
				}

				; FIXME: The alloca is not correctly aligned to 16 bytes.

				define void @rvv_stack_align16() #0 {
				; RV32-LABEL: rvv_stack_align16:
				; RV32: # %bb.0:
				; RV32-NEXT: addi sp, sp, -48
				; RV32-NEXT: sw ra, 44(sp) # 4-byte Folded Spill
				; RV32-NEXT: csrr a0, vlenb
				; RV32-NEXT: slli a0, a0, 1
				; RV32-NEXT: sub sp, sp, a0
				; RV32-NEXT: addi a0, sp, 32
				; RV32-NEXT: addi a1, sp, 16
				; RV32-NEXT: addi a2, sp, 8
				; RV32-NEXT: call extern@plt
				; RV32-NEXT: csrr a0, vlenb
				; RV32-NEXT: slli a0, a0, 1
				; RV32-NEXT: add sp, sp, a0
				; RV32-NEXT: lw ra, 44(sp) # 4-byte Folded Reload
				; RV32-NEXT: addi sp, sp, 48
				; RV32-NEXT: ret
				;
				; RV64-LABEL: rvv_stack_align16:
				; RV64: # %bb.0:
				; RV64-NEXT: addi sp, sp, -32
				; RV64-NEXT: sd ra, 24(sp) # 8-byte Folded Spill
				; RV64-NEXT: csrr a0, vlenb
				; RV64-NEXT: slli a0, a0, 1
				; RV64-NEXT: sub sp, sp, a0
				; RV64-NEXT: addi a0, sp, 24
				; RV64-NEXT: addi a1, sp, 16
				; RV64-NEXT: addi a2, sp, 8
				; RV64-NEXT: call extern@plt
				; RV64-NEXT: csrr a0, vlenb
				; RV64-NEXT: slli a0, a0, 1
				; RV64-NEXT: add sp, sp, a0
				; RV64-NEXT: ld ra, 24(sp) # 8-byte Folded Reload
				; RV64-NEXT: addi sp, sp, 32
				; RV64-NEXT: ret
				%a = alloca <vscale x 4 x i32>, align 16
				%b = alloca i64
				%c = alloca i64
				call void @extern(<vscale x 4 x i32>* %a)
				ret void
				}

				; FIXME: The alloca is not correctly aligned to 32 bytes.

				define void @rvv_stack_align32() #0 {
				; RV32-LABEL: rvv_stack_align32:
				; RV32: # %bb.0:
				; RV32-NEXT: addi sp, sp, -32
				; RV32-NEXT: sw ra, 28(sp) # 4-byte Folded Spill
				; RV32-NEXT: sw s0, 24(sp) # 4-byte Folded Spill
				; RV32-NEXT: addi s0, sp, 32
				; RV32-NEXT: csrr a0, vlenb
				; RV32-NEXT: slli a0, a0, 1
				; RV32-NEXT: sub sp, sp, a0
				; RV32-NEXT: andi sp, sp, -32
				; RV32-NEXT: addi a0, sp, 24
				; RV32-NEXT: addi a1, sp, 16
				; RV32-NEXT: addi a2, sp, 8
				; RV32-NEXT: call extern@plt
				; RV32-NEXT: addi sp, s0, -32
				; RV32-NEXT: lw ra, 28(sp) # 4-byte Folded Reload
				; RV32-NEXT: lw s0, 24(sp) # 4-byte Folded Reload
				; RV32-NEXT: addi sp, sp, 32
				; RV32-NEXT: ret
				;
				; RV64-LABEL: rvv_stack_align32:
				; RV64: # %bb.0:
				; RV64-NEXT: addi sp, sp, -32
				; RV64-NEXT: sd ra, 24(sp) # 8-byte Folded Spill
				; RV64-NEXT: sd s0, 16(sp) # 8-byte Folded Spill
				; RV64-NEXT: addi s0, sp, 32
				; RV64-NEXT: csrr a0, vlenb
				; RV64-NEXT: slli a0, a0, 1
				; RV64-NEXT: sub sp, sp, a0
				; RV64-NEXT: andi sp, sp, -32
				; RV64-NEXT: addi a0, sp, 16
				; RV64-NEXT: addi a1, sp, 8
				; RV64-NEXT: mv a2, sp
				; RV64-NEXT: call extern@plt
				; RV64-NEXT: addi sp, s0, -32
				; RV64-NEXT: ld ra, 24(sp) # 8-byte Folded Reload
				; RV64-NEXT: ld s0, 16(sp) # 8-byte Folded Reload
				; RV64-NEXT: addi sp, sp, 32
				; RV64-NEXT: ret
				%a = alloca <vscale x 4 x i32>, align 32
				%b = alloca i64
				%c = alloca i64
				call void @extern(<vscale x 4 x i32>* %a)
				ret void
				}

				attributes #0 = { nounwind nofree nosync }
				...
				---
				name: rvv_stack_align8
				alignment: 4
				frameInfo:
				isFrameAddressTaken: false
				isReturnAddressTaken: false
				hasStackMap: false
				hasPatchPoint: false
				stackSize: 0
				offsetAdjustment: 0
				maxAlignment: 8
				adjustsStack: false
				hasCalls: true
				stackProtector: ''
				maxCallFrameSize: 4294967295
				cvBytesOfCalleeSavedRegisters: 0
				hasOpaqueSPAdjustment: false
				hasVAStart: false
				hasMustTailInVarArgFunc: false
				hasTailCall: false
				localFrameSize: 0
				savePoint: ''
				restorePoint: ''
				fixedStack: []
				stack:
				- { id: 0, name: a, type: default, offset: 0, size: 16, alignment: 8,
				stack-id: scalable-vector, callee-saved-register: '', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				- { id: 1, name: b, type: default, offset: 0, size: 8, alignment: 8,
				stack-id: default, callee-saved-register: '', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				- { id: 2, name: c, type: default, offset: 0, size: 8, alignment: 8,
				stack-id: default, callee-saved-register: '', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				body: \|
				bb.0 (%ir-block.0):
				ADJCALLSTACKDOWN 0, 0, implicit-def dead $x2, implicit $x2
				$x10 = ADDI %stack.0.a, 0
				$x11 = ADDI %stack.1.b, 0
				$x12 = ADDI %stack.2.c, 0
				PseudoCALL target-flags(riscv-plt) @extern, csr_ilp32d_lp64d, implicit-def dead $x1, implicit killed $x10, implicit-def $x2
				ADJCALLSTACKUP 0, 0, implicit-def dead $x2, implicit $x2
				PseudoRET

				...
				---
				name: rvv_stack_align16
				alignment: 4
				frameInfo:
				isFrameAddressTaken: false
				isReturnAddressTaken: false
				hasStackMap: false
				hasPatchPoint: false
				stackSize: 0
				offsetAdjustment: 0
				maxAlignment: 16
				adjustsStack: false
				hasCalls: true
				stackProtector: ''
				maxCallFrameSize: 4294967295
				cvBytesOfCalleeSavedRegisters: 0
				hasOpaqueSPAdjustment: false
				hasVAStart: false
				hasMustTailInVarArgFunc: false
				hasTailCall: false
				localFrameSize: 0
				savePoint: ''
				restorePoint: ''
				fixedStack: []
				stack:
				- { id: 0, name: a, type: default, offset: 0, size: 16, alignment: 16,
				stack-id: scalable-vector, callee-saved-register: '', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				- { id: 1, name: b, type: default, offset: 0, size: 8, alignment: 8,
				stack-id: default, callee-saved-register: '', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				- { id: 2, name: c, type: default, offset: 0, size: 8, alignment: 8,
				stack-id: default, callee-saved-register: '', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				body: \|
				bb.0 (%ir-block.0):
				ADJCALLSTACKDOWN 0, 0, implicit-def dead $x2, implicit $x2
				$x10 = ADDI %stack.0.a, 0
				$x11 = ADDI %stack.1.b, 0
				$x12 = ADDI %stack.2.c, 0
				PseudoCALL target-flags(riscv-plt) @extern, csr_ilp32d_lp64d, implicit-def dead $x1, implicit killed $x10, implicit-def $x2
				ADJCALLSTACKUP 0, 0, implicit-def dead $x2, implicit $x2
				PseudoRET

				...
				---
				name: rvv_stack_align32
				alignment: 4
				frameInfo:
				isFrameAddressTaken: false
				isReturnAddressTaken: false
				hasStackMap: false
				hasPatchPoint: false
				stackSize: 0
				offsetAdjustment: 0
				maxAlignment: 32
				adjustsStack: false
				hasCalls: true
				stackProtector: ''
				maxCallFrameSize: 4294967295
				cvBytesOfCalleeSavedRegisters: 0
				hasOpaqueSPAdjustment: false
				hasVAStart: false
				hasMustTailInVarArgFunc: false
				hasTailCall: false
				localFrameSize: 0
				savePoint: ''
				restorePoint: ''
				fixedStack: []
				stack:
				- { id: 0, name: a, type: default, offset: 0, size: 16, alignment: 32,
				stack-id: scalable-vector, callee-saved-register: '', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				- { id: 1, name: b, type: default, offset: 0, size: 8, alignment: 8,
				stack-id: default, callee-saved-register: '', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				- { id: 2, name: c, type: default, offset: 0, size: 8, alignment: 8,
				stack-id: default, callee-saved-register: '', callee-saved-restored: true,
				debug-info-variable: '', debug-info-expression: '', debug-info-location: '' }
				body: \|
				bb.0 (%ir-block.0):
				ADJCALLSTACKDOWN 0, 0, implicit-def dead $x2, implicit $x2
				$x10 = ADDI %stack.0.a, 0
				$x11 = ADDI %stack.1.b, 0
				$x12 = ADDI %stack.2.c, 0
				PseudoCALL target-flags(riscv-plt) @extern, csr_ilp32d_lp64d, implicit-def dead $x1, implicit killed $x10, implicit-def $x2
				ADJCALLSTACKUP 0, 0, implicit-def dead $x2, implicit $x2
				PseudoRET

				...

This is an archive of the discontinued LLVM Phabricator instance.

[RISCV] Add a test showing incorrect RVV stack alignment
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 430066

llvm/test/CodeGen/RISCV/rvv/rvv-stack-align.mir

This is an archive of the discontinued LLVM Phabricator instance.

[RISCV] Add a test showing incorrect RVV stack alignmentClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 430066

llvm/test/CodeGen/RISCV/rvv/rvv-stack-align.mir

[RISCV] Add a test showing incorrect RVV stack alignment
ClosedPublic