Code in CC_ARM_AAPCS_Custom_Aggregate() is responsible for handling homogeneous aggregates for CC_ARM_AAPCS_VFP. When an aggregate ends up fully on stack, the function tries to pack all resulting items of the aggregate as tightly as possible. Once the first item is laid out, the alignment used for consecutive items is the size of one item.
This logic goes wrong for 128-bit vectors because their alignment is normally only 64 bits, and so can result in inserting unexpected padding between the first and second element.
Example:
$ cat test.c #include <arm_neon.h> typedef struct { double A[4]; } S_d64_4; typedef struct { uint32x4_t A[2]; } S_v128_2; int foo(S_d64_4 P0, S_d64_4 P1, float P2, S_v128_2 P3) { // * P0 is passed in D0-D3. // * P1 is passed in D4-D7. // * P2 is passed in [SP, SP+4). // * P3.A[0] is passed in [SP+8, SP+24). // * P3.A[1] should be passed according to AAPCS in [SP+24, SP+40) but the // code produced by Clang/LLVM expects it in [SP+32, SP+48). return vgetq_lane_u32(P3.A[0], 0) + vgetq_lane_u32(P3.A[1], 0); } $ clang -target arm-none-eabi -mcpu=cortex-a53 -S test.c -o - [...] foo: push {r11, lr} mov r11, sp sub sp, sp, #8 bfc sp, #0, #4 ldr r0, [r11, #40] /* load from entry-SP + #32 */ ldr r1, [r11, #16] /* load from entry-SP + #8 */ add r0, r1, r0 mov sp, r11 pop {r11, pc}
The proposed patch fixes the problem by updating the alignment with the item size only if this results in reducing it.
Could you hoist the std::min(Align, Size) out of the loop so it's clear it isn't changing every iteration?