Currently clang aligns to 16 bytes when passing m128/m256/__m512 vector type.
However, when calculating va_arg, it will be always treated as 4 byte alignment, including
struct, union and vector types. For struct/union, there is no probem because it will align
to 4 bytes when passing them. For m128/m256/__m512 vector type, it will get wrong result.
This patch will get va_arg according the rules below:
- When the target doesn't support avx and avx512: get m128/m256/__m512 from 16 bytes aligned stack.
- When the target supports avx: get m256/m512 from 32 bytes aligned stack.
- When the target supports avx512: get __m512 from 64 bytes aligned stack.