This is an archive of the discontinued LLVM Phabricator instance.

[X86] Support replacing aligned vector moves with unaligned moves when avx is enabled. (off by default)
AbandonedPublic

Authored by LuoYuanke on Sep 28 2020, 1:35 AM.

Details

Summary
 With AVX the performance for aligned vector move and unaligned vector move on X86
are the same if the address is aligned. However if the address is not aligned,
aligned vector move raise exception while unaligned vector move can still run.
To be conservative, llvm option "x86-enable-unaligned-vector-move" is added to
enable this preference.

Change-Id: I85ab9749013d7e1abb237e03bc22eeacfd37836a

Diff Detail

Event Timeline

LuoYuanke created this revision.Sep 28 2020, 1:35 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 28 2020, 1:35 AM
LuoYuanke requested review of this revision.Sep 28 2020, 1:35 AM

What issue is this fixing?

However if the address is not aligned, movaps raise exception while movups can still run.

That sounds like either a miscompile happened along the way, or the original source code had UB to begin with.

What issue is this fixing?

However if the address is not aligned, movaps raise exception while movups can still run.

That sounds like either a miscompile happened along the way, or the original source code had UB to begin with.

It can avoid segment fault when unaligned pointer is casted.

#include <immintrin.h>

extern __m128 value;

void add(void* pointer) {
    value = _mm_add_ps(value,*(__m128*)pointer);
}
lebedev.ri requested changes to this revision.Sep 28 2020, 2:00 AM

What issue is this fixing?

However if the address is not aligned, movaps raise exception while movups can still run.

That sounds like either a miscompile happened along the way, or the original source code had UB to begin with.

It can avoid segment fault when unaligned pointer is casted.

#include <immintrin.h>

extern __m128 value;

void add(void* pointer) {
    value = _mm_add_ps(value,*(__m128*)pointer);
}

That is undefined behaviour:
https://godbolt.org/z/xdWKje

This revision now requires changes to proceed.Sep 28 2020, 2:00 AM
RKSimon added a subscriber: RKSimon.

As @lebedev.ri has said, I don't think this is a good idea and if its happening it sounds like you have an underlying bug in your code that the sanitizers would probably help you with.

Ignoring the motivating code here for a second. Using aligned load/store instructions with AVX is a little weird. If we fold the load into an arithmetic op, with AVX it doesn't have to be aligned so using the folded instruction suppresses the fault check. This is different than SSE where we can only fold aligned loads except on AMD CPUs. So AVX provides an inconsistent faulting experience. But if we were going to change that I'd hunt down all the places we check the alignment and remove them and take the code size reduction in the compiler instead of adding a new pass.

If the address is random, during validation the address is aligned and sanitizers tool doesn't notice it. When run in real time, it crashes randomly. There is no harm to replace movaps with movups, and it can avoid some crash issue. Is it doable to add an option to let user choose movups or movaps?

If the address is random, during validation the address is aligned and sanitizers tool doesn't notice it. When run in real time, it crashes randomly. There is no harm to replace movaps with movups, and it can avoid some crash issue. Is it doable to add an option to let user choose movups or movaps?

If you change void* to char*, you get clang diagnostic:

<source>:4:38: warning: cast from 'char *' to '__m128 *' increases required alignment from 1 to 16 [-Wcast-align]
    __m128 value = _mm_add_ps(value,*(__m128*)pointer);
                                     ^~~~~~~~~~~~~~~~

So this should in principle not require sanitizers.

I didn't get the error at https://godbolt.org/z/8aGhd5. Another example may like this, an float array is packed in a struct.

#include <immintrin.h>

__m128 value;

typedef struct _data_str {
    int header;
    float src[400];
} data_t;

data_t data;

void add(__m128* pointer) {
    value = _mm_add_ps(value, *pointer);
}

void foo() {
    for (int i = 0; i < 400; i += 4)
    add((__m128*)(&data.src[i]));
}

I didn't get the error at https://godbolt.org/z/8aGhd5. Another example may like this, an float array is packed in a struct.

#include <immintrin.h>

__m128 value;

typedef struct _data_str {
    int header;
    float src[400];
} data_t;

data_t data;

void add(__m128* pointer) {
    value = _mm_add_ps(value, *pointer);
}

void foo() {
    for (int i = 0; i < 400; i += 4)
    add((__m128*)(&data.src[i]));
}

Compiling for SSE this code will likely use the memory form of addps which will fault on the misalignment. I know this patch only targets AVX.

I don’t think you can motivate this change by showing what code you want to accept if the code would crash when compiled with the default SSE2 target.

Compiling for SSE this code will likely use the memory form of addps which will fault on the misalignment. I know this patch only targets AVX.

I don’t think you can motivate this change by showing what code you want to accept if the code would crash when compiled with the default SSE2 target.

Yes, this patch only targets AVX.

LuoYuanke added inline comments.Sep 30 2020, 1:26 AM
llvm/lib/Target/X86/X86MovapsToMovups.cpp
70 ↗(On Diff #294623)

We only target AVX.

I didn't get the error at https://godbolt.org/z/8aGhd5. Another example may like this, an float array is packed in a struct.

#include <immintrin.h>

__m128 value;

typedef struct _data_str {
    int header;
    float src[400];
} data_t;

data_t data;

void add(__m128* pointer) {
    value = _mm_add_ps(value, *pointer);
}

void foo() {
    for (int i = 0; i < 400; i += 4)
    add((__m128*)(&data.src[i]));
}

Compiling for SSE this code will likely use the memory form of addps which will fault on the misalignment. I know this patch only targets AVX.

I don’t think you can motivate this change by showing what code you want to accept if the code would crash when compiled with the default SSE2 target.

Note that even if x86 codegen will always emit unaligned ops (which will cause new questions/bugreports),
the original IR will still contain UB, and it will be only a question of time until that causes some other 'miscompile'.
I really think this should be approached from front-end diag side.

Compiling for SSE this code will likely use the memory form of addps which will fault on the misalignment. I know this patch only targets AVX.

I don’t think you can motivate this change by showing what code you want to accept if the code would crash when compiled with the default SSE2 target.

Note that even if x86 codegen will always emit unaligned ops (which will cause new questions/bugreports),
the original IR will still contain UB, and it will be only a question of time until that causes some other 'miscompile'.
I really think this should be approached from front-end diag side.

Sorry, what does 'UB' means? Why cause 'miscompile', compiler still think the address is aligned. Selecting movups doesn't break compiler assumption. Is there any reason movaps is better than movups? To detect the alignment exception?

Compiling for SSE this code will likely use the memory form of addps which will fault on the misalignment. I know this patch only targets AVX.

I don’t think you can motivate this change by showing what code you want to accept if the code would crash when compiled with the default SSE2 target.

Note that even if x86 codegen will always emit unaligned ops (which will cause new questions/bugreports),
the original IR will still contain UB, and it will be only a question of time until that causes some other 'miscompile'.
I really think this should be approached from front-end diag side.

Sorry, what does 'UB' means?

undefined behavior

Why cause 'miscompile', compiler still think the address is aligned.

That is very precisely my point.

Selecting movups doesn't break compiler assumption. Is there any reason movaps is better than movups? To detect the alignment exception?

Compiling for SSE this code will likely use the memory form of addps which will fault on the misalignment. I know this patch only targets AVX.

I don’t think you can motivate this change by showing what code you want to accept if the code would crash when compiled with the default SSE2 target.

Note that even if x86 codegen will always emit unaligned ops (which will cause new questions/bugreports),
the original IR will still contain UB, and it will be only a question of time until that causes some other 'miscompile'.
I really think this should be approached from front-end diag side.

Sorry, what does 'UB' means?

undefined behavior

Why cause 'miscompile', compiler still think the address is aligned.

That is very precisely my point.

Selecting movups doesn't break compiler assumption. Is there any reason movaps is better than movups? To detect the alignment exception?

Why we need to detect the alignment exception? This is just like assert, it can be done in debug mode. So can we select movaps in debug build, and select movups in non-debug build?

Compiling for SSE this code will likely use the memory form of addps which will fault on the misalignment. I know this patch only targets AVX.

I don’t think you can motivate this change by showing what code you want to accept if the code would crash when compiled with the default SSE2 target.

Note that even if x86 codegen will always emit unaligned ops (which will cause new questions/bugreports),
the original IR will still contain UB, and it will be only a question of time until that causes some other 'miscompile'.
I really think this should be approached from front-end diag side.

Sorry, what does 'UB' means?

undefined behavior

Why cause 'miscompile', compiler still think the address is aligned.

That is very precisely my point.

Selecting movups doesn't break compiler assumption. Is there any reason movaps is better than movups? To detect the alignment exception?

Why we need to detect the alignment exception? This is just like assert, it can be done in debug mode. So can we select movaps in debug build, and select movups in non-debug build?

I never said that. I'm only saying that the original source code has undefined behavior,
and even if you mask it with this patch, it will most likely manifest in some other way later on.

LuoYuanke updated this revision to Diff 296033.Oct 4 2020, 2:44 AM

Add llvm option "-enable-x86-movaps-to-movups" to enable movups preference.
By default the option is false.

RKSimon added inline comments.Oct 7 2020, 2:37 AM
llvm/lib/Target/X86/CMakeLists.txt
54

sorting

llvm/lib/Target/X86/X86MovapsToMovups.cpp
1 ↗(On Diff #296040)

Very minor issue - but this isn't just movups - how about X86UnalignedVectorMoves.cpp ?

13 ↗(On Diff #296040)

"So unaligned load/stores may be preferred if hardware exceptions can't be trusted."?

LuoYuanke updated this revision to Diff 296662.Oct 7 2020, 6:43 AM

Rebase and address Simon's comments.

LuoYuanke marked 2 inline comments as done.Oct 7 2020, 7:00 AM
LuoYuanke added inline comments.
llvm/lib/Target/X86/X86MovapsToMovups.cpp
13 ↗(On Diff #296040)

Sometimes user prefer inefficient unaligned load/store rather than hardware exceptions when address is unaligned. We provide the opportunity for user to choose the behavior.

Alignment information isn’t just used to select aligned or unaligned instructions. It’s also used by alias analysis for example. If the compiler is being told incorrect alignment it could cause incorrect optimization in other parts of the compiler.

Alignment information isn’t just used to select aligned or unaligned instructions. It’s also used by alias analysis for example. If the compiler is being told incorrect alignment it could cause incorrect optimization in other parts of the compiler.

(That's what i've been saying all this time...)

craig.topper added inline comments.Oct 7 2020, 8:07 AM
llvm/lib/Target/X86/X86MovapsToMovups.cpp
1 ↗(On Diff #296040)

But the patch only looks at movaps opcodes. It should look at all move opcodes.

38 ↗(On Diff #296040)

This option should start with x86- and again it shouldn't just be movaps. It also needs to handle movapd movdqa movdqa64 and movdqa32.

LuoYuanke updated this revision to Diff 297380.Oct 9 2020, 7:54 PM

Addressed Craig's comments. Added transform for movapd and movdq.

LuoYuanke retitled this revision from [X86] Replace movaps with movups when avx is enabled. to [X86] Replace aligned vector move with unaligned move when avx is enabled..
LuoYuanke edited the summary of this revision. (Show Details)
pengfei added inline comments.Oct 9 2020, 8:14 PM
llvm/lib/Target/X86/X86UnalignedVectorMoves.cpp
11

Change the comments as well.

92

Change | to ||

LuoYuanke updated this revision to Diff 297381.Oct 9 2020, 8:26 PM
LuoYuanke retitled this revision from [X86] Replace aligned vector move with unaligned move when avx is enabled. to [X86] Replace aligned vector move with unaligned move when avx is enabled..
LuoYuanke edited the summary of this revision. (Show Details)

Address Pengfei's comments.

LuoYuanke marked 2 inline comments as done.Oct 9 2020, 8:27 PM
craig.topper added inline comments.Oct 9 2020, 8:31 PM
llvm/lib/Target/X86/X86UnalignedVectorMoves.cpp
94

Why do we need 3 separate functions?

craig.topper added inline comments.Oct 9 2020, 8:33 PM
llvm/lib/Target/X86/X86UnalignedVectorMoves.cpp
229

Don't include non-VEX/EVEX opcodes. Those aren't used when AVX is enabled.

LuoYuanke added inline comments.Oct 9 2020, 8:34 PM
llvm/lib/Target/X86/X86UnalignedVectorMoves.cpp
94

Separating into 3 function looks clearer to me. I can merge them into 1 switch clause and add 3 comments for the code. Do you prefer merge?

craig.topper added inline comments.Oct 9 2020, 8:41 PM
llvm/lib/Target/X86/X86UnalignedVectorMoves.cpp
94

I'd prefer one function. And if you can get all 3 lines on one line without exceeding 80 columns I'd prefer that

case X86::VMOVDQA32Z128mr: NewOpc = X86::VMOVDQU32Z128mr; break;

The start of NewOpc on every line. Same for the break. See for example the nested switch in X86InstrInfo::optimizeCompareInstr

LuoYuanke updated this revision to Diff 297385.Oct 9 2020, 9:36 PM

Address Craig's comments.

LuoYuanke marked 2 inline comments as done.Oct 9 2020, 9:37 PM
pengfei added inline comments.Oct 9 2020, 10:05 PM
llvm/lib/Target/X86/X86UnalignedVectorMoves.cpp
108

Could you make all NewOpc and break aligned?

lebedev.ri retitled this revision from [X86] Replace aligned vector move with unaligned move when avx is enabled. to [X86] Support replacing aligned vector moves with unaligned moves when avx is enabled. (off by default).Oct 9 2020, 11:16 PM
LuoYuanke updated this revision to Diff 297401.Oct 10 2020, 2:34 AM

Address Pengfei's comments.

LuoYuanke marked an inline comment as done.Oct 10 2020, 2:35 AM

LGTM. Thanks!
Since the pass is turned off by default, I think we can let it in. @lebedev.ri, what's your opinion?

lebedev.ri added a comment.EditedOct 10 2020, 2:59 AM

LGTM. Thanks!
Since the pass is turned off by default, I think we can let it in. @lebedev.ri, what's your opinion?

I still retain my original opinion that this is trying to paper over broken source code,
and incorrectly so, because even if backend doesn't make use of the alignment information
that was lowered from the source code into IR, the IR will still contain incorrect alignment
information, and it is only a matter of time until that UB manifests in some other way.

As i see it, there are 5 options:

  1. Don't manually vectorize the code
  2. Do UBSan to catch these issues
  3. Enhance clang/clang-tidy to better catch these issues
  4. Don't do aligned loads https://godbolt.org/z/38jrvE
  5. Add a clang (!) switch to make __m128 unaligned

I strongly suggest that an option 4 be taken.

LGTM. Thanks!
Since the pass is turned off by default, I think we can let it in. @lebedev.ri, what's your opinion?

I still retain my original opinion that this is trying to paper over broken source code,
and incorrectly so, because even if backend doesn't make use of the alignment information
that was lowered from the source code into IR, the IR will still contain incorrect alignment
information, and it is only a matter of time until that UB manifests in some other way.

As i see it, there are 5 options:

  1. Don't manually vectorize the code
  2. Do UBSan to catch these issues
  3. Enhance clang/clang-tidy to better catch these issues
  4. Don't do aligned loads https://godbolt.org/z/38jrvE
  5. Add a clang (!) switch to make __m128 unaligned

I strongly suggest that an option 4 be taken.

I think it is friendly for compiler to provide an opportunity to let user decide whether he/she prefer aligned load or unaligned load. As I know some processor also have some control register to control raising or suppressing exception on unaligned memory access. Leaving the decision to user doesn't harm any existing behavior. For X86 we have a choice to select aligned instruction or unaligned instruction. What if some processor only have instruction that don't raise exception on unaligned memory access?

lebedev.ri resigned from this revision.Oct 10 2020, 4:17 AM

LGTM. Thanks!
Since the pass is turned off by default, I think we can let it in. @lebedev.ri, what's your opinion?

I still retain my original opinion that this is trying to paper over broken source code,
and incorrectly so, because even if backend doesn't make use of the alignment information
that was lowered from the source code into IR, the IR will still contain incorrect alignment
information, and it is only a matter of time until that UB manifests in some other way.

As i see it, there are 5 options:

  1. Don't manually vectorize the code
  2. Do UBSan to catch these issues
  3. Enhance clang/clang-tidy to better catch these issues
  4. Don't do aligned loads https://godbolt.org/z/38jrvE
  5. Add a clang (!) switch to make __m128 unaligned

I strongly suggest that an option 4 be taken.

I think it is friendly for compiler to provide an opportunity to let user decide whether he/she prefer aligned load or unaligned load. As I know some processor also have some control register to control raising or suppressing exception on unaligned memory access. Leaving the decision to user doesn't harm any existing behavior. For X86 we have a choice to select aligned instruction or unaligned instruction. What if some processor only have instruction that don't raise exception on unaligned memory access?

I think we are talking past each other.
Do you agree that even with this patch, the LLVM IR will still contain an incorrect alignment on loads (align 16, https://godbolt.org/z/4d8xM3)?
Do you agree that it is an undefined behaviour?
Do you agree that by only hiding that fact in the back-end, the middle end optimization pipeline is still free to make use of that incorrect information to miscompile the code?

LuoYuanke abandoned this revision.Nov 19 2020, 5:52 PM

I abandon this patch since we don't reach a consensus.