Page MenuHomePhabricator

[X86] Support replacing aligned vector moves with unaligned moves when avx is enabled.
Needs RevisionPublic

Authored by LiuChen3 on Mar 29 2021, 11:52 PM.

Details

Summary

With AVX the performance for aligned vector move and unaligned vector move on X86
are the same if the address is aligned. In this case we prefer to use unaligned
move because it can avoid some run time exceptions.
"-muse-unaligned-vector-move" and "-mno-use-unaligned-vector-move" are added to
enable this preference. This transform is disabled as default.

This patch is a replacement of D88396.

Diff Detail

Event Timeline

LiuChen3 created this revision.Mar 29 2021, 11:52 PM
LiuChen3 requested review of this revision.Mar 29 2021, 11:52 PM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptMar 29 2021, 11:52 PM
craig.topper added inline comments.Mar 30 2021, 12:56 AM
clang/include/clang/Driver/Options.td
1646

As far the user is concerned this isn’t a transform. From their perspective it’s always use unaligned move instructions.

llvm/test/CodeGen/X86/avx512vl-unaligned-load-store.ll
7

CHECK isn’t a valid prefix for this file

22

What are the tests with align 1 intended to show?

The only use I could really see for this is to prevent a developers code from crashing when it’s distributed to someone else. For paranoia because it’s possible you have a bug and got lucky with alignment in your internal testing before you shipped.

If you need this your code has undefined behavior which should be fixed. You should not use this to make a known runtime exception go away.

Your code could still be miscompiled. For example, llvm really likes to replace ADD with OR when the bits don’t overlap. So it would be very easy to have your pointer arithmetic miscompiled because llvm believes a pointer is aligned but really isn’t.

Your code would not be portable to SSE. If your application does dynamic dispatch most of your users may get the AVX code path, but the smaller percentage on older hardware or cheaper hardware that doesn’t have AVX still get the exceptions.

I think you need to be very careful with how this feature is communicated.

LiuChen3 added inline comments.Mar 30 2021, 1:34 AM
clang/include/clang/Driver/Options.td
1646

How about "Always emit unaligned move instructions." ? Do you have any suggestion here?

llvm/lib/Target/X86/X86MCInstLower.cpp
2708 ↗(On Diff #334059)

Forgot to check the SSE level. I will add in next patch.

llvm/test/CodeGen/X86/avx512vl-unaligned-load-store.ll
7

Thanks for reminding. I forgot to delete the these. I will remove these in next patch.

22

This is to distinguish the unaligned-mov converted from aligned-move and the original unaligned-move. See the difference between line12 and line28.

The only use I could really see for this is to prevent a developers code from crashing when it’s distributed to someone else. For paranoia because it’s possible you have a bug and got lucky with alignment in your internal testing before you shipped.

If you need this your code has undefined behavior which should be fixed. You should not use this to make a known runtime exception go away.

Your code could still be miscompiled. For example, llvm really likes to replace ADD with OR when the bits don’t overlap. So it would be very easy to have your pointer arithmetic miscompiled because llvm believes a pointer is aligned but really isn’t.

Your code would not be portable to SSE. If your application does dynamic dispatch most of your users may get the AVX code path, but the smaller percentage on older hardware or cheaper hardware that doesn’t have AVX still get the exceptions.

+ your code won't be portable to other compilers.

I think you need to be very careful with how this feature is communicated.

+1, i have already wrote all that in the previous revision.
I really don't think this should go in.

I really don't think this should go in.

Here are more arguments for why, I think, this is an useful option in my opinion, in arbitrary order:

  1. This was requested by and added for users of Intel Compiler. Having similar option in LLVM would make the two compilers more compatible and ease the transition of new customers to LLVM.
  2. This fixes an inconsistency in optimization; suppose a load operation was merged into another instruction (e.g., load and add becomes `add [memop]'). If a misaligned pointer is passed to the two-instruction sequence, it will raise an exception. If the same pointer is passed to the memop instruction, it will work. Thus, the behavior of misalignment depends upon what optimization levels and passes are applied, and small source changes could cause issues to appear and disappear. It's better for the user to consistently use unaligned load/store to improve the debug experience.
  3. Makes good use of HW that is capable of handling misaligned data gracefully. It is not necessarily a bug in users code but a third-part library. For example it would allow using a library built in old ages where stack alignment was 4-byte only.

If you still think this can hinder the raise of a desired exception for a mis-aligned access (I'd argue that "going slower" is better than "raising exception"), then let's consider adding this as an option that is OFF by default.
This would give the most flexibility to everyone.

I think I wouldn't mind if we just didn't emit aligned loads/store instructions for AVX/AVX512 from isel and other places in the compiler in the first place. As noted, if the load gets folded the alignment check doesn't happen. That would reduce the size of the isel tables and remove branches, reducing complexity of the compiler. Adding a new step and a command line to undo the earlier decision increases complexity.

The counter argument to that is that the alignment check has found bugs in the vectorizer on more than one occasion that I know of.

Sergey, remind me, does icc always emit unaligned loads/stores? Is there any option to control it?

Sergey, remind me, does icc always emit unaligned loads/stores? Is there any option to control it?

There was a control to emit aligned opcodes, yes, but I don't think anyone ever used it.

I think I wouldn't mind if we just didn't emit aligned loads/store instructions for AVX/AVX512 from isel and other places in the compiler in the first place. As noted, if the load gets folded the alignment check doesn't happen. That would reduce the size of the isel tables and remove branches, reducing complexity of the compiler. Adding a new step and a command line to undo the earlier decision increases complexity.

The counter argument to that is that the alignment check has found bugs in the vectorizer on more than one occasion that I know of.

Can I understand that if we implement it in isel, you will no longer oppose this patch?

lebedev.ri added a comment.EditedMon, Apr 12, 4:43 AM

I'm still uncomfortable with changing current status quo, even though i obviously don't get to cast the final vote here.

One should not use aligned loads in hope that they will cause an exception to detect address misalignment.
That's UBSan's job. -fsanitize=undefined/-fsanitize=aligment *should* catch it.
If it does not do so in your particular case, please file a bug, i would like to take a look.

Likewise, i don't think one should do overaligned loads and hope that they will just work.
UB is UB. The code will still be miscompiled, but you've just hidden your warning.

Likewise, even if unaligned loads can be always used, i would personally find it pretty surprising
to suddenly see unaliged loads instead of aligned ones.
Also, isn't that only possible/so when AVX is available?
Also, doesn't that cause compiler lock-in?
What happens without AVX? Do so anyways at the perfomance's cost?
Or back to exceptions?

Should this process in any form other than the UBSan changes,
i would like to first see a RFC on llvm-dev.
Sorry about being uneasy about this. :S

LiuChen3 updated this revision to Diff 337057.Tue, Apr 13, 12:46 AM
  1. Rebase;
  2. Emit unaligned move in ISEL;
  3. Only do the conversion on AVX machine.

I am still working on fast-isel.

I'm still uncomfortable with changing current status quo, even though i obviously don't get to cast the final vote here.

One should not use aligned loads in hope that they will cause an exception to detect address misalignment.
That's UBSan's job. -fsanitize=undefined/-fsanitize=aligment *should* catch it.
If it does not do so in your particular case, please file a bug, i would like to take a look.

Likewise, i don't think one should do overaligned loads and hope that they will just work.
UB is UB. The code will still be miscompiled, but you've just hidden your warning.

Likewise, even if unaligned loads can be always used, i would personally find it pretty surprising
to suddenly see unaliged loads instead of aligned ones.
Also, isn't that only possible/so when AVX is available?
Also, doesn't that cause compiler lock-in?
What happens without AVX? Do so anyways at the perfomance's cost?
Or back to exceptions?

Should this process in any form other than the UBSan changes,
i would like to first see a RFC on llvm-dev.
Sorry about being uneasy about this. :S

We are happy to hear your voice. We will discuss this on llvm-dev later after confirming Craigs's opinion.

I think I wouldn't mind if we just didn't emit aligned loads/store instructions for AVX/AVX512 from isel and other places in the compiler in the first place. As noted, if the load gets folded the alignment check doesn't happen. That would reduce the size of the isel tables and remove branches, reducing complexity of the compiler. Adding a new step and a command line to undo the earlier decision increases complexity.

The counter argument to that is that the alignment check has found bugs in the vectorizer on more than one occasion that I know of.

Hi, @craig.topper. I'm not sure if I understand what you mean correctly. Do you mean we can remove the alignload/alignstore pattern match so that we can reduce the size of the isel tables? But this means that there is no option to control this behavior.

craig.topper added inline comments.Tue, Apr 13, 1:00 AM
clang/include/clang/Driver/Options.td
1649

This makes it sound like unaligned moves would never be used even if it unaligned.

llvm/lib/Target/X86/X86InstrFragmentsSIMD.td
834

Won’t this make this return true for any load when only SSE is enabled. So SSE will use an aligned load instruction for an unaligned address.

LiuChen3 updated this revision to Diff 337064.Tue, Apr 13, 1:36 AM

Address Craig's comments

craig.topper added inline comments.Tue, Apr 13, 2:58 PM
clang/lib/Driver/ToolChains/CommonArgs.cpp
1723 ↗(On Diff #337064)

Could you just put this in target features in the IR and not have to deal with LTO specially? Similar to what we do with retpoline, speculative loading hardening, etc.

craig.topper added inline comments.Tue, Apr 13, 3:00 PM
clang/include/clang/Driver/Options.td
1649

From the user's perspective you're not transforming instructions. The instructions don't exist before the compiler runs.

LiuChen3 updated this revision to Diff 337333.Tue, Apr 13, 10:11 PM

Address Craig's comments

LiuChen3 edited the summary of this revision. (Show Details)Tue, Apr 13, 10:13 PM
lebedev.ri requested changes to this revision.Thu, Apr 15, 12:15 AM
This revision now requires changes to proceed.Thu, Apr 15, 12:15 AM