This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/X86/
-
Target/
-
X86/
1/1
CMakeLists.txt
-
X86.h
-
X86TargetMachine.cpp
6/7
X86UnalignedVectorMoves.cpp
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
-
O0-pipeline.ll
-
avx-unaligned-load-store.ll
-
avx512-unaligned-load-store.ll
-
avx512vl-unaligned-load-store.ll
-
opt-pipeline.ll

Differential D88396

[X86] Support replacing aligned vector moves with unaligned moves when avx is enabled. (off by default)
AbandonedPublic

Authored by LuoYuanke on Sep 28 2020, 1:35 AM.

Download Raw Diff

Details

Reviewers

craig.topper
annita.zhang
RKSimon
pengfei
lebedev.ri

Summary

 With AVX the performance for aligned vector move and unaligned vector move on X86
are the same if the address is aligned. However if the address is not aligned,
aligned vector move raise exception while unaligned vector move can still run.
To be conservative, llvm option "x86-enable-unaligned-vector-move" is added to
enable this preference.

Change-Id: I85ab9749013d7e1abb237e03bc22eeacfd37836a

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

LuoYuanke created this revision.Sep 28 2020, 1:35 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 28 2020, 1:35 AM

Herald added subscribers: llvm-commits, nikic, arphaman and 2 others. · View Herald Transcript

LuoYuanke requested review of this revision.Sep 28 2020, 1:35 AM

What issue is this fixing?

However if the address is not aligned, movaps raise exception while movups can still run.

That sounds like either a miscompile happened along the way, or the original source code had UB to begin with.

LuoYuanke added reviewers: craig.topper, annita.zhang.Sep 28 2020, 1:48 AM

Harbormaster completed remote builds in B73128: Diff 294623.Sep 28 2020, 1:48 AM

In D88396#2297525, @lebedev.ri wrote:

What issue is this fixing?

However if the address is not aligned, movaps raise exception while movups can still run.

That sounds like either a miscompile happened along the way, or the original source code had UB to begin with.

It can avoid segment fault when unaligned pointer is casted.

#include <immintrin.h>

extern __m128 value;

void add(void* pointer) {
    value = _mm_add_ps(value,*(__m128*)pointer);
}

In D88396#2297554, @LuoYuanke wrote:
In D88396#2297525, @lebedev.ri wrote:

What issue is this fixing?

However if the address is not aligned, movaps raise exception while movups can still run.

That sounds like either a miscompile happened along the way, or the original source code had UB to begin with.

It can avoid segment fault when unaligned pointer is casted.
#include <immintrin.h>

extern __m128 value;

void add(void* pointer) {
    value = _mm_add_ps(value,*(__m128*)pointer);
}

That is undefined behaviour:
https://godbolt.org/z/xdWKje

This revision now requires changes to proceed.Sep 28 2020, 2:00 AM

As @lebedev.ri has said, I don't think this is a good idea and if its happening it sounds like you have an underlying bug in your code that the sanitizers would probably help you with.

Ignoring the motivating code here for a second. Using aligned load/store instructions with AVX is a little weird. If we fold the load into an arithmetic op, with AVX it doesn't have to be aligned so using the folded instruction suppresses the fault check. This is different than SSE where we can only fold aligned loads except on AMD CPUs. So AVX provides an inconsistent faulting experience. But if we were going to change that I'd hunt down all the places we check the alignment and remove them and take the code size reduction in the compiler instead of adding a new pass.

LuoYuanke added a subscriber: pengfei.Sep 29 2020, 11:29 PM

If the address is random, during validation the address is aligned and sanitizers tool doesn't notice it. When run in real time, it crashes randomly. There is no harm to replace movaps with movups, and it can avoid some crash issue. Is it doable to add an option to let user choose movups or movaps?

In D88396#2302469, @LuoYuanke wrote:

If the address is random, during validation the address is aligned and sanitizers tool doesn't notice it. When run in real time, it crashes randomly. There is no harm to replace movaps with movups, and it can avoid some crash issue. Is it doable to add an option to let user choose movups or movaps?

If you change void* to char*, you get clang diagnostic:

<source>:4:38: warning: cast from 'char *' to '__m128 *' increases required alignment from 1 to 16 [-Wcast-align]
    __m128 value = _mm_add_ps(value,*(__m128*)pointer);
                                     ^~~~~~~~~~~~~~~~

So this should in principle not require sanitizers.

I didn't get the error at https://godbolt.org/z/8aGhd5. Another example may like this, an float array is packed in a struct.

#include <immintrin.h>

__m128 value;

typedef struct _data_str {
    int header;
    float src[400];
} data_t;

data_t data;

void add(__m128* pointer) {
    value = _mm_add_ps(value, *pointer);
}

void foo() {
    for (int i = 0; i < 400; i += 4)
    add((__m128*)(&data.src[i]));
}

In D88396#2302542, @LuoYuanke wrote:
I didn't get the error at https://godbolt.org/z/8aGhd5. Another example may like this, an float array is packed in a struct.
#include <immintrin.h>

__m128 value;

typedef struct _data_str {
    int header;
    float src[400];
} data_t;

data_t data;

void add(__m128* pointer) {
    value = _mm_add_ps(value, *pointer);
}

void foo() {
    for (int i = 0; i < 400; i += 4)
    add((__m128*)(&data.src[i]));
}

Compiling for SSE this code will likely use the memory form of addps which will fault on the misalignment. I know this patch only targets AVX.

I don’t think you can motivate this change by showing what code you want to accept if the code would crash when compiled with the default SSE2 target.

Compiling for SSE this code will likely use the memory form of addps which will fault on the misalignment. I know this patch only targets AVX.

I don’t think you can motivate this change by showing what code you want to accept if the code would crash when compiled with the default SSE2 target.

Yes, this patch only targets AVX.

LuoYuanke added inline comments.Sep 30 2020, 1:26 AM

llvm/lib/Target/X86/X86MovapsToMovups.cpp
70 ↗	(On Diff #294623)	We only target AVX.

In D88396#2302582, @craig.topper wrote:
In D88396#2302542, @LuoYuanke wrote:
I didn't get the error at https://godbolt.org/z/8aGhd5. Another example may like this, an float array is packed in a struct.
#include <immintrin.h>

__m128 value;

typedef struct _data_str {
    int header;
    float src[400];
} data_t;

data_t data;

void add(__m128* pointer) {
    value = _mm_add_ps(value, *pointer);
}

void foo() {
    for (int i = 0; i < 400; i += 4)
    add((__m128*)(&data.src[i]));
}
Compiling for SSE this code will likely use the memory form of addps which will fault on the misalignment. I know this patch only targets AVX.

I don’t think you can motivate this change by showing what code you want to accept if the code would crash when compiled with the default SSE2 target.

Note that even if x86 codegen will always emit unaligned ops (which will cause new questions/bugreports),
the original IR will still contain UB, and it will be only a question of time until that causes some other 'miscompile'.
I really think this should be approached from front-end diag side.

Compiling for SSE this code will likely use the memory form of addps which will fault on the misalignment. I know this patch only targets AVX.

I don’t think you can motivate this change by showing what code you want to accept if the code would crash when compiled with the default SSE2 target.

Note that even if x86 codegen will always emit unaligned ops (which will cause new questions/bugreports),
the original IR will still contain UB, and it will be only a question of time until that causes some other 'miscompile'.
I really think this should be approached from front-end diag side.

Sorry, what does 'UB' means? Why cause 'miscompile', compiler still think the address is aligned. Selecting movups doesn't break compiler assumption. Is there any reason movaps is better than movups? To detect the alignment exception?

In D88396#2302727, @LuoYuanke wrote:

Compiling for SSE this code will likely use the memory form of addps which will fault on the misalignment. I know this patch only targets AVX.

I don’t think you can motivate this change by showing what code you want to accept if the code would crash when compiled with the default SSE2 target.

Note that even if x86 codegen will always emit unaligned ops (which will cause new questions/bugreports),
the original IR will still contain UB, and it will be only a question of time until that causes some other 'miscompile'.
I really think this should be approached from front-end diag side.

Sorry, what does 'UB' means?

undefined behavior

Why cause 'miscompile', compiler still think the address is aligned.

That is very precisely my point.

Selecting movups doesn't break compiler assumption. Is there any reason movaps is better than movups? To detect the alignment exception?

In D88396#2302728, @lebedev.ri wrote:

In D88396#2302727, @LuoYuanke wrote:

Compiling for SSE this code will likely use the memory form of addps which will fault on the misalignment. I know this patch only targets AVX.

I don’t think you can motivate this change by showing what code you want to accept if the code would crash when compiled with the default SSE2 target.

Note that even if x86 codegen will always emit unaligned ops (which will cause new questions/bugreports),
the original IR will still contain UB, and it will be only a question of time until that causes some other 'miscompile'.
I really think this should be approached from front-end diag side.

Sorry, what does 'UB' means?

undefined behavior

Why cause 'miscompile', compiler still think the address is aligned.

That is very precisely my point.

Selecting movups doesn't break compiler assumption. Is there any reason movaps is better than movups? To detect the alignment exception?

Why we need to detect the alignment exception? This is just like assert, it can be done in debug mode. So can we select movaps in debug build, and select movups in non-debug build?

In D88396#2302837, @LuoYuanke wrote:

In D88396#2302728, @lebedev.ri wrote:

In D88396#2302727, @LuoYuanke wrote:

Compiling for SSE this code will likely use the memory form of addps which will fault on the misalignment. I know this patch only targets AVX.

I don’t think you can motivate this change by showing what code you want to accept if the code would crash when compiled with the default SSE2 target.

Note that even if x86 codegen will always emit unaligned ops (which will cause new questions/bugreports),
the original IR will still contain UB, and it will be only a question of time until that causes some other 'miscompile'.
I really think this should be approached from front-end diag side.

Sorry, what does 'UB' means?

undefined behavior

Why cause 'miscompile', compiler still think the address is aligned.

That is very precisely my point.

Selecting movups doesn't break compiler assumption. Is there any reason movaps is better than movups? To detect the alignment exception?

Why we need to detect the alignment exception? This is just like assert, it can be done in debug mode. So can we select movaps in debug build, and select movups in non-debug build?

I never said that. I'm only saying that the original source code has undefined behavior,
and even if you mask it with this patch, it will most likely manifest in some other way later on.

Add llvm option "-enable-x86-movaps-to-movups" to enable movups preference.
By default the option is false.

Harbormaster completed remote builds in B73910: Diff 296033.Oct 4 2020, 3:13 AM

Rebase

Harbormaster completed remote builds in B73913: Diff 296040.Oct 4 2020, 5:00 AM

RKSimon added inline comments.Oct 7 2020, 2:37 AM

llvm/lib/Target/X86/CMakeLists.txt
54	sorting
llvm/lib/Target/X86/X86MovapsToMovups.cpp
1 ↗	(On Diff #296040)	Very minor issue - but this isn't just movups - how about X86UnalignedVectorMoves.cpp ?
13 ↗	(On Diff #296040)	"So unaligned load/stores may be preferred if hardware exceptions can't be trusted."?

Rebase and address Simon's comments.

Harbormaster completed remote builds in B74268: Diff 296662.Oct 7 2020, 6:56 AM

LuoYuanke marked 2 inline comments as done.Oct 7 2020, 7:00 AM

LuoYuanke added inline comments.

llvm/lib/Target/X86/X86MovapsToMovups.cpp
13 ↗	(On Diff #296040)	Sometimes user prefer inefficient unaligned load/store rather than hardware exceptions when address is unaligned. We provide the opportunity for user to choose the behavior.

Alignment information isn’t just used to select aligned or unaligned instructions. It’s also used by alias analysis for example. If the compiler is being told incorrect alignment it could cause incorrect optimization in other parts of the compiler.

In D88396#2316762, @craig.topper wrote:

Alignment information isn’t just used to select aligned or unaligned instructions. It’s also used by alias analysis for example. If the compiler is being told incorrect alignment it could cause incorrect optimization in other parts of the compiler.

(That's what i've been saying all this time...)

craig.topper added inline comments.Oct 7 2020, 8:07 AM

llvm/lib/Target/X86/X86MovapsToMovups.cpp
1 ↗	(On Diff #296040)	But the patch only looks at movaps opcodes. It should look at all move opcodes.
38 ↗	(On Diff #296040)	This option should start with x86- and again it shouldn't just be movaps. It also needs to handle movapd movdqa movdqa64 and movdqa32.

Addressed Craig's comments. Added transform for movapd and movdq.

LuoYuanke added a reviewer: pengfei.Oct 9 2020, 8:02 PM

LuoYuanke retitled this revision from [X86] Replace movaps with movups when avx is enabled. to [X86] Replace aligned vector move with unaligned move when avx is enabled..

LuoYuanke edited the summary of this revision. (Show Details)

pengfei added inline comments.Oct 9 2020, 8:14 PM

llvm/lib/Target/X86/X86UnalignedVectorMoves.cpp
11	Change the comments as well.
92	Change `\|` to `\|\|`

Address Pengfei's comments.

LuoYuanke marked 2 inline comments as done.Oct 9 2020, 8:27 PM

craig.topper added inline comments.Oct 9 2020, 8:31 PM

llvm/lib/Target/X86/X86UnalignedVectorMoves.cpp
94	Why do we need 3 separate functions?

craig.topper added inline comments.Oct 9 2020, 8:33 PM

llvm/lib/Target/X86/X86UnalignedVectorMoves.cpp
229	Don't include non-VEX/EVEX opcodes. Those aren't used when AVX is enabled.

LuoYuanke added inline comments.Oct 9 2020, 8:34 PM

llvm/lib/Target/X86/X86UnalignedVectorMoves.cpp
94	Separating into 3 function looks clearer to me. I can merge them into 1 switch clause and add 3 comments for the code. Do you prefer merge?

craig.topper added inline comments.Oct 9 2020, 8:41 PM

llvm/lib/Target/X86/X86UnalignedVectorMoves.cpp
94	I'd prefer one function. And if you can get all 3 lines on one line without exceeding 80 columns I'd prefer that case X86::VMOVDQA32Z128mr: NewOpc = X86::VMOVDQU32Z128mr; break; The start of NewOpc on every line. Same for the break. See for example the nested switch in X86InstrInfo::optimizeCompareInstr

Harbormaster completed remote builds in B74678: Diff 297380.Oct 9 2020, 8:41 PM

Harbormaster completed remote builds in B74679: Diff 297381.Oct 9 2020, 8:56 PM

Address Craig's comments.

LuoYuanke marked 2 inline comments as done.Oct 9 2020, 9:37 PM

pengfei added inline comments.Oct 9 2020, 10:05 PM

llvm/lib/Target/X86/X86UnalignedVectorMoves.cpp
108	Could you make all NewOpc and break aligned?

Harbormaster completed remote builds in B74682: Diff 297385.Oct 9 2020, 10:14 PM

lebedev.ri retitled this revision from [X86] Replace aligned vector move with unaligned move when avx is enabled. to [X86] Support replacing aligned vector moves with unaligned moves when avx is enabled. (off by default).Oct 9 2020, 11:16 PM

Address Pengfei's comments.

LuoYuanke marked an inline comment as done.Oct 10 2020, 2:35 AM

LGTM. Thanks!
Since the pass is turned off by default, I think we can let it in. @lebedev.ri, what's your opinion?

In D88396#2323332, @pengfei wrote:

LGTM. Thanks!
Since the pass is turned off by default, I think we can let it in. @lebedev.ri, what's your opinion?

I still retain my original opinion that this is trying to paper over broken source code,
and incorrectly so, because even if backend doesn't make use of the alignment information
that was lowered from the source code into IR, the IR will still contain incorrect alignment
information, and it is only a matter of time until that UB manifests in some other way.

As i see it, there are 5 options:

Don't manually vectorize the code
Do UBSan to catch these issues
Enhance clang/clang-tidy to better catch these issues
Don't do aligned loads https://godbolt.org/z/38jrvE
Add a clang (!) switch to make __m128 unaligned

I strongly suggest that an option 4 be taken.

Harbormaster completed remote builds in B74691: Diff 297401.Oct 10 2020, 3:09 AM

In D88396#2323333, @lebedev.ri wrote:

In D88396#2323332, @pengfei wrote:

LGTM. Thanks!
Since the pass is turned off by default, I think we can let it in. @lebedev.ri, what's your opinion?

I still retain my original opinion that this is trying to paper over broken source code,
and incorrectly so, because even if backend doesn't make use of the alignment information
that was lowered from the source code into IR, the IR will still contain incorrect alignment
information, and it is only a matter of time until that UB manifests in some other way.

As i see it, there are 5 options:

Don't manually vectorize the code

Do UBSan to catch these issues

Enhance clang/clang-tidy to better catch these issues

Don't do aligned loads https://godbolt.org/z/38jrvE

Add a clang (!) switch to make __m128 unaligned

I strongly suggest that an option 4 be taken.

I think it is friendly for compiler to provide an opportunity to let user decide whether he/she prefer aligned load or unaligned load. As I know some processor also have some control register to control raising or suppressing exception on unaligned memory access. Leaving the decision to user doesn't harm any existing behavior. For X86 we have a choice to select aligned instruction or unaligned instruction. What if some processor only have instruction that don't raise exception on unaligned memory access?

In D88396#2323356, @LuoYuanke wrote:

In D88396#2323333, @lebedev.ri wrote:

In D88396#2323332, @pengfei wrote:

LGTM. Thanks!
Since the pass is turned off by default, I think we can let it in. @lebedev.ri, what's your opinion?

I still retain my original opinion that this is trying to paper over broken source code,
and incorrectly so, because even if backend doesn't make use of the alignment information
that was lowered from the source code into IR, the IR will still contain incorrect alignment
information, and it is only a matter of time until that UB manifests in some other way.

As i see it, there are 5 options:

Don't manually vectorize the code

Do UBSan to catch these issues

Enhance clang/clang-tidy to better catch these issues

Don't do aligned loads https://godbolt.org/z/38jrvE

Add a clang (!) switch to make __m128 unaligned

I strongly suggest that an option 4 be taken.

I think it is friendly for compiler to provide an opportunity to let user decide whether he/she prefer aligned load or unaligned load. As I know some processor also have some control register to control raising or suppressing exception on unaligned memory access. Leaving the decision to user doesn't harm any existing behavior. For X86 we have a choice to select aligned instruction or unaligned instruction. What if some processor only have instruction that don't raise exception on unaligned memory access?

I think we are talking past each other.
Do you agree that even with this patch, the LLVM IR will still contain an incorrect alignment on loads (align 16, https://godbolt.org/z/4d8xM3)?
Do you agree that it is an undefined behaviour?
Do you agree that by only hiding that fact in the back-end, the middle end optimization pipeline is still free to make use of that incorrect information to miscompile the code?

I abandon this patch since we don't reach a consensus.

LiuChen3 mentioned this in D99565: [X86] Support replacing aligned vector moves with unaligned moves when avx is enabled..Mar 29 2021, 11:52 PM

Revision Contents

Path

Size

llvm/

lib/

Target/

X86/

CMakeLists.txt

1 line

X86.h

2 lines

X86TargetMachine.cpp

2 lines

X86UnalignedVectorMoves.cpp

198 lines

test/

CodeGen/

X86/

O0-pipeline.ll

1 line

avx-unaligned-load-store.ll

244 lines

avx512-unaligned-load-store.ll

376 lines

avx512vl-unaligned-load-store.ll

378 lines

opt-pipeline.ll

1 line

Diff 297401

llvm/lib/Target/X86/CMakeLists.txt

Show First 20 Lines • Show All 45 Lines • ▼ Show 20 Lines	set(sources
X86IndirectThunks.cpp		X86IndirectThunks.cpp
X86InterleavedAccess.cpp		X86InterleavedAccess.cpp
X86InsertPrefetch.cpp		X86InsertPrefetch.cpp
X86InstCombineIntrinsic.cpp		X86InstCombineIntrinsic.cpp
X86InstrFMA3Info.cpp		X86InstrFMA3Info.cpp
X86InstrFoldTables.cpp		X86InstrFoldTables.cpp
X86InstrInfo.cpp		X86InstrInfo.cpp
X86EvexToVex.cpp		X86EvexToVex.cpp
X86LegalizerInfo.cpp		X86LegalizerInfo.cpp
		RKSimonUnsubmitted Done Reply Inline Actions sorting RKSimon: sorting
X86LoadValueInjectionLoadHardening.cpp		X86LoadValueInjectionLoadHardening.cpp
X86LoadValueInjectionRetHardening.cpp		X86LoadValueInjectionRetHardening.cpp
X86MCInstLower.cpp		X86MCInstLower.cpp
X86MachineFunctionInfo.cpp		X86MachineFunctionInfo.cpp
X86MacroFusion.cpp		X86MacroFusion.cpp
X86OptimizeLEAs.cpp		X86OptimizeLEAs.cpp
X86PadShortFunction.cpp		X86PadShortFunction.cpp
X86PartialReduction.cpp		X86PartialReduction.cpp
X86RegisterBankInfo.cpp		X86RegisterBankInfo.cpp
X86RegisterInfo.cpp		X86RegisterInfo.cpp
X86SelectionDAGInfo.cpp		X86SelectionDAGInfo.cpp
X86ShuffleDecodeConstantPool.cpp		X86ShuffleDecodeConstantPool.cpp
X86SpeculativeLoadHardening.cpp		X86SpeculativeLoadHardening.cpp
X86SpeculativeExecutionSideEffectSuppression.cpp		X86SpeculativeExecutionSideEffectSuppression.cpp
X86Subtarget.cpp		X86Subtarget.cpp
X86TargetMachine.cpp		X86TargetMachine.cpp
X86TargetObjectFile.cpp		X86TargetObjectFile.cpp
X86TargetTransformInfo.cpp		X86TargetTransformInfo.cpp
		X86UnalignedVectorMoves.cpp
X86VZeroUpper.cpp		X86VZeroUpper.cpp
X86WinAllocaExpander.cpp		X86WinAllocaExpander.cpp
X86WinEHState.cpp		X86WinEHState.cpp
X86InsertWait.cpp		X86InsertWait.cpp
)		)

add_llvm_target(X86CodeGen ${sources})		add_llvm_target(X86CodeGen ${sources})

add_subdirectory(AsmParser)		add_subdirectory(AsmParser)
add_subdirectory(Disassembler)		add_subdirectory(Disassembler)
add_subdirectory(MCTargetDesc)		add_subdirectory(MCTargetDesc)
add_subdirectory(TargetInfo)		add_subdirectory(TargetInfo)

llvm/lib/Target/X86/X86.h

	Show First 20 Lines • Show All 135 Lines • ▼ Show 20 Lines
	InstructionSelector *createX86InstructionSelector(const X86TargetMachine &TM,			InstructionSelector *createX86InstructionSelector(const X86TargetMachine &TM,
	X86Subtarget &,			X86Subtarget &,
	X86RegisterBankInfo &);			X86RegisterBankInfo &);

	FunctionPass *createX86LoadValueInjectionLoadHardeningPass();			FunctionPass *createX86LoadValueInjectionLoadHardeningPass();
	FunctionPass *createX86LoadValueInjectionRetHardeningPass();			FunctionPass *createX86LoadValueInjectionRetHardeningPass();
	FunctionPass *createX86SpeculativeLoadHardeningPass();			FunctionPass *createX86SpeculativeLoadHardeningPass();
	FunctionPass *createX86SpeculativeExecutionSideEffectSuppression();			FunctionPass *createX86SpeculativeExecutionSideEffectSuppression();
				FunctionPass *createX86UnalignedVectorMoves();

	void initializeEvexToVexInstPassPass(PassRegistry &);			void initializeEvexToVexInstPassPass(PassRegistry &);
	void initializeFixupBWInstPassPass(PassRegistry &);			void initializeFixupBWInstPassPass(PassRegistry &);
	void initializeFixupLEAPassPass(PassRegistry &);			void initializeFixupLEAPassPass(PassRegistry &);
	void initializeFPSPass(PassRegistry &);			void initializeFPSPass(PassRegistry &);
	void initializeWinEHStatePassPass(PassRegistry &);			void initializeWinEHStatePassPass(PassRegistry &);
	void initializeX86AvoidSFBPassPass(PassRegistry &);			void initializeX86AvoidSFBPassPass(PassRegistry &);
	void initializeX86AvoidTrailingCallPassPass(PassRegistry &);			void initializeX86AvoidTrailingCallPassPass(PassRegistry &);
	void initializeX86CallFrameOptimizationPass(PassRegistry &);			void initializeX86CallFrameOptimizationPass(PassRegistry &);
	void initializeX86CmovConverterPassPass(PassRegistry &);			void initializeX86CmovConverterPassPass(PassRegistry &);
	void initializeX86DomainReassignmentPass(PassRegistry &);			void initializeX86DomainReassignmentPass(PassRegistry &);
	void initializeX86ExecutionDomainFixPass(PassRegistry &);			void initializeX86ExecutionDomainFixPass(PassRegistry &);
	void initializeX86ExpandPseudoPass(PassRegistry &);			void initializeX86ExpandPseudoPass(PassRegistry &);
	void initializeX86FixupSetCCPassPass(PassRegistry &);			void initializeX86FixupSetCCPassPass(PassRegistry &);
	void initializeX86FlagsCopyLoweringPassPass(PassRegistry &);			void initializeX86FlagsCopyLoweringPassPass(PassRegistry &);
	void initializeX86LoadValueInjectionLoadHardeningPassPass(PassRegistry &);			void initializeX86LoadValueInjectionLoadHardeningPassPass(PassRegistry &);
	void initializeX86LoadValueInjectionRetHardeningPassPass(PassRegistry &);			void initializeX86LoadValueInjectionRetHardeningPassPass(PassRegistry &);
	void initializeX86OptimizeLEAPassPass(PassRegistry &);			void initializeX86OptimizeLEAPassPass(PassRegistry &);
	void initializeX86PartialReductionPass(PassRegistry &);			void initializeX86PartialReductionPass(PassRegistry &);
	void initializeX86SpeculativeLoadHardeningPassPass(PassRegistry &);			void initializeX86SpeculativeLoadHardeningPassPass(PassRegistry &);
	void initializeX86SpeculativeExecutionSideEffectSuppressionPass(PassRegistry &);			void initializeX86SpeculativeExecutionSideEffectSuppressionPass(PassRegistry &);
				void initializeX86UnalignedVectorMovePassPass(PassRegistry &);

	namespace X86AS {			namespace X86AS {
	enum : unsigned {			enum : unsigned {
	GS = 256,			GS = 256,
	FS = 257,			FS = 257,
	SS = 258,			SS = 258,
	PTR32_SPTR = 270,			PTR32_SPTR = 270,
	PTR32_UPTR = 271,			PTR32_UPTR = 271,
	PTR64 = 272			PTR64 = 272
	};			};
	} // End X86AS namespace			} // End X86AS namespace

	} // End llvm namespace			} // End llvm namespace

	#endif			#endif

llvm/lib/Target/X86/X86TargetMachine.cpp

Show First 20 Lines • Show All 77 Lines • ▼ Show 20 Lines	extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeX86Target() {
initializeX86AvoidTrailingCallPassPass(PR);		initializeX86AvoidTrailingCallPassPass(PR);
initializeX86SpeculativeLoadHardeningPassPass(PR);		initializeX86SpeculativeLoadHardeningPassPass(PR);
initializeX86SpeculativeExecutionSideEffectSuppressionPass(PR);		initializeX86SpeculativeExecutionSideEffectSuppressionPass(PR);
initializeX86FlagsCopyLoweringPassPass(PR);		initializeX86FlagsCopyLoweringPassPass(PR);
initializeX86LoadValueInjectionLoadHardeningPassPass(PR);		initializeX86LoadValueInjectionLoadHardeningPassPass(PR);
initializeX86LoadValueInjectionRetHardeningPassPass(PR);		initializeX86LoadValueInjectionRetHardeningPassPass(PR);
initializeX86OptimizeLEAPassPass(PR);		initializeX86OptimizeLEAPassPass(PR);
initializeX86PartialReductionPass(PR);		initializeX86PartialReductionPass(PR);
		initializeX86UnalignedVectorMovePassPass(PR);
}		}

static std::unique_ptr<TargetLoweringObjectFile> createTLOF(const Triple &TT) {		static std::unique_ptr<TargetLoweringObjectFile> createTLOF(const Triple &TT) {
if (TT.isOSBinFormatMachO()) {		if (TT.isOSBinFormatMachO()) {
if (TT.getArch() == Triple::x86_64)		if (TT.getArch() == Triple::x86_64)
return std::make_unique<X86_64MachoTargetObjectFile>();		return std::make_unique<X86_64MachoTargetObjectFile>();
return std::make_unique<TargetLoweringObjectFileMachO>();		return std::make_unique<TargetLoweringObjectFileMachO>();
}		}
▲ Show 20 Lines • Show All 425 Lines • ▼ Show 20 Lines	void X86PassConfig::addPreEmitPass() {

addPass(createX86IssueVZeroUpperPass());		addPass(createX86IssueVZeroUpperPass());

if (getOptLevel() != CodeGenOpt::None) {		if (getOptLevel() != CodeGenOpt::None) {
addPass(createX86FixupBWInsts());		addPass(createX86FixupBWInsts());
addPass(createX86PadShortFunctions());		addPass(createX86PadShortFunctions());
addPass(createX86FixupLEAs());		addPass(createX86FixupLEAs());
}		}
		addPass(createX86UnalignedVectorMoves());
addPass(createX86EvexToVexInsts());		addPass(createX86EvexToVexInsts());
addPass(createX86DiscriminateMemOpsPass());		addPass(createX86DiscriminateMemOpsPass());
addPass(createX86InsertPrefetchPass());		addPass(createX86InsertPrefetchPass());
addPass(createX86InsertX87waitPass());		addPass(createX86InsertX87waitPass());
}		}

void X86PassConfig::addPreEmitPass2() {		void X86PassConfig::addPreEmitPass2() {
const Triple &TT = TM->getTargetTriple();		const Triple &TT = TM->getTargetTriple();
Show All 35 Lines

llvm/lib/Target/X86/X86UnalignedVectorMoves.cpp

This file was added.

				//===- X86UnalignedVectorMoves.cpp ----------------------------------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				/// \file
				/// This file defines the pass that replace aligned vector move with unaligned
				/// vector move. Unaligned vector move achieve the same performance as aligned
				pengfeiUnsubmitted Done Reply Inline Actions Change the comments as well. pengfei: Change the comments as well.
				/// vector move does when the address is aligned.
				/// If the address is not aligned, unaligned vector move can run without
				/// raising exception, but aligned vector move raises exception. Sometimes
				/// user wants to suppress the exception, so an option is provided for this
				/// purpose.
				//
				//===----------------------------------------------------------------------===//

				#include "X86.h"
				#include "X86InstrInfo.h"
				#include "X86Subtarget.h"
				#include "llvm/ADT/StringRef.h"
				#include "llvm/CodeGen/MachineFunction.h"
				#include "llvm/CodeGen/MachineFunctionPass.h"
				#include "llvm/CodeGen/MachineInstr.h"
				#include "llvm/CodeGen/MachineOperand.h"
				#include "llvm/MC/MCInstrDesc.h"
				#include "llvm/Pass.h"
				#include <cassert>
				#include <cstdint>

				using namespace llvm;

				#define UNALIGNED_VEC_MOV_DESC "X86 unaligned vector move"
				#define DEBUG_TYPE "x86-unaligned-vector-move"

				static cl::opt<bool> EnableX86UnalignedVecMov(
				"x86-enable-unaligned-vector-move", cl::Hidden,
				cl::desc("X86: Enable transforming aligned vector move instruction to "
				"unaligned vector move."),
				cl::init(false));

				namespace {

				class X86UnalignedVectorMovePass : public MachineFunctionPass {

				bool alignedMovToUnalignedMov(MachineInstr &MI) const;

				public:
				static char ID;

				X86UnalignedVectorMovePass() : MachineFunctionPass(ID) {}

				StringRef getPassName() const override { return UNALIGNED_VEC_MOV_DESC; }

				bool runOnMachineFunction(MachineFunction &MF) override;

				// This pass runs after regalloc and doesn't support VReg operands.
				MachineFunctionProperties getRequiredProperties() const override {
				return MachineFunctionProperties().set(
				MachineFunctionProperties::Property::NoVRegs);
				}

				private:
				/// Machine instruction info used throughout the class.
				const X86InstrInfo *TII = nullptr;
				};

				} // end anonymous namespace

				char X86UnalignedVectorMovePass::ID = 0;

				bool X86UnalignedVectorMovePass::runOnMachineFunction(MachineFunction &MF) {
				if (!EnableX86UnalignedVecMov)
				return false;

				const X86Subtarget &ST = MF.getSubtarget<X86Subtarget>();
				TII = ST.getInstrInfo();
				if (!ST.hasAVX())
				return false;

				bool Changed = false;

				/// Go over all basic blocks in function and replace
				/// movaps with movups when possible.
				for (MachineBasicBlock &MBB : MF) {

				// Traverse the basic block.
				for (MachineInstr &MI : MBB)
				Changed \|= alignedMovToUnalignedMov(MI);
				}
				pengfeiUnsubmitted Done Reply Inline Actions Change `\|` to `\|\|` pengfei: Change `\|` to `\|\|`

				return Changed;
				craig.topperUnsubmitted Not Done Reply Inline Actions Why do we need 3 separate functions? craig.topper: Why do we need 3 separate functions?
				LuoYuankeAuthorUnsubmitted Done Reply Inline Actions Separating into 3 function looks clearer to me. I can merge them into 1 switch clause and add 3 comments for the code. Do you prefer merge? LuoYuanke: Separating into 3 function looks clearer to me. I can merge them into 1 switch clause and add 3…
				craig.topperUnsubmitted Done Reply Inline Actions I'd prefer one function. And if you can get all 3 lines on one line without exceeding 80 columns I'd prefer that case X86::VMOVDQA32Z128mr: NewOpc = X86::VMOVDQU32Z128mr; break; The start of NewOpc on every line. Same for the break. See for example the nested switch in X86InstrInfo::optimizeCompareInstr craig.topper: I'd prefer one function. And if you can get all 3 lines on one line without exceeding 80…
				}

				bool X86UnalignedVectorMovePass::alignedMovToUnalignedMov(
				MachineInstr &MI) const {
				unsigned Opc = MI.getOpcode();
				unsigned NewOpc;

				switch (Opc) {
				default:
				return false;
				// Replace vmovaps with vmovups.
				// MOVAPSmr and MOVAPSrm aren't used when AVX is enabled.
				case X86::VMOVAPSYmr: NewOpc = X86::VMOVUPSYmr; break;
				Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - case X86::VMOVAPSYmr: NewOpc = X86::VMOVUPSYmr; break; - case X86::VMOVAPSYrm: NewOpc = X86::VMOVUPSYrm; break; - case X86::VMOVAPSZ128mr: NewOpc = X86::VMOVUPSZ128mr; break; - case X86::VMOVAPSZ128mr_NOVLX: NewOpc = X86::VMOVUPSZ128mr_NOVLX; break; - case X86::VMOVAPSZ128mrk: NewOpc = X86::VMOVUPSZ128mrk; break; - case X86::VMOVAPSZ128rm: NewOpc = X86::VMOVUPSZ128rm; break; - case X86::VMOVAPSZ128rm_NOVLX: NewOpc = X86::VMOVUPSZ128rm_NOVLX; break; - case X86::VMOVAPSZ128rmk: NewOpc = X86::VMOVUPSZ128rmk; break; - case X86::VMOVAPSZ128rmkz: NewOpc = X86::VMOVUPSZ128rmkz; break; - case X86::VMOVAPSZ256mr: NewOpc = X86::VMOVUPSZ256mr; break; 82 diff lines are omitted. See full path. Lint: Pre-merge checks: clang-format: please reformat the code ``` - case X86::VMOVAPSYmr: NewOpc = X86…
				case X86::VMOVAPSYrm: NewOpc = X86::VMOVUPSYrm; break;
				pengfeiUnsubmitted Done Reply Inline Actions Could you make all NewOpc and break aligned? pengfei: Could you make all NewOpc and break aligned?
				case X86::VMOVAPSZ128mr: NewOpc = X86::VMOVUPSZ128mr; break;
				case X86::VMOVAPSZ128mr_NOVLX: NewOpc = X86::VMOVUPSZ128mr_NOVLX; break;
				case X86::VMOVAPSZ128mrk: NewOpc = X86::VMOVUPSZ128mrk; break;
				case X86::VMOVAPSZ128rm: NewOpc = X86::VMOVUPSZ128rm; break;
				case X86::VMOVAPSZ128rm_NOVLX: NewOpc = X86::VMOVUPSZ128rm_NOVLX; break;
				case X86::VMOVAPSZ128rmk: NewOpc = X86::VMOVUPSZ128rmk; break;
				case X86::VMOVAPSZ128rmkz: NewOpc = X86::VMOVUPSZ128rmkz; break;
				case X86::VMOVAPSZ256mr: NewOpc = X86::VMOVUPSZ256mr; break;
				case X86::VMOVAPSZ256mr_NOVLX: NewOpc = X86::VMOVUPSZ256mr_NOVLX; break;
				case X86::VMOVAPSZ256mrk: NewOpc = X86::VMOVUPSZ256mrk; break;
				case X86::VMOVAPSZ256rm: NewOpc = X86::VMOVUPSZ256rm; break;
				case X86::VMOVAPSZ256rm_NOVLX: NewOpc = X86::VMOVUPSZ256rm_NOVLX; break;
				case X86::VMOVAPSZ256rmk: NewOpc = X86::VMOVUPSZ256rmk; break;
				case X86::VMOVAPSZ256rmkz: NewOpc = X86::VMOVUPSZ256rmkz; break;
				case X86::VMOVAPSZmr: NewOpc = X86::VMOVUPSZmr; break;
				case X86::VMOVAPSZmrk: NewOpc = X86::VMOVUPSZmrk; break;
				case X86::VMOVAPSZrm: NewOpc = X86::VMOVUPSZrm; break;
				case X86::VMOVAPSZrmk: NewOpc = X86::VMOVUPSZrmk; break;
				case X86::VMOVAPSZrmkz: NewOpc = X86::VMOVUPSZrmkz; break;
				case X86::VMOVAPSmr: NewOpc = X86::VMOVUPSmr; break;
				case X86::VMOVAPSrm: NewOpc = X86::VMOVUPSrm; break;
				// Replace vmovapd with vmovupd.
				// MOVAPDmr and MOVAPDrm aren't used when AVX is enabled.
				case X86::VMOVAPDYmr: NewOpc = X86::VMOVUPDYmr; break;
				Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - case X86::VMOVAPDYmr: NewOpc = X86::VMOVUPDYmr; break; - case X86::VMOVAPDYrm: NewOpc = X86::VMOVUPDYrm; break; - case X86::VMOVAPDZ128mr: NewOpc = X86::VMOVUPDZ128mr; break; - case X86::VMOVAPDZ128mrk: NewOpc = X86::VMOVUPDZ128mrk; break; - case X86::VMOVAPDZ128rm: NewOpc = X86::VMOVUPDZ128rm; break; - case X86::VMOVAPDZ128rmk: NewOpc = X86::VMOVUPDZ128rmk; break; - case X86::VMOVAPDZ128rmkz: NewOpc = X86::VMOVUPDZ128rmkz; break; - case X86::VMOVAPDZ256mr: NewOpc = X86::VMOVUPDZ256mr; break; - case X86::VMOVAPDZ256mrk: NewOpc = X86::VMOVUPDZ256mrk; break; - case X86::VMOVAPDZ256rm: NewOpc = X86::VMOVUPDZ256rm; break; 66 diff lines are omitted. See full path. Lint: Pre-merge checks: clang-format: please reformat the code ``` - case X86::VMOVAPDYmr: NewOpc = X86…
				case X86::VMOVAPDYrm: NewOpc = X86::VMOVUPDYrm; break;
				case X86::VMOVAPDZ128mr: NewOpc = X86::VMOVUPDZ128mr; break;
				case X86::VMOVAPDZ128mrk: NewOpc = X86::VMOVUPDZ128mrk; break;
				case X86::VMOVAPDZ128rm: NewOpc = X86::VMOVUPDZ128rm; break;
				case X86::VMOVAPDZ128rmk: NewOpc = X86::VMOVUPDZ128rmk; break;
				case X86::VMOVAPDZ128rmkz: NewOpc = X86::VMOVUPDZ128rmkz; break;
				case X86::VMOVAPDZ256mr: NewOpc = X86::VMOVUPDZ256mr; break;
				case X86::VMOVAPDZ256mrk: NewOpc = X86::VMOVUPDZ256mrk; break;
				case X86::VMOVAPDZ256rm: NewOpc = X86::VMOVUPDZ256rm; break;
				case X86::VMOVAPDZ256rmk: NewOpc = X86::VMOVUPDZ256rmk; break;
				case X86::VMOVAPDZ256rmkz: NewOpc = X86::VMOVUPDZ256rmkz; break;
				case X86::VMOVAPDZmr: NewOpc = X86::VMOVUPDZmr; break;
				case X86::VMOVAPDZmrk: NewOpc = X86::VMOVUPDZmrk; break;
				case X86::VMOVAPDZrm: NewOpc = X86::VMOVUPDZrm; break;
				case X86::VMOVAPDZrmk: NewOpc = X86::VMOVUPDZrmk; break;
				case X86::VMOVAPDZrmkz: NewOpc = X86::VMOVUPDZrmkz; break;
				case X86::VMOVAPDmr: NewOpc = X86::VMOVUPDmr; break;
				case X86::VMOVAPDrm: NewOpc = X86::VMOVUPDrm; break;
				// Replace vmovdqa with vmovdqu.
				// MOVDQAmr and MOVDQArm aren't used when AVX is enabled.
				case X86::VMOVDQA32Z128mr: NewOpc = X86::VMOVDQU32Z128mr; break;
				Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - case X86::VMOVDQA32Z128mr: NewOpc = X86::VMOVDQU32Z128mr; break; - case X86::VMOVDQA32Z128mrk: NewOpc = X86::VMOVDQU32Z128mrk; break; - case X86::VMOVDQA32Z128rm: NewOpc = X86::VMOVDQU32Z128rm; break; - case X86::VMOVDQA32Z128rmk: NewOpc = X86::VMOVDQU32Z128rmk; break; - case X86::VMOVDQA32Z128rmkz: NewOpc = X86::VMOVDQU32Z128rmkz; break; - case X86::VMOVDQA32Z256mr: NewOpc = X86::VMOVDQU32Z256mr; break; - case X86::VMOVDQA32Z256mrk: NewOpc = X86::VMOVDQU32Z256mrk; break; - case X86::VMOVDQA32Z256rm: NewOpc = X86::VMOVDQU32Z256rm; break; - case X86::VMOVDQA32Z256rmk: NewOpc = X86::VMOVDQU32Z256rmk; break; - case X86::VMOVDQA32Z256rmkz: NewOpc = X86::VMOVDQU32Z256rmkz; break; 126 diff lines are omitted. See full path. Lint: Pre-merge checks: clang-format: please reformat the code ``` - case X86::VMOVDQA32Z128mr: NewOpc = X86…
				case X86::VMOVDQA32Z128mrk: NewOpc = X86::VMOVDQU32Z128mrk; break;
				case X86::VMOVDQA32Z128rm: NewOpc = X86::VMOVDQU32Z128rm; break;
				case X86::VMOVDQA32Z128rmk: NewOpc = X86::VMOVDQU32Z128rmk; break;
				case X86::VMOVDQA32Z128rmkz: NewOpc = X86::VMOVDQU32Z128rmkz; break;
				case X86::VMOVDQA32Z256mr: NewOpc = X86::VMOVDQU32Z256mr; break;
				case X86::VMOVDQA32Z256mrk: NewOpc = X86::VMOVDQU32Z256mrk; break;
				case X86::VMOVDQA32Z256rm: NewOpc = X86::VMOVDQU32Z256rm; break;
				case X86::VMOVDQA32Z256rmk: NewOpc = X86::VMOVDQU32Z256rmk; break;
				case X86::VMOVDQA32Z256rmkz: NewOpc = X86::VMOVDQU32Z256rmkz; break;
				case X86::VMOVDQA32Zmr: NewOpc = X86::VMOVDQU32Zmr; break;
				case X86::VMOVDQA32Zmrk: NewOpc = X86::VMOVDQU32Zmrk; break;
				case X86::VMOVDQA32Zrm: NewOpc = X86::VMOVDQU32Zrm; break;
				case X86::VMOVDQA32Zrmk: NewOpc = X86::VMOVDQU32Zrmk; break;
				case X86::VMOVDQA32Zrmkz: NewOpc = X86::VMOVDQU32Zrmkz; break;
				case X86::VMOVDQA64Z128mr: NewOpc = X86::VMOVDQU64Z128mr; break;
				case X86::VMOVDQA64Z128mrk: NewOpc = X86::VMOVDQU64Z128mrk; break;
				case X86::VMOVDQA64Z128rm: NewOpc = X86::VMOVDQU64Z128rm; break;
				case X86::VMOVDQA64Z128rmk: NewOpc = X86::VMOVDQU64Z128rmk; break;
				case X86::VMOVDQA64Z128rmkz: NewOpc = X86::VMOVDQU64Z128rmkz; break;
				case X86::VMOVDQA64Z256mr: NewOpc = X86::VMOVDQU64Z256mr; break;
				case X86::VMOVDQA64Z256mrk: NewOpc = X86::VMOVDQU64Z256mrk; break;
				case X86::VMOVDQA64Z256rm: NewOpc = X86::VMOVDQU64Z256rm; break;
				case X86::VMOVDQA64Z256rmk: NewOpc = X86::VMOVDQU64Z256rmk; break;
				case X86::VMOVDQA64Z256rmkz: NewOpc = X86::VMOVDQU64Z256rmkz; break;
				case X86::VMOVDQA64Zmr: NewOpc = X86::VMOVDQU64Zmr; break;
				case X86::VMOVDQA64Zmrk: NewOpc = X86::VMOVDQU64Zmrk; break;
				case X86::VMOVDQA64Zrm: NewOpc = X86::VMOVDQU64Zrm; break;
				case X86::VMOVDQA64Zrmk: NewOpc = X86::VMOVDQU64Zrmk; break;
				case X86::VMOVDQA64Zrmkz: NewOpc = X86::VMOVDQU64Zrmkz; break;
				case X86::VMOVDQAYmr: NewOpc = X86::VMOVDQUYmr; break;
				case X86::VMOVDQAYrm: NewOpc = X86::VMOVDQUYrm; break;
				case X86::VMOVDQAmr: NewOpc = X86::VMOVDQUmr; break;
				case X86::VMOVDQArm: NewOpc = X86::VMOVDQUrm; break;
				}

				MI.setDesc(TII->get(NewOpc));
				return true;
				}

				INITIALIZE_PASS(X86UnalignedVectorMovePass, DEBUG_TYPE, UNALIGNED_VEC_MOV_DESC,
				false, false)

				FunctionPass *llvm::createX86UnalignedVectorMoves() {
				return new X86UnalignedVectorMovePass();
				}
				craig.topperUnsubmitted Done Reply Inline Actions Don't include non-VEX/EVEX opcodes. Those aren't used when AVX is enabled. craig.topper: Don't include non-VEX/EVEX opcodes. Those aren't used when AVX is enabled.

llvm/test/CodeGen/X86/O0-pipeline.ll

	Show First 20 Lines • Show All 52 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: Post-RA pseudo instruction expansion pass			; CHECK-NEXT: Post-RA pseudo instruction expansion pass
	; CHECK-NEXT: X86 pseudo instruction expansion pass			; CHECK-NEXT: X86 pseudo instruction expansion pass
	; CHECK-NEXT: Analyze Machine Code For Garbage Collection			; CHECK-NEXT: Analyze Machine Code For Garbage Collection
	; CHECK-NEXT: Insert fentry calls			; CHECK-NEXT: Insert fentry calls
	; CHECK-NEXT: Insert XRay ops			; CHECK-NEXT: Insert XRay ops
	; CHECK-NEXT: Implement the 'patchable-function' attribute			; CHECK-NEXT: Implement the 'patchable-function' attribute
	; CHECK-NEXT: X86 Indirect Branch Tracking			; CHECK-NEXT: X86 Indirect Branch Tracking
	; CHECK-NEXT: X86 vzeroupper inserter			; CHECK-NEXT: X86 vzeroupper inserter
				; CHECK-NEXT: X86 unaligned vector move
	; CHECK-NEXT: Compressing EVEX instrs to VEX encoding when possibl			; CHECK-NEXT: Compressing EVEX instrs to VEX encoding when possibl
	; CHECK-NEXT: X86 Discriminate Memory Operands			; CHECK-NEXT: X86 Discriminate Memory Operands
	; CHECK-NEXT: X86 Insert Cache Prefetches			; CHECK-NEXT: X86 Insert Cache Prefetches
	; CHECK-NEXT: X86 insert wait instruction			; CHECK-NEXT: X86 insert wait instruction
	; CHECK-NEXT: Contiguously Lay Out Funclets			; CHECK-NEXT: Contiguously Lay Out Funclets
	; CHECK-NEXT: StackMap Liveness Analysis			; CHECK-NEXT: StackMap Liveness Analysis
	; CHECK-NEXT: Live DEBUG_VALUE analysis			; CHECK-NEXT: Live DEBUG_VALUE analysis
	; CHECK-NEXT: X86 Speculative Execution Side Effect Suppression			; CHECK-NEXT: X86 Speculative Execution Side Effect Suppression
	Show All 11 Lines

llvm/test/CodeGen/X86/avx-unaligned-load-store.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=sse4.2,slow-unaligned-mem-16 -x86-enable-unaligned-vector-move \| FileCheck %s -check-prefix=CHECK_SSE
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=avx,slow-unaligned-mem-32 -x86-enable-unaligned-vector-move \| FileCheck %s

				define void @test_256_load(double* nocapture %d, float* nocapture %f, <4 x i64>* nocapture %i) nounwind {
				; CHECK_SSE-LABEL: test_256_load:
				; CHECK_SSE: # %bb.0: # %entry
				; CHECK_SSE-NEXT: pushq %r15
				; CHECK_SSE-NEXT: pushq %r14
				; CHECK_SSE-NEXT: pushq %rbx
				; CHECK_SSE-NEXT: subq $96, %rsp
				; CHECK_SSE-NEXT: movq %rdx, %r14
				; CHECK_SSE-NEXT: movq %rsi, %r15
				; CHECK_SSE-NEXT: movq %rdi, %rbx
				; CHECK_SSE-NEXT: movaps (%rdx), %xmm4
				; CHECK_SSE-NEXT: movaps %xmm4, {{[-0-9]+}}(%r{{[sb]}}p) # 16-byte Spill
				; CHECK_SSE-NEXT: movaps 16(%rdx), %xmm5
				; CHECK_SSE-NEXT: movaps %xmm5, {{[-0-9]+}}(%r{{[sb]}}p) # 16-byte Spill
				; CHECK_SSE-NEXT: movaps (%rsi), %xmm2
				; CHECK_SSE-NEXT: movaps %xmm2, {{[-0-9]+}}(%r{{[sb]}}p) # 16-byte Spill
				; CHECK_SSE-NEXT: movaps 16(%rsi), %xmm3
				; CHECK_SSE-NEXT: movaps %xmm3, {{[-0-9]+}}(%r{{[sb]}}p) # 16-byte Spill
				; CHECK_SSE-NEXT: movaps (%rdi), %xmm0
				; CHECK_SSE-NEXT: movaps %xmm0, {{[-0-9]+}}(%r{{[sb]}}p) # 16-byte Spill
				; CHECK_SSE-NEXT: movaps 16(%rdi), %xmm1
				; CHECK_SSE-NEXT: movaps %xmm1, (%rsp) # 16-byte Spill
				; CHECK_SSE-NEXT: callq dummy
				; CHECK_SSE-NEXT: movaps {{[-0-9]+}}(%r{{[sb]}}p), %xmm0 # 16-byte Reload
				; CHECK_SSE-NEXT: movaps %xmm0, (%rbx)
				; CHECK_SSE-NEXT: movaps (%rsp), %xmm0 # 16-byte Reload
				; CHECK_SSE-NEXT: movaps %xmm0, 16(%rbx)
				; CHECK_SSE-NEXT: movaps {{[-0-9]+}}(%r{{[sb]}}p), %xmm0 # 16-byte Reload
				; CHECK_SSE-NEXT: movaps %xmm0, (%r15)
				; CHECK_SSE-NEXT: movaps {{[-0-9]+}}(%r{{[sb]}}p), %xmm0 # 16-byte Reload
				; CHECK_SSE-NEXT: movaps %xmm0, 16(%r15)
				; CHECK_SSE-NEXT: movaps {{[-0-9]+}}(%r{{[sb]}}p), %xmm0 # 16-byte Reload
				; CHECK_SSE-NEXT: movaps %xmm0, (%r14)
				; CHECK_SSE-NEXT: movaps {{[-0-9]+}}(%r{{[sb]}}p), %xmm0 # 16-byte Reload
				; CHECK_SSE-NEXT: movaps %xmm0, 16(%r14)
				; CHECK_SSE-NEXT: addq $96, %rsp
				; CHECK_SSE-NEXT: popq %rbx
				; CHECK_SSE-NEXT: popq %r14
				; CHECK_SSE-NEXT: popq %r15
				; CHECK_SSE-NEXT: retq
				;
				; CHECK-LABEL: test_256_load:
				; CHECK: # %bb.0: # %entry
				; CHECK-NEXT: pushq %r15
				; CHECK-NEXT: pushq %r14
				; CHECK-NEXT: pushq %rbx
				; CHECK-NEXT: subq $96, %rsp
				; CHECK-NEXT: movq %rdx, %r14
				; CHECK-NEXT: movq %rsi, %r15
				; CHECK-NEXT: movq %rdi, %rbx
				; CHECK-NEXT: vmovups (%rdi), %ymm0
				; CHECK-NEXT: vmovups %ymm0, {{[-0-9]+}}(%r{{[sb]}}p) # 32-byte Spill
				; CHECK-NEXT: vmovups (%rsi), %ymm1
				; CHECK-NEXT: vmovups %ymm1, {{[-0-9]+}}(%r{{[sb]}}p) # 32-byte Spill
				; CHECK-NEXT: vmovups (%rdx), %ymm2
				; CHECK-NEXT: vmovups %ymm2, (%rsp) # 32-byte Spill
				; CHECK-NEXT: callq dummy
				; CHECK-NEXT: vmovups {{[-0-9]+}}(%r{{[sb]}}p), %ymm0 # 32-byte Reload
				; CHECK-NEXT: vmovups %ymm0, (%rbx)
				; CHECK-NEXT: vmovups {{[-0-9]+}}(%r{{[sb]}}p), %ymm0 # 32-byte Reload
				; CHECK-NEXT: vmovups %ymm0, (%r15)
				; CHECK-NEXT: vmovups (%rsp), %ymm0 # 32-byte Reload
				; CHECK-NEXT: vmovups %ymm0, (%r14)
				; CHECK-NEXT: addq $96, %rsp
				; CHECK-NEXT: popq %rbx
				; CHECK-NEXT: popq %r14
				; CHECK-NEXT: popq %r15
				; CHECK-NEXT: vzeroupper
				; CHECK-NEXT: retq
				entry:
				%0 = bitcast double* %d to <4 x double>*
				%tmp1.i = load <4 x double>, <4 x double>* %0, align 32
				%1 = bitcast float* %f to <8 x float>*
				%tmp1.i17 = load <8 x float>, <8 x float>* %1, align 32
				%tmp1.i16 = load <4 x i64>, <4 x i64>* %i, align 32
				tail call void @dummy(<4 x double> %tmp1.i, <8 x float> %tmp1.i17, <4 x i64> %tmp1.i16) nounwind
				store <4 x double> %tmp1.i, <4 x double>* %0, align 32
				store <8 x float> %tmp1.i17, <8 x float>* %1, align 32
				store <4 x i64> %tmp1.i16, <4 x i64>* %i, align 32
				ret void
				}

				declare void @dummy(<4 x double>, <8 x float>, <4 x i64>)

				define void @storev16i16(<16 x i16> %a) nounwind {
				; CHECK_SSE-LABEL: storev16i16:
				; CHECK_SSE: # %bb.0:
				; CHECK_SSE-NEXT: movaps %xmm1, (%rax)
				; CHECK_SSE-NEXT: movaps %xmm0, (%rax)
				;
				; CHECK-LABEL: storev16i16:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovups %ymm0, (%rax)
				store <16 x i16> %a, <16 x i16>* undef, align 32
				unreachable
				}

				define void @storev16i16_01(<16 x i16> %a) nounwind {
				; CHECK_SSE-LABEL: storev16i16_01:
				; CHECK_SSE: # %bb.0:
				; CHECK_SSE-NEXT: movups %xmm1, (%rax)
				; CHECK_SSE-NEXT: movups %xmm0, (%rax)
				;
				; CHECK-LABEL: storev16i16_01:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vextractf128 $1, %ymm0, (%rax)
				; CHECK-NEXT: vmovups %xmm0, (%rax)
				store <16 x i16> %a, <16 x i16>* undef, align 4
				unreachable
				}

				define void @storev32i8(<32 x i8> %a) nounwind {
				; CHECK_SSE-LABEL: storev32i8:
				; CHECK_SSE: # %bb.0:
				; CHECK_SSE-NEXT: movaps %xmm1, (%rax)
				; CHECK_SSE-NEXT: movaps %xmm0, (%rax)
				;
				; CHECK-LABEL: storev32i8:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovups %ymm0, (%rax)
				store <32 x i8> %a, <32 x i8>* undef, align 32
				unreachable
				}

				define void @storev32i8_01(<32 x i8> %a) nounwind {
				; CHECK_SSE-LABEL: storev32i8_01:
				; CHECK_SSE: # %bb.0:
				; CHECK_SSE-NEXT: movups %xmm1, (%rax)
				; CHECK_SSE-NEXT: movups %xmm0, (%rax)
				;
				; CHECK-LABEL: storev32i8_01:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vextractf128 $1, %ymm0, (%rax)
				; CHECK-NEXT: vmovups %xmm0, (%rax)
				store <32 x i8> %a, <32 x i8>* undef, align 4
				unreachable
				}

				; It is faster to make two saves, if the data is already in xmm registers. For
				; example, after making an integer operation.
				define void @double_save(<4 x i32> %A, <4 x i32> %B, <8 x i32>* %P) nounwind ssp {
				; CHECK_SSE-LABEL: double_save:
				; CHECK_SSE: # %bb.0:
				; CHECK_SSE-NEXT: movaps %xmm1, 16(%rdi)
				; CHECK_SSE-NEXT: movaps %xmm0, (%rdi)
				; CHECK_SSE-NEXT: retq
				;
				; CHECK-LABEL: double_save:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovups %xmm1, 16(%rdi)
				; CHECK-NEXT: vmovups %xmm0, (%rdi)
				; CHECK-NEXT: retq
				%Z = shufflevector <4 x i32>%A, <4 x i32>%B, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				store <8 x i32> %Z, <8 x i32>* %P, align 16
				ret void
				}

				define void @double_save_volatile(<4 x i32> %A, <4 x i32> %B, <8 x i32>* %P) nounwind {
				; CHECK_SSE-LABEL: double_save_volatile:
				; CHECK_SSE: # %bb.0:
				; CHECK_SSE-NEXT: movaps %xmm1, 16(%rdi)
				; CHECK_SSE-NEXT: movaps %xmm0, (%rdi)
				; CHECK_SSE-NEXT: retq
				;
				; CHECK-LABEL: double_save_volatile:
				; CHECK: # %bb.0:
				; CHECK-NEXT: # kill: def $xmm0 killed $xmm0 def $ymm0
				; CHECK-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0
				; CHECK-NEXT: vmovups %ymm0, (%rdi)
				; CHECK-NEXT: vzeroupper
				; CHECK-NEXT: retq
				%Z = shufflevector <4 x i32>%A, <4 x i32>%B, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				store volatile <8 x i32> %Z, <8 x i32>* %P, align 16
				ret void
				}

				define void @add8i32(<8 x i32>* %ret, <8 x i32>* %bp) nounwind {
				; CHECK_SSE-LABEL: add8i32:
				; CHECK_SSE: # %bb.0:
				; CHECK_SSE-NEXT: movups (%rsi), %xmm0
				; CHECK_SSE-NEXT: movups 16(%rsi), %xmm1
				; CHECK_SSE-NEXT: movups %xmm1, 16(%rdi)
				; CHECK_SSE-NEXT: movups %xmm0, (%rdi)
				; CHECK_SSE-NEXT: retq
				;
				; CHECK-LABEL: add8i32:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovups (%rsi), %xmm0
				; CHECK-NEXT: vmovups 16(%rsi), %xmm1
				; CHECK-NEXT: vmovups %xmm1, 16(%rdi)
				; CHECK-NEXT: vmovups %xmm0, (%rdi)
				; CHECK-NEXT: retq
				%b = load <8 x i32>, <8 x i32>* %bp, align 1
				%x = add <8 x i32> zeroinitializer, %b
				store <8 x i32> %x, <8 x i32>* %ret, align 1
				ret void
				}

				define void @add4i64a64(<4 x i64>* %ret, <4 x i64>* %bp) nounwind {
				; CHECK_SSE-LABEL: add4i64a64:
				; CHECK_SSE: # %bb.0:
				; CHECK_SSE-NEXT: movaps (%rsi), %xmm0
				; CHECK_SSE-NEXT: movaps 16(%rsi), %xmm1
				; CHECK_SSE-NEXT: movaps %xmm0, (%rdi)
				; CHECK_SSE-NEXT: movaps %xmm1, 16(%rdi)
				; CHECK_SSE-NEXT: retq
				;
				; CHECK-LABEL: add4i64a64:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovups (%rsi), %ymm0
				; CHECK-NEXT: vmovups %ymm0, (%rdi)
				; CHECK-NEXT: vzeroupper
				; CHECK-NEXT: retq
				%b = load <4 x i64>, <4 x i64>* %bp, align 64
				%x = add <4 x i64> zeroinitializer, %b
				store <4 x i64> %x, <4 x i64>* %ret, align 64
				ret void
				}

				define void @add4i64a16(<4 x i64>* %ret, <4 x i64>* %bp) nounwind {
				; CHECK_SSE-LABEL: add4i64a16:
				; CHECK_SSE: # %bb.0:
				; CHECK_SSE-NEXT: movaps (%rsi), %xmm0
				; CHECK_SSE-NEXT: movaps 16(%rsi), %xmm1
				; CHECK_SSE-NEXT: movaps %xmm1, 16(%rdi)
				; CHECK_SSE-NEXT: movaps %xmm0, (%rdi)
				; CHECK_SSE-NEXT: retq
				;
				; CHECK-LABEL: add4i64a16:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovups (%rsi), %xmm0
				; CHECK-NEXT: vmovups 16(%rsi), %xmm1
				; CHECK-NEXT: vmovups %xmm1, 16(%rdi)
				; CHECK-NEXT: vmovups %xmm0, (%rdi)
				; CHECK-NEXT: retq
				%b = load <4 x i64>, <4 x i64>* %bp, align 16
				%x = add <4 x i64> zeroinitializer, %b
				store <4 x i64> %x, <4 x i64>* %ret, align 16
				ret void
				}

llvm/test/CodeGen/X86/avx512-unaligned-load-store.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=avx512f -x86-enable-unaligned-vector-move \| FileCheck %s

				define <16 x i32> @test17(i8 * %addr) {
				; CHECK-LABEL: test17:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovups (%rdi), %zmm0
				; CHECK-NEXT: retq
				%vaddr = bitcast i8* %addr to <16 x i32>*
				%res = load <16 x i32>, <16 x i32>* %vaddr, align 64
				ret <16 x i32>%res
				}

				define void @test18(i8 * %addr, <8 x i64> %data) {
				; CHECK-LABEL: test18:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovups %zmm0, (%rdi)
				; CHECK-NEXT: vzeroupper
				; CHECK-NEXT: retq
				%vaddr = bitcast i8* %addr to <8 x i64>*
				store <8 x i64>%data, <8 x i64>* %vaddr, align 64
				ret void
				}

				define void @test19(i8 * %addr, <16 x i32> %data) {
				; CHECK-LABEL: test19:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovups %zmm0, (%rdi)
				; CHECK-NEXT: vzeroupper
				; CHECK-NEXT: retq
				%vaddr = bitcast i8* %addr to <16 x i32>*
				store <16 x i32>%data, <16 x i32>* %vaddr, align 1
				ret void
				}

				define void @test20(i8 * %addr, <16 x i32> %data) {
				; CHECK-LABEL: test20:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovups %zmm0, (%rdi)
				; CHECK-NEXT: vzeroupper
				; CHECK-NEXT: retq
				%vaddr = bitcast i8* %addr to <16 x i32>*
				store <16 x i32>%data, <16 x i32>* %vaddr, align 64
				ret void
				}

				define <8 x i64> @test21(i8 * %addr) {
				; CHECK-LABEL: test21:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovups (%rdi), %zmm0
				; CHECK-NEXT: retq
				%vaddr = bitcast i8* %addr to <8 x i64>*
				%res = load <8 x i64>, <8 x i64>* %vaddr, align 64
				ret <8 x i64>%res
				}

				define void @test22(i8 * %addr, <8 x i64> %data) {
				; CHECK-LABEL: test22:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovups %zmm0, (%rdi)
				; CHECK-NEXT: vzeroupper
				; CHECK-NEXT: retq
				%vaddr = bitcast i8* %addr to <8 x i64>*
				store <8 x i64>%data, <8 x i64>* %vaddr, align 1
				ret void
				}

				define <8 x i64> @test23(i8 * %addr) {
				; CHECK-LABEL: test23:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovups (%rdi), %zmm0
				; CHECK-NEXT: retq
				%vaddr = bitcast i8* %addr to <8 x i64>*
				%res = load <8 x i64>, <8 x i64>* %vaddr, align 1
				ret <8 x i64>%res
				}

				define void @test24(i8 * %addr, <8 x double> %data) {
				; CHECK-LABEL: test24:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovups %zmm0, (%rdi)
				; CHECK-NEXT: vzeroupper
				; CHECK-NEXT: retq
				%vaddr = bitcast i8* %addr to <8 x double>*
				store <8 x double>%data, <8 x double>* %vaddr, align 64
				ret void
				}

				define <8 x double> @test25(i8 * %addr) {
				; CHECK-LABEL: test25:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovups (%rdi), %zmm0
				; CHECK-NEXT: retq
				%vaddr = bitcast i8* %addr to <8 x double>*
				%res = load <8 x double>, <8 x double>* %vaddr, align 64
				ret <8 x double>%res
				}

				define void @test26(i8 * %addr, <16 x float> %data) {
				; CHECK-LABEL: test26:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovups %zmm0, (%rdi)
				; CHECK-NEXT: vzeroupper
				; CHECK-NEXT: retq
				%vaddr = bitcast i8* %addr to <16 x float>*
				store <16 x float>%data, <16 x float>* %vaddr, align 64
				ret void
				}

				define <16 x float> @test27(i8 * %addr) {
				; CHECK-LABEL: test27:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovups (%rdi), %zmm0
				; CHECK-NEXT: retq
				%vaddr = bitcast i8* %addr to <16 x float>*
				%res = load <16 x float>, <16 x float>* %vaddr, align 64
				ret <16 x float>%res
				}

				define void @test28(i8 * %addr, <8 x double> %data) {
				; CHECK-LABEL: test28:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovups %zmm0, (%rdi)
				; CHECK-NEXT: vzeroupper
				; CHECK-NEXT: retq
				%vaddr = bitcast i8* %addr to <8 x double>*
				store <8 x double>%data, <8 x double>* %vaddr, align 1
				ret void
				}

				define <8 x double> @test29(i8 * %addr) {
				; CHECK-LABEL: test29:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovups (%rdi), %zmm0
				; CHECK-NEXT: retq
				%vaddr = bitcast i8* %addr to <8 x double>*
				%res = load <8 x double>, <8 x double>* %vaddr, align 1
				ret <8 x double>%res
				}

				define void @test30(i8 * %addr, <16 x float> %data) {
				; CHECK-LABEL: test30:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovups %zmm0, (%rdi)
				; CHECK-NEXT: vzeroupper
				; CHECK-NEXT: retq
				%vaddr = bitcast i8* %addr to <16 x float>*
				store <16 x float>%data, <16 x float>* %vaddr, align 1
				ret void
				}

				define <16 x float> @test31(i8 * %addr) {
				; CHECK-LABEL: test31:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovups (%rdi), %zmm0
				; CHECK-NEXT: retq
				%vaddr = bitcast i8* %addr to <16 x float>*
				%res = load <16 x float>, <16 x float>* %vaddr, align 1
				ret <16 x float>%res
				}

				define <16 x i32> @test32(i8 * %addr, <16 x i32> %old, <16 x i32> %mask1) {
				; CHECK-LABEL: test32:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vptestmd %zmm1, %zmm1, %k1
				; CHECK-NEXT: vmovdqu32 (%rdi), %zmm0 {%k1}
				; CHECK-NEXT: retq
				%mask = icmp ne <16 x i32> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <16 x i32>*
				%r = load <16 x i32>, <16 x i32>* %vaddr, align 64
				%res = select <16 x i1> %mask, <16 x i32> %r, <16 x i32> %old
				ret <16 x i32>%res
				}

				define <16 x i32> @test33(i8 * %addr, <16 x i32> %old, <16 x i32> %mask1) {
				; CHECK-LABEL: test33:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vptestmd %zmm1, %zmm1, %k1
				; CHECK-NEXT: vmovdqu32 (%rdi), %zmm0 {%k1}
				; CHECK-NEXT: retq
				%mask = icmp ne <16 x i32> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <16 x i32>*
				%r = load <16 x i32>, <16 x i32>* %vaddr, align 1
				%res = select <16 x i1> %mask, <16 x i32> %r, <16 x i32> %old
				ret <16 x i32>%res
				}

				define <16 x i32> @test34(i8 * %addr, <16 x i32> %mask1) {
				; CHECK-LABEL: test34:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vptestmd %zmm0, %zmm0, %k1
				; CHECK-NEXT: vmovdqu32 (%rdi), %zmm0 {%k1} {z}
				; CHECK-NEXT: retq
				%mask = icmp ne <16 x i32> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <16 x i32>*
				%r = load <16 x i32>, <16 x i32>* %vaddr, align 64
				%res = select <16 x i1> %mask, <16 x i32> %r, <16 x i32> zeroinitializer
				ret <16 x i32>%res
				}

				define <16 x i32> @test35(i8 * %addr, <16 x i32> %mask1) {
				; CHECK-LABEL: test35:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vptestmd %zmm0, %zmm0, %k1
				; CHECK-NEXT: vmovdqu32 (%rdi), %zmm0 {%k1} {z}
				; CHECK-NEXT: retq
				%mask = icmp ne <16 x i32> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <16 x i32>*
				%r = load <16 x i32>, <16 x i32>* %vaddr, align 1
				%res = select <16 x i1> %mask, <16 x i32> %r, <16 x i32> zeroinitializer
				ret <16 x i32>%res
				}

				define <8 x i64> @test36(i8 * %addr, <8 x i64> %old, <8 x i64> %mask1) {
				; CHECK-LABEL: test36:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vptestmq %zmm1, %zmm1, %k1
				; CHECK-NEXT: vmovdqu64 (%rdi), %zmm0 {%k1}
				; CHECK-NEXT: retq
				%mask = icmp ne <8 x i64> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <8 x i64>*
				%r = load <8 x i64>, <8 x i64>* %vaddr, align 64
				%res = select <8 x i1> %mask, <8 x i64> %r, <8 x i64> %old
				ret <8 x i64>%res
				}

				define <8 x i64> @test37(i8 * %addr, <8 x i64> %old, <8 x i64> %mask1) {
				; CHECK-LABEL: test37:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vptestmq %zmm1, %zmm1, %k1
				; CHECK-NEXT: vmovdqu64 (%rdi), %zmm0 {%k1}
				; CHECK-NEXT: retq
				%mask = icmp ne <8 x i64> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <8 x i64>*
				%r = load <8 x i64>, <8 x i64>* %vaddr, align 1
				%res = select <8 x i1> %mask, <8 x i64> %r, <8 x i64> %old
				ret <8 x i64>%res
				}

				define <8 x i64> @test38(i8 * %addr, <8 x i64> %mask1) {
				; CHECK-LABEL: test38:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vptestmq %zmm0, %zmm0, %k1
				; CHECK-NEXT: vmovdqu64 (%rdi), %zmm0 {%k1} {z}
				; CHECK-NEXT: retq
				%mask = icmp ne <8 x i64> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <8 x i64>*
				%r = load <8 x i64>, <8 x i64>* %vaddr, align 64
				%res = select <8 x i1> %mask, <8 x i64> %r, <8 x i64> zeroinitializer
				ret <8 x i64>%res
				}

				define <8 x i64> @test39(i8 * %addr, <8 x i64> %mask1) {
				; CHECK-LABEL: test39:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vptestmq %zmm0, %zmm0, %k1
				; CHECK-NEXT: vmovdqu64 (%rdi), %zmm0 {%k1} {z}
				; CHECK-NEXT: retq
				%mask = icmp ne <8 x i64> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <8 x i64>*
				%r = load <8 x i64>, <8 x i64>* %vaddr, align 1
				%res = select <8 x i1> %mask, <8 x i64> %r, <8 x i64> zeroinitializer
				ret <8 x i64>%res
				}

				define <16 x float> @test40(i8 * %addr, <16 x float> %old, <16 x float> %mask1) {
				; CHECK-LABEL: test40:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vxorps %xmm2, %xmm2, %xmm2
				; CHECK-NEXT: vcmpneq_oqps %zmm2, %zmm1, %k1
				; CHECK-NEXT: vmovups (%rdi), %zmm0 {%k1}
				; CHECK-NEXT: retq
				%mask = fcmp one <16 x float> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <16 x float>*
				%r = load <16 x float>, <16 x float>* %vaddr, align 64
				%res = select <16 x i1> %mask, <16 x float> %r, <16 x float> %old
				ret <16 x float>%res
				}

				define <16 x float> @test41(i8 * %addr, <16 x float> %old, <16 x float> %mask1) {
				; CHECK-LABEL: test41:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vxorps %xmm2, %xmm2, %xmm2
				; CHECK-NEXT: vcmpneq_oqps %zmm2, %zmm1, %k1
				; CHECK-NEXT: vmovups (%rdi), %zmm0 {%k1}
				; CHECK-NEXT: retq
				%mask = fcmp one <16 x float> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <16 x float>*
				%r = load <16 x float>, <16 x float>* %vaddr, align 1
				%res = select <16 x i1> %mask, <16 x float> %r, <16 x float> %old
				ret <16 x float>%res
				}

				define <16 x float> @test42(i8 * %addr, <16 x float> %mask1) {
				; CHECK-LABEL: test42:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vxorps %xmm1, %xmm1, %xmm1
				; CHECK-NEXT: vcmpneq_oqps %zmm1, %zmm0, %k1
				; CHECK-NEXT: vmovups (%rdi), %zmm0 {%k1} {z}
				; CHECK-NEXT: retq
				%mask = fcmp one <16 x float> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <16 x float>*
				%r = load <16 x float>, <16 x float>* %vaddr, align 64
				%res = select <16 x i1> %mask, <16 x float> %r, <16 x float> zeroinitializer
				ret <16 x float>%res
				}

				define <16 x float> @test43(i8 * %addr, <16 x float> %mask1) {
				; CHECK-LABEL: test43:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vxorps %xmm1, %xmm1, %xmm1
				; CHECK-NEXT: vcmpneq_oqps %zmm1, %zmm0, %k1
				; CHECK-NEXT: vmovups (%rdi), %zmm0 {%k1} {z}
				; CHECK-NEXT: retq
				%mask = fcmp one <16 x float> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <16 x float>*
				%r = load <16 x float>, <16 x float>* %vaddr, align 1
				%res = select <16 x i1> %mask, <16 x float> %r, <16 x float> zeroinitializer
				ret <16 x float>%res
				}

				define <8 x double> @test44(i8 * %addr, <8 x double> %old, <8 x double> %mask1) {
				; CHECK-LABEL: test44:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vxorpd %xmm2, %xmm2, %xmm2
				; CHECK-NEXT: vcmpneq_oqpd %zmm2, %zmm1, %k1
				; CHECK-NEXT: vmovupd (%rdi), %zmm0 {%k1}
				; CHECK-NEXT: retq
				%mask = fcmp one <8 x double> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <8 x double>*
				%r = load <8 x double>, <8 x double>* %vaddr, align 64
				%res = select <8 x i1> %mask, <8 x double> %r, <8 x double> %old
				ret <8 x double>%res
				}

				define <8 x double> @test45(i8 * %addr, <8 x double> %old, <8 x double> %mask1) {
				; CHECK-LABEL: test45:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vxorpd %xmm2, %xmm2, %xmm2
				; CHECK-NEXT: vcmpneq_oqpd %zmm2, %zmm1, %k1
				; CHECK-NEXT: vmovupd (%rdi), %zmm0 {%k1}
				; CHECK-NEXT: retq
				%mask = fcmp one <8 x double> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <8 x double>*
				%r = load <8 x double>, <8 x double>* %vaddr, align 1
				%res = select <8 x i1> %mask, <8 x double> %r, <8 x double> %old
				ret <8 x double>%res
				}

				define <8 x double> @test46(i8 * %addr, <8 x double> %mask1) {
				; CHECK-LABEL: test46:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vxorpd %xmm1, %xmm1, %xmm1
				; CHECK-NEXT: vcmpneq_oqpd %zmm1, %zmm0, %k1
				; CHECK-NEXT: vmovupd (%rdi), %zmm0 {%k1} {z}
				; CHECK-NEXT: retq
				%mask = fcmp one <8 x double> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <8 x double>*
				%r = load <8 x double>, <8 x double>* %vaddr, align 64
				%res = select <8 x i1> %mask, <8 x double> %r, <8 x double> zeroinitializer
				ret <8 x double>%res
				}

				define <8 x double> @test47(i8 * %addr, <8 x double> %mask1) {
				; CHECK-LABEL: test47:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vxorpd %xmm1, %xmm1, %xmm1
				; CHECK-NEXT: vcmpneq_oqpd %zmm1, %zmm0, %k1
				; CHECK-NEXT: vmovupd (%rdi), %zmm0 {%k1} {z}
				; CHECK-NEXT: retq
				%mask = fcmp one <8 x double> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <8 x double>*
				%r = load <8 x double>, <8 x double>* %vaddr, align 1
				%res = select <8 x i1> %mask, <8 x double> %r, <8 x double> zeroinitializer
				ret <8 x double>%res
				}

llvm/test/CodeGen/X86/avx512vl-unaligned-load-store.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=avx512f -mattr=avx512vl -x86-enable-unaligned-vector-move \| FileCheck %s

				define <8 x i32> @test_256_1(i8 * %addr) {
				; CHECK-LABEL: test_256_1:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovups (%rdi), %ymm0
				; CHECK-NEXT: retq
				%vaddr = bitcast i8* %addr to <8 x i32>*
				%res = load <8 x i32>, <8 x i32>* %vaddr, align 1
				ret <8 x i32>%res
				}

				define <8 x i32> @test_256_2(i8 * %addr) {
				; CHECK-LABEL: test_256_2:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovups (%rdi), %ymm0
				; CHECK-NEXT: retq
				%vaddr = bitcast i8* %addr to <8 x i32>*
				%res = load <8 x i32>, <8 x i32>* %vaddr, align 32
				ret <8 x i32>%res
				}

				define void @test_256_3(i8 * %addr, <4 x i64> %data) {
				; CHECK-LABEL: test_256_3:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovups %ymm0, (%rdi)
				; CHECK-NEXT: vzeroupper
				; CHECK-NEXT: retq
				%vaddr = bitcast i8* %addr to <4 x i64>*
				store <4 x i64>%data, <4 x i64>* %vaddr, align 32
				ret void
				}

				define void @test_256_4(i8 * %addr, <8 x i32> %data) {
				; CHECK-LABEL: test_256_4:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovups %ymm0, (%rdi)
				; CHECK-NEXT: vzeroupper
				; CHECK-NEXT: retq
				%vaddr = bitcast i8* %addr to <8 x i32>*
				store <8 x i32>%data, <8 x i32>* %vaddr, align 1
				ret void
				}

				define void @test_256_5(i8 * %addr, <8 x i32> %data) {
				; CHECK-LABEL: test_256_5:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovups %ymm0, (%rdi)
				; CHECK-NEXT: vzeroupper
				; CHECK-NEXT: retq
				%vaddr = bitcast i8* %addr to <8 x i32>*
				store <8 x i32>%data, <8 x i32>* %vaddr, align 32
				ret void
				}

				define <4 x i64> @test_256_6(i8 * %addr) {
				; CHECK-LABEL: test_256_6:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovups (%rdi), %ymm0
				; CHECK-NEXT: retq
				%vaddr = bitcast i8* %addr to <4 x i64>*
				%res = load <4 x i64>, <4 x i64>* %vaddr, align 32
				ret <4 x i64>%res
				}

				define void @test_256_7(i8 * %addr, <4 x i64> %data) {
				; CHECK-LABEL: test_256_7:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovups %ymm0, (%rdi)
				; CHECK-NEXT: vzeroupper
				; CHECK-NEXT: retq
				%vaddr = bitcast i8* %addr to <4 x i64>*
				store <4 x i64>%data, <4 x i64>* %vaddr, align 1
				ret void
				}

				define <4 x i64> @test_256_8(i8 * %addr) {
				; CHECK-LABEL: test_256_8:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovups (%rdi), %ymm0
				; CHECK-NEXT: retq
				%vaddr = bitcast i8* %addr to <4 x i64>*
				%res = load <4 x i64>, <4 x i64>* %vaddr, align 1
				ret <4 x i64>%res
				}

				define void @test_256_9(i8 * %addr, <4 x double> %data) {
				; CHECK-LABEL: test_256_9:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovups %ymm0, (%rdi)
				; CHECK-NEXT: vzeroupper
				; CHECK-NEXT: retq
				%vaddr = bitcast i8* %addr to <4 x double>*
				store <4 x double>%data, <4 x double>* %vaddr, align 32
				ret void
				}

				define <4 x double> @test_256_10(i8 * %addr) {
				; CHECK-LABEL: test_256_10:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovups (%rdi), %ymm0
				; CHECK-NEXT: retq
				%vaddr = bitcast i8* %addr to <4 x double>*
				%res = load <4 x double>, <4 x double>* %vaddr, align 32
				ret <4 x double>%res
				}

				define void @test_256_11(i8 * %addr, <8 x float> %data) {
				; CHECK-LABEL: test_256_11:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovups %ymm0, (%rdi)
				; CHECK-NEXT: vzeroupper
				; CHECK-NEXT: retq
				%vaddr = bitcast i8* %addr to <8 x float>*
				store <8 x float>%data, <8 x float>* %vaddr, align 32
				ret void
				}

				define <8 x float> @test_256_12(i8 * %addr) {
				; CHECK-LABEL: test_256_12:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovups (%rdi), %ymm0
				; CHECK-NEXT: retq
				%vaddr = bitcast i8* %addr to <8 x float>*
				%res = load <8 x float>, <8 x float>* %vaddr, align 32
				ret <8 x float>%res
				}

				define void @test_256_13(i8 * %addr, <4 x double> %data) {
				; CHECK-LABEL: test_256_13:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovups %ymm0, (%rdi)
				; CHECK-NEXT: vzeroupper
				; CHECK-NEXT: retq
				%vaddr = bitcast i8* %addr to <4 x double>*
				store <4 x double>%data, <4 x double>* %vaddr, align 1
				ret void
				}

				define <4 x double> @test_256_14(i8 * %addr) {
				; CHECK-LABEL: test_256_14:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovups (%rdi), %ymm0
				; CHECK-NEXT: retq
				%vaddr = bitcast i8* %addr to <4 x double>*
				%res = load <4 x double>, <4 x double>* %vaddr, align 1
				ret <4 x double>%res
				}

				define void @test_256_15(i8 * %addr, <8 x float> %data) {
				; CHECK-LABEL: test_256_15:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovups %ymm0, (%rdi)
				; CHECK-NEXT: vzeroupper
				; CHECK-NEXT: retq
				%vaddr = bitcast i8* %addr to <8 x float>*
				store <8 x float>%data, <8 x float>* %vaddr, align 1
				ret void
				}

				define <8 x float> @test_256_16(i8 * %addr) {
				; CHECK-LABEL: test_256_16:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovups (%rdi), %ymm0
				; CHECK-NEXT: retq
				%vaddr = bitcast i8* %addr to <8 x float>*
				%res = load <8 x float>, <8 x float>* %vaddr, align 1
				ret <8 x float>%res
				}

				define <8 x i32> @test_256_17(i8 * %addr, <8 x i32> %old, <8 x i32> %mask1) {
				; CHECK-LABEL: test_256_17:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vptestmd %ymm1, %ymm1, %k1
				; CHECK-NEXT: vmovdqu32 (%rdi), %ymm0 {%k1}
				; CHECK-NEXT: retq
				%mask = icmp ne <8 x i32> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <8 x i32>*
				%r = load <8 x i32>, <8 x i32>* %vaddr, align 32
				%res = select <8 x i1> %mask, <8 x i32> %r, <8 x i32> %old
				ret <8 x i32>%res
				}

				define <8 x i32> @test_256_18(i8 * %addr, <8 x i32> %old, <8 x i32> %mask1) {
				; CHECK-LABEL: test_256_18:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vptestmd %ymm1, %ymm1, %k1
				; CHECK-NEXT: vmovdqu32 (%rdi), %ymm0 {%k1}
				; CHECK-NEXT: retq
				%mask = icmp ne <8 x i32> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <8 x i32>*
				%r = load <8 x i32>, <8 x i32>* %vaddr, align 1
				%res = select <8 x i1> %mask, <8 x i32> %r, <8 x i32> %old
				ret <8 x i32>%res
				}

				define <8 x i32> @test_256_19(i8 * %addr, <8 x i32> %mask1) {
				; CHECK-LABEL: test_256_19:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vptestmd %ymm0, %ymm0, %k1
				; CHECK-NEXT: vmovdqu32 (%rdi), %ymm0 {%k1} {z}
				; CHECK-NEXT: retq
				%mask = icmp ne <8 x i32> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <8 x i32>*
				%r = load <8 x i32>, <8 x i32>* %vaddr, align 32
				%res = select <8 x i1> %mask, <8 x i32> %r, <8 x i32> zeroinitializer
				ret <8 x i32>%res
				}

				define <8 x i32> @test_256_20(i8 * %addr, <8 x i32> %mask1) {
				; CHECK-LABEL: test_256_20:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vptestmd %ymm0, %ymm0, %k1
				; CHECK-NEXT: vmovdqu32 (%rdi), %ymm0 {%k1} {z}
				; CHECK-NEXT: retq
				%mask = icmp ne <8 x i32> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <8 x i32>*
				%r = load <8 x i32>, <8 x i32>* %vaddr, align 1
				%res = select <8 x i1> %mask, <8 x i32> %r, <8 x i32> zeroinitializer
				ret <8 x i32>%res
				}

				define <4 x i64> @test_256_21(i8 * %addr, <4 x i64> %old, <4 x i64> %mask1) {
				; CHECK-LABEL: test_256_21:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vptestmq %ymm1, %ymm1, %k1
				; CHECK-NEXT: vmovdqu64 (%rdi), %ymm0 {%k1}
				; CHECK-NEXT: retq
				%mask = icmp ne <4 x i64> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <4 x i64>*
				%r = load <4 x i64>, <4 x i64>* %vaddr, align 32
				%res = select <4 x i1> %mask, <4 x i64> %r, <4 x i64> %old
				ret <4 x i64>%res
				}

				define <4 x i64> @test_256_22(i8 * %addr, <4 x i64> %old, <4 x i64> %mask1) {
				; CHECK-LABEL: test_256_22:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vptestmq %ymm1, %ymm1, %k1
				; CHECK-NEXT: vmovdqu64 (%rdi), %ymm0 {%k1}
				; CHECK-NEXT: retq
				%mask = icmp ne <4 x i64> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <4 x i64>*
				%r = load <4 x i64>, <4 x i64>* %vaddr, align 1
				%res = select <4 x i1> %mask, <4 x i64> %r, <4 x i64> %old
				ret <4 x i64>%res
				}

				define <4 x i64> @test_256_23(i8 * %addr, <4 x i64> %mask1) {
				; CHECK-LABEL: test_256_23:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vptestmq %ymm0, %ymm0, %k1
				; CHECK-NEXT: vmovdqu64 (%rdi), %ymm0 {%k1} {z}
				; CHECK-NEXT: retq
				%mask = icmp ne <4 x i64> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <4 x i64>*
				%r = load <4 x i64>, <4 x i64>* %vaddr, align 32
				%res = select <4 x i1> %mask, <4 x i64> %r, <4 x i64> zeroinitializer
				ret <4 x i64>%res
				}

				define <4 x i64> @test_256_24(i8 * %addr, <4 x i64> %mask1) {
				; CHECK-LABEL: test_256_24:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vptestmq %ymm0, %ymm0, %k1
				; CHECK-NEXT: vmovdqu64 (%rdi), %ymm0 {%k1} {z}
				; CHECK-NEXT: retq
				%mask = icmp ne <4 x i64> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <4 x i64>*
				%r = load <4 x i64>, <4 x i64>* %vaddr, align 1
				%res = select <4 x i1> %mask, <4 x i64> %r, <4 x i64> zeroinitializer
				ret <4 x i64>%res
				}

				define <4 x i32> @test_128_17(i8 * %addr, <4 x i32> %old, <4 x i32> %mask1) {
				; CHECK-LABEL: test_128_17:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vptestmd %xmm1, %xmm1, %k1
				; CHECK-NEXT: vmovdqu32 (%rdi), %xmm0 {%k1}
				; CHECK-NEXT: retq
				%mask = icmp ne <4 x i32> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <4 x i32>*
				%r = load <4 x i32>, <4 x i32>* %vaddr, align 16
				%res = select <4 x i1> %mask, <4 x i32> %r, <4 x i32> %old
				ret <4 x i32>%res
				}

				define <4 x i32> @test_128_18(i8 * %addr, <4 x i32> %old, <4 x i32> %mask1) {
				; CHECK-LABEL: test_128_18:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vptestmd %xmm1, %xmm1, %k1
				; CHECK-NEXT: vmovdqu32 (%rdi), %xmm0 {%k1}
				; CHECK-NEXT: retq
				%mask = icmp ne <4 x i32> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <4 x i32>*
				%r = load <4 x i32>, <4 x i32>* %vaddr, align 1
				%res = select <4 x i1> %mask, <4 x i32> %r, <4 x i32> %old
				ret <4 x i32>%res
				}

				define <4 x i32> @test_128_19(i8 * %addr, <4 x i32> %mask1) {
				; CHECK-LABEL: test_128_19:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vptestmd %xmm0, %xmm0, %k1
				; CHECK-NEXT: vmovdqu32 (%rdi), %xmm0 {%k1} {z}
				; CHECK-NEXT: retq
				%mask = icmp ne <4 x i32> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <4 x i32>*
				%r = load <4 x i32>, <4 x i32>* %vaddr, align 16
				%res = select <4 x i1> %mask, <4 x i32> %r, <4 x i32> zeroinitializer
				ret <4 x i32>%res
				}

				define <4 x i32> @test_128_20(i8 * %addr, <4 x i32> %mask1) {
				; CHECK-LABEL: test_128_20:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vptestmd %xmm0, %xmm0, %k1
				; CHECK-NEXT: vmovdqu32 (%rdi), %xmm0 {%k1} {z}
				; CHECK-NEXT: retq
				%mask = icmp ne <4 x i32> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <4 x i32>*
				%r = load <4 x i32>, <4 x i32>* %vaddr, align 1
				%res = select <4 x i1> %mask, <4 x i32> %r, <4 x i32> zeroinitializer
				ret <4 x i32>%res
				}

				define <2 x i64> @test_128_21(i8 * %addr, <2 x i64> %old, <2 x i64> %mask1) {
				; CHECK-LABEL: test_128_21:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vptestmq %xmm1, %xmm1, %k1
				; CHECK-NEXT: vmovdqu64 (%rdi), %xmm0 {%k1}
				; CHECK-NEXT: retq
				%mask = icmp ne <2 x i64> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <2 x i64>*
				%r = load <2 x i64>, <2 x i64>* %vaddr, align 16
				%res = select <2 x i1> %mask, <2 x i64> %r, <2 x i64> %old
				ret <2 x i64>%res
				}

				define <2 x i64> @test_128_22(i8 * %addr, <2 x i64> %old, <2 x i64> %mask1) {
				; CHECK-LABEL: test_128_22:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vptestmq %xmm1, %xmm1, %k1
				; CHECK-NEXT: vmovdqu64 (%rdi), %xmm0 {%k1}
				; CHECK-NEXT: retq
				%mask = icmp ne <2 x i64> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <2 x i64>*
				%r = load <2 x i64>, <2 x i64>* %vaddr, align 1
				%res = select <2 x i1> %mask, <2 x i64> %r, <2 x i64> %old
				ret <2 x i64>%res
				}

				define <2 x i64> @test_128_23(i8 * %addr, <2 x i64> %mask1) {
				; CHECK-LABEL: test_128_23:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vptestmq %xmm0, %xmm0, %k1
				; CHECK-NEXT: vmovdqu64 (%rdi), %xmm0 {%k1} {z}
				; CHECK-NEXT: retq
				%mask = icmp ne <2 x i64> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <2 x i64>*
				%r = load <2 x i64>, <2 x i64>* %vaddr, align 16
				%res = select <2 x i1> %mask, <2 x i64> %r, <2 x i64> zeroinitializer
				ret <2 x i64>%res
				}

				define <2 x i64> @test_128_24(i8 * %addr, <2 x i64> %mask1) {
				; CHECK-LABEL: test_128_24:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vptestmq %xmm0, %xmm0, %k1
				; CHECK-NEXT: vmovdqu64 (%rdi), %xmm0 {%k1} {z}
				; CHECK-NEXT: retq
				%mask = icmp ne <2 x i64> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <2 x i64>*
				%r = load <2 x i64>, <2 x i64>* %vaddr, align 1
				%res = select <2 x i1> %mask, <2 x i64> %r, <2 x i64> zeroinitializer
				ret <2 x i64>%res
				}

llvm/test/CodeGen/X86/opt-pipeline.ll

	Show First 20 Lines • Show All 180 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: X86 vzeroupper inserter			; CHECK-NEXT: X86 vzeroupper inserter
	; CHECK-NEXT: MachineDominator Tree Construction			; CHECK-NEXT: MachineDominator Tree Construction
	; CHECK-NEXT: Machine Natural Loop Construction			; CHECK-NEXT: Machine Natural Loop Construction
	; CHECK-NEXT: Lazy Machine Block Frequency Analysis			; CHECK-NEXT: Lazy Machine Block Frequency Analysis
	; CHECK-NEXT: X86 Byte/Word Instruction Fixup			; CHECK-NEXT: X86 Byte/Word Instruction Fixup
	; CHECK-NEXT: Lazy Machine Block Frequency Analysis			; CHECK-NEXT: Lazy Machine Block Frequency Analysis
	; CHECK-NEXT: X86 Atom pad short functions			; CHECK-NEXT: X86 Atom pad short functions
	; CHECK-NEXT: X86 LEA Fixup			; CHECK-NEXT: X86 LEA Fixup
				; CHECK-NEXT: X86 unaligned vector move
	; CHECK-NEXT: Compressing EVEX instrs to VEX encoding when possible			; CHECK-NEXT: Compressing EVEX instrs to VEX encoding when possible
	; CHECK-NEXT: X86 Discriminate Memory Operands			; CHECK-NEXT: X86 Discriminate Memory Operands
	; CHECK-NEXT: X86 Insert Cache Prefetches			; CHECK-NEXT: X86 Insert Cache Prefetches
	; CHECK-NEXT: X86 insert wait instruction			; CHECK-NEXT: X86 insert wait instruction
	; CHECK-NEXT: Contiguously Lay Out Funclets			; CHECK-NEXT: Contiguously Lay Out Funclets
	; CHECK-NEXT: StackMap Liveness Analysis			; CHECK-NEXT: StackMap Liveness Analysis
	; CHECK-NEXT: Live DEBUG_VALUE analysis			; CHECK-NEXT: Live DEBUG_VALUE analysis
	; CHECK-NEXT: X86 Speculative Execution Side Effect Suppression			; CHECK-NEXT: X86 Speculative Execution Side Effect Suppression
	Show All 11 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[X86] Support replacing aligned vector moves with unaligned moves when avx is enabled. (off by default)AbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 297401

llvm/lib/Target/X86/CMakeLists.txt

llvm/lib/Target/X86/X86.h

llvm/lib/Target/X86/X86TargetMachine.cpp

llvm/lib/Target/X86/X86UnalignedVectorMoves.cpp

llvm/test/CodeGen/X86/O0-pipeline.ll

llvm/test/CodeGen/X86/avx-unaligned-load-store.ll

llvm/test/CodeGen/X86/avx512-unaligned-load-store.ll

llvm/test/CodeGen/X86/avx512vl-unaligned-load-store.ll

llvm/test/CodeGen/X86/opt-pipeline.ll

[X86] Support replacing aligned vector moves with unaligned moves when avx is enabled. (off by default)
AbandonedPublic