This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/
-
include/clang/Driver/
-
clang/
-
Driver/
1/4
Options.td
-
test/Driver/
-
Driver/
-
x86-target-features.c
-
llvm/
-
include/llvm/Support/
-
llvm/
-
Support/
-
X86TargetParser.def
-
lib/
-
Support/
-
X86TargetParser.cpp
-
Target/X86/
-
X86/
-
X86.td
1
X86InstrFragmentsSIMD.td
-
X86Subtarget.h
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
-
avx-unaligned-load-store.ll
-
avx512-unaligned-load-store.ll
2/4
avx512vl-unaligned-load-store.ll

Differential D99565

[X86] Support replacing aligned vector moves with unaligned moves when avx is enabled.
AbandonedPublic

Authored by LuoYuanke on Mar 29 2021, 11:52 PM.

Download Raw Diff

Details

Reviewers

pengfei
craig.topper
kbsmith1
smaslov
RKSimon
LiuChen3
lebedev.ri

Summary

With AVX the performance for aligned vector move and unaligned vector move on X86
are the same if the address is aligned. In this case we prefer to use unaligned
move because it can avoid some run time exceptions.
"-muse-unaligned-vector-move" and "-mno-use-unaligned-vector-move" are added to
enable this preference. This transform is disabled as default.

This patch is a replacement of D88396.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

LiuChen3 created this revision.Mar 29 2021, 11:52 PM

Herald added subscribers: jansvoboda11, dang, pengfei, hiraditya. · View Herald TranscriptMar 29 2021, 11:52 PM

LiuChen3 requested review of this revision.Mar 29 2021, 11:52 PM

Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptMar 29 2021, 11:52 PM

Herald added subscribers: llvm-commits, cfe-commits. · View Herald Transcript

LiuChen3 added reviewers: pengfei, LuoYuanke, craig.topper, kbsmith1.Mar 29 2021, 11:55 PM

LuoYuanke added reviewers: smaslov, lebedev.ri.Mar 30 2021, 12:50 AM

LuoYuanke added a reviewer: RKSimon.

craig.topper added inline comments.Mar 30 2021, 12:56 AM

clang/include/clang/Driver/Options.td
1646	As far the user is concerned this isn’t a transform. From their perspective it’s always use unaligned move instructions.
llvm/test/CodeGen/X86/avx512vl-unaligned-load-store.ll
7	CHECK isn’t a valid prefix for this file
22	What are the tests with align 1 intended to show?

Harbormaster completed remote builds in B96251: Diff 334059.Mar 30 2021, 12:57 AM

The only use I could really see for this is to prevent a developers code from crashing when it’s distributed to someone else. For paranoia because it’s possible you have a bug and got lucky with alignment in your internal testing before you shipped.

If you need this your code has undefined behavior which should be fixed. You should not use this to make a known runtime exception go away.

Your code could still be miscompiled. For example, llvm really likes to replace ADD with OR when the bits don’t overlap. So it would be very easy to have your pointer arithmetic miscompiled because llvm believes a pointer is aligned but really isn’t.

Your code would not be portable to SSE. If your application does dynamic dispatch most of your users may get the AVX code path, but the smaller percentage on older hardware or cheaper hardware that doesn’t have AVX still get the exceptions.

I think you need to be very careful with how this feature is communicated.

LiuChen3 added inline comments.Mar 30 2021, 1:34 AM

clang/include/clang/Driver/Options.td
1646	How about "Always emit unaligned move instructions." ? Do you have any suggestion here?
llvm/lib/Target/X86/X86MCInstLower.cpp
2708 ↗	(On Diff #334059)	Forgot to check the SSE level. I will add in next patch.
llvm/test/CodeGen/X86/avx512vl-unaligned-load-store.ll
7	Thanks for reminding. I forgot to delete the these. I will remove these in next patch.
22	This is to distinguish the unaligned-mov converted from aligned-move and the original unaligned-move. See the difference between line12 and line28.

In D99565#2657813, @craig.topper wrote:

The only use I could really see for this is to prevent a developers code from crashing when it’s distributed to someone else. For paranoia because it’s possible you have a bug and got lucky with alignment in your internal testing before you shipped.

If you need this your code has undefined behavior which should be fixed. You should not use this to make a known runtime exception go away.

Your code could still be miscompiled. For example, llvm really likes to replace ADD with OR when the bits don’t overlap. So it would be very easy to have your pointer arithmetic miscompiled because llvm believes a pointer is aligned but really isn’t.

Your code would not be portable to SSE. If your application does dynamic dispatch most of your users may get the AVX code path, but the smaller percentage on older hardware or cheaper hardware that doesn’t have AVX still get the exceptions.

+ your code won't be portable to other compilers.

I think you need to be very careful with how this feature is communicated.

+1, i have already wrote all that in the previous revision.
I really don't think this should go in.

I really don't think this should go in.

Here are more arguments for why, I think, this is an useful option in my opinion, in arbitrary order:

This was requested by and added for users of Intel Compiler. Having similar option in LLVM would make the two compilers more compatible and ease the transition of new customers to LLVM.
This fixes an inconsistency in optimization; suppose a load operation was merged into another instruction (e.g., load and add becomes `add [memop]'). If a misaligned pointer is passed to the two-instruction sequence, it will raise an exception. If the same pointer is passed to the memop instruction, it will work. Thus, the behavior of misalignment depends upon what optimization levels and passes are applied, and small source changes could cause issues to appear and disappear. It's better for the user to consistently use unaligned load/store to improve the debug experience.
Makes good use of HW that is capable of handling misaligned data gracefully. It is not necessarily a bug in users code but a third-part library. For example it would allow using a library built in old ages where stack alignment was 4-byte only.

If you still think this can hinder the raise of a desired exception for a mis-aligned access (I'd argue that "going slower" is better than "raising exception"), then let's consider adding this as an option that is OFF by default.
This would give the most flexibility to everyone.

I think I wouldn't mind if we just didn't emit aligned loads/store instructions for AVX/AVX512 from isel and other places in the compiler in the first place. As noted, if the load gets folded the alignment check doesn't happen. That would reduce the size of the isel tables and remove branches, reducing complexity of the compiler. Adding a new step and a command line to undo the earlier decision increases complexity.

The counter argument to that is that the alignment check has found bugs in the vectorizer on more than one occasion that I know of.

Sergey, remind me, does icc always emit unaligned loads/stores? Is there any option to control it?

Sergey, remind me, does icc always emit unaligned loads/stores? Is there any option to control it?

There was a control to emit aligned opcodes, yes, but I don't think anyone ever used it.

In D99565#2678073, @craig.topper wrote:

I think I wouldn't mind if we just didn't emit aligned loads/store instructions for AVX/AVX512 from isel and other places in the compiler in the first place. As noted, if the load gets folded the alignment check doesn't happen. That would reduce the size of the isel tables and remove branches, reducing complexity of the compiler. Adding a new step and a command line to undo the earlier decision increases complexity.

The counter argument to that is that the alignment check has found bugs in the vectorizer on more than one occasion that I know of.

Can I understand that if we implement it in isel, you will no longer oppose this patch?

I'm still uncomfortable with changing current status quo, even though i obviously don't get to cast the final vote here.

One should not use aligned loads in hope that they will cause an exception to detect address misalignment.
That's UBSan's job. -fsanitize=undefined/-fsanitize=aligment *should* catch it.
If it does not do so in your particular case, please file a bug, i would like to take a look.

Likewise, i don't think one should do overaligned loads and hope that they will just work.
UB is UB. The code will still be miscompiled, but you've just hidden your warning.

Likewise, even if unaligned loads can be always used, i would personally find it pretty surprising
to suddenly see unaliged loads instead of aligned ones.
Also, isn't that only possible/so when AVX is available?
Also, doesn't that cause compiler lock-in?
What happens without AVX? Do so anyways at the perfomance's cost?
Or back to exceptions?

Should this process in any form other than the UBSan changes,
i would like to first see a RFC on llvm-dev.
Sorry about being uneasy about this. :S

Rebase;
Emit unaligned move in ISEL;
Only do the conversion on AVX machine.

I am still working on fast-isel.

In D99565#2682809, @lebedev.ri wrote:

I'm still uncomfortable with changing current status quo, even though i obviously don't get to cast the final vote here.

One should not use aligned loads in hope that they will cause an exception to detect address misalignment.
That's UBSan's job. -fsanitize=undefined/-fsanitize=aligment *should* catch it.
If it does not do so in your particular case, please file a bug, i would like to take a look.

Likewise, i don't think one should do overaligned loads and hope that they will just work.
UB is UB. The code will still be miscompiled, but you've just hidden your warning.

Likewise, even if unaligned loads can be always used, i would personally find it pretty surprising
to suddenly see unaliged loads instead of aligned ones.
Also, isn't that only possible/so when AVX is available?
Also, doesn't that cause compiler lock-in?
What happens without AVX? Do so anyways at the perfomance's cost?
Or back to exceptions?

Should this process in any form other than the UBSan changes,
i would like to first see a RFC on llvm-dev.
Sorry about being uneasy about this. :S

We are happy to hear your voice. We will discuss this on llvm-dev later after confirming Craigs's opinion.

In D99565#2678073, @craig.topper wrote:

I think I wouldn't mind if we just didn't emit aligned loads/store instructions for AVX/AVX512 from isel and other places in the compiler in the first place. As noted, if the load gets folded the alignment check doesn't happen. That would reduce the size of the isel tables and remove branches, reducing complexity of the compiler. Adding a new step and a command line to undo the earlier decision increases complexity.

The counter argument to that is that the alignment check has found bugs in the vectorizer on more than one occasion that I know of.

Hi, @craig.topper. I'm not sure if I understand what you mean correctly. Do you mean we can remove the alignload/alignstore pattern match so that we can reduce the size of the isel tables? But this means that there is no option to control this behavior.

craig.topper added inline comments.Apr 13 2021, 1:00 AM

clang/include/clang/Driver/Options.td
1649	This makes it sound like unaligned moves would never be used even if it unaligned.
llvm/lib/Target/X86/X86InstrFragmentsSIMD.td
834	Won’t this make this return true for any load when only SSE is enabled. So SSE will use an aligned load instruction for an unaligned address.

Harbormaster completed remote builds in B98431: Diff 337057.Apr 13 2021, 1:22 AM

Address Craig's comments

Harbormaster completed remote builds in B98437: Diff 337064.Apr 13 2021, 2:11 AM

craig.topper added inline comments.Apr 13 2021, 2:58 PM

clang/lib/Driver/ToolChains/CommonArgs.cpp
1723 ↗	(On Diff #337064)	Could you just put this in target features in the IR and not have to deal with LTO specially? Similar to what we do with retpoline, speculative loading hardening, etc.

craig.topper added inline comments.Apr 13 2021, 3:00 PM

clang/include/clang/Driver/Options.td
1649	From the user's perspective you're not transforming instructions. The instructions don't exist before the compiler runs.

Address Craig's comments

LiuChen3 edited the summary of this revision. (Show Details)Apr 13 2021, 10:13 PM

Harbormaster completed remote builds in B98614: Diff 337333.Apr 14 2021, 12:33 AM

lebedev.ri requested changes to this revision.Apr 15 2021, 12:15 AM

This revision now requires changes to proceed.Apr 15 2021, 12:15 AM

Matt added a subscriber: Matt.Oct 2 2021, 6:08 AM

This review seems to be stuck/dead, consider abandoning if no longer relevant.

This revision now requires review to proceed.Jan 12 2023, 4:49 PM

Herald added a project: Restricted Project. · View Herald TranscriptJan 12 2023, 4:49 PM

Herald added a subscriber: StephenFan. · View Herald Transcript

In D99565#4049330, @lebedev.ri wrote:

This review seems to be stuck/dead, consider abandoning if no longer relevant.

@LiuChen3 has been inactive for long time. How can I help to abandon the patch?

LuoYuanke abandoned this revision.Jan 12 2023, 4:55 PM

Revision Contents

Path

Size

clang/

include/

clang/

Driver/

Options.td

2 lines

test/

Driver/

x86-target-features.c

5 lines

llvm/

include/

llvm/

Support/

X86TargetParser.def

1 line

lib/

Support/

X86TargetParser.cpp

1 line

Target/

X86/

X86.td

6 lines

X86InstrFragmentsSIMD.td

10 lines

X86Subtarget.h

11 lines

test/

CodeGen/

X86/

avx-unaligned-load-store.ll

441 lines

avx512-unaligned-load-store.ll

595 lines

avx512vl-unaligned-load-store.ll

747 lines

Diff 337333

clang/include/clang/Driver/Options.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 1,637 Lines • ▼ Show 20 Lines
	} // end -f[no-]sanitize* flags			} // end -f[no-]sanitize* flags

	def funsafe_math_optimizations : Flag<["-"], "funsafe-math-optimizations">,			def funsafe_math_optimizations : Flag<["-"], "funsafe-math-optimizations">,
	Group<f_Group>;			Group<f_Group>;
	def fno_unsafe_math_optimizations : Flag<["-"], "fno-unsafe-math-optimizations">,			def fno_unsafe_math_optimizations : Flag<["-"], "fno-unsafe-math-optimizations">,
	Group<f_Group>;			Group<f_Group>;
	def fassociative_math : Flag<["-"], "fassociative-math">, Group<f_Group>;			def fassociative_math : Flag<["-"], "fassociative-math">, Group<f_Group>;
	def fno_associative_math : Flag<["-"], "fno-associative-math">, Group<f_Group>;			def fno_associative_math : Flag<["-"], "fno-associative-math">, Group<f_Group>;
	defm reciprocal_math : BoolFOption<"reciprocal-math",			defm reciprocal_math : BoolFOption<"reciprocal-math",
				craig.topperUnsubmitted Not Done Reply Inline Actions As far the user is concerned this isn’t a transform. From their perspective it’s always use unaligned move instructions. craig.topper: As far the user is concerned this isn’t a transform. From their perspective it’s always use…
				LiuChen3Unsubmitted Done Reply Inline Actions How about "Always emit unaligned move instructions." ? Do you have any suggestion here? LiuChen3: How about "Always emit unaligned move instructions." ? Do you have any suggestion here?
	LangOpts<"AllowRecip">, DefaultFalse,			LangOpts<"AllowRecip">, DefaultFalse,
	PosFlag<SetTrue, [CC1Option], "Allow division operations to be reassociated",			PosFlag<SetTrue, [CC1Option], "Allow division operations to be reassociated",
	[menable_unsafe_fp_math.KeyPath]>,			[menable_unsafe_fp_math.KeyPath]>,
				craig.topperUnsubmitted Not Done Reply Inline Actions This makes it sound like unaligned moves would never be used even if it unaligned. craig.topper: This makes it sound like unaligned moves would never be used even if it unaligned.
				craig.topperUnsubmitted Not Done Reply Inline Actions From the user's perspective you're not transforming instructions. The instructions don't exist before the compiler runs. craig.topper: From the user's perspective you're not transforming instructions. The instructions don't exist…
	NegFlag<SetFalse>>;			NegFlag<SetFalse>>;
	def fapprox_func : Flag<["-"], "fapprox-func">, Group<f_Group>, Flags<[CC1Option, NoDriverOption]>,			def fapprox_func : Flag<["-"], "fapprox-func">, Group<f_Group>, Flags<[CC1Option, NoDriverOption]>,
	MarshallingInfoFlag<LangOpts<"ApproxFunc">>, ImpliedByAnyOf<[menable_unsafe_fp_math.KeyPath]>;			MarshallingInfoFlag<LangOpts<"ApproxFunc">>, ImpliedByAnyOf<[menable_unsafe_fp_math.KeyPath]>;
	defm finite_math_only : BoolFOption<"finite-math-only",			defm finite_math_only : BoolFOption<"finite-math-only",
	LangOpts<"FiniteMathOnly">, DefaultFalse,			LangOpts<"FiniteMathOnly">, DefaultFalse,
	PosFlag<SetTrue, [CC1Option], "", [cl_finite_math_only.KeyPath, ffast_math.KeyPath]>,			PosFlag<SetTrue, [CC1Option], "", [cl_finite_math_only.KeyPath, ffast_math.KeyPath]>,
	NegFlag<SetFalse>>;			NegFlag<SetFalse>>;
	defm signed_zeros : BoolFOption<"signed-zeros",			defm signed_zeros : BoolFOption<"signed-zeros",
	▲ Show 20 Lines • Show All 2,418 Lines • ▼ Show 20 Lines
	def msha : Flag<["-"], "msha">, Group<m_x86_Features_Group>;			def msha : Flag<["-"], "msha">, Group<m_x86_Features_Group>;
	def mno_sha : Flag<["-"], "mno-sha">, Group<m_x86_Features_Group>;			def mno_sha : Flag<["-"], "mno-sha">, Group<m_x86_Features_Group>;
	def mtbm : Flag<["-"], "mtbm">, Group<m_x86_Features_Group>;			def mtbm : Flag<["-"], "mtbm">, Group<m_x86_Features_Group>;
	def mno_tbm : Flag<["-"], "mno-tbm">, Group<m_x86_Features_Group>;			def mno_tbm : Flag<["-"], "mno-tbm">, Group<m_x86_Features_Group>;
	def mtsxldtrk : Flag<["-"], "mtsxldtrk">, Group<m_x86_Features_Group>;			def mtsxldtrk : Flag<["-"], "mtsxldtrk">, Group<m_x86_Features_Group>;
	def mno_tsxldtrk : Flag<["-"], "mno-tsxldtrk">, Group<m_x86_Features_Group>;			def mno_tsxldtrk : Flag<["-"], "mno-tsxldtrk">, Group<m_x86_Features_Group>;
	def muintr : Flag<["-"], "muintr">, Group<m_x86_Features_Group>;			def muintr : Flag<["-"], "muintr">, Group<m_x86_Features_Group>;
	def mno_uintr : Flag<["-"], "mno-uintr">, Group<m_x86_Features_Group>;			def mno_uintr : Flag<["-"], "mno-uintr">, Group<m_x86_Features_Group>;
				def munalignedvecmove : Flag<["-"], "muse-unaligned-vector-move">, Group<m_x86_Features_Group>;
				def mno_unalignedvecmove : Flag<["-"], "mno-use-unaligned-vector-move">, Group<m_x86_Features_Group>;
	def mvaes : Flag<["-"], "mvaes">, Group<m_x86_Features_Group>;			def mvaes : Flag<["-"], "mvaes">, Group<m_x86_Features_Group>;
	def mno_vaes : Flag<["-"], "mno-vaes">, Group<m_x86_Features_Group>;			def mno_vaes : Flag<["-"], "mno-vaes">, Group<m_x86_Features_Group>;
	def mvpclmulqdq : Flag<["-"], "mvpclmulqdq">, Group<m_x86_Features_Group>;			def mvpclmulqdq : Flag<["-"], "mvpclmulqdq">, Group<m_x86_Features_Group>;
	def mno_vpclmulqdq : Flag<["-"], "mno-vpclmulqdq">, Group<m_x86_Features_Group>;			def mno_vpclmulqdq : Flag<["-"], "mno-vpclmulqdq">, Group<m_x86_Features_Group>;
	def mwaitpkg : Flag<["-"], "mwaitpkg">, Group<m_x86_Features_Group>;			def mwaitpkg : Flag<["-"], "mwaitpkg">, Group<m_x86_Features_Group>;
	def mno_waitpkg : Flag<["-"], "mno-waitpkg">, Group<m_x86_Features_Group>;			def mno_waitpkg : Flag<["-"], "mno-waitpkg">, Group<m_x86_Features_Group>;
	def mxop : Flag<["-"], "mxop">, Group<m_x86_Features_Group>;			def mxop : Flag<["-"], "mxop">, Group<m_x86_Features_Group>;
	def mno_xop : Flag<["-"], "mno-xop">, Group<m_x86_Features_Group>;			def mno_xop : Flag<["-"], "mno-xop">, Group<m_x86_Features_Group>;
	▲ Show 20 Lines • Show All 2,045 Lines • Show Last 20 Lines

clang/test/Driver/x86-target-features.c

	Show First 20 Lines • Show All 287 Lines • ▼ Show 20 Lines
	// RUN: %clang -target i386-unknown-linux-gnu -march=i386 -mno-uintr %s -### -o %t.o 2>&1 \| FileCheck -check-prefix=NO-UINTR %s			// RUN: %clang -target i386-unknown-linux-gnu -march=i386 -mno-uintr %s -### -o %t.o 2>&1 \| FileCheck -check-prefix=NO-UINTR %s
	// UINTR: "-target-feature" "+uintr"			// UINTR: "-target-feature" "+uintr"
	// NO-UINTR: "-target-feature" "-uintr"			// NO-UINTR: "-target-feature" "-uintr"

	// RUN: %clang -target i386-unknown-linux-gnu -march=i386 -mavxvnni %s -### -o %t.o 2>&1 \| FileCheck --check-prefix=AVX-VNNI %s			// RUN: %clang -target i386-unknown-linux-gnu -march=i386 -mavxvnni %s -### -o %t.o 2>&1 \| FileCheck --check-prefix=AVX-VNNI %s
	// RUN: %clang -target i386-unknown-linux-gnu -march=i386 -mno-avxvnni %s -### -o %t.o 2>&1 \| FileCheck --check-prefix=NO-AVX-VNNI %s			// RUN: %clang -target i386-unknown-linux-gnu -march=i386 -mno-avxvnni %s -### -o %t.o 2>&1 \| FileCheck --check-prefix=NO-AVX-VNNI %s
	// AVX-VNNI: "-target-feature" "+avxvnni"			// AVX-VNNI: "-target-feature" "+avxvnni"
	// NO-AVX-VNNI: "-target-feature" "-avxvnni"			// NO-AVX-VNNI: "-target-feature" "-avxvnni"

				// RUN: %clang -target i386-linux-gnu -muse-unaligned-vector-move %s -### -o %t.o 2>&1 \| FileCheck -check-prefix=UNALIGNEDVECMOVE %s
				// RUN: %clang -target i386-linux-gnu -mno-use-unaligned-vector-move %s -### -o %t.o 2>&1 \| FileCheck -check-prefix=NO-UNALIGNEDVECMOVE %s
				// UNALIGNEDVECMOVE: "-target-feature" "+use-unaligned-vector-move"
				// NO-UNALIGNEDVECMOVE: "-target-feature" "-use-unaligned-vector-move"

llvm/include/llvm/Support/X86TargetParser.def

	Show First 20 Lines • Show All 194 Lines • ▼ Show 20 Lines
	X86_FEATURE (HRESET, "hreset")			X86_FEATURE (HRESET, "hreset")
	X86_FEATURE (AVXVNNI, "avxvnni")			X86_FEATURE (AVXVNNI, "avxvnni")
	// These features aren't really CPU features, but the frontend can set them.			// These features aren't really CPU features, but the frontend can set them.
	X86_FEATURE (RETPOLINE_EXTERNAL_THUNK, "retpoline-external-thunk")			X86_FEATURE (RETPOLINE_EXTERNAL_THUNK, "retpoline-external-thunk")
	X86_FEATURE (RETPOLINE_INDIRECT_BRANCHES, "retpoline-indirect-branches")			X86_FEATURE (RETPOLINE_INDIRECT_BRANCHES, "retpoline-indirect-branches")
	X86_FEATURE (RETPOLINE_INDIRECT_CALLS, "retpoline-indirect-calls")			X86_FEATURE (RETPOLINE_INDIRECT_CALLS, "retpoline-indirect-calls")
	X86_FEATURE (LVI_CFI, "lvi-cfi")			X86_FEATURE (LVI_CFI, "lvi-cfi")
	X86_FEATURE (LVI_LOAD_HARDENING, "lvi-load-hardening")			X86_FEATURE (LVI_LOAD_HARDENING, "lvi-load-hardening")
				X86_FEATURE (UNALIGNED_VECTOR_MOVE, "use-unaligned-vector-move")
	#undef X86_FEATURE_COMPAT			#undef X86_FEATURE_COMPAT
	#undef X86_FEATURE			#undef X86_FEATURE

llvm/lib/Support/X86TargetParser.cpp

	Show First 20 Lines • Show All 504 Lines • ▼ Show 20 Lines

	// Not really CPU features, but need to be in the table because clang uses			// Not really CPU features, but need to be in the table because clang uses
	// target features to communicate them to the backend.			// target features to communicate them to the backend.
	constexpr FeatureBitset ImpliedFeaturesRETPOLINE_EXTERNAL_THUNK = {};			constexpr FeatureBitset ImpliedFeaturesRETPOLINE_EXTERNAL_THUNK = {};
	constexpr FeatureBitset ImpliedFeaturesRETPOLINE_INDIRECT_BRANCHES = {};			constexpr FeatureBitset ImpliedFeaturesRETPOLINE_INDIRECT_BRANCHES = {};
	constexpr FeatureBitset ImpliedFeaturesRETPOLINE_INDIRECT_CALLS = {};			constexpr FeatureBitset ImpliedFeaturesRETPOLINE_INDIRECT_CALLS = {};
	constexpr FeatureBitset ImpliedFeaturesLVI_CFI = {};			constexpr FeatureBitset ImpliedFeaturesLVI_CFI = {};
	constexpr FeatureBitset ImpliedFeaturesLVI_LOAD_HARDENING = {};			constexpr FeatureBitset ImpliedFeaturesLVI_LOAD_HARDENING = {};
				constexpr FeatureBitset ImpliedFeaturesUNALIGNED_VECTOR_MOVE = {};

	// XSAVE features are dependent on basic XSAVE.			// XSAVE features are dependent on basic XSAVE.
	constexpr FeatureBitset ImpliedFeaturesXSAVEC = FeatureXSAVE;			constexpr FeatureBitset ImpliedFeaturesXSAVEC = FeatureXSAVE;
	constexpr FeatureBitset ImpliedFeaturesXSAVEOPT = FeatureXSAVE;			constexpr FeatureBitset ImpliedFeaturesXSAVEOPT = FeatureXSAVE;
	constexpr FeatureBitset ImpliedFeaturesXSAVES = FeatureXSAVE;			constexpr FeatureBitset ImpliedFeaturesXSAVES = FeatureXSAVE;

	// MMX->3DNOW->3DNOWA chain.			// MMX->3DNOW->3DNOWA chain.
	constexpr FeatureBitset ImpliedFeaturesMMX = {};			constexpr FeatureBitset ImpliedFeaturesMMX = {};
	▲ Show 20 Lines • Show All 142 Lines • Show Last 20 Lines

llvm/lib/Target/X86/X86.td

	Show First 20 Lines • Show All 516 Lines • ▼ Show 20 Lines
	def FeatureUseGLMDivSqrtCosts			def FeatureUseGLMDivSqrtCosts
	: SubtargetFeature<"use-glm-div-sqrt-costs", "UseGLMDivSqrtCosts", "true",			: SubtargetFeature<"use-glm-div-sqrt-costs", "UseGLMDivSqrtCosts", "true",
	"Use Goldmont specific floating point div/sqrt costs">;			"Use Goldmont specific floating point div/sqrt costs">;

	// Enable use of alias analysis during code generation.			// Enable use of alias analysis during code generation.
	def FeatureUseAA : SubtargetFeature<"use-aa", "UseAA", "true",			def FeatureUseAA : SubtargetFeature<"use-aa", "UseAA", "true",
	"Use alias analysis during codegen">;			"Use alias analysis during codegen">;

				/// Always emit unaligned move instructions on AVX machine.
				def FeatureUnalignedVecMove : SubtargetFeature<"use-unaligned-vector-move",
				"UseUnalignedVectorMove", "true",
				"Always emit unaligned vector move instructions "
				"on AVX machine.">;

	// Bonnell			// Bonnell
	def ProcIntelAtom : SubtargetFeature<"", "X86ProcFamily", "IntelAtom", "">;			def ProcIntelAtom : SubtargetFeature<"", "X86ProcFamily", "IntelAtom", "">;
	// Silvermont			// Silvermont
	def ProcIntelSLM : SubtargetFeature<"", "X86ProcFamily", "IntelSLM", "">;			def ProcIntelSLM : SubtargetFeature<"", "X86ProcFamily", "IntelSLM", "">;

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// Register File Description			// Register File Description
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	▲ Show 20 Lines • Show All 962 Lines • Show Last 20 Lines

llvm/lib/Target/X86/X86InstrFragmentsSIMD.td

	Show First 20 Lines • Show All 818 Lines • ▼ Show 20 Lines
	def loadv32i16 : PatFrag<(ops node:$ptr), (v32i16 (load node:$ptr))>;			def loadv32i16 : PatFrag<(ops node:$ptr), (v32i16 (load node:$ptr))>;
	def loadv64i8 : PatFrag<(ops node:$ptr), (v64i8 (load node:$ptr))>;			def loadv64i8 : PatFrag<(ops node:$ptr), (v64i8 (load node:$ptr))>;

	// 128-/256-/512-bit extload pattern fragments			// 128-/256-/512-bit extload pattern fragments
	def extloadv2f32 : PatFrag<(ops node:$ptr), (extloadvf32 node:$ptr)>;			def extloadv2f32 : PatFrag<(ops node:$ptr), (extloadvf32 node:$ptr)>;
	def extloadv4f32 : PatFrag<(ops node:$ptr), (extloadvf32 node:$ptr)>;			def extloadv4f32 : PatFrag<(ops node:$ptr), (extloadvf32 node:$ptr)>;
	def extloadv8f32 : PatFrag<(ops node:$ptr), (extloadvf32 node:$ptr)>;			def extloadv8f32 : PatFrag<(ops node:$ptr), (extloadvf32 node:$ptr)>;

	// Like 'store', but always requires vector size alignment.			// Like 'store', but always requires vector size alignment when target doesn't
				// have use-unaligned-vector-move feature.
	def alignedstore : PatFrag<(ops node:$val, node:$ptr),			def alignedstore : PatFrag<(ops node:$val, node:$ptr),
	(store node:$val, node:$ptr), [{			(store node:$val, node:$ptr), [{
				if (Subtarget->useUnalignedVecMove())
				return false;
	auto *St = cast<StoreSDNode>(N);			auto *St = cast<StoreSDNode>(N);
	return St->getAlignment() >= St->getMemoryVT().getStoreSize();			return St->getAlignment() >= St->getMemoryVT().getStoreSize();
				craig.topperUnsubmitted Not Done Reply Inline Actions Won’t this make this return true for any load when only SSE is enabled. So SSE will use an aligned load instruction for an unaligned address. craig.topper: Won’t this make this return true for any load when only SSE is enabled. So SSE will use an…
	}]>;			}]>;

	// Like 'load', but always requires vector size alignment.			// Like 'load', but always requires vector size alignment when target doesn't
				// have use-unaligned-vector-move feature.
	def alignedload : PatFrag<(ops node:$ptr), (load node:$ptr), [{			def alignedload : PatFrag<(ops node:$ptr), (load node:$ptr), [{
				if (Subtarget->useUnalignedVecMove())
				return false;
	auto *Ld = cast<LoadSDNode>(N);			auto *Ld = cast<LoadSDNode>(N);
	return Ld->getAlignment() >= Ld->getMemoryVT().getStoreSize();			return Ld->getAlignment() >= Ld->getMemoryVT().getStoreSize();
	}]>;			}]>;

	// 128-bit aligned load pattern fragments			// 128-bit aligned load pattern fragments
	// NOTE: all 128-bit integer vector loads are promoted to v2i64			// NOTE: all 128-bit integer vector loads are promoted to v2i64
	def alignedloadv4f32 : PatFrag<(ops node:$ptr),			def alignedloadv4f32 : PatFrag<(ops node:$ptr),
	(v4f32 (alignedload node:$ptr))>;			(v4f32 (alignedload node:$ptr))>;
	▲ Show 20 Lines • Show All 352 Lines • Show Last 20 Lines

llvm/lib/Target/X86/X86Subtarget.h

Show First 20 Lines • Show All 462 Lines • ▼ Show 20 Lines	class X86Subtarget final : public X86GenSubtargetInfo {
bool UseLVILoadHardening = false;		bool UseLVILoadHardening = false;

/// Use software floating point for code generation.		/// Use software floating point for code generation.
bool UseSoftFloat = false;		bool UseSoftFloat = false;

/// Use alias analysis during code generation.		/// Use alias analysis during code generation.
bool UseAA = false;		bool UseAA = false;

		/// Always emit unaligned vector move instructions on AVX machine.
		bool UseUnalignedVectorMove = false;

/// The minimum alignment known to hold of the stack frame on		/// The minimum alignment known to hold of the stack frame on
/// entry to the function and which must be maintained by every function.		/// entry to the function and which must be maintained by every function.
Align stackAlignment = Align(4);		Align stackAlignment = Align(4);

Align TileConfigAlignment = Align(4);		Align TileConfigAlignment = Align(4);

/// Max. memset / memcpy size that is turned into rep/movs, rep/stos ops.		/// Max. memset / memcpy size that is turned into rep/movs, rep/stos ops.
///		///
▲ Show 20 Lines • Show All 419 Lines • ▼ Show 20 Lines	bool isCallingConvWin64(CallingConv::ID CC) const {
case CallingConv::X86_64_SysV:		case CallingConv::X86_64_SysV:
return false;		return false;
// Otherwise, who knows what this is.		// Otherwise, who knows what this is.
default:		default:
return false;		return false;
}		}
}		}

		/// Unaligned vector move achieve the same performance as aligned vector move
		/// does when the address is aligned on AVX machine. We will always emit
		/// unaligned vector move on AVX machine when the UseUnalignedVectorMove is
		/// set.
		bool useUnalignedVecMove() const {
		return hasAVX() && UseUnalignedVectorMove;
		}

/// Classify a global variable reference for the current subtarget according		/// Classify a global variable reference for the current subtarget according
/// to how we should reference it in a non-pcrel context.		/// to how we should reference it in a non-pcrel context.
unsigned char classifyLocalReference(const GlobalValue *GV) const;		unsigned char classifyLocalReference(const GlobalValue *GV) const;

unsigned char classifyGlobalReference(const GlobalValue *GV,		unsigned char classifyGlobalReference(const GlobalValue *GV,
const Module &M) const;		const Module &M) const;
unsigned char classifyGlobalReference(const GlobalValue *GV) const;		unsigned char classifyGlobalReference(const GlobalValue *GV) const;

Show All 36 Lines

llvm/test/CodeGen/X86/avx-unaligned-load-store.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=sse4.2,+use-unaligned-vector-move \| FileCheck %s -check-prefix=CHECK_SSE
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=avx,+use-unaligned-vector-move \| FileCheck %s -check-prefix=CHECK_AVX
				; RUN: llc < %s -mtriple=i686-unknown-unknown -mattr=sse4.2,+use-unaligned-vector-move \| FileCheck %s -check-prefix=CHECK_SSE32
				; RUN: llc < %s -mtriple=i686-unknown-unknown -mattr=avx,+use-unaligned-vector-move \| FileCheck %s -check-prefix=CHECK_AVX32

				define void @test_256_load(double* nocapture %d, float* nocapture %f, <4 x i64>* nocapture %i) nounwind {
				; CHECK_SSE-LABEL: test_256_load:
				; CHECK_SSE: # %bb.0: # %entry
				; CHECK_SSE-NEXT: pushq %r15
				; CHECK_SSE-NEXT: pushq %r14
				; CHECK_SSE-NEXT: pushq %rbx
				; CHECK_SSE-NEXT: subq $96, %rsp
				; CHECK_SSE-NEXT: movq %rdx, %r14
				; CHECK_SSE-NEXT: movq %rsi, %r15
				; CHECK_SSE-NEXT: movq %rdi, %rbx
				; CHECK_SSE-NEXT: movaps (%rdx), %xmm4
				; CHECK_SSE-NEXT: movaps %xmm4, {{[-0-9]+}}(%r{{[sb]}}p) # 16-byte Spill
				; CHECK_SSE-NEXT: movaps 16(%rdx), %xmm5
				; CHECK_SSE-NEXT: movaps %xmm5, {{[-0-9]+}}(%r{{[sb]}}p) # 16-byte Spill
				; CHECK_SSE-NEXT: movaps (%rsi), %xmm2
				; CHECK_SSE-NEXT: movaps %xmm2, {{[-0-9]+}}(%r{{[sb]}}p) # 16-byte Spill
				; CHECK_SSE-NEXT: movaps 16(%rsi), %xmm3
				; CHECK_SSE-NEXT: movaps %xmm3, {{[-0-9]+}}(%r{{[sb]}}p) # 16-byte Spill
				; CHECK_SSE-NEXT: movaps (%rdi), %xmm0
				; CHECK_SSE-NEXT: movaps %xmm0, {{[-0-9]+}}(%r{{[sb]}}p) # 16-byte Spill
				; CHECK_SSE-NEXT: movaps 16(%rdi), %xmm1
				; CHECK_SSE-NEXT: movaps %xmm1, (%rsp) # 16-byte Spill
				; CHECK_SSE-NEXT: callq dummy@PLT
				; CHECK_SSE-NEXT: movaps {{[-0-9]+}}(%r{{[sb]}}p), %xmm0 # 16-byte Reload
				; CHECK_SSE-NEXT: movaps %xmm0, (%rbx)
				; CHECK_SSE-NEXT: movaps (%rsp), %xmm0 # 16-byte Reload
				; CHECK_SSE-NEXT: movaps %xmm0, 16(%rbx)
				; CHECK_SSE-NEXT: movaps {{[-0-9]+}}(%r{{[sb]}}p), %xmm0 # 16-byte Reload
				; CHECK_SSE-NEXT: movaps %xmm0, (%r15)
				; CHECK_SSE-NEXT: movaps {{[-0-9]+}}(%r{{[sb]}}p), %xmm0 # 16-byte Reload
				; CHECK_SSE-NEXT: movaps %xmm0, 16(%r15)
				; CHECK_SSE-NEXT: movaps {{[-0-9]+}}(%r{{[sb]}}p), %xmm0 # 16-byte Reload
				; CHECK_SSE-NEXT: movaps %xmm0, (%r14)
				; CHECK_SSE-NEXT: movaps {{[-0-9]+}}(%r{{[sb]}}p), %xmm0 # 16-byte Reload
				; CHECK_SSE-NEXT: movaps %xmm0, 16(%r14)
				; CHECK_SSE-NEXT: addq $96, %rsp
				; CHECK_SSE-NEXT: popq %rbx
				; CHECK_SSE-NEXT: popq %r14
				; CHECK_SSE-NEXT: popq %r15
				; CHECK_SSE-NEXT: retq
				;
				; CHECK_AVX-LABEL: test_256_load:
				; CHECK_AVX: # %bb.0: # %entry
				; CHECK_AVX-NEXT: pushq %r15
				; CHECK_AVX-NEXT: pushq %r14
				; CHECK_AVX-NEXT: pushq %rbx
				; CHECK_AVX-NEXT: subq $96, %rsp
				; CHECK_AVX-NEXT: movq %rdx, %r14
				; CHECK_AVX-NEXT: movq %rsi, %r15
				; CHECK_AVX-NEXT: movq %rdi, %rbx
				; CHECK_AVX-NEXT: vmovups (%rdi), %ymm0
				; CHECK_AVX-NEXT: vmovups %ymm0, {{[-0-9]+}}(%r{{[sb]}}p) # 32-byte Spill
				; CHECK_AVX-NEXT: vmovups (%rsi), %ymm1
				; CHECK_AVX-NEXT: vmovups %ymm1, {{[-0-9]+}}(%r{{[sb]}}p) # 32-byte Spill
				; CHECK_AVX-NEXT: vmovups (%rdx), %ymm2
				; CHECK_AVX-NEXT: vmovups %ymm2, (%rsp) # 32-byte Spill
				; CHECK_AVX-NEXT: callq dummy@PLT
				; CHECK_AVX-NEXT: vmovups {{[-0-9]+}}(%r{{[sb]}}p), %ymm0 # 32-byte Reload
				; CHECK_AVX-NEXT: vmovups %ymm0, (%rbx)
				; CHECK_AVX-NEXT: vmovups {{[-0-9]+}}(%r{{[sb]}}p), %ymm0 # 32-byte Reload
				; CHECK_AVX-NEXT: vmovups %ymm0, (%r15)
				; CHECK_AVX-NEXT: vmovups (%rsp), %ymm0 # 32-byte Reload
				; CHECK_AVX-NEXT: vmovups %ymm0, (%r14)
				; CHECK_AVX-NEXT: addq $96, %rsp
				; CHECK_AVX-NEXT: popq %rbx
				; CHECK_AVX-NEXT: popq %r14
				; CHECK_AVX-NEXT: popq %r15
				; CHECK_AVX-NEXT: vzeroupper
				; CHECK_AVX-NEXT: retq
				;
				; CHECK_SSE32-LABEL: test_256_load:
				; CHECK_SSE32: # %bb.0: # %entry
				; CHECK_SSE32-NEXT: pushl %ebp
				; CHECK_SSE32-NEXT: movl %esp, %ebp
				; CHECK_SSE32-NEXT: pushl %ebx
				; CHECK_SSE32-NEXT: pushl %edi
				; CHECK_SSE32-NEXT: pushl %esi
				; CHECK_SSE32-NEXT: andl $-16, %esp
				; CHECK_SSE32-NEXT: subl $160, %esp
				; CHECK_SSE32-NEXT: movl 16(%ebp), %esi
				; CHECK_SSE32-NEXT: movl 12(%ebp), %edi
				; CHECK_SSE32-NEXT: movl 8(%ebp), %ebx
				; CHECK_SSE32-NEXT: movaps (%ebx), %xmm0
				; CHECK_SSE32-NEXT: movaps %xmm0, {{[-0-9]+}}(%e{{[sb]}}p) # 16-byte Spill
				; CHECK_SSE32-NEXT: movaps 16(%ebx), %xmm1
				; CHECK_SSE32-NEXT: movaps %xmm1, {{[-0-9]+}}(%e{{[sb]}}p) # 16-byte Spill
				; CHECK_SSE32-NEXT: movaps (%edi), %xmm2
				; CHECK_SSE32-NEXT: movaps %xmm2, {{[-0-9]+}}(%e{{[sb]}}p) # 16-byte Spill
				; CHECK_SSE32-NEXT: movaps 16(%edi), %xmm3
				; CHECK_SSE32-NEXT: movaps %xmm3, {{[-0-9]+}}(%e{{[sb]}}p) # 16-byte Spill
				; CHECK_SSE32-NEXT: movaps (%esi), %xmm4
				; CHECK_SSE32-NEXT: movaps %xmm4, {{[-0-9]+}}(%e{{[sb]}}p) # 16-byte Spill
				; CHECK_SSE32-NEXT: movaps 16(%esi), %xmm5
				; CHECK_SSE32-NEXT: movaps %xmm5, {{[-0-9]+}}(%e{{[sb]}}p) # 16-byte Spill
				; CHECK_SSE32-NEXT: movaps %xmm5, {{[0-9]+}}(%esp)
				; CHECK_SSE32-NEXT: movaps %xmm4, {{[0-9]+}}(%esp)
				; CHECK_SSE32-NEXT: movaps %xmm3, (%esp)
				; CHECK_SSE32-NEXT: calll dummy@PLT
				; CHECK_SSE32-NEXT: movaps {{[-0-9]+}}(%e{{[sb]}}p), %xmm0 # 16-byte Reload
				; CHECK_SSE32-NEXT: movaps %xmm0, (%ebx)
				; CHECK_SSE32-NEXT: movaps {{[-0-9]+}}(%e{{[sb]}}p), %xmm0 # 16-byte Reload
				; CHECK_SSE32-NEXT: movaps %xmm0, 16(%ebx)
				; CHECK_SSE32-NEXT: movaps {{[-0-9]+}}(%e{{[sb]}}p), %xmm0 # 16-byte Reload
				; CHECK_SSE32-NEXT: movaps %xmm0, (%edi)
				; CHECK_SSE32-NEXT: movaps {{[-0-9]+}}(%e{{[sb]}}p), %xmm0 # 16-byte Reload
				; CHECK_SSE32-NEXT: movaps %xmm0, 16(%edi)
				; CHECK_SSE32-NEXT: movaps {{[-0-9]+}}(%e{{[sb]}}p), %xmm0 # 16-byte Reload
				; CHECK_SSE32-NEXT: movaps %xmm0, (%esi)
				; CHECK_SSE32-NEXT: movaps {{[-0-9]+}}(%e{{[sb]}}p), %xmm0 # 16-byte Reload
				; CHECK_SSE32-NEXT: movaps %xmm0, 16(%esi)
				; CHECK_SSE32-NEXT: leal -12(%ebp), %esp
				; CHECK_SSE32-NEXT: popl %esi
				; CHECK_SSE32-NEXT: popl %edi
				; CHECK_SSE32-NEXT: popl %ebx
				; CHECK_SSE32-NEXT: popl %ebp
				; CHECK_SSE32-NEXT: retl
				;
				; CHECK_AVX32-LABEL: test_256_load:
				; CHECK_AVX32: # %bb.0: # %entry
				; CHECK_AVX32-NEXT: pushl %ebx
				; CHECK_AVX32-NEXT: pushl %edi
				; CHECK_AVX32-NEXT: pushl %esi
				; CHECK_AVX32-NEXT: subl $112, %esp
				; CHECK_AVX32-NEXT: movl {{[0-9]+}}(%esp), %esi
				; CHECK_AVX32-NEXT: movl {{[0-9]+}}(%esp), %edi
				; CHECK_AVX32-NEXT: movl {{[0-9]+}}(%esp), %ebx
				; CHECK_AVX32-NEXT: vmovups (%ebx), %ymm0
				; CHECK_AVX32-NEXT: vmovups %ymm0, {{[-0-9]+}}(%e{{[sb]}}p) # 32-byte Spill
				; CHECK_AVX32-NEXT: vmovups (%edi), %ymm1
				; CHECK_AVX32-NEXT: vmovups %ymm1, {{[-0-9]+}}(%e{{[sb]}}p) # 32-byte Spill
				; CHECK_AVX32-NEXT: vmovups (%esi), %ymm2
				; CHECK_AVX32-NEXT: vmovups %ymm2, (%esp) # 32-byte Spill
				; CHECK_AVX32-NEXT: calll dummy@PLT
				; CHECK_AVX32-NEXT: vmovups {{[-0-9]+}}(%e{{[sb]}}p), %ymm0 # 32-byte Reload
				; CHECK_AVX32-NEXT: vmovups %ymm0, (%ebx)
				; CHECK_AVX32-NEXT: vmovups {{[-0-9]+}}(%e{{[sb]}}p), %ymm0 # 32-byte Reload
				; CHECK_AVX32-NEXT: vmovups %ymm0, (%edi)
				; CHECK_AVX32-NEXT: vmovups (%esp), %ymm0 # 32-byte Reload
				; CHECK_AVX32-NEXT: vmovups %ymm0, (%esi)
				; CHECK_AVX32-NEXT: addl $112, %esp
				; CHECK_AVX32-NEXT: popl %esi
				; CHECK_AVX32-NEXT: popl %edi
				; CHECK_AVX32-NEXT: popl %ebx
				; CHECK_AVX32-NEXT: vzeroupper
				; CHECK_AVX32-NEXT: retl
				entry:
				%0 = bitcast double* %d to <4 x double>*
				%tmp1.i = load <4 x double>, <4 x double>* %0, align 32
				%1 = bitcast float* %f to <8 x float>*
				%tmp1.i17 = load <8 x float>, <8 x float>* %1, align 32
				%tmp1.i16 = load <4 x i64>, <4 x i64>* %i, align 32
				tail call void @dummy(<4 x double> %tmp1.i, <8 x float> %tmp1.i17, <4 x i64> %tmp1.i16) nounwind
				store <4 x double> %tmp1.i, <4 x double>* %0, align 32
				store <8 x float> %tmp1.i17, <8 x float>* %1, align 32
				store <4 x i64> %tmp1.i16, <4 x i64>* %i, align 32
				ret void
				}

				declare void @dummy(<4 x double>, <8 x float>, <4 x i64>)

				define void @storev16i16(<16 x i16> %a) nounwind {
				; CHECK_SSE-LABEL: storev16i16:
				; CHECK_SSE: # %bb.0:
				; CHECK_SSE-NEXT: movaps %xmm1, (%rax)
				; CHECK_SSE-NEXT: movaps %xmm0, (%rax)
				;
				; CHECK_AVX-LABEL: storev16i16:
				; CHECK_AVX: # %bb.0:
				; CHECK_AVX-NEXT: vmovups %ymm0, (%rax)
				;
				; CHECK_SSE32-LABEL: storev16i16:
				; CHECK_SSE32: # %bb.0:
				; CHECK_SSE32-NEXT: movaps %xmm1, (%eax)
				; CHECK_SSE32-NEXT: movaps %xmm0, (%eax)
				;
				; CHECK_AVX32-LABEL: storev16i16:
				; CHECK_AVX32: # %bb.0:
				; CHECK_AVX32-NEXT: vmovups %ymm0, (%eax)
				store <16 x i16> %a, <16 x i16>* undef, align 32
				unreachable
				}

				define void @storev16i16_01(<16 x i16> %a) nounwind {
				; CHECK_SSE-LABEL: storev16i16_01:
				; CHECK_SSE: # %bb.0:
				; CHECK_SSE-NEXT: movups %xmm1, (%rax)
				; CHECK_SSE-NEXT: movups %xmm0, (%rax)
				;
				; CHECK_AVX-LABEL: storev16i16_01:
				; CHECK_AVX: # %bb.0:
				; CHECK_AVX-NEXT: vmovups %ymm0, (%rax)
				;
				; CHECK_SSE32-LABEL: storev16i16_01:
				; CHECK_SSE32: # %bb.0:
				; CHECK_SSE32-NEXT: movups %xmm1, (%eax)
				; CHECK_SSE32-NEXT: movups %xmm0, (%eax)
				;
				; CHECK_AVX32-LABEL: storev16i16_01:
				; CHECK_AVX32: # %bb.0:
				; CHECK_AVX32-NEXT: vmovups %ymm0, (%eax)
				store <16 x i16> %a, <16 x i16>* undef, align 4
				unreachable
				}

				define void @storev32i8(<32 x i8> %a) nounwind {
				; CHECK_SSE-LABEL: storev32i8:
				; CHECK_SSE: # %bb.0:
				; CHECK_SSE-NEXT: movaps %xmm1, (%rax)
				; CHECK_SSE-NEXT: movaps %xmm0, (%rax)
				;
				; CHECK_AVX-LABEL: storev32i8:
				; CHECK_AVX: # %bb.0:
				; CHECK_AVX-NEXT: vmovups %ymm0, (%rax)
				;
				; CHECK_SSE32-LABEL: storev32i8:
				; CHECK_SSE32: # %bb.0:
				; CHECK_SSE32-NEXT: movaps %xmm1, (%eax)
				; CHECK_SSE32-NEXT: movaps %xmm0, (%eax)
				;
				; CHECK_AVX32-LABEL: storev32i8:
				; CHECK_AVX32: # %bb.0:
				; CHECK_AVX32-NEXT: vmovups %ymm0, (%eax)
				store <32 x i8> %a, <32 x i8>* undef, align 32
				unreachable
				}

				define void @storev32i8_01(<32 x i8> %a) nounwind {
				; CHECK_SSE-LABEL: storev32i8_01:
				; CHECK_SSE: # %bb.0:
				; CHECK_SSE-NEXT: movups %xmm1, (%rax)
				; CHECK_SSE-NEXT: movups %xmm0, (%rax)
				;
				; CHECK_AVX-LABEL: storev32i8_01:
				; CHECK_AVX: # %bb.0:
				; CHECK_AVX-NEXT: vmovups %ymm0, (%rax)
				;
				; CHECK_SSE32-LABEL: storev32i8_01:
				; CHECK_SSE32: # %bb.0:
				; CHECK_SSE32-NEXT: movups %xmm1, (%eax)
				; CHECK_SSE32-NEXT: movups %xmm0, (%eax)
				;
				; CHECK_AVX32-LABEL: storev32i8_01:
				; CHECK_AVX32: # %bb.0:
				; CHECK_AVX32-NEXT: vmovups %ymm0, (%eax)
				store <32 x i8> %a, <32 x i8>* undef, align 4
				unreachable
				}

				; It is faster to make two saves, if the data is already in xmm registers. For
				; example, after making an integer operation.
				define void @double_save(<4 x i32> %A, <4 x i32> %B, <8 x i32>* %P) nounwind ssp {
				; CHECK_SSE-LABEL: double_save:
				; CHECK_SSE: # %bb.0:
				; CHECK_SSE-NEXT: movaps %xmm1, 16(%rdi)
				; CHECK_SSE-NEXT: movaps %xmm0, (%rdi)
				; CHECK_SSE-NEXT: retq
				;
				; CHECK_AVX-LABEL: double_save:
				; CHECK_AVX: # %bb.0:
				; CHECK_AVX-NEXT: vmovups %xmm1, 16(%rdi)
				; CHECK_AVX-NEXT: vmovups %xmm0, (%rdi)
				; CHECK_AVX-NEXT: retq
				;
				; CHECK_SSE32-LABEL: double_save:
				; CHECK_SSE32: # %bb.0:
				; CHECK_SSE32-NEXT: movl {{[0-9]+}}(%esp), %eax
				; CHECK_SSE32-NEXT: movaps %xmm1, 16(%eax)
				; CHECK_SSE32-NEXT: movaps %xmm0, (%eax)
				; CHECK_SSE32-NEXT: retl
				;
				; CHECK_AVX32-LABEL: double_save:
				; CHECK_AVX32: # %bb.0:
				; CHECK_AVX32-NEXT: movl {{[0-9]+}}(%esp), %eax
				; CHECK_AVX32-NEXT: vmovups %xmm1, 16(%eax)
				; CHECK_AVX32-NEXT: vmovups %xmm0, (%eax)
				; CHECK_AVX32-NEXT: retl
				%Z = shufflevector <4 x i32>%A, <4 x i32>%B, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				store <8 x i32> %Z, <8 x i32>* %P, align 16
				ret void
				}

				define void @double_save_volatile(<4 x i32> %A, <4 x i32> %B, <8 x i32>* %P) nounwind {
				; CHECK_SSE-LABEL: double_save_volatile:
				; CHECK_SSE: # %bb.0:
				; CHECK_SSE-NEXT: movaps %xmm1, 16(%rdi)
				; CHECK_SSE-NEXT: movaps %xmm0, (%rdi)
				; CHECK_SSE-NEXT: retq
				;
				; CHECK_AVX-LABEL: double_save_volatile:
				; CHECK_AVX: # %bb.0:
				; CHECK_AVX-NEXT: # kill: def $xmm0 killed $xmm0 def $ymm0
				; CHECK_AVX-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0
				; CHECK_AVX-NEXT: vmovups %ymm0, (%rdi)
				; CHECK_AVX-NEXT: vzeroupper
				; CHECK_AVX-NEXT: retq
				;
				; CHECK_SSE32-LABEL: double_save_volatile:
				; CHECK_SSE32: # %bb.0:
				; CHECK_SSE32-NEXT: movl {{[0-9]+}}(%esp), %eax
				; CHECK_SSE32-NEXT: movaps %xmm1, 16(%eax)
				; CHECK_SSE32-NEXT: movaps %xmm0, (%eax)
				; CHECK_SSE32-NEXT: retl
				;
				; CHECK_AVX32-LABEL: double_save_volatile:
				; CHECK_AVX32: # %bb.0:
				; CHECK_AVX32-NEXT: # kill: def $xmm0 killed $xmm0 def $ymm0
				; CHECK_AVX32-NEXT: movl {{[0-9]+}}(%esp), %eax
				; CHECK_AVX32-NEXT: vinsertf128 $1, %xmm1, %ymm0, %ymm0
				; CHECK_AVX32-NEXT: vmovups %ymm0, (%eax)
				; CHECK_AVX32-NEXT: vzeroupper
				; CHECK_AVX32-NEXT: retl
				%Z = shufflevector <4 x i32>%A, <4 x i32>%B, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				store volatile <8 x i32> %Z, <8 x i32>* %P, align 16
				ret void
				}

				define void @add8i32(<8 x i32>* %ret, <8 x i32>* %bp) nounwind {
				; CHECK_SSE-LABEL: add8i32:
				; CHECK_SSE: # %bb.0:
				; CHECK_SSE-NEXT: movups (%rsi), %xmm0
				; CHECK_SSE-NEXT: movups 16(%rsi), %xmm1
				; CHECK_SSE-NEXT: movups %xmm1, 16(%rdi)
				; CHECK_SSE-NEXT: movups %xmm0, (%rdi)
				; CHECK_SSE-NEXT: retq
				;
				; CHECK_AVX-LABEL: add8i32:
				; CHECK_AVX: # %bb.0:
				; CHECK_AVX-NEXT: vmovups (%rsi), %ymm0
				; CHECK_AVX-NEXT: vmovups %ymm0, (%rdi)
				; CHECK_AVX-NEXT: vzeroupper
				; CHECK_AVX-NEXT: retq
				;
				; CHECK_SSE32-LABEL: add8i32:
				; CHECK_SSE32: # %bb.0:
				; CHECK_SSE32-NEXT: movl {{[0-9]+}}(%esp), %eax
				; CHECK_SSE32-NEXT: movl {{[0-9]+}}(%esp), %ecx
				; CHECK_SSE32-NEXT: movups (%ecx), %xmm0
				; CHECK_SSE32-NEXT: movups 16(%ecx), %xmm1
				; CHECK_SSE32-NEXT: movups %xmm1, 16(%eax)
				; CHECK_SSE32-NEXT: movups %xmm0, (%eax)
				; CHECK_SSE32-NEXT: retl
				;
				; CHECK_AVX32-LABEL: add8i32:
				; CHECK_AVX32: # %bb.0:
				; CHECK_AVX32-NEXT: movl {{[0-9]+}}(%esp), %eax
				; CHECK_AVX32-NEXT: movl {{[0-9]+}}(%esp), %ecx
				; CHECK_AVX32-NEXT: vmovups (%ecx), %ymm0
				; CHECK_AVX32-NEXT: vmovups %ymm0, (%eax)
				; CHECK_AVX32-NEXT: vzeroupper
				; CHECK_AVX32-NEXT: retl
				%b = load <8 x i32>, <8 x i32>* %bp, align 1
				%x = add <8 x i32> zeroinitializer, %b
				store <8 x i32> %x, <8 x i32>* %ret, align 1
				ret void
				}

				define void @add4i64a64(<4 x i64>* %ret, <4 x i64>* %bp) nounwind {
				; CHECK_SSE-LABEL: add4i64a64:
				; CHECK_SSE: # %bb.0:
				; CHECK_SSE-NEXT: movaps (%rsi), %xmm0
				; CHECK_SSE-NEXT: movaps 16(%rsi), %xmm1
				; CHECK_SSE-NEXT: movaps %xmm0, (%rdi)
				; CHECK_SSE-NEXT: movaps %xmm1, 16(%rdi)
				; CHECK_SSE-NEXT: retq
				;
				; CHECK_AVX-LABEL: add4i64a64:
				; CHECK_AVX: # %bb.0:
				; CHECK_AVX-NEXT: vmovups (%rsi), %ymm0
				; CHECK_AVX-NEXT: vmovups %ymm0, (%rdi)
				; CHECK_AVX-NEXT: vzeroupper
				; CHECK_AVX-NEXT: retq
				;
				; CHECK_SSE32-LABEL: add4i64a64:
				; CHECK_SSE32: # %bb.0:
				; CHECK_SSE32-NEXT: movl {{[0-9]+}}(%esp), %eax
				; CHECK_SSE32-NEXT: movl {{[0-9]+}}(%esp), %ecx
				; CHECK_SSE32-NEXT: movaps (%ecx), %xmm0
				; CHECK_SSE32-NEXT: movaps 16(%ecx), %xmm1
				; CHECK_SSE32-NEXT: movaps %xmm0, (%eax)
				; CHECK_SSE32-NEXT: movaps %xmm1, 16(%eax)
				; CHECK_SSE32-NEXT: retl
				;
				; CHECK_AVX32-LABEL: add4i64a64:
				; CHECK_AVX32: # %bb.0:
				; CHECK_AVX32-NEXT: movl {{[0-9]+}}(%esp), %eax
				; CHECK_AVX32-NEXT: movl {{[0-9]+}}(%esp), %ecx
				; CHECK_AVX32-NEXT: vmovups (%ecx), %ymm0
				; CHECK_AVX32-NEXT: vmovups %ymm0, (%eax)
				; CHECK_AVX32-NEXT: vzeroupper
				; CHECK_AVX32-NEXT: retl
				%b = load <4 x i64>, <4 x i64>* %bp, align 64
				%x = add <4 x i64> zeroinitializer, %b
				store <4 x i64> %x, <4 x i64>* %ret, align 64
				ret void
				}

				define void @add4i64a16(<4 x i64>* %ret, <4 x i64>* %bp) nounwind {
				; CHECK_SSE-LABEL: add4i64a16:
				; CHECK_SSE: # %bb.0:
				; CHECK_SSE-NEXT: movaps (%rsi), %xmm0
				; CHECK_SSE-NEXT: movaps 16(%rsi), %xmm1
				; CHECK_SSE-NEXT: movaps %xmm1, 16(%rdi)
				; CHECK_SSE-NEXT: movaps %xmm0, (%rdi)
				; CHECK_SSE-NEXT: retq
				;
				; CHECK_AVX-LABEL: add4i64a16:
				; CHECK_AVX: # %bb.0:
				; CHECK_AVX-NEXT: vmovups (%rsi), %ymm0
				; CHECK_AVX-NEXT: vmovups %ymm0, (%rdi)
				; CHECK_AVX-NEXT: vzeroupper
				; CHECK_AVX-NEXT: retq
				;
				; CHECK_SSE32-LABEL: add4i64a16:
				; CHECK_SSE32: # %bb.0:
				; CHECK_SSE32-NEXT: movl {{[0-9]+}}(%esp), %eax
				; CHECK_SSE32-NEXT: movl {{[0-9]+}}(%esp), %ecx
				; CHECK_SSE32-NEXT: movaps (%ecx), %xmm0
				; CHECK_SSE32-NEXT: movaps 16(%ecx), %xmm1
				; CHECK_SSE32-NEXT: movaps %xmm1, 16(%eax)
				; CHECK_SSE32-NEXT: movaps %xmm0, (%eax)
				; CHECK_SSE32-NEXT: retl
				;
				; CHECK_AVX32-LABEL: add4i64a16:
				; CHECK_AVX32: # %bb.0:
				; CHECK_AVX32-NEXT: movl {{[0-9]+}}(%esp), %eax
				; CHECK_AVX32-NEXT: movl {{[0-9]+}}(%esp), %ecx
				; CHECK_AVX32-NEXT: vmovups (%ecx), %ymm0
				; CHECK_AVX32-NEXT: vmovups %ymm0, (%eax)
				; CHECK_AVX32-NEXT: vzeroupper
				; CHECK_AVX32-NEXT: retl
				%b = load <4 x i64>, <4 x i64>* %bp, align 16
				%x = add <4 x i64> zeroinitializer, %b
				store <4 x i64> %x, <4 x i64>* %ret, align 16
				ret void
				}

llvm/test/CodeGen/X86/avx512-unaligned-load-store.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=avx512f,+use-unaligned-vector-move \| FileCheck %s -check-prefix=X64
				; RUN: llc < %s -mtriple=i686-unknown-unknown -mattr=avx512f,+use-unaligned-vector-move \| FileCheck %s -check-prefix=X86

				define <16 x i32> @test17(i8 * %addr) {
				; X64-LABEL: test17:
				; X64: # %bb.0:
				; X64-NEXT: vmovups (%rdi), %zmm0
				; X64-NEXT: retq
				;
				; X86-LABEL: test17:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vmovups (%eax), %zmm0
				; X86-NEXT: retl
				%vaddr = bitcast i8* %addr to <16 x i32>*
				%res = load <16 x i32>, <16 x i32>* %vaddr, align 64
				ret <16 x i32>%res
				}

				define void @test18(i8 * %addr, <8 x i64> %data) {
				; X64-LABEL: test18:
				; X64: # %bb.0:
				; X64-NEXT: vmovups %zmm0, (%rdi)
				; X64-NEXT: vzeroupper
				; X64-NEXT: retq
				;
				; X86-LABEL: test18:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vmovups %zmm0, (%eax)
				; X86-NEXT: vzeroupper
				; X86-NEXT: retl
				%vaddr = bitcast i8* %addr to <8 x i64>*
				store <8 x i64>%data, <8 x i64>* %vaddr, align 64
				ret void
				}

				define void @test19(i8 * %addr, <16 x i32> %data) {
				; X64-LABEL: test19:
				; X64: # %bb.0:
				; X64-NEXT: vmovups %zmm0, (%rdi)
				; X64-NEXT: vzeroupper
				; X64-NEXT: retq
				;
				; X86-LABEL: test19:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vmovups %zmm0, (%eax)
				; X86-NEXT: vzeroupper
				; X86-NEXT: retl
				%vaddr = bitcast i8* %addr to <16 x i32>*
				store <16 x i32>%data, <16 x i32>* %vaddr, align 1
				ret void
				}

				define void @test20(i8 * %addr, <16 x i32> %data) {
				; X64-LABEL: test20:
				; X64: # %bb.0:
				; X64-NEXT: vmovups %zmm0, (%rdi)
				; X64-NEXT: vzeroupper
				; X64-NEXT: retq
				;
				; X86-LABEL: test20:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vmovups %zmm0, (%eax)
				; X86-NEXT: vzeroupper
				; X86-NEXT: retl
				%vaddr = bitcast i8* %addr to <16 x i32>*
				store <16 x i32>%data, <16 x i32>* %vaddr, align 64
				ret void
				}

				define <8 x i64> @test21(i8 * %addr) {
				; X64-LABEL: test21:
				; X64: # %bb.0:
				; X64-NEXT: vmovups (%rdi), %zmm0
				; X64-NEXT: retq
				;
				; X86-LABEL: test21:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vmovups (%eax), %zmm0
				; X86-NEXT: retl
				%vaddr = bitcast i8* %addr to <8 x i64>*
				%res = load <8 x i64>, <8 x i64>* %vaddr, align 64
				ret <8 x i64>%res
				}

				define void @test22(i8 * %addr, <8 x i64> %data) {
				; X64-LABEL: test22:
				; X64: # %bb.0:
				; X64-NEXT: vmovups %zmm0, (%rdi)
				; X64-NEXT: vzeroupper
				; X64-NEXT: retq
				;
				; X86-LABEL: test22:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vmovups %zmm0, (%eax)
				; X86-NEXT: vzeroupper
				; X86-NEXT: retl
				%vaddr = bitcast i8* %addr to <8 x i64>*
				store <8 x i64>%data, <8 x i64>* %vaddr, align 1
				ret void
				}

				define <8 x i64> @test23(i8 * %addr) {
				; X64-LABEL: test23:
				; X64: # %bb.0:
				; X64-NEXT: vmovups (%rdi), %zmm0
				; X64-NEXT: retq
				;
				; X86-LABEL: test23:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vmovups (%eax), %zmm0
				; X86-NEXT: retl
				%vaddr = bitcast i8* %addr to <8 x i64>*
				%res = load <8 x i64>, <8 x i64>* %vaddr, align 1
				ret <8 x i64>%res
				}

				define void @test24(i8 * %addr, <8 x double> %data) {
				; X64-LABEL: test24:
				; X64: # %bb.0:
				; X64-NEXT: vmovups %zmm0, (%rdi)
				; X64-NEXT: vzeroupper
				; X64-NEXT: retq
				;
				; X86-LABEL: test24:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vmovups %zmm0, (%eax)
				; X86-NEXT: vzeroupper
				; X86-NEXT: retl
				%vaddr = bitcast i8* %addr to <8 x double>*
				store <8 x double>%data, <8 x double>* %vaddr, align 64
				ret void
				}

				define <8 x double> @test25(i8 * %addr) {
				; X64-LABEL: test25:
				; X64: # %bb.0:
				; X64-NEXT: vmovups (%rdi), %zmm0
				; X64-NEXT: retq
				;
				; X86-LABEL: test25:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vmovups (%eax), %zmm0
				; X86-NEXT: retl
				%vaddr = bitcast i8* %addr to <8 x double>*
				%res = load <8 x double>, <8 x double>* %vaddr, align 64
				ret <8 x double>%res
				}

				define void @test26(i8 * %addr, <16 x float> %data) {
				; X64-LABEL: test26:
				; X64: # %bb.0:
				; X64-NEXT: vmovups %zmm0, (%rdi)
				; X64-NEXT: vzeroupper
				; X64-NEXT: retq
				;
				; X86-LABEL: test26:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vmovups %zmm0, (%eax)
				; X86-NEXT: vzeroupper
				; X86-NEXT: retl
				%vaddr = bitcast i8* %addr to <16 x float>*
				store <16 x float>%data, <16 x float>* %vaddr, align 64
				ret void
				}

				define <16 x float> @test27(i8 * %addr) {
				; X64-LABEL: test27:
				; X64: # %bb.0:
				; X64-NEXT: vmovups (%rdi), %zmm0
				; X64-NEXT: retq
				;
				; X86-LABEL: test27:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vmovups (%eax), %zmm0
				; X86-NEXT: retl
				%vaddr = bitcast i8* %addr to <16 x float>*
				%res = load <16 x float>, <16 x float>* %vaddr, align 64
				ret <16 x float>%res
				}

				define void @test28(i8 * %addr, <8 x double> %data) {
				; X64-LABEL: test28:
				; X64: # %bb.0:
				; X64-NEXT: vmovups %zmm0, (%rdi)
				; X64-NEXT: vzeroupper
				; X64-NEXT: retq
				;
				; X86-LABEL: test28:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vmovups %zmm0, (%eax)
				; X86-NEXT: vzeroupper
				; X86-NEXT: retl
				%vaddr = bitcast i8* %addr to <8 x double>*
				store <8 x double>%data, <8 x double>* %vaddr, align 1
				ret void
				}

				define <8 x double> @test29(i8 * %addr) {
				; X64-LABEL: test29:
				; X64: # %bb.0:
				; X64-NEXT: vmovups (%rdi), %zmm0
				; X64-NEXT: retq
				;
				; X86-LABEL: test29:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vmovups (%eax), %zmm0
				; X86-NEXT: retl
				%vaddr = bitcast i8* %addr to <8 x double>*
				%res = load <8 x double>, <8 x double>* %vaddr, align 1
				ret <8 x double>%res
				}

				define void @test30(i8 * %addr, <16 x float> %data) {
				; X64-LABEL: test30:
				; X64: # %bb.0:
				; X64-NEXT: vmovups %zmm0, (%rdi)
				; X64-NEXT: vzeroupper
				; X64-NEXT: retq
				;
				; X86-LABEL: test30:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vmovups %zmm0, (%eax)
				; X86-NEXT: vzeroupper
				; X86-NEXT: retl
				%vaddr = bitcast i8* %addr to <16 x float>*
				store <16 x float>%data, <16 x float>* %vaddr, align 1
				ret void
				}

				define <16 x float> @test31(i8 * %addr) {
				; X64-LABEL: test31:
				; X64: # %bb.0:
				; X64-NEXT: vmovups (%rdi), %zmm0
				; X64-NEXT: retq
				;
				; X86-LABEL: test31:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vmovups (%eax), %zmm0
				; X86-NEXT: retl
				%vaddr = bitcast i8* %addr to <16 x float>*
				%res = load <16 x float>, <16 x float>* %vaddr, align 1
				ret <16 x float>%res
				}

				define <16 x i32> @test32(i8 * %addr, <16 x i32> %old, <16 x i32> %mask1) {
				; X64-LABEL: test32:
				; X64: # %bb.0:
				; X64-NEXT: vptestmd %zmm1, %zmm1, %k1
				; X64-NEXT: vmovdqu32 (%rdi), %zmm0 {%k1}
				; X64-NEXT: retq
				;
				; X86-LABEL: test32:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vptestmd %zmm1, %zmm1, %k1
				; X86-NEXT: vmovdqu32 (%eax), %zmm0 {%k1}
				; X86-NEXT: retl
				%mask = icmp ne <16 x i32> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <16 x i32>*
				%r = load <16 x i32>, <16 x i32>* %vaddr, align 64
				%res = select <16 x i1> %mask, <16 x i32> %r, <16 x i32> %old
				ret <16 x i32>%res
				}

				define <16 x i32> @test33(i8 * %addr, <16 x i32> %old, <16 x i32> %mask1) {
				; X64-LABEL: test33:
				; X64: # %bb.0:
				; X64-NEXT: vptestmd %zmm1, %zmm1, %k1
				; X64-NEXT: vmovdqu32 (%rdi), %zmm0 {%k1}
				; X64-NEXT: retq
				;
				; X86-LABEL: test33:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vptestmd %zmm1, %zmm1, %k1
				; X86-NEXT: vmovdqu32 (%eax), %zmm0 {%k1}
				; X86-NEXT: retl
				%mask = icmp ne <16 x i32> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <16 x i32>*
				%r = load <16 x i32>, <16 x i32>* %vaddr, align 1
				%res = select <16 x i1> %mask, <16 x i32> %r, <16 x i32> %old
				ret <16 x i32>%res
				}

				define <16 x i32> @test34(i8 * %addr, <16 x i32> %mask1) {
				; X64-LABEL: test34:
				; X64: # %bb.0:
				; X64-NEXT: vptestmd %zmm0, %zmm0, %k1
				; X64-NEXT: vmovdqu32 (%rdi), %zmm0 {%k1} {z}
				; X64-NEXT: retq
				;
				; X86-LABEL: test34:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vptestmd %zmm0, %zmm0, %k1
				; X86-NEXT: vmovdqu32 (%eax), %zmm0 {%k1} {z}
				; X86-NEXT: retl
				%mask = icmp ne <16 x i32> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <16 x i32>*
				%r = load <16 x i32>, <16 x i32>* %vaddr, align 64
				%res = select <16 x i1> %mask, <16 x i32> %r, <16 x i32> zeroinitializer
				ret <16 x i32>%res
				}

				define <16 x i32> @test35(i8 * %addr, <16 x i32> %mask1) {
				; X64-LABEL: test35:
				; X64: # %bb.0:
				; X64-NEXT: vptestmd %zmm0, %zmm0, %k1
				; X64-NEXT: vmovdqu32 (%rdi), %zmm0 {%k1} {z}
				; X64-NEXT: retq
				;
				; X86-LABEL: test35:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vptestmd %zmm0, %zmm0, %k1
				; X86-NEXT: vmovdqu32 (%eax), %zmm0 {%k1} {z}
				; X86-NEXT: retl
				%mask = icmp ne <16 x i32> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <16 x i32>*
				%r = load <16 x i32>, <16 x i32>* %vaddr, align 1
				%res = select <16 x i1> %mask, <16 x i32> %r, <16 x i32> zeroinitializer
				ret <16 x i32>%res
				}

				define <8 x i64> @test36(i8 * %addr, <8 x i64> %old, <8 x i64> %mask1) {
				; X64-LABEL: test36:
				; X64: # %bb.0:
				; X64-NEXT: vptestmq %zmm1, %zmm1, %k1
				; X64-NEXT: vmovdqu64 (%rdi), %zmm0 {%k1}
				; X64-NEXT: retq
				;
				; X86-LABEL: test36:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vptestmq %zmm1, %zmm1, %k1
				; X86-NEXT: vmovdqu64 (%eax), %zmm0 {%k1}
				; X86-NEXT: retl
				%mask = icmp ne <8 x i64> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <8 x i64>*
				%r = load <8 x i64>, <8 x i64>* %vaddr, align 64
				%res = select <8 x i1> %mask, <8 x i64> %r, <8 x i64> %old
				ret <8 x i64>%res
				}

				define <8 x i64> @test37(i8 * %addr, <8 x i64> %old, <8 x i64> %mask1) {
				; X64-LABEL: test37:
				; X64: # %bb.0:
				; X64-NEXT: vptestmq %zmm1, %zmm1, %k1
				; X64-NEXT: vmovdqu64 (%rdi), %zmm0 {%k1}
				; X64-NEXT: retq
				;
				; X86-LABEL: test37:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vptestmq %zmm1, %zmm1, %k1
				; X86-NEXT: vmovdqu64 (%eax), %zmm0 {%k1}
				; X86-NEXT: retl
				%mask = icmp ne <8 x i64> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <8 x i64>*
				%r = load <8 x i64>, <8 x i64>* %vaddr, align 1
				%res = select <8 x i1> %mask, <8 x i64> %r, <8 x i64> %old
				ret <8 x i64>%res
				}

				define <8 x i64> @test38(i8 * %addr, <8 x i64> %mask1) {
				; X64-LABEL: test38:
				; X64: # %bb.0:
				; X64-NEXT: vptestmq %zmm0, %zmm0, %k1
				; X64-NEXT: vmovdqu64 (%rdi), %zmm0 {%k1} {z}
				; X64-NEXT: retq
				;
				; X86-LABEL: test38:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vptestmq %zmm0, %zmm0, %k1
				; X86-NEXT: vmovdqu64 (%eax), %zmm0 {%k1} {z}
				; X86-NEXT: retl
				%mask = icmp ne <8 x i64> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <8 x i64>*
				%r = load <8 x i64>, <8 x i64>* %vaddr, align 64
				%res = select <8 x i1> %mask, <8 x i64> %r, <8 x i64> zeroinitializer
				ret <8 x i64>%res
				}

				define <8 x i64> @test39(i8 * %addr, <8 x i64> %mask1) {
				; X64-LABEL: test39:
				; X64: # %bb.0:
				; X64-NEXT: vptestmq %zmm0, %zmm0, %k1
				; X64-NEXT: vmovdqu64 (%rdi), %zmm0 {%k1} {z}
				; X64-NEXT: retq
				;
				; X86-LABEL: test39:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vptestmq %zmm0, %zmm0, %k1
				; X86-NEXT: vmovdqu64 (%eax), %zmm0 {%k1} {z}
				; X86-NEXT: retl
				%mask = icmp ne <8 x i64> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <8 x i64>*
				%r = load <8 x i64>, <8 x i64>* %vaddr, align 1
				%res = select <8 x i1> %mask, <8 x i64> %r, <8 x i64> zeroinitializer
				ret <8 x i64>%res
				}

				define <16 x float> @test40(i8 * %addr, <16 x float> %old, <16 x float> %mask1) {
				; X64-LABEL: test40:
				; X64: # %bb.0:
				; X64-NEXT: vxorps %xmm2, %xmm2, %xmm2
				; X64-NEXT: vcmpneq_oqps %zmm2, %zmm1, %k1
				; X64-NEXT: vmovups (%rdi), %zmm0 {%k1}
				; X64-NEXT: retq
				;
				; X86-LABEL: test40:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vxorps %xmm2, %xmm2, %xmm2
				; X86-NEXT: vcmpneq_oqps %zmm2, %zmm1, %k1
				; X86-NEXT: vmovups (%eax), %zmm0 {%k1}
				; X86-NEXT: retl
				%mask = fcmp one <16 x float> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <16 x float>*
				%r = load <16 x float>, <16 x float>* %vaddr, align 64
				%res = select <16 x i1> %mask, <16 x float> %r, <16 x float> %old
				ret <16 x float>%res
				}

				define <16 x float> @test41(i8 * %addr, <16 x float> %old, <16 x float> %mask1) {
				; X64-LABEL: test41:
				; X64: # %bb.0:
				; X64-NEXT: vxorps %xmm2, %xmm2, %xmm2
				; X64-NEXT: vcmpneq_oqps %zmm2, %zmm1, %k1
				; X64-NEXT: vmovups (%rdi), %zmm0 {%k1}
				; X64-NEXT: retq
				;
				; X86-LABEL: test41:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vxorps %xmm2, %xmm2, %xmm2
				; X86-NEXT: vcmpneq_oqps %zmm2, %zmm1, %k1
				; X86-NEXT: vmovups (%eax), %zmm0 {%k1}
				; X86-NEXT: retl
				%mask = fcmp one <16 x float> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <16 x float>*
				%r = load <16 x float>, <16 x float>* %vaddr, align 1
				%res = select <16 x i1> %mask, <16 x float> %r, <16 x float> %old
				ret <16 x float>%res
				}

				define <16 x float> @test42(i8 * %addr, <16 x float> %mask1) {
				; X64-LABEL: test42:
				; X64: # %bb.0:
				; X64-NEXT: vxorps %xmm1, %xmm1, %xmm1
				; X64-NEXT: vcmpneq_oqps %zmm1, %zmm0, %k1
				; X64-NEXT: vmovups (%rdi), %zmm0 {%k1} {z}
				; X64-NEXT: retq
				;
				; X86-LABEL: test42:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vxorps %xmm1, %xmm1, %xmm1
				; X86-NEXT: vcmpneq_oqps %zmm1, %zmm0, %k1
				; X86-NEXT: vmovups (%eax), %zmm0 {%k1} {z}
				; X86-NEXT: retl
				%mask = fcmp one <16 x float> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <16 x float>*
				%r = load <16 x float>, <16 x float>* %vaddr, align 64
				%res = select <16 x i1> %mask, <16 x float> %r, <16 x float> zeroinitializer
				ret <16 x float>%res
				}

				define <16 x float> @test43(i8 * %addr, <16 x float> %mask1) {
				; X64-LABEL: test43:
				; X64: # %bb.0:
				; X64-NEXT: vxorps %xmm1, %xmm1, %xmm1
				; X64-NEXT: vcmpneq_oqps %zmm1, %zmm0, %k1
				; X64-NEXT: vmovups (%rdi), %zmm0 {%k1} {z}
				; X64-NEXT: retq
				;
				; X86-LABEL: test43:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vxorps %xmm1, %xmm1, %xmm1
				; X86-NEXT: vcmpneq_oqps %zmm1, %zmm0, %k1
				; X86-NEXT: vmovups (%eax), %zmm0 {%k1} {z}
				; X86-NEXT: retl
				%mask = fcmp one <16 x float> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <16 x float>*
				%r = load <16 x float>, <16 x float>* %vaddr, align 1
				%res = select <16 x i1> %mask, <16 x float> %r, <16 x float> zeroinitializer
				ret <16 x float>%res
				}

				define <8 x double> @test44(i8 * %addr, <8 x double> %old, <8 x double> %mask1) {
				; X64-LABEL: test44:
				; X64: # %bb.0:
				; X64-NEXT: vxorpd %xmm2, %xmm2, %xmm2
				; X64-NEXT: vcmpneq_oqpd %zmm2, %zmm1, %k1
				; X64-NEXT: vmovupd (%rdi), %zmm0 {%k1}
				; X64-NEXT: retq
				;
				; X86-LABEL: test44:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vxorpd %xmm2, %xmm2, %xmm2
				; X86-NEXT: vcmpneq_oqpd %zmm2, %zmm1, %k1
				; X86-NEXT: vmovupd (%eax), %zmm0 {%k1}
				; X86-NEXT: retl
				%mask = fcmp one <8 x double> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <8 x double>*
				%r = load <8 x double>, <8 x double>* %vaddr, align 64
				%res = select <8 x i1> %mask, <8 x double> %r, <8 x double> %old
				ret <8 x double>%res
				}

				define <8 x double> @test45(i8 * %addr, <8 x double> %old, <8 x double> %mask1) {
				; X64-LABEL: test45:
				; X64: # %bb.0:
				; X64-NEXT: vxorpd %xmm2, %xmm2, %xmm2
				; X64-NEXT: vcmpneq_oqpd %zmm2, %zmm1, %k1
				; X64-NEXT: vmovupd (%rdi), %zmm0 {%k1}
				; X64-NEXT: retq
				;
				; X86-LABEL: test45:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vxorpd %xmm2, %xmm2, %xmm2
				; X86-NEXT: vcmpneq_oqpd %zmm2, %zmm1, %k1
				; X86-NEXT: vmovupd (%eax), %zmm0 {%k1}
				; X86-NEXT: retl
				%mask = fcmp one <8 x double> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <8 x double>*
				%r = load <8 x double>, <8 x double>* %vaddr, align 1
				%res = select <8 x i1> %mask, <8 x double> %r, <8 x double> %old
				ret <8 x double>%res
				}

				define <8 x double> @test46(i8 * %addr, <8 x double> %mask1) {
				; X64-LABEL: test46:
				; X64: # %bb.0:
				; X64-NEXT: vxorpd %xmm1, %xmm1, %xmm1
				; X64-NEXT: vcmpneq_oqpd %zmm1, %zmm0, %k1
				; X64-NEXT: vmovupd (%rdi), %zmm0 {%k1} {z}
				; X64-NEXT: retq
				;
				; X86-LABEL: test46:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vxorpd %xmm1, %xmm1, %xmm1
				; X86-NEXT: vcmpneq_oqpd %zmm1, %zmm0, %k1
				; X86-NEXT: vmovupd (%eax), %zmm0 {%k1} {z}
				; X86-NEXT: retl
				%mask = fcmp one <8 x double> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <8 x double>*
				%r = load <8 x double>, <8 x double>* %vaddr, align 64
				%res = select <8 x i1> %mask, <8 x double> %r, <8 x double> zeroinitializer
				ret <8 x double>%res
				}

				define <8 x double> @test47(i8 * %addr, <8 x double> %mask1) {
				; X64-LABEL: test47:
				; X64: # %bb.0:
				; X64-NEXT: vxorpd %xmm1, %xmm1, %xmm1
				; X64-NEXT: vcmpneq_oqpd %zmm1, %zmm0, %k1
				; X64-NEXT: vmovupd (%rdi), %zmm0 {%k1} {z}
				; X64-NEXT: retq
				;
				; X86-LABEL: test47:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vxorpd %xmm1, %xmm1, %xmm1
				; X86-NEXT: vcmpneq_oqpd %zmm1, %zmm0, %k1
				; X86-NEXT: vmovupd (%eax), %zmm0 {%k1} {z}
				; X86-NEXT: retl
				%mask = fcmp one <8 x double> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <8 x double>*
				%r = load <8 x double>, <8 x double>* %vaddr, align 1
				%res = select <8 x i1> %mask, <8 x double> %r, <8 x double> zeroinitializer
				ret <8 x double>%res
				}

llvm/test/CodeGen/X86/avx512vl-unaligned-load-store.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx512f,+avx512vl,+use-unaligned-vector-move \| FileCheck %s -check-prefix=X64
				; RUN: llc < %s -mtriple=i686-unknown-unknown -mattr=+avx512f,+avx512vl,+use-unaligned-vector-move \| FileCheck %s -check-prefix=X86

				define <8 x i32> @test_256_1(i8 * %addr) {
				; CHECK-LABEL: test_256_1:
				; CHECK: # %bb.0:
				craig.topperUnsubmitted Not Done Reply Inline Actions CHECK isn’t a valid prefix for this file craig.topper: CHECK isn’t a valid prefix for this file
				LiuChen3Unsubmitted Done Reply Inline Actions Thanks for reminding. I forgot to delete the these. I will remove these in next patch. LiuChen3: Thanks for reminding. I forgot to delete the these. I will remove these in next patch.
				; CHECK-NEXT: vmovups (%rdi), %ymm0
				; CHECK-NEXT: retq
				; X64-LABEL: test_256_1:
				; X64: # %bb.0:
				; X64-NEXT: vmovups (%rdi), %ymm0
				; X64-NEXT: retq
				;
				; X86-LABEL: test_256_1:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vmovups (%eax), %ymm0
				; X86-NEXT: retl
				%vaddr = bitcast i8* %addr to <8 x i32>*
				%res = load <8 x i32>, <8 x i32>* %vaddr, align 1
				ret <8 x i32>%res
				craig.topperUnsubmitted Not Done Reply Inline Actions What are the tests with align 1 intended to show? craig.topper: What are the tests with align 1 intended to show?
				LiuChen3Unsubmitted Done Reply Inline Actions This is to distinguish the unaligned-mov converted from aligned-move and the original unaligned-move. See the difference between line12 and line28. LiuChen3: This is to distinguish the unaligned-mov converted from aligned-move and the original unaligned…
				}

				define <8 x i32> @test_256_2(i8 * %addr) {
				; CHECK-LABEL: test_256_2:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovups (%rdi), %ymm0
				; CHECK-NEXT: retq
				; X64-LABEL: test_256_2:
				; X64: # %bb.0:
				; X64-NEXT: vmovups (%rdi), %ymm0
				; X64-NEXT: retq
				;
				; X86-LABEL: test_256_2:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vmovups (%eax), %ymm0
				; X86-NEXT: retl
				%vaddr = bitcast i8* %addr to <8 x i32>*
				%res = load <8 x i32>, <8 x i32>* %vaddr, align 32
				ret <8 x i32>%res
				}

				define void @test_256_3(i8 * %addr, <4 x i64> %data) {
				; CHECK-LABEL: test_256_3:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovups %ymm0, (%rdi)
				; CHECK-NEXT: vzeroupper
				; CHECK-NEXT: retq
				; X64-LABEL: test_256_3:
				; X64: # %bb.0:
				; X64-NEXT: vmovups %ymm0, (%rdi)
				; X64-NEXT: vzeroupper
				; X64-NEXT: retq
				;
				; X86-LABEL: test_256_3:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vmovups %ymm0, (%eax)
				; X86-NEXT: vzeroupper
				; X86-NEXT: retl
				%vaddr = bitcast i8* %addr to <4 x i64>*
				store <4 x i64>%data, <4 x i64>* %vaddr, align 32
				ret void
				}

				define void @test_256_4(i8 * %addr, <8 x i32> %data) {
				; CHECK-LABEL: test_256_4:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovups %ymm0, (%rdi)
				; CHECK-NEXT: vzeroupper
				; CHECK-NEXT: retq
				; X64-LABEL: test_256_4:
				; X64: # %bb.0:
				; X64-NEXT: vmovups %ymm0, (%rdi)
				; X64-NEXT: vzeroupper
				; X64-NEXT: retq
				;
				; X86-LABEL: test_256_4:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vmovups %ymm0, (%eax)
				; X86-NEXT: vzeroupper
				; X86-NEXT: retl
				%vaddr = bitcast i8* %addr to <8 x i32>*
				store <8 x i32>%data, <8 x i32>* %vaddr, align 1
				ret void
				}

				define void @test_256_5(i8 * %addr, <8 x i32> %data) {
				; CHECK-LABEL: test_256_5:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovups %ymm0, (%rdi)
				; CHECK-NEXT: vzeroupper
				; CHECK-NEXT: retq
				; X64-LABEL: test_256_5:
				; X64: # %bb.0:
				; X64-NEXT: vmovups %ymm0, (%rdi)
				; X64-NEXT: vzeroupper
				; X64-NEXT: retq
				;
				; X86-LABEL: test_256_5:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vmovups %ymm0, (%eax)
				; X86-NEXT: vzeroupper
				; X86-NEXT: retl
				%vaddr = bitcast i8* %addr to <8 x i32>*
				store <8 x i32>%data, <8 x i32>* %vaddr, align 32
				ret void
				}

				define <4 x i64> @test_256_6(i8 * %addr) {
				; CHECK-LABEL: test_256_6:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovups (%rdi), %ymm0
				; CHECK-NEXT: retq
				; X64-LABEL: test_256_6:
				; X64: # %bb.0:
				; X64-NEXT: vmovups (%rdi), %ymm0
				; X64-NEXT: retq
				;
				; X86-LABEL: test_256_6:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vmovups (%eax), %ymm0
				; X86-NEXT: retl
				%vaddr = bitcast i8* %addr to <4 x i64>*
				%res = load <4 x i64>, <4 x i64>* %vaddr, align 32
				ret <4 x i64>%res
				}

				define void @test_256_7(i8 * %addr, <4 x i64> %data) {
				; CHECK-LABEL: test_256_7:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovups %ymm0, (%rdi)
				; CHECK-NEXT: vzeroupper
				; CHECK-NEXT: retq
				; X64-LABEL: test_256_7:
				; X64: # %bb.0:
				; X64-NEXT: vmovups %ymm0, (%rdi)
				; X64-NEXT: vzeroupper
				; X64-NEXT: retq
				;
				; X86-LABEL: test_256_7:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vmovups %ymm0, (%eax)
				; X86-NEXT: vzeroupper
				; X86-NEXT: retl
				%vaddr = bitcast i8* %addr to <4 x i64>*
				store <4 x i64>%data, <4 x i64>* %vaddr, align 1
				ret void
				}

				define <4 x i64> @test_256_8(i8 * %addr) {
				; CHECK-LABEL: test_256_8:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovups (%rdi), %ymm0
				; CHECK-NEXT: retq
				; X64-LABEL: test_256_8:
				; X64: # %bb.0:
				; X64-NEXT: vmovups (%rdi), %ymm0
				; X64-NEXT: retq
				;
				; X86-LABEL: test_256_8:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vmovups (%eax), %ymm0
				; X86-NEXT: retl
				%vaddr = bitcast i8* %addr to <4 x i64>*
				%res = load <4 x i64>, <4 x i64>* %vaddr, align 1
				ret <4 x i64>%res
				}

				define void @test_256_9(i8 * %addr, <4 x double> %data) {
				; CHECK-LABEL: test_256_9:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovups %ymm0, (%rdi)
				; CHECK-NEXT: vzeroupper
				; CHECK-NEXT: retq
				; X64-LABEL: test_256_9:
				; X64: # %bb.0:
				; X64-NEXT: vmovups %ymm0, (%rdi)
				; X64-NEXT: vzeroupper
				; X64-NEXT: retq
				;
				; X86-LABEL: test_256_9:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vmovups %ymm0, (%eax)
				; X86-NEXT: vzeroupper
				; X86-NEXT: retl
				%vaddr = bitcast i8* %addr to <4 x double>*
				store <4 x double>%data, <4 x double>* %vaddr, align 32
				ret void
				}

				define <4 x double> @test_256_10(i8 * %addr) {
				; CHECK-LABEL: test_256_10:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovups (%rdi), %ymm0
				; CHECK-NEXT: retq
				; X64-LABEL: test_256_10:
				; X64: # %bb.0:
				; X64-NEXT: vmovups (%rdi), %ymm0
				; X64-NEXT: retq
				;
				; X86-LABEL: test_256_10:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vmovups (%eax), %ymm0
				; X86-NEXT: retl
				%vaddr = bitcast i8* %addr to <4 x double>*
				%res = load <4 x double>, <4 x double>* %vaddr, align 32
				ret <4 x double>%res
				}

				define void @test_256_11(i8 * %addr, <8 x float> %data) {
				; CHECK-LABEL: test_256_11:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovups %ymm0, (%rdi)
				; CHECK-NEXT: vzeroupper
				; CHECK-NEXT: retq
				; X64-LABEL: test_256_11:
				; X64: # %bb.0:
				; X64-NEXT: vmovups %ymm0, (%rdi)
				; X64-NEXT: vzeroupper
				; X64-NEXT: retq
				;
				; X86-LABEL: test_256_11:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vmovups %ymm0, (%eax)
				; X86-NEXT: vzeroupper
				; X86-NEXT: retl
				%vaddr = bitcast i8* %addr to <8 x float>*
				store <8 x float>%data, <8 x float>* %vaddr, align 32
				ret void
				}

				define <8 x float> @test_256_12(i8 * %addr) {
				; CHECK-LABEL: test_256_12:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovups (%rdi), %ymm0
				; CHECK-NEXT: retq
				; X64-LABEL: test_256_12:
				; X64: # %bb.0:
				; X64-NEXT: vmovups (%rdi), %ymm0
				; X64-NEXT: retq
				;
				; X86-LABEL: test_256_12:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vmovups (%eax), %ymm0
				; X86-NEXT: retl
				%vaddr = bitcast i8* %addr to <8 x float>*
				%res = load <8 x float>, <8 x float>* %vaddr, align 32
				ret <8 x float>%res
				}

				define void @test_256_13(i8 * %addr, <4 x double> %data) {
				; CHECK-LABEL: test_256_13:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovups %ymm0, (%rdi)
				; CHECK-NEXT: vzeroupper
				; CHECK-NEXT: retq
				; X64-LABEL: test_256_13:
				; X64: # %bb.0:
				; X64-NEXT: vmovups %ymm0, (%rdi)
				; X64-NEXT: vzeroupper
				; X64-NEXT: retq
				;
				; X86-LABEL: test_256_13:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vmovups %ymm0, (%eax)
				; X86-NEXT: vzeroupper
				; X86-NEXT: retl
				%vaddr = bitcast i8* %addr to <4 x double>*
				store <4 x double>%data, <4 x double>* %vaddr, align 1
				ret void
				}

				define <4 x double> @test_256_14(i8 * %addr) {
				; CHECK-LABEL: test_256_14:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovups (%rdi), %ymm0
				; CHECK-NEXT: retq
				; X64-LABEL: test_256_14:
				; X64: # %bb.0:
				; X64-NEXT: vmovups (%rdi), %ymm0
				; X64-NEXT: retq
				;
				; X86-LABEL: test_256_14:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vmovups (%eax), %ymm0
				; X86-NEXT: retl
				%vaddr = bitcast i8* %addr to <4 x double>*
				%res = load <4 x double>, <4 x double>* %vaddr, align 1
				ret <4 x double>%res
				}

				define void @test_256_15(i8 * %addr, <8 x float> %data) {
				; CHECK-LABEL: test_256_15:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovups %ymm0, (%rdi)
				; CHECK-NEXT: vzeroupper
				; CHECK-NEXT: retq
				; X64-LABEL: test_256_15:
				; X64: # %bb.0:
				; X64-NEXT: vmovups %ymm0, (%rdi)
				; X64-NEXT: vzeroupper
				; X64-NEXT: retq
				;
				; X86-LABEL: test_256_15:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vmovups %ymm0, (%eax)
				; X86-NEXT: vzeroupper
				; X86-NEXT: retl
				%vaddr = bitcast i8* %addr to <8 x float>*
				store <8 x float>%data, <8 x float>* %vaddr, align 1
				ret void
				}

				define <8 x float> @test_256_16(i8 * %addr) {
				; CHECK-LABEL: test_256_16:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vmovups (%rdi), %ymm0
				; CHECK-NEXT: retq
				; X64-LABEL: test_256_16:
				; X64: # %bb.0:
				; X64-NEXT: vmovups (%rdi), %ymm0
				; X64-NEXT: retq
				;
				; X86-LABEL: test_256_16:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vmovups (%eax), %ymm0
				; X86-NEXT: retl
				%vaddr = bitcast i8* %addr to <8 x float>*
				%res = load <8 x float>, <8 x float>* %vaddr, align 1
				ret <8 x float>%res
				}

				define <8 x i32> @test_256_17(i8 * %addr, <8 x i32> %old, <8 x i32> %mask1) {
				; CHECK-LABEL: test_256_17:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vptestmd %ymm1, %ymm1, %k1
				; CHECK-NEXT: vmovdqu32 (%rdi), %ymm0 {%k1}
				; CHECK-NEXT: retq
				; X64-LABEL: test_256_17:
				; X64: # %bb.0:
				; X64-NEXT: vptestmd %ymm1, %ymm1, %k1
				; X64-NEXT: vmovdqu32 (%rdi), %ymm0 {%k1}
				; X64-NEXT: retq
				;
				; X86-LABEL: test_256_17:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vptestmd %ymm1, %ymm1, %k1
				; X86-NEXT: vmovdqu32 (%eax), %ymm0 {%k1}
				; X86-NEXT: retl
				%mask = icmp ne <8 x i32> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <8 x i32>*
				%r = load <8 x i32>, <8 x i32>* %vaddr, align 32
				%res = select <8 x i1> %mask, <8 x i32> %r, <8 x i32> %old
				ret <8 x i32>%res
				}

				define <8 x i32> @test_256_18(i8 * %addr, <8 x i32> %old, <8 x i32> %mask1) {
				; CHECK-LABEL: test_256_18:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vptestmd %ymm1, %ymm1, %k1
				; CHECK-NEXT: vmovdqu32 (%rdi), %ymm0 {%k1}
				; CHECK-NEXT: retq
				; X64-LABEL: test_256_18:
				; X64: # %bb.0:
				; X64-NEXT: vptestmd %ymm1, %ymm1, %k1
				; X64-NEXT: vmovdqu32 (%rdi), %ymm0 {%k1}
				; X64-NEXT: retq
				;
				; X86-LABEL: test_256_18:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vptestmd %ymm1, %ymm1, %k1
				; X86-NEXT: vmovdqu32 (%eax), %ymm0 {%k1}
				; X86-NEXT: retl
				%mask = icmp ne <8 x i32> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <8 x i32>*
				%r = load <8 x i32>, <8 x i32>* %vaddr, align 1
				%res = select <8 x i1> %mask, <8 x i32> %r, <8 x i32> %old
				ret <8 x i32>%res
				}

				define <8 x i32> @test_256_19(i8 * %addr, <8 x i32> %mask1) {
				; CHECK-LABEL: test_256_19:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vptestmd %ymm0, %ymm0, %k1
				; CHECK-NEXT: vmovdqu32 (%rdi), %ymm0 {%k1} {z}
				; CHECK-NEXT: retq
				; X64-LABEL: test_256_19:
				; X64: # %bb.0:
				; X64-NEXT: vptestmd %ymm0, %ymm0, %k1
				; X64-NEXT: vmovdqu32 (%rdi), %ymm0 {%k1} {z}
				; X64-NEXT: retq
				;
				; X86-LABEL: test_256_19:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vptestmd %ymm0, %ymm0, %k1
				; X86-NEXT: vmovdqu32 (%eax), %ymm0 {%k1} {z}
				; X86-NEXT: retl
				%mask = icmp ne <8 x i32> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <8 x i32>*
				%r = load <8 x i32>, <8 x i32>* %vaddr, align 32
				%res = select <8 x i1> %mask, <8 x i32> %r, <8 x i32> zeroinitializer
				ret <8 x i32>%res
				}

				define <8 x i32> @test_256_20(i8 * %addr, <8 x i32> %mask1) {
				; CHECK-LABEL: test_256_20:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vptestmd %ymm0, %ymm0, %k1
				; CHECK-NEXT: vmovdqu32 (%rdi), %ymm0 {%k1} {z}
				; CHECK-NEXT: retq
				; X64-LABEL: test_256_20:
				; X64: # %bb.0:
				; X64-NEXT: vptestmd %ymm0, %ymm0, %k1
				; X64-NEXT: vmovdqu32 (%rdi), %ymm0 {%k1} {z}
				; X64-NEXT: retq
				;
				; X86-LABEL: test_256_20:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vptestmd %ymm0, %ymm0, %k1
				; X86-NEXT: vmovdqu32 (%eax), %ymm0 {%k1} {z}
				; X86-NEXT: retl
				%mask = icmp ne <8 x i32> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <8 x i32>*
				%r = load <8 x i32>, <8 x i32>* %vaddr, align 1
				%res = select <8 x i1> %mask, <8 x i32> %r, <8 x i32> zeroinitializer
				ret <8 x i32>%res
				}

				define <4 x i64> @test_256_21(i8 * %addr, <4 x i64> %old, <4 x i64> %mask1) {
				; CHECK-LABEL: test_256_21:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vptestmq %ymm1, %ymm1, %k1
				; CHECK-NEXT: vmovdqu64 (%rdi), %ymm0 {%k1}
				; CHECK-NEXT: retq
				; X64-LABEL: test_256_21:
				; X64: # %bb.0:
				; X64-NEXT: vptestmq %ymm1, %ymm1, %k1
				; X64-NEXT: vmovdqu64 (%rdi), %ymm0 {%k1}
				; X64-NEXT: retq
				;
				; X86-LABEL: test_256_21:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vptestmq %ymm1, %ymm1, %k1
				; X86-NEXT: vmovdqu64 (%eax), %ymm0 {%k1}
				; X86-NEXT: retl
				%mask = icmp ne <4 x i64> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <4 x i64>*
				%r = load <4 x i64>, <4 x i64>* %vaddr, align 32
				%res = select <4 x i1> %mask, <4 x i64> %r, <4 x i64> %old
				ret <4 x i64>%res
				}

				define <4 x i64> @test_256_22(i8 * %addr, <4 x i64> %old, <4 x i64> %mask1) {
				; CHECK-LABEL: test_256_22:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vptestmq %ymm1, %ymm1, %k1
				; CHECK-NEXT: vmovdqu64 (%rdi), %ymm0 {%k1}
				; CHECK-NEXT: retq
				; X64-LABEL: test_256_22:
				; X64: # %bb.0:
				; X64-NEXT: vptestmq %ymm1, %ymm1, %k1
				; X64-NEXT: vmovdqu64 (%rdi), %ymm0 {%k1}
				; X64-NEXT: retq
				;
				; X86-LABEL: test_256_22:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vptestmq %ymm1, %ymm1, %k1
				; X86-NEXT: vmovdqu64 (%eax), %ymm0 {%k1}
				; X86-NEXT: retl
				%mask = icmp ne <4 x i64> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <4 x i64>*
				%r = load <4 x i64>, <4 x i64>* %vaddr, align 1
				%res = select <4 x i1> %mask, <4 x i64> %r, <4 x i64> %old
				ret <4 x i64>%res
				}

				define <4 x i64> @test_256_23(i8 * %addr, <4 x i64> %mask1) {
				; CHECK-LABEL: test_256_23:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vptestmq %ymm0, %ymm0, %k1
				; CHECK-NEXT: vmovdqu64 (%rdi), %ymm0 {%k1} {z}
				; CHECK-NEXT: retq
				; X64-LABEL: test_256_23:
				; X64: # %bb.0:
				; X64-NEXT: vptestmq %ymm0, %ymm0, %k1
				; X64-NEXT: vmovdqu64 (%rdi), %ymm0 {%k1} {z}
				; X64-NEXT: retq
				;
				; X86-LABEL: test_256_23:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vptestmq %ymm0, %ymm0, %k1
				; X86-NEXT: vmovdqu64 (%eax), %ymm0 {%k1} {z}
				; X86-NEXT: retl
				%mask = icmp ne <4 x i64> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <4 x i64>*
				%r = load <4 x i64>, <4 x i64>* %vaddr, align 32
				%res = select <4 x i1> %mask, <4 x i64> %r, <4 x i64> zeroinitializer
				ret <4 x i64>%res
				}

				define <4 x i64> @test_256_24(i8 * %addr, <4 x i64> %mask1) {
				; CHECK-LABEL: test_256_24:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vptestmq %ymm0, %ymm0, %k1
				; CHECK-NEXT: vmovdqu64 (%rdi), %ymm0 {%k1} {z}
				; CHECK-NEXT: retq
				; X64-LABEL: test_256_24:
				; X64: # %bb.0:
				; X64-NEXT: vptestmq %ymm0, %ymm0, %k1
				; X64-NEXT: vmovdqu64 (%rdi), %ymm0 {%k1} {z}
				; X64-NEXT: retq
				;
				; X86-LABEL: test_256_24:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vptestmq %ymm0, %ymm0, %k1
				; X86-NEXT: vmovdqu64 (%eax), %ymm0 {%k1} {z}
				; X86-NEXT: retl
				%mask = icmp ne <4 x i64> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <4 x i64>*
				%r = load <4 x i64>, <4 x i64>* %vaddr, align 1
				%res = select <4 x i1> %mask, <4 x i64> %r, <4 x i64> zeroinitializer
				ret <4 x i64>%res
				}

				define <4 x i32> @test_128_17(i8 * %addr, <4 x i32> %old, <4 x i32> %mask1) {
				; CHECK-LABEL: test_128_17:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vptestmd %xmm1, %xmm1, %k1
				; CHECK-NEXT: vmovdqu32 (%rdi), %xmm0 {%k1}
				; CHECK-NEXT: retq
				; X64-LABEL: test_128_17:
				; X64: # %bb.0:
				; X64-NEXT: vptestmd %xmm1, %xmm1, %k1
				; X64-NEXT: vmovdqu32 (%rdi), %xmm0 {%k1}
				; X64-NEXT: retq
				;
				; X86-LABEL: test_128_17:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vptestmd %xmm1, %xmm1, %k1
				; X86-NEXT: vmovdqu32 (%eax), %xmm0 {%k1}
				; X86-NEXT: retl
				%mask = icmp ne <4 x i32> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <4 x i32>*
				%r = load <4 x i32>, <4 x i32>* %vaddr, align 16
				%res = select <4 x i1> %mask, <4 x i32> %r, <4 x i32> %old
				ret <4 x i32>%res
				}

				define <4 x i32> @test_128_18(i8 * %addr, <4 x i32> %old, <4 x i32> %mask1) {
				; CHECK-LABEL: test_128_18:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vptestmd %xmm1, %xmm1, %k1
				; CHECK-NEXT: vmovdqu32 (%rdi), %xmm0 {%k1}
				; CHECK-NEXT: retq
				; X64-LABEL: test_128_18:
				; X64: # %bb.0:
				; X64-NEXT: vptestmd %xmm1, %xmm1, %k1
				; X64-NEXT: vmovdqu32 (%rdi), %xmm0 {%k1}
				; X64-NEXT: retq
				;
				; X86-LABEL: test_128_18:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vptestmd %xmm1, %xmm1, %k1
				; X86-NEXT: vmovdqu32 (%eax), %xmm0 {%k1}
				; X86-NEXT: retl
				%mask = icmp ne <4 x i32> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <4 x i32>*
				%r = load <4 x i32>, <4 x i32>* %vaddr, align 1
				%res = select <4 x i1> %mask, <4 x i32> %r, <4 x i32> %old
				ret <4 x i32>%res
				}

				define <4 x i32> @test_128_19(i8 * %addr, <4 x i32> %mask1) {
				; CHECK-LABEL: test_128_19:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vptestmd %xmm0, %xmm0, %k1
				; CHECK-NEXT: vmovdqu32 (%rdi), %xmm0 {%k1} {z}
				; CHECK-NEXT: retq
				; X64-LABEL: test_128_19:
				; X64: # %bb.0:
				; X64-NEXT: vptestmd %xmm0, %xmm0, %k1
				; X64-NEXT: vmovdqu32 (%rdi), %xmm0 {%k1} {z}
				; X64-NEXT: retq
				;
				; X86-LABEL: test_128_19:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vptestmd %xmm0, %xmm0, %k1
				; X86-NEXT: vmovdqu32 (%eax), %xmm0 {%k1} {z}
				; X86-NEXT: retl
				%mask = icmp ne <4 x i32> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <4 x i32>*
				%r = load <4 x i32>, <4 x i32>* %vaddr, align 16
				%res = select <4 x i1> %mask, <4 x i32> %r, <4 x i32> zeroinitializer
				ret <4 x i32>%res
				}

				define <4 x i32> @test_128_20(i8 * %addr, <4 x i32> %mask1) {
				; CHECK-LABEL: test_128_20:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vptestmd %xmm0, %xmm0, %k1
				; CHECK-NEXT: vmovdqu32 (%rdi), %xmm0 {%k1} {z}
				; CHECK-NEXT: retq
				; X64-LABEL: test_128_20:
				; X64: # %bb.0:
				; X64-NEXT: vptestmd %xmm0, %xmm0, %k1
				; X64-NEXT: vmovdqu32 (%rdi), %xmm0 {%k1} {z}
				; X64-NEXT: retq
				;
				; X86-LABEL: test_128_20:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vptestmd %xmm0, %xmm0, %k1
				; X86-NEXT: vmovdqu32 (%eax), %xmm0 {%k1} {z}
				; X86-NEXT: retl
				%mask = icmp ne <4 x i32> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <4 x i32>*
				%r = load <4 x i32>, <4 x i32>* %vaddr, align 1
				%res = select <4 x i1> %mask, <4 x i32> %r, <4 x i32> zeroinitializer
				ret <4 x i32>%res
				}

				define <2 x i64> @test_128_21(i8 * %addr, <2 x i64> %old, <2 x i64> %mask1) {
				; CHECK-LABEL: test_128_21:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vptestmq %xmm1, %xmm1, %k1
				; CHECK-NEXT: vmovdqu64 (%rdi), %xmm0 {%k1}
				; CHECK-NEXT: retq
				; X64-LABEL: test_128_21:
				; X64: # %bb.0:
				; X64-NEXT: vptestmq %xmm1, %xmm1, %k1
				; X64-NEXT: vmovdqu64 (%rdi), %xmm0 {%k1}
				; X64-NEXT: retq
				;
				; X86-LABEL: test_128_21:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vptestmq %xmm1, %xmm1, %k1
				; X86-NEXT: vmovdqu64 (%eax), %xmm0 {%k1}
				; X86-NEXT: retl
				%mask = icmp ne <2 x i64> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <2 x i64>*
				%r = load <2 x i64>, <2 x i64>* %vaddr, align 16
				%res = select <2 x i1> %mask, <2 x i64> %r, <2 x i64> %old
				ret <2 x i64>%res
				}

				define <2 x i64> @test_128_22(i8 * %addr, <2 x i64> %old, <2 x i64> %mask1) {
				; CHECK-LABEL: test_128_22:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vptestmq %xmm1, %xmm1, %k1
				; CHECK-NEXT: vmovdqu64 (%rdi), %xmm0 {%k1}
				; CHECK-NEXT: retq
				; X64-LABEL: test_128_22:
				; X64: # %bb.0:
				; X64-NEXT: vptestmq %xmm1, %xmm1, %k1
				; X64-NEXT: vmovdqu64 (%rdi), %xmm0 {%k1}
				; X64-NEXT: retq
				;
				; X86-LABEL: test_128_22:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vptestmq %xmm1, %xmm1, %k1
				; X86-NEXT: vmovdqu64 (%eax), %xmm0 {%k1}
				; X86-NEXT: retl
				%mask = icmp ne <2 x i64> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <2 x i64>*
				%r = load <2 x i64>, <2 x i64>* %vaddr, align 1
				%res = select <2 x i1> %mask, <2 x i64> %r, <2 x i64> %old
				ret <2 x i64>%res
				}

				define <2 x i64> @test_128_23(i8 * %addr, <2 x i64> %mask1) {
				; CHECK-LABEL: test_128_23:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vptestmq %xmm0, %xmm0, %k1
				; CHECK-NEXT: vmovdqu64 (%rdi), %xmm0 {%k1} {z}
				; CHECK-NEXT: retq
				; X64-LABEL: test_128_23:
				; X64: # %bb.0:
				; X64-NEXT: vptestmq %xmm0, %xmm0, %k1
				; X64-NEXT: vmovdqu64 (%rdi), %xmm0 {%k1} {z}
				; X64-NEXT: retq
				;
				; X86-LABEL: test_128_23:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vptestmq %xmm0, %xmm0, %k1
				; X86-NEXT: vmovdqu64 (%eax), %xmm0 {%k1} {z}
				; X86-NEXT: retl
				%mask = icmp ne <2 x i64> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <2 x i64>*
				%r = load <2 x i64>, <2 x i64>* %vaddr, align 16
				%res = select <2 x i1> %mask, <2 x i64> %r, <2 x i64> zeroinitializer
				ret <2 x i64>%res
				}

				define <2 x i64> @test_128_24(i8 * %addr, <2 x i64> %mask1) {
				; CHECK-LABEL: test_128_24:
				; CHECK: # %bb.0:
				; CHECK-NEXT: vptestmq %xmm0, %xmm0, %k1
				; CHECK-NEXT: vmovdqu64 (%rdi), %xmm0 {%k1} {z}
				; CHECK-NEXT: retq
				; X64-LABEL: test_128_24:
				; X64: # %bb.0:
				; X64-NEXT: vptestmq %xmm0, %xmm0, %k1
				; X64-NEXT: vmovdqu64 (%rdi), %xmm0 {%k1} {z}
				; X64-NEXT: retq
				;
				; X86-LABEL: test_128_24:
				; X86: # %bb.0:
				; X86-NEXT: movl {{[0-9]+}}(%esp), %eax
				; X86-NEXT: vptestmq %xmm0, %xmm0, %k1
				; X86-NEXT: vmovdqu64 (%eax), %xmm0 {%k1} {z}
				; X86-NEXT: retl
				%mask = icmp ne <2 x i64> %mask1, zeroinitializer
				%vaddr = bitcast i8* %addr to <2 x i64>*
				%r = load <2 x i64>, <2 x i64>* %vaddr, align 1
				%res = select <2 x i1> %mask, <2 x i64> %r, <2 x i64> zeroinitializer
				ret <2 x i64>%res
				}