This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Target/X86/
-
Target/
-
X86/
-
CMakeLists.txt
-
X86.h
22/36
X86FixupSFB.cpp
-
X86TargetMachine.cpp
-
test/CodeGen/X86/
-
CodeGen/
-
X86/
1/1
fixup-sfb.ll

Differential D41330

[X86] Reduce Store Forward Block issues in HW
ClosedPublic

Authored by lsaba on Dec 17 2017, 3:25 AM.

Download Raw Diff

Details

Reviewers

craig.topper
guyblank
zvi
zansari
RKSimon
andreadb
oren_ben_simhon

Commits

rG927468309fdd: [X86] Reduce Store Forward Block issues in HW - Recommit after fixing Bug 36346
rL328973: [X86] Reduce Store Forward Block issues in HW - Recommit after fixing Bug 36346

Summary

If a load follows a store and reloads data that the store has written to memory, Intel microarchitectures can in many cases forward the data directly from the store to the load, This "store forwarding" saves cycles by enabling the load to directly obtain the data instead of accessing the data from cache or memory.
A "store forward block" occurs in cases that a store cannot be forwarded to the load. The most typical case of store forward block on Intel Core microarchiticutre that a small store cannot be forwarded to a large load.
The estimated penalty for a store forward block is ~13 cycles.

This pass tries to recognize and handle cases where "store forward block" is created by the compiler when lowering memcpy calls to a sequence
of a load and a store.

The pass currently only handles cases where memcpy is lowered to XMM/YMM registers, it tries to break the memcpy into smaller copies.
breaking the memcpy should be possible since there is no atomicity guarantee for loads and stores to XMM/YMM.

Diff Detail

Event Timeline

lsaba created this revision.Dec 17 2017, 3:25 AM

Herald added a subscriber: mgorny. · View Herald TranscriptDec 17 2017, 3:25 AM

lsaba added a reviewer: zansari.Dec 17 2017, 3:26 AM

RKSimon added reviewers: RKSimon, andreadb.Dec 17 2017, 4:01 AM

Few style comments. I didn't look at the algorithm closely yet.

lib/Target/X86/X86FixupSFB.cpp
121	What about the EVEX versions of these instructions?
222	Remove the parentheses here.
245	Use a "static const unsigned " variable instead of a #define.
459	Capitalize variable names.

Addressed Craig's comments

lib/Target/X86/X86FixupSFB.cpp
121	More cases can still be added in the future

hfinkel added a subscriber: hfinkel.Dec 19 2017, 10:21 PM

hfinkel added inline comments.

lib/Target/X86/X86FixupSFB.cpp
33	AA is too conservative. Did you try turning it on? :-) If you overload useAA() in X86Subtarget, then you'll actually get non-trivial AA in the backend. Given all of the recent work on adding scheduling models, we should probably experiment with this again. That having been said, it relies on having good MMOs. If we don't, then you'll have trouble using it for this purpose until that's fixed.
225	Can you elaborate on what you're seeing? Missing MMOs are something we should fix. If the MMO size is too small, that's a bug (i.e., can cause miscompiles from invalid instruction scheduling). MMO sizes that are too big should also be fixed.

lsaba added inline comments.Dec 20 2017, 7:08 AM

lib/Target/X86/X86FixupSFB.cpp
33	Perhaps it's better to say that the solution approach suggested in the comment is more conservative, and not the AA, I will try to explain what I meant with an example: %struct.S = type { i32, i32, i32, i32 } ; Function Attrs: nounwind uwtable define void @test_conditional_block(%struct.S* nocapture %s1, %struct.S* nocapture %s2, i32 %x, %struct.S* nocapture %s3, %struct.S* nocapture readonly %s4) local_unnamed_addr #0 { entry: %b = getelementptr inbounds %struct.S, %struct.S* %s1, i64 0, i32 1 store i32 %x, i32* %b, align 4 %0 = bitcast %struct.S* %s3 to i8* %1 = bitcast %struct.S* %s4 to i8* tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %0, i8* %1, i64 16, i32 4, i1 false) %2 = bitcast %struct.S* %s2 to i8* %3 = bitcast %struct.S* %s1 to i8* tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %2, i8* %3, i64 16, i32 4, i1 false) ret void } the currently generated code: movl %edx, 4(%rdi) // blocking store movups (%r8), %xmm0 movups %xmm0, (%rcx) movups (%rdi), %xmm0 // blocked load movups %xmm0, (%rsi) If we were to solve the problem by loading the old value, then storing like the comment suggests: movups (%rdi), %xmm0 // blocked load vinsert %edx, xmm0, 1 //store b to xmm0 at index 1 movups %xmm0, (%rsi) We would have to prove that the previous copy from s3 to s4: movups (%r8), %xmm0 movups %xmm0, (%rcx) does not alias or must alias (and in that case do a similar handling for it) the copy from s1 to s2, while the AA would naturally assume they may alias in this case. On the other hand, the proposed solution in this patch (breaking the memcpy) will be correct regardless to the memory relation between s3, s4 and s1,s2. I am not too familiar with the capabilities of the backend AA, I did not try turning it on :) correct me if i'm wrong, but relying on it and using the approach in the comment will yield a more conservative optimization than the one implemented here, at least for the trivial cases. In general, the solution in the comment gets more complicated if the blocking store is in a predecessing block, in that case we'll need to prove safety in all predecessing blocks, or hoist the memcpy for conditional blocks
225	You are right. I saw missing MMOs, I will try to fix the ones I encountered, but it doesn't guarantee there won't be more in code I did not reach so i'm not sure it would be safe to remove those defines

lsaba updated this revision to Diff 128164.Dec 26 2017, 4:00 AM

add MMO info

hfinkel added inline comments.Dec 28 2017, 6:56 AM

lib/Target/X86/X86FixupSFB.cpp
33	I think that this makes sense, please explain this better in the comment (i.e., that you're speculatively assuming aliasing, as the transformation is correct either way, and you often won't be able to prove a sufficient number of the interesting cases).
225	Okay, thanks (although you should use something other than `#define`s (constants or a function with a switch statement, etc.).

addressed @hfinkel comments.
ping to reviewers.

lsaba marked 2 inline comments as done.Jan 3 2018, 12:13 AM

Ping.

Pinging again,,

craig.topper added inline comments.Jan 17 2018, 5:19 PM

lib/Target/X86/X86FixupSFB.cpp
110	I think you need to add VMOVDQU64Z128rm/VMOVDQU32Z128rm/VMOVDQA64Z128rm/VMOVDQA32Z128rm. Those are the EVEX integer equivalents.
117	Add the EVEX integer instructions.
247	Add a comment about what's considered a relevant addressing mode?
255	Why can't these be MachineOperand references? Why are they assigned to nullptr and then immediately overwritten?
test/CodeGen/X86/fixup-sfb.ll
5	Add avx512vl command line?

All of the load/store switch statements can be moved to more compact lookup tables, similar to what we do for domain switching in X86InstrInfo.cpp?

lib/Target/X86/X86FixupSFB.cpp
18	microarchitecture

addressed comments from @RKSimon and @craig.topper

Herald added a subscriber: hintonda. · View Herald TranscriptJan 31 2018, 9:20 AM

lsaba added a comment.Jan 31 2018, 9:25 AM

This comment was removed by lsaba.

craig.topper added inline comments.Feb 2 2018, 9:32 AM

lib/Target/X86/X86FixupSFB.cpp
128	Should these be DenseMaps? @RKSimon, what do you think? Can you use a std::array or std::pair instead of std::vector for the second type since the size is fixed. Each std::vector will be a separate heap allocation.
255	I don't think this was addressed. Why aren't these references instead of pointers?

addressed @craig.topper comments

lsaba added a reviewer: oren_ben_simhon.Feb 6 2018, 6:13 AM

ping to reviewers

courbet added a subscriber: courbet.Feb 8 2018, 12:25 AM

craig.topper added inline comments.Feb 8 2018, 12:28 AM

lib/Target/X86/X86FixupSFB.cpp
128	I don't think the DenseMap question here was answered.
233	Return a reference?
238	Return a reference here too.
426	Potentially is mispelled.
450	Can you clang-format at least this function? That 8 looks really out of place. So does the splitting of the getRegClass call.
455	Pass the map by reference. Probably const reference if you're not modifying it.

addressed @craig.topper 's comments

lib/Target/X86/X86FixupSFB.cpp
128	I think DenseMap is less suitable since it doesn't have a static initializer constructor
450	This weird format is actually clang-format's output, I changed it to make it more readable

craig.topper added inline comments.Feb 8 2018, 2:04 PM

lib/Target/X86/X86FixupSFB.cpp
448	Is MRI->getTargetRegisterInfo() different than the TRI you already have?

fixed TRI @craig.topper

LGTM

This revision is now accepted and ready to land.Feb 11 2018, 12:36 AM

committed in https://reviews.llvm.org/rL324835

Sorry for showing up late, but I was looking at this code because it turns out it is miscompiling code for us. We're working on a test case, but there are really a number of basic LLVM coding convention problems here.

Craig, I hate to say this, but I don't think this should have gotten approved in its current form. It violates several clear guidelines in the LLVM coding standards, and pretty frequently. Maybe try to help folks ramping up on LLVM get their patches up to a higher code quality bar before approving?

lib/Target/X86/X86FixupSFB.cpp
1	I find using only the acronym "SFB" everywhere really confusing. It makes it hard to discover things. Also, the word `Fixup` doesn't really communicate much. Maybe `X86AvoidStoreForwardingBlocks`?
104	Why can't these predicates be generated by tablegen from the instruction definitions. I really don't like have lots of tables in every pass as it creates a maintenance nightmare when adding new instructions and other issues.
128	But this still isn't going to be a constant. It isn't marked constexpr. This will require building a map every time the binary starts. Please don't create globals with complex initializers in LLVM: http://llvm.org/docs/CodingStandards.html#do-not-use-static-constructors Also, std::map is really slow... Why not just build a function with a switch to do this mapping? Even better, could this be generated by tablegen?
248–252	It seems really weird to have getters above for some of these and then to not use them here. If we want a better API for extracting the memory operands, we should build one that can be used everywhere (likely in X86InstrInfo.h) rather than a couple of ad-hoc ones that we end up not using in places like this. =[
272	We don't use all caps constants typically in LLVM. And we almost always expose debug flags for this kind of constant. It should also be at the top of the file and the function comment actually attached to the function rather than the constant.
428–430	Please use early-continue or early-exit to reduce indentation when writing LLVM code: http://llvm.org/docs/CodingStandards.html#use-early-exits-and-continue-to-simplify-code
444	Missing vertical space here.
464	Please follow LLVM's variable naming conventions rather than `inst`. Also `Inst` isn't a very idiomatic variable name. But worse, it is actively confusing. What is it? I would assume an instruction, perhaps a `MachineInstr`, but it isn't. It is a pair of int and and unsigned?? I have no idea what this variable means.

@chandlerc Thanks for your comments.
still checking the option of using tablegen for some of the tables.
Any luck with the reproducer?

lib/Target/X86/X86FixupSFB.cpp
128	back to using switch functions for this.

Fix the reported bug (added case to bug 36346, based on code provided by Richard Smith )
and limit the transformation to 64bit only

lsaba reopened this revision.Mar 19 2018, 8:30 AM

This revision is now accepted and ready to land.Mar 19 2018, 8:30 AM

fix comments from @niravd and @craig.topper in review https://reviews.llvm.org/D43619#inline-381377 (closing that one)

ping @ reviewers, if there are no more comments i'd like to re-submit this

The changes from just D43619 look correct but I haven't looked deeply at this part of the patch. From the comments it looks like the patch is a bit out of date as std::map should have been converted back to switch statements.

Minor comment: I got tripped up on remembering what StoreForwardingBlock the first time I saw it. I think it Store-Forward Blocking is a more intuitive name than StoreForwardingBlock.

hintonda removed a subscriber: hintonda.Mar 27 2018, 11:53 AM

In D41330#1049332, @niravd wrote:

The changes from just D43619 look correct but I haven't looked deeply at this part of the patch. From the comments it looks like the patch is a bit out of date as std::map should have been converted back to switch statements.

The switch statements are back, are you referring to the DisplacementSizeMap? this one needs to be sorted, that's why i'm using std::map

craig.topper added inline comments.Mar 28 2018, 12:17 PM

lib/Target/X86/X86AvoidStoreForwardingBlocks.cpp
60 ↗	(On Diff #139434)	Prefix command line option strings with "x86-"
344 ↗	(On Diff #139434)	Should this be called BlockCnt instead of BlockLimit. It itself is not a limit.
345 ↗	(On Diff #139434)	Why do we need InspectionLimit? Can't we use x86AvoidSFBInspectionLimit everywhere? Or at least make InspectionLimit const. Naively it looks like the limit might be changed in the function.
346 ↗	(On Diff #139434)	LI isn't a great name for this iterator. The name seems to have been chosen because it starts at LoadInst, but that's not meaningful once you walk away from LoadInst.
347 ↗	(On Diff #139434)	This won't visit the first instruction in the block. And it visits LoadInst. Is that intended?
368 ↗	(On Diff #139434)	PredCnt?
544 ↗	(On Diff #139434)	Use std::make_pair so you don't need to be explicit with the types.
578 ↗	(On Diff #139434)	Capitalize
638 ↗	(On Diff #139434)	Variables should be capitalized.
697 ↗	(On Diff #139434)	BlockingStoresDispSizeMap.empty()

lsaba marked 10 inline comments as done.Mar 29 2018, 7:09 AM

lsaba added inline comments.

lib/Target/X86/X86AvoidStoreForwardingBlocks.cpp
345 ↗	(On Diff #139434)	I did it for readability purposes only
347 ↗	(On Diff #139434)	Nope, fixed.

Updated with @craig.topper comments

craig.topper added inline comments.Mar 29 2018, 10:38 AM

lib/Target/X86/X86AvoidStoreForwardingBlocks.cpp
347 ↗	(On Diff #140236)	I don't think you want std::next here. reverse_iterator already makes it point to the instruction before LoadInst. So now I think you skip that instruction.

lsaba added inline comments.Mar 30 2018, 12:30 AM

lib/Target/X86/X86AvoidStoreForwardingBlocks.cpp
347 ↗	(On Diff #140236)	are you sure? I've double checked, without std::next the iterator starts from the load instruction

LGTM

Closed by commit rL328973: [X86] Reduce Store Forward Block issues in HW - Recommit after fixing Bug 36346 (authored by lsaba). · Explain WhyApr 2 2018, 6:53 AM

This revision was automatically updated to reflect the committed changes.

This seems to break the Machine Verifier. I filed PR37153. Do you mind taking a look please? Thanks!

In D41330#1069883, @thegameg wrote:

This seems to break the Machine Verifier. I filed PR37153. Do you mind taking a look please? Thanks!

Sure, taking a look.

In D41330#1069883, @thegameg wrote:

This seems to break the Machine Verifier. I filed PR37153. Do you mind taking a look please? Thanks!

Please see https://reviews.llvm.org/D45823
This should hopefully solve it

Revision Contents

Path

Size

lib/

Target/

X86/

1 line

3 lines

558 lines

1 line

test/

CodeGen/

X86/

fixup-sfb.ll

1373 lines

Diff 133786

lib/Target/X86/CMakeLists.txt

Show All 25 Lines	set(sources
X86CallingConv.cpp		X86CallingConv.cpp
X86CallLowering.cpp		X86CallLowering.cpp
X86CmovConversion.cpp		X86CmovConversion.cpp
X86DomainReassignment.cpp		X86DomainReassignment.cpp
X86ExpandPseudo.cpp		X86ExpandPseudo.cpp
X86FastISel.cpp		X86FastISel.cpp
X86FixupBWInsts.cpp		X86FixupBWInsts.cpp
X86FixupLEAs.cpp		X86FixupLEAs.cpp
		X86FixupSFB.cpp
X86FixupSetCC.cpp		X86FixupSetCC.cpp
X86FloatingPoint.cpp		X86FloatingPoint.cpp
X86FrameLowering.cpp		X86FrameLowering.cpp
X86InstructionSelector.cpp		X86InstructionSelector.cpp
X86ISelDAGToDAG.cpp		X86ISelDAGToDAG.cpp
X86ISelLowering.cpp		X86ISelLowering.cpp
X86IndirectBranchTracking.cpp		X86IndirectBranchTracking.cpp
X86InterleavedAccess.cpp		X86InterleavedAccess.cpp
Show All 31 Lines

lib/Target/X86/X86.h

	Show First 20 Lines • Show All 64 Lines • ▼ Show 20 Lines

	/// Return a pass that removes redundant LEA instructions and redundant address			/// Return a pass that removes redundant LEA instructions and redundant address
	/// recalculations.			/// recalculations.
	FunctionPass *createX86OptimizeLEAs();			FunctionPass *createX86OptimizeLEAs();

	/// Return a pass that transforms setcc + movzx pairs into xor + setcc.			/// Return a pass that transforms setcc + movzx pairs into xor + setcc.
	FunctionPass *createX86FixupSetCC();			FunctionPass *createX86FixupSetCC();

				/// Return a pass that avoids creating store forward block issues in the hardware.
				FunctionPass *createX86FixupSFB();

	/// Return a pass that expands WinAlloca pseudo-instructions.			/// Return a pass that expands WinAlloca pseudo-instructions.
	FunctionPass *createX86WinAllocaExpander();			FunctionPass *createX86WinAllocaExpander();

	/// Return a pass that optimizes the code-size of x86 call sequences. This is			/// Return a pass that optimizes the code-size of x86 call sequences. This is
	/// done by replacing esp-relative movs with pushes.			/// done by replacing esp-relative movs with pushes.
	FunctionPass *createX86CallFrameOptimization();			FunctionPass *createX86CallFrameOptimization();

	/// Return an IR pass that inserts EH registration stack objects and explicit			/// Return an IR pass that inserts EH registration stack objects and explicit
	▲ Show 20 Lines • Show All 41 Lines • Show Last 20 Lines

lib/Target/X86/X86FixupSFB.cpp

This file was added.

				//===- X86FixupSFB.cpp - Avoid HW Store Forward Block issues -----------===//
				chandlercUnsubmitted Done Reply Inline Actions I find using only the acronym "SFB" everywhere really confusing. It makes it hard to discover things. Also, the word `Fixup` doesn't really communicate much. Maybe `X86AvoidStoreForwardingBlocks`? chandlerc: I find using only the acronym "SFB" everywhere really confusing. It makes it hard to discover…
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				//
				// If a load follows a store and reloads data that the store has written to
				// memory, Intel microarchitectures can in many cases forward the data directly
				// from the store to the load, This "store forwarding" saves cycles by enabling
				// the load to directly obtain the data instead of accessing the data from
				// cache or memory.
				// A "store forward block" occurs in cases that a store cannot be forwarded to
				// the load. The most typical case of store forward block on Intel Core
				// microarchitecture that a small store cannot be forwarded to a large load.
				// The estimated penalty for a store forward block is ~13 cycles.
				RKSimonUnsubmitted Done Reply Inline Actions microarchitecture RKSimon: microarchitecture
				//
				// This pass tries to recognize and handle cases where "store forward block"
				// is created by the compiler when lowering memcpy calls to a sequence
				// of a load and a store.
				//
				// The pass currently only handles cases where memcpy is lowered to
				// XMM/YMM registers, it tries to break the memcpy into smaller copies.
				// breaking the memcpy should be possible since there is no atomicity
				// guarantee for loads and stores to XMM/YMM.
				//
				// It could be better for performance to solve the problem by loading
				// to XMM/YMM then inserting the partial store before storing back from XMM/YMM
				// to memory, but this will result in a more conservative optimization since it
				// requires we prove that all memory accesses between the blocking store and the
				// load must alias/don't alias before we can move the store, whereas the
				hfinkelUnsubmitted Not Done Reply Inline Actions AA is too conservative. Did you try turning it on? :-) If you overload useAA() in X86Subtarget, then you'll actually get non-trivial AA in the backend. Given all of the recent work on adding scheduling models, we should probably experiment with this again. That having been said, it relies on having good MMOs. If we don't, then you'll have trouble using it for this purpose until that's fixed. hfinkel: > AA is too conservative. Did you try turning it on? :-) If you overload useAA() in…
				lsabaAuthorUnsubmitted Not Done Reply Inline Actions Perhaps it's better to say that the solution approach suggested in the comment is more conservative, and not the AA, I will try to explain what I meant with an example: %struct.S = type { i32, i32, i32, i32 } ; Function Attrs: nounwind uwtable define void @test_conditional_block(%struct.S* nocapture %s1, %struct.S* nocapture %s2, i32 %x, %struct.S* nocapture %s3, %struct.S* nocapture readonly %s4) local_unnamed_addr #0 { entry: %b = getelementptr inbounds %struct.S, %struct.S* %s1, i64 0, i32 1 store i32 %x, i32* %b, align 4 %0 = bitcast %struct.S* %s3 to i8* %1 = bitcast %struct.S* %s4 to i8* tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %0, i8* %1, i64 16, i32 4, i1 false) %2 = bitcast %struct.S* %s2 to i8* %3 = bitcast %struct.S* %s1 to i8* tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %2, i8* %3, i64 16, i32 4, i1 false) ret void } the currently generated code: movl %edx, 4(%rdi) // blocking store movups (%r8), %xmm0 movups %xmm0, (%rcx) movups (%rdi), %xmm0 // blocked load movups %xmm0, (%rsi) If we were to solve the problem by loading the old value, then storing like the comment suggests: movups (%rdi), %xmm0 // blocked load vinsert %edx, xmm0, 1 //store b to xmm0 at index 1 movups %xmm0, (%rsi) We would have to prove that the previous copy from s3 to s4: movups (%r8), %xmm0 movups %xmm0, (%rcx) does not alias or must alias (and in that case do a similar handling for it) the copy from s1 to s2, while the AA would naturally assume they may alias in this case. On the other hand, the proposed solution in this patch (breaking the memcpy) will be correct regardless to the memory relation between s3, s4 and s1,s2. I am not too familiar with the capabilities of the backend AA, I did not try turning it on :) correct me if i'm wrong, but relying on it and using the approach in the comment will yield a more conservative optimization than the one implemented here, at least for the trivial cases. In general, the solution in the comment gets more complicated if the blocking store is in a predecessing block, in that case we'll need to prove safety in all predecessing blocks, or hoist the memcpy for conditional blocks lsaba: Perhaps it's better to say that the solution approach suggested in the comment is more…
				hfinkelUnsubmitted Done Reply Inline Actions I think that this makes sense, please explain this better in the comment (i.e., that you're speculatively assuming aliasing, as the transformation is correct either way, and you often won't be able to prove a sufficient number of the interesting cases). hfinkel: I think that this makes sense, please explain this better in the comment (i.e., that you're…
				// transformation done here is correct regardless to other memory accesses.
				//===----------------------------------------------------------------------===//

				#include "X86InstrInfo.h"
				#include "X86Subtarget.h"
				#include "llvm/CodeGen/MachineBasicBlock.h"
				#include "llvm/CodeGen/MachineFunction.h"
				#include "llvm/CodeGen/MachineFunctionPass.h"
				#include "llvm/CodeGen/MachineInstr.h"
				#include "llvm/CodeGen/MachineInstrBuilder.h"
				#include "llvm/CodeGen/MachineOperand.h"
				#include "llvm/CodeGen/MachineRegisterInfo.h"
				#include "llvm/IR/DebugInfoMetadata.h"
				#include "llvm/IR/DebugLoc.h"
				#include "llvm/IR/Function.h"
				#include "llvm/MC/MCInstrDesc.h"

				using namespace llvm;

				#define DEBUG_TYPE "x86-fixup-SFB"

				static cl::opt<bool> DisableX86FixupSFB("disable-fixup-SFB", cl::Hidden,
				cl::desc("X86: Disable SFB fixup."),
				cl::init(false));
				namespace {

				class FixupSFBPass : public MachineFunctionPass {
				public:
				FixupSFBPass() : MachineFunctionPass(ID) {}

				StringRef getPassName() const override {
				return "X86 Fixup Store Forward Block";
				}

				bool runOnMachineFunction(MachineFunction &MF) override;

				private:
				MachineRegisterInfo *MRI;
				const X86InstrInfo *TII;
				const X86RegisterInfo *TRI;
				SmallVector<std::pair<MachineInstr , MachineInstr >, 2> BlockedLoadsStores;
				SmallVector<MachineInstr *, 2> ForRemoval;

				/// \brief Returns couples of Load then Store to memory which look
				/// like a memcpy.
				void findPotentiallylBlockedCopies(MachineFunction &MF);
				/// \brief Break the memcpy's load and store into smaller copies
				/// such that each memory load that was blocked by a smaller store
				/// would now be copied separately.
				void
				breakBlockedCopies(MachineInstr LoadInst, MachineInstr StoreInst,
				const std::map<int64_t, unsigned> &BlockingStoresDisp);
				/// \brief Break a copy of size Size to smaller copies.
				void buildCopies(int Size, MachineInstr *LoadInst, int64_t LdDispImm,
				MachineInstr *StoreInst, int64_t StDispImm);

				void buildCopy(MachineInstr *LoadInst, unsigned NLoadOpcode, int64_t LoadDisp,
				MachineInstr *StoreInst, unsigned NStoreOpcode,
				int64_t StoreDisp, unsigned Size);

				unsigned getRegSizeInBytes(MachineInstr *Inst);
				static char ID;
				};

				} // end anonymous namespace

				char FixupSFBPass::ID = 0;

				FunctionPass *llvm::createX86FixupSFB() { return new FixupSFBPass(); }

				static bool isXMMLoadOpcode(unsigned Opcode) {
				chandlercUnsubmitted Not Done Reply Inline Actions Why can't these predicates be generated by tablegen from the instruction definitions. I really don't like have lots of tables in every pass as it creates a maintenance nightmare when adding new instructions and other issues. chandlerc: Why can't these predicates be generated by tablegen from the instruction definitions. I really…
				return Opcode == X86::MOVUPSrm \|\| Opcode == X86::MOVAPSrm \|\|
				Opcode == X86::VMOVUPSrm \|\| Opcode == X86::VMOVAPSrm \|\|
				Opcode == X86::VMOVUPDrm \|\| Opcode == X86::VMOVAPDrm \|\|
				Opcode == X86::VMOVDQUrm \|\| Opcode == X86::VMOVDQArm \|\|
				Opcode == X86::VMOVUPSZ128rm \|\| Opcode == X86::VMOVAPSZ128rm \|\|
				Opcode == X86::VMOVUPDZ128rm \|\| Opcode == X86::VMOVAPDZ128rm \|\|
				craig.topperUnsubmitted Done Reply Inline Actions I think you need to add VMOVDQU64Z128rm/VMOVDQU32Z128rm/VMOVDQA64Z128rm/VMOVDQA32Z128rm. Those are the EVEX integer equivalents. craig.topper: I think you need to add VMOVDQU64Z128rm/VMOVDQU32Z128rm/VMOVDQA64Z128rm/VMOVDQA32Z128rm. Those…
				Opcode == X86::VMOVDQU64Z128rm \|\| Opcode == X86::VMOVDQA64Z128rm \|\|
				Opcode == X86::VMOVDQU32Z128rm \|\| Opcode == X86::VMOVDQA32Z128rm;
				}
				static bool isYMMLoadOpcode(unsigned Opcode) {
				return Opcode == X86::VMOVUPSYrm \|\| Opcode == X86::VMOVAPSYrm \|\|
				Opcode == X86::VMOVUPDYrm \|\| Opcode == X86::VMOVAPDYrm \|\|
				Opcode == X86::VMOVDQUYrm \|\| Opcode == X86::VMOVDQAYrm \|\|
				craig.topperUnsubmitted Done Reply Inline Actions Add the EVEX integer instructions. craig.topper: Add the EVEX integer instructions.
				Opcode == X86::VMOVUPSZ256rm \|\| Opcode == X86::VMOVAPSZ256rm \|\|
				Opcode == X86::VMOVUPDZ256rm \|\| Opcode == X86::VMOVAPDZ256rm \|\|
				Opcode == X86::VMOVDQU64Z256rm \|\| Opcode == X86::VMOVDQA64Z256rm \|\|
				Opcode == X86::VMOVDQU32Z256rm \|\| Opcode == X86::VMOVDQA32Z256rm;
				craig.topperUnsubmitted Done Reply Inline Actions What about the EVEX versions of these instructions? craig.topper: What about the EVEX versions of these instructions?
				lsabaAuthorUnsubmitted Not Done Reply Inline Actions More cases can still be added in the future lsaba: More cases can still be added in the future
				}

				static bool isPotentialBlockedMemCpyLd(unsigned Opcode) {
				return isXMMLoadOpcode(Opcode) \|\| isYMMLoadOpcode(Opcode);
				}

				std::map<unsigned, std::pair<unsigned, unsigned>> PotentialBlockedMemCpy{
				craig.topperUnsubmitted Not Done Reply Inline Actions Should these be DenseMaps? @RKSimon, what do you think? Can you use a std::array or std::pair instead of std::vector for the second type since the size is fixed. Each std::vector will be a separate heap allocation. craig.topper: Should these be DenseMaps? @RKSimon, what do you think? Can you use a std::array or std::pair…
				craig.topperUnsubmitted Not Done Reply Inline Actions I don't think the DenseMap question here was answered. craig.topper: I don't think the DenseMap question here was answered.
				lsabaAuthorUnsubmitted Not Done Reply Inline Actions I think DenseMap is less suitable since it doesn't have a static initializer constructor lsaba: I think DenseMap is less suitable since it doesn't have a static initializer constructor
				chandlercUnsubmitted Not Done Reply Inline Actions But this still isn't going to be a constant. It isn't marked constexpr. This will require building a map every time the binary starts. Please don't create globals with complex initializers in LLVM: http://llvm.org/docs/CodingStandards.html#do-not-use-static-constructors Also, std::map is really slow... Why not just build a function with a switch to do this mapping? Even better, could this be generated by tablegen? chandlerc: But this still isn't going to be a constant. It isn't marked constexpr. This will require…
				lsabaAuthorUnsubmitted Not Done Reply Inline Actions back to using switch functions for this. lsaba: back to using switch functions for this.
				{X86::MOVUPSrm, {X86::MOVUPSmr, X86::MOVAPSmr}},
				{X86::MOVAPSrm, {X86::MOVUPSmr, X86::MOVAPSmr}},
				{X86::VMOVUPSrm, {X86::VMOVUPSmr, X86::VMOVAPSmr}},
				{X86::VMOVAPSrm, {X86::VMOVUPSmr, X86::VMOVAPSmr}},
				{X86::VMOVUPDrm, {X86::VMOVUPDmr, X86::VMOVAPDmr}},
				{X86::VMOVAPDrm, {X86::VMOVUPDmr, X86::VMOVAPDmr}},
				{X86::VMOVDQUrm, {X86::VMOVDQUmr, X86::VMOVDQAmr}},
				{X86::VMOVDQArm, {X86::VMOVDQUmr, X86::VMOVDQAmr}},
				{X86::VMOVUPSZ128rm, {X86::VMOVUPSZ128mr, X86::VMOVAPSZ128mr}},
				{X86::VMOVAPSZ128rm, {X86::VMOVUPSZ128mr, X86::VMOVAPSZ128mr}},
				{X86::VMOVUPDZ128rm, {X86::VMOVUPDZ128mr, X86::VMOVAPDZ128mr}},
				{X86::VMOVAPDZ128rm, {X86::VMOVUPDZ128mr, X86::VMOVAPDZ128mr}},
				{X86::VMOVUPSYrm, {X86::VMOVUPSYmr, X86::VMOVAPSYmr}},
				{X86::VMOVAPSYrm, {X86::VMOVUPSYmr, X86::VMOVAPSYmr}},
				{X86::VMOVUPDYrm, {X86::VMOVUPDYmr, X86::VMOVAPDYmr}},
				{X86::VMOVAPDYrm, {X86::VMOVUPDYmr, X86::VMOVAPDYmr}},
				{X86::VMOVDQUYrm, {X86::VMOVDQUYmr, X86::VMOVDQAYmr}},
				{X86::VMOVDQAYrm, {X86::VMOVDQUYmr, X86::VMOVDQAYmr}},
				{X86::VMOVUPSZ256rm, {X86::VMOVUPSZ256mr, X86::VMOVAPSZ256mr}},
				{X86::VMOVAPSZ256rm, {X86::VMOVUPSZ256mr, X86::VMOVAPSZ256mr}},
				{X86::VMOVUPDZ256rm, {X86::VMOVUPDZ256mr, X86::VMOVAPDZ256mr}},
				{X86::VMOVAPDZ256rm, {X86::VMOVUPDZ256mr, X86::VMOVAPDZ256mr}},
				{X86::VMOVDQU64Z128rm, {X86::VMOVDQU64Z128mr, X86::VMOVDQA64Z128mr}},
				{X86::VMOVDQA64Z128rm, {X86::VMOVDQU64Z128mr, X86::VMOVDQA64Z128mr}},
				{X86::VMOVDQU32Z128rm, {X86::VMOVDQU32Z128mr, X86::VMOVDQA32Z128mr}},
				{X86::VMOVDQA32Z128rm, {X86::VMOVDQU32Z128mr, X86::VMOVDQA32Z128mr}},
				{X86::VMOVDQU64Z256rm, {X86::VMOVDQU64Z256mr, X86::VMOVDQA64Z256mr}},
				{X86::VMOVDQA64Z256rm, {X86::VMOVDQU64Z256mr, X86::VMOVDQA64Z256mr}},
				{X86::VMOVDQU32Z256rm, {X86::VMOVDQU32Z256mr, X86::VMOVDQA32Z256mr}},
				{X86::VMOVDQA32Z256rm, {X86::VMOVDQU32Z256mr, X86::VMOVDQA32Z256mr}},
				};

				static bool isPotentialBlockedMemCpyPair(unsigned LdOpcode, unsigned StOpcode) {
				auto PotentialStores = PotentialBlockedMemCpy.at(LdOpcode);
				return PotentialStores.first == StOpcode \|\|
				PotentialStores.second == StOpcode;
				}

				static bool isPotentialBlockingStoreInst(int Opcode, int LoadOpcode) {
				bool PBlock = false;
				PBlock \|= Opcode == X86::MOV64mr \|\| Opcode == X86::MOV64mi32 \|\|
				Opcode == X86::MOV32mr \|\| Opcode == X86::MOV32mi \|\|
				Opcode == X86::MOV16mr \|\| Opcode == X86::MOV16mi \|\|
				Opcode == X86::MOV8mr \|\| Opcode == X86::MOV8mi;
				if (isYMMLoadOpcode(LoadOpcode))
				PBlock \|= Opcode == X86::VMOVUPSmr \|\| Opcode == X86::VMOVAPSmr \|\|
				Opcode == X86::VMOVUPDmr \|\| Opcode == X86::VMOVAPDmr \|\|
				Opcode == X86::VMOVDQUmr \|\| Opcode == X86::VMOVDQAmr \|\|
				Opcode == X86::VMOVUPSZ128mr \|\| Opcode == X86::VMOVAPSZ128mr \|\|
				Opcode == X86::VMOVUPDZ128mr \|\| Opcode == X86::VMOVAPDZ128mr \|\|
				Opcode == X86::VMOVDQU64Z128mr \|\|
				Opcode == X86::VMOVDQA64Z128mr \|\|
				Opcode == X86::VMOVDQU32Z128mr \|\| Opcode == X86::VMOVDQA32Z128mr;
				return PBlock;
				}

				static const int MOV128SZ = 16;
				static const int MOV64SZ = 8;
				static const int MOV32SZ = 4;
				static const int MOV16SZ = 2;
				static const int MOV8SZ = 1;

				std::map<unsigned, unsigned> YMMtoXMMLoadMap = {
				{X86::VMOVUPSYrm, X86::VMOVUPSrm},
				{X86::VMOVAPSYrm, X86::VMOVUPSrm},
				{X86::VMOVUPDYrm, X86::VMOVUPDrm},
				{X86::VMOVAPDYrm, X86::VMOVUPDrm},
				{X86::VMOVDQUYrm, X86::VMOVDQUrm},
				{X86::VMOVDQAYrm, X86::VMOVDQUrm},
				{X86::VMOVUPSZ256rm, X86::VMOVUPSZ128rm},
				{X86::VMOVAPSZ256rm, X86::VMOVUPSZ128rm},
				{X86::VMOVUPDZ256rm, X86::VMOVUPDZ128rm},
				{X86::VMOVAPDZ256rm, X86::VMOVUPDZ128rm},
				{X86::VMOVDQU64Z256rm, X86::VMOVDQU64Z128rm},
				{X86::VMOVDQA64Z256rm, X86::VMOVDQU64Z128rm},
				{X86::VMOVDQU32Z256rm, X86::VMOVDQU32Z128rm},
				{X86::VMOVDQA32Z256rm, X86::VMOVDQU32Z128rm},
				};

				std::map<unsigned, unsigned> YMMtoXMMStoreMap = {
				{X86::VMOVUPSYmr, X86::VMOVUPSmr},
				{X86::VMOVAPSYmr, X86::VMOVUPSmr},
				{X86::VMOVUPDYmr, X86::VMOVUPDmr},
				{X86::VMOVAPDYmr, X86::VMOVUPDmr},
				{X86::VMOVDQUYmr, X86::VMOVDQUmr},
				{X86::VMOVDQAYmr, X86::VMOVDQUmr},
				{X86::VMOVUPSZ256mr, X86::VMOVUPSZ128mr},
				{X86::VMOVAPSZ256mr, X86::VMOVUPSZ128mr},
				{X86::VMOVUPDZ256mr, X86::VMOVUPDZ128mr},
				{X86::VMOVAPDZ256mr, X86::VMOVUPDZ128mr},
				{X86::VMOVDQU64Z256mr, X86::VMOVDQU64Z128mr},
				{X86::VMOVDQA64Z256mr, X86::VMOVDQU64Z128mr},
				{X86::VMOVDQU32Z256mr, X86::VMOVDQU32Z128mr},
				{X86::VMOVDQA32Z256mr, X86::VMOVDQU32Z128mr},
				craig.topperUnsubmitted Done Reply Inline Actions Remove the parentheses here. craig.topper: Remove the parentheses here.
				};

				static int getAddrOffset(MachineInstr *MI) {
				hfinkelUnsubmitted Not Done Reply Inline Actions Can you elaborate on what you're seeing? Missing MMOs are something we should fix. If the MMO size is too small, that's a bug (i.e., can cause miscompiles from invalid instruction scheduling). MMO sizes that are too big should also be fixed. hfinkel: Can you elaborate on what you're seeing? Missing MMOs are something we should fix. If the MMO…
				lsabaAuthorUnsubmitted Not Done Reply Inline Actions You are right. I saw missing MMOs, I will try to fix the ones I encountered, but it doesn't guarantee there won't be more in code I did not reach so i'm not sure it would be safe to remove those defines lsaba: You are right. I saw missing MMOs, I will try to fix the ones I encountered, but it doesn't…
				hfinkelUnsubmitted Done Reply Inline Actions Okay, thanks (although you should use something other than `#define`s (constants or a function with a switch statement, etc.). hfinkel: Okay, thanks (although you should use something other than `#define`s (constants or a function…
				const MCInstrDesc &Descl = MI->getDesc();
				int AddrOffset = X86II::getMemoryOperandNo(Descl.TSFlags);
				assert(AddrOffset != -1 && "Expected Memory Operand");
				AddrOffset += X86II::getOperandBias(Descl);
				return AddrOffset;
				}

				static MachineOperand &getBaseOperand(MachineInstr *MI) {
				craig.topperUnsubmitted Done Reply Inline Actions Return a reference? craig.topper: Return a reference?
				int AddrOffset = getAddrOffset(MI);
				return MI->getOperand(AddrOffset + X86::AddrBaseReg);
				}

				static MachineOperand &getDispOperand(MachineInstr *MI) {
				craig.topperUnsubmitted Done Reply Inline Actions Return a reference here too. craig.topper: Return a reference here too.
				int AddrOffset = getAddrOffset(MI);
				return MI->getOperand(AddrOffset + X86::AddrDisp);
				}

				// Relevant addressing modes contain only base register and immediate
				// displacement or frameindex and immediate displacement.
				// TODO: Consider expanding to other addressing modes in the future
				craig.topperUnsubmitted Done Reply Inline Actions Use a "static const unsigned " variable instead of a #define. craig.topper: Use a "static const unsigned " variable instead of a #define.
				static bool isRelevantAddressingMode(MachineInstr *MI) {
				int AddrOffset = getAddrOffset(MI);
				craig.topperUnsubmitted Done Reply Inline Actions Add a comment about what's considered a relevant addressing mode? craig.topper: Add a comment about what's considered a relevant addressing mode?
				MachineOperand &Base = MI->getOperand(AddrOffset + X86::AddrBaseReg);
				MachineOperand &Disp = MI->getOperand(AddrOffset + X86::AddrDisp);
				MachineOperand &Scale = MI->getOperand(AddrOffset + X86::AddrScaleAmt);
				MachineOperand &Index = MI->getOperand(AddrOffset + X86::AddrIndexReg);
				MachineOperand &Segment = MI->getOperand(AddrOffset + X86::AddrSegmentReg);
				chandlercUnsubmitted Not Done Reply Inline Actions It seems really weird to have getters above for some of these and then to not use them here. If we want a better API for extracting the memory operands, we should build one that can be used everywhere (likely in X86InstrInfo.h) rather than a couple of ad-hoc ones that we end up not using in places like this. =[ chandlerc: It seems really weird to have getters above for some of these and then to not use them here.

				if (!((Base.isReg() && Base.getReg() != X86::NoRegister) \|\| Base.isFI()))
				return false;
				craig.topperUnsubmitted Done Reply Inline Actions Why can't these be MachineOperand references? Why are they assigned to nullptr and then immediately overwritten? craig.topper: Why can't these be MachineOperand references? Why are they assigned to nullptr and then…
				craig.topperUnsubmitted Done Reply Inline Actions I don't think this was addressed. Why aren't these references instead of pointers? craig.topper: I don't think this was addressed. Why aren't these references instead of pointers?
				if (!Disp.isImm())
				return false;
				if (Scale.getImm() != 1)
				return false;
				if (!(Index.isReg() && Index.getReg() == X86::NoRegister))
				return false;
				if (!(Segment.isReg() && Segment.getReg() == X86::NoRegister))
				return false;
				return true;
				}

				// Collect potentially blocking stores.
				// Limit the number of instructions backwards we want to inspect
				// since the effect of store block won't be visible if the store
				// and load instructions have enough instructions in between to
				// keep the core busy.
				static const unsigned LIMIT = 20;
				chandlercUnsubmitted Done Reply Inline Actions We don't use all caps constants typically in LLVM. And we almost always expose debug flags for this kind of constant. It should also be at the top of the file and the function comment actually attached to the function rather than the constant. chandlerc: We don't use all caps constants typically in LLVM. And we almost always expose debug flags for…
				static SmallVector<MachineInstr *, 2>
				findPotentialBlockers(MachineInstr *LoadInst) {
				SmallVector<MachineInstr *, 2> PotentialBlockers;
				unsigned BlockLimit = 0;
				for (MachineBasicBlock::iterator LI = LoadInst,
				BB = LoadInst->getParent()->begin();
				LI != BB; --LI) {
				BlockLimit++;
				if (BlockLimit >= LIMIT)
				break;
				MachineInstr &MI = *LI;
				if (MI.getDesc().isCall())
				break;
				PotentialBlockers.push_back(&MI);
				}
				// If we didn't get to the instructions limit try predecessing blocks.
				// Ideally we should traverse the predecessor blocks in depth with some
				// coloring algorithm, but for now let's just look at the first order
				// predecessors.
				if (BlockLimit < LIMIT) {
				MachineBasicBlock *MBB = LoadInst->getParent();
				int LimitLeft = LIMIT - BlockLimit;
				for (MachineBasicBlock::pred_iterator PB = MBB->pred_begin(),
				PE = MBB->pred_end();
				PB != PE; ++PB) {
				MachineBasicBlock PMBB = PB;
				int PredLimit = 0;
				for (MachineBasicBlock::reverse_iterator PMI = PMBB->rbegin(),
				PME = PMBB->rend();
				PMI != PME; ++PMI) {
				PredLimit++;
				if (PredLimit >= LimitLeft)
				break;
				if (PMI->getDesc().isCall())
				break;
				PotentialBlockers.push_back(&*PMI);
				}
				}
				}
				return PotentialBlockers;
				}

				void FixupSFBPass::buildCopy(MachineInstr *LoadInst, unsigned NLoadOpcode,
				int64_t LoadDisp, MachineInstr *StoreInst,
				unsigned NStoreOpcode, int64_t StoreDisp,
				unsigned Size) {
				MachineOperand &LoadBase = getBaseOperand(LoadInst);
				MachineOperand &StoreBase = getBaseOperand(StoreInst);
				MachineBasicBlock *MBB = LoadInst->getParent();
				MachineMemOperand LMMO = LoadInst->memoperands_begin();
				MachineMemOperand SMMO = StoreInst->memoperands_begin();

				unsigned Reg1 = MRI->createVirtualRegister(
				TII->getRegClass(TII->get(NLoadOpcode), 0, TRI, *(MBB->getParent())));
				BuildMI(*MBB, LoadInst, LoadInst->getDebugLoc(), TII->get(NLoadOpcode), Reg1)
				.add(LoadBase)
				.addImm(1)
				.addReg(X86::NoRegister)
				.addImm(LoadDisp)
				.addReg(X86::NoRegister)
				.addMemOperand(MBB->getParent()->getMachineMemOperand(
				LMMO->getPointerInfo(), LMMO->getFlags(), Size, 0));
				DEBUG(LoadInst->getPrevNode()->dump());
				// If the load and store are consecutive, use the loadInst location to
				// reduce register pressure.
				MachineInstr *StInst = StoreInst;
				if (StoreInst->getPrevNode() == LoadInst)
				StInst = LoadInst;
				BuildMI(*MBB, StInst, StInst->getDebugLoc(), TII->get(NStoreOpcode))
				.add(StoreBase)
				.addImm(1)
				.addReg(X86::NoRegister)
				.addImm(StoreDisp)
				.addReg(X86::NoRegister)
				.addReg(Reg1)
				.addMemOperand(MBB->getParent()->getMachineMemOperand(
				SMMO->getPointerInfo(), SMMO->getFlags(), Size, 0));
				DEBUG(StInst->getPrevNode()->dump());
				}

				void FixupSFBPass::buildCopies(int Size, MachineInstr *LoadInst,
				int64_t LdDispImm, MachineInstr *StoreInst,
				int64_t StDispImm) {
				int LdDisp = LdDispImm;
				int StDisp = StDispImm;
				while (Size > 0) {
				if ((Size - MOV128SZ >= 0) && isYMMLoadOpcode(LoadInst->getOpcode())) {
				Size = Size - MOV128SZ;
				buildCopy(LoadInst, YMMtoXMMLoadMap.at(LoadInst->getOpcode()), LdDisp,
				StoreInst, YMMtoXMMStoreMap.at(StoreInst->getOpcode()), StDisp,
				MOV128SZ);
				LdDisp += MOV128SZ;
				StDisp += MOV128SZ;
				continue;
				}
				if (Size - MOV64SZ >= 0) {
				Size = Size - MOV64SZ;
				buildCopy(LoadInst, X86::MOV64rm, LdDisp, StoreInst, X86::MOV64mr, StDisp,
				MOV64SZ);
				LdDisp += MOV64SZ;
				StDisp += MOV64SZ;
				continue;
				}
				if (Size - MOV32SZ >= 0) {
				Size = Size - MOV32SZ;
				buildCopy(LoadInst, X86::MOV32rm, LdDisp, StoreInst, X86::MOV32mr, StDisp,
				MOV32SZ);
				LdDisp += MOV32SZ;
				StDisp += MOV32SZ;
				continue;
				}
				if (Size - MOV16SZ >= 0) {
				Size = Size - MOV16SZ;
				buildCopy(LoadInst, X86::MOV16rm, LdDisp, StoreInst, X86::MOV16mr, StDisp,
				MOV16SZ);
				LdDisp += MOV16SZ;
				StDisp += MOV16SZ;
				continue;
				}
				if (Size - MOV8SZ >= 0) {
				Size = Size - MOV8SZ;
				buildCopy(LoadInst, X86::MOV8rm, LdDisp, StoreInst, X86::MOV8mr, StDisp,
				MOV8SZ);
				LdDisp += MOV8SZ;
				StDisp += MOV8SZ;
				continue;
				}
				}
				assert(Size == 0 && "Wrong size division");
				}

				static void updateKillStatus(MachineInstr LoadInst, MachineInstr StoreInst) {
				MachineOperand &LoadBase = getBaseOperand(LoadInst);
				MachineOperand &StoreBase = getBaseOperand(StoreInst);
				if (LoadBase.isReg()) {
				MachineInstr *LastLoad = LoadInst->getPrevNode();
				// If the original load and store to xmm/ymm were consecutive
				// then the partial copies were also created in
				// a consecutive order to reduce register pressure,
				// and the location of the last load is before the last store.
				if (StoreInst->getPrevNode() == LoadInst)
				LastLoad = LoadInst->getPrevNode()->getPrevNode();
				getBaseOperand(LastLoad).setIsKill(LoadBase.isKill());
				}
				if (StoreBase.isReg()) {
				MachineInstr *StInst = StoreInst;
				if (StoreInst->getPrevNode() == LoadInst)
				StInst = LoadInst;
				getBaseOperand(StInst->getPrevNode()).setIsKill(StoreBase.isKill());
				}
				}

				void FixupSFBPass::findPotentiallylBlockedCopies(MachineFunction &MF) {
				for (auto &MBB : MF)
				craig.topperUnsubmitted Done Reply Inline Actions Potentially is mispelled. craig.topper: Potentially is mispelled.
				for (auto &MI : MBB)
				if (isPotentialBlockedMemCpyLd(MI.getOpcode())) {
				int DefVR = MI.getOperand(0).getReg();
				if (MRI->hasOneUse(DefVR))
				chandlercUnsubmitted Done Reply Inline Actions Please use early-continue or early-exit to reduce indentation when writing LLVM code: http://llvm.org/docs/CodingStandards.html#use-early-exits-and-continue-to-simplify-code chandlerc: Please use early-continue or early-exit to reduce indentation when writing LLVM code: http…
				for (auto UI = MRI->use_nodbg_begin(DefVR), UE = MRI->use_nodbg_end();
				UI != UE;) {
				MachineOperand &StoreMO = *UI++;
				MachineInstr &StoreMI = *StoreMO.getParent();
				if (isPotentialBlockedMemCpyPair(MI.getOpcode(),
				StoreMI.getOpcode()) &&
				(StoreMI.getParent() == MI.getParent()))
				if (isRelevantAddressingMode(&MI) &&
				isRelevantAddressingMode(&StoreMI))
				BlockedLoadsStores.push_back(
				std::pair<MachineInstr , MachineInstr >(&MI, &StoreMI));
				}
				}
				}
				chandlercUnsubmitted Done Reply Inline Actions Missing vertical space here. chandlerc: Missing vertical space here.
				unsigned FixupSFBPass::getRegSizeInBytes(MachineInstr *LoadInst) {
				auto TRC = TII->getRegClass(TII->get(LoadInst->getOpcode()), 0, TRI,
				*LoadInst->getParent()->getParent());
				return TRI->getRegSizeInBits(*TRC) / 8;
				craig.topperUnsubmitted Done Reply Inline Actions Is MRI->getTargetRegisterInfo() different than the TRI you already have? craig.topper: Is MRI->getTargetRegisterInfo() different than the TRI you already have?
				}

				craig.topperUnsubmitted Not Done Reply Inline Actions Can you clang-format at least this function? That 8 looks really out of place. So does the splitting of the getRegClass call. craig.topper: Can you clang-format at least this function? That 8 looks really out of place. So does the…
				lsabaAuthorUnsubmitted Not Done Reply Inline Actions This weird format is actually clang-format's output, I changed it to make it more readable lsaba: This weird format is actually clang-format's output, I changed it to make it more readable
				void FixupSFBPass::breakBlockedCopies(
				MachineInstr LoadInst, MachineInstr StoreInst,
				const std::map<int64_t, unsigned> &BlockingStoresDisp) {
				int64_t LdDispImm = getDispOperand(LoadInst).getImm();
				int64_t StDispImm = getDispOperand(StoreInst).getImm();
				craig.topperUnsubmitted Done Reply Inline Actions Pass the map by reference. Probably const reference if you're not modifying it. craig.topper: Pass the map by reference. Probably const reference if you're not modifying it.

				int64_t LdDisp1 = LdDispImm;
				int64_t LdDisp2 = 0;
				int64_t StDisp1 = StDispImm;
				craig.topperUnsubmitted Done Reply Inline Actions Capitalize variable names. craig.topper: Capitalize variable names.
				int64_t StDisp2 = 0;
				unsigned Size1 = 0;
				unsigned Size2 = 0;
				int64_t LdStDelta = StDispImm - LdDispImm;
				for (auto inst : BlockingStoresDisp) {
				chandlercUnsubmitted Done Reply Inline Actions Please follow LLVM's variable naming conventions rather than `inst`. Also `Inst` isn't a very idiomatic variable name. But worse, it is actively confusing. What is it? I would assume an instruction, perhaps a `MachineInstr`, but it isn't. It is a pair of int and and unsigned?? I have no idea what this variable means. chandlerc: Please follow LLVM's variable naming conventions rather than `inst`. Also `Inst` isn't a very…
				LdDisp2 = inst.first;
				StDisp2 = inst.first + LdStDelta;
				Size1 = std::abs(std::abs(LdDisp2) - std::abs(LdDisp1));
				Size2 = inst.second;
				buildCopies(Size1, LoadInst, LdDisp1, StoreInst, StDisp1);
				buildCopies(Size2, LoadInst, LdDisp2, StoreInst, StDisp2);
				LdDisp1 = LdDisp2 + Size2;
				StDisp1 = StDisp2 + Size2;
				}
				unsigned Size3 = (LdDispImm + getRegSizeInBytes(LoadInst)) - LdDisp1;
				buildCopies(Size3, LoadInst, LdDisp1, StoreInst, StDisp1);
				}

				bool FixupSFBPass::runOnMachineFunction(MachineFunction &MF) {
				bool Changed = false;

				if (DisableX86FixupSFB \|\| skipFunction(MF.getFunction()))
				return false;

				MRI = &MF.getRegInfo();
				assert(MRI->isSSA() && "Expected MIR to be in SSA form");
				TII = MF.getSubtarget<X86Subtarget>().getInstrInfo();
				TRI = MF.getSubtarget<X86Subtarget>().getRegisterInfo();

				DEBUG(dbgs() << "Start X86FixupSFB\n";);
				// Look for a load then a store to XMM/YMM which look like a memcpy
				findPotentiallylBlockedCopies(MF);

				for (auto LoadStoreInst : BlockedLoadsStores) {
				MachineInstr *LoadInst = LoadStoreInst.first;
				SmallVector<MachineInstr *, 2> PotentialBlockers =
				findPotentialBlockers(LoadInst);

				MachineOperand &LoadBase = getBaseOperand(LoadInst);
				int64_t LdDispImm = getDispOperand(LoadInst).getImm();
				std::map<int64_t, unsigned> BlockingStoresDisp;
				int LdBaseReg = LoadBase.isReg() ? LoadBase.getReg() : LoadBase.getIndex();

				for (auto PBInst : PotentialBlockers) {
				if (isPotentialBlockingStoreInst(PBInst->getOpcode(),
				LoadInst->getOpcode())) {
				if (!isRelevantAddressingMode(PBInst))
				continue;
				MachineOperand &PBstoreBase = getBaseOperand(PBInst);
				int64_t PBstDispImm = getDispOperand(PBInst).getImm();
				assert(PBInst->hasOneMemOperand() && "Expected One Memory Operand");
				unsigned PBstSize = (*PBInst->memoperands_begin())->getSize();
				int PBstBaseReg =
				PBstoreBase.isReg() ? PBstoreBase.getReg() : PBstoreBase.getIndex();
				// This check doesn't cover all cases, but it will suffice for now.
				// TODO: take branch probability into consideration, if the blocking
				// store is in an unreached block, breaking the memcopy could lose
				// performance.
				if (((LoadBase.isReg() && PBstoreBase.isReg()) \|\|
				(LoadBase.isFI() && PBstoreBase.isFI())) &&
				LdBaseReg == PBstBaseReg &&
				((PBstDispImm >= LdDispImm) &&
				(PBstDispImm <=
				LdDispImm + (getRegSizeInBytes(LoadInst) - PBstSize)))) {
				if (BlockingStoresDisp.count(PBstDispImm)) {
				if (BlockingStoresDisp[PBstDispImm] > PBstSize)
				BlockingStoresDisp[PBstDispImm] = PBstSize;

				} else
				BlockingStoresDisp[PBstDispImm] = PBstSize;
				}
				}
				}

				if (BlockingStoresDisp.size() == 0)
				continue;

				// We found a store forward block, break the memcpy's load and store
				// into smaller copies such that each smaller store that was causing
				// a store block would now be copied separately.
				MachineInstr *StoreInst = LoadStoreInst.second;
				DEBUG(dbgs() << "Blocked load and store instructions: \n");
				DEBUG(LoadInst->dump());
				DEBUG(StoreInst->dump());
				DEBUG(dbgs() << "Replaced with:\n");
				breakBlockedCopies(LoadInst, StoreInst, BlockingStoresDisp);
				updateKillStatus(LoadInst, StoreInst);
				ForRemoval.push_back(LoadInst);
				ForRemoval.push_back(StoreInst);
				}
				for (auto RemovedInst : ForRemoval) {
				RemovedInst->eraseFromParent();
				}
				ForRemoval.clear();
				BlockedLoadsStores.clear();
				DEBUG(dbgs() << "End X86FixupSFB\n";);

				return Changed;
				}

lib/Target/X86/X86TargetMachine.cpp

	Show First 20 Lines • Show All 428 Lines • ▼ Show 20 Lines
	}			}

	void X86PassConfig::addPreRegAlloc() {			void X86PassConfig::addPreRegAlloc() {
	if (getOptLevel() != CodeGenOpt::None) {			if (getOptLevel() != CodeGenOpt::None) {
	addPass(&LiveRangeShrinkID);			addPass(&LiveRangeShrinkID);
	addPass(createX86FixupSetCC());			addPass(createX86FixupSetCC());
	addPass(createX86OptimizeLEAs());			addPass(createX86OptimizeLEAs());
	addPass(createX86CallFrameOptimization());			addPass(createX86CallFrameOptimization());
				addPass(createX86FixupSFB());
	}			}

	addPass(createX86WinAllocaExpander());			addPass(createX86WinAllocaExpander());
	}			}
	void X86PassConfig::addMachineSSAOptimization() {			void X86PassConfig::addMachineSSAOptimization() {
	addPass(createX86DomainReassignmentPass());			addPass(createX86DomainReassignmentPass());
	TargetPassConfig::addMachineSSAOptimization();			TargetPassConfig::addMachineSSAOptimization();
	}			}
	Show All 29 Lines

test/CodeGen/X86/fixup-sfb.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc < %s -mtriple=x86_64-linux \| FileCheck %s -check-prefix=CHECK
				; RUN: llc < %s -mtriple=x86_64-linux --disable-fixup-SFB \| FileCheck %s --check-prefix=DISABLED
				; RUN: llc < %s -mtriple=x86_64-linux -mcpu=core-avx2 \| FileCheck %s -check-prefix=CHECK-AVX2
				; RUN: llc < %s -mtriple=x86_64-linux -mcpu=skx \| FileCheck %s -check-prefix=CHECK-AVX512
				craig.topperUnsubmitted Done Reply Inline Actions Add avx512vl command line? craig.topper: Add avx512vl command line?

				target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
				target triple = "x86_64-unknown-linux-gnu"

				%struct.S = type { i32, i32, i32, i32 }

				; Function Attrs: nounwind uwtable
				define void @test_conditional_block(%struct.S* nocapture %s1, %struct.S* nocapture %s2, i32 %x, %struct.S* nocapture %s3, %struct.S* nocapture readonly %s4) local_unnamed_addr #0 {
				; CHECK-LABEL: test_conditional_block:
				; CHECK: # %bb.0: # %entry
				; CHECK-NEXT: cmpl $18, %edx
				; CHECK-NEXT: jl .LBB0_2
				; CHECK-NEXT: # %bb.1: # %if.then
				; CHECK-NEXT: movl %edx, 4(%rdi)
				; CHECK-NEXT: .LBB0_2: # %if.end
				; CHECK-NEXT: movups (%r8), %xmm0
				; CHECK-NEXT: movups %xmm0, (%rcx)
				; CHECK-NEXT: movl (%rdi), %eax
				; CHECK-NEXT: movl %eax, (%rsi)
				; CHECK-NEXT: movl 4(%rdi), %eax
				; CHECK-NEXT: movl %eax, 4(%rsi)
				; CHECK-NEXT: movq 8(%rdi), %rax
				; CHECK-NEXT: movq %rax, 8(%rsi)
				; CHECK-NEXT: retq
				;
				; DISABLED-LABEL: test_conditional_block:
				; DISABLED: # %bb.0: # %entry
				; DISABLED-NEXT: cmpl $18, %edx
				; DISABLED-NEXT: jl .LBB0_2
				; DISABLED-NEXT: # %bb.1: # %if.then
				; DISABLED-NEXT: movl %edx, 4(%rdi)
				; DISABLED-NEXT: .LBB0_2: # %if.end
				; DISABLED-NEXT: movups (%r8), %xmm0
				; DISABLED-NEXT: movups %xmm0, (%rcx)
				; DISABLED-NEXT: movups (%rdi), %xmm0
				; DISABLED-NEXT: movups %xmm0, (%rsi)
				; DISABLED-NEXT: retq
				;
				; CHECK-AVX2-LABEL: test_conditional_block:
				; CHECK-AVX2: # %bb.0: # %entry
				; CHECK-AVX2-NEXT: cmpl $18, %edx
				; CHECK-AVX2-NEXT: jl .LBB0_2
				; CHECK-AVX2-NEXT: # %bb.1: # %if.then
				; CHECK-AVX2-NEXT: movl %edx, 4(%rdi)
				; CHECK-AVX2-NEXT: .LBB0_2: # %if.end
				; CHECK-AVX2-NEXT: vmovups (%r8), %xmm0
				; CHECK-AVX2-NEXT: vmovups %xmm0, (%rcx)
				; CHECK-AVX2-NEXT: movl (%rdi), %eax
				; CHECK-AVX2-NEXT: movl %eax, (%rsi)
				; CHECK-AVX2-NEXT: movl 4(%rdi), %eax
				; CHECK-AVX2-NEXT: movl %eax, 4(%rsi)
				; CHECK-AVX2-NEXT: movq 8(%rdi), %rax
				; CHECK-AVX2-NEXT: movq %rax, 8(%rsi)
				; CHECK-AVX2-NEXT: retq
				;
				; CHECK-AVX512-LABEL: test_conditional_block:
				; CHECK-AVX512: # %bb.0: # %entry
				; CHECK-AVX512-NEXT: cmpl $18, %edx
				; CHECK-AVX512-NEXT: jl .LBB0_2
				; CHECK-AVX512-NEXT: # %bb.1: # %if.then
				; CHECK-AVX512-NEXT: movl %edx, 4(%rdi)
				; CHECK-AVX512-NEXT: .LBB0_2: # %if.end
				; CHECK-AVX512-NEXT: vmovups (%r8), %xmm0
				; CHECK-AVX512-NEXT: vmovups %xmm0, (%rcx)
				; CHECK-AVX512-NEXT: movl (%rdi), %eax
				; CHECK-AVX512-NEXT: movl %eax, (%rsi)
				; CHECK-AVX512-NEXT: movl 4(%rdi), %eax
				; CHECK-AVX512-NEXT: movl %eax, 4(%rsi)
				; CHECK-AVX512-NEXT: movq 8(%rdi), %rax
				; CHECK-AVX512-NEXT: movq %rax, 8(%rsi)
				; CHECK-AVX512-NEXT: retq
				entry:
				%cmp = icmp sgt i32 %x, 17
				br i1 %cmp, label %if.then, label %if.end

				if.then: ; preds = %entry
				%b = getelementptr inbounds %struct.S, %struct.S* %s1, i64 0, i32 1
				store i32 %x, i32* %b, align 4
				br label %if.end

				if.end: ; preds = %if.then, %entry
				%0 = bitcast %struct.S* %s3 to i8*
				%1 = bitcast %struct.S* %s4 to i8*
				tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %0, i8* %1, i64 16, i32 4, i1 false)
				%2 = bitcast %struct.S* %s2 to i8*
				%3 = bitcast %struct.S* %s1 to i8*
				tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %2, i8* %3, i64 16, i32 4, i1 false)
				ret void
				}

				; Function Attrs: nounwind uwtable
				define void @test_imm_store(%struct.S* nocapture %s1, %struct.S* nocapture %s2, i32 %x, %struct.S* nocapture %s3) local_unnamed_addr #0 {
				; CHECK-LABEL: test_imm_store:
				; CHECK: # %bb.0: # %entry
				; CHECK-NEXT: movl $0, (%rdi)
				; CHECK-NEXT: movl $1, (%rcx)
				; CHECK-NEXT: movl (%rdi), %eax
				; CHECK-NEXT: movl %eax, (%rsi)
				; CHECK-NEXT: movq 4(%rdi), %rax
				; CHECK-NEXT: movq %rax, 4(%rsi)
				; CHECK-NEXT: movl 12(%rdi), %eax
				; CHECK-NEXT: movl %eax, 12(%rsi)
				; CHECK-NEXT: retq
				;
				; DISABLED-LABEL: test_imm_store:
				; DISABLED: # %bb.0: # %entry
				; DISABLED-NEXT: movl $0, (%rdi)
				; DISABLED-NEXT: movl $1, (%rcx)
				; DISABLED-NEXT: movups (%rdi), %xmm0
				; DISABLED-NEXT: movups %xmm0, (%rsi)
				; DISABLED-NEXT: retq
				;
				; CHECK-AVX2-LABEL: test_imm_store:
				; CHECK-AVX2: # %bb.0: # %entry
				; CHECK-AVX2-NEXT: movl $0, (%rdi)
				; CHECK-AVX2-NEXT: movl $1, (%rcx)
				; CHECK-AVX2-NEXT: movl (%rdi), %eax
				; CHECK-AVX2-NEXT: movl %eax, (%rsi)
				; CHECK-AVX2-NEXT: movq 4(%rdi), %rax
				; CHECK-AVX2-NEXT: movq %rax, 4(%rsi)
				; CHECK-AVX2-NEXT: movl 12(%rdi), %eax
				; CHECK-AVX2-NEXT: movl %eax, 12(%rsi)
				; CHECK-AVX2-NEXT: retq
				;
				; CHECK-AVX512-LABEL: test_imm_store:
				; CHECK-AVX512: # %bb.0: # %entry
				; CHECK-AVX512-NEXT: movl $0, (%rdi)
				; CHECK-AVX512-NEXT: movl $1, (%rcx)
				; CHECK-AVX512-NEXT: movl (%rdi), %eax
				; CHECK-AVX512-NEXT: movl %eax, (%rsi)
				; CHECK-AVX512-NEXT: movq 4(%rdi), %rax
				; CHECK-AVX512-NEXT: movq %rax, 4(%rsi)
				; CHECK-AVX512-NEXT: movl 12(%rdi), %eax
				; CHECK-AVX512-NEXT: movl %eax, 12(%rsi)
				; CHECK-AVX512-NEXT: retq
				entry:
				%a = getelementptr inbounds %struct.S, %struct.S* %s1, i64 0, i32 0
				store i32 0, i32* %a, align 4
				%a1 = getelementptr inbounds %struct.S, %struct.S* %s3, i64 0, i32 0
				store i32 1, i32* %a1, align 4
				%0 = bitcast %struct.S* %s2 to i8*
				%1 = bitcast %struct.S* %s1 to i8*
				tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %0, i8* %1, i64 16, i32 4, i1 false)
				ret void
				}

				; Function Attrs: nounwind uwtable
				define void @test_nondirect_br(%struct.S* nocapture %s1, %struct.S* nocapture %s2, i32 %x, %struct.S* nocapture %s3, %struct.S* nocapture readonly %s4, i32 %x2) local_unnamed_addr #0 {
				; CHECK-LABEL: test_nondirect_br:
				; CHECK: # %bb.0: # %entry
				; CHECK-NEXT: cmpl $18, %edx
				; CHECK-NEXT: jl .LBB2_2
				; CHECK-NEXT: # %bb.1: # %if.then
				; CHECK-NEXT: movl %edx, 4(%rdi)
				; CHECK-NEXT: .LBB2_2: # %if.end
				; CHECK-NEXT: cmpl $14, %r9d
				; CHECK-NEXT: jl .LBB2_4
				; CHECK-NEXT: # %bb.3: # %if.then2
				; CHECK-NEXT: movl %r9d, 12(%rdi)
				; CHECK-NEXT: .LBB2_4: # %if.end3
				; CHECK-NEXT: movups (%r8), %xmm0
				; CHECK-NEXT: movups %xmm0, (%rcx)
				; CHECK-NEXT: movq (%rdi), %rax
				; CHECK-NEXT: movq %rax, (%rsi)
				; CHECK-NEXT: movl 8(%rdi), %eax
				; CHECK-NEXT: movl %eax, 8(%rsi)
				; CHECK-NEXT: movl 12(%rdi), %eax
				; CHECK-NEXT: movl %eax, 12(%rsi)
				; CHECK-NEXT: retq
				;
				; DISABLED-LABEL: test_nondirect_br:
				; DISABLED: # %bb.0: # %entry
				; DISABLED-NEXT: cmpl $18, %edx
				; DISABLED-NEXT: jl .LBB2_2
				; DISABLED-NEXT: # %bb.1: # %if.then
				; DISABLED-NEXT: movl %edx, 4(%rdi)
				; DISABLED-NEXT: .LBB2_2: # %if.end
				; DISABLED-NEXT: cmpl $14, %r9d
				; DISABLED-NEXT: jl .LBB2_4
				; DISABLED-NEXT: # %bb.3: # %if.then2
				; DISABLED-NEXT: movl %r9d, 12(%rdi)
				; DISABLED-NEXT: .LBB2_4: # %if.end3
				; DISABLED-NEXT: movups (%r8), %xmm0
				; DISABLED-NEXT: movups %xmm0, (%rcx)
				; DISABLED-NEXT: movups (%rdi), %xmm0
				; DISABLED-NEXT: movups %xmm0, (%rsi)
				; DISABLED-NEXT: retq
				;
				; CHECK-AVX2-LABEL: test_nondirect_br:
				; CHECK-AVX2: # %bb.0: # %entry
				; CHECK-AVX2-NEXT: cmpl $18, %edx
				; CHECK-AVX2-NEXT: jl .LBB2_2
				; CHECK-AVX2-NEXT: # %bb.1: # %if.then
				; CHECK-AVX2-NEXT: movl %edx, 4(%rdi)
				; CHECK-AVX2-NEXT: .LBB2_2: # %if.end
				; CHECK-AVX2-NEXT: cmpl $14, %r9d
				; CHECK-AVX2-NEXT: jl .LBB2_4
				; CHECK-AVX2-NEXT: # %bb.3: # %if.then2
				; CHECK-AVX2-NEXT: movl %r9d, 12(%rdi)
				; CHECK-AVX2-NEXT: .LBB2_4: # %if.end3
				; CHECK-AVX2-NEXT: vmovups (%r8), %xmm0
				; CHECK-AVX2-NEXT: vmovups %xmm0, (%rcx)
				; CHECK-AVX2-NEXT: movq (%rdi), %rax
				; CHECK-AVX2-NEXT: movq %rax, (%rsi)
				; CHECK-AVX2-NEXT: movl 8(%rdi), %eax
				; CHECK-AVX2-NEXT: movl %eax, 8(%rsi)
				; CHECK-AVX2-NEXT: movl 12(%rdi), %eax
				; CHECK-AVX2-NEXT: movl %eax, 12(%rsi)
				; CHECK-AVX2-NEXT: retq
				;
				; CHECK-AVX512-LABEL: test_nondirect_br:
				; CHECK-AVX512: # %bb.0: # %entry
				; CHECK-AVX512-NEXT: cmpl $18, %edx
				; CHECK-AVX512-NEXT: jl .LBB2_2
				; CHECK-AVX512-NEXT: # %bb.1: # %if.then
				; CHECK-AVX512-NEXT: movl %edx, 4(%rdi)
				; CHECK-AVX512-NEXT: .LBB2_2: # %if.end
				; CHECK-AVX512-NEXT: cmpl $14, %r9d
				; CHECK-AVX512-NEXT: jl .LBB2_4
				; CHECK-AVX512-NEXT: # %bb.3: # %if.then2
				; CHECK-AVX512-NEXT: movl %r9d, 12(%rdi)
				; CHECK-AVX512-NEXT: .LBB2_4: # %if.end3
				; CHECK-AVX512-NEXT: vmovups (%r8), %xmm0
				; CHECK-AVX512-NEXT: vmovups %xmm0, (%rcx)
				; CHECK-AVX512-NEXT: movq (%rdi), %rax
				; CHECK-AVX512-NEXT: movq %rax, (%rsi)
				; CHECK-AVX512-NEXT: movl 8(%rdi), %eax
				; CHECK-AVX512-NEXT: movl %eax, 8(%rsi)
				; CHECK-AVX512-NEXT: movl 12(%rdi), %eax
				; CHECK-AVX512-NEXT: movl %eax, 12(%rsi)
				; CHECK-AVX512-NEXT: retq
				entry:
				%cmp = icmp sgt i32 %x, 17
				br i1 %cmp, label %if.then, label %if.end

				if.then: ; preds = %entry
				%b = getelementptr inbounds %struct.S, %struct.S* %s1, i64 0, i32 1
				store i32 %x, i32* %b, align 4
				br label %if.end

				if.end: ; preds = %if.then, %entry
				%cmp1 = icmp sgt i32 %x2, 13
				br i1 %cmp1, label %if.then2, label %if.end3

				if.then2: ; preds = %if.end
				%d = getelementptr inbounds %struct.S, %struct.S* %s1, i64 0, i32 3
				store i32 %x2, i32* %d, align 4
				br label %if.end3

				if.end3: ; preds = %if.then2, %if.end
				%0 = bitcast %struct.S* %s3 to i8*
				%1 = bitcast %struct.S* %s4 to i8*
				tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %0, i8* %1, i64 16, i32 4, i1 false)
				%2 = bitcast %struct.S* %s2 to i8*
				%3 = bitcast %struct.S* %s1 to i8*
				tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %2, i8* %3, i64 16, i32 4, i1 false)
				ret void
				}

				; Function Attrs: nounwind uwtable
				define void @test_2preds_block(%struct.S* nocapture %s1, %struct.S* nocapture %s2, i32 %x, %struct.S* nocapture %s3, %struct.S* nocapture readonly %s4, i32 %x2) local_unnamed_addr #0 {
				; CHECK-LABEL: test_2preds_block:
				; CHECK: # %bb.0: # %entry
				; CHECK-NEXT: movl %r9d, 12(%rdi)
				; CHECK-NEXT: cmpl $18, %edx
				; CHECK-NEXT: jl .LBB3_2
				; CHECK-NEXT: # %bb.1: # %if.then
				; CHECK-NEXT: movl %edx, 4(%rdi)
				; CHECK-NEXT: .LBB3_2: # %if.end
				; CHECK-NEXT: movups (%r8), %xmm0
				; CHECK-NEXT: movups %xmm0, (%rcx)
				; CHECK-NEXT: movl (%rdi), %eax
				; CHECK-NEXT: movl %eax, (%rsi)
				; CHECK-NEXT: movl 4(%rdi), %eax
				; CHECK-NEXT: movl %eax, 4(%rsi)
				; CHECK-NEXT: movl 8(%rdi), %eax
				; CHECK-NEXT: movl %eax, 8(%rsi)
				; CHECK-NEXT: movl 12(%rdi), %eax
				; CHECK-NEXT: movl %eax, 12(%rsi)
				; CHECK-NEXT: retq
				;
				; DISABLED-LABEL: test_2preds_block:
				; DISABLED: # %bb.0: # %entry
				; DISABLED-NEXT: movl %r9d, 12(%rdi)
				; DISABLED-NEXT: cmpl $18, %edx
				; DISABLED-NEXT: jl .LBB3_2
				; DISABLED-NEXT: # %bb.1: # %if.then
				; DISABLED-NEXT: movl %edx, 4(%rdi)
				; DISABLED-NEXT: .LBB3_2: # %if.end
				; DISABLED-NEXT: movups (%r8), %xmm0
				; DISABLED-NEXT: movups %xmm0, (%rcx)
				; DISABLED-NEXT: movups (%rdi), %xmm0
				; DISABLED-NEXT: movups %xmm0, (%rsi)
				; DISABLED-NEXT: retq
				;
				; CHECK-AVX2-LABEL: test_2preds_block:
				; CHECK-AVX2: # %bb.0: # %entry
				; CHECK-AVX2-NEXT: movl %r9d, 12(%rdi)
				; CHECK-AVX2-NEXT: cmpl $18, %edx
				; CHECK-AVX2-NEXT: jl .LBB3_2
				; CHECK-AVX2-NEXT: # %bb.1: # %if.then
				; CHECK-AVX2-NEXT: movl %edx, 4(%rdi)
				; CHECK-AVX2-NEXT: .LBB3_2: # %if.end
				; CHECK-AVX2-NEXT: vmovups (%r8), %xmm0
				; CHECK-AVX2-NEXT: vmovups %xmm0, (%rcx)
				; CHECK-AVX2-NEXT: movl (%rdi), %eax
				; CHECK-AVX2-NEXT: movl %eax, (%rsi)
				; CHECK-AVX2-NEXT: movl 4(%rdi), %eax
				; CHECK-AVX2-NEXT: movl %eax, 4(%rsi)
				; CHECK-AVX2-NEXT: movl 8(%rdi), %eax
				; CHECK-AVX2-NEXT: movl %eax, 8(%rsi)
				; CHECK-AVX2-NEXT: movl 12(%rdi), %eax
				; CHECK-AVX2-NEXT: movl %eax, 12(%rsi)
				; CHECK-AVX2-NEXT: retq
				;
				; CHECK-AVX512-LABEL: test_2preds_block:
				; CHECK-AVX512: # %bb.0: # %entry
				; CHECK-AVX512-NEXT: movl %r9d, 12(%rdi)
				; CHECK-AVX512-NEXT: cmpl $18, %edx
				; CHECK-AVX512-NEXT: jl .LBB3_2
				; CHECK-AVX512-NEXT: # %bb.1: # %if.then
				; CHECK-AVX512-NEXT: movl %edx, 4(%rdi)
				; CHECK-AVX512-NEXT: .LBB3_2: # %if.end
				; CHECK-AVX512-NEXT: vmovups (%r8), %xmm0
				; CHECK-AVX512-NEXT: vmovups %xmm0, (%rcx)
				; CHECK-AVX512-NEXT: movl (%rdi), %eax
				; CHECK-AVX512-NEXT: movl %eax, (%rsi)
				; CHECK-AVX512-NEXT: movl 4(%rdi), %eax
				; CHECK-AVX512-NEXT: movl %eax, 4(%rsi)
				; CHECK-AVX512-NEXT: movl 8(%rdi), %eax
				; CHECK-AVX512-NEXT: movl %eax, 8(%rsi)
				; CHECK-AVX512-NEXT: movl 12(%rdi), %eax
				; CHECK-AVX512-NEXT: movl %eax, 12(%rsi)
				; CHECK-AVX512-NEXT: retq
				entry:
				%d = getelementptr inbounds %struct.S, %struct.S* %s1, i64 0, i32 3
				store i32 %x2, i32* %d, align 4
				%cmp = icmp sgt i32 %x, 17
				br i1 %cmp, label %if.then, label %if.end

				if.then: ; preds = %entry
				%b = getelementptr inbounds %struct.S, %struct.S* %s1, i64 0, i32 1
				store i32 %x, i32* %b, align 4
				br label %if.end

				if.end: ; preds = %if.then, %entry
				%0 = bitcast %struct.S* %s3 to i8*
				%1 = bitcast %struct.S* %s4 to i8*
				tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %0, i8* %1, i64 16, i32 4, i1 false)
				%2 = bitcast %struct.S* %s2 to i8*
				%3 = bitcast %struct.S* %s1 to i8*
				tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %2, i8* %3, i64 16, i32 4, i1 false)
				ret void
				}
				%struct.S2 = type { i64, i64 }

				; Function Attrs: nounwind uwtable
				define void @test_type64(%struct.S2* nocapture %s1, %struct.S2* nocapture %s2, i32 %x, %struct.S2* nocapture %s3, %struct.S2* nocapture readonly %s4) local_unnamed_addr #0 {
				; CHECK-LABEL: test_type64:
				; CHECK: # %bb.0: # %entry
				; CHECK-NEXT: cmpl $18, %edx
				; CHECK-NEXT: jl .LBB4_2
				; CHECK-NEXT: # %bb.1: # %if.then
				; CHECK-NEXT: movslq %edx, %rax
				; CHECK-NEXT: movq %rax, 8(%rdi)
				; CHECK-NEXT: .LBB4_2: # %if.end
				; CHECK-NEXT: movups (%r8), %xmm0
				; CHECK-NEXT: movups %xmm0, (%rcx)
				; CHECK-NEXT: movq (%rdi), %rax
				; CHECK-NEXT: movq %rax, (%rsi)
				; CHECK-NEXT: movq 8(%rdi), %rax
				; CHECK-NEXT: movq %rax, 8(%rsi)
				; CHECK-NEXT: retq
				;
				; DISABLED-LABEL: test_type64:
				; DISABLED: # %bb.0: # %entry
				; DISABLED-NEXT: cmpl $18, %edx
				; DISABLED-NEXT: jl .LBB4_2
				; DISABLED-NEXT: # %bb.1: # %if.then
				; DISABLED-NEXT: movslq %edx, %rax
				; DISABLED-NEXT: movq %rax, 8(%rdi)
				; DISABLED-NEXT: .LBB4_2: # %if.end
				; DISABLED-NEXT: movups (%r8), %xmm0
				; DISABLED-NEXT: movups %xmm0, (%rcx)
				; DISABLED-NEXT: movups (%rdi), %xmm0
				; DISABLED-NEXT: movups %xmm0, (%rsi)
				; DISABLED-NEXT: retq
				;
				; CHECK-AVX2-LABEL: test_type64:
				; CHECK-AVX2: # %bb.0: # %entry
				; CHECK-AVX2-NEXT: cmpl $18, %edx
				; CHECK-AVX2-NEXT: jl .LBB4_2
				; CHECK-AVX2-NEXT: # %bb.1: # %if.then
				; CHECK-AVX2-NEXT: movslq %edx, %rax
				; CHECK-AVX2-NEXT: movq %rax, 8(%rdi)
				; CHECK-AVX2-NEXT: .LBB4_2: # %if.end
				; CHECK-AVX2-NEXT: vmovups (%r8), %xmm0
				; CHECK-AVX2-NEXT: vmovups %xmm0, (%rcx)
				; CHECK-AVX2-NEXT: movq (%rdi), %rax
				; CHECK-AVX2-NEXT: movq %rax, (%rsi)
				; CHECK-AVX2-NEXT: movq 8(%rdi), %rax
				; CHECK-AVX2-NEXT: movq %rax, 8(%rsi)
				; CHECK-AVX2-NEXT: retq
				;
				; CHECK-AVX512-LABEL: test_type64:
				; CHECK-AVX512: # %bb.0: # %entry
				; CHECK-AVX512-NEXT: cmpl $18, %edx
				; CHECK-AVX512-NEXT: jl .LBB4_2
				; CHECK-AVX512-NEXT: # %bb.1: # %if.then
				; CHECK-AVX512-NEXT: movslq %edx, %rax
				; CHECK-AVX512-NEXT: movq %rax, 8(%rdi)
				; CHECK-AVX512-NEXT: .LBB4_2: # %if.end
				; CHECK-AVX512-NEXT: vmovups (%r8), %xmm0
				; CHECK-AVX512-NEXT: vmovups %xmm0, (%rcx)
				; CHECK-AVX512-NEXT: movq (%rdi), %rax
				; CHECK-AVX512-NEXT: movq %rax, (%rsi)
				; CHECK-AVX512-NEXT: movq 8(%rdi), %rax
				; CHECK-AVX512-NEXT: movq %rax, 8(%rsi)
				; CHECK-AVX512-NEXT: retq
				entry:
				%cmp = icmp sgt i32 %x, 17
				br i1 %cmp, label %if.then, label %if.end

				if.then: ; preds = %entry
				%conv = sext i32 %x to i64
				%b = getelementptr inbounds %struct.S2, %struct.S2* %s1, i64 0, i32 1
				store i64 %conv, i64* %b, align 8
				br label %if.end

				if.end: ; preds = %if.then, %entry
				%0 = bitcast %struct.S2* %s3 to i8*
				%1 = bitcast %struct.S2* %s4 to i8*
				tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %0, i8* %1, i64 16, i32 8, i1 false)
				%2 = bitcast %struct.S2* %s2 to i8*
				%3 = bitcast %struct.S2* %s1 to i8*
				tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %2, i8* %3, i64 16, i32 8, i1 false)
				ret void
				}
				%struct.S3 = type { i64, i8, i8, i16, i32 }

				; Function Attrs: noinline nounwind uwtable
				define void @test_mixed_type(%struct.S3* nocapture %s1, %struct.S3* nocapture %s2, i32 %x, %struct.S3* nocapture readnone %s3, %struct.S3* nocapture readnone %s4) local_unnamed_addr #0 {
				; CHECK-LABEL: test_mixed_type:
				; CHECK: # %bb.0: # %entry
				; CHECK-NEXT: cmpl $18, %edx
				; CHECK-NEXT: jl .LBB5_2
				; CHECK-NEXT: # %bb.1: # %if.then
				; CHECK-NEXT: movslq %edx, %rax
				; CHECK-NEXT: movq %rax, (%rdi)
				; CHECK-NEXT: movb %dl, 8(%rdi)
				; CHECK-NEXT: .LBB5_2: # %if.end
				; CHECK-NEXT: movq (%rdi), %rax
				; CHECK-NEXT: movq %rax, (%rsi)
				; CHECK-NEXT: movb 8(%rdi), %al
				; CHECK-NEXT: movb %al, 8(%rsi)
				; CHECK-NEXT: movl 9(%rdi), %eax
				; CHECK-NEXT: movl %eax, 9(%rsi)
				; CHECK-NEXT: movzwl 13(%rdi), %eax
				; CHECK-NEXT: movw %ax, 13(%rsi)
				; CHECK-NEXT: movb 15(%rdi), %al
				; CHECK-NEXT: movb %al, 15(%rsi)
				; CHECK-NEXT: retq
				;
				; DISABLED-LABEL: test_mixed_type:
				; DISABLED: # %bb.0: # %entry
				; DISABLED-NEXT: cmpl $18, %edx
				; DISABLED-NEXT: jl .LBB5_2
				; DISABLED-NEXT: # %bb.1: # %if.then
				; DISABLED-NEXT: movslq %edx, %rax
				; DISABLED-NEXT: movq %rax, (%rdi)
				; DISABLED-NEXT: movb %dl, 8(%rdi)
				; DISABLED-NEXT: .LBB5_2: # %if.end
				; DISABLED-NEXT: movups (%rdi), %xmm0
				; DISABLED-NEXT: movups %xmm0, (%rsi)
				; DISABLED-NEXT: retq
				;
				; CHECK-AVX2-LABEL: test_mixed_type:
				; CHECK-AVX2: # %bb.0: # %entry
				; CHECK-AVX2-NEXT: cmpl $18, %edx
				; CHECK-AVX2-NEXT: jl .LBB5_2
				; CHECK-AVX2-NEXT: # %bb.1: # %if.then
				; CHECK-AVX2-NEXT: movslq %edx, %rax
				; CHECK-AVX2-NEXT: movq %rax, (%rdi)
				; CHECK-AVX2-NEXT: movb %dl, 8(%rdi)
				; CHECK-AVX2-NEXT: .LBB5_2: # %if.end
				; CHECK-AVX2-NEXT: movq (%rdi), %rax
				; CHECK-AVX2-NEXT: movq %rax, (%rsi)
				; CHECK-AVX2-NEXT: movb 8(%rdi), %al
				; CHECK-AVX2-NEXT: movb %al, 8(%rsi)
				; CHECK-AVX2-NEXT: movl 9(%rdi), %eax
				; CHECK-AVX2-NEXT: movl %eax, 9(%rsi)
				; CHECK-AVX2-NEXT: movzwl 13(%rdi), %eax
				; CHECK-AVX2-NEXT: movw %ax, 13(%rsi)
				; CHECK-AVX2-NEXT: movb 15(%rdi), %al
				; CHECK-AVX2-NEXT: movb %al, 15(%rsi)
				; CHECK-AVX2-NEXT: retq
				;
				; CHECK-AVX512-LABEL: test_mixed_type:
				; CHECK-AVX512: # %bb.0: # %entry
				; CHECK-AVX512-NEXT: cmpl $18, %edx
				; CHECK-AVX512-NEXT: jl .LBB5_2
				; CHECK-AVX512-NEXT: # %bb.1: # %if.then
				; CHECK-AVX512-NEXT: movslq %edx, %rax
				; CHECK-AVX512-NEXT: movq %rax, (%rdi)
				; CHECK-AVX512-NEXT: movb %dl, 8(%rdi)
				; CHECK-AVX512-NEXT: .LBB5_2: # %if.end
				; CHECK-AVX512-NEXT: movq (%rdi), %rax
				; CHECK-AVX512-NEXT: movq %rax, (%rsi)
				; CHECK-AVX512-NEXT: movb 8(%rdi), %al
				; CHECK-AVX512-NEXT: movb %al, 8(%rsi)
				; CHECK-AVX512-NEXT: movl 9(%rdi), %eax
				; CHECK-AVX512-NEXT: movl %eax, 9(%rsi)
				; CHECK-AVX512-NEXT: movzwl 13(%rdi), %eax
				; CHECK-AVX512-NEXT: movw %ax, 13(%rsi)
				; CHECK-AVX512-NEXT: movb 15(%rdi), %al
				; CHECK-AVX512-NEXT: movb %al, 15(%rsi)
				; CHECK-AVX512-NEXT: retq
				entry:
				%cmp = icmp sgt i32 %x, 17
				br i1 %cmp, label %if.then, label %if.end

				if.then: ; preds = %entry
				%conv = sext i32 %x to i64
				%a = getelementptr inbounds %struct.S3, %struct.S3* %s1, i64 0, i32 0
				store i64 %conv, i64* %a, align 8
				%conv1 = trunc i32 %x to i8
				%b = getelementptr inbounds %struct.S3, %struct.S3* %s1, i64 0, i32 1
				store i8 %conv1, i8* %b, align 8
				br label %if.end

				if.end: ; preds = %if.then, %entry
				%0 = bitcast %struct.S3* %s2 to i8*
				%1 = bitcast %struct.S3* %s1 to i8*
				tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %0, i8* %1, i64 16, i32 8, i1 false)
				ret void
				}
				%struct.S4 = type { i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32, i32 }

				; Function Attrs: nounwind uwtable
				define void @test_multiple_blocks(%struct.S4* nocapture %s1, %struct.S4* nocapture %s2) local_unnamed_addr #0 {
				; CHECK-LABEL: test_multiple_blocks:
				; CHECK: # %bb.0: # %entry
				; CHECK-NEXT: movl $0, 4(%rdi)
				; CHECK-NEXT: movl $0, 36(%rdi)
				; CHECK-NEXT: movups 16(%rdi), %xmm0
				; CHECK-NEXT: movups %xmm0, 16(%rsi)
				; CHECK-NEXT: movl 32(%rdi), %eax
				; CHECK-NEXT: movl %eax, 32(%rsi)
				; CHECK-NEXT: movl 36(%rdi), %eax
				; CHECK-NEXT: movl %eax, 36(%rsi)
				; CHECK-NEXT: movq 40(%rdi), %rax
				; CHECK-NEXT: movq %rax, 40(%rsi)
				; CHECK-NEXT: movl (%rdi), %eax
				; CHECK-NEXT: movl %eax, (%rsi)
				; CHECK-NEXT: movl 4(%rdi), %eax
				; CHECK-NEXT: movl %eax, 4(%rsi)
				; CHECK-NEXT: movq 8(%rdi), %rax
				; CHECK-NEXT: movq %rax, 8(%rsi)
				; CHECK-NEXT: retq
				;
				; DISABLED-LABEL: test_multiple_blocks:
				; DISABLED: # %bb.0: # %entry
				; DISABLED-NEXT: movl $0, 4(%rdi)
				; DISABLED-NEXT: movl $0, 36(%rdi)
				; DISABLED-NEXT: movups 16(%rdi), %xmm0
				; DISABLED-NEXT: movups %xmm0, 16(%rsi)
				; DISABLED-NEXT: movups 32(%rdi), %xmm0
				; DISABLED-NEXT: movups %xmm0, 32(%rsi)
				; DISABLED-NEXT: movups (%rdi), %xmm0
				; DISABLED-NEXT: movups %xmm0, (%rsi)
				; DISABLED-NEXT: retq
				;
				; CHECK-AVX2-LABEL: test_multiple_blocks:
				; CHECK-AVX2: # %bb.0: # %entry
				; CHECK-AVX2-NEXT: movl $0, 4(%rdi)
				; CHECK-AVX2-NEXT: movl $0, 36(%rdi)
				; CHECK-AVX2-NEXT: vmovups 16(%rdi), %xmm0
				; CHECK-AVX2-NEXT: vmovups %xmm0, 16(%rsi)
				; CHECK-AVX2-NEXT: movl 32(%rdi), %eax
				; CHECK-AVX2-NEXT: movl %eax, 32(%rsi)
				; CHECK-AVX2-NEXT: movl 36(%rdi), %eax
				; CHECK-AVX2-NEXT: movl %eax, 36(%rsi)
				; CHECK-AVX2-NEXT: movq 40(%rdi), %rax
				; CHECK-AVX2-NEXT: movq %rax, 40(%rsi)
				; CHECK-AVX2-NEXT: movl (%rdi), %eax
				; CHECK-AVX2-NEXT: movl %eax, (%rsi)
				; CHECK-AVX2-NEXT: movl 4(%rdi), %eax
				; CHECK-AVX2-NEXT: movl %eax, 4(%rsi)
				; CHECK-AVX2-NEXT: vmovups 8(%rdi), %xmm0
				; CHECK-AVX2-NEXT: vmovups %xmm0, 8(%rsi)
				; CHECK-AVX2-NEXT: movq 24(%rdi), %rax
				; CHECK-AVX2-NEXT: movq %rax, 24(%rsi)
				; CHECK-AVX2-NEXT: retq
				;
				; CHECK-AVX512-LABEL: test_multiple_blocks:
				; CHECK-AVX512: # %bb.0: # %entry
				; CHECK-AVX512-NEXT: movl $0, 4(%rdi)
				; CHECK-AVX512-NEXT: movl $0, 36(%rdi)
				; CHECK-AVX512-NEXT: vmovups 16(%rdi), %xmm0
				; CHECK-AVX512-NEXT: vmovups %xmm0, 16(%rsi)
				; CHECK-AVX512-NEXT: movl 32(%rdi), %eax
				; CHECK-AVX512-NEXT: movl %eax, 32(%rsi)
				; CHECK-AVX512-NEXT: movl 36(%rdi), %eax
				; CHECK-AVX512-NEXT: movl %eax, 36(%rsi)
				; CHECK-AVX512-NEXT: movq 40(%rdi), %rax
				; CHECK-AVX512-NEXT: movq %rax, 40(%rsi)
				; CHECK-AVX512-NEXT: movl (%rdi), %eax
				; CHECK-AVX512-NEXT: movl %eax, (%rsi)
				; CHECK-AVX512-NEXT: movl 4(%rdi), %eax
				; CHECK-AVX512-NEXT: movl %eax, 4(%rsi)
				; CHECK-AVX512-NEXT: vmovups 8(%rdi), %xmm0
				; CHECK-AVX512-NEXT: vmovups %xmm0, 8(%rsi)
				; CHECK-AVX512-NEXT: movq 24(%rdi), %rax
				; CHECK-AVX512-NEXT: movq %rax, 24(%rsi)
				; CHECK-AVX512-NEXT: retq
				entry:
				%b = getelementptr inbounds %struct.S4, %struct.S4* %s1, i64 0, i32 1
				store i32 0, i32* %b, align 4
				%b3 = getelementptr inbounds %struct.S4, %struct.S4* %s1, i64 0, i32 9
				store i32 0, i32* %b3, align 4
				%0 = bitcast %struct.S4* %s2 to i8*
				%1 = bitcast %struct.S4* %s1 to i8*
				tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %0, i8* %1, i64 48, i32 4, i1 false)
				ret void
				}
				%struct.S5 = type { i16, i16, i16, i16, i16, i16, i16, i16 }

				; Function Attrs: nounwind uwtable
				define void @test_type16(%struct.S5* nocapture %s1, %struct.S5* nocapture %s2, i32 %x, %struct.S5* nocapture %s3, %struct.S5* nocapture readonly %s4) local_unnamed_addr #0 {
				; CHECK-LABEL: test_type16:
				; CHECK: # %bb.0: # %entry
				; CHECK-NEXT: cmpl $18, %edx
				; CHECK-NEXT: jl .LBB7_2
				; CHECK-NEXT: # %bb.1: # %if.then
				; CHECK-NEXT: movw %dx, 2(%rdi)
				; CHECK-NEXT: .LBB7_2: # %if.end
				; CHECK-NEXT: movups (%r8), %xmm0
				; CHECK-NEXT: movups %xmm0, (%rcx)
				; CHECK-NEXT: movzwl (%rdi), %eax
				; CHECK-NEXT: movw %ax, (%rsi)
				; CHECK-NEXT: movzwl 2(%rdi), %eax
				; CHECK-NEXT: movw %ax, 2(%rsi)
				; CHECK-NEXT: movq 4(%rdi), %rax
				; CHECK-NEXT: movq %rax, 4(%rsi)
				; CHECK-NEXT: movl 12(%rdi), %eax
				; CHECK-NEXT: movl %eax, 12(%rsi)
				; CHECK-NEXT: retq
				;
				; DISABLED-LABEL: test_type16:
				; DISABLED: # %bb.0: # %entry
				; DISABLED-NEXT: cmpl $18, %edx
				; DISABLED-NEXT: jl .LBB7_2
				; DISABLED-NEXT: # %bb.1: # %if.then
				; DISABLED-NEXT: movw %dx, 2(%rdi)
				; DISABLED-NEXT: .LBB7_2: # %if.end
				; DISABLED-NEXT: movups (%r8), %xmm0
				; DISABLED-NEXT: movups %xmm0, (%rcx)
				; DISABLED-NEXT: movups (%rdi), %xmm0
				; DISABLED-NEXT: movups %xmm0, (%rsi)
				; DISABLED-NEXT: retq
				;
				; CHECK-AVX2-LABEL: test_type16:
				; CHECK-AVX2: # %bb.0: # %entry
				; CHECK-AVX2-NEXT: cmpl $18, %edx
				; CHECK-AVX2-NEXT: jl .LBB7_2
				; CHECK-AVX2-NEXT: # %bb.1: # %if.then
				; CHECK-AVX2-NEXT: movw %dx, 2(%rdi)
				; CHECK-AVX2-NEXT: .LBB7_2: # %if.end
				; CHECK-AVX2-NEXT: vmovups (%r8), %xmm0
				; CHECK-AVX2-NEXT: vmovups %xmm0, (%rcx)
				; CHECK-AVX2-NEXT: movzwl (%rdi), %eax
				; CHECK-AVX2-NEXT: movw %ax, (%rsi)
				; CHECK-AVX2-NEXT: movzwl 2(%rdi), %eax
				; CHECK-AVX2-NEXT: movw %ax, 2(%rsi)
				; CHECK-AVX2-NEXT: movq 4(%rdi), %rax
				; CHECK-AVX2-NEXT: movq %rax, 4(%rsi)
				; CHECK-AVX2-NEXT: movl 12(%rdi), %eax
				; CHECK-AVX2-NEXT: movl %eax, 12(%rsi)
				; CHECK-AVX2-NEXT: retq
				;
				; CHECK-AVX512-LABEL: test_type16:
				; CHECK-AVX512: # %bb.0: # %entry
				; CHECK-AVX512-NEXT: cmpl $18, %edx
				; CHECK-AVX512-NEXT: jl .LBB7_2
				; CHECK-AVX512-NEXT: # %bb.1: # %if.then
				; CHECK-AVX512-NEXT: movw %dx, 2(%rdi)
				; CHECK-AVX512-NEXT: .LBB7_2: # %if.end
				; CHECK-AVX512-NEXT: vmovups (%r8), %xmm0
				; CHECK-AVX512-NEXT: vmovups %xmm0, (%rcx)
				; CHECK-AVX512-NEXT: movzwl (%rdi), %eax
				; CHECK-AVX512-NEXT: movw %ax, (%rsi)
				; CHECK-AVX512-NEXT: movzwl 2(%rdi), %eax
				; CHECK-AVX512-NEXT: movw %ax, 2(%rsi)
				; CHECK-AVX512-NEXT: movq 4(%rdi), %rax
				; CHECK-AVX512-NEXT: movq %rax, 4(%rsi)
				; CHECK-AVX512-NEXT: movl 12(%rdi), %eax
				; CHECK-AVX512-NEXT: movl %eax, 12(%rsi)
				; CHECK-AVX512-NEXT: retq
				entry:
				%cmp = icmp sgt i32 %x, 17
				br i1 %cmp, label %if.then, label %if.end

				if.then: ; preds = %entry
				%conv = trunc i32 %x to i16
				%b = getelementptr inbounds %struct.S5, %struct.S5* %s1, i64 0, i32 1
				store i16 %conv, i16* %b, align 2
				br label %if.end

				if.end: ; preds = %if.then, %entry
				%0 = bitcast %struct.S5* %s3 to i8*
				%1 = bitcast %struct.S5* %s4 to i8*
				tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %0, i8* %1, i64 16, i32 2, i1 false)
				%2 = bitcast %struct.S5* %s2 to i8*
				%3 = bitcast %struct.S5* %s1 to i8*
				tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %2, i8* %3, i64 16, i32 2, i1 false)
				ret void
				}

				%struct.S6 = type { [4 x i32], i32, i32, i32, i32 }

				; Function Attrs: nounwind uwtable
				define void @test_stack(%struct.S6* noalias nocapture sret %agg.result, %struct.S6* byval nocapture readnone align 8 %s1, %struct.S6* byval nocapture align 8 %s2, i32 %x) local_unnamed_addr #0 {
				; CHECK-LABEL: test_stack:
				; CHECK: # %bb.0: # %entry
				; CHECK-NEXT: movl %esi, {{[0-9]+}}(%rsp)
				; CHECK-NEXT: movaps {{[0-9]+}}(%rsp), %xmm0
				; CHECK-NEXT: movups %xmm0, (%rdi)
				; CHECK-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; CHECK-NEXT: movq %rax, 16(%rdi)
				; CHECK-NEXT: movl {{[0-9]+}}(%rsp), %eax
				; CHECK-NEXT: movl %eax, 24(%rdi)
				; CHECK-NEXT: movl {{[0-9]+}}(%rsp), %eax
				; CHECK-NEXT: movl %eax, 28(%rdi)
				; CHECK-NEXT: movq %rdi, %rax
				; CHECK-NEXT: retq
				;
				; DISABLED-LABEL: test_stack:
				; DISABLED: # %bb.0: # %entry
				; DISABLED-NEXT: movl %esi, {{[0-9]+}}(%rsp)
				; DISABLED-NEXT: movaps {{[0-9]+}}(%rsp), %xmm0
				; DISABLED-NEXT: movups %xmm0, (%rdi)
				; DISABLED-NEXT: movaps {{[0-9]+}}(%rsp), %xmm0
				; DISABLED-NEXT: movups %xmm0, 16(%rdi)
				; DISABLED-NEXT: movq %rdi, %rax
				; DISABLED-NEXT: retq
				;
				; CHECK-AVX2-LABEL: test_stack:
				; CHECK-AVX2: # %bb.0: # %entry
				; CHECK-AVX2-NEXT: movl %esi, {{[0-9]+}}(%rsp)
				; CHECK-AVX2-NEXT: vmovups {{[0-9]+}}(%rsp), %xmm0
				; CHECK-AVX2-NEXT: vmovups %xmm0, (%rdi)
				; CHECK-AVX2-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; CHECK-AVX2-NEXT: movq %rax, 16(%rdi)
				; CHECK-AVX2-NEXT: movl {{[0-9]+}}(%rsp), %eax
				; CHECK-AVX2-NEXT: movl %eax, 24(%rdi)
				; CHECK-AVX2-NEXT: movl {{[0-9]+}}(%rsp), %eax
				; CHECK-AVX2-NEXT: movl %eax, 28(%rdi)
				; CHECK-AVX2-NEXT: movq %rdi, %rax
				; CHECK-AVX2-NEXT: retq
				;
				; CHECK-AVX512-LABEL: test_stack:
				; CHECK-AVX512: # %bb.0: # %entry
				; CHECK-AVX512-NEXT: movl %esi, {{[0-9]+}}(%rsp)
				; CHECK-AVX512-NEXT: vmovups {{[0-9]+}}(%rsp), %xmm0
				; CHECK-AVX512-NEXT: vmovups %xmm0, (%rdi)
				; CHECK-AVX512-NEXT: movq {{[0-9]+}}(%rsp), %rax
				; CHECK-AVX512-NEXT: movq %rax, 16(%rdi)
				; CHECK-AVX512-NEXT: movl {{[0-9]+}}(%rsp), %eax
				; CHECK-AVX512-NEXT: movl %eax, 24(%rdi)
				; CHECK-AVX512-NEXT: movl {{[0-9]+}}(%rsp), %eax
				; CHECK-AVX512-NEXT: movl %eax, 28(%rdi)
				; CHECK-AVX512-NEXT: movq %rdi, %rax
				; CHECK-AVX512-NEXT: retq
				entry:
				%s6.sroa.0.0..sroa_cast1 = bitcast %struct.S6* %s2 to i8*
				%s6.sroa.3.0..sroa_idx4 = getelementptr inbounds %struct.S6, %struct.S6* %s2, i64 0, i32 3
				store i32 %x, i32* %s6.sroa.3.0..sroa_idx4, align 8
				%0 = bitcast %struct.S6* %agg.result to i8*
				call void @llvm.memcpy.p0i8.p0i8.i64(i8* %0, i8* nonnull %s6.sroa.0.0..sroa_cast1, i64 32, i32 4, i1 false)
				ret void
				}

				; Function Attrs: nounwind uwtable
				define void @test_limit_all(%struct.S* %s1, %struct.S* nocapture %s2, i32 %x, %struct.S* nocapture %s3, %struct.S* nocapture readonly %s4, i32 %x2) local_unnamed_addr #0 {
				; CHECK-LABEL: test_limit_all:
				; CHECK: # %bb.0: # %entry
				; CHECK-NEXT: pushq %rbp
				; CHECK-NEXT: .cfi_def_cfa_offset 16
				; CHECK-NEXT: pushq %r15
				; CHECK-NEXT: .cfi_def_cfa_offset 24
				; CHECK-NEXT: pushq %r14
				; CHECK-NEXT: .cfi_def_cfa_offset 32
				; CHECK-NEXT: pushq %r12
				; CHECK-NEXT: .cfi_def_cfa_offset 40
				; CHECK-NEXT: pushq %rbx
				; CHECK-NEXT: .cfi_def_cfa_offset 48
				; CHECK-NEXT: .cfi_offset %rbx, -48
				; CHECK-NEXT: .cfi_offset %r12, -40
				; CHECK-NEXT: .cfi_offset %r14, -32
				; CHECK-NEXT: .cfi_offset %r15, -24
				; CHECK-NEXT: .cfi_offset %rbp, -16
				; CHECK-NEXT: movq %r8, %r15
				; CHECK-NEXT: movq %rcx, %r14
				; CHECK-NEXT: movl %edx, %ebp
				; CHECK-NEXT: movq %rsi, %r12
				; CHECK-NEXT: movq %rdi, %rbx
				; CHECK-NEXT: movl %r9d, 12(%rbx)
				; CHECK-NEXT: callq bar
				; CHECK-NEXT: cmpl $18, %ebp
				; CHECK-NEXT: jl .LBB9_2
				; CHECK-NEXT: # %bb.1: # %if.then
				; CHECK-NEXT: movl %ebp, 4(%rbx)
				; CHECK-NEXT: movq %rbx, %rdi
				; CHECK-NEXT: callq bar
				; CHECK-NEXT: .LBB9_2: # %if.end
				; CHECK-NEXT: movups (%r15), %xmm0
				; CHECK-NEXT: movups %xmm0, (%r14)
				; CHECK-NEXT: movups (%rbx), %xmm0
				; CHECK-NEXT: movups %xmm0, (%r12)
				; CHECK-NEXT: popq %rbx
				; CHECK-NEXT: popq %r12
				; CHECK-NEXT: popq %r14
				; CHECK-NEXT: popq %r15
				; CHECK-NEXT: popq %rbp
				; CHECK-NEXT: retq
				;
				; DISABLED-LABEL: test_limit_all:
				; DISABLED: # %bb.0: # %entry
				; DISABLED-NEXT: pushq %rbp
				; DISABLED-NEXT: .cfi_def_cfa_offset 16
				; DISABLED-NEXT: pushq %r15
				; DISABLED-NEXT: .cfi_def_cfa_offset 24
				; DISABLED-NEXT: pushq %r14
				; DISABLED-NEXT: .cfi_def_cfa_offset 32
				; DISABLED-NEXT: pushq %r12
				; DISABLED-NEXT: .cfi_def_cfa_offset 40
				; DISABLED-NEXT: pushq %rbx
				; DISABLED-NEXT: .cfi_def_cfa_offset 48
				; DISABLED-NEXT: .cfi_offset %rbx, -48
				; DISABLED-NEXT: .cfi_offset %r12, -40
				; DISABLED-NEXT: .cfi_offset %r14, -32
				; DISABLED-NEXT: .cfi_offset %r15, -24
				; DISABLED-NEXT: .cfi_offset %rbp, -16
				; DISABLED-NEXT: movq %r8, %r15
				; DISABLED-NEXT: movq %rcx, %r14
				; DISABLED-NEXT: movl %edx, %ebp
				; DISABLED-NEXT: movq %rsi, %r12
				; DISABLED-NEXT: movq %rdi, %rbx
				; DISABLED-NEXT: movl %r9d, 12(%rbx)
				; DISABLED-NEXT: callq bar
				; DISABLED-NEXT: cmpl $18, %ebp
				; DISABLED-NEXT: jl .LBB9_2
				; DISABLED-NEXT: # %bb.1: # %if.then
				; DISABLED-NEXT: movl %ebp, 4(%rbx)
				; DISABLED-NEXT: movq %rbx, %rdi
				; DISABLED-NEXT: callq bar
				; DISABLED-NEXT: .LBB9_2: # %if.end
				; DISABLED-NEXT: movups (%r15), %xmm0
				; DISABLED-NEXT: movups %xmm0, (%r14)
				; DISABLED-NEXT: movups (%rbx), %xmm0
				; DISABLED-NEXT: movups %xmm0, (%r12)
				; DISABLED-NEXT: popq %rbx
				; DISABLED-NEXT: popq %r12
				; DISABLED-NEXT: popq %r14
				; DISABLED-NEXT: popq %r15
				; DISABLED-NEXT: popq %rbp
				; DISABLED-NEXT: retq
				;
				; CHECK-AVX2-LABEL: test_limit_all:
				; CHECK-AVX2: # %bb.0: # %entry
				; CHECK-AVX2-NEXT: pushq %rbp
				; CHECK-AVX2-NEXT: .cfi_def_cfa_offset 16
				; CHECK-AVX2-NEXT: pushq %r15
				; CHECK-AVX2-NEXT: .cfi_def_cfa_offset 24
				; CHECK-AVX2-NEXT: pushq %r14
				; CHECK-AVX2-NEXT: .cfi_def_cfa_offset 32
				; CHECK-AVX2-NEXT: pushq %r12
				; CHECK-AVX2-NEXT: .cfi_def_cfa_offset 40
				; CHECK-AVX2-NEXT: pushq %rbx
				; CHECK-AVX2-NEXT: .cfi_def_cfa_offset 48
				; CHECK-AVX2-NEXT: .cfi_offset %rbx, -48
				; CHECK-AVX2-NEXT: .cfi_offset %r12, -40
				; CHECK-AVX2-NEXT: .cfi_offset %r14, -32
				; CHECK-AVX2-NEXT: .cfi_offset %r15, -24
				; CHECK-AVX2-NEXT: .cfi_offset %rbp, -16
				; CHECK-AVX2-NEXT: movq %r8, %r15
				; CHECK-AVX2-NEXT: movq %rcx, %r14
				; CHECK-AVX2-NEXT: movl %edx, %ebp
				; CHECK-AVX2-NEXT: movq %rsi, %r12
				; CHECK-AVX2-NEXT: movq %rdi, %rbx
				; CHECK-AVX2-NEXT: movl %r9d, 12(%rbx)
				; CHECK-AVX2-NEXT: callq bar
				; CHECK-AVX2-NEXT: cmpl $18, %ebp
				; CHECK-AVX2-NEXT: jl .LBB9_2
				; CHECK-AVX2-NEXT: # %bb.1: # %if.then
				; CHECK-AVX2-NEXT: movl %ebp, 4(%rbx)
				; CHECK-AVX2-NEXT: movq %rbx, %rdi
				; CHECK-AVX2-NEXT: callq bar
				; CHECK-AVX2-NEXT: .LBB9_2: # %if.end
				; CHECK-AVX2-NEXT: vmovups (%r15), %xmm0
				; CHECK-AVX2-NEXT: vmovups %xmm0, (%r14)
				; CHECK-AVX2-NEXT: vmovups (%rbx), %xmm0
				; CHECK-AVX2-NEXT: vmovups %xmm0, (%r12)
				; CHECK-AVX2-NEXT: popq %rbx
				; CHECK-AVX2-NEXT: popq %r12
				; CHECK-AVX2-NEXT: popq %r14
				; CHECK-AVX2-NEXT: popq %r15
				; CHECK-AVX2-NEXT: popq %rbp
				; CHECK-AVX2-NEXT: retq
				;
				; CHECK-AVX512-LABEL: test_limit_all:
				; CHECK-AVX512: # %bb.0: # %entry
				; CHECK-AVX512-NEXT: pushq %rbp
				; CHECK-AVX512-NEXT: .cfi_def_cfa_offset 16
				; CHECK-AVX512-NEXT: pushq %r15
				; CHECK-AVX512-NEXT: .cfi_def_cfa_offset 24
				; CHECK-AVX512-NEXT: pushq %r14
				; CHECK-AVX512-NEXT: .cfi_def_cfa_offset 32
				; CHECK-AVX512-NEXT: pushq %r12
				; CHECK-AVX512-NEXT: .cfi_def_cfa_offset 40
				; CHECK-AVX512-NEXT: pushq %rbx
				; CHECK-AVX512-NEXT: .cfi_def_cfa_offset 48
				; CHECK-AVX512-NEXT: .cfi_offset %rbx, -48
				; CHECK-AVX512-NEXT: .cfi_offset %r12, -40
				; CHECK-AVX512-NEXT: .cfi_offset %r14, -32
				; CHECK-AVX512-NEXT: .cfi_offset %r15, -24
				; CHECK-AVX512-NEXT: .cfi_offset %rbp, -16
				; CHECK-AVX512-NEXT: movq %r8, %r15
				; CHECK-AVX512-NEXT: movq %rcx, %r14
				; CHECK-AVX512-NEXT: movl %edx, %ebp
				; CHECK-AVX512-NEXT: movq %rsi, %r12
				; CHECK-AVX512-NEXT: movq %rdi, %rbx
				; CHECK-AVX512-NEXT: movl %r9d, 12(%rbx)
				; CHECK-AVX512-NEXT: callq bar
				; CHECK-AVX512-NEXT: cmpl $18, %ebp
				; CHECK-AVX512-NEXT: jl .LBB9_2
				; CHECK-AVX512-NEXT: # %bb.1: # %if.then
				; CHECK-AVX512-NEXT: movl %ebp, 4(%rbx)
				; CHECK-AVX512-NEXT: movq %rbx, %rdi
				; CHECK-AVX512-NEXT: callq bar
				; CHECK-AVX512-NEXT: .LBB9_2: # %if.end
				; CHECK-AVX512-NEXT: vmovups (%r15), %xmm0
				; CHECK-AVX512-NEXT: vmovups %xmm0, (%r14)
				; CHECK-AVX512-NEXT: vmovups (%rbx), %xmm0
				; CHECK-AVX512-NEXT: vmovups %xmm0, (%r12)
				; CHECK-AVX512-NEXT: popq %rbx
				; CHECK-AVX512-NEXT: popq %r12
				; CHECK-AVX512-NEXT: popq %r14
				; CHECK-AVX512-NEXT: popq %r15
				; CHECK-AVX512-NEXT: popq %rbp
				; CHECK-AVX512-NEXT: retq
				entry:
				%d = getelementptr inbounds %struct.S, %struct.S* %s1, i64 0, i32 3
				store i32 %x2, i32* %d, align 4
				tail call void @bar(%struct.S* %s1) #3
				%cmp = icmp sgt i32 %x, 17
				br i1 %cmp, label %if.then, label %if.end

				if.then: ; preds = %entry
				%b = getelementptr inbounds %struct.S, %struct.S* %s1, i64 0, i32 1
				store i32 %x, i32* %b, align 4
				tail call void @bar(%struct.S* nonnull %s1) #3
				br label %if.end

				if.end: ; preds = %if.then, %entry
				%0 = bitcast %struct.S* %s3 to i8*
				%1 = bitcast %struct.S* %s4 to i8*
				tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %0, i8* %1, i64 16, i32 4, i1 false)
				%2 = bitcast %struct.S* %s2 to i8*
				%3 = bitcast %struct.S* %s1 to i8*
				tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %2, i8* %3, i64 16, i32 4, i1 false)
				ret void
				}

				; Function Attrs: nounwind uwtable
				define void @test_limit_one_pred(%struct.S* %s1, %struct.S* nocapture %s2, i32 %x, %struct.S* nocapture %s3, %struct.S* nocapture readonly %s4, i32 %x2) local_unnamed_addr #0 {
				; CHECK-LABEL: test_limit_one_pred:
				; CHECK: # %bb.0: # %entry
				; CHECK-NEXT: pushq %r15
				; CHECK-NEXT: .cfi_def_cfa_offset 16
				; CHECK-NEXT: pushq %r14
				; CHECK-NEXT: .cfi_def_cfa_offset 24
				; CHECK-NEXT: pushq %r12
				; CHECK-NEXT: .cfi_def_cfa_offset 32
				; CHECK-NEXT: pushq %rbx
				; CHECK-NEXT: .cfi_def_cfa_offset 40
				; CHECK-NEXT: pushq %rax
				; CHECK-NEXT: .cfi_def_cfa_offset 48
				; CHECK-NEXT: .cfi_offset %rbx, -40
				; CHECK-NEXT: .cfi_offset %r12, -32
				; CHECK-NEXT: .cfi_offset %r14, -24
				; CHECK-NEXT: .cfi_offset %r15, -16
				; CHECK-NEXT: movq %r8, %r12
				; CHECK-NEXT: movq %rcx, %r15
				; CHECK-NEXT: movq %rsi, %r14
				; CHECK-NEXT: movq %rdi, %rbx
				; CHECK-NEXT: movl %r9d, 12(%rbx)
				; CHECK-NEXT: cmpl $18, %edx
				; CHECK-NEXT: jl .LBB10_2
				; CHECK-NEXT: # %bb.1: # %if.then
				; CHECK-NEXT: movl %edx, 4(%rbx)
				; CHECK-NEXT: movq %rbx, %rdi
				; CHECK-NEXT: callq bar
				; CHECK-NEXT: .LBB10_2: # %if.end
				; CHECK-NEXT: movups (%r12), %xmm0
				; CHECK-NEXT: movups %xmm0, (%r15)
				; CHECK-NEXT: movq (%rbx), %rax
				; CHECK-NEXT: movq %rax, (%r14)
				; CHECK-NEXT: movl 8(%rbx), %eax
				; CHECK-NEXT: movl %eax, 8(%r14)
				; CHECK-NEXT: movl 12(%rbx), %eax
				; CHECK-NEXT: movl %eax, 12(%r14)
				; CHECK-NEXT: addq $8, %rsp
				; CHECK-NEXT: popq %rbx
				; CHECK-NEXT: popq %r12
				; CHECK-NEXT: popq %r14
				; CHECK-NEXT: popq %r15
				; CHECK-NEXT: retq
				;
				; DISABLED-LABEL: test_limit_one_pred:
				; DISABLED: # %bb.0: # %entry
				; DISABLED-NEXT: pushq %r15
				; DISABLED-NEXT: .cfi_def_cfa_offset 16
				; DISABLED-NEXT: pushq %r14
				; DISABLED-NEXT: .cfi_def_cfa_offset 24
				; DISABLED-NEXT: pushq %r12
				; DISABLED-NEXT: .cfi_def_cfa_offset 32
				; DISABLED-NEXT: pushq %rbx
				; DISABLED-NEXT: .cfi_def_cfa_offset 40
				; DISABLED-NEXT: pushq %rax
				; DISABLED-NEXT: .cfi_def_cfa_offset 48
				; DISABLED-NEXT: .cfi_offset %rbx, -40
				; DISABLED-NEXT: .cfi_offset %r12, -32
				; DISABLED-NEXT: .cfi_offset %r14, -24
				; DISABLED-NEXT: .cfi_offset %r15, -16
				; DISABLED-NEXT: movq %r8, %r15
				; DISABLED-NEXT: movq %rcx, %r14
				; DISABLED-NEXT: movq %rsi, %r12
				; DISABLED-NEXT: movq %rdi, %rbx
				; DISABLED-NEXT: movl %r9d, 12(%rbx)
				; DISABLED-NEXT: cmpl $18, %edx
				; DISABLED-NEXT: jl .LBB10_2
				; DISABLED-NEXT: # %bb.1: # %if.then
				; DISABLED-NEXT: movl %edx, 4(%rbx)
				; DISABLED-NEXT: movq %rbx, %rdi
				; DISABLED-NEXT: callq bar
				; DISABLED-NEXT: .LBB10_2: # %if.end
				; DISABLED-NEXT: movups (%r15), %xmm0
				; DISABLED-NEXT: movups %xmm0, (%r14)
				; DISABLED-NEXT: movups (%rbx), %xmm0
				; DISABLED-NEXT: movups %xmm0, (%r12)
				; DISABLED-NEXT: addq $8, %rsp
				; DISABLED-NEXT: popq %rbx
				; DISABLED-NEXT: popq %r12
				; DISABLED-NEXT: popq %r14
				; DISABLED-NEXT: popq %r15
				; DISABLED-NEXT: retq
				;
				; CHECK-AVX2-LABEL: test_limit_one_pred:
				; CHECK-AVX2: # %bb.0: # %entry
				; CHECK-AVX2-NEXT: pushq %r15
				; CHECK-AVX2-NEXT: .cfi_def_cfa_offset 16
				; CHECK-AVX2-NEXT: pushq %r14
				; CHECK-AVX2-NEXT: .cfi_def_cfa_offset 24
				; CHECK-AVX2-NEXT: pushq %r12
				; CHECK-AVX2-NEXT: .cfi_def_cfa_offset 32
				; CHECK-AVX2-NEXT: pushq %rbx
				; CHECK-AVX2-NEXT: .cfi_def_cfa_offset 40
				; CHECK-AVX2-NEXT: pushq %rax
				; CHECK-AVX2-NEXT: .cfi_def_cfa_offset 48
				; CHECK-AVX2-NEXT: .cfi_offset %rbx, -40
				; CHECK-AVX2-NEXT: .cfi_offset %r12, -32
				; CHECK-AVX2-NEXT: .cfi_offset %r14, -24
				; CHECK-AVX2-NEXT: .cfi_offset %r15, -16
				; CHECK-AVX2-NEXT: movq %r8, %r12
				; CHECK-AVX2-NEXT: movq %rcx, %r15
				; CHECK-AVX2-NEXT: movq %rsi, %r14
				; CHECK-AVX2-NEXT: movq %rdi, %rbx
				; CHECK-AVX2-NEXT: movl %r9d, 12(%rbx)
				; CHECK-AVX2-NEXT: cmpl $18, %edx
				; CHECK-AVX2-NEXT: jl .LBB10_2
				; CHECK-AVX2-NEXT: # %bb.1: # %if.then
				; CHECK-AVX2-NEXT: movl %edx, 4(%rbx)
				; CHECK-AVX2-NEXT: movq %rbx, %rdi
				; CHECK-AVX2-NEXT: callq bar
				; CHECK-AVX2-NEXT: .LBB10_2: # %if.end
				; CHECK-AVX2-NEXT: vmovups (%r12), %xmm0
				; CHECK-AVX2-NEXT: vmovups %xmm0, (%r15)
				; CHECK-AVX2-NEXT: movq (%rbx), %rax
				; CHECK-AVX2-NEXT: movq %rax, (%r14)
				; CHECK-AVX2-NEXT: movl 8(%rbx), %eax
				; CHECK-AVX2-NEXT: movl %eax, 8(%r14)
				; CHECK-AVX2-NEXT: movl 12(%rbx), %eax
				; CHECK-AVX2-NEXT: movl %eax, 12(%r14)
				; CHECK-AVX2-NEXT: addq $8, %rsp
				; CHECK-AVX2-NEXT: popq %rbx
				; CHECK-AVX2-NEXT: popq %r12
				; CHECK-AVX2-NEXT: popq %r14
				; CHECK-AVX2-NEXT: popq %r15
				; CHECK-AVX2-NEXT: retq
				;
				; CHECK-AVX512-LABEL: test_limit_one_pred:
				; CHECK-AVX512: # %bb.0: # %entry
				; CHECK-AVX512-NEXT: pushq %r15
				; CHECK-AVX512-NEXT: .cfi_def_cfa_offset 16
				; CHECK-AVX512-NEXT: pushq %r14
				; CHECK-AVX512-NEXT: .cfi_def_cfa_offset 24
				; CHECK-AVX512-NEXT: pushq %r12
				; CHECK-AVX512-NEXT: .cfi_def_cfa_offset 32
				; CHECK-AVX512-NEXT: pushq %rbx
				; CHECK-AVX512-NEXT: .cfi_def_cfa_offset 40
				; CHECK-AVX512-NEXT: pushq %rax
				; CHECK-AVX512-NEXT: .cfi_def_cfa_offset 48
				; CHECK-AVX512-NEXT: .cfi_offset %rbx, -40
				; CHECK-AVX512-NEXT: .cfi_offset %r12, -32
				; CHECK-AVX512-NEXT: .cfi_offset %r14, -24
				; CHECK-AVX512-NEXT: .cfi_offset %r15, -16
				; CHECK-AVX512-NEXT: movq %r8, %r12
				; CHECK-AVX512-NEXT: movq %rcx, %r15
				; CHECK-AVX512-NEXT: movq %rsi, %r14
				; CHECK-AVX512-NEXT: movq %rdi, %rbx
				; CHECK-AVX512-NEXT: movl %r9d, 12(%rbx)
				; CHECK-AVX512-NEXT: cmpl $18, %edx
				; CHECK-AVX512-NEXT: jl .LBB10_2
				; CHECK-AVX512-NEXT: # %bb.1: # %if.then
				; CHECK-AVX512-NEXT: movl %edx, 4(%rbx)
				; CHECK-AVX512-NEXT: movq %rbx, %rdi
				; CHECK-AVX512-NEXT: callq bar
				; CHECK-AVX512-NEXT: .LBB10_2: # %if.end
				; CHECK-AVX512-NEXT: vmovups (%r12), %xmm0
				; CHECK-AVX512-NEXT: vmovups %xmm0, (%r15)
				; CHECK-AVX512-NEXT: movq (%rbx), %rax
				; CHECK-AVX512-NEXT: movq %rax, (%r14)
				; CHECK-AVX512-NEXT: movl 8(%rbx), %eax
				; CHECK-AVX512-NEXT: movl %eax, 8(%r14)
				; CHECK-AVX512-NEXT: movl 12(%rbx), %eax
				; CHECK-AVX512-NEXT: movl %eax, 12(%r14)
				; CHECK-AVX512-NEXT: addq $8, %rsp
				; CHECK-AVX512-NEXT: popq %rbx
				; CHECK-AVX512-NEXT: popq %r12
				; CHECK-AVX512-NEXT: popq %r14
				; CHECK-AVX512-NEXT: popq %r15
				; CHECK-AVX512-NEXT: retq
				entry:
				%d = getelementptr inbounds %struct.S, %struct.S* %s1, i64 0, i32 3
				store i32 %x2, i32* %d, align 4
				%cmp = icmp sgt i32 %x, 17
				br i1 %cmp, label %if.then, label %if.end

				if.then: ; preds = %entry
				%b = getelementptr inbounds %struct.S, %struct.S* %s1, i64 0, i32 1
				store i32 %x, i32* %b, align 4
				tail call void @bar(%struct.S* nonnull %s1) #3
				br label %if.end

				if.end: ; preds = %if.then, %entry
				%0 = bitcast %struct.S* %s3 to i8*
				%1 = bitcast %struct.S* %s4 to i8*
				tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %0, i8* %1, i64 16, i32 4, i1 false)
				%2 = bitcast %struct.S* %s2 to i8*
				%3 = bitcast %struct.S* %s1 to i8*
				tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %2, i8* %3, i64 16, i32 4, i1 false)
				ret void
				}


				declare void @bar(%struct.S*) local_unnamed_addr #1


				; Function Attrs: argmemonly nounwind
				declare void @llvm.memcpy.p0i8.p0i8.i64(i8* nocapture writeonly, i8* nocapture readonly, i64, i32, i1) #1

				attributes #0 = { nounwind uwtable "target-cpu"="x86-64" }

				%struct.S7 = type { float, float, float , float, float, float, float, float }

				; Function Attrs: nounwind uwtable
				define void @test_conditional_block_float(%struct.S7* nocapture %s1, %struct.S7* nocapture %s2, i32 %x, %struct.S7* nocapture %s3, %struct.S7* nocapture readonly %s4, float %y) local_unnamed_addr #0 {
				; CHECK-LABEL: test_conditional_block_float:
				; CHECK: # %bb.0: # %entry
				; CHECK-NEXT: cmpl $18, %edx
				; CHECK-NEXT: jl .LBB11_2
				; CHECK-NEXT: # %bb.1: # %if.then
				; CHECK-NEXT: movl $1065353216, 4(%rdi) # imm = 0x3F800000
				; CHECK-NEXT: .LBB11_2: # %if.end
				; CHECK-NEXT: movups (%r8), %xmm0
				; CHECK-NEXT: movups 16(%r8), %xmm1
				; CHECK-NEXT: movups %xmm1, 16(%rcx)
				; CHECK-NEXT: movups %xmm0, (%rcx)
				; CHECK-NEXT: movl (%rdi), %eax
				; CHECK-NEXT: movl 4(%rdi), %ecx
				; CHECK-NEXT: movq 8(%rdi), %rdx
				; CHECK-NEXT: movups 16(%rdi), %xmm0
				; CHECK-NEXT: movups %xmm0, 16(%rsi)
				; CHECK-NEXT: movl %eax, (%rsi)
				; CHECK-NEXT: movl %ecx, 4(%rsi)
				; CHECK-NEXT: movq %rdx, 8(%rsi)
				; CHECK-NEXT: retq
				;
				; DISABLED-LABEL: test_conditional_block_float:
				; DISABLED: # %bb.0: # %entry
				; DISABLED-NEXT: cmpl $18, %edx
				; DISABLED-NEXT: jl .LBB11_2
				; DISABLED-NEXT: # %bb.1: # %if.then
				; DISABLED-NEXT: movl $1065353216, 4(%rdi) # imm = 0x3F800000
				; DISABLED-NEXT: .LBB11_2: # %if.end
				; DISABLED-NEXT: movups (%r8), %xmm0
				; DISABLED-NEXT: movups 16(%r8), %xmm1
				; DISABLED-NEXT: movups %xmm1, 16(%rcx)
				; DISABLED-NEXT: movups %xmm0, (%rcx)
				; DISABLED-NEXT: movups (%rdi), %xmm0
				; DISABLED-NEXT: movups 16(%rdi), %xmm1
				; DISABLED-NEXT: movups %xmm1, 16(%rsi)
				; DISABLED-NEXT: movups %xmm0, (%rsi)
				; DISABLED-NEXT: retq
				;
				; CHECK-AVX2-LABEL: test_conditional_block_float:
				; CHECK-AVX2: # %bb.0: # %entry
				; CHECK-AVX2-NEXT: cmpl $18, %edx
				; CHECK-AVX2-NEXT: jl .LBB11_2
				; CHECK-AVX2-NEXT: # %bb.1: # %if.then
				; CHECK-AVX2-NEXT: movl $1065353216, 4(%rdi) # imm = 0x3F800000
				; CHECK-AVX2-NEXT: .LBB11_2: # %if.end
				; CHECK-AVX2-NEXT: vmovups (%r8), %ymm0
				; CHECK-AVX2-NEXT: vmovups %ymm0, (%rcx)
				; CHECK-AVX2-NEXT: movl (%rdi), %eax
				; CHECK-AVX2-NEXT: movl %eax, (%rsi)
				; CHECK-AVX2-NEXT: movl 4(%rdi), %eax
				; CHECK-AVX2-NEXT: movl %eax, 4(%rsi)
				; CHECK-AVX2-NEXT: vmovups 8(%rdi), %xmm0
				; CHECK-AVX2-NEXT: vmovups %xmm0, 8(%rsi)
				; CHECK-AVX2-NEXT: movq 24(%rdi), %rax
				; CHECK-AVX2-NEXT: movq %rax, 24(%rsi)
				; CHECK-AVX2-NEXT: vzeroupper
				; CHECK-AVX2-NEXT: retq
				;
				; CHECK-AVX512-LABEL: test_conditional_block_float:
				; CHECK-AVX512: # %bb.0: # %entry
				; CHECK-AVX512-NEXT: cmpl $18, %edx
				; CHECK-AVX512-NEXT: jl .LBB11_2
				; CHECK-AVX512-NEXT: # %bb.1: # %if.then
				; CHECK-AVX512-NEXT: movl $1065353216, 4(%rdi) # imm = 0x3F800000
				; CHECK-AVX512-NEXT: .LBB11_2: # %if.end
				; CHECK-AVX512-NEXT: vmovups (%r8), %ymm0
				; CHECK-AVX512-NEXT: vmovups %ymm0, (%rcx)
				; CHECK-AVX512-NEXT: movl (%rdi), %eax
				; CHECK-AVX512-NEXT: movl %eax, (%rsi)
				; CHECK-AVX512-NEXT: movl 4(%rdi), %eax
				; CHECK-AVX512-NEXT: movl %eax, 4(%rsi)
				; CHECK-AVX512-NEXT: vmovups 8(%rdi), %xmm0
				; CHECK-AVX512-NEXT: vmovups %xmm0, 8(%rsi)
				; CHECK-AVX512-NEXT: movq 24(%rdi), %rax
				; CHECK-AVX512-NEXT: movq %rax, 24(%rsi)
				; CHECK-AVX512-NEXT: vzeroupper
				; CHECK-AVX512-NEXT: retq
				entry:
				%cmp = icmp sgt i32 %x, 17
				br i1 %cmp, label %if.then, label %if.end

				if.then: ; preds = %entry
				%b = getelementptr inbounds %struct.S7, %struct.S7* %s1, i64 0, i32 1
				store float 1.0, float* %b, align 4
				br label %if.end

				if.end: ; preds = %if.then, %entry
				%0 = bitcast %struct.S7* %s3 to i8*
				%1 = bitcast %struct.S7* %s4 to i8*
				tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %0, i8* %1, i64 32, i32 4, i1 false)
				%2 = bitcast %struct.S7* %s2 to i8*
				%3 = bitcast %struct.S7* %s1 to i8*
				tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %2, i8* %3, i64 32, i32 4, i1 false)
				ret void
				}

				%struct.S8 = type { i64, i64, i64, i64, i64, i64 }

				; Function Attrs: nounwind uwtable
				define void @test_conditional_block_ymm(%struct.S8* nocapture %s1, %struct.S8* nocapture %s2, i32 %x, %struct.S8* nocapture %s3, %struct.S8* nocapture readonly %s4) local_unnamed_addr #0 {
				; CHECK-LABEL: test_conditional_block_ymm:
				; CHECK: # %bb.0: # %entry
				; CHECK-NEXT: cmpl $18, %edx
				; CHECK-NEXT: jl .LBB12_2
				; CHECK-NEXT: # %bb.1: # %if.then
				; CHECK-NEXT: movq $1, 8(%rdi)
				; CHECK-NEXT: .LBB12_2: # %if.end
				; CHECK-NEXT: movups (%r8), %xmm0
				; CHECK-NEXT: movups 16(%r8), %xmm1
				; CHECK-NEXT: movups %xmm1, 16(%rcx)
				; CHECK-NEXT: movups %xmm0, (%rcx)
				; CHECK-NEXT: movq (%rdi), %rax
				; CHECK-NEXT: movq 8(%rdi), %rcx
				; CHECK-NEXT: movups 16(%rdi), %xmm0
				; CHECK-NEXT: movups %xmm0, 16(%rsi)
				; CHECK-NEXT: movq %rax, (%rsi)
				; CHECK-NEXT: movq %rcx, 8(%rsi)
				; CHECK-NEXT: retq
				;
				; DISABLED-LABEL: test_conditional_block_ymm:
				; DISABLED: # %bb.0: # %entry
				; DISABLED-NEXT: cmpl $18, %edx
				; DISABLED-NEXT: jl .LBB12_2
				; DISABLED-NEXT: # %bb.1: # %if.then
				; DISABLED-NEXT: movq $1, 8(%rdi)
				; DISABLED-NEXT: .LBB12_2: # %if.end
				; DISABLED-NEXT: movups (%r8), %xmm0
				; DISABLED-NEXT: movups 16(%r8), %xmm1
				; DISABLED-NEXT: movups %xmm1, 16(%rcx)
				; DISABLED-NEXT: movups %xmm0, (%rcx)
				; DISABLED-NEXT: movups (%rdi), %xmm0
				; DISABLED-NEXT: movups 16(%rdi), %xmm1
				; DISABLED-NEXT: movups %xmm1, 16(%rsi)
				; DISABLED-NEXT: movups %xmm0, (%rsi)
				; DISABLED-NEXT: retq
				;
				; CHECK-AVX2-LABEL: test_conditional_block_ymm:
				; CHECK-AVX2: # %bb.0: # %entry
				; CHECK-AVX2-NEXT: cmpl $18, %edx
				; CHECK-AVX2-NEXT: jl .LBB12_2
				; CHECK-AVX2-NEXT: # %bb.1: # %if.then
				; CHECK-AVX2-NEXT: movq $1, 8(%rdi)
				; CHECK-AVX2-NEXT: .LBB12_2: # %if.end
				; CHECK-AVX2-NEXT: vmovups (%r8), %ymm0
				; CHECK-AVX2-NEXT: vmovups %ymm0, (%rcx)
				; CHECK-AVX2-NEXT: movq (%rdi), %rax
				; CHECK-AVX2-NEXT: movq %rax, (%rsi)
				; CHECK-AVX2-NEXT: movq 8(%rdi), %rax
				; CHECK-AVX2-NEXT: movq %rax, 8(%rsi)
				; CHECK-AVX2-NEXT: vmovups 16(%rdi), %xmm0
				; CHECK-AVX2-NEXT: vmovups %xmm0, 16(%rsi)
				; CHECK-AVX2-NEXT: vzeroupper
				; CHECK-AVX2-NEXT: retq
				;
				; CHECK-AVX512-LABEL: test_conditional_block_ymm:
				; CHECK-AVX512: # %bb.0: # %entry
				; CHECK-AVX512-NEXT: cmpl $18, %edx
				; CHECK-AVX512-NEXT: jl .LBB12_2
				; CHECK-AVX512-NEXT: # %bb.1: # %if.then
				; CHECK-AVX512-NEXT: movq $1, 8(%rdi)
				; CHECK-AVX512-NEXT: .LBB12_2: # %if.end
				; CHECK-AVX512-NEXT: vmovups (%r8), %ymm0
				; CHECK-AVX512-NEXT: vmovups %ymm0, (%rcx)
				; CHECK-AVX512-NEXT: movq (%rdi), %rax
				; CHECK-AVX512-NEXT: movq %rax, (%rsi)
				; CHECK-AVX512-NEXT: movq 8(%rdi), %rax
				; CHECK-AVX512-NEXT: movq %rax, 8(%rsi)
				; CHECK-AVX512-NEXT: vmovups 16(%rdi), %xmm0
				; CHECK-AVX512-NEXT: vmovups %xmm0, 16(%rsi)
				; CHECK-AVX512-NEXT: vzeroupper
				; CHECK-AVX512-NEXT: retq
				entry:
				%cmp = icmp sgt i32 %x, 17
				br i1 %cmp, label %if.then, label %if.end

				if.then: ; preds = %entry
				%b = getelementptr inbounds %struct.S8, %struct.S8* %s1, i64 0, i32 1
				store i64 1, i64* %b, align 4
				br label %if.end

				if.end: ; preds = %if.then, %entry
				%0 = bitcast %struct.S8* %s3 to i8*
				%1 = bitcast %struct.S8* %s4 to i8*
				tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %0, i8* %1, i64 32, i32 4, i1 false)
				%2 = bitcast %struct.S8* %s2 to i8*
				%3 = bitcast %struct.S8* %s1 to i8*
				tail call void @llvm.memcpy.p0i8.p0i8.i64(i8* %2, i8* %3, i64 32, i32 4, i1 false)
				ret void
				}