This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/Transforms/Vectorize/
-
llvm/
-
Transforms/
-
Vectorize/
-
SLPVectorizer.h
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
-
SLPVectorizer.cpp
-
test/Transforms/SLPVectorizer/X86/
-
Transforms/
-
SLPVectorizer/
-
X86/
-
arith-add-ssat.ll
-
arith-add-usat.ll
-
arith-add.ll
-
arith-fix.ll
-
arith-mul.ll
-
arith-sub-ssat.ll
-
arith-sub-usat.ll
-
arith-sub.ll
-
bitreverse.ll
-
ctlz.ll
-
ctpop.ll
-
cttz.ll
-
different-vec-widths.ll
-
jumbled_store_crash.ll
-
pr35497.ll
-
shift-ashr.ll
-
shift-lshr.ll
-
shift-shl.ll
-
store-jumbled.ll
-
stores_vectorize.ll

Differential D74185

Revert the revert of vectorization commits
Needs ReviewPublic

Authored by george.karpenkov on Feb 6 2020, 5:21 PM.

Download Raw Diff

Details

Reviewers

ABataev
bkramer
sanjoy.google

Summary

Revert of https://github.com/llvm/llvm-project/commit/bfa32573bf2d0ab587f9a5d933ea2144a382cf3c

The vectorization introduced by the commit causes a miscompile.

This is the input LLVM IR: https://gist.github.com/cheshire/9dff84d8fbb83278736278854c746d8d
This is IR with the mentioned commit reverted: https://gist.github.com/cheshire/fb06443834735c2fa04fa0eacb288606 (produces correct results)
This is IR with ToT LLVM: https://gist.github.com/cheshire/d0b4195b1be344f52d77b98c70b5241d

If just a diff in IR is not enough, the IR dump is actually runnable.

To run it, compile a standalone (no dependencies, only a single file) driver tool at https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/xla/tools/driver.cc and link with good/bad IR version above. Then run it with a single argument, containing a filename with the buffer assignment description (contents at https://gist.github.com/cheshire/f929aa118978f4bcad03a3da11209d27).

With the vectorization reverted, the driver produces this output:

Output:
(
1.15675, 1.04532, 1.1549, 1.04836, 0.947364, 1.04667, 1.04491, 0.944249, 1.04323, 0.94699, 0.855764, 0.945469, 1.1536, 1.04247, 1.15175, 1.0455, 0.944784, 1.04382
,
1.15675, 1.04532, 1.1549, 1.04836, 0.947364, 1.04667, 1.04491, 0.944249, 1.04323, 0.94699, 0.855764, 0.945469, 1.1536, 1.04247, 1.15175, 1.0455, 0.944784, 1.04382
)

Without the revert, the output is different:

Output:
(
1.15675, 1.04532, 1.1549, 1.04836, 0.947364, 1.04667, 1.04491, 0.944249, 1.04323, 0.94699, 0.855764, 0.945469, 1.1536, 1.04247, 1.15175, 1.0455, 0.944784, 1.04382
,
1.15675, 1.10195, 1.1549, 1.04836, 0.947364, 1.04667, 1.04491, 0.899154, 1.04323, 0.94699, 0.855764, 0.945469, 1.1536, 1.09595, 1.15175, 1.0455, 0.944784, 1.04382
)

(note the 8th number in the second row)

Diff Detail

Event Timeline

george.karpenkov created this revision.Feb 6 2020, 5:21 PM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 6 2020, 5:21 PM

Herald added subscribers: Charusso, jfb, mgrang, hiraditya. · View Herald Transcript

I tried to compare the LLVM IR. I did not find incorrect optimizations. Actually, it is very hard to compare such big files. It would be good if you could try to generate a smaller reproducer. Also, reverting is not productive, better to try to find a proper fix.

This revision now requires changes to proceed.Feb 7 2020, 7:53 AM

@ABataev Please check out the runnable reproducer above. It clearly demonstrates a miscompile: if the pass is correct, same IR input should give identical output (barring undefined behavior, but we use sanitizers to check for that).

Also, reverting is not productive, better to try to find a proper fix.

Miscompiles should be reverted just based on incorrect behavior.

In D74185#1864077, @george.karpenkov wrote:

@ABataev Please check out the runnable reproducer above. It clearly demonstrates a miscompile: if the pass is correct, same IR input should give identical output (barring undefined behavior, but we use sanitizers to check for that).

Runnable reproducer requires some prerequisites. It would be good if you could provide a simpler one.

Also, reverting is not productive, better to try to find a proper fix.

Miscompiles should be reverted just based on incorrect behavior.

I'm not sure that this patch is the real cause of the problem you have. Maybe, it triggers some other issues in the backend. But it very hard to say currently if it is so or not.

Runnable reproducer requires some prerequisites. It would be good if you could provide a simpler one.

I understand, but actually we have already spent a very large amount of time on reducing this.

To make sure we are on the same page: what do you mean by prerequisites? In order to run the reproducer, you only need to wget a single C++ file with no dependencies and link it to the IR dump/

In D74185#1864233, @george.karpenkov wrote:

Runnable reproducer requires some prerequisites. It would be good if you could provide a simpler one.

I understand, but actually we have already spent a very large amount of time on reducing this.

To make sure we are on the same page: what do you mean by prerequisites? In order to run the reproducer, you only need to wget a single C++ file with no dependencies and link it to the IR dump/

Still, in its current form, it won't help. I already said, I don't see a real problem in the LLVM IR files. Maybe, this optimization just triggers some other dangerous transformations. Need to investigate this thoroughly to understand what is the real cause of the problem.

I did manage to reduce the test case further.

For input IR: https://gist.github.com/cheshire/17067d5ba4781817861c8b21d15c928d

Bad optimized version: https://gist.github.com/cheshire/bf1047b4385bcf82c22a70f5cf1fb5df

Good optimized version: https://gist.github.com/cheshire/8bea1f36ab849f8945bc190b519272a6

The compilation comes from XLA test case which looks like this:

HloModule EntryModule

ENTRY EntryModule {
  %input0 = f64[] parameter(0)
  %sign_227 = f64[] sign(f64[] %input0)
  %multiply_235 = f64[] multiply(f64[] %sign_227, f64[] %sign_227)

  %p4 = f64[2,2] broadcast(%input0), dimensions={}
  %dot_81 = f64[2,2] dot(f64[2,2] %p4, f64[2,2] %p4), lhs_contracting_dims={1}, rhs_contracting_dims={1}
  %br2 = f64[2,2] broadcast(f64[] %multiply_235), dimensions={}
  %reshape_294 = f64[2,2] multiply(f64[2,2] %dot_81, f64[2,2] br2)

  %broadcast_298 = f64[2,3,2,3] broadcast(f64[2,2] %reshape_294), dimensions={0,2}

  %arg7_8 = f64[3,3] parameter(1)
  %broadcast_300 = f64[2,3,2,3] broadcast(f64[3,3] %arg7_8), dimensions={1,3}
  %multiply_301 = f64[2,3,2,3] multiply(f64[2,3,2,3] %broadcast_298, f64[2,3,2,3] %broadcast_300)

  %reshape_302 = f64[6,6] reshape(f64[2,3,2,3] %multiply_301)
  %zero = f64[] constant(0)
  %zeros = f64[6,6] broadcast(f64[] %zero), dimensions={}

  %diag = pred[6,6] constant({{1,0,0,0,0,0}, {0,1,0,0,0,0}, {0,0,1,0,0,0}, {0,0,0,1,0,0}, {0,0,0,0,1,0}, {0,0,0,0,0,1}})
  ROOT %select_316 = f64[6,6] select(pred[6,6] %diag, f64[6,6] %reshape_302, f64[6,6] %zeros)
}

Essentially, it performs some element-wise multiplications on random input floats, and then replaces all non-diagonal entries with zeros.

Difference in output:

Expected literal:
f64[6,6] {
  { 0.000275566374291495, 0, 0, 0, 0, 0 },
  { 0, 0.00040918878254846619, 0, 0, 0, 0 },
  { 0, 0, -0.00058600272184509581, 0, 0, 0 },
  { 0, 0, 0, 0.000275566374291495, 0, 0 },
  { 0, 0, 0, 0, 0.00040918878254846619, 0 },
  { 0, 0, 0, 0, 0, -0.00058600272184509581 }
}

Actual literal:
f64[6,6] {
  { 0.000275566374291495, 0, 0, 0, 0, 0 },
  { 0, 1, 0, 0, 0, 0 },
  { 0, 0, -0.00058600272184509581, 0, 0, 0 },
  { 0, 0, 0, 0.000275566374291495, 0, 0 },
  { 0, 0, 0, 0, 0.00040918878254846619, 0 },
  { 0, 0, 0, 0, 0, -0.00058600272184509581 }
}
1/1 runs miscompared.

This is not something which can be caused by fast math: a number created by elementwise multiplication of random input floats is exactly "1" in the bad version.

I find this instruction suspicious: https://gist.github.com/cheshire/bf1047b4385bcf82c22a70f5cf1fb5df#file-bad_input_ir_opt-L29

why is the result of the comparison casted to double and then inserted into the vector?

In D74185#1865072, @george.karpenkov wrote:

I did manage to reduce the test case further.

For input IR: https://gist.github.com/cheshire/17067d5ba4781817861c8b21d15c928d

Bad optimized version: https://gist.github.com/cheshire/bf1047b4385bcf82c22a70f5cf1fb5df

Good optimized version: https://gist.github.com/cheshire/8bea1f36ab849f8945bc190b519272a6

The compilation comes from XLA test case which looks like this:

HloModule EntryModule

ENTRY EntryModule {
  %input0 = f64[] parameter(0)
  %sign_227 = f64[] sign(f64[] %input0)
  %multiply_235 = f64[] multiply(f64[] %sign_227, f64[] %sign_227)

  %p4 = f64[2,2] broadcast(%input0), dimensions={}
  %dot_81 = f64[2,2] dot(f64[2,2] %p4, f64[2,2] %p4), lhs_contracting_dims={1}, rhs_contracting_dims={1}
  %br2 = f64[2,2] broadcast(f64[] %multiply_235), dimensions={}
  %reshape_294 = f64[2,2] multiply(f64[2,2] %dot_81, f64[2,2] br2)

  %broadcast_298 = f64[2,3,2,3] broadcast(f64[2,2] %reshape_294), dimensions={0,2}

  %arg7_8 = f64[3,3] parameter(1)
  %broadcast_300 = f64[2,3,2,3] broadcast(f64[3,3] %arg7_8), dimensions={1,3}
  %multiply_301 = f64[2,3,2,3] multiply(f64[2,3,2,3] %broadcast_298, f64[2,3,2,3] %broadcast_300)

  %reshape_302 = f64[6,6] reshape(f64[2,3,2,3] %multiply_301)
  %zero = f64[] constant(0)
  %zeros = f64[6,6] broadcast(f64[] %zero), dimensions={}

  %diag = pred[6,6] constant({{1,0,0,0,0,0}, {0,1,0,0,0,0}, {0,0,1,0,0,0}, {0,0,0,1,0,0}, {0,0,0,0,1,0}, {0,0,0,0,0,1}})
  ROOT %select_316 = f64[6,6] select(pred[6,6] %diag, f64[6,6] %reshape_302, f64[6,6] %zeros)
}

Essentially, it performs some element-wise multiplications on random input floats, and then replaces all non-diagonal entries with zeros.

Difference in output:

Expected literal:
f64[6,6] {
  { 0.000275566374291495, 0, 0, 0, 0, 0 },
  { 0, 0.00040918878254846619, 0, 0, 0, 0 },
  { 0, 0, -0.00058600272184509581, 0, 0, 0 },
  { 0, 0, 0, 0.000275566374291495, 0, 0 },
  { 0, 0, 0, 0, 0.00040918878254846619, 0 },
  { 0, 0, 0, 0, 0, -0.00058600272184509581 }
}

Actual literal:
f64[6,6] {
  { 0.000275566374291495, 0, 0, 0, 0, 0 },
  { 0, 1, 0, 0, 0, 0 },
  { 0, 0, -0.00058600272184509581, 0, 0, 0 },
  { 0, 0, 0, 0.000275566374291495, 0, 0 },
  { 0, 0, 0, 0, 0.00040918878254846619, 0 },
  { 0, 0, 0, 0, 0, -0.00058600272184509581 }
}
1/1 runs miscompared.

This is not something which can be caused by fast math: a number created by elementwise multiplication of random input floats is exactly "1" in the bad version.

Thanks, will check this.

In D74185#1865088, @ABataev wrote:

Thanks, will check this.

Should we revert the commit while you investigate?

In D74185#1865175, @sanjoy.google wrote:

In D74185#1865088, @ABataev wrote:

Thanks, will check this.

Should we revert the commit while you investigate?

The problem here that this patch does not introduce new vectorization, instead it just triggers the existing vetorization for more cases. If there is a bug in the vectorizer, this patch just allows to reveal it, not introduces it.

In D74185#1865186, @ABataev wrote:

In D74185#1865175, @sanjoy.google wrote:

In D74185#1865088, @ABataev wrote:

Thanks, will check this.

Should we revert the commit while you investigate?

The problem here that this patch does not introduce new vectorization, instead it just triggers the existing vetorization for more cases. If there is a bug in the vectorizer, this patch just allows to reveal it, not introduces it.

Even if we believe that this commit only reveals a bug (and is itself buggy) I think the correct action is to revert. I have done this in the past with my changes.

Reverting is not a indictment on the quality of the change being reverted; it just keeps the trunk green and helps us all sleep better.

The problem here that this patch does not introduce new vectorization, instead it just triggers the existing vetorization for more cases. If there is a bug in the vectorizer, this patch just allows to reveal it, not introduces it.

As Sanjoy mentioned above, we would still have to revert in this case, as it is the only sustainable option to deal with miscompiles.

In D74185#1865334, @george.karpenkov wrote:

The problem here that this patch does not introduce new vectorization, instead it just triggers the existing vetorization for more cases. If there is a bug in the vectorizer, this patch just allows to reveal it, not introduces it.

As Sanjoy mentioned above, we would still have to revert in this case, as it is the only sustainable option to deal with miscompiles.

Even if it is not a real cause of the problem? Need to investigate it at first, to be sure that the vectorizer is the real cause.

Even if it is not a real cause of the problem?

Yes, of course. Many revisions are rolled back like this. It would be completely unsustainable to try to find the "real" culprit for a breaking change, and it's often not even clear where the "real" blame is. Taking this example to extreme: a commit in the compiler which breaks all executables due to triggering a CPU bug should still be rolled back, even if it perfectly fits the compiler contract.

In D74185#1865361, @george.karpenkov wrote:

Even if it is not a real cause of the problem?

Yes, of course. Many revisions are rolled back like this. It would be completely unsustainable to try to find the "real" culprit for a breaking change, and it's often not even clear where the "real" blame is. Taking this example to extreme: a commit in the compiler which breaks all executables due to triggering a CPU bug should still be rolled back, even if it perfectly fits the compiler contract.

I can investigate it thoroughly on Monday. Give me a Monday to investigate it and after that I will revert it myself, if no proper fix is found.

In D74185#1865072, @george.karpenkov wrote:

I did manage to reduce the test case further.

For input IR: https://gist.github.com/cheshire/17067d5ba4781817861c8b21d15c928d

Bad optimized version: https://gist.github.com/cheshire/bf1047b4385bcf82c22a70f5cf1fb5df

Good optimized version: https://gist.github.com/cheshire/8bea1f36ab849f8945bc190b519272a6

The compilation comes from XLA test case which looks like this:

HloModule EntryModule

ENTRY EntryModule {
  %input0 = f64[] parameter(0)
  %sign_227 = f64[] sign(f64[] %input0)
  %multiply_235 = f64[] multiply(f64[] %sign_227, f64[] %sign_227)

  %p4 = f64[2,2] broadcast(%input0), dimensions={}
  %dot_81 = f64[2,2] dot(f64[2,2] %p4, f64[2,2] %p4), lhs_contracting_dims={1}, rhs_contracting_dims={1}
  %br2 = f64[2,2] broadcast(f64[] %multiply_235), dimensions={}
  %reshape_294 = f64[2,2] multiply(f64[2,2] %dot_81, f64[2,2] br2)

  %broadcast_298 = f64[2,3,2,3] broadcast(f64[2,2] %reshape_294), dimensions={0,2}

  %arg7_8 = f64[3,3] parameter(1)
  %broadcast_300 = f64[2,3,2,3] broadcast(f64[3,3] %arg7_8), dimensions={1,3}
  %multiply_301 = f64[2,3,2,3] multiply(f64[2,3,2,3] %broadcast_298, f64[2,3,2,3] %broadcast_300)

  %reshape_302 = f64[6,6] reshape(f64[2,3,2,3] %multiply_301)
  %zero = f64[] constant(0)
  %zeros = f64[6,6] broadcast(f64[] %zero), dimensions={}

  %diag = pred[6,6] constant({{1,0,0,0,0,0}, {0,1,0,0,0,0}, {0,0,1,0,0,0}, {0,0,0,1,0,0}, {0,0,0,0,1,0}, {0,0,0,0,0,1}})
  ROOT %select_316 = f64[6,6] select(pred[6,6] %diag, f64[6,6] %reshape_302, f64[6,6] %zeros)
}

Essentially, it performs some element-wise multiplications on random input floats, and then replaces all non-diagonal entries with zeros.

Difference in output:

Expected literal:
f64[6,6] {
  { 0.000275566374291495, 0, 0, 0, 0, 0 },
  { 0, 0.00040918878254846619, 0, 0, 0, 0 },
  { 0, 0, -0.00058600272184509581, 0, 0, 0 },
  { 0, 0, 0, 0.000275566374291495, 0, 0 },
  { 0, 0, 0, 0, 0.00040918878254846619, 0 },
  { 0, 0, 0, 0, 0, -0.00058600272184509581 }
}

Actual literal:
f64[6,6] {
  { 0.000275566374291495, 0, 0, 0, 0, 0 },
  { 0, 1, 0, 0, 0, 0 },
  { 0, 0, -0.00058600272184509581, 0, 0, 0 },
  { 0, 0, 0, 0.000275566374291495, 0, 0 },
  { 0, 0, 0, 0, 0.00040918878254846619, 0 },
  { 0, 0, 0, 0, 0, -0.00058600272184509581 }
}
1/1 runs miscompared.

This is not something which can be caused by fast math: a number created by elementwise multiplication of random input floats is exactly "1" in the bad version.

Hmm, tried this reproducer with trunk and it generates the same LLVM IR just like your correct variant. Could you try to check it using the trunk, maybe the bug was fixed already? The convert of fcmp to double also looks suspicious to me, but it looks like the instruction combiner bug, not vectorizer. Vectorizer does not do this kind of transformation with IR. Meanwhile, I'll try to investigate you iriginal reproducer, maybe it is also fixed already.

@ABataev Thanks for looking into this! You are right: when I try the opt tool I also get the same result. Yet somehow when inside TensorFlow, rolling back this revision does fix the miscompile.
Do you think you could give any advice what I can do to narrow it down? Maybe places to print before and after in vectorization passes?

In D74185#1868423, @george.karpenkov wrote:

@ABataev Thanks for looking into this! You are right: when I try the opt tool I also get the same result. Yet somehow when inside TensorFlow, rolling back this revision does fix the miscompile.
Do you think you could give any advice what I can do to narrow it down? Maybe places to print before and after in vectorization passes?

There is an option in LLVM, which allows to limit the number of passes and find the pass, which causes the miscompilation. Can't tell you this option right now, will tell you tomorrow.

In D74185#1868457, @ABataev wrote:

In D74185#1868423, @george.karpenkov wrote:

@ABataev Thanks for looking into this! You are right: when I try the opt tool I also get the same result. Yet somehow when inside TensorFlow, rolling back this revision does fix the miscompile.
Do you think you could give any advice what I can do to narrow it down? Maybe places to print before and after in vectorization passes?

There is an option in LLVM, which allows to limit the number of passes and find the pass, which causes the miscompilation. Can't tell you this option right now, will tell you tomorrow.

Yep, found it. Here it is https://llvm.org/docs/OptBisect.html. You can try to bisect optimization passes and find the wrong one.

Was a bug tracking this issue already filed? (mark it as blocking release-10.0.0)

@lebedev.ri Thanks! No, I haven't filed the issue yet. The current blocking issue is that I see the regression when compiled with JIT, but not with opt. I think I need to figure out the exact set of flags I need to give opt to give me the desired behavior.

Revision Contents

Path

Size

llvm/

include/

llvm/

Transforms/

Vectorize/

SLPVectorizer.h

2 lines

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

276 lines

test/

Transforms/

SLPVectorizer/

X86/

85 lines

66 lines

85 lines

170 lines

85 lines

85 lines

66 lines

85 lines

28 lines

44 lines

22 lines

44 lines

different-vec-widths.ll

17 lines

jumbled_store_crash.ll

13 lines

77 lines

77 lines

77 lines

7 lines

45 lines

Diff 243056

llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h

Context not available.
	bool vectorizeChainsInBlock(BasicBlock *BB, slpvectorizer::BoUpSLP &R);	bool vectorizeChainsInBlock(BasicBlock *BB, slpvectorizer::BoUpSLP &R);

	bool vectorizeStoreChain(ArrayRef<Value *> Chain, slpvectorizer::BoUpSLP &R,	bool vectorizeStoreChain(ArrayRef<Value *> Chain, slpvectorizer::BoUpSLP &R,
	unsigned Idx);	unsigned VecRegSize);

	bool vectorizeStores(ArrayRef<StoreInst *> Stores, slpvectorizer::BoUpSLP &R);	bool vectorizeStores(ArrayRef<StoreInst *> Stores, slpvectorizer::BoUpSLP &R);

Context not available.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

Context not available.
	#include "llvm/ADT/PostOrderIterator.h"	#include "llvm/ADT/PostOrderIterator.h"
	#include "llvm/ADT/STLExtras.h"	#include "llvm/ADT/STLExtras.h"
	#include "llvm/ADT/SetVector.h"	#include "llvm/ADT/SetVector.h"
	#include "llvm/ADT/SmallBitVector.h"
	#include "llvm/ADT/SmallPtrSet.h"	#include "llvm/ADT/SmallPtrSet.h"
	#include "llvm/ADT/SmallSet.h"	#include "llvm/ADT/SmallSet.h"
	#include "llvm/ADT/SmallVector.h"	#include "llvm/ADT/SmallVector.h"
Context not available.
	MaxVectorRegSizeOption("slp-max-reg-size", cl::init(128), cl::Hidden,	MaxVectorRegSizeOption("slp-max-reg-size", cl::init(128), cl::Hidden,
	cl::desc("Attempt to vectorize for this register size in bits"));	cl::desc("Attempt to vectorize for this register size in bits"));

	static cl::opt<int>
	MaxStoreLookup("slp-max-store-lookup", cl::init(32), cl::Hidden,
	cl::desc("Maximum depth of the lookup for consecutive stores."));

	/// Limits the size of scheduling regions in a block.	/// Limits the size of scheduling regions in a block.
	/// It avoid long compile times for _very_ large blocks where vector	/// It avoid long compile times for _very_ large blocks where vector
	/// instructions are spread over a wide range.	/// instructions are spread over a wide range.
Context not available.
	}	}
	case Instruction::Store: {	case Instruction::Store: {
	// Check if the stores are consecutive or if we need to swizzle them.	// Check if the stores are consecutive or if we need to swizzle them.
	llvm::Type *ScalarTy = cast<StoreInst>(VL0)->getValueOperand()->getType();	for (unsigned i = 0, e = VL.size() - 1; i < e; ++i)
	// Make sure all stores in the bundle are simple - we can't vectorize	if (!isConsecutiveAccess(VL[i], VL[i + 1], DL, SE)) {
	// atomic or volatile stores.
	SmallVector<Value *, 4> PointerOps(VL.size());
	ValueList Operands(VL.size());
	auto POIter = PointerOps.begin();
	auto OIter = Operands.begin();
	for (Value *V : VL) {
	auto *SI = cast<StoreInst>(V);
	if (!SI->isSimple()) {
	BS.cancelScheduling(VL, VL0);	BS.cancelScheduling(VL, VL0);
	newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx,	newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx,
	ReuseShuffleIndicies);	ReuseShuffleIndicies);
	LLVM_DEBUG(dbgs() << "SLP: Gathering non-simple stores.\n");	LLVM_DEBUG(dbgs() << "SLP: Non-consecutive store.\n");
	return;	return;
	}	}
	*POIter = SI->getPointerOperand();
	*OIter = SI->getValueOperand();
	++POIter;
	++OIter;
	}

	OrdersType CurrentOrder;	TreeEntry TE = newTreeEntry(VL, Bundle /vectorized*/, S, UserTreeIdx,
	// Check the order of pointer operands.	ReuseShuffleIndicies);
	if (llvm::sortPtrAccesses(PointerOps, DL, SE, CurrentOrder)) {	LLVM_DEBUG(dbgs() << "SLP: added a vector of stores.\n");
	Value *Ptr0;
	Value *PtrN;
	if (CurrentOrder.empty()) {
	Ptr0 = PointerOps.front();
	PtrN = PointerOps.back();
	} else {
	Ptr0 = PointerOps[CurrentOrder.front()];
	PtrN = PointerOps[CurrentOrder.back()];
	}
	const SCEV *Scev0 = SE->getSCEV(Ptr0);
	const SCEV *ScevN = SE->getSCEV(PtrN);
	const auto *Diff =
	dyn_cast<SCEVConstant>(SE->getMinusSCEV(ScevN, Scev0));
	uint64_t Size = DL->getTypeAllocSize(ScalarTy);
	// Check that the sorted pointer operands are consecutive.
	if (Diff && Diff->getAPInt() == (VL.size() - 1) * Size) {
	if (CurrentOrder.empty()) {
	// Original stores are consecutive and does not require reordering.
	++NumOpsWantToKeepOriginalOrder;
	TreeEntry TE = newTreeEntry(VL, Bundle /vectorized*/, S,
	UserTreeIdx, ReuseShuffleIndicies);
	TE->setOperandsInOrder();
	buildTree_rec(Operands, Depth + 1, {TE, 0});
	LLVM_DEBUG(dbgs() << "SLP: added a vector of stores.\n");
	} else {
	// Need to reorder.
	auto I = NumOpsWantToKeepOrder.try_emplace(CurrentOrder).first;
	++(I->getSecond());
	TreeEntry *TE =
	newTreeEntry(VL, Bundle /vectorized/, S, UserTreeIdx,
	ReuseShuffleIndicies, I->getFirst());
	TE->setOperandsInOrder();
	buildTree_rec(Operands, Depth + 1, {TE, 0});
	LLVM_DEBUG(dbgs() << "SLP: added a vector of jumbled stores.\n");
	}
	return;
	}
	}

	BS.cancelScheduling(VL, VL0);	ValueList Operands;
	newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx,	for (Value *V : VL)
	ReuseShuffleIndicies);	Operands.push_back(cast<Instruction>(V)->getOperand(0));
	LLVM_DEBUG(dbgs() << "SLP: Non-consecutive store.\n");	TE->setOperandsInOrder();
		buildTree_rec(Operands, Depth + 1, {TE, 0});
	return;	return;
	}	}
	case Instruction::Call: {	case Instruction::Call: {
Context not available.
	}	}
	case Instruction::Store: {	case Instruction::Store: {
	// We know that we can merge the stores. Calculate the cost.	// We know that we can merge the stores. Calculate the cost.
	bool IsReorder = !E->ReorderIndices.empty();	MaybeAlign alignment(cast<StoreInst>(VL0)->getAlignment());
	auto *SI =
	cast<StoreInst>(IsReorder ? VL[E->ReorderIndices.front()] : VL0);
	MaybeAlign Alignment(SI->getAlignment());
	int ScalarEltCost =	int ScalarEltCost =
	TTI->getMemoryOpCost(Instruction::Store, ScalarTy, Alignment, 0, VL0);	TTI->getMemoryOpCost(Instruction::Store, ScalarTy, alignment, 0, VL0);
	if (NeedToShuffleReuses)	if (NeedToShuffleReuses) {
	ReuseShuffleCost = -(ReuseShuffleNumbers - VL.size()) * ScalarEltCost;	ReuseShuffleCost -= (ReuseShuffleNumbers - VL.size()) * ScalarEltCost;
	int ScalarStCost = VecTy->getNumElements() * ScalarEltCost;
	int VecStCost = TTI->getMemoryOpCost(Instruction::Store,
	VecTy, Alignment, 0, VL0);
	if (IsReorder) {
	// TODO: Merge this shuffle with the ReuseShuffleCost.
	VecStCost += TTI->getShuffleCost(
	TargetTransformInfo::SK_PermuteSingleSrc, VecTy);
	}	}
		int ScalarStCost = VecTy->getNumElements() * ScalarEltCost;
		int VecStCost =
		TTI->getMemoryOpCost(Instruction::Store, VecTy, alignment, 0, VL0);
	return ReuseShuffleCost + VecStCost - ScalarStCost;	return ReuseShuffleCost + VecStCost - ScalarStCost;
	}	}
	case Instruction::Call: {	case Instruction::Call: {
Context not available.
	return V;	return V;
	}	}
	case Instruction::Store: {	case Instruction::Store: {
	bool IsReorder = !E->ReorderIndices.empty();	StoreInst *SI = cast<StoreInst>(VL0);
	auto *SI = cast<StoreInst>(
	IsReorder ? E->Scalars[E->ReorderIndices.front()] : VL0);
	unsigned Alignment = SI->getAlignment();	unsigned Alignment = SI->getAlignment();
	unsigned AS = SI->getPointerAddressSpace();	unsigned AS = SI->getPointerAddressSpace();

	setInsertPointAfterBundle(E);	setInsertPointAfterBundle(E);

	Value *VecValue = vectorizeTree(E->getOperand(0));	Value *VecValue = vectorizeTree(E->getOperand(0));
	if (IsReorder) {
	OrdersType Mask;
	inversePermutation(E->ReorderIndices, Mask);
	VecValue = Builder.CreateShuffleVector(
	VecValue, UndefValue::get(VecValue->getType()), E->ReorderIndices,
	"reorder_shuffle");
	}
	Value *ScalarPtr = SI->getPointerOperand();	Value *ScalarPtr = SI->getPointerOperand();
	Value *VecPtr = Builder.CreateBitCast(	Value *VecPtr = Builder.CreateBitCast(ScalarPtr, VecTy->getPointerTo(AS));
	ScalarPtr, VecValue->getType()->getPointerTo(AS));
	StoreInst *ST = Builder.CreateStore(VecValue, VecPtr);	StoreInst *ST = Builder.CreateStore(VecValue, VecPtr);

	// The pointer operand uses an in-tree scalar, so add the new BitCast to	// The pointer operand uses an in-tree scalar, so add the new BitCast to
Context not available.
	}	}

	bool SLPVectorizerPass::vectorizeStoreChain(ArrayRef<Value *> Chain, BoUpSLP &R,	bool SLPVectorizerPass::vectorizeStoreChain(ArrayRef<Value *> Chain, BoUpSLP &R,
	unsigned Idx) {	unsigned VecRegSize) {
	LLVM_DEBUG(dbgs() << "SLP: Analyzing a store chain of length " << Chain.size()	const unsigned ChainLen = Chain.size();
		LLVM_DEBUG(dbgs() << "SLP: Analyzing a store chain of length " << ChainLen
	<< "\n");	<< "\n");
	const unsigned Sz = R.getVectorElementSize(Chain[0]);	const unsigned Sz = R.getVectorElementSize(Chain[0]);
	const unsigned MinVF = R.getMinVecRegSize() / Sz;	const unsigned VF = VecRegSize / Sz;
	unsigned VF = Chain.size();

	if (!isPowerOf2_32(Sz) \|\| !isPowerOf2_32(VF) \|\| VF < 2 \|\| VF < MinVF)	if (!isPowerOf2_32(Sz) \|\| VF < 2)
	return false;	return false;

	LLVM_DEBUG(dbgs() << "SLP: Analyzing " << VF << " stores at offset " << Idx	bool Changed = false;
	<< "\n");	// Look for profitable vectorizable trees at all offsets, starting at zero.
		for (unsigned i = 0, e = ChainLen; i + VF <= e; ++i) {

		ArrayRef<Value *> Operands = Chain.slice(i, VF);
		// Check that a previous iteration of this loop did not delete the Value.
		if (llvm::any_of(Operands, [&R](Value *V) {
		auto *I = dyn_cast<Instruction>(V);
		return I && R.isDeleted(I);
		}))
		continue;

	R.buildTree(Chain);	LLVM_DEBUG(dbgs() << "SLP: Analyzing " << VF << " stores at offset " << i
	Optional<ArrayRef<unsigned>> Order = R.bestOrder();	<< "\n");
	// TODO: Handle orders of size less than number of elements in the vector.
	if (Order && Order->size() == Chain.size()) {
	// TODO: reorder tree nodes without tree rebuilding.
	SmallVector<Value *, 4> ReorderedOps(Chain.rbegin(), Chain.rend());
	llvm::transform(*Order, ReorderedOps.begin(),
	[Chain](const unsigned Idx) { return Chain[Idx]; });
	R.buildTree(ReorderedOps);
	}
	if (R.isTreeTinyAndNotFullyVectorizable())
	return false;

	R.computeMinimumValueSizes();	R.buildTree(Operands);
		if (R.isTreeTinyAndNotFullyVectorizable())
		continue;

	int Cost = R.getTreeCost();	R.computeMinimumValueSizes();

	LLVM_DEBUG(dbgs() << "SLP: Found cost=" << Cost << " for VF=" << VF << "\n");	int Cost = R.getTreeCost();
	if (Cost < -SLPCostThreshold) {
	LLVM_DEBUG(dbgs() << "SLP: Decided to vectorize cost=" << Cost << "\n");

	using namespace ore;	LLVM_DEBUG(dbgs() << "SLP: Found cost=" << Cost << " for VF=" << VF
		<< "\n");
		if (Cost < -SLPCostThreshold) {
		LLVM_DEBUG(dbgs() << "SLP: Decided to vectorize cost=" << Cost << "\n");

	R.getORE()->emit(OptimizationRemark(SV_NAME, "StoresVectorized",	using namespace ore;
	cast<StoreInst>(Chain[0]))
	<< "Stores SLP vectorized with cost " << NV("Cost", Cost)
	<< " and with tree size "
	<< NV("TreeSize", R.getTreeSize()));

	R.vectorizeTree();	R.getORE()->emit(OptimizationRemark(SV_NAME, "StoresVectorized",
	return true;	cast<StoreInst>(Chain[i]))
		<< "Stores SLP vectorized with cost " << NV("Cost", Cost)
		<< " and with tree size "
		<< NV("TreeSize", R.getTreeSize()));

		R.vectorizeTree();

		// Move to the next bundle.
		i += VF - 1;
		Changed = true;
		}
	}	}

	return false;	return Changed;
	}	}

	bool SLPVectorizerPass::vectorizeStores(ArrayRef<StoreInst *> Stores,	bool SLPVectorizerPass::vectorizeStores(ArrayRef<StoreInst *> Stores,
	BoUpSLP &R) {	BoUpSLP &R) {
		SetVector<StoreInst *> Heads;
		SmallDenseSet<StoreInst *> Tails;
		SmallDenseMap<StoreInst , StoreInst > ConsecutiveChain;

	// We may run into multiple chains that merge into a single chain. We mark the	// We may run into multiple chains that merge into a single chain. We mark the
	// stores that we vectorized so that we don't visit the same store twice.	// stores that we vectorized so that we don't visit the same store twice.
	BoUpSLP::ValueSet VectorizedStores;	BoUpSLP::ValueSet VectorizedStores;
	bool Changed = false;	bool Changed = false;

	int E = Stores.size();	auto &&FindConsecutiveAccess =
	SmallBitVector Tails(E, false);	[this, &Stores, &Heads, &Tails, &ConsecutiveChain] (int K, int Idx) {
	SmallVector<int, 16> ConsecutiveChain(E, E + 1);	if (!isConsecutiveAccess(Stores[K], Stores[Idx], DL, SE))
	int MaxIter = MaxStoreLookup.getValue();	return false;
	int IterCnt;
	auto &&FindConsecutiveAccess = [this, &Stores, &Tails, &IterCnt, MaxIter,	Tails.insert(Stores[Idx]);
	&ConsecutiveChain](int K, int Idx) {	Heads.insert(Stores[K]);
	if (IterCnt >= MaxIter)	ConsecutiveChain[Stores[K]] = Stores[Idx];
	return true;	return true;
	++IterCnt;	};
	if (!isConsecutiveAccess(Stores[K], Stores[Idx], DL, SE))
	return false;

	Tails.set(Idx);
	ConsecutiveChain[K] = Idx;
	return true;
	};
	// Do a quadratic search on all of the given stores in reverse order and find	// Do a quadratic search on all of the given stores in reverse order and find
	// all of the pairs of stores that follow each other.	// all of the pairs of stores that follow each other.
		int E = Stores.size();
	for (int Idx = E - 1; Idx >= 0; --Idx) {	for (int Idx = E - 1; Idx >= 0; --Idx) {
	// If a store has multiple consecutive store candidates, search according	// If a store has multiple consecutive store candidates, search according
	// to the sequence: Idx-1, Idx+1, Idx-2, Idx+2, ...	// to the sequence: Idx-1, Idx+1, Idx-2, Idx+2, ...
	// This is because usually pairing with immediate succeeding or preceding	// This is because usually pairing with immediate succeeding or preceding
	// candidate create the best chance to find slp vectorization opportunity.	// candidate create the best chance to find slp vectorization opportunity.
	const int MaxLookDepth = std::max(E - Idx, Idx + 1);	for (int Offset = 1, F = std::max(E - Idx, Idx + 1); Offset < F; ++Offset)
	IterCnt = 0;
	for (int Offset = 1, F = MaxLookDepth; Offset < F; ++Offset)
	if ((Idx >= Offset && FindConsecutiveAccess(Idx - Offset, Idx)) \|\|	if ((Idx >= Offset && FindConsecutiveAccess(Idx - Offset, Idx)) \|\|
	(Idx + Offset < E && FindConsecutiveAccess(Idx + Offset, Idx)))	(Idx + Offset < E && FindConsecutiveAccess(Idx + Offset, Idx)))
	break;	break;
	}	}

	// For stores that start but don't end a link in the chain:	// For stores that start but don't end a link in the chain:
	for (int Cnt = E; Cnt > 0; --Cnt) {	for (auto *SI : llvm::reverse(Heads)) {
	int I = Cnt - 1;	if (Tails.count(SI))
	if (ConsecutiveChain[I] == E + 1 \|\| Tails.test(I))
	continue;	continue;

	// We found a store instr that starts a chain. Now follow the chain and try	// We found a store instr that starts a chain. Now follow the chain and try
	// to vectorize it.	// to vectorize it.
	BoUpSLP::ValueList Operands;	BoUpSLP::ValueList Operands;
		StoreInst *I = SI;
	// Collect the chain into a list.	// Collect the chain into a list.
	while (I != E + 1 && !VectorizedStores.count(Stores[I])) {	while ((Tails.count(I) \|\| Heads.count(I)) && !VectorizedStores.count(I)) {
	Operands.push_back(Stores[I]);	Operands.push_back(I);
	// Move to the next value in the chain.	// Move to the next value in the chain.
	I = ConsecutiveChain[I];	I = ConsecutiveChain[I];
	}	}

	// If a vector register can't hold 1 element, we are done.
	unsigned MaxVecRegSize = R.getMaxVecRegSize();
	unsigned EltSize = R.getVectorElementSize(Stores[0]);
	if (MaxVecRegSize % EltSize != 0)
	continue;

	unsigned MaxElts = MaxVecRegSize / EltSize;
	// FIXME: Is division-by-2 the correct step? Should we assert that the	// FIXME: Is division-by-2 the correct step? Should we assert that the
	// register size is a power-of-2?	// register size is a power-of-2?
	unsigned StartIdx = 0;	for (unsigned Size = R.getMaxVecRegSize(); Size >= R.getMinVecRegSize();
	for (unsigned Size = llvm::PowerOf2Ceil(MaxElts); Size >= 2; Size /= 2) {	Size /= 2) {
	for (unsigned Cnt = StartIdx, E = Operands.size(); Cnt + Size <= E;) {	if (vectorizeStoreChain(Operands, R, Size)) {
	ArrayRef<Value *> Slice = makeArrayRef(Operands).slice(Cnt, Size);	// Mark the vectorized stores so that we don't vectorize them again.
	if (!VectorizedStores.count(Slice.front()) &&	VectorizedStores.insert(Operands.begin(), Operands.end());
	!VectorizedStores.count(Slice.back()) &&	Changed = true;
	vectorizeStoreChain(Slice, R, Cnt)) {
	// Mark the vectorized stores so that we don't vectorize them again.
	VectorizedStores.insert(Slice.begin(), Slice.end());
	Changed = true;
	// If we vectorized initial block, no need to try to vectorize it
	// again.
	if (Cnt == StartIdx)
	StartIdx += Size;
	Cnt += Size;
	continue;
	}
	++Cnt;
	}
	// Check if the whole array was vectorized already - exit.
	if (StartIdx >= Operands.size())
	break;	break;
		}
	}	}
	}	}

Context not available.
	LLVM_DEBUG(dbgs() << "SLP: Analyzing a store chain of length "	LLVM_DEBUG(dbgs() << "SLP: Analyzing a store chain of length "
	<< it->second.size() << ".\n");	<< it->second.size() << ".\n");

	Changed \|= vectorizeStores(it->second, R);	// Process the stores in chunks of 16.
		// TODO: The limit of 16 inhibits greater vectorization factors.
		// For example, AVX2 supports v32i8. Increasing this limit, however,
		// may cause a significant compile-time increase.
		for (unsigned CI = 0, CE = it->second.size(); CI < CE; CI += 16) {
		unsigned Len = std::min<unsigned>(CE - CI, 16);
		Changed \|= vectorizeStores(makeArrayRef(&it->second[CI], Len), R);
		}
	}	}
	return Changed;	return Changed;
	}	}
Context not available.

llvm/test/Transforms/SLPVectorizer/X86/arith-add-ssat.ll

Context not available.
	; AVX-NEXT: ret void	; AVX-NEXT: ret void
	;	;
	; AVX512-LABEL: @add_v32i16(	; AVX512-LABEL: @add_v32i16(
	; AVX512-NEXT: [[TMP1:%.]] = load <32 x i16>, <32 x i16> bitcast ([32 x i16]* @a16 to <32 x i16>*), align 2	; AVX512-NEXT: [[TMP1:%.]] = load <16 x i16>, <16 x i16> bitcast ([32 x i16]* @a16 to <16 x i16>*), align 2
	; AVX512-NEXT: [[TMP2:%.]] = load <32 x i16>, <32 x i16> bitcast ([32 x i16]* @b16 to <32 x i16>*), align 2	; AVX512-NEXT: [[TMP2:%.]] = load <16 x i16>, <16 x i16> bitcast (i16* getelementptr inbounds ([32 x i16], [32 x i16]* @a16, i32 0, i64 16) to <16 x i16>*), align 2
	; AVX512-NEXT: [[TMP3:%.*]] = call <32 x i16> @llvm.sadd.sat.v32i16(<32 x i16> [[TMP1]], <32 x i16> [[TMP2]])	; AVX512-NEXT: [[TMP3:%.]] = load <16 x i16>, <16 x i16> bitcast ([32 x i16]* @b16 to <16 x i16>*), align 2
	; AVX512-NEXT: store <32 x i16> [[TMP3]], <32 x i16>* bitcast ([32 x i16]* @c16 to <32 x i16>*), align 2	; AVX512-NEXT: [[TMP4:%.]] = load <16 x i16>, <16 x i16> bitcast (i16* getelementptr inbounds ([32 x i16], [32 x i16]* @b16, i32 0, i64 16) to <16 x i16>*), align 2
		; AVX512-NEXT: [[TMP5:%.*]] = call <16 x i16> @llvm.sadd.sat.v16i16(<16 x i16> [[TMP1]], <16 x i16> [[TMP3]])
		; AVX512-NEXT: [[TMP6:%.*]] = call <16 x i16> @llvm.sadd.sat.v16i16(<16 x i16> [[TMP2]], <16 x i16> [[TMP4]])
		; AVX512-NEXT: store <16 x i16> [[TMP5]], <16 x i16>* bitcast ([32 x i16]* @c16 to <16 x i16>*), align 2
		; AVX512-NEXT: store <16 x i16> [[TMP6]], <16 x i16>* bitcast (i16* getelementptr inbounds ([32 x i16], [32 x i16]* @c16, i32 0, i64 16) to <16 x i16>*), align 2
	; AVX512-NEXT: ret void	; AVX512-NEXT: ret void
	;	;
	%a0 = load i16, i16* getelementptr inbounds ([32 x i16], [32 x i16]* @a16, i32 0, i64 0 ), align 2	%a0 = load i16, i16* getelementptr inbounds ([32 x i16], [32 x i16]* @a16, i32 0, i64 0 ), align 2
Context not available.
	}	}

	define void @add_v64i8() {	define void @add_v64i8() {
	; SSE-LABEL: @add_v64i8(	; CHECK-LABEL: @add_v64i8(
	; SSE-NEXT: [[TMP1:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @a8 to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP1:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @a8 to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP2:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 16) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP2:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 16) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP3:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 32) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP3:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 32) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP4:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 48) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP4:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 48) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP5:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @b8 to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP5:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @b8 to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP6:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 16) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP6:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 16) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP7:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 32) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP7:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 32) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP8:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 48) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP8:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 48) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP9:%.*]] = call <16 x i8> @llvm.sadd.sat.v16i8(<16 x i8> [[TMP1]], <16 x i8> [[TMP5]])	; CHECK-NEXT: [[TMP9:%.*]] = call <16 x i8> @llvm.sadd.sat.v16i8(<16 x i8> [[TMP1]], <16 x i8> [[TMP5]])
	; SSE-NEXT: [[TMP10:%.*]] = call <16 x i8> @llvm.sadd.sat.v16i8(<16 x i8> [[TMP2]], <16 x i8> [[TMP6]])	; CHECK-NEXT: [[TMP10:%.*]] = call <16 x i8> @llvm.sadd.sat.v16i8(<16 x i8> [[TMP2]], <16 x i8> [[TMP6]])
	; SSE-NEXT: [[TMP11:%.*]] = call <16 x i8> @llvm.sadd.sat.v16i8(<16 x i8> [[TMP3]], <16 x i8> [[TMP7]])	; CHECK-NEXT: [[TMP11:%.*]] = call <16 x i8> @llvm.sadd.sat.v16i8(<16 x i8> [[TMP3]], <16 x i8> [[TMP7]])
	; SSE-NEXT: [[TMP12:%.*]] = call <16 x i8> @llvm.sadd.sat.v16i8(<16 x i8> [[TMP4]], <16 x i8> [[TMP8]])	; CHECK-NEXT: [[TMP12:%.*]] = call <16 x i8> @llvm.sadd.sat.v16i8(<16 x i8> [[TMP4]], <16 x i8> [[TMP8]])
	; SSE-NEXT: store <16 x i8> [[TMP9]], <16 x i8>* bitcast ([64 x i8]* @c8 to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP9]], <16 x i8>* bitcast ([64 x i8]* @c8 to <16 x i8>*), align 1
	; SSE-NEXT: store <16 x i8> [[TMP10]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 16) to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP10]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 16) to <16 x i8>*), align 1
	; SSE-NEXT: store <16 x i8> [[TMP11]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 32) to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP11]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 32) to <16 x i8>*), align 1
	; SSE-NEXT: store <16 x i8> [[TMP12]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 48) to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP12]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 48) to <16 x i8>*), align 1
	; SSE-NEXT: ret void	; CHECK-NEXT: ret void
	;
	; SLM-LABEL: @add_v64i8(
	; SLM-NEXT: [[TMP1:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @a8 to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP2:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 16) to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP3:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 32) to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP4:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 48) to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP5:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @b8 to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP6:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 16) to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP7:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 32) to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP8:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 48) to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP9:%.*]] = call <16 x i8> @llvm.sadd.sat.v16i8(<16 x i8> [[TMP1]], <16 x i8> [[TMP5]])
	; SLM-NEXT: [[TMP10:%.*]] = call <16 x i8> @llvm.sadd.sat.v16i8(<16 x i8> [[TMP2]], <16 x i8> [[TMP6]])
	; SLM-NEXT: [[TMP11:%.*]] = call <16 x i8> @llvm.sadd.sat.v16i8(<16 x i8> [[TMP3]], <16 x i8> [[TMP7]])
	; SLM-NEXT: [[TMP12:%.*]] = call <16 x i8> @llvm.sadd.sat.v16i8(<16 x i8> [[TMP4]], <16 x i8> [[TMP8]])
	; SLM-NEXT: store <16 x i8> [[TMP9]], <16 x i8>* bitcast ([64 x i8]* @c8 to <16 x i8>*), align 1
	; SLM-NEXT: store <16 x i8> [[TMP10]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 16) to <16 x i8>*), align 1
	; SLM-NEXT: store <16 x i8> [[TMP11]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 32) to <16 x i8>*), align 1
	; SLM-NEXT: store <16 x i8> [[TMP12]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 48) to <16 x i8>*), align 1
	; SLM-NEXT: ret void
	;
	; AVX-LABEL: @add_v64i8(
	; AVX-NEXT: [[TMP1:%.]] = load <32 x i8>, <32 x i8> bitcast ([64 x i8]* @a8 to <32 x i8>*), align 1
	; AVX-NEXT: [[TMP2:%.]] = load <32 x i8>, <32 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 32) to <32 x i8>*), align 1
	; AVX-NEXT: [[TMP3:%.]] = load <32 x i8>, <32 x i8> bitcast ([64 x i8]* @b8 to <32 x i8>*), align 1
	; AVX-NEXT: [[TMP4:%.]] = load <32 x i8>, <32 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 32) to <32 x i8>*), align 1
	; AVX-NEXT: [[TMP5:%.*]] = call <32 x i8> @llvm.sadd.sat.v32i8(<32 x i8> [[TMP1]], <32 x i8> [[TMP3]])
	; AVX-NEXT: [[TMP6:%.*]] = call <32 x i8> @llvm.sadd.sat.v32i8(<32 x i8> [[TMP2]], <32 x i8> [[TMP4]])
	; AVX-NEXT: store <32 x i8> [[TMP5]], <32 x i8>* bitcast ([64 x i8]* @c8 to <32 x i8>*), align 1
	; AVX-NEXT: store <32 x i8> [[TMP6]], <32 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 32) to <32 x i8>*), align 1
	; AVX-NEXT: ret void
	;
	; AVX512-LABEL: @add_v64i8(
	; AVX512-NEXT: [[TMP1:%.]] = load <64 x i8>, <64 x i8> bitcast ([64 x i8]* @a8 to <64 x i8>*), align 1
	; AVX512-NEXT: [[TMP2:%.]] = load <64 x i8>, <64 x i8> bitcast ([64 x i8]* @b8 to <64 x i8>*), align 1
	; AVX512-NEXT: [[TMP3:%.*]] = call <64 x i8> @llvm.sadd.sat.v64i8(<64 x i8> [[TMP1]], <64 x i8> [[TMP2]])
	; AVX512-NEXT: store <64 x i8> [[TMP3]], <64 x i8>* bitcast ([64 x i8]* @c8 to <64 x i8>*), align 1
	; AVX512-NEXT: ret void
	;	;
	%a0 = load i8, i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 0 ), align 1	%a0 = load i8, i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 0 ), align 1
	%a1 = load i8, i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 1 ), align 1	%a1 = load i8, i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 1 ), align 1
Context not available.

llvm/test/Transforms/SLPVectorizer/X86/arith-add-usat.ll

Context not available.
	; AVX-NEXT: ret void	; AVX-NEXT: ret void
	;	;
	; AVX512-LABEL: @add_v32i16(	; AVX512-LABEL: @add_v32i16(
	; AVX512-NEXT: [[TMP1:%.]] = load <32 x i16>, <32 x i16> bitcast ([32 x i16]* @a16 to <32 x i16>*), align 2	; AVX512-NEXT: [[TMP1:%.]] = load <16 x i16>, <16 x i16> bitcast ([32 x i16]* @a16 to <16 x i16>*), align 2
	; AVX512-NEXT: [[TMP2:%.]] = load <32 x i16>, <32 x i16> bitcast ([32 x i16]* @b16 to <32 x i16>*), align 2	; AVX512-NEXT: [[TMP2:%.]] = load <16 x i16>, <16 x i16> bitcast (i16* getelementptr inbounds ([32 x i16], [32 x i16]* @a16, i32 0, i64 16) to <16 x i16>*), align 2
	; AVX512-NEXT: [[TMP3:%.*]] = call <32 x i16> @llvm.uadd.sat.v32i16(<32 x i16> [[TMP1]], <32 x i16> [[TMP2]])	; AVX512-NEXT: [[TMP3:%.]] = load <16 x i16>, <16 x i16> bitcast ([32 x i16]* @b16 to <16 x i16>*), align 2
	; AVX512-NEXT: store <32 x i16> [[TMP3]], <32 x i16>* bitcast ([32 x i16]* @c16 to <32 x i16>*), align 2	; AVX512-NEXT: [[TMP4:%.]] = load <16 x i16>, <16 x i16> bitcast (i16* getelementptr inbounds ([32 x i16], [32 x i16]* @b16, i32 0, i64 16) to <16 x i16>*), align 2
		; AVX512-NEXT: [[TMP5:%.*]] = call <16 x i16> @llvm.uadd.sat.v16i16(<16 x i16> [[TMP1]], <16 x i16> [[TMP3]])
		; AVX512-NEXT: [[TMP6:%.*]] = call <16 x i16> @llvm.uadd.sat.v16i16(<16 x i16> [[TMP2]], <16 x i16> [[TMP4]])
		; AVX512-NEXT: store <16 x i16> [[TMP5]], <16 x i16>* bitcast ([32 x i16]* @c16 to <16 x i16>*), align 2
		; AVX512-NEXT: store <16 x i16> [[TMP6]], <16 x i16>* bitcast (i16* getelementptr inbounds ([32 x i16], [32 x i16]* @c16, i32 0, i64 16) to <16 x i16>*), align 2
	; AVX512-NEXT: ret void	; AVX512-NEXT: ret void
	;	;
	%a0 = load i16, i16* getelementptr inbounds ([32 x i16], [32 x i16]* @a16, i32 0, i64 0 ), align 2	%a0 = load i16, i16* getelementptr inbounds ([32 x i16], [32 x i16]* @a16, i32 0, i64 0 ), align 2
Context not available.
	}	}

	define void @add_v64i8() {	define void @add_v64i8() {
	; SSE-LABEL: @add_v64i8(	; CHECK-LABEL: @add_v64i8(
	; SSE-NEXT: [[TMP1:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @a8 to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP1:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @a8 to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP2:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 16) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP2:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 16) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP3:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 32) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP3:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 32) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP4:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 48) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP4:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 48) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP5:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @b8 to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP5:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @b8 to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP6:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 16) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP6:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 16) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP7:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 32) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP7:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 32) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP8:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 48) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP8:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 48) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP9:%.*]] = call <16 x i8> @llvm.uadd.sat.v16i8(<16 x i8> [[TMP1]], <16 x i8> [[TMP5]])	; CHECK-NEXT: [[TMP9:%.*]] = call <16 x i8> @llvm.uadd.sat.v16i8(<16 x i8> [[TMP1]], <16 x i8> [[TMP5]])
	; SSE-NEXT: [[TMP10:%.*]] = call <16 x i8> @llvm.uadd.sat.v16i8(<16 x i8> [[TMP2]], <16 x i8> [[TMP6]])	; CHECK-NEXT: [[TMP10:%.*]] = call <16 x i8> @llvm.uadd.sat.v16i8(<16 x i8> [[TMP2]], <16 x i8> [[TMP6]])
	; SSE-NEXT: [[TMP11:%.*]] = call <16 x i8> @llvm.uadd.sat.v16i8(<16 x i8> [[TMP3]], <16 x i8> [[TMP7]])	; CHECK-NEXT: [[TMP11:%.*]] = call <16 x i8> @llvm.uadd.sat.v16i8(<16 x i8> [[TMP3]], <16 x i8> [[TMP7]])
	; SSE-NEXT: [[TMP12:%.*]] = call <16 x i8> @llvm.uadd.sat.v16i8(<16 x i8> [[TMP4]], <16 x i8> [[TMP8]])	; CHECK-NEXT: [[TMP12:%.*]] = call <16 x i8> @llvm.uadd.sat.v16i8(<16 x i8> [[TMP4]], <16 x i8> [[TMP8]])
	; SSE-NEXT: store <16 x i8> [[TMP9]], <16 x i8>* bitcast ([64 x i8]* @c8 to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP9]], <16 x i8>* bitcast ([64 x i8]* @c8 to <16 x i8>*), align 1
	; SSE-NEXT: store <16 x i8> [[TMP10]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 16) to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP10]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 16) to <16 x i8>*), align 1
	; SSE-NEXT: store <16 x i8> [[TMP11]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 32) to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP11]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 32) to <16 x i8>*), align 1
	; SSE-NEXT: store <16 x i8> [[TMP12]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 48) to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP12]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 48) to <16 x i8>*), align 1
	; SSE-NEXT: ret void	; CHECK-NEXT: ret void
	;
	; AVX-LABEL: @add_v64i8(
	; AVX-NEXT: [[TMP1:%.]] = load <32 x i8>, <32 x i8> bitcast ([64 x i8]* @a8 to <32 x i8>*), align 1
	; AVX-NEXT: [[TMP2:%.]] = load <32 x i8>, <32 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 32) to <32 x i8>*), align 1
	; AVX-NEXT: [[TMP3:%.]] = load <32 x i8>, <32 x i8> bitcast ([64 x i8]* @b8 to <32 x i8>*), align 1
	; AVX-NEXT: [[TMP4:%.]] = load <32 x i8>, <32 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 32) to <32 x i8>*), align 1
	; AVX-NEXT: [[TMP5:%.*]] = call <32 x i8> @llvm.uadd.sat.v32i8(<32 x i8> [[TMP1]], <32 x i8> [[TMP3]])
	; AVX-NEXT: [[TMP6:%.*]] = call <32 x i8> @llvm.uadd.sat.v32i8(<32 x i8> [[TMP2]], <32 x i8> [[TMP4]])
	; AVX-NEXT: store <32 x i8> [[TMP5]], <32 x i8>* bitcast ([64 x i8]* @c8 to <32 x i8>*), align 1
	; AVX-NEXT: store <32 x i8> [[TMP6]], <32 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 32) to <32 x i8>*), align 1
	; AVX-NEXT: ret void
	;
	; AVX512-LABEL: @add_v64i8(
	; AVX512-NEXT: [[TMP1:%.]] = load <64 x i8>, <64 x i8> bitcast ([64 x i8]* @a8 to <64 x i8>*), align 1
	; AVX512-NEXT: [[TMP2:%.]] = load <64 x i8>, <64 x i8> bitcast ([64 x i8]* @b8 to <64 x i8>*), align 1
	; AVX512-NEXT: [[TMP3:%.*]] = call <64 x i8> @llvm.uadd.sat.v64i8(<64 x i8> [[TMP1]], <64 x i8> [[TMP2]])
	; AVX512-NEXT: store <64 x i8> [[TMP3]], <64 x i8>* bitcast ([64 x i8]* @c8 to <64 x i8>*), align 1
	; AVX512-NEXT: ret void
	;	;
	%a0 = load i8, i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 0 ), align 1	%a0 = load i8, i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 0 ), align 1
	%a1 = load i8, i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 1 ), align 1	%a1 = load i8, i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 1 ), align 1
Context not available.

llvm/test/Transforms/SLPVectorizer/X86/arith-add.ll

Context not available.
	; AVX-NEXT: ret void	; AVX-NEXT: ret void
	;	;
	; AVX512-LABEL: @add_v32i16(	; AVX512-LABEL: @add_v32i16(
	; AVX512-NEXT: [[TMP1:%.]] = load <32 x i16>, <32 x i16> bitcast ([32 x i16]* @a16 to <32 x i16>*), align 2	; AVX512-NEXT: [[TMP1:%.]] = load <16 x i16>, <16 x i16> bitcast ([32 x i16]* @a16 to <16 x i16>*), align 2
	; AVX512-NEXT: [[TMP2:%.]] = load <32 x i16>, <32 x i16> bitcast ([32 x i16]* @b16 to <32 x i16>*), align 2	; AVX512-NEXT: [[TMP2:%.]] = load <16 x i16>, <16 x i16> bitcast (i16* getelementptr inbounds ([32 x i16], [32 x i16]* @a16, i32 0, i64 16) to <16 x i16>*), align 2
	; AVX512-NEXT: [[TMP3:%.*]] = add <32 x i16> [[TMP1]], [[TMP2]]	; AVX512-NEXT: [[TMP3:%.]] = load <16 x i16>, <16 x i16> bitcast ([32 x i16]* @b16 to <16 x i16>*), align 2
	; AVX512-NEXT: store <32 x i16> [[TMP3]], <32 x i16>* bitcast ([32 x i16]* @c16 to <32 x i16>*), align 2	; AVX512-NEXT: [[TMP4:%.]] = load <16 x i16>, <16 x i16> bitcast (i16* getelementptr inbounds ([32 x i16], [32 x i16]* @b16, i32 0, i64 16) to <16 x i16>*), align 2
		; AVX512-NEXT: [[TMP5:%.*]] = add <16 x i16> [[TMP1]], [[TMP3]]
		; AVX512-NEXT: [[TMP6:%.*]] = add <16 x i16> [[TMP2]], [[TMP4]]
		; AVX512-NEXT: store <16 x i16> [[TMP5]], <16 x i16>* bitcast ([32 x i16]* @c16 to <16 x i16>*), align 2
		; AVX512-NEXT: store <16 x i16> [[TMP6]], <16 x i16>* bitcast (i16* getelementptr inbounds ([32 x i16], [32 x i16]* @c16, i32 0, i64 16) to <16 x i16>*), align 2
	; AVX512-NEXT: ret void	; AVX512-NEXT: ret void
	;	;
	%a0 = load i16, i16* getelementptr inbounds ([32 x i16], [32 x i16]* @a16, i32 0, i64 0 ), align 2	%a0 = load i16, i16* getelementptr inbounds ([32 x i16], [32 x i16]* @a16, i32 0, i64 0 ), align 2
Context not available.
	}	}

	define void @add_v64i8() {	define void @add_v64i8() {
	; SSE-LABEL: @add_v64i8(	; CHECK-LABEL: @add_v64i8(
	; SSE-NEXT: [[TMP1:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @a8 to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP1:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @a8 to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP2:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 16) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP2:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 16) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP3:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 32) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP3:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 32) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP4:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 48) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP4:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 48) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP5:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @b8 to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP5:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @b8 to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP6:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 16) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP6:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 16) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP7:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 32) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP7:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 32) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP8:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 48) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP8:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 48) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP9:%.*]] = add <16 x i8> [[TMP1]], [[TMP5]]	; CHECK-NEXT: [[TMP9:%.*]] = add <16 x i8> [[TMP1]], [[TMP5]]
	; SSE-NEXT: [[TMP10:%.*]] = add <16 x i8> [[TMP2]], [[TMP6]]	; CHECK-NEXT: [[TMP10:%.*]] = add <16 x i8> [[TMP2]], [[TMP6]]
	; SSE-NEXT: [[TMP11:%.*]] = add <16 x i8> [[TMP3]], [[TMP7]]	; CHECK-NEXT: [[TMP11:%.*]] = add <16 x i8> [[TMP3]], [[TMP7]]
	; SSE-NEXT: [[TMP12:%.*]] = add <16 x i8> [[TMP4]], [[TMP8]]	; CHECK-NEXT: [[TMP12:%.*]] = add <16 x i8> [[TMP4]], [[TMP8]]
	; SSE-NEXT: store <16 x i8> [[TMP9]], <16 x i8>* bitcast ([64 x i8]* @c8 to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP9]], <16 x i8>* bitcast ([64 x i8]* @c8 to <16 x i8>*), align 1
	; SSE-NEXT: store <16 x i8> [[TMP10]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 16) to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP10]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 16) to <16 x i8>*), align 1
	; SSE-NEXT: store <16 x i8> [[TMP11]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 32) to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP11]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 32) to <16 x i8>*), align 1
	; SSE-NEXT: store <16 x i8> [[TMP12]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 48) to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP12]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 48) to <16 x i8>*), align 1
	; SSE-NEXT: ret void	; CHECK-NEXT: ret void
	;
	; SLM-LABEL: @add_v64i8(
	; SLM-NEXT: [[TMP1:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @a8 to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP2:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 16) to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP3:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 32) to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP4:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 48) to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP5:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @b8 to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP6:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 16) to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP7:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 32) to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP8:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 48) to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP9:%.*]] = add <16 x i8> [[TMP1]], [[TMP5]]
	; SLM-NEXT: [[TMP10:%.*]] = add <16 x i8> [[TMP2]], [[TMP6]]
	; SLM-NEXT: [[TMP11:%.*]] = add <16 x i8> [[TMP3]], [[TMP7]]
	; SLM-NEXT: [[TMP12:%.*]] = add <16 x i8> [[TMP4]], [[TMP8]]
	; SLM-NEXT: store <16 x i8> [[TMP9]], <16 x i8>* bitcast ([64 x i8]* @c8 to <16 x i8>*), align 1
	; SLM-NEXT: store <16 x i8> [[TMP10]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 16) to <16 x i8>*), align 1
	; SLM-NEXT: store <16 x i8> [[TMP11]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 32) to <16 x i8>*), align 1
	; SLM-NEXT: store <16 x i8> [[TMP12]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 48) to <16 x i8>*), align 1
	; SLM-NEXT: ret void
	;
	; AVX-LABEL: @add_v64i8(
	; AVX-NEXT: [[TMP1:%.]] = load <32 x i8>, <32 x i8> bitcast ([64 x i8]* @a8 to <32 x i8>*), align 1
	; AVX-NEXT: [[TMP2:%.]] = load <32 x i8>, <32 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 32) to <32 x i8>*), align 1
	; AVX-NEXT: [[TMP3:%.]] = load <32 x i8>, <32 x i8> bitcast ([64 x i8]* @b8 to <32 x i8>*), align 1
	; AVX-NEXT: [[TMP4:%.]] = load <32 x i8>, <32 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 32) to <32 x i8>*), align 1
	; AVX-NEXT: [[TMP5:%.*]] = add <32 x i8> [[TMP1]], [[TMP3]]
	; AVX-NEXT: [[TMP6:%.*]] = add <32 x i8> [[TMP2]], [[TMP4]]
	; AVX-NEXT: store <32 x i8> [[TMP5]], <32 x i8>* bitcast ([64 x i8]* @c8 to <32 x i8>*), align 1
	; AVX-NEXT: store <32 x i8> [[TMP6]], <32 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 32) to <32 x i8>*), align 1
	; AVX-NEXT: ret void
	;
	; AVX512-LABEL: @add_v64i8(
	; AVX512-NEXT: [[TMP1:%.]] = load <64 x i8>, <64 x i8> bitcast ([64 x i8]* @a8 to <64 x i8>*), align 1
	; AVX512-NEXT: [[TMP2:%.]] = load <64 x i8>, <64 x i8> bitcast ([64 x i8]* @b8 to <64 x i8>*), align 1
	; AVX512-NEXT: [[TMP3:%.*]] = add <64 x i8> [[TMP1]], [[TMP2]]
	; AVX512-NEXT: store <64 x i8> [[TMP3]], <64 x i8>* bitcast ([64 x i8]* @c8 to <64 x i8>*), align 1
	; AVX512-NEXT: ret void
	;	;
	%a0 = load i8, i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 0 ), align 1	%a0 = load i8, i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 0 ), align 1
	%a1 = load i8, i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 1 ), align 1	%a1 = load i8, i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 1 ), align 1
Context not available.

llvm/test/Transforms/SLPVectorizer/X86/arith-fix.ll

Context not available.
	; AVX-NEXT: ret void	; AVX-NEXT: ret void
	;	;
	; AVX512-LABEL: @smul_v32i16(	; AVX512-LABEL: @smul_v32i16(
	; AVX512-NEXT: [[TMP1:%.]] = load <32 x i16>, <32 x i16> bitcast ([32 x i16]* @a16 to <32 x i16>*), align 2	; AVX512-NEXT: [[TMP1:%.]] = load <16 x i16>, <16 x i16> bitcast ([32 x i16]* @a16 to <16 x i16>*), align 2
	; AVX512-NEXT: [[TMP2:%.]] = load <32 x i16>, <32 x i16> bitcast ([32 x i16]* @b16 to <32 x i16>*), align 2	; AVX512-NEXT: [[TMP2:%.]] = load <16 x i16>, <16 x i16> bitcast (i16* getelementptr inbounds ([32 x i16], [32 x i16]* @a16, i32 0, i64 16) to <16 x i16>*), align 2
	; AVX512-NEXT: [[TMP3:%.*]] = call <32 x i16> @llvm.smul.fix.v32i16(<32 x i16> [[TMP1]], <32 x i16> [[TMP2]], i32 3)	; AVX512-NEXT: [[TMP3:%.]] = load <16 x i16>, <16 x i16> bitcast ([32 x i16]* @b16 to <16 x i16>*), align 2
	; AVX512-NEXT: store <32 x i16> [[TMP3]], <32 x i16>* bitcast ([32 x i16]* @c16 to <32 x i16>*), align 2	; AVX512-NEXT: [[TMP4:%.]] = load <16 x i16>, <16 x i16> bitcast (i16* getelementptr inbounds ([32 x i16], [32 x i16]* @b16, i32 0, i64 16) to <16 x i16>*), align 2
		; AVX512-NEXT: [[TMP5:%.*]] = call <16 x i16> @llvm.smul.fix.v16i16(<16 x i16> [[TMP1]], <16 x i16> [[TMP3]], i32 3)
		; AVX512-NEXT: [[TMP6:%.*]] = call <16 x i16> @llvm.smul.fix.v16i16(<16 x i16> [[TMP2]], <16 x i16> [[TMP4]], i32 3)
		; AVX512-NEXT: store <16 x i16> [[TMP5]], <16 x i16>* bitcast ([32 x i16]* @c16 to <16 x i16>*), align 2
		; AVX512-NEXT: store <16 x i16> [[TMP6]], <16 x i16>* bitcast (i16* getelementptr inbounds ([32 x i16], [32 x i16]* @c16, i32 0, i64 16) to <16 x i16>*), align 2
	; AVX512-NEXT: ret void	; AVX512-NEXT: ret void
	;	;
	%a0 = load i16, i16* getelementptr inbounds ([32 x i16], [32 x i16]* @a16, i32 0, i64 0 ), align 2	%a0 = load i16, i16* getelementptr inbounds ([32 x i16], [32 x i16]* @a16, i32 0, i64 0 ), align 2
Context not available.
	}	}

	define void @smul_v64i8() {	define void @smul_v64i8() {
	; SSE-LABEL: @smul_v64i8(	; CHECK-LABEL: @smul_v64i8(
	; SSE-NEXT: [[TMP1:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @a8 to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP1:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @a8 to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP2:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 16) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP2:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 16) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP3:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 32) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP3:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 32) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP4:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 48) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP4:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 48) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP5:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @b8 to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP5:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @b8 to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP6:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 16) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP6:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 16) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP7:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 32) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP7:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 32) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP8:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 48) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP8:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 48) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP9:%.*]] = call <16 x i8> @llvm.smul.fix.v16i8(<16 x i8> [[TMP1]], <16 x i8> [[TMP5]], i32 3)	; CHECK-NEXT: [[TMP9:%.*]] = call <16 x i8> @llvm.smul.fix.v16i8(<16 x i8> [[TMP1]], <16 x i8> [[TMP5]], i32 3)
	; SSE-NEXT: [[TMP10:%.*]] = call <16 x i8> @llvm.smul.fix.v16i8(<16 x i8> [[TMP2]], <16 x i8> [[TMP6]], i32 3)	; CHECK-NEXT: [[TMP10:%.*]] = call <16 x i8> @llvm.smul.fix.v16i8(<16 x i8> [[TMP2]], <16 x i8> [[TMP6]], i32 3)
	; SSE-NEXT: [[TMP11:%.*]] = call <16 x i8> @llvm.smul.fix.v16i8(<16 x i8> [[TMP3]], <16 x i8> [[TMP7]], i32 3)	; CHECK-NEXT: [[TMP11:%.*]] = call <16 x i8> @llvm.smul.fix.v16i8(<16 x i8> [[TMP3]], <16 x i8> [[TMP7]], i32 3)
	; SSE-NEXT: [[TMP12:%.*]] = call <16 x i8> @llvm.smul.fix.v16i8(<16 x i8> [[TMP4]], <16 x i8> [[TMP8]], i32 3)	; CHECK-NEXT: [[TMP12:%.*]] = call <16 x i8> @llvm.smul.fix.v16i8(<16 x i8> [[TMP4]], <16 x i8> [[TMP8]], i32 3)
	; SSE-NEXT: store <16 x i8> [[TMP9]], <16 x i8>* bitcast ([64 x i8]* @c8 to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP9]], <16 x i8>* bitcast ([64 x i8]* @c8 to <16 x i8>*), align 1
	; SSE-NEXT: store <16 x i8> [[TMP10]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 16) to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP10]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 16) to <16 x i8>*), align 1
	; SSE-NEXT: store <16 x i8> [[TMP11]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 32) to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP11]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 32) to <16 x i8>*), align 1
	; SSE-NEXT: store <16 x i8> [[TMP12]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 48) to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP12]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 48) to <16 x i8>*), align 1
	; SSE-NEXT: ret void	; CHECK-NEXT: ret void
	;
	; SLM-LABEL: @smul_v64i8(
	; SLM-NEXT: [[TMP1:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @a8 to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP2:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 16) to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP3:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 32) to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP4:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 48) to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP5:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @b8 to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP6:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 16) to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP7:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 32) to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP8:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 48) to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP9:%.*]] = call <16 x i8> @llvm.smul.fix.v16i8(<16 x i8> [[TMP1]], <16 x i8> [[TMP5]], i32 3)
	; SLM-NEXT: [[TMP10:%.*]] = call <16 x i8> @llvm.smul.fix.v16i8(<16 x i8> [[TMP2]], <16 x i8> [[TMP6]], i32 3)
	; SLM-NEXT: [[TMP11:%.*]] = call <16 x i8> @llvm.smul.fix.v16i8(<16 x i8> [[TMP3]], <16 x i8> [[TMP7]], i32 3)
	; SLM-NEXT: [[TMP12:%.*]] = call <16 x i8> @llvm.smul.fix.v16i8(<16 x i8> [[TMP4]], <16 x i8> [[TMP8]], i32 3)
	; SLM-NEXT: store <16 x i8> [[TMP9]], <16 x i8>* bitcast ([64 x i8]* @c8 to <16 x i8>*), align 1
	; SLM-NEXT: store <16 x i8> [[TMP10]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 16) to <16 x i8>*), align 1
	; SLM-NEXT: store <16 x i8> [[TMP11]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 32) to <16 x i8>*), align 1
	; SLM-NEXT: store <16 x i8> [[TMP12]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 48) to <16 x i8>*), align 1
	; SLM-NEXT: ret void
	;
	; AVX-LABEL: @smul_v64i8(
	; AVX-NEXT: [[TMP1:%.]] = load <32 x i8>, <32 x i8> bitcast ([64 x i8]* @a8 to <32 x i8>*), align 1
	; AVX-NEXT: [[TMP2:%.]] = load <32 x i8>, <32 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 32) to <32 x i8>*), align 1
	; AVX-NEXT: [[TMP3:%.]] = load <32 x i8>, <32 x i8> bitcast ([64 x i8]* @b8 to <32 x i8>*), align 1
	; AVX-NEXT: [[TMP4:%.]] = load <32 x i8>, <32 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 32) to <32 x i8>*), align 1
	; AVX-NEXT: [[TMP5:%.*]] = call <32 x i8> @llvm.smul.fix.v32i8(<32 x i8> [[TMP1]], <32 x i8> [[TMP3]], i32 3)
	; AVX-NEXT: [[TMP6:%.*]] = call <32 x i8> @llvm.smul.fix.v32i8(<32 x i8> [[TMP2]], <32 x i8> [[TMP4]], i32 3)
	; AVX-NEXT: store <32 x i8> [[TMP5]], <32 x i8>* bitcast ([64 x i8]* @c8 to <32 x i8>*), align 1
	; AVX-NEXT: store <32 x i8> [[TMP6]], <32 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 32) to <32 x i8>*), align 1
	; AVX-NEXT: ret void
	;
	; AVX512-LABEL: @smul_v64i8(
	; AVX512-NEXT: [[TMP1:%.]] = load <64 x i8>, <64 x i8> bitcast ([64 x i8]* @a8 to <64 x i8>*), align 1
	; AVX512-NEXT: [[TMP2:%.]] = load <64 x i8>, <64 x i8> bitcast ([64 x i8]* @b8 to <64 x i8>*), align 1
	; AVX512-NEXT: [[TMP3:%.*]] = call <64 x i8> @llvm.smul.fix.v64i8(<64 x i8> [[TMP1]], <64 x i8> [[TMP2]], i32 3)
	; AVX512-NEXT: store <64 x i8> [[TMP3]], <64 x i8>* bitcast ([64 x i8]* @c8 to <64 x i8>*), align 1
	; AVX512-NEXT: ret void
	;	;
	%a0 = load i8, i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 0 ), align 1	%a0 = load i8, i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 0 ), align 1
	%a1 = load i8, i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 1 ), align 1	%a1 = load i8, i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 1 ), align 1
Context not available.
	; AVX-NEXT: ret void	; AVX-NEXT: ret void
	;	;
	; AVX512-LABEL: @umul_v32i16(	; AVX512-LABEL: @umul_v32i16(
	; AVX512-NEXT: [[TMP1:%.]] = load <32 x i16>, <32 x i16> bitcast ([32 x i16]* @a16 to <32 x i16>*), align 2	; AVX512-NEXT: [[TMP1:%.]] = load <16 x i16>, <16 x i16> bitcast ([32 x i16]* @a16 to <16 x i16>*), align 2
	; AVX512-NEXT: [[TMP2:%.]] = load <32 x i16>, <32 x i16> bitcast ([32 x i16]* @b16 to <32 x i16>*), align 2	; AVX512-NEXT: [[TMP2:%.]] = load <16 x i16>, <16 x i16> bitcast (i16* getelementptr inbounds ([32 x i16], [32 x i16]* @a16, i32 0, i64 16) to <16 x i16>*), align 2
	; AVX512-NEXT: [[TMP3:%.*]] = call <32 x i16> @llvm.umul.fix.v32i16(<32 x i16> [[TMP1]], <32 x i16> [[TMP2]], i32 3)	; AVX512-NEXT: [[TMP3:%.]] = load <16 x i16>, <16 x i16> bitcast ([32 x i16]* @b16 to <16 x i16>*), align 2
	; AVX512-NEXT: store <32 x i16> [[TMP3]], <32 x i16>* bitcast ([32 x i16]* @c16 to <32 x i16>*), align 2	; AVX512-NEXT: [[TMP4:%.]] = load <16 x i16>, <16 x i16> bitcast (i16* getelementptr inbounds ([32 x i16], [32 x i16]* @b16, i32 0, i64 16) to <16 x i16>*), align 2
		; AVX512-NEXT: [[TMP5:%.*]] = call <16 x i16> @llvm.umul.fix.v16i16(<16 x i16> [[TMP1]], <16 x i16> [[TMP3]], i32 3)
		; AVX512-NEXT: [[TMP6:%.*]] = call <16 x i16> @llvm.umul.fix.v16i16(<16 x i16> [[TMP2]], <16 x i16> [[TMP4]], i32 3)
		; AVX512-NEXT: store <16 x i16> [[TMP5]], <16 x i16>* bitcast ([32 x i16]* @c16 to <16 x i16>*), align 2
		; AVX512-NEXT: store <16 x i16> [[TMP6]], <16 x i16>* bitcast (i16* getelementptr inbounds ([32 x i16], [32 x i16]* @c16, i32 0, i64 16) to <16 x i16>*), align 2
	; AVX512-NEXT: ret void	; AVX512-NEXT: ret void
	;	;
	%a0 = load i16, i16* getelementptr inbounds ([32 x i16], [32 x i16]* @a16, i32 0, i64 0 ), align 2	%a0 = load i16, i16* getelementptr inbounds ([32 x i16], [32 x i16]* @a16, i32 0, i64 0 ), align 2
Context not available.
	}	}

	define void @umul_v64i8() {	define void @umul_v64i8() {
	; SSE-LABEL: @umul_v64i8(	; CHECK-LABEL: @umul_v64i8(
	; SSE-NEXT: [[TMP1:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @a8 to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP1:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @a8 to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP2:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 16) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP2:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 16) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP3:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 32) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP3:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 32) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP4:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 48) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP4:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 48) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP5:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @b8 to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP5:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @b8 to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP6:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 16) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP6:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 16) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP7:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 32) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP7:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 32) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP8:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 48) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP8:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 48) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP9:%.*]] = call <16 x i8> @llvm.umul.fix.v16i8(<16 x i8> [[TMP1]], <16 x i8> [[TMP5]], i32 3)	; CHECK-NEXT: [[TMP9:%.*]] = call <16 x i8> @llvm.umul.fix.v16i8(<16 x i8> [[TMP1]], <16 x i8> [[TMP5]], i32 3)
	; SSE-NEXT: [[TMP10:%.*]] = call <16 x i8> @llvm.umul.fix.v16i8(<16 x i8> [[TMP2]], <16 x i8> [[TMP6]], i32 3)	; CHECK-NEXT: [[TMP10:%.*]] = call <16 x i8> @llvm.umul.fix.v16i8(<16 x i8> [[TMP2]], <16 x i8> [[TMP6]], i32 3)
	; SSE-NEXT: [[TMP11:%.*]] = call <16 x i8> @llvm.umul.fix.v16i8(<16 x i8> [[TMP3]], <16 x i8> [[TMP7]], i32 3)	; CHECK-NEXT: [[TMP11:%.*]] = call <16 x i8> @llvm.umul.fix.v16i8(<16 x i8> [[TMP3]], <16 x i8> [[TMP7]], i32 3)
	; SSE-NEXT: [[TMP12:%.*]] = call <16 x i8> @llvm.umul.fix.v16i8(<16 x i8> [[TMP4]], <16 x i8> [[TMP8]], i32 3)	; CHECK-NEXT: [[TMP12:%.*]] = call <16 x i8> @llvm.umul.fix.v16i8(<16 x i8> [[TMP4]], <16 x i8> [[TMP8]], i32 3)
	; SSE-NEXT: store <16 x i8> [[TMP9]], <16 x i8>* bitcast ([64 x i8]* @c8 to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP9]], <16 x i8>* bitcast ([64 x i8]* @c8 to <16 x i8>*), align 1
	; SSE-NEXT: store <16 x i8> [[TMP10]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 16) to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP10]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 16) to <16 x i8>*), align 1
	; SSE-NEXT: store <16 x i8> [[TMP11]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 32) to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP11]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 32) to <16 x i8>*), align 1
	; SSE-NEXT: store <16 x i8> [[TMP12]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 48) to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP12]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 48) to <16 x i8>*), align 1
	; SSE-NEXT: ret void	; CHECK-NEXT: ret void
	;
	; SLM-LABEL: @umul_v64i8(
	; SLM-NEXT: [[TMP1:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @a8 to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP2:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 16) to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP3:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 32) to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP4:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 48) to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP5:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @b8 to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP6:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 16) to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP7:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 32) to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP8:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 48) to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP9:%.*]] = call <16 x i8> @llvm.umul.fix.v16i8(<16 x i8> [[TMP1]], <16 x i8> [[TMP5]], i32 3)
	; SLM-NEXT: [[TMP10:%.*]] = call <16 x i8> @llvm.umul.fix.v16i8(<16 x i8> [[TMP2]], <16 x i8> [[TMP6]], i32 3)
	; SLM-NEXT: [[TMP11:%.*]] = call <16 x i8> @llvm.umul.fix.v16i8(<16 x i8> [[TMP3]], <16 x i8> [[TMP7]], i32 3)
	; SLM-NEXT: [[TMP12:%.*]] = call <16 x i8> @llvm.umul.fix.v16i8(<16 x i8> [[TMP4]], <16 x i8> [[TMP8]], i32 3)
	; SLM-NEXT: store <16 x i8> [[TMP9]], <16 x i8>* bitcast ([64 x i8]* @c8 to <16 x i8>*), align 1
	; SLM-NEXT: store <16 x i8> [[TMP10]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 16) to <16 x i8>*), align 1
	; SLM-NEXT: store <16 x i8> [[TMP11]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 32) to <16 x i8>*), align 1
	; SLM-NEXT: store <16 x i8> [[TMP12]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 48) to <16 x i8>*), align 1
	; SLM-NEXT: ret void
	;
	; AVX-LABEL: @umul_v64i8(
	; AVX-NEXT: [[TMP1:%.]] = load <32 x i8>, <32 x i8> bitcast ([64 x i8]* @a8 to <32 x i8>*), align 1
	; AVX-NEXT: [[TMP2:%.]] = load <32 x i8>, <32 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 32) to <32 x i8>*), align 1
	; AVX-NEXT: [[TMP3:%.]] = load <32 x i8>, <32 x i8> bitcast ([64 x i8]* @b8 to <32 x i8>*), align 1
	; AVX-NEXT: [[TMP4:%.]] = load <32 x i8>, <32 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 32) to <32 x i8>*), align 1
	; AVX-NEXT: [[TMP5:%.*]] = call <32 x i8> @llvm.umul.fix.v32i8(<32 x i8> [[TMP1]], <32 x i8> [[TMP3]], i32 3)
	; AVX-NEXT: [[TMP6:%.*]] = call <32 x i8> @llvm.umul.fix.v32i8(<32 x i8> [[TMP2]], <32 x i8> [[TMP4]], i32 3)
	; AVX-NEXT: store <32 x i8> [[TMP5]], <32 x i8>* bitcast ([64 x i8]* @c8 to <32 x i8>*), align 1
	; AVX-NEXT: store <32 x i8> [[TMP6]], <32 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 32) to <32 x i8>*), align 1
	; AVX-NEXT: ret void
	;
	; AVX512-LABEL: @umul_v64i8(
	; AVX512-NEXT: [[TMP1:%.]] = load <64 x i8>, <64 x i8> bitcast ([64 x i8]* @a8 to <64 x i8>*), align 1
	; AVX512-NEXT: [[TMP2:%.]] = load <64 x i8>, <64 x i8> bitcast ([64 x i8]* @b8 to <64 x i8>*), align 1
	; AVX512-NEXT: [[TMP3:%.*]] = call <64 x i8> @llvm.umul.fix.v64i8(<64 x i8> [[TMP1]], <64 x i8> [[TMP2]], i32 3)
	; AVX512-NEXT: store <64 x i8> [[TMP3]], <64 x i8>* bitcast ([64 x i8]* @c8 to <64 x i8>*), align 1
	; AVX512-NEXT: ret void
	;	;
	%a0 = load i8, i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 0 ), align 1	%a0 = load i8, i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 0 ), align 1
	%a1 = load i8, i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 1 ), align 1	%a1 = load i8, i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 1 ), align 1
Context not available.

llvm/test/Transforms/SLPVectorizer/X86/arith-mul.ll

Context not available.
	; AVX-NEXT: ret void	; AVX-NEXT: ret void
	;	;
	; AVX512-LABEL: @mul_v32i16(	; AVX512-LABEL: @mul_v32i16(
	; AVX512-NEXT: [[TMP1:%.]] = load <32 x i16>, <32 x i16> bitcast ([32 x i16]* @a16 to <32 x i16>*), align 2	; AVX512-NEXT: [[TMP1:%.]] = load <16 x i16>, <16 x i16> bitcast ([32 x i16]* @a16 to <16 x i16>*), align 2
	; AVX512-NEXT: [[TMP2:%.]] = load <32 x i16>, <32 x i16> bitcast ([32 x i16]* @b16 to <32 x i16>*), align 2	; AVX512-NEXT: [[TMP2:%.]] = load <16 x i16>, <16 x i16> bitcast (i16* getelementptr inbounds ([32 x i16], [32 x i16]* @a16, i32 0, i64 16) to <16 x i16>*), align 2
	; AVX512-NEXT: [[TMP3:%.*]] = mul <32 x i16> [[TMP1]], [[TMP2]]	; AVX512-NEXT: [[TMP3:%.]] = load <16 x i16>, <16 x i16> bitcast ([32 x i16]* @b16 to <16 x i16>*), align 2
	; AVX512-NEXT: store <32 x i16> [[TMP3]], <32 x i16>* bitcast ([32 x i16]* @c16 to <32 x i16>*), align 2	; AVX512-NEXT: [[TMP4:%.]] = load <16 x i16>, <16 x i16> bitcast (i16* getelementptr inbounds ([32 x i16], [32 x i16]* @b16, i32 0, i64 16) to <16 x i16>*), align 2
		; AVX512-NEXT: [[TMP5:%.*]] = mul <16 x i16> [[TMP1]], [[TMP3]]
		; AVX512-NEXT: [[TMP6:%.*]] = mul <16 x i16> [[TMP2]], [[TMP4]]
		; AVX512-NEXT: store <16 x i16> [[TMP5]], <16 x i16>* bitcast ([32 x i16]* @c16 to <16 x i16>*), align 2
		; AVX512-NEXT: store <16 x i16> [[TMP6]], <16 x i16>* bitcast (i16* getelementptr inbounds ([32 x i16], [32 x i16]* @c16, i32 0, i64 16) to <16 x i16>*), align 2
	; AVX512-NEXT: ret void	; AVX512-NEXT: ret void
	;	;
	%a0 = load i16, i16* getelementptr inbounds ([32 x i16], [32 x i16]* @a16, i32 0, i64 0 ), align 2	%a0 = load i16, i16* getelementptr inbounds ([32 x i16], [32 x i16]* @a16, i32 0, i64 0 ), align 2
Context not available.
	}	}

	define void @mul_v64i8() {	define void @mul_v64i8() {
	; SSE-LABEL: @mul_v64i8(	; CHECK-LABEL: @mul_v64i8(
	; SSE-NEXT: [[TMP1:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @a8 to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP1:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @a8 to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP2:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 16) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP2:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 16) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP3:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 32) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP3:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 32) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP4:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 48) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP4:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 48) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP5:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @b8 to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP5:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @b8 to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP6:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 16) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP6:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 16) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP7:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 32) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP7:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 32) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP8:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 48) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP8:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 48) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP9:%.*]] = mul <16 x i8> [[TMP1]], [[TMP5]]	; CHECK-NEXT: [[TMP9:%.*]] = mul <16 x i8> [[TMP1]], [[TMP5]]
	; SSE-NEXT: [[TMP10:%.*]] = mul <16 x i8> [[TMP2]], [[TMP6]]	; CHECK-NEXT: [[TMP10:%.*]] = mul <16 x i8> [[TMP2]], [[TMP6]]
	; SSE-NEXT: [[TMP11:%.*]] = mul <16 x i8> [[TMP3]], [[TMP7]]	; CHECK-NEXT: [[TMP11:%.*]] = mul <16 x i8> [[TMP3]], [[TMP7]]
	; SSE-NEXT: [[TMP12:%.*]] = mul <16 x i8> [[TMP4]], [[TMP8]]	; CHECK-NEXT: [[TMP12:%.*]] = mul <16 x i8> [[TMP4]], [[TMP8]]
	; SSE-NEXT: store <16 x i8> [[TMP9]], <16 x i8>* bitcast ([64 x i8]* @c8 to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP9]], <16 x i8>* bitcast ([64 x i8]* @c8 to <16 x i8>*), align 1
	; SSE-NEXT: store <16 x i8> [[TMP10]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 16) to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP10]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 16) to <16 x i8>*), align 1
	; SSE-NEXT: store <16 x i8> [[TMP11]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 32) to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP11]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 32) to <16 x i8>*), align 1
	; SSE-NEXT: store <16 x i8> [[TMP12]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 48) to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP12]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 48) to <16 x i8>*), align 1
	; SSE-NEXT: ret void	; CHECK-NEXT: ret void
	;
	; SLM-LABEL: @mul_v64i8(
	; SLM-NEXT: [[TMP1:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @a8 to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP2:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 16) to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP3:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 32) to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP4:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 48) to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP5:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @b8 to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP6:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 16) to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP7:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 32) to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP8:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 48) to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP9:%.*]] = mul <16 x i8> [[TMP1]], [[TMP5]]
	; SLM-NEXT: [[TMP10:%.*]] = mul <16 x i8> [[TMP2]], [[TMP6]]
	; SLM-NEXT: [[TMP11:%.*]] = mul <16 x i8> [[TMP3]], [[TMP7]]
	; SLM-NEXT: [[TMP12:%.*]] = mul <16 x i8> [[TMP4]], [[TMP8]]
	; SLM-NEXT: store <16 x i8> [[TMP9]], <16 x i8>* bitcast ([64 x i8]* @c8 to <16 x i8>*), align 1
	; SLM-NEXT: store <16 x i8> [[TMP10]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 16) to <16 x i8>*), align 1
	; SLM-NEXT: store <16 x i8> [[TMP11]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 32) to <16 x i8>*), align 1
	; SLM-NEXT: store <16 x i8> [[TMP12]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 48) to <16 x i8>*), align 1
	; SLM-NEXT: ret void
	;
	; AVX-LABEL: @mul_v64i8(
	; AVX-NEXT: [[TMP1:%.]] = load <32 x i8>, <32 x i8> bitcast ([64 x i8]* @a8 to <32 x i8>*), align 1
	; AVX-NEXT: [[TMP2:%.]] = load <32 x i8>, <32 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 32) to <32 x i8>*), align 1
	; AVX-NEXT: [[TMP3:%.]] = load <32 x i8>, <32 x i8> bitcast ([64 x i8]* @b8 to <32 x i8>*), align 1
	; AVX-NEXT: [[TMP4:%.]] = load <32 x i8>, <32 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 32) to <32 x i8>*), align 1
	; AVX-NEXT: [[TMP5:%.*]] = mul <32 x i8> [[TMP1]], [[TMP3]]
	; AVX-NEXT: [[TMP6:%.*]] = mul <32 x i8> [[TMP2]], [[TMP4]]
	; AVX-NEXT: store <32 x i8> [[TMP5]], <32 x i8>* bitcast ([64 x i8]* @c8 to <32 x i8>*), align 1
	; AVX-NEXT: store <32 x i8> [[TMP6]], <32 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 32) to <32 x i8>*), align 1
	; AVX-NEXT: ret void
	;
	; AVX512-LABEL: @mul_v64i8(
	; AVX512-NEXT: [[TMP1:%.]] = load <64 x i8>, <64 x i8> bitcast ([64 x i8]* @a8 to <64 x i8>*), align 1
	; AVX512-NEXT: [[TMP2:%.]] = load <64 x i8>, <64 x i8> bitcast ([64 x i8]* @b8 to <64 x i8>*), align 1
	; AVX512-NEXT: [[TMP3:%.*]] = mul <64 x i8> [[TMP1]], [[TMP2]]
	; AVX512-NEXT: store <64 x i8> [[TMP3]], <64 x i8>* bitcast ([64 x i8]* @c8 to <64 x i8>*), align 1
	; AVX512-NEXT: ret void
	;	;
	%a0 = load i8, i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 0 ), align 1	%a0 = load i8, i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 0 ), align 1
	%a1 = load i8, i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 1 ), align 1	%a1 = load i8, i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 1 ), align 1
Context not available.

llvm/test/Transforms/SLPVectorizer/X86/arith-sub-ssat.ll

Context not available.
	; AVX-NEXT: ret void	; AVX-NEXT: ret void
	;	;
	; AVX512-LABEL: @sub_v32i16(	; AVX512-LABEL: @sub_v32i16(
	; AVX512-NEXT: [[TMP1:%.]] = load <32 x i16>, <32 x i16> bitcast ([32 x i16]* @a16 to <32 x i16>*), align 2	; AVX512-NEXT: [[TMP1:%.]] = load <16 x i16>, <16 x i16> bitcast ([32 x i16]* @a16 to <16 x i16>*), align 2
	; AVX512-NEXT: [[TMP2:%.]] = load <32 x i16>, <32 x i16> bitcast ([32 x i16]* @b16 to <32 x i16>*), align 2	; AVX512-NEXT: [[TMP2:%.]] = load <16 x i16>, <16 x i16> bitcast (i16* getelementptr inbounds ([32 x i16], [32 x i16]* @a16, i32 0, i64 16) to <16 x i16>*), align 2
	; AVX512-NEXT: [[TMP3:%.*]] = call <32 x i16> @llvm.ssub.sat.v32i16(<32 x i16> [[TMP1]], <32 x i16> [[TMP2]])	; AVX512-NEXT: [[TMP3:%.]] = load <16 x i16>, <16 x i16> bitcast ([32 x i16]* @b16 to <16 x i16>*), align 2
	; AVX512-NEXT: store <32 x i16> [[TMP3]], <32 x i16>* bitcast ([32 x i16]* @c16 to <32 x i16>*), align 2	; AVX512-NEXT: [[TMP4:%.]] = load <16 x i16>, <16 x i16> bitcast (i16* getelementptr inbounds ([32 x i16], [32 x i16]* @b16, i32 0, i64 16) to <16 x i16>*), align 2
		; AVX512-NEXT: [[TMP5:%.*]] = call <16 x i16> @llvm.ssub.sat.v16i16(<16 x i16> [[TMP1]], <16 x i16> [[TMP3]])
		; AVX512-NEXT: [[TMP6:%.*]] = call <16 x i16> @llvm.ssub.sat.v16i16(<16 x i16> [[TMP2]], <16 x i16> [[TMP4]])
		; AVX512-NEXT: store <16 x i16> [[TMP5]], <16 x i16>* bitcast ([32 x i16]* @c16 to <16 x i16>*), align 2
		; AVX512-NEXT: store <16 x i16> [[TMP6]], <16 x i16>* bitcast (i16* getelementptr inbounds ([32 x i16], [32 x i16]* @c16, i32 0, i64 16) to <16 x i16>*), align 2
	; AVX512-NEXT: ret void	; AVX512-NEXT: ret void
	;	;
	%a0 = load i16, i16* getelementptr inbounds ([32 x i16], [32 x i16]* @a16, i32 0, i64 0 ), align 2	%a0 = load i16, i16* getelementptr inbounds ([32 x i16], [32 x i16]* @a16, i32 0, i64 0 ), align 2
Context not available.
	}	}

	define void @sub_v64i8() {	define void @sub_v64i8() {
	; SSE-LABEL: @sub_v64i8(	; CHECK-LABEL: @sub_v64i8(
	; SSE-NEXT: [[TMP1:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @a8 to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP1:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @a8 to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP2:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 16) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP2:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 16) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP3:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 32) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP3:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 32) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP4:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 48) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP4:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 48) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP5:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @b8 to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP5:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @b8 to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP6:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 16) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP6:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 16) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP7:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 32) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP7:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 32) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP8:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 48) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP8:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 48) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP9:%.*]] = call <16 x i8> @llvm.ssub.sat.v16i8(<16 x i8> [[TMP1]], <16 x i8> [[TMP5]])	; CHECK-NEXT: [[TMP9:%.*]] = call <16 x i8> @llvm.ssub.sat.v16i8(<16 x i8> [[TMP1]], <16 x i8> [[TMP5]])
	; SSE-NEXT: [[TMP10:%.*]] = call <16 x i8> @llvm.ssub.sat.v16i8(<16 x i8> [[TMP2]], <16 x i8> [[TMP6]])	; CHECK-NEXT: [[TMP10:%.*]] = call <16 x i8> @llvm.ssub.sat.v16i8(<16 x i8> [[TMP2]], <16 x i8> [[TMP6]])
	; SSE-NEXT: [[TMP11:%.*]] = call <16 x i8> @llvm.ssub.sat.v16i8(<16 x i8> [[TMP3]], <16 x i8> [[TMP7]])	; CHECK-NEXT: [[TMP11:%.*]] = call <16 x i8> @llvm.ssub.sat.v16i8(<16 x i8> [[TMP3]], <16 x i8> [[TMP7]])
	; SSE-NEXT: [[TMP12:%.*]] = call <16 x i8> @llvm.ssub.sat.v16i8(<16 x i8> [[TMP4]], <16 x i8> [[TMP8]])	; CHECK-NEXT: [[TMP12:%.*]] = call <16 x i8> @llvm.ssub.sat.v16i8(<16 x i8> [[TMP4]], <16 x i8> [[TMP8]])
	; SSE-NEXT: store <16 x i8> [[TMP9]], <16 x i8>* bitcast ([64 x i8]* @c8 to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP9]], <16 x i8>* bitcast ([64 x i8]* @c8 to <16 x i8>*), align 1
	; SSE-NEXT: store <16 x i8> [[TMP10]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 16) to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP10]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 16) to <16 x i8>*), align 1
	; SSE-NEXT: store <16 x i8> [[TMP11]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 32) to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP11]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 32) to <16 x i8>*), align 1
	; SSE-NEXT: store <16 x i8> [[TMP12]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 48) to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP12]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 48) to <16 x i8>*), align 1
	; SSE-NEXT: ret void	; CHECK-NEXT: ret void
	;
	; SLM-LABEL: @sub_v64i8(
	; SLM-NEXT: [[TMP1:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @a8 to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP2:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 16) to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP3:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 32) to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP4:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 48) to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP5:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @b8 to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP6:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 16) to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP7:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 32) to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP8:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 48) to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP9:%.*]] = call <16 x i8> @llvm.ssub.sat.v16i8(<16 x i8> [[TMP1]], <16 x i8> [[TMP5]])
	; SLM-NEXT: [[TMP10:%.*]] = call <16 x i8> @llvm.ssub.sat.v16i8(<16 x i8> [[TMP2]], <16 x i8> [[TMP6]])
	; SLM-NEXT: [[TMP11:%.*]] = call <16 x i8> @llvm.ssub.sat.v16i8(<16 x i8> [[TMP3]], <16 x i8> [[TMP7]])
	; SLM-NEXT: [[TMP12:%.*]] = call <16 x i8> @llvm.ssub.sat.v16i8(<16 x i8> [[TMP4]], <16 x i8> [[TMP8]])
	; SLM-NEXT: store <16 x i8> [[TMP9]], <16 x i8>* bitcast ([64 x i8]* @c8 to <16 x i8>*), align 1
	; SLM-NEXT: store <16 x i8> [[TMP10]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 16) to <16 x i8>*), align 1
	; SLM-NEXT: store <16 x i8> [[TMP11]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 32) to <16 x i8>*), align 1
	; SLM-NEXT: store <16 x i8> [[TMP12]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 48) to <16 x i8>*), align 1
	; SLM-NEXT: ret void
	;
	; AVX-LABEL: @sub_v64i8(
	; AVX-NEXT: [[TMP1:%.]] = load <32 x i8>, <32 x i8> bitcast ([64 x i8]* @a8 to <32 x i8>*), align 1
	; AVX-NEXT: [[TMP2:%.]] = load <32 x i8>, <32 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 32) to <32 x i8>*), align 1
	; AVX-NEXT: [[TMP3:%.]] = load <32 x i8>, <32 x i8> bitcast ([64 x i8]* @b8 to <32 x i8>*), align 1
	; AVX-NEXT: [[TMP4:%.]] = load <32 x i8>, <32 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 32) to <32 x i8>*), align 1
	; AVX-NEXT: [[TMP5:%.*]] = call <32 x i8> @llvm.ssub.sat.v32i8(<32 x i8> [[TMP1]], <32 x i8> [[TMP3]])
	; AVX-NEXT: [[TMP6:%.*]] = call <32 x i8> @llvm.ssub.sat.v32i8(<32 x i8> [[TMP2]], <32 x i8> [[TMP4]])
	; AVX-NEXT: store <32 x i8> [[TMP5]], <32 x i8>* bitcast ([64 x i8]* @c8 to <32 x i8>*), align 1
	; AVX-NEXT: store <32 x i8> [[TMP6]], <32 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 32) to <32 x i8>*), align 1
	; AVX-NEXT: ret void
	;
	; AVX512-LABEL: @sub_v64i8(
	; AVX512-NEXT: [[TMP1:%.]] = load <64 x i8>, <64 x i8> bitcast ([64 x i8]* @a8 to <64 x i8>*), align 1
	; AVX512-NEXT: [[TMP2:%.]] = load <64 x i8>, <64 x i8> bitcast ([64 x i8]* @b8 to <64 x i8>*), align 1
	; AVX512-NEXT: [[TMP3:%.*]] = call <64 x i8> @llvm.ssub.sat.v64i8(<64 x i8> [[TMP1]], <64 x i8> [[TMP2]])
	; AVX512-NEXT: store <64 x i8> [[TMP3]], <64 x i8>* bitcast ([64 x i8]* @c8 to <64 x i8>*), align 1
	; AVX512-NEXT: ret void
	;	;
	%a0 = load i8, i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 0 ), align 1	%a0 = load i8, i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 0 ), align 1
	%a1 = load i8, i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 1 ), align 1	%a1 = load i8, i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 1 ), align 1
Context not available.

llvm/test/Transforms/SLPVectorizer/X86/arith-sub-usat.ll

Context not available.
	; AVX-NEXT: ret void	; AVX-NEXT: ret void
	;	;
	; AVX512-LABEL: @sub_v32i16(	; AVX512-LABEL: @sub_v32i16(
	; AVX512-NEXT: [[TMP1:%.]] = load <32 x i16>, <32 x i16> bitcast ([32 x i16]* @a16 to <32 x i16>*), align 2	; AVX512-NEXT: [[TMP1:%.]] = load <16 x i16>, <16 x i16> bitcast ([32 x i16]* @a16 to <16 x i16>*), align 2
	; AVX512-NEXT: [[TMP2:%.]] = load <32 x i16>, <32 x i16> bitcast ([32 x i16]* @b16 to <32 x i16>*), align 2	; AVX512-NEXT: [[TMP2:%.]] = load <16 x i16>, <16 x i16> bitcast (i16* getelementptr inbounds ([32 x i16], [32 x i16]* @a16, i32 0, i64 16) to <16 x i16>*), align 2
	; AVX512-NEXT: [[TMP3:%.*]] = call <32 x i16> @llvm.usub.sat.v32i16(<32 x i16> [[TMP1]], <32 x i16> [[TMP2]])	; AVX512-NEXT: [[TMP3:%.]] = load <16 x i16>, <16 x i16> bitcast ([32 x i16]* @b16 to <16 x i16>*), align 2
	; AVX512-NEXT: store <32 x i16> [[TMP3]], <32 x i16>* bitcast ([32 x i16]* @c16 to <32 x i16>*), align 2	; AVX512-NEXT: [[TMP4:%.]] = load <16 x i16>, <16 x i16> bitcast (i16* getelementptr inbounds ([32 x i16], [32 x i16]* @b16, i32 0, i64 16) to <16 x i16>*), align 2
		; AVX512-NEXT: [[TMP5:%.*]] = call <16 x i16> @llvm.usub.sat.v16i16(<16 x i16> [[TMP1]], <16 x i16> [[TMP3]])
		; AVX512-NEXT: [[TMP6:%.*]] = call <16 x i16> @llvm.usub.sat.v16i16(<16 x i16> [[TMP2]], <16 x i16> [[TMP4]])
		; AVX512-NEXT: store <16 x i16> [[TMP5]], <16 x i16>* bitcast ([32 x i16]* @c16 to <16 x i16>*), align 2
		; AVX512-NEXT: store <16 x i16> [[TMP6]], <16 x i16>* bitcast (i16* getelementptr inbounds ([32 x i16], [32 x i16]* @c16, i32 0, i64 16) to <16 x i16>*), align 2
	; AVX512-NEXT: ret void	; AVX512-NEXT: ret void
	;	;
	%a0 = load i16, i16* getelementptr inbounds ([32 x i16], [32 x i16]* @a16, i32 0, i64 0 ), align 2	%a0 = load i16, i16* getelementptr inbounds ([32 x i16], [32 x i16]* @a16, i32 0, i64 0 ), align 2
Context not available.
	}	}

	define void @sub_v64i8() {	define void @sub_v64i8() {
	; SSE-LABEL: @sub_v64i8(	; CHECK-LABEL: @sub_v64i8(
	; SSE-NEXT: [[TMP1:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @a8 to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP1:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @a8 to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP2:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 16) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP2:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 16) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP3:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 32) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP3:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 32) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP4:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 48) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP4:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 48) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP5:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @b8 to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP5:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @b8 to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP6:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 16) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP6:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 16) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP7:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 32) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP7:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 32) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP8:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 48) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP8:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 48) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP9:%.*]] = call <16 x i8> @llvm.usub.sat.v16i8(<16 x i8> [[TMP1]], <16 x i8> [[TMP5]])	; CHECK-NEXT: [[TMP9:%.*]] = call <16 x i8> @llvm.usub.sat.v16i8(<16 x i8> [[TMP1]], <16 x i8> [[TMP5]])
	; SSE-NEXT: [[TMP10:%.*]] = call <16 x i8> @llvm.usub.sat.v16i8(<16 x i8> [[TMP2]], <16 x i8> [[TMP6]])	; CHECK-NEXT: [[TMP10:%.*]] = call <16 x i8> @llvm.usub.sat.v16i8(<16 x i8> [[TMP2]], <16 x i8> [[TMP6]])
	; SSE-NEXT: [[TMP11:%.*]] = call <16 x i8> @llvm.usub.sat.v16i8(<16 x i8> [[TMP3]], <16 x i8> [[TMP7]])	; CHECK-NEXT: [[TMP11:%.*]] = call <16 x i8> @llvm.usub.sat.v16i8(<16 x i8> [[TMP3]], <16 x i8> [[TMP7]])
	; SSE-NEXT: [[TMP12:%.*]] = call <16 x i8> @llvm.usub.sat.v16i8(<16 x i8> [[TMP4]], <16 x i8> [[TMP8]])	; CHECK-NEXT: [[TMP12:%.*]] = call <16 x i8> @llvm.usub.sat.v16i8(<16 x i8> [[TMP4]], <16 x i8> [[TMP8]])
	; SSE-NEXT: store <16 x i8> [[TMP9]], <16 x i8>* bitcast ([64 x i8]* @c8 to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP9]], <16 x i8>* bitcast ([64 x i8]* @c8 to <16 x i8>*), align 1
	; SSE-NEXT: store <16 x i8> [[TMP10]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 16) to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP10]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 16) to <16 x i8>*), align 1
	; SSE-NEXT: store <16 x i8> [[TMP11]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 32) to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP11]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 32) to <16 x i8>*), align 1
	; SSE-NEXT: store <16 x i8> [[TMP12]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 48) to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP12]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 48) to <16 x i8>*), align 1
	; SSE-NEXT: ret void	; CHECK-NEXT: ret void
	;
	; AVX-LABEL: @sub_v64i8(
	; AVX-NEXT: [[TMP1:%.]] = load <32 x i8>, <32 x i8> bitcast ([64 x i8]* @a8 to <32 x i8>*), align 1
	; AVX-NEXT: [[TMP2:%.]] = load <32 x i8>, <32 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 32) to <32 x i8>*), align 1
	; AVX-NEXT: [[TMP3:%.]] = load <32 x i8>, <32 x i8> bitcast ([64 x i8]* @b8 to <32 x i8>*), align 1
	; AVX-NEXT: [[TMP4:%.]] = load <32 x i8>, <32 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 32) to <32 x i8>*), align 1
	; AVX-NEXT: [[TMP5:%.*]] = call <32 x i8> @llvm.usub.sat.v32i8(<32 x i8> [[TMP1]], <32 x i8> [[TMP3]])
	; AVX-NEXT: [[TMP6:%.*]] = call <32 x i8> @llvm.usub.sat.v32i8(<32 x i8> [[TMP2]], <32 x i8> [[TMP4]])
	; AVX-NEXT: store <32 x i8> [[TMP5]], <32 x i8>* bitcast ([64 x i8]* @c8 to <32 x i8>*), align 1
	; AVX-NEXT: store <32 x i8> [[TMP6]], <32 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 32) to <32 x i8>*), align 1
	; AVX-NEXT: ret void
	;
	; AVX512-LABEL: @sub_v64i8(
	; AVX512-NEXT: [[TMP1:%.]] = load <64 x i8>, <64 x i8> bitcast ([64 x i8]* @a8 to <64 x i8>*), align 1
	; AVX512-NEXT: [[TMP2:%.]] = load <64 x i8>, <64 x i8> bitcast ([64 x i8]* @b8 to <64 x i8>*), align 1
	; AVX512-NEXT: [[TMP3:%.*]] = call <64 x i8> @llvm.usub.sat.v64i8(<64 x i8> [[TMP1]], <64 x i8> [[TMP2]])
	; AVX512-NEXT: store <64 x i8> [[TMP3]], <64 x i8>* bitcast ([64 x i8]* @c8 to <64 x i8>*), align 1
	; AVX512-NEXT: ret void
	;	;
	%a0 = load i8, i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 0 ), align 1	%a0 = load i8, i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 0 ), align 1
	%a1 = load i8, i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 1 ), align 1	%a1 = load i8, i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 1 ), align 1
Context not available.

llvm/test/Transforms/SLPVectorizer/X86/arith-sub.ll

Context not available.
	; AVX-NEXT: ret void	; AVX-NEXT: ret void
	;	;
	; AVX512-LABEL: @sub_v32i16(	; AVX512-LABEL: @sub_v32i16(
	; AVX512-NEXT: [[TMP1:%.]] = load <32 x i16>, <32 x i16> bitcast ([32 x i16]* @a16 to <32 x i16>*), align 2	; AVX512-NEXT: [[TMP1:%.]] = load <16 x i16>, <16 x i16> bitcast ([32 x i16]* @a16 to <16 x i16>*), align 2
	; AVX512-NEXT: [[TMP2:%.]] = load <32 x i16>, <32 x i16> bitcast ([32 x i16]* @b16 to <32 x i16>*), align 2	; AVX512-NEXT: [[TMP2:%.]] = load <16 x i16>, <16 x i16> bitcast (i16* getelementptr inbounds ([32 x i16], [32 x i16]* @a16, i32 0, i64 16) to <16 x i16>*), align 2
	; AVX512-NEXT: [[TMP3:%.*]] = sub <32 x i16> [[TMP1]], [[TMP2]]	; AVX512-NEXT: [[TMP3:%.]] = load <16 x i16>, <16 x i16> bitcast ([32 x i16]* @b16 to <16 x i16>*), align 2
	; AVX512-NEXT: store <32 x i16> [[TMP3]], <32 x i16>* bitcast ([32 x i16]* @c16 to <32 x i16>*), align 2	; AVX512-NEXT: [[TMP4:%.]] = load <16 x i16>, <16 x i16> bitcast (i16* getelementptr inbounds ([32 x i16], [32 x i16]* @b16, i32 0, i64 16) to <16 x i16>*), align 2
		; AVX512-NEXT: [[TMP5:%.*]] = sub <16 x i16> [[TMP1]], [[TMP3]]
		; AVX512-NEXT: [[TMP6:%.*]] = sub <16 x i16> [[TMP2]], [[TMP4]]
		; AVX512-NEXT: store <16 x i16> [[TMP5]], <16 x i16>* bitcast ([32 x i16]* @c16 to <16 x i16>*), align 2
		; AVX512-NEXT: store <16 x i16> [[TMP6]], <16 x i16>* bitcast (i16* getelementptr inbounds ([32 x i16], [32 x i16]* @c16, i32 0, i64 16) to <16 x i16>*), align 2
	; AVX512-NEXT: ret void	; AVX512-NEXT: ret void
	;	;
	%a0 = load i16, i16* getelementptr inbounds ([32 x i16], [32 x i16]* @a16, i32 0, i64 0 ), align 2	%a0 = load i16, i16* getelementptr inbounds ([32 x i16], [32 x i16]* @a16, i32 0, i64 0 ), align 2
Context not available.
	}	}

	define void @sub_v64i8() {	define void @sub_v64i8() {
	; SSE-LABEL: @sub_v64i8(	; CHECK-LABEL: @sub_v64i8(
	; SSE-NEXT: [[TMP1:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @a8 to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP1:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @a8 to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP2:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 16) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP2:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 16) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP3:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 32) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP3:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 32) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP4:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 48) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP4:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 48) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP5:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @b8 to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP5:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @b8 to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP6:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 16) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP6:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 16) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP7:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 32) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP7:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 32) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP8:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 48) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP8:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 48) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP9:%.*]] = sub <16 x i8> [[TMP1]], [[TMP5]]	; CHECK-NEXT: [[TMP9:%.*]] = sub <16 x i8> [[TMP1]], [[TMP5]]
	; SSE-NEXT: [[TMP10:%.*]] = sub <16 x i8> [[TMP2]], [[TMP6]]	; CHECK-NEXT: [[TMP10:%.*]] = sub <16 x i8> [[TMP2]], [[TMP6]]
	; SSE-NEXT: [[TMP11:%.*]] = sub <16 x i8> [[TMP3]], [[TMP7]]	; CHECK-NEXT: [[TMP11:%.*]] = sub <16 x i8> [[TMP3]], [[TMP7]]
	; SSE-NEXT: [[TMP12:%.*]] = sub <16 x i8> [[TMP4]], [[TMP8]]	; CHECK-NEXT: [[TMP12:%.*]] = sub <16 x i8> [[TMP4]], [[TMP8]]
	; SSE-NEXT: store <16 x i8> [[TMP9]], <16 x i8>* bitcast ([64 x i8]* @c8 to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP9]], <16 x i8>* bitcast ([64 x i8]* @c8 to <16 x i8>*), align 1
	; SSE-NEXT: store <16 x i8> [[TMP10]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 16) to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP10]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 16) to <16 x i8>*), align 1
	; SSE-NEXT: store <16 x i8> [[TMP11]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 32) to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP11]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 32) to <16 x i8>*), align 1
	; SSE-NEXT: store <16 x i8> [[TMP12]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 48) to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP12]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 48) to <16 x i8>*), align 1
	; SSE-NEXT: ret void	; CHECK-NEXT: ret void
	;
	; SLM-LABEL: @sub_v64i8(
	; SLM-NEXT: [[TMP1:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @a8 to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP2:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 16) to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP3:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 32) to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP4:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 48) to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP5:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @b8 to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP6:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 16) to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP7:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 32) to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP8:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 48) to <16 x i8>*), align 1
	; SLM-NEXT: [[TMP9:%.*]] = sub <16 x i8> [[TMP1]], [[TMP5]]
	; SLM-NEXT: [[TMP10:%.*]] = sub <16 x i8> [[TMP2]], [[TMP6]]
	; SLM-NEXT: [[TMP11:%.*]] = sub <16 x i8> [[TMP3]], [[TMP7]]
	; SLM-NEXT: [[TMP12:%.*]] = sub <16 x i8> [[TMP4]], [[TMP8]]
	; SLM-NEXT: store <16 x i8> [[TMP9]], <16 x i8>* bitcast ([64 x i8]* @c8 to <16 x i8>*), align 1
	; SLM-NEXT: store <16 x i8> [[TMP10]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 16) to <16 x i8>*), align 1
	; SLM-NEXT: store <16 x i8> [[TMP11]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 32) to <16 x i8>*), align 1
	; SLM-NEXT: store <16 x i8> [[TMP12]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 48) to <16 x i8>*), align 1
	; SLM-NEXT: ret void
	;
	; AVX-LABEL: @sub_v64i8(
	; AVX-NEXT: [[TMP1:%.]] = load <32 x i8>, <32 x i8> bitcast ([64 x i8]* @a8 to <32 x i8>*), align 1
	; AVX-NEXT: [[TMP2:%.]] = load <32 x i8>, <32 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 32) to <32 x i8>*), align 1
	; AVX-NEXT: [[TMP3:%.]] = load <32 x i8>, <32 x i8> bitcast ([64 x i8]* @b8 to <32 x i8>*), align 1
	; AVX-NEXT: [[TMP4:%.]] = load <32 x i8>, <32 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 32) to <32 x i8>*), align 1
	; AVX-NEXT: [[TMP5:%.*]] = sub <32 x i8> [[TMP1]], [[TMP3]]
	; AVX-NEXT: [[TMP6:%.*]] = sub <32 x i8> [[TMP2]], [[TMP4]]
	; AVX-NEXT: store <32 x i8> [[TMP5]], <32 x i8>* bitcast ([64 x i8]* @c8 to <32 x i8>*), align 1
	; AVX-NEXT: store <32 x i8> [[TMP6]], <32 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 32) to <32 x i8>*), align 1
	; AVX-NEXT: ret void
	;
	; AVX512-LABEL: @sub_v64i8(
	; AVX512-NEXT: [[TMP1:%.]] = load <64 x i8>, <64 x i8> bitcast ([64 x i8]* @a8 to <64 x i8>*), align 1
	; AVX512-NEXT: [[TMP2:%.]] = load <64 x i8>, <64 x i8> bitcast ([64 x i8]* @b8 to <64 x i8>*), align 1
	; AVX512-NEXT: [[TMP3:%.*]] = sub <64 x i8> [[TMP1]], [[TMP2]]
	; AVX512-NEXT: store <64 x i8> [[TMP3]], <64 x i8>* bitcast ([64 x i8]* @c8 to <64 x i8>*), align 1
	; AVX512-NEXT: ret void
	;	;
	%a0 = load i8, i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 0 ), align 1	%a0 = load i8, i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 0 ), align 1
	%a1 = load i8, i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 1 ), align 1	%a1 = load i8, i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 1 ), align 1
Context not available.

llvm/test/Transforms/SLPVectorizer/X86/bitreverse.ll

Context not available.
	}	}

	define void @bitreverse_32i8() #0 {	define void @bitreverse_32i8() #0 {
	; SSE-LABEL: @bitreverse_32i8(	; CHECK-LABEL: @bitreverse_32i8(
	; SSE-NEXT: [[TMP1:%.]] = load <16 x i8>, <16 x i8> bitcast ([32 x i8]* @src8 to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP1:%.]] = load <16 x i8>, <16 x i8> bitcast ([32 x i8]* @src8 to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP2:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([32 x i8], [32 x i8]* @src8, i8 0, i64 16) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP2:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([32 x i8], [32 x i8]* @src8, i8 0, i64 16) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP3:%.*]] = call <16 x i8> @llvm.bitreverse.v16i8(<16 x i8> [[TMP1]])	; CHECK-NEXT: [[TMP3:%.*]] = call <16 x i8> @llvm.bitreverse.v16i8(<16 x i8> [[TMP1]])
	; SSE-NEXT: [[TMP4:%.*]] = call <16 x i8> @llvm.bitreverse.v16i8(<16 x i8> [[TMP2]])	; CHECK-NEXT: [[TMP4:%.*]] = call <16 x i8> @llvm.bitreverse.v16i8(<16 x i8> [[TMP2]])
	; SSE-NEXT: store <16 x i8> [[TMP3]], <16 x i8>* bitcast ([32 x i8]* @dst8 to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP3]], <16 x i8>* bitcast ([32 x i8]* @dst8 to <16 x i8>*), align 1
	; SSE-NEXT: store <16 x i8> [[TMP4]], <16 x i8>* bitcast (i8* getelementptr inbounds ([32 x i8], [32 x i8]* @dst8, i8 0, i64 16) to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP4]], <16 x i8>* bitcast (i8* getelementptr inbounds ([32 x i8], [32 x i8]* @dst8, i8 0, i64 16) to <16 x i8>*), align 1
	; SSE-NEXT: ret void	; CHECK-NEXT: ret void
	;
	; AVX-LABEL: @bitreverse_32i8(
	; AVX-NEXT: [[TMP1:%.]] = load <32 x i8>, <32 x i8> bitcast ([32 x i8]* @src8 to <32 x i8>*), align 1
	; AVX-NEXT: [[TMP2:%.*]] = call <32 x i8> @llvm.bitreverse.v32i8(<32 x i8> [[TMP1]])
	; AVX-NEXT: store <32 x i8> [[TMP2]], <32 x i8>* bitcast ([32 x i8]* @dst8 to <32 x i8>*), align 1
	; AVX-NEXT: ret void
	;
	; XOP-LABEL: @bitreverse_32i8(
	; XOP-NEXT: [[TMP1:%.]] = load <32 x i8>, <32 x i8> bitcast ([32 x i8]* @src8 to <32 x i8>*), align 1
	; XOP-NEXT: [[TMP2:%.*]] = call <32 x i8> @llvm.bitreverse.v32i8(<32 x i8> [[TMP1]])
	; XOP-NEXT: store <32 x i8> [[TMP2]], <32 x i8>* bitcast ([32 x i8]* @dst8 to <32 x i8>*), align 1
	; XOP-NEXT: ret void
	;	;
	%ld0 = load i8, i8* getelementptr inbounds ([32 x i8], [32 x i8]* @src8, i8 0, i64 0), align 1	%ld0 = load i8, i8* getelementptr inbounds ([32 x i8], [32 x i8]* @src8, i8 0, i64 0), align 1
	%ld1 = load i8, i8* getelementptr inbounds ([32 x i8], [32 x i8]* @src8, i8 0, i64 1), align 1	%ld1 = load i8, i8* getelementptr inbounds ([32 x i8], [32 x i8]* @src8, i8 0, i64 1), align 1
Context not available.

llvm/test/Transforms/SLPVectorizer/X86/ctlz.ll

Context not available.
	}	}

	define void @ctlz_32i8() #0 {	define void @ctlz_32i8() #0 {
	; SSE-LABEL: @ctlz_32i8(	; CHECK-LABEL: @ctlz_32i8(
	; SSE-NEXT: [[TMP1:%.]] = load <16 x i8>, <16 x i8> bitcast ([32 x i8]* @src8 to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP1:%.]] = load <16 x i8>, <16 x i8> bitcast ([32 x i8]* @src8 to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP2:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([32 x i8], [32 x i8]* @src8, i8 0, i64 16) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP2:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([32 x i8], [32 x i8]* @src8, i8 0, i64 16) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP3:%.*]] = call <16 x i8> @llvm.ctlz.v16i8(<16 x i8> [[TMP1]], i1 false)	; CHECK-NEXT: [[TMP3:%.*]] = call <16 x i8> @llvm.ctlz.v16i8(<16 x i8> [[TMP1]], i1 false)
	; SSE-NEXT: [[TMP4:%.*]] = call <16 x i8> @llvm.ctlz.v16i8(<16 x i8> [[TMP2]], i1 false)	; CHECK-NEXT: [[TMP4:%.*]] = call <16 x i8> @llvm.ctlz.v16i8(<16 x i8> [[TMP2]], i1 false)
	; SSE-NEXT: store <16 x i8> [[TMP3]], <16 x i8>* bitcast ([32 x i8]* @dst8 to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP3]], <16 x i8>* bitcast ([32 x i8]* @dst8 to <16 x i8>*), align 1
	; SSE-NEXT: store <16 x i8> [[TMP4]], <16 x i8>* bitcast (i8* getelementptr inbounds ([32 x i8], [32 x i8]* @dst8, i8 0, i64 16) to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP4]], <16 x i8>* bitcast (i8* getelementptr inbounds ([32 x i8], [32 x i8]* @dst8, i8 0, i64 16) to <16 x i8>*), align 1
	; SSE-NEXT: ret void	; CHECK-NEXT: ret void
	;
	; AVX-LABEL: @ctlz_32i8(
	; AVX-NEXT: [[TMP1:%.]] = load <32 x i8>, <32 x i8> bitcast ([32 x i8]* @src8 to <32 x i8>*), align 1
	; AVX-NEXT: [[TMP2:%.*]] = call <32 x i8> @llvm.ctlz.v32i8(<32 x i8> [[TMP1]], i1 false)
	; AVX-NEXT: store <32 x i8> [[TMP2]], <32 x i8>* bitcast ([32 x i8]* @dst8 to <32 x i8>*), align 1
	; AVX-NEXT: ret void
	;	;
	%ld0 = load i8, i8* getelementptr inbounds ([32 x i8], [32 x i8]* @src8, i8 0, i64 0), align 1	%ld0 = load i8, i8* getelementptr inbounds ([32 x i8], [32 x i8]* @src8, i8 0, i64 0), align 1
	%ld1 = load i8, i8* getelementptr inbounds ([32 x i8], [32 x i8]* @src8, i8 0, i64 1), align 1	%ld1 = load i8, i8* getelementptr inbounds ([32 x i8], [32 x i8]* @src8, i8 0, i64 1), align 1
Context not available.
	}	}

	define void @ctlz_undef_32i8() #0 {	define void @ctlz_undef_32i8() #0 {
	; SSE-LABEL: @ctlz_undef_32i8(	; CHECK-LABEL: @ctlz_undef_32i8(
	; SSE-NEXT: [[TMP1:%.]] = load <16 x i8>, <16 x i8> bitcast ([32 x i8]* @src8 to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP1:%.]] = load <16 x i8>, <16 x i8> bitcast ([32 x i8]* @src8 to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP2:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([32 x i8], [32 x i8]* @src8, i8 0, i64 16) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP2:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([32 x i8], [32 x i8]* @src8, i8 0, i64 16) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP3:%.*]] = call <16 x i8> @llvm.ctlz.v16i8(<16 x i8> [[TMP1]], i1 true)	; CHECK-NEXT: [[TMP3:%.*]] = call <16 x i8> @llvm.ctlz.v16i8(<16 x i8> [[TMP1]], i1 true)
	; SSE-NEXT: [[TMP4:%.*]] = call <16 x i8> @llvm.ctlz.v16i8(<16 x i8> [[TMP2]], i1 true)	; CHECK-NEXT: [[TMP4:%.*]] = call <16 x i8> @llvm.ctlz.v16i8(<16 x i8> [[TMP2]], i1 true)
	; SSE-NEXT: store <16 x i8> [[TMP3]], <16 x i8>* bitcast ([32 x i8]* @dst8 to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP3]], <16 x i8>* bitcast ([32 x i8]* @dst8 to <16 x i8>*), align 1
	; SSE-NEXT: store <16 x i8> [[TMP4]], <16 x i8>* bitcast (i8* getelementptr inbounds ([32 x i8], [32 x i8]* @dst8, i8 0, i64 16) to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP4]], <16 x i8>* bitcast (i8* getelementptr inbounds ([32 x i8], [32 x i8]* @dst8, i8 0, i64 16) to <16 x i8>*), align 1
	; SSE-NEXT: ret void	; CHECK-NEXT: ret void
	;
	; AVX-LABEL: @ctlz_undef_32i8(
	; AVX-NEXT: [[TMP1:%.]] = load <32 x i8>, <32 x i8> bitcast ([32 x i8]* @src8 to <32 x i8>*), align 1
	; AVX-NEXT: [[TMP2:%.*]] = call <32 x i8> @llvm.ctlz.v32i8(<32 x i8> [[TMP1]], i1 true)
	; AVX-NEXT: store <32 x i8> [[TMP2]], <32 x i8>* bitcast ([32 x i8]* @dst8 to <32 x i8>*), align 1
	; AVX-NEXT: ret void
	;	;
	%ld0 = load i8, i8* getelementptr inbounds ([32 x i8], [32 x i8]* @src8, i8 0, i64 0), align 1	%ld0 = load i8, i8* getelementptr inbounds ([32 x i8], [32 x i8]* @src8, i8 0, i64 0), align 1
	%ld1 = load i8, i8* getelementptr inbounds ([32 x i8], [32 x i8]* @src8, i8 0, i64 1), align 1	%ld1 = load i8, i8* getelementptr inbounds ([32 x i8], [32 x i8]* @src8, i8 0, i64 1), align 1
Context not available.

llvm/test/Transforms/SLPVectorizer/X86/ctpop.ll

Context not available.
	}	}

	define void @ctpop_32i8() #0 {	define void @ctpop_32i8() #0 {
	; SSE-LABEL: @ctpop_32i8(	; CHECK-LABEL: @ctpop_32i8(
	; SSE-NEXT: [[TMP1:%.]] = load <16 x i8>, <16 x i8> bitcast ([32 x i8]* @src8 to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP1:%.]] = load <16 x i8>, <16 x i8> bitcast ([32 x i8]* @src8 to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP2:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([32 x i8], [32 x i8]* @src8, i8 0, i64 16) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP2:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([32 x i8], [32 x i8]* @src8, i8 0, i64 16) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP3:%.*]] = call <16 x i8> @llvm.ctpop.v16i8(<16 x i8> [[TMP1]])	; CHECK-NEXT: [[TMP3:%.*]] = call <16 x i8> @llvm.ctpop.v16i8(<16 x i8> [[TMP1]])
	; SSE-NEXT: [[TMP4:%.*]] = call <16 x i8> @llvm.ctpop.v16i8(<16 x i8> [[TMP2]])	; CHECK-NEXT: [[TMP4:%.*]] = call <16 x i8> @llvm.ctpop.v16i8(<16 x i8> [[TMP2]])
	; SSE-NEXT: store <16 x i8> [[TMP3]], <16 x i8>* bitcast ([32 x i8]* @dst8 to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP3]], <16 x i8>* bitcast ([32 x i8]* @dst8 to <16 x i8>*), align 1
	; SSE-NEXT: store <16 x i8> [[TMP4]], <16 x i8>* bitcast (i8* getelementptr inbounds ([32 x i8], [32 x i8]* @dst8, i8 0, i64 16) to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP4]], <16 x i8>* bitcast (i8* getelementptr inbounds ([32 x i8], [32 x i8]* @dst8, i8 0, i64 16) to <16 x i8>*), align 1
	; SSE-NEXT: ret void	; CHECK-NEXT: ret void
	;
	; AVX-LABEL: @ctpop_32i8(
	; AVX-NEXT: [[TMP1:%.]] = load <32 x i8>, <32 x i8> bitcast ([32 x i8]* @src8 to <32 x i8>*), align 1
	; AVX-NEXT: [[TMP2:%.*]] = call <32 x i8> @llvm.ctpop.v32i8(<32 x i8> [[TMP1]])
	; AVX-NEXT: store <32 x i8> [[TMP2]], <32 x i8>* bitcast ([32 x i8]* @dst8 to <32 x i8>*), align 1
	; AVX-NEXT: ret void
	;	;
	%ld0 = load i8, i8* getelementptr inbounds ([32 x i8], [32 x i8]* @src8, i8 0, i64 0), align 1	%ld0 = load i8, i8* getelementptr inbounds ([32 x i8], [32 x i8]* @src8, i8 0, i64 0), align 1
	%ld1 = load i8, i8* getelementptr inbounds ([32 x i8], [32 x i8]* @src8, i8 0, i64 1), align 1	%ld1 = load i8, i8* getelementptr inbounds ([32 x i8], [32 x i8]* @src8, i8 0, i64 1), align 1
Context not available.

llvm/test/Transforms/SLPVectorizer/X86/cttz.ll

Context not available.
	}	}

	define void @cttz_32i8() #0 {	define void @cttz_32i8() #0 {
	; SSE-LABEL: @cttz_32i8(	; CHECK-LABEL: @cttz_32i8(
	; SSE-NEXT: [[TMP1:%.]] = load <16 x i8>, <16 x i8> bitcast ([32 x i8]* @src8 to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP1:%.]] = load <16 x i8>, <16 x i8> bitcast ([32 x i8]* @src8 to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP2:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([32 x i8], [32 x i8]* @src8, i8 0, i64 16) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP2:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([32 x i8], [32 x i8]* @src8, i8 0, i64 16) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP3:%.*]] = call <16 x i8> @llvm.cttz.v16i8(<16 x i8> [[TMP1]], i1 false)	; CHECK-NEXT: [[TMP3:%.*]] = call <16 x i8> @llvm.cttz.v16i8(<16 x i8> [[TMP1]], i1 false)
	; SSE-NEXT: [[TMP4:%.*]] = call <16 x i8> @llvm.cttz.v16i8(<16 x i8> [[TMP2]], i1 false)	; CHECK-NEXT: [[TMP4:%.*]] = call <16 x i8> @llvm.cttz.v16i8(<16 x i8> [[TMP2]], i1 false)
	; SSE-NEXT: store <16 x i8> [[TMP3]], <16 x i8>* bitcast ([32 x i8]* @dst8 to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP3]], <16 x i8>* bitcast ([32 x i8]* @dst8 to <16 x i8>*), align 1
	; SSE-NEXT: store <16 x i8> [[TMP4]], <16 x i8>* bitcast (i8* getelementptr inbounds ([32 x i8], [32 x i8]* @dst8, i8 0, i64 16) to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP4]], <16 x i8>* bitcast (i8* getelementptr inbounds ([32 x i8], [32 x i8]* @dst8, i8 0, i64 16) to <16 x i8>*), align 1
	; SSE-NEXT: ret void	; CHECK-NEXT: ret void
	;
	; AVX-LABEL: @cttz_32i8(
	; AVX-NEXT: [[TMP1:%.]] = load <32 x i8>, <32 x i8> bitcast ([32 x i8]* @src8 to <32 x i8>*), align 1
	; AVX-NEXT: [[TMP2:%.*]] = call <32 x i8> @llvm.cttz.v32i8(<32 x i8> [[TMP1]], i1 false)
	; AVX-NEXT: store <32 x i8> [[TMP2]], <32 x i8>* bitcast ([32 x i8]* @dst8 to <32 x i8>*), align 1
	; AVX-NEXT: ret void
	;	;
	%ld0 = load i8, i8* getelementptr inbounds ([32 x i8], [32 x i8]* @src8, i8 0, i64 0), align 1	%ld0 = load i8, i8* getelementptr inbounds ([32 x i8], [32 x i8]* @src8, i8 0, i64 0), align 1
	%ld1 = load i8, i8* getelementptr inbounds ([32 x i8], [32 x i8]* @src8, i8 0, i64 1), align 1	%ld1 = load i8, i8* getelementptr inbounds ([32 x i8], [32 x i8]* @src8, i8 0, i64 1), align 1
Context not available.
	}	}

	define void @cttz_undef_32i8() #0 {	define void @cttz_undef_32i8() #0 {
	; SSE-LABEL: @cttz_undef_32i8(	; CHECK-LABEL: @cttz_undef_32i8(
	; SSE-NEXT: [[TMP1:%.]] = load <16 x i8>, <16 x i8> bitcast ([32 x i8]* @src8 to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP1:%.]] = load <16 x i8>, <16 x i8> bitcast ([32 x i8]* @src8 to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP2:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([32 x i8], [32 x i8]* @src8, i8 0, i64 16) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP2:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([32 x i8], [32 x i8]* @src8, i8 0, i64 16) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP3:%.*]] = call <16 x i8> @llvm.cttz.v16i8(<16 x i8> [[TMP1]], i1 true)	; CHECK-NEXT: [[TMP3:%.*]] = call <16 x i8> @llvm.cttz.v16i8(<16 x i8> [[TMP1]], i1 true)
	; SSE-NEXT: [[TMP4:%.*]] = call <16 x i8> @llvm.cttz.v16i8(<16 x i8> [[TMP2]], i1 true)	; CHECK-NEXT: [[TMP4:%.*]] = call <16 x i8> @llvm.cttz.v16i8(<16 x i8> [[TMP2]], i1 true)
	; SSE-NEXT: store <16 x i8> [[TMP3]], <16 x i8>* bitcast ([32 x i8]* @dst8 to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP3]], <16 x i8>* bitcast ([32 x i8]* @dst8 to <16 x i8>*), align 1
	; SSE-NEXT: store <16 x i8> [[TMP4]], <16 x i8>* bitcast (i8* getelementptr inbounds ([32 x i8], [32 x i8]* @dst8, i8 0, i64 16) to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP4]], <16 x i8>* bitcast (i8* getelementptr inbounds ([32 x i8], [32 x i8]* @dst8, i8 0, i64 16) to <16 x i8>*), align 1
	; SSE-NEXT: ret void	; CHECK-NEXT: ret void
	;
	; AVX-LABEL: @cttz_undef_32i8(
	; AVX-NEXT: [[TMP1:%.]] = load <32 x i8>, <32 x i8> bitcast ([32 x i8]* @src8 to <32 x i8>*), align 1
	; AVX-NEXT: [[TMP2:%.*]] = call <32 x i8> @llvm.cttz.v32i8(<32 x i8> [[TMP1]], i1 true)
	; AVX-NEXT: store <32 x i8> [[TMP2]], <32 x i8>* bitcast ([32 x i8]* @dst8 to <32 x i8>*), align 1
	; AVX-NEXT: ret void
	;	;
	%ld0 = load i8, i8* getelementptr inbounds ([32 x i8], [32 x i8]* @src8, i8 0, i64 0), align 1	%ld0 = load i8, i8* getelementptr inbounds ([32 x i8], [32 x i8]* @src8, i8 0, i64 0), align 1
	%ld1 = load i8, i8* getelementptr inbounds ([32 x i8], [32 x i8]* @src8, i8 0, i64 1), align 1	%ld1 = load i8, i8* getelementptr inbounds ([32 x i8], [32 x i8]* @src8, i8 0, i64 1), align 1
Context not available.

llvm/test/Transforms/SLPVectorizer/X86/different-vec-widths.ll

Context not available.
	; AVX-NEXT: [[Q5:%.]] = getelementptr inbounds double, double [[Q]], i64 5	; AVX-NEXT: [[Q5:%.]] = getelementptr inbounds double, double [[Q]], i64 5
	; AVX-NEXT: [[TMP1:%.]] = bitcast double [[P0]] to <4 x double>*	; AVX-NEXT: [[TMP1:%.]] = bitcast double [[P0]] to <4 x double>*
	; AVX-NEXT: [[TMP2:%.]] = load <4 x double>, <4 x double> [[TMP1]], align 8	; AVX-NEXT: [[TMP2:%.]] = load <4 x double>, <4 x double> [[TMP1]], align 8
	; AVX-NEXT: [[TMP3:%.]] = bitcast double [[P4]] to <2 x double>*	; AVX-NEXT: [[D4:%.]] = load double, double [[P4]]
	; AVX-NEXT: [[TMP4:%.]] = load <2 x double>, <2 x double> [[TMP3]], align 8	; AVX-NEXT: [[D5:%.]] = load double, double [[P5]]
	; AVX-NEXT: [[TMP5:%.*]] = fadd <4 x double> [[TMP2]], <double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00>	; AVX-NEXT: [[TMP3:%.*]] = fadd <4 x double> [[TMP2]], <double 1.000000e+00, double 1.000000e+00, double 1.000000e+00, double 1.000000e+00>
	; AVX-NEXT: [[TMP6:%.*]] = fadd <2 x double> [[TMP4]], <double 1.000000e+00, double 1.000000e+00>	; AVX-NEXT: [[A4:%.*]] = fadd double [[D4]], 1.000000e+00
	; AVX-NEXT: [[TMP7:%.]] = bitcast double [[Q0]] to <4 x double>*	; AVX-NEXT: [[A5:%.*]] = fadd double [[D5]], 1.000000e+00
	; AVX-NEXT: store <4 x double> [[TMP5]], <4 x double>* [[TMP7]], align 8	; AVX-NEXT: [[TMP4:%.]] = bitcast double [[Q0]] to <4 x double>*
	; AVX-NEXT: [[TMP8:%.]] = bitcast double [[Q4]] to <2 x double>*	; AVX-NEXT: store <4 x double> [[TMP3]], <4 x double>* [[TMP4]], align 8
	; AVX-NEXT: store <2 x double> [[TMP6]], <2 x double>* [[TMP8]], align 8	; AVX-NEXT: store double [[A4]], double* [[Q4]]
		; AVX-NEXT: store double [[A5]], double* [[Q5]]
	; AVX-NEXT: ret void	; AVX-NEXT: ret void
	;	;
	%p0 = getelementptr inbounds double, double* %p, i64 0	%p0 = getelementptr inbounds double, double* %p, i64 0
Context not available.

llvm/test/Transforms/SLPVectorizer/X86/jumbled_store_crash.ll

This file was deleted.

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt --slp-vectorizer -mtriple=x86_64-unknown-linux-gnu -o - -S < %s \| FileCheck %s

	@b = common dso_local global i32* null, align 8
	@e = common dso_local global float 0.000000e+00, align 4
	@c = common dso_local global float 0.000000e+00, align 4
	@g = common dso_local global float 0.000000e+00, align 4
	@d = common dso_local global float 0.000000e+00, align 4
	@f = common dso_local global float 0.000000e+00, align 4
	@a = common dso_local global i32 0, align 4
	@h = common dso_local global float 0.000000e+00, align 4

	define dso_local void @j() local_unnamed_addr {
	; CHECK-LABEL: define {{[^@]+}}@j(
	; CHECK-NEXT: entry:
	; CHECK-NEXT: [[TMP0:%.]] = load i32, i32** @b, align 8
	; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[TMP0]], i64 4
	; CHECK-NEXT: [[ARRAYIDX1:%.]] = getelementptr inbounds i32, i32 [[TMP0]], i64 12
	; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds i32, i32 [[TMP0]], i64 5
	; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[ARRAYIDX]] to <2 x i32>*
	; CHECK-NEXT: [[TMP2:%.]] = load <2 x i32>, <2 x i32> [[TMP1]], align 4
	; CHECK-NEXT: [[REORDER_SHUFFLE1:%.*]] = shufflevector <2 x i32> [[TMP2]], <2 x i32> undef, <2 x i32> <i32 1, i32 0>
	; CHECK-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 [[TMP0]], i64 13
	; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[ARRAYIDX1]] to <2 x i32>*
	; CHECK-NEXT: [[TMP4:%.]] = load <2 x i32>, <2 x i32> [[TMP3]], align 4
	; CHECK-NEXT: [[REORDER_SHUFFLE:%.*]] = shufflevector <2 x i32> [[TMP4]], <2 x i32> undef, <2 x i32> <i32 1, i32 0>
	; CHECK-NEXT: [[TMP5:%.*]] = add nsw <2 x i32> [[REORDER_SHUFFLE]], [[REORDER_SHUFFLE1]]
	; CHECK-NEXT: [[TMP6:%.*]] = sitofp <2 x i32> [[TMP5]] to <2 x float>
	; CHECK-NEXT: [[TMP7:%.*]] = fmul <2 x float> [[TMP6]], <float 1.000000e+01, float 1.000000e+01>
	; CHECK-NEXT: [[TMP8:%.*]] = fsub <2 x float> <float 0.000000e+00, float 1.000000e+00>, [[TMP7]]
	; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x float> [[TMP8]], <2 x float> undef, <4 x i32> <i32 0, i32 1, i32 0, i32 1>
	; CHECK-NEXT: [[TMP9:%.*]] = extractelement <4 x float> [[SHUFFLE]], i32 1
	; CHECK-NEXT: store float [[TMP9]], float* @g, align 4
	; CHECK-NEXT: [[TMP10:%.*]] = fadd <4 x float> [[SHUFFLE]], <float -1.000000e+00, float -1.000000e+00, float 1.000000e+00, float 1.000000e+00>
	; CHECK-NEXT: [[TMP11:%.*]] = extractelement <4 x float> [[TMP10]], i32 2
	; CHECK-NEXT: store float [[TMP11]], float* @c, align 4
	; CHECK-NEXT: [[TMP12:%.*]] = extractelement <4 x float> [[TMP10]], i32 0
	; CHECK-NEXT: store float [[TMP12]], float* @d, align 4
	; CHECK-NEXT: [[TMP13:%.*]] = extractelement <4 x float> [[TMP10]], i32 3
	; CHECK-NEXT: store float [[TMP13]], float* @e, align 4
	; CHECK-NEXT: [[TMP14:%.*]] = extractelement <4 x float> [[TMP10]], i32 1
	; CHECK-NEXT: store float [[TMP14]], float* @f, align 4
	; CHECK-NEXT: [[ARRAYIDX15:%.]] = getelementptr inbounds i32, i32 [[TMP0]], i64 14
	; CHECK-NEXT: [[ARRAYIDX18:%.]] = getelementptr inbounds i32, i32 [[TMP0]], i64 15
	; CHECK-NEXT: [[TMP15:%.]] = load i32, i32 @a, align 4
	; CHECK-NEXT: [[CONV19:%.*]] = sitofp i32 [[TMP15]] to float
	; CHECK-NEXT: [[TMP16:%.*]] = insertelement <4 x float> undef, float [[CONV19]], i32 0
	; CHECK-NEXT: [[TMP17:%.*]] = insertelement <4 x float> [[TMP16]], float -1.000000e+00, i32 1
	; CHECK-NEXT: [[TMP18:%.*]] = extractelement <4 x float> [[SHUFFLE]], i32 0
	; CHECK-NEXT: [[TMP19:%.*]] = insertelement <4 x float> [[TMP17]], float [[TMP18]], i32 2
	; CHECK-NEXT: [[TMP20:%.*]] = insertelement <4 x float> [[TMP19]], float -1.000000e+00, i32 3
	; CHECK-NEXT: [[TMP21:%.*]] = fsub <4 x float> [[TMP10]], [[TMP20]]
	; CHECK-NEXT: [[TMP22:%.*]] = fadd <4 x float> [[TMP10]], [[TMP20]]
	; CHECK-NEXT: [[TMP23:%.*]] = shufflevector <4 x float> [[TMP21]], <4 x float> [[TMP22]], <4 x i32> <i32 0, i32 5, i32 2, i32 7>
	; CHECK-NEXT: [[TMP24:%.*]] = fptosi <4 x float> [[TMP23]] to <4 x i32>
	; CHECK-NEXT: [[TMP25:%.]] = bitcast i32 [[ARRAYIDX1]] to <4 x i32>*
	; CHECK-NEXT: store <4 x i32> [[TMP24]], <4 x i32>* [[TMP25]], align 4
	; CHECK-NEXT: ret void
	;
	entry:
	%0 = load i32, i32* @b, align 8
	%arrayidx = getelementptr inbounds i32, i32* %0, i64 4
	%1 = load i32, i32* %arrayidx, align 4
	%arrayidx1 = getelementptr inbounds i32, i32* %0, i64 12
	%2 = load i32, i32* %arrayidx1, align 4
	%add = add nsw i32 %2, %1
	%conv = sitofp i32 %add to float
	%mul = fmul float %conv, 1.000000e+01
	%arrayidx2 = getelementptr inbounds i32, i32* %0, i64 5
	%3 = load i32, i32* %arrayidx2, align 4
	%arrayidx3 = getelementptr inbounds i32, i32* %0, i64 13
	%4 = load i32, i32* %arrayidx3, align 4
	%add4 = add nsw i32 %4, %3
	%conv5 = sitofp i32 %add4 to float
	%mul6 = fmul float %conv5, 1.000000e+01
	%sub = fsub float 0.000000e+00, %mul6
	%sub7 = fsub float 1.000000e+00, %mul
	store float %sub7, float* @g, align 4
	%add9 = fadd float %sub, 1.000000e+00
	store float %add9, float* @c, align 4
	%sub10 = fadd float %sub, -1.000000e+00
	store float %sub10, float* @d, align 4
	%add11 = fadd float %sub7, 1.000000e+00
	store float %add11, float* @e, align 4
	%sub12 = fadd float %sub7, -1.000000e+00
	store float %sub12, float* @f, align 4
	%sub13 = fsub float %add9, %sub
	%conv14 = fptosi float %sub13 to i32
	%arrayidx15 = getelementptr inbounds i32, i32* %0, i64 14
	store i32 %conv14, i32* %arrayidx15, align 4
	%sub16 = fadd float %add11, -1.000000e+00
	%conv17 = fptosi float %sub16 to i32
	%arrayidx18 = getelementptr inbounds i32, i32* %0, i64 15
	store i32 %conv17, i32* %arrayidx18, align 4
	%5 = load i32, i32* @a, align 4
	%conv19 = sitofp i32 %5 to float
	%sub20 = fsub float %sub10, %conv19
	%conv21 = fptosi float %sub20 to i32
	store i32 %conv21, i32* %arrayidx1, align 4
	%sub23 = fadd float %sub12, -1.000000e+00
	%conv24 = fptosi float %sub23 to i32
	store i32 %conv24, i32* %arrayidx3, align 4
	ret void
	}

llvm/test/Transforms/SLPVectorizer/X86/pr35497.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -slp-vectorizer -slp-vectorizer -S -mtriple=x86_64-unknown-linux-gnu < %s \| FileCheck %s	; RUN: opt -slp-vectorizer -S -mtriple=x86_64-unknown-linux-gnu < %s \| FileCheck %s

	%class.1 = type { %class.2 }	%class.1 = type { %class.2 }
	%class.2 = type { %"class.3" }	%class.2 = type { %"class.3" }
Context not available.
	; CHECK-NEXT: [[ARRAYIDX2_6:%.]] = getelementptr inbounds [0 x i64], [0 x i64] undef, i64 0, i64 0	; CHECK-NEXT: [[ARRAYIDX2_6:%.]] = getelementptr inbounds [0 x i64], [0 x i64] undef, i64 0, i64 0
	; CHECK-NEXT: [[TMP10:%.]] = bitcast i64 [[ARRAYIDX2_6]] to <2 x i64>*	; CHECK-NEXT: [[TMP10:%.]] = bitcast i64 [[ARRAYIDX2_6]] to <2 x i64>*
	; CHECK-NEXT: store <2 x i64> [[TMP4]], <2 x i64>* [[TMP10]], align 1	; CHECK-NEXT: store <2 x i64> [[TMP4]], <2 x i64>* [[TMP10]], align 1
	; CHECK-NEXT: [[TMP11:%.*]] = extractelement <2 x i64> [[TMP4]], i32 0	; CHECK-NEXT: [[TMP11:%.*]] = lshr <2 x i64> [[TMP4]], <i64 6, i64 6>
	; CHECK-NEXT: [[TMP12:%.*]] = insertelement <2 x i64> undef, i64 [[TMP11]], i32 0	; CHECK-NEXT: [[TMP12:%.*]] = add nuw nsw <2 x i64> [[TMP9]], [[TMP11]]
	; CHECK-NEXT: [[TMP13:%.*]] = insertelement <2 x i64> [[TMP12]], i64 [[TMP5]], i32 1	; CHECK-NEXT: [[TMP13:%.]] = bitcast i64 [[ARRAYIDX2_2]] to <2 x i64>*
	; CHECK-NEXT: [[TMP14:%.*]] = lshr <2 x i64> [[TMP13]], <i64 6, i64 6>	; CHECK-NEXT: store <2 x i64> [[TMP12]], <2 x i64>* [[TMP13]], align 1
	; CHECK-NEXT: [[TMP15:%.*]] = add nuw nsw <2 x i64> [[TMP9]], [[TMP14]]
	; CHECK-NEXT: [[TMP16:%.]] = bitcast i64 [[ARRAYIDX2_2]] to <2 x i64>*
	; CHECK-NEXT: store <2 x i64> [[TMP15]], <2 x i64>* [[TMP16]], align 1
	; CHECK-NEXT: ret void	; CHECK-NEXT: ret void
	;	;
	entry:	entry:
Context not available.

llvm/test/Transforms/SLPVectorizer/X86/shift-ashr.ll

Context not available.
	; AVX-NEXT: ret void	; AVX-NEXT: ret void
	;	;
	; AVX512-LABEL: @ashr_v32i16(	; AVX512-LABEL: @ashr_v32i16(
	; AVX512-NEXT: [[TMP1:%.]] = load <32 x i16>, <32 x i16> bitcast ([32 x i16]* @a16 to <32 x i16>*), align 2	; AVX512-NEXT: [[TMP1:%.]] = load <16 x i16>, <16 x i16> bitcast ([32 x i16]* @a16 to <16 x i16>*), align 2
	; AVX512-NEXT: [[TMP2:%.]] = load <32 x i16>, <32 x i16> bitcast ([32 x i16]* @b16 to <32 x i16>*), align 2	; AVX512-NEXT: [[TMP2:%.]] = load <16 x i16>, <16 x i16> bitcast (i16* getelementptr inbounds ([32 x i16], [32 x i16]* @a16, i32 0, i64 16) to <16 x i16>*), align 2
	; AVX512-NEXT: [[TMP3:%.*]] = ashr <32 x i16> [[TMP1]], [[TMP2]]	; AVX512-NEXT: [[TMP3:%.]] = load <16 x i16>, <16 x i16> bitcast ([32 x i16]* @b16 to <16 x i16>*), align 2
	; AVX512-NEXT: store <32 x i16> [[TMP3]], <32 x i16>* bitcast ([32 x i16]* @c16 to <32 x i16>*), align 2	; AVX512-NEXT: [[TMP4:%.]] = load <16 x i16>, <16 x i16> bitcast (i16* getelementptr inbounds ([32 x i16], [32 x i16]* @b16, i32 0, i64 16) to <16 x i16>*), align 2
		; AVX512-NEXT: [[TMP5:%.*]] = ashr <16 x i16> [[TMP1]], [[TMP3]]
		; AVX512-NEXT: [[TMP6:%.*]] = ashr <16 x i16> [[TMP2]], [[TMP4]]
		; AVX512-NEXT: store <16 x i16> [[TMP5]], <16 x i16>* bitcast ([32 x i16]* @c16 to <16 x i16>*), align 2
		; AVX512-NEXT: store <16 x i16> [[TMP6]], <16 x i16>* bitcast (i16* getelementptr inbounds ([32 x i16], [32 x i16]* @c16, i32 0, i64 16) to <16 x i16>*), align 2
	; AVX512-NEXT: ret void	; AVX512-NEXT: ret void
	;	;
	; XOP-LABEL: @ashr_v32i16(	; XOP-LABEL: @ashr_v32i16(
Context not available.
	}	}

	define void @ashr_v64i8() {	define void @ashr_v64i8() {
	; SSE-LABEL: @ashr_v64i8(	; CHECK-LABEL: @ashr_v64i8(
	; SSE-NEXT: [[TMP1:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @a8 to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP1:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @a8 to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP2:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 16) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP2:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 16) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP3:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 32) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP3:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 32) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP4:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 48) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP4:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 48) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP5:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @b8 to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP5:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @b8 to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP6:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 16) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP6:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 16) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP7:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 32) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP7:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 32) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP8:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 48) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP8:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 48) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP9:%.*]] = ashr <16 x i8> [[TMP1]], [[TMP5]]	; CHECK-NEXT: [[TMP9:%.*]] = ashr <16 x i8> [[TMP1]], [[TMP5]]
	; SSE-NEXT: [[TMP10:%.*]] = ashr <16 x i8> [[TMP2]], [[TMP6]]	; CHECK-NEXT: [[TMP10:%.*]] = ashr <16 x i8> [[TMP2]], [[TMP6]]
	; SSE-NEXT: [[TMP11:%.*]] = ashr <16 x i8> [[TMP3]], [[TMP7]]	; CHECK-NEXT: [[TMP11:%.*]] = ashr <16 x i8> [[TMP3]], [[TMP7]]
	; SSE-NEXT: [[TMP12:%.*]] = ashr <16 x i8> [[TMP4]], [[TMP8]]	; CHECK-NEXT: [[TMP12:%.*]] = ashr <16 x i8> [[TMP4]], [[TMP8]]
	; SSE-NEXT: store <16 x i8> [[TMP9]], <16 x i8>* bitcast ([64 x i8]* @c8 to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP9]], <16 x i8>* bitcast ([64 x i8]* @c8 to <16 x i8>*), align 1
	; SSE-NEXT: store <16 x i8> [[TMP10]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 16) to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP10]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 16) to <16 x i8>*), align 1
	; SSE-NEXT: store <16 x i8> [[TMP11]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 32) to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP11]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 32) to <16 x i8>*), align 1
	; SSE-NEXT: store <16 x i8> [[TMP12]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 48) to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP12]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 48) to <16 x i8>*), align 1
	; SSE-NEXT: ret void	; CHECK-NEXT: ret void
	;
	; AVX-LABEL: @ashr_v64i8(
	; AVX-NEXT: [[TMP1:%.]] = load <32 x i8>, <32 x i8> bitcast ([64 x i8]* @a8 to <32 x i8>*), align 1
	; AVX-NEXT: [[TMP2:%.]] = load <32 x i8>, <32 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 32) to <32 x i8>*), align 1
	; AVX-NEXT: [[TMP3:%.]] = load <32 x i8>, <32 x i8> bitcast ([64 x i8]* @b8 to <32 x i8>*), align 1
	; AVX-NEXT: [[TMP4:%.]] = load <32 x i8>, <32 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 32) to <32 x i8>*), align 1
	; AVX-NEXT: [[TMP5:%.*]] = ashr <32 x i8> [[TMP1]], [[TMP3]]
	; AVX-NEXT: [[TMP6:%.*]] = ashr <32 x i8> [[TMP2]], [[TMP4]]
	; AVX-NEXT: store <32 x i8> [[TMP5]], <32 x i8>* bitcast ([64 x i8]* @c8 to <32 x i8>*), align 1
	; AVX-NEXT: store <32 x i8> [[TMP6]], <32 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 32) to <32 x i8>*), align 1
	; AVX-NEXT: ret void
	;
	; AVX512-LABEL: @ashr_v64i8(
	; AVX512-NEXT: [[TMP1:%.]] = load <64 x i8>, <64 x i8> bitcast ([64 x i8]* @a8 to <64 x i8>*), align 1
	; AVX512-NEXT: [[TMP2:%.]] = load <64 x i8>, <64 x i8> bitcast ([64 x i8]* @b8 to <64 x i8>*), align 1
	; AVX512-NEXT: [[TMP3:%.*]] = ashr <64 x i8> [[TMP1]], [[TMP2]]
	; AVX512-NEXT: store <64 x i8> [[TMP3]], <64 x i8>* bitcast ([64 x i8]* @c8 to <64 x i8>*), align 1
	; AVX512-NEXT: ret void
	;
	; XOP-LABEL: @ashr_v64i8(
	; XOP-NEXT: [[TMP1:%.]] = load <32 x i8>, <32 x i8> bitcast ([64 x i8]* @a8 to <32 x i8>*), align 1
	; XOP-NEXT: [[TMP2:%.]] = load <32 x i8>, <32 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 32) to <32 x i8>*), align 1
	; XOP-NEXT: [[TMP3:%.]] = load <32 x i8>, <32 x i8> bitcast ([64 x i8]* @b8 to <32 x i8>*), align 1
	; XOP-NEXT: [[TMP4:%.]] = load <32 x i8>, <32 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 32) to <32 x i8>*), align 1
	; XOP-NEXT: [[TMP5:%.*]] = ashr <32 x i8> [[TMP1]], [[TMP3]]
	; XOP-NEXT: [[TMP6:%.*]] = ashr <32 x i8> [[TMP2]], [[TMP4]]
	; XOP-NEXT: store <32 x i8> [[TMP5]], <32 x i8>* bitcast ([64 x i8]* @c8 to <32 x i8>*), align 1
	; XOP-NEXT: store <32 x i8> [[TMP6]], <32 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 32) to <32 x i8>*), align 1
	; XOP-NEXT: ret void
	;	;
	%a0 = load i8, i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 0 ), align 1	%a0 = load i8, i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 0 ), align 1
	%a1 = load i8, i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 1 ), align 1	%a1 = load i8, i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 1 ), align 1
Context not available.

llvm/test/Transforms/SLPVectorizer/X86/shift-lshr.ll

Context not available.
	; AVX-NEXT: ret void	; AVX-NEXT: ret void
	;	;
	; AVX512-LABEL: @lshr_v32i16(	; AVX512-LABEL: @lshr_v32i16(
	; AVX512-NEXT: [[TMP1:%.]] = load <32 x i16>, <32 x i16> bitcast ([32 x i16]* @a16 to <32 x i16>*), align 2	; AVX512-NEXT: [[TMP1:%.]] = load <16 x i16>, <16 x i16> bitcast ([32 x i16]* @a16 to <16 x i16>*), align 2
	; AVX512-NEXT: [[TMP2:%.]] = load <32 x i16>, <32 x i16> bitcast ([32 x i16]* @b16 to <32 x i16>*), align 2	; AVX512-NEXT: [[TMP2:%.]] = load <16 x i16>, <16 x i16> bitcast (i16* getelementptr inbounds ([32 x i16], [32 x i16]* @a16, i32 0, i64 16) to <16 x i16>*), align 2
	; AVX512-NEXT: [[TMP3:%.*]] = lshr <32 x i16> [[TMP1]], [[TMP2]]	; AVX512-NEXT: [[TMP3:%.]] = load <16 x i16>, <16 x i16> bitcast ([32 x i16]* @b16 to <16 x i16>*), align 2
	; AVX512-NEXT: store <32 x i16> [[TMP3]], <32 x i16>* bitcast ([32 x i16]* @c16 to <32 x i16>*), align 2	; AVX512-NEXT: [[TMP4:%.]] = load <16 x i16>, <16 x i16> bitcast (i16* getelementptr inbounds ([32 x i16], [32 x i16]* @b16, i32 0, i64 16) to <16 x i16>*), align 2
		; AVX512-NEXT: [[TMP5:%.*]] = lshr <16 x i16> [[TMP1]], [[TMP3]]
		; AVX512-NEXT: [[TMP6:%.*]] = lshr <16 x i16> [[TMP2]], [[TMP4]]
		; AVX512-NEXT: store <16 x i16> [[TMP5]], <16 x i16>* bitcast ([32 x i16]* @c16 to <16 x i16>*), align 2
		; AVX512-NEXT: store <16 x i16> [[TMP6]], <16 x i16>* bitcast (i16* getelementptr inbounds ([32 x i16], [32 x i16]* @c16, i32 0, i64 16) to <16 x i16>*), align 2
	; AVX512-NEXT: ret void	; AVX512-NEXT: ret void
	;	;
	; XOP-LABEL: @lshr_v32i16(	; XOP-LABEL: @lshr_v32i16(
Context not available.
	}	}

	define void @lshr_v64i8() {	define void @lshr_v64i8() {
	; SSE-LABEL: @lshr_v64i8(	; CHECK-LABEL: @lshr_v64i8(
	; SSE-NEXT: [[TMP1:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @a8 to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP1:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @a8 to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP2:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 16) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP2:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 16) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP3:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 32) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP3:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 32) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP4:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 48) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP4:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 48) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP5:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @b8 to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP5:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @b8 to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP6:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 16) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP6:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 16) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP7:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 32) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP7:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 32) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP8:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 48) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP8:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 48) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP9:%.*]] = lshr <16 x i8> [[TMP1]], [[TMP5]]	; CHECK-NEXT: [[TMP9:%.*]] = lshr <16 x i8> [[TMP1]], [[TMP5]]
	; SSE-NEXT: [[TMP10:%.*]] = lshr <16 x i8> [[TMP2]], [[TMP6]]	; CHECK-NEXT: [[TMP10:%.*]] = lshr <16 x i8> [[TMP2]], [[TMP6]]
	; SSE-NEXT: [[TMP11:%.*]] = lshr <16 x i8> [[TMP3]], [[TMP7]]	; CHECK-NEXT: [[TMP11:%.*]] = lshr <16 x i8> [[TMP3]], [[TMP7]]
	; SSE-NEXT: [[TMP12:%.*]] = lshr <16 x i8> [[TMP4]], [[TMP8]]	; CHECK-NEXT: [[TMP12:%.*]] = lshr <16 x i8> [[TMP4]], [[TMP8]]
	; SSE-NEXT: store <16 x i8> [[TMP9]], <16 x i8>* bitcast ([64 x i8]* @c8 to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP9]], <16 x i8>* bitcast ([64 x i8]* @c8 to <16 x i8>*), align 1
	; SSE-NEXT: store <16 x i8> [[TMP10]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 16) to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP10]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 16) to <16 x i8>*), align 1
	; SSE-NEXT: store <16 x i8> [[TMP11]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 32) to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP11]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 32) to <16 x i8>*), align 1
	; SSE-NEXT: store <16 x i8> [[TMP12]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 48) to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP12]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 48) to <16 x i8>*), align 1
	; SSE-NEXT: ret void	; CHECK-NEXT: ret void
	;
	; AVX-LABEL: @lshr_v64i8(
	; AVX-NEXT: [[TMP1:%.]] = load <32 x i8>, <32 x i8> bitcast ([64 x i8]* @a8 to <32 x i8>*), align 1
	; AVX-NEXT: [[TMP2:%.]] = load <32 x i8>, <32 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 32) to <32 x i8>*), align 1
	; AVX-NEXT: [[TMP3:%.]] = load <32 x i8>, <32 x i8> bitcast ([64 x i8]* @b8 to <32 x i8>*), align 1
	; AVX-NEXT: [[TMP4:%.]] = load <32 x i8>, <32 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 32) to <32 x i8>*), align 1
	; AVX-NEXT: [[TMP5:%.*]] = lshr <32 x i8> [[TMP1]], [[TMP3]]
	; AVX-NEXT: [[TMP6:%.*]] = lshr <32 x i8> [[TMP2]], [[TMP4]]
	; AVX-NEXT: store <32 x i8> [[TMP5]], <32 x i8>* bitcast ([64 x i8]* @c8 to <32 x i8>*), align 1
	; AVX-NEXT: store <32 x i8> [[TMP6]], <32 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 32) to <32 x i8>*), align 1
	; AVX-NEXT: ret void
	;
	; AVX512-LABEL: @lshr_v64i8(
	; AVX512-NEXT: [[TMP1:%.]] = load <64 x i8>, <64 x i8> bitcast ([64 x i8]* @a8 to <64 x i8>*), align 1
	; AVX512-NEXT: [[TMP2:%.]] = load <64 x i8>, <64 x i8> bitcast ([64 x i8]* @b8 to <64 x i8>*), align 1
	; AVX512-NEXT: [[TMP3:%.*]] = lshr <64 x i8> [[TMP1]], [[TMP2]]
	; AVX512-NEXT: store <64 x i8> [[TMP3]], <64 x i8>* bitcast ([64 x i8]* @c8 to <64 x i8>*), align 1
	; AVX512-NEXT: ret void
	;
	; XOP-LABEL: @lshr_v64i8(
	; XOP-NEXT: [[TMP1:%.]] = load <32 x i8>, <32 x i8> bitcast ([64 x i8]* @a8 to <32 x i8>*), align 1
	; XOP-NEXT: [[TMP2:%.]] = load <32 x i8>, <32 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 32) to <32 x i8>*), align 1
	; XOP-NEXT: [[TMP3:%.]] = load <32 x i8>, <32 x i8> bitcast ([64 x i8]* @b8 to <32 x i8>*), align 1
	; XOP-NEXT: [[TMP4:%.]] = load <32 x i8>, <32 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 32) to <32 x i8>*), align 1
	; XOP-NEXT: [[TMP5:%.*]] = lshr <32 x i8> [[TMP1]], [[TMP3]]
	; XOP-NEXT: [[TMP6:%.*]] = lshr <32 x i8> [[TMP2]], [[TMP4]]
	; XOP-NEXT: store <32 x i8> [[TMP5]], <32 x i8>* bitcast ([64 x i8]* @c8 to <32 x i8>*), align 1
	; XOP-NEXT: store <32 x i8> [[TMP6]], <32 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 32) to <32 x i8>*), align 1
	; XOP-NEXT: ret void
	;	;
	%a0 = load i8, i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 0 ), align 1	%a0 = load i8, i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 0 ), align 1
	%a1 = load i8, i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 1 ), align 1	%a1 = load i8, i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 1 ), align 1
Context not available.

llvm/test/Transforms/SLPVectorizer/X86/shift-shl.ll

Context not available.
	; AVX-NEXT: ret void	; AVX-NEXT: ret void
	;	;
	; AVX512-LABEL: @shl_v32i16(	; AVX512-LABEL: @shl_v32i16(
	; AVX512-NEXT: [[TMP1:%.]] = load <32 x i16>, <32 x i16> bitcast ([32 x i16]* @a16 to <32 x i16>*), align 2	; AVX512-NEXT: [[TMP1:%.]] = load <16 x i16>, <16 x i16> bitcast ([32 x i16]* @a16 to <16 x i16>*), align 2
	; AVX512-NEXT: [[TMP2:%.]] = load <32 x i16>, <32 x i16> bitcast ([32 x i16]* @b16 to <32 x i16>*), align 2	; AVX512-NEXT: [[TMP2:%.]] = load <16 x i16>, <16 x i16> bitcast (i16* getelementptr inbounds ([32 x i16], [32 x i16]* @a16, i32 0, i64 16) to <16 x i16>*), align 2
	; AVX512-NEXT: [[TMP3:%.*]] = shl <32 x i16> [[TMP1]], [[TMP2]]	; AVX512-NEXT: [[TMP3:%.]] = load <16 x i16>, <16 x i16> bitcast ([32 x i16]* @b16 to <16 x i16>*), align 2
	; AVX512-NEXT: store <32 x i16> [[TMP3]], <32 x i16>* bitcast ([32 x i16]* @c16 to <32 x i16>*), align 2	; AVX512-NEXT: [[TMP4:%.]] = load <16 x i16>, <16 x i16> bitcast (i16* getelementptr inbounds ([32 x i16], [32 x i16]* @b16, i32 0, i64 16) to <16 x i16>*), align 2
		; AVX512-NEXT: [[TMP5:%.*]] = shl <16 x i16> [[TMP1]], [[TMP3]]
		; AVX512-NEXT: [[TMP6:%.*]] = shl <16 x i16> [[TMP2]], [[TMP4]]
		; AVX512-NEXT: store <16 x i16> [[TMP5]], <16 x i16>* bitcast ([32 x i16]* @c16 to <16 x i16>*), align 2
		; AVX512-NEXT: store <16 x i16> [[TMP6]], <16 x i16>* bitcast (i16* getelementptr inbounds ([32 x i16], [32 x i16]* @c16, i32 0, i64 16) to <16 x i16>*), align 2
	; AVX512-NEXT: ret void	; AVX512-NEXT: ret void
	;	;
	; XOP-LABEL: @shl_v32i16(	; XOP-LABEL: @shl_v32i16(
Context not available.
	}	}

	define void @shl_v64i8() {	define void @shl_v64i8() {
	; SSE-LABEL: @shl_v64i8(	; CHECK-LABEL: @shl_v64i8(
	; SSE-NEXT: [[TMP1:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @a8 to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP1:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @a8 to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP2:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 16) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP2:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 16) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP3:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 32) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP3:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 32) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP4:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 48) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP4:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 48) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP5:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @b8 to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP5:%.]] = load <16 x i8>, <16 x i8> bitcast ([64 x i8]* @b8 to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP6:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 16) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP6:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 16) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP7:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 32) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP7:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 32) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP8:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 48) to <16 x i8>*), align 1	; CHECK-NEXT: [[TMP8:%.]] = load <16 x i8>, <16 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 48) to <16 x i8>*), align 1
	; SSE-NEXT: [[TMP9:%.*]] = shl <16 x i8> [[TMP1]], [[TMP5]]	; CHECK-NEXT: [[TMP9:%.*]] = shl <16 x i8> [[TMP1]], [[TMP5]]
	; SSE-NEXT: [[TMP10:%.*]] = shl <16 x i8> [[TMP2]], [[TMP6]]	; CHECK-NEXT: [[TMP10:%.*]] = shl <16 x i8> [[TMP2]], [[TMP6]]
	; SSE-NEXT: [[TMP11:%.*]] = shl <16 x i8> [[TMP3]], [[TMP7]]	; CHECK-NEXT: [[TMP11:%.*]] = shl <16 x i8> [[TMP3]], [[TMP7]]
	; SSE-NEXT: [[TMP12:%.*]] = shl <16 x i8> [[TMP4]], [[TMP8]]	; CHECK-NEXT: [[TMP12:%.*]] = shl <16 x i8> [[TMP4]], [[TMP8]]
	; SSE-NEXT: store <16 x i8> [[TMP9]], <16 x i8>* bitcast ([64 x i8]* @c8 to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP9]], <16 x i8>* bitcast ([64 x i8]* @c8 to <16 x i8>*), align 1
	; SSE-NEXT: store <16 x i8> [[TMP10]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 16) to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP10]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 16) to <16 x i8>*), align 1
	; SSE-NEXT: store <16 x i8> [[TMP11]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 32) to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP11]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 32) to <16 x i8>*), align 1
	; SSE-NEXT: store <16 x i8> [[TMP12]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 48) to <16 x i8>*), align 1	; CHECK-NEXT: store <16 x i8> [[TMP12]], <16 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 48) to <16 x i8>*), align 1
	; SSE-NEXT: ret void	; CHECK-NEXT: ret void
	;
	; AVX-LABEL: @shl_v64i8(
	; AVX-NEXT: [[TMP1:%.]] = load <32 x i8>, <32 x i8> bitcast ([64 x i8]* @a8 to <32 x i8>*), align 1
	; AVX-NEXT: [[TMP2:%.]] = load <32 x i8>, <32 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 32) to <32 x i8>*), align 1
	; AVX-NEXT: [[TMP3:%.]] = load <32 x i8>, <32 x i8> bitcast ([64 x i8]* @b8 to <32 x i8>*), align 1
	; AVX-NEXT: [[TMP4:%.]] = load <32 x i8>, <32 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 32) to <32 x i8>*), align 1
	; AVX-NEXT: [[TMP5:%.*]] = shl <32 x i8> [[TMP1]], [[TMP3]]
	; AVX-NEXT: [[TMP6:%.*]] = shl <32 x i8> [[TMP2]], [[TMP4]]
	; AVX-NEXT: store <32 x i8> [[TMP5]], <32 x i8>* bitcast ([64 x i8]* @c8 to <32 x i8>*), align 1
	; AVX-NEXT: store <32 x i8> [[TMP6]], <32 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 32) to <32 x i8>*), align 1
	; AVX-NEXT: ret void
	;
	; AVX512-LABEL: @shl_v64i8(
	; AVX512-NEXT: [[TMP1:%.]] = load <64 x i8>, <64 x i8> bitcast ([64 x i8]* @a8 to <64 x i8>*), align 1
	; AVX512-NEXT: [[TMP2:%.]] = load <64 x i8>, <64 x i8> bitcast ([64 x i8]* @b8 to <64 x i8>*), align 1
	; AVX512-NEXT: [[TMP3:%.*]] = shl <64 x i8> [[TMP1]], [[TMP2]]
	; AVX512-NEXT: store <64 x i8> [[TMP3]], <64 x i8>* bitcast ([64 x i8]* @c8 to <64 x i8>*), align 1
	; AVX512-NEXT: ret void
	;
	; XOP-LABEL: @shl_v64i8(
	; XOP-NEXT: [[TMP1:%.]] = load <32 x i8>, <32 x i8> bitcast ([64 x i8]* @a8 to <32 x i8>*), align 1
	; XOP-NEXT: [[TMP2:%.]] = load <32 x i8>, <32 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 32) to <32 x i8>*), align 1
	; XOP-NEXT: [[TMP3:%.]] = load <32 x i8>, <32 x i8> bitcast ([64 x i8]* @b8 to <32 x i8>*), align 1
	; XOP-NEXT: [[TMP4:%.]] = load <32 x i8>, <32 x i8> bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @b8, i32 0, i64 32) to <32 x i8>*), align 1
	; XOP-NEXT: [[TMP5:%.*]] = shl <32 x i8> [[TMP1]], [[TMP3]]
	; XOP-NEXT: [[TMP6:%.*]] = shl <32 x i8> [[TMP2]], [[TMP4]]
	; XOP-NEXT: store <32 x i8> [[TMP5]], <32 x i8>* bitcast ([64 x i8]* @c8 to <32 x i8>*), align 1
	; XOP-NEXT: store <32 x i8> [[TMP6]], <32 x i8>* bitcast (i8* getelementptr inbounds ([64 x i8], [64 x i8]* @c8, i32 0, i64 32) to <32 x i8>*), align 1
	; XOP-NEXT: ret void
	;	;
	%a0 = load i8, i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 0 ), align 1	%a0 = load i8, i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 0 ), align 1
	%a1 = load i8, i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 1 ), align 1	%a1 = load i8, i8* getelementptr inbounds ([64 x i8], [64 x i8]* @a8, i32 0, i64 1 ), align 1
Context not available.

llvm/test/Transforms/SLPVectorizer/X86/store-jumbled.ll

Context not available.
	; CHECK-NEXT: [[GEP_3:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 3	; CHECK-NEXT: [[GEP_3:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 3
	; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[IN_ADDR]] to <4 x i32>*	; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[IN_ADDR]] to <4 x i32>*
	; CHECK-NEXT: [[TMP2:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4	; CHECK-NEXT: [[TMP2:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4
		; CHECK-NEXT: [[REORDER_SHUFFLE:%.*]] = shufflevector <4 x i32> [[TMP2]], <4 x i32> undef, <4 x i32> <i32 1, i32 3, i32 0, i32 2>
	; CHECK-NEXT: [[INN_ADDR:%.]] = getelementptr inbounds i32, i32 [[INN:%.*]], i64 0	; CHECK-NEXT: [[INN_ADDR:%.]] = getelementptr inbounds i32, i32 [[INN:%.*]], i64 0
	; CHECK-NEXT: [[GEP_4:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 1	; CHECK-NEXT: [[GEP_4:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 1
	; CHECK-NEXT: [[GEP_5:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 2	; CHECK-NEXT: [[GEP_5:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 2
	; CHECK-NEXT: [[GEP_6:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 3	; CHECK-NEXT: [[GEP_6:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 3
	; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[INN_ADDR]] to <4 x i32>*	; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[INN_ADDR]] to <4 x i32>*
	; CHECK-NEXT: [[TMP4:%.]] = load <4 x i32>, <4 x i32> [[TMP3]], align 4	; CHECK-NEXT: [[TMP4:%.]] = load <4 x i32>, <4 x i32> [[TMP3]], align 4
	; CHECK-NEXT: [[TMP5:%.*]] = mul <4 x i32> [[TMP2]], [[TMP4]]	; CHECK-NEXT: [[REORDER_SHUFFLE1:%.*]] = shufflevector <4 x i32> [[TMP4]], <4 x i32> undef, <4 x i32> <i32 1, i32 3, i32 0, i32 2>
		; CHECK-NEXT: [[TMP5:%.*]] = mul <4 x i32> [[REORDER_SHUFFLE]], [[REORDER_SHUFFLE1]]
	; CHECK-NEXT: [[GEP_7:%.]] = getelementptr inbounds i32, i32 [[OUT:%.*]], i64 0	; CHECK-NEXT: [[GEP_7:%.]] = getelementptr inbounds i32, i32 [[OUT:%.*]], i64 0
	; CHECK-NEXT: [[GEP_8:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 1	; CHECK-NEXT: [[GEP_8:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 1
	; CHECK-NEXT: [[GEP_9:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 2	; CHECK-NEXT: [[GEP_9:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 2
	; CHECK-NEXT: [[GEP_10:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 3	; CHECK-NEXT: [[GEP_10:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 3
	; CHECK-NEXT: [[REORDER_SHUFFLE:%.*]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> undef, <4 x i32> <i32 1, i32 3, i32 0, i32 2>
	; CHECK-NEXT: [[TMP6:%.]] = bitcast i32 [[GEP_7]] to <4 x i32>*	; CHECK-NEXT: [[TMP6:%.]] = bitcast i32 [[GEP_7]] to <4 x i32>*
	; CHECK-NEXT: store <4 x i32> [[REORDER_SHUFFLE]], <4 x i32>* [[TMP6]], align 4	; CHECK-NEXT: store <4 x i32> [[TMP5]], <4 x i32>* [[TMP6]], align 4
	; CHECK-NEXT: ret i32 undef	; CHECK-NEXT: ret i32 undef
	;	;
	%in.addr = getelementptr inbounds i32, i32* %in, i64 0	%in.addr = getelementptr inbounds i32, i32* %in, i64 0
Context not available.

llvm/test/Transforms/SLPVectorizer/X86/stores_vectorize.ll

Context not available.
	; CHECK-NEXT: [[ARRAYIDX11:%.]] = getelementptr inbounds i64, i64 [[P3]], i64 3	; CHECK-NEXT: [[ARRAYIDX11:%.]] = getelementptr inbounds i64, i64 [[P3]], i64 3
	; CHECK-NEXT: [[TMP0:%.]] = bitcast i64 [[P3]] to <4 x i64>*	; CHECK-NEXT: [[TMP0:%.]] = bitcast i64 [[P3]] to <4 x i64>*
	; CHECK-NEXT: [[TMP1:%.]] = load <4 x i64>, <4 x i64> [[TMP0]], align 8	; CHECK-NEXT: [[TMP1:%.]] = load <4 x i64>, <4 x i64> [[TMP0]], align 8
		; CHECK-NEXT: [[TMP2:%.*]] = shufflevector <4 x i64> [[TMP1]], <4 x i64> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
	; CHECK-NEXT: [[ARRAYIDX12:%.]] = getelementptr inbounds i64, i64 [[P3]], i64 11	; CHECK-NEXT: [[ARRAYIDX12:%.]] = getelementptr inbounds i64, i64 [[P3]], i64 11
	; CHECK-NEXT: [[TMP2:%.]] = bitcast i64 [[ARRAYIDX1]] to <4 x i64>*	; CHECK-NEXT: [[TMP3:%.]] = bitcast i64 [[ARRAYIDX1]] to <4 x i64>*
	; CHECK-NEXT: [[TMP3:%.]] = load <4 x i64>, <4 x i64> [[TMP2]], align 8	; CHECK-NEXT: [[TMP4:%.]] = load <4 x i64>, <4 x i64> [[TMP3]], align 8
	; CHECK-NEXT: [[TMP4:%.*]] = shl <4 x i64> [[TMP1]], [[TMP3]]
	; CHECK-NEXT: [[ARRAYIDX14:%.]] = getelementptr inbounds i64, i64 [[P3]], i64 4
	; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <4 x i64> [[TMP4]], <4 x i64> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>	; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <4 x i64> [[TMP4]], <4 x i64> undef, <4 x i32> <i32 3, i32 2, i32 1, i32 0>
	; CHECK-NEXT: [[TMP6:%.]] = bitcast i64 [[ARRAYIDX14]] to <4 x i64>*	; CHECK-NEXT: [[TMP6:%.*]] = shl <4 x i64> [[TMP2]], [[TMP5]]
	; CHECK-NEXT: store <4 x i64> [[TMP5]], <4 x i64>* [[TMP6]], align 8	; CHECK-NEXT: [[ARRAYIDX14:%.]] = getelementptr inbounds i64, i64 [[P3]], i64 4
		; CHECK-NEXT: [[TMP7:%.]] = bitcast i64 [[ARRAYIDX14]] to <4 x i64>*
		; CHECK-NEXT: store <4 x i64> [[TMP6]], <4 x i64>* [[TMP7]], align 8
	; CHECK-NEXT: ret void	; CHECK-NEXT: ret void
	;	;
	entry:	entry:
Context not available.
	; CHECK-NEXT: [[ADD:%.*]] = fadd float [[TMP0]], [[TMP1]]	; CHECK-NEXT: [[ADD:%.*]] = fadd float [[TMP0]], [[TMP1]]
	; CHECK-NEXT: store float [[ADD]], float* [[ARRAYIDX2]], align 4	; CHECK-NEXT: store float [[ADD]], float* [[ARRAYIDX2]], align 4
	; CHECK-NEXT: [[ARRAYIDX4:%.]] = getelementptr inbounds i64, i64 [[P3]], i64 1	; CHECK-NEXT: [[ARRAYIDX4:%.]] = getelementptr inbounds i64, i64 [[P3]], i64 1
		; CHECK-NEXT: [[TMP2:%.]] = bitcast i64 [[P3]] to <2 x i64>*
		; CHECK-NEXT: [[TMP3:%.]] = load <2 x i64>, <2 x i64> [[TMP2]], align 8
		; CHECK-NEXT: [[TMP4:%.*]] = lshr <2 x i64> [[TMP3]], <i64 5, i64 5>
		; CHECK-NEXT: [[TMP5:%.]] = bitcast i64 [[P3]] to <2 x i64>*
		; CHECK-NEXT: store <2 x i64> [[TMP4]], <2 x i64>* [[TMP5]], align 8
	; CHECK-NEXT: [[ARRAYIDX6:%.]] = getelementptr inbounds i64, i64 [[P3]], i64 2	; CHECK-NEXT: [[ARRAYIDX6:%.]] = getelementptr inbounds i64, i64 [[P3]], i64 2
		; CHECK-NEXT: [[TMP6:%.]] = load i64, i64 [[ARRAYIDX6]], align 8
		; CHECK-NEXT: [[SHR7:%.*]] = lshr i64 [[TMP6]], 5
		; CHECK-NEXT: store i64 [[SHR7]], i64* [[ARRAYIDX6]], align 8
	; CHECK-NEXT: [[ARRAYIDX8:%.]] = getelementptr inbounds i64, i64 [[P3]], i64 3	; CHECK-NEXT: [[ARRAYIDX8:%.]] = getelementptr inbounds i64, i64 [[P3]], i64 3
	; CHECK-NEXT: [[TMP2:%.]] = bitcast i64 [[P3]] to <4 x i64>*	; CHECK-NEXT: [[TMP7:%.]] = load i64, i64 [[ARRAYIDX8]], align 8
	; CHECK-NEXT: [[TMP3:%.]] = load <4 x i64>, <4 x i64> [[TMP2]], align 8	; CHECK-NEXT: [[SHR9:%.*]] = lshr i64 [[TMP7]], 5
	; CHECK-NEXT: [[TMP4:%.*]] = lshr <4 x i64> [[TMP3]], <i64 5, i64 5, i64 5, i64 5>
	; CHECK-NEXT: [[ARRAYIDX9:%.]] = getelementptr inbounds i64, i64 [[P3]], i64 5	; CHECK-NEXT: [[ARRAYIDX9:%.]] = getelementptr inbounds i64, i64 [[P3]], i64 5
	; CHECK-NEXT: store i64 5, i64* [[ARRAYIDX9]], align 8	; CHECK-NEXT: store i64 5, i64* [[ARRAYIDX9]], align 8
	; CHECK-NEXT: store i64 5, i64* [[ARRAYIDX9]], align 8	; CHECK-NEXT: store i64 5, i64* [[ARRAYIDX9]], align 8
Context not available.
	; CHECK-NEXT: store i64 5, i64* [[ARRAYIDX9]], align 8	; CHECK-NEXT: store i64 5, i64* [[ARRAYIDX9]], align 8
	; CHECK-NEXT: store i64 5, i64* [[ARRAYIDX9]], align 8	; CHECK-NEXT: store i64 5, i64* [[ARRAYIDX9]], align 8
	; CHECK-NEXT: store i64 5, i64* [[ARRAYIDX9]], align 8	; CHECK-NEXT: store i64 5, i64* [[ARRAYIDX9]], align 8
	; CHECK-NEXT: [[TMP5:%.]] = bitcast i64 [[P3]] to <4 x i64>*	; CHECK-NEXT: store i64 [[SHR9]], i64* [[ARRAYIDX8]], align 8
	; CHECK-NEXT: store <4 x i64> [[TMP4]], <4 x i64>* [[TMP5]], align 8
	; CHECK-NEXT: ret void	; CHECK-NEXT: ret void
	;	;
	entry:	entry:
Context not available.
	; CHECK-NEXT: [[ADD:%.*]] = fadd float [[TMP0]], [[TMP1]]	; CHECK-NEXT: [[ADD:%.*]] = fadd float [[TMP0]], [[TMP1]]
	; CHECK-NEXT: store float [[ADD]], float* [[ARRAYIDX2]], align 4	; CHECK-NEXT: store float [[ADD]], float* [[ARRAYIDX2]], align 4
	; CHECK-NEXT: [[ARRAYIDX4:%.]] = getelementptr inbounds i64, i64 [[P3]], i64 1	; CHECK-NEXT: [[ARRAYIDX4:%.]] = getelementptr inbounds i64, i64 [[P3]], i64 1
		; CHECK-NEXT: [[TMP2:%.]] = bitcast i64 [[P3]] to <2 x i64>*
		; CHECK-NEXT: [[TMP3:%.]] = load <2 x i64>, <2 x i64> [[TMP2]], align 8
		; CHECK-NEXT: [[TMP4:%.*]] = lshr <2 x i64> [[TMP3]], <i64 5, i64 5>
		; CHECK-NEXT: [[TMP5:%.]] = bitcast i64 [[P3]] to <2 x i64>*
		; CHECK-NEXT: store <2 x i64> [[TMP4]], <2 x i64>* [[TMP5]], align 8
	; CHECK-NEXT: [[ARRAYIDX6:%.]] = getelementptr inbounds i64, i64 [[P3]], i64 2	; CHECK-NEXT: [[ARRAYIDX6:%.]] = getelementptr inbounds i64, i64 [[P3]], i64 2
		; CHECK-NEXT: [[TMP6:%.]] = load i64, i64 [[ARRAYIDX6]], align 8
		; CHECK-NEXT: [[SHR7:%.*]] = lshr i64 [[TMP6]], 5
		; CHECK-NEXT: store i64 [[SHR7]], i64* [[ARRAYIDX6]], align 8
	; CHECK-NEXT: [[ARRAYIDX8:%.]] = getelementptr inbounds i64, i64 [[P3]], i64 3	; CHECK-NEXT: [[ARRAYIDX8:%.]] = getelementptr inbounds i64, i64 [[P3]], i64 3
	; CHECK-NEXT: [[TMP2:%.]] = bitcast i64 [[P3]] to <4 x i64>*	; CHECK-NEXT: [[TMP7:%.]] = load i64, i64 [[ARRAYIDX8]], align 8
	; CHECK-NEXT: [[TMP3:%.]] = load <4 x i64>, <4 x i64> [[TMP2]], align 8	; CHECK-NEXT: [[SHR9:%.*]] = lshr i64 [[TMP7]], 5
	; CHECK-NEXT: [[TMP4:%.*]] = lshr <4 x i64> [[TMP3]], <i64 5, i64 5, i64 5, i64 5>
	; CHECK-NEXT: [[ARRAYIDX9:%.]] = getelementptr inbounds i64, i64 [[P3]], i64 5	; CHECK-NEXT: [[ARRAYIDX9:%.]] = getelementptr inbounds i64, i64 [[P3]], i64 5
	; CHECK-NEXT: store i64 5, i64* [[ARRAYIDX9]], align 8	; CHECK-NEXT: store i64 5, i64* [[ARRAYIDX9]], align 8
	; CHECK-NEXT: store i64 5, i64* [[ARRAYIDX9]], align 8	; CHECK-NEXT: store i64 5, i64* [[ARRAYIDX9]], align 8
Context not available.
	; CHECK-NEXT: store i64 5, i64* [[ARRAYIDX9]], align 8	; CHECK-NEXT: store i64 5, i64* [[ARRAYIDX9]], align 8
	; CHECK-NEXT: store i64 5, i64* [[ARRAYIDX9]], align 8	; CHECK-NEXT: store i64 5, i64* [[ARRAYIDX9]], align 8
	; CHECK-NEXT: store i64 5, i64* [[ARRAYIDX9]], align 8	; CHECK-NEXT: store i64 5, i64* [[ARRAYIDX9]], align 8
	; CHECK-NEXT: [[TMP5:%.]] = bitcast i64 [[P3]] to <4 x i64>*	; CHECK-NEXT: store i64 [[SHR9]], i64* [[ARRAYIDX8]], align 8
	; CHECK-NEXT: store <4 x i64> [[TMP4]], <4 x i64>* [[TMP5]], align 8
	; CHECK-NEXT: ret void	; CHECK-NEXT: ret void
	;	;
	entry:	entry:
Context not available.

This is an archive of the discontinued LLVM Phabricator instance.

Revert the revert of vectorization commitsNeeds ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 243056

llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

llvm/test/Transforms/SLPVectorizer/X86/arith-add-ssat.ll

llvm/test/Transforms/SLPVectorizer/X86/arith-add-usat.ll

llvm/test/Transforms/SLPVectorizer/X86/arith-add.ll

llvm/test/Transforms/SLPVectorizer/X86/arith-fix.ll

llvm/test/Transforms/SLPVectorizer/X86/arith-mul.ll

llvm/test/Transforms/SLPVectorizer/X86/arith-sub-ssat.ll

llvm/test/Transforms/SLPVectorizer/X86/arith-sub-usat.ll

llvm/test/Transforms/SLPVectorizer/X86/arith-sub.ll

llvm/test/Transforms/SLPVectorizer/X86/bitreverse.ll

llvm/test/Transforms/SLPVectorizer/X86/ctlz.ll

llvm/test/Transforms/SLPVectorizer/X86/ctpop.ll

llvm/test/Transforms/SLPVectorizer/X86/cttz.ll

llvm/test/Transforms/SLPVectorizer/X86/different-vec-widths.ll

llvm/test/Transforms/SLPVectorizer/X86/jumbled_store_crash.ll

llvm/test/Transforms/SLPVectorizer/X86/pr35497.ll

llvm/test/Transforms/SLPVectorizer/X86/shift-ashr.ll

llvm/test/Transforms/SLPVectorizer/X86/shift-lshr.ll

llvm/test/Transforms/SLPVectorizer/X86/shift-shl.ll

llvm/test/Transforms/SLPVectorizer/X86/store-jumbled.ll

llvm/test/Transforms/SLPVectorizer/X86/stores_vectorize.ll

Revert the revert of vectorization commits
Needs ReviewPublic