This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
7/10
VectorCombine.cpp
-
test/Transforms/VectorCombine/X86/
-
Transforms/
-
VectorCombine/
-
X86/
-
load-bitcast-vec.ll

Differential D81766

[VectorCombine] try to create vector loads from scalar loads
ClosedPublic

Authored by spatel on Jun 12 2020, 2:29 PM.

Download Raw Diff

Details

Reviewers

lebedev.ri
RKSimon
anton-afanasyev
vporpo
ABataev
xbolva00

Commits

rG43bdac290663: [VectorCombine] try to create vector loads from scalar loads

Summary

This should allow this pass or others to fold subsequent scalar ops together more easily because they will see extractelement ops from a single vector rather than incomplete parts of that vector.

Currently, this transform will make no overall difference to these most basic patterns because the backend (DAGCombiner) will narrow the loads back down to scalars via narrowExtractedVectorLoad().

For now, we do not get any differences for scalar integer loads because those extracts are not free. We will need to match larger patterns and/or adjust the cost equation to allow that.

Diff Detail

Event Timeline

spatel created this revision.Jun 12 2020, 2:29 PM

Herald added a project: Restricted Project. · View Herald TranscriptJun 12 2020, 2:29 PM

Herald added subscribers: steven.zhang, hiraditya, mcrosier. · View Herald Transcript

Some thoughts.

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
57	I'm not sure why we'd care whether the load is of bitcast. Why are we using bitcast src type as the source of truth? Are we trying to avoid introducing some cache issues? I'd think we should instead assess (check, brute-force) each possible wider load type, first checking cost and then `isSafeToLoadUnconditionally()`.
101	Shouldn't this be `+=`?
120–121	`++NumLoadsVectorized;`

To add to what @lebedev.ri said, this patch violates the opaque pointer model towards which LLVM is migrating. Pointer element types are not allowed to influence optimization behavior.

In D81766#2091526, @nikic wrote:

To add to what @lebedev.ri said, this patch violates the opaque pointer model towards which LLVM is migrating. Pointer element types are not allowed to influence optimization behavior.

Ah, good point, too.

This revision now requires changes to proceed.Jun 14 2020, 1:38 AM

In D81766#2091526, @nikic wrote:

To add to what @lebedev.ri said, this patch violates the opaque pointer model towards which LLVM is migrating. Pointer element types are not allowed to influence optimization behavior.

Ok, let me see if I can rework this using just the cost model. This patch started within InstCombine, so we didn't have access to costs and didn't want to do the transform too loosely. The bitcast was used as a proxy for "cost effective" - it indicated that either the original code or the vectorizers had validated the vector type as a legitimate type for the target.

spatel updated this revision to Diff 280132.Jul 23 2020, 8:13 AM

spatel retitled this revision from [VectorCombine] try to create vector loads from bitcasted scalar pointers to [VectorCombine] try to create vector loads from scalar loads.

spatel edited the summary of this revision. (Show Details)

Patch updated - this uses the target, cost model, and load attributes only (not pointer types/casts) to decide if we can create a vector load.

nikic added inline comments.Jul 23 2020, 2:26 PM

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
86	Will other middle end passes be able to handle this well as well? I don't have anything specific in mind here, but would suspect that some passes will be able to deal with a "load" better than a "bitcast, load, extractelement" sequence.

spatel marked 2 inline comments as done.Jul 24 2020, 9:29 AM

spatel added inline comments.

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
86	Other middle end passes almost certainly will not handle this as well. :) It's not quite the same pattern/problem since we're creating vector ops here, but that's what led to removing the generic LoadCombine IR pass ( http://lists.llvm.org/pipermail/llvm-dev/2016-September/105291.html ). I'm assuming that VectorCombine is running late enough (after GVN, etc.) that we've already done all of the general IR optimizations that can be done with the narrow ops. I could try to cobble some PhaseOrdering tests to enforce it, but these would be negative tests currently.

Ping.

RKSimon added inline comments.Aug 1 2020, 5:10 AM

llvm/test/Transforms/VectorCombine/X86/load.ll
60 ↗	(On Diff #280132)	do we have test coverage for non-zero gep indices?

spatel marked an inline comment as done.Aug 1 2020, 7:20 AM

spatel added inline comments.

llvm/test/Transforms/VectorCombine/X86/load.ll
60 ↗	(On Diff #280132)	No, that was missing. Added with rGd620a6fe98f7.

Patch updated:
No code changes, but added tests for non-zero gep offsets. This actually works more than I was expecting given that we're only using the base stripPointerCasts(). If there are enough dereferenceable bytes, isSafeToLoadUnconditionally() can still manage to use the offset load via the existing gep.

nikic added inline comments.Aug 1 2020, 8:15 AM

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
86	I'm not sure how safe that assumption is. For example, in Rust it is possible for code to go through two ThinLTO links at different hierarchy levels. In that case the first one would convert everything to vector loads (given ample dereferencability information), and the second one might have trouble optimizing based on that.

Patch updated:
Adjusted to match the most basic pattern that starts with an insertelement (so there's no extract created here). Hopefully, that removes any concern about interactions with other passes. Ie, the transform should almost always be profitable. (We could make an argument that this could be part of canonicalization, but we conservatively try not to create vector ops from scalar ops in passes like instcombine.)

LGTM with one minor @nikic Does this look OK now?

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
84	Since you're inserting into undef, this is really a BUILD_VECTOR - you might get better results with getScalarizationOverhead?

Patch updated:
Use TTI.getScalarizationOverhead() to model insert cost of original code. I had not used this API before and the documentation comment isn't entirely clear to me, so please see if that looks as intended.

In D81766#2202770, @RKSimon wrote:

LGTM with one minor @nikic Does this look OK now?

Looks good to me as well.

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
64	assert / use `cast<>`? Don't think the pointer operand can have a non-pointer type.

LGTM - the getScalarizationOverhead() change is OK

lebedev.ri accepted this revision.Aug 8 2020, 1:31 AM

This revision is now accepted and ready to land.Aug 8 2020, 1:31 AM

xbolva00 added a subscriber: xbolva00.Aug 8 2020, 2:16 AM

xbolva00 added inline comments.

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
64	Not resolved?

RKSimon mentioned this in D64142: [SLP] try to create vector loads from bitcasted scalar pointers.Aug 8 2020, 3:40 AM

Patch updated:
Changed pointer check to an assert.

xbolva00 accepted this revision.Aug 8 2020, 8:34 AM

Closed by commit rG43bdac290663: [VectorCombine] try to create vector loads from scalar loads (authored by spatel). · Explain WhyAug 9 2020, 6:19 AM

This revision was automatically updated to reflect the committed changes.

spatel added a commit: rG43bdac290663: [VectorCombine] try to create vector loads from scalar loads.

This appears to have broken some of Halide's codegen for Hexagon/HVX; as of this revision, some of our tests are now failing with

llvm/lib/IR/Type.cpp:617: static llvm::FixedVectorType* llvm::FixedVectorType::get(llvm::Type*, unsigned int): Assertion `NumElts > 0 && "#Elements of a VectorType must be greater than 0"' failed.

(It's not yet clear whether this is an injection on LLVM's part, or a change that reveals a latent bug in Halide; I'm investigating to determine.)

Update: it appears that VectorSize (from TTI.getMinVectorRegisterBitWidth()) is zero in this case, which causes the assertion failure.

This appears to be the case because HexagonTTIImpl::getMinVectorRegisterBitWidth() returns 0 if useHVX() isn't true... and useHVX() returns false if the HexagonAutoHVX option isn't enabled.

By design, Halide doesn't enable the HexagonAutoHVX option; we like to do all the vectorization ourselves.

I'm not sure how to resolve this issue -- the flaw here seems to lie in HexagonTTIImpl's assumption that disabling HexagonAutoHVX should cause it to report zero-width vectors, which seems to be a dubious decision (and one that is at odds with every other implementation of getMinVectorRegisterBitWidth() that I see in trunk LLVM (none of them appear to ever return 0).

Would it make sense to consider backing out this change until this can be resolved, since it clearly appears to have bad consequences for Hexagon/HVX codegen?

In D81766#2211936, @srj wrote:

Update: it appears that VectorSize (from TTI.getMinVectorRegisterBitWidth()) is zero in this case, which causes the assertion failure.

This appears to be the case because HexagonTTIImpl::getMinVectorRegisterBitWidth() returns 0 if useHVX() isn't true... and useHVX() returns false if the HexagonAutoHVX option isn't enabled.

By design, Halide doesn't enable the HexagonAutoHVX option; we like to do all the vectorization ourselves.

I'm not sure how to resolve this issue -- the flaw here seems to lie in HexagonTTIImpl's assumption that disabling HexagonAutoHVX should cause it to report zero-width vectors, which seems to be a dubious decision (and one that is at odds with every other implementation of getMinVectorRegisterBitWidth() that I see in trunk LLVM (none of them appear to ever return 0).

Would it make sense to consider backing out this change until this can be resolved, since it clearly appears to have bad consequences for Hexagon/HVX codegen?

Can we just add another condition (!VectorSize) to this bailout?
if (!ScalarSize || VectorSize % ScalarSize != 0)

Can we just add another condition (!VectorSize) to this bailout?
if (!ScalarSize || VectorSize % ScalarSize != 0)

Changing it to if (!ScalarSize || !VectorSize || VectorSize % ScalarSize != 0) does indeed seem to make our failure go away, so that would be fine as a quick fix.

(I'm still surprised that getMinVectorRegisterBitWidth() should ever return 0, but I can take that up with Qualcomm folks separately; it is entirely possible I don't understand the full semantics of that method.)

spatel mentioned this in rGb0b95dab1ce2: [VectorCombine] add safety check for 0-width register.Aug 11 2020, 5:30 PM

In D81766#2211952, @srj wrote:

Can we just add another condition (!VectorSize) to this bailout?
if (!ScalarSize || VectorSize % ScalarSize != 0)

Changing it to if (!ScalarSize || !VectorSize || VectorSize % ScalarSize != 0) does indeed seem to make our failure go away, so that would be fine as a quick fix.

(I'm still surprised that getMinVectorRegisterBitWidth() should ever return 0, but I can take that up with Qualcomm folks separately; it is entirely possible I don't understand the full semantics of that method.)

rGb0b95dab1ce2
l'll see if I can find a better API and/or test case for that tomorrow.

In D81766#2211991, @spatel wrote:

In D81766#2211952, @srj wrote:

Can we just add another condition (!VectorSize) to this bailout?
if (!ScalarSize || VectorSize % ScalarSize != 0)

Changing it to if (!ScalarSize || !VectorSize || VectorSize % ScalarSize != 0) does indeed seem to make our failure go away, so that would be fine as a quick fix.

(I'm still surprised that getMinVectorRegisterBitWidth() should ever return 0, but I can take that up with Qualcomm folks separately; it is entirely possible I don't understand the full semantics of that method.)

rGb0b95dab1ce2
l'll see if I can find a better API and/or test case for that tomorrow.

Test added here:
rGb97e402ca5ba

The Loop and SLP vectorizers check this:

// If the target claims to have no vector registers don't attempt
// vectorization.
if (!TTI->getNumberOfRegisters(TTI->getRegisterClassForType(true)))
  return false;

So I should probably add that check to this pass too to be safer.

spatel mentioned this in rGcc892fd9f4cb: [VectorCombine] early exit if target has no vector registers.Aug 12 2020, 6:29 AM

The Hexagon issue is fixed in rGa2dc19b81b1e.

spatel mentioned this in D86160: [VectorCombine] allow vector loads with mismatched insert type.Aug 18 2020, 11:03 AM

spatel mentioned this in rG8fb055932c08: [VectorCombine] allow vector loads with mismatched insert type.Sep 2 2020, 5:11 AM

MaskRay mentioned this in D87538: [VectorCombine] Don't vectorize scalar load under asan/hwasan/memtag/tsan.Sep 11 2020, 12:00 PM

MaskRay mentioned this in rG4452cc4086ac: [VectorCombine] Don't vectorize scalar load under asan/hwasan/memtag/tsan.Sep 15 2020, 9:52 AM

FYI: @yaxunl -- I've ran into this while compiling rocFFT, so it may bite you, too.

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
61	This triggers an assertion in `CreateBitCast` below when we happen to strip a necessary `AddrSpaceCast` and put `PtrOp` in a different AS. Reproducer is here: https://gist.github.com/Artem-B/98a4420dda4f0c36364ddc170a8b12c5 At the very least the code should check that AS didn't change.

tra mentioned this in D89577: [VectorCombine] Avoid crossing address space boundaries..Oct 16 2020, 11:28 AM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

VectorCombine.cpp

75 lines

test/

Transforms/

VectorCombine/

X86/

load-bitcast-vec.ll

44 lines

Diff 270516

llvm/lib/Transforms/Vectorize/VectorCombine.cpp

Show All 10 Lines
// vectorization passes.		// vectorization passes.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "llvm/Transforms/Vectorize/VectorCombine.h"		#include "llvm/Transforms/Vectorize/VectorCombine.h"
#include "llvm/ADT/Statistic.h"		#include "llvm/ADT/Statistic.h"
#include "llvm/Analysis/BasicAliasAnalysis.h"		#include "llvm/Analysis/BasicAliasAnalysis.h"
#include "llvm/Analysis/GlobalsModRef.h"		#include "llvm/Analysis/GlobalsModRef.h"
		#include "llvm/Analysis/Loads.h"
#include "llvm/Analysis/TargetTransformInfo.h"		#include "llvm/Analysis/TargetTransformInfo.h"
#include "llvm/Analysis/ValueTracking.h"		#include "llvm/Analysis/ValueTracking.h"
#include "llvm/Analysis/VectorUtils.h"		#include "llvm/Analysis/VectorUtils.h"
#include "llvm/IR/Dominators.h"		#include "llvm/IR/Dominators.h"
#include "llvm/IR/Function.h"		#include "llvm/IR/Function.h"
#include "llvm/IR/IRBuilder.h"		#include "llvm/IR/IRBuilder.h"
#include "llvm/IR/PatternMatch.h"		#include "llvm/IR/PatternMatch.h"
#include "llvm/InitializePasses.h"		#include "llvm/InitializePasses.h"
Show All 14 Lines
static cl::opt<bool> DisableVectorCombine(		static cl::opt<bool> DisableVectorCombine(
"disable-vector-combine", cl::init(false), cl::Hidden,		"disable-vector-combine", cl::init(false), cl::Hidden,
cl::desc("Disable all vector combine transforms"));		cl::desc("Disable all vector combine transforms"));

static cl::opt<bool> DisableBinopExtractShuffle(		static cl::opt<bool> DisableBinopExtractShuffle(
"disable-binop-extract-shuffle", cl::init(false), cl::Hidden,		"disable-binop-extract-shuffle", cl::init(false), cl::Hidden,
cl::desc("Disable binop extract to shuffle transforms"));		cl::desc("Disable binop extract to shuffle transforms"));

		static bool vectorizeLoad(Instruction &I, const TargetTransformInfo &TTI,
		const DominatorTree &DT) {
		// Match regular loads.
		auto *Load = dyn_cast<LoadInst>(&I);
		if (!Load \|\| !Load->isSimple())
		return false;

		// Match a scalar load of a bitcasted vector pointer.
		lebedev.riUnsubmitted Done Reply Inline Actions I'm not sure why we'd care whether the load is of bitcast. Why are we using bitcast src type as the source of truth? Are we trying to avoid introducing some cache issues? I'd think we should instead assess (check, brute-force) each possible wider load type, first checking cost and then `isSafeToLoadUnconditionally()`. lebedev.ri: I'm not sure why we'd care whether the load is of bitcast. Why are we using bitcast src type as…
		// TODO: Extend this to match GEP with 0 or other offset.
		Instruction *PtrOp;
		Value *SrcPtr;
		if (!match(Load->getPointerOperand(),
		traUnsubmitted Not Done Reply Inline Actions This triggers an assertion in `CreateBitCast` below when we happen to strip a necessary `AddrSpaceCast` and put `PtrOp` in a different AS. Reproducer is here: https://gist.github.com/Artem-B/98a4420dda4f0c36364ddc170a8b12c5 At the very least the code should check that AS didn't change. tra: This triggers an assertion in `CreateBitCast` below when we happen to strip a necessary…
		m_CombineAnd(m_Instruction(PtrOp), m_BitCast(m_Value(SrcPtr)))))
		return false;

		nikicUnsubmitted Done Reply Inline Actions assert / use `cast<>`? Don't think the pointer operand can have a non-pointer type. nikic: assert / use `cast<>`? Don't think the pointer operand can have a non-pointer type.
		xbolva00Unsubmitted Done Reply Inline Actions Not resolved? xbolva00: Not resolved?
		// TODO: Extend this to allow widening of a sub-vector (not scalar) load.
		auto *PtrOpTy = dyn_cast<PointerType>(PtrOp->getType());
		auto *SrcPtrTy = dyn_cast<PointerType>(SrcPtr->getType());
		if (!PtrOpTy \|\| !SrcPtrTy)
		return false;

		Type *ScalarTy = PtrOpTy->getElementType();
		auto *VectorTy = dyn_cast<FixedVectorType>(SrcPtrTy->getElementType());
		if (ScalarTy->isVectorTy() \|\| !VectorTy)
		return false;

		// Check safety of replacing the scalar load with a larger vector load.
		Align Alignment = Load->getAlign();
		const DataLayout &DL = I.getModule()->getDataLayout();
		if (!isSafeToLoadUnconditionally(SrcPtr, VectorTy, Alignment, DL, Load, &DT))
		return false;

		// Original pattern: load (bitcast VecPtr to ScalarPtr)
		int OldCost = TTI.getMemoryOpCost(Instruction::Load, ScalarTy, Alignment,
		Load->getPointerAddressSpace());
		RKSimonUnsubmitted Not Done Reply Inline Actions Since you're inserting into undef, this is really a BUILD_VECTOR - you might get better results with getScalarizationOverhead? RKSimon: Since you're inserting into undef, this is really a BUILD_VECTOR - you might get better results…
		OldCost += TTI.getCastInstrCost(Instruction::BitCast, PtrOpTy, SrcPtrTy);

		nikicUnsubmitted Done Reply Inline Actions Will other middle end passes be able to handle this well as well? I don't have anything specific in mind here, but would suspect that some passes will be able to deal with a "load" better than a "bitcast, load, extractelement" sequence. nikic: Will other middle end passes be able to handle this well as well? I don't have anything…
		spatelAuthorUnsubmitted Done Reply Inline Actions Other middle end passes almost certainly will not handle this as well. :) It's not quite the same pattern/problem since we're creating vector ops here, but that's what led to removing the generic LoadCombine IR pass ( http://lists.llvm.org/pipermail/llvm-dev/2016-September/105291.html ). I'm assuming that VectorCombine is running late enough (after GVN, etc.) that we've already done all of the general IR optimizations that can be done with the narrow ops. I could try to cobble some PhaseOrdering tests to enforce it, but these would be negative tests currently. spatel: Other middle end passes almost certainly will not handle this as well. :) It's not quite the…
		nikicUnsubmitted Not Done Reply Inline Actions I'm not sure how safe that assumption is. For example, in Rust it is possible for code to go through two ThinLTO links at different hierarchy levels. In that case the first one would convert everything to vector loads (given ample dereferencability information), and the second one might have trouble optimizing based on that. nikic: I'm not sure how safe that assumption is. For example, in Rust it is possible for code to go…
		// If needed, bitcast the vector type to match the load (scalar element).
		// Do not create a vector load of an unsupported type.
		unsigned VecSize = VectorTy->getPrimitiveSizeInBits();
		Type *VecLoadTy = VectorTy;
		if (VectorTy->getElementType() != Load->getType()) {
		unsigned NumElts = VecSize / Load->getType()->getPrimitiveSizeInBits();
		VecLoadTy = VectorType::get(Load->getType(), NumElts);
		}

		// New pattern: extractelt (load [bitcast] VecPtr), 0
		int NewCost = 0;
		if (VecLoadTy != VectorTy)
		NewCost += TTI.getCastInstrCost(Instruction::BitCast,
		VecLoadTy->getPointerTo(), SrcPtrTy);
		NewCost = TTI.getMemoryOpCost(Instruction::Load, VecLoadTy, Alignment,
		lebedev.riUnsubmitted Done Reply Inline Actions Shouldn't this be `+=`? lebedev.ri: Shouldn't this be `+=`?
		Load->getPointerAddressSpace());
		NewCost += TTI.getVectorInstrCost(Instruction::ExtractElement, VectorTy, 0);

		// We can aggressively convert to the vector form because the backend will
		// invert this transform if it does not result in a performance win.
		if (OldCost < NewCost)
		return false;

		// It is safe and profitable to load using the original vector pointer and
		// extract the scalar value from that:
		// load (bitcast VecPtr to ScalarPtr) --> extractelt (load VecPtr), 0
		IRBuilder<> Builder(Load);
		if (VecLoadTy != VectorTy)
		SrcPtr = Builder.CreateBitCast(SrcPtr, VecLoadTy->getPointerTo());

		LoadInst *VecLd = Builder.CreateAlignedLoad(VecLoadTy, SrcPtr, Alignment);
		Value *ExtElt = Builder.CreateExtractElement(VecLd, Builder.getInt32(0));
		Load->replaceAllUsesWith(ExtElt);
		ExtElt->takeName(&I);
		return true;
		lebedev.riUnsubmitted Done Reply Inline Actions `++NumLoadsVectorized;` lebedev.ri: `++NumLoadsVectorized;`
		}

/// Compare the relative costs of 2 extracts followed by scalar operation vs.		/// Compare the relative costs of 2 extracts followed by scalar operation vs.
/// vector operation(s) followed by extract. Return true if the existing		/// vector operation(s) followed by extract. Return true if the existing
/// instructions are cheaper than a vector alternative. Otherwise, return false		/// instructions are cheaper than a vector alternative. Otherwise, return false
/// and if one of the extracts should be transformed to a shufflevector, set		/// and if one of the extracts should be transformed to a shufflevector, set
/// \p ConvertToShuffle to that extract instruction.		/// \p ConvertToShuffle to that extract instruction.
static bool isExtractExtractCheap(Instruction Ext0, Instruction Ext1,		static bool isExtractExtractCheap(Instruction Ext0, Instruction Ext1,
unsigned Opcode,		unsigned Opcode,
▲ Show 20 Lines • Show All 361 Lines • ▼ Show 20 Lines	if (!DT.isReachableFromEntry(&BB))
continue;		continue;
// Do not delete instructions under here and invalidate the iterator.		// Do not delete instructions under here and invalidate the iterator.
// Walk the block forwards to enable simple iterative chains of transforms.		// Walk the block forwards to enable simple iterative chains of transforms.
// TODO: It could be more efficient to remove dead instructions		// TODO: It could be more efficient to remove dead instructions
// iteratively in this loop rather than waiting until the end.		// iteratively in this loop rather than waiting until the end.
for (Instruction &I : BB) {		for (Instruction &I : BB) {
if (isa<DbgInfoIntrinsic>(I))		if (isa<DbgInfoIntrinsic>(I))
continue;		continue;
		MadeChange \|= vectorizeLoad(I, TTI, DT);
MadeChange \|= foldExtractExtract(I, TTI);		MadeChange \|= foldExtractExtract(I, TTI);
MadeChange \|= foldBitcastShuf(I, TTI);		MadeChange \|= foldBitcastShuf(I, TTI);
MadeChange \|= scalarizeBinop(I, TTI);		MadeChange \|= scalarizeBinop(I, TTI);
}		}
}		}

// We're done with transforms, so remove dead instructions.		// We're done with transforms, so remove dead instructions.
if (MadeChange)		if (MadeChange)
▲ Show 20 Lines • Show All 61 Lines • Show Last 20 Lines

llvm/test/Transforms/VectorCombine/X86/load-bitcast-vec.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -vector-combine -S -mtriple=x86_64-- -mattr=sse2 \| FileCheck %s --check-prefixes=CHECK,SSE			; RUN: opt < %s -vector-combine -S -mtriple=x86_64-- -mattr=sse2 \| FileCheck %s --check-prefixes=CHECK,SSE
	; RUN: opt < %s -vector-combine -S -mtriple=x86_64-- -mattr=avx2 \| FileCheck %s --check-prefixes=CHECK,AVX			; RUN: opt < %s -vector-combine -S -mtriple=x86_64-- -mattr=avx2 \| FileCheck %s --check-prefixes=CHECK,AVX

	target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"

	define float @matching_scalar(<4 x float>* align 16 dereferenceable(16) %p) {			define float @matching_scalar(<4 x float>* align 16 dereferenceable(16) %p) {
	; CHECK-LABEL: @matching_scalar(			; CHECK-LABEL: @matching_scalar(
	; CHECK-NEXT: [[BC:%.]] = bitcast <4 x float> [[P:%.]] to float			; CHECK-NEXT: [[TMP1:%.]] = load <4 x float>, <4 x float> [[P:%.*]], align 16
	; CHECK-NEXT: [[R:%.]] = load float, float [[BC]], align 16			; CHECK-NEXT: [[R:%.*]] = extractelement <4 x float> [[TMP1]], i32 0
	; CHECK-NEXT: ret float [[R]]			; CHECK-NEXT: ret float [[R]]
	;			;
	%bc = bitcast <4 x float>* %p to float*			%bc = bitcast <4 x float>* %p to float*
	%r = load float, float* %bc, align 16			%r = load float, float* %bc, align 16
	ret float %r			ret float %r
	}			}

	define i32 @nonmatching_scalar(<4 x float>* align 16 dereferenceable(16) %p) {			define i32 @nonmatching_scalar(<4 x float>* align 16 dereferenceable(16) %p) {
	; CHECK-LABEL: @nonmatching_scalar(			; CHECK-LABEL: @nonmatching_scalar(
	; CHECK-NEXT: [[BC:%.]] = bitcast <4 x float> [[P:%.]] to i32			; CHECK-NEXT: [[TMP1:%.]] = bitcast <4 x float> [[P:%.]] to <4 x i32>
	; CHECK-NEXT: [[R:%.]] = load i32, i32 [[BC]], align 16			; CHECK-NEXT: [[TMP2:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 16
				; CHECK-NEXT: [[R:%.*]] = extractelement <4 x i32> [[TMP2]], i32 0
	; CHECK-NEXT: ret i32 [[R]]			; CHECK-NEXT: ret i32 [[R]]
	;			;
	%bc = bitcast <4 x float>* %p to i32*			%bc = bitcast <4 x float>* %p to i32*
	%r = load i32, i32* %bc, align 16			%r = load i32, i32* %bc, align 16
	ret i32 %r			ret i32 %r
	}			}

	define i64 @larger_scalar(<4 x float>* align 16 dereferenceable(16) %p) {			define i64 @larger_scalar(<4 x float>* align 16 dereferenceable(16) %p) {
	; CHECK-LABEL: @larger_scalar(			; CHECK-LABEL: @larger_scalar(
	; CHECK-NEXT: [[BC:%.]] = bitcast <4 x float> [[P:%.]] to i64			; CHECK-NEXT: [[TMP1:%.]] = bitcast <4 x float> [[P:%.]] to <2 x i64>
	; CHECK-NEXT: [[R:%.]] = load i64, i64 [[BC]], align 16			; CHECK-NEXT: [[TMP2:%.]] = load <2 x i64>, <2 x i64> [[TMP1]], align 16
				; CHECK-NEXT: [[R:%.*]] = extractelement <2 x i64> [[TMP2]], i32 0
	; CHECK-NEXT: ret i64 [[R]]			; CHECK-NEXT: ret i64 [[R]]
	;			;
	%bc = bitcast <4 x float>* %p to i64*			%bc = bitcast <4 x float>* %p to i64*
	%r = load i64, i64* %bc, align 16			%r = load i64, i64* %bc, align 16
	ret i64 %r			ret i64 %r
	}			}

	define i8 @smaller_scalar(<4 x float>* align 16 dereferenceable(16) %p) {			define i8 @smaller_scalar(<4 x float>* align 16 dereferenceable(16) %p) {
	; CHECK-LABEL: @smaller_scalar(			; CHECK-LABEL: @smaller_scalar(
	; CHECK-NEXT: [[BC:%.]] = bitcast <4 x float> [[P:%.]] to i8			; CHECK-NEXT: [[TMP1:%.]] = bitcast <4 x float> [[P:%.]] to <16 x i8>
	; CHECK-NEXT: [[R:%.]] = load i8, i8 [[BC]], align 16			; CHECK-NEXT: [[TMP2:%.]] = load <16 x i8>, <16 x i8> [[TMP1]], align 16
				; CHECK-NEXT: [[R:%.*]] = extractelement <16 x i8> [[TMP2]], i32 0
	; CHECK-NEXT: ret i8 [[R]]			; CHECK-NEXT: ret i8 [[R]]
	;			;
	%bc = bitcast <4 x float>* %p to i8*			%bc = bitcast <4 x float>* %p to i8*
	%r = load i8, i8* %bc, align 16			%r = load i8, i8* %bc, align 16
	ret i8 %r			ret i8 %r
	}			}

	define i8 @smaller_scalar_256bit_vec(<8 x float>* align 32 dereferenceable(32) %p) {			define i8 @smaller_scalar_256bit_vec(<8 x float>* align 32 dereferenceable(32) %p) {
	; CHECK-LABEL: @smaller_scalar_256bit_vec(			; SSE-LABEL: @smaller_scalar_256bit_vec(
	; CHECK-NEXT: [[BC:%.]] = bitcast <8 x float> [[P:%.]] to i8			; SSE-NEXT: [[BC:%.]] = bitcast <8 x float> [[P:%.]] to i8
	; CHECK-NEXT: [[R:%.]] = load i8, i8 [[BC]], align 32			; SSE-NEXT: [[R:%.]] = load i8, i8 [[BC]], align 32
	; CHECK-NEXT: ret i8 [[R]]			; SSE-NEXT: ret i8 [[R]]
				;
				; AVX-LABEL: @smaller_scalar_256bit_vec(
				; AVX-NEXT: [[TMP1:%.]] = bitcast <8 x float> [[P:%.]] to <32 x i8>
				; AVX-NEXT: [[TMP2:%.]] = load <32 x i8>, <32 x i8> [[TMP1]], align 32
				; AVX-NEXT: [[R:%.*]] = extractelement <32 x i8> [[TMP2]], i32 0
				; AVX-NEXT: ret i8 [[R]]
	;			;
	%bc = bitcast <8 x float>* %p to i8*			%bc = bitcast <8 x float>* %p to i8*
	%r = load i8, i8* %bc, align 32			%r = load i8, i8* %bc, align 32
	ret i8 %r			ret i8 %r
	}			}

	define i8 @smaller_scalar_less_aligned(<4 x float>* align 16 dereferenceable(16) %p) {			define i8 @smaller_scalar_less_aligned(<4 x float>* align 16 dereferenceable(16) %p) {
	; CHECK-LABEL: @smaller_scalar_less_aligned(			; CHECK-LABEL: @smaller_scalar_less_aligned(
	; CHECK-NEXT: [[BC:%.]] = bitcast <4 x float> [[P:%.]] to i8			; CHECK-NEXT: [[TMP1:%.]] = bitcast <4 x float> [[P:%.]] to <16 x i8>
	; CHECK-NEXT: [[R:%.]] = load i8, i8 [[BC]], align 4			; CHECK-NEXT: [[TMP2:%.]] = load <16 x i8>, <16 x i8> [[TMP1]], align 4
				; CHECK-NEXT: [[R:%.*]] = extractelement <16 x i8> [[TMP2]], i32 0
	; CHECK-NEXT: ret i8 [[R]]			; CHECK-NEXT: ret i8 [[R]]
	;			;
	%bc = bitcast <4 x float>* %p to i8*			%bc = bitcast <4 x float>* %p to i8*
	%r = load i8, i8* %bc, align 4			%r = load i8, i8* %bc, align 4
	ret i8 %r			ret i8 %r
	}			}

				; negative test - not enough dereferenceable bytes

	define float @matching_scalar_small_deref(<4 x float>* align 16 dereferenceable(15) %p) {			define float @matching_scalar_small_deref(<4 x float>* align 16 dereferenceable(15) %p) {
	; CHECK-LABEL: @matching_scalar_small_deref(			; CHECK-LABEL: @matching_scalar_small_deref(
	; CHECK-NEXT: [[BC:%.]] = bitcast <4 x float> [[P:%.]] to float			; CHECK-NEXT: [[BC:%.]] = bitcast <4 x float> [[P:%.]] to float
	; CHECK-NEXT: [[R:%.]] = load float, float [[BC]], align 16			; CHECK-NEXT: [[R:%.]] = load float, float [[BC]], align 16
	; CHECK-NEXT: ret float [[R]]			; CHECK-NEXT: ret float [[R]]
	;			;
	%bc = bitcast <4 x float>* %p to float*			%bc = bitcast <4 x float>* %p to float*
	%r = load float, float* %bc, align 16			%r = load float, float* %bc, align 16
	ret float %r			ret float %r
	}			}

				; negative test - do not modify volatile

	define float @matching_scalar_volatile(<4 x float>* align 16 dereferenceable(16) %p) {			define float @matching_scalar_volatile(<4 x float>* align 16 dereferenceable(16) %p) {
	; CHECK-LABEL: @matching_scalar_volatile(			; CHECK-LABEL: @matching_scalar_volatile(
	; CHECK-NEXT: [[BC:%.]] = bitcast <4 x float> [[P:%.]] to float			; CHECK-NEXT: [[BC:%.]] = bitcast <4 x float> [[P:%.]] to float
	; CHECK-NEXT: [[R:%.]] = load volatile float, float [[BC]], align 16			; CHECK-NEXT: [[R:%.]] = load volatile float, float [[BC]], align 16
	; CHECK-NEXT: ret float [[R]]			; CHECK-NEXT: ret float [[R]]
	;			;
	%bc = bitcast <4 x float>* %p to float*			%bc = bitcast <4 x float>* %p to float*
	%r = load volatile float, float* %bc, align 16			%r = load volatile float, float* %bc, align 16
	ret float %r			ret float %r
	}			}

				; negative test - not bitcast from vector

	define float @nonvector(double* align 16 dereferenceable(16) %p) {			define float @nonvector(double* align 16 dereferenceable(16) %p) {
	; CHECK-LABEL: @nonvector(			; CHECK-LABEL: @nonvector(
	; CHECK-NEXT: [[BC:%.]] = bitcast double [[P:%.]] to float			; CHECK-NEXT: [[BC:%.]] = bitcast double [[P:%.]] to float
	; CHECK-NEXT: [[R:%.]] = load float, float [[BC]], align 16			; CHECK-NEXT: [[R:%.]] = load float, float [[BC]], align 16
	; CHECK-NEXT: ret float [[R]]			; CHECK-NEXT: ret float [[R]]
	;			;
	%bc = bitcast double* %p to float*			%bc = bitcast double* %p to float*
	%r = load float, float* %bc, align 16			%r = load float, float* %bc, align 16
	ret float %r			ret float %r
	}			}