This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
7/10
VectorCombine.cpp
-
test/Transforms/VectorCombine/X86/
-
Transforms/
-
VectorCombine/
-
X86/
2/2
load.ll

Differential D81766

[VectorCombine] try to create vector loads from scalar loads
ClosedPublic

Authored by spatel on Jun 12 2020, 2:29 PM.

Download Raw Diff

Details

Reviewers

lebedev.ri
RKSimon
anton-afanasyev
vporpo
ABataev
xbolva00

Commits

rG43bdac290663: [VectorCombine] try to create vector loads from scalar loads

Summary

This should allow this pass or others to fold subsequent scalar ops together more easily because they will see extractelement ops from a single vector rather than incomplete parts of that vector.

Currently, this transform will make no overall difference to these most basic patterns because the backend (DAGCombiner) will narrow the loads back down to scalars via narrowExtractedVectorLoad().

For now, we do not get any differences for scalar integer loads because those extracts are not free. We will need to match larger patterns and/or adjust the cost equation to allow that.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

spatel created this revision.Jun 12 2020, 2:29 PM

Herald added a project: Restricted Project. · View Herald TranscriptJun 12 2020, 2:29 PM

Herald added subscribers: steven.zhang, hiraditya, mcrosier. · View Herald Transcript

Some thoughts.

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
60	I'm not sure why we'd care whether the load is of bitcast. Why are we using bitcast src type as the source of truth? Are we trying to avoid introducing some cache issues? I'd think we should instead assess (check, brute-force) each possible wider load type, first checking cost and then `isSafeToLoadUnconditionally()`.
104	Shouldn't this be `+=`?
123–124	`++NumLoadsVectorized;`

To add to what @lebedev.ri said, this patch violates the opaque pointer model towards which LLVM is migrating. Pointer element types are not allowed to influence optimization behavior.

In D81766#2091526, @nikic wrote:

To add to what @lebedev.ri said, this patch violates the opaque pointer model towards which LLVM is migrating. Pointer element types are not allowed to influence optimization behavior.

Ah, good point, too.

This revision now requires changes to proceed.Jun 14 2020, 1:38 AM

In D81766#2091526, @nikic wrote:

To add to what @lebedev.ri said, this patch violates the opaque pointer model towards which LLVM is migrating. Pointer element types are not allowed to influence optimization behavior.

Ok, let me see if I can rework this using just the cost model. This patch started within InstCombine, so we didn't have access to costs and didn't want to do the transform too loosely. The bitcast was used as a proxy for "cost effective" - it indicated that either the original code or the vectorizers had validated the vector type as a legitimate type for the target.

spatel updated this revision to Diff 280132.Jul 23 2020, 8:13 AM

spatel retitled this revision from [VectorCombine] try to create vector loads from bitcasted scalar pointers to [VectorCombine] try to create vector loads from scalar loads.

spatel edited the summary of this revision. (Show Details)

Patch updated - this uses the target, cost model, and load attributes only (not pointer types/casts) to decide if we can create a vector load.

nikic added inline comments.Jul 23 2020, 2:26 PM

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
130	Will other middle end passes be able to handle this well as well? I don't have anything specific in mind here, but would suspect that some passes will be able to deal with a "load" better than a "bitcast, load, extractelement" sequence.

spatel marked 2 inline comments as done.Jul 24 2020, 9:29 AM

spatel added inline comments.

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
130	Other middle end passes almost certainly will not handle this as well. :) It's not quite the same pattern/problem since we're creating vector ops here, but that's what led to removing the generic LoadCombine IR pass ( http://lists.llvm.org/pipermail/llvm-dev/2016-September/105291.html ). I'm assuming that VectorCombine is running late enough (after GVN, etc.) that we've already done all of the general IR optimizations that can be done with the narrow ops. I could try to cobble some PhaseOrdering tests to enforce it, but these would be negative tests currently.

Ping.

RKSimon added inline comments.Aug 1 2020, 5:10 AM

llvm/test/Transforms/VectorCombine/X86/load.ll
60	do we have test coverage for non-zero gep indices?

spatel marked an inline comment as done.Aug 1 2020, 7:20 AM

spatel added inline comments.

llvm/test/Transforms/VectorCombine/X86/load.ll
60	No, that was missing. Added with rGd620a6fe98f7.

Patch updated:
No code changes, but added tests for non-zero gep offsets. This actually works more than I was expecting given that we're only using the base stripPointerCasts(). If there are enough dereferenceable bytes, isSafeToLoadUnconditionally() can still manage to use the offset load via the existing gep.

nikic added inline comments.Aug 1 2020, 8:15 AM

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
130	I'm not sure how safe that assumption is. For example, in Rust it is possible for code to go through two ThinLTO links at different hierarchy levels. In that case the first one would convert everything to vector loads (given ample dereferencability information), and the second one might have trouble optimizing based on that.

Patch updated:
Adjusted to match the most basic pattern that starts with an insertelement (so there's no extract created here). Hopefully, that removes any concern about interactions with other passes. Ie, the transform should almost always be profitable. (We could make an argument that this could be part of canonicalization, but we conservatively try not to create vector ops from scalar ops in passes like instcombine.)

LGTM with one minor @nikic Does this look OK now?

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
128	Since you're inserting into undef, this is really a BUILD_VECTOR - you might get better results with getScalarizationOverhead?

Patch updated:
Use TTI.getScalarizationOverhead() to model insert cost of original code. I had not used this API before and the documentation comment isn't entirely clear to me, so please see if that looks as intended.

In D81766#2202770, @RKSimon wrote:

LGTM with one minor @nikic Does this look OK now?

Looks good to me as well.

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
108	assert / use `cast<>`? Don't think the pointer operand can have a non-pointer type.

LGTM - the getScalarizationOverhead() change is OK

lebedev.ri accepted this revision.Aug 8 2020, 1:31 AM

This revision is now accepted and ready to land.Aug 8 2020, 1:31 AM

xbolva00 added a subscriber: xbolva00.Aug 8 2020, 2:16 AM

xbolva00 added inline comments.

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
108	Not resolved?

RKSimon mentioned this in D64142: [SLP] try to create vector loads from bitcasted scalar pointers.Aug 8 2020, 3:40 AM

Patch updated:
Changed pointer check to an assert.

xbolva00 accepted this revision.Aug 8 2020, 8:34 AM

Closed by commit rG43bdac290663: [VectorCombine] try to create vector loads from scalar loads (authored by spatel). · Explain WhyAug 9 2020, 6:19 AM

This revision was automatically updated to reflect the committed changes.

spatel added a commit: rG43bdac290663: [VectorCombine] try to create vector loads from scalar loads.

This appears to have broken some of Halide's codegen for Hexagon/HVX; as of this revision, some of our tests are now failing with

llvm/lib/IR/Type.cpp:617: static llvm::FixedVectorType* llvm::FixedVectorType::get(llvm::Type*, unsigned int): Assertion `NumElts > 0 && "#Elements of a VectorType must be greater than 0"' failed.

(It's not yet clear whether this is an injection on LLVM's part, or a change that reveals a latent bug in Halide; I'm investigating to determine.)

Update: it appears that VectorSize (from TTI.getMinVectorRegisterBitWidth()) is zero in this case, which causes the assertion failure.

This appears to be the case because HexagonTTIImpl::getMinVectorRegisterBitWidth() returns 0 if useHVX() isn't true... and useHVX() returns false if the HexagonAutoHVX option isn't enabled.

By design, Halide doesn't enable the HexagonAutoHVX option; we like to do all the vectorization ourselves.

I'm not sure how to resolve this issue -- the flaw here seems to lie in HexagonTTIImpl's assumption that disabling HexagonAutoHVX should cause it to report zero-width vectors, which seems to be a dubious decision (and one that is at odds with every other implementation of getMinVectorRegisterBitWidth() that I see in trunk LLVM (none of them appear to ever return 0).

Would it make sense to consider backing out this change until this can be resolved, since it clearly appears to have bad consequences for Hexagon/HVX codegen?

In D81766#2211936, @srj wrote:

Update: it appears that VectorSize (from TTI.getMinVectorRegisterBitWidth()) is zero in this case, which causes the assertion failure.

This appears to be the case because HexagonTTIImpl::getMinVectorRegisterBitWidth() returns 0 if useHVX() isn't true... and useHVX() returns false if the HexagonAutoHVX option isn't enabled.

By design, Halide doesn't enable the HexagonAutoHVX option; we like to do all the vectorization ourselves.

I'm not sure how to resolve this issue -- the flaw here seems to lie in HexagonTTIImpl's assumption that disabling HexagonAutoHVX should cause it to report zero-width vectors, which seems to be a dubious decision (and one that is at odds with every other implementation of getMinVectorRegisterBitWidth() that I see in trunk LLVM (none of them appear to ever return 0).

Would it make sense to consider backing out this change until this can be resolved, since it clearly appears to have bad consequences for Hexagon/HVX codegen?

Can we just add another condition (!VectorSize) to this bailout?
if (!ScalarSize || VectorSize % ScalarSize != 0)

Can we just add another condition (!VectorSize) to this bailout?
if (!ScalarSize || VectorSize % ScalarSize != 0)

Changing it to if (!ScalarSize || !VectorSize || VectorSize % ScalarSize != 0) does indeed seem to make our failure go away, so that would be fine as a quick fix.

(I'm still surprised that getMinVectorRegisterBitWidth() should ever return 0, but I can take that up with Qualcomm folks separately; it is entirely possible I don't understand the full semantics of that method.)

spatel mentioned this in rGb0b95dab1ce2: [VectorCombine] add safety check for 0-width register.Aug 11 2020, 5:30 PM

In D81766#2211952, @srj wrote:

Can we just add another condition (!VectorSize) to this bailout?
if (!ScalarSize || VectorSize % ScalarSize != 0)

Changing it to if (!ScalarSize || !VectorSize || VectorSize % ScalarSize != 0) does indeed seem to make our failure go away, so that would be fine as a quick fix.

(I'm still surprised that getMinVectorRegisterBitWidth() should ever return 0, but I can take that up with Qualcomm folks separately; it is entirely possible I don't understand the full semantics of that method.)

rGb0b95dab1ce2
l'll see if I can find a better API and/or test case for that tomorrow.

In D81766#2211991, @spatel wrote:

In D81766#2211952, @srj wrote:

Can we just add another condition (!VectorSize) to this bailout?
if (!ScalarSize || VectorSize % ScalarSize != 0)

Changing it to if (!ScalarSize || !VectorSize || VectorSize % ScalarSize != 0) does indeed seem to make our failure go away, so that would be fine as a quick fix.

(I'm still surprised that getMinVectorRegisterBitWidth() should ever return 0, but I can take that up with Qualcomm folks separately; it is entirely possible I don't understand the full semantics of that method.)

rGb0b95dab1ce2
l'll see if I can find a better API and/or test case for that tomorrow.

Test added here:
rGb97e402ca5ba

The Loop and SLP vectorizers check this:

// If the target claims to have no vector registers don't attempt
// vectorization.
if (!TTI->getNumberOfRegisters(TTI->getRegisterClassForType(true)))
  return false;

So I should probably add that check to this pass too to be safer.

spatel mentioned this in rGcc892fd9f4cb: [VectorCombine] early exit if target has no vector registers.Aug 12 2020, 6:29 AM

The Hexagon issue is fixed in rGa2dc19b81b1e.

spatel mentioned this in D86160: [VectorCombine] allow vector loads with mismatched insert type.Aug 18 2020, 11:03 AM

spatel mentioned this in rG8fb055932c08: [VectorCombine] allow vector loads with mismatched insert type.Sep 2 2020, 5:11 AM

MaskRay mentioned this in D87538: [VectorCombine] Don't vectorize scalar load under asan/hwasan/memtag/tsan.Sep 11 2020, 12:00 PM

MaskRay mentioned this in rG4452cc4086ac: [VectorCombine] Don't vectorize scalar load under asan/hwasan/memtag/tsan.Sep 15 2020, 9:52 AM

FYI: @yaxunl -- I've ran into this while compiling rocFFT, so it may bite you, too.

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
105	This triggers an assertion in `CreateBitCast` below when we happen to strip a necessary `AddrSpaceCast` and put `PtrOp` in a different AS. Reproducer is here: https://gist.github.com/Artem-B/98a4420dda4f0c36364ddc170a8b12c5 At the very least the code should check that AS didn't change.

tra mentioned this in D89577: [VectorCombine] Avoid crossing address space boundaries..Oct 16 2020, 11:28 AM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

VectorCombine.cpp

59 lines

test/

Transforms/

VectorCombine/

X86/

load.ll

53 lines

Diff 284191

llvm/lib/Transforms/Vectorize/VectorCombine.cpp

Show All 10 Lines
// vectorization passes.		// vectorization passes.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "llvm/Transforms/Vectorize/VectorCombine.h"		#include "llvm/Transforms/Vectorize/VectorCombine.h"
#include "llvm/ADT/Statistic.h"		#include "llvm/ADT/Statistic.h"
#include "llvm/Analysis/BasicAliasAnalysis.h"		#include "llvm/Analysis/BasicAliasAnalysis.h"
#include "llvm/Analysis/GlobalsModRef.h"		#include "llvm/Analysis/GlobalsModRef.h"
		#include "llvm/Analysis/Loads.h"
#include "llvm/Analysis/TargetTransformInfo.h"		#include "llvm/Analysis/TargetTransformInfo.h"
#include "llvm/Analysis/ValueTracking.h"		#include "llvm/Analysis/ValueTracking.h"
#include "llvm/Analysis/VectorUtils.h"		#include "llvm/Analysis/VectorUtils.h"
#include "llvm/IR/Dominators.h"		#include "llvm/IR/Dominators.h"
#include "llvm/IR/Function.h"		#include "llvm/IR/Function.h"
#include "llvm/IR/IRBuilder.h"		#include "llvm/IR/IRBuilder.h"
#include "llvm/IR/PatternMatch.h"		#include "llvm/IR/PatternMatch.h"
#include "llvm/InitializePasses.h"		#include "llvm/InitializePasses.h"
#include "llvm/Pass.h"		#include "llvm/Pass.h"
#include "llvm/Support/CommandLine.h"		#include "llvm/Support/CommandLine.h"
#include "llvm/Transforms/Utils/Local.h"		#include "llvm/Transforms/Utils/Local.h"
#include "llvm/Transforms/Vectorize.h"		#include "llvm/Transforms/Vectorize.h"

using namespace llvm;		using namespace llvm;
using namespace llvm::PatternMatch;		using namespace llvm::PatternMatch;

#define DEBUG_TYPE "vector-combine"		#define DEBUG_TYPE "vector-combine"
		STATISTIC(NumVecLoad, "Number of vector loads formed");
STATISTIC(NumVecCmp, "Number of vector compares formed");		STATISTIC(NumVecCmp, "Number of vector compares formed");
STATISTIC(NumVecBO, "Number of vector binops formed");		STATISTIC(NumVecBO, "Number of vector binops formed");
STATISTIC(NumVecCmpBO, "Number of vector compare + binop formed");		STATISTIC(NumVecCmpBO, "Number of vector compare + binop formed");
STATISTIC(NumShufOfBitcast, "Number of shuffles moved after bitcast");		STATISTIC(NumShufOfBitcast, "Number of shuffles moved after bitcast");
STATISTIC(NumScalarBO, "Number of scalar binops formed");		STATISTIC(NumScalarBO, "Number of scalar binops formed");
STATISTIC(NumScalarCmp, "Number of scalar compares formed");		STATISTIC(NumScalarCmp, "Number of scalar compares formed");

static cl::opt<bool> DisableVectorCombine(		static cl::opt<bool> DisableVectorCombine(
"disable-vector-combine", cl::init(false), cl::Hidden,		"disable-vector-combine", cl::init(false), cl::Hidden,
cl::desc("Disable all vector combine transforms"));		cl::desc("Disable all vector combine transforms"));

static cl::opt<bool> DisableBinopExtractShuffle(		static cl::opt<bool> DisableBinopExtractShuffle(
"disable-binop-extract-shuffle", cl::init(false), cl::Hidden,		"disable-binop-extract-shuffle", cl::init(false), cl::Hidden,
cl::desc("Disable binop extract to shuffle transforms"));		cl::desc("Disable binop extract to shuffle transforms"));

static const unsigned InvalidIndex = std::numeric_limits<unsigned>::max();		static const unsigned InvalidIndex = std::numeric_limits<unsigned>::max();

namespace {		namespace {
class VectorCombine {		class VectorCombine {
public:		public:
VectorCombine(Function &F, const TargetTransformInfo &TTI,		VectorCombine(Function &F, const TargetTransformInfo &TTI,
const DominatorTree &DT)		const DominatorTree &DT)
: F(F), Builder(F.getContext()), TTI(TTI), DT(DT) {}		: F(F), Builder(F.getContext()), TTI(TTI), DT(DT) {}
		lebedev.riUnsubmitted Done Reply Inline Actions I'm not sure why we'd care whether the load is of bitcast. Why are we using bitcast src type as the source of truth? Are we trying to avoid introducing some cache issues? I'd think we should instead assess (check, brute-force) each possible wider load type, first checking cost and then `isSafeToLoadUnconditionally()`. lebedev.ri: I'm not sure why we'd care whether the load is of bitcast. Why are we using bitcast src type as…

bool run();		bool run();

private:		private:
Function &F;		Function &F;
IRBuilder<> Builder;		IRBuilder<> Builder;
const TargetTransformInfo &TTI;		const TargetTransformInfo &TTI;
const DominatorTree &DT;		const DominatorTree &DT;

		bool vectorizeLoadInsert(Instruction &I);
ExtractElementInst getShuffleExtract(ExtractElementInst Ext0,		ExtractElementInst getShuffleExtract(ExtractElementInst Ext0,
ExtractElementInst *Ext1,		ExtractElementInst *Ext1,
unsigned PreferredExtractIndex) const;		unsigned PreferredExtractIndex) const;
bool isExtractExtractCheap(ExtractElementInst Ext0, ExtractElementInst Ext1,		bool isExtractExtractCheap(ExtractElementInst Ext0, ExtractElementInst Ext1,
unsigned Opcode,		unsigned Opcode,
ExtractElementInst *&ConvertToShuffle,		ExtractElementInst *&ConvertToShuffle,
unsigned PreferredExtractIndex);		unsigned PreferredExtractIndex);
void foldExtExtCmp(ExtractElementInst Ext0, ExtractElementInst Ext1,		void foldExtExtCmp(ExtractElementInst Ext0, ExtractElementInst Ext1,
Instruction &I);		Instruction &I);
void foldExtExtBinop(ExtractElementInst Ext0, ExtractElementInst Ext1,		void foldExtExtBinop(ExtractElementInst Ext0, ExtractElementInst Ext1,
Instruction &I);		Instruction &I);
bool foldExtractExtract(Instruction &I);		bool foldExtractExtract(Instruction &I);
bool foldBitcastShuf(Instruction &I);		bool foldBitcastShuf(Instruction &I);
bool scalarizeBinopOrCmp(Instruction &I);		bool scalarizeBinopOrCmp(Instruction &I);
bool foldExtractedCmps(Instruction &I);		bool foldExtractedCmps(Instruction &I);
};		};
} // namespace		} // namespace

static void replaceValue(Value &Old, Value &New) {		static void replaceValue(Value &Old, Value &New) {
Old.replaceAllUsesWith(&New);		Old.replaceAllUsesWith(&New);
New.takeName(&Old);		New.takeName(&Old);
}		}

		bool VectorCombine::vectorizeLoadInsert(Instruction &I) {
		// Match insert of scalar load.
		Value *Scalar;
		if (!match(&I, m_InsertElt(m_Undef(), m_Value(Scalar), m_ZeroInt())))
		return false;
		auto *Load = dyn_cast<LoadInst>(Scalar);
		Type *ScalarTy = Scalar->getType();
		if (!Load \|\| !Load->isSimple())
		return false;

		// TODO: Extend this to match GEP with constant offsets.
		lebedev.riUnsubmitted Done Reply Inline Actions Shouldn't this be `+=`? lebedev.ri: Shouldn't this be `+=`?
		Value *PtrOp = Load->getPointerOperand()->stripPointerCasts();
		traUnsubmitted Not Done Reply Inline Actions This triggers an assertion in `CreateBitCast` below when we happen to strip a necessary `AddrSpaceCast` and put `PtrOp` in a different AS. Reproducer is here: https://gist.github.com/Artem-B/98a4420dda4f0c36364ddc170a8b12c5 At the very least the code should check that AS didn't change. tra: This triggers an assertion in `CreateBitCast` below when we happen to strip a necessary…
		assert(isa<PointerType>(PtrOp->getType()) && "Expected a pointer type");

		unsigned VectorSize = TTI.getMinVectorRegisterBitWidth();
		nikicUnsubmitted Done Reply Inline Actions assert / use `cast<>`? Don't think the pointer operand can have a non-pointer type. nikic: assert / use `cast<>`? Don't think the pointer operand can have a non-pointer type.
		xbolva00Unsubmitted Done Reply Inline Actions Not resolved? xbolva00: Not resolved?
		uint64_t ScalarSize = ScalarTy->getPrimitiveSizeInBits();
		if (!ScalarSize \|\| VectorSize % ScalarSize != 0)
		return false;

		// Check safety of replacing the scalar load with a larger vector load.
		unsigned VecNumElts = VectorSize / ScalarSize;
		auto *VectorTy = VectorType::get(ScalarTy, VecNumElts, false);
		// TODO: Allow insert/extract subvector if the type does not match.
		if (VectorTy != I.getType())
		return false;
		Align Alignment = Load->getAlign();
		const DataLayout &DL = I.getModule()->getDataLayout();
		if (!isSafeToLoadUnconditionally(PtrOp, VectorTy, Alignment, DL, Load, &DT))
		return false;

		// Original pattern: insertelt undef, load [free casts of] ScalarPtr, 0
		lebedev.riUnsubmitted Done Reply Inline Actions `++NumLoadsVectorized;` lebedev.ri: `++NumLoadsVectorized;`
		int OldCost = TTI.getMemoryOpCost(Instruction::Load, ScalarTy, Alignment,
		Load->getPointerAddressSpace());
		APInt DemandedElts = APInt::getOneBitSet(VecNumElts, 0);
		OldCost += TTI.getScalarizationOverhead(VectorTy, DemandedElts, true, false);
		RKSimonUnsubmitted Not Done Reply Inline Actions Since you're inserting into undef, this is really a BUILD_VECTOR - you might get better results with getScalarizationOverhead? RKSimon: Since you're inserting into undef, this is really a BUILD_VECTOR - you might get better results…

		// New pattern: load VecPtr
		nikicUnsubmitted Done Reply Inline Actions Will other middle end passes be able to handle this well as well? I don't have anything specific in mind here, but would suspect that some passes will be able to deal with a "load" better than a "bitcast, load, extractelement" sequence. nikic: Will other middle end passes be able to handle this well as well? I don't have anything…
		spatelAuthorUnsubmitted Done Reply Inline Actions Other middle end passes almost certainly will not handle this as well. :) It's not quite the same pattern/problem since we're creating vector ops here, but that's what led to removing the generic LoadCombine IR pass ( http://lists.llvm.org/pipermail/llvm-dev/2016-September/105291.html ). I'm assuming that VectorCombine is running late enough (after GVN, etc.) that we've already done all of the general IR optimizations that can be done with the narrow ops. I could try to cobble some PhaseOrdering tests to enforce it, but these would be negative tests currently. spatel: Other middle end passes almost certainly will not handle this as well. :) It's not quite the…
		nikicUnsubmitted Not Done Reply Inline Actions I'm not sure how safe that assumption is. For example, in Rust it is possible for code to go through two ThinLTO links at different hierarchy levels. In that case the first one would convert everything to vector loads (given ample dereferencability information), and the second one might have trouble optimizing based on that. nikic: I'm not sure how safe that assumption is. For example, in Rust it is possible for code to go…
		int NewCost = TTI.getMemoryOpCost(Instruction::Load, VectorTy, Alignment,
		Load->getPointerAddressSpace());

		// We can aggressively convert to the vector form because the backend can
		// invert this transform if it does not result in a performance win.
		if (OldCost < NewCost)
		return false;

		// It is safe and potentially profitable to load a vector directly:
		// inselt undef, load Scalar, 0 --> load VecPtr
		IRBuilder<> Builder(Load);
		Value *CastedPtr = Builder.CreateBitCast(PtrOp, VectorTy->getPointerTo());
		LoadInst *VecLd = Builder.CreateAlignedLoad(VectorTy, CastedPtr, Alignment);
		replaceValue(I, *VecLd);
		++NumVecLoad;
		return true;
		}

/// Determine which, if any, of the inputs should be replaced by a shuffle		/// Determine which, if any, of the inputs should be replaced by a shuffle
/// followed by extract from a different index.		/// followed by extract from a different index.
ExtractElementInst *VectorCombine::getShuffleExtract(		ExtractElementInst *VectorCombine::getShuffleExtract(
ExtractElementInst Ext0, ExtractElementInst Ext1,		ExtractElementInst Ext0, ExtractElementInst Ext1,
unsigned PreferredExtractIndex = InvalidIndex) const {		unsigned PreferredExtractIndex = InvalidIndex) const {
assert(isa<ConstantInt>(Ext0->getIndexOperand()) &&		assert(isa<ConstantInt>(Ext0->getIndexOperand()) &&
isa<ConstantInt>(Ext1->getIndexOperand()) &&		isa<ConstantInt>(Ext1->getIndexOperand()) &&
"Expected constant extract indexes");		"Expected constant extract indexes");
▲ Show 20 Lines • Show All 521 Lines • ▼ Show 20 Lines	for (BasicBlock &BB : F) {
// Do not delete instructions under here and invalidate the iterator.		// Do not delete instructions under here and invalidate the iterator.
// Walk the block forwards to enable simple iterative chains of transforms.		// Walk the block forwards to enable simple iterative chains of transforms.
// TODO: It could be more efficient to remove dead instructions		// TODO: It could be more efficient to remove dead instructions
// iteratively in this loop rather than waiting until the end.		// iteratively in this loop rather than waiting until the end.
for (Instruction &I : BB) {		for (Instruction &I : BB) {
if (isa<DbgInfoIntrinsic>(I))		if (isa<DbgInfoIntrinsic>(I))
continue;		continue;
Builder.SetInsertPoint(&I);		Builder.SetInsertPoint(&I);
		MadeChange \|= vectorizeLoadInsert(I);
MadeChange \|= foldExtractExtract(I);		MadeChange \|= foldExtractExtract(I);
MadeChange \|= foldBitcastShuf(I);		MadeChange \|= foldBitcastShuf(I);
MadeChange \|= scalarizeBinopOrCmp(I);		MadeChange \|= scalarizeBinopOrCmp(I);
MadeChange \|= foldExtractedCmps(I);		MadeChange \|= foldExtractedCmps(I);
}		}
}		}

// We're done with transforms, so remove dead instructions.		// We're done with transforms, so remove dead instructions.
▲ Show 20 Lines • Show All 64 Lines • Show Last 20 Lines

llvm/test/Transforms/VectorCombine/X86/load.ll

	Show First 20 Lines • Show All 51 Lines • ▼ Show 20 Lines
	;			;
	%bc = bitcast <4 x float>* %p to float*			%bc = bitcast <4 x float>* %p to float*
	%r = load float, float* %bc, align 16			%r = load float, float* %bc, align 16
	ret float %r			ret float %r
	}			}

	define float @matching_fp_vector_gep00(<4 x float>* align 16 dereferenceable(16) %p) {			define float @matching_fp_vector_gep00(<4 x float>* align 16 dereferenceable(16) %p) {
	; CHECK-LABEL: @matching_fp_vector_gep00(			; CHECK-LABEL: @matching_fp_vector_gep00(
	; CHECK-NEXT: [[GEP:%.]] = getelementptr inbounds <4 x float>, <4 x float> [[P:%.*]], i64 0, i64 0			; CHECK-NEXT: [[GEP:%.]] = getelementptr inbounds <4 x float>, <4 x float> [[P:%.*]], i64 0, i64 0
	RKSimonUnsubmitted Done Reply Inline Actions do we have test coverage for non-zero gep indices? RKSimon: do we have test coverage for non-zero gep indices?
	spatelAuthorUnsubmitted Done Reply Inline Actions No, that was missing. Added with rGd620a6fe98f7. spatel: No, that was missing. Added with rGd620a6fe98f7.
	; CHECK-NEXT: [[R:%.]] = load float, float [[GEP]], align 16			; CHECK-NEXT: [[R:%.]] = load float, float [[GEP]], align 16
	; CHECK-NEXT: ret float [[R]]			; CHECK-NEXT: ret float [[R]]
	;			;
	%gep = getelementptr inbounds <4 x float>, <4 x float>* %p, i64 0, i64 0			%gep = getelementptr inbounds <4 x float>, <4 x float>* %p, i64 0, i64 0
	%r = load float, float* %gep, align 16			%r = load float, float* %gep, align 16
	ret float %r			ret float %r
	}			}

	▲ Show 20 Lines • Show All 100 Lines • ▼ Show 20 Lines
	;			;
	%bc = bitcast <8 x float>* %p to double*			%bc = bitcast <8 x float>* %p to double*
	%r = load double, double* %bc, align 32			%r = load double, double* %bc, align 32
	ret double %r			ret double %r
	}			}

	define <4 x float> @load_f32_insert_v4f32(float* align 16 dereferenceable(16) %p) {			define <4 x float> @load_f32_insert_v4f32(float* align 16 dereferenceable(16) %p) {
	; CHECK-LABEL: @load_f32_insert_v4f32(			; CHECK-LABEL: @load_f32_insert_v4f32(
	; CHECK-NEXT: [[S:%.]] = load float, float [[P:%.*]], align 4			; CHECK-NEXT: [[TMP1:%.]] = bitcast float [[P:%.]] to <4 x float>
	; CHECK-NEXT: [[R:%.*]] = insertelement <4 x float> undef, float [[S]], i32 0			; CHECK-NEXT: [[R:%.]] = load <4 x float>, <4 x float> [[TMP1]], align 4
	; CHECK-NEXT: ret <4 x float> [[R]]			; CHECK-NEXT: ret <4 x float> [[R]]
	;			;
	%s = load float, float* %p, align 4			%s = load float, float* %p, align 4
	%r = insertelement <4 x float> undef, float %s, i32 0			%r = insertelement <4 x float> undef, float %s, i32 0
	ret <4 x float> %r			ret <4 x float> %r
	}			}

	define <4 x float> @casted_load_f32_insert_v4f32(<4 x float>* align 4 dereferenceable(16) %p) {			define <4 x float> @casted_load_f32_insert_v4f32(<4 x float>* align 4 dereferenceable(16) %p) {
	; CHECK-LABEL: @casted_load_f32_insert_v4f32(			; CHECK-LABEL: @casted_load_f32_insert_v4f32(
	; CHECK-NEXT: [[B:%.]] = bitcast <4 x float> [[P:%.]] to float			; CHECK-NEXT: [[R:%.]] = load <4 x float>, <4 x float> [[P:%.*]], align 4
	; CHECK-NEXT: [[S:%.]] = load float, float [[B]], align 4
	; CHECK-NEXT: [[R:%.*]] = insertelement <4 x float> undef, float [[S]], i32 0
	; CHECK-NEXT: ret <4 x float> [[R]]			; CHECK-NEXT: ret <4 x float> [[R]]
	;			;
	%b = bitcast <4 x float>* %p to float*			%b = bitcast <4 x float>* %p to float*
	%s = load float, float* %b, align 4			%s = load float, float* %b, align 4
	%r = insertelement <4 x float> undef, float %s, i32 0			%r = insertelement <4 x float> undef, float %s, i32 0
	ret <4 x float> %r			ret <4 x float> %r
	}			}

				; Element type does not change cost.

	define <4 x i32> @load_i32_insert_v4i32(i32* align 16 dereferenceable(16) %p) {			define <4 x i32> @load_i32_insert_v4i32(i32* align 16 dereferenceable(16) %p) {
	; CHECK-LABEL: @load_i32_insert_v4i32(			; CHECK-LABEL: @load_i32_insert_v4i32(
	; CHECK-NEXT: [[S:%.]] = load i32, i32 [[P:%.*]], align 4			; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[P:%.]] to <4 x i32>
	; CHECK-NEXT: [[R:%.*]] = insertelement <4 x i32> undef, i32 [[S]], i32 0			; CHECK-NEXT: [[R:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4
	; CHECK-NEXT: ret <4 x i32> [[R]]			; CHECK-NEXT: ret <4 x i32> [[R]]
	;			;
	%s = load i32, i32* %p, align 4			%s = load i32, i32* %p, align 4
	%r = insertelement <4 x i32> undef, i32 %s, i32 0			%r = insertelement <4 x i32> undef, i32 %s, i32 0
	ret <4 x i32> %r			ret <4 x i32> %r
	}			}

				; Pointer type does not change cost.

	define <4 x i32> @casted_load_i32_insert_v4i32(<16 x i8>* align 4 dereferenceable(16) %p) {			define <4 x i32> @casted_load_i32_insert_v4i32(<16 x i8>* align 4 dereferenceable(16) %p) {
	; CHECK-LABEL: @casted_load_i32_insert_v4i32(			; CHECK-LABEL: @casted_load_i32_insert_v4i32(
	; CHECK-NEXT: [[B:%.]] = bitcast <16 x i8> [[P:%.]] to i32			; CHECK-NEXT: [[TMP1:%.]] = bitcast <16 x i8> [[P:%.]] to <4 x i32>
	; CHECK-NEXT: [[S:%.]] = load i32, i32 [[B]], align 4			; CHECK-NEXT: [[R:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4
	; CHECK-NEXT: [[R:%.*]] = insertelement <4 x i32> undef, i32 [[S]], i32 0
	; CHECK-NEXT: ret <4 x i32> [[R]]			; CHECK-NEXT: ret <4 x i32> [[R]]
	;			;
	%b = bitcast <16 x i8>* %p to i32*			%b = bitcast <16 x i8>* %p to i32*
	%s = load i32, i32* %b, align 4			%s = load i32, i32* %b, align 4
	%r = insertelement <4 x i32> undef, i32 %s, i32 0			%r = insertelement <4 x i32> undef, i32 %s, i32 0
	ret <4 x i32> %r			ret <4 x i32> %r
	}			}

				; This is canonical form for vector element access.

	define <4 x float> @gep00_load_f32_insert_v4f32(<4 x float>* align 16 dereferenceable(16) %p) {			define <4 x float> @gep00_load_f32_insert_v4f32(<4 x float>* align 16 dereferenceable(16) %p) {
	; CHECK-LABEL: @gep00_load_f32_insert_v4f32(			; CHECK-LABEL: @gep00_load_f32_insert_v4f32(
	; CHECK-NEXT: [[GEP:%.]] = getelementptr inbounds <4 x float>, <4 x float> [[P:%.*]], i64 0, i64 0			; CHECK-NEXT: [[R:%.]] = load <4 x float>, <4 x float> [[P:%.*]], align 16
	; CHECK-NEXT: [[S:%.]] = load float, float [[GEP]], align 16
	; CHECK-NEXT: [[R:%.*]] = insertelement <4 x float> undef, float [[S]], i64 0
	; CHECK-NEXT: ret <4 x float> [[R]]			; CHECK-NEXT: ret <4 x float> [[R]]
	;			;
	%gep = getelementptr inbounds <4 x float>, <4 x float>* %p, i64 0, i64 0			%gep = getelementptr inbounds <4 x float>, <4 x float>* %p, i64 0, i64 0
	%s = load float, float* %gep, align 16			%s = load float, float* %gep, align 16
	%r = insertelement <4 x float> undef, float %s, i64 0			%r = insertelement <4 x float> undef, float %s, i64 0
	ret <4 x float> %r			ret <4 x float> %r
	}			}

				; If there are enough dereferenceable bytes, we can offset the vector load.

	define <8 x i16> @gep01_load_i16_insert_v8i16(<8 x i16>* align 16 dereferenceable(18) %p) {			define <8 x i16> @gep01_load_i16_insert_v8i16(<8 x i16>* align 16 dereferenceable(18) %p) {
	; CHECK-LABEL: @gep01_load_i16_insert_v8i16(			; CHECK-LABEL: @gep01_load_i16_insert_v8i16(
	; CHECK-NEXT: [[GEP:%.]] = getelementptr inbounds <8 x i16>, <8 x i16> [[P:%.*]], i64 0, i64 1			; CHECK-NEXT: [[GEP:%.]] = getelementptr inbounds <8 x i16>, <8 x i16> [[P:%.*]], i64 0, i64 1
	; CHECK-NEXT: [[S:%.]] = load i16, i16 [[GEP]], align 2			; CHECK-NEXT: [[TMP1:%.]] = bitcast i16 [[GEP]] to <8 x i16>*
	; CHECK-NEXT: [[R:%.*]] = insertelement <8 x i16> undef, i16 [[S]], i64 0			; CHECK-NEXT: [[R:%.]] = load <8 x i16>, <8 x i16> [[TMP1]], align 2
	; CHECK-NEXT: ret <8 x i16> [[R]]			; CHECK-NEXT: ret <8 x i16> [[R]]
	;			;
	%gep = getelementptr inbounds <8 x i16>, <8 x i16>* %p, i64 0, i64 1			%gep = getelementptr inbounds <8 x i16>, <8 x i16>* %p, i64 0, i64 1
	%s = load i16, i16* %gep, align 2			%s = load i16, i16* %gep, align 2
	%r = insertelement <8 x i16> undef, i16 %s, i64 0			%r = insertelement <8 x i16> undef, i16 %s, i64 0
	ret <8 x i16> %r			ret <8 x i16> %r
	}			}

				; Negative test - can't safely load the offset vector, but could load+shuffle.

	define <8 x i16> @gep01_load_i16_insert_v8i16_deref(<8 x i16>* align 16 dereferenceable(17) %p) {			define <8 x i16> @gep01_load_i16_insert_v8i16_deref(<8 x i16>* align 16 dereferenceable(17) %p) {
	; CHECK-LABEL: @gep01_load_i16_insert_v8i16_deref(			; CHECK-LABEL: @gep01_load_i16_insert_v8i16_deref(
	; CHECK-NEXT: [[GEP:%.]] = getelementptr inbounds <8 x i16>, <8 x i16> [[P:%.*]], i64 0, i64 1			; CHECK-NEXT: [[GEP:%.]] = getelementptr inbounds <8 x i16>, <8 x i16> [[P:%.*]], i64 0, i64 1
	; CHECK-NEXT: [[S:%.]] = load i16, i16 [[GEP]], align 2			; CHECK-NEXT: [[S:%.]] = load i16, i16 [[GEP]], align 2
	; CHECK-NEXT: [[R:%.*]] = insertelement <8 x i16> undef, i16 [[S]], i64 0			; CHECK-NEXT: [[R:%.*]] = insertelement <8 x i16> undef, i16 [[S]], i64 0
	; CHECK-NEXT: ret <8 x i16> [[R]]			; CHECK-NEXT: ret <8 x i16> [[R]]
	;			;
	%gep = getelementptr inbounds <8 x i16>, <8 x i16>* %p, i64 0, i64 1			%gep = getelementptr inbounds <8 x i16>, <8 x i16>* %p, i64 0, i64 1
	%s = load i16, i16* %gep, align 2			%s = load i16, i16* %gep, align 2
	%r = insertelement <8 x i16> undef, i16 %s, i64 0			%r = insertelement <8 x i16> undef, i16 %s, i64 0
	ret <8 x i16> %r			ret <8 x i16> %r
	}			}

				; If there are enough dereferenceable bytes, we can offset the vector load.

	define <8 x i16> @gep10_load_i16_insert_v8i16(<8 x i16>* align 16 dereferenceable(32) %p) {			define <8 x i16> @gep10_load_i16_insert_v8i16(<8 x i16>* align 16 dereferenceable(32) %p) {
	; CHECK-LABEL: @gep10_load_i16_insert_v8i16(			; CHECK-LABEL: @gep10_load_i16_insert_v8i16(
	; CHECK-NEXT: [[GEP:%.]] = getelementptr inbounds <8 x i16>, <8 x i16> [[P:%.*]], i64 1, i64 0			; CHECK-NEXT: [[GEP:%.]] = getelementptr inbounds <8 x i16>, <8 x i16> [[P:%.*]], i64 1, i64 0
	; CHECK-NEXT: [[S:%.]] = load i16, i16 [[GEP]], align 16			; CHECK-NEXT: [[TMP1:%.]] = bitcast i16 [[GEP]] to <8 x i16>*
	; CHECK-NEXT: [[R:%.*]] = insertelement <8 x i16> undef, i16 [[S]], i64 0			; CHECK-NEXT: [[R:%.]] = load <8 x i16>, <8 x i16> [[TMP1]], align 16
	; CHECK-NEXT: ret <8 x i16> [[R]]			; CHECK-NEXT: ret <8 x i16> [[R]]
	;			;
	%gep = getelementptr inbounds <8 x i16>, <8 x i16>* %p, i64 1, i64 0			%gep = getelementptr inbounds <8 x i16>, <8 x i16>* %p, i64 1, i64 0
	%s = load i16, i16* %gep, align 16			%s = load i16, i16* %gep, align 16
	%r = insertelement <8 x i16> undef, i16 %s, i64 0			%r = insertelement <8 x i16> undef, i16 %s, i64 0
	ret <8 x i16> %r			ret <8 x i16> %r
	}			}

				; Negative test - can't safely load the offset vector, but could load+shuffle.

	define <8 x i16> @gep10_load_i16_insert_v8i16_deref(<8 x i16>* align 16 dereferenceable(31) %p) {			define <8 x i16> @gep10_load_i16_insert_v8i16_deref(<8 x i16>* align 16 dereferenceable(31) %p) {
	; CHECK-LABEL: @gep10_load_i16_insert_v8i16_deref(			; CHECK-LABEL: @gep10_load_i16_insert_v8i16_deref(
	; CHECK-NEXT: [[GEP:%.]] = getelementptr inbounds <8 x i16>, <8 x i16> [[P:%.*]], i64 1, i64 0			; CHECK-NEXT: [[GEP:%.]] = getelementptr inbounds <8 x i16>, <8 x i16> [[P:%.*]], i64 1, i64 0
	; CHECK-NEXT: [[S:%.]] = load i16, i16 [[GEP]], align 16			; CHECK-NEXT: [[S:%.]] = load i16, i16 [[GEP]], align 16
	; CHECK-NEXT: [[R:%.*]] = insertelement <8 x i16> undef, i16 [[S]], i64 0			; CHECK-NEXT: [[R:%.*]] = insertelement <8 x i16> undef, i16 [[S]], i64 0
	; CHECK-NEXT: ret <8 x i16> [[R]]			; CHECK-NEXT: ret <8 x i16> [[R]]
	;			;
	%gep = getelementptr inbounds <8 x i16>, <8 x i16>* %p, i64 1, i64 0			%gep = getelementptr inbounds <8 x i16>, <8 x i16>* %p, i64 1, i64 0
	%s = load i16, i16* %gep, align 16			%s = load i16, i16* %gep, align 16
	%r = insertelement <8 x i16> undef, i16 %s, i64 0			%r = insertelement <8 x i16> undef, i16 %s, i64 0
	ret <8 x i16> %r			ret <8 x i16> %r
	}			}

				; Negative test - do not alter volatile.

	define <4 x float> @load_f32_insert_v4f32_volatile(float* align 16 dereferenceable(16) %p) {			define <4 x float> @load_f32_insert_v4f32_volatile(float* align 16 dereferenceable(16) %p) {
	; CHECK-LABEL: @load_f32_insert_v4f32_volatile(			; CHECK-LABEL: @load_f32_insert_v4f32_volatile(
	; CHECK-NEXT: [[S:%.]] = load volatile float, float [[P:%.*]], align 4			; CHECK-NEXT: [[S:%.]] = load volatile float, float [[P:%.*]], align 4
	; CHECK-NEXT: [[R:%.*]] = insertelement <4 x float> undef, float [[S]], i32 0			; CHECK-NEXT: [[R:%.*]] = insertelement <4 x float> undef, float [[S]], i32 0
	; CHECK-NEXT: ret <4 x float> [[R]]			; CHECK-NEXT: ret <4 x float> [[R]]
	;			;
	%s = load volatile float, float* %p, align 4			%s = load volatile float, float* %p, align 4
	%r = insertelement <4 x float> undef, float %s, i32 0			%r = insertelement <4 x float> undef, float %s, i32 0
	ret <4 x float> %r			ret <4 x float> %r
	}			}

				; Negative test? - pointer is not as aligned as load.

	define <4 x float> @load_f32_insert_v4f32_align(float* align 1 dereferenceable(16) %p) {			define <4 x float> @load_f32_insert_v4f32_align(float* align 1 dereferenceable(16) %p) {
	; CHECK-LABEL: @load_f32_insert_v4f32_align(			; CHECK-LABEL: @load_f32_insert_v4f32_align(
	; CHECK-NEXT: [[S:%.]] = load float, float [[P:%.*]], align 4			; CHECK-NEXT: [[S:%.]] = load float, float [[P:%.*]], align 4
	; CHECK-NEXT: [[R:%.*]] = insertelement <4 x float> undef, float [[S]], i32 0			; CHECK-NEXT: [[R:%.*]] = insertelement <4 x float> undef, float [[S]], i32 0
	; CHECK-NEXT: ret <4 x float> [[R]]			; CHECK-NEXT: ret <4 x float> [[R]]
	;			;
	%s = load float, float* %p, align 4			%s = load float, float* %p, align 4
	%r = insertelement <4 x float> undef, float %s, i32 0			%r = insertelement <4 x float> undef, float %s, i32 0
	ret <4 x float> %r			ret <4 x float> %r
	}			}

				; Negative test - not enough bytes.

	define <4 x float> @load_f32_insert_v4f32_deref(float* align 4 dereferenceable(15) %p) {			define <4 x float> @load_f32_insert_v4f32_deref(float* align 4 dereferenceable(15) %p) {
	; CHECK-LABEL: @load_f32_insert_v4f32_deref(			; CHECK-LABEL: @load_f32_insert_v4f32_deref(
	; CHECK-NEXT: [[S:%.]] = load float, float [[P:%.*]], align 4			; CHECK-NEXT: [[S:%.]] = load float, float [[P:%.*]], align 4
	; CHECK-NEXT: [[R:%.*]] = insertelement <4 x float> undef, float [[S]], i32 0			; CHECK-NEXT: [[R:%.*]] = insertelement <4 x float> undef, float [[S]], i32 0
	; CHECK-NEXT: ret <4 x float> [[R]]			; CHECK-NEXT: ret <4 x float> [[R]]
	;			;
	%s = load float, float* %p, align 4			%s = load float, float* %p, align 4
	%r = insertelement <4 x float> undef, float %s, i32 0			%r = insertelement <4 x float> undef, float %s, i32 0
	ret <4 x float> %r			ret <4 x float> %r
	}			}

				; TODO: Should load v4i32.

	define <8 x i32> @load_i32_insert_v8i32(i32* align 16 dereferenceable(16) %p) {			define <8 x i32> @load_i32_insert_v8i32(i32* align 16 dereferenceable(16) %p) {
	; CHECK-LABEL: @load_i32_insert_v8i32(			; CHECK-LABEL: @load_i32_insert_v8i32(
	; CHECK-NEXT: [[S:%.]] = load i32, i32 [[P:%.*]], align 4			; CHECK-NEXT: [[S:%.]] = load i32, i32 [[P:%.*]], align 4
	; CHECK-NEXT: [[R:%.*]] = insertelement <8 x i32> undef, i32 [[S]], i32 0			; CHECK-NEXT: [[R:%.*]] = insertelement <8 x i32> undef, i32 [[S]], i32 0
	; CHECK-NEXT: ret <8 x i32> [[R]]			; CHECK-NEXT: ret <8 x i32> [[R]]
	;			;
	%s = load i32, i32* %p, align 4			%s = load i32, i32* %p, align 4
	%r = insertelement <8 x i32> undef, i32 %s, i32 0			%r = insertelement <8 x i32> undef, i32 %s, i32 0
	ret <8 x i32> %r			ret <8 x i32> %r
	}			}

				; TODO: Should load v4i32.

	define <8 x i32> @casted_load_i32_insert_v8i32(<4 x i32>* align 4 dereferenceable(16) %p) {			define <8 x i32> @casted_load_i32_insert_v8i32(<4 x i32>* align 4 dereferenceable(16) %p) {
	; CHECK-LABEL: @casted_load_i32_insert_v8i32(			; CHECK-LABEL: @casted_load_i32_insert_v8i32(
	; CHECK-NEXT: [[B:%.]] = bitcast <4 x i32> [[P:%.]] to i32			; CHECK-NEXT: [[B:%.]] = bitcast <4 x i32> [[P:%.]] to i32
	; CHECK-NEXT: [[S:%.]] = load i32, i32 [[B]], align 4			; CHECK-NEXT: [[S:%.]] = load i32, i32 [[B]], align 4
	; CHECK-NEXT: [[R:%.*]] = insertelement <8 x i32> undef, i32 [[S]], i32 0			; CHECK-NEXT: [[R:%.*]] = insertelement <8 x i32> undef, i32 [[S]], i32 0
	; CHECK-NEXT: ret <8 x i32> [[R]]			; CHECK-NEXT: ret <8 x i32> [[R]]
	;			;
	%b = bitcast <4 x i32>* %p to i32*			%b = bitcast <4 x i32>* %p to i32*
	%s = load i32, i32* %b, align 4			%s = load i32, i32* %b, align 4
	%r = insertelement <8 x i32> undef, i32 %s, i32 0			%r = insertelement <8 x i32> undef, i32 %s, i32 0
	ret <8 x i32> %r			ret <8 x i32> %r
	}			}