This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
18/30
VectorCombine.cpp
-
test/Transforms/
-
Transforms/
-
InstCombine/
-
load-insert-store.ll
-
VectorCombine/X86/
-
X86/
2/4
load-insert-store.ll

Differential D98240

[VectorCombine] Simplify to scalar store if only one element updated
ClosedPublic

Authored by qiucf on Mar 9 2021, 2:00 AM.

Download Raw Diff

Details

Reviewers

spatel
efriedma
fhahn
lebedev.ri
RKSimon

Commits

rG2db4979c0fe0: [VectorCombine] Simplify to scalar store if only one element updated

Summary

This is vector-combine version of revision D71828, which simplifies load-insertelt-store pattern into getelementptr-store.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	520 ms	x64 debian > Clang.CodeGen::aggregate-assign-call.c
	560 ms	x64 debian > Clang.CodeGen::attr-arm-sve-vector-bits-call.c
	450 ms	x64 debian > Clang.CodeGen::available-externally-suppress.c
	470 ms	x64 debian > Clang.CodeGen::cfi-icall-cross-dso.c
	580 ms	x64 debian > Clang.CodeGen::dllimport.c
		View Full Test Results (207 Failed)

Event Timeline

qiucf created this revision.Mar 9 2021, 2:00 AM

Herald added subscribers: jfb, hiraditya. · View Herald TranscriptMar 9 2021, 2:00 AM

qiucf requested review of this revision.Mar 9 2021, 2:00 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 9 2021, 2:00 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

HLJ2009 added a subscriber: HLJ2009.Mar 9 2021, 2:26 AM

Fix pipeline tests

Herald added subscribers: nikic, kerbowa, steven_wu and 2 others. · View Herald TranscriptMar 9 2021, 2:35 AM

Since the original proposal, MemorySSA has evolved. I still don't know much about that though. cc @asbirlea @nikic to see if this implementation is ok or should we use MemorySSA here?

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
777	Can shorten: if (!SI \|\| !SI->isSimple() ...)
829	This line of the comment is not accurate now. Remove or update.

Harbormaster completed remote builds in B92825: Diff 329254.Mar 9 2021, 8:52 AM

Harbormaster completed remote builds in B92829: Diff 329261.Mar 9 2021, 9:38 AM

In D98240#2614181, @spatel wrote:

Since the original proposal, MemorySSA has evolved. I still don't know much about that though. cc @asbirlea @nikic to see if this implementation is ok or should we use MemorySSA here?

MemorySSA may be handy in this case, but it looks like it's not available close-by in the current pipeline position?

nikic added inline comments.Mar 10 2021, 3:28 AM

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
810	This line shouldn't be dropped.

Address comments

Harbormaster completed remote builds in B93309: Diff 329968.Mar 11 2021, 1:24 PM

fhahn mentioned this in D100273: [VectorCombine] Scalarize vector load/extract..Apr 11 2021, 2:04 PM

I put up a similar patch to handle extractelement (load %ptr), %index in a similar fashion: D100273. Together with this patch, they greatly improve codegen for certain code generated using the C/C++ matrix extension https://clang.godbolt.org/z/qsccPdPf4

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
761	I think we need a limit here, to avoid excessive compile time?
802	Do we need a pointer cast here? I think we can just create a `GEP` with the vector pointer and indices `0, Idx`.
805	What about other metadata?
807	Is there a reason to remove the instruction here? I don't think the other functions do so, so it might be better to keep things consistent (or change it for other patterns as well)

Address comments

qiucf added inline comments.Apr 11 2021, 11:13 PM

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
807	I think that's because we're operating on a `store`, which will not be erased automatically for empty use.

Harbormaster completed remote builds in B98217: Diff 336754.Apr 11 2021, 11:47 PM

fhahn mentioned this in D100302: [VectorCombine] Run load/extract scalarization after scalarizing store..Apr 12 2021, 6:38 AM

fhahn added a child revision: D100302: [VectorCombine] Run load/extract scalarization after scalarizing store..Apr 12 2021, 6:39 AM

fhahn added inline comments.Apr 12 2021, 6:43 AM

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
789	Why only allow constants here? If the index is a non-constant, it should be even more profitable, because most targets probably do not have instructions to insert at a variable index.

Allow non-constant index.

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
789	Good catch. Thanks.

Harbormaster completed remote builds in B98409: Diff 337032.Apr 12 2021, 8:48 PM

LGTM, but @fhahn likely has a wider/fresher view of the expected patterns, so see if there are any other comments.

fhahn added inline comments.Apr 14 2021, 6:49 AM

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
54	This seems a bit high to start with, perhaps we should start with a smaller limit? Unfortunately there are not many instances of this pattern in the test-suite/SPEC2000/SPEC2006 so we can't really evaluate the impact.
768	this potentially could be made a bit simpler by using `any_of` instead of the explicit loop.
800	I think we are also missing test coverage for the case where load and store are not in the same BB?
801	do we have test coverage for that?
802	I think we still need some more coverage for the various scenarios for `isMemModifiedBetween`, .e.g. the case where we have stores in between that are must-alias, no-alias and may-alias. Also we should have a test for the limit.
884	I think there are some oddities with respect to `GlobalsAA` and it should also be preserved, e.g. see D82342 (same for the new PM)

lebedev.ri added inline comments.Apr 14 2021, 6:53 AM

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
53–54	Option name doesn't match the description/usage. Iteration, to me, means how many times we will rerun this whole pass, while you use it as a number of instructions to scan.

Address comments.

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
884	Do you mean dropping this and use result from GlobalsAA? I see GlobalsAA is already preserved.

Harbormaster completed remote builds in B98828: Diff 337649.Apr 15 2021, 1:30 AM

RKSimon added a subscriber: RKSimon.Apr 16 2021, 5:50 AM

RKSimon added inline comments.

llvm/test/Transforms/VectorCombine/X86/load-insert-store.ll
3	You've made this a X86 test but used a very generic data layout (and tested on big endian as well) - maybe move out of the x86 sub directory?

RKSimon added a reviewer: RKSimon.Apr 16 2021, 5:51 AM

fhahn added inline comments.Apr 19 2021, 6:00 AM

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
784	I'm not sure if the vector GEP will properly handle types that are non-power-of-2 properly. Perhaps it might be good to limit this to types for which the following holds? `SI->typeSizeEqualsStoreSize(LI->getType())`?
884	Ah never mind, I missed that it was already preserved.

fhahn added inline comments.Apr 19 2021, 6:42 AM

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
809	I think we need to set the alignment here. For example, the original store could have a alignment less than the default for the type.
llvm/test/Transforms/InstCombine/load-insert-store.ll
1	I'm not sure why this has been removed?
llvm/test/Transforms/VectorCombine/X86/load-insert-store.ll
15	can you also add a test with element types > i8?

Address some comments.

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
784	Hmm.. Not sure what you mean, maybe something like %0 = getelementptr inbounds <16 x i24>, <16 x i24>* %q, i32 0, i32 3 store i24 %s, i24* %0, align 4 this should be expected result. Or do we need to ensure scalar type size of load equals to store's value type size?
llvm/test/Transforms/InstCombine/load-insert-store.ll
1	It's moved from `InstCombine` to `VectorCombine`, not deleted.
llvm/test/Transforms/VectorCombine/X86/load-insert-store.ll
15	Do you mean `insertelement <16 x i8> %0, i32 %s, i32 2`? If not, there's a test for `<8 x i16>` below.

Harbormaster completed remote builds in B100203: Diff 339536.Apr 22 2021, 3:51 AM

fhahn added a subscriber: bjope.Apr 23 2021, 6:59 AM

fhahn added inline comments.

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
784	Yes this was what I meant. The created pointer looks OK, but I think this is an area where it's worth to be extra careful. I am not too familiar with the potential problems, but I think @bjope has experience with such targets. @bjope WDYT, do you think this would work as expected? (my preference would be to start with dis-allowing the transform if the sizes don't match initially and enable it once we got confirmation that it is safe)
809	Perhaps I missed it, but could you add a test that specifies a smaller alignment for a store (`!align 1`) for a vector of `i16` or larger?
llvm/test/Transforms/InstCombine/load-insert-store.ll
1	Oh right, so when it was originally added the plan was to optimize the pattern in instcombine? Might be worth moving the file in a separate commit and then just have the diff here show the changes by this patch,.
llvm/test/Transforms/VectorCombine/X86/load-insert-store.ll
15	yep I missed that one, sorry.

bjope added inline comments.Apr 26 2021, 12:43 AM

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
784	Yes, I think this would be a bit more complicated if trying to support types for with `DL.typeSizeEqualsStoreSize(SI->getType())` isn't true. Since the vectors are bit-packed you can't simply address a single vector element using a GEP otherwise. You wouldn't end up clobbering adjacent elements when doing the store (unless doing some kind of read-modify-write operation). Maybe you even need to look at the typeAllocSize (comparing it with the storeSize) if alignment is important (to avoid misaligned stores). But I also wonder if using GEP:s to address individual vector elements is something we do elsewhere. I know that https://llvm.org/docs/GetElementPtr.html#can-gep-index-into-vector-elements still says that "In the future, it will probably be outright disallowed.". Has there been discussions somewhere about opening up "pandoras box" (?) and start doing such things? If so, is that well tested somewhere? For example something like this compiles without any complaints: ; RUN: llc -O3 -mtriple x86_64-- -o - %s target datalayout = "E" define void @foo(<8 x i4>* %q, i4 %s) { %p = getelementptr inbounds <8 x i4>, <8 x i4>* %q, i32 0, i32 3 store i4 %s, i4* %p ret void } but it ends up writing to `3(%rdi)` while the third element in that vector actually is at half of the byte at `1(%rdi)`.

fhahn added inline comments.Apr 26 2021, 7:23 AM

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
784	Yes, I think this would be a bit more complicated if trying to support types for with DL.typeSizeEqualsStoreSize(SI->getType()) isn't true. Since the vectors are bit-packed you can't simply address a single vector element using a GEP otherwise. You wouldn't end up clobbering adjacent elements when doing the store (unless doing some kind of read-modify-write operation). Thanks for confirming! Maybe you even need to look at the typeAllocSize (comparing it with the storeSize) if alignment is important (to avoid misaligned stores). We are setting the alignment of the store to the minimum of the alignment for the scalar store and the original alignment of the store. Would the be missing something? But I also wonder if using GEP:s to address individual vector elements is something we do elsewhere. It looks like at least `instcombine` likes to introduce such GEPs. Originally I was using something to the code in the link, which `instcombine` turned in a vector GEP, hence I updated D100273 to directly emit vector GEPS and suggested that here as well. https://godbolt.org/z/q4o6fM3eP

Update tests and store size constraint. Thanks for explanation from @bjope and @fhahn

Harbormaster completed remote builds in B101447: Diff 341243.Apr 28 2021, 11:57 AM

In D98240#2723175, @qiucf wrote:

Update tests and store size constraint. Thanks for explanation from @bjope and @fhahn

Thanks for the update! I think the only thing missing is a few tests for un-common vector types, like <8 x i4> or <4 x i31>. Probably worth to have at least 2 different tests with different types.

Add more tests.

Harbormaster completed remote builds in B103119: Diff 343568.May 6 2021, 8:19 PM

LGTM, thanks!

This revision is now accepted and ready to land.May 7 2021, 12:53 PM

Closed by commit rG2db4979c0fe0: [VectorCombine] Simplify to scalar store if only one element updated (authored by qiucf). · Explain WhyMay 8 2021, 3:17 AM

This revision was automatically updated to reflect the committed changes.

qiucf added a commit: rG2db4979c0fe0: [VectorCombine] Simplify to scalar store if only one element updated.

qiucf mentioned this in D71828: [InstCombine] Convert vector store to scalar store if only one element updated.May 8 2021, 7:33 PM

fhahn mentioned this in rG86497785d540: [VectorCombine] Scalarize vector load/extract..May 24 2021, 1:29 AM

For targets not supporting scalar load from vector memory (like ours), this breaks it:

%43 = load <8 x i32>, <8 x i32> addrspace(201)* %1, align 32, !tbaa !28
%44 = extractelement <8 x i32> %43, i32 0

Now:

%43 = getelementptr inbounds <8 x i32>, <8 x i32> addrspace(201)* %1, i32 0, i32 0
%44 = load i32, i32 addrspace(201)* %43, align 32

Are targets expected to provide patterns?

In D98240#2781048, @hgreving wrote:
For targets not supporting scalar load from vector memory (like ours), this breaks it:
%43 = load <8 x i32>, <8 x i32> addrspace(201)* %1, align 32, !tbaa !28
%44 = extractelement <8 x i32> %43, i32 0
Now:
%43 = getelementptr inbounds <8 x i32>, <8 x i32> addrspace(201)* %1, i32 0, i32 0
%44 = load i32, i32 addrspace(201)* %43, align 32
Are targets expected to provide patterns?

Interesting! I guess the code assumes that a scalar load is always possible & at least as cheap as the vector version. But I think it would make sense to ask the cost-model if that's the case. Not sure if it would be possible to test this with an in-tree target?

Hmm wait, i completely ignored this patch :/
Does this really not do any cost modelling?
This should at least check that scalar load isn't more costly
than the original vector load + insertelement + vector store.

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
762–764	Does this ignore debuginfo?

Sorry for completely ignoring this :(
I'm fine with fixing this up as a followup, if that happens today.

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
804–807	I'm certain this is a miscompile. The alignment that was valid for a vector store is not guaranteed to be valid for the store of a single vector element, unless it's the 0'th element of course. I think this needs to be something like newalign = 1; if(autoC = dyn_cast<ConstantInt>(Idx)) { newalign = max(old store align, old load align); newalign = commonAlignment(newalign, Idx DL.getTypeSize(NewElement.getType())) }

lebedev.ri added inline comments.May 26 2021, 1:33 AM

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
802	Technically, size of the `insertelement` index doesn't have to match the size of the GEP index, the latter is controlled in datalayout. I'm not sure if/how we need to deal with that here, however.

In D98240#2781262, @fhahn wrote:
In D98240#2781048, @hgreving wrote:
For targets not supporting scalar load from vector memory (like ours), this breaks it:
%43 = load <8 x i32>, <8 x i32> addrspace(201)* %1, align 32, !tbaa !28
%44 = extractelement <8 x i32> %43, i32 0
Now:
%43 = getelementptr inbounds <8 x i32>, <8 x i32> addrspace(201)* %1, i32 0, i32 0
%44 = load i32, i32 addrspace(201)* %43, align 32
Are targets expected to provide patterns?
Interesting! I guess the code assumes that a scalar load is always possible & at least as cheap as the vector version. But I think it would make sense to ask the cost-model if that's the case. Not sure if it would be possible to test this with an in-tree target?

Hi thanks for getting back to me. I'm not sure if it's a cost model question, a straight-up disable switch for not morphing vector derefs into scalar might be better? Is there anything else in this pass that might do that? Unfortunately yes, I think there's no proper upstream target with this constraint. Though I am guessing I am not the only downstream target with a vector memory like that. The problem with trying to make this work is that I am worried about what happens to the pointer. Will I always be able to rely on that it will be aligned, probably not...

In D98240#2782010, @hgreving wrote:

Interesting! I guess the code assumes that a scalar load is always possible & at least as cheap as the vector version. But I think it would make sense to ask the cost-model if that's the case. Not sure if it would be possible to test this with an in-tree target?

Hi thanks for getting back to me. I'm not sure if it's a cost model question, a straight-up disable switch for not morphing vector derefs into scalar might be better? Is there anything else in this pass that might do that? Unfortunately yes, I think there's no proper upstream target with this constraint. Though I am guessing I am not the only downstream target with a vector memory like that. The problem with trying to make this work is that I am worried about what happens to the pointer. Will I always be able to rely on that it will be aligned, probably not...

I guess it depends on whether the backend can legalize/convert back to a vector load. In general, relying on the middle-end to no scalarize those loads for correctness seems a bit fragile. I'm not sure if it makes sense to add a TTI hook that's not used by any in tree targets.

@qiucf would you be able to look into extending the code to check for the cost?

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
804–807	That's a good point. @qiucf can you look into this?

In D98240#2788629, @fhahn wrote:

In D98240#2782010, @hgreving wrote:

Interesting! I guess the code assumes that a scalar load is always possible & at least as cheap as the vector version. But I think it would make sense to ask the cost-model if that's the case. Not sure if it would be possible to test this with an in-tree target?

Hi thanks for getting back to me. I'm not sure if it's a cost model question, a straight-up disable switch for not morphing vector derefs into scalar might be better? Is there anything else in this pass that might do that? Unfortunately yes, I think there's no proper upstream target with this constraint. Though I am guessing I am not the only downstream target with a vector memory like that. The problem with trying to make this work is that I am worried about what happens to the pointer. Will I always be able to rely on that it will be aligned, probably not...

I guess it depends on whether the backend can legalize/convert back to a vector load. In general, relying on the middle-end to no scalarize those loads for correctness seems a bit fragile. I'm not sure if it makes sense to add a TTI hook that's not used by any in tree targets.

@qiucf would you be able to look into extending the code to check for the cost?

I'll look into it, thanks. Sorry for the late notice.

qiucf mentioned this in D103419: [VectorCombine] Fix alignment in single element store.May 31 2021, 10:40 AM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

VectorCombine.cpp

72 lines

test/

Transforms/

InstCombine/

load-insert-store.ll

	VectorCombine/	X86/
		InstCombine/

load-insert-store.ll

67 lines

Diff 329254

llvm/lib/Transforms/Vectorize/VectorCombine.cpp

Show First 20 Lines • Show All 44 Lines • ▼ Show 20 Lines
static cl::opt<bool> DisableVectorCombine(		static cl::opt<bool> DisableVectorCombine(
"disable-vector-combine", cl::init(false), cl::Hidden,		"disable-vector-combine", cl::init(false), cl::Hidden,
cl::desc("Disable all vector combine transforms"));		cl::desc("Disable all vector combine transforms"));

static cl::opt<bool> DisableBinopExtractShuffle(		static cl::opt<bool> DisableBinopExtractShuffle(
"disable-binop-extract-shuffle", cl::init(false), cl::Hidden,		"disable-binop-extract-shuffle", cl::init(false), cl::Hidden,
cl::desc("Disable binop extract to shuffle transforms"));		cl::desc("Disable binop extract to shuffle transforms"));

static const unsigned InvalidIndex = std::numeric_limits<unsigned>::max();		static const unsigned InvalidIndex = std::numeric_limits<unsigned>::max();

		fhahnUnsubmitted Done Reply Inline Actions This seems a bit high to start with, perhaps we should start with a smaller limit? Unfortunately there are not many instances of this pattern in the test-suite/SPEC2000/SPEC2006 so we can't really evaluate the impact. fhahn: This seems a bit high to start with, perhaps we should start with a smaller limit?
		lebedev.riUnsubmitted Done Reply Inline Actions Option name doesn't match the description/usage. Iteration, to me, means how many times we will rerun this whole pass, while you use it as a number of instructions to scan. lebedev.ri: Option name doesn't match the description/usage. Iteration, to me, means how many times we will…
namespace {		namespace {
class VectorCombine {		class VectorCombine {
public:		public:
VectorCombine(Function &F, const TargetTransformInfo &TTI,		VectorCombine(Function &F, const TargetTransformInfo &TTI,
const DominatorTree &DT)		const DominatorTree &DT, AAResults &AA)
: F(F), Builder(F.getContext()), TTI(TTI), DT(DT) {}		: F(F), Builder(F.getContext()), TTI(TTI), DT(DT), AA(AA) {}

bool run();		bool run();

private:		private:
Function &F;		Function &F;
IRBuilder<> Builder;		IRBuilder<> Builder;
const TargetTransformInfo &TTI;		const TargetTransformInfo &TTI;
const DominatorTree &DT;		const DominatorTree &DT;
		AAResults &AA;

bool vectorizeLoadInsert(Instruction &I);		bool vectorizeLoadInsert(Instruction &I);
ExtractElementInst getShuffleExtract(ExtractElementInst Ext0,		ExtractElementInst getShuffleExtract(ExtractElementInst Ext0,
ExtractElementInst *Ext1,		ExtractElementInst *Ext1,
unsigned PreferredExtractIndex) const;		unsigned PreferredExtractIndex) const;
bool isExtractExtractCheap(ExtractElementInst Ext0, ExtractElementInst Ext1,		bool isExtractExtractCheap(ExtractElementInst Ext0, ExtractElementInst Ext1,
unsigned Opcode,		unsigned Opcode,
ExtractElementInst *&ConvertToShuffle,		ExtractElementInst *&ConvertToShuffle,
unsigned PreferredExtractIndex);		unsigned PreferredExtractIndex);
void foldExtExtCmp(ExtractElementInst Ext0, ExtractElementInst Ext1,		void foldExtExtCmp(ExtractElementInst Ext0, ExtractElementInst Ext1,
Instruction &I);		Instruction &I);
void foldExtExtBinop(ExtractElementInst Ext0, ExtractElementInst Ext1,		void foldExtExtBinop(ExtractElementInst Ext0, ExtractElementInst Ext1,
Instruction &I);		Instruction &I);
bool foldExtractExtract(Instruction &I);		bool foldExtractExtract(Instruction &I);
bool foldBitcastShuf(Instruction &I);		bool foldBitcastShuf(Instruction &I);
bool scalarizeBinopOrCmp(Instruction &I);		bool scalarizeBinopOrCmp(Instruction &I);
bool foldExtractedCmps(Instruction &I);		bool foldExtractedCmps(Instruction &I);
		bool foldSingleElementStore(Instruction &I);
};		};
} // namespace		} // namespace

static void replaceValue(Value &Old, Value &New) {		static void replaceValue(Value &Old, Value &New) {
Old.replaceAllUsesWith(&New);		Old.replaceAllUsesWith(&New);
New.takeName(&Old);		New.takeName(&Old);
}		}

▲ Show 20 Lines • Show All 653 Lines • ▼ Show 20 Lines	bool VectorCombine::foldExtractedCmps(Instruction &I) {
Value *VecLogic = Builder.CreateBinOp(cast<BinaryOperator>(I).getOpcode(),		Value *VecLogic = Builder.CreateBinOp(cast<BinaryOperator>(I).getOpcode(),
VCmp, Shuf);		VCmp, Shuf);
Value *NewExt = Builder.CreateExtractElement(VecLogic, CheapIndex);		Value *NewExt = Builder.CreateExtractElement(VecLogic, CheapIndex);
replaceValue(I, *NewExt);		replaceValue(I, *NewExt);
++NumVecCmpBO;		++NumVecCmpBO;
return true;		return true;
}		}

		// Check if memory loc modified between two instrs in the same BB
		static bool isMemModifiedBetween(BasicBlock::iterator Begin,
		BasicBlock::iterator End,
		const MemoryLocation &Loc, AAResults &AA) {
		for (BasicBlock::iterator BBI = Begin; BBI != End; ++BBI)
		fhahnUnsubmitted Done Reply Inline Actions I think we need a limit here, to avoid excessive compile time? fhahn: I think we need a limit here, to avoid excessive compile time?
		if (isModSet(AA.getModRefInfo(&*BBI, Loc)))
		return true;
		return false;
		lebedev.riUnsubmitted Not Done Reply Inline Actions Does this ignore debuginfo? lebedev.ri: Does this ignore debuginfo?
		}

		// Combine patterns like:
		// %0 = load <4 x i32>, <4 x i32>* %a
		fhahnUnsubmitted Done Reply Inline Actions this potentially could be made a bit simpler by using `any_of` instead of the explicit loop. fhahn: this potentially could be made a bit simpler by using `any_of` instead of the explicit loop.
		// %1 = insertelement <4 x i32> %0, i32 %b, i32 1
		// store <4 x i32> %1, <4 x i32>* %a
		// to:
		// %0 = bitcast <4 x i32>* %a to i32*
		// %1 = getelementptr inbounds i32, i32* %0, i64 0, i64 1
		// store i32 %b, i32* %1
		bool VectorCombine::foldSingleElementStore(Instruction &I) {
		StoreInst *SI = dyn_cast<StoreInst>(&I);
		if (SI == nullptr \|\| !SI->isSimple() \|\|
		spatelUnsubmitted Done Reply Inline Actions Can shorten: if (!SI \|\| !SI->isSimple() ...) spatel: Can shorten: if (!SI \|\| !SI->isSimple() ...)
		!SI->getValueOperand()->getType()->isVectorTy())
		return false;

		// TODO: Combine more complicated patterns (multiple insert) by referencing
		// TargetTransformInfo.
		Instruction *Source;
		Value *NewElement;
		fhahnUnsubmitted Not Done Reply Inline Actions I'm not sure if the vector GEP will properly handle types that are non-power-of-2 properly. Perhaps it might be good to limit this to types for which the following holds? `SI->typeSizeEqualsStoreSize(LI->getType())`? fhahn: I'm not sure if the vector GEP will properly handle types that are non-power-of-2 properly.
		qiucfAuthorUnsubmitted Done Reply Inline Actions Hmm.. Not sure what you mean, maybe something like %0 = getelementptr inbounds <16 x i24>, <16 x i24>* %q, i32 0, i32 3 store i24 %s, i24* %0, align 4 this should be expected result. Or do we need to ensure scalar type size of load equals to store's value type size? qiucf: Hmm.. Not sure what you mean, maybe something like ``` %0 = getelementptr inbounds <16 x i24>…
		fhahnUnsubmitted Not Done Reply Inline Actions Yes this was what I meant. The created pointer looks OK, but I think this is an area where it's worth to be extra careful. I am not too familiar with the potential problems, but I think @bjope has experience with such targets. @bjope WDYT, do you think this would work as expected? (my preference would be to start with dis-allowing the transform if the sizes don't match initially and enable it once we got confirmation that it is safe) fhahn: Yes this was what I meant. The created pointer looks OK, but I think this is an area where it's…
		bjopeUnsubmitted Not Done Reply Inline Actions Yes, I think this would be a bit more complicated if trying to support types for with `DL.typeSizeEqualsStoreSize(SI->getType())` isn't true. Since the vectors are bit-packed you can't simply address a single vector element using a GEP otherwise. You wouldn't end up clobbering adjacent elements when doing the store (unless doing some kind of read-modify-write operation). Maybe you even need to look at the typeAllocSize (comparing it with the storeSize) if alignment is important (to avoid misaligned stores). But I also wonder if using GEP:s to address individual vector elements is something we do elsewhere. I know that https://llvm.org/docs/GetElementPtr.html#can-gep-index-into-vector-elements still says that "In the future, it will probably be outright disallowed.". Has there been discussions somewhere about opening up "pandoras box" (?) and start doing such things? If so, is that well tested somewhere? For example something like this compiles without any complaints: ; RUN: llc -O3 -mtriple x86_64-- -o - %s target datalayout = "E" define void @foo(<8 x i4>* %q, i4 %s) { %p = getelementptr inbounds <8 x i4>, <8 x i4>* %q, i32 0, i32 3 store i4 %s, i4* %p ret void } but it ends up writing to `3(%rdi)` while the third element in that vector actually is at half of the byte at `1(%rdi)`. bjope: Yes, I think this would be a bit more complicated if trying to support types for with `DL.
		fhahnUnsubmitted Not Done Reply Inline Actions Yes, I think this would be a bit more complicated if trying to support types for with DL.typeSizeEqualsStoreSize(SI->getType()) isn't true. Since the vectors are bit-packed you can't simply address a single vector element using a GEP otherwise. You wouldn't end up clobbering adjacent elements when doing the store (unless doing some kind of read-modify-write operation). Thanks for confirming! Maybe you even need to look at the typeAllocSize (comparing it with the storeSize) if alignment is important (to avoid misaligned stores). We are setting the alignment of the store to the minimum of the alignment for the scalar store and the original alignment of the store. Would the be missing something? But I also wonder if using GEP:s to address individual vector elements is something we do elsewhere. It looks like at least `instcombine` likes to introduce such GEPs. Originally I was using something to the code in the link, which `instcombine` turned in a vector GEP, hence I updated D100273 to directly emit vector GEPS and suggested that here as well. https://godbolt.org/z/q4o6fM3eP fhahn: > Yes, I think this would be a bit more complicated if trying to support types for with DL.
		Constant *Idx;
		if (!match(SI->getValueOperand(),
		m_InsertElt(m_Instruction(Source), m_Value(NewElement),
		m_Constant(Idx))))
		return false;
		fhahnUnsubmitted Done Reply Inline Actions Why only allow constants here? If the index is a non-constant, it should be even more profitable, because most targets probably do not have instructions to insert at a variable index. fhahn: Why only allow constants here? If the index is a non-constant, it should be even more…
		qiucfAuthorUnsubmitted Done Reply Inline Actions Good catch. Thanks. qiucf: Good catch. Thanks.

		if (auto *Load = dyn_cast<LoadInst>(Source)) {
		Value *SrcAddr = Load->getPointerOperand()->stripPointerCasts();
		// Don't optimize for atomic/volatile load or stores.
		if (!Load->isSimple() \|\| Load->getParent() != SI->getParent() \|\|
		SrcAddr != SI->getPointerOperand()->stripPointerCasts() \|\|
		isMemModifiedBetween(Load->getIterator(), SI->getIterator(),
		MemoryLocation::get(SI), AA))
		return false;

		Type *ElePtrType = NewElement->getType()->getPointerTo();
		fhahnUnsubmitted Done Reply Inline Actions I think we are also missing test coverage for the case where load and store are not in the same BB? fhahn: I think we are also missing test coverage for the case where load and store are not in the same…
		Value *ElePtr =
		fhahnUnsubmitted Done Reply Inline Actions do we have test coverage for that? fhahn: do we have test coverage for that?
		Builder.CreatePointerCast(SI->getPointerOperand(), ElePtrType);
		fhahnUnsubmitted Done Reply Inline Actions Do we need a pointer cast here? I think we can just create a `GEP` with the vector pointer and indices `0, Idx`. fhahn: Do we need a pointer cast here? I think we can just create a `GEP` with the vector pointer…
		fhahnUnsubmitted Done Reply Inline Actions I think we still need some more coverage for the various scenarios for `isMemModifiedBetween`, .e.g. the case where we have stores in between that are must-alias, no-alias and may-alias. Also we should have a test for the limit. fhahn: I think we still need some more coverage for the various scenarios for `isMemModifiedBetween`, .
		lebedev.riUnsubmitted Not Done Reply Inline Actions Technically, size of the `insertelement` index doesn't have to match the size of the GEP index, the latter is controlled in datalayout. I'm not sure if/how we need to deal with that here, however. lebedev.ri: Technically, size of the `insertelement` index doesn't have to match the size of the GEP index…
		Value *GEP = Builder.CreateInBoundsGEP(NewElement->getType(), ElePtr, Idx);
		StoreInst *NSI = Builder.CreateStore(NewElement, GEP);
		NSI->copyMetadata(*SI, {LLVMContext::MD_nontemporal});
		fhahnUnsubmitted Done Reply Inline Actions What about other metadata? fhahn: What about other metadata?
		replaceValue(I, *NSI);
		I.eraseFromParent();
		fhahnUnsubmitted Not Done Reply Inline Actions Is there a reason to remove the instruction here? I don't think the other functions do so, so it might be better to keep things consistent (or change it for other patterns as well) fhahn: Is there a reason to remove the instruction here? I don't think the other functions do so, so…
		qiucfAuthorUnsubmitted Done Reply Inline Actions I think that's because we're operating on a `store`, which will not be erased automatically for empty use. qiucf: I think that's because we're operating on a `store`, which will not be erased automatically for…
		lebedev.riUnsubmitted Not Done Reply Inline Actions I'm certain this is a miscompile. The alignment that was valid for a vector store is not guaranteed to be valid for the store of a single vector element, unless it's the 0'th element of course. I think this needs to be something like newalign = 1; if(autoC = dyn_cast<ConstantInt>(Idx)) { newalign = max(old store align, old load align); newalign = commonAlignment(newalign, Idx DL.getTypeSize(NewElement.getType())) } lebedev.ri: I'm certain this is a miscompile. The alignment that was valid for a vector store is not…
		fhahnUnsubmitted Not Done Reply Inline Actions That's a good point. @qiucf can you look into this? fhahn: That's a good point. @qiucf can you look into this?
		return true;
		}
		fhahnUnsubmitted Done Reply Inline Actions I think we need to set the alignment here. For example, the original store could have a alignment less than the default for the type. fhahn: I think we need to set the alignment here. For example, the original store could have a…
		fhahnUnsubmitted Not Done Reply Inline Actions Perhaps I missed it, but could you add a test that specifies a smaller alignment for a store (`!align 1`) for a vector of `i16` or larger? fhahn: Perhaps I missed it, but could you add a test that specifies a smaller alignment for a store (`!

		return false;
		}

/// This is the entry point for all transforms. Pass manager differences are		/// This is the entry point for all transforms. Pass manager differences are
/// handled in the callers of this function.		/// handled in the callers of this function.
bool VectorCombine::run() {		bool VectorCombine::run() {
if (DisableVectorCombine)		if (DisableVectorCombine)
return false;		return false;

// Don't attempt vectorization if the target does not support vectors.		// Don't attempt vectorization if the target does not support vectors.
if (!TTI.getNumberOfRegisters(TTI.getRegisterClassForType(/Vector/ true)))		if (!TTI.getNumberOfRegisters(TTI.getRegisterClassForType(/Vector/ true)))
return false;		return false;

bool MadeChange = false;		bool MadeChange = false;
for (BasicBlock &BB : F) {		for (BasicBlock &BB : F) {
// Ignore unreachable basic blocks.		// Ignore unreachable basic blocks.
if (!DT.isReachableFromEntry(&BB))		if (!DT.isReachableFromEntry(&BB))
continue;		continue;
// Do not delete instructions under here and invalidate the iterator.		// Do not delete instructions under here and invalidate the iterator.
		spatelUnsubmitted Done Reply Inline Actions This line of the comment is not accurate now. Remove or update. spatel: This line of the comment is not accurate now. Remove or update.
// Walk the block forwards to enable simple iterative chains of transforms.		// Walk the block forwards to enable simple iterative chains of transforms.
// TODO: It could be more efficient to remove dead instructions		// TODO: It could be more efficient to remove dead instructions
// iteratively in this loop rather than waiting until the end.		// iteratively in this loop rather than waiting until the end.
for (Instruction &I : BB) {		for (Instruction &I : make_early_inc_range(BB)) {
if (isa<DbgInfoIntrinsic>(I))		if (isa<DbgInfoIntrinsic>(I))
continue;		continue;
Builder.SetInsertPoint(&I);		Builder.SetInsertPoint(&I);
MadeChange \|= vectorizeLoadInsert(I);		MadeChange \|= vectorizeLoadInsert(I);
MadeChange \|= foldExtractExtract(I);		MadeChange \|= foldExtractExtract(I);
MadeChange \|= foldBitcastShuf(I);		MadeChange \|= foldBitcastShuf(I);
MadeChange \|= scalarizeBinopOrCmp(I);		MadeChange \|= scalarizeBinopOrCmp(I);
MadeChange \|= foldExtractedCmps(I);		MadeChange \|= foldExtractedCmps(I);
		MadeChange \|= foldSingleElementStore(I);
}		}
}		}

// We're done with transforms, so remove dead instructions.		// We're done with transforms, so remove dead instructions.
if (MadeChange)		if (MadeChange)
for (BasicBlock &BB : F)		for (BasicBlock &BB : F)
SimplifyInstructionsInBlock(&BB);		SimplifyInstructionsInBlock(&BB);

Show All 11 Lines	public:
}		}

void getAnalysisUsage(AnalysisUsage &AU) const override {		void getAnalysisUsage(AnalysisUsage &AU) const override {
AU.addRequired<DominatorTreeWrapperPass>();		AU.addRequired<DominatorTreeWrapperPass>();
AU.addRequired<TargetTransformInfoWrapperPass>();		AU.addRequired<TargetTransformInfoWrapperPass>();
AU.setPreservesCFG();		AU.setPreservesCFG();
AU.addPreserved<DominatorTreeWrapperPass>();		AU.addPreserved<DominatorTreeWrapperPass>();
AU.addPreserved<GlobalsAAWrapperPass>();		AU.addPreserved<GlobalsAAWrapperPass>();
AU.addPreserved<AAResultsWrapperPass>();		AU.addPreserved<AAResultsWrapperPass>();
nikicUnsubmitted Done Reply Inline Actions This line shouldn't be dropped. nikic: This line shouldn't be dropped.
AU.addPreserved<BasicAAWrapperPass>();		AU.addPreserved<BasicAAWrapperPass>();
FunctionPass::getAnalysisUsage(AU);		FunctionPass::getAnalysisUsage(AU);
}		}

bool runOnFunction(Function &F) override {		bool runOnFunction(Function &F) override {
if (skipFunction(F))		if (skipFunction(F))
return false;		return false;
auto &TTI = getAnalysis<TargetTransformInfoWrapperPass>().getTTI(F);		auto &TTI = getAnalysis<TargetTransformInfoWrapperPass>().getTTI(F);
auto &DT = getAnalysis<DominatorTreeWrapperPass>().getDomTree();		auto &DT = getAnalysis<DominatorTreeWrapperPass>().getDomTree();
VectorCombine Combiner(F, TTI, DT);		auto &AA = getAnalysis<AAResultsWrapperPass>().getAAResults();
		VectorCombine Combiner(F, TTI, DT, AA);
return Combiner.run();		return Combiner.run();
}		}
};		};
		fhahnUnsubmitted Not Done Reply Inline Actions I think there are some oddities with respect to `GlobalsAA` and it should also be preserved, e.g. see D82342 (same for the new PM) fhahn: I think there are some oddities with respect to `GlobalsAA` and it should also be preserved, e.
		qiucfAuthorUnsubmitted Done Reply Inline Actions Do you mean dropping this and use result from GlobalsAA? I see GlobalsAA is already preserved. qiucf: Do you mean dropping this and use result from GlobalsAA? I see GlobalsAA is already preserved.
		fhahnUnsubmitted Not Done Reply Inline Actions Ah never mind, I missed that it was already preserved. fhahn: Ah never mind, I missed that it was already preserved.
} // namespace		} // namespace

char VectorCombineLegacyPass::ID = 0;		char VectorCombineLegacyPass::ID = 0;
INITIALIZE_PASS_BEGIN(VectorCombineLegacyPass, "vector-combine",		INITIALIZE_PASS_BEGIN(VectorCombineLegacyPass, "vector-combine",
"Optimize scalar/vector ops", false,		"Optimize scalar/vector ops", false,
false)		false)
INITIALIZE_PASS_DEPENDENCY(DominatorTreeWrapperPass)		INITIALIZE_PASS_DEPENDENCY(DominatorTreeWrapperPass)
INITIALIZE_PASS_END(VectorCombineLegacyPass, "vector-combine",		INITIALIZE_PASS_END(VectorCombineLegacyPass, "vector-combine",
"Optimize scalar/vector ops", false, false)		"Optimize scalar/vector ops", false, false)
Pass *llvm::createVectorCombinePass() {		Pass *llvm::createVectorCombinePass() {
return new VectorCombineLegacyPass();		return new VectorCombineLegacyPass();
}		}

PreservedAnalyses VectorCombinePass::run(Function &F,		PreservedAnalyses VectorCombinePass::run(Function &F,
FunctionAnalysisManager &FAM) {		FunctionAnalysisManager &FAM) {
TargetTransformInfo &TTI = FAM.getResult<TargetIRAnalysis>(F);		TargetTransformInfo &TTI = FAM.getResult<TargetIRAnalysis>(F);
DominatorTree &DT = FAM.getResult<DominatorTreeAnalysis>(F);		DominatorTree &DT = FAM.getResult<DominatorTreeAnalysis>(F);
VectorCombine Combiner(F, TTI, DT);		AAResults &AA = FAM.getResult<AAManager>(F);
		VectorCombine Combiner(F, TTI, DT, AA);
if (!Combiner.run())		if (!Combiner.run())
return PreservedAnalyses::all();		return PreservedAnalyses::all();
PreservedAnalyses PA;		PreservedAnalyses PA;
PA.preserveSet<CFGAnalyses>();		PA.preserveSet<CFGAnalyses>();
PA.preserve<GlobalsAA>();		PA.preserve<GlobalsAA>();
PA.preserve<AAManager>();		PA.preserve<AAManager>();
PA.preserve<BasicAA>();		PA.preserve<BasicAA>();
return PA;		return PA;
}		}

llvm/test/Transforms/InstCombine/load-insert-store.ll

This file was moved to llvm/test/Transforms/VectorCombine/X86/load-insert-store.ll.

llvm/test/Transforms/VectorCombine/X86/load-insert-store.ll

This file was moved from llvm/test/Transforms/InstCombine/load-insert-store.ll.

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -S -instcombine < %s \| FileCheck %s			; RUN: opt -S -vector-combine -data-layout=e < %s \| FileCheck %s
				; RUN: opt -S -vector-combine -data-layout=E < %s \| FileCheck %s
				RKSimonUnsubmitted Done Reply Inline Actions You've made this a X86 test but used a very generic data layout (and tested on big endian as well) - maybe move out of the x86 sub directory? RKSimon: You've made this a X86 test but used a very generic data layout (and tested on big endian as…

	define void @insert_store(<16 x i8>* %q, i8 zeroext %s) {			define void @insert_store(<16 x i8>* %q, i8 zeroext %s) {
	; CHECK-LABEL: @insert_store(			; CHECK-LABEL: @insert_store(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[TMP0:%.]] = load <16 x i8>, <16 x i8> [[Q:%.*]], align 16			; CHECK-NEXT: [[TMP0:%.]] = bitcast <16 x i8> [[Q:%.]] to i8
	; CHECK-NEXT: [[VECINS:%.]] = insertelement <16 x i8> [[TMP0]], i8 [[S:%.]], i32 3			; CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds i8, i8 [[TMP0]], i32 3
	; CHECK-NEXT: store <16 x i8> [[VECINS]], <16 x i8>* [[Q]], align 16			; CHECK-NEXT: store i8 [[S:%.]], i8 [[TMP1]], align 1
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	%0 = load <16 x i8>, <16 x i8>* %q			%0 = load <16 x i8>, <16 x i8>* %q
	%vecins = insertelement <16 x i8> %0, i8 %s, i32 3			%vecins = insertelement <16 x i8> %0, i8 %s, i32 3
				fhahnUnsubmitted Not Done Reply Inline Actions can you also add a test with element types > i8? fhahn: can you also add a test with element types > i8?
				qiucfAuthorUnsubmitted Done Reply Inline Actions Do you mean `insertelement <16 x i8> %0, i32 %s, i32 2`? If not, there's a test for `<8 x i16>` below. qiucf: Do you mean `insertelement <16 x i8> %0, i32 %s, i32 2`? If not, there's a test for `<8 x i16>`…
				fhahnUnsubmitted Not Done Reply Inline Actions yep I missed that one, sorry. fhahn: yep I missed that one, sorry.
	store <16 x i8> %vecins, <16 x i8>* %q			store <16 x i8> %vecins, <16 x i8>* %q
	ret void			ret void
	}			}

	define void @single_shuffle_store(<4 x i32>* %a, i32 %b) {			define void @insert_store_i16(<8 x i16>* %q, i16 zeroext %s) {
	; CHECK-LABEL: @single_shuffle_store(			; CHECK-LABEL: @insert_store_i16(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[TMP0:%.]] = load <4 x i32>, <4 x i32> [[A:%.*]], align 16			; CHECK-NEXT: [[TMP0:%.]] = bitcast <8 x i16> [[Q:%.]] to i16
	; CHECK-NEXT: [[TMP1:%.]] = insertelement <4 x i32> [[TMP0]], i32 [[B:%.]], i32 1			; CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds i16, i16 [[TMP0]], i32 3
	; CHECK-NEXT: store <4 x i32> [[TMP1]], <4 x i32>* [[A]], align 16, !nontemporal !0			; CHECK-NEXT: store i16 [[S:%.]], i16 [[TMP1]], align 2
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	%0 = load <4 x i32>, <4 x i32>* %a			%0 = load <8 x i16>, <8 x i16>* %q
	%1 = insertelement <4 x i32> %0, i32 %b, i32 1			%vecins = insertelement <8 x i16> %0, i16 %s, i32 3
	%2 = shufflevector <4 x i32> %0, <4 x i32> %1, <4 x i32> <i32 0, i32 5, i32 2, i32 3>			store <8 x i16> %vecins, <8 x i16>* %q
	store <4 x i32> %2, <4 x i32>* %a, !nontemporal !0
	ret void			ret void
	}			}

	define void @volatile_update(<16 x i8>* %q, <16 x i8>* %p, i8 zeroext %s) {			define void @volatile_update(<16 x i8>* %q, <16 x i8>* %p, i8 zeroext %s) {
	; CHECK-LABEL: @volatile_update(			; CHECK-LABEL: @volatile_update(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[TMP0:%.]] = load <16 x i8>, <16 x i8> [[Q:%.*]], align 16			; CHECK-NEXT: [[TMP0:%.]] = load <16 x i8>, <16 x i8> [[Q:%.*]], align 16
	; CHECK-NEXT: [[VECINS0:%.]] = insertelement <16 x i8> [[TMP0]], i8 [[S:%.]], i32 3			; CHECK-NEXT: [[VECINS0:%.]] = insertelement <16 x i8> [[TMP0]], i8 [[S:%.]], i32 3
	Show All 24 Lines
	;			;
	entry:			entry:
	%ld = load <16 x i8>, <16 x i8>* %p			%ld = load <16 x i8>, <16 x i8>* %p
	%ins = insertelement <16 x i8> %ld, i8 %s, i32 3			%ins = insertelement <16 x i8> %ld, i8 %s, i32 3
	store <16 x i8> %ins, <16 x i8>* %q			store <16 x i8> %ins, <16 x i8>* %q
	ret void			ret void
	}			}

				; We can't transform if any instr could modify memory in between.
				; Here p and q may alias, so we can't remove the load.
				; r is impossible to alias with others, so it's safe to transform.
	define void @insert_store_mem_modify(<16 x i8>* %p, <16 x i8>* %q, <16 x i8>* noalias %r, i8 %s) {			define void @insert_store_mem_modify(<16 x i8>* %p, <16 x i8>* %q, <16 x i8>* noalias %r, i8 %s) {
	; CHECK-LABEL: @insert_store_mem_modify(			; CHECK-LABEL: @insert_store_mem_modify(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[LD:%.]] = load <16 x i8>, <16 x i8> [[P:%.*]], align 16			; CHECK-NEXT: [[LD:%.]] = load <16 x i8>, <16 x i8> [[P:%.*]], align 16
	; CHECK-NEXT: store <16 x i8> zeroinitializer, <16 x i8>* [[Q:%.*]], align 16			; CHECK-NEXT: store <16 x i8> zeroinitializer, <16 x i8>* [[Q:%.*]], align 16
	; CHECK-NEXT: [[INS:%.]] = insertelement <16 x i8> [[LD]], i8 [[S:%.]], i32 3			; CHECK-NEXT: [[INS:%.]] = insertelement <16 x i8> [[LD]], i8 [[S:%.]], i32 3
	; CHECK-NEXT: store <16 x i8> [[INS]], <16 x i8>* [[P]], align 16			; CHECK-NEXT: store <16 x i8> [[INS]], <16 x i8>* [[P]], align 16
	; CHECK-NEXT: [[LD2:%.]] = load <16 x i8>, <16 x i8> [[Q]], align 16
	; CHECK-NEXT: store <16 x i8> zeroinitializer, <16 x i8>* [[R:%.*]], align 16			; CHECK-NEXT: store <16 x i8> zeroinitializer, <16 x i8>* [[R:%.*]], align 16
	; CHECK-NEXT: [[INS2:%.*]] = insertelement <16 x i8> [[LD2]], i8 [[S]], i32 7			; CHECK-NEXT: [[TMP0:%.]] = bitcast <16 x i8> [[Q]] to i8*
	; CHECK-NEXT: store <16 x i8> [[INS2]], <16 x i8>* [[Q]], align 16			; CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds i8, i8 [[TMP0]], i32 7
				; CHECK-NEXT: store i8 [[S]], i8* [[TMP1]], align 1
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	%ld = load <16 x i8>, <16 x i8>* %p			%ld = load <16 x i8>, <16 x i8>* %p
	store <16 x i8> zeroinitializer, <16 x i8>* %q			store <16 x i8> zeroinitializer, <16 x i8>* %q
	%ins = insertelement <16 x i8> %ld, i8 %s, i32 3			%ins = insertelement <16 x i8> %ld, i8 %s, i32 3
	store <16 x i8> %ins, <16 x i8>* %p			store <16 x i8> %ins, <16 x i8>* %p

	%ld2 = load <16 x i8>, <16 x i8>* %q			%ld2 = load <16 x i8>, <16 x i8>* %q
	store <16 x i8> zeroinitializer, <16 x i8>* %r			store <16 x i8> zeroinitializer, <16 x i8>* %r
	%ins2 = insertelement <16 x i8> %ld2, i8 %s, i32 7			%ins2 = insertelement <16 x i8> %ld2, i8 %s, i32 7
	store <16 x i8> %ins2, <16 x i8>* %q			store <16 x i8> %ins2, <16 x i8>* %q
	ret void			ret void
	}			}

				; Check cases when calls may modify memory
				define void @insert_store_with_call(<16 x i8>* %p, <16 x i8>* %q, i8 %s) {
				; CHECK-LABEL: @insert_store_with_call(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[LD:%.]] = load <16 x i8>, <16 x i8> [[P:%.*]], align 16
				; CHECK-NEXT: call void @maywrite(<16 x i8>* [[P]])
				; CHECK-NEXT: [[INS:%.]] = insertelement <16 x i8> [[LD]], i8 [[S:%.]], i32 3
				; CHECK-NEXT: store <16 x i8> [[INS]], <16 x i8>* [[P]], align 16
				; CHECK-NEXT: call void @foo()
				; CHECK-NEXT: call void @nowrite(<16 x i8>* [[P]])
				; CHECK-NEXT: [[TMP0:%.]] = bitcast <16 x i8> [[P]] to i8*
				; CHECK-NEXT: [[TMP1:%.]] = getelementptr inbounds i8, i8 [[TMP0]], i32 7
				; CHECK-NEXT: store i8 [[S]], i8* [[TMP1]], align 1
				; CHECK-NEXT: ret void
				;
				entry:
				%ld = load <16 x i8>, <16 x i8>* %p
				call void @maywrite(<16 x i8>* %p)
				%ins = insertelement <16 x i8> %ld, i8 %s, i32 3
				store <16 x i8> %ins, <16 x i8>* %p
				call void @foo() ; Barrier
				%ld2 = load <16 x i8>, <16 x i8>* %p
				call void @nowrite(<16 x i8>* %p)
				%ins2 = insertelement <16 x i8> %ld2, i8 %s, i32 7
				store <16 x i8> %ins2, <16 x i8>* %p
				ret void
				}

				declare void @foo()
				declare void @maywrite(<16 x i8>*)
				declare void @nowrite(<16 x i8>*) readonly

	!0 = !{}			!0 = !{}