This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
18/30
VectorCombine.cpp
-
test/
-
CodeGen/AMDGPU/
-
AMDGPU/
-
opt-pipeline.ll
-
Other/
-
opt-LTO-pipeline.ll
-
Transforms/
-
InstCombine/
-
load-insert-store.ll
-
VectorCombine/X86/
-
X86/
2/4
load-insert-store.ll

Differential D98240

[VectorCombine] Simplify to scalar store if only one element updated
ClosedPublic

Authored by qiucf on Mar 9 2021, 2:00 AM.

Download Raw Diff

Details

Reviewers

spatel
efriedma
fhahn
lebedev.ri
RKSimon

Commits

rG2db4979c0fe0: [VectorCombine] Simplify to scalar store if only one element updated

Summary

This is vector-combine version of revision D71828, which simplifies load-insertelt-store pattern into getelementptr-store.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

qiucf created this revision.Mar 9 2021, 2:00 AM

Herald added subscribers: jfb, hiraditya. · View Herald TranscriptMar 9 2021, 2:00 AM

qiucf requested review of this revision.Mar 9 2021, 2:00 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 9 2021, 2:00 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

HLJ2009 added a subscriber: HLJ2009.Mar 9 2021, 2:26 AM

Fix pipeline tests

Herald added subscribers: nikic, kerbowa, steven_wu and 2 others. · View Herald TranscriptMar 9 2021, 2:35 AM

Since the original proposal, MemorySSA has evolved. I still don't know much about that though. cc @asbirlea @nikic to see if this implementation is ok or should we use MemorySSA here?

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
777	Can shorten: if (!SI \|\| !SI->isSimple() ...)
829	This line of the comment is not accurate now. Remove or update.

Harbormaster completed remote builds in B92825: Diff 329254.Mar 9 2021, 8:52 AM

Harbormaster completed remote builds in B92829: Diff 329261.Mar 9 2021, 9:38 AM

In D98240#2614181, @spatel wrote:

Since the original proposal, MemorySSA has evolved. I still don't know much about that though. cc @asbirlea @nikic to see if this implementation is ok or should we use MemorySSA here?

MemorySSA may be handy in this case, but it looks like it's not available close-by in the current pipeline position?

nikic added inline comments.Mar 10 2021, 3:28 AM

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
810	This line shouldn't be dropped.

Address comments

Harbormaster completed remote builds in B93309: Diff 329968.Mar 11 2021, 1:24 PM

fhahn mentioned this in D100273: [VectorCombine] Scalarize vector load/extract..Apr 11 2021, 2:04 PM

I put up a similar patch to handle extractelement (load %ptr), %index in a similar fashion: D100273. Together with this patch, they greatly improve codegen for certain code generated using the C/C++ matrix extension https://clang.godbolt.org/z/qsccPdPf4

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
761	I think we need a limit here, to avoid excessive compile time?
802	Do we need a pointer cast here? I think we can just create a `GEP` with the vector pointer and indices `0, Idx`.
805	What about other metadata?
807	Is there a reason to remove the instruction here? I don't think the other functions do so, so it might be better to keep things consistent (or change it for other patterns as well)

Address comments

qiucf added inline comments.Apr 11 2021, 11:13 PM

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
807	I think that's because we're operating on a `store`, which will not be erased automatically for empty use.

Harbormaster completed remote builds in B98217: Diff 336754.Apr 11 2021, 11:47 PM

fhahn mentioned this in D100302: [VectorCombine] Run load/extract scalarization after scalarizing store..Apr 12 2021, 6:38 AM

fhahn added a child revision: D100302: [VectorCombine] Run load/extract scalarization after scalarizing store..Apr 12 2021, 6:39 AM

fhahn added inline comments.Apr 12 2021, 6:43 AM

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
795	Why only allow constants here? If the index is a non-constant, it should be even more profitable, because most targets probably do not have instructions to insert at a variable index.

Allow non-constant index.

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
795	Good catch. Thanks.

Harbormaster completed remote builds in B98409: Diff 337032.Apr 12 2021, 8:48 PM

LGTM, but @fhahn likely has a wider/fresher view of the expected patterns, so see if there are any other comments.

fhahn added inline comments.Apr 14 2021, 6:49 AM

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
54	This seems a bit high to start with, perhaps we should start with a smaller limit? Unfortunately there are not many instances of this pattern in the test-suite/SPEC2000/SPEC2006 so we can't really evaluate the impact.
768	this potentially could be made a bit simpler by using `any_of` instead of the explicit loop.
800	I think we are also missing test coverage for the case where load and store are not in the same BB?
801	do we have test coverage for that?
802	I think we still need some more coverage for the various scenarios for `isMemModifiedBetween`, .e.g. the case where we have stores in between that are must-alias, no-alias and may-alias. Also we should have a test for the limit.
884	I think there are some oddities with respect to `GlobalsAA` and it should also be preserved, e.g. see D82342 (same for the new PM)

lebedev.ri added inline comments.Apr 14 2021, 6:53 AM

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
53–54	Option name doesn't match the description/usage. Iteration, to me, means how many times we will rerun this whole pass, while you use it as a number of instructions to scan.

Address comments.

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
884	Do you mean dropping this and use result from GlobalsAA? I see GlobalsAA is already preserved.

Harbormaster completed remote builds in B98828: Diff 337649.Apr 15 2021, 1:30 AM

RKSimon added a subscriber: RKSimon.Apr 16 2021, 5:50 AM

RKSimon added inline comments.

llvm/test/Transforms/VectorCombine/X86/load-insert-store.ll
3	You've made this a X86 test but used a very generic data layout (and tested on big endian as well) - maybe move out of the x86 sub directory?

RKSimon added a reviewer: RKSimon.Apr 16 2021, 5:51 AM

fhahn added inline comments.Apr 19 2021, 6:00 AM

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
784	I'm not sure if the vector GEP will properly handle types that are non-power-of-2 properly. Perhaps it might be good to limit this to types for which the following holds? `SI->typeSizeEqualsStoreSize(LI->getType())`?
884	Ah never mind, I missed that it was already preserved.

fhahn added inline comments.Apr 19 2021, 6:42 AM

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
809	I think we need to set the alignment here. For example, the original store could have a alignment less than the default for the type.
llvm/test/Transforms/InstCombine/load-insert-store.ll
1	I'm not sure why this has been removed?
llvm/test/Transforms/VectorCombine/X86/load-insert-store.ll
15	can you also add a test with element types > i8?

Address some comments.

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
784	Hmm.. Not sure what you mean, maybe something like %0 = getelementptr inbounds <16 x i24>, <16 x i24>* %q, i32 0, i32 3 store i24 %s, i24* %0, align 4 this should be expected result. Or do we need to ensure scalar type size of load equals to store's value type size?
llvm/test/Transforms/InstCombine/load-insert-store.ll
1	It's moved from `InstCombine` to `VectorCombine`, not deleted.
llvm/test/Transforms/VectorCombine/X86/load-insert-store.ll
15	Do you mean `insertelement <16 x i8> %0, i32 %s, i32 2`? If not, there's a test for `<8 x i16>` below.

Harbormaster completed remote builds in B100203: Diff 339536.Apr 22 2021, 3:51 AM

fhahn added a subscriber: bjope.Apr 23 2021, 6:59 AM

fhahn added inline comments.

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
784	Yes this was what I meant. The created pointer looks OK, but I think this is an area where it's worth to be extra careful. I am not too familiar with the potential problems, but I think @bjope has experience with such targets. @bjope WDYT, do you think this would work as expected? (my preference would be to start with dis-allowing the transform if the sizes don't match initially and enable it once we got confirmation that it is safe)
809	Perhaps I missed it, but could you add a test that specifies a smaller alignment for a store (`!align 1`) for a vector of `i16` or larger?
llvm/test/Transforms/InstCombine/load-insert-store.ll
1	Oh right, so when it was originally added the plan was to optimize the pattern in instcombine? Might be worth moving the file in a separate commit and then just have the diff here show the changes by this patch,.
llvm/test/Transforms/VectorCombine/X86/load-insert-store.ll
15	yep I missed that one, sorry.

bjope added inline comments.Apr 26 2021, 12:43 AM

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
784	Yes, I think this would be a bit more complicated if trying to support types for with `DL.typeSizeEqualsStoreSize(SI->getType())` isn't true. Since the vectors are bit-packed you can't simply address a single vector element using a GEP otherwise. You wouldn't end up clobbering adjacent elements when doing the store (unless doing some kind of read-modify-write operation). Maybe you even need to look at the typeAllocSize (comparing it with the storeSize) if alignment is important (to avoid misaligned stores). But I also wonder if using GEP:s to address individual vector elements is something we do elsewhere. I know that https://llvm.org/docs/GetElementPtr.html#can-gep-index-into-vector-elements still says that "In the future, it will probably be outright disallowed.". Has there been discussions somewhere about opening up "pandoras box" (?) and start doing such things? If so, is that well tested somewhere? For example something like this compiles without any complaints: ; RUN: llc -O3 -mtriple x86_64-- -o - %s target datalayout = "E" define void @foo(<8 x i4>* %q, i4 %s) { %p = getelementptr inbounds <8 x i4>, <8 x i4>* %q, i32 0, i32 3 store i4 %s, i4* %p ret void } but it ends up writing to `3(%rdi)` while the third element in that vector actually is at half of the byte at `1(%rdi)`.

fhahn added inline comments.Apr 26 2021, 7:23 AM

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
784	Yes, I think this would be a bit more complicated if trying to support types for with DL.typeSizeEqualsStoreSize(SI->getType()) isn't true. Since the vectors are bit-packed you can't simply address a single vector element using a GEP otherwise. You wouldn't end up clobbering adjacent elements when doing the store (unless doing some kind of read-modify-write operation). Thanks for confirming! Maybe you even need to look at the typeAllocSize (comparing it with the storeSize) if alignment is important (to avoid misaligned stores). We are setting the alignment of the store to the minimum of the alignment for the scalar store and the original alignment of the store. Would the be missing something? But I also wonder if using GEP:s to address individual vector elements is something we do elsewhere. It looks like at least `instcombine` likes to introduce such GEPs. Originally I was using something to the code in the link, which `instcombine` turned in a vector GEP, hence I updated D100273 to directly emit vector GEPS and suggested that here as well. https://godbolt.org/z/q4o6fM3eP

Update tests and store size constraint. Thanks for explanation from @bjope and @fhahn

Harbormaster completed remote builds in B101447: Diff 341243.Apr 28 2021, 11:57 AM

In D98240#2723175, @qiucf wrote:

Update tests and store size constraint. Thanks for explanation from @bjope and @fhahn

Thanks for the update! I think the only thing missing is a few tests for un-common vector types, like <8 x i4> or <4 x i31>. Probably worth to have at least 2 different tests with different types.

Add more tests.

Harbormaster completed remote builds in B103119: Diff 343568.May 6 2021, 8:19 PM

LGTM, thanks!

This revision is now accepted and ready to land.May 7 2021, 12:53 PM

Closed by commit rG2db4979c0fe0: [VectorCombine] Simplify to scalar store if only one element updated (authored by qiucf). · Explain WhyMay 8 2021, 3:17 AM

This revision was automatically updated to reflect the committed changes.

qiucf added a commit: rG2db4979c0fe0: [VectorCombine] Simplify to scalar store if only one element updated.

qiucf mentioned this in D71828: [InstCombine] Convert vector store to scalar store if only one element updated.May 8 2021, 7:33 PM

fhahn mentioned this in rG86497785d540: [VectorCombine] Scalarize vector load/extract..May 24 2021, 1:29 AM

For targets not supporting scalar load from vector memory (like ours), this breaks it:

%43 = load <8 x i32>, <8 x i32> addrspace(201)* %1, align 32, !tbaa !28
%44 = extractelement <8 x i32> %43, i32 0

Now:

%43 = getelementptr inbounds <8 x i32>, <8 x i32> addrspace(201)* %1, i32 0, i32 0
%44 = load i32, i32 addrspace(201)* %43, align 32

Are targets expected to provide patterns?

In D98240#2781048, @hgreving wrote:
For targets not supporting scalar load from vector memory (like ours), this breaks it:
%43 = load <8 x i32>, <8 x i32> addrspace(201)* %1, align 32, !tbaa !28
%44 = extractelement <8 x i32> %43, i32 0
Now:
%43 = getelementptr inbounds <8 x i32>, <8 x i32> addrspace(201)* %1, i32 0, i32 0
%44 = load i32, i32 addrspace(201)* %43, align 32
Are targets expected to provide patterns?

Interesting! I guess the code assumes that a scalar load is always possible & at least as cheap as the vector version. But I think it would make sense to ask the cost-model if that's the case. Not sure if it would be possible to test this with an in-tree target?

Hmm wait, i completely ignored this patch :/
Does this really not do any cost modelling?
This should at least check that scalar load isn't more costly
than the original vector load + insertelement + vector store.

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
768–770	Does this ignore debuginfo?

Sorry for completely ignoring this :(
I'm fine with fixing this up as a followup, if that happens today.

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
810–813	I'm certain this is a miscompile. The alignment that was valid for a vector store is not guaranteed to be valid for the store of a single vector element, unless it's the 0'th element of course. I think this needs to be something like newalign = 1; if(autoC = dyn_cast<ConstantInt>(Idx)) { newalign = max(old store align, old load align); newalign = commonAlignment(newalign, Idx DL.getTypeSize(NewElement.getType())) }

lebedev.ri added inline comments.May 26 2021, 1:33 AM

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
808	Technically, size of the `insertelement` index doesn't have to match the size of the GEP index, the latter is controlled in datalayout. I'm not sure if/how we need to deal with that here, however.

In D98240#2781262, @fhahn wrote:
In D98240#2781048, @hgreving wrote:
For targets not supporting scalar load from vector memory (like ours), this breaks it:
%43 = load <8 x i32>, <8 x i32> addrspace(201)* %1, align 32, !tbaa !28
%44 = extractelement <8 x i32> %43, i32 0
Now:
%43 = getelementptr inbounds <8 x i32>, <8 x i32> addrspace(201)* %1, i32 0, i32 0
%44 = load i32, i32 addrspace(201)* %43, align 32
Are targets expected to provide patterns?
Interesting! I guess the code assumes that a scalar load is always possible & at least as cheap as the vector version. But I think it would make sense to ask the cost-model if that's the case. Not sure if it would be possible to test this with an in-tree target?

Hi thanks for getting back to me. I'm not sure if it's a cost model question, a straight-up disable switch for not morphing vector derefs into scalar might be better? Is there anything else in this pass that might do that? Unfortunately yes, I think there's no proper upstream target with this constraint. Though I am guessing I am not the only downstream target with a vector memory like that. The problem with trying to make this work is that I am worried about what happens to the pointer. Will I always be able to rely on that it will be aligned, probably not...

In D98240#2782010, @hgreving wrote:

Interesting! I guess the code assumes that a scalar load is always possible & at least as cheap as the vector version. But I think it would make sense to ask the cost-model if that's the case. Not sure if it would be possible to test this with an in-tree target?

Hi thanks for getting back to me. I'm not sure if it's a cost model question, a straight-up disable switch for not morphing vector derefs into scalar might be better? Is there anything else in this pass that might do that? Unfortunately yes, I think there's no proper upstream target with this constraint. Though I am guessing I am not the only downstream target with a vector memory like that. The problem with trying to make this work is that I am worried about what happens to the pointer. Will I always be able to rely on that it will be aligned, probably not...

I guess it depends on whether the backend can legalize/convert back to a vector load. In general, relying on the middle-end to no scalarize those loads for correctness seems a bit fragile. I'm not sure if it makes sense to add a TTI hook that's not used by any in tree targets.

@qiucf would you be able to look into extending the code to check for the cost?

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
810–813	That's a good point. @qiucf can you look into this?

In D98240#2788629, @fhahn wrote:

In D98240#2782010, @hgreving wrote:

Interesting! I guess the code assumes that a scalar load is always possible & at least as cheap as the vector version. But I think it would make sense to ask the cost-model if that's the case. Not sure if it would be possible to test this with an in-tree target?

Hi thanks for getting back to me. I'm not sure if it's a cost model question, a straight-up disable switch for not morphing vector derefs into scalar might be better? Is there anything else in this pass that might do that? Unfortunately yes, I think there's no proper upstream target with this constraint. Though I am guessing I am not the only downstream target with a vector memory like that. The problem with trying to make this work is that I am worried about what happens to the pointer. Will I always be able to rely on that it will be aligned, probably not...

I guess it depends on whether the backend can legalize/convert back to a vector load. In general, relying on the middle-end to no scalarize those loads for correctness seems a bit fragile. I'm not sure if it makes sense to add a TTI hook that's not used by any in tree targets.

@qiucf would you be able to look into extending the code to check for the cost?

I'll look into it, thanks. Sorry for the late notice.

qiucf mentioned this in D103419: [VectorCombine] Fix alignment in single element store.May 31 2021, 10:40 AM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

VectorCombine.cpp

83 lines

test/

CodeGen/

AMDGPU/

opt-pipeline.ll

2 lines

Other/

opt-LTO-pipeline.ll

2 lines

Transforms/

InstCombine/

load-insert-store.ll

	VectorCombine/	X86/
		InstCombine/

load-insert-store.ll

63 lines

Diff 336754

llvm/lib/Transforms/Vectorize/VectorCombine.cpp

Show First 20 Lines • Show All 44 Lines • ▼ Show 20 Lines
static cl::opt<bool> DisableVectorCombine(		static cl::opt<bool> DisableVectorCombine(
"disable-vector-combine", cl::init(false), cl::Hidden,		"disable-vector-combine", cl::init(false), cl::Hidden,
cl::desc("Disable all vector combine transforms"));		cl::desc("Disable all vector combine transforms"));

static cl::opt<bool> DisableBinopExtractShuffle(		static cl::opt<bool> DisableBinopExtractShuffle(
"disable-binop-extract-shuffle", cl::init(false), cl::Hidden,		"disable-binop-extract-shuffle", cl::init(false), cl::Hidden,
cl::desc("Disable binop extract to shuffle transforms"));		cl::desc("Disable binop extract to shuffle transforms"));

		static cl::opt<unsigned> IterationThreshold(
		"vector-combine-iteration-threshold", cl::init(100), cl::Hidden,
		fhahnUnsubmitted Done Reply Inline Actions This seems a bit high to start with, perhaps we should start with a smaller limit? Unfortunately there are not many instances of this pattern in the test-suite/SPEC2000/SPEC2006 so we can't really evaluate the impact. fhahn: This seems a bit high to start with, perhaps we should start with a smaller limit?
		lebedev.riUnsubmitted Done Reply Inline Actions Option name doesn't match the description/usage. Iteration, to me, means how many times we will rerun this whole pass, while you use it as a number of instructions to scan. lebedev.ri: Option name doesn't match the description/usage. Iteration, to me, means how many times we will…
		cl::desc("Max number of instructions to scan for vector combining."));

static const unsigned InvalidIndex = std::numeric_limits<unsigned>::max();		static const unsigned InvalidIndex = std::numeric_limits<unsigned>::max();

namespace {		namespace {
class VectorCombine {		class VectorCombine {
public:		public:
VectorCombine(Function &F, const TargetTransformInfo &TTI,		VectorCombine(Function &F, const TargetTransformInfo &TTI,
const DominatorTree &DT)		const DominatorTree &DT, AAResults &AA)
: F(F), Builder(F.getContext()), TTI(TTI), DT(DT) {}		: F(F), Builder(F.getContext()), TTI(TTI), DT(DT), AA(AA) {}

bool run();		bool run();

private:		private:
Function &F;		Function &F;
IRBuilder<> Builder;		IRBuilder<> Builder;
const TargetTransformInfo &TTI;		const TargetTransformInfo &TTI;
const DominatorTree &DT;		const DominatorTree &DT;
		AAResults &AA;

bool vectorizeLoadInsert(Instruction &I);		bool vectorizeLoadInsert(Instruction &I);
ExtractElementInst getShuffleExtract(ExtractElementInst Ext0,		ExtractElementInst getShuffleExtract(ExtractElementInst Ext0,
ExtractElementInst *Ext1,		ExtractElementInst *Ext1,
unsigned PreferredExtractIndex) const;		unsigned PreferredExtractIndex) const;
bool isExtractExtractCheap(ExtractElementInst Ext0, ExtractElementInst Ext1,		bool isExtractExtractCheap(ExtractElementInst Ext0, ExtractElementInst Ext1,
unsigned Opcode,		unsigned Opcode,
ExtractElementInst *&ConvertToShuffle,		ExtractElementInst *&ConvertToShuffle,
unsigned PreferredExtractIndex);		unsigned PreferredExtractIndex);
void foldExtExtCmp(ExtractElementInst Ext0, ExtractElementInst Ext1,		void foldExtExtCmp(ExtractElementInst Ext0, ExtractElementInst Ext1,
Instruction &I);		Instruction &I);
void foldExtExtBinop(ExtractElementInst Ext0, ExtractElementInst Ext1,		void foldExtExtBinop(ExtractElementInst Ext0, ExtractElementInst Ext1,
Instruction &I);		Instruction &I);
bool foldExtractExtract(Instruction &I);		bool foldExtractExtract(Instruction &I);
bool foldBitcastShuf(Instruction &I);		bool foldBitcastShuf(Instruction &I);
bool scalarizeBinopOrCmp(Instruction &I);		bool scalarizeBinopOrCmp(Instruction &I);
bool foldExtractedCmps(Instruction &I);		bool foldExtractedCmps(Instruction &I);
		bool foldSingleElementStore(Instruction &I);
};		};
} // namespace		} // namespace

static void replaceValue(Value &Old, Value &New) {		static void replaceValue(Value &Old, Value &New) {
Old.replaceAllUsesWith(&New);		Old.replaceAllUsesWith(&New);
New.takeName(&Old);		New.takeName(&Old);
}		}

▲ Show 20 Lines • Show All 653 Lines • ▼ Show 20 Lines	bool VectorCombine::foldExtractedCmps(Instruction &I) {

Value *Shuf = createShiftShuffle(VCmp, ExpensiveIndex, CheapIndex, Builder);		Value *Shuf = createShiftShuffle(VCmp, ExpensiveIndex, CheapIndex, Builder);
Value *VecLogic = Builder.CreateBinOp(cast<BinaryOperator>(I).getOpcode(),		Value *VecLogic = Builder.CreateBinOp(cast<BinaryOperator>(I).getOpcode(),
VCmp, Shuf);		VCmp, Shuf);
Value *NewExt = Builder.CreateExtractElement(VecLogic, CheapIndex);		Value *NewExt = Builder.CreateExtractElement(VecLogic, CheapIndex);
replaceValue(I, *NewExt);		replaceValue(I, *NewExt);
++NumVecCmpBO;		++NumVecCmpBO;
return true;		return true;
}		}
		fhahnUnsubmitted Done Reply Inline Actions I think we need a limit here, to avoid excessive compile time? fhahn: I think we need a limit here, to avoid excessive compile time?

		// Check if memory loc modified between two instrs in the same BB
		static bool isMemModifiedBetween(BasicBlock::iterator Begin,
		BasicBlock::iterator End,
		const MemoryLocation &Loc, AAResults &AA) {
		unsigned LoopCnt = 0;
		for (BasicBlock::iterator BBI = Begin; BBI != End; ++BBI)
		fhahnUnsubmitted Done Reply Inline Actions this potentially could be made a bit simpler by using `any_of` instead of the explicit loop. fhahn: this potentially could be made a bit simpler by using `any_of` instead of the explicit loop.
		if (isModSet(AA.getModRefInfo(&*BBI, Loc)) \|\|
		++LoopCnt > IterationThreshold)
		lebedev.riUnsubmitted Not Done Reply Inline Actions Does this ignore debuginfo? lebedev.ri: Does this ignore debuginfo?
		return true;
		return false;
		}

		// Combine patterns like:
		// %0 = load <4 x i32>, <4 x i32>* %a
		// %1 = insertelement <4 x i32> %0, i32 %b, i32 1
		spatelUnsubmitted Done Reply Inline Actions Can shorten: if (!SI \|\| !SI->isSimple() ...) spatel: Can shorten: if (!SI \|\| !SI->isSimple() ...)
		// store <4 x i32> %1, <4 x i32>* %a
		// to:
		// %0 = bitcast <4 x i32>* %a to i32*
		// %1 = getelementptr inbounds i32, i32* %0, i64 0, i64 1
		// store i32 %b, i32* %1
		bool VectorCombine::foldSingleElementStore(Instruction &I) {
		StoreInst *SI = dyn_cast<StoreInst>(&I);
		fhahnUnsubmitted Not Done Reply Inline Actions I'm not sure if the vector GEP will properly handle types that are non-power-of-2 properly. Perhaps it might be good to limit this to types for which the following holds? `SI->typeSizeEqualsStoreSize(LI->getType())`? fhahn: I'm not sure if the vector GEP will properly handle types that are non-power-of-2 properly.
		qiucfAuthorUnsubmitted Done Reply Inline Actions Hmm.. Not sure what you mean, maybe something like %0 = getelementptr inbounds <16 x i24>, <16 x i24>* %q, i32 0, i32 3 store i24 %s, i24* %0, align 4 this should be expected result. Or do we need to ensure scalar type size of load equals to store's value type size? qiucf: Hmm.. Not sure what you mean, maybe something like ``` %0 = getelementptr inbounds <16 x i24>…
		fhahnUnsubmitted Not Done Reply Inline Actions Yes this was what I meant. The created pointer looks OK, but I think this is an area where it's worth to be extra careful. I am not too familiar with the potential problems, but I think @bjope has experience with such targets. @bjope WDYT, do you think this would work as expected? (my preference would be to start with dis-allowing the transform if the sizes don't match initially and enable it once we got confirmation that it is safe) fhahn: Yes this was what I meant. The created pointer looks OK, but I think this is an area where it's…
		bjopeUnsubmitted Not Done Reply Inline Actions Yes, I think this would be a bit more complicated if trying to support types for with `DL.typeSizeEqualsStoreSize(SI->getType())` isn't true. Since the vectors are bit-packed you can't simply address a single vector element using a GEP otherwise. You wouldn't end up clobbering adjacent elements when doing the store (unless doing some kind of read-modify-write operation). Maybe you even need to look at the typeAllocSize (comparing it with the storeSize) if alignment is important (to avoid misaligned stores). But I also wonder if using GEP:s to address individual vector elements is something we do elsewhere. I know that https://llvm.org/docs/GetElementPtr.html#can-gep-index-into-vector-elements still says that "In the future, it will probably be outright disallowed.". Has there been discussions somewhere about opening up "pandoras box" (?) and start doing such things? If so, is that well tested somewhere? For example something like this compiles without any complaints: ; RUN: llc -O3 -mtriple x86_64-- -o - %s target datalayout = "E" define void @foo(<8 x i4>* %q, i4 %s) { %p = getelementptr inbounds <8 x i4>, <8 x i4>* %q, i32 0, i32 3 store i4 %s, i4* %p ret void } but it ends up writing to `3(%rdi)` while the third element in that vector actually is at half of the byte at `1(%rdi)`. bjope: Yes, I think this would be a bit more complicated if trying to support types for with `DL.
		fhahnUnsubmitted Not Done Reply Inline Actions Yes, I think this would be a bit more complicated if trying to support types for with DL.typeSizeEqualsStoreSize(SI->getType()) isn't true. Since the vectors are bit-packed you can't simply address a single vector element using a GEP otherwise. You wouldn't end up clobbering adjacent elements when doing the store (unless doing some kind of read-modify-write operation). Thanks for confirming! Maybe you even need to look at the typeAllocSize (comparing it with the storeSize) if alignment is important (to avoid misaligned stores). We are setting the alignment of the store to the minimum of the alignment for the scalar store and the original alignment of the store. Would the be missing something? But I also wonder if using GEP:s to address individual vector elements is something we do elsewhere. It looks like at least `instcombine` likes to introduce such GEPs. Originally I was using something to the code in the link, which `instcombine` turned in a vector GEP, hence I updated D100273 to directly emit vector GEPS and suggested that here as well. https://godbolt.org/z/q4o6fM3eP fhahn: > Yes, I think this would be a bit more complicated if trying to support types for with DL.
		if (!SI \|\| !SI->isSimple() \|\| !SI->getValueOperand()->getType()->isVectorTy())
		return false;

		// TODO: Combine more complicated patterns (multiple insert) by referencing
		// TargetTransformInfo.
		Instruction *Source;
		Value *NewElement;
		Constant *Idx;
		if (!match(SI->getValueOperand(),
		m_InsertElt(m_Instruction(Source), m_Value(NewElement),
		m_Constant(Idx))))
		fhahnUnsubmitted Done Reply Inline Actions Why only allow constants here? If the index is a non-constant, it should be even more profitable, because most targets probably do not have instructions to insert at a variable index. fhahn: Why only allow constants here? If the index is a non-constant, it should be even more…
		qiucfAuthorUnsubmitted Done Reply Inline Actions Good catch. Thanks. qiucf: Good catch. Thanks.
		return false;

		if (auto *Load = dyn_cast<LoadInst>(Source)) {
		Value *SrcAddr = Load->getPointerOperand()->stripPointerCasts();
		// Don't optimize for atomic/volatile load or stores.
		fhahnUnsubmitted Done Reply Inline Actions I think we are also missing test coverage for the case where load and store are not in the same BB? fhahn: I think we are also missing test coverage for the case where load and store are not in the same…
		if (!Load->isSimple() \|\| Load->getParent() != SI->getParent() \|\|
		fhahnUnsubmitted Done Reply Inline Actions do we have test coverage for that? fhahn: do we have test coverage for that?
		SrcAddr != SI->getPointerOperand()->stripPointerCasts() \|\|
		fhahnUnsubmitted Done Reply Inline Actions Do we need a pointer cast here? I think we can just create a `GEP` with the vector pointer and indices `0, Idx`. fhahn: Do we need a pointer cast here? I think we can just create a `GEP` with the vector pointer…
		fhahnUnsubmitted Done Reply Inline Actions I think we still need some more coverage for the various scenarios for `isMemModifiedBetween`, .e.g. the case where we have stores in between that are must-alias, no-alias and may-alias. Also we should have a test for the limit. fhahn: I think we still need some more coverage for the various scenarios for `isMemModifiedBetween`, .
		isMemModifiedBetween(Load->getIterator(), SI->getIterator(),
		MemoryLocation::get(SI), AA))
		return false;
		fhahnUnsubmitted Done Reply Inline Actions What about other metadata? fhahn: What about other metadata?

		Value *GEP = GetElementPtrInst::CreateInBounds(
		fhahnUnsubmitted Not Done Reply Inline Actions Is there a reason to remove the instruction here? I don't think the other functions do so, so it might be better to keep things consistent (or change it for other patterns as well) fhahn: Is there a reason to remove the instruction here? I don't think the other functions do so, so…
		qiucfAuthorUnsubmitted Done Reply Inline Actions I think that's because we're operating on a `store`, which will not be erased automatically for empty use. qiucf: I think that's because we're operating on a `store`, which will not be erased automatically for…
		SI->getPointerOperand(), {ConstantInt::get(Idx->getType(), 0), Idx});
		lebedev.riUnsubmitted Not Done Reply Inline Actions Technically, size of the `insertelement` index doesn't have to match the size of the GEP index, the latter is controlled in datalayout. I'm not sure if/how we need to deal with that here, however. lebedev.ri: Technically, size of the `insertelement` index doesn't have to match the size of the GEP index…
		Builder.Insert(GEP);
		fhahnUnsubmitted Done Reply Inline Actions I think we need to set the alignment here. For example, the original store could have a alignment less than the default for the type. fhahn: I think we need to set the alignment here. For example, the original store could have a…
		fhahnUnsubmitted Not Done Reply Inline Actions Perhaps I missed it, but could you add a test that specifies a smaller alignment for a store (`!align 1`) for a vector of `i16` or larger? fhahn: Perhaps I missed it, but could you add a test that specifies a smaller alignment for a store (`!
		StoreInst *NSI = Builder.CreateStore(NewElement, GEP);
		NSI->copyMetadata(*SI);
		replaceValue(I, *NSI);
		// Need erasing the store manually.
		lebedev.riUnsubmitted Not Done Reply Inline Actions I'm certain this is a miscompile. The alignment that was valid for a vector store is not guaranteed to be valid for the store of a single vector element, unless it's the 0'th element of course. I think this needs to be something like newalign = 1; if(autoC = dyn_cast<ConstantInt>(Idx)) { newalign = max(old store align, old load align); newalign = commonAlignment(newalign, Idx DL.getTypeSize(NewElement.getType())) } lebedev.ri: I'm certain this is a miscompile. The alignment that was valid for a vector store is not…
		fhahnUnsubmitted Not Done Reply Inline Actions That's a good point. @qiucf can you look into this? fhahn: That's a good point. @qiucf can you look into this?
		I.eraseFromParent();
		return true;
		}

		return false;
		}

/// This is the entry point for all transforms. Pass manager differences are		/// This is the entry point for all transforms. Pass manager differences are
/// handled in the callers of this function.		/// handled in the callers of this function.
bool VectorCombine::run() {		bool VectorCombine::run() {
if (DisableVectorCombine)		if (DisableVectorCombine)
return false;		return false;

// Don't attempt vectorization if the target does not support vectors.		// Don't attempt vectorization if the target does not support vectors.
if (!TTI.getNumberOfRegisters(TTI.getRegisterClassForType(/Vector/ true)))		if (!TTI.getNumberOfRegisters(TTI.getRegisterClassForType(/Vector/ true)))
return false;		return false;
		spatelUnsubmitted Done Reply Inline Actions This line of the comment is not accurate now. Remove or update. spatel: This line of the comment is not accurate now. Remove or update.

bool MadeChange = false;		bool MadeChange = false;
for (BasicBlock &BB : F) {		for (BasicBlock &BB : F) {
// Ignore unreachable basic blocks.		// Ignore unreachable basic blocks.
if (!DT.isReachableFromEntry(&BB))		if (!DT.isReachableFromEntry(&BB))
continue;		continue;
// Do not delete instructions under here and invalidate the iterator.		// Use early increment range so that we can erase instructions in loop.
// Walk the block forwards to enable simple iterative chains of transforms.		for (Instruction &I : make_early_inc_range(BB)) {
// TODO: It could be more efficient to remove dead instructions
// iteratively in this loop rather than waiting until the end.
for (Instruction &I : BB) {
if (isa<DbgInfoIntrinsic>(I))		if (isa<DbgInfoIntrinsic>(I))
continue;		continue;
Builder.SetInsertPoint(&I);		Builder.SetInsertPoint(&I);
MadeChange \|= vectorizeLoadInsert(I);		MadeChange \|= vectorizeLoadInsert(I);
MadeChange \|= foldExtractExtract(I);		MadeChange \|= foldExtractExtract(I);
MadeChange \|= foldBitcastShuf(I);		MadeChange \|= foldBitcastShuf(I);
MadeChange \|= scalarizeBinopOrCmp(I);		MadeChange \|= scalarizeBinopOrCmp(I);
MadeChange \|= foldExtractedCmps(I);		MadeChange \|= foldExtractedCmps(I);
		MadeChange \|= foldSingleElementStore(I);
}		}
}		}

// We're done with transforms, so remove dead instructions.		// We're done with transforms, so remove dead instructions.
if (MadeChange)		if (MadeChange)
for (BasicBlock &BB : F)		for (BasicBlock &BB : F)
SimplifyInstructionsInBlock(&BB);		SimplifyInstructionsInBlock(&BB);

return MadeChange;		return MadeChange;
}		}

// Pass manager boilerplate below here.		// Pass manager boilerplate below here.

namespace {		namespace {
class VectorCombineLegacyPass : public FunctionPass {		class VectorCombineLegacyPass : public FunctionPass {
public:		public:
static char ID;		static char ID;
VectorCombineLegacyPass() : FunctionPass(ID) {		VectorCombineLegacyPass() : FunctionPass(ID) {
initializeVectorCombineLegacyPassPass(*PassRegistry::getPassRegistry());		initializeVectorCombineLegacyPassPass(*PassRegistry::getPassRegistry());
}		}

void getAnalysisUsage(AnalysisUsage &AU) const override {		void getAnalysisUsage(AnalysisUsage &AU) const override {
AU.addRequired<DominatorTreeWrapperPass>();		AU.addRequired<DominatorTreeWrapperPass>();
AU.addRequired<TargetTransformInfoWrapperPass>();		AU.addRequired<TargetTransformInfoWrapperPass>();
		AU.addRequired<AAResultsWrapperPass>();
AU.setPreservesCFG();		AU.setPreservesCFG();
AU.addPreserved<DominatorTreeWrapperPass>();		AU.addPreserved<DominatorTreeWrapperPass>();
nikicUnsubmitted Done Reply Inline Actions This line shouldn't be dropped. nikic: This line shouldn't be dropped.
AU.addPreserved<GlobalsAAWrapperPass>();		AU.addPreserved<GlobalsAAWrapperPass>();
AU.addPreserved<AAResultsWrapperPass>();		AU.addPreserved<AAResultsWrapperPass>();
AU.addPreserved<BasicAAWrapperPass>();		AU.addPreserved<BasicAAWrapperPass>();
FunctionPass::getAnalysisUsage(AU);		FunctionPass::getAnalysisUsage(AU);
}		}

bool runOnFunction(Function &F) override {		bool runOnFunction(Function &F) override {
if (skipFunction(F))		if (skipFunction(F))
return false;		return false;
auto &TTI = getAnalysis<TargetTransformInfoWrapperPass>().getTTI(F);		auto &TTI = getAnalysis<TargetTransformInfoWrapperPass>().getTTI(F);
auto &DT = getAnalysis<DominatorTreeWrapperPass>().getDomTree();		auto &DT = getAnalysis<DominatorTreeWrapperPass>().getDomTree();
		fhahnUnsubmitted Not Done Reply Inline Actions I think there are some oddities with respect to `GlobalsAA` and it should also be preserved, e.g. see D82342 (same for the new PM) fhahn: I think there are some oddities with respect to `GlobalsAA` and it should also be preserved, e.
		qiucfAuthorUnsubmitted Done Reply Inline Actions Do you mean dropping this and use result from GlobalsAA? I see GlobalsAA is already preserved. qiucf: Do you mean dropping this and use result from GlobalsAA? I see GlobalsAA is already preserved.
		fhahnUnsubmitted Not Done Reply Inline Actions Ah never mind, I missed that it was already preserved. fhahn: Ah never mind, I missed that it was already preserved.
VectorCombine Combiner(F, TTI, DT);		auto &AA = getAnalysis<AAResultsWrapperPass>().getAAResults();
		VectorCombine Combiner(F, TTI, DT, AA);
return Combiner.run();		return Combiner.run();
}		}
};		};
} // namespace		} // namespace

char VectorCombineLegacyPass::ID = 0;		char VectorCombineLegacyPass::ID = 0;
INITIALIZE_PASS_BEGIN(VectorCombineLegacyPass, "vector-combine",		INITIALIZE_PASS_BEGIN(VectorCombineLegacyPass, "vector-combine",
"Optimize scalar/vector ops", false,		"Optimize scalar/vector ops", false,
false)		false)
INITIALIZE_PASS_DEPENDENCY(DominatorTreeWrapperPass)		INITIALIZE_PASS_DEPENDENCY(DominatorTreeWrapperPass)
INITIALIZE_PASS_END(VectorCombineLegacyPass, "vector-combine",		INITIALIZE_PASS_END(VectorCombineLegacyPass, "vector-combine",
"Optimize scalar/vector ops", false, false)		"Optimize scalar/vector ops", false, false)
Pass *llvm::createVectorCombinePass() {		Pass *llvm::createVectorCombinePass() {
return new VectorCombineLegacyPass();		return new VectorCombineLegacyPass();
}		}

PreservedAnalyses VectorCombinePass::run(Function &F,		PreservedAnalyses VectorCombinePass::run(Function &F,
FunctionAnalysisManager &FAM) {		FunctionAnalysisManager &FAM) {
TargetTransformInfo &TTI = FAM.getResult<TargetIRAnalysis>(F);		TargetTransformInfo &TTI = FAM.getResult<TargetIRAnalysis>(F);
DominatorTree &DT = FAM.getResult<DominatorTreeAnalysis>(F);		DominatorTree &DT = FAM.getResult<DominatorTreeAnalysis>(F);
VectorCombine Combiner(F, TTI, DT);		AAResults &AA = FAM.getResult<AAManager>(F);
		VectorCombine Combiner(F, TTI, DT, AA);
if (!Combiner.run())		if (!Combiner.run())
return PreservedAnalyses::all();		return PreservedAnalyses::all();
PreservedAnalyses PA;		PreservedAnalyses PA;
PA.preserveSet<CFGAnalyses>();		PA.preserveSet<CFGAnalyses>();
PA.preserve<GlobalsAA>();		PA.preserve<GlobalsAA>();
PA.preserve<AAManager>();		PA.preserve<AAManager>();
PA.preserve<BasicAA>();		PA.preserve<BasicAA>();
return PA;		return PA;
}		}

llvm/test/CodeGen/AMDGPU/opt-pipeline.ll

	Show First 20 Lines • Show All 246 Lines • ▼ Show 20 Lines
	; GCN-O1-NEXT: Basic Alias Analysis (stateless AA impl)			; GCN-O1-NEXT: Basic Alias Analysis (stateless AA impl)
	; GCN-O1-NEXT: Function Alias Analysis Results			; GCN-O1-NEXT: Function Alias Analysis Results
	; GCN-O1-NEXT: Lazy Branch Probability Analysis			; GCN-O1-NEXT: Lazy Branch Probability Analysis
	; GCN-O1-NEXT: Lazy Block Frequency Analysis			; GCN-O1-NEXT: Lazy Block Frequency Analysis
	; GCN-O1-NEXT: Optimization Remark Emitter			; GCN-O1-NEXT: Optimization Remark Emitter
	; GCN-O1-NEXT: Combine redundant instructions			; GCN-O1-NEXT: Combine redundant instructions
	; GCN-O1-NEXT: Simplify the CFG			; GCN-O1-NEXT: Simplify the CFG
	; GCN-O1-NEXT: Dominator Tree Construction			; GCN-O1-NEXT: Dominator Tree Construction
	; GCN-O1-NEXT: Optimize scalar/vector ops
	; GCN-O1-NEXT: Basic Alias Analysis (stateless AA impl)			; GCN-O1-NEXT: Basic Alias Analysis (stateless AA impl)
	; GCN-O1-NEXT: Function Alias Analysis Results			; GCN-O1-NEXT: Function Alias Analysis Results
				; GCN-O1-NEXT: Optimize scalar/vector ops
	; GCN-O1-NEXT: Natural Loop Information			; GCN-O1-NEXT: Natural Loop Information
	; GCN-O1-NEXT: Lazy Branch Probability Analysis			; GCN-O1-NEXT: Lazy Branch Probability Analysis
	; GCN-O1-NEXT: Lazy Block Frequency Analysis			; GCN-O1-NEXT: Lazy Block Frequency Analysis
	; GCN-O1-NEXT: Optimization Remark Emitter			; GCN-O1-NEXT: Optimization Remark Emitter
	; GCN-O1-NEXT: Combine redundant instructions			; GCN-O1-NEXT: Combine redundant instructions
	; GCN-O1-NEXT: Canonicalize natural loops			; GCN-O1-NEXT: Canonicalize natural loops
	; GCN-O1-NEXT: LCSSA Verifier			; GCN-O1-NEXT: LCSSA Verifier
	; GCN-O1-NEXT: Loop-Closed SSA Form Pass			; GCN-O1-NEXT: Loop-Closed SSA Form Pass
	▲ Show 20 Lines • Show All 799 Lines • Show Last 20 Lines

llvm/test/Other/opt-LTO-pipeline.ll

	Show First 20 Lines • Show All 162 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: Function Alias Analysis Results			; CHECK-NEXT: Function Alias Analysis Results
	; CHECK-NEXT: Natural Loop Information			; CHECK-NEXT: Natural Loop Information
	; CHECK-NEXT: Lazy Branch Probability Analysis			; CHECK-NEXT: Lazy Branch Probability Analysis
	; CHECK-NEXT: Lazy Block Frequency Analysis			; CHECK-NEXT: Lazy Block Frequency Analysis
	; CHECK-NEXT: Optimization Remark Emitter			; CHECK-NEXT: Optimization Remark Emitter
	; CHECK-NEXT: Combine redundant instructions			; CHECK-NEXT: Combine redundant instructions
	; CHECK-NEXT: Demanded bits analysis			; CHECK-NEXT: Demanded bits analysis
	; CHECK-NEXT: Bit-Tracking Dead Code Elimination			; CHECK-NEXT: Bit-Tracking Dead Code Elimination
				; CHECK-NEXT: Function Alias Analysis Results
	; CHECK-NEXT: Optimize scalar/vector ops			; CHECK-NEXT: Optimize scalar/vector ops
	; CHECK-NEXT: Scalar Evolution Analysis			; CHECK-NEXT: Scalar Evolution Analysis
	; CHECK-NEXT: Alignment from assumptions			; CHECK-NEXT: Alignment from assumptions
	; CHECK-NEXT: Function Alias Analysis Results
	; CHECK-NEXT: Optimization Remark Emitter			; CHECK-NEXT: Optimization Remark Emitter
	; CHECK-NEXT: Combine redundant instructions			; CHECK-NEXT: Combine redundant instructions
	; CHECK-NEXT: Lazy Value Information Analysis			; CHECK-NEXT: Lazy Value Information Analysis
	; CHECK-NEXT: Jump Threading			; CHECK-NEXT: Jump Threading
	; CHECK-NEXT: Cross-DSO CFI			; CHECK-NEXT: Cross-DSO CFI
	; CHECK-NEXT: Lower type metadata			; CHECK-NEXT: Lower type metadata
	; CHECK-NEXT: Lower type metadata			; CHECK-NEXT: Lower type metadata
	; CHECK-NEXT: FunctionPass Manager			; CHECK-NEXT: FunctionPass Manager
	Show All 33 Lines

llvm/test/Transforms/InstCombine/load-insert-store.ll

This file was moved to llvm/test/Transforms/VectorCombine/X86/load-insert-store.ll.

llvm/test/Transforms/VectorCombine/X86/load-insert-store.ll

This file was moved from llvm/test/Transforms/InstCombine/load-insert-store.ll.

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -S -instcombine < %s \| FileCheck %s			; RUN: opt -S -vector-combine -data-layout=e < %s \| FileCheck %s
				; RUN: opt -S -vector-combine -data-layout=E < %s \| FileCheck %s
				RKSimonUnsubmitted Done Reply Inline Actions You've made this a X86 test but used a very generic data layout (and tested on big endian as well) - maybe move out of the x86 sub directory? RKSimon: You've made this a X86 test but used a very generic data layout (and tested on big endian as…

	define void @insert_store(<16 x i8>* %q, i8 zeroext %s) {			define void @insert_store(<16 x i8>* %q, i8 zeroext %s) {
	; CHECK-LABEL: @insert_store(			; CHECK-LABEL: @insert_store(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[TMP0:%.]] = load <16 x i8>, <16 x i8> [[Q:%.*]], align 16			; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds <16 x i8>, <16 x i8> [[Q:%.*]], i32 0, i32 3
	; CHECK-NEXT: [[VECINS:%.]] = insertelement <16 x i8> [[TMP0]], i8 [[S:%.]], i32 3			; CHECK-NEXT: store i8 [[S:%.]], i8 [[TMP0]], align 1
	; CHECK-NEXT: store <16 x i8> [[VECINS]], <16 x i8>* [[Q]], align 16
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	%0 = load <16 x i8>, <16 x i8>* %q			%0 = load <16 x i8>, <16 x i8>* %q
	%vecins = insertelement <16 x i8> %0, i8 %s, i32 3			%vecins = insertelement <16 x i8> %0, i8 %s, i32 3
	store <16 x i8> %vecins, <16 x i8>* %q			store <16 x i8> %vecins, <16 x i8>* %q
				fhahnUnsubmitted Not Done Reply Inline Actions can you also add a test with element types > i8? fhahn: can you also add a test with element types > i8?
				qiucfAuthorUnsubmitted Done Reply Inline Actions Do you mean `insertelement <16 x i8> %0, i32 %s, i32 2`? If not, there's a test for `<8 x i16>` below. qiucf: Do you mean `insertelement <16 x i8> %0, i32 %s, i32 2`? If not, there's a test for `<8 x i16>`…
				fhahnUnsubmitted Not Done Reply Inline Actions yep I missed that one, sorry. fhahn: yep I missed that one, sorry.
	ret void			ret void
	}			}

	define void @single_shuffle_store(<4 x i32>* %a, i32 %b) {			define void @insert_store_i16(<8 x i16>* %q, i16 zeroext %s) {
	; CHECK-LABEL: @single_shuffle_store(			; CHECK-LABEL: @insert_store_i16(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[TMP0:%.]] = load <4 x i32>, <4 x i32> [[A:%.*]], align 16			; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds <8 x i16>, <8 x i16> [[Q:%.*]], i32 0, i32 3
	; CHECK-NEXT: [[TMP1:%.]] = insertelement <4 x i32> [[TMP0]], i32 [[B:%.]], i32 1			; CHECK-NEXT: store i16 [[S:%.]], i16 [[TMP0]], align 2
	; CHECK-NEXT: store <4 x i32> [[TMP1]], <4 x i32>* [[A]], align 16, !nontemporal !0
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	%0 = load <4 x i32>, <4 x i32>* %a			%0 = load <8 x i16>, <8 x i16>* %q
	%1 = insertelement <4 x i32> %0, i32 %b, i32 1			%vecins = insertelement <8 x i16> %0, i16 %s, i32 3
	%2 = shufflevector <4 x i32> %0, <4 x i32> %1, <4 x i32> <i32 0, i32 5, i32 2, i32 3>			store <8 x i16> %vecins, <8 x i16>* %q
	store <4 x i32> %2, <4 x i32>* %a, !nontemporal !0
	ret void			ret void
	}			}

	define void @volatile_update(<16 x i8>* %q, <16 x i8>* %p, i8 zeroext %s) {			define void @volatile_update(<16 x i8>* %q, <16 x i8>* %p, i8 zeroext %s) {
	; CHECK-LABEL: @volatile_update(			; CHECK-LABEL: @volatile_update(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[TMP0:%.]] = load <16 x i8>, <16 x i8> [[Q:%.*]], align 16			; CHECK-NEXT: [[TMP0:%.]] = load <16 x i8>, <16 x i8> [[Q:%.*]], align 16
	; CHECK-NEXT: [[VECINS0:%.]] = insertelement <16 x i8> [[TMP0]], i8 [[S:%.]], i32 3			; CHECK-NEXT: [[VECINS0:%.]] = insertelement <16 x i8> [[TMP0]], i8 [[S:%.]], i32 3
	Show All 24 Lines
	;			;
	entry:			entry:
	%ld = load <16 x i8>, <16 x i8>* %p			%ld = load <16 x i8>, <16 x i8>* %p
	%ins = insertelement <16 x i8> %ld, i8 %s, i32 3			%ins = insertelement <16 x i8> %ld, i8 %s, i32 3
	store <16 x i8> %ins, <16 x i8>* %q			store <16 x i8> %ins, <16 x i8>* %q
	ret void			ret void
	}			}

				; We can't transform if any instr could modify memory in between.
				; Here p and q may alias, so we can't remove the load.
				; r is impossible to alias with others, so it's safe to transform.
	define void @insert_store_mem_modify(<16 x i8>* %p, <16 x i8>* %q, <16 x i8>* noalias %r, i8 %s) {			define void @insert_store_mem_modify(<16 x i8>* %p, <16 x i8>* %q, <16 x i8>* noalias %r, i8 %s) {
	; CHECK-LABEL: @insert_store_mem_modify(			; CHECK-LABEL: @insert_store_mem_modify(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[LD:%.]] = load <16 x i8>, <16 x i8> [[P:%.*]], align 16			; CHECK-NEXT: [[LD:%.]] = load <16 x i8>, <16 x i8> [[P:%.*]], align 16
	; CHECK-NEXT: store <16 x i8> zeroinitializer, <16 x i8>* [[Q:%.*]], align 16			; CHECK-NEXT: store <16 x i8> zeroinitializer, <16 x i8>* [[Q:%.*]], align 16
	; CHECK-NEXT: [[INS:%.]] = insertelement <16 x i8> [[LD]], i8 [[S:%.]], i32 3			; CHECK-NEXT: [[INS:%.]] = insertelement <16 x i8> [[LD]], i8 [[S:%.]], i32 3
	; CHECK-NEXT: store <16 x i8> [[INS]], <16 x i8>* [[P]], align 16			; CHECK-NEXT: store <16 x i8> [[INS]], <16 x i8>* [[P]], align 16
	; CHECK-NEXT: [[LD2:%.]] = load <16 x i8>, <16 x i8> [[Q]], align 16
	; CHECK-NEXT: store <16 x i8> zeroinitializer, <16 x i8>* [[R:%.*]], align 16			; CHECK-NEXT: store <16 x i8> zeroinitializer, <16 x i8>* [[R:%.*]], align 16
	; CHECK-NEXT: [[INS2:%.*]] = insertelement <16 x i8> [[LD2]], i8 [[S]], i32 7			; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds <16 x i8>, <16 x i8> [[Q]], i32 0, i32 7
	; CHECK-NEXT: store <16 x i8> [[INS2]], <16 x i8>* [[Q]], align 16			; CHECK-NEXT: store i8 [[S]], i8* [[TMP0]], align 1
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	%ld = load <16 x i8>, <16 x i8>* %p			%ld = load <16 x i8>, <16 x i8>* %p
	store <16 x i8> zeroinitializer, <16 x i8>* %q			store <16 x i8> zeroinitializer, <16 x i8>* %q
	%ins = insertelement <16 x i8> %ld, i8 %s, i32 3			%ins = insertelement <16 x i8> %ld, i8 %s, i32 3
	store <16 x i8> %ins, <16 x i8>* %p			store <16 x i8> %ins, <16 x i8>* %p

	%ld2 = load <16 x i8>, <16 x i8>* %q			%ld2 = load <16 x i8>, <16 x i8>* %q
	store <16 x i8> zeroinitializer, <16 x i8>* %r			store <16 x i8> zeroinitializer, <16 x i8>* %r
	%ins2 = insertelement <16 x i8> %ld2, i8 %s, i32 7			%ins2 = insertelement <16 x i8> %ld2, i8 %s, i32 7
	store <16 x i8> %ins2, <16 x i8>* %q			store <16 x i8> %ins2, <16 x i8>* %q
	ret void			ret void
	}			}

				; Check cases when calls may modify memory
				define void @insert_store_with_call(<16 x i8>* %p, <16 x i8>* %q, i8 %s) {
				; CHECK-LABEL: @insert_store_with_call(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: [[LD:%.]] = load <16 x i8>, <16 x i8> [[P:%.*]], align 16
				; CHECK-NEXT: call void @maywrite(<16 x i8>* [[P]])
				; CHECK-NEXT: [[INS:%.]] = insertelement <16 x i8> [[LD]], i8 [[S:%.]], i32 3
				; CHECK-NEXT: store <16 x i8> [[INS]], <16 x i8>* [[P]], align 16
				; CHECK-NEXT: call void @foo()
				; CHECK-NEXT: call void @nowrite(<16 x i8>* [[P]])
				; CHECK-NEXT: [[TMP0:%.]] = getelementptr inbounds <16 x i8>, <16 x i8> [[P]], i32 0, i32 7
				; CHECK-NEXT: store i8 [[S]], i8* [[TMP0]], align 1
				; CHECK-NEXT: ret void
				;
				entry:
				%ld = load <16 x i8>, <16 x i8>* %p
				call void @maywrite(<16 x i8>* %p)
				%ins = insertelement <16 x i8> %ld, i8 %s, i32 3
				store <16 x i8> %ins, <16 x i8>* %p
				call void @foo() ; Barrier
				%ld2 = load <16 x i8>, <16 x i8>* %p
				call void @nowrite(<16 x i8>* %p)
				%ins2 = insertelement <16 x i8> %ld2, i8 %s, i32 7
				store <16 x i8> %ins2, <16 x i8>* %p
				ret void
				}

				declare void @foo()
				declare void @maywrite(<16 x i8>*)
				declare void @nowrite(<16 x i8>*) readonly

	!0 = !{}			!0 = !{}