This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
include/llvm/
-
llvm/
-
InitializePasses.h
-
LinkAllPasses.h
-
Transforms/
3
Vectorize.h
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
-
CMakeLists.txt
29/78
LoadStoreVectorizer.cpp
-
Vectorize.cpp
-
test/Transforms/LoadStoreVectorizer/AMDGPU/
-
Transforms/
-
LoadStoreVectorizer/
-
AMDGPU/
-
extended-index.ll
-
insertion-point.ll
-
interleaved-mayalias-store.ll
-
lit.local.cfg
-
merge-stores.ll
-
merge-vectors.ll
-
no-implicit-float.ll

Differential D19501

Add LoadStoreVectorizer pass
ClosedPublic

Authored by arsenm on Apr 25 2016, 3:11 PM.

Download Raw Diff

Details

Reviewers

• tstellarAMD
jlebar
escha
resistor

Summary

This was contributed by Apple, and I've been working on
minimal cleanups and generalizing it.

Diff Detail

Event Timeline

arsenm updated this revision to Diff 54922.Apr 25 2016, 3:11 PM

arsenm retitled this revision from to Add LoadStoreVectorizer pass.

arsenm updated this object.

arsenm added reviewers: escha, resistor, • tstellarAMD.

arsenm added a subscriber: llvm-commits.

Herald added subscribers: mzolotukhin, sanjoy. · View Herald TranscriptApr 25 2016, 3:11 PM

Why can't that be achieved by SLP vectorizer?

Michael

In D19501#411387, @mzolotukhin wrote:

Why can't that be achieved by SLP vectorizer?

Michael

The SLPVectorizer is concerned with forming all vectors. It will give up if it can't vectorize all downstream scalar operations fed by the loads. On GPUs that only have vector loads, and all other operations are scalar, this isn't desirable. At best the vectors are formed just to be scalarized during legalization.

Yeah, I can see the motivation, but my question was why not to add this capability to SLP instead of creating a completely new pass. This seems to be introducing a lot of code duplication.

Michael

In D19501#411411, @mzolotukhin wrote:

Yeah, I can see the motivation, but my question was why not to add this capability to SLP instead of creating a completely new pass. This seems to be introducing a lot of code duplication.

Michael

It's not exactly the same problem. It tries to vectorize stores, and then vectorize if the operations leading to the stores are vectorizable, and then the loads. This vectorizes loads and stores, ignores the intervening instructions (which there may be none of. Loads that are used to compute something that may not be stored should also be vectorized). The LoadStoreVectorizer should be able to produce vectors of arbitrary size and split them as requested by the target, even if the types are not legal.

mssimpso added a subscriber: mssimpso.Apr 27 2016, 1:00 PM

I also see quite a bit of duplication with SLP. Could this also be done with the ConsecutiveStore and ConsecutiveLoad optimizations in SelectionDAG?

In D19501#414396, @mssimpso wrote:

I also see quite a bit of duplication with SLP. Could this also be done with the ConsecutiveStore and ConsecutiveLoad optimizations in SelectionDAG?

The DAG load and store combining is less effective, and also restricts to combining to legal types

Hi Matt,

I think it's ok to have it in a separate pass, but we really need to try to factor out common parts. For instance, we definitely have propagateMetadata and isConsecutiveAccess in SLP.

Thanks,
Michael

lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
519	Can it be a range loop?

Cleanups

In D19501#414459, @mzolotukhin wrote:

Hi Matt,

I think it's ok to have it in a separate pass, but we really need to try to factor out common parts. For instance, we definitely have propagateMetadata and isConsecutiveAccess in SLP.

Thanks,
Michael

isConsecutiveAccess is already factored into a separate utility function, but it seems to have diverged from the one here. Some work will be needed to merge them. I've added a FIXME for it

In D19501#426396, @arsenm wrote:

In D19501#414459, @mzolotukhin wrote:

Hi Matt,

I think it's ok to have it in a separate pass, but we really need to try to factor out common parts. For instance, we definitely have propagateMetadata and isConsecutiveAccess in SLP.

Thanks,
Michael

isConsecutiveAccess is already factored into a separate utility function, but it seems to have diverged from the one here. Some work will be needed to merge them. I've added a FIXME for it

D20639 splits SLPVectorizer's propagateMetadata into a utility function

I am very excited about this; please let me know if there's anything I can do to help get it in.

include/llvm/Transforms/Vectorize.h
147	Should the VecRegSize be pulled from TTI (at least as a default)?
lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
89	Typos

arsenm added inline comments.May 25 2016, 2:01 PM

include/llvm/Transforms/Vectorize.h
147	This is done in the follow up patches D19502, D19504

Just wanted to mention that this pass was originally contributed by Volkan Keles.

Thanks Volkan!

I can crash this locally. We assert for trying to create an invalid bitcast at LoadStoreVectorizer.cpp:833:

831	      if (Extracted->getType() != UI->getType())
832	        Extracted =
833	          cast<Instruction>(Builder.CreateBitCast(Extracted, UI->getType()));

(rr) p Extracted->dump()
  %25 = extractelement <2 x i64> %23, i32 1
(rr) p UI->getType()->dump()
%"struct.Eigen::half"*

I will try to get a testcase.

OK, I got a somewhat minimized testcase.

$ curl https://gist.githubusercontent.com/anonymous/c3d28b883b97b476b9a2ecb4d5dac807/raw/cd830af846bd8738544783305eb7618449d06eba/- | llc

llc: ../src/lib/IR/Instructions.cpp:2582: static llvm::CastInst *llvm::CastInst::Create(Instruction::CastOps, llvm::Value *, llvm::Type *, const llvm::Twine &, llvm::Instruction *): Assertion `castIsValid(op, S, Ty) && "Invalid cast!"' failed.
#0 0x0000000001927c08 llvm::sys::PrintStackTrace(llvm::raw_ostream&) (/usr/local/google/home/jlebar/code/llvm/release/bin/llc+0x1927c08)
#1 0x0000000001926296 llvm::sys::RunSignalHandlers() (/usr/local/google/home/jlebar/code/llvm/release/bin/llc+0x1926296)
#2 0x000000000192880a SignalHandler(int) (/usr/local/google/home/jlebar/code/llvm/release/bin/llc+0x192880a)
#3 0x00007fe0a84d5340 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x10340)
#4 0x00007fe0a76fdcc9 gsignal /build/eglibc-3GlaMS/eglibc-2.19/signal/../nptl/sysdeps/unix/sysv/linux/raise.c:56:0
#5 0x00007fe0a77010d8 abort /build/eglibc-3GlaMS/eglibc-2.19/stdlib/abort.c:91:0
#6 0x00007fe0a76f6b86 __assert_fail_base /build/eglibc-3GlaMS/eglibc-2.19/assert/assert.c:92:0
#7 0x00007fe0a76f6c32 (/lib/x86_64-linux-gnu/libc.so.6+0x2fc32)
#8 0x00000000014dd7b9 llvm::CastInst::Create(llvm::Instruction::CastOps, llvm::Value*, llvm::Type*, llvm::Twine const&, llvm::Instruction*) (/usr/local/google/home/jlebar/code/llvm/release/bin/llc+0x14dd7b9)
#9 0x000000000079d58b llvm::IRBuilder<llvm::ConstantFolder, llvm::IRBuilderDefaultInserter>::CreateCast(llvm::Instruction::CastOps, llvm::Value*, llvm::Type*, llvm::Twine const&) (/usr/local/google/home/jlebar/code/llvm/release/bin/llc+0x79d58b)
#10 0x00000000019d84fd (anonymous namespace)::Vectorizer::vectorizeLoadChain(llvm::ArrayRef<llvm::Value*>) (/usr/local/google/home/jlebar/code/llvm/release/bin/llc+0x19d84fd)
#11 0x00000000019d7f57 (anonymous namespace)::Vectorizer::vectorizeLoadChain(llvm::ArrayRef<llvm::Value*>) (/usr/local/google/home/jlebar/code/llvm/release/bin/llc+0x19d7f57)
#12 0x00000000019d7809 (anonymous namespace)::Vectorizer::vectorizeChains(llvm::MapVector<llvm::Value*, llvm::SmallVector<llvm::Value*, 8u>, llvm::DenseMap<llvm::Value*, unsigned int, llvm::DenseMapInfo<llvm::Value*>, llvm::detail::DenseMapPair<llvm::Value*, unsigned int> >, std::vector<std::pair<llvm::Value*, llvm::SmallVector<llvm::Value*, 8u> >, std::allocator<std::pair<llvm::Value*, llvm::SmallVector<llvm::Value*, 8u> > > > >&) (/usr/local/google/home/jlebar/code/llvm/release/bin/llc+0x19d7809)
#13 0x00000000019d5cd8 (anonymous namespace)::LoadStoreVectorizer::runOnFunction(llvm::Function&) (/usr/local/google/home/jlebar/code/llvm/release/bin/llc+0x19d5cd8)
#14 0x00000000014f84d4 llvm::FPPassManager::runOnFunction(llvm::Function&) (/usr/local/google/home/jlebar/code/llvm/release/bin/llc+0x14f84d4)
#15 0x00000000014f871b llvm::FPPassManager::runOnModule(llvm::Module&) (/usr/local/google/home/jlebar/code/llvm/release/bin/llc+0x14f871b)
#16 0x00000000014f8bf7 llvm::legacy::PassManagerImpl::run(llvm::Module&) (/usr/local/google/home/jlebar/code/llvm/release/bin/llc+0x14f8bf7)
#17 0x00000000005fc419 compileModule(char**, llvm::LLVMContext&) (/usr/local/google/home/jlebar/code/llvm/release/bin/llc+0x5fc419)
#18 0x00000000005f9a4b main (/usr/local/google/home/jlebar/code/llvm/release/bin/llc+0x5f9a4b)
#19 0x00007fe0a76e8ec5 __libc_start_main /build/eglibc-3GlaMS/eglibc-2.19/csu/libc-start.c:321:0
#20 0x00000000005f5f22 _start (/usr/local/google/home/jlebar/code/llvm/release/bin/llc+0x5f5f22)
Stack dump:
0.	Program arguments: /usr/local/google/home/jlebar/code/llvm/release/bin/llc 
1.	Running pass 'Function Pass Manager' on module '<stdin>'.
2.	Running pass 'GPU Load and Store Vectorizer' on function '@_ZN5Eigen8internal15EigenMetaKernelINS_15TensorEvaluatorIKNS_14TensorAssignOpINS_9TensorMapINS_6TensorIxLi0ELi1ElEELi1EEEKNS_18TensorConversionOpIxKNS_20TensorTupleReducerOpINS0_18ArgMaxTupleReducerINS_5TupleIlNS_4halfEEEEEKNS_5arrayIlLm1EEEKNS4_INS5_IKSC_Li1ELi1ElEELi1EEEEEEEEENS_9GpuDeviceEEElEEvT_T0_'
Aborted (core dumped)

OK, I see the problem. This is trying to vectorize two consecutive loads, where the first is loading from an i64*, and the second is loading from an Eigen::half**. We can in fact vectorize these on my target, because i64 and pointers have the same size. But we need an inttoptr instruction if the vector type is <2 x i64>.

Minor cleanup and additional tests

In D19501#441328, @jlebar wrote:

OK, I see the problem. This is trying to vectorize two consecutive loads, where the first is loading from an i64*, and the second is loading from an Eigen::half**. We can in fact vectorize these on my target, because i64 and pointers have the same size. But we need an inttoptr instruction if the vector type is <2 x i64>.

I'll post a separate follow up patch which fixes this case

arsenm added a child revision: D20705: LoadStoreVectorizer: Fix assert when merging pointer ops.May 26 2016, 3:40 PM

jlebar added a parent revision: D20639: SLPVectorizer: Move propagateMetadata to VectorUtils.May 26 2016, 5:12 PM

jlebar removed a parent revision: D20639: SLPVectorizer: Move propagateMetadata to VectorUtils.

jlebar added a parent revision: D20639: SLPVectorizer: Move propagateMetadata to VectorUtils.

Not to scope-creep this, because it's immensely helpful as-is, but looking through some Eigen code, it seems like it would also be helpful to merge consecutive vector loads where possible. (In NVPTX, this usually means merging two loads of 2xf32 into one 4xf32 load.) Eigen tries to be clever and vectorize itself, but it doesn't always use the right packet size.

Another question I have is: NVPTX has some vectorizable load instructions -- notably ld.global.nc, which means, load from the global address space using the texture cache -- that are modeled as as llvm intrinsics, so they don't get merged by this pass. I wonder what's the portable way to teach this pass about these loads. Like, should they be modeled in LLVM as regular load instructions with some sort of annotation, instead of as an LLVM intrinsic?

volkan added a subscriber: volkan.Jun 6 2016, 12:14 AM

My intuition would be -- if loading using the texture cache doesn't change the result, but rather is just a performance thing, that would seem to be something you'd set with metadata on the instruction, right?

In D19501#449697, @escha wrote:

My intuition would be -- if loading using the texture cache doesn't change the result, but rather is just a performance thing, that would seem to be something you'd set with metadata on the instruction, right?

The instruction I have in mind does (potentially) change the result -- the texture cache is noncoherent, so it's only guaranteed to give the same result as a regular load if the data we're loading hasn't been modified since the kernel started.

ping

arsenm's current patch queue, for my reference: D20705, D19504, D19506, D19507, D19508, D20696, D20639, D20705.

Hi,

Some random comments after a quick first round of review inline.

Thanks,
Michael

mzolotukhin added inline comments.Jun 9 2016, 6:43 PM

include/llvm/Transforms/Vectorize.h
147	Minor: I'd rather hardcode `128` for now, just to make the pass signature similar to existing passes (no other pass accepts an argument in `create...Pass` I think). But it probably doesn't matter as you're going to fix it in follow-ups anyway.
lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
78–79	I think the second line should be indented here.
108	Bikeshed: I'd rename `Vectorizer` to `LoadStoreVectorizer` and `LoadStoreVectorizer` to `LoadStoreVectorizerPass` os something like this.
153–155	Should we bail out on this for all potential targets? E.g. why is it a problem for, e.g., vectorizing integer stores on x86 using `movdqa` integer instructions?
287	Some comments would be helpful here (and for other functions).
417	Range loop?
487–488	Might be a good idea to parametrize `64`.
679	Any chance it can be somehow combined with `vectorizeStoreChain`? They look so similar..
730–731	Is there something special about `4`? Can we add a define/const for it?

jlebar mentioned this in D20605: [NVPTX] Allow load/store vectorization..Jun 10 2016, 11:28 AM

Not done going through this, but I tried your branch at https://github.com/arsenm/llvm/tree/ls-vectorizer, and I'm still ICE'ing when compiling thrust.

$ git clone https://github.com/thrust/thrust.git
$ cd thrust
$ CLANG_PATH=/abs/path/to/dir/containing/clang/binary scons arch=sm_35 Wall=no Werror=no mode=release cuda_compiler=clang std=c++11 cuda_path=/usr/local/cuda-7.0 -j48 unit_tests

Range types must match instruction type!
  %wide.load = load <16 x i8>, <16 x i8>* %60, align 1, !tbaa !61, !range !63, !alias.scope !64
Range types must match instruction type!
  %wide.load544 = load <16 x i8>, <16 x i8>* %62, align 1, !tbaa !61, !range !63, !alias.scope !64
Range types must match instruction type!
  %wide.load.1 = load <16 x i8>, <16 x i8>* %66, align 1, !tbaa !61, !range !63, !alias.scope !64
Range types must match instruction type!
  %wide.load544.1 = load <16 x i8>, <16 x i8>* %68, align 1, !tbaa !61, !range !63, !alias.scope !64
Range types must match instruction type!
  %wide.load.2 = load <16 x i8>, <16 x i8>* %72, align 1, !tbaa !61, !range !63, !alias.scope !64
Range types must match instruction type!
  %wide.load544.2 = load <16 x i8>, <16 x i8>* %74, align 1, !tbaa !61, !range !63, !alias.scope !64
Range types must match instruction type!
  %wide.load.3 = load <16 x i8>, <16 x i8>* %78, align 1, !tbaa !61, !range !63, !alias.scope !64
Range types must match instruction type!
  %wide.load544.3 = load <16 x i8>, <16 x i8>* %80, align 1, !tbaa !61, !range !63, !alias.scope !64
Range types must match instruction type!
  %wide.load.epil = load <16 x i8>, <16 x i8>* %84, align 1, !tbaa !61, !range !63, !alias.scope !64
Range types must match instruction type!
  %wide.load544.epil = load <16 x i8>, <16 x i8>* %86, align 1, !tbaa !61, !range !63, !alias.scope !64
fatal error: error in backend: Broken function found, compilation aborted!

Interestingly, this is happening during *host* compilation. It happens even if I don't add a patch to enable the load/store vectorizer for nvptx.

Smaller steps to reproduce:

$ curl https://gist.githubusercontent.com/anonymous/5c8863562db1dda94a821c51e3ad001a/raw/b8f3e3cc9b0995ab89d5e8c3840dcea30dc96d0b/- | opt -O2
(same ICE)

lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
39	Nit, number of scalar instructions removed might be a more interesting metric, because it's notable whether we manage to vectorize in groups of 2 vs 4.
78–79	clang-format formats it for me as-is.

arsenm added inline comments.Jun 10 2016, 12:49 PM

lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
153–155	I don't really know what this attribute does, it was just there
730–731	The basic type alignment on GPUs? It's on my todo list to make this check allowsMemoryAccess instead

suyog added a subscriber: suyog.Jun 10 2016, 1:23 PM

asbirlea added a subscriber: asbirlea.Jun 10 2016, 3:36 PM

echristo added a subscriber: echristo.Jun 10 2016, 4:46 PM

Looks pretty good to me; mostly nits / comment suggestions, but I have one nontrivial concern about how we ensure that we always build the longest vectors we can.

lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
71	Nit, if this is all the comment says, do we need it? It seems just to repeat what's already in the function name.
74	Nit, "Reorders" would be consistent with the rest of the comments in this file. Also, do you mean s/user/users/?
74	It's not obvious to me how this function reorders the users of I without looking at the source. Something like "Reorders instructions to ensure that I dominates all of its users." or something would have helped me.
88	Perhaps, "Checks if there are any instructions which may affect the value returned by a load/store in Chain between From and To." Since this function checks not just for aliasing memory instructions, but also side-effecting instructions.
96	Probably worth indicating that you expect all of the elements of Map to be loads, or all of them to be stores.
98	Suggest "Finds loads/stores to consecutive memory addresses and vectorizes them." Otherwise it can be read as saying that the instructions themselves must appear consecutively in the BB. Probably also worth indicating that all elements of Instrs must be loads, or all must be stores.
153–155	It came from r187340, back in 2013, by Nadav Rotem <nrotem@apple.com>. There's no additional information in the patch. AFAICT the intent is specifically not to vectorize loads/stores of fp types when noimplicitfloat is set. Which sort of makes sense, based on the tiny information I can glean about noimplicitfloat.
169	s/I/BBI/ would have made this a lot more obvious to me, since we use I for values/instructions most everywhere else.
209	It's not clear to me why we need this line, although all the cases I can think of where this is false seem quite bizarre, so maybe it doesn't matter.
231	Suggest s/Otherwise// -- it's not clear that this is contrasting with "Check if they are based on the same pointer."
243	Nit, suggest s/proving/checking/ -- "proving" is kind of a strong term for what we do, sort of implies it's harder than it is (so it's confusing when I see it's so simple).
260	Suggest something like // Only look through ZExt/SExt to make it clearer why we're doing an early return.
306	Nit, consider llvm::is_contained()
326	Why this special case? This isn't as strong as a general DCE, so it seems to me like it gives us a false sense of security? i.e. should we just say "run DCE after this pass"?
339	s/ElementSize/ElementSizeBits/? (I don't care as much what the local vars here are called, but if the arg is called ElementSize, I might reasonably pass a value in bytes.)
339	I'm not sure "bisect" is quite the right word for this? I expected this to split Chain in half, but it looks like it actually splits off the rightmost 1, 2, or 3 bytes' worth of elements, so that the left part has a size that's a multiple of 4 bytes (ish). I might call that "splitIntoRoundSize" or something. A postcondition assert() would probably also help clarify this.
356	Consider llvm::if_contains
378	suggest "(the vectorized load is inserted at the location of the first load in the chain)."
389	Not GetPointerOperand?
415	To me this would be a lot more clear if we returned the relevant maps (or, I suppose, explicitly took them as params). Otherwise it's not clear from the signature (or the comments) where we collect the instructions into.
437	Consider llvm::all_of(LI->users(), ...);
500	Nit, It might be helpful to print out at least the first instruction in the chain. At least, that was helpful to me when I was debugging it.
503	Can we move this closer to where it's used?
509	Nit, we know that Instrs is <= 64 elements, so could we just make ConsecutiveChain an array of signed ints? Then we could use -1 instead of ~0U (which implied to me that we were going to do some bit twiddling below). Also I think you could initialize the array with int ConsecutiveChain[64] = {-1}; which would be clearer than setting each element in the loop, imo.
518	I cannot for the life of me figure out what invariant you're trying to preserve with these checks. I guess a comment is in order. :) (I get that we prefer to vectorize chains of scalar instructions that appear in order (j > i), but I can't figure out what CurDistance and NewDistance are checking. It would make sense if you wanted to prefer chains of scalar instructions that are close to each other, but then NewDistance should be abs(j - i), right?)
519	Not sure, but this might be clearer if you simply used a "continue" here instead of a flag variable.
524	If ConsecutiveChain was not -1, don't we need to remove ConsecutiveChain[i] from Tails? (Maybe it would be better to build Heads and Tails, or at least Tails, in a separate loop, over ConsecutiveChain, so we don't have to worry about the case where one instr is a tail for two heads and then gets overwritten just one time.)
530	The intent here is to build the largest vectors we can, right? (Or at least, we want to build up vectors of the max vector width, if possible? Beyond that doesn't make a difference.) I don't see how we ensure this happens, unless there is some implicit requirement on the order of Instrs that I'm missing. For example, if the elements of Instrs are load p[3] load p[2] load p[1] load p[0] then it seems that the first element of Heads will be "load p[2]", implying a two-wide vector load. I don't see what here would cause us to wait on this one and do the four-wide load starting at p[0]. Sorry if I'm missing something obvious.
538	Actually, I'm not sure why we need the (Tails.count(I) \|\| Heads.count(I)) part of this predicate, and therefore I'm not sure why we need Tails at all. That probably deserves a comment at least.
578	Hm, looks like an additional constraint on bisect() is that the array passed should have more than 4 elements -- probably worth saying that somewhere.
630	If you like, you could write this as BasicBlock::iterator First, Last; std::tie(First, Last) = getBoundaryInstrs(Chain); Same for the other call to getBoundaryInstrs.
754	A correctness-critical assumption above is that the loads are inserted at the location of the first load but here it looks like the opposite?
764	Nit, I'd prefer to move this into the if and else blocks, so it's clear that we don't use it elsewhere.

Mostly address review comments

lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
39	I added this as well, I think both might be useful
108	Then you end up with the macro functions adding another "Pass" to the name, so you end up with initializeLoadStoreVectorizerPassPass etc., so it's better to not include Pass in the class name.
169	Better, use a range loop
209	I think this is to fix cases like trying to merge i1 into i8. This was already crashing earlier though, so I'm just having sub-byte types never considered.
287	The comments are already on the declarations. Usually for private functions like this I would put them on the body. Should I move them?
326	It's run after each vectorization. I'm guessing the additional uses might somehow interfere with vectorization of other chains? This can be investigated later.
500	I'm not sure why this one is even needed. The full chain is already printed out right above this by the "LSV: Loads to vectorize" DEBUG
509	Can't do that. it needs to be reset after the j loop
538	I'm more likely to break something trying to refactor this compared to all of the style fixes, so I'll leave looking at that for later.
754	I had another test which I apparently forgot to include which shows the alias insertion order behavior. I've also added another which shows the insertion point. With the insertion point as Last, the load store load store case is correct and vectorizes as store load v2 store. If I change the land insertion point to be before the first, it regresses and produces the incorrect load v2 store v2 Maybe the comment is backwards?

arsenm added a child revision: D21458: LoadStoreVectorizer: Fix crashes on sub-byte types.Jun 16 2016, 9:13 PM

jlebar added inline comments.Jun 17 2016, 9:31 AM

lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
88	Nit, "Legalizes" and "Breaks" would be consistent with the rest of the file. Also suggest s/vector half/piece/ -- "vector half" is kind of hard to parse.
96	This was marked as done, but it doesn't appear to have been done. Not a big deal if that's intentional, but pointing it out in case it wasn't. Probably worth indicating that you expect all of the elements of Map to be loads, or all of them to be stores.
190	Nit, would getOperandAddressSpace() or getPointerAddressSpace() be a better name? The address space itself isn't an operand of the instruction.
500	Well, they are sort of different...that one may never appear if we find no chains or whatever. But I take your point that this one may not be so useful. FWIW some additional well-placed log messages might be helpful; I was debugging with a coworker a few days ago why some instructions weren't being vectorized, and it required gdb. But we can also easily go back and add these later.
509	Ah, I see. I clearly have no idea what this loop does. :)
754	Maybe the comment is backwards? I can't manage to convince myself that those checks are correct if we do anything but what the comments say. In the test, I see that you have %ld.c = load double, double addrspace(1)* %c, align 8 ; may alias store to %a store double 0.0, double addrspace(1)* %a, align 8 %ld.c.idx.1 = load double, double addrspace(1)* %c.idx.1, align 8 ; may alias store to %a store double 0.0, double addrspace(1)* %a.idx.1, align 8 If the comments are correct and both loads may alias the first store, isn't transforming this into "store; load v2; store" (what the test is checking for) unsafe? I grant that vectorizing both the loads and the stores is also unsafe.

asbirlea added inline comments.Jun 17 2016, 9:40 AM

lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
509	I'm confused why that is the case. Each i iteration sets ConsecutiveChain[i]=-1 and all read/write accesses in the j loop are on ConsecutiveChain[i]. Initializing all ConsecutiveChain to -1 before the i loop should have the equivalent behavior. Could you clarify what I'm missing?

As the author of some of that code, it is quite possible the code is just not well-written and can be done better ;-)

jlebar added inline comments.Jun 17 2016, 1:23 PM

lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
304	Talking to asbirlea, we are not convinced this is safe. In particular, we cannot safely reorder across instructions that may read or may write arbitrary memory, nor can we safely reorder loads and stores that are not in the chain but which nonetheless may alias each other. I think we're safe wrt instructions which may read or write, because these will never appear between a chain's first and last instructions. (I would feel much better if we had an assert to this effect, though.) But I am not convinced we're safe wrt regular loads and stores whose addresses depend on the result of a vectorized load. For example, imagine we have b = load a load [b] // b does not alias a store [b] c = load a+1 Suppose that we choose to vectorize the first and last load and WLOG assume we insert the vectorized load at the position of c: load [b] store [b] {b, c} = load {a, a+1} Now we call reorder() to fix this up. The first two instructions are going to be moved below the vectorized load in the reverse order of I->uses(). AFAICT there is no guarantee on the order of uses(), but even if there were, outputting in the opposite order seems probably not right? Even if that were somehow right, afaict we can make reorder() do arbitrary, unsafe reorderings of loads/stores with the following idiom. b = load a b1 = b + 0 // copy b to b1 b2 = b + 0 // copy b to b2 load [b1] // b1 does not alias a store [b2] // b2 does not alias a c = load a+1 Now if the order of b1 and b2 in the source determines their order in users(), then that controls the order of "load [b1]" and "store [b2]" in the result, even though there's only one correct ordering of that load and store. So to summarize, it may be my ignorance, but I don't see why we can rely on there being any particular order to users(). But even if users() has some guaranteed order, I don't see how any ordering could guarantee that we never reorder an aliasing load/store pair.

asbirlea added inline comments.Jun 17 2016, 2:29 PM

lib/Transforms/Vectorize/LoadStoreVectorizer.cpp

304

Tried testing this out using jlebar's first example.

Input:

%struct.buffer_t = type { i64, i8*, [4 x i32], [4 x i32], [4 x i32], i32, i8, i8, [2 x i8] }       
  
define i32 @test(%struct.buffer_t* noalias %buff) #0 {                                             
entry:
  %tmp1 = getelementptr inbounds %struct.buffer_t, %struct.buffer_t* %buff, i64 0, i32 1           
  %buff.host = load i8*, i8** %tmp1, align 8                                                       
  %buff.val = load i8, i8* %buff.host, align 8                                                     
  store i8 0, i8* %buff.host, align 8                                                              
  %tmp0 = getelementptr inbounds %struct.buffer_t, %struct.buffer_t* %buff, i64 0, i32 0           
  %buff.dev = load i64, i64* %tmp0, align 8                                                        
  ret i32 0
} 
  
attributes #0 = { nounwind }

Output:

%struct.buffer_t = type { i64, i8*, [4 x i32], [4 x i32], [4 x i32], i32, i8, i8, [2 x i8] }       
  
; Function Attrs: nounwind
define i32 @test(%struct.buffer_t* noalias %buff) #0 {                                             
entry:
  %tmp0 = getelementptr inbounds %struct.buffer_t, %struct.buffer_t* %buff, i64 0, i32 0           
  %0 = bitcast i64* %tmp0 to <2 x i64>*                                                            
  %1 = load <2 x i64>, <2 x i64>* %0, align 8                                                      
  %2 = extractelement <2 x i64> %1, i32 0
  %3 = extractelement <2 x i64> %1, i32 1                                                          
  %4 = inttoptr i64 %3 to i8*                                                                      
  store i8 0, i8* %4, align 8
  %buff.val = load i8, i8* %4, align 8                                                             
  ret i32 0
} 
  
attributes #0 = { nounwind }

The first and last load got vectorized and moved at the beginning, but the uses that now follow are in reversed order.

The input had load [b], store [b]. The output has store[b], load[b].

asbirlea added inline comments.Jun 17 2016, 3:20 PM

lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
289	Nit: Could the Instruction instance "User" be renamed to something different than a class name? It makes it somewhat harder to follow.

If we were to always insert vectorized loads at the location of the first load and place stores at the location of the last store, I actually don't see why we'd need reorder() at all.

asbirlea added inline comments.Jun 20 2016, 1:42 PM

lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
304	FWIW, the example above can be fixed by keeping a temporary of the last instruction inserted. For example: Instruction* InsertAfter=I; for (User U : I->users()) if (Instruction User = dyn_cast<Instruction>(U)) if (!DT.dominates(I, User) && User->getOpcode() != Instruction::PHI) { User->removeFromParent(); User->insertAfter(InsertAfter); reorder(User); InsertAfter = User; } However, this does not account for the recursive inserts AND it assumes the users() are in BB order. A correct and very iffy way: 1. create a set (include the recursive call here) of all (transitive) users, 2. make a pass over all instructions in BB; if instruction is in the collected set insert it after the latest inserted instruction (as in above sample); or just go over the BB in reverse order. I would favor avoiding the reorder entirely than having a pass over the BB at every reorder..

Here's a test-case that currently produces incorrect results on my end.

target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"
target triple = "aarch64--linux-gnueabihf"

define i32 @test(i32* noalias %ptr) {
entry:
  br label %"for something"

"for something":
  %index = phi i64 [ 0, %entry ], [ %index.next, %"for something" ]
  %next.gep = getelementptr i32, i32* %ptr, i64 %index
  %a1 = add nsw i64 %index, 1
  %next.gep1 = getelementptr i32, i32* %ptr, i64 %a1
  %a2 = add nsw i64 %index, 2
  %next.gep2 = getelementptr i32, i32* %ptr, i64 %a2

  %l1 = load i32, i32* %next.gep1, align 4                                                         
  %l2 = load i32, i32* %next.gep, align 4                                                          
  store i32 0, i32* %next.gep1, align 4                                                            
  store i32 0, i32* %next.gep, align 4
  %l3 = load i32, i32* %next.gep1, align 4                                                         
  %l4 = load i32, i32* %next.gep2, align 4   
  %index.next = add i64 %index, 8
  %cmp_res = icmp eq i64 %index.next, 8
  br i1 %cmp_res, label %ending, label %"for something"

ending:
  ret i32 0

It does not take into account the stores as aliasing with the loads, and leaves the stores as the first instructions.
Output I see:

%0 = bitcast i32* %next.gep to <2 x i32>*                                                        
store <2 x i32> zeroinitializer, <2 x i32>* %0, align 4                                          
%1 = bitcast i32* %next.gep to <2 x i32>*                                                        
%2 = load <2 x i32>, <2 x i32>* %1, align 4                                                      
%3 = extractelement <2 x i32> %2, i32 0
%4 = extractelement <2 x i32> %2, i32 1                                                          
%5 = bitcast i32* %next.gep1 to <2 x i32>*                                                       
%6 = load <2 x i32>, <2 x i32>* %5, align 4                                                      
%7 = extractelement <2 x i32> %6, i32 0
%8 = extractelement <2 x i32> %6, i32 1

Is there a follow-up patch fixing this that I missed?

Example in the previous comment should produce correct code with these changes.

lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
322	This should be: LastInstr = ++I; (using updated patch) It's always valid: either uses the next valid instruction or BB.end(). Fixes issue that isVectorizable misses testing the last instruction in the chain.
376	Consider asserting Chain.size() == ChainInstrs.size()?
390	As loads are inserted at the location of the last load, either invert sign (VVIdx > VIdx) and update comment, or fix load insert location. The former misses vectorization opportunities, but it avoids (at least some of the) incorrect code generation.
394	Same as above.

asbirlea added inline comments.Jun 23 2016, 5:15 PM

lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
538	As I understand it, this is the condition for finding the chain of maximum length. The check that Head cannot be found in Tails, means it's the beginning of a chain, hence one of the longest chains. However, if the longest chain fails to vectorize, this same check prevents any vectorization of the remaining chain. Here's a suggestion to address vectorizing the chain suffix: for (int i = 0; i < Heads.size(); i++) { unsigned Head = Heads[i]; if (VectorizedValues.count(Instrs[Head])) continue; // Skip if a longer chain exists in the remaining Heads/Tails for (int j = i+1; j < Tails.size(); j++) if (Head == Tails[j]) continue; Feel free to add additional improvements for vectorization of a chain prefix.

After discussion on IRC, I'm OK taking this as-is, with an explicit understanding that there are existing changes that need to be made here, so future patches will not have the same bias towards existing code that we might normally have. Collaborating with the code living out-of-tree is just too painful.

Alina has fixes to the correctness issues in her patch queue, and one of us will go back and ensure that all of my comments above (particularly about the ConsecutiveChain algorithm) are addressed.

This revision is now accepted and ready to land.Jun 28 2016, 4:03 PM

arsenm marked 2 inline comments as done.Jun 30 2016, 11:28 AM

arsenm added inline comments.

lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
96	I put it on vectorizeChains. I can add it here too
322	Leaving for future patch
376	This needs to go with the ++I patch
390	Are you going to fix this / do you have test cases for this and below?

asbirlea added inline comments.Jun 30 2016, 11:32 AM

lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
390	Yes. I have an updated version of the current patch that fixes this and adds couple of unit tests.

r274293

Revision Contents

Path

Size

include/

llvm/

InitializePasses.h

1 line

LinkAllPasses.h

1 line

Transforms/

Vectorize.h

7 lines

lib/

Transforms/

Vectorize/

CMakeLists.txt

3 lines

LoadStoreVectorizer.cpp

823 lines

Vectorize.cpp

1 line

test/

Transforms/

LoadStoreVectorizer/

AMDGPU/

extended-index.ll

150 lines

insertion-point.ll

62 lines

interleaved-mayalias-store.ll

28 lines

3 lines

635 lines

91 lines

20 lines

Diff 61062

include/llvm/InitializePasses.h

	Show First 20 Lines • Show All 163 Lines • ▼ Show 20 Lines
	void initializeLiveDebugValuesPass(PassRegistry&);			void initializeLiveDebugValuesPass(PassRegistry&);
	void initializeLiveDebugVariablesPass(PassRegistry&);			void initializeLiveDebugVariablesPass(PassRegistry&);
	void initializeLiveIntervalsPass(PassRegistry&);			void initializeLiveIntervalsPass(PassRegistry&);
	void initializeLiveRegMatrixPass(PassRegistry&);			void initializeLiveRegMatrixPass(PassRegistry&);
	void initializeLiveStacksPass(PassRegistry&);			void initializeLiveStacksPass(PassRegistry&);
	void initializeLiveVariablesPass(PassRegistry&);			void initializeLiveVariablesPass(PassRegistry&);
	void initializeLoadCombinePass(PassRegistry&);			void initializeLoadCombinePass(PassRegistry&);
	void initializeLoaderPassPass(PassRegistry&);			void initializeLoaderPassPass(PassRegistry&);
				void initializeLoadStoreVectorizerPass(PassRegistry&);
	void initializeLocalStackSlotPassPass(PassRegistry&);			void initializeLocalStackSlotPassPass(PassRegistry&);
	void initializeLoopAccessAnalysisPass(PassRegistry&);			void initializeLoopAccessAnalysisPass(PassRegistry&);
	void initializeLoopDataPrefetchPass(PassRegistry&);			void initializeLoopDataPrefetchPass(PassRegistry&);
	void initializeLoopDeletionPass(PassRegistry&);			void initializeLoopDeletionPass(PassRegistry&);
	void initializeLoopDistributePass(PassRegistry&);			void initializeLoopDistributePass(PassRegistry&);
	void initializeLoopExtractorPass(PassRegistry&);			void initializeLoopExtractorPass(PassRegistry&);
	void initializeLoopIdiomRecognizePass(PassRegistry&);			void initializeLoopIdiomRecognizePass(PassRegistry&);
	void initializeLoopInfoWrapperPassPass(PassRegistry&);			void initializeLoopInfoWrapperPassPass(PassRegistry&);
	▲ Show 20 Lines • Show All 157 Lines • Show Last 20 Lines

include/llvm/LinkAllPasses.h

Show First 20 Lines • Show All 177 Lines • ▼ Show 20 Lines	ForcePassLinking() {
(void) llvm::createLintPass();		(void) llvm::createLintPass();
(void) llvm::createSinkingPass();		(void) llvm::createSinkingPass();
(void) llvm::createLowerAtomicPass();		(void) llvm::createLowerAtomicPass();
(void) llvm::createCorrelatedValuePropagationPass();		(void) llvm::createCorrelatedValuePropagationPass();
(void) llvm::createMemDepPrinter();		(void) llvm::createMemDepPrinter();
(void) llvm::createInstructionSimplifierPass();		(void) llvm::createInstructionSimplifierPass();
(void) llvm::createLoopVectorizePass();		(void) llvm::createLoopVectorizePass();
(void) llvm::createSLPVectorizerPass();		(void) llvm::createSLPVectorizerPass();
		(void) llvm::createLoadStoreVectorizerPass(128);
(void) llvm::createBBVectorizePass();		(void) llvm::createBBVectorizePass();
(void) llvm::createPartiallyInlineLibCallsPass();		(void) llvm::createPartiallyInlineLibCallsPass();
(void) llvm::createScalarizerPass();		(void) llvm::createScalarizerPass();
(void) llvm::createSeparateConstOffsetFromGEPPass();		(void) llvm::createSeparateConstOffsetFromGEPPass();
(void) llvm::createSpeculativeExecutionPass();		(void) llvm::createSpeculativeExecutionPass();
(void) llvm::createSpeculativeExecutionIfHasBranchDivergencePass();		(void) llvm::createSpeculativeExecutionIfHasBranchDivergencePass();
(void) llvm::createRewriteSymbolsPass();		(void) llvm::createRewriteSymbolsPass();
(void) llvm::createStraightLineStrengthReducePass();		(void) llvm::createStraightLineStrengthReducePass();
Show All 20 Lines

include/llvm/Transforms/Vectorize.h

	Show First 20 Lines • Show All 133 Lines • ▼ Show 20 Lines
	/// ScalarEvolution. After the vectorization, AliasAnalysis,			/// ScalarEvolution. After the vectorization, AliasAnalysis,
	/// ScalarEvolution and CFG are preserved.			/// ScalarEvolution and CFG are preserved.
	///			///
	/// @return True if the BB is changed, false otherwise.			/// @return True if the BB is changed, false otherwise.
	///			///
	bool vectorizeBasicBlock(Pass *P, BasicBlock &BB,			bool vectorizeBasicBlock(Pass *P, BasicBlock &BB,
	const VectorizeConfig &C = VectorizeConfig());			const VectorizeConfig &C = VectorizeConfig());

				//===----------------------------------------------------------------------===//
				//
				// LoadStoreVectorizer - Create vector loads and stores, but leave scalar
				// operations.
				//
				Pass *createLoadStoreVectorizerPass(unsigned VecRegSize = 128);
				jlebarUnsubmitted Not Done Reply Inline Actions Should the VecRegSize be pulled from TTI (at least as a default)? jlebar: Should the VecRegSize be pulled from TTI (at least as a default)?
				arsenmAuthorUnsubmitted Not Done Reply Inline Actions This is done in the follow up patches D19502, D19504 arsenm: This is done in the follow up patches D19502, D19504
				mzolotukhinUnsubmitted Not Done Reply Inline Actions Minor: I'd rather hardcode `128` for now, just to make the pass signature similar to existing passes (no other pass accepts an argument in `create...Pass` I think). But it probably doesn't matter as you're going to fix it in follow-ups anyway. mzolotukhin: Minor: I'd rather hardcode `128` for now, just to make the pass signature similar to existing…

	} // End llvm namespace			} // End llvm namespace

	#endif			#endif

lib/Transforms/Vectorize/CMakeLists.txt

	add_llvm_library(LLVMVectorize			add_llvm_library(LLVMVectorize
	BBVectorize.cpp			BBVectorize.cpp
	Vectorize.cpp			LoadStoreVectorizer.cpp
	LoopVectorize.cpp			LoopVectorize.cpp
	SLPVectorizer.cpp			SLPVectorizer.cpp
				Vectorize.cpp

	ADDITIONAL_HEADER_DIRS			ADDITIONAL_HEADER_DIRS
	${LLVM_MAIN_INCLUDE_DIR}/llvm/Transforms			${LLVM_MAIN_INCLUDE_DIR}/llvm/Transforms
	)			)

	add_dependencies(LLVMVectorize intrinsics_gen)			add_dependencies(LLVMVectorize intrinsics_gen)

lib/Transforms/Vectorize/LoadStoreVectorizer.cpp

This file was added.

				//===----- LoadStoreVectorizer.cpp - GPU Load & Store Vectorizer ----------===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				//
				//===----------------------------------------------------------------------===//

				#include "llvm/Transforms/Vectorize.h"
				#include "llvm/ADT/MapVector.h"
				#include "llvm/ADT/PostOrderIterator.h"
				#include "llvm/ADT/SetVector.h"
				#include "llvm/ADT/Statistic.h"
				#include "llvm/ADT/Triple.h"
				#include "llvm/Analysis/AliasAnalysis.h"
				#include "llvm/Analysis/ScalarEvolution.h"
				#include "llvm/Analysis/ScalarEvolutionExpressions.h"
				#include "llvm/Analysis/TargetTransformInfo.h"
				#include "llvm/Analysis/ValueTracking.h"
				#include "llvm/Analysis/VectorUtils.h"
				#include "llvm/IR/DataLayout.h"
				#include "llvm/IR/Dominators.h"
				#include "llvm/IR/IRBuilder.h"
				#include "llvm/IR/Instructions.h"
				#include "llvm/IR/Module.h"
				#include "llvm/IR/Type.h"
				#include "llvm/IR/Value.h"
				#include "llvm/Support/CommandLine.h"
				#include "llvm/Support/Debug.h"
				#include "llvm/Support/raw_ostream.h"

				using namespace llvm;

				#define DEBUG_TYPE "load-store-vectorizer"
				STATISTIC(NumVectorInstructions, "Number of vector accesses generated");
				STATISTIC(NumScalarsVectorized, "Number of scalar accesses vectorized");
				jlebarUnsubmitted Done Reply Inline Actions Nit, number of scalar instructions removed might be a more interesting metric, because it's notable whether we manage to vectorize in groups of 2 vs 4. jlebar: Nit, number of scalar instructions removed might be a more interesting metric, because it's…
				arsenmAuthorUnsubmitted Not Done Reply Inline Actions I added this as well, I think both might be useful arsenm: I added this as well, I think both might be useful

				namespace {

				// TODO: Remove this
				static const unsigned TargetBaseAlign = 4;

				class Vectorizer {
				typedef SmallVector<Value *, 8> ValueList;
				typedef MapVector<Value *, ValueList> ValueListMap;

				Function &F;
				AliasAnalysis &AA;
				DominatorTree &DT;
				ScalarEvolution &SE;
				const DataLayout &DL;
				IRBuilder<> Builder;
				ValueListMap StoreRefs;
				ValueListMap LoadRefs;
				unsigned VecRegSize;

				public:
				Vectorizer(Function &F, AliasAnalysis &AA, DominatorTree &DT,
				ScalarEvolution &SE, unsigned VecRegSize)
				: F(F), AA(AA), DT(DT), SE(SE), DL(F.getParent()->getDataLayout()),
				Builder(SE.getContext()), VecRegSize(VecRegSize) {}

				bool run();

				private:
				Value getPointerOperand(Value I);

				unsigned getAddressSpaceOperand(Value *I);
				jlebarUnsubmitted Done Reply Inline Actions Nit, if this is all the comment says, do we need it? It seems just to repeat what's already in the function name. jlebar: Nit, if this is all the comment says, do we need it? It seems just to repeat what's already in…

				bool isConsecutiveAccess(Value A, Value B);

				jlebarUnsubmitted Done Reply Inline Actions Nit, "Reorders" would be consistent with the rest of the comments in this file. Also, do you mean s/user/users/? jlebar: Nit, "Reorders" would be consistent with the rest of the comments in this file. Also, do you…
				jlebarUnsubmitted Done Reply Inline Actions It's not obvious to me how this function reorders the users of I without looking at the source. Something like "Reorders instructions to ensure that I dominates all of its users." or something would have helped me. jlebar: It's not obvious to me how this function reorders the users of I without looking at the source.
				/// Reorders the users of I after vectorization to ensure that I dominates its
				/// users.
				void reorder(Instruction *I);

				/// Returns the first and the last instructions in Chain.
				mzolotukhinUnsubmitted Not Done Reply Inline Actions I think the second line should be indented here. mzolotukhin: I think the second line should be indented here.
				jlebarUnsubmitted Not Done Reply Inline Actions clang-format formats it for me as-is. jlebar: clang-format formats it for me as-is.
				std::pair<BasicBlock::iterator, BasicBlock::iterator>
				getBoundaryInstrs(ArrayRef<Value *> Chain);

				/// Erases the original instructions after vectorizing.
				void eraseInstructions(ArrayRef<Value *> Chain);

				/// "Legalize" the vector type that would be produced by combining \p
				/// ElementSizeBits elements in \p Chain. Break into two pieces such that the
				/// total size of each vector half is 1, 2 or a multiple of 4 bytes.
				jlebarUnsubmitted Done Reply Inline Actions Perhaps, "Checks if there are any instructions which may affect the value returned by a load/store in Chain between From and To." Since this function checks not just for aliasing memory instructions, but also side-effecting instructions. jlebar: Perhaps, "Checks if there are any instructions which may affect the value returned by a…
				jlebarUnsubmitted Not Done Reply Inline Actions Nit, "Legalizes" and "Breaks" would be consistent with the rest of the file. Also suggest s/vector half/piece/ -- "vector half" is kind of hard to parse. jlebar: Nit, "Legalizes" and "Breaks" would be consistent with the rest of the file. Also suggest…
				/// \p Chain is expected to have more than 4 elements.
				jlebarUnsubmitted Not Done Reply Inline Actions Typos jlebar: Typos
				std::pair<ArrayRef<Value >, ArrayRef<Value >>
				splitOddVectorElts(ArrayRef<Value *> Chain, unsigned ElementSizeBits);

				/// Checks if there are any instructions which may affect the memory accessed
				/// in the chain between \p From and \p To.
				bool isVectorizable(ArrayRef<Value *> &Chain, BasicBlock::iterator From,
				BasicBlock::iterator To);
				jlebarUnsubmitted Done Reply Inline Actions Probably worth indicating that you expect all of the elements of Map to be loads, or all of them to be stores. jlebar: Probably worth indicating that you expect all of the elements of Map to be loads, or all of…
				jlebarUnsubmitted Done Reply Inline Actions This was marked as done, but it doesn't appear to have been done. Not a big deal if that's intentional, but pointing it out in case it wasn't. Probably worth indicating that you expect all of the elements of Map to be loads, or all of them to be stores. jlebar: This was marked as done, but it doesn't appear to have been done. Not a big deal if that's…
				arsenmAuthorUnsubmitted Not Done Reply Inline Actions I put it on vectorizeChains. I can add it here too arsenm: I put it on vectorizeChains. I can add it here too

				/// Collects load and store instructions to vectorize.
				jlebarUnsubmitted Done Reply Inline Actions Suggest "Finds loads/stores to consecutive memory addresses and vectorizes them." Otherwise it can be read as saying that the instructions themselves must appear consecutively in the BB. Probably also worth indicating that all elements of Instrs must be loads, or all must be stores. jlebar: Suggest "Finds loads/stores to consecutive memory addresses and vectorizes them." Otherwise it…
				void collectInstructions(BasicBlock *BB);

				/// Processes the collected instructions, the \p Map. The elements of \p Map
				/// should be all loads or all stores.
				bool vectorizeChains(ValueListMap &Map);

				/// Finds the load/stores to consecutive memory addresses and vectorizes them.
				bool vectorizeInstructions(ArrayRef<Value *> Instrs);

				/// Vectorizes the load instructions in Chain.
				mzolotukhinUnsubmitted Not Done Reply Inline Actions Bikeshed: I'd rename `Vectorizer` to `LoadStoreVectorizer` and `LoadStoreVectorizer` to `LoadStoreVectorizerPass` os something like this. mzolotukhin: Bikeshed: I'd rename `Vectorizer` to `LoadStoreVectorizer` and `LoadStoreVectorizer` to…
				arsenmAuthorUnsubmitted Not Done Reply Inline Actions Then you end up with the macro functions adding another "Pass" to the name, so you end up with initializeLoadStoreVectorizerPassPass etc., so it's better to not include Pass in the class name. arsenm: Then you end up with the macro functions adding another "Pass" to the name, so you end up with…
				bool vectorizeLoadChain(ArrayRef<Value *> Chain);

				/// Vectorizes the store instructions in Chain.
				bool vectorizeStoreChain(ArrayRef<Value *> Chain);
				};

				class LoadStoreVectorizer : public FunctionPass {
				public:
				static char ID;
				unsigned VecRegSize;

				LoadStoreVectorizer(unsigned VecRegSize = 128) : FunctionPass(ID),
				VecRegSize(VecRegSize) {
				initializeLoadStoreVectorizerPass(*PassRegistry::getPassRegistry());
				}

				bool runOnFunction(Function &F) override;

				const char *getPassName() const override {
				return "GPU Load and Store Vectorizer";
				}

				void getAnalysisUsage(AnalysisUsage &AU) const override {
				AU.addRequired<AAResultsWrapperPass>();
				AU.addRequired<ScalarEvolutionWrapperPass>();
				AU.addRequired<DominatorTreeWrapperPass>();
				AU.setPreservesCFG();
				}
				};
				}

				INITIALIZE_PASS_BEGIN(LoadStoreVectorizer, DEBUG_TYPE,
				"Vectorize load and Store instructions", false, false);
				INITIALIZE_PASS_DEPENDENCY(SCEVAAWrapperPass)
				INITIALIZE_PASS_DEPENDENCY(DominatorTreeWrapperPass)
				INITIALIZE_PASS_DEPENDENCY(AAResultsWrapperPass)
				INITIALIZE_PASS_DEPENDENCY(GlobalsAAWrapperPass)
				INITIALIZE_PASS_END(LoadStoreVectorizer, DEBUG_TYPE,
				"Vectorize load and store instructions", false, false);

				char LoadStoreVectorizer::ID = 0;

				Pass *llvm::createLoadStoreVectorizerPass(unsigned VecRegSize) {
				return new LoadStoreVectorizer(VecRegSize);
				}

				bool LoadStoreVectorizer::runOnFunction(Function &F) {
				mzolotukhinUnsubmitted Not Done Reply Inline Actions Should we bail out on this for all potential targets? E.g. why is it a problem for, e.g., vectorizing integer stores on x86 using `movdqa` integer instructions? mzolotukhin: Should we bail out on this for all potential targets? E.g. why is it a problem for, e.g.
				arsenmAuthorUnsubmitted Not Done Reply Inline Actions I don't really know what this attribute does, it was just there arsenm: I don't really know what this attribute does, it was just there
				jlebarUnsubmitted Not Done Reply Inline Actions It came from r187340, back in 2013, by Nadav Rotem <nrotem@apple.com>. There's no additional information in the patch. AFAICT the intent is specifically not to vectorize loads/stores of fp types when noimplicitfloat is set. Which sort of makes sense, based on the tiny information I can glean about noimplicitfloat. jlebar: It came from r187340, back in 2013, by Nadav Rotem <nrotem@apple.com>. There's no additional…
				AliasAnalysis &AA = getAnalysis<AAResultsWrapperPass>().getAAResults();
				DominatorTree &DT = getAnalysis<DominatorTreeWrapperPass>().getDomTree();
				ScalarEvolution &SE = getAnalysis<ScalarEvolutionWrapperPass>().getSE();

				// Don't vectorize when the attribute NoImplicitFloat is used.
				if (F.hasFnAttribute(Attribute::NoImplicitFloat))
				return false;

				Vectorizer V(F, AA, DT, SE, VecRegSize);
				return V.run();
				}

				// Vectorizer Implementation
				bool Vectorizer::run() {
				jlebarUnsubmitted Done Reply Inline Actions s/I/BBI/ would have made this a lot more obvious to me, since we use I for values/instructions most everywhere else. jlebar: s/I/BBI/ would have made this a lot more obvious to me, since we use I for values/instructions…
				arsenmAuthorUnsubmitted Not Done Reply Inline Actions Better, use a range loop arsenm: Better, use a range loop
				bool Changed = false;

				// Scan the blocks in the function in post order.
				for (BasicBlock *BB : post_order(&F)) {
				collectInstructions(BB);
				Changed \|= vectorizeChains(LoadRefs);
				Changed \|= vectorizeChains(StoreRefs);
				}

				return Changed;
				}

				Value Vectorizer::getPointerOperand(Value I) {
				if (LoadInst *LI = dyn_cast<LoadInst>(I))
				return LI->getPointerOperand();
				if (StoreInst *SI = dyn_cast<StoreInst>(I))
				return SI->getPointerOperand();
				return nullptr;
				}

				unsigned Vectorizer::getAddressSpaceOperand(Value *I) {
				jlebarUnsubmitted Done Reply Inline Actions Nit, would getOperandAddressSpace() or getPointerAddressSpace() be a better name? The address space itself isn't an operand of the instruction. jlebar: Nit, would getOperandAddressSpace() or getPointerAddressSpace() be a better name? The address…
				if (LoadInst *L = dyn_cast<LoadInst>(I))
				return L->getPointerAddressSpace();
				if (StoreInst *S = dyn_cast<StoreInst>(I))
				return S->getPointerAddressSpace();
				return -1;
				}

				// FIXME: Merge with llvm::isConsecutiveAccess
				bool Vectorizer::isConsecutiveAccess(Value A, Value B) {
				Value *PtrA = getPointerOperand(A);
				Value *PtrB = getPointerOperand(B);
				unsigned ASA = getAddressSpaceOperand(A);
				unsigned ASB = getAddressSpaceOperand(B);

				// Check that the address spaces match and that the pointers are valid.
				if (!PtrA \|\| !PtrB \|\| (ASA != ASB))
				return false;

				// Make sure that A and B are different pointers of the same size type.
				jlebarUnsubmitted Not Done Reply Inline Actions It's not clear to me why we need this line, although all the cases I can think of where this is false seem quite bizarre, so maybe it doesn't matter. jlebar: It's not clear to me why we need this line, although all the cases I can think of where this is…
				arsenmAuthorUnsubmitted Not Done Reply Inline Actions I think this is to fix cases like trying to merge i1 into i8. This was already crashing earlier though, so I'm just having sub-byte types never considered. arsenm: I think this is to fix cases like trying to merge i1 into i8. This was already crashing earlier…
				unsigned PtrBitWidth = DL.getPointerSizeInBits(ASA);
				Type *PtrATy = PtrA->getType()->getPointerElementType();
				Type *PtrBTy = PtrB->getType()->getPointerElementType();
				if (PtrA == PtrB \|\|
				DL.getTypeStoreSize(PtrATy) != DL.getTypeStoreSize(PtrBTy) \|\|
				DL.getTypeStoreSize(PtrATy->getScalarType()) !=
				DL.getTypeStoreSize(PtrBTy->getScalarType()))
				return false;

				APInt Size(PtrBitWidth, DL.getTypeStoreSize(PtrATy));

				APInt OffsetA(PtrBitWidth, 0), OffsetB(PtrBitWidth, 0);
				PtrA = PtrA->stripAndAccumulateInBoundsConstantOffsets(DL, OffsetA);
				PtrB = PtrB->stripAndAccumulateInBoundsConstantOffsets(DL, OffsetB);

				APInt OffsetDelta = OffsetB - OffsetA;

				// Check if they are based on the same pointer. That makes the offsets
				// sufficient.
				if (PtrA == PtrB)
				return OffsetDelta == Size;

				jlebarUnsubmitted Done Reply Inline Actions Suggest s/Otherwise// -- it's not clear that this is contrasting with "Check if they are based on the same pointer." jlebar: Suggest s/Otherwise// -- it's not clear that this is contrasting with "Check if they are based…
				// Compute the necessary base pointer delta to have the necessary final delta
				// equal to the size.
				APInt BaseDelta = Size - OffsetDelta;

				// Compute the distance with SCEV between the base pointers.
				const SCEV *PtrSCEVA = SE.getSCEV(PtrA);
				const SCEV *PtrSCEVB = SE.getSCEV(PtrB);
				const SCEV *C = SE.getConstant(BaseDelta);
				const SCEV *X = SE.getAddExpr(PtrSCEVA, C);
				if (X == PtrSCEVB)
				return true;

				jlebarUnsubmitted Done Reply Inline Actions Nit, suggest s/proving/checking/ -- "proving" is kind of a strong term for what we do, sort of implies it's harder than it is (so it's confusing when I see it's so simple). jlebar: Nit, suggest s/proving/checking/ -- "proving" is kind of a strong term for what we do, sort of…
				// Sometimes even this doesn't work, because SCEV can't always see through
				// patterns that look like (gep (ext (add (shl X, C1), C2))). Try checking
				// things the hard way.

				// Look through GEPs after checking they're the same except for the last
				// index.
				GetElementPtrInst *GEPA = dyn_cast<GetElementPtrInst>(getPointerOperand(A));
				GetElementPtrInst *GEPB = dyn_cast<GetElementPtrInst>(getPointerOperand(B));
				if (!GEPA \|\| !GEPB \|\| GEPA->getNumOperands() != GEPB->getNumOperands())
				return false;
				unsigned FinalIndex = GEPA->getNumOperands() - 1;
				for (unsigned i = 0; i < FinalIndex; i++)
				if (GEPA->getOperand(i) != GEPB->getOperand(i))
				return false;

				Instruction *OpA = dyn_cast<Instruction>(GEPA->getOperand(FinalIndex));
				Instruction *OpB = dyn_cast<Instruction>(GEPB->getOperand(FinalIndex));
				jlebarUnsubmitted Done Reply Inline Actions Suggest something like // Only look through ZExt/SExt to make it clearer why we're doing an early return. jlebar: Suggest something like // Only look through ZExt/SExt to make it clearer why we're doing an…
				if (!OpA \|\| !OpB \|\| OpA->getOpcode() != OpB->getOpcode() \|\|
				OpA->getType() != OpB->getType())
				return false;

				// Only look through a ZExt/SExt.
				if (!isa<SExtInst>(OpA) && !isa<ZExtInst>(OpA))
				return false;

				OpA = dyn_cast<Instruction>(OpA->getOperand(0));
				OpB = dyn_cast<Instruction>(OpB->getOperand(0));
				if (!OpA \|\| !OpB \|\| OpA->getType() != OpB->getType())
				return false;

				// Now we need to prove that adding 1 to OpA won't overflow.
				unsigned BitWidth = OpA->getType()->getScalarSizeInBits();
				APInt KnownZero = APInt(BitWidth, 0);
				APInt KnownOne = APInt(BitWidth, 0);
				computeKnownBits(OpA, KnownZero, KnownOne, DL, 0, nullptr, OpA, &DT);
				// If any bits are known to be zero other than the sign bit in OpA, we can
				// add 1 to it while guaranteeing no overflow of any sort.
				KnownZero &= ~APInt::getHighBitsSet(BitWidth, 1);
				if (KnownZero == 0)
				return false;

				const SCEV *OffsetSCEVA = SE.getSCEV(OpA);
				const SCEV *OffsetSCEVB = SE.getSCEV(OpB);
				const SCEV *One = SE.getConstant(APInt(BitWidth, 1));
				mzolotukhinUnsubmitted Not Done Reply Inline Actions Some comments would be helpful here (and for other functions). mzolotukhin: Some comments would be helpful here (and for other functions).
				arsenmAuthorUnsubmitted Not Done Reply Inline Actions The comments are already on the declarations. Usually for private functions like this I would put them on the body. Should I move them? arsenm: The comments are already on the declarations. Usually for private functions like this I would…
				const SCEV *X2 = SE.getAddExpr(OffsetSCEVA, One);
				return X2 == OffsetSCEVB;
				asbirleaUnsubmitted Not Done Reply Inline Actions Nit: Could the Instruction instance "User" be renamed to something different than a class name? It makes it somewhat harder to follow. asbirlea: Nit: Could the Instruction instance "User" be renamed to something different than a class name?
				}

				void Vectorizer::reorder(Instruction *I) {
				for (User *U : I->users()) {
				Instruction *User = dyn_cast<Instruction>(U);
				if (!User \|\| User->getOpcode() == Instruction::PHI)
				continue;

				if (!DT.dominates(I, User)) {
				User->removeFromParent();
				User->insertAfter(I);
				reorder(User);
				}
				}
				}
				jlebarUnsubmitted Not Done Reply Inline Actions Talking to asbirlea, we are not convinced this is safe. In particular, we cannot safely reorder across instructions that may read or may write arbitrary memory, nor can we safely reorder loads and stores that are not in the chain but which nonetheless may alias each other. I think we're safe wrt instructions which may read or write, because these will never appear between a chain's first and last instructions. (I would feel much better if we had an assert to this effect, though.) But I am not convinced we're safe wrt regular loads and stores whose addresses depend on the result of a vectorized load. For example, imagine we have b = load a load [b] // b does not alias a store [b] c = load a+1 Suppose that we choose to vectorize the first and last load and WLOG assume we insert the vectorized load at the position of c: load [b] store [b] {b, c} = load {a, a+1} Now we call reorder() to fix this up. The first two instructions are going to be moved below the vectorized load in the reverse order of I->uses(). AFAICT there is no guarantee on the order of uses(), but even if there were, outputting in the opposite order seems probably not right? Even if that were somehow right, afaict we can make reorder() do arbitrary, unsafe reorderings of loads/stores with the following idiom. b = load a b1 = b + 0 // copy b to b1 b2 = b + 0 // copy b to b2 load [b1] // b1 does not alias a store [b2] // b2 does not alias a c = load a+1 Now if the order of b1 and b2 in the source determines their order in users(), then that controls the order of "load [b1]" and "store [b2]" in the result, even though there's only one correct ordering of that load and store. So to summarize, it may be my ignorance, but I don't see why we can rely on there being any particular order to users(). But even if users() has some guaranteed order, I don't see how any ordering could guarantee that we never reorder an aliasing load/store pair. jlebar: Talking to asbirlea, we are not convinced this is safe. In particular, we cannot safely…
				asbirleaUnsubmitted Not Done Reply Inline Actions Tried testing this out using jlebar's first example. Input: %struct.buffer_t = type { i64, i8, [4 x i32], [4 x i32], [4 x i32], i32, i8, i8, [2 x i8] } define i32 @test(%struct.buffer_t noalias %buff) #0 { entry: %tmp1 = getelementptr inbounds %struct.buffer_t, %struct.buffer_t* %buff, i64 0, i32 1 %buff.host = load i8, i8* %tmp1, align 8 %buff.val = load i8, i8* %buff.host, align 8 store i8 0, i8* %buff.host, align 8 %tmp0 = getelementptr inbounds %struct.buffer_t, %struct.buffer_t* %buff, i64 0, i32 0 %buff.dev = load i64, i64* %tmp0, align 8 ret i32 0 } attributes #0 = { nounwind } Output: %struct.buffer_t = type { i64, i8, [4 x i32], [4 x i32], [4 x i32], i32, i8, i8, [2 x i8] } ; Function Attrs: nounwind define i32 @test(%struct.buffer_t noalias %buff) #0 { entry: %tmp0 = getelementptr inbounds %struct.buffer_t, %struct.buffer_t* %buff, i64 0, i32 0 %0 = bitcast i64* %tmp0 to <2 x i64>* %1 = load <2 x i64>, <2 x i64>* %0, align 8 %2 = extractelement <2 x i64> %1, i32 0 %3 = extractelement <2 x i64> %1, i32 1 %4 = inttoptr i64 %3 to i8* store i8 0, i8* %4, align 8 %buff.val = load i8, i8* %4, align 8 ret i32 0 } attributes #0 = { nounwind } The first and last load got vectorized and moved at the beginning, but the uses that now follow are in reversed order. The input had load [b], store [b]. The output has store[b], load[b]. asbirlea: Tried testing this out using jlebar's first example. Input: ``` %struct.buffer_t = type { i64…
				asbirleaUnsubmitted Not Done Reply Inline Actions FWIW, the example above can be fixed by keeping a temporary of the last instruction inserted. For example: Instruction* InsertAfter=I; for (User U : I->users()) if (Instruction User = dyn_cast<Instruction>(U)) if (!DT.dominates(I, User) && User->getOpcode() != Instruction::PHI) { User->removeFromParent(); User->insertAfter(InsertAfter); reorder(User); InsertAfter = User; } However, this does not account for the recursive inserts AND it assumes the users() are in BB order. A correct and very iffy way: 1. create a set (include the recursive call here) of all (transitive) users, 2. make a pass over all instructions in BB; if instruction is in the collected set insert it after the latest inserted instruction (as in above sample); or just go over the BB in reverse order. I would favor avoiding the reorder entirely than having a pass over the BB at every reorder.. asbirlea: FWIW, the example above can be fixed by keeping a temporary of the last instruction inserted.

				std::pair<BasicBlock::iterator, BasicBlock::iterator>
				jlebarUnsubmitted Done Reply Inline Actions Nit, consider llvm::is_contained() jlebar: Nit, consider llvm::is_contained()
				Vectorizer::getBoundaryInstrs(ArrayRef<Value *> Chain) {
				Instruction *C0 = cast<Instruction>(Chain[0]);
				BasicBlock::iterator FirstInstr = C0->getIterator();
				BasicBlock::iterator LastInstr = C0->getIterator();

				BasicBlock *BB = C0->getParent();
				unsigned NumFound = 0;
				for (Instruction &I : *BB) {
				if (!is_contained(Chain, &I))
				continue;

				++NumFound;
				if (NumFound == 1) {
				FirstInstr = I.getIterator();
				} else if (NumFound == Chain.size()) {
				LastInstr = I.getIterator();
				asbirleaUnsubmitted Not Done Reply Inline Actions This should be: LastInstr = ++I; (using updated patch) It's always valid: either uses the next valid instruction or BB.end(). Fixes issue that isVectorizable misses testing the last instruction in the chain. asbirlea: This should be: LastInstr = ++I; (using updated patch) It's always valid: either uses the next…
				arsenmAuthorUnsubmitted Not Done Reply Inline Actions Leaving for future patch arsenm: Leaving for future patch
				break;
				}
				}

				jlebarUnsubmitted Not Done Reply Inline Actions Why this special case? This isn't as strong as a general DCE, so it seems to me like it gives us a false sense of security? i.e. should we just say "run DCE after this pass"? jlebar: Why this special case? This isn't as strong as a general DCE, so it seems to me like it gives…
				arsenmAuthorUnsubmitted Not Done Reply Inline Actions It's run after each vectorization. I'm guessing the additional uses might somehow interfere with vectorization of other chains? This can be investigated later. arsenm: It's run after each vectorization. I'm guessing the additional uses might somehow interfere…
				return std::make_pair(FirstInstr, LastInstr);
				}

				void Vectorizer::eraseInstructions(ArrayRef<Value *> Chain) {
				SmallVector<Instruction *, 16> Instrs;
				for (Value *V : Chain) {
				Value *PtrOperand = getPointerOperand(V);
				assert(PtrOperand && "Instruction must have a pointer operand.");
				Instrs.push_back(cast<Instruction>(V));
				if (GetElementPtrInst *GEP = dyn_cast<GetElementPtrInst>(PtrOperand))
				Instrs.push_back(GEP);
				}

				jlebarUnsubmitted Done Reply Inline Actions s/ElementSize/ElementSizeBits/? (I don't care as much what the local vars here are called, but if the arg is called ElementSize, I might reasonably pass a value in bytes.) jlebar: s/ElementSize/ElementSizeBits/? (I don't care as much what the local vars here are called, but…
				jlebarUnsubmitted Done Reply Inline Actions I'm not sure "bisect" is quite the right word for this? I expected this to split Chain in half, but it looks like it actually splits off the rightmost 1, 2, or 3 bytes' worth of elements, so that the left part has a size that's a multiple of 4 bytes (ish). I might call that "splitIntoRoundSize" or something. A postcondition assert() would probably also help clarify this. jlebar: I'm not sure "bisect" is quite the right word for this? I expected this to split Chain in half…
				// Erase instructions.
				for (Value *V : Instrs) {
				Instruction *Instr = cast<Instruction>(V);
				if (Instr->use_empty())
				Instr->eraseFromParent();
				}
				}

				std::pair<ArrayRef<Value >, ArrayRef<Value >>
				Vectorizer::splitOddVectorElts(ArrayRef<Value *> Chain,
				unsigned ElementSizeBits) {
				unsigned ElemSizeInBytes = ElementSizeBits / 8;
				unsigned SizeInBytes = ElemSizeInBytes * Chain.size();
				unsigned NumRight = (SizeInBytes % 4) / ElemSizeInBytes;
				unsigned NumLeft = Chain.size() - NumRight;
				return std::make_pair(Chain.slice(0, NumLeft), Chain.slice(NumLeft));
				}
				jlebarUnsubmitted Done Reply Inline Actions Consider llvm::if_contains jlebar: Consider llvm::if_contains

				bool Vectorizer::isVectorizable(ArrayRef<Value *> &Chain,
				BasicBlock::iterator From,
				BasicBlock::iterator To) {
				SmallVector<std::pair<Value *, unsigned>, 16> MemoryInstrs;
				SmallVector<std::pair<Value *, unsigned>, 16> ChainInstrs;

				unsigned Idx = 0;
				for (auto I = From, E = To; I != E; ++I, ++Idx) {
				if (isa<LoadInst>(I) \|\| isa<StoreInst>(I)) {
				if (!is_contained(Chain, &*I))
				MemoryInstrs.push_back({ &*I, Idx });
				else
				ChainInstrs.push_back({ &*I, Idx });
				} else if (I->mayHaveSideEffects()) {
				DEBUG(dbgs() << "LSV: Found side-effecting operation: " << *I << '\n');
				return false;
				}
				}

				asbirleaUnsubmitted Not Done Reply Inline Actions Consider asserting Chain.size() == ChainInstrs.size()? asbirlea: Consider asserting Chain.size() == ChainInstrs.size()?
				arsenmAuthorUnsubmitted Not Done Reply Inline Actions This needs to go with the ++I patch arsenm: This needs to go with the ++I patch
				for (auto EntryMem : MemoryInstrs) {
				Value *V = EntryMem.first;
				jlebarUnsubmitted Done Reply Inline Actions suggest "(the vectorized load is inserted at the location of the first load in the chain)." jlebar: suggest "(the vectorized load is inserted at the location of the first load in the chain)."
				unsigned VIdx = EntryMem.second;
				for (auto EntryChain : ChainInstrs) {
				Value *VV = EntryChain.first;
				unsigned VVIdx = EntryChain.second;
				if (isa<LoadInst>(V) && isa<LoadInst>(VV))
				continue;

				// We can ignore the alias as long as the load comes before the store,
				// because that means we won't be moving the load past the store to
				// vectorize it (the vectorized load is inserted at the location of the
				// first load in the chain).
				jlebarUnsubmitted Done Reply Inline Actions Not GetPointerOperand? jlebar: Not GetPointerOperand?
				if (isa<StoreInst>(V) && isa<LoadInst>(VV) && VVIdx < VIdx)
				asbirleaUnsubmitted Not Done Reply Inline Actions As loads are inserted at the location of the last load, either invert sign (VVIdx > VIdx) and update comment, or fix load insert location. The former misses vectorization opportunities, but it avoids (at least some of the) incorrect code generation. asbirlea: As loads are inserted at the location of the last load, either invert sign (VVIdx > VIdx) and…
				arsenmAuthorUnsubmitted Not Done Reply Inline Actions Are you going to fix this / do you have test cases for this and below? arsenm: Are you going to fix this / do you have test cases for this and below?
				asbirleaUnsubmitted Not Done Reply Inline Actions Yes. I have an updated version of the current patch that fixes this and adds couple of unit tests. asbirlea: Yes. I have an updated version of the current patch that fixes this and adds couple of unit…
				continue;

				// Same case, but in reverse.
				if (isa<LoadInst>(V) && isa<StoreInst>(VV) && VVIdx > VIdx)
				asbirleaUnsubmitted Not Done Reply Inline Actions Same as above. asbirlea: Same as above.
				continue;

				Instruction *M0 = cast<Instruction>(V);
				Instruction *M1 = cast<Instruction>(VV);
				Value *Ptr0 = getPointerOperand(M0);
				Value *Ptr1 = getPointerOperand(M1);
				unsigned S0 =
				DL.getTypeStoreSize(Ptr0->getType()->getPointerElementType());
				unsigned S1 =
				DL.getTypeStoreSize(Ptr1->getType()->getPointerElementType());

				if (AA.alias(MemoryLocation(Ptr0, S0), MemoryLocation(Ptr1, S1))) {
				DEBUG(
				dbgs() << "LSV: Found alias.\n"
				" Aliasing instruction and pointer:\n"
				<< V << " aliases " << Ptr0 << '\n'
				<< " Aliased instruction and pointer:\n"
				<< VV << " aliases " << Ptr1 << '\n'
				);

				return false;
				jlebarUnsubmitted Not Done Reply Inline Actions To me this would be a lot more clear if we returned the relevant maps (or, I suppose, explicitly took them as params). Otherwise it's not clear from the signature (or the comments) where we collect the instructions into. jlebar: To me this would be a lot more clear if we returned the relevant maps (or, I suppose…
				}
				}
				mzolotukhinUnsubmitted Done Reply Inline Actions Range loop? mzolotukhin: Range loop?
				}

				return true;
				}

				void Vectorizer::collectInstructions(BasicBlock *BB) {
				LoadRefs.clear();
				StoreRefs.clear();

				for (Instruction &I : *BB) {
				if (!I.mayReadOrWriteMemory())
				continue;

				if (LoadInst *LI = dyn_cast<LoadInst>(&I)) {
				if (!LI->isSimple())
				continue;

				Type *Ty = LI->getType();
				if (!VectorType::isValidElementType(Ty->getScalarType()))
				continue;
				jlebarUnsubmitted Done Reply Inline Actions Consider llvm::all_of(LI->users(), ...); jlebar: Consider llvm::all_of(LI->users(), ...);

				// No point in looking at these if they're too big to vectorize.
				if (DL.getTypeSizeInBits(Ty) > VecRegSize / 2)
				continue;

				// Make sure all the users of a vector are constant-index extracts.
				if (isa<VectorType>(Ty) &&
				!all_of(LI->users(), [LI](const User *U) {
				const Instruction *UI = cast<Instruction>(U);
				return isa<ExtractElementInst>(UI) &&
				isa<ConstantInt>(UI->getOperand(1));
				}))
				continue;

				// TODO: Target hook to filter types.

				// Save the load locations.
				Value *Ptr = GetUnderlyingObject(LI->getPointerOperand(), DL);
				LoadRefs[Ptr].push_back(LI);

				} else if (StoreInst *SI = dyn_cast<StoreInst>(&I)) {
				if (!SI->isSimple())
				continue;

				Type *Ty = SI->getValueOperand()->getType();
				if (!VectorType::isValidElementType(Ty->getScalarType()))
				continue;

				if (DL.getTypeSizeInBits(Ty) > VecRegSize / 2)
				continue;

				if (isa<VectorType>(Ty) &&
				!all_of(SI->users(), [SI](const User *U) {
				const Instruction *UI = cast<Instruction>(U);
				return isa<ExtractElementInst>(UI) &&
				isa<ConstantInt>(UI->getOperand(1));
				}))
				continue;

				// Save store location.
				Value *Ptr = GetUnderlyingObject(SI->getPointerOperand(), DL);
				StoreRefs[Ptr].push_back(SI);
				}
				}
				}

				bool Vectorizer::vectorizeChains(ValueListMap &Map) {
				bool Changed = false;

				for (const std::pair<Value *, ValueList> &Chain : Map) {
				unsigned Size = Chain.second.size();
				mzolotukhinUnsubmitted Not Done Reply Inline Actions Might be a good idea to parametrize `64`. mzolotukhin: Might be a good idea to parametrize `64`.
				if (Size < 2)
				continue;

				DEBUG(dbgs() << "LSV: Analyzing a chain of length " << Size << ".\n");

				// Process the stores in chunks of 64.
				for (unsigned CI = 0, CE = Size; CI < CE; CI += 64) {
				unsigned Len = std::min<unsigned>(CE - CI, 64);
				ArrayRef<Value *> Chunk(&Chain.second[CI], Len);
				Changed \|= vectorizeInstructions(Chunk);
				}
				}
				jlebarUnsubmitted Not Done Reply Inline Actions Nit, It might be helpful to print out at least the first instruction in the chain. At least, that was helpful to me when I was debugging it. jlebar: Nit, It might be helpful to print out at least the first instruction in the chain. At least…
				arsenmAuthorUnsubmitted Not Done Reply Inline Actions I'm not sure why this one is even needed. The full chain is already printed out right above this by the "LSV: Loads to vectorize" DEBUG arsenm: I'm not sure why this one is even needed. The full chain is already printed out right above…
				jlebarUnsubmitted Not Done Reply Inline Actions Well, they are sort of different...that one may never appear if we find no chains or whatever. But I take your point that this one may not be so useful. FWIW some additional well-placed log messages might be helpful; I was debugging with a coworker a few days ago why some instructions weren't being vectorized, and it required gdb. But we can also easily go back and add these later. jlebar: Well, they are sort of different...that one may never appear if we find no chains or whatever.

				return Changed;
				}
				jlebarUnsubmitted Done Reply Inline Actions Can we move this closer to where it's used? jlebar: Can we move this closer to where it's used?

				bool Vectorizer::vectorizeInstructions(ArrayRef<Value *> Instrs) {
				DEBUG(dbgs() << "LSV: Vectorizing " << Instrs.size() << " instructions.\n");
				SmallSetVector<int, 16> Heads, Tails;
				int ConsecutiveChain[64];

				jlebarUnsubmitted Done Reply Inline Actions Nit, we know that Instrs is <= 64 elements, so could we just make ConsecutiveChain an array of signed ints? Then we could use -1 instead of ~0U (which implied to me that we were going to do some bit twiddling below). Also I think you could initialize the array with int ConsecutiveChain[64] = {-1}; which would be clearer than setting each element in the loop, imo. jlebar: Nit, we know that Instrs is <= 64 elements, so could we just make ConsecutiveChain an array of…
				arsenmAuthorUnsubmitted Not Done Reply Inline Actions Can't do that. it needs to be reset after the j loop arsenm: Can't do that. it needs to be reset after the j loop
				jlebarUnsubmitted Not Done Reply Inline Actions Ah, I see. I clearly have no idea what this loop does. :) jlebar: Ah, I see. I clearly have no idea what this loop does. :)
				asbirleaUnsubmitted Not Done Reply Inline Actions I'm confused why that is the case. Each i iteration sets ConsecutiveChain[i]=-1 and all read/write accesses in the j loop are on ConsecutiveChain[i]. Initializing all ConsecutiveChain to -1 before the i loop should have the equivalent behavior. Could you clarify what I'm missing? asbirlea: I'm confused why that is the case. Each i iteration sets ConsecutiveChain[i]=-1 and all…
				// Do a quadratic search on all of the given stores and find all of the pairs
				// of stores that follow each other.
				for (int i = 0, e = Instrs.size(); i < e; ++i) {
				ConsecutiveChain[i] = -1;
				for (int j = e - 1; j >= 0; --j) {
				if (i == j)
				continue;

				if (isConsecutiveAccess(Instrs[i], Instrs[j])) {
				jlebarUnsubmitted Not Done Reply Inline Actions I cannot for the life of me figure out what invariant you're trying to preserve with these checks. I guess a comment is in order. :) (I get that we prefer to vectorize chains of scalar instructions that appear in order (j > i), but I can't figure out what CurDistance and NewDistance are checking. It would make sense if you wanted to prefer chains of scalar instructions that are close to each other, but then NewDistance should be abs(j - i), right?) jlebar: I cannot for the life of me figure out what invariant you're trying to preserve with these…
				if (ConsecutiveChain[i] != -1) {
				mzolotukhinUnsubmitted Done Reply Inline Actions Can it be a range loop? mzolotukhin: Can it be a range loop?
				jlebarUnsubmitted Done Reply Inline Actions Not sure, but this might be clearer if you simply used a "continue" here instead of a flag variable. jlebar: Not sure, but this might be clearer if you simply used a "continue" here instead of a flag…
				int CurDistance = std::abs(ConsecutiveChain[i] - i);
				int NewDistance = std::abs(ConsecutiveChain[i] - j);
				if (j < i \|\| NewDistance > CurDistance)
				continue; // Should not insert.
				}
				jlebarUnsubmitted Not Done Reply Inline Actions If ConsecutiveChain was not -1, don't we need to remove ConsecutiveChain[i] from Tails? (Maybe it would be better to build Heads and Tails, or at least Tails, in a separate loop, over ConsecutiveChain, so we don't have to worry about the case where one instr is a tail for two heads and then gets overwritten just one time.) jlebar: If ConsecutiveChain was not -1, don't we need to remove ConsecutiveChain[i] from Tails? (Maybe…

				Tails.insert(j);
				Heads.insert(i);
				ConsecutiveChain[i] = j;
				}
				}
				jlebarUnsubmitted Not Done Reply Inline Actions The intent here is to build the largest vectors we can, right? (Or at least, we want to build up vectors of the max vector width, if possible? Beyond that doesn't make a difference.) I don't see how we ensure this happens, unless there is some implicit requirement on the order of Instrs that I'm missing. For example, if the elements of Instrs are load p[3] load p[2] load p[1] load p[0] then it seems that the first element of Heads will be "load p[2]", implying a two-wide vector load. I don't see what here would cause us to wait on this one and do the four-wide load starting at p[0]. Sorry if I'm missing something obvious. jlebar: The intent here is to build the largest vectors we can, right? (Or at least, we want to build…
				}

				bool Changed = false;
				SmallPtrSet<Value *, 16> VectorizedValues;

				for (int Head : Heads) {
				if (Tails.count(Head))
				continue;
				jlebarUnsubmitted Not Done Reply Inline Actions Actually, I'm not sure why we need the (Tails.count(I) \|\| Heads.count(I)) part of this predicate, and therefore I'm not sure why we need Tails at all. That probably deserves a comment at least. jlebar: Actually, I'm not sure why we need the (Tails.count(I) \|\| Heads.count(I)) part of this…
				arsenmAuthorUnsubmitted Not Done Reply Inline Actions I'm more likely to break something trying to refactor this compared to all of the style fixes, so I'll leave looking at that for later. arsenm: I'm more likely to break something trying to refactor this compared to all of the style fixes…
				asbirleaUnsubmitted Not Done Reply Inline Actions As I understand it, this is the condition for finding the chain of maximum length. The check that Head cannot be found in Tails, means it's the beginning of a chain, hence one of the longest chains. However, if the longest chain fails to vectorize, this same check prevents any vectorization of the remaining chain. Here's a suggestion to address vectorizing the chain suffix: for (int i = 0; i < Heads.size(); i++) { unsigned Head = Heads[i]; if (VectorizedValues.count(Instrs[Head])) continue; // Skip if a longer chain exists in the remaining Heads/Tails for (int j = i+1; j < Tails.size(); j++) if (Head == Tails[j]) continue; Feel free to add additional improvements for vectorization of a chain prefix. asbirlea: As I understand it, this is the condition for finding the chain of maximum length. The check…

				// We found an instr that starts a chain. Now follow the chain and try to
				// vectorize it.
				SmallVector<Value *, 16> Operands;
				int I = Head;
				while (I != -1 && (Tails.count(I) \|\| Heads.count(I))) {
				if (VectorizedValues.count(Instrs[I]))
				break;

				Operands.push_back(Instrs[I]);
				I = ConsecutiveChain[I];
				}

				bool Vectorized = false;
				if (isa<LoadInst>(*Operands.begin()))
				Vectorized = vectorizeLoadChain(Operands);
				else
				Vectorized = vectorizeStoreChain(Operands);

				// Mark the vectorized instructions so that we don't vectorize them again.
				if (Vectorized)
				VectorizedValues.insert(Operands.begin(), Operands.end());
				Changed \|= Vectorized;
				}

				return Changed;
				}

				bool Vectorizer::vectorizeStoreChain(ArrayRef<Value *> Chain) {
				StoreInst *S0 = cast<StoreInst>(Chain[0]);
				Type *StoreTy = S0->getValueOperand()->getType();
				unsigned Sz = DL.getTypeSizeInBits(StoreTy);
				unsigned VF = VecRegSize / Sz;
				unsigned ChainSize = Chain.size();

				if (!isPowerOf2_32(Sz) \|\| VF < 2 \|\| ChainSize < 2)
				return false;

				// Store size should be 1B, 2B or multiple of 4B.
				// TODO: Target hook for size constraint?
				jlebarUnsubmitted Done Reply Inline Actions Hm, looks like an additional constraint on bisect() is that the array passed should have more than 4 elements -- probably worth saying that somewhere. jlebar: Hm, looks like an additional constraint on bisect() is that the array passed should have more…
				unsigned SzInBytes = (Sz / 8) * ChainSize;
				if (SzInBytes > 2 && SzInBytes % 4 != 0) {
				DEBUG(dbgs() << "LSV: Size should be 1B, 2B "
				"or multiple of 4B. Splitting.\n");
				if (SzInBytes == 3)
				return vectorizeStoreChain(Chain.slice(0, ChainSize - 1));

				auto Chains = splitOddVectorElts(Chain, Sz);
				return vectorizeStoreChain(Chains.first) \|
				vectorizeStoreChain(Chains.second);
				}

				VectorType *VecTy;
				VectorType *VecStoreTy = dyn_cast<VectorType>(StoreTy);
				if (VecStoreTy)
				VecTy = VectorType::get(StoreTy->getScalarType(),
				Chain.size() * VecStoreTy->getNumElements());
				else
				VecTy = VectorType::get(StoreTy, Chain.size());

				// If it's more than the max vector size, break it into two pieces.
				// TODO: Target hook to control types to split to.
				if (ChainSize > VF) {
				DEBUG(dbgs() << "LSV: Vector factor is too big."
				" Creating two separate arrays.\n");
				return vectorizeStoreChain(Chain.slice(0, VF)) \|
				vectorizeStoreChain(Chain.slice(VF));
				}

				DEBUG(
				dbgs() << "LSV: Stores to vectorize:\n";
				for (Value *V : Chain)
				V->dump();
				);

				// Check alignment restrictions.
				unsigned Alignment = S0->getAlignment();

				// If the store is going to be misaligned, don't vectorize it.
				// TODO: Check TLI.allowsMisalignedMemoryAccess
				if ((Alignment % SzInBytes) != 0 && (Alignment % TargetBaseAlign) != 0) {
				if (S0->getPointerAddressSpace() == 0) {
				// If we're storing to an object on the stack, we control its alignment,
				// so we can cheat and change it!
				Value *V = GetUnderlyingObject(S0->getPointerOperand(), DL);
				if (AllocaInst *AI = dyn_cast_or_null<AllocaInst>(V)) {
				AI->setAlignment(TargetBaseAlign);
				Alignment = TargetBaseAlign;
				} else {
				return false;
				}
				} else {
				jlebarUnsubmitted Done Reply Inline Actions If you like, you could write this as BasicBlock::iterator First, Last; std::tie(First, Last) = getBoundaryInstrs(Chain); Same for the other call to getBoundaryInstrs. jlebar: If you like, you could write this as BasicBlock::iterator First, Last; std::tie(First…
				return false;
				}
				}

				BasicBlock::iterator First, Last;
				std::tie(First, Last) = getBoundaryInstrs(Chain);

				if (!isVectorizable(Chain, First, Last))
				return false;

				// Set insert point.
				Builder.SetInsertPoint(&*Last);
				unsigned AS = S0->getPointerAddressSpace();

				Value *Vec = UndefValue::get(VecTy);

				if (VecStoreTy) {
				unsigned VecWidth = VecStoreTy->getNumElements();
				for (unsigned I = 0, E = Chain.size(); I != E; ++I) {
				StoreInst *Store = cast<StoreInst>(Chain[I]);
				for (unsigned J = 0, NE = VecStoreTy->getNumElements(); J != NE; ++J) {
				unsigned NewIdx = J + I * VecWidth;
				Value *Extract = Builder.CreateExtractElement(Store->getValueOperand(),
				Builder.getInt32(J));
				if (Extract->getType() != StoreTy->getScalarType())
				Extract = Builder.CreateBitCast(Extract, StoreTy->getScalarType());

				Value *Insert = Builder.CreateInsertElement(Vec, Extract,
				Builder.getInt32(NewIdx));
				Vec = Insert;
				}
				}
				} else {
				for (unsigned I = 0, E = Chain.size(); I != E; ++I) {
				StoreInst *Store = cast<StoreInst>(Chain[I]);
				Value *Extract = Store->getValueOperand();
				if (Extract->getType() != StoreTy->getScalarType())
				Extract = Builder.CreateBitCast(Extract, StoreTy->getScalarType());

				Value *Insert = Builder.CreateInsertElement(Vec, Extract,
				Builder.getInt32(I));
				Vec = Insert;
				}
				}

				Value *Bitcast =
				Builder.CreateBitCast(S0->getPointerOperand(), VecTy->getPointerTo(AS));
				StoreInst *SI = cast<StoreInst>(Builder.CreateStore(Vec, Bitcast));
				propagateMetadata(SI, Chain);
				mzolotukhinUnsubmitted Not Done Reply Inline Actions Any chance it can be somehow combined with `vectorizeStoreChain`? They look so similar.. mzolotukhin: Any chance it can be somehow combined with `vectorizeStoreChain`? They look so similar..
				SI->setAlignment(Alignment);

				eraseInstructions(Chain);
				++NumVectorInstructions;
				NumScalarsVectorized += Chain.size();
				return true;
				}

				bool Vectorizer::vectorizeLoadChain(ArrayRef<Value *> Chain) {
				LoadInst *L0 = cast<LoadInst>(Chain[0]);
				Type *LoadTy = L0->getType();
				unsigned Sz = DL.getTypeSizeInBits(LoadTy);
				unsigned VF = VecRegSize / Sz;
				unsigned ChainSize = Chain.size();

				if (!isPowerOf2_32(Sz) \|\| VF < 2 \|\| ChainSize < 2)
				return false;

				// Load size should be 1B, 2B or multiple of 4B.
				// TODO: Should size constraint be a target hook?
				unsigned SzInBytes = (Sz / 8) * ChainSize;
				if (SzInBytes > 2 && SzInBytes % 4 != 0) {
				DEBUG(dbgs() << "LSV: Size should be 1B, 2B or multiple of 4B. Splitting.\n");
				if (SzInBytes == 3)
				return vectorizeLoadChain(Chain.slice(0, ChainSize - 1));
				auto Chains = splitOddVectorElts(Chain, Sz);
				return vectorizeLoadChain(Chains.first) \| vectorizeLoadChain(Chains.second);
				}

				VectorType *VecTy;
				VectorType *VecLoadTy = dyn_cast<VectorType>(LoadTy);
				if (VecLoadTy)
				VecTy = VectorType::get(LoadTy->getScalarType(),
				Chain.size() * VecLoadTy->getNumElements());
				else
				VecTy = VectorType::get(LoadTy, Chain.size());

				// If it's more than the max vector size, break it into two pieces.
				// TODO: Target hook to control types to split to.
				if (ChainSize > VF) {
				DEBUG(dbgs() << "LSV: Vector factor is too big. "
				"Creating two separate arrays.\n");
				return vectorizeLoadChain(Chain.slice(0, VF)) \|
				vectorizeLoadChain(Chain.slice(VF));
				}

				// Check alignment restrictions.
				unsigned Alignment = L0->getAlignment();

				// If the load is going to be misaligned, don't vectorize it.
				// TODO: Check TLI.allowsMisalignedMemoryAccess and remove TargetBaseAlign.
				if ((Alignment % SzInBytes) != 0 && (Alignment % TargetBaseAlign) != 0) {
				mzolotukhinUnsubmitted Done Reply Inline Actions Is there something special about `4`? Can we add a define/const for it? mzolotukhin: Is there something special about `4`? Can we add a define/const for it?
				arsenmAuthorUnsubmitted Not Done Reply Inline Actions The basic type alignment on GPUs? It's on my todo list to make this check allowsMemoryAccess instead arsenm: The basic type alignment on GPUs? It's on my todo list to make this check allowsMemoryAccess…
				if (L0->getPointerAddressSpace() == 0) {
				// If we're loading from an object on the stack, we control its alignment,
				// so we can cheat and change it!
				Value *V = GetUnderlyingObject(L0->getPointerOperand(), DL);
				if (AllocaInst *AI = dyn_cast_or_null<AllocaInst>(V)) {
				AI->setAlignment(TargetBaseAlign);
				Alignment = TargetBaseAlign;
				} else {
				return false;
				}
				} else {
				return false;
				}
				}

				DEBUG(
				dbgs() << "LSV: Loads to vectorize:\n";
				for (Value *V : Chain)
				V->dump();
				);

				BasicBlock::iterator First, Last;
				std::tie(First, Last) = getBoundaryInstrs(Chain);
				jlebarUnsubmitted Not Done Reply Inline Actions A correctness-critical assumption above is that the loads are inserted at the location of the first load but here it looks like the opposite? jlebar: A correctness-critical assumption above is that > the loads are inserted at the location of…
				arsenmAuthorUnsubmitted Not Done Reply Inline Actions I had another test which I apparently forgot to include which shows the alias insertion order behavior. I've also added another which shows the insertion point. With the insertion point as Last, the load store load store case is correct and vectorizes as store load v2 store. If I change the land insertion point to be before the first, it regresses and produces the incorrect load v2 store v2 Maybe the comment is backwards? arsenm: I had another test which I apparently forgot to include which shows the alias insertion order…
				jlebarUnsubmitted Not Done Reply Inline Actions Maybe the comment is backwards? I can't manage to convince myself that those checks are correct if we do anything but what the comments say. In the test, I see that you have %ld.c = load double, double addrspace(1)* %c, align 8 ; may alias store to %a store double 0.0, double addrspace(1)* %a, align 8 %ld.c.idx.1 = load double, double addrspace(1)* %c.idx.1, align 8 ; may alias store to %a store double 0.0, double addrspace(1)* %a.idx.1, align 8 If the comments are correct and both loads may alias the first store, isn't transforming this into "store; load v2; store" (what the test is checking for) unsafe? I grant that vectorizing both the loads and the stores is also unsafe. jlebar: > Maybe the comment is backwards? I can't manage to convince myself that those checks are…

				if (!isVectorizable(Chain, First, Last))
				return false;

				// Set insert point.
				Builder.SetInsertPoint(&*Last);

				unsigned AS = L0->getPointerAddressSpace();
				Value *Bitcast =
				Builder.CreateBitCast(L0->getPointerOperand(), VecTy->getPointerTo(AS));
				jlebarUnsubmitted Done Reply Inline Actions Nit, I'd prefer to move this into the if and else blocks, so it's clear that we don't use it elsewhere. jlebar: Nit, I'd prefer to move this into the if and else blocks, so it's clear that we don't use it…

				LoadInst *LI = cast<LoadInst>(Builder.CreateLoad(Bitcast));
				propagateMetadata(LI, Chain);
				LI->setAlignment(Alignment);

				if (VecLoadTy) {
				SmallVector<Instruction *, 16> InstrsToErase;
				SmallVector<Instruction *, 16> InstrsToReorder;

				unsigned VecWidth = VecLoadTy->getNumElements();
				for (unsigned I = 0, E = Chain.size(); I != E; ++I) {
				for (auto Use : Chain[I]->users()) {
				Instruction *UI = cast<Instruction>(Use);
				unsigned Idx = cast<ConstantInt>(UI->getOperand(1))->getZExtValue();
				unsigned NewIdx = Idx + I * VecWidth;
				Value *V = Builder.CreateExtractElement(LI, Builder.getInt32(NewIdx));
				Instruction *Extracted = cast<Instruction>(V);
				if (Extracted->getType() != UI->getType())
				Extracted =
				cast<Instruction>(Builder.CreateBitCast(Extracted, UI->getType()));

				// Replace the old instruction.
				UI->replaceAllUsesWith(Extracted);
				InstrsToReorder.push_back(Extracted);
				InstrsToErase.push_back(UI);
				}
				}

				for (Instruction *ModUser : InstrsToReorder)
				reorder(ModUser);

				for (auto I : InstrsToErase)
				I->eraseFromParent();
				} else {
				SmallVector<Instruction *, 16> InstrsToReorder;

				for (unsigned I = 0, E = Chain.size(); I != E; ++I) {
				Value *V = Builder.CreateExtractElement(LI, Builder.getInt32(I));
				Instruction *Extracted = cast<Instruction>(V);
				Instruction *UI = cast<Instruction>(Chain[I]);
				if (Extracted->getType() != UI->getType())
				Extracted =
				cast<Instruction>(Builder.CreateBitCast(Extracted, UI->getType()));

				// Replace the old instruction.
				UI->replaceAllUsesWith(Extracted);
				InstrsToReorder.push_back(Extracted);
				}

				for (Instruction *ModUser : InstrsToReorder)
				reorder(ModUser);
				}

				eraseInstructions(Chain);

				++NumVectorInstructions;
				NumScalarsVectorized += Chain.size();
				return true;
				}

lib/Transforms/Vectorize/Vectorize.cpp

	Show All 23 Lines
	using namespace llvm;			using namespace llvm;

	/// initializeVectorizationPasses - Initialize all passes linked into the			/// initializeVectorizationPasses - Initialize all passes linked into the
	/// Vectorization library.			/// Vectorization library.
	void llvm::initializeVectorization(PassRegistry &Registry) {			void llvm::initializeVectorization(PassRegistry &Registry) {
	initializeBBVectorizePass(Registry);			initializeBBVectorizePass(Registry);
	initializeLoopVectorizePass(Registry);			initializeLoopVectorizePass(Registry);
	initializeSLPVectorizerPass(Registry);			initializeSLPVectorizerPass(Registry);
				initializeLoadStoreVectorizerPass(Registry);
	}			}

	void LLVMInitializeVectorization(LLVMPassRegistryRef R) {			void LLVMInitializeVectorization(LLVMPassRegistryRef R) {
	initializeVectorization(*unwrap(R));			initializeVectorization(*unwrap(R));
	}			}

	void LLVMAddBBVectorizePass(LLVMPassManagerRef PM) {			void LLVMAddBBVectorizePass(LLVMPassManagerRef PM) {
	unwrap(PM)->add(createBBVectorizePass());			unwrap(PM)->add(createBBVectorizePass());
	Show All 9 Lines

test/Transforms/LoadStoreVectorizer/AMDGPU/extended-index.ll

This file was added.

				; RUN: opt -mtriple=amdgcn-amd-amdhsa -basicaa -load-store-vectorizer -S -o - %s \| FileCheck %s

				target datalayout = "e-p:32:32-p1:64:64-p2:64:64-p3:32:32-p4:64:64-p5:32:32-p24:64:64-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64"

				declare i32 @llvm.amdgcn.workitem.id.x() #1

				; CHECK-LABEL: @basic_merge_sext_index(
				; CHECK: sext i32 %id.x to i64
				; CHECK: load <2 x float>
				; CHECK: store <2 x float> zeroinitializer
				define void @basic_merge_sext_index(float addrspace(1)* nocapture %a, float addrspace(1)* nocapture %b, float addrspace(1)* nocapture readonly %c) #0 {
				entry:
				%id.x = call i32 @llvm.amdgcn.workitem.id.x()
				%sext.id.x = sext i32 %id.x to i64
				%a.idx.x = getelementptr inbounds float, float addrspace(1)* %a, i64 %sext.id.x
				%c.idx.x = getelementptr inbounds float, float addrspace(1)* %c, i64 %sext.id.x
				%a.idx.x.1 = getelementptr inbounds float, float addrspace(1)* %a.idx.x, i64 1
				%c.idx.x.1 = getelementptr inbounds float, float addrspace(1)* %c.idx.x, i64 1

				%ld.c = load float, float addrspace(1)* %c.idx.x, align 4
				%ld.c.idx.1 = load float, float addrspace(1)* %c.idx.x.1, align 4

				store float 0.0, float addrspace(1)* %a.idx.x, align 4
				store float 0.0, float addrspace(1)* %a.idx.x.1, align 4

				%add = fadd float %ld.c, %ld.c.idx.1
				store float %add, float addrspace(1)* %b, align 4
				ret void
				}

				; CHECK-LABEL: @basic_merge_zext_index(
				; CHECK: zext i32 %id.x to i64
				; CHECK: load <2 x float>
				; CHECK: store <2 x float>
				define void @basic_merge_zext_index(float addrspace(1)* nocapture %a, float addrspace(1)* nocapture %b, float addrspace(1)* nocapture readonly %c) #0 {
				entry:
				%id.x = call i32 @llvm.amdgcn.workitem.id.x()
				%zext.id.x = zext i32 %id.x to i64
				%a.idx.x = getelementptr inbounds float, float addrspace(1)* %a, i64 %zext.id.x
				%c.idx.x = getelementptr inbounds float, float addrspace(1)* %c, i64 %zext.id.x
				%a.idx.x.1 = getelementptr inbounds float, float addrspace(1)* %a.idx.x, i64 1
				%c.idx.x.1 = getelementptr inbounds float, float addrspace(1)* %c.idx.x, i64 1

				%ld.c = load float, float addrspace(1)* %c.idx.x, align 4
				%ld.c.idx.1 = load float, float addrspace(1)* %c.idx.x.1, align 4
				store float 0.0, float addrspace(1)* %a.idx.x, align 4
				store float 0.0, float addrspace(1)* %a.idx.x.1, align 4

				%add = fadd float %ld.c, %ld.c.idx.1
				store float %add, float addrspace(1)* %b, align 4
				ret void
				}

				; CHECK-LABEL: @merge_op_zext_index(
				; CHECK: load <2 x float>
				; CHECK: store <2 x float>
				define void @merge_op_zext_index(float addrspace(1)* nocapture noalias %a, float addrspace(1)* nocapture noalias %b, float addrspace(1)* nocapture readonly noalias %c) #0 {
				entry:
				%id.x = call i32 @llvm.amdgcn.workitem.id.x()
				%shl = shl i32 %id.x, 2
				%zext.id.x = zext i32 %shl to i64
				%a.0 = getelementptr inbounds float, float addrspace(1)* %a, i64 %zext.id.x
				%c.0 = getelementptr inbounds float, float addrspace(1)* %c, i64 %zext.id.x

				%id.x.1 = or i32 %shl, 1
				%id.x.1.ext = zext i32 %id.x.1 to i64

				%a.1 = getelementptr inbounds float, float addrspace(1)* %a, i64 %id.x.1.ext
				%c.1 = getelementptr inbounds float, float addrspace(1)* %c, i64 %id.x.1.ext

				%ld.c.0 = load float, float addrspace(1)* %c.0, align 4
				store float 0.0, float addrspace(1)* %a.0, align 4
				%ld.c.1 = load float, float addrspace(1)* %c.1, align 4
				store float 0.0, float addrspace(1)* %a.1, align 4

				%add = fadd float %ld.c.0, %ld.c.1
				store float %add, float addrspace(1)* %b, align 4
				ret void
				}

				; CHECK-LABEL: @merge_op_sext_index(
				; CHECK: load <2 x float>
				; CHECK: store <2 x float>
				define void @merge_op_sext_index(float addrspace(1)* nocapture noalias %a, float addrspace(1)* nocapture noalias %b, float addrspace(1)* nocapture readonly noalias %c) #0 {
				entry:
				%id.x = call i32 @llvm.amdgcn.workitem.id.x()
				%shl = shl i32 %id.x, 2
				%zext.id.x = sext i32 %shl to i64
				%a.0 = getelementptr inbounds float, float addrspace(1)* %a, i64 %zext.id.x
				%c.0 = getelementptr inbounds float, float addrspace(1)* %c, i64 %zext.id.x

				%id.x.1 = or i32 %shl, 1
				%id.x.1.ext = sext i32 %id.x.1 to i64

				%a.1 = getelementptr inbounds float, float addrspace(1)* %a, i64 %id.x.1.ext
				%c.1 = getelementptr inbounds float, float addrspace(1)* %c, i64 %id.x.1.ext

				%ld.c.0 = load float, float addrspace(1)* %c.0, align 4
				store float 0.0, float addrspace(1)* %a.0, align 4
				%ld.c.1 = load float, float addrspace(1)* %c.1, align 4
				store float 0.0, float addrspace(1)* %a.1, align 4

				%add = fadd float %ld.c.0, %ld.c.1
				store float %add, float addrspace(1)* %b, align 4
				ret void
				}

				; This case fails to vectorize if not using the extra extension
				; handling in isConsecutiveAccess.

				; CHECK-LABEL: @zext_trunc_phi_1(
				; CHECK: loop:
				; CHECK: load <2 x i32>
				; CHECK: store <2 x i32>
				define void @zext_trunc_phi_1(i32 addrspace(1)* nocapture noalias %a, i32 addrspace(1)* nocapture noalias %b, i32 addrspace(1)* nocapture readonly noalias %c, i32 %n, i64 %arst, i64 %aoeu) #0 {
				entry:
				%cmp0 = icmp eq i32 %n, 0
				br i1 %cmp0, label %exit, label %loop

				loop:
				%indvars.iv = phi i64 [ %indvars.iv.next, %loop ], [ 0, %entry ]
				%trunc.iv = trunc i64 %indvars.iv to i32
				%idx = shl i32 %trunc.iv, 4

				%idx.ext = zext i32 %idx to i64
				%c.0 = getelementptr inbounds i32, i32 addrspace(1)* %c, i64 %idx.ext
				%a.0 = getelementptr inbounds i32, i32 addrspace(1)* %a, i64 %idx.ext

				%idx.1 = or i32 %idx, 1
				%idx.1.ext = zext i32 %idx.1 to i64
				%c.1 = getelementptr inbounds i32, i32 addrspace(1)* %c, i64 %idx.1.ext
				%a.1 = getelementptr inbounds i32, i32 addrspace(1)* %a, i64 %idx.1.ext

				%ld.c.0 = load i32, i32 addrspace(1)* %c.0, align 4
				store i32 %ld.c.0, i32 addrspace(1)* %a.0, align 4
				%ld.c.1 = load i32, i32 addrspace(1)* %c.1, align 4
				store i32 %ld.c.1, i32 addrspace(1)* %a.1, align 4

				%indvars.iv.next = add i64 %indvars.iv, 1
				%lftr.wideiv = trunc i64 %indvars.iv.next to i32

				%exitcond = icmp eq i32 %lftr.wideiv, %n
				br i1 %exitcond, label %exit, label %loop

				exit:
				ret void
				}

				attributes #0 = { nounwind }
				attributes #1 = { nounwind readnone }

test/Transforms/LoadStoreVectorizer/AMDGPU/insertion-point.ll

This file was added.

				; RUN: opt -mtriple=amdgcn-amd-amdhsa -basicaa -load-store-vectorizer -S -o - %s \| FileCheck %s

				target datalayout = "e-p:32:32-p1:64:64-p2:64:64-p3:32:32-p4:64:64-p5:32:32-p24:64:64-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64"

				; Check relative position of the inserted vector load relative to the
				; existing adds.

				; CHECK-LABEL: @insert_load_point(
				; CHECK: %z = add i32 %x, 4
				; CHECK: %w = add i32 %y, 9
				; CHECK: load <2 x float>
				; CHECK: %foo = add i32 %z, %w
				define void @insert_load_point(float addrspace(1)* nocapture %a, float addrspace(1)* nocapture %b, float addrspace(1)* nocapture readonly %c, i64 %idx, i32 %x, i32 %y) #0 {
				entry:
				%a.idx.x = getelementptr inbounds float, float addrspace(1)* %a, i64 %idx
				%c.idx.x = getelementptr inbounds float, float addrspace(1)* %c, i64 %idx
				%a.idx.x.1 = getelementptr inbounds float, float addrspace(1)* %a.idx.x, i64 1
				%c.idx.x.1 = getelementptr inbounds float, float addrspace(1)* %c.idx.x, i64 1

				%z = add i32 %x, 4
				%ld.c = load float, float addrspace(1)* %c.idx.x, align 4
				%w = add i32 %y, 9
				%ld.c.idx.1 = load float, float addrspace(1)* %c.idx.x.1, align 4
				%foo = add i32 %z, %w

				store float 0.0, float addrspace(1)* %a.idx.x, align 4
				store float 0.0, float addrspace(1)* %a.idx.x.1, align 4

				%add = fadd float %ld.c, %ld.c.idx.1
				store float %add, float addrspace(1)* %b, align 4
				store i32 %foo, i32 addrspace(3)* null, align 4
				ret void
				}

				; CHECK-LABEL: @insert_store_point(
				; CHECK: %z = add i32 %x, 4
				; CHECK: %w = add i32 %y, 9
				; CHECK: store <2 x float>
				; CHECK: %foo = add i32 %z, %w
				define void @insert_store_point(float addrspace(1)* nocapture %a, float addrspace(1)* nocapture %b, float addrspace(1)* nocapture readonly %c, i64 %idx, i32 %x, i32 %y) #0 {
				entry:
				%a.idx.x = getelementptr inbounds float, float addrspace(1)* %a, i64 %idx
				%c.idx.x = getelementptr inbounds float, float addrspace(1)* %c, i64 %idx
				%a.idx.x.1 = getelementptr inbounds float, float addrspace(1)* %a.idx.x, i64 1
				%c.idx.x.1 = getelementptr inbounds float, float addrspace(1)* %c.idx.x, i64 1

				%ld.c = load float, float addrspace(1)* %c.idx.x, align 4
				%ld.c.idx.1 = load float, float addrspace(1)* %c.idx.x.1, align 4

				%z = add i32 %x, 4
				store float 0.0, float addrspace(1)* %a.idx.x, align 4
				%w = add i32 %y, 9
				store float 0.0, float addrspace(1)* %a.idx.x.1, align 4
				%foo = add i32 %z, %w

				%add = fadd float %ld.c, %ld.c.idx.1
				store float %add, float addrspace(1)* %b, align 4
				store i32 %foo, i32 addrspace(3)* null, align 4
				ret void
				}

				attributes #0 = { nounwind }

test/Transforms/LoadStoreVectorizer/AMDGPU/interleaved-mayalias-store.ll

This file was added.

				; RUN: opt -mtriple=amdgcn-amd-amdhsa -basicaa -load-store-vectorizer -S -o - %s \| FileCheck %s

				target datalayout = "e-p:32:32-p1:64:64-p2:64:64-p3:32:32-p4:64:64-p5:32:32-p24:64:64-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64"

				; This is OK to vectorize the load as long as the may alias store
				; occurs before the vector load.

				; CHECK: store double 0.000000e+00, double addrspace(1)* %a,
				; CHECK: load <2 x double>
				; CHECK: store double 0.000000e+00, double addrspace(1)* %a.idx.1
				define void @interleave(double addrspace(1)* nocapture %a, double addrspace(1)* nocapture %b, double addrspace(1)* nocapture readonly %c) #0 {
				entry:
				%a.idx.1 = getelementptr inbounds double, double addrspace(1)* %a, i64 1
				%c.idx.1 = getelementptr inbounds double, double addrspace(1)* %c, i64 1

				%ld.c = load double, double addrspace(1)* %c, align 8 ; may alias store to %a
				store double 0.0, double addrspace(1)* %a, align 8

				%ld.c.idx.1 = load double, double addrspace(1)* %c.idx.1, align 8 ; may alias store to %a
				store double 0.0, double addrspace(1)* %a.idx.1, align 8

				%add = fadd double %ld.c, %ld.c.idx.1
				store double %add, double addrspace(1)* %b

				ret void
				}

				attributes #0 = { nounwind }

test/Transforms/LoadStoreVectorizer/AMDGPU/lit.local.cfg

This file was added.

				if not 'AMDGPU' in config.root.targets:
				config.unsupported = True

test/Transforms/LoadStoreVectorizer/AMDGPU/merge-stores.ll

This file was added.

				; RUN: opt -mtriple=amdgcn-amd-amdhsa -load-store-vectorizer -S -o - %s \| FileCheck %s
				; Copy of test/CodeGen/AMDGPU/merge-stores.ll with some additions

				; TODO: Vector element tests
				; TODO: Non-zero base offset for load and store combinations
				; TODO: Same base addrspacecasted


				; CHECK-LABEL: @merge_global_store_2_constants_i8(
				; CHECK: store <2 x i8> <i8 -56, i8 123>, <2 x i8> addrspace(1)* %{{[0-9]+}}, align 2
				define void @merge_global_store_2_constants_i8(i8 addrspace(1)* %out) #0 {
				%out.gep.1 = getelementptr i8, i8 addrspace(1)* %out, i32 1

				store i8 123, i8 addrspace(1)* %out.gep.1
				store i8 456, i8 addrspace(1)* %out, align 2
				ret void
				}

				; CHECK-LABEL: @merge_global_store_2_constants_i8_natural_align
				; CHECK: store <2 x i8> <i8 -56, i8 123>, <2 x i8> addrspace(1)* %{{[0-9]+$}}
				define void @merge_global_store_2_constants_i8_natural_align(i8 addrspace(1)* %out) #0 {
				%out.gep.1 = getelementptr i8, i8 addrspace(1)* %out, i32 1

				store i8 123, i8 addrspace(1)* %out.gep.1
				store i8 456, i8 addrspace(1)* %out
				ret void
				}

				; CHECK-LABEL: @merge_global_store_2_constants_i16
				; CHECK: store <2 x i16> <i16 456, i16 123>, <2 x i16> addrspace(1)* %{{[0-9]+}}, align 4
				define void @merge_global_store_2_constants_i16(i16 addrspace(1)* %out) #0 {
				%out.gep.1 = getelementptr i16, i16 addrspace(1)* %out, i32 1

				store i16 123, i16 addrspace(1)* %out.gep.1
				store i16 456, i16 addrspace(1)* %out, align 4
				ret void
				}

				; CHECK-LABEL: @merge_global_store_2_constants_0_i16
				; CHECK: store <2 x i16> zeroinitializer, <2 x i16> addrspace(1)* %{{[0-9]+}}, align 4
				define void @merge_global_store_2_constants_0_i16(i16 addrspace(1)* %out) #0 {
				%out.gep.1 = getelementptr i16, i16 addrspace(1)* %out, i32 1

				store i16 0, i16 addrspace(1)* %out.gep.1
				store i16 0, i16 addrspace(1)* %out, align 4
				ret void
				}

				; CHECK-LABEL: @merge_global_store_2_constants_i16_natural_align
				; CHECK: store <2 x i16> <i16 456, i16 123>, <2 x i16> addrspace(1)* %{{[0-9]+$}}
				define void @merge_global_store_2_constants_i16_natural_align(i16 addrspace(1)* %out) #0 {
				%out.gep.1 = getelementptr i16, i16 addrspace(1)* %out, i32 1

				store i16 123, i16 addrspace(1)* %out.gep.1
				store i16 456, i16 addrspace(1)* %out
				ret void
				}

				; CHECK-LABEL: @merge_global_store_2_constants_half_natural_align
				; CHECK: store <2 x half> <half 0xH3C00, half 0xH4000>, <2 x half> addrspace(1)* %{{[0-9]+$}}
				define void @merge_global_store_2_constants_half_natural_align(half addrspace(1)* %out) #0 {
				%out.gep.1 = getelementptr half, half addrspace(1)* %out, i32 1

				store half 2.0, half addrspace(1)* %out.gep.1
				store half 1.0, half addrspace(1)* %out
				ret void
				}

				; CHECK-LABEL: @merge_global_store_2_constants_i32
				; CHECK: store <2 x i32> <i32 456, i32 123>, <2 x i32> addrspace(1)* %{{[0-9]+$}}
				define void @merge_global_store_2_constants_i32(i32 addrspace(1)* %out) #0 {
				%out.gep.1 = getelementptr i32, i32 addrspace(1)* %out, i32 1

				store i32 123, i32 addrspace(1)* %out.gep.1
				store i32 456, i32 addrspace(1)* %out
				ret void
				}

				; CHECK-LABEL: @merge_global_store_2_constants_i32_f32
				; CHECK: store <2 x i32> <i32 456, i32 1065353216>, <2 x i32> addrspace(1)* %{{[0-9]+$}}
				define void @merge_global_store_2_constants_i32_f32(i32 addrspace(1)* %out) #0 {
				%out.gep.1 = getelementptr i32, i32 addrspace(1)* %out, i32 1
				%out.gep.1.bc = bitcast i32 addrspace(1)* %out.gep.1 to float addrspace(1)*
				store float 1.0, float addrspace(1)* %out.gep.1.bc
				store i32 456, i32 addrspace(1)* %out
				ret void
				}

				; CHECK-LABEL: @merge_global_store_2_constants_f32_i32
				; CHECK store <2 x float> <float 4.000000e+00, float 0x370EC00000000000>, <2 x float> addrspace(1)* %{{[0-9]+$}}
				define void @merge_global_store_2_constants_f32_i32(float addrspace(1)* %out) #0 {
				%out.gep.1 = getelementptr float, float addrspace(1)* %out, i32 1
				%out.gep.1.bc = bitcast float addrspace(1)* %out.gep.1 to i32 addrspace(1)*
				store i32 123, i32 addrspace(1)* %out.gep.1.bc
				store float 4.0, float addrspace(1)* %out
				ret void
				}

				; CHECK-LABEL: @merge_global_store_4_constants_i32
				; CHECK: store <4 x i32> <i32 1234, i32 123, i32 456, i32 333>, <4 x i32> addrspace(1)* %{{[0-9]+$}}
				define void @merge_global_store_4_constants_i32(i32 addrspace(1)* %out) #0 {
				%out.gep.1 = getelementptr i32, i32 addrspace(1)* %out, i32 1
				%out.gep.2 = getelementptr i32, i32 addrspace(1)* %out, i32 2
				%out.gep.3 = getelementptr i32, i32 addrspace(1)* %out, i32 3

				store i32 123, i32 addrspace(1)* %out.gep.1
				store i32 456, i32 addrspace(1)* %out.gep.2
				store i32 333, i32 addrspace(1)* %out.gep.3
				store i32 1234, i32 addrspace(1)* %out
				ret void
				}

				; CHECK-LABEL: @merge_global_store_4_constants_f32_order
				; CHECK: store <4 x float> <float 8.000000e+00, float 1.000000e+00, float 2.000000e+00, float 4.000000e+00>, <4 x float> addrspace(1)* %{{[0-9]+}}
				define void @merge_global_store_4_constants_f32_order(float addrspace(1)* %out) #0 {
				%out.gep.1 = getelementptr float, float addrspace(1)* %out, i32 1
				%out.gep.2 = getelementptr float, float addrspace(1)* %out, i32 2
				%out.gep.3 = getelementptr float, float addrspace(1)* %out, i32 3

				store float 8.0, float addrspace(1)* %out
				store float 1.0, float addrspace(1)* %out.gep.1
				store float 2.0, float addrspace(1)* %out.gep.2
				store float 4.0, float addrspace(1)* %out.gep.3
				ret void
				}

				; First store is out of order.
				; CHECK-LABEL: @merge_global_store_4_constants_f32
				; CHECK: store <4 x float> <float 8.000000e+00, float 1.000000e+00, float 2.000000e+00, float 4.000000e+00>, <4 x float> addrspace(1)* %{{[0-9]+$}}
				define void @merge_global_store_4_constants_f32(float addrspace(1)* %out) #0 {
				%out.gep.1 = getelementptr float, float addrspace(1)* %out, i32 1
				%out.gep.2 = getelementptr float, float addrspace(1)* %out, i32 2
				%out.gep.3 = getelementptr float, float addrspace(1)* %out, i32 3

				store float 1.0, float addrspace(1)* %out.gep.1
				store float 2.0, float addrspace(1)* %out.gep.2
				store float 4.0, float addrspace(1)* %out.gep.3
				store float 8.0, float addrspace(1)* %out
				ret void
				}

				; CHECK-LABEL: @merge_global_store_4_constants_mixed_i32_f32
				; CHECK: store <4 x float> <float 8.000000e+00, float 0x36D6000000000000, float 2.000000e+00, float 0x36E1000000000000>, <4 x float> addrspace(1)* %{{[0-9]+}}
				define void @merge_global_store_4_constants_mixed_i32_f32(float addrspace(1)* %out) #0 {
				%out.gep.1 = getelementptr float, float addrspace(1)* %out, i32 1
				%out.gep.2 = getelementptr float, float addrspace(1)* %out, i32 2
				%out.gep.3 = getelementptr float, float addrspace(1)* %out, i32 3

				%out.gep.1.bc = bitcast float addrspace(1)* %out.gep.1 to i32 addrspace(1)*
				%out.gep.3.bc = bitcast float addrspace(1)* %out.gep.3 to i32 addrspace(1)*

				store i32 11, i32 addrspace(1)* %out.gep.1.bc
				store float 2.0, float addrspace(1)* %out.gep.2
				store i32 17, i32 addrspace(1)* %out.gep.3.bc
				store float 8.0, float addrspace(1)* %out
				ret void
				}

				; CHECK-LABEL: @merge_global_store_3_constants_i32
				; CHECK: store <3 x i32> <i32 1234, i32 123, i32 456>, <3 x i32> addrspace(1)* %{{[0-9]+$}}
				define void @merge_global_store_3_constants_i32(i32 addrspace(1)* %out) #0 {
				%out.gep.1 = getelementptr i32, i32 addrspace(1)* %out, i32 1
				%out.gep.2 = getelementptr i32, i32 addrspace(1)* %out, i32 2

				store i32 123, i32 addrspace(1)* %out.gep.1
				store i32 456, i32 addrspace(1)* %out.gep.2
				store i32 1234, i32 addrspace(1)* %out
				ret void
				}

				; CHECK-LABEL: @merge_global_store_2_constants_i64
				; CHECK: store <2 x i64> <i64 456, i64 123>, <2 x i64> addrspace(1)* %{{[0-9]+$}}
				define void @merge_global_store_2_constants_i64(i64 addrspace(1)* %out) #0 {
				%out.gep.1 = getelementptr i64, i64 addrspace(1)* %out, i64 1

				store i64 123, i64 addrspace(1)* %out.gep.1
				store i64 456, i64 addrspace(1)* %out
				ret void
				}

				; CHECK-LABEL: @merge_global_store_4_constants_i64
				; CHECK: store <2 x i64> <i64 456, i64 333>, <2 x i64> addrspace(1)* %{{[0-9]+$}}
				; CHECK: store <2 x i64> <i64 1234, i64 123>, <2 x i64> addrspace(1)* %{{[0-9]+$}}
				define void @merge_global_store_4_constants_i64(i64 addrspace(1)* %out) #0 {
				%out.gep.1 = getelementptr i64, i64 addrspace(1)* %out, i64 1
				%out.gep.2 = getelementptr i64, i64 addrspace(1)* %out, i64 2
				%out.gep.3 = getelementptr i64, i64 addrspace(1)* %out, i64 3

				store i64 123, i64 addrspace(1)* %out.gep.1
				store i64 456, i64 addrspace(1)* %out.gep.2
				store i64 333, i64 addrspace(1)* %out.gep.3
				store i64 1234, i64 addrspace(1)* %out
				ret void
				}

				; CHECK-LABEL: @merge_global_store_2_adjacent_loads_i32
				; CHECK: [[LOAD:%[0-9]+]] = load <2 x i32>
				; CHECK: [[ELT0:%[0-9]+]] = extractelement <2 x i32> [[LOAD]], i32 0
				; CHECK: [[ELT1:%[0-9]+]] = extractelement <2 x i32> [[LOAD]], i32 1
				; CHECK: [[INSERT0:%[0-9]+]] = insertelement <2 x i32> undef, i32 [[ELT0]], i32 0
				; CHECK: [[INSERT1:%[0-9]+]] = insertelement <2 x i32> [[INSERT0]], i32 [[ELT1]], i32 1
				; CHECK: store <2 x i32> [[INSERT1]]
				define void @merge_global_store_2_adjacent_loads_i32(i32 addrspace(1)* %out, i32 addrspace(1)* %in) #0 {
				%out.gep.1 = getelementptr i32, i32 addrspace(1)* %out, i32 1
				%in.gep.1 = getelementptr i32, i32 addrspace(1)* %in, i32 1

				%lo = load i32, i32 addrspace(1)* %in
				%hi = load i32, i32 addrspace(1)* %in.gep.1

				store i32 %lo, i32 addrspace(1)* %out
				store i32 %hi, i32 addrspace(1)* %out.gep.1
				ret void
				}

				; CHECK-LABEL: @merge_global_store_2_adjacent_loads_i32_nonzero_base
				; CHECK: extractelement
				; CHECK: extractelement
				; CHECK: insertelement
				; CHECK: insertelement
				; CHECK: store <2 x i32>
				define void @merge_global_store_2_adjacent_loads_i32_nonzero_base(i32 addrspace(1)* %out, i32 addrspace(1)* %in) #0 {
				%in.gep.0 = getelementptr i32, i32 addrspace(1)* %in, i32 2
				%in.gep.1 = getelementptr i32, i32 addrspace(1)* %in, i32 3

				%out.gep.0 = getelementptr i32, i32 addrspace(1)* %out, i32 2
				%out.gep.1 = getelementptr i32, i32 addrspace(1)* %out, i32 3
				%lo = load i32, i32 addrspace(1)* %in.gep.0
				%hi = load i32, i32 addrspace(1)* %in.gep.1

				store i32 %lo, i32 addrspace(1)* %out.gep.0
				store i32 %hi, i32 addrspace(1)* %out.gep.1
				ret void
				}

				; CHECK-LABEL: @merge_global_store_2_adjacent_loads_shuffle_i32
				; CHECK: [[LOAD:%[0-9]+]] = load <2 x i32>
				; CHECK: [[ELT0:%[0-9]+]] = extractelement <2 x i32> [[LOAD]], i32 0
				; CHECK: [[ELT1:%[0-9]+]] = extractelement <2 x i32> [[LOAD]], i32 1
				; CHECK: [[INSERT0:%[0-9]+]] = insertelement <2 x i32> undef, i32 [[ELT1]], i32 0
				; CHECK: [[INSERT1:%[0-9]+]] = insertelement <2 x i32> [[INSERT0]], i32 [[ELT0]], i32 1
				; CHECK: store <2 x i32> [[INSERT1]]
				define void @merge_global_store_2_adjacent_loads_shuffle_i32(i32 addrspace(1)* %out, i32 addrspace(1)* %in) #0 {
				%out.gep.1 = getelementptr i32, i32 addrspace(1)* %out, i32 1
				%in.gep.1 = getelementptr i32, i32 addrspace(1)* %in, i32 1

				%lo = load i32, i32 addrspace(1)* %in
				%hi = load i32, i32 addrspace(1)* %in.gep.1

				store i32 %hi, i32 addrspace(1)* %out
				store i32 %lo, i32 addrspace(1)* %out.gep.1
				ret void
				}

				; CHECK-LABEL: @merge_global_store_4_adjacent_loads_i32
				; CHECK: load <4 x i32>
				; CHECK: store <4 x i32>
				define void @merge_global_store_4_adjacent_loads_i32(i32 addrspace(1)* %out, i32 addrspace(1)* %in) #0 {
				%out.gep.1 = getelementptr i32, i32 addrspace(1)* %out, i32 1
				%out.gep.2 = getelementptr i32, i32 addrspace(1)* %out, i32 2
				%out.gep.3 = getelementptr i32, i32 addrspace(1)* %out, i32 3
				%in.gep.1 = getelementptr i32, i32 addrspace(1)* %in, i32 1
				%in.gep.2 = getelementptr i32, i32 addrspace(1)* %in, i32 2
				%in.gep.3 = getelementptr i32, i32 addrspace(1)* %in, i32 3

				%x = load i32, i32 addrspace(1)* %in
				%y = load i32, i32 addrspace(1)* %in.gep.1
				%z = load i32, i32 addrspace(1)* %in.gep.2
				%w = load i32, i32 addrspace(1)* %in.gep.3

				store i32 %x, i32 addrspace(1)* %out
				store i32 %y, i32 addrspace(1)* %out.gep.1
				store i32 %z, i32 addrspace(1)* %out.gep.2
				store i32 %w, i32 addrspace(1)* %out.gep.3
				ret void
				}

				; CHECK-LABEL: @merge_global_store_3_adjacent_loads_i32
				; CHECK: load <3 x i32>
				; CHECK: store <3 x i32>
				define void @merge_global_store_3_adjacent_loads_i32(i32 addrspace(1)* %out, i32 addrspace(1)* %in) #0 {
				%out.gep.1 = getelementptr i32, i32 addrspace(1)* %out, i32 1
				%out.gep.2 = getelementptr i32, i32 addrspace(1)* %out, i32 2
				%in.gep.1 = getelementptr i32, i32 addrspace(1)* %in, i32 1
				%in.gep.2 = getelementptr i32, i32 addrspace(1)* %in, i32 2

				%x = load i32, i32 addrspace(1)* %in
				%y = load i32, i32 addrspace(1)* %in.gep.1
				%z = load i32, i32 addrspace(1)* %in.gep.2

				store i32 %x, i32 addrspace(1)* %out
				store i32 %y, i32 addrspace(1)* %out.gep.1
				store i32 %z, i32 addrspace(1)* %out.gep.2
				ret void
				}

				; CHECK-LABEL: @merge_global_store_4_adjacent_loads_f32
				; CHECK: load <4 x float>
				; CHECK: store <4 x float>
				define void @merge_global_store_4_adjacent_loads_f32(float addrspace(1)* %out, float addrspace(1)* %in) #0 {
				%out.gep.1 = getelementptr float, float addrspace(1)* %out, i32 1
				%out.gep.2 = getelementptr float, float addrspace(1)* %out, i32 2
				%out.gep.3 = getelementptr float, float addrspace(1)* %out, i32 3
				%in.gep.1 = getelementptr float, float addrspace(1)* %in, i32 1
				%in.gep.2 = getelementptr float, float addrspace(1)* %in, i32 2
				%in.gep.3 = getelementptr float, float addrspace(1)* %in, i32 3

				%x = load float, float addrspace(1)* %in
				%y = load float, float addrspace(1)* %in.gep.1
				%z = load float, float addrspace(1)* %in.gep.2
				%w = load float, float addrspace(1)* %in.gep.3

				store float %x, float addrspace(1)* %out
				store float %y, float addrspace(1)* %out.gep.1
				store float %z, float addrspace(1)* %out.gep.2
				store float %w, float addrspace(1)* %out.gep.3
				ret void
				}

				; CHECK-LABEL: @merge_global_store_4_adjacent_loads_i32_nonzero_base
				; CHECK: load <4 x i32>
				; CHECK: store <4 x i32>
				define void @merge_global_store_4_adjacent_loads_i32_nonzero_base(i32 addrspace(1)* %out, i32 addrspace(1)* %in) #0 {
				%in.gep.0 = getelementptr i32, i32 addrspace(1)* %in, i32 11
				%in.gep.1 = getelementptr i32, i32 addrspace(1)* %in, i32 12
				%in.gep.2 = getelementptr i32, i32 addrspace(1)* %in, i32 13
				%in.gep.3 = getelementptr i32, i32 addrspace(1)* %in, i32 14
				%out.gep.0 = getelementptr i32, i32 addrspace(1)* %out, i32 7
				%out.gep.1 = getelementptr i32, i32 addrspace(1)* %out, i32 8
				%out.gep.2 = getelementptr i32, i32 addrspace(1)* %out, i32 9
				%out.gep.3 = getelementptr i32, i32 addrspace(1)* %out, i32 10

				%x = load i32, i32 addrspace(1)* %in.gep.0
				%y = load i32, i32 addrspace(1)* %in.gep.1
				%z = load i32, i32 addrspace(1)* %in.gep.2
				%w = load i32, i32 addrspace(1)* %in.gep.3

				store i32 %x, i32 addrspace(1)* %out.gep.0
				store i32 %y, i32 addrspace(1)* %out.gep.1
				store i32 %z, i32 addrspace(1)* %out.gep.2
				store i32 %w, i32 addrspace(1)* %out.gep.3
				ret void
				}

				; CHECK-LABEL: @merge_global_store_4_adjacent_loads_inverse_i32
				; CHECK: load <4 x i32>
				; CHECK: store <4 x i32>
				define void @merge_global_store_4_adjacent_loads_inverse_i32(i32 addrspace(1)* %out, i32 addrspace(1)* %in) #0 {
				%out.gep.1 = getelementptr i32, i32 addrspace(1)* %out, i32 1
				%out.gep.2 = getelementptr i32, i32 addrspace(1)* %out, i32 2
				%out.gep.3 = getelementptr i32, i32 addrspace(1)* %out, i32 3
				%in.gep.1 = getelementptr i32, i32 addrspace(1)* %in, i32 1
				%in.gep.2 = getelementptr i32, i32 addrspace(1)* %in, i32 2
				%in.gep.3 = getelementptr i32, i32 addrspace(1)* %in, i32 3

				%x = load i32, i32 addrspace(1)* %in
				%y = load i32, i32 addrspace(1)* %in.gep.1
				%z = load i32, i32 addrspace(1)* %in.gep.2
				%w = load i32, i32 addrspace(1)* %in.gep.3

				; Make sure the barrier doesn't stop this
				tail call void @llvm.amdgcn.s.barrier() #1

				store i32 %w, i32 addrspace(1)* %out.gep.3
				store i32 %z, i32 addrspace(1)* %out.gep.2
				store i32 %y, i32 addrspace(1)* %out.gep.1
				store i32 %x, i32 addrspace(1)* %out

				ret void
				}

				; CHECK-LABEL: @merge_global_store_4_adjacent_loads_shuffle_i32
				; CHECK: load <4 x i32>
				; CHECK: store <4 x i32>
				define void @merge_global_store_4_adjacent_loads_shuffle_i32(i32 addrspace(1)* %out, i32 addrspace(1)* %in) #0 {
				%out.gep.1 = getelementptr i32, i32 addrspace(1)* %out, i32 1
				%out.gep.2 = getelementptr i32, i32 addrspace(1)* %out, i32 2
				%out.gep.3 = getelementptr i32, i32 addrspace(1)* %out, i32 3
				%in.gep.1 = getelementptr i32, i32 addrspace(1)* %in, i32 1
				%in.gep.2 = getelementptr i32, i32 addrspace(1)* %in, i32 2
				%in.gep.3 = getelementptr i32, i32 addrspace(1)* %in, i32 3

				%x = load i32, i32 addrspace(1)* %in
				%y = load i32, i32 addrspace(1)* %in.gep.1
				%z = load i32, i32 addrspace(1)* %in.gep.2
				%w = load i32, i32 addrspace(1)* %in.gep.3

				; Make sure the barrier doesn't stop this
				tail call void @llvm.amdgcn.s.barrier() #1

				store i32 %w, i32 addrspace(1)* %out
				store i32 %z, i32 addrspace(1)* %out.gep.1
				store i32 %y, i32 addrspace(1)* %out.gep.2
				store i32 %x, i32 addrspace(1)* %out.gep.3

				ret void
				}

				; CHECK-LABEL: @merge_global_store_4_adjacent_loads_i8
				; CHECK: load <4 x i8>
				; CHECK: extractelement <4 x i8>
				; CHECK: extractelement <4 x i8>
				; CHECK: extractelement <4 x i8>
				; CHECK: extractelement <4 x i8>
				; CHECK: insertelement <4 x i8>
				; CHECK: insertelement <4 x i8>
				; CHECK: insertelement <4 x i8>
				; CHECK: insertelement <4 x i8>
				; CHECK: store <4 x i8>
				define void @merge_global_store_4_adjacent_loads_i8(i8 addrspace(1)* %out, i8 addrspace(1)* %in) #0 {
				%out.gep.1 = getelementptr i8, i8 addrspace(1)* %out, i8 1
				%out.gep.2 = getelementptr i8, i8 addrspace(1)* %out, i8 2
				%out.gep.3 = getelementptr i8, i8 addrspace(1)* %out, i8 3
				%in.gep.1 = getelementptr i8, i8 addrspace(1)* %in, i8 1
				%in.gep.2 = getelementptr i8, i8 addrspace(1)* %in, i8 2
				%in.gep.3 = getelementptr i8, i8 addrspace(1)* %in, i8 3

				%x = load i8, i8 addrspace(1)* %in, align 4
				%y = load i8, i8 addrspace(1)* %in.gep.1
				%z = load i8, i8 addrspace(1)* %in.gep.2
				%w = load i8, i8 addrspace(1)* %in.gep.3

				store i8 %x, i8 addrspace(1)* %out, align 4
				store i8 %y, i8 addrspace(1)* %out.gep.1
				store i8 %z, i8 addrspace(1)* %out.gep.2
				store i8 %w, i8 addrspace(1)* %out.gep.3
				ret void
				}

				; CHECK-LABEL: @merge_global_store_4_adjacent_loads_i8_natural_align
				; CHECK: load <4 x i8>
				; CHECK: store <4 x i8>
				define void @merge_global_store_4_adjacent_loads_i8_natural_align(i8 addrspace(1)* %out, i8 addrspace(1)* %in) #0 {
				%out.gep.1 = getelementptr i8, i8 addrspace(1)* %out, i8 1
				%out.gep.2 = getelementptr i8, i8 addrspace(1)* %out, i8 2
				%out.gep.3 = getelementptr i8, i8 addrspace(1)* %out, i8 3
				%in.gep.1 = getelementptr i8, i8 addrspace(1)* %in, i8 1
				%in.gep.2 = getelementptr i8, i8 addrspace(1)* %in, i8 2
				%in.gep.3 = getelementptr i8, i8 addrspace(1)* %in, i8 3

				%x = load i8, i8 addrspace(1)* %in
				%y = load i8, i8 addrspace(1)* %in.gep.1
				%z = load i8, i8 addrspace(1)* %in.gep.2
				%w = load i8, i8 addrspace(1)* %in.gep.3

				store i8 %x, i8 addrspace(1)* %out
				store i8 %y, i8 addrspace(1)* %out.gep.1
				store i8 %z, i8 addrspace(1)* %out.gep.2
				store i8 %w, i8 addrspace(1)* %out.gep.3
				ret void
				}

				; CHECK-LABEL: @merge_global_store_4_vector_elts_loads_v4i32
				; CHECK: load <4 x i32>
				; CHECK: store <4 x i32>
				define void @merge_global_store_4_vector_elts_loads_v4i32(i32 addrspace(1)* %out, <4 x i32> addrspace(1)* %in) #0 {
				%out.gep.1 = getelementptr i32, i32 addrspace(1)* %out, i32 1
				%out.gep.2 = getelementptr i32, i32 addrspace(1)* %out, i32 2
				%out.gep.3 = getelementptr i32, i32 addrspace(1)* %out, i32 3
				%vec = load <4 x i32>, <4 x i32> addrspace(1)* %in

				%x = extractelement <4 x i32> %vec, i32 0
				%y = extractelement <4 x i32> %vec, i32 1
				%z = extractelement <4 x i32> %vec, i32 2
				%w = extractelement <4 x i32> %vec, i32 3

				store i32 %x, i32 addrspace(1)* %out
				store i32 %y, i32 addrspace(1)* %out.gep.1
				store i32 %z, i32 addrspace(1)* %out.gep.2
				store i32 %w, i32 addrspace(1)* %out.gep.3
				ret void
				}

				; CHECK-LABEL: @merge_local_store_2_constants_i8
				; CHECK: store <2 x i8> <i8 -56, i8 123>, <2 x i8> addrspace(3)* %{{[0-9]+}}, align 2
				define void @merge_local_store_2_constants_i8(i8 addrspace(3)* %out) #0 {
				%out.gep.1 = getelementptr i8, i8 addrspace(3)* %out, i32 1

				store i8 123, i8 addrspace(3)* %out.gep.1
				store i8 456, i8 addrspace(3)* %out, align 2
				ret void
				}

				; CHECK-LABEL: @merge_local_store_2_constants_i32
				; CHECK: store <2 x i32> <i32 456, i32 123>, <2 x i32> addrspace(3)* %{{[0-9]+$}}
				define void @merge_local_store_2_constants_i32(i32 addrspace(3)* %out) #0 {
				%out.gep.1 = getelementptr i32, i32 addrspace(3)* %out, i32 1

				store i32 123, i32 addrspace(3)* %out.gep.1
				store i32 456, i32 addrspace(3)* %out
				ret void
				}

				; CHECK-LABEL: @merge_local_store_2_constants_i32_align_2
				; CHECK: store i32
				; CHECK: store i32
				define void @merge_local_store_2_constants_i32_align_2(i32 addrspace(3)* %out) #0 {
				%out.gep.1 = getelementptr i32, i32 addrspace(3)* %out, i32 1

				store i32 123, i32 addrspace(3)* %out.gep.1, align 2
				store i32 456, i32 addrspace(3)* %out, align 2
				ret void
				}

				; CHECK-LABEL: @merge_local_store_4_constants_i32
				; CHECK: store <4 x i32> <i32 1234, i32 123, i32 456, i32 333>, <4 x i32> addrspace(3)*
				define void @merge_local_store_4_constants_i32(i32 addrspace(3)* %out) #0 {
				%out.gep.1 = getelementptr i32, i32 addrspace(3)* %out, i32 1
				%out.gep.2 = getelementptr i32, i32 addrspace(3)* %out, i32 2
				%out.gep.3 = getelementptr i32, i32 addrspace(3)* %out, i32 3

				store i32 123, i32 addrspace(3)* %out.gep.1
				store i32 456, i32 addrspace(3)* %out.gep.2
				store i32 333, i32 addrspace(3)* %out.gep.3
				store i32 1234, i32 addrspace(3)* %out
				ret void
				}

				; CHECK-LABEL: @merge_global_store_5_constants_i32
				; CHECK: store <4 x i32> <i32 9, i32 12, i32 16, i32 -12>, <4 x i32> addrspace(1)* %{{[0-9]+}}, align 4
				; CHECK: store i32
				define void @merge_global_store_5_constants_i32(i32 addrspace(1)* %out) {
				store i32 9, i32 addrspace(1)* %out, align 4
				%idx1 = getelementptr inbounds i32, i32 addrspace(1)* %out, i64 1
				store i32 12, i32 addrspace(1)* %idx1, align 4
				%idx2 = getelementptr inbounds i32, i32 addrspace(1)* %out, i64 2
				store i32 16, i32 addrspace(1)* %idx2, align 4
				%idx3 = getelementptr inbounds i32, i32 addrspace(1)* %out, i64 3
				store i32 -12, i32 addrspace(1)* %idx3, align 4
				%idx4 = getelementptr inbounds i32, i32 addrspace(1)* %out, i64 4
				store i32 11, i32 addrspace(1)* %idx4, align 4
				ret void
				}

				; CHECK-LABEL: @merge_global_store_6_constants_i32
				; CHECK: store <4 x i32> <i32 13, i32 15, i32 62, i32 63>, <4 x i32> addrspace(1)* %{{[0-9]+}}, align 4
				; CHECK: store <2 x i32> <i32 11, i32 123>, <2 x i32> addrspace(1)* %{{[0-9]+}}, align 4
				define void @merge_global_store_6_constants_i32(i32 addrspace(1)* %out) {
				store i32 13, i32 addrspace(1)* %out, align 4
				%idx1 = getelementptr inbounds i32, i32 addrspace(1)* %out, i64 1
				store i32 15, i32 addrspace(1)* %idx1, align 4
				%idx2 = getelementptr inbounds i32, i32 addrspace(1)* %out, i64 2
				store i32 62, i32 addrspace(1)* %idx2, align 4
				%idx3 = getelementptr inbounds i32, i32 addrspace(1)* %out, i64 3
				store i32 63, i32 addrspace(1)* %idx3, align 4
				%idx4 = getelementptr inbounds i32, i32 addrspace(1)* %out, i64 4
				store i32 11, i32 addrspace(1)* %idx4, align 4
				%idx5 = getelementptr inbounds i32, i32 addrspace(1)* %out, i64 5
				store i32 123, i32 addrspace(1)* %idx5, align 4
				ret void
				}

				; CHECK-LABEL: @merge_global_store_7_constants_i32
				; CHECK: store <4 x i32> <i32 34, i32 999, i32 65, i32 33>, <4 x i32> addrspace(1)* %{{[0-9]+}}, align 4
				; CHECK: store <3 x i32> <i32 98, i32 91, i32 212>, <3 x i32> addrspace(1)* %{{[0-9]+}}, align 4
				define void @merge_global_store_7_constants_i32(i32 addrspace(1)* %out) {
				store i32 34, i32 addrspace(1)* %out, align 4
				%idx1 = getelementptr inbounds i32, i32 addrspace(1)* %out, i64 1
				store i32 999, i32 addrspace(1)* %idx1, align 4
				%idx2 = getelementptr inbounds i32, i32 addrspace(1)* %out, i64 2
				store i32 65, i32 addrspace(1)* %idx2, align 4
				%idx3 = getelementptr inbounds i32, i32 addrspace(1)* %out, i64 3
				store i32 33, i32 addrspace(1)* %idx3, align 4
				%idx4 = getelementptr inbounds i32, i32 addrspace(1)* %out, i64 4
				store i32 98, i32 addrspace(1)* %idx4, align 4
				%idx5 = getelementptr inbounds i32, i32 addrspace(1)* %out, i64 5
				store i32 91, i32 addrspace(1)* %idx5, align 4
				%idx6 = getelementptr inbounds i32, i32 addrspace(1)* %out, i64 6
				store i32 212, i32 addrspace(1)* %idx6, align 4
				ret void
				}

				; CHECK-LABEL: @merge_global_store_8_constants_i32
				; CHECK: store <4 x i32> <i32 34, i32 999, i32 65, i32 33>, <4 x i32> addrspace(1)* %{{[0-9]+}}, align 4
				; CHECK: store <4 x i32> <i32 98, i32 91, i32 212, i32 999>, <4 x i32> addrspace(1)* %{{[0-9]+}}, align 4
				define void @merge_global_store_8_constants_i32(i32 addrspace(1)* %out) {
				store i32 34, i32 addrspace(1)* %out, align 4
				%idx1 = getelementptr inbounds i32, i32 addrspace(1)* %out, i64 1
				store i32 999, i32 addrspace(1)* %idx1, align 4
				%idx2 = getelementptr inbounds i32, i32 addrspace(1)* %out, i64 2
				store i32 65, i32 addrspace(1)* %idx2, align 4
				%idx3 = getelementptr inbounds i32, i32 addrspace(1)* %out, i64 3
				store i32 33, i32 addrspace(1)* %idx3, align 4
				%idx4 = getelementptr inbounds i32, i32 addrspace(1)* %out, i64 4
				store i32 98, i32 addrspace(1)* %idx4, align 4
				%idx5 = getelementptr inbounds i32, i32 addrspace(1)* %out, i64 5
				store i32 91, i32 addrspace(1)* %idx5, align 4
				%idx6 = getelementptr inbounds i32, i32 addrspace(1)* %out, i64 6
				store i32 212, i32 addrspace(1)* %idx6, align 4
				%idx7 = getelementptr inbounds i32, i32 addrspace(1)* %out, i64 7
				store i32 999, i32 addrspace(1)* %idx7, align 4
				ret void
				}

				; CHECK-LABEL: @copy_v3i32_align4
				; CHECK: %vec = load <3 x i32>, <3 x i32> addrspace(1)* %in, align 4
				; CHECK: store <3 x i32> %vec, <3 x i32> addrspace(1)* %out
				define void @copy_v3i32_align4(<3 x i32> addrspace(1)* noalias %out, <3 x i32> addrspace(1)* noalias %in) #0 {
				%vec = load <3 x i32>, <3 x i32> addrspace(1)* %in, align 4
				store <3 x i32> %vec, <3 x i32> addrspace(1)* %out
				ret void
				}

				; CHECK-LABEL: @copy_v3i64_align4
				; CHECK: %vec = load <3 x i64>, <3 x i64> addrspace(1)* %in, align 4
				; CHECK: store <3 x i64> %vec, <3 x i64> addrspace(1)* %out
				define void @copy_v3i64_align4(<3 x i64> addrspace(1)* noalias %out, <3 x i64> addrspace(1)* noalias %in) #0 {
				%vec = load <3 x i64>, <3 x i64> addrspace(1)* %in, align 4
				store <3 x i64> %vec, <3 x i64> addrspace(1)* %out
				ret void
				}

				; CHECK-LABEL: @copy_v3f32_align4
				; CHECK: %vec = load <3 x float>, <3 x float> addrspace(1)* %in, align 4
				; CHECK: store <3 x float>
				define void @copy_v3f32_align4(<3 x float> addrspace(1)* noalias %out, <3 x float> addrspace(1)* noalias %in) #0 {
				%vec = load <3 x float>, <3 x float> addrspace(1)* %in, align 4
				%fadd = fadd <3 x float> %vec, <float 1.0, float 2.0, float 4.0>
				store <3 x float> %fadd, <3 x float> addrspace(1)* %out
				ret void
				}

				; CHECK-LABEL: @copy_v3f64_align4
				; CHECK: %vec = load <3 x double>, <3 x double> addrspace(1)* %in, align 4
				; CHECK: store <3 x double> %fadd, <3 x double> addrspace(1)* %out
				define void @copy_v3f64_align4(<3 x double> addrspace(1)* noalias %out, <3 x double> addrspace(1)* noalias %in) #0 {
				%vec = load <3 x double>, <3 x double> addrspace(1)* %in, align 4
				%fadd = fadd <3 x double> %vec, <double 1.0, double 2.0, double 4.0>
				store <3 x double> %fadd, <3 x double> addrspace(1)* %out
				ret void
				}

				declare void @llvm.amdgcn.s.barrier() #1

				attributes #0 = { nounwind }
				attributes #1 = { convergent nounwind }

test/Transforms/LoadStoreVectorizer/AMDGPU/merge-vectors.ll

This file was added.

				; RUN: opt -mtriple=amdgcn-amd-amdhsa -basicaa -load-store-vectorizer -S -o - %s \| FileCheck %s

				target datalayout = "e-p:32:32-p1:64:64-p2:64:64-p3:32:32-p4:64:64-p5:32:32-p24:64:64-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64"

				; CHECK-LABEL: @merge_v2i32_v2i32(
				; CHECK: load <4 x i32>
				; CHECK: store <4 x i32> zeroinitializer
				define void @merge_v2i32_v2i32(<2 x i32> addrspace(1)* nocapture %a, <2 x i32> addrspace(1)* nocapture readonly %b) #0 {
				entry:
				%a.1 = getelementptr inbounds <2 x i32>, <2 x i32> addrspace(1)* %a, i64 1
				%b.1 = getelementptr inbounds <2 x i32>, <2 x i32> addrspace(1)* %b, i64 1

				%ld.c = load <2 x i32>, <2 x i32> addrspace(1)* %b, align 4
				%ld.c.idx.1 = load <2 x i32>, <2 x i32> addrspace(1)* %b.1, align 4

				store <2 x i32> zeroinitializer, <2 x i32> addrspace(1)* %a, align 4
				store <2 x i32> zeroinitializer, <2 x i32> addrspace(1)* %a.1, align 4

				ret void
				}

				; CHECK-LABEL: @merge_v1i32_v1i32(
				; CHECK: load <2 x i32>
				; CHECK: store <2 x i32> zeroinitializer
				define void @merge_v1i32_v1i32(<1 x i32> addrspace(1)* nocapture %a, <1 x i32> addrspace(1)* nocapture readonly %b) #0 {
				entry:
				%a.1 = getelementptr inbounds <1 x i32>, <1 x i32> addrspace(1)* %a, i64 1
				%b.1 = getelementptr inbounds <1 x i32>, <1 x i32> addrspace(1)* %b, i64 1

				%ld.c = load <1 x i32>, <1 x i32> addrspace(1)* %b, align 4
				%ld.c.idx.1 = load <1 x i32>, <1 x i32> addrspace(1)* %b.1, align 4

				store <1 x i32> zeroinitializer, <1 x i32> addrspace(1)* %a, align 4
				store <1 x i32> zeroinitializer, <1 x i32> addrspace(1)* %a.1, align 4

				ret void
				}

				; CHECK-LABEL: @no_merge_v3i32_v3i32(
				; CHECK: load <3 x i32>
				; CHECK: load <3 x i32>
				; CHECK: store <3 x i32> zeroinitializer
				; CHECK: store <3 x i32> zeroinitializer
				define void @no_merge_v3i32_v3i32(<3 x i32> addrspace(1)* nocapture %a, <3 x i32> addrspace(1)* nocapture readonly %b) #0 {
				entry:
				%a.1 = getelementptr inbounds <3 x i32>, <3 x i32> addrspace(1)* %a, i64 1
				%b.1 = getelementptr inbounds <3 x i32>, <3 x i32> addrspace(1)* %b, i64 1

				%ld.c = load <3 x i32>, <3 x i32> addrspace(1)* %b, align 4
				%ld.c.idx.1 = load <3 x i32>, <3 x i32> addrspace(1)* %b.1, align 4

				store <3 x i32> zeroinitializer, <3 x i32> addrspace(1)* %a, align 4
				store <3 x i32> zeroinitializer, <3 x i32> addrspace(1)* %a.1, align 4

				ret void
				}

				; CHECK-LABEL: @merge_v2i16_v2i16(
				; CHECK: load <4 x i16>
				; CHECK: store <4 x i16> zeroinitializer
				define void @merge_v2i16_v2i16(<2 x i16> addrspace(1)* nocapture %a, <2 x i16> addrspace(1)* nocapture readonly %b) #0 {
				entry:
				%a.1 = getelementptr inbounds <2 x i16>, <2 x i16> addrspace(1)* %a, i64 1
				%b.1 = getelementptr inbounds <2 x i16>, <2 x i16> addrspace(1)* %b, i64 1

				%ld.c = load <2 x i16>, <2 x i16> addrspace(1)* %b, align 4
				%ld.c.idx.1 = load <2 x i16>, <2 x i16> addrspace(1)* %b.1, align 4

				store <2 x i16> zeroinitializer, <2 x i16> addrspace(1)* %a, align 4
				store <2 x i16> zeroinitializer, <2 x i16> addrspace(1)* %a.1, align 4

				ret void
				}

				; Ideally this would be merged
				; CHECK-LABEL: @merge_load_i32_v2i16(
				; CHECK: load i32,
				; CHECK: load <2 x i16>
				define void @merge_load_i32_v2i16(i32 addrspace(1)* nocapture %a) #0 {
				entry:
				%a.1 = getelementptr inbounds i32, i32 addrspace(1)* %a, i32 1
				%a.1.cast = bitcast i32 addrspace(1)* %a.1 to <2 x i16> addrspace(1)*

				%ld.0 = load i32, i32 addrspace(1)* %a
				%ld.1 = load <2 x i16>, <2 x i16> addrspace(1)* %a.1.cast

				ret void
				}

				attributes #0 = { nounwind }
				attributes #1 = { nounwind readnone }

test/Transforms/LoadStoreVectorizer/AMDGPU/no-implicit-float.ll

This file was added.

				; RUN: opt -mtriple=amdgcn-amd-amdhsa -load-store-vectorizer -S -o - %s \| FileCheck %s

				; CHECK-LABEL: @no_implicit_float(
				; CHECK: store i32
				; CHECK: store i32
				; CHECK: store i32
				; CHECK: store i32
				define void @no_implicit_float(i32 addrspace(1)* %out) #0 {
				%out.gep.1 = getelementptr i32, i32 addrspace(1)* %out, i32 1
				%out.gep.2 = getelementptr i32, i32 addrspace(1)* %out, i32 2
				%out.gep.3 = getelementptr i32, i32 addrspace(1)* %out, i32 3

				store i32 123, i32 addrspace(1)* %out.gep.1
				store i32 456, i32 addrspace(1)* %out.gep.2
				store i32 333, i32 addrspace(1)* %out.gep.3
				store i32 1234, i32 addrspace(1)* %out
				ret void
				}

				attributes #0 = { nounwind noimplicitfloat }

This is an archive of the discontinued LLVM Phabricator instance.

Add LoadStoreVectorizer passClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 61062

include/llvm/InitializePasses.h

include/llvm/LinkAllPasses.h

include/llvm/Transforms/Vectorize.h

lib/Transforms/Vectorize/CMakeLists.txt

lib/Transforms/Vectorize/LoadStoreVectorizer.cpp

lib/Transforms/Vectorize/Vectorize.cpp

test/Transforms/LoadStoreVectorizer/AMDGPU/extended-index.ll

test/Transforms/LoadStoreVectorizer/AMDGPU/insertion-point.ll

test/Transforms/LoadStoreVectorizer/AMDGPU/interleaved-mayalias-store.ll

test/Transforms/LoadStoreVectorizer/AMDGPU/lit.local.cfg

test/Transforms/LoadStoreVectorizer/AMDGPU/merge-stores.ll

test/Transforms/LoadStoreVectorizer/AMDGPU/merge-vectors.ll

test/Transforms/LoadStoreVectorizer/AMDGPU/no-implicit-float.ll

Add LoadStoreVectorizer pass
ClosedPublic