This is an archive of the discontinued LLVM Phabricator instance.

Add LoadStoreVectorizer pass
ClosedPublic

Authored by arsenm on Apr 25 2016, 3:11 PM.

Details

Summary

This was contributed by Apple, and I've been working on
minimal cleanups and generalizing it.

Diff Detail

Event Timeline

arsenm updated this revision to Diff 54922.Apr 25 2016, 3:11 PM
arsenm retitled this revision from to Add LoadStoreVectorizer pass.
arsenm updated this object.
arsenm added reviewers: escha, resistor, tstellarAMD.
arsenm added a subscriber: llvm-commits.

Why can't that be achieved by SLP vectorizer?

Michael

Why can't that be achieved by SLP vectorizer?

Michael

The SLPVectorizer is concerned with forming all vectors. It will give up if it can't vectorize all downstream scalar operations fed by the loads. On GPUs that only have vector loads, and all other operations are scalar, this isn't desirable. At best the vectors are formed just to be scalarized during legalization.

Yeah, I can see the motivation, but my question was why not to add this capability to SLP instead of creating a completely new pass. This seems to be introducing a lot of code duplication.

Michael

Yeah, I can see the motivation, but my question was why not to add this capability to SLP instead of creating a completely new pass. This seems to be introducing a lot of code duplication.

Michael

It's not exactly the same problem. It tries to vectorize stores, and then vectorize if the operations leading to the stores are vectorizable, and then the loads. This vectorizes loads and stores, ignores the intervening instructions (which there may be none of. Loads that are used to compute something that may not be stored should also be vectorized). The LoadStoreVectorizer should be able to produce vectors of arbitrary size and split them as requested by the target, even if the types are not legal.

I also see quite a bit of duplication with SLP. Could this also be done with the ConsecutiveStore and ConsecutiveLoad optimizations in SelectionDAG?

I also see quite a bit of duplication with SLP. Could this also be done with the ConsecutiveStore and ConsecutiveLoad optimizations in SelectionDAG?

The DAG load and store combining is less effective, and also restricts to combining to legal types

Hi Matt,

I think it's ok to have it in a separate pass, but we really need to try to factor out common parts. For instance, we definitely have propagateMetadata and isConsecutiveAccess in SLP.

Thanks,
Michael

lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
519

Can it be a range loop?

arsenm updated this revision to Diff 56811.May 10 2016, 1:55 PM

Cleanups

arsenm marked an inline comment as done.May 10 2016, 1:58 PM

Hi Matt,

I think it's ok to have it in a separate pass, but we really need to try to factor out common parts. For instance, we definitely have propagateMetadata and isConsecutiveAccess in SLP.

Thanks,
Michael

isConsecutiveAccess is already factored into a separate utility function, but it seems to have diverged from the one here. Some work will be needed to merge them. I've added a FIXME for it

Hi Matt,

I think it's ok to have it in a separate pass, but we really need to try to factor out common parts. For instance, we definitely have propagateMetadata and isConsecutiveAccess in SLP.

Thanks,
Michael

isConsecutiveAccess is already factored into a separate utility function, but it seems to have diverged from the one here. Some work will be needed to merge them. I've added a FIXME for it

D20639 splits SLPVectorizer's propagateMetadata into a utility function

jlebar added a subscriber: jlebar.May 25 2016, 1:58 PM

I am very excited about this; please let me know if there's anything I can do to help get it in.

include/llvm/Transforms/Vectorize.h
147

Should the VecRegSize be pulled from TTI (at least as a default)?

lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
89

Typos

arsenm added inline comments.May 25 2016, 2:01 PM
include/llvm/Transforms/Vectorize.h
147

This is done in the follow up patches D19502, D19504

Just wanted to mention that this pass was originally contributed by Volkan Keles.

Thanks Volkan!

I can crash this locally. We assert for trying to create an invalid bitcast at LoadStoreVectorizer.cpp:833:

831	      if (Extracted->getType() != UI->getType())
832	        Extracted =
833	          cast<Instruction>(Builder.CreateBitCast(Extracted, UI->getType()));

(rr) p Extracted->dump()
  %25 = extractelement <2 x i64> %23, i32 1
(rr) p UI->getType()->dump()
%"struct.Eigen::half"*

I will try to get a testcase.

OK, I got a somewhat minimized testcase.

$ curl https://gist.githubusercontent.com/anonymous/c3d28b883b97b476b9a2ecb4d5dac807/raw/cd830af846bd8738544783305eb7618449d06eba/- | llc

llc: ../src/lib/IR/Instructions.cpp:2582: static llvm::CastInst *llvm::CastInst::Create(Instruction::CastOps, llvm::Value *, llvm::Type *, const llvm::Twine &, llvm::Instruction *): Assertion `castIsValid(op, S, Ty) && "Invalid cast!"' failed.
#0 0x0000000001927c08 llvm::sys::PrintStackTrace(llvm::raw_ostream&) (/usr/local/google/home/jlebar/code/llvm/release/bin/llc+0x1927c08)
#1 0x0000000001926296 llvm::sys::RunSignalHandlers() (/usr/local/google/home/jlebar/code/llvm/release/bin/llc+0x1926296)
#2 0x000000000192880a SignalHandler(int) (/usr/local/google/home/jlebar/code/llvm/release/bin/llc+0x192880a)
#3 0x00007fe0a84d5340 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x10340)
#4 0x00007fe0a76fdcc9 gsignal /build/eglibc-3GlaMS/eglibc-2.19/signal/../nptl/sysdeps/unix/sysv/linux/raise.c:56:0
#5 0x00007fe0a77010d8 abort /build/eglibc-3GlaMS/eglibc-2.19/stdlib/abort.c:91:0
#6 0x00007fe0a76f6b86 __assert_fail_base /build/eglibc-3GlaMS/eglibc-2.19/assert/assert.c:92:0
#7 0x00007fe0a76f6c32 (/lib/x86_64-linux-gnu/libc.so.6+0x2fc32)
#8 0x00000000014dd7b9 llvm::CastInst::Create(llvm::Instruction::CastOps, llvm::Value*, llvm::Type*, llvm::Twine const&, llvm::Instruction*) (/usr/local/google/home/jlebar/code/llvm/release/bin/llc+0x14dd7b9)
#9 0x000000000079d58b llvm::IRBuilder<llvm::ConstantFolder, llvm::IRBuilderDefaultInserter>::CreateCast(llvm::Instruction::CastOps, llvm::Value*, llvm::Type*, llvm::Twine const&) (/usr/local/google/home/jlebar/code/llvm/release/bin/llc+0x79d58b)
#10 0x00000000019d84fd (anonymous namespace)::Vectorizer::vectorizeLoadChain(llvm::ArrayRef<llvm::Value*>) (/usr/local/google/home/jlebar/code/llvm/release/bin/llc+0x19d84fd)
#11 0x00000000019d7f57 (anonymous namespace)::Vectorizer::vectorizeLoadChain(llvm::ArrayRef<llvm::Value*>) (/usr/local/google/home/jlebar/code/llvm/release/bin/llc+0x19d7f57)
#12 0x00000000019d7809 (anonymous namespace)::Vectorizer::vectorizeChains(llvm::MapVector<llvm::Value*, llvm::SmallVector<llvm::Value*, 8u>, llvm::DenseMap<llvm::Value*, unsigned int, llvm::DenseMapInfo<llvm::Value*>, llvm::detail::DenseMapPair<llvm::Value*, unsigned int> >, std::vector<std::pair<llvm::Value*, llvm::SmallVector<llvm::Value*, 8u> >, std::allocator<std::pair<llvm::Value*, llvm::SmallVector<llvm::Value*, 8u> > > > >&) (/usr/local/google/home/jlebar/code/llvm/release/bin/llc+0x19d7809)
#13 0x00000000019d5cd8 (anonymous namespace)::LoadStoreVectorizer::runOnFunction(llvm::Function&) (/usr/local/google/home/jlebar/code/llvm/release/bin/llc+0x19d5cd8)
#14 0x00000000014f84d4 llvm::FPPassManager::runOnFunction(llvm::Function&) (/usr/local/google/home/jlebar/code/llvm/release/bin/llc+0x14f84d4)
#15 0x00000000014f871b llvm::FPPassManager::runOnModule(llvm::Module&) (/usr/local/google/home/jlebar/code/llvm/release/bin/llc+0x14f871b)
#16 0x00000000014f8bf7 llvm::legacy::PassManagerImpl::run(llvm::Module&) (/usr/local/google/home/jlebar/code/llvm/release/bin/llc+0x14f8bf7)
#17 0x00000000005fc419 compileModule(char**, llvm::LLVMContext&) (/usr/local/google/home/jlebar/code/llvm/release/bin/llc+0x5fc419)
#18 0x00000000005f9a4b main (/usr/local/google/home/jlebar/code/llvm/release/bin/llc+0x5f9a4b)
#19 0x00007fe0a76e8ec5 __libc_start_main /build/eglibc-3GlaMS/eglibc-2.19/csu/libc-start.c:321:0
#20 0x00000000005f5f22 _start (/usr/local/google/home/jlebar/code/llvm/release/bin/llc+0x5f5f22)
Stack dump:
0.	Program arguments: /usr/local/google/home/jlebar/code/llvm/release/bin/llc 
1.	Running pass 'Function Pass Manager' on module '<stdin>'.
2.	Running pass 'GPU Load and Store Vectorizer' on function '@_ZN5Eigen8internal15EigenMetaKernelINS_15TensorEvaluatorIKNS_14TensorAssignOpINS_9TensorMapINS_6TensorIxLi0ELi1ElEELi1EEEKNS_18TensorConversionOpIxKNS_20TensorTupleReducerOpINS0_18ArgMaxTupleReducerINS_5TupleIlNS_4halfEEEEEKNS_5arrayIlLm1EEEKNS4_INS5_IKSC_Li1ELi1ElEELi1EEEEEEEEENS_9GpuDeviceEEElEEvT_T0_'
Aborted (core dumped)

OK, I see the problem. This is trying to vectorize two consecutive loads, where the first is loading from an i64*, and the second is loading from an Eigen::half**. We can in fact vectorize these on my target, because i64 and pointers have the same size. But we need an inttoptr instruction if the vector type is <2 x i64>.

arsenm updated this revision to Diff 58680.May 26 2016, 1:17 PM

Minor cleanup and additional tests

OK, I see the problem. This is trying to vectorize two consecutive loads, where the first is loading from an i64*, and the second is loading from an Eigen::half**. We can in fact vectorize these on my target, because i64 and pointers have the same size. But we need an inttoptr instruction if the vector type is <2 x i64>.

I'll post a separate follow up patch which fixes this case

jlebar added a comment.Jun 2 2016, 1:37 PM

Not to scope-creep this, because it's immensely helpful as-is, but looking through some Eigen code, it seems like it would also be helpful to merge consecutive vector loads where possible. (In NVPTX, this usually means merging two loads of 2xf32 into one 4xf32 load.) Eigen tries to be clever and vectorize itself, but it doesn't always use the right packet size.

jlebar added a comment.Jun 2 2016, 1:43 PM

Another question I have is: NVPTX has some vectorizable load instructions -- notably ld.global.nc, which means, load from the global address space using the texture cache -- that are modeled as as llvm intrinsics, so they don't get merged by this pass. I wonder what's the portable way to teach this pass about these loads. Like, should they be modeled in LLVM as regular load instructions with some sort of annotation, instead of as an LLVM intrinsic?

volkan added a subscriber: volkan.Jun 6 2016, 12:14 AM
escha edited edge metadata.Jun 6 2016, 7:38 AM

My intuition would be -- if loading using the texture cache doesn't change the result, but rather is just a performance thing, that would seem to be something you'd set with metadata on the instruction, right?

jlebar added a comment.Jun 6 2016, 9:34 AM

My intuition would be -- if loading using the texture cache doesn't change the result, but rather is just a performance thing, that would seem to be something you'd set with metadata on the instruction, right?

The instruction I have in mind does (potentially) change the result -- the texture cache is noncoherent, so it's only guaranteed to give the same result as a regular load if the data we're loading hasn't been modified since the kernel started.

jlebar added a comment.Jun 9 2016, 5:38 PM

arsenm's current patch queue, for my reference: D20705, D19504, D19506, D19507, D19508, D20696, D20639, D20705.

Hi,

Some random comments after a quick first round of review inline.

Thanks,
Michael

mzolotukhin added inline comments.Jun 9 2016, 6:43 PM
include/llvm/Transforms/Vectorize.h
147

Minor: I'd rather hardcode 128 for now, just to make the pass signature similar to existing passes (no other pass accepts an argument in create...Pass I think). But it probably doesn't matter as you're going to fix it in follow-ups anyway.

lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
78–79

I think the second line should be indented here.

108

Bikeshed: I'd rename Vectorizer to LoadStoreVectorizer and LoadStoreVectorizer to LoadStoreVectorizerPass os something like this.

153–155

Should we bail out on this for all potential targets? E.g. why is it a problem for, e.g., vectorizing integer stores on x86 using movdqa integer instructions?

287

Some comments would be helpful here (and for other functions).

417

Range loop?

487–488

Might be a good idea to parametrize 64.

679

Any chance it can be somehow combined with vectorizeStoreChain? They look so similar..

730–731

Is there something special about 4? Can we add a define/const for it?

jlebar added a comment.EditedJun 10 2016, 11:59 AM

Not done going through this, but I tried your branch at https://github.com/arsenm/llvm/tree/ls-vectorizer, and I'm still ICE'ing when compiling thrust.

$ git clone https://github.com/thrust/thrust.git
$ cd thrust
$ CLANG_PATH=/abs/path/to/dir/containing/clang/binary scons arch=sm_35 Wall=no Werror=no mode=release cuda_compiler=clang std=c++11 cuda_path=/usr/local/cuda-7.0 -j48 unit_tests

Range types must match instruction type!
  %wide.load = load <16 x i8>, <16 x i8>* %60, align 1, !tbaa !61, !range !63, !alias.scope !64
Range types must match instruction type!
  %wide.load544 = load <16 x i8>, <16 x i8>* %62, align 1, !tbaa !61, !range !63, !alias.scope !64
Range types must match instruction type!
  %wide.load.1 = load <16 x i8>, <16 x i8>* %66, align 1, !tbaa !61, !range !63, !alias.scope !64
Range types must match instruction type!
  %wide.load544.1 = load <16 x i8>, <16 x i8>* %68, align 1, !tbaa !61, !range !63, !alias.scope !64
Range types must match instruction type!
  %wide.load.2 = load <16 x i8>, <16 x i8>* %72, align 1, !tbaa !61, !range !63, !alias.scope !64
Range types must match instruction type!
  %wide.load544.2 = load <16 x i8>, <16 x i8>* %74, align 1, !tbaa !61, !range !63, !alias.scope !64
Range types must match instruction type!
  %wide.load.3 = load <16 x i8>, <16 x i8>* %78, align 1, !tbaa !61, !range !63, !alias.scope !64
Range types must match instruction type!
  %wide.load544.3 = load <16 x i8>, <16 x i8>* %80, align 1, !tbaa !61, !range !63, !alias.scope !64
Range types must match instruction type!
  %wide.load.epil = load <16 x i8>, <16 x i8>* %84, align 1, !tbaa !61, !range !63, !alias.scope !64
Range types must match instruction type!
  %wide.load544.epil = load <16 x i8>, <16 x i8>* %86, align 1, !tbaa !61, !range !63, !alias.scope !64
fatal error: error in backend: Broken function found, compilation aborted!

Interestingly, this is happening during *host* compilation. It happens even if I don't add a patch to enable the load/store vectorizer for nvptx.

Smaller steps to reproduce:

$ curl https://gist.githubusercontent.com/anonymous/5c8863562db1dda94a821c51e3ad001a/raw/b8f3e3cc9b0995ab89d5e8c3840dcea30dc96d0b/- | opt -O2
(same ICE)
lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
39

Nit, number of scalar instructions removed might be a more interesting metric, because it's notable whether we manage to vectorize in groups of 2 vs 4.

78–79

clang-format formats it for me as-is.

arsenm added inline comments.Jun 10 2016, 12:49 PM
lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
153–155

I don't really know what this attribute does, it was just there

730–731

The basic type alignment on GPUs? It's on my todo list to make this check allowsMemoryAccess instead

suyog added a subscriber: suyog.Jun 10 2016, 1:23 PM

Looks pretty good to me; mostly nits / comment suggestions, but I have one nontrivial concern about how we ensure that we always build the longest vectors we can.

lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
71

Nit, if this is all the comment says, do we need it? It seems just to repeat what's already in the function name.

74

Nit, "Reorders" would be consistent with the rest of the comments in this file. Also, do you mean s/user/users/?

74

It's not obvious to me how this function reorders the users of I without looking at the source. Something like "Reorders instructions to ensure that I dominates all of its users." or something would have helped me.

88

Perhaps, "Checks if there are any instructions which may affect the value returned by a load/store in Chain between From and To." Since this function checks not just for aliasing memory instructions, but also side-effecting instructions.

96

Probably worth indicating that you expect all of the elements of Map to be loads, or all of them to be stores.

98

Suggest "Finds loads/stores to consecutive memory addresses and vectorizes them." Otherwise it can be read as saying that the instructions themselves must appear consecutively in the BB. Probably also worth indicating that all elements of Instrs must be loads, or all must be stores.

153–155

It came from r187340, back in 2013, by Nadav Rotem <nrotem@apple.com>. There's no additional information in the patch.

AFAICT the intent is specifically not to vectorize loads/stores of fp types when noimplicitfloat is set. Which sort of makes sense, based on the tiny information I can glean about noimplicitfloat.

169

s/I/BBI/ would have made this a lot more obvious to me, since we use I for values/instructions most everywhere else.

209

It's not clear to me why we need this line, although all the cases I can think of where this is false seem quite bizarre, so maybe it doesn't matter.

231

Suggest s/Otherwise// -- it's not clear that this is contrasting with "Check if they are based on the same pointer."

243

Nit, suggest s/proving/checking/ -- "proving" is kind of a strong term for what we do, sort of implies it's harder than it is (so it's confusing when I see it's so simple).

260

Suggest something like

// Only look through ZExt/SExt

to make it clearer why we're doing an early return.

306

Nit, consider llvm::is_contained()

326

Why this special case? This isn't as strong as a general DCE, so it seems to me like it gives us a false sense of security? i.e. should we just say "run DCE after this pass"?

339

s/ElementSize/ElementSizeBits/? (I don't care as much what the local vars here are called, but if the arg is called ElementSize, I might reasonably pass a value in bytes.)

339

I'm not sure "bisect" is quite the right word for this? I expected this to split Chain in half, but it looks like it actually splits off the rightmost 1, 2, or 3 bytes' worth of elements, so that the left part has a size that's a multiple of 4 bytes (ish). I might call that "splitIntoRoundSize" or something.

A postcondition assert() would probably also help clarify this.

356

Consider llvm::if_contains

378

suggest "(the vectorized load is inserted at the location of the first load in the chain)."

389

Not GetPointerOperand?

415

To me this would be a lot more clear if we returned the relevant maps (or, I suppose, explicitly took them as params). Otherwise it's not clear from the signature (or the comments) where we collect the instructions into.

437

Consider llvm::all_of(LI->users(), ...);

500

Nit, It might be helpful to print out at least the first instruction in the chain. At least, that was helpful to me when I was debugging it.

503

Can we move this closer to where it's used?

509

Nit, we know that Instrs is <= 64 elements, so could we just make ConsecutiveChain an array of signed ints? Then we could use -1 instead of ~0U (which implied to me that we were going to do some bit twiddling below).

Also I think you could initialize the array with

int ConsecutiveChain[64] = {-1};

which would be clearer than setting each element in the loop, imo.

518

I cannot for the life of me figure out what invariant you're trying to preserve with these checks. I guess a comment is in order. :)

(I get that we prefer to vectorize chains of scalar instructions that appear in order (j > i), but I can't figure out what CurDistance and NewDistance are checking. It would make sense if you wanted to prefer chains of scalar instructions that are close to each other, but then NewDistance should be abs(j - i), right?)

519

Not sure, but this might be clearer if you simply used a "continue" here instead of a flag variable.

524

If ConsecutiveChain was not -1, don't we need to remove ConsecutiveChain[i] from Tails? (Maybe it would be better to build Heads and Tails, or at least Tails, in a separate loop, over ConsecutiveChain, so we don't have to worry about the case where one instr is a tail for two heads and then gets overwritten just one time.)

530

The intent here is to build the largest vectors we can, right? (Or at least, we want to build up vectors of the max vector width, if possible? Beyond that doesn't make a difference.)

I don't see how we ensure this happens, unless there is some implicit requirement on the order of Instrs that I'm missing. For example, if the elements of Instrs are

  • load p[3]
  • load p[2]
  • load p[1]
  • load p[0]

then it seems that the first element of Heads will be "load p[2]", implying a two-wide vector load. I don't see what here would cause us to wait on this one and do the four-wide load starting at p[0].

Sorry if I'm missing something obvious.

538

Actually, I'm not sure why we need the (Tails.count(I) || Heads.count(I)) part of this predicate, and therefore I'm not sure why we need Tails at all. That probably deserves a comment at least.

578

Hm, looks like an additional constraint on bisect() is that the array passed should have more than 4 elements -- probably worth saying that somewhere.

630

If you like, you could write this as

BasicBlock::iterator First, Last;
std::tie(First, Last) = getBoundaryInstrs(Chain);

Same for the other call to getBoundaryInstrs.

754

A correctness-critical assumption above is that

the loads are inserted at the location of the first load

but here it looks like the opposite?

764

Nit, I'd prefer to move this into the if and else blocks, so it's clear that we don't use it elsewhere.

arsenm updated this revision to Diff 61062.Jun 16 2016, 9:12 PM
arsenm edited edge metadata.
arsenm marked 26 inline comments as done.

Mostly address review comments

lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
39

I added this as well, I think both might be useful

108

Then you end up with the macro functions adding another "Pass" to the name, so you end up with initializeLoadStoreVectorizerPassPass etc., so it's better to not include Pass in the class name.

169

Better, use a range loop

209

I think this is to fix cases like trying to merge i1 into i8. This was already crashing earlier though, so I'm just having sub-byte types never considered.

287

The comments are already on the declarations. Usually for private functions like this I would put them on the body. Should I move them?

326

It's run after each vectorization. I'm guessing the additional uses might somehow interfere with vectorization of other chains? This can be investigated later.

500

I'm not sure why this one is even needed. The full chain is already printed out right above this by the "LSV: Loads to vectorize" DEBUG

509

Can't do that. it needs to be reset after the j loop

538

I'm more likely to break something trying to refactor this compared to all of the style fixes, so I'll leave looking at that for later.

754

I had another test which I apparently forgot to include which shows the alias insertion order behavior. I've also added another which shows the insertion point.

With the insertion point as Last, the load store load store case is correct and vectorizes as
store
load v2
store.

If I change the land insertion point to be before the first, it regresses and produces the incorrect
load v2
store v2

Maybe the comment is backwards?

jlebar added inline comments.Jun 17 2016, 9:31 AM
lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
88

Nit, "Legalizes" and "Breaks" would be consistent with the rest of the file. Also suggest s/vector half/piece/ -- "vector half" is kind of hard to parse.

96

This was marked as done, but it doesn't appear to have been done. Not a big deal if that's intentional, but pointing it out in case it wasn't.

Probably worth indicating that you expect all of the elements of Map to be loads, or all of them to be stores.

190

Nit, would getOperandAddressSpace() or getPointerAddressSpace() be a better name? The address space itself isn't an operand of the instruction.

500

Well, they are sort of different...that one may never appear if we find no chains or whatever. But I take your point that this one may not be so useful. FWIW some additional well-placed log messages might be helpful; I was debugging with a coworker a few days ago why some instructions weren't being vectorized, and it required gdb. But we can also easily go back and add these later.

509

Ah, I see. I clearly have no idea what this loop does. :)

754

Maybe the comment is backwards?

I can't manage to convince myself that those checks are correct if we do anything but what the comments say.

In the test, I see that you have

%ld.c = load double, double addrspace(1)* %c, align 8 ; may alias store to %a
store double 0.0, double addrspace(1)* %a, align 8
%ld.c.idx.1 = load double, double addrspace(1)* %c.idx.1, align 8 ; may alias store to %a
store double 0.0, double addrspace(1)* %a.idx.1, align 8

If the comments are correct and both loads may alias the first store, isn't transforming this into "store; load v2; store" (what the test is checking for) unsafe? I grant that vectorizing both the loads and the stores is also unsafe.

asbirlea added inline comments.Jun 17 2016, 9:40 AM
lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
509

I'm confused why that is the case. Each i iteration sets ConsecutiveChain[i]=-1 and all read/write accesses in the j loop are on ConsecutiveChain[i]. Initializing all ConsecutiveChain to -1 before the i loop should have the equivalent behavior. Could you clarify what I'm missing?

escha added a comment.Jun 17 2016, 9:53 AM

As the author of some of that code, it is quite possible the code is just not well-written and can be done better ;-)

jlebar added inline comments.Jun 17 2016, 1:23 PM
lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
304

Talking to asbirlea, we are not convinced this is safe. In particular, we cannot safely reorder across instructions that may read or may write arbitrary memory, nor can we safely reorder loads and stores that are not in the chain but which nonetheless may alias each other.

I think we're safe wrt instructions which may read or write, because these will never appear between a chain's first and last instructions. (I would feel much better if we had an assert to this effect, though.)

But I am not convinced we're safe wrt regular loads and stores whose addresses depend on the result of a vectorized load. For example, imagine we have

b = load a
load [b]  // b does not alias a
store [b]
c = load a+1

Suppose that we choose to vectorize the first and last load and WLOG assume we insert the vectorized load at the position of c:

load [b]
store [b]
{b, c} = load {a, a+1}

Now we call reorder() to fix this up. The first two instructions are going to be moved below the vectorized load in the reverse order of I->uses(). AFAICT there is no guarantee on the order of uses(), but even if there were, outputting in the opposite order seems probably not right?

Even if that were somehow right, afaict we can make reorder() do arbitrary, unsafe reorderings of loads/stores with the following idiom.

b = load a
b1 = b + 0 // copy b to b1
b2 = b + 0 // copy b to b2
load [b1]  // b1 does not alias a
store [b2]  // b2 does not alias a
c = load a+1

Now if the order of b1 and b2 in the source determines their order in users(), then that controls the order of "load [b1]" and "store [b2]" in the result, even though there's only one correct ordering of that load and store.

So to summarize, it may be my ignorance, but I don't see why we can rely on there being any particular order to users(). But even if users() has some guaranteed order, I don't see how any ordering could guarantee that we never reorder an aliasing load/store pair.

asbirlea added inline comments.Jun 17 2016, 2:29 PM
lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
304

Tried testing this out using jlebar's first example.

Input:

%struct.buffer_t = type { i64, i8*, [4 x i32], [4 x i32], [4 x i32], i32, i8, i8, [2 x i8] }       
  
define i32 @test(%struct.buffer_t* noalias %buff) #0 {                                             
entry:
  %tmp1 = getelementptr inbounds %struct.buffer_t, %struct.buffer_t* %buff, i64 0, i32 1           
  %buff.host = load i8*, i8** %tmp1, align 8                                                       
  %buff.val = load i8, i8* %buff.host, align 8                                                     
  store i8 0, i8* %buff.host, align 8                                                              
  %tmp0 = getelementptr inbounds %struct.buffer_t, %struct.buffer_t* %buff, i64 0, i32 0           
  %buff.dev = load i64, i64* %tmp0, align 8                                                        
  ret i32 0
} 
  
attributes #0 = { nounwind }

Output:

%struct.buffer_t = type { i64, i8*, [4 x i32], [4 x i32], [4 x i32], i32, i8, i8, [2 x i8] }       
  
; Function Attrs: nounwind
define i32 @test(%struct.buffer_t* noalias %buff) #0 {                                             
entry:
  %tmp0 = getelementptr inbounds %struct.buffer_t, %struct.buffer_t* %buff, i64 0, i32 0           
  %0 = bitcast i64* %tmp0 to <2 x i64>*                                                            
  %1 = load <2 x i64>, <2 x i64>* %0, align 8                                                      
  %2 = extractelement <2 x i64> %1, i32 0
  %3 = extractelement <2 x i64> %1, i32 1                                                          
  %4 = inttoptr i64 %3 to i8*                                                                      
  store i8 0, i8* %4, align 8
  %buff.val = load i8, i8* %4, align 8                                                             
  ret i32 0
} 
  
attributes #0 = { nounwind }

The first and last load got vectorized and moved at the beginning, but the uses that now follow are in reversed order.

The input had load [b], store [b]. The output has store[b], load[b].

asbirlea added inline comments.Jun 17 2016, 3:20 PM
lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
289

Nit: Could the Instruction instance "User" be renamed to something different than a class name? It makes it somewhat harder to follow.

If we were to always insert vectorized loads at the location of the first load and place stores at the location of the last store, I actually don't see why we'd need reorder() at all.

asbirlea added inline comments.Jun 20 2016, 1:42 PM
lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
304

FWIW, the example above can be fixed by keeping a temporary of the last instruction inserted. For example:

Instruction* InsertAfter=I;
for (User *U : I->users())                                                                       
    if (Instruction *User = dyn_cast<Instruction>(U))                                              
      if (!DT.dominates(I, User) && User->getOpcode() != Instruction::PHI) {                       
        User->removeFromParent();                                                                  
        User->insertAfter(InsertAfter);                                                            
        reorder(User);                                                                             
        InsertAfter = User;                                                                        
      }

However, this does not account for the recursive inserts AND it assumes the users() are in BB order.

A correct and very iffy way: 1. create a set (include the recursive call here) of all (transitive) users, 2. make a pass over all instructions in BB; if instruction is in the collected set insert it after the latest inserted instruction (as in above sample); or just go over the BB in reverse order.

I would favor avoiding the reorder entirely than having a pass over the BB at every reorder..

Here's a test-case that currently produces incorrect results on my end.

target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"
target triple = "aarch64--linux-gnueabihf"

define i32 @test(i32* noalias %ptr) {
entry:
  br label %"for something"

"for something":
  %index = phi i64 [ 0, %entry ], [ %index.next, %"for something" ]
  %next.gep = getelementptr i32, i32* %ptr, i64 %index
  %a1 = add nsw i64 %index, 1
  %next.gep1 = getelementptr i32, i32* %ptr, i64 %a1
  %a2 = add nsw i64 %index, 2
  %next.gep2 = getelementptr i32, i32* %ptr, i64 %a2

  %l1 = load i32, i32* %next.gep1, align 4                                                         
  %l2 = load i32, i32* %next.gep, align 4                                                          
  store i32 0, i32* %next.gep1, align 4                                                            
  store i32 0, i32* %next.gep, align 4
  %l3 = load i32, i32* %next.gep1, align 4                                                         
  %l4 = load i32, i32* %next.gep2, align 4   
  %index.next = add i64 %index, 8
  %cmp_res = icmp eq i64 %index.next, 8
  br i1 %cmp_res, label %ending, label %"for something"

ending:
  ret i32 0

It does not take into account the stores as aliasing with the loads, and leaves the stores as the first instructions.
Output I see:

%0 = bitcast i32* %next.gep to <2 x i32>*                                                        
store <2 x i32> zeroinitializer, <2 x i32>* %0, align 4                                          
%1 = bitcast i32* %next.gep to <2 x i32>*                                                        
%2 = load <2 x i32>, <2 x i32>* %1, align 4                                                      
%3 = extractelement <2 x i32> %2, i32 0
%4 = extractelement <2 x i32> %2, i32 1                                                          
%5 = bitcast i32* %next.gep1 to <2 x i32>*                                                       
%6 = load <2 x i32>, <2 x i32>* %5, align 4                                                      
%7 = extractelement <2 x i32> %6, i32 0
%8 = extractelement <2 x i32> %6, i32 1

Is there a follow-up patch fixing this that I missed?

Example in the previous comment should produce correct code with these changes.

lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
322

This should be: LastInstr = ++I; (using updated patch)
It's always valid: either uses the next valid instruction or BB.end().

Fixes issue that isVectorizable misses testing the last instruction in the chain.

376

Consider asserting Chain.size() == ChainInstrs.size()?

390

As loads are inserted at the location of the last load, either invert sign (VVIdx > VIdx) and update comment, or fix load insert location.
The former misses vectorization opportunities, but it avoids (at least some of the) incorrect code generation.

394

Same as above.

asbirlea added inline comments.Jun 23 2016, 5:15 PM
lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
538

As I understand it, this is the condition for finding the chain of maximum length. The check that Head cannot be found in Tails, means it's the beginning of a chain, hence one of the longest chains. However, if the longest chain fails to vectorize, this same check prevents any vectorization of the remaining chain.

Here's a suggestion to address vectorizing the chain suffix:

for (int i = 0; i < Heads.size(); i++) {
    unsigned Head = Heads[i];
    if (VectorizedValues.count(Instrs[Head]))
      continue;
    // Skip if a longer chain exists in the remaining Heads/Tails
    for (int j = i+1; j < Tails.size(); j++)
      if (Head == Tails[j])
        continue;

Feel free to add additional improvements for vectorization of a chain prefix.

jlebar accepted this revision.Jun 28 2016, 4:03 PM
jlebar added a reviewer: jlebar.

After discussion on IRC, I'm OK taking this as-is, with an explicit understanding that there are existing changes that need to be made here, so future patches will not have the same bias towards existing code that we might normally have. Collaborating with the code living out-of-tree is just too painful.

Alina has fixes to the correctness issues in her patch queue, and one of us will go back and ensure that all of my comments above (particularly about the ConsecutiveChain algorithm) are addressed.

This revision is now accepted and ready to land.Jun 28 2016, 4:03 PM
arsenm marked 2 inline comments as done.Jun 30 2016, 11:28 AM
arsenm added inline comments.
lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
96

I put it on vectorizeChains. I can add it here too

322

Leaving for future patch

376

This needs to go with the ++I patch

390

Are you going to fix this / do you have test cases for this and below?

asbirlea added inline comments.Jun 30 2016, 11:32 AM
lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
390

Yes. I have an updated version of the current patch that fixes this and adds couple of unit tests.

arsenm closed this revision.Jun 30 2016, 4:18 PM
arsenm marked an inline comment as done.

r274293