This was contributed by Apple, and I've been working on
minimal cleanups and generalizing it.
Details
- Reviewers
• tstellarAMD jlebar escha resistor
Diff Detail
Event Timeline
The SLPVectorizer is concerned with forming all vectors. It will give up if it can't vectorize all downstream scalar operations fed by the loads. On GPUs that only have vector loads, and all other operations are scalar, this isn't desirable. At best the vectors are formed just to be scalarized during legalization.
Yeah, I can see the motivation, but my question was why not to add this capability to SLP instead of creating a completely new pass. This seems to be introducing a lot of code duplication.
Michael
It's not exactly the same problem. It tries to vectorize stores, and then vectorize if the operations leading to the stores are vectorizable, and then the loads. This vectorizes loads and stores, ignores the intervening instructions (which there may be none of. Loads that are used to compute something that may not be stored should also be vectorized). The LoadStoreVectorizer should be able to produce vectors of arbitrary size and split them as requested by the target, even if the types are not legal.
I also see quite a bit of duplication with SLP. Could this also be done with the ConsecutiveStore and ConsecutiveLoad optimizations in SelectionDAG?
The DAG load and store combining is less effective, and also restricts to combining to legal types
Hi Matt,
I think it's ok to have it in a separate pass, but we really need to try to factor out common parts. For instance, we definitely have propagateMetadata and isConsecutiveAccess in SLP.
Thanks,
Michael
lib/Transforms/Vectorize/LoadStoreVectorizer.cpp | ||
---|---|---|
519 | Can it be a range loop? |
isConsecutiveAccess is already factored into a separate utility function, but it seems to have diverged from the one here. Some work will be needed to merge them. I've added a FIXME for it
Just wanted to mention that this pass was originally contributed by Volkan Keles.
Thanks Volkan!
I can crash this locally. We assert for trying to create an invalid bitcast at LoadStoreVectorizer.cpp:833:
831 if (Extracted->getType() != UI->getType()) 832 Extracted = 833 cast<Instruction>(Builder.CreateBitCast(Extracted, UI->getType())); (rr) p Extracted->dump() %25 = extractelement <2 x i64> %23, i32 1 (rr) p UI->getType()->dump() %"struct.Eigen::half"*
I will try to get a testcase.
OK, I got a somewhat minimized testcase.
$ curl https://gist.githubusercontent.com/anonymous/c3d28b883b97b476b9a2ecb4d5dac807/raw/cd830af846bd8738544783305eb7618449d06eba/- | llc llc: ../src/lib/IR/Instructions.cpp:2582: static llvm::CastInst *llvm::CastInst::Create(Instruction::CastOps, llvm::Value *, llvm::Type *, const llvm::Twine &, llvm::Instruction *): Assertion `castIsValid(op, S, Ty) && "Invalid cast!"' failed. #0 0x0000000001927c08 llvm::sys::PrintStackTrace(llvm::raw_ostream&) (/usr/local/google/home/jlebar/code/llvm/release/bin/llc+0x1927c08) #1 0x0000000001926296 llvm::sys::RunSignalHandlers() (/usr/local/google/home/jlebar/code/llvm/release/bin/llc+0x1926296) #2 0x000000000192880a SignalHandler(int) (/usr/local/google/home/jlebar/code/llvm/release/bin/llc+0x192880a) #3 0x00007fe0a84d5340 __restore_rt (/lib/x86_64-linux-gnu/libpthread.so.0+0x10340) #4 0x00007fe0a76fdcc9 gsignal /build/eglibc-3GlaMS/eglibc-2.19/signal/../nptl/sysdeps/unix/sysv/linux/raise.c:56:0 #5 0x00007fe0a77010d8 abort /build/eglibc-3GlaMS/eglibc-2.19/stdlib/abort.c:91:0 #6 0x00007fe0a76f6b86 __assert_fail_base /build/eglibc-3GlaMS/eglibc-2.19/assert/assert.c:92:0 #7 0x00007fe0a76f6c32 (/lib/x86_64-linux-gnu/libc.so.6+0x2fc32) #8 0x00000000014dd7b9 llvm::CastInst::Create(llvm::Instruction::CastOps, llvm::Value*, llvm::Type*, llvm::Twine const&, llvm::Instruction*) (/usr/local/google/home/jlebar/code/llvm/release/bin/llc+0x14dd7b9) #9 0x000000000079d58b llvm::IRBuilder<llvm::ConstantFolder, llvm::IRBuilderDefaultInserter>::CreateCast(llvm::Instruction::CastOps, llvm::Value*, llvm::Type*, llvm::Twine const&) (/usr/local/google/home/jlebar/code/llvm/release/bin/llc+0x79d58b) #10 0x00000000019d84fd (anonymous namespace)::Vectorizer::vectorizeLoadChain(llvm::ArrayRef<llvm::Value*>) (/usr/local/google/home/jlebar/code/llvm/release/bin/llc+0x19d84fd) #11 0x00000000019d7f57 (anonymous namespace)::Vectorizer::vectorizeLoadChain(llvm::ArrayRef<llvm::Value*>) (/usr/local/google/home/jlebar/code/llvm/release/bin/llc+0x19d7f57) #12 0x00000000019d7809 (anonymous namespace)::Vectorizer::vectorizeChains(llvm::MapVector<llvm::Value*, llvm::SmallVector<llvm::Value*, 8u>, llvm::DenseMap<llvm::Value*, unsigned int, llvm::DenseMapInfo<llvm::Value*>, llvm::detail::DenseMapPair<llvm::Value*, unsigned int> >, std::vector<std::pair<llvm::Value*, llvm::SmallVector<llvm::Value*, 8u> >, std::allocator<std::pair<llvm::Value*, llvm::SmallVector<llvm::Value*, 8u> > > > >&) (/usr/local/google/home/jlebar/code/llvm/release/bin/llc+0x19d7809) #13 0x00000000019d5cd8 (anonymous namespace)::LoadStoreVectorizer::runOnFunction(llvm::Function&) (/usr/local/google/home/jlebar/code/llvm/release/bin/llc+0x19d5cd8) #14 0x00000000014f84d4 llvm::FPPassManager::runOnFunction(llvm::Function&) (/usr/local/google/home/jlebar/code/llvm/release/bin/llc+0x14f84d4) #15 0x00000000014f871b llvm::FPPassManager::runOnModule(llvm::Module&) (/usr/local/google/home/jlebar/code/llvm/release/bin/llc+0x14f871b) #16 0x00000000014f8bf7 llvm::legacy::PassManagerImpl::run(llvm::Module&) (/usr/local/google/home/jlebar/code/llvm/release/bin/llc+0x14f8bf7) #17 0x00000000005fc419 compileModule(char**, llvm::LLVMContext&) (/usr/local/google/home/jlebar/code/llvm/release/bin/llc+0x5fc419) #18 0x00000000005f9a4b main (/usr/local/google/home/jlebar/code/llvm/release/bin/llc+0x5f9a4b) #19 0x00007fe0a76e8ec5 __libc_start_main /build/eglibc-3GlaMS/eglibc-2.19/csu/libc-start.c:321:0 #20 0x00000000005f5f22 _start (/usr/local/google/home/jlebar/code/llvm/release/bin/llc+0x5f5f22) Stack dump: 0. Program arguments: /usr/local/google/home/jlebar/code/llvm/release/bin/llc 1. Running pass 'Function Pass Manager' on module '<stdin>'. 2. Running pass 'GPU Load and Store Vectorizer' on function '@_ZN5Eigen8internal15EigenMetaKernelINS_15TensorEvaluatorIKNS_14TensorAssignOpINS_9TensorMapINS_6TensorIxLi0ELi1ElEELi1EEEKNS_18TensorConversionOpIxKNS_20TensorTupleReducerOpINS0_18ArgMaxTupleReducerINS_5TupleIlNS_4halfEEEEEKNS_5arrayIlLm1EEEKNS4_INS5_IKSC_Li1ELi1ElEELi1EEEEEEEEENS_9GpuDeviceEEElEEvT_T0_' Aborted (core dumped)
OK, I see the problem. This is trying to vectorize two consecutive loads, where the first is loading from an i64*, and the second is loading from an Eigen::half**. We can in fact vectorize these on my target, because i64 and pointers have the same size. But we need an inttoptr instruction if the vector type is <2 x i64>.
Not to scope-creep this, because it's immensely helpful as-is, but looking through some Eigen code, it seems like it would also be helpful to merge consecutive vector loads where possible. (In NVPTX, this usually means merging two loads of 2xf32 into one 4xf32 load.) Eigen tries to be clever and vectorize itself, but it doesn't always use the right packet size.
Another question I have is: NVPTX has some vectorizable load instructions -- notably ld.global.nc, which means, load from the global address space using the texture cache -- that are modeled as as llvm intrinsics, so they don't get merged by this pass. I wonder what's the portable way to teach this pass about these loads. Like, should they be modeled in LLVM as regular load instructions with some sort of annotation, instead of as an LLVM intrinsic?
My intuition would be -- if loading using the texture cache doesn't change the result, but rather is just a performance thing, that would seem to be something you'd set with metadata on the instruction, right?
The instruction I have in mind does (potentially) change the result -- the texture cache is noncoherent, so it's only guaranteed to give the same result as a regular load if the data we're loading hasn't been modified since the kernel started.
include/llvm/Transforms/Vectorize.h | ||
---|---|---|
147 | Minor: I'd rather hardcode 128 for now, just to make the pass signature similar to existing passes (no other pass accepts an argument in create...Pass I think). But it probably doesn't matter as you're going to fix it in follow-ups anyway. | |
lib/Transforms/Vectorize/LoadStoreVectorizer.cpp | ||
78–79 | I think the second line should be indented here. | |
108 | Bikeshed: I'd rename Vectorizer to LoadStoreVectorizer and LoadStoreVectorizer to LoadStoreVectorizerPass os something like this. | |
153–155 | Should we bail out on this for all potential targets? E.g. why is it a problem for, e.g., vectorizing integer stores on x86 using movdqa integer instructions? | |
287 | Some comments would be helpful here (and for other functions). | |
417 | Range loop? | |
487–488 | Might be a good idea to parametrize 64. | |
679 | Any chance it can be somehow combined with vectorizeStoreChain? They look so similar.. | |
730–731 | Is there something special about 4? Can we add a define/const for it? |
Not done going through this, but I tried your branch at https://github.com/arsenm/llvm/tree/ls-vectorizer, and I'm still ICE'ing when compiling thrust.
$ git clone https://github.com/thrust/thrust.git $ cd thrust $ CLANG_PATH=/abs/path/to/dir/containing/clang/binary scons arch=sm_35 Wall=no Werror=no mode=release cuda_compiler=clang std=c++11 cuda_path=/usr/local/cuda-7.0 -j48 unit_tests Range types must match instruction type! %wide.load = load <16 x i8>, <16 x i8>* %60, align 1, !tbaa !61, !range !63, !alias.scope !64 Range types must match instruction type! %wide.load544 = load <16 x i8>, <16 x i8>* %62, align 1, !tbaa !61, !range !63, !alias.scope !64 Range types must match instruction type! %wide.load.1 = load <16 x i8>, <16 x i8>* %66, align 1, !tbaa !61, !range !63, !alias.scope !64 Range types must match instruction type! %wide.load544.1 = load <16 x i8>, <16 x i8>* %68, align 1, !tbaa !61, !range !63, !alias.scope !64 Range types must match instruction type! %wide.load.2 = load <16 x i8>, <16 x i8>* %72, align 1, !tbaa !61, !range !63, !alias.scope !64 Range types must match instruction type! %wide.load544.2 = load <16 x i8>, <16 x i8>* %74, align 1, !tbaa !61, !range !63, !alias.scope !64 Range types must match instruction type! %wide.load.3 = load <16 x i8>, <16 x i8>* %78, align 1, !tbaa !61, !range !63, !alias.scope !64 Range types must match instruction type! %wide.load544.3 = load <16 x i8>, <16 x i8>* %80, align 1, !tbaa !61, !range !63, !alias.scope !64 Range types must match instruction type! %wide.load.epil = load <16 x i8>, <16 x i8>* %84, align 1, !tbaa !61, !range !63, !alias.scope !64 Range types must match instruction type! %wide.load544.epil = load <16 x i8>, <16 x i8>* %86, align 1, !tbaa !61, !range !63, !alias.scope !64 fatal error: error in backend: Broken function found, compilation aborted!
Interestingly, this is happening during *host* compilation. It happens even if I don't add a patch to enable the load/store vectorizer for nvptx.
Smaller steps to reproduce:
$ curl https://gist.githubusercontent.com/anonymous/5c8863562db1dda94a821c51e3ad001a/raw/b8f3e3cc9b0995ab89d5e8c3840dcea30dc96d0b/- | opt -O2 (same ICE)
lib/Transforms/Vectorize/LoadStoreVectorizer.cpp | ||
---|---|---|
39 | Nit, number of scalar instructions removed might be a more interesting metric, because it's notable whether we manage to vectorize in groups of 2 vs 4. | |
78–79 | clang-format formats it for me as-is. |
Looks pretty good to me; mostly nits / comment suggestions, but I have one nontrivial concern about how we ensure that we always build the longest vectors we can.
lib/Transforms/Vectorize/LoadStoreVectorizer.cpp | ||
---|---|---|
71 | Nit, if this is all the comment says, do we need it? It seems just to repeat what's already in the function name. | |
74 | Nit, "Reorders" would be consistent with the rest of the comments in this file. Also, do you mean s/user/users/? | |
74 | It's not obvious to me how this function reorders the users of I without looking at the source. Something like "Reorders instructions to ensure that I dominates all of its users." or something would have helped me. | |
88 | Perhaps, "Checks if there are any instructions which may affect the value returned by a load/store in Chain between From and To." Since this function checks not just for aliasing memory instructions, but also side-effecting instructions. | |
96 | Probably worth indicating that you expect all of the elements of Map to be loads, or all of them to be stores. | |
98 | Suggest "Finds loads/stores to consecutive memory addresses and vectorizes them." Otherwise it can be read as saying that the instructions themselves must appear consecutively in the BB. Probably also worth indicating that all elements of Instrs must be loads, or all must be stores. | |
153–155 | It came from r187340, back in 2013, by Nadav Rotem <nrotem@apple.com>. There's no additional information in the patch. AFAICT the intent is specifically not to vectorize loads/stores of fp types when noimplicitfloat is set. Which sort of makes sense, based on the tiny information I can glean about noimplicitfloat. | |
169 | s/I/BBI/ would have made this a lot more obvious to me, since we use I for values/instructions most everywhere else. | |
209 | It's not clear to me why we need this line, although all the cases I can think of where this is false seem quite bizarre, so maybe it doesn't matter. | |
231 | Suggest s/Otherwise// -- it's not clear that this is contrasting with "Check if they are based on the same pointer." | |
243 | Nit, suggest s/proving/checking/ -- "proving" is kind of a strong term for what we do, sort of implies it's harder than it is (so it's confusing when I see it's so simple). | |
260 | Suggest something like // Only look through ZExt/SExt to make it clearer why we're doing an early return. | |
306 | Nit, consider llvm::is_contained() | |
326 | Why this special case? This isn't as strong as a general DCE, so it seems to me like it gives us a false sense of security? i.e. should we just say "run DCE after this pass"? | |
339 | s/ElementSize/ElementSizeBits/? (I don't care as much what the local vars here are called, but if the arg is called ElementSize, I might reasonably pass a value in bytes.) | |
339 | I'm not sure "bisect" is quite the right word for this? I expected this to split Chain in half, but it looks like it actually splits off the rightmost 1, 2, or 3 bytes' worth of elements, so that the left part has a size that's a multiple of 4 bytes (ish). I might call that "splitIntoRoundSize" or something. A postcondition assert() would probably also help clarify this. | |
356 | Consider llvm::if_contains | |
378 | suggest "(the vectorized load is inserted at the location of the first load in the chain)." | |
389 | Not GetPointerOperand? | |
415 | To me this would be a lot more clear if we returned the relevant maps (or, I suppose, explicitly took them as params). Otherwise it's not clear from the signature (or the comments) where we collect the instructions into. | |
437 | Consider llvm::all_of(LI->users(), ...); | |
500 | Nit, It might be helpful to print out at least the first instruction in the chain. At least, that was helpful to me when I was debugging it. | |
503 | Can we move this closer to where it's used? | |
509 | Nit, we know that Instrs is <= 64 elements, so could we just make ConsecutiveChain an array of signed ints? Then we could use -1 instead of ~0U (which implied to me that we were going to do some bit twiddling below). Also I think you could initialize the array with int ConsecutiveChain[64] = {-1}; which would be clearer than setting each element in the loop, imo. | |
518 | I cannot for the life of me figure out what invariant you're trying to preserve with these checks. I guess a comment is in order. :) (I get that we prefer to vectorize chains of scalar instructions that appear in order (j > i), but I can't figure out what CurDistance and NewDistance are checking. It would make sense if you wanted to prefer chains of scalar instructions that are close to each other, but then NewDistance should be abs(j - i), right?) | |
519 | Not sure, but this might be clearer if you simply used a "continue" here instead of a flag variable. | |
524 | If ConsecutiveChain was not -1, don't we need to remove ConsecutiveChain[i] from Tails? (Maybe it would be better to build Heads and Tails, or at least Tails, in a separate loop, over ConsecutiveChain, so we don't have to worry about the case where one instr is a tail for two heads and then gets overwritten just one time.) | |
530 | The intent here is to build the largest vectors we can, right? (Or at least, we want to build up vectors of the max vector width, if possible? Beyond that doesn't make a difference.) I don't see how we ensure this happens, unless there is some implicit requirement on the order of Instrs that I'm missing. For example, if the elements of Instrs are
then it seems that the first element of Heads will be "load p[2]", implying a two-wide vector load. I don't see what here would cause us to wait on this one and do the four-wide load starting at p[0]. Sorry if I'm missing something obvious. | |
538 | Actually, I'm not sure why we need the (Tails.count(I) || Heads.count(I)) part of this predicate, and therefore I'm not sure why we need Tails at all. That probably deserves a comment at least. | |
578 | Hm, looks like an additional constraint on bisect() is that the array passed should have more than 4 elements -- probably worth saying that somewhere. | |
630 | If you like, you could write this as BasicBlock::iterator First, Last; std::tie(First, Last) = getBoundaryInstrs(Chain); Same for the other call to getBoundaryInstrs. | |
754 | A correctness-critical assumption above is that
but here it looks like the opposite? | |
764 | Nit, I'd prefer to move this into the if and else blocks, so it's clear that we don't use it elsewhere. |
Mostly address review comments
lib/Transforms/Vectorize/LoadStoreVectorizer.cpp | ||
---|---|---|
39 | I added this as well, I think both might be useful | |
108 | Then you end up with the macro functions adding another "Pass" to the name, so you end up with initializeLoadStoreVectorizerPassPass etc., so it's better to not include Pass in the class name. | |
169 | Better, use a range loop | |
209 | I think this is to fix cases like trying to merge i1 into i8. This was already crashing earlier though, so I'm just having sub-byte types never considered. | |
287 | The comments are already on the declarations. Usually for private functions like this I would put them on the body. Should I move them? | |
326 | It's run after each vectorization. I'm guessing the additional uses might somehow interfere with vectorization of other chains? This can be investigated later. | |
500 | I'm not sure why this one is even needed. The full chain is already printed out right above this by the "LSV: Loads to vectorize" DEBUG | |
509 | Can't do that. it needs to be reset after the j loop | |
538 | I'm more likely to break something trying to refactor this compared to all of the style fixes, so I'll leave looking at that for later. | |
754 | I had another test which I apparently forgot to include which shows the alias insertion order behavior. I've also added another which shows the insertion point. With the insertion point as Last, the load store load store case is correct and vectorizes as If I change the land insertion point to be before the first, it regresses and produces the incorrect Maybe the comment is backwards? |
lib/Transforms/Vectorize/LoadStoreVectorizer.cpp | ||
---|---|---|
89 | Nit, "Legalizes" and "Breaks" would be consistent with the rest of the file. Also suggest s/vector half/piece/ -- "vector half" is kind of hard to parse. | |
97 | This was marked as done, but it doesn't appear to have been done. Not a big deal if that's intentional, but pointing it out in case it wasn't.
| |
191 | Nit, would getOperandAddressSpace() or getPointerAddressSpace() be a better name? The address space itself isn't an operand of the instruction. | |
500 | Well, they are sort of different...that one may never appear if we find no chains or whatever. But I take your point that this one may not be so useful. FWIW some additional well-placed log messages might be helpful; I was debugging with a coworker a few days ago why some instructions weren't being vectorized, and it required gdb. But we can also easily go back and add these later. | |
509 | Ah, I see. I clearly have no idea what this loop does. :) | |
754 |
I can't manage to convince myself that those checks are correct if we do anything but what the comments say. In the test, I see that you have %ld.c = load double, double addrspace(1)* %c, align 8 ; may alias store to %a store double 0.0, double addrspace(1)* %a, align 8 %ld.c.idx.1 = load double, double addrspace(1)* %c.idx.1, align 8 ; may alias store to %a store double 0.0, double addrspace(1)* %a.idx.1, align 8 If the comments are correct and both loads may alias the first store, isn't transforming this into "store; load v2; store" (what the test is checking for) unsafe? I grant that vectorizing both the loads and the stores is also unsafe. |
lib/Transforms/Vectorize/LoadStoreVectorizer.cpp | ||
---|---|---|
510 | I'm confused why that is the case. Each i iteration sets ConsecutiveChain[i]=-1 and all read/write accesses in the j loop are on ConsecutiveChain[i]. Initializing all ConsecutiveChain to -1 before the i loop should have the equivalent behavior. Could you clarify what I'm missing? |
As the author of some of that code, it is quite possible the code is just not well-written and can be done better ;-)
lib/Transforms/Vectorize/LoadStoreVectorizer.cpp | ||
---|---|---|
305 | Talking to asbirlea, we are not convinced this is safe. In particular, we cannot safely reorder across instructions that may read or may write arbitrary memory, nor can we safely reorder loads and stores that are not in the chain but which nonetheless may alias each other. I think we're safe wrt instructions which may read or write, because these will never appear between a chain's first and last instructions. (I would feel much better if we had an assert to this effect, though.) But I am not convinced we're safe wrt regular loads and stores whose addresses depend on the result of a vectorized load. For example, imagine we have b = load a load [b] // b does not alias a store [b] c = load a+1 Suppose that we choose to vectorize the first and last load and WLOG assume we insert the vectorized load at the position of c: load [b] store [b] {b, c} = load {a, a+1} Now we call reorder() to fix this up. The first two instructions are going to be moved below the vectorized load in the reverse order of I->uses(). AFAICT there is no guarantee on the order of uses(), but even if there were, outputting in the opposite order seems probably not right? Even if that were somehow right, afaict we can make reorder() do arbitrary, unsafe reorderings of loads/stores with the following idiom. b = load a b1 = b + 0 // copy b to b1 b2 = b + 0 // copy b to b2 load [b1] // b1 does not alias a store [b2] // b2 does not alias a c = load a+1 Now if the order of b1 and b2 in the source determines their order in users(), then that controls the order of "load [b1]" and "store [b2]" in the result, even though there's only one correct ordering of that load and store. So to summarize, it may be my ignorance, but I don't see why we can rely on there being any particular order to users(). But even if users() has some guaranteed order, I don't see how any ordering could guarantee that we never reorder an aliasing load/store pair. |
lib/Transforms/Vectorize/LoadStoreVectorizer.cpp | ||
---|---|---|
305 | Tried testing this out using jlebar's first example. Input: %struct.buffer_t = type { i64, i8*, [4 x i32], [4 x i32], [4 x i32], i32, i8, i8, [2 x i8] } define i32 @test(%struct.buffer_t* noalias %buff) #0 { entry: %tmp1 = getelementptr inbounds %struct.buffer_t, %struct.buffer_t* %buff, i64 0, i32 1 %buff.host = load i8*, i8** %tmp1, align 8 %buff.val = load i8, i8* %buff.host, align 8 store i8 0, i8* %buff.host, align 8 %tmp0 = getelementptr inbounds %struct.buffer_t, %struct.buffer_t* %buff, i64 0, i32 0 %buff.dev = load i64, i64* %tmp0, align 8 ret i32 0 } attributes #0 = { nounwind } Output: %struct.buffer_t = type { i64, i8*, [4 x i32], [4 x i32], [4 x i32], i32, i8, i8, [2 x i8] } ; Function Attrs: nounwind define i32 @test(%struct.buffer_t* noalias %buff) #0 { entry: %tmp0 = getelementptr inbounds %struct.buffer_t, %struct.buffer_t* %buff, i64 0, i32 0 %0 = bitcast i64* %tmp0 to <2 x i64>* %1 = load <2 x i64>, <2 x i64>* %0, align 8 %2 = extractelement <2 x i64> %1, i32 0 %3 = extractelement <2 x i64> %1, i32 1 %4 = inttoptr i64 %3 to i8* store i8 0, i8* %4, align 8 %buff.val = load i8, i8* %4, align 8 ret i32 0 } attributes #0 = { nounwind } The first and last load got vectorized and moved at the beginning, but the uses that now follow are in reversed order. The input had load [b], store [b]. The output has store[b], load[b]. |
lib/Transforms/Vectorize/LoadStoreVectorizer.cpp | ||
---|---|---|
289 | Nit: Could the Instruction instance "User" be renamed to something different than a class name? It makes it somewhat harder to follow. |
If we were to always insert vectorized loads at the location of the first load and place stores at the location of the last store, I actually don't see why we'd need reorder() at all.
lib/Transforms/Vectorize/LoadStoreVectorizer.cpp | ||
---|---|---|
305 | FWIW, the example above can be fixed by keeping a temporary of the last instruction inserted. For example: Instruction* InsertAfter=I; for (User *U : I->users()) if (Instruction *User = dyn_cast<Instruction>(U)) if (!DT.dominates(I, User) && User->getOpcode() != Instruction::PHI) { User->removeFromParent(); User->insertAfter(InsertAfter); reorder(User); InsertAfter = User; } However, this does not account for the recursive inserts AND it assumes the users() are in BB order. A correct and very iffy way: 1. create a set (include the recursive call here) of all (transitive) users, 2. make a pass over all instructions in BB; if instruction is in the collected set insert it after the latest inserted instruction (as in above sample); or just go over the BB in reverse order. I would favor avoiding the reorder entirely than having a pass over the BB at every reorder.. |
Here's a test-case that currently produces incorrect results on my end.
target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128" target triple = "aarch64--linux-gnueabihf" define i32 @test(i32* noalias %ptr) { entry: br label %"for something" "for something": %index = phi i64 [ 0, %entry ], [ %index.next, %"for something" ] %next.gep = getelementptr i32, i32* %ptr, i64 %index %a1 = add nsw i64 %index, 1 %next.gep1 = getelementptr i32, i32* %ptr, i64 %a1 %a2 = add nsw i64 %index, 2 %next.gep2 = getelementptr i32, i32* %ptr, i64 %a2 %l1 = load i32, i32* %next.gep1, align 4 %l2 = load i32, i32* %next.gep, align 4 store i32 0, i32* %next.gep1, align 4 store i32 0, i32* %next.gep, align 4 %l3 = load i32, i32* %next.gep1, align 4 %l4 = load i32, i32* %next.gep2, align 4 %index.next = add i64 %index, 8 %cmp_res = icmp eq i64 %index.next, 8 br i1 %cmp_res, label %ending, label %"for something" ending: ret i32 0
It does not take into account the stores as aliasing with the loads, and leaves the stores as the first instructions.
Output I see:
%0 = bitcast i32* %next.gep to <2 x i32>* store <2 x i32> zeroinitializer, <2 x i32>* %0, align 4 %1 = bitcast i32* %next.gep to <2 x i32>* %2 = load <2 x i32>, <2 x i32>* %1, align 4 %3 = extractelement <2 x i32> %2, i32 0 %4 = extractelement <2 x i32> %2, i32 1 %5 = bitcast i32* %next.gep1 to <2 x i32>* %6 = load <2 x i32>, <2 x i32>* %5, align 4 %7 = extractelement <2 x i32> %6, i32 0 %8 = extractelement <2 x i32> %6, i32 1
Is there a follow-up patch fixing this that I missed?
Example in the previous comment should produce correct code with these changes.
lib/Transforms/Vectorize/LoadStoreVectorizer.cpp | ||
---|---|---|
323 | This should be: LastInstr = ++I; (using updated patch) Fixes issue that isVectorizable misses testing the last instruction in the chain. | |
377 | Consider asserting Chain.size() == ChainInstrs.size()? | |
391 | As loads are inserted at the location of the last load, either invert sign (VVIdx > VIdx) and update comment, or fix load insert location. | |
395 | Same as above. |
lib/Transforms/Vectorize/LoadStoreVectorizer.cpp | ||
---|---|---|
539 | As I understand it, this is the condition for finding the chain of maximum length. The check that Head cannot be found in Tails, means it's the beginning of a chain, hence one of the longest chains. However, if the longest chain fails to vectorize, this same check prevents any vectorization of the remaining chain. Here's a suggestion to address vectorizing the chain suffix: for (int i = 0; i < Heads.size(); i++) { unsigned Head = Heads[i]; if (VectorizedValues.count(Instrs[Head])) continue; // Skip if a longer chain exists in the remaining Heads/Tails for (int j = i+1; j < Tails.size(); j++) if (Head == Tails[j]) continue; Feel free to add additional improvements for vectorization of a chain prefix. |
After discussion on IRC, I'm OK taking this as-is, with an explicit understanding that there are existing changes that need to be made here, so future patches will not have the same bias towards existing code that we might normally have. Collaborating with the code living out-of-tree is just too painful.
Alina has fixes to the correctness issues in her patch queue, and one of us will go back and ensure that all of my comments above (particularly about the ConsecutiveChain algorithm) are addressed.
lib/Transforms/Vectorize/LoadStoreVectorizer.cpp | ||
---|---|---|
391 | Yes. I have an updated version of the current patch that fixes this and adds couple of unit tests. |
Should the VecRegSize be pulled from TTI (at least as a default)?