Page MenuHomePhabricator

Add LoadStoreVectorizer pass

Authored by arsenm on Apr 25 2016, 3:11 PM.



This was contributed by Apple, and I've been working on
minimal cleanups and generalizing it.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
mzolotukhin added inline comments.Jun 9 2016, 6:43 PM

Minor: I'd rather hardcode 128 for now, just to make the pass signature similar to existing passes (no other pass accepts an argument in create...Pass I think). But it probably doesn't matter as you're going to fix it in follow-ups anyway.


I think the second line should be indented here.


Bikeshed: I'd rename Vectorizer to LoadStoreVectorizer and LoadStoreVectorizer to LoadStoreVectorizerPass os something like this.


Should we bail out on this for all potential targets? E.g. why is it a problem for, e.g., vectorizing integer stores on x86 using movdqa integer instructions?


Some comments would be helpful here (and for other functions).


Range loop?


Might be a good idea to parametrize 64.


Any chance it can be somehow combined with vectorizeStoreChain? They look so similar..


Is there something special about 4? Can we add a define/const for it?

jlebar added a comment.EditedJun 10 2016, 11:59 AM

Not done going through this, but I tried your branch at, and I'm still ICE'ing when compiling thrust.

$ git clone
$ cd thrust
$ CLANG_PATH=/abs/path/to/dir/containing/clang/binary scons arch=sm_35 Wall=no Werror=no mode=release cuda_compiler=clang std=c++11 cuda_path=/usr/local/cuda-7.0 -j48 unit_tests

Range types must match instruction type!
  %wide.load = load <16 x i8>, <16 x i8>* %60, align 1, !tbaa !61, !range !63, !alias.scope !64
Range types must match instruction type!
  %wide.load544 = load <16 x i8>, <16 x i8>* %62, align 1, !tbaa !61, !range !63, !alias.scope !64
Range types must match instruction type!
  %wide.load.1 = load <16 x i8>, <16 x i8>* %66, align 1, !tbaa !61, !range !63, !alias.scope !64
Range types must match instruction type!
  %wide.load544.1 = load <16 x i8>, <16 x i8>* %68, align 1, !tbaa !61, !range !63, !alias.scope !64
Range types must match instruction type!
  %wide.load.2 = load <16 x i8>, <16 x i8>* %72, align 1, !tbaa !61, !range !63, !alias.scope !64
Range types must match instruction type!
  %wide.load544.2 = load <16 x i8>, <16 x i8>* %74, align 1, !tbaa !61, !range !63, !alias.scope !64
Range types must match instruction type!
  %wide.load.3 = load <16 x i8>, <16 x i8>* %78, align 1, !tbaa !61, !range !63, !alias.scope !64
Range types must match instruction type!
  %wide.load544.3 = load <16 x i8>, <16 x i8>* %80, align 1, !tbaa !61, !range !63, !alias.scope !64
Range types must match instruction type!
  %wide.load.epil = load <16 x i8>, <16 x i8>* %84, align 1, !tbaa !61, !range !63, !alias.scope !64
Range types must match instruction type!
  %wide.load544.epil = load <16 x i8>, <16 x i8>* %86, align 1, !tbaa !61, !range !63, !alias.scope !64
fatal error: error in backend: Broken function found, compilation aborted!

Interestingly, this is happening during *host* compilation. It happens even if I don't add a patch to enable the load/store vectorizer for nvptx.

Smaller steps to reproduce:

$ curl | opt -O2
(same ICE)

Nit, number of scalar instructions removed might be a more interesting metric, because it's notable whether we manage to vectorize in groups of 2 vs 4.


clang-format formats it for me as-is.

arsenm added inline comments.Jun 10 2016, 12:49 PM

I don't really know what this attribute does, it was just there


The basic type alignment on GPUs? It's on my todo list to make this check allowsMemoryAccess instead

suyog added a subscriber: suyog.Jun 10 2016, 1:23 PM

Looks pretty good to me; mostly nits / comment suggestions, but I have one nontrivial concern about how we ensure that we always build the longest vectors we can.


Nit, if this is all the comment says, do we need it? It seems just to repeat what's already in the function name.


Nit, "Reorders" would be consistent with the rest of the comments in this file. Also, do you mean s/user/users/?


It's not obvious to me how this function reorders the users of I without looking at the source. Something like "Reorders instructions to ensure that I dominates all of its users." or something would have helped me.


Perhaps, "Checks if there are any instructions which may affect the value returned by a load/store in Chain between From and To." Since this function checks not just for aliasing memory instructions, but also side-effecting instructions.


Probably worth indicating that you expect all of the elements of Map to be loads, or all of them to be stores.


Suggest "Finds loads/stores to consecutive memory addresses and vectorizes them." Otherwise it can be read as saying that the instructions themselves must appear consecutively in the BB. Probably also worth indicating that all elements of Instrs must be loads, or all must be stores.


It came from r187340, back in 2013, by Nadav Rotem <>. There's no additional information in the patch.

AFAICT the intent is specifically not to vectorize loads/stores of fp types when noimplicitfloat is set. Which sort of makes sense, based on the tiny information I can glean about noimplicitfloat.


s/I/BBI/ would have made this a lot more obvious to me, since we use I for values/instructions most everywhere else.


It's not clear to me why we need this line, although all the cases I can think of where this is false seem quite bizarre, so maybe it doesn't matter.


Suggest s/Otherwise// -- it's not clear that this is contrasting with "Check if they are based on the same pointer."


Nit, suggest s/proving/checking/ -- "proving" is kind of a strong term for what we do, sort of implies it's harder than it is (so it's confusing when I see it's so simple).


Suggest something like

// Only look through ZExt/SExt

to make it clearer why we're doing an early return.


Nit, consider llvm::is_contained()


Why this special case? This isn't as strong as a general DCE, so it seems to me like it gives us a false sense of security? i.e. should we just say "run DCE after this pass"?


s/ElementSize/ElementSizeBits/? (I don't care as much what the local vars here are called, but if the arg is called ElementSize, I might reasonably pass a value in bytes.)


I'm not sure "bisect" is quite the right word for this? I expected this to split Chain in half, but it looks like it actually splits off the rightmost 1, 2, or 3 bytes' worth of elements, so that the left part has a size that's a multiple of 4 bytes (ish). I might call that "splitIntoRoundSize" or something.

A postcondition assert() would probably also help clarify this.


Consider llvm::if_contains


suggest "(the vectorized load is inserted at the location of the first load in the chain)."


Not GetPointerOperand?


To me this would be a lot more clear if we returned the relevant maps (or, I suppose, explicitly took them as params). Otherwise it's not clear from the signature (or the comments) where we collect the instructions into.


Consider llvm::all_of(LI->users(), ...);


Nit, It might be helpful to print out at least the first instruction in the chain. At least, that was helpful to me when I was debugging it.


Can we move this closer to where it's used?


Nit, we know that Instrs is <= 64 elements, so could we just make ConsecutiveChain an array of signed ints? Then we could use -1 instead of ~0U (which implied to me that we were going to do some bit twiddling below).

Also I think you could initialize the array with

int ConsecutiveChain[64] = {-1};

which would be clearer than setting each element in the loop, imo.


I cannot for the life of me figure out what invariant you're trying to preserve with these checks. I guess a comment is in order. :)

(I get that we prefer to vectorize chains of scalar instructions that appear in order (j > i), but I can't figure out what CurDistance and NewDistance are checking. It would make sense if you wanted to prefer chains of scalar instructions that are close to each other, but then NewDistance should be abs(j - i), right?)


Not sure, but this might be clearer if you simply used a "continue" here instead of a flag variable.


If ConsecutiveChain was not -1, don't we need to remove ConsecutiveChain[i] from Tails? (Maybe it would be better to build Heads and Tails, or at least Tails, in a separate loop, over ConsecutiveChain, so we don't have to worry about the case where one instr is a tail for two heads and then gets overwritten just one time.)


The intent here is to build the largest vectors we can, right? (Or at least, we want to build up vectors of the max vector width, if possible? Beyond that doesn't make a difference.)

I don't see how we ensure this happens, unless there is some implicit requirement on the order of Instrs that I'm missing. For example, if the elements of Instrs are

  • load p[3]
  • load p[2]
  • load p[1]
  • load p[0]

then it seems that the first element of Heads will be "load p[2]", implying a two-wide vector load. I don't see what here would cause us to wait on this one and do the four-wide load starting at p[0].

Sorry if I'm missing something obvious.


Actually, I'm not sure why we need the (Tails.count(I) || Heads.count(I)) part of this predicate, and therefore I'm not sure why we need Tails at all. That probably deserves a comment at least.


Hm, looks like an additional constraint on bisect() is that the array passed should have more than 4 elements -- probably worth saying that somewhere.


If you like, you could write this as

BasicBlock::iterator First, Last;
std::tie(First, Last) = getBoundaryInstrs(Chain);

Same for the other call to getBoundaryInstrs.


A correctness-critical assumption above is that

the loads are inserted at the location of the first load

but here it looks like the opposite?


Nit, I'd prefer to move this into the if and else blocks, so it's clear that we don't use it elsewhere.

arsenm updated this revision to Diff 61062.Jun 16 2016, 9:12 PM
arsenm edited edge metadata.
arsenm marked 26 inline comments as done.

Mostly address review comments


I added this as well, I think both might be useful


Then you end up with the macro functions adding another "Pass" to the name, so you end up with initializeLoadStoreVectorizerPassPass etc., so it's better to not include Pass in the class name.


Better, use a range loop


I think this is to fix cases like trying to merge i1 into i8. This was already crashing earlier though, so I'm just having sub-byte types never considered.


The comments are already on the declarations. Usually for private functions like this I would put them on the body. Should I move them?


It's run after each vectorization. I'm guessing the additional uses might somehow interfere with vectorization of other chains? This can be investigated later.


I'm not sure why this one is even needed. The full chain is already printed out right above this by the "LSV: Loads to vectorize" DEBUG


Can't do that. it needs to be reset after the j loop


I'm more likely to break something trying to refactor this compared to all of the style fixes, so I'll leave looking at that for later.


I had another test which I apparently forgot to include which shows the alias insertion order behavior. I've also added another which shows the insertion point.

With the insertion point as Last, the load store load store case is correct and vectorizes as
load v2

If I change the land insertion point to be before the first, it regresses and produces the incorrect
load v2
store v2

Maybe the comment is backwards?

jlebar added inline comments.Jun 17 2016, 9:31 AM

Nit, "Legalizes" and "Breaks" would be consistent with the rest of the file. Also suggest s/vector half/piece/ -- "vector half" is kind of hard to parse.


This was marked as done, but it doesn't appear to have been done. Not a big deal if that's intentional, but pointing it out in case it wasn't.

Probably worth indicating that you expect all of the elements of Map to be loads, or all of them to be stores.


Nit, would getOperandAddressSpace() or getPointerAddressSpace() be a better name? The address space itself isn't an operand of the instruction.


Well, they are sort of different...that one may never appear if we find no chains or whatever. But I take your point that this one may not be so useful. FWIW some additional well-placed log messages might be helpful; I was debugging with a coworker a few days ago why some instructions weren't being vectorized, and it required gdb. But we can also easily go back and add these later.


Ah, I see. I clearly have no idea what this loop does. :)


Maybe the comment is backwards?

I can't manage to convince myself that those checks are correct if we do anything but what the comments say.

In the test, I see that you have

%ld.c = load double, double addrspace(1)* %c, align 8 ; may alias store to %a
store double 0.0, double addrspace(1)* %a, align 8
%ld.c.idx.1 = load double, double addrspace(1)* %c.idx.1, align 8 ; may alias store to %a
store double 0.0, double addrspace(1)* %a.idx.1, align 8

If the comments are correct and both loads may alias the first store, isn't transforming this into "store; load v2; store" (what the test is checking for) unsafe? I grant that vectorizing both the loads and the stores is also unsafe.

asbirlea added inline comments.Jun 17 2016, 9:40 AM

I'm confused why that is the case. Each i iteration sets ConsecutiveChain[i]=-1 and all read/write accesses in the j loop are on ConsecutiveChain[i]. Initializing all ConsecutiveChain to -1 before the i loop should have the equivalent behavior. Could you clarify what I'm missing?

escha added a comment.Jun 17 2016, 9:53 AM

As the author of some of that code, it is quite possible the code is just not well-written and can be done better ;-)

jlebar added inline comments.Jun 17 2016, 1:23 PM

Talking to asbirlea, we are not convinced this is safe. In particular, we cannot safely reorder across instructions that may read or may write arbitrary memory, nor can we safely reorder loads and stores that are not in the chain but which nonetheless may alias each other.

I think we're safe wrt instructions which may read or write, because these will never appear between a chain's first and last instructions. (I would feel much better if we had an assert to this effect, though.)

But I am not convinced we're safe wrt regular loads and stores whose addresses depend on the result of a vectorized load. For example, imagine we have

b = load a
load [b]  // b does not alias a
store [b]
c = load a+1

Suppose that we choose to vectorize the first and last load and WLOG assume we insert the vectorized load at the position of c:

load [b]
store [b]
{b, c} = load {a, a+1}

Now we call reorder() to fix this up. The first two instructions are going to be moved below the vectorized load in the reverse order of I->uses(). AFAICT there is no guarantee on the order of uses(), but even if there were, outputting in the opposite order seems probably not right?

Even if that were somehow right, afaict we can make reorder() do arbitrary, unsafe reorderings of loads/stores with the following idiom.

b = load a
b1 = b + 0 // copy b to b1
b2 = b + 0 // copy b to b2
load [b1]  // b1 does not alias a
store [b2]  // b2 does not alias a
c = load a+1

Now if the order of b1 and b2 in the source determines their order in users(), then that controls the order of "load [b1]" and "store [b2]" in the result, even though there's only one correct ordering of that load and store.

So to summarize, it may be my ignorance, but I don't see why we can rely on there being any particular order to users(). But even if users() has some guaranteed order, I don't see how any ordering could guarantee that we never reorder an aliasing load/store pair.

asbirlea added inline comments.Jun 17 2016, 2:29 PM

Tried testing this out using jlebar's first example.


%struct.buffer_t = type { i64, i8*, [4 x i32], [4 x i32], [4 x i32], i32, i8, i8, [2 x i8] }       
define i32 @test(%struct.buffer_t* noalias %buff) #0 {                                             
  %tmp1 = getelementptr inbounds %struct.buffer_t, %struct.buffer_t* %buff, i64 0, i32 1     = load i8*, i8** %tmp1, align 8                                                       
  %buff.val = load i8, i8*, align 8                                                     
  store i8 0, i8*, align 8                                                              
  %tmp0 = getelementptr inbounds %struct.buffer_t, %struct.buffer_t* %buff, i64 0, i32 0     = load i64, i64* %tmp0, align 8                                                        
  ret i32 0
attributes #0 = { nounwind }


%struct.buffer_t = type { i64, i8*, [4 x i32], [4 x i32], [4 x i32], i32, i8, i8, [2 x i8] }       
; Function Attrs: nounwind
define i32 @test(%struct.buffer_t* noalias %buff) #0 {                                             
  %tmp0 = getelementptr inbounds %struct.buffer_t, %struct.buffer_t* %buff, i64 0, i32 0           
  %0 = bitcast i64* %tmp0 to <2 x i64>*                                                            
  %1 = load <2 x i64>, <2 x i64>* %0, align 8                                                      
  %2 = extractelement <2 x i64> %1, i32 0
  %3 = extractelement <2 x i64> %1, i32 1                                                          
  %4 = inttoptr i64 %3 to i8*                                                                      
  store i8 0, i8* %4, align 8
  %buff.val = load i8, i8* %4, align 8                                                             
  ret i32 0
attributes #0 = { nounwind }

The first and last load got vectorized and moved at the beginning, but the uses that now follow are in reversed order.

The input had load [b], store [b]. The output has store[b], load[b].

asbirlea added inline comments.Jun 17 2016, 3:20 PM

Nit: Could the Instruction instance "User" be renamed to something different than a class name? It makes it somewhat harder to follow.

If we were to always insert vectorized loads at the location of the first load and place stores at the location of the last store, I actually don't see why we'd need reorder() at all.

asbirlea added inline comments.Jun 20 2016, 1:42 PM

FWIW, the example above can be fixed by keeping a temporary of the last instruction inserted. For example:

Instruction* InsertAfter=I;
for (User *U : I->users())                                                                       
    if (Instruction *User = dyn_cast<Instruction>(U))                                              
      if (!DT.dominates(I, User) && User->getOpcode() != Instruction::PHI) {                       
        InsertAfter = User;                                                                        

However, this does not account for the recursive inserts AND it assumes the users() are in BB order.

A correct and very iffy way: 1. create a set (include the recursive call here) of all (transitive) users, 2. make a pass over all instructions in BB; if instruction is in the collected set insert it after the latest inserted instruction (as in above sample); or just go over the BB in reverse order.

I would favor avoiding the reorder entirely than having a pass over the BB at every reorder..

Here's a test-case that currently produces incorrect results on my end.

target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"
target triple = "aarch64--linux-gnueabihf"

define i32 @test(i32* noalias %ptr) {
  br label %"for something"

"for something":
  %index = phi i64 [ 0, %entry ], [, %"for something" ]
  %next.gep = getelementptr i32, i32* %ptr, i64 %index
  %a1 = add nsw i64 %index, 1
  %next.gep1 = getelementptr i32, i32* %ptr, i64 %a1
  %a2 = add nsw i64 %index, 2
  %next.gep2 = getelementptr i32, i32* %ptr, i64 %a2

  %l1 = load i32, i32* %next.gep1, align 4                                                         
  %l2 = load i32, i32* %next.gep, align 4                                                          
  store i32 0, i32* %next.gep1, align 4                                                            
  store i32 0, i32* %next.gep, align 4
  %l3 = load i32, i32* %next.gep1, align 4                                                         
  %l4 = load i32, i32* %next.gep2, align 4 = add i64 %index, 8
  %cmp_res = icmp eq i64, 8
  br i1 %cmp_res, label %ending, label %"for something"

  ret i32 0

It does not take into account the stores as aliasing with the loads, and leaves the stores as the first instructions.
Output I see:

%0 = bitcast i32* %next.gep to <2 x i32>*                                                        
store <2 x i32> zeroinitializer, <2 x i32>* %0, align 4                                          
%1 = bitcast i32* %next.gep to <2 x i32>*                                                        
%2 = load <2 x i32>, <2 x i32>* %1, align 4                                                      
%3 = extractelement <2 x i32> %2, i32 0
%4 = extractelement <2 x i32> %2, i32 1                                                          
%5 = bitcast i32* %next.gep1 to <2 x i32>*                                                       
%6 = load <2 x i32>, <2 x i32>* %5, align 4                                                      
%7 = extractelement <2 x i32> %6, i32 0
%8 = extractelement <2 x i32> %6, i32 1

Is there a follow-up patch fixing this that I missed?

Example in the previous comment should produce correct code with these changes.


This should be: LastInstr = ++I; (using updated patch)
It's always valid: either uses the next valid instruction or BB.end().

Fixes issue that isVectorizable misses testing the last instruction in the chain.


Consider asserting Chain.size() == ChainInstrs.size()?


As loads are inserted at the location of the last load, either invert sign (VVIdx > VIdx) and update comment, or fix load insert location.
The former misses vectorization opportunities, but it avoids (at least some of the) incorrect code generation.


Same as above.

asbirlea added inline comments.Jun 23 2016, 5:15 PM

As I understand it, this is the condition for finding the chain of maximum length. The check that Head cannot be found in Tails, means it's the beginning of a chain, hence one of the longest chains. However, if the longest chain fails to vectorize, this same check prevents any vectorization of the remaining chain.

Here's a suggestion to address vectorizing the chain suffix:

for (int i = 0; i < Heads.size(); i++) {
    unsigned Head = Heads[i];
    if (VectorizedValues.count(Instrs[Head]))
    // Skip if a longer chain exists in the remaining Heads/Tails
    for (int j = i+1; j < Tails.size(); j++)
      if (Head == Tails[j])

Feel free to add additional improvements for vectorization of a chain prefix.

jlebar accepted this revision.Jun 28 2016, 4:03 PM
jlebar added a reviewer: jlebar.

After discussion on IRC, I'm OK taking this as-is, with an explicit understanding that there are existing changes that need to be made here, so future patches will not have the same bias towards existing code that we might normally have. Collaborating with the code living out-of-tree is just too painful.

Alina has fixes to the correctness issues in her patch queue, and one of us will go back and ensure that all of my comments above (particularly about the ConsecutiveChain algorithm) are addressed.

This revision is now accepted and ready to land.Jun 28 2016, 4:03 PM
arsenm marked 2 inline comments as done.Jun 30 2016, 11:28 AM
arsenm added inline comments.

I put it on vectorizeChains. I can add it here too


Leaving for future patch


This needs to go with the ++I patch


Are you going to fix this / do you have test cases for this and below?

asbirlea added inline comments.Jun 30 2016, 11:32 AM

Yes. I have an updated version of the current patch that fixes this and adds couple of unit tests.

arsenm closed this revision.Jun 30 2016, 4:18 PM
arsenm marked an inline comment as done.