This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
6
SLPVectorizer.cpp
-
test/Transforms/SLPVectorizer/AArch64/
-
Transforms/
-
SLPVectorizer/
-
AArch64/
-
horizontaladd.ll

Differential D6675

[SLPVectorization] Enhance Ability to Vectorize Horizontal Reductions from Consecutive Loads
Needs ReviewPublic

Authored by suyog on Dec 15 2014, 9:22 PM.

Download Raw Diff

Details

Reviewers

nadav
aschwaighofer

Summary

This patch is enhancement to r224119 which vectorizes horizontal reductions from consecutive loads.

Earlier in r224119, we handled tree :

             +
           /    \
         /       \
       +         +
      /  \       /  \
    /     \     /    \
a[0]  a[1] a[2] a[3]

where originally, we had

Left              Right
a[0]              a[1]
a[2]              a[3]

In r224119, we compared, (Left[i], Right[i]) and (Right[i], Left[i+1])

Left        Right
a[0] ---> a[1]
           /             
          /                                                        
         /
       \/
a[2]       a[3]

And then rearrange it to

Left        Right
a[0]        a[2]
a[1]        a[3]

so that, we can bundle left and right into vector of loads.

However, with bigger tree,

                 + 
               /    \ 
             /       \ 
           /          \
          /            \
         +              +
       /   \            /  \
      /     \          /    \
     /       \        /      \
   +        +      +       +
  /  \      /  \    /  \     /  \
0    1   2   3  4   5  6   7
 
                            
   Left              Right
    0                  1
    4                  5
    2                  3
    6                  7

In this case, Comparison of Right[i] and Left[i+1] would fail, and code remains scalar.

If we eliminate comparison Right[i] and Left[i+1], and just compare Left[i] with Right[i],
we would be able to re-arrange Left and Right into :

Left               Right
 0                    4
 1                    5
 2                    6
 3                    7

And then would bundle (0,1) (4,5) and (2,3) (6,7) into vector loads.
And then have vector adds of (01, 45) and (23, 67).

However, notice that, this would disturb the sequence of addition.
Originally, (01) and (23) should have been added. Same with (45) and (67).
For integer type addition, this would not create any issue, but for other
data types with precision concerns, there might be a problem.

ffast-math would have eliminated this precision concern, but it would have
re-associated the tree itself into (+(+(+(+(0,1)2)3....)

Hence, in this patch we are checking for integer types and then only skipping
the extra comparison of (Right[i], Left[i+1]).

With this patch, we now vectorize above type of tree for any length of consecutive loads
of integer type.

For test case:

#include <arm_neon.h>
int hadd(int* a){
    return (a[0] + a[1]) + (a[2] + a[3]) + (a[4] + a[5]) + (a[6] + a[7]);
}

AArch64 assembly before this patch :

ldp      w8, w9, [x0]
ldp     w10, w11, [x0, #8]
ldp     w12, w13, [x0, #16]
ldp     w14, w15, [x0, #24]
add      w8, w8, w9
add      w9, w10, w11
add      w10, w12, w13
add      w11, w14, w15
add      w8, w8, w9
add      w9, w10, w11
add      w0, w8, w9
ret

AArch64 assembly after this patch :

ldp      d0, d1, [x0]
ldp     d2, d3, [x0, #16]
add     v0.2s, v0.2s, v2.2s
add     v1.2s, v1.2s, v3.2s
add     v0.2s, v0.2s, v1.2s
fmov    w8, s0
mov             w9, v0.s[1]
add      w0, w8, w9
ret

Please help in reviewing this patch. I did not run LNT as of now, since this is just enhancement
to r224119. I will update with LNT results if required.

Regards,
Suyog

Diff Detail

Repository: rL LLVM

Event Timeline

suyog updated this revision to Diff 17319.Dec 15 2014, 9:22 PM

suyog retitled this revision from to [SLPVectorization] Enhance Ability to Vectorize Horizontal Reductions from Consecutive Loads .

suyog updated this object.

suyog edited the test plan for this revision. (Show Details)

suyog added reviewers: nadav, aschwaighofer, jmolloy.

suyog set the repository for this revision to rL LLVM.

suyog added a subscriber: Unknown Object (MLST).

Herald added a subscriber: aemerson. · View Herald TranscriptDec 15 2014, 9:22 PM

Hi James,

Thanks for the review.

Yes, I agree the code generated can be optimized further to have v0.4s (addition of 4 scalars at a time) instead of
v0.2s (addition of 2 scalars at a time). So, basically we need to emit <4x> instead of <2x> vectors in the IR.

AFAIK, the way we build the Bottom up tree in SLP ( for the kind of tree I have
described in the description below ), when we try to bundle up loads for a vector binary operator, we always
bundle up in pair of 2 loads.

For ex (I am numbering the + operators for reference):

                 + 1
               /    \
             /       \
           /          \
          /            \
         + 2           + 3
       /   \            /  \
      /     \          /    \
     /       \        /      \
   + 4     + 5    + 6     + 7
  /  \      /  \    /  \     /  \
0    1   2   3  4   5  6   7

When visiting this tree, we encounter 2 and 3 + and we say that it can be vectorized. Now we visit, left of 2 and 3,
and come to 4 and 6 + , which can again be vectorized. So we visit, Left and Right of 4 and 6 + . We now find, that

Left -> a[0] and a[4]
Right -> a[1] and a[5]

With my patch, we re-arrange them as

Left -> a[0] and a[1]
Right -> a[4] and a[5].

We see that both Left and Right now have consecutive loads and hence can be bundled into a vector load of <2x>.
Note, that at this point, we are unaware of the other loads and 5 7 +. Hence, we are not emitting <4x> vector loads.

This traversal of operators and operands was already in existing code, and I didn't disturb that :).

May be we can put code to handle such type of IR's in DAG combine where if we encounter consecutive vector loads of 2 loads
at a time, we can combine them into vector load of 4 loads.

So basically, we need to reduce the tree :

                +   
            /       \
          /          \
        +            +
      /   \         /    \
 load   load    load  load
2x@0 2x@2  2x@4 2x@6

to something like :

      +
   /     \
  /       \ 
load     load
4x@0   4x@4

Feel free to correct me in my understanding :)
I am trying to solve this type of problems in incremental steps.

(Exciting thing is, as I was writing this mail, I got an idea for above type of reduction,
where I can vectorize 2x loads into 4xloads :).
Need to check if it already exist and come up with a patch if not.)

Regards,
Suyog

Original Message -------

Sender : James Molloy<james@jamesmolloy.co.uk>
Date : Dec 16, 2014 16:25 (GMT+09:00)
Title : Re: [PATCH] [SLPVectorization] Enhance Ability to Vectorize Horizontal Reductions from Consecutive Loads

Hi suyog,
This is a good improvement, thanks for working on it!

I'll take a closer look today, but for now I did notice that the generated aarch64 assembly isn't as optimal as it could be. I'd expect:

Ldp q0, q1
Add v0.4s, v0.4s, v1.4s
Addv s0, v0.4s

Cheers,

James

Hi James,

I am not having plans as of now to include IR intrinsic for horizontal reductions, need to spend cycles to understand its functioning and implement it, which I am short of :(

It would be interesting to see it though.
Meanwhile, IMO we should try to exploit every available opportunity to vectorize (2 times faster vector code better than scalar one, though 4 times faster is the best), until we have an IR intrinsic framework for it. :)

Your suggestions are always valuable and welcomed :).
Reviews for the attached patch awaited :).

Regards,
Suyog

Original Message -------

Sender : James Molloy<james@jamesmolloy.co.uk>
Date : Dec 16, 2014 17:09 (GMT+09:00)
Title : Re: Re: [PATCH] [SLPVectorization] Enhance Ability to Vectorize Horizontal Reductions from Consecutive Loads

Hi suyog,

Yes, we could pattern match this at the DAG level, but it would be rather fragile I think? My previous suggestion was to use IR level intrinsics to model horizontal reductions , because it avoids the pattern matching and each backed could lower it into an efficient form without pattern matching.

As its you rather than me doing this work however I won't push hard for it :)

Cheers,

James

Reverted 224119 because floating point data types cannot be freely re-associated unless ffast-math specified.

Also modified this patch now to exchange left and right only for integer data type now.
Modified the test case to check vectorization for integer type and not for floating point data types.

Please help in reviewing the patch.

Regards,
Suyog

Hi Suyog,

Thanks for working on this, please find comments from me inline.

Michael

lib/Transforms/Vectorize/SLPVectorizer.cpp
1244	No need in whitespace before the brace.
1838–1841	I think we should only swap if `Left[i]` and `Left[i+1]` are consecutive - that's the only case we get something from this reordering. In your current approach we might lose consecutiveness in `Right[i]` and `Right[i+1]` by swapping `Left[i+1]` and `Right[i]` even if later on `Left[i]` and `Left[i+1]` won't be consecutive.
2077	Redundant whitespace here too.

mzolotukhin added inline comments.Dec 19 2014, 3:54 PM

lib/Transforms/Vectorize/SLPVectorizer.cpp
444	s/Rorder/Reorder/

Hi Michael,

Thanks for the review. I will take care of the typos and extra space in fresh upload.

For the logic part, please find my comments inline.

Your comments are awaited :)

lib/Transforms/Vectorize/SLPVectorizer.cpp
1838–1841	For consecutive loads, if Left[i] and Left[i+1] are consecutive, then we will never arrive at this check, since they are already consecutive and can be bundled into a vector. Lets take an example : return (a[0] + a[1]) + (a[2] + a[3]) the tree for this will be : + / \ / \ + + / \ / \ / \ / \ a[0] a[1] a[2] a[3] where Left Right i=0 a[0] a[1] i=1 a[2] a[3] (Please note the contents of Left and Right :). Seems confusing, but its the way :) ) here Left[0] = a[0] and Left[1] = a[2] which are not consecutive and hence, cannot be bundled. Right[0] = a[1] and Right[1] = a[3], which are also not consecutive and hence cannot be bundled. But here, Left[i] and Right[i] are consecutive and hence we exchange Left[i+1] and Right[i]. The Left and Right after formed after re-arranging : Left Right i=0 a[0] a[2] i=1 a[1] a[3] Now since, Left[i] and Left[i+1] are consecutive, they can be bundled into a vector. Same with Right. Now, this disturbs the original addition, since we were supposed to add a[0] with a[1] and a[2] with a[3] and finally add their additions. But after rearranging, which in turn vectorizes the code, we are now adding a[0] a[1] + + a[2] a[3] This doesn't affect the result for integers, but for floating point data types with precision issues, this might cause difference in final answer. Hence, we are doing this re-arrangements for integer data types only. I think you considered Left will contain left subtree (a[0] and a[1]) while Right will contain right subtree (a[2] and a[3]), which is not so. Please confirm and also correct my understanding if wrong :) Your suggestions/corrections are most welcomed :)

Hi Suyog,

Thanks for the explanation, it actually matches my understanding.

But yes, I was a bit confused with what we actually load in a vector - Left[i] and Left[i+1], or Left[i] and Right[i]. Now it's clear, and your approach looks right.

One more question though: what if Left[i+1] and Right[i+1] are consecutive, but Left[i] and Right[i] are not (i.e. (p1[0] + p2[0]) + (a[0] + a[1]))? Does it make sense to swap p2[0] (Right[i]) and a[0] (Left[i+1]) ? In this case we will get at least one pair of consecutive loads - in the right operands.

Hi Michael,

Thanks for the reply.

We are bundling Left[i] and Left[i+1] (and Right[i] and Right[i+1]).
So its not beneficial for us if Left[i+1] and Right[i+1] are consecutive.

Also, it doesn't make any sense to vectorize one bundle of load
while leave the other bundle scalar, since u need two vectors to operate together.
This is taken care by the cost model.

And anyways we are checking the loop till e-1 and not e.

For example purpose lets say that one of the Left-Right is consecutive
while other is not.

Lets take an example -

(a[0] + a[1]) + (a[4] + a[6]))
              
Left                Right
a[0]                 a[1]
a[4]                 a[6]

This is not vectorizable in current form.

Now, with our patch, since Left[i] and Right[i] are consecutive (integers),
and hence swap Left[i+1] and Right[i]

Left                  Right
a[0]                   a[4]
a[1]                   a[6]

Now, since elements in Left are consecutive, Left can be bundled into a vector of load.
Elements in Right are not consecutive, hence they cannot be bundled into a vector load.

It doesn't make any sense to vectorize one bundle of load while other bundle remains scalar.
This is taken care by the cost model. When we calculate the cost of vectorization, it comes
to be positive than scalar code (positive means expensive), and hence we do not vectorize the entire code.

Any questions/suggestions/comments are welcomed.

I have updated patch addressing extra space and typos.
Please help in reviewing it.

Regards,
Suyog

Hi Suyog,

Yep, I see what you mean. However, to me Left and Right operands look pretty symmetrical, so probably we should either try to make both of them consecutive simultaneously (option a), or make at least one of them consecutive (option b).

For option (a) we need to add a check:

// We can't make the right operands consecutive - bail out.
if (!(isConsecutiveAccess(Left[i+1], Right[i+1])))
  continue;

For option (b) we need to change

if (!(isConsecutiveAccess(Left[i], Right[i])))
  continue;

if (!isConsecutiveAccess(Left[i], Right[i]) && !isConsecutiveAccess(Left[i+1], Right[i+1]))
  // If we can't make any of the pairs consecutive - bail out.
  continue;

I prefer the option (b) here, but if you have any arguments against it, the option (a) is also good.

lib/Transforms/Vectorize/SLPVectorizer.cpp
1838	Also, braces around isConsecutiveAccess are redundant.

Hi Michael,

We could add that check for Left[i+1] and Right[i+1] as well.
If we add that, we would have to increase the loop variable by 2 :)
as we are checking both i and i+1 elements.

By the way, The loop will always run for 2 times only in any case,
since the tree which we are considering, will always have only 2 children for a node.

But, this would be a very strict checking. If we skip this checking,
decision to vectorize or not will be taken by cost model as demonstrated earlier.

The only question is to whether to check at this point itself or leave it for
the cost model to decide, which it does very well.

I am of the opinion that we should not check for Left[i+1] and Right[i+1],
since it would hardly make any difference in the logic and the outcome,
and if we include that check, the code would look a bit ugly :)

Removed redundant braces around isConsecutive check.

Regards,
Suyog

Hi Suyog,

I see what you mean, but the loop doesn't look like behaving as you describe. In case Left.size() equals 2, the loop performs only one iteration. That means that we don't try to make Right operands consecutive (yep, they might become consecutive in some situations though, like your original example).

Could we actually check all operands before making any swaps? I.e. if we have (a[0]+b[0])+(b[1]+a[1]), then the current algorithm will not handle it, right? In this case we have two pairs of consecutive loads, but to find them we need to look through all loads first. It's purely theoretical case though, I don't know how often it occurs in real life.

By the way, have you measured performance with your change? Do you expect any gains on SPECs or other benchmarks?

Hi Michael,

Thanks for replying.

Two quick points to mention from your reply.

By swapping Right[i] with Left[i+1] after check, we are ensuring that resultant Left after swap are consecutive. We do not need to check if after swapping, Right is as well consecutive. If the resultant Right becomes consecutive, the bundling code ahead (already existing in SLP) will bundle it into a load vector as a result of which the cost model will calculate the cost of vectorization as negative and finally vectorize the whole code. If the resultant Right doesn't have consecutive loads, the bundling code ahead will not bundle these loads into vector, which in turn makes the cost of vector positive (expensive) and hence avoid vectorization of whole code.

The case you mentioned (a[0]+b[0])+(a[1]+b[1]) was already handled without this patch as well. Left - a[0] & a[1] and Right - b[0] & b[1] and hence vectorizable already.

The twist in the example as u mentioned happens with (a[0]+b[0]) + (b[1]+a[1]), where Left - a[0] & b[1] and Right - b[0] & a[1] (though i am doubtful if this ever happens). Since we are checking Left[i] & Right[i] only, there won't be any swapping in this case and we won't vectorize this code, though it has potential to be vectorized.

Now, since for reduction case of above type where we have a single tree we know that the Left and Right will always have size=2 because of the way tree is formed, we can actually eliminate loop and check all the 4 elements for consecutive loads.

I would like to have your opinion on this - whether we should eliminate loop and just check 4 elements?

I had run LNT test on similar patch submitted earlier for 10 iterations on X86 and i didn't see any significant improvement nor any regression, though the AArch64 assembly generated above is a good improvement.

I would also like to give context of earlier discussions on this whole exercise:

http://lists.cs.uiuc.edu/pipermail/llvmdev/2014-September/076930.html
http://lists.cs.uiuc.edu/pipermail/llvmdev/2014-November/078666.html

Waiting for your suggestions.

Regards,
Suyog

Hi Suyog,

Thanks for the links to the previous discussions.

Sizes of Left and Right equal 2 only for the trees with depth 2, right (i.e. only for sum of 4 operands, like in our previous examples)? That's not true for sum of 8 operands, for which we have a deeper tree. So, if I'm not totally confused here, unrolling the loop would mean that we only handle 4-operands sums, which doesn't look better than your original code. So, I'd prefer to keep the loop.

However, I'm still thinking if it's possible to solve the problem in a more general way. E.g. sort the operands basing on their index from the same base pointer, and then distribute them between Left and Right arrays to get maximum number of consecutive pairs. What do you think, is it doable? And if so, will it catch any useful case that the current approach doesn't catch? By the way, I totally agree that the example from my previous letter is quite artificial, and maybe it's ok if we give up on it. And if a general solution would bring a lot of complexity to the code, I'd rather go with you current implementation.

Thanks,
Michael

PS: My answers might be delayed during the Christmas season, but I'll get back to the discussion as soon as I can.

Hi Michael.

Ideally, for sum of 8 operands, for which we have a deeper tree, Right and Left should each have 4 operands.
But the way the tree is build up, we recursively call build_tree() for Left and Right, we handle 2 loads each at a time.

Lets take an example :). I will number the adds op

                  + (1)
              /      \ 
             /        \ 
            /          \
           /            \
         + (2)            + (3)
       /   \            /  \
      /     \          /    \
     /       \        /      \
   +(4)       +(5)    +(6)    +(7)
  /  \      /  \    /  \     /  \
0    1     2    3  4    5    6   7

When the add 1 is encountered (top most add), we split the tree into left and right subtree, and recursively call build tree on
left and right of 2nd and 3rd add. So, we go to left of 2nd and 3rd add, and arrive at 4th and 6th add. Then again we go to left
of 4th and 6th add and encounter a[0] and a[4], we put them into Left vector and go to right of 4th and 6th add. We arrive at
a[1] and a[5], we put them into vector Right. After this we check if Elements in Left and Right can be bundled into a vector of loads.

Left        Right
a[0]          a[1]
a[4]          a[5]

At this point, we are totally unaware of the other loads, since we haven't called build_tree() on the right side of 2nd and 3rd add yet.
(DFS running in parallel for subtree starting at 2nd and 3rd add).

Once, we are done with processing and bundling the above pair of loads into a vector, we then move to the right recursively, and finally
encounter 5th and 7th add and then have Left and Right as:

Left           Right
a[2]          a[3]
a[6]          a[7]

Note that at this point, a[0], a[1], a[4] and a[5] are already processed. I hope you get my point :)

The Left and Right will contain more than 2 operands if the number of subtree are more than 2, which is not possible with single tree,
as single tree has only 2 children at a time.

I think that this should be handled in better way, because the above code had even more potential to get vectorize into <4x> vectors instead of <2x> vectors,
though i am not yet sure how to do that. I would be happy for your suggestions on it.

Awaiting your reply !!

Regards,
Suyog

(Merry Xmas and Happy New Year :) )

Hi Suyog,

Happy New Year!

Thanks for the detailed explanation! I think that since the current algorithm always works locally, and sees only two left and two right operands (since we start from BinOp), we should confine
reorderIfConsecutiveLoads to size=2. That could be done either by unrolling the loop, or by adding assertion+comment at the beginning. I prefer getting rid of the loop now, because the current approach targets a very specific case and isn't intended to be general, while the presence of the loop made me think that it's more or less general, which is misleading. We should either do a general approach, or confine ourselves with the specific particular case. Hopefully in future we'll implement a general algorithm for handling big reductions.

And one more nit: we can enable this for floats too, when fast-math is on. You can find an example of how it could be done in LoopVectorizer.cpp:5251.

Hi Suyog,

I've also just managed to construct an example in which we perform an incorrect transformation.

Here it is:

@a = common global [1000 x i32] zeroinitializer, align 16
@b = common global [1000 x i32] zeroinitializer, align 16
@c = common global [1000 x i32] zeroinitializer, align 16

; Function Attrs: nounwind readonly ssp uwtable
define void @foo() #0 {
entry:
  %a0 = load i32* getelementptr inbounds ([1000 x i32]* @a, i64 0, i64 0), align 16, !tbaa !2
  %a1 = load i32* getelementptr inbounds ([1000 x i32]* @a, i64 0, i64 1), align 4, !tbaa !2
  %a2 = load i32* getelementptr inbounds ([1000 x i32]* @a, i64 0, i64 2), align 8, !tbaa !2
  %a3 = load i32* getelementptr inbounds ([1000 x i32]* @a, i64 0, i64 3), align 4, !tbaa !2
  %a4 = load i32* getelementptr inbounds ([1000 x i32]* @a, i64 0, i64 4), align 16, !tbaa !2
  %a5 = load i32* getelementptr inbounds ([1000 x i32]* @a, i64 0, i64 5), align 4, !tbaa !2
  %a6 = load i32* getelementptr inbounds ([1000 x i32]* @a, i64 0, i64 6), align 8, !tbaa !2
  %a7 = load i32* getelementptr inbounds ([1000 x i32]* @a, i64 0, i64 7), align 4, !tbaa !2

  %b0 = load i32* getelementptr inbounds ([1000 x i32]* @b, i64 0, i64 0), align 16, !tbaa !2
  %b1 = load i32* getelementptr inbounds ([1000 x i32]* @b, i64 0, i64 1), align 4, !tbaa !2
  %b2 = load i32* getelementptr inbounds ([1000 x i32]* @b, i64 0, i64 2), align 8, !tbaa !2
  %b3 = load i32* getelementptr inbounds ([1000 x i32]* @b, i64 0, i64 3), align 4, !tbaa !2
  %b4 = load i32* getelementptr inbounds ([1000 x i32]* @b, i64 0, i64 4), align 16, !tbaa !2
  %b5 = load i32* getelementptr inbounds ([1000 x i32]* @b, i64 0, i64 5), align 4, !tbaa !2
  %b6 = load i32* getelementptr inbounds ([1000 x i32]* @b, i64 0, i64 6), align 8, !tbaa !2
  %b7 = load i32* getelementptr inbounds ([1000 x i32]* @b, i64 0, i64 7), align 4, !tbaa !2

  %add01 = add i32 %a0, %a1
  %add02 = add i32 %a4, %b4
  %add0 = add i32 %add01, %add02

  %add11 = add i32 %b0, %b1
  %add12 = add i32 %a5, %b5
  %add1 = add i32 %add11, %add12

  %add21 = add i32 %a2, %b2
  %add22 = add i32 %a6, %b6
  %add2 = add i32 %add21, %add22

  %add31 = add i32 %a3, %b3
  %add32 = add i32 %a7, %b7
  %add3 = add i32 %add31, %add32

  store i32 %add0, i32* getelementptr inbounds ([1000 x i32]* @c, i32 0, i64 0), align 16
  store i32 %add1, i32* getelementptr inbounds ([1000 x i32]* @c, i32 0, i64 1), align 4
  store i32 %add2, i32* getelementptr inbounds ([1000 x i32]* @c, i32 0, i64 2), align 8
  store i32 %add3, i32* getelementptr inbounds ([1000 x i32]* @c, i32 0, i64 3), align 4
  ret void
}

The code might look confusing, but it's actually pretty simple. I took computation c[0:3] = (a[0:3]+b[0:3]) + (a[4:7]+b[4:7]) and swapped b[0] and a[1] in it. The patched compiler incorrectly swaps these two operands back.

The problem happens because reorderIfConsecutiveLoads is currently called not only for reductions, but for store-chains as well. While it's valid to swap operands in reduction, it's illegal to do so across the lanes in usual vector computations.

Hi Michael,

Thanks for the review and the example.

I think i should abandon this patch for now, since it has become too specific gradually as we discussed as well as wrongly working for store chains.

Basically, i will have to come up with patch tracking redcution (setting some flag) and then checking for consecutiveness of loads in that case.

Was a nice discussion :)

Regards,
Suyog

Hi Suyog,

Thank you for working on this, it's a very promising area!

Looking forward to your new patch:)

jmolloy removed a reviewer: jmolloy.Mar 18 2015, 3:25 AM

mzolotukhin removed a subscriber: mzolotukhin.May 12 2015, 6:56 PM

Revision Contents

Path

Size

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

28 lines

test/

Transforms/

SLPVectorizer/

AArch64/

horizontaladd.ll

58 lines

Diff 17545

lib/Transforms/Vectorize/SLPVectorizer.cpp

Show First 20 Lines • Show All 433 Lines • ▼ Show 20 Lines	for (auto &Iter : BlocksSchedules) {
BlockScheduling *BS = Iter.second.get();		BlockScheduling *BS = Iter.second.get();
BS->clear();		BS->clear();
}		}
}		}

/// \returns true if the memory operations A and B are consecutive.		/// \returns true if the memory operations A and B are consecutive.
bool isConsecutiveAccess(Value A, Value B);		bool isConsecutiveAccess(Value A, Value B);

		/// For consecutive loads (+(+ v0, v1)(+ v2, v3)), Left had v0 and v2
		/// while Right had v1 and v3, which prevented bundling them into
		/// a vector of loads. Reorder them so that Left now has v0 and v1
		mzolotukhinUnsubmitted Not Done Reply Inline Actions s/Rorder/Reorder/ mzolotukhin: s/Rorder/Reorder/
		/// while Right has v2 and v3 enabling their bundling into a vector.
		void reorderIfConsecutiveLoads(SmallVectorImpl<Value *> &Left,
		SmallVectorImpl<Value *> &Right);

/// \brief Perform LICM and CSE on the newly generated gather sequences.		/// \brief Perform LICM and CSE on the newly generated gather sequences.
void optimizeGatherSequence();		void optimizeGatherSequence();

/// \returns true if it is benefitial to reverse the vector order.		/// \returns true if it is benefitial to reverse the vector order.
bool shouldReorder() const {		bool shouldReorder() const {
return NumLoadsWantToChangeOrder > NumLoadsWantToKeepOrder;		return NumLoadsWantToChangeOrder > NumLoadsWantToKeepOrder;
}		}

▲ Show 20 Lines • Show All 779 Lines • ▼ Show 20 Lines	case Instruction::Xor: {
newTreeEntry(VL, true);		newTreeEntry(VL, true);
DEBUG(dbgs() << "SLP: added a vector of bin op.\n");		DEBUG(dbgs() << "SLP: added a vector of bin op.\n");

// Sort operands of the instructions so that each side is more likely to		// Sort operands of the instructions so that each side is more likely to
// have the same opcode.		// have the same opcode.
if (isa<BinaryOperator>(VL0) && VL0->isCommutative()) {		if (isa<BinaryOperator>(VL0) && VL0->isCommutative()) {
ValueList Left, Right;		ValueList Left, Right;
reorderInputsAccordingToOpcode(VL, Left, Right);		reorderInputsAccordingToOpcode(VL, Left, Right);
		reorderIfConsecutiveLoads(Left, Right);
		mzolotukhinUnsubmitted Not Done Reply Inline Actions No need in whitespace before the brace. mzolotukhin: No need in whitespace before the brace.
buildTree_rec(Left, Depth + 1);		buildTree_rec(Left, Depth + 1);
buildTree_rec(Right, Depth + 1);		buildTree_rec(Right, Depth + 1);
return;		return;
}		}

for (unsigned i = 0, e = VL0->getNumOperands(); i < e; ++i) {		for (unsigned i = 0, e = VL0->getNumOperands(); i < e; ++i) {
ValueList Operands;		ValueList Operands;
// Prepare the operand vector.		// Prepare the operand vector.
▲ Show 20 Lines • Show All 568 Lines • ▼ Show 20 Lines	bool BoUpSLP::isConsecutiveAccess(Value A, Value B) {
// Otherwise compute the distance with SCEV between the base pointers.		// Otherwise compute the distance with SCEV between the base pointers.
const SCEV *PtrSCEVA = SE->getSCEV(PtrA);		const SCEV *PtrSCEVA = SE->getSCEV(PtrA);
const SCEV *PtrSCEVB = SE->getSCEV(PtrB);		const SCEV *PtrSCEVB = SE->getSCEV(PtrB);
const SCEV *C = SE->getConstant(BaseDelta);		const SCEV *C = SE->getConstant(BaseDelta);
const SCEV *X = SE->getAddExpr(PtrSCEVA, C);		const SCEV *X = SE->getAddExpr(PtrSCEVA, C);
return X == PtrSCEVB;		return X == PtrSCEVB;
}		}

		void BoUpSLP::reorderIfConsecutiveLoads(SmallVectorImpl<Value *> &Left,
		SmallVectorImpl<Value *> &Right) {
		for (unsigned i = 0, e = Left.size(); i < e - 1; ++i) {
		if (!isa<LoadInst>(Left[i]) \|\| !isa<LoadInst>(Right[i]))
		return;
		LoadInst *L = dyn_cast<LoadInst>(Left[i]);
		bool isInt = L->getType()->isIntegerTy();
		if (!isInt)
		return;
		if (!isConsecutiveAccess(Left[i], Right[i]))
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Also, braces around isConsecutiveAccess are redundant. mzolotukhin: Also, braces around isConsecutiveAccess are redundant.
		continue;
		else
		std::swap(Left[i + 1], Right[i]);
		mzolotukhinUnsubmitted Not Done Reply Inline Actions I think we should only swap if `Left[i]` and `Left[i+1]` are consecutive - that's the only case we get something from this reordering. In your current approach we might lose consecutiveness in `Right[i]` and `Right[i+1]` by swapping `Left[i+1]` and `Right[i]` even if later on `Left[i]` and `Left[i+1]` won't be consecutive. mzolotukhin: I think we should only swap if `Left[i]` and `Left[i+1]` are consecutive - that's the only case…
		suyogAuthorUnsubmitted Not Done Reply Inline Actions For consecutive loads, if Left[i] and Left[i+1] are consecutive, then we will never arrive at this check, since they are already consecutive and can be bundled into a vector. Lets take an example : return (a[0] + a[1]) + (a[2] + a[3]) the tree for this will be : + / \ / \ + + / \ / \ / \ / \ a[0] a[1] a[2] a[3] where Left Right i=0 a[0] a[1] i=1 a[2] a[3] (Please note the contents of Left and Right :). Seems confusing, but its the way :) ) here Left[0] = a[0] and Left[1] = a[2] which are not consecutive and hence, cannot be bundled. Right[0] = a[1] and Right[1] = a[3], which are also not consecutive and hence cannot be bundled. But here, Left[i] and Right[i] are consecutive and hence we exchange Left[i+1] and Right[i]. The Left and Right after formed after re-arranging : Left Right i=0 a[0] a[2] i=1 a[1] a[3] Now since, Left[i] and Left[i+1] are consecutive, they can be bundled into a vector. Same with Right. Now, this disturbs the original addition, since we were supposed to add a[0] with a[1] and a[2] with a[3] and finally add their additions. But after rearranging, which in turn vectorizes the code, we are now adding a[0] a[1] + + a[2] a[3] This doesn't affect the result for integers, but for floating point data types with precision issues, this might cause difference in final answer. Hence, we are doing this re-arrangements for integer data types only. I think you considered Left will contain left subtree (a[0] and a[1]) while Right will contain right subtree (a[2] and a[3]), which is not so. Please confirm and also correct my understanding if wrong :) Your suggestions/corrections are most welcomed :) suyog: For consecutive loads, if Left[i] and Left[i+1] are consecutive, then we will never arrive at…
		}
		}

void BoUpSLP::setInsertPointAfterBundle(ArrayRef<Value *> VL) {		void BoUpSLP::setInsertPointAfterBundle(ArrayRef<Value *> VL) {
Instruction *VL0 = cast<Instruction>(VL[0]);		Instruction *VL0 = cast<Instruction>(VL[0]);
BasicBlock::iterator NextInst = VL0;		BasicBlock::iterator NextInst = VL0;
++NextInst;		++NextInst;
Builder.SetInsertPoint(VL0->getParent(), NextInst);		Builder.SetInsertPoint(VL0->getParent(), NextInst);
Builder.SetCurrentDebugLocation(VL0->getDebugLoc());		Builder.SetCurrentDebugLocation(VL0->getDebugLoc());
}		}

▲ Show 20 Lines • Show All 214 Lines • ▼ Show 20 Lines	switch (Opcode) {
case Instruction::FRem:		case Instruction::FRem:
case Instruction::Shl:		case Instruction::Shl:
case Instruction::LShr:		case Instruction::LShr:
case Instruction::AShr:		case Instruction::AShr:
case Instruction::And:		case Instruction::And:
case Instruction::Or:		case Instruction::Or:
case Instruction::Xor: {		case Instruction::Xor: {
ValueList LHSVL, RHSVL;		ValueList LHSVL, RHSVL;
if (isa<BinaryOperator>(VL0) && VL0->isCommutative())		if (isa<BinaryOperator>(VL0) && VL0->isCommutative()) {
reorderInputsAccordingToOpcode(E->Scalars, LHSVL, RHSVL);		reorderInputsAccordingToOpcode(E->Scalars, LHSVL, RHSVL);
		reorderIfConsecutiveLoads(LHSVL, RHSVL);
		mzolotukhinUnsubmitted Not Done Reply Inline Actions Redundant whitespace here too. mzolotukhin: Redundant whitespace here too.
		}
else		else
for (int i = 0, e = E->Scalars.size(); i < e; ++i) {		for (int i = 0, e = E->Scalars.size(); i < e; ++i) {
LHSVL.push_back(cast<Instruction>(E->Scalars[i])->getOperand(0));		LHSVL.push_back(cast<Instruction>(E->Scalars[i])->getOperand(0));
RHSVL.push_back(cast<Instruction>(E->Scalars[i])->getOperand(1));		RHSVL.push_back(cast<Instruction>(E->Scalars[i])->getOperand(1));
}		}

setInsertPointAfterBundle(E->Scalars);		setInsertPointAfterBundle(E->Scalars);

▲ Show 20 Lines • Show All 1,737 Lines • Show Last 20 Lines

test/Transforms/SLPVectorizer/AArch64/horizontaladd.ll

				; RUN: opt < %s -basicaa -slp-vectorizer -S -mtriple=aarch64-unknown-linux-gnu -mcpu=cortex-a57 \| FileCheck %s
				target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"
				target triple = "aarch64--linux-gnu"

				; float hadd (float *a) {
				; return (a[0] + a[1]) + (a[2] + a[3]);
				; }

				; CHECK-LABEL: @hadd
				; CHECK-NOT: load <2 x float>*
				; CHECK-NOT: fadd <2 x float>
				; CHECK-NOT: extractelement <2 x float>

				define float @hadd(float* nocapture readonly %a) {
				entry:
				%0 = load float* %a, align 4
				%arrayidx1 = getelementptr inbounds float* %a, i64 1
				%1 = load float* %arrayidx1, align 4
				%add = fadd float %0, %1
				%arrayidx2 = getelementptr inbounds float* %a, i64 2
				%2 = load float* %arrayidx2, align 4
				%arrayidx3 = getelementptr inbounds float* %a, i64 3
				%3 = load float* %arrayidx3, align 4
				%add4 = fadd float %2, %3
				%add5 = fadd float %add, %add4
				ret float %add5
				}

				; CHECK-LABEL: @hadd_int
				; CHECK: load <2 x i32>*
				; CHECK: add <2 x i32>
				; CHECK: extractelement <2 x i32>
				define i32 @hadd_int(i32* nocapture readonly %a) {
				entry:
				%0 = load i32* %a, align 4
				%arrayidx1 = getelementptr inbounds i32* %a, i64 1
				%1 = load i32* %arrayidx1, align 4
				%arrayidx2 = getelementptr inbounds i32* %a, i64 2
				%2 = load i32* %arrayidx2, align 4
				%arrayidx3 = getelementptr inbounds i32* %a, i64 3
				%3 = load i32* %arrayidx3, align 4
				%arrayidx6 = getelementptr inbounds i32* %a, i64 4
				%4 = load i32* %arrayidx6, align 4
				%arrayidx7 = getelementptr inbounds i32* %a, i64 5
				%5 = load i32* %arrayidx7, align 4
				%arrayidx10 = getelementptr inbounds i32* %a, i64 6
				%6 = load i32* %arrayidx10, align 4
				%arrayidx11 = getelementptr inbounds i32* %a, i64 7
				%7 = load i32* %arrayidx11, align 4
				%add1 = add i32 %0, %1
				%add2 = add i32 %2, %3
				%add3 = add i32 %4, %5
				%add4 = add i32 %6, %7
				%add5 = add i32 %add1, %add2
				%add6 = add i32 %add3, %add4
				%add7 = add i32 %add5, %add6
				ret i32 %add7
				}

This is an archive of the discontinued LLVM Phabricator instance.

[SLPVectorization] Enhance Ability to Vectorize Horizontal Reductions from Consecutive Loads Needs ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 17545

lib/Transforms/Vectorize/SLPVectorizer.cpp

test/Transforms/SLPVectorizer/AArch64/horizontaladd.ll

[SLPVectorization] Enhance Ability to Vectorize Horizontal Reductions from Consecutive Loads
Needs ReviewPublic