This is an archive of the discontinued LLVM Phabricator instance.

[SLPVectorization] Enhance Ability to Vectorize Horizontal Reductions from Consecutive Loads
Needs ReviewPublic

Authored by suyog on Dec 15 2014, 9:22 PM.

Details

Summary

This patch is enhancement to r224119 which vectorizes horizontal reductions from consecutive loads.

Earlier in r224119, we handled tree :

             +
           /    \
         /       \
       +         +
      /  \       /  \
    /     \     /    \
a[0]  a[1] a[2] a[3]

where originally, we had

Left              Right
a[0]              a[1]
a[2]              a[3]

In r224119, we compared, (Left[i], Right[i]) and (Right[i], Left[i+1])

Left        Right
a[0] ---> a[1]
           /             
          /                                                        
         /
       \/
a[2]       a[3]

And then rearrange it to

Left        Right
a[0]        a[2]
a[1]        a[3]

so that, we can bundle left and right into vector of loads.

However, with bigger tree,

                 + 
               /    \ 
             /       \ 
           /          \
          /            \
         +              +
       /   \            /  \
      /     \          /    \
     /       \        /      \
   +        +      +       +
  /  \      /  \    /  \     /  \
0    1   2   3  4   5  6   7
 
                            
   Left              Right
    0                  1
    4                  5
    2                  3
    6                  7

In this case, Comparison of Right[i] and Left[i+1] would fail, and code remains scalar.

If we eliminate comparison Right[i] and Left[i+1], and just compare Left[i] with Right[i],
we would be able to re-arrange Left and Right into :

Left               Right
 0                    4
 1                    5
 2                    6
 3                    7

And then would bundle (0,1) (4,5) and (2,3) (6,7) into vector loads.
And then have vector adds of (01, 45) and (23, 67).

However, notice that, this would disturb the sequence of addition.
Originally, (01) and (23) should have been added. Same with (45) and (67).
For integer type addition, this would not create any issue, but for other
data types with precision concerns, there might be a problem.

ffast-math would have eliminated this precision concern, but it would have
re-associated the tree itself into (+(+(+(+(0,1)2)3....)

Hence, in this patch we are checking for integer types and then only skipping
the extra comparison of (Right[i], Left[i+1]).

With this patch, we now vectorize above type of tree for any length of consecutive loads
of integer type.

For test case:

#include <arm_neon.h>
int hadd(int* a){
    return (a[0] + a[1]) + (a[2] + a[3]) + (a[4] + a[5]) + (a[6] + a[7]);
}

AArch64 assembly before this patch :

ldp      w8, w9, [x0]
ldp     w10, w11, [x0, #8]
ldp     w12, w13, [x0, #16]
ldp     w14, w15, [x0, #24]
add      w8, w8, w9
add      w9, w10, w11
add      w10, w12, w13
add      w11, w14, w15
add      w8, w8, w9
add      w9, w10, w11
add      w0, w8, w9
ret

AArch64 assembly after this patch :

ldp      d0, d1, [x0]
ldp     d2, d3, [x0, #16]
add     v0.2s, v0.2s, v2.2s
add     v1.2s, v1.2s, v3.2s
add     v0.2s, v0.2s, v1.2s
fmov    w8, s0
mov             w9, v0.s[1]
add      w0, w8, w9
ret

Please help in reviewing this patch. I did not run LNT as of now, since this is just enhancement
to r224119. I will update with LNT results if required.

Regards,
Suyog

Diff Detail

Repository
rL LLVM

Event Timeline

suyog updated this revision to Diff 17319.Dec 15 2014, 9:22 PM
suyog retitled this revision from to [SLPVectorization] Enhance Ability to Vectorize Horizontal Reductions from Consecutive Loads .
suyog updated this object.
suyog edited the test plan for this revision. (Show Details)
suyog added reviewers: nadav, aschwaighofer, jmolloy.
suyog set the repository for this revision to rL LLVM.
suyog added a subscriber: Unknown Object (MLST).

Hi James,

Thanks for the review.

Yes, I agree the code generated can be optimized further to have v0.4s (addition of 4 scalars at a time) instead of
v0.2s (addition of 2 scalars at a time). So, basically we need to emit <4x> instead of <2x> vectors in the IR.

AFAIK, the way we build the Bottom up tree in SLP ( for the kind of tree I have
described in the description below ), when we try to bundle up loads for a vector binary operator, we always
bundle up in pair of 2 loads.

For ex (I am numbering the + operators for reference):

                 + 1
               /    \
             /       \
           /          \
          /            \
         + 2           + 3
       /   \            /  \
      /     \          /    \
     /       \        /      \
   + 4     + 5    + 6     + 7
  /  \      /  \    /  \     /  \
0    1   2   3  4   5  6   7

When visiting this tree, we encounter 2 and 3 + and we say that it can be vectorized. Now we visit, left of 2 and 3,
and come to 4 and 6 + , which can again be vectorized. So we visit, Left and Right of 4 and 6 + . We now find, that

Left -> a[0] and a[4]
Right -> a[1] and a[5]

With my patch, we re-arrange them as

Left -> a[0] and a[1]
Right -> a[4] and a[5].

We see that both Left and Right now have consecutive loads and hence can be bundled into a vector load of <2x>.
Note, that at this point, we are unaware of the other loads and 5 7 +. Hence, we are not emitting <4x> vector loads.

This traversal of operators and operands was already in existing code, and I didn't disturb that :).

May be we can put code to handle such type of IR's in DAG combine where if we encounter consecutive vector loads of 2 loads
at a time, we can combine them into vector load of 4 loads.

So basically, we need to reduce the tree :

                +   
            /       \
          /          \
        +            +
      /   \         /    \
 load   load    load  load
2x@0 2x@2  2x@4 2x@6

to something like :

      +
   /     \
  /       \ 
load     load
4x@0   4x@4

Feel free to correct me in my understanding :)
I am trying to solve this type of problems in incremental steps.

(Exciting thing is, as I was writing this mail, I got an idea for above type of reduction,
where I can vectorize 2x loads into 4xloads :).
Need to check if it already exist and come up with a patch if not.)

Regards,
Suyog

  • Original Message -------

Sender : James Molloy<james@jamesmolloy.co.uk>
Date : Dec 16, 2014 16:25 (GMT+09:00)
Title : Re: [PATCH] [SLPVectorization] Enhance Ability to Vectorize Horizontal Reductions from Consecutive Loads

Hi suyog,
This is a good improvement, thanks for working on it!

I'll take a closer look today, but for now I did notice that the generated aarch64 assembly isn't as optimal as it could be. I'd expect:

Ldp q0, q1
Add v0.4s, v0.4s, v1.4s
Addv s0, v0.4s

Cheers,

James

Hi James,

I am not having plans as of now to include IR intrinsic for horizontal reductions, need to spend cycles to understand its functioning and implement it, which I am short of :(

It would be interesting to see it though.
Meanwhile, IMO we should try to exploit every available opportunity to vectorize (2 times faster vector code better than scalar one, though 4 times faster is the best), until we have an IR intrinsic framework for it. :)

Your suggestions are always valuable and welcomed :).
Reviews for the attached patch awaited :).

Regards,
Suyog

  • Original Message -------

Sender : James Molloy<james@jamesmolloy.co.uk>
Date : Dec 16, 2014 17:09 (GMT+09:00)
Title : Re: Re: [PATCH] [SLPVectorization] Enhance Ability to Vectorize Horizontal Reductions from Consecutive Loads

Hi suyog,

Yes, we could pattern match this at the DAG level, but it would be rather fragile I think? My previous suggestion was to use IR level intrinsics to model horizontal reductions , because it avoids the pattern matching and each backed could lower it into an efficient form without pattern matching.

As its you rather than me doing this work however I won't push hard for it :)

Cheers,

James

suyog updated this revision to Diff 17385.Dec 17 2014, 4:09 AM

Reverted 224119 because floating point data types cannot be freely re-associated unless ffast-math specified.

Also modified this patch now to exchange left and right only for integer data type now.
Modified the test case to check vectorization for integer type and not for floating point data types.

Please help in reviewing the patch.

Regards,
Suyog

Hi Suyog,

Thanks for working on this, please find comments from me inline.

Michael

lib/Transforms/Vectorize/SLPVectorizer.cpp
1244

No need in whitespace before the brace.

1838–1841

I think we should only swap if Left[i] and Left[i+1] are consecutive - that's the only case we get something from this reordering.

In your current approach we might lose consecutiveness in Right[i] and Right[i+1] by swapping Left[i+1] and Right[i] even if later on Left[i] and Left[i+1] won't be consecutive.

2077

Redundant whitespace here too.

mzolotukhin added inline comments.Dec 19 2014, 3:54 PM
lib/Transforms/Vectorize/SLPVectorizer.cpp
444

s/Rorder/Reorder/

suyog added a comment.Dec 21 2014, 9:49 AM

Hi Michael,

Thanks for the review. I will take care of the typos and extra space in fresh upload.

For the logic part, please find my comments inline.

Your comments are awaited :)

lib/Transforms/Vectorize/SLPVectorizer.cpp
1838–1841

For consecutive loads, if Left[i] and Left[i+1] are consecutive, then we will never arrive at this check, since they are already consecutive and can be bundled into a vector.

Lets take an example :

return (a[0] + a[1]) + (a[2] + a[3])

the tree for this will be :

            +
         /     \
        /       \
       +        +
      /  \      /  \
     /    \    /    \
a[0] a[1] a[2]  a[3]

where

        Left       Right
i=0    a[0]       a[1]
i=1    a[2]       a[3]

(Please note the contents of Left and Right :). Seems confusing, but its the way :) )

here Left[0] = a[0] and Left[1] = a[2] which are not consecutive and hence, cannot be bundled. Right[0] = a[1] and Right[1] = a[3], which are also not consecutive and hence cannot be bundled.

But here, Left[i] and Right[i] are consecutive and hence we exchange Left[i+1] and Right[i]. The Left and Right after formed after re-arranging :

           Left           Right   
i=0       a[0]            a[2]
i=1       a[1]            a[3]

Now since, Left[i] and Left[i+1] are consecutive, they can be bundled into a vector. Same with Right.

Now, this disturbs the original addition, since we were supposed to add a[0] with a[1] and a[2] with a[3] and finally add their additions.
But after rearranging, which in turn vectorizes the code, we are now adding

a[0]   a[1]
  +      +
a[2]   a[3]

This doesn't affect the result for integers, but for floating point data types with precision issues, this might cause difference in final answer. Hence, we are doing this re-arrangements for integer data types only.

I think you considered Left will contain left subtree (a[0] and a[1]) while Right will contain right subtree (a[2] and a[3]), which is not so. Please confirm and also correct my understanding if wrong :)

Your suggestions/corrections are most welcomed :)

Hi Suyog,

Thanks for the explanation, it actually matches my understanding.

But yes, I was a bit confused with what we actually load in a vector - Left[i] and Left[i+1], or Left[i] and Right[i]. Now it's clear, and your approach looks right.

One more question though: what if Left[i+1] and Right[i+1] are consecutive, but Left[i] and Right[i] are not (i.e. (p1[0] + p2[0]) + (a[0] + a[1]))? Does it make sense to swap p2[0] (Right[i]) and a[0] (Left[i+1]) ? In this case we will get at least one pair of consecutive loads - in the right operands.

suyog updated this revision to Diff 17543.Dec 21 2014, 9:39 PM

Hi Michael,

Thanks for the reply.

We are bundling Left[i] and Left[i+1] (and Right[i] and Right[i+1]).
So its not beneficial for us if Left[i+1] and Right[i+1] are consecutive.

Also, it doesn't make any sense to vectorize one bundle of load
while leave the other bundle scalar, since u need two vectors to operate together.
This is taken care by the cost model.

And anyways we are checking the loop till e-1 and not e.

For example purpose lets say that one of the Left-Right is consecutive
while other is not.

Lets take an example -

(a[0] + a[1]) + (a[4] + a[6]))
              
Left                Right
a[0]                 a[1]
a[4]                 a[6]

This is not vectorizable in current form.

Now, with our patch, since Left[i] and Right[i] are consecutive (integers),
and hence swap Left[i+1] and Right[i]

Left                  Right
a[0]                   a[4]
a[1]                   a[6]

Now, since elements in Left are consecutive, Left can be bundled into a vector of load.
Elements in Right are not consecutive, hence they cannot be bundled into a vector load.

It doesn't make any sense to vectorize one bundle of load while other bundle remains scalar.
This is taken care by the cost model. When we calculate the cost of vectorization, it comes
to be positive than scalar code (positive means expensive), and hence we do not vectorize the entire code.

Any questions/suggestions/comments are welcomed.

I have updated patch addressing extra space and typos.
Please help in reviewing it.

Regards,
Suyog

Hi Suyog,

Yep, I see what you mean. However, to me Left and Right operands look pretty symmetrical, so probably we should either try to make both of them consecutive simultaneously (option a), or make at least one of them consecutive (option b).

For option (a) we need to add a check:

// We can't make the right operands consecutive - bail out.
if (!(isConsecutiveAccess(Left[i+1], Right[i+1])))
  continue;

For option (b) we need to change

if (!(isConsecutiveAccess(Left[i], Right[i])))
  continue;

to

if (!isConsecutiveAccess(Left[i], Right[i]) && !isConsecutiveAccess(Left[i+1], Right[i+1]))
  // If we can't make any of the pairs consecutive - bail out.
  continue;

I prefer the option (b) here, but if you have any arguments against it, the option (a) is also good.

lib/Transforms/Vectorize/SLPVectorizer.cpp
1838

Also, braces around isConsecutiveAccess are redundant.

suyog updated this revision to Diff 17545.Dec 22 2014, 12:00 AM

Hi Michael,

We could add that check for Left[i+1] and Right[i+1] as well.
If we add that, we would have to increase the loop variable by 2 :)
as we are checking both i and i+1 elements.

By the way, The loop will always run for 2 times only in any case,
since the tree which we are considering, will always have only 2 children for a node.

But, this would be a very strict checking. If we skip this checking,
decision to vectorize or not will be taken by cost model as demonstrated earlier.

The only question is to whether to check at this point itself or leave it for
the cost model to decide, which it does very well.

I am of the opinion that we should not check for Left[i+1] and Right[i+1],
since it would hardly make any difference in the logic and the outcome,
and if we include that check, the code would look a bit ugly :)

Removed redundant braces around isConsecutive check.

Regards,
Suyog

Hi Suyog,

I see what you mean, but the loop doesn't look like behaving as you describe. In case Left.size() equals 2, the loop performs only one iteration. That means that we don't try to make Right operands consecutive (yep, they might become consecutive in some situations though, like your original example).

Could we actually check all operands before making any swaps? I.e. if we have (a[0]+b[0])+(b[1]+a[1]), then the current algorithm will not handle it, right? In this case we have two pairs of consecutive loads, but to find them we need to look through all loads first. It's purely theoretical case though, I don't know how often it occurs in real life.

By the way, have you measured performance with your change? Do you expect any gains on SPECs or other benchmarks?

suyog added a comment.Dec 23 2014, 6:03 AM

Hi Michael,

Thanks for replying.

Two quick points to mention from your reply.

  1. By swapping Right[i] with Left[i+1] after check, we are ensuring that resultant Left after swap are consecutive. We do not need to check if after swapping, Right is as well consecutive. If the resultant Right becomes consecutive, the bundling code ahead (already existing in SLP) will bundle it into a load vector as a result of which the cost model will calculate the cost of vectorization as negative and finally vectorize the whole code. If the resultant Right doesn't have consecutive loads, the bundling code ahead will not bundle these loads into vector, which in turn makes the cost of vector positive (expensive) and hence avoid vectorization of whole code.
  1. The case you mentioned (a[0]+b[0])+(a[1]+b[1]) was already handled without this patch as well. Left - a[0] & a[1] and Right - b[0] & b[1] and hence vectorizable already.

    The twist in the example as u mentioned happens with (a[0]+b[0]) + (b[1]+a[1]), where Left - a[0] & b[1] and Right - b[0] & a[1] (though i am doubtful if this ever happens). Since we are checking Left[i] & Right[i] only, there won't be any swapping in this case and we won't vectorize this code, though it has potential to be vectorized.

    Now, since for reduction case of above type where we have a single tree we know that the Left and Right will always have size=2 because of the way tree is formed, we can actually eliminate loop and check all the 4 elements for consecutive loads.

    I would like to have your opinion on this - whether we should eliminate loop and just check 4 elements?

    I had run LNT test on similar patch submitted earlier for 10 iterations on X86 and i didn't see any significant improvement nor any regression, though the AArch64 assembly generated above is a good improvement.

I would also like to give context of earlier discussions on this whole exercise:

http://lists.cs.uiuc.edu/pipermail/llvmdev/2014-September/076930.html
http://lists.cs.uiuc.edu/pipermail/llvmdev/2014-November/078666.html

Waiting for your suggestions.

Regards,
Suyog

Hi Suyog,

Thanks for the links to the previous discussions.

Sizes of Left and Right equal 2 only for the trees with depth 2, right (i.e. only for sum of 4 operands, like in our previous examples)? That's not true for sum of 8 operands, for which we have a deeper tree. So, if I'm not totally confused here, unrolling the loop would mean that we only handle 4-operands sums, which doesn't look better than your original code. So, I'd prefer to keep the loop.

However, I'm still thinking if it's possible to solve the problem in a more general way. E.g. sort the operands basing on their index from the same base pointer, and then distribute them between Left and Right arrays to get maximum number of consecutive pairs. What do you think, is it doable? And if so, will it catch any useful case that the current approach doesn't catch? By the way, I totally agree that the example from my previous letter is quite artificial, and maybe it's ok if we give up on it. And if a general solution would bring a lot of complexity to the code, I'd rather go with you current implementation.

Thanks,
Michael

PS: My answers might be delayed during the Christmas season, but I'll get back to the discussion as soon as I can.

Hi Michael.

Ideally, for sum of 8 operands, for which we have a deeper tree, Right and Left should each have 4 operands.
But the way the tree is build up, we recursively call build_tree() for Left and Right, we handle 2 loads each at a time.

Lets take an example :). I will number the adds op

                  + (1)
              /      \ 
             /        \ 
            /          \
           /            \
         + (2)            + (3)
       /   \            /  \
      /     \          /    \
     /       \        /      \
   +(4)       +(5)    +(6)    +(7)
  /  \      /  \    /  \     /  \
0    1     2    3  4    5    6   7

When the add 1 is encountered (top most add), we split the tree into left and right subtree, and recursively call build tree on
left and right of 2nd and 3rd add. So, we go to left of 2nd and 3rd add, and arrive at 4th and 6th add. Then again we go to left
of 4th and 6th add and encounter a[0] and a[4], we put them into Left vector and go to right of 4th and 6th add. We arrive at
a[1] and a[5], we put them into vector Right. After this we check if Elements in Left and Right can be bundled into a vector of loads.

Left        Right
a[0]          a[1]
a[4]          a[5]

At this point, we are totally unaware of the other loads, since we haven't called build_tree() on the right side of 2nd and 3rd add yet.
(DFS running in parallel for subtree starting at 2nd and 3rd add).

Once, we are done with processing and bundling the above pair of loads into a vector, we then move to the right recursively, and finally
encounter 5th and 7th add and then have Left and Right as:

Left           Right
a[2]          a[3]
a[6]          a[7]

Note that at this point, a[0], a[1], a[4] and a[5] are already processed. I hope you get my point :)

The Left and Right will contain more than 2 operands if the number of subtree are more than 2, which is not possible with single tree,
as single tree has only 2 children at a time.

I think that this should be handled in better way, because the above code had even more potential to get vectorize into <4x> vectors instead of <2x> vectors,
though i am not yet sure how to do that. I would be happy for your suggestions on it.

Awaiting your reply !!

Regards,
Suyog

(Merry Xmas and Happy New Year :) )

Hi Suyog,

Happy New Year!

Thanks for the detailed explanation! I think that since the current algorithm always works locally, and sees only two left and two right operands (since we start from BinOp), we should confine
reorderIfConsecutiveLoads to size=2. That could be done either by unrolling the loop, or by adding assertion+comment at the beginning. I prefer getting rid of the loop now, because the current approach targets a very specific case and isn't intended to be general, while the presence of the loop made me think that it's more or less general, which is misleading. We should either do a general approach, or confine ourselves with the specific particular case. Hopefully in future we'll implement a general algorithm for handling big reductions.

And one more nit: we can enable this for floats too, when fast-math is on. You can find an example of how it could be done in LoopVectorizer.cpp:5251.

Hi Suyog,

I've also just managed to construct an example in which we perform an incorrect transformation.

Here it is:

@a = common global [1000 x i32] zeroinitializer, align 16
@b = common global [1000 x i32] zeroinitializer, align 16
@c = common global [1000 x i32] zeroinitializer, align 16

; Function Attrs: nounwind readonly ssp uwtable
define void @foo() #0 {
entry:
  %a0 = load i32* getelementptr inbounds ([1000 x i32]* @a, i64 0, i64 0), align 16, !tbaa !2
  %a1 = load i32* getelementptr inbounds ([1000 x i32]* @a, i64 0, i64 1), align 4, !tbaa !2
  %a2 = load i32* getelementptr inbounds ([1000 x i32]* @a, i64 0, i64 2), align 8, !tbaa !2
  %a3 = load i32* getelementptr inbounds ([1000 x i32]* @a, i64 0, i64 3), align 4, !tbaa !2
  %a4 = load i32* getelementptr inbounds ([1000 x i32]* @a, i64 0, i64 4), align 16, !tbaa !2
  %a5 = load i32* getelementptr inbounds ([1000 x i32]* @a, i64 0, i64 5), align 4, !tbaa !2
  %a6 = load i32* getelementptr inbounds ([1000 x i32]* @a, i64 0, i64 6), align 8, !tbaa !2
  %a7 = load i32* getelementptr inbounds ([1000 x i32]* @a, i64 0, i64 7), align 4, !tbaa !2

  %b0 = load i32* getelementptr inbounds ([1000 x i32]* @b, i64 0, i64 0), align 16, !tbaa !2
  %b1 = load i32* getelementptr inbounds ([1000 x i32]* @b, i64 0, i64 1), align 4, !tbaa !2
  %b2 = load i32* getelementptr inbounds ([1000 x i32]* @b, i64 0, i64 2), align 8, !tbaa !2
  %b3 = load i32* getelementptr inbounds ([1000 x i32]* @b, i64 0, i64 3), align 4, !tbaa !2
  %b4 = load i32* getelementptr inbounds ([1000 x i32]* @b, i64 0, i64 4), align 16, !tbaa !2
  %b5 = load i32* getelementptr inbounds ([1000 x i32]* @b, i64 0, i64 5), align 4, !tbaa !2
  %b6 = load i32* getelementptr inbounds ([1000 x i32]* @b, i64 0, i64 6), align 8, !tbaa !2
  %b7 = load i32* getelementptr inbounds ([1000 x i32]* @b, i64 0, i64 7), align 4, !tbaa !2

  %add01 = add i32 %a0, %a1
  %add02 = add i32 %a4, %b4
  %add0 = add i32 %add01, %add02

  %add11 = add i32 %b0, %b1
  %add12 = add i32 %a5, %b5
  %add1 = add i32 %add11, %add12

  %add21 = add i32 %a2, %b2
  %add22 = add i32 %a6, %b6
  %add2 = add i32 %add21, %add22

  %add31 = add i32 %a3, %b3
  %add32 = add i32 %a7, %b7
  %add3 = add i32 %add31, %add32

  store i32 %add0, i32* getelementptr inbounds ([1000 x i32]* @c, i32 0, i64 0), align 16
  store i32 %add1, i32* getelementptr inbounds ([1000 x i32]* @c, i32 0, i64 1), align 4
  store i32 %add2, i32* getelementptr inbounds ([1000 x i32]* @c, i32 0, i64 2), align 8
  store i32 %add3, i32* getelementptr inbounds ([1000 x i32]* @c, i32 0, i64 3), align 4
  ret void
}

The code might look confusing, but it's actually pretty simple. I took computation c[0:3] = (a[0:3]+b[0:3]) + (a[4:7]+b[4:7]) and swapped b[0] and a[1] in it. The patched compiler incorrectly swaps these two operands back.

The problem happens because reorderIfConsecutiveLoads is currently called not only for reductions, but for store-chains as well. While it's valid to swap operands in reduction, it's illegal to do so across the lanes in usual vector computations.

suyog added a comment.Jan 8 2015, 12:18 PM

Hi Michael,

Thanks for the review and the example.

I think i should abandon this patch for now, since it has become too specific gradually as we discussed as well as wrongly working for store chains.

Basically, i will have to come up with patch tracking redcution (setting some flag) and then checking for consecutiveness of loads in that case.

Was a nice discussion :)

Regards,
Suyog

Hi Suyog,

Thank you for working on this, it's a very promising area!

Looking forward to your new patch:)