Page MenuHomePhabricator

[SLPVectorization] Vectorize flat addition in a single tree (+(+(+ v1 v2) v3) v4)

Authored by suyog on Dec 31 2014, 2:53 AM.



This is one more patch based on previous discussions.

This patch vectorizes flat addition of integer type from a single array whose
expression tree is of type (+(+(+ v1 v2) v3) v4).


int foo (int *a) {
  return a[0] + a[1] + a[2] + a[3];

The IR for above code is :

define i32 @hadd(i32* %a) {
    %0 = load i32* %a, align 4
    %arrayidx1 = getelementptr inbounds i32* %a, i32 1
    %1 = load i32* %arrayidx1, align 4
    %add = add nsw i32 %0, %1
    %arrayidx2 = getelementptr inbounds i32* %a, i32 2
    %2 = load i32* %arrayidx2, align 4
    %add3 = add nsw i32 %add, %2
    %arrayidx4 = getelementptr inbounds i32* %a, i32 3
    %3 = load i32* %arrayidx4, align 4
    %add5 = add nsw i32 %add3, %3
     ret i32 %add5

The above addition can be modeled as combination of two shuffle vectors, two vector adds and an extractelement instruction.

After vectorization with this patch IR :

define i32 @hadd(i32* %a) {
     %0 = bitcast i32* %a to <4 x i32>*
     %1 = load <4 x i32>* %0, align 4
     %rdx.shuf = shufflevector <4 x i32> %1, <4 x i32> undef, <4 x i32> <i32 2, i32 3, i32 undef, i32 undef>
     %bin.rdx = add <4 x i32> %1, %rdx.shuf
     %rdx.shuf1 = shufflevector <4 x i32> %bin.rdx, <4 x i32> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
     %bin.rdx2 = add <4 x i32> %bin.rdx, %rdx.shuf1
     %2 = extractelement <4 x i32> %bin.rdx2, i32 0
     ret i32 %2

AArch assembly before patch :

ldp	 w8, w9, [x0]

ldp w10, w11, [x0, #8]
add w8, w8, w9
add w8, w8, w10
add w0, w8, w11

AArch assembly after this patch:

ldr	q0, [x0]

ext v1.16b, v0.16b, v0.16b, #8
add v0.4s, v0.4s, v1.4s
dup v1.4s, v0.s[1]
add v0.4s, v0.4s, v1.4s
fmov w0, s0


This patch handles any number of such addition like a[0]-a[7]. Added test case for same.

I have written a newfunction "matchFlatReduction" to identify this type of tree as i didn't want to disturb the original "matchAssociateReduction".

Please help in reviewing this patch. No make-check regressions observed.


Diff Detail


Event Timeline

suyog updated this revision to Diff 17744.Dec 31 2014, 2:53 AM
suyog retitled this revision from to [SLPVectorization] Vectorize flat addition in a single tree (+(+(+ v1 v2) v3) v4).
suyog updated this object.
suyog edited the test plan for this revision. (Show Details)
suyog set the repository for this revision to rL LLVM.
suyog added a subscriber: Unknown Object (MLST).

Hey Suyog,

Looks interesting :-) I can't speak for the technical content, but I did notice a few stylistic problems which might save you some patch ping-pong later.

See inline.



Why three slashes? Normally we use two.


It's more conventional in LLVM to type these as Value *Op0, the style you're using in this function varies from declaration to declaration. There are several more instances of this in the patch.


Add a space after the if. There are several instances of this in the patch, including for other control constructs.


Looks like you're an indent level too far in here.


This looks a bit weird, I suggest you run it through clang-format.

suyog updated this revision to Diff 17757.Jan 1 2015, 11:58 PM

Hi Charlie,

Thanks a lot for pointing out issues. I had earlier uploaded unformatted version of the flat, corrected it
by running clang-format. My bad, sincere apology.

Addressed your concerns and ran clang-format on the added code.


jmolloy requested changes to this revision.Jan 5 2015, 2:45 AM
jmolloy edited edge metadata.


Comments inline.


Mutable globals? This is a really bad code smell.


Why does it matter that it feeds a return? Why wouldn't feeding a store also trigger? or a call operand?

It looks like you're using mutable globals to track state, which is a really bad pattern. It'll mean that two SLPVectorizer passes can't be used in parallel, which will break some JIT compilers and people compiling multiple Modules in parallel.


Please write comments in full sentences, without ellipses (...) where possible. Where ellipses are needed, they have 3 periods (...).






As I've mentioned several times in different threads, I don't like this. Architectures such as AArch64 have dedicated reduction instructions (ADDV), and so their cost does not follow the IR pattern given above.

The IR pattern above is matched to pairwise-adds by the X86 backend, so that cost isn't the same either.

This revision now requires changes to proceed.Jan 5 2015, 2:45 AM
suyog added a comment.Jan 5 2015, 4:24 AM

Hi James,

Thanks for the review.

Yes its a very bad code design and i will come up with better design for tracking flags.
I had this feeling while writing code itself. Thanks for pointing out.

For some of the issues, you raised, commenting inline.



Will it be beneficial if we had Reduction width less than 4, say suppose 2?

I had just copied this from matchAssociativeReduction, i feel the reason there would be the same.


If we allow it for floating point data types, results may vary, since (a+b)+c != a+(b+c) in case of floating point data structure (Chandler pointed this in earlier patches as well). Since, by vectorizing, we are changing the addition order, it may affect floating point additions. Hence, only integer add. We can allow it for integer multiplication as well though.


The assembly generated as of now after vectorization, does not generate ADDV, which is bad.
But if we need to vectorize a horizontal addition, is there any other way it would be done on IR level?
Once, we achieve it at IR level, we can lower it to ADDV at DAG level in DAGCombine.

You had suggested earlier to have an IR intrinsic to indicate pattern and then lower that to machine specific instructions. Any other way than that?

suyog abandoned this revision.Dec 14 2015, 8:17 AM
nadav edited edge metadata.Dec 14 2015, 9:05 AM

It looks like this patch is not ready for review.

Before submitting it again please run this code on the llvm test suite and collect performance numbers (runtime and compile time). I want to make sure we are not regressing and assess the wins.


I agree with jmolloy. No globals please.

suyog added a comment.Dec 14 2015, 9:08 AM

@nadav : I abandoned this revision. Did i do anything wrong which triggered something else?