This is an archive of the discontinued LLVM Phabricator instance.

[SLPVectorization] Vectorize flat addition in a single tree (+(+(+ v1 v2) v3) v4)
AbandonedPublic

Authored by suyog on Dec 31 2014, 2:53 AM.

Details

Summary

This is one more patch based on previous discussions.

This patch vectorizes flat addition of integer type from a single array whose
expression tree is of type (+(+(+ v1 v2) v3) v4).

e.g.

int foo (int *a) {
  return a[0] + a[1] + a[2] + a[3];
}

The IR for above code is :

define i32 @hadd(i32* %a) {
entry:
    %0 = load i32* %a, align 4
    %arrayidx1 = getelementptr inbounds i32* %a, i32 1
    %1 = load i32* %arrayidx1, align 4
    %add = add nsw i32 %0, %1
    %arrayidx2 = getelementptr inbounds i32* %a, i32 2
    %2 = load i32* %arrayidx2, align 4
    %add3 = add nsw i32 %add, %2
    %arrayidx4 = getelementptr inbounds i32* %a, i32 3
    %3 = load i32* %arrayidx4, align 4
    %add5 = add nsw i32 %add3, %3
     ret i32 %add5
  }

The above addition can be modeled as combination of two shuffle vectors, two vector adds and an extractelement instruction.

After vectorization with this patch IR :

define i32 @hadd(i32* %a) {
 entry:
     %0 = bitcast i32* %a to <4 x i32>*
     %1 = load <4 x i32>* %0, align 4
     %rdx.shuf = shufflevector <4 x i32> %1, <4 x i32> undef, <4 x i32> <i32 2, i32 3, i32 undef, i32 undef>
     %bin.rdx = add <4 x i32> %1, %rdx.shuf
     %rdx.shuf1 = shufflevector <4 x i32> %bin.rdx, <4 x i32> undef, <4 x i32> <i32 1, i32 undef, i32 undef, i32 undef>
     %bin.rdx2 = add <4 x i32> %bin.rdx, %rdx.shuf1
     %2 = extractelement <4 x i32> %bin.rdx2, i32 0
     ret i32 %2
 }

AArch assembly before patch :

ldp	 w8, w9, [x0]

ldp w10, w11, [x0, #8]
add w8, w8, w9
add w8, w8, w10
add w0, w8, w11
ret

AArch assembly after this patch:

ldr	q0, [x0]

ext v1.16b, v0.16b, v0.16b, #8
add v0.4s, v0.4s, v1.4s
dup v1.4s, v0.s[1]
add v0.4s, v0.4s, v1.4s
fmov w0, s0

ret

This patch handles any number of such addition like a[0]-a[7]. Added test case for same.

I have written a newfunction "matchFlatReduction" to identify this type of tree as i didn't want to disturb the original "matchAssociateReduction".

Please help in reviewing this patch. No make-check regressions observed.

Regards,
Suyog

Diff Detail

Repository
rL LLVM

Event Timeline

suyog updated this revision to Diff 17744.Dec 31 2014, 2:53 AM
suyog retitled this revision from to [SLPVectorization] Vectorize flat addition in a single tree (+(+(+ v1 v2) v3) v4).
suyog updated this object.
suyog edited the test plan for this revision. (Show Details)
suyog set the repository for this revision to rL LLVM.
suyog added a subscriber: Unknown Object (MLST).

Hey Suyog,

Looks interesting :-) I can't speak for the technical content, but I did notice a few stylistic problems which might save you some patch ping-pong later.

See inline.

Cheers,
Charlie.

lib/Transforms/Vectorize/SLPVectorizer.cpp
3327–3328

Why three slashes? Normally we use two.

3352–3353

It's more conventional in LLVM to type these as Value *Op0, the style you're using in this function varies from declaration to declaration. There are several more instances of this in the patch.

3359–3360

Add a space after the if. There are several instances of this in the patch, including for other control constructs.

3606–3607

Looks like you're an indent level too far in here.

3610–3613

This looks a bit weird, I suggest you run it through clang-format.

suyog updated this revision to Diff 17757.Jan 1 2015, 11:58 PM

Hi Charlie,

Thanks a lot for pointing out issues. I had earlier uploaded unformatted version of the flat, corrected it
by running clang-format. My bad, sincere apology.

Addressed your concerns and ran clang-format on the added code.

Regards,
Suyog

jmolloy requested changes to this revision.Jan 5 2015, 2:45 AM
jmolloy edited edge metadata.

Hi,

Comments inline.

lib/Transforms/Vectorize/SLPVectorizer.cpp
78

Mutable globals? This is a really bad code smell.

1629

Why does it matter that it feeds a return? Why wouldn't feeding a store also trigger? or a call operand?

It looks like you're using mutable globals to track state, which is a really bad pattern. It'll mean that two SLPVectorizer passes can't be used in parallel, which will break some JIT compilers and people compiling multiple Modules in parallel.

3327

Please write comments in full sentences, without ellipses (...) where possible. Where ellipses are needed, they have 3 periods (...).

3345

Why?

3348

Why?

3607

As I've mentioned several times in different threads, I don't like this. Architectures such as AArch64 have dedicated reduction instructions (ADDV), and so their cost does not follow the IR pattern given above.

The IR pattern above is matched to pairwise-adds by the X86 backend, so that cost isn't the same either.

This revision now requires changes to proceed.Jan 5 2015, 2:45 AM
suyog added a comment.Jan 5 2015, 4:24 AM

Hi James,

Thanks for the review.

Yes its a very bad code design and i will come up with better design for tracking flags.
I had this feeling while writing code itself. Thanks for pointing out.

For some of the issues, you raised, commenting inline.

Regards,
Suyog

lib/Transforms/Vectorize/SLPVectorizer.cpp
3345

Will it be beneficial if we had Reduction width less than 4, say suppose 2?

I had just copied this from matchAssociativeReduction, i feel the reason there would be the same.

3348

If we allow it for floating point data types, results may vary, since (a+b)+c != a+(b+c) in case of floating point data structure (Chandler pointed this in earlier patches as well). Since, by vectorizing, we are changing the addition order, it may affect floating point additions. Hence, only integer add. We can allow it for integer multiplication as well though.

3607

The assembly generated as of now after vectorization, does not generate ADDV, which is bad.
But if we need to vectorize a horizontal addition, is there any other way it would be done on IR level?
Once, we achieve it at IR level, we can lower it to ADDV at DAG level in DAGCombine.

You had suggested earlier to have an IR intrinsic to indicate pattern and then lower that to machine specific instructions. Any other way than that?

suyog abandoned this revision.Dec 14 2015, 8:17 AM
nadav edited edge metadata.Dec 14 2015, 9:05 AM

It looks like this patch is not ready for review.

Before submitting it again please run this code on the llvm test suite and collect performance numbers (runtime and compile time). I want to make sure we are not regressing and assess the wins.

lib/Transforms/Vectorize/SLPVectorizer.cpp
79

I agree with jmolloy. No globals please.

suyog added a comment.Dec 14 2015, 9:08 AM

@nadav : I abandoned this revision. Did i do anything wrong which triggered something else?