This is an archive of the discontinued LLVM Phabricator instance.

[SLPVectorization] Vectorize Reduction Chain feeding into a 'return' statement
ClosedPublic

Authored by suyog on Nov 12 2014, 8:11 AM.

Details

Summary

This patch vectorizes a reduction chain feeding into a 'return' statement.

(The reduction chain though, with current code, has to be in specific form - for return/store/others.
Improvements in it to follow in separate patches).

I do have one question - Should this be moved under a special 'flag' (same as horizontal store)?

Please help in reviewing the patch.

Regards,
Suyog

Diff Detail

Repository
rL LLVM

Event Timeline

suyog updated this revision to Diff 16094.Nov 12 2014, 8:11 AM
suyog retitled this revision from to [SLPVectorization] Vectorize Reduction Chain feeding into a 'return' statement.
suyog updated this object.
suyog edited the test plan for this revision. (Show Details)
suyog added a subscriber: Unknown Object (MLST).
hfinkel added inline comments.Nov 12 2014, 8:21 AM
test/Transforms/SLPVectorizer/X86/return.ll
58 ↗(On Diff #16094)

Please remove unnecessary attributes.

62 ↗(On Diff #16094)

Please remove unnecessary metadata.

aschwaighofer edited edge metadata.Nov 12 2014, 8:28 AM

The code looks fine to me.

There could be a benefit in calling:

SLPVectorizer::tryToVectorize(BinaryOperator *V, BoUpSLP &R)

instead of directly calling tryToVectorizePair.

Did you run performance tests (test-suite) on x86 or arm?

Thanks for working on this!

hfinkel edited edge metadata.Nov 12 2014, 8:31 AM

I do have one question - Should this be moved under a special 'flag' (same as horizontal store)?

I think that this is not necessary (unless we find that it introduces performance regressions); I don't see any reason for the cost modeling to be special here. Arnold, do you agree?

suyog updated this revision to Diff 16100.Nov 12 2014, 9:10 AM
suyog edited edge metadata.

Hi Hal, Arnold, James,

Thanks for reviewing the patch.

I have updated the patch removing unnecessary metadata from test case.

After Arnold's suggestion, i tried using tryToVectorize() in place of tryToVectorizeList().
It was breaking some of the test cases (instead of <4xfloat>, <2xfloat> emitted for already existing
test cases). Need to investigate that, till then sticking with tryToVectorizeList().

As Arnold suggested, this is an intermediate step to improve horizontal reductions.
Improvements as pointed earlier in 'llvm-dev' mailing discussion to follow in upcoming patches.

I did not run performance test yet. Will update soon with the results.
(unfortunately my laptop broke yesterday and i have to rely on my office workstation).

However, for AArch64:
test case :

float hadd(float * a) {

return ((a[0] + a[2]) + (a[1] + a[3]));

}

Assembly without patch:

ldp	 s0, s1, [x0]

ldp s2, s3, [x0, #8]
fadd s0, s0, s2
fadd s1, s1, s3
fadd s0, s0, s1
ret

Assembly with patch:

ldp	 d0, d1, [x0]

fadd v0.2s, v0.2s, v1.2s
faddp s0, v0.2s

which indicates code improvement.

Please help in reviewing the updated patch.

Regards,
Suyog

suyog added a comment.Nov 14 2014, 4:27 AM

Hi all,

I had run performance test suite on X86 for 10 iterations and below is the output.

Please NOTE - BASELINE is WITH THE PROPOSED PATCH and CURRENT is WITHOUT PATCH.
(I tried alternating the baseline and the current so that current contains the proposed patch. Strangely,
the report always shows baseline to be with the current patch.) There isn't any regression observed.
Test cases marked in RED are improvements, though they are unrelated to the proposed patch.

This code will be re-usable for future improvements in identifying consecutive memory access in same subtree.
That will come in separate patch, unrelated to the objective of this patch.

I tried checking code generated for smaller than 32-bit type for AArch64.

Test Case :

#include <arm_neon.h>
  short hadd(short * a) {
    return ((a[0] + a[2]) + (a[1] + a[3]));
  }

IR after O1 (without SLP) :

  define i16 @hadd(i16* nocapture readonly %a) #0 {
   entry:
  %0 = load i16* %a, align 2, !tbaa !1
  %conv13 = zext i16 %0 to i32
  %arrayidx1 = getelementptr inbounds i16* %a, i64 2
  %1 = load i16* %arrayidx1, align 2, !tbaa !1
  %conv214 = zext i16 %1 to i32
  %arrayidx3 = getelementptr inbounds i16* %a, i64 1
  %2 = load i16* %arrayidx3, align 2, !tbaa !1
  %conv415 = zext i16 %2 to i32
  %arrayidx5 = getelementptr inbounds i16* %a, i64 3
  %3 = load i16* %arrayidx5, align 2, !tbaa !1
  %conv616 = zext i16 %3 to i32
  %add7 = add nuw nsw i32 %conv214, %conv13
  %add = add nuw nsw i32 %add7, %conv415
  %add8 = add nuw nsw i32 %add, %conv616
  %conv9 = trunc i32 %add8 to i16
  ret i16 %conv9
}

Since we are doing extension-truncation operations here, current patch does not vectorize it.

If we remove those extension and truncation

define i16 @hadd(i16* nocapture readonly %a) #0 {
entry:
     %0 = load i16* %a, align 2, !tbaa !1
     %arrayidx1 = getelementptr inbounds i16* %a, i64 2
     %1 = load i16* %arrayidx1, align 2, !tbaa !1
     %arrayidx3 = getelementptr inbounds i16* %a, i64 1
     %2 = load i16* %arrayidx3, align 2, !tbaa !1
     %arrayidx5 = getelementptr inbounds i16* %a, i64 3
     %3 = load i16* %arrayidx5, align 2, !tbaa !1
     %add7 = add nuw nsw i16 %0, %1
     %add = add nuw nsw i16 %2, %3
     %add8 = add nuw nsw i16 %add, %add7
     ret i16 %add8 
   }

LLVM vectorizes this with patch above.

Assembly code for 16 bit with extension-truncation after running SLP pass (No vectorization done in this case)

               
                 ldrh	 w8, [x0]
                 ldrh	w9, [x0, #4]
	         ldrh	w10, [x0, #2]
	         ldrh	w11, [x0, #6]
	         add	 w8, w9, w8
	         add	 w8, w8, w10
	         add	 w0, w8, w11
	         ret

Assembly code for 16 bit without extension-truncation after running SLP pass (vectorization done in this case)

            
                 ldrh	 w8, [x0]
	         ldrh	w9, [x0, #2]
	         ldrh	w10, [x0, #4]
	         ldrh	w11, [x0, #6]
	         fmov	s0, w8
	         fmov	s1, w10
	         ins	v0.s[1], w9
	         ins	v1.s[1], w11
	         add	v0.2s, v0.2s, v1.2s
	         fmov	w8, s0
	         mov	 w9, v0.s[1]
	         add	 w0, w9, w8 
                ret

Seems bad code for less than 32 bit data type.

However, the current patch doesn't vectorizes less than 32 bit data as it ignores vectorization if truncation/extension encountered.

Please help in reviewing this patch.

Regards,
Suyog

suyog added a comment.Nov 18 2014, 5:05 AM

Gentle Ping !!

(Note : For results of LNT test, please refer phabricator link.
The snapshot is attached there. Not sure why the snapshot
didn't get attached to the mail triggered by phabricator)

Regards,
Suyog

ab added a subscriber: ab.Nov 18 2014, 7:15 AM

LGTM.

I don't think this should go behind a flag.

suyog accepted this revision.Nov 19 2014, 6:07 AM
suyog added a reviewer: suyog.
This revision is now accepted and ready to land.Nov 19 2014, 6:07 AM
suyog closed this revision.Nov 19 2014, 8:07 AM
suyog updated this revision to Diff 16386.

Closed by commit rL222364 (authored by @suyog).