This is an archive of the discontinued LLVM Phabricator instance.

[SLPVectorization] Vectorize Reduction Chain feeding into a 'return' statement
ClosedPublic

Authored by suyog on Nov 12 2014, 8:11 AM.

Download Raw Diff

Details

Reviewers

rengolin
nadav
aschwaighofer
jmolloy
suyog
hfinkel

Commits

rGaba97f4aba2b: Vectorize a reduction chain feeding into a 'return' statement. e.x return (a…
rL222364: Vectorize a reduction chain feeding into a 'return' statement.

Summary

This patch vectorizes a reduction chain feeding into a 'return' statement.

(The reduction chain though, with current code, has to be in specific form - for return/store/others.
Improvements in it to follow in separate patches).

I do have one question - Should this be moved under a special 'flag' (same as horizontal store)?

Please help in reviewing the patch.

Regards,
Suyog

Diff Detail

Repository: rL LLVM

Event Timeline

suyog updated this revision to Diff 16094.Nov 12 2014, 8:11 AM

suyog retitled this revision from to [SLPVectorization] Vectorize Reduction Chain feeding into a 'return' statement.

suyog updated this object.

suyog edited the test plan for this revision. (Show Details)

suyog added reviewers: aschwaighofer, nadav, jmolloy, rengolin, hfinkel.

suyog added a subscriber: Unknown Object (MLST).

hfinkel added inline comments.Nov 12 2014, 8:21 AM

test/Transforms/SLPVectorizer/X86/return.ll
58 ↗	(On Diff #16094)	Please remove unnecessary attributes.
62 ↗	(On Diff #16094)	Please remove unnecessary metadata.

The code looks fine to me.

There could be a benefit in calling:

SLPVectorizer::tryToVectorize(BinaryOperator *V, BoUpSLP &R)

instead of directly calling tryToVectorizePair.

Did you run performance tests (test-suite) on x86 or arm?

Thanks for working on this!

I do have one question - Should this be moved under a special 'flag' (same as horizontal store)?

I think that this is not necessary (unless we find that it introduces performance regressions); I don't see any reason for the cost modeling to be special here. Arnold, do you agree?

Hi Hal, Arnold, James,

Thanks for reviewing the patch.

I have updated the patch removing unnecessary metadata from test case.

After Arnold's suggestion, i tried using tryToVectorize() in place of tryToVectorizeList().
It was breaking some of the test cases (instead of <4xfloat>, <2xfloat> emitted for already existing
test cases). Need to investigate that, till then sticking with tryToVectorizeList().

As Arnold suggested, this is an intermediate step to improve horizontal reductions.
Improvements as pointed earlier in 'llvm-dev' mailing discussion to follow in upcoming patches.

I did not run performance test yet. Will update soon with the results.
(unfortunately my laptop broke yesterday and i have to rely on my office workstation).

However, for AArch64:
test case :

float hadd(float * a) {

return ((a[0] + a[2]) + (a[1] + a[3]));

}

Assembly without patch:

ldp	 s0, s1, [x0]

ldp s2, s3, [x0, #8]
fadd s0, s0, s2
fadd s1, s1, s3
fadd s0, s0, s1
ret

Assembly with patch:

ldp	 d0, d1, [x0]

fadd v0.2s, v0.2s, v1.2s
faddp s0, v0.2s

which indicates code improvement.

Please help in reviewing the updated patch.

Regards,
Suyog

Hi all,

I had run performance test suite on X86 for 10 iterations and below is the output.

Please NOTE - BASELINE is WITH THE PROPOSED PATCH and CURRENT is WITHOUT PATCH.
(I tried alternating the baseline and the current so that current contains the proposed patch. Strangely,
the report always shows baseline to be with the current patch.) There isn't any regression observed.
Test cases marked in RED are improvements, though they are unrelated to the proposed patch.

This code will be re-usable for future improvements in identifying consecutive memory access in same subtree.
That will come in separate patch, unrelated to the objective of this patch.

I tried checking code generated for smaller than 32-bit type for AArch64.

Test Case :

#include <arm_neon.h>
  short hadd(short * a) {
    return ((a[0] + a[2]) + (a[1] + a[3]));
  }

IR after O1 (without SLP) :

  define i16 @hadd(i16* nocapture readonly %a) #0 {
   entry:
  %0 = load i16* %a, align 2, !tbaa !1
  %conv13 = zext i16 %0 to i32
  %arrayidx1 = getelementptr inbounds i16* %a, i64 2
  %1 = load i16* %arrayidx1, align 2, !tbaa !1
  %conv214 = zext i16 %1 to i32
  %arrayidx3 = getelementptr inbounds i16* %a, i64 1
  %2 = load i16* %arrayidx3, align 2, !tbaa !1
  %conv415 = zext i16 %2 to i32
  %arrayidx5 = getelementptr inbounds i16* %a, i64 3
  %3 = load i16* %arrayidx5, align 2, !tbaa !1
  %conv616 = zext i16 %3 to i32
  %add7 = add nuw nsw i32 %conv214, %conv13
  %add = add nuw nsw i32 %add7, %conv415
  %add8 = add nuw nsw i32 %add, %conv616
  %conv9 = trunc i32 %add8 to i16
  ret i16 %conv9
}

Since we are doing extension-truncation operations here, current patch does not vectorize it.

If we remove those extension and truncation

define i16 @hadd(i16* nocapture readonly %a) #0 {
entry:
     %0 = load i16* %a, align 2, !tbaa !1
     %arrayidx1 = getelementptr inbounds i16* %a, i64 2
     %1 = load i16* %arrayidx1, align 2, !tbaa !1
     %arrayidx3 = getelementptr inbounds i16* %a, i64 1
     %2 = load i16* %arrayidx3, align 2, !tbaa !1
     %arrayidx5 = getelementptr inbounds i16* %a, i64 3
     %3 = load i16* %arrayidx5, align 2, !tbaa !1
     %add7 = add nuw nsw i16 %0, %1
     %add = add nuw nsw i16 %2, %3
     %add8 = add nuw nsw i16 %add, %add7
     ret i16 %add8 
   }

LLVM vectorizes this with patch above.

Assembly code for 16 bit with extension-truncation after running SLP pass (No vectorization done in this case)

               
                 ldrh	 w8, [x0]
                 ldrh	w9, [x0, #4]
	         ldrh	w10, [x0, #2]
	         ldrh	w11, [x0, #6]
	         add	 w8, w9, w8
	         add	 w8, w8, w10
	         add	 w0, w8, w11
	         ret

Assembly code for 16 bit without extension-truncation after running SLP pass (vectorization done in this case)

            
                 ldrh	 w8, [x0]
	         ldrh	w9, [x0, #2]
	         ldrh	w10, [x0, #4]
	         ldrh	w11, [x0, #6]
	         fmov	s0, w8
	         fmov	s1, w10
	         ins	v0.s[1], w9
	         ins	v1.s[1], w11
	         add	v0.2s, v0.2s, v1.2s
	         fmov	w8, s0
	         mov	 w9, v0.s[1]
	         add	 w0, w9, w8 
                ret

Seems bad code for less than 32 bit data type.

However, the current patch doesn't vectorizes less than 32 bit data as it ignores vectorization if truncation/extension encountered.

Please help in reviewing this patch.

Regards,
Suyog

Gentle Ping !!

(Note : For results of LNT test, please refer phabricator link.
The snapshot is attached there. Not sure why the snapshot
didn't get attached to the mail triggered by phabricator)

Regards,
Suyog

ab added a subscriber: ab.Nov 18 2014, 7:15 AM

LGTM.

I don't think this should go behind a flag.

suyog accepted this revision.Nov 19 2014, 6:07 AM

suyog added a reviewer: suyog.

This revision is now accepted and ready to land.Nov 19 2014, 6:07 AM

Closed by commit rL222364 (authored by @suyog).

Revision Contents

Path

Size

llvm/

trunk/

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

15 lines

test/

Transforms/

SLPVectorizer/

X86/

return.ll

54 lines

Diff 16386

llvm/trunk/lib/Transforms/Vectorize/SLPVectorizer.cpp

Show First 20 Lines • Show All 3,690 Lines • ▼ Show 20 Lines	if (ShouldStartVectorizeHorAtStore)
tryToVectorize(BinOp, R))) {		tryToVectorize(BinOp, R))) {
Changed = true;		Changed = true;
it = BB->begin();		it = BB->begin();
e = BB->end();		e = BB->end();
continue;		continue;
}		}
}		}

		// Try to vectorize horizontal reductions feeding into a return.
		if (ReturnInst *RI = dyn_cast<ReturnInst>(it))
		if (RI->getNumOperands() != 0)
		if (BinaryOperator *BinOp =
		dyn_cast<BinaryOperator>(RI->getOperand(0))) {
		DEBUG(dbgs() << "SLP: Found a return to vectorize.\n");
		if (tryToVectorizePair(BinOp->getOperand(0),
		BinOp->getOperand(1), R)) {
		Changed = true;
		it = BB->begin();
		e = BB->end();
		continue;
		}
		}

// Try to vectorize trees that start at compare instructions.		// Try to vectorize trees that start at compare instructions.
if (CmpInst *CI = dyn_cast<CmpInst>(it)) {		if (CmpInst *CI = dyn_cast<CmpInst>(it)) {
if (tryToVectorizePair(CI->getOperand(0), CI->getOperand(1), R)) {		if (tryToVectorizePair(CI->getOperand(0), CI->getOperand(1), R)) {
Changed = true;		Changed = true;
// We would like to start over since some instructions are deleted		// We would like to start over since some instructions are deleted
// and the iterator may become invalid value.		// and the iterator may become invalid value.
it = BB->begin();		it = BB->begin();
e = BB->end();		e = BB->end();
▲ Show 20 Lines • Show All 76 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/SLPVectorizer/X86/return.ll

				; RUN: opt < %s -basicaa -slp-vectorizer -S \| FileCheck %s
				target datalayout = "e-m:e-p:32:32-f64:32:64-f80:32-n8:16:32-S128"
				target triple = "x86_64--linux-gnu"

				@a = common global [4 x double] zeroinitializer, align 8
				@b = common global [4 x double] zeroinitializer, align 8

				; [4], b[4];
				; double foo() {
				; double sum =0;
				; sum = (a[0]+b[0]) + (a[1]+b[1]);
				; return sum;
				; }

				; CHECK-LABEL: @return1
				; CHECK: %0 = load <2 x double>*
				; CHECK: %1 = load <2 x double>*
				; CHECK: %2 = fadd <2 x double>

				define double @return1() {
				entry:
				%a0 = load double* getelementptr inbounds ([4 x double]* @a, i32 0, i32 0), align 8
				%b0 = load double* getelementptr inbounds ([4 x double]* @b, i32 0, i32 0), align 8
				%add0 = fadd double %a0, %b0
				%a1 = load double* getelementptr inbounds ([4 x double]* @a, i32 0, i32 1), align 8
				%b1 = load double* getelementptr inbounds ([4 x double]* @b, i32 0, i32 1), align 8
				%add1 = fadd double %a1, %b1
				%add2 = fadd double %add0, %add1
				ret double %add2
				}

				; double hadd(double *x) {
				; return ((x[0] + x[2]) + (x[1] + x[3]));
				; }

				; CHECK-LABEL: @return2
				; CHECK: %1 = load <2 x double>*
				; CHECK: %3 = load <2 x double>* %2
				; CHECK: %4 = fadd <2 x double> %1, %3

				define double @return2(double* nocapture readonly %x) {
				entry:
				%x0 = load double* %x, align 4
				%arrayidx1 = getelementptr inbounds double* %x, i32 2
				%x2 = load double* %arrayidx1, align 4
				%add3 = fadd double %x0, %x2
				%arrayidx2 = getelementptr inbounds double* %x, i32 1
				%x1 = load double* %arrayidx2, align 4
				%arrayidx3 = getelementptr inbounds double* %x, i32 3
				%x3 = load double* %arrayidx3, align 4
				%add4 = fadd double %x1, %x3
				%add5 = fadd double %add3, %add4
				ret double %add5
				}

This is an archive of the discontinued LLVM Phabricator instance.

[SLPVectorization] Vectorize Reduction Chain feeding into a 'return' statementClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 16386

llvm/trunk/lib/Transforms/Vectorize/SLPVectorizer.cpp

llvm/trunk/test/Transforms/SLPVectorizer/X86/return.ll

[SLPVectorization] Vectorize Reduction Chain feeding into a 'return' statement
ClosedPublic