This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
-
SLPVectorizer.cpp
-
test/Transforms/SLPVectorizer/X86/
-
Transforms/
-
SLPVectorizer/
-
X86/
-
phi_limited.ll

Differential D32533

[SLPVectorizer] Limit the number of block chain instructions to max register size
AbandonedPublic

Authored by mkazantsev on Apr 26 2017, 4:50 AM.

Download Raw Diff

Details

Reviewers

sanjoy
reames
anna
apilipenko
skatkov
hfinkel
mkuper

Summary

When aggregating instructions of the same type in the vectorizeChainsInBlock method, SLPVectorizer
does not care about the maximum vector register size. As result, it may generate a vector instruction
which uses vectors that are longer than allowed.

This patch limits the number of aggregated instructions so that their overall size does not exceed
the max allowed vector size.

Diff Detail

Event Timeline

mkazantsev created this revision.Apr 26 2017, 4:50 AM

Herald added a subscriber: mzolotukhin. · View Herald TranscriptApr 26 2017, 4:50 AM

mkazantsev added a subscriber: llvm-commits.Apr 26 2017, 4:51 AM

Hi Max,
I think we are allowed to have vector operations (other than vector stores) on vector sizes greater than getMaxVecRegSize in the IR. In the SLP vectorizer, we look for the max size when storing. Both GEP vectorization and phi-vectorization goes by the number of scalar elements chosen for vectorization, rather than the R.getMaxVecRegSize (for example, GEP chooses chunks of 16). Also, looking at all the llvm tests, there are many cases where geps, shuffles, inserts and extracts operate on <16 x i64> vector types = 1024 bits. The maxVecRegSize among all targets is 512 bits (zmm registers in AVX512). Running this through codegen, I can see we use the correct vector register size allowed at the target, even though the IR has <16 x i64>. Given this, I don't think this is a bug.

@mssimpso Is this correct? Also, I would like to know the design philosophy behind having the large vector sizes, and allowing LLC to find the correct vector register size.

In D32533#738049, @anna wrote:

@mssimpso Is this correct? Also, I would like to know the design philosophy behind having the large vector sizes, and allowing LLC to find the correct vector register size.

As far as I know, this is correct. The vectorizers are generally allowed (and do in some cases) create vectors that are wider than the physical vector register size of the target. The backend is supposed to know how to split them up into legal sizes. The min/max vector register size options in SLP, can use some work, though (see D31965 and the TODO on line 325). I think they are primarily used to narrow the search space for compile-time savings. But I think we currently only do this for store-rooted trees (vectorizeStores).

okay, that explains why we do limit to the target vector register *only* for the store chain in SLP (compile time benefit). However, if we look at the Loop Vectorizer, we consider the maximum vector register size when generating the code in the IR. This also gives a more accurate cost model for LV.

Not considering the physical vector register size is limiting the SLP cost model right? For example, in the target, we would have 4 shuffles instead of a single shuffle.

In D32533#738239, @anna wrote:

Not considering the physical vector register size is limiting the SLP cost model right? For example, in the target, we would have 4 shuffles instead of a single shuffle.

The physical vector register size is used when computing costs. In the loop vectorizer, we set a MaxVF based on the target register size, then compute the cost for all VFs up to this size and select the VF that is most profitable. But for SLP (for store-rooted trees), we have MinVecRegSize and MaxVecRegSize (where MaxVecRegSize is defined by the target). We try VFs based on these sizes from Max to Min and vectorize with the first one that is profitable (we don't try all of them).

But again, it looks to me like these limits are only imposed on the store-rooted trees. For vectorizeChainsInBlock and vectorizeGEPIndices, it looks like we basically let the vector be as wide as it can be, compute the cost of that, and if profitable, vectorize. The costs are still based on the target register size, though.

Thanks Matt. Looking at the SLP vectorizer code, I agree with what you've said.

Just to summarize, SLP vectorizer computes with the max possible vector length in the IR for GEPs and PHIs. However, the physical vector register size is used in the cost calculation, and vectorization is done only if its profitable. I've verified that vectorizing stores, geps and phis use this cost threshold to decide if vectorization is profitable. The downside to how the cost model is used in SLP vectorizer is that we may miss out on vectorizing geps and phis because we chose the widest possible vector and if it's unprofitable (because target's physical register size is small), we just wont vectorize.

I think the right fix from a profitability standpoint (i.e. vectorizing more geps and phis) is to do the same thing we do for store-rooted trees: try from max VF = target's physical vector reg width upto min VF, and stop when we find something profitable. The issue might be a higher compile time. From correctness standpoint, SLP code is doing the right thing.

In D32533#738239, @anna wrote:

okay, that explains why we do limit to the target vector register *only* for the store chain in SLP (compile time benefit). However, if we look at the Loop Vectorizer, we consider the maximum vector register size when generating the code in the IR. This also gives a more accurate cost model for LV.

Not considering the physical vector register size is limiting the SLP cost model right? For example, in the target, we would have 4 shuffles instead of a single shuffle.

SLP and LV, unfortunately, have different approaches here.

SLP, except for the store chain case, ignores register sizes. The assumption is that (a) the legalizer will do a good job, and (b) the cost model accurately reflects legalization costs. LV is more conservative, and will not create vectors wider than the register size (it has a flag that enables it to do so, but it's off by default). The direction we want to move in is of *not* limiting the vector size in either SLP or LV, that is, the opposite of what this patch does. This is important for vectorizing code that mixes types of different sizes. That hasn't happened yet, for a couple of reasons. One is that the assumptions SLP makes about the cost model and legalization don't necessarily hold. :-) The other is that doing it correctly also requires modeling register pressure. We already do it in LV for the interleaving (unrolling) factor, to an extent, but it needs to get integrated with the vectorization factor heuristic as well.

So, overall, I'd say the right solution here is not to stop SLP from creating wide vectors, but to fix the backend/cost model issues.

Thank you guys for clarification, I abandon this revision since the existing behavior is correct.

Revision Contents

Path

Size

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

17 lines

test/

Transforms/

SLPVectorizer/

X86/

phi_limited.ll

251 lines

Diff 96713

lib/Transforms/Vectorize/SLPVectorizer.cpp

Show First 20 Lines • Show All 4,889 Lines • ▼ Show 20 Lines	while (HaveVectorizedPhiNodes) {

// Try to vectorize elements base on their type.		// Try to vectorize elements base on their type.
for (SmallVector<Value *, 4>::iterator IncIt = Incoming.begin(),		for (SmallVector<Value *, 4>::iterator IncIt = Incoming.begin(),
E = Incoming.end();		E = Incoming.end();
IncIt != E;) {		IncIt != E;) {

// Look for the next elements with the same type.		// Look for the next elements with the same type.
SmallVector<Value *, 4>::iterator SameTypeIt = IncIt;		SmallVector<Value *, 4>::iterator SameTypeIt = IncIt;
while (SameTypeIt != E &&		Type Ty = (IncIt)->getType();
(SameTypeIt)->getType() == (IncIt)->getType()) {		assert(DL->getTypeSizeInBits(Ty) > 0 && "Zero sized type?");
VisitedInstrs.insert(*SameTypeIt);		unsigned MaxNumElts = R.getMaxVecRegSize() / DL->getTypeSizeInBits(Ty);
++SameTypeIt;		if (MaxNumElts < 2) {
		// The elements of this type are not vectorizable, skip them.
		while (IncIt != E && Ty == (*IncIt)->getType())
		VisitedInstrs.insert(*IncIt++);
		continue;
}		}
		while (SameTypeIt != E &&
		Ty == (*SameTypeIt)->getType() &&
		SameTypeIt - IncIt < MaxNumElts)
		VisitedInstrs.insert(*SameTypeIt++);

// Try to vectorize them.		// Try to vectorize them.
unsigned NumElts = (SameTypeIt - IncIt);		unsigned NumElts = (SameTypeIt - IncIt);
		assert(NumElts <= MaxNumElts && "Desired vector size is too big!");
DEBUG(errs() << "SLP: Trying to vectorize starting at PHIs (" << NumElts << ")\n");		DEBUG(errs() << "SLP: Trying to vectorize starting at PHIs (" << NumElts << ")\n");
// The order in which the phi nodes appear in the program does not matter.		// The order in which the phi nodes appear in the program does not matter.
// So allow tryToVectorizeList to reorder them if it is beneficial. This		// So allow tryToVectorizeList to reorder them if it is beneficial. This
// is done when there are exactly two elements since tryToVectorizeList		// is done when there are exactly two elements since tryToVectorizeList
// asserts that there are only two values when AllowReorder is true.		// asserts that there are only two values when AllowReorder is true.
bool AllowReorder = NumElts == 2;		bool AllowReorder = NumElts == 2;
if (NumElts > 1 && tryToVectorizeList(makeArrayRef(IncIt, NumElts), R,		if (NumElts > 1 && tryToVectorizeList(makeArrayRef(IncIt, NumElts), R,
None, AllowReorder)) {		None, AllowReorder)) {
▲ Show 20 Lines • Show All 253 Lines • Show Last 20 Lines

test/Transforms/SLPVectorizer/X86/phi_limited.ll

				; RUN: opt < %s -basicaa -slp-vectorizer -slp-threshold=-100 -dce -S -mtriple=i386-apple-macosx10.8.0 -mcpu=corei7-avx -slp-max-reg-size=64 \| FileCheck %s

				target datalayout = "e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:32:64-f32:32:32-f64:32:64-v64:64:64-v128:128:128-a0:0:64-f80:128:128-n8:16:32-S128"
				target triple = "i386-apple-macosx10.9.0"

				; Same as phi.ll, but the maximum vector register size is limited to 64 bits.

				;int foo(double *A, int k) {
				; double A0;
				; double A1;
				; if (k) {
				; A0 = 3;
				; A1 = 5;
				; } else {
				; A0 = A[10];
				; A1 = A[11];
				; }
				; A[0] = A0;
				; A[1] = A1;
				;}

				;CHECK: i32 @foo
				;CHECK-NOT: <2 x double>
				;CHECK: load double
				;CHECK: phi double
				;CHECK: store double
				;CHECK: ret i32 undef
				define i32 @foo(double* nocapture %A, i32 %k) {
				entry:
				%tobool = icmp eq i32 %k, 0
				br i1 %tobool, label %if.else, label %if.end

				if.else: ; preds = %entry
				%arrayidx = getelementptr inbounds double, double* %A, i64 10
				%0 = load double, double* %arrayidx, align 8
				%arrayidx1 = getelementptr inbounds double, double* %A, i64 11
				%1 = load double, double* %arrayidx1, align 8
				br label %if.end

				if.end: ; preds = %entry, %if.else
				%A0.0 = phi double [ %0, %if.else ], [ 3.000000e+00, %entry ]
				%A1.0 = phi double [ %1, %if.else ], [ 5.000000e+00, %entry ]
				store double %A0.0, double* %A, align 8
				%arrayidx3 = getelementptr inbounds double, double* %A, i64 1
				store double %A1.0, double* %arrayidx3, align 8
				ret i32 undef
				}


				;int foo(double * restrict B, double * restrict A, int n, int m) {
				; double R=A[1];
				; double G=A[0];
				; for (int i=0; i < 100; i++) {
				; R += 10;
				; G += 10;
				; R *= 4;
				; G *= 4;
				; R += 4;
				; G += 4;
				; }
				; B[0] = G;
				; B[1] = R;
				; return 0;
				;}

				;CHECK: foo2
				;CHECK-NOT: <2 x double>
				;CHECK: load double
				;CHECK: phi double
				;CHECK: fmul double
				;CHECK: store double
				;CHECK: ret
				define i32 @foo2(double* noalias nocapture %B, double* noalias nocapture %A, i32 %n, i32 %m) #0 {
				entry:
				%arrayidx = getelementptr inbounds double, double* %A, i64 1
				%0 = load double, double* %arrayidx, align 8
				%1 = load double, double* %A, align 8
				br label %for.body

				for.body: ; preds = %for.body, %entry
				%i.019 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
				%G.018 = phi double [ %1, %entry ], [ %add5, %for.body ]
				%R.017 = phi double [ %0, %entry ], [ %add4, %for.body ]
				%add = fadd double %R.017, 1.000000e+01
				%add2 = fadd double %G.018, 1.000000e+01
				%mul = fmul double %add, 4.000000e+00
				%mul3 = fmul double %add2, 4.000000e+00
				%add4 = fadd double %mul, 4.000000e+00
				%add5 = fadd double %mul3, 4.000000e+00
				%inc = add nsw i32 %i.019, 1
				%exitcond = icmp eq i32 %inc, 100
				br i1 %exitcond, label %for.end, label %for.body

				for.end: ; preds = %for.body
				store double %add5, double* %B, align 8
				%arrayidx7 = getelementptr inbounds double, double* %B, i64 1
				store double %add4, double* %arrayidx7, align 8
				ret i32 0
				}

				; float foo3(float *A) {
				;
				; float R = A[0];
				; float G = A[1];
				; float B = A[2];
				; float Y = A[3];
				; float P = A[4];
				; for (int i=0; i < 121; i+=3) {
				; R+=A[i+0]*7;
				; G+=A[i+1]*8;
				; B+=A[i+2]*9;
				; Y+=A[i+3]*10;
				; P+=A[i+4]*11;
				; }
				;
				; return R+G+B+Y+P;
				; }

				;CHECK: foo3
				;CHECK-NOT: <4 x float>
				;CHECK-NOT: <5 x float>
				;CHECK: phi <2 x float>
				;CHECK: fmul <2 x float>
				;CHECK: fadd <2 x float>

				define float @foo3(float* nocapture readonly %A) #0 {
				entry:
				%0 = load float, float* %A, align 4
				%arrayidx1 = getelementptr inbounds float, float* %A, i64 1
				%1 = load float, float* %arrayidx1, align 4
				%arrayidx2 = getelementptr inbounds float, float* %A, i64 2
				%2 = load float, float* %arrayidx2, align 4
				%arrayidx3 = getelementptr inbounds float, float* %A, i64 3
				%3 = load float, float* %arrayidx3, align 4
				%arrayidx4 = getelementptr inbounds float, float* %A, i64 4
				%4 = load float, float* %arrayidx4, align 4
				br label %for.body

				for.body: ; preds = %for.body, %entry
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%P.056 = phi float [ %4, %entry ], [ %add26, %for.body ]
				%Y.055 = phi float [ %3, %entry ], [ %add21, %for.body ]
				%B.054 = phi float [ %2, %entry ], [ %add16, %for.body ]
				%G.053 = phi float [ %1, %entry ], [ %add11, %for.body ]
				%R.052 = phi float [ %0, %entry ], [ %add6, %for.body ]
				%5 = phi float [ %1, %entry ], [ %11, %for.body ]
				%6 = phi float [ %0, %entry ], [ %9, %for.body ]
				%mul = fmul float %6, 7.000000e+00
				%add6 = fadd float %R.052, %mul
				%mul10 = fmul float %5, 8.000000e+00
				%add11 = fadd float %G.053, %mul10
				%7 = add nsw i64 %indvars.iv, 2
				%arrayidx14 = getelementptr inbounds float, float* %A, i64 %7
				%8 = load float, float* %arrayidx14, align 4
				%mul15 = fmul float %8, 9.000000e+00
				%add16 = fadd float %B.054, %mul15
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 3
				%arrayidx19 = getelementptr inbounds float, float* %A, i64 %indvars.iv.next
				%9 = load float, float* %arrayidx19, align 4
				%mul20 = fmul float %9, 1.000000e+01
				%add21 = fadd float %Y.055, %mul20
				%10 = add nsw i64 %indvars.iv, 4
				%arrayidx24 = getelementptr inbounds float, float* %A, i64 %10
				%11 = load float, float* %arrayidx24, align 4
				%mul25 = fmul float %11, 1.100000e+01
				%add26 = fadd float %P.056, %mul25
				%12 = trunc i64 %indvars.iv.next to i32
				%cmp = icmp slt i32 %12, 121
				br i1 %cmp, label %for.body, label %for.end

				for.end: ; preds = %for.body
				%add28 = fadd float %add6, %add11
				%add29 = fadd float %add28, %add16
				%add30 = fadd float %add29, %add21
				%add31 = fadd float %add30, %add26
				ret float %add31
				}

				; Make sure the order of phi nodes of different types does not prevent
				; vectorization of same typed phi nodes.
				; CHECK-LABEL: sort_phi_type
				; CHECK-NOT: <4 x float>
				; CHECK: phi <2 x float>
				; CHECK: fmul <2 x float>

				define float @sort_phi_type(float* nocapture readonly %A) {
				entry:
				br label %for.body

				for.body: ; preds = %for.body, %entry
				%Y = phi float [ 1.000000e+01, %entry ], [ %mul10, %for.body ]
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%B = phi float [ 1.000000e+01, %entry ], [ %mul15, %for.body ]
				%G = phi float [ 1.000000e+01, %entry ], [ %mul20, %for.body ]
				%R = phi float [ 1.000000e+01, %entry ], [ %mul25, %for.body ]
				%mul10 = fmul float %Y, 8.000000e+00
				%mul15 = fmul float %B, 9.000000e+00
				%mul20 = fmul float %R, 10.000000e+01
				%mul25 = fmul float %G, 11.100000e+01
				%indvars.iv.next = add nsw i64 %indvars.iv, 4
				%cmp = icmp slt i64 %indvars.iv.next, 128
				br i1 %cmp, label %for.body, label %for.end

				for.end: ; preds = %for.body
				%add28 = fadd float 1.000000e+01, %mul10
				%add29 = fadd float %mul10, %mul15
				%add30 = fadd float %add29, %mul20
				%add31 = fadd float %add30, %mul25
				ret float %add31
				}

				define void @test(x86_fp80* %i1, x86_fp80* %i2, x86_fp80* %o) {
				; CHECK-LABEL: @test(
				;
				; Test that we correctly recognize the discontiguous memory in arrays where the
				; size is less than the alignment, and through various different GEP formations.
				;
				; We disable the vectorization of x86_fp80 for now.

				entry:
				%i1.0 = load x86_fp80, x86_fp80* %i1, align 16
				%i1.gep1 = getelementptr x86_fp80, x86_fp80* %i1, i64 1
				%i1.1 = load x86_fp80, x86_fp80* %i1.gep1, align 16
				; CHECK: load x86_fp80, x86_fp80*
				; CHECK: load x86_fp80, x86_fp80*
				; CHECK-NOT: insertelement <2 x x86_fp80>
				; CHECK-NOT: insertelement <2 x x86_fp80>
				br i1 undef, label %then, label %end

				then:
				%i2.gep0 = getelementptr inbounds x86_fp80, x86_fp80* %i2, i64 0
				%i2.0 = load x86_fp80, x86_fp80* %i2.gep0, align 16
				%i2.gep1 = getelementptr inbounds x86_fp80, x86_fp80* %i2, i64 1
				%i2.1 = load x86_fp80, x86_fp80* %i2.gep1, align 16
				; CHECK: load x86_fp80, x86_fp80*
				; CHECK: load x86_fp80, x86_fp80*
				; CHECK-NOT: insertelement <2 x x86_fp80>
				; CHECK-NOT: insertelement <2 x x86_fp80>
				br label %end

				end:
				%phi0 = phi x86_fp80 [ %i1.0, %entry ], [ %i2.0, %then ]
				%phi1 = phi x86_fp80 [ %i1.1, %entry ], [ %i2.1, %then ]
				; CHECK-NOT: phi <2 x x86_fp80>
				; CHECK-NOT: extractelement <2 x x86_fp80>
				; CHECK-NOT: extractelement <2 x x86_fp80>
				store x86_fp80 %phi0, x86_fp80* %o, align 16
				%o.gep1 = getelementptr inbounds x86_fp80, x86_fp80* %o, i64 1
				store x86_fp80 %phi1, x86_fp80* %o.gep1, align 16
				ret void
				}