This is an archive of the discontinued LLVM Phabricator instance.

[SLP] allow non-power-of-2 vectorization
AbandonedPublic

Authored by spatel on Aug 19 2019, 6:10 AM.

Download Raw Diff

Details

Reviewers

ABataev
dtemirbulatov
vporpo
RKSimon

Summary

From what I can tell, we are artificially restricting the pass to bail out if we would vectorize to a non-power-of-2 number of elements. That is, everything below the changed part of this patch is working as intended for calculating costs and tree elements. However, I am proposing to add a debug flag for experimentation in case this reveals regressions.

A similar test to the diff here:
rL369255
...shows that we can already generate a non-standard vector size (<2 x float>) and shuffle.

The motivating case is from PR16739:
https://bugs.llvm.org/show_bug.cgi?id=16739
...and after instcombine, we end up with:

define <4 x float> @PR16739_byref(<4 x float>* nocapture readonly dereferenceable(16) %x) {
  %1 = bitcast <4 x float>* %x to <3 x float>*
  %2 = load <3 x float>, <3 x float>* %1, align 4
  %i3 = shufflevector <3 x float> %2, <3 x float> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 2>
  ret <4 x float> %i3
}

And because we know that the pointer is dereferenceable to 16 bytes, the backend generates the optimal code for x86:

	movups	(%rdi), %xmm0
	shufps	$164, %xmm0, %xmm0      ## xmm0 = xmm0[0,1,2,2]

This does not appear to interact with proposal D57779, but maybe we are just lacking the regression tests to show it?

Diff Detail

Event Timeline

spatel created this revision.Aug 19 2019, 6:10 AM

Herald added a project: Restricted Project. · View Herald TranscriptAug 19 2019, 6:10 AM

Herald added subscribers: hiraditya, mcrosier. · View Herald Transcript

There is more complex D57059 to support non-power-of-2 vectorization. It should be split into several small patches + it must be very carefully tested, I just don't have time to work on this. I have a patch for this updated to the latest version, it would be good if somebody else could take it, split it, etc.

In D66416#1635141, @ABataev wrote:

There is more complex D57059 to support non-power-of-2 vectorization. It should be split into several small patches + it must be very carefully tested, I just don't have time to work on this. I have a patch for this updated to the latest version, it would be good if somebody else could take it, split it, etc.

Thanks! I knew this had come up before, but I didn't find that patch. Let me discuss with Simon and Dinar to see what we can do.

RKSimon mentioned this in D57059: [SLP] Initial support for the vectorization of the non-power-of-2 vectors..Aug 19 2019, 6:47 AM

@spatel Abandon this? D57059 is close to being completed

Abandoning in favor of the bigger fix in D57059.

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

17 lines

test/

Transforms/

SLPVectorizer/

X86/

cse.ll

18 lines

load-merge.ll

15 lines

Diff 215867

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 133 Lines • ▼ Show 20 Lines
static cl::opt<int>		static cl::opt<int>
ScheduleRegionSizeBudget("slp-schedule-budget", cl::init(100000), cl::Hidden,		ScheduleRegionSizeBudget("slp-schedule-budget", cl::init(100000), cl::Hidden,
cl::desc("Limit the size of the SLP scheduling region per block"));		cl::desc("Limit the size of the SLP scheduling region per block"));

static cl::opt<int> MinVectorRegSizeOption(		static cl::opt<int> MinVectorRegSizeOption(
"slp-min-reg-size", cl::init(128), cl::Hidden,		"slp-min-reg-size", cl::init(128), cl::Hidden,
cl::desc("Attempt to vectorize for this register size in bits"));		cl::desc("Attempt to vectorize for this register size in bits"));

		static cl::opt<bool> AllowNonPowerOf2Elts(
		"slp-allow-non-power-of-2", cl::init(true), cl::Hidden,
		cl::desc("Allow non-power-of-2 elements per vector"));

static cl::opt<unsigned> RecursionMaxDepth(		static cl::opt<unsigned> RecursionMaxDepth(
"slp-recursion-max-depth", cl::init(12), cl::Hidden,		"slp-recursion-max-depth", cl::init(12), cl::Hidden,
cl::desc("Limit the recursion depth when building a vectorizable tree"));		cl::desc("Limit the recursion depth when building a vectorizable tree"));

static cl::opt<unsigned> MinTreeSize(		static cl::opt<unsigned> MinTreeSize(
"slp-min-tree-size", cl::init(3), cl::Hidden,		"slp-min-tree-size", cl::init(3), cl::Hidden,
cl::desc("Only vectorize small trees if they are fully vectorizable"));		cl::desc("Only vectorize small trees if they are fully vectorizable"));

▲ Show 20 Lines • Show All 2,025 Lines • ▼ Show 20 Lines	for (Value *V : VL) {
ReuseShuffleIndicies.emplace_back(Res.first->second);		ReuseShuffleIndicies.emplace_back(Res.first->second);
if (Res.second)		if (Res.second)
UniqueValues.emplace_back(V);		UniqueValues.emplace_back(V);
}		}
size_t NumUniqueScalarValues = UniqueValues.size();		size_t NumUniqueScalarValues = UniqueValues.size();
if (NumUniqueScalarValues == VL.size()) {		if (NumUniqueScalarValues == VL.size()) {
ReuseShuffleIndicies.clear();		ReuseShuffleIndicies.clear();
} else {		} else {
LLVM_DEBUG(dbgs() << "SLP: Shuffle for reused scalars.\n");		if (NumUniqueScalarValues <= 1) {
if (NumUniqueScalarValues <= 1 \|\|		LLVM_DEBUG(dbgs() << "SLP: Less than 2 scalars used in bundle.\n");
!llvm::isPowerOf2_32(NumUniqueScalarValues)) {		newTreeEntry(VL, None /not vectorized/, UserTreeIdx);
LLVM_DEBUG(dbgs() << "SLP: Scalar used twice in bundle.\n");		return;
		}
		if (!isPowerOf2_32(NumUniqueScalarValues) && !AllowNonPowerOf2Elts) {
		LLVM_DEBUG(dbgs() << "SLP: Non-power-of-2 elements in bundle.\n");
newTreeEntry(VL, None /not vectorized/, UserTreeIdx);		newTreeEntry(VL, None /not vectorized/, UserTreeIdx);
return;		return;
}		}
		// Vectorization requires shuffling to duplicate scalar values.
		LLVM_DEBUG(dbgs() << "SLP: Shuffle for reused scalars.\n");
VL = UniqueValues;		VL = UniqueValues;
}		}

auto &BSRef = BlocksSchedules[BB];		auto &BSRef = BlocksSchedules[BB];
if (!BSRef)		if (!BSRef)
BSRef = std::make_unique<BlockScheduling>(BB);		BSRef = std::make_unique<BlockScheduling>(BB);

BlockScheduling &BS = *BSRef.get();		BlockScheduling &BS = *BSRef.get();
▲ Show 20 Lines • Show All 4,821 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/cse.ll

	Show All 12 Lines

	define i32 @test(double* nocapture %G) {			define i32 @test(double* nocapture %G) {
	; CHECK-LABEL: @test(			; CHECK-LABEL: @test(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds double, double [[G:%.*]], i64 5			; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds double, double [[G:%.*]], i64 5
	; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds double, double [[G]], i64 6			; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds double, double [[G]], i64 6
	; CHECK-NEXT: [[TMP0:%.]] = bitcast double [[ARRAYIDX]] to <2 x double>*			; CHECK-NEXT: [[TMP0:%.]] = bitcast double [[ARRAYIDX]] to <2 x double>*
	; CHECK-NEXT: [[TMP1:%.]] = load <2 x double>, <2 x double> [[TMP0]], align 8			; CHECK-NEXT: [[TMP1:%.]] = load <2 x double>, <2 x double> [[TMP0]], align 8
	; CHECK-NEXT: [[TMP2:%.*]] = fmul <2 x double> [[TMP1]], <double 4.000000e+00, double 3.000000e+00>			; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x double> [[TMP1]], <2 x double> undef, <3 x i32> <i32 0, i32 1, i32 1>
	; CHECK-NEXT: [[TMP3:%.*]] = fadd <2 x double> [[TMP2]], <double 1.000000e+00, double 6.000000e+00>
	; CHECK-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds double, double [[G]], i64 1			; CHECK-NEXT: [[ARRAYIDX5:%.]] = getelementptr inbounds double, double [[G]], i64 1
	; CHECK-NEXT: [[TMP4:%.]] = bitcast double [[G]] to <2 x double>*
	; CHECK-NEXT: store <2 x double> [[TMP3]], <2 x double>* [[TMP4]], align 8
	; CHECK-NEXT: [[TMP5:%.*]] = extractelement <2 x double> [[TMP2]], i32 0
	; CHECK-NEXT: [[ARRAYIDX9:%.]] = getelementptr inbounds double, double [[G]], i64 2			; CHECK-NEXT: [[ARRAYIDX9:%.]] = getelementptr inbounds double, double [[G]], i64 2
	; CHECK-NEXT: [[TMP6:%.*]] = extractelement <2 x double> [[TMP1]], i32 1			; CHECK-NEXT: [[TMP2:%.*]] = fmul <3 x double> [[SHUFFLE]], <double 4.000000e+00, double 3.000000e+00, double 4.000000e+00>
	; CHECK-NEXT: [[MUL11:%.*]] = fmul double [[TMP6]], 4.000000e+00			; CHECK-NEXT: [[SHUFFLE1:%.*]] = shufflevector <3 x double> [[TMP2]], <3 x double> undef, <4 x i32> <i32 0, i32 1, i32 0, i32 2>
	; CHECK-NEXT: [[TMP7:%.*]] = insertelement <2 x double> undef, double [[TMP5]], i32 0			; CHECK-NEXT: [[TMP3:%.*]] = fadd <4 x double> [[SHUFFLE1]], <double 1.000000e+00, double 6.000000e+00, double 7.000000e+00, double 8.000000e+00>
	; CHECK-NEXT: [[TMP8:%.*]] = insertelement <2 x double> [[TMP7]], double [[MUL11]], i32 1
	; CHECK-NEXT: [[TMP9:%.*]] = fadd <2 x double> [[TMP8]], <double 7.000000e+00, double 8.000000e+00>
	; CHECK-NEXT: [[ARRAYIDX13:%.]] = getelementptr inbounds double, double [[G]], i64 3			; CHECK-NEXT: [[ARRAYIDX13:%.]] = getelementptr inbounds double, double [[G]], i64 3
	; CHECK-NEXT: [[TMP10:%.]] = bitcast double [[ARRAYIDX9]] to <2 x double>*			; CHECK-NEXT: [[TMP4:%.]] = bitcast double [[G]] to <4 x double>*
	; CHECK-NEXT: store <2 x double> [[TMP9]], <2 x double>* [[TMP10]], align 8			; CHECK-NEXT: store <4 x double> [[TMP3]], <4 x double>* [[TMP4]], align 8
	; CHECK-NEXT: ret i32 undef			; CHECK-NEXT: ret i32 undef
	;			;
	entry:			entry:
	%arrayidx = getelementptr inbounds double, double* %G, i64 5			%arrayidx = getelementptr inbounds double, double* %G, i64 5
	%0 = load double, double* %arrayidx, align 8			%0 = load double, double* %arrayidx, align 8
	%mul = fmul double %0, 4.000000e+00			%mul = fmul double %0, 4.000000e+00
	%add = fadd double %mul, 1.000000e+00			%add = fadd double %mul, 1.000000e+00
	store double %add, double* %G, align 8			store double %add, double* %G, align 8
	▲ Show 20 Lines • Show All 320 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/load-merge.ll

Show First 20 Lines • Show All 48 Lines • ▼ Show 20 Lines	entry:
ret i32 %or11		ret i32 %or11
}		}

define <4 x float> @PR16739_byref(<4 x float>* nocapture readonly dereferenceable(16) %x) {		define <4 x float> @PR16739_byref(<4 x float>* nocapture readonly dereferenceable(16) %x) {
; CHECK-LABEL: @PR16739_byref(		; CHECK-LABEL: @PR16739_byref(
; CHECK-NEXT: [[GEP0:%.]] = getelementptr inbounds <4 x float>, <4 x float> [[X:%.*]], i64 0, i64 0		; CHECK-NEXT: [[GEP0:%.]] = getelementptr inbounds <4 x float>, <4 x float> [[X:%.*]], i64 0, i64 0
; CHECK-NEXT: [[GEP1:%.]] = getelementptr inbounds <4 x float>, <4 x float> [[X]], i64 0, i64 1		; CHECK-NEXT: [[GEP1:%.]] = getelementptr inbounds <4 x float>, <4 x float> [[X]], i64 0, i64 1
; CHECK-NEXT: [[GEP2:%.]] = getelementptr inbounds <4 x float>, <4 x float> [[X]], i64 0, i64 2		; CHECK-NEXT: [[GEP2:%.]] = getelementptr inbounds <4 x float>, <4 x float> [[X]], i64 0, i64 2
; CHECK-NEXT: [[TMP1:%.]] = bitcast float [[GEP0]] to <2 x float>*		; CHECK-NEXT: [[TMP1:%.]] = bitcast float [[GEP0]] to <3 x float>*
; CHECK-NEXT: [[TMP2:%.]] = load <2 x float>, <2 x float> [[TMP1]], align 4		; CHECK-NEXT: [[TMP2:%.]] = load <3 x float>, <3 x float> [[TMP1]], align 4
; CHECK-NEXT: [[X2:%.]] = load float, float [[GEP2]]		; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <3 x float> [[TMP2]], <3 x float> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 2>
; CHECK-NEXT: [[TMP3:%.*]] = extractelement <2 x float> [[TMP2]], i32 0		; CHECK-NEXT: [[TMP3:%.*]] = extractelement <4 x float> [[SHUFFLE]], i32 0
; CHECK-NEXT: [[I0:%.*]] = insertelement <4 x float> undef, float [[TMP3]], i32 0		; CHECK-NEXT: [[I0:%.*]] = insertelement <4 x float> undef, float [[TMP3]], i32 0
; CHECK-NEXT: [[TMP4:%.*]] = extractelement <2 x float> [[TMP2]], i32 1		; CHECK-NEXT: [[TMP4:%.*]] = extractelement <4 x float> [[SHUFFLE]], i32 1
; CHECK-NEXT: [[I1:%.*]] = insertelement <4 x float> [[I0]], float [[TMP4]], i32 1		; CHECK-NEXT: [[I1:%.*]] = insertelement <4 x float> [[I0]], float [[TMP4]], i32 1
; CHECK-NEXT: [[I2:%.*]] = insertelement <4 x float> [[I1]], float [[X2]], i32 2		; CHECK-NEXT: [[TMP5:%.*]] = extractelement <4 x float> [[SHUFFLE]], i32 2
; CHECK-NEXT: [[I3:%.*]] = insertelement <4 x float> [[I2]], float [[X2]], i32 3		; CHECK-NEXT: [[I2:%.*]] = insertelement <4 x float> [[I1]], float [[TMP5]], i32 2
		; CHECK-NEXT: [[I3:%.*]] = insertelement <4 x float> [[I2]], float [[TMP5]], i32 3
; CHECK-NEXT: ret <4 x float> [[I3]]		; CHECK-NEXT: ret <4 x float> [[I3]]
;		;
%gep0 = getelementptr inbounds <4 x float>, <4 x float>* %x, i64 0, i64 0		%gep0 = getelementptr inbounds <4 x float>, <4 x float>* %x, i64 0, i64 0
%gep1 = getelementptr inbounds <4 x float>, <4 x float>* %x, i64 0, i64 1		%gep1 = getelementptr inbounds <4 x float>, <4 x float>* %x, i64 0, i64 1
%gep2 = getelementptr inbounds <4 x float>, <4 x float>* %x, i64 0, i64 2		%gep2 = getelementptr inbounds <4 x float>, <4 x float>* %x, i64 0, i64 2
%x0 = load float, float* %gep0		%x0 = load float, float* %gep0
%x1 = load float, float* %gep1		%x1 = load float, float* %gep1
%x2 = load float, float* %gep2		%x2 = load float, float* %gep2
▲ Show 20 Lines • Show All 71 Lines • Show Last 20 Lines