This is an archive of the discontinued LLVM Phabricator instance.

In D93192#2451838, @xbolva00 wrote:

Kinda big impact on code size

https://llvm-compile-time-tracker.com/compare.php?from=6c8ded0d8c3c256defaf72c0596614eea465ca27&to=fac7c7ec3ccd64d19b6d33af0a8bc2f3f7f7b047&stat=size-text

Is it ok? No performance regressions?

Yes, I've observed this impact too before committing. It looks suspiciously good, but actually this fix is just exploiting potential built before. It's bug fix, so I doubt it's able to make things worse.

In-short: I'm still do sure that current patch is bugfixing. But this bug had induced some of boundary vectorization cases before fixing. Below is a typical case.

I've dug into the several files of the tests above (for instance, consumer-typeset/z35.c, here is details: http://llvm-compile-time-tracker.com/compare.php?from=a6c25539c1ed6b60dc3b92a5abe25f6bdb6e9788&to=283577865a97c26852292afae4fa57c689edbbfb&stat=size-total&details=on) and checked SLP stats:

$ ~/llvm/llvm-project/build_deb_base/bin/clang ... -mllvm --stats -c consumer-typeset/z35.c -o z35.base.o 2>stat.base
$ ~/llvm/llvm-project/build_deb_exp/bin/clang ... -mllvm --stats -c consumer-typeset/z35.c -o z35.exp.o 2>stat.exp
$ size z35.base.o
   text    data     bss     dec     hex filename
   9057       0      56    9113    2399 z43.base.o
$ size z43.exp.o
   text    data     bss     dec     hex filename
   8929       0      56    8985    2319 z43.exp.o
$ grep SLP stat.*                                                                              
stat.base:   35 SLP                          - Number of vector instructions generated

So, really code size decreased, but due to the number of vector instrs _not generated_. I've investigated these instructions and also gathered vectorization costs.
Before bugfixing:

%53 = insertelement <4 x %union.rec*> undef, %union.rec* %.in, i32 0
%54 = shufflevector <4 x %union.rec*> %53, <4 x %union.rec*> undef, <4 x i32> zeroinitializer
%55 = bitcast %union.rec* %.in to <4 x %union.rec*>*
store <4 x %union.rec*> %54, <4 x %union.rec*>* %55, align 8, !tbaa !6

After bugfixing:

%53 = getelementptr inbounds %union.rec, %union.rec* %.in, i64 0, i32 0, i32 0, i64 1, i32 1
store %union.rec* %.in, %union.rec** %53, align 8, !tbaa !6
%54 = getelementptr inbounds %union.rec, %union.rec* %.in, i64 0, i32 0, i32 0, i64 1, i32 0
store %union.rec* %.in, %union.rec** %54, align 8, !tbaa !6
%55 = getelementptr inbounds %union.rec, %union.rec* %.in, i64 0, i32 0, i32 0, i64 0, i32 1
store %union.rec* %.in, %union.rec** %55, align 8, !tbaa !6
%56 = getelementptr inbounds %union.rec, %union.rec* %.in, i64 0, i32 0, i32 0, i64 0, i32 0
store %union.rec* %.in, %union.rec** %56, align 8, !tbaa !6

Asm code before:

movq    %rax, %xmm0
pshufd  $68, %xmm0, %xmm0               # xmm0 = xmm0[0,1,0,1]
movdqu  %xmm0, 16(%rax)
movdqu  %xmm0, (%rax)

Asm code after:

movq    %rax, 24(%rax)
movq    %rax, 16(%rax)
movq    %rax, 8(%rax)
movq    %rax, (%rax)

So actually <4 x %union.rec*> doesn't fit to maximum vector register (4x64=256), but store instr could be lowered to two movdqu. The total cost of tree is -1, allowing vectorization. But four stores hasn't been vectorized to two <2 x %union.reg*> after bugfixing, since cost of both parts is zero. We have a border case when 0 + 0 = -1 here.

Not sure what to do with it (allow 2 * MaxVecRegSize for llvm vectors of tree?), but current fix is still necessary -- previously bug has exploited this "border case" accidently, for the random cases of combined stores chains.

anton-afanasyev mentioned this in D94974: [SLP] Try doubled MaxElts for stores vectorization.Jan 19 2021, 8:19 AM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

2 lines

test/

Transforms/

SLPVectorizer/

X86/

combined-stores-chains.ll

35 lines

Diff 311554

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 6,070 Lines • ▼ Show 20 Lines	for (int Cnt = E; Cnt > 0; --Cnt) {
while (I != E + 1 && !VectorizedStores.count(Stores[I])) {		while (I != E + 1 && !VectorizedStores.count(Stores[I])) {
Operands.push_back(Stores[I]);		Operands.push_back(Stores[I]);
// Move to the next value in the chain.		// Move to the next value in the chain.
I = ConsecutiveChain[I];		I = ConsecutiveChain[I];
}		}

// If a vector register can't hold 1 element, we are done.		// If a vector register can't hold 1 element, we are done.
unsigned MaxVecRegSize = R.getMaxVecRegSize();		unsigned MaxVecRegSize = R.getMaxVecRegSize();
unsigned EltSize = R.getVectorElementSize(Stores[0]);		unsigned EltSize = R.getVectorElementSize(Operands[0]);
if (MaxVecRegSize % EltSize != 0)		if (MaxVecRegSize % EltSize != 0)
continue;		continue;

unsigned MaxElts = MaxVecRegSize / EltSize;		unsigned MaxElts = MaxVecRegSize / EltSize;
// FIXME: Is division-by-2 the correct step? Should we assert that the		// FIXME: Is division-by-2 the correct step? Should we assert that the
// register size is a power-of-2?		// register size is a power-of-2?
unsigned StartIdx = 0;		unsigned StartIdx = 0;
for (unsigned Size = llvm::PowerOf2Ceil(MaxElts); Size >= 2; Size /= 2) {		for (unsigned Size = llvm::PowerOf2Ceil(MaxElts); Size >= 2; Size /= 2) {
▲ Show 20 Lines • Show All 1,840 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/combined-stores-chains.ll

	Show All 17 Lines
	; CHECK-NEXT: [[T21:%.]] = getelementptr inbounds i32, i32 [[T0]], i64 4			; CHECK-NEXT: [[T21:%.]] = getelementptr inbounds i32, i32 [[T0]], i64 4
	; CHECK-NEXT: [[T25:%.]] = getelementptr inbounds i32, i32 [[T0]], i64 5			; CHECK-NEXT: [[T25:%.]] = getelementptr inbounds i32, i32 [[T0]], i64 5
	; CHECK-NEXT: [[T29:%.]] = getelementptr inbounds i32, i32 [[T0]], i64 6			; CHECK-NEXT: [[T29:%.]] = getelementptr inbounds i32, i32 [[T0]], i64 6
	; CHECK-NEXT: [[T32:%.]] = getelementptr inbounds i32, i32 [[T0]], i64 7			; CHECK-NEXT: [[T32:%.]] = getelementptr inbounds i32, i32 [[T0]], i64 7
	; CHECK-NEXT: [[T212:%.]] = getelementptr inbounds i64, i64 [[T02]], i64 8			; CHECK-NEXT: [[T212:%.]] = getelementptr inbounds i64, i64 [[T02]], i64 8
	; CHECK-NEXT: [[T252:%.]] = getelementptr inbounds i64, i64 [[T02]], i64 9			; CHECK-NEXT: [[T252:%.]] = getelementptr inbounds i64, i64 [[T02]], i64 9
	; CHECK-NEXT: [[T292:%.]] = getelementptr inbounds i64, i64 [[T02]], i64 10			; CHECK-NEXT: [[T292:%.]] = getelementptr inbounds i64, i64 [[T02]], i64 10
	; CHECK-NEXT: [[T322:%.]] = getelementptr inbounds i64, i64 [[T02]], i64 11			; CHECK-NEXT: [[T322:%.]] = getelementptr inbounds i64, i64 [[T02]], i64 11
	; CHECK-NEXT: [[T19:%.]] = load i32, i32 [[T14]], align 4			; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[T14]] to <4 x i32>*
	; CHECK-NEXT: [[T23:%.]] = load i32, i32 [[T18]], align 4			; CHECK-NEXT: [[TMP2:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4
	; CHECK-NEXT: [[T27:%.]] = load i32, i32 [[T22]], align 4			; CHECK-NEXT: [[TMP3:%.]] = bitcast i64 [[T142]] to <2 x i64>*
	; CHECK-NEXT: [[T30:%.]] = load i32, i32 [[T26]], align 4
	; CHECK-NEXT: [[TMP1:%.]] = bitcast i64 [[T142]] to <2 x i64>*
	; CHECK-NEXT: [[TMP2:%.]] = load <2 x i64>, <2 x i64> [[TMP1]], align 8
	; CHECK-NEXT: [[TMP3:%.]] = bitcast i64 [[T222]] to <2 x i64>*
	; CHECK-NEXT: [[TMP4:%.]] = load <2 x i64>, <2 x i64> [[TMP3]], align 8			; CHECK-NEXT: [[TMP4:%.]] = load <2 x i64>, <2 x i64> [[TMP3]], align 8
	; CHECK-NEXT: [[T20:%.*]] = add nsw i32 [[T19]], 4			; CHECK-NEXT: [[TMP5:%.]] = bitcast i64 [[T222]] to <2 x i64>*
	; CHECK-NEXT: [[T24:%.*]] = add nsw i32 [[T23]], 4			; CHECK-NEXT: [[TMP6:%.]] = load <2 x i64>, <2 x i64> [[TMP5]], align 8
	; CHECK-NEXT: [[T28:%.*]] = add nsw i32 [[T27]], 6			; CHECK-NEXT: [[TMP7:%.*]] = add nsw <4 x i32> [[TMP2]], <i32 4, i32 4, i32 6, i32 7>
	; CHECK-NEXT: [[T31:%.*]] = add nsw i32 [[T30]], 7			; CHECK-NEXT: [[TMP8:%.*]] = add nsw <2 x i64> [[TMP4]], <i64 4, i64 4>
	; CHECK-NEXT: [[TMP5:%.*]] = add nsw <2 x i64> [[TMP2]], <i64 4, i64 4>			; CHECK-NEXT: [[TMP9:%.*]] = add nsw <2 x i64> [[TMP6]], <i64 6, i64 7>
	; CHECK-NEXT: [[TMP6:%.*]] = add nsw <2 x i64> [[TMP4]], <i64 6, i64 7>			; CHECK-NEXT: [[TMP10:%.]] = bitcast i64 [[T212]] to <2 x i64>*
	; CHECK-NEXT: [[TMP7:%.]] = bitcast i64 [[T212]] to <2 x i64>*			; CHECK-NEXT: store <2 x i64> [[TMP8]], <2 x i64>* [[TMP10]], align 8
	; CHECK-NEXT: store <2 x i64> [[TMP5]], <2 x i64>* [[TMP7]], align 8			; CHECK-NEXT: [[TMP11:%.]] = bitcast i64 [[T292]] to <2 x i64>*
	; CHECK-NEXT: [[TMP8:%.]] = bitcast i64 [[T292]] to <2 x i64>*			; CHECK-NEXT: store <2 x i64> [[TMP9]], <2 x i64>* [[TMP11]], align 8
	; CHECK-NEXT: store <2 x i64> [[TMP6]], <2 x i64>* [[TMP8]], align 8			; CHECK-NEXT: [[TMP12:%.]] = bitcast i32 [[T21]] to <4 x i32>*
	; CHECK-NEXT: store i32 [[T20]], i32* [[T21]], align 4			; CHECK-NEXT: store <4 x i32> [[TMP7]], <4 x i32>* [[TMP12]], align 4
	; CHECK-NEXT: store i32 [[T24]], i32* [[T25]], align 4
	; CHECK-NEXT: store i32 [[T28]], i32* [[T29]], align 4
	; CHECK-NEXT: store i32 [[T31]], i32* [[T32]], align 4
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	%t0 = bitcast i8* %v0 to i32*			%t0 = bitcast i8* %v0 to i32*
	%t1 = bitcast i8* %v1 to i32*			%t1 = bitcast i8* %v1 to i32*

	%t02 = bitcast i8* %v0 to i64*			%t02 = bitcast i8* %v0 to i64*
	%t12 = bitcast i8* %v1 to i64*			%t12 = bitcast i8* %v1 to i64*

	▲ Show 20 Lines • Show All 52 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Fix vector element size for the store chainsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 311554

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

llvm/test/Transforms/SLPVectorizer/X86/combined-stores-chains.ll

[SLP] Fix vector element size for the store chains
ClosedPublic