Vector element size could be different for different store chains.
This patch prevents wrong computation of maximum number of elements
for that case.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Yes, I've observed this impact too before committing. It looks suspiciously good, but actually this fix is just exploiting potential built before. It's bug fix, so I doubt it's able to make things worse.
In-short: I'm still do sure that current patch is bugfixing. But this bug had induced some of boundary vectorization cases before fixing. Below is a typical case.
I've dug into the several files of the tests above (for instance, consumer-typeset/z35.c, here is details: http://llvm-compile-time-tracker.com/compare.php?from=a6c25539c1ed6b60dc3b92a5abe25f6bdb6e9788&to=283577865a97c26852292afae4fa57c689edbbfb&stat=size-total&details=on) and checked SLP stats:
$ ~/llvm/llvm-project/build_deb_base/bin/clang ... -mllvm --stats -c consumer-typeset/z35.c -o z35.base.o 2>stat.base $ ~/llvm/llvm-project/build_deb_exp/bin/clang ... -mllvm --stats -c consumer-typeset/z35.c -o z35.exp.o 2>stat.exp $ size z35.base.o text data bss dec hex filename 9057 0 56 9113 2399 z43.base.o $ size z43.exp.o text data bss dec hex filename 8929 0 56 8985 2319 z43.exp.o $ grep SLP stat.* stat.base: 35 SLP - Number of vector instructions generated
So, really code size decreased, but due to the number of vector instrs _not generated_. I've investigated these instructions and also gathered vectorization costs.
Before bugfixing:
%53 = insertelement <4 x %union.rec*> undef, %union.rec* %.in, i32 0 %54 = shufflevector <4 x %union.rec*> %53, <4 x %union.rec*> undef, <4 x i32> zeroinitializer %55 = bitcast %union.rec* %.in to <4 x %union.rec*>* store <4 x %union.rec*> %54, <4 x %union.rec*>* %55, align 8, !tbaa !6
After bugfixing:
%53 = getelementptr inbounds %union.rec, %union.rec* %.in, i64 0, i32 0, i32 0, i64 1, i32 1 store %union.rec* %.in, %union.rec** %53, align 8, !tbaa !6 %54 = getelementptr inbounds %union.rec, %union.rec* %.in, i64 0, i32 0, i32 0, i64 1, i32 0 store %union.rec* %.in, %union.rec** %54, align 8, !tbaa !6 %55 = getelementptr inbounds %union.rec, %union.rec* %.in, i64 0, i32 0, i32 0, i64 0, i32 1 store %union.rec* %.in, %union.rec** %55, align 8, !tbaa !6 %56 = getelementptr inbounds %union.rec, %union.rec* %.in, i64 0, i32 0, i32 0, i64 0, i32 0 store %union.rec* %.in, %union.rec** %56, align 8, !tbaa !6
Asm code before:
movq %rax, %xmm0 pshufd $68, %xmm0, %xmm0 # xmm0 = xmm0[0,1,0,1] movdqu %xmm0, 16(%rax) movdqu %xmm0, (%rax)
Asm code after:
movq %rax, 24(%rax) movq %rax, 16(%rax) movq %rax, 8(%rax) movq %rax, (%rax)
So actually <4 x %union.rec*> doesn't fit to maximum vector register (4x64=256), but store instr could be lowered to two movdqu. The total cost of tree is -1, allowing vectorization. But four stores hasn't been vectorized to two <2 x %union.reg*> after bugfixing, since cost of both parts is zero. We have a border case when 0 + 0 = -1 here.
Not sure what to do with it (allow 2 * MaxVecRegSize for llvm vectors of tree?), but current fix is still necessary -- previously bug has exploited this "border case" accidently, for the random cases of combined stores chains.