This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Fix vector element size for the store chains
ClosedPublic

Authored by anton-afanasyev on Dec 13 2020, 10:22 PM.

Details

Summary

Vector element size could be different for different store chains.
This patch prevents wrong computation of maximum number of elements
for that case.

Diff Detail

Event Timeline

anton-afanasyev requested review of this revision.Dec 13 2020, 10:22 PM
Herald added a project: Restricted Project. · View Herald TranscriptDec 13 2020, 10:22 PM
RKSimon accepted this revision.Dec 14 2020, 4:25 AM

LGTM

This revision is now accepted and ready to land.Dec 14 2020, 4:25 AM
This revision was landed with ongoing or failed builds.Dec 14 2020, 4:54 AM
This revision was automatically updated to reflect the committed changes.

Yes, I've observed this impact too before committing. It looks suspiciously good, but actually this fix is just exploiting potential built before. It's bug fix, so I doubt it's able to make things worse.

In-short: I'm still do sure that current patch is bugfixing. But this bug had induced some of boundary vectorization cases before fixing. Below is a typical case.

I've dug into the several files of the tests above (for instance, consumer-typeset/z35.c, here is details: http://llvm-compile-time-tracker.com/compare.php?from=a6c25539c1ed6b60dc3b92a5abe25f6bdb6e9788&to=283577865a97c26852292afae4fa57c689edbbfb&stat=size-total&details=on) and checked SLP stats:

$ ~/llvm/llvm-project/build_deb_base/bin/clang ... -mllvm --stats -c consumer-typeset/z35.c -o z35.base.o 2>stat.base
$ ~/llvm/llvm-project/build_deb_exp/bin/clang ... -mllvm --stats -c consumer-typeset/z35.c -o z35.exp.o 2>stat.exp
$ size z35.base.o
   text    data     bss     dec     hex filename
   9057       0      56    9113    2399 z43.base.o
$ size z43.exp.o
   text    data     bss     dec     hex filename
   8929       0      56    8985    2319 z43.exp.o
$ grep SLP stat.*                                                                              
stat.base:   35 SLP                          - Number of vector instructions generated

So, really code size decreased, but due to the number of vector instrs _not generated_. I've investigated these instructions and also gathered vectorization costs.
Before bugfixing:

%53 = insertelement <4 x %union.rec*> undef, %union.rec* %.in, i32 0
%54 = shufflevector <4 x %union.rec*> %53, <4 x %union.rec*> undef, <4 x i32> zeroinitializer
%55 = bitcast %union.rec* %.in to <4 x %union.rec*>*
store <4 x %union.rec*> %54, <4 x %union.rec*>* %55, align 8, !tbaa !6

After bugfixing:

%53 = getelementptr inbounds %union.rec, %union.rec* %.in, i64 0, i32 0, i32 0, i64 1, i32 1
store %union.rec* %.in, %union.rec** %53, align 8, !tbaa !6
%54 = getelementptr inbounds %union.rec, %union.rec* %.in, i64 0, i32 0, i32 0, i64 1, i32 0
store %union.rec* %.in, %union.rec** %54, align 8, !tbaa !6
%55 = getelementptr inbounds %union.rec, %union.rec* %.in, i64 0, i32 0, i32 0, i64 0, i32 1
store %union.rec* %.in, %union.rec** %55, align 8, !tbaa !6
%56 = getelementptr inbounds %union.rec, %union.rec* %.in, i64 0, i32 0, i32 0, i64 0, i32 0
store %union.rec* %.in, %union.rec** %56, align 8, !tbaa !6

Asm code before:

movq    %rax, %xmm0
pshufd  $68, %xmm0, %xmm0               # xmm0 = xmm0[0,1,0,1]
movdqu  %xmm0, 16(%rax)
movdqu  %xmm0, (%rax)

Asm code after:

movq    %rax, 24(%rax)
movq    %rax, 16(%rax)
movq    %rax, 8(%rax)
movq    %rax, (%rax)

So actually <4 x %union.rec*> doesn't fit to maximum vector register (4x64=256), but store instr could be lowered to two movdqu. The total cost of tree is -1, allowing vectorization. But four stores hasn't been vectorized to two <2 x %union.reg*> after bugfixing, since cost of both parts is zero. We have a border case when 0 + 0 = -1 here.

Not sure what to do with it (allow 2 * MaxVecRegSize for llvm vectors of tree?), but current fix is still necessary -- previously bug has exploited this "border case" accidently, for the random cases of combined stores chains.