Page MenuHomePhabricator

[SLP] Add insertelement instructions to vectorizable tree
ClosedPublic

Authored by anton-afanasyev on Mar 16 2021, 8:21 AM.

Details

Summary

Add new type of tree node for InsertElementInst chain forming vector.
These instructions could be either removed, or replaced by shuffles during
vectorization and we can add this node to cost model, so naturally estimating
their cost, getting rid of CompensateCost tricks and reducing further work
for InstCombine. This fixes PR40522 and PR35732 in a natural way. Also this
patch is the first step towards revectorization of partially vectorization
(to fix PR42022 completely). After adding inserts to tree the next step is
to add vector instructions there (for instance, to merge store <2 x float>
and store <2 x float> to store <4 x float>).

Fixes PR40522 and PR35732.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2899

Inserts here are coming from findBuildAggregate(), being already sorted by index and therefore consecutive, the only exclusion is that they can have gaps, so we check it here. I don't think we should process this rare case within this patch.

3810–3811

Why do we need to normalize insert indices? DemandElts is passed to getScalarizationOverhead() which just summarizes cost of all eliminated insertelements (with non-normalized operand indices).

4380–4384

I've removed this all, since using vectorized value without extraction now.

5369

Oh, sure, fixed this.

5375–5377

I've just got rid of this code using the fact we have Scalars already sorted by index.

5387–5388

Hmm, here we are preparing shuffled V with undefs elements if V is used without several elements inserted (before vectorizer). But undefined positions accept anything, so we can actually use completely filled V here. Thanks, I've changed it this way and removed all redundand code!

The only rare case could be if any further instruction inserts element already inserted (having the same index), so we have to exclude this case. But this is guaranteed for inserts coming from findBuildAggregate().

anton-afanasyev marked 6 inline comments as done.

Address comments, refactor

ABataev added inline comments.May 10 2021, 5:52 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2899

Can we check for the gaps during the analysis of the build vector sequence instead?

3810–3811

Ah, you're getting it for SrcVecTy, got it.

4362–4364

I don't think this is correct. I think you need to use code from my patch D101555 for better cost estimation here.

5326

ExtractAndExtendIfNeeded

anton-afanasyev marked 4 inline comments as done.May 10 2021, 12:11 PM
anton-afanasyev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2899

I've decided to check it here since we can get other sources of incoming insert bundle in future.

4362–4364

Why? We don't generate any extract instructions for external using of vector, so no need to cost it.

I think you mixed two different cases up.
The patch D101555 you referenced is about cost estimation when insert is _user_, but here is the case when insert is _used_. We do not really need to "extract" it, since its user uses vector value rather than scalar one.

Also I don't think we need to use code from D101555 in this patch, since it does the same by the other way. The main idea of this patch is to unify the way we process inserts (the only vector tree node for now) vs ordinary tree nodes. The inserts are tree node now, and they are sorted by index, so no need to shuffle them. We could use ReorderIndicies if needed, but no need, since operands are sorted as well.

5326

Sure, done.

anton-afanasyev marked 3 inline comments as done.

Address comments (minor)

ABataev added inline comments.May 10 2021, 12:28 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2899

The no need to iterate through the whole list, use early exit out of loop with cancelled scheduling and gather nodes.

4362–4364

I think you're missing shuffle cost here. If the external user is a vector and extracted element is inserted into the different lane - it is at least shuffle and need to add its cost.

anton-afanasyev marked an inline comment as done.May 10 2021, 11:00 PM
anton-afanasyev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4362–4364

I think you're missing shuffle cost here.

The cost of what shuffle? We don't generate any shuffle.

If the external user is a vector ...

Not "the external user is a vector", but its operand is a vector. We do not need to extract any special lane, since _whole_ vector is using and replacing after "vectorization". In this special case (when tree node is inserts) we have vector "scalars" (inserts have vector type) and their "vectorization" is just removing (i.e. replacing by vectorized operands).

ABataev added inline comments.May 11 2021, 4:14 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3813–3816

Just:

return -TTI->getScalarizationOverhead(SrcVecTy, DemandedElts,
                                            /*Insert*/ true, /*Extract*/ false);
4362–4364

Ok, I see. You correctly excluded the cost of the final (sub)vector reuse.
But I suggest to improve the cost model for inserts.

i1 = insertelement undef, v0
i2 = insertelement i1, v1
....
ii1 = insertelement undef, v1

If <v0,v1> gets vectorized, you need to count the cost of the extract of v1. Instead, we can count it as shuffle and build the final shuffle. It can be much more profitable than relying on extract element costs/instructions. But probably this can be addressed in the next patch.

4871–4879

You can do this check by yourself. We have something similar for extracts, where we check if need to perform shuffle/copying of subvector. But we definitely need it.

anton-afanasyev marked 6 inline comments as done.May 11 2021, 7:19 AM
anton-afanasyev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3813–3816

I changed this code, used Cost variable.

4362–4364

Ok, I see you too. Though I don't see how this case is addressed within D101555, two inserts i1 and ii1 cannot be cought to the same tree and they don't occur in one InsertUses therefore.

Anyway, I suggest to address this in the separate patch. It's rather rare case in natural life.

4871–4879

You can do this check by yourself. We have something similar for extracts, where we check if need to perform shuffle/copying of subvector. But we definitely need it.

Ok, I've used something similar for inserts as for extracts. Do you think it's better to make this check within getShuffleCost() in X86TargetTransformInfo.cpp and AArch64TargetTransformInfo.cpp?

anton-afanasyev marked 3 inline comments as done.

Address comments

ABataev added inline comments.May 11 2021, 7:41 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2899

Not done, you can early exit out of the loop if the non consecutive insert is found:

int Offset = *getInsertIndex(VL[0], 0);
ValueList Operands(VL.size());
Operands[0] = cast<Instruction>(VL[0])->getOperand(1);
for (unsigned I = 1, E = VL.size(); I < E; ++I) {
  if (I != *getInsertIndex(VL[I], 0) - Offset) {
    LLVM_DEBUG(dbgs() << "SLP: skipping non-consecutive inserts.\n");
    BS.cancelScheduling(VL, VL0);
    buildTree_rec(Operands, Depth, UserTreeIdx);
    return;
  }
  Operands[I] = cast<Instruction>(VL[I])->getOperand(1);
}
TreeEntry *TE = newTreeEntry(VL, Bundle /*vectorized*/, S, UserTreeIdx);
LLVM_DEBUG(dbgs() << "SLP: added inserts bundle.\n");
TE->setOperand(0, Operands);

ValueList VectorOperands;
for (Value *V : VL)
  VectorOperands.push_back(cast<Instruction>(V)->getOperand(0));

TE->setOperand(1, VectorOperands);

buildTree_rec(Operands, Depth + 1, {TE, 0});
return;
4871–4879

It would be good. Probably in a separate patch after this one

ABataev added inline comments.May 11 2021, 7:43 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4362–4364

Actually, D101555 addresses exactly described problem. In some cases, it really improves the performance, especially for SSE/AVX/AVX2 targets. I will update it once this patch is landed

anton-afanasyev marked 4 inline comments as done.May 11 2021, 9:43 AM
anton-afanasyev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2899

This code is not what intended, since we early exit building tree with uncompletely filled Operands.

4871–4879

Ok, I'm to make it in a separate patch.

ABataev added inline comments.May 11 2021, 9:46 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2899

I meant newTreeEntry(VL, None /*not vectorized*/, S, UserTreeIdx);, just like we're doing for other nodes.

anton-afanasyev marked 3 inline comments as done.May 11 2021, 10:00 AM
anton-afanasyev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2899

Hmm, buildTree_rec() here (with completely filled Operands) is intended: if we skip vectorization of non-consecutive inserts, we still try to vectorize starting from Operands (as it were before this patch). I think it's rare case when such operands could be successful seed for vectorizable tree, but why not to try?

ABataev added inline comments.May 11 2021, 10:12 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2899

Hmm, does it mean you're going to support something like this:

i1 = insertelement undef, v0, 0
i2 = insertelement i1, v1, 2

|
V

v = <v0, v1>
i2 = shuffle v, undef, <0, undef, 1, undef>

? How do we handle shuffle in this case? As external uses and inserts of extractelements? And we do not subtract the costs of insertelements in this case?

anton-afanasyev marked 2 inline comments as done.May 11 2021, 11:10 AM
anton-afanasyev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2899

Yes, for that case we end up with previous combination of inserts/extracts and we don't subtract the cost of inserts/extracts to be eliminated by instcombine later, but that's better than nothing, if total cost is still good.
I've checked that neither my version nor your one doesn't affect any test case and it all looks like rather speculative and rare case though.

ABataev added inline comments.May 11 2021, 11:25 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2899

At least add a FIXME to properly support this kind of vectorization in the future

anton-afanasyev marked 2 inline comments as done.May 11 2021, 11:33 AM
anton-afanasyev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2899

Ok, added

anton-afanasyev marked an inline comment as done.

Add FIXME for non-consecutive insert case

Please can you rebase against trunk? At least some of these test diffs should go as I've regenerated them.

ABataev accepted this revision.May 12 2021, 3:48 AM

Looks good with a nit

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3819

I think need to check that NumElts != NumScalars

RKSimon accepted this revision.May 12 2021, 4:16 AM

LGTM (to unblock this)

This revision is now accepted and ready to land.May 12 2021, 4:16 AM
anton-afanasyev marked an inline comment as done.May 12 2021, 4:33 AM
anton-afanasyev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3819

I've removed this check since Offset % NumScalars != 0 implies NumElts != NumScalars. Do you think we need check for readability?

ABataev added inline comments.May 12 2021, 4:45 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3819

I would add this check just in case

anton-afanasyev marked 2 inline comments as done.May 12 2021, 4:55 AM
anton-afanasyev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3819

Ok, added.

anton-afanasyev marked an inline comment as done.

Address a nit

This patch introduces an assertion error we believe may be contributing to a miscompile (along with some other recent SLP patches -- this patch fixes the reduced case in http://llvm.org/PR50323, but doesn't fix the full case it was reduced from):

$ opt reduced.ll -disable-output -O1 -slp-vectorizer  # See below for reduced.ll
opt: /home/rupprecht/src/llvm-project/llvm/lib/IR/Type.cpp:648: static llvm::FixedVectorType *llvm::FixedVectorType::get(llvm::Type *, unsigned int): Assertion `isValidElementType(ElementType) && "Element type of a VectorType must " "be an integer, floating point, or " "pointer type."' failed.
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
Stack dump:
0.      Program arguments: /home/rupprecht/dev/opt reduced.ll -disable-output -O1 -slp-vectorizer
...
#10 0x000000000697c8f8 llvm::FixedVectorType::get(llvm::Type*, unsigned int) /home/rupprecht/src/llvm-project/llvm/lib/IR/Type.cpp:650:36
#11 0x0000000007752a5e llvm::slpvectorizer::BoUpSLP::getSpillCost() const /home/rupprecht/src/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:4321:21
#12 0x0000000007753060 llvm::slpvectorizer::BoUpSLP::getTreeCost() /home/rupprecht/src/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:4384:31
#13 0x000000000775f888 llvm::SLPVectorizerPass::tryToVectorizeList(llvm::ArrayRef<llvm::Value*>, llvm::slpvectorizer::BoUpSLP&, bool) /home/rupprecht/src/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:6740:32
#14 0x0000000007760dab llvm::SLPVectorizerPass::vectorizeInsertElementInst(llvm::InsertElementInst*, llvm::BasicBlock*, llvm::slpvectorizer::BoUpSLP&) /home/rupprecht/src/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:7844:3
#15 0x0000000007760f84 llvm::SLPVectorizerPass::vectorizeSimpleInstructions(llvm::SmallVectorImpl<llvm::Instruction*>&, llvm::BasicBlock*, llvm::slpvectorizer::BoUpSLP&, bool) /home/rupprecht/src/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:7858:21
#16 0x000000000775d5d8 llvm::SLPVectorizerPass::vectorizeChainsInBlock(llvm::BasicBlock*, llvm::slpvectorizer::BoUpSLP&) /home/rupprecht/src/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:8019:21
#17 0x000000000775c6d3 llvm::SLPVectorizerPass::runImpl(llvm::Function&, llvm::ScalarEvolution*, llvm::TargetTransformInfo*, llvm::TargetLibraryInfo*, llvm::AAResults*, llvm::LoopInfo*, llvm::DominatorTree*, llvm::AssumptionCache*, llvm::DemandedBits*, llvm::OptimizationRemarkEmitter*) /home/rupprecht/src/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:6395:16
#18 0x000000000775c26f llvm::SLPVectorizerPass::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) /home/rupprecht/src/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:6332:8
...
$ cat reduced.ll
; ModuleID = 'reduced.ll'
source_filename = "repro.cc"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

%struct.widget = type { %struct.baz }
%struct.baz = type { double, double }
%struct.snork = type { <2 x double> }
%struct.spam = type { %struct.snork }

$_ZN1dC2Edd = comdat any

$_ZN1k1lE1d = comdat any

$_ZN1d1hES_ = comdat any

$_ZN1d1fEv = comdat any

$_ZN1d1eEv = comdat any

@global = external global %struct.widget, align 8

define <2 x double> @zot(%struct.widget* %arg, %struct.baz* %arg1) align 2 {
bb:
  %tmp = alloca %struct.snork, align 16
  %tmp2 = alloca %struct.widget*, align 8
  %tmp3 = alloca %struct.baz*, align 8
  store %struct.widget* %arg, %struct.widget** %tmp2, align 8, !tbaa !0
  store %struct.baz* %arg1, %struct.baz** %tmp3, align 8, !tbaa !0
  %tmp4 = load %struct.widget*, %struct.widget** %tmp2, align 8
  %tmp5 = load %struct.baz*, %struct.baz** %tmp3, align 8, !tbaa !0
  %tmp6 = getelementptr inbounds %struct.baz, %struct.baz* %tmp5, i32 0, i32 1
  %tmp7 = load double, double* %tmp6, align 8, !tbaa !4
  %tmp8 = getelementptr inbounds %struct.widget, %struct.widget* %tmp4, i32 0, i32 0
  %tmp9 = getelementptr inbounds %struct.baz, %struct.baz* %tmp8, i32 0, i32 1
  %tmp10 = load double, double* %tmp9, align 8, !tbaa !7
  %tmp11 = fsub double %tmp7, %tmp10
  %tmp12 = load %struct.baz*, %struct.baz** %tmp3, align 8, !tbaa !0
  %tmp13 = getelementptr inbounds %struct.baz, %struct.baz* %tmp12, i32 0, i32 0
  %tmp14 = load double, double* %tmp13, align 8, !tbaa !9
  %tmp15 = getelementptr inbounds %struct.widget, %struct.widget* %tmp4, i32 0, i32 0
  %tmp16 = getelementptr inbounds %struct.baz, %struct.baz* %tmp15, i32 0, i32 0
  %tmp17 = load double, double* %tmp16, align 8, !tbaa !10
  %tmp18 = fsub double %tmp14, %tmp17
  call void @wombat(%struct.snork* %tmp, double %tmp11, double %tmp18)
  %tmp19 = getelementptr inbounds %struct.snork, %struct.snork* %tmp, i32 0, i32 0
  %tmp20 = load <2 x double>, <2 x double>* %tmp19, align 16
  ret <2 x double> %tmp20
}

define linkonce_odr void @wombat(%struct.snork* %arg, double %arg1, double %arg2) unnamed_addr comdat($_ZN1dC2Edd) align 2 {
bb:
  %tmp = alloca %struct.snork*, align 8
  %tmp3 = alloca double, align 8
  %tmp4 = alloca double, align 8
  store %struct.snork* %arg, %struct.snork** %tmp, align 8, !tbaa !0
  store double %arg1, double* %tmp3, align 8, !tbaa !11
  store double %arg2, double* %tmp4, align 8, !tbaa !11
  %tmp5 = load %struct.snork*, %struct.snork** %tmp, align 8
  %tmp6 = getelementptr inbounds %struct.snork, %struct.snork* %tmp5, i32 0, i32 0
  %tmp7 = load double, double* %tmp3, align 8, !tbaa !11
  %tmp8 = insertelement <2 x double> undef, double %tmp7, i32 0
  %tmp9 = load double, double* %tmp4, align 8, !tbaa !11
  %tmp10 = insertelement <2 x double> %tmp8, double %tmp9, i32 1
  store <2 x double> %tmp10, <2 x double>* %tmp6, align 16, !tbaa !12
  ret void
}

define double @wombat.1() {
bb:
  %tmp = alloca %struct.widget, align 8
  %tmp1 = alloca %struct.spam, align 16
  %tmp2 = alloca %struct.snork, align 16
  %tmp3 = alloca %struct.baz, align 8
  %tmp4 = bitcast %struct.widget* %tmp to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* %tmp4, i8* bitcast (%struct.widget* @global to i8*), i64 16, i1 false), !tbaa.struct !13
  %tmp5 = bitcast %struct.spam* %tmp1 to i8*
  call void @llvm.memset.p0i8.i64(i8* %tmp5, i8 0, i64 16, i1 false)
  call void @quux()
  %tmp6 = getelementptr inbounds %struct.baz, %struct.baz* %tmp3, i32 0, i32 0
  store double 0.000000e+00, double* %tmp6, align 8, !tbaa !9
  %tmp7 = getelementptr inbounds %struct.baz, %struct.baz* %tmp3, i32 0, i32 1
  store double 0.000000e+00, double* %tmp7, align 8, !tbaa !4
  %tmp8 = call <2 x double> @zot(%struct.widget* %tmp, %struct.baz* %tmp3)
  %tmp9 = getelementptr inbounds %struct.snork, %struct.snork* %tmp2, i32 0, i32 0
  store <2 x double> %tmp8, <2 x double>* %tmp9, align 16
  %tmp10 = getelementptr inbounds %struct.snork, %struct.snork* %tmp2, i32 0, i32 0
  %tmp11 = load <2 x double>, <2 x double>* %tmp10, align 16
  %tmp12 = call double @wobble(%struct.spam* %tmp1, <2 x double> %tmp11)
  ret double %tmp12
}

; Function Attrs: argmemonly nofree nosync nounwind willreturn
declare void @llvm.lifetime.start.p0i8(i64 immarg, i8* nocapture) #0

; Function Attrs: argmemonly nofree nosync nounwind willreturn
declare void @llvm.memcpy.p0i8.p0i8.i64(i8* noalias nocapture writeonly, i8* noalias nocapture readonly, i64, i1 immarg) #0

; Function Attrs: argmemonly nofree nosync nounwind willreturn writeonly
declare void @llvm.memset.p0i8.i64(i8* nocapture writeonly, i8, i64, i1 immarg) #1

declare void @quux() unnamed_addr align 2

define linkonce_odr double @wobble(%struct.spam* %arg, <2 x double> %arg1) comdat($_ZN1k1lE1d) align 2 {
bb:
  %tmp = alloca %struct.snork, align 16
  %tmp2 = alloca %struct.spam*, align 8
  %tmp3 = alloca %struct.snork, align 16
  %tmp4 = alloca %struct.snork, align 16
  %tmp5 = getelementptr inbounds %struct.snork, %struct.snork* %tmp, i32 0, i32 0
  store <2 x double> %arg1, <2 x double>* %tmp5, align 16
  store %struct.spam* %arg, %struct.spam** %tmp2, align 8, !tbaa !0
  %tmp6 = load %struct.spam*, %struct.spam** %tmp2, align 8
  %tmp7 = getelementptr inbounds %struct.spam, %struct.spam* %tmp6, i32 0, i32 0
  %tmp8 = bitcast %struct.snork* %tmp3 to i8*
  %tmp9 = bitcast %struct.snork* %tmp7 to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* %tmp8, i8* %tmp9, i64 16, i1 false), !tbaa.struct !14
  %tmp10 = bitcast %struct.snork* %tmp4 to i8*
  %tmp11 = bitcast %struct.snork* %tmp to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* %tmp10, i8* %tmp11, i64 16, i1 false), !tbaa.struct !14
  %tmp12 = getelementptr inbounds %struct.snork, %struct.snork* %tmp3, i32 0, i32 0
  %tmp13 = load <2 x double>, <2 x double>* %tmp12, align 16
  %tmp14 = getelementptr inbounds %struct.snork, %struct.snork* %tmp4, i32 0, i32 0
  %tmp15 = load <2 x double>, <2 x double>* %tmp14, align 16
  %tmp16 = call double @eggs(<2 x double> %tmp13, <2 x double> %tmp15)
  ret double %tmp16
}

; Function Attrs: argmemonly nofree nosync nounwind willreturn
declare void @llvm.lifetime.end.p0i8(i64 immarg, i8* nocapture) #0

define linkonce_odr double @eggs(<2 x double> %arg, <2 x double> %arg1) align 2 {
bb:
  %tmp = alloca %struct.snork, align 16
  %tmp2 = alloca %struct.snork, align 16
  %tmp3 = alloca %struct.snork, align 16
  %tmp4 = getelementptr inbounds %struct.snork, %struct.snork* %tmp, i32 0, i32 0
  store <2 x double> %arg, <2 x double>* %tmp4, align 16
  %tmp5 = getelementptr inbounds %struct.snork, %struct.snork* %tmp2, i32 0, i32 0
  store <2 x double> %arg1, <2 x double>* %tmp5, align 16
  %tmp6 = bitcast %struct.snork* %tmp3 to i8*
  %tmp7 = bitcast %struct.snork* %tmp2 to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* %tmp6, i8* %tmp7, i64 16, i1 false), !tbaa.struct !14
  %tmp8 = getelementptr inbounds %struct.snork, %struct.snork* %tmp3, i32 0, i32 0
  %tmp9 = load <2 x double>, <2 x double>* %tmp8, align 16
  %tmp10 = call double @wobble.2(%struct.snork* %tmp, <2 x double> %tmp9)
  ret double %tmp10
}

define linkonce_odr double @wobble.2(%struct.snork* %arg, <2 x double> %arg1) comdat($_ZN1d1hES_) align 2 {
bb:
  %tmp = alloca %struct.snork, align 16
  %tmp2 = alloca %struct.snork*, align 8
  %tmp3 = alloca %struct.snork, align 16
  %tmp4 = getelementptr inbounds %struct.snork, %struct.snork* %tmp, i32 0, i32 0
  store <2 x double> %arg1, <2 x double>* %tmp4, align 16
  store %struct.snork* %arg, %struct.snork** %tmp2, align 8, !tbaa !0
  %tmp5 = load %struct.snork*, %struct.snork** %tmp2, align 8
  %tmp6 = call double @quux.3(%struct.snork* %tmp)
  %tmp7 = call double @zot.4(%struct.snork* %tmp)
  call void @wombat(%struct.snork* %tmp3, double %tmp6, double %tmp7)
  %tmp8 = getelementptr inbounds %struct.snork, %struct.snork* %tmp5, i32 0, i32 0
  %tmp9 = load <2 x double>, <2 x double>* %tmp8, align 16, !tbaa !12
  %tmp10 = getelementptr inbounds %struct.snork, %struct.snork* %tmp3, i32 0, i32 0
  %tmp11 = load <2 x double>, <2 x double>* %tmp10, align 16, !tbaa !12
  %tmp12 = fmul <2 x double> %tmp11, %tmp9
  store <2 x double> %tmp12, <2 x double>* %tmp10, align 16, !tbaa !12
  %tmp13 = call double @zot.4(%struct.snork* %tmp3)
  %tmp14 = call double @quux.3(%struct.snork* %tmp3)
  %tmp15 = fsub double %tmp13, %tmp14
  ret double %tmp15
}

define linkonce_odr double @quux.3(%struct.snork* %arg) comdat($_ZN1d1fEv) align 2 {
bb:
  %tmp = alloca %struct.snork*, align 8
  store %struct.snork* %arg, %struct.snork** %tmp, align 8, !tbaa !0
  %tmp1 = load %struct.snork*, %struct.snork** %tmp, align 8
  %tmp2 = getelementptr inbounds %struct.snork, %struct.snork* %tmp1, i32 0, i32 0
  %tmp3 = load <2 x double>, <2 x double>* %tmp2, align 16, !tbaa !12
  %tmp4 = extractelement <2 x double> %tmp3, i32 1
  ret double %tmp4
}

define linkonce_odr double @zot.4(%struct.snork* %arg) comdat($_ZN1d1eEv) align 2 {
bb:
  %tmp = alloca %struct.snork*, align 8
  store %struct.snork* %arg, %struct.snork** %tmp, align 8, !tbaa !0
  %tmp1 = load %struct.snork*, %struct.snork** %tmp, align 8
  %tmp2 = getelementptr inbounds %struct.snork, %struct.snork* %tmp1, i32 0, i32 0
  %tmp3 = load <2 x double>, <2 x double>* %tmp2, align 16, !tbaa !12
  %tmp4 = extractelement <2 x double> %tmp3, i32 0
  ret double %tmp4
}

attributes #0 = { argmemonly nofree nosync nounwind willreturn }
attributes #1 = { argmemonly nofree nosync nounwind willreturn writeonly }

!0 = !{!1, !1, i64 0}
!1 = !{!"any pointer", !2, i64 0}
!2 = !{!"omnipotent char", !3, i64 0}
!3 = !{!"Simple C++ TBAA"}
!4 = !{!5, !6, i64 8}
!5 = !{!"_ZTS1a", !6, i64 0, !6, i64 8}
!6 = !{!"double", !2, i64 0}
!7 = !{!8, !6, i64 8}
!8 = !{!"_ZTS1p", !5, i64 0}
!9 = !{!5, !6, i64 0}
!10 = !{!8, !6, i64 0}
!11 = !{!6, !6, i64 0}
!12 = !{!2, !2, i64 0}
!13 = !{i64 0, i64 8, !11, i64 8, i64 8, !11}
!14 = !{i64 0, i64 16, !12}

(Sorry for the length -- this is as far as llvm-reduce would take it)

This patch introduces an assertion error we believe may be contributing to a miscompile (along with some other recent SLP patches -- this patch fixes the reduced case in http://llvm.org/PR50323, but doesn't fix the full case it was reduced from):

Thanks for report, fixed here https://reviews.llvm.org/rG207cdd7ed9fc

Hi, there is another issue that can be reproduced with existing test case:

$opt -S -slp-vectorizer -slp-threshold=-10000 test/Transforms/SLPVectorizer/X86/insert-element-build-vector.ll -slp-min-tree-size=0
opt: .../llvm-project/llvm/lib/IR/Type.cpp:648: static llvm::FixedVectorType* llvm::FixedVectorType::get(llvm::Type*, unsigned int): Assertion `isValidElementType(ElementType) && "Element type of a VectorType must " "be an integer, floating point, or " "pointer type."' failed.

Thanks,

Valery
dyung added a subscriber: dyung.May 17 2021, 4:42 PM

Hi, this change has caused a regression in the codegen for one of our internal tests. Consider the following code:

__attribute__((noinline))
__m256d add_pd_002(__m256d a, __m256d b) {
  __m256d r = (__m256d){ a[0] + a[1], a[2] + a[3], b[0] + b[1], b[2] + b[3] };
  return __builtin_shufflevector(r, a, 0, -1, 2, 3);
}

If you compile this with "-g0 -O3 -march=btver2", prior to your commit the compiler would generate the following code for the function:

# %bb.0:                                # %entry
        vinsertf128     $1, %xmm1, %ymm0, %ymm0
        vhaddpd %ymm1, %ymm0, %ymm0

But after your change it is now generating the following code:

# %bb.0:                                # %entry
        vextractf128    $1, %ymm1, %xmm2
        vhaddpd %xmm0, %xmm0, %xmm0
        vhaddpd %xmm2, %xmm1, %xmm1
        vperm2f128      $2, %ymm0, %ymm1, %ymm0 # ymm0 = ymm0[0,1],ymm1[0,1]

From your commit description, it sounds like this is expected and will be fixed in a follow-up commit. Is my understanding of this correct?

Hi, this change has caused a regression in the codegen for one of our internal tests. Consider the following code:

__attribute__((noinline))
__m256d add_pd_002(__m256d a, __m256d b) {
  __m256d r = (__m256d){ a[0] + a[1], a[2] + a[3], b[0] + b[1], b[2] + b[3] };
  return __builtin_shufflevector(r, a, 0, -1, 2, 3);
}

If you compile this with "-g0 -O3 -march=btver2", prior to your commit the compiler would generate the following code for the function:

# %bb.0:                                # %entry
        vinsertf128     $1, %xmm1, %ymm0, %ymm0
        vhaddpd %ymm1, %ymm0, %ymm0

But after your change it is now generating the following code:

# %bb.0:                                # %entry
        vextractf128    $1, %ymm1, %xmm2
        vhaddpd %xmm0, %xmm0, %xmm0
        vhaddpd %xmm2, %xmm1, %xmm1
        vperm2f128      $2, %ymm0, %ymm1, %ymm0 # ymm0 = ymm0[0,1],ymm1[0,1]

From your commit description, it sounds like this is expected and will be fixed in a follow-up commit. Is my understanding of this correct?

Hi, could you check if D101555 fixes this issue?

Hi, there is another issue that can be reproduced with existing test case:

$opt -S -slp-vectorizer -slp-threshold=-10000 test/Transforms/SLPVectorizer/X86/insert-element-build-vector.ll -slp-min-tree-size=0
opt: .../llvm-project/llvm/lib/IR/Type.cpp:648: static llvm::FixedVectorType* llvm::FixedVectorType::get(llvm::Type*, unsigned int): Assertion `isValidElementType(ElementType) && "Element type of a VectorType must " "be an integer, floating point, or " "pointer type."' failed.

Thanks,

Valery

Thansk for the report, fixed here, need quick review: https://reviews.llvm.org/D102675

dyung added a comment.May 18 2021, 2:17 AM

Hi, this change has caused a regression in the codegen for one of our internal tests. Consider the following code:

__attribute__((noinline))
__m256d add_pd_002(__m256d a, __m256d b) {
  __m256d r = (__m256d){ a[0] + a[1], a[2] + a[3], b[0] + b[1], b[2] + b[3] };
  return __builtin_shufflevector(r, a, 0, -1, 2, 3);
}

If you compile this with "-g0 -O3 -march=btver2", prior to your commit the compiler would generate the following code for the function:

# %bb.0:                                # %entry
        vinsertf128     $1, %xmm1, %ymm0, %ymm0
        vhaddpd %ymm1, %ymm0, %ymm0

But after your change it is now generating the following code:

# %bb.0:                                # %entry
        vextractf128    $1, %ymm1, %xmm2
        vhaddpd %xmm0, %xmm0, %xmm0
        vhaddpd %xmm2, %xmm1, %xmm1
        vperm2f128      $2, %ymm0, %ymm1, %ymm0 # ymm0 = ymm0[0,1],ymm1[0,1]

From your commit description, it sounds like this is expected and will be fixed in a follow-up commit. Is my understanding of this correct?

Hi, could you check if D101555 fixes this issue?

Hi, I applied the patch locally and built the compiler, but the generated assembly actually seems it might be worse:

# %bb.0:                                # %entry
        vinsertf128     $1, %xmm0, %ymm1, %ymm2
        vinsertf128     $1, %xmm1, %ymm0, %ymm0
        vextractf128    $1, %ymm1, %xmm1
        vhaddpd %ymm2, %ymm0, %ymm0
        vhaddpd %xmm1, %xmm1, %xmm1
        vextractf128    $1, %ymm0, %xmm2
        vunpcklpd       %xmm1, %xmm2, %xmm1     # xmm1 = xmm2[0],xmm1[0]
        vinsertf128     $1, %xmm1, %ymm0, %ymm0
        retq

Hi, this change has caused a regression in the codegen for one of our internal tests. Consider the following code:

__attribute__((noinline))
__m256d add_pd_002(__m256d a, __m256d b) {
  __m256d r = (__m256d){ a[0] + a[1], a[2] + a[3], b[0] + b[1], b[2] + b[3] };
  return __builtin_shufflevector(r, a, 0, -1, 2, 3);
}

If you compile this with "-g0 -O3 -march=btver2", prior to your commit the compiler would generate the following code for the function:

# %bb.0:                                # %entry
        vinsertf128     $1, %xmm1, %ymm0, %ymm0
        vhaddpd %ymm1, %ymm0, %ymm0

But after your change it is now generating the following code:

# %bb.0:                                # %entry
        vextractf128    $1, %ymm1, %xmm2
        vhaddpd %xmm0, %xmm0, %xmm0
        vhaddpd %xmm2, %xmm1, %xmm1
        vperm2f128      $2, %ymm0, %ymm1, %ymm0 # ymm0 = ymm0[0,1],ymm1[0,1]

From your commit description, it sounds like this is expected and will be fixed in a follow-up commit. Is my understanding of this correct?

Hi, could you check if D101555 fixes this issue?

Hi, I applied the patch locally and built the compiler, but the generated assembly actually seems it might be worse:

# %bb.0:                                # %entry
        vinsertf128     $1, %xmm0, %ymm1, %ymm2
        vinsertf128     $1, %xmm1, %ymm0, %ymm0
        vextractf128    $1, %ymm1, %xmm1
        vhaddpd %ymm2, %ymm0, %ymm0
        vhaddpd %xmm1, %xmm1, %xmm1
        vextractf128    $1, %ymm0, %xmm2
        vunpcklpd       %xmm1, %xmm2, %xmm1     # xmm1 = xmm2[0],xmm1[0]
        vinsertf128     $1, %xmm1, %ymm0, %ymm0
        retq

Ok, thanks, will fix it later today.

Hi, this change has caused a regression in the codegen for one of our internal tests. Consider the following code:

__attribute__((noinline))
__m256d add_pd_002(__m256d a, __m256d b) {
  __m256d r = (__m256d){ a[0] + a[1], a[2] + a[3], b[0] + b[1], b[2] + b[3] };
  return __builtin_shufflevector(r, a, 0, -1, 2, 3);
}

If you compile this with "-g0 -O3 -march=btver2", prior to your commit the compiler would generate the following code for the function:

# %bb.0:                                # %entry
        vinsertf128     $1, %xmm1, %ymm0, %ymm0
        vhaddpd %ymm1, %ymm0, %ymm0

But after your change it is now generating the following code:

# %bb.0:                                # %entry
        vextractf128    $1, %ymm1, %xmm2
        vhaddpd %xmm0, %xmm0, %xmm0
        vhaddpd %xmm2, %xmm1, %xmm1
        vperm2f128      $2, %ymm0, %ymm1, %ymm0 # ymm0 = ymm0[0,1],ymm1[0,1]

From your commit description, it sounds like this is expected and will be fixed in a follow-up commit. Is my understanding of this correct?

I see a different result for btver2.

# %bb.0:
        vextractf128    $1, %ymm1, %xmm2
        vhaddpd %xmm1, %xmm0, %xmm0
        vhaddpd %xmm2, %xmm1, %xmm1
        vinsertf128     $1, %xmm1, %ymm0, %ymm0

But I used llvm-11, most probably there is a difference with llvm-12.

Currently, it is impossible to fix this issue. This problem will be fixed after non-power-2 vectorization support in SLP is landed since here we have a build vector of 3 elements (the second index in shuffle is -1 and thus the second sum is optimized out resulting in the build of <4 x double> vector using just 3 insertelement instructions). Looks like previously this patter was recognized by another transformation pass but currently SLP tries to vectorize it

Hi, this change has caused a regression in the codegen for one of our internal tests. Consider the following code:

__attribute__((noinline))
__m256d add_pd_002(__m256d a, __m256d b) {
  __m256d r = (__m256d){ a[0] + a[1], a[2] + a[3], b[0] + b[1], b[2] + b[3] };
  return __builtin_shufflevector(r, a, 0, -1, 2, 3);
}

If you compile this with "-g0 -O3 -march=btver2", prior to your commit the compiler would generate the following code for the function:

# %bb.0:                                # %entry
        vinsertf128     $1, %xmm1, %ymm0, %ymm0
        vhaddpd %ymm1, %ymm0, %ymm0

But after your change it is now generating the following code:

# %bb.0:                                # %entry
        vextractf128    $1, %ymm1, %xmm2
        vhaddpd %xmm0, %xmm0, %xmm0
        vhaddpd %xmm2, %xmm1, %xmm1
        vperm2f128      $2, %ymm0, %ymm1, %ymm0 # ymm0 = ymm0[0,1],ymm1[0,1]

From your commit description, it sounds like this is expected and will be fixed in a follow-up commit. Is my understanding of this correct?

I see a different result for btver2.

# %bb.0:
        vextractf128    $1, %ymm1, %xmm2
        vhaddpd %xmm1, %xmm0, %xmm0
        vhaddpd %xmm2, %xmm1, %xmm1
        vinsertf128     $1, %xmm1, %ymm0, %ymm0

But I used llvm-11, most probably there is a difference with llvm-12.

Currently, it is impossible to fix this issue. This problem will be fixed after non-power-2 vectorization support in SLP is landed since here we have a build vector of 3 elements (the second index in shuffle is -1 and thus the second sum is optimized out resulting in the build of <4 x double> vector using just 3 insertelement instructions). Looks like previously this patter was recognized by another transformation pass but currently SLP tries to vectorize it

The new result I posted in your quote was the result of a compiler built from this change. It is unfortunate to hear that we will have to take a regression for this, but I will update our internal test to expect it and file a bug so that it is not forgotten.