This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Add insertelement instructions to vectorizable tree
ClosedPublic

Authored by anton-afanasyev on Mar 16 2021, 8:21 AM.

Details

Summary

Add new type of tree node for InsertElementInst chain forming vector.
These instructions could be either removed, or replaced by shuffles during
vectorization and we can add this node to cost model, so naturally estimating
their cost, getting rid of CompensateCost tricks and reducing further work
for InstCombine. This fixes PR40522 and PR35732 in a natural way. Also this
patch is the first step towards revectorization of partially vectorization
(to fix PR42022 completely). After adding inserts to tree the next step is
to add vector instructions there (for instance, to merge store <2 x float>
and store <2 x float> to store <4 x float>).

Fixes PR40522 and PR35732.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3597

Thanks, done

3807

Thanks, done

4142

If any of elements is InsertElementInst, they are all the same, so checking one is enough. And we can't move it closer to the end, since the next check can return true.

4363

Thanks, done

4364

Thanks, done

4386–4387

It's definitely not an Identity and not a Reverse here.

4785

Thanks, done

4846–4888

Do you mean to come back to extracts generating here?
I don't believe it's a good way by several reasons:

  1. It doesn't match the common logic of vectorization (changing scalar operations to one vector operation),
  2. If we leave insertelements unvectorized (i.e. replaced by shuffles), these inserts may be somehow processed again,
  3. If we can deal with it in SLPVectorizer, why not to do it early? It's not too hard to do.
5399–5400

Thanks, done

6712–6713

Yes, I've planned to do this, but can't: we make a decision about vectorization based on the _operands_:

 InstructionsState S = getSameOpcode(VL);
 if (!S.getOpcode())
   return false;
...
unsigned Sz = R.getVectorElementSize(I0);

but then make vectorization starting from inserts.

anton-afanasyev marked 10 inline comments as done.

Address some comments

anton-afanasyev marked 9 inline comments as done.May 5 2021, 12:45 PM
anton-afanasyev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2887–2892

Ok, moved to GetConstantOperand()

2893–2900

Ok, added GetConstantOperand()

3816

Thanks, done this way.

3818–3824

I've removed this commented code after getScalarizationOverhead() used as @RKSimon suggested.

4142

Moreover, have to move it upper, to isTreeTinyAndNotFullyVectorizable() function, since need to exclude explicitly cases when vectorizing inserts of gathered values. It makes no sense and otherwise we can fall into infinite loop of generating inserts and vectorizing them again (for instance, when -slp-min-tree-size=2 is set).

6712–6713

Ok, eventually found a way to drop InsertUses parameter.

llvm/test/Transforms/SLPVectorizer/AArch64/accelerate-vector-functions.ll
225

I've debugged this case. Without acceleration @llvm.exp.v4f32 is too expensive (call cost is 18=58-40 vs -30 = 10-40 with accelaration), whereas @llvm.exp.v2f32 is cheaper (call cost is 6=26-20).

308

This case is the same as above.

llvm/test/Transforms/SLPVectorizer/X86/alternate-int-inseltpoison.ll
186

Thanks, fixed

llvm/test/Transforms/SLPVectorizer/X86/sext-inseltpoison.ll
4

Ok, fixed

llvm/test/Transforms/SLPVectorizer/X86/sext.ll
4

Ok, fixed

anton-afanasyev marked 9 inline comments as done.

Addressed all comments

ABataev added inline comments.May 5 2021, 2:12 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2893–2894
for (Value *V : VL)
  MinIndex = std::min<int>(MinIndex, GetConstantOperand(V));
3811–3812

Turn GetConstantOperand into a function and use it

4363

ArrayRef<Value *>

4368

Value *

4369–4372

std::min<int> and use a function

4377–4378

Use a function to extract index

4383

Scalars.contains(Insert) or .count(Insert)

anton-afanasyev marked 7 inline comments as done.May 5 2021, 10:46 PM
anton-afanasyev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2893–2894

Ok, done

3811–3812

Ok, done

4363

Ok, done

4368

Ok, done

4369–4372

Ok, done

4377–4378

Ok, done

4383

Ok, done

anton-afanasyev marked 7 inline comments as done.

Addressed comments

Reused getOperandIndex() for InsertElement instructions

Fix names, minors

ABataev added inline comments.May 6 2021, 5:06 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4869–4870
SmallVector<int, 16> Mask(NumElts, UndefMaskElem);
std::iota(Mask.begin(), std::next(Mask.begin(), NumScalars), 0);

Also, I think just std::iota(Mask.begin(), Mask.end(), 0); shall work too

4871–4879

Can we use ShuffleBuilder here?

4871–4879

Did you include these costs in the cost model?

5385

Scalars.contain(Insert)

anton-afanasyev marked 4 inline comments as done.May 6 2021, 9:14 AM
anton-afanasyev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4869–4870

Thanks, changed to std::iota(Mask.begin(), std::next(Mask.begin(), NumScalars), 0);

Yes, std::iota(Mask.begin(), Mask.end(), 0); gives the same result, but wouldn't it lead to redundant code lowered?

4871–4879

They are included in getScalarizationOverhead().

4871–4879

Yes, I thought about ShuffleBuilder, but here we just need to create two shuffles of special kind for vector resizing. It requires ShuffleInstructionBuilder expanding, don't think it's worth it.

5385

Thanks, done

anton-afanasyev marked 4 inline comments as done.

Addressed new comments

ABataev added inline comments.May 6 2021, 9:25 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4869–4870

std::iota(Mask.begin(), Mask.end(), 0); will produce the same code as std::iota(Mask.begin(), std::next(Mask.begin(), NumScalars), 0); but only if NumScalars <= Mask.size() * 2. Otherwise the compiler may crash in some cases. So, better to keep std::iota(Mask.begin(), std::next(Mask.begin(), NumScalars), 0);

4871–4879

I rather doubt in it. First, you subtract the scalarization overhead cost from the vector cost, but here you need to add the costs of subvector insert and permutation of 2 vectors

anton-afanasyev marked an inline comment as done.May 7 2021, 12:09 AM
anton-afanasyev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4869–4870

Mask.size() is just NumElts and NumScalars <= NumElts (is it worth to add assert for this?), so NumScalars <= Mask.size() * 2 for all cases.

What I did mean by "the same result" is that shuffle

%a = shufflevector <2 x float> %b, <2 x float> undef, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef>

is equivalent to

%a = shufflevector <2 x float> %b, <2 x float> undef, <4 x i32> <i32 0, i32 1, i32 2, i32 3>

(here NumScalars == 2 and NumElts == 4).

ABataev added inline comments.May 7 2021, 3:02 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4869–4870

We may have a situation say NumScalars is 2 and NumElts is 8, if insert the same scalars several times.

anton-afanasyev marked 3 inline comments as done.May 7 2021, 3:20 AM
anton-afanasyev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4869–4870

Do you mean "NumScalars is 8 and NumElts is 2"?
This case is excluded, we can get only unique inserts (inserts with unique index) from findBuildAggregate() for now. Also, we even assert

assert(ReuseShuffleIndicies.empty() && "All inserts should be unique");

at first line of InsertElement at the buildTree_rec() function.

anton-afanasyev marked an inline comment as done.May 7 2021, 3:31 AM
anton-afanasyev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4871–4879

Did you include these costs in the cost model?

They are included in getScalarizationOverhead().

I rather doubt in it. First, you subtract the scalarization overhead cost from the vector cost, but here you need to add the costs of subvector insert and permutation of 2 vectors

These subvector inserts are lowered to nop actually, so they cost nothing. We need this code when processing big vector of scalars part-by-part, but every chunk fits the whole vector register (condition bits >= MinVecRegSize), so actually there is no inserting. The result consisting of several vector registers is returned then.

ABataev added inline comments.May 7 2021, 3:34 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4869–4870

Well, maybe :) Anyway, let's keep std::iota(Mask.begin(), std::next(Mask.begin(), NumScalars), 0); just to avoid fixin something in future when we add support for vectorization of more code patterns

ABataev added inline comments.May 7 2021, 3:36 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4871–4879

This might be not true for some targets/code patterns. Plus, the second pattern is a permutation/combining of 2 vectors, its cost is at least 1

anton-afanasyev marked 2 inline comments as done.May 7 2021, 3:52 AM
anton-afanasyev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4869–4870

Sure!

4871–4879

This might be not true for some targets/code patterns. Plus, the second pattern is a permutation/combining of 2 vectors, its cost is at least 1

Both of these shuffles are just the one operation of "inserting", we need the first one to expand source vector since shufflevector needs the same size of operands.

We can add cost of this shuffles, but this prevents several vectorization being performed before, since this cost is redundant (may be TTI->getShuffleCost() should be tuned for TargetTransformInfo::SK_InsertVector?)

ABataev added inline comments.May 7 2021, 4:09 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4871–4879

Need to check that the vectorized code is really profitable and if so, need to tune the cost model, yes.

RKSimon added inline comments.May 7 2021, 4:31 AM
llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h
96–97

Remove the references to InsertUses in the doxygen, otherwise we'll get Wdocumentation warnings

anton-afanasyev marked 3 inline comments as done.

Fixed comment

llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h
96–97

Thanks, done

ABataev added inline comments.May 7 2021, 6:15 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2887–2897

I think this can be simplified:

int Offset = *getInsertIndex(VL[0], 0);
ValueList Operands(VL.size());
Operands[0] = cast<Instruction>(VL[0])->getOperand(1);
bool IsConsecutive = true;
for (unsigned I = 1, E = VL.size(); I < E; ++I) {
  IsConsecutive &= (I == *getInsertIndex(VL[I], 0) - Offset);
  Operands[I] = cast<Instruction>(VL[I])->getOperand(1);
}
2899

If not consecutive, we can include the cost of single permute and still vectorize it, no?

3810–3811

Hm, I think you need to normalize insert indices here, the MinIndex (offset) might be non 0 but should be 0. Plus, I think, need to add (subtract from vectcost) a cost of subvector insertion from Index MinIndex (if MinIndex != 0)

4380–4384

I think this should be:

ExtractCost += TTI->getShuffleCost(TTI::SK_InsertSubvector, VL0->getType(), MinIndex, VecTy);
5369

What if the user is Phi in thу middle of other phis?

5375–5377

You're using this code in many places, make it a function.

5387–5388

Why we can't use just V here?

anton-afanasyev marked 6 inline comments as done.May 10 2021, 5:21 AM
anton-afanasyev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2887–2897

Yes, sure, after refactoring we expect consecutive operands here and this can be simplified. Did this way.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2899

Inserts here are coming from findBuildAggregate(), being already sorted by index and therefore consecutive, the only exclusion is that they can have gaps, so we check it here. I don't think we should process this rare case within this patch.

3810–3811

Why do we need to normalize insert indices? DemandElts is passed to getScalarizationOverhead() which just summarizes cost of all eliminated insertelements (with non-normalized operand indices).

4380–4384

I've removed this all, since using vectorized value without extraction now.

5369

Oh, sure, fixed this.

5375–5377

I've just got rid of this code using the fact we have Scalars already sorted by index.

5387–5388

Hmm, here we are preparing shuffled V with undefs elements if V is used without several elements inserted (before vectorizer). But undefined positions accept anything, so we can actually use completely filled V here. Thanks, I've changed it this way and removed all redundand code!

The only rare case could be if any further instruction inserts element already inserted (having the same index), so we have to exclude this case. But this is guaranteed for inserts coming from findBuildAggregate().

anton-afanasyev marked 6 inline comments as done.

Address comments, refactor

ABataev added inline comments.May 10 2021, 5:52 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2899

Can we check for the gaps during the analysis of the build vector sequence instead?

3810–3811

Ah, you're getting it for SrcVecTy, got it.

4362–4364

I don't think this is correct. I think you need to use code from my patch D101555 for better cost estimation here.

5326

ExtractAndExtendIfNeeded

anton-afanasyev marked 4 inline comments as done.May 10 2021, 12:11 PM
anton-afanasyev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2899

I've decided to check it here since we can get other sources of incoming insert bundle in future.

4362–4364

Why? We don't generate any extract instructions for external using of vector, so no need to cost it.

I think you mixed two different cases up.
The patch D101555 you referenced is about cost estimation when insert is _user_, but here is the case when insert is _used_. We do not really need to "extract" it, since its user uses vector value rather than scalar one.

Also I don't think we need to use code from D101555 in this patch, since it does the same by the other way. The main idea of this patch is to unify the way we process inserts (the only vector tree node for now) vs ordinary tree nodes. The inserts are tree node now, and they are sorted by index, so no need to shuffle them. We could use ReorderIndicies if needed, but no need, since operands are sorted as well.

5326

Sure, done.

anton-afanasyev marked 3 inline comments as done.

Address comments (minor)

ABataev added inline comments.May 10 2021, 12:28 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2899

The no need to iterate through the whole list, use early exit out of loop with cancelled scheduling and gather nodes.

4362–4364

I think you're missing shuffle cost here. If the external user is a vector and extracted element is inserted into the different lane - it is at least shuffle and need to add its cost.

anton-afanasyev marked an inline comment as done.May 10 2021, 11:00 PM
anton-afanasyev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4362–4364

I think you're missing shuffle cost here.

The cost of what shuffle? We don't generate any shuffle.

If the external user is a vector ...

Not "the external user is a vector", but its operand is a vector. We do not need to extract any special lane, since _whole_ vector is using and replacing after "vectorization". In this special case (when tree node is inserts) we have vector "scalars" (inserts have vector type) and their "vectorization" is just removing (i.e. replacing by vectorized operands).

ABataev added inline comments.May 11 2021, 4:14 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3813–3816

Just:

return -TTI->getScalarizationOverhead(SrcVecTy, DemandedElts,
                                            /*Insert*/ true, /*Extract*/ false);
4362–4364

Ok, I see. You correctly excluded the cost of the final (sub)vector reuse.
But I suggest to improve the cost model for inserts.

i1 = insertelement undef, v0
i2 = insertelement i1, v1
....
ii1 = insertelement undef, v1

If <v0,v1> gets vectorized, you need to count the cost of the extract of v1. Instead, we can count it as shuffle and build the final shuffle. It can be much more profitable than relying on extract element costs/instructions. But probably this can be addressed in the next patch.

4871–4879

You can do this check by yourself. We have something similar for extracts, where we check if need to perform shuffle/copying of subvector. But we definitely need it.

anton-afanasyev marked 6 inline comments as done.May 11 2021, 7:19 AM
anton-afanasyev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3813–3816

I changed this code, used Cost variable.

4362–4364

Ok, I see you too. Though I don't see how this case is addressed within D101555, two inserts i1 and ii1 cannot be cought to the same tree and they don't occur in one InsertUses therefore.

Anyway, I suggest to address this in the separate patch. It's rather rare case in natural life.

4871–4879

You can do this check by yourself. We have something similar for extracts, where we check if need to perform shuffle/copying of subvector. But we definitely need it.

Ok, I've used something similar for inserts as for extracts. Do you think it's better to make this check within getShuffleCost() in X86TargetTransformInfo.cpp and AArch64TargetTransformInfo.cpp?

anton-afanasyev marked 3 inline comments as done.

Address comments

ABataev added inline comments.May 11 2021, 7:41 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2899

Not done, you can early exit out of the loop if the non consecutive insert is found:

int Offset = *getInsertIndex(VL[0], 0);
ValueList Operands(VL.size());
Operands[0] = cast<Instruction>(VL[0])->getOperand(1);
for (unsigned I = 1, E = VL.size(); I < E; ++I) {
  if (I != *getInsertIndex(VL[I], 0) - Offset) {
    LLVM_DEBUG(dbgs() << "SLP: skipping non-consecutive inserts.\n");
    BS.cancelScheduling(VL, VL0);
    buildTree_rec(Operands, Depth, UserTreeIdx);
    return;
  }
  Operands[I] = cast<Instruction>(VL[I])->getOperand(1);
}
TreeEntry *TE = newTreeEntry(VL, Bundle /*vectorized*/, S, UserTreeIdx);
LLVM_DEBUG(dbgs() << "SLP: added inserts bundle.\n");
TE->setOperand(0, Operands);

ValueList VectorOperands;
for (Value *V : VL)
  VectorOperands.push_back(cast<Instruction>(V)->getOperand(0));

TE->setOperand(1, VectorOperands);

buildTree_rec(Operands, Depth + 1, {TE, 0});
return;
4871–4879

It would be good. Probably in a separate patch after this one

ABataev added inline comments.May 11 2021, 7:43 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4362–4364

Actually, D101555 addresses exactly described problem. In some cases, it really improves the performance, especially for SSE/AVX/AVX2 targets. I will update it once this patch is landed

anton-afanasyev marked 4 inline comments as done.May 11 2021, 9:43 AM
anton-afanasyev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2899

This code is not what intended, since we early exit building tree with uncompletely filled Operands.

4871–4879

Ok, I'm to make it in a separate patch.

ABataev added inline comments.May 11 2021, 9:46 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2899

I meant newTreeEntry(VL, None /*not vectorized*/, S, UserTreeIdx);, just like we're doing for other nodes.

anton-afanasyev marked 3 inline comments as done.May 11 2021, 10:00 AM
anton-afanasyev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2899

Hmm, buildTree_rec() here (with completely filled Operands) is intended: if we skip vectorization of non-consecutive inserts, we still try to vectorize starting from Operands (as it were before this patch). I think it's rare case when such operands could be successful seed for vectorizable tree, but why not to try?

ABataev added inline comments.May 11 2021, 10:12 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2899

Hmm, does it mean you're going to support something like this:

i1 = insertelement undef, v0, 0
i2 = insertelement i1, v1, 2

|
V

v = <v0, v1>
i2 = shuffle v, undef, <0, undef, 1, undef>

? How do we handle shuffle in this case? As external uses and inserts of extractelements? And we do not subtract the costs of insertelements in this case?

anton-afanasyev marked 2 inline comments as done.May 11 2021, 11:10 AM
anton-afanasyev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2899

Yes, for that case we end up with previous combination of inserts/extracts and we don't subtract the cost of inserts/extracts to be eliminated by instcombine later, but that's better than nothing, if total cost is still good.
I've checked that neither my version nor your one doesn't affect any test case and it all looks like rather speculative and rare case though.

ABataev added inline comments.May 11 2021, 11:25 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2899

At least add a FIXME to properly support this kind of vectorization in the future

anton-afanasyev marked 2 inline comments as done.May 11 2021, 11:33 AM
anton-afanasyev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2899

Ok, added

anton-afanasyev marked an inline comment as done.

Add FIXME for non-consecutive insert case

Please can you rebase against trunk? At least some of these test diffs should go as I've regenerated them.

ABataev accepted this revision.May 12 2021, 3:48 AM

Looks good with a nit

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3819

I think need to check that NumElts != NumScalars

RKSimon accepted this revision.May 12 2021, 4:16 AM

LGTM (to unblock this)

This revision is now accepted and ready to land.May 12 2021, 4:16 AM
anton-afanasyev marked an inline comment as done.May 12 2021, 4:33 AM
anton-afanasyev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3819

I've removed this check since Offset % NumScalars != 0 implies NumElts != NumScalars. Do you think we need check for readability?

ABataev added inline comments.May 12 2021, 4:45 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3819

I would add this check just in case

anton-afanasyev marked 2 inline comments as done.May 12 2021, 4:55 AM
anton-afanasyev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3819

Ok, added.

anton-afanasyev marked an inline comment as done.

Address a nit

This patch introduces an assertion error we believe may be contributing to a miscompile (along with some other recent SLP patches -- this patch fixes the reduced case in http://llvm.org/PR50323, but doesn't fix the full case it was reduced from):

$ opt reduced.ll -disable-output -O1 -slp-vectorizer  # See below for reduced.ll
opt: /home/rupprecht/src/llvm-project/llvm/lib/IR/Type.cpp:648: static llvm::FixedVectorType *llvm::FixedVectorType::get(llvm::Type *, unsigned int): Assertion `isValidElementType(ElementType) && "Element type of a VectorType must " "be an integer, floating point, or " "pointer type."' failed.
PLEASE submit a bug report to https://bugs.llvm.org/ and include the crash backtrace.
Stack dump:
0.      Program arguments: /home/rupprecht/dev/opt reduced.ll -disable-output -O1 -slp-vectorizer
...
#10 0x000000000697c8f8 llvm::FixedVectorType::get(llvm::Type*, unsigned int) /home/rupprecht/src/llvm-project/llvm/lib/IR/Type.cpp:650:36
#11 0x0000000007752a5e llvm::slpvectorizer::BoUpSLP::getSpillCost() const /home/rupprecht/src/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:4321:21
#12 0x0000000007753060 llvm::slpvectorizer::BoUpSLP::getTreeCost() /home/rupprecht/src/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:4384:31
#13 0x000000000775f888 llvm::SLPVectorizerPass::tryToVectorizeList(llvm::ArrayRef<llvm::Value*>, llvm::slpvectorizer::BoUpSLP&, bool) /home/rupprecht/src/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:6740:32
#14 0x0000000007760dab llvm::SLPVectorizerPass::vectorizeInsertElementInst(llvm::InsertElementInst*, llvm::BasicBlock*, llvm::slpvectorizer::BoUpSLP&) /home/rupprecht/src/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:7844:3
#15 0x0000000007760f84 llvm::SLPVectorizerPass::vectorizeSimpleInstructions(llvm::SmallVectorImpl<llvm::Instruction*>&, llvm::BasicBlock*, llvm::slpvectorizer::BoUpSLP&, bool) /home/rupprecht/src/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:7858:21
#16 0x000000000775d5d8 llvm::SLPVectorizerPass::vectorizeChainsInBlock(llvm::BasicBlock*, llvm::slpvectorizer::BoUpSLP&) /home/rupprecht/src/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:8019:21
#17 0x000000000775c6d3 llvm::SLPVectorizerPass::runImpl(llvm::Function&, llvm::ScalarEvolution*, llvm::TargetTransformInfo*, llvm::TargetLibraryInfo*, llvm::AAResults*, llvm::LoopInfo*, llvm::DominatorTree*, llvm::AssumptionCache*, llvm::DemandedBits*, llvm::OptimizationRemarkEmitter*) /home/rupprecht/src/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:6395:16
#18 0x000000000775c26f llvm::SLPVectorizerPass::run(llvm::Function&, llvm::AnalysisManager<llvm::Function>&) /home/rupprecht/src/llvm-project/llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp:6332:8
...
$ cat reduced.ll
; ModuleID = 'reduced.ll'
source_filename = "repro.cc"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

%struct.widget = type { %struct.baz }
%struct.baz = type { double, double }
%struct.snork = type { <2 x double> }
%struct.spam = type { %struct.snork }

$_ZN1dC2Edd = comdat any

$_ZN1k1lE1d = comdat any

$_ZN1d1hES_ = comdat any

$_ZN1d1fEv = comdat any

$_ZN1d1eEv = comdat any

@global = external global %struct.widget, align 8

define <2 x double> @zot(%struct.widget* %arg, %struct.baz* %arg1) align 2 {
bb:
  %tmp = alloca %struct.snork, align 16
  %tmp2 = alloca %struct.widget*, align 8
  %tmp3 = alloca %struct.baz*, align 8
  store %struct.widget* %arg, %struct.widget** %tmp2, align 8, !tbaa !0
  store %struct.baz* %arg1, %struct.baz** %tmp3, align 8, !tbaa !0
  %tmp4 = load %struct.widget*, %struct.widget** %tmp2, align 8
  %tmp5 = load %struct.baz*, %struct.baz** %tmp3, align 8, !tbaa !0
  %tmp6 = getelementptr inbounds %struct.baz, %struct.baz* %tmp5, i32 0, i32 1
  %tmp7 = load double, double* %tmp6, align 8, !tbaa !4
  %tmp8 = getelementptr inbounds %struct.widget, %struct.widget* %tmp4, i32 0, i32 0
  %tmp9 = getelementptr inbounds %struct.baz, %struct.baz* %tmp8, i32 0, i32 1
  %tmp10 = load double, double* %tmp9, align 8, !tbaa !7
  %tmp11 = fsub double %tmp7, %tmp10
  %tmp12 = load %struct.baz*, %struct.baz** %tmp3, align 8, !tbaa !0
  %tmp13 = getelementptr inbounds %struct.baz, %struct.baz* %tmp12, i32 0, i32 0
  %tmp14 = load double, double* %tmp13, align 8, !tbaa !9
  %tmp15 = getelementptr inbounds %struct.widget, %struct.widget* %tmp4, i32 0, i32 0
  %tmp16 = getelementptr inbounds %struct.baz, %struct.baz* %tmp15, i32 0, i32 0
  %tmp17 = load double, double* %tmp16, align 8, !tbaa !10
  %tmp18 = fsub double %tmp14, %tmp17
  call void @wombat(%struct.snork* %tmp, double %tmp11, double %tmp18)
  %tmp19 = getelementptr inbounds %struct.snork, %struct.snork* %tmp, i32 0, i32 0
  %tmp20 = load <2 x double>, <2 x double>* %tmp19, align 16
  ret <2 x double> %tmp20
}

define linkonce_odr void @wombat(%struct.snork* %arg, double %arg1, double %arg2) unnamed_addr comdat($_ZN1dC2Edd) align 2 {
bb:
  %tmp = alloca %struct.snork*, align 8
  %tmp3 = alloca double, align 8
  %tmp4 = alloca double, align 8
  store %struct.snork* %arg, %struct.snork** %tmp, align 8, !tbaa !0
  store double %arg1, double* %tmp3, align 8, !tbaa !11
  store double %arg2, double* %tmp4, align 8, !tbaa !11
  %tmp5 = load %struct.snork*, %struct.snork** %tmp, align 8
  %tmp6 = getelementptr inbounds %struct.snork, %struct.snork* %tmp5, i32 0, i32 0
  %tmp7 = load double, double* %tmp3, align 8, !tbaa !11
  %tmp8 = insertelement <2 x double> undef, double %tmp7, i32 0
  %tmp9 = load double, double* %tmp4, align 8, !tbaa !11
  %tmp10 = insertelement <2 x double> %tmp8, double %tmp9, i32 1
  store <2 x double> %tmp10, <2 x double>* %tmp6, align 16, !tbaa !12
  ret void
}

define double @wombat.1() {
bb:
  %tmp = alloca %struct.widget, align 8
  %tmp1 = alloca %struct.spam, align 16
  %tmp2 = alloca %struct.snork, align 16
  %tmp3 = alloca %struct.baz, align 8
  %tmp4 = bitcast %struct.widget* %tmp to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* %tmp4, i8* bitcast (%struct.widget* @global to i8*), i64 16, i1 false), !tbaa.struct !13
  %tmp5 = bitcast %struct.spam* %tmp1 to i8*
  call void @llvm.memset.p0i8.i64(i8* %tmp5, i8 0, i64 16, i1 false)
  call void @quux()
  %tmp6 = getelementptr inbounds %struct.baz, %struct.baz* %tmp3, i32 0, i32 0
  store double 0.000000e+00, double* %tmp6, align 8, !tbaa !9
  %tmp7 = getelementptr inbounds %struct.baz, %struct.baz* %tmp3, i32 0, i32 1
  store double 0.000000e+00, double* %tmp7, align 8, !tbaa !4
  %tmp8 = call <2 x double> @zot(%struct.widget* %tmp, %struct.baz* %tmp3)
  %tmp9 = getelementptr inbounds %struct.snork, %struct.snork* %tmp2, i32 0, i32 0
  store <2 x double> %tmp8, <2 x double>* %tmp9, align 16
  %tmp10 = getelementptr inbounds %struct.snork, %struct.snork* %tmp2, i32 0, i32 0
  %tmp11 = load <2 x double>, <2 x double>* %tmp10, align 16
  %tmp12 = call double @wobble(%struct.spam* %tmp1, <2 x double> %tmp11)
  ret double %tmp12
}

; Function Attrs: argmemonly nofree nosync nounwind willreturn
declare void @llvm.lifetime.start.p0i8(i64 immarg, i8* nocapture) #0

; Function Attrs: argmemonly nofree nosync nounwind willreturn
declare void @llvm.memcpy.p0i8.p0i8.i64(i8* noalias nocapture writeonly, i8* noalias nocapture readonly, i64, i1 immarg) #0

; Function Attrs: argmemonly nofree nosync nounwind willreturn writeonly
declare void @llvm.memset.p0i8.i64(i8* nocapture writeonly, i8, i64, i1 immarg) #1

declare void @quux() unnamed_addr align 2

define linkonce_odr double @wobble(%struct.spam* %arg, <2 x double> %arg1) comdat($_ZN1k1lE1d) align 2 {
bb:
  %tmp = alloca %struct.snork, align 16
  %tmp2 = alloca %struct.spam*, align 8
  %tmp3 = alloca %struct.snork, align 16
  %tmp4 = alloca %struct.snork, align 16
  %tmp5 = getelementptr inbounds %struct.snork, %struct.snork* %tmp, i32 0, i32 0
  store <2 x double> %arg1, <2 x double>* %tmp5, align 16
  store %struct.spam* %arg, %struct.spam** %tmp2, align 8, !tbaa !0
  %tmp6 = load %struct.spam*, %struct.spam** %tmp2, align 8
  %tmp7 = getelementptr inbounds %struct.spam, %struct.spam* %tmp6, i32 0, i32 0
  %tmp8 = bitcast %struct.snork* %tmp3 to i8*
  %tmp9 = bitcast %struct.snork* %tmp7 to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* %tmp8, i8* %tmp9, i64 16, i1 false), !tbaa.struct !14
  %tmp10 = bitcast %struct.snork* %tmp4 to i8*
  %tmp11 = bitcast %struct.snork* %tmp to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* %tmp10, i8* %tmp11, i64 16, i1 false), !tbaa.struct !14
  %tmp12 = getelementptr inbounds %struct.snork, %struct.snork* %tmp3, i32 0, i32 0
  %tmp13 = load <2 x double>, <2 x double>* %tmp12, align 16
  %tmp14 = getelementptr inbounds %struct.snork, %struct.snork* %tmp4, i32 0, i32 0
  %tmp15 = load <2 x double>, <2 x double>* %tmp14, align 16
  %tmp16 = call double @eggs(<2 x double> %tmp13, <2 x double> %tmp15)
  ret double %tmp16
}

; Function Attrs: argmemonly nofree nosync nounwind willreturn
declare void @llvm.lifetime.end.p0i8(i64 immarg, i8* nocapture) #0

define linkonce_odr double @eggs(<2 x double> %arg, <2 x double> %arg1) align 2 {
bb:
  %tmp = alloca %struct.snork, align 16
  %tmp2 = alloca %struct.snork, align 16
  %tmp3 = alloca %struct.snork, align 16
  %tmp4 = getelementptr inbounds %struct.snork, %struct.snork* %tmp, i32 0, i32 0
  store <2 x double> %arg, <2 x double>* %tmp4, align 16
  %tmp5 = getelementptr inbounds %struct.snork, %struct.snork* %tmp2, i32 0, i32 0
  store <2 x double> %arg1, <2 x double>* %tmp5, align 16
  %tmp6 = bitcast %struct.snork* %tmp3 to i8*
  %tmp7 = bitcast %struct.snork* %tmp2 to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* %tmp6, i8* %tmp7, i64 16, i1 false), !tbaa.struct !14
  %tmp8 = getelementptr inbounds %struct.snork, %struct.snork* %tmp3, i32 0, i32 0
  %tmp9 = load <2 x double>, <2 x double>* %tmp8, align 16
  %tmp10 = call double @wobble.2(%struct.snork* %tmp, <2 x double> %tmp9)
  ret double %tmp10
}

define linkonce_odr double @wobble.2(%struct.snork* %arg, <2 x double> %arg1) comdat($_ZN1d1hES_) align 2 {
bb:
  %tmp = alloca %struct.snork, align 16
  %tmp2 = alloca %struct.snork*, align 8
  %tmp3 = alloca %struct.snork, align 16
  %tmp4 = getelementptr inbounds %struct.snork, %struct.snork* %tmp, i32 0, i32 0
  store <2 x double> %arg1, <2 x double>* %tmp4, align 16
  store %struct.snork* %arg, %struct.snork** %tmp2, align 8, !tbaa !0
  %tmp5 = load %struct.snork*, %struct.snork** %tmp2, align 8
  %tmp6 = call double @quux.3(%struct.snork* %tmp)
  %tmp7 = call double @zot.4(%struct.snork* %tmp)
  call void @wombat(%struct.snork* %tmp3, double %tmp6, double %tmp7)
  %tmp8 = getelementptr inbounds %struct.snork, %struct.snork* %tmp5, i32 0, i32 0
  %tmp9 = load <2 x double>, <2 x double>* %tmp8, align 16, !tbaa !12
  %tmp10 = getelementptr inbounds %struct.snork, %struct.snork* %tmp3, i32 0, i32 0
  %tmp11 = load <2 x double>, <2 x double>* %tmp10, align 16, !tbaa !12
  %tmp12 = fmul <2 x double> %tmp11, %tmp9
  store <2 x double> %tmp12, <2 x double>* %tmp10, align 16, !tbaa !12
  %tmp13 = call double @zot.4(%struct.snork* %tmp3)
  %tmp14 = call double @quux.3(%struct.snork* %tmp3)
  %tmp15 = fsub double %tmp13, %tmp14
  ret double %tmp15
}

define linkonce_odr double @quux.3(%struct.snork* %arg) comdat($_ZN1d1fEv) align 2 {
bb:
  %tmp = alloca %struct.snork*, align 8
  store %struct.snork* %arg, %struct.snork** %tmp, align 8, !tbaa !0
  %tmp1 = load %struct.snork*, %struct.snork** %tmp, align 8
  %tmp2 = getelementptr inbounds %struct.snork, %struct.snork* %tmp1, i32 0, i32 0
  %tmp3 = load <2 x double>, <2 x double>* %tmp2, align 16, !tbaa !12
  %tmp4 = extractelement <2 x double> %tmp3, i32 1
  ret double %tmp4
}

define linkonce_odr double @zot.4(%struct.snork* %arg) comdat($_ZN1d1eEv) align 2 {
bb:
  %tmp = alloca %struct.snork*, align 8
  store %struct.snork* %arg, %struct.snork** %tmp, align 8, !tbaa !0
  %tmp1 = load %struct.snork*, %struct.snork** %tmp, align 8
  %tmp2 = getelementptr inbounds %struct.snork, %struct.snork* %tmp1, i32 0, i32 0
  %tmp3 = load <2 x double>, <2 x double>* %tmp2, align 16, !tbaa !12
  %tmp4 = extractelement <2 x double> %tmp3, i32 0
  ret double %tmp4
}

attributes #0 = { argmemonly nofree nosync nounwind willreturn }
attributes #1 = { argmemonly nofree nosync nounwind willreturn writeonly }

!0 = !{!1, !1, i64 0}
!1 = !{!"any pointer", !2, i64 0}
!2 = !{!"omnipotent char", !3, i64 0}
!3 = !{!"Simple C++ TBAA"}
!4 = !{!5, !6, i64 8}
!5 = !{!"_ZTS1a", !6, i64 0, !6, i64 8}
!6 = !{!"double", !2, i64 0}
!7 = !{!8, !6, i64 8}
!8 = !{!"_ZTS1p", !5, i64 0}
!9 = !{!5, !6, i64 0}
!10 = !{!8, !6, i64 0}
!11 = !{!6, !6, i64 0}
!12 = !{!2, !2, i64 0}
!13 = !{i64 0, i64 8, !11, i64 8, i64 8, !11}
!14 = !{i64 0, i64 16, !12}

(Sorry for the length -- this is as far as llvm-reduce would take it)

This patch introduces an assertion error we believe may be contributing to a miscompile (along with some other recent SLP patches -- this patch fixes the reduced case in http://llvm.org/PR50323, but doesn't fix the full case it was reduced from):

Thanks for report, fixed here https://reviews.llvm.org/rG207cdd7ed9fc

Hi, there is another issue that can be reproduced with existing test case:

$opt -S -slp-vectorizer -slp-threshold=-10000 test/Transforms/SLPVectorizer/X86/insert-element-build-vector.ll -slp-min-tree-size=0
opt: .../llvm-project/llvm/lib/IR/Type.cpp:648: static llvm::FixedVectorType* llvm::FixedVectorType::get(llvm::Type*, unsigned int): Assertion `isValidElementType(ElementType) && "Element type of a VectorType must " "be an integer, floating point, or " "pointer type."' failed.

Thanks,

Valery
dyung added a subscriber: dyung.May 17 2021, 4:42 PM

Hi, this change has caused a regression in the codegen for one of our internal tests. Consider the following code:

__attribute__((noinline))
__m256d add_pd_002(__m256d a, __m256d b) {
  __m256d r = (__m256d){ a[0] + a[1], a[2] + a[3], b[0] + b[1], b[2] + b[3] };
  return __builtin_shufflevector(r, a, 0, -1, 2, 3);
}

If you compile this with "-g0 -O3 -march=btver2", prior to your commit the compiler would generate the following code for the function:

# %bb.0:                                # %entry
        vinsertf128     $1, %xmm1, %ymm0, %ymm0
        vhaddpd %ymm1, %ymm0, %ymm0

But after your change it is now generating the following code:

# %bb.0:                                # %entry
        vextractf128    $1, %ymm1, %xmm2
        vhaddpd %xmm0, %xmm0, %xmm0
        vhaddpd %xmm2, %xmm1, %xmm1
        vperm2f128      $2, %ymm0, %ymm1, %ymm0 # ymm0 = ymm0[0,1],ymm1[0,1]

From your commit description, it sounds like this is expected and will be fixed in a follow-up commit. Is my understanding of this correct?

Hi, this change has caused a regression in the codegen for one of our internal tests. Consider the following code:

__attribute__((noinline))
__m256d add_pd_002(__m256d a, __m256d b) {
  __m256d r = (__m256d){ a[0] + a[1], a[2] + a[3], b[0] + b[1], b[2] + b[3] };
  return __builtin_shufflevector(r, a, 0, -1, 2, 3);
}

If you compile this with "-g0 -O3 -march=btver2", prior to your commit the compiler would generate the following code for the function:

# %bb.0:                                # %entry
        vinsertf128     $1, %xmm1, %ymm0, %ymm0
        vhaddpd %ymm1, %ymm0, %ymm0

But after your change it is now generating the following code:

# %bb.0:                                # %entry
        vextractf128    $1, %ymm1, %xmm2
        vhaddpd %xmm0, %xmm0, %xmm0
        vhaddpd %xmm2, %xmm1, %xmm1
        vperm2f128      $2, %ymm0, %ymm1, %ymm0 # ymm0 = ymm0[0,1],ymm1[0,1]

From your commit description, it sounds like this is expected and will be fixed in a follow-up commit. Is my understanding of this correct?

Hi, could you check if D101555 fixes this issue?

Hi, there is another issue that can be reproduced with existing test case:

$opt -S -slp-vectorizer -slp-threshold=-10000 test/Transforms/SLPVectorizer/X86/insert-element-build-vector.ll -slp-min-tree-size=0
opt: .../llvm-project/llvm/lib/IR/Type.cpp:648: static llvm::FixedVectorType* llvm::FixedVectorType::get(llvm::Type*, unsigned int): Assertion `isValidElementType(ElementType) && "Element type of a VectorType must " "be an integer, floating point, or " "pointer type."' failed.

Thanks,

Valery

Thansk for the report, fixed here, need quick review: https://reviews.llvm.org/D102675

dyung added a comment.May 18 2021, 2:17 AM

Hi, this change has caused a regression in the codegen for one of our internal tests. Consider the following code:

__attribute__((noinline))
__m256d add_pd_002(__m256d a, __m256d b) {
  __m256d r = (__m256d){ a[0] + a[1], a[2] + a[3], b[0] + b[1], b[2] + b[3] };
  return __builtin_shufflevector(r, a, 0, -1, 2, 3);
}

If you compile this with "-g0 -O3 -march=btver2", prior to your commit the compiler would generate the following code for the function:

# %bb.0:                                # %entry
        vinsertf128     $1, %xmm1, %ymm0, %ymm0
        vhaddpd %ymm1, %ymm0, %ymm0

But after your change it is now generating the following code:

# %bb.0:                                # %entry
        vextractf128    $1, %ymm1, %xmm2
        vhaddpd %xmm0, %xmm0, %xmm0
        vhaddpd %xmm2, %xmm1, %xmm1
        vperm2f128      $2, %ymm0, %ymm1, %ymm0 # ymm0 = ymm0[0,1],ymm1[0,1]

From your commit description, it sounds like this is expected and will be fixed in a follow-up commit. Is my understanding of this correct?

Hi, could you check if D101555 fixes this issue?

Hi, I applied the patch locally and built the compiler, but the generated assembly actually seems it might be worse:

# %bb.0:                                # %entry
        vinsertf128     $1, %xmm0, %ymm1, %ymm2
        vinsertf128     $1, %xmm1, %ymm0, %ymm0
        vextractf128    $1, %ymm1, %xmm1
        vhaddpd %ymm2, %ymm0, %ymm0
        vhaddpd %xmm1, %xmm1, %xmm1
        vextractf128    $1, %ymm0, %xmm2
        vunpcklpd       %xmm1, %xmm2, %xmm1     # xmm1 = xmm2[0],xmm1[0]
        vinsertf128     $1, %xmm1, %ymm0, %ymm0
        retq

Hi, this change has caused a regression in the codegen for one of our internal tests. Consider the following code:

__attribute__((noinline))
__m256d add_pd_002(__m256d a, __m256d b) {
  __m256d r = (__m256d){ a[0] + a[1], a[2] + a[3], b[0] + b[1], b[2] + b[3] };
  return __builtin_shufflevector(r, a, 0, -1, 2, 3);
}

If you compile this with "-g0 -O3 -march=btver2", prior to your commit the compiler would generate the following code for the function:

# %bb.0:                                # %entry
        vinsertf128     $1, %xmm1, %ymm0, %ymm0
        vhaddpd %ymm1, %ymm0, %ymm0

But after your change it is now generating the following code:

# %bb.0:                                # %entry
        vextractf128    $1, %ymm1, %xmm2
        vhaddpd %xmm0, %xmm0, %xmm0
        vhaddpd %xmm2, %xmm1, %xmm1
        vperm2f128      $2, %ymm0, %ymm1, %ymm0 # ymm0 = ymm0[0,1],ymm1[0,1]

From your commit description, it sounds like this is expected and will be fixed in a follow-up commit. Is my understanding of this correct?

Hi, could you check if D101555 fixes this issue?

Hi, I applied the patch locally and built the compiler, but the generated assembly actually seems it might be worse:

# %bb.0:                                # %entry
        vinsertf128     $1, %xmm0, %ymm1, %ymm2
        vinsertf128     $1, %xmm1, %ymm0, %ymm0
        vextractf128    $1, %ymm1, %xmm1
        vhaddpd %ymm2, %ymm0, %ymm0
        vhaddpd %xmm1, %xmm1, %xmm1
        vextractf128    $1, %ymm0, %xmm2
        vunpcklpd       %xmm1, %xmm2, %xmm1     # xmm1 = xmm2[0],xmm1[0]
        vinsertf128     $1, %xmm1, %ymm0, %ymm0
        retq

Ok, thanks, will fix it later today.

Hi, this change has caused a regression in the codegen for one of our internal tests. Consider the following code:

__attribute__((noinline))
__m256d add_pd_002(__m256d a, __m256d b) {
  __m256d r = (__m256d){ a[0] + a[1], a[2] + a[3], b[0] + b[1], b[2] + b[3] };
  return __builtin_shufflevector(r, a, 0, -1, 2, 3);
}

If you compile this with "-g0 -O3 -march=btver2", prior to your commit the compiler would generate the following code for the function:

# %bb.0:                                # %entry
        vinsertf128     $1, %xmm1, %ymm0, %ymm0
        vhaddpd %ymm1, %ymm0, %ymm0

But after your change it is now generating the following code:

# %bb.0:                                # %entry
        vextractf128    $1, %ymm1, %xmm2
        vhaddpd %xmm0, %xmm0, %xmm0
        vhaddpd %xmm2, %xmm1, %xmm1
        vperm2f128      $2, %ymm0, %ymm1, %ymm0 # ymm0 = ymm0[0,1],ymm1[0,1]

From your commit description, it sounds like this is expected and will be fixed in a follow-up commit. Is my understanding of this correct?

I see a different result for btver2.

# %bb.0:
        vextractf128    $1, %ymm1, %xmm2
        vhaddpd %xmm1, %xmm0, %xmm0
        vhaddpd %xmm2, %xmm1, %xmm1
        vinsertf128     $1, %xmm1, %ymm0, %ymm0

But I used llvm-11, most probably there is a difference with llvm-12.

Currently, it is impossible to fix this issue. This problem will be fixed after non-power-2 vectorization support in SLP is landed since here we have a build vector of 3 elements (the second index in shuffle is -1 and thus the second sum is optimized out resulting in the build of <4 x double> vector using just 3 insertelement instructions). Looks like previously this patter was recognized by another transformation pass but currently SLP tries to vectorize it

Hi, this change has caused a regression in the codegen for one of our internal tests. Consider the following code:

__attribute__((noinline))
__m256d add_pd_002(__m256d a, __m256d b) {
  __m256d r = (__m256d){ a[0] + a[1], a[2] + a[3], b[0] + b[1], b[2] + b[3] };
  return __builtin_shufflevector(r, a, 0, -1, 2, 3);
}

If you compile this with "-g0 -O3 -march=btver2", prior to your commit the compiler would generate the following code for the function:

# %bb.0:                                # %entry
        vinsertf128     $1, %xmm1, %ymm0, %ymm0
        vhaddpd %ymm1, %ymm0, %ymm0

But after your change it is now generating the following code:

# %bb.0:                                # %entry
        vextractf128    $1, %ymm1, %xmm2
        vhaddpd %xmm0, %xmm0, %xmm0
        vhaddpd %xmm2, %xmm1, %xmm1
        vperm2f128      $2, %ymm0, %ymm1, %ymm0 # ymm0 = ymm0[0,1],ymm1[0,1]

From your commit description, it sounds like this is expected and will be fixed in a follow-up commit. Is my understanding of this correct?

I see a different result for btver2.

# %bb.0:
        vextractf128    $1, %ymm1, %xmm2
        vhaddpd %xmm1, %xmm0, %xmm0
        vhaddpd %xmm2, %xmm1, %xmm1
        vinsertf128     $1, %xmm1, %ymm0, %ymm0

But I used llvm-11, most probably there is a difference with llvm-12.

Currently, it is impossible to fix this issue. This problem will be fixed after non-power-2 vectorization support in SLP is landed since here we have a build vector of 3 elements (the second index in shuffle is -1 and thus the second sum is optimized out resulting in the build of <4 x double> vector using just 3 insertelement instructions). Looks like previously this patter was recognized by another transformation pass but currently SLP tries to vectorize it

The new result I posted in your quote was the result of a compiler built from this change. It is unfortunate to hear that we will have to take a regression for this, but I will update our internal test to expect it and file a bug so that it is not forgotten.