This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Add support for throttling.
AbandonedPublic

Authored by dtemirbulatov on Feb 5 2019, 12:48 PM.

Details

Summary

Here is support for SLP throttling, when cost is high to vectorize the whole tree we can reduce the number of proposed vectorizable elements and partially vectorize the tree. https://www.youtube.com/watch?v=xxtA2XPmIug&t=5s

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
RKSimon added inline comments.Aug 10 2020, 6:30 AM
clang/tools/clang-tidy
2 ↗(On Diff #284313)

Remove this

llvm/tools/mlir
2 ↗(On Diff #284313)

remove this

RKSimon added inline comments.Aug 10 2020, 8:42 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
8352

This looks like a NFC clang-format change now - either pre-commit or discard from the patch?

llvm/test/Transforms/SLPVectorizer/X86/load-merge.ll
59 ↗(On Diff #284345)

rebase - this was committed at rG90f721404ff8

Rebased, Fixed.

oh, I missed to fully remove from diff at 7269, Fixed

RKSimon added inline comments.Aug 11 2020, 2:36 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4924–4928

Maybe pull this NDEBUG change out into its own patch?

7666

This still looks wrong - isn't the UserCost only used locally in the CompensateUseCost path?

dtemirbulatov added inline comments.Aug 11 2020, 3:09 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4924–4928

yes, it could go as NFC.

7666

No, there is another instance of UserCost at 6476, We have to compare the cost to SLPCostThreshold inside findSubTree() and subtract UserCost.

dtemirbulatov added inline comments.Aug 11 2020, 3:12 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
7666

I mean the instance of usage.

@ABataev @anton-afanasyev Any more comments on this?

llvm/test/Transforms/SLPVectorizer/X86/slp-throttle.ll
2

Is it worth adding a second RUN with -slp-throttle=false ?

xbolva00 added inline comments.Aug 16 2020, 10:23 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
680

Please mention paper name:

“Throttling Automatic Vectorization: When Less Is More”

https://www.cl.cam.ac.uk/~tmj32/papers/docs/porpodas15-pact.pdf

Slides are good, but paper is paper :)

Corrected paper citation, added -slp-throttle=false to llvm/test/Transforms/SLPVectorizer/X86/slp-throttle.ll, rebased.

dtemirbulatov marked 2 inline comments as done.Aug 17 2020, 4:21 AM
ABataev added inline comments.Aug 21 2020, 6:51 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3564

What if the user does not have corresponding tree entry, i.e. it is initially scalar? What if the Scalar itself is going to remain scalar?

4884–4891

Just:

for (Value *V : Entry->Scalars) {
  auto *Inst = cast<Instruction>(V);
  if (llvm::any_of(Inst->users(), [this](User *Op){ return Tree->ScalarToTreeEntry.count(Op) > 0; }))
      return InsertCost + getEntryCost(Entry);
}

Also, check code formatting

dtemirbulatov added inline comments.Aug 21 2020, 7:38 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3564

What if the Scalar itself is going to remain scalar?

At this point, the decision to cut the tree was made and the Scalar could be only with intend to vectorize. Note about that 3295 we are ignoring any tree entries without State not equals TreeEntry::Vectorize.

What if the user does not have corresponding tree entry, i.e. it is initially scalar?

ah, yes. I have to check that !UserTE at 3305 and just continue if it is true.

dtemirbulatov added inline comments.Aug 21 2020, 3:23 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4884–4891

hmm, I think this is not a correct suggestion, there might be several tree entries with TreeEntry::ProposedToGather status and we have to calculate Insert cost for the whole tree here.

ABataev added inline comments.Aug 21 2020, 3:27 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4884–4891

Yeah, maybe. But you van do something similar, like

InsertCost += ...
break;

instead of setting flag and do a check after the loop.

Fixed remarks, rebased.

dtemirbulatov marked 3 inline comments as done.Aug 21 2020, 3:52 PM

Removed unnecessary check for "UserTE" at 3305.

Rebased. Ping.

Good enough for initial implementation?

Good enough for initial implementation?

yes, For me, it looks like ready.

Good enough for initial implementation?

yes, For me, it looks like ready.

Will be able to review it next week, after returning from vacation.

ABataev added inline comments.Sep 15 2020, 9:30 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3565–3566

Could you compare it with a similar code in BoUpSLP::buildTree? Looks like you still missed some cases for user ignoring.

Matt added a subscriber: Matt.Sep 16 2020, 8:53 AM
dtemirbulatov updated this revision to Diff 293457.EditedSep 22 2020, 8:09 AM
dtemirbulatov marked an inline comment as done.

Rebased. Moved InternalTreeUses population out of (UseScalar != U || !InTreeUserNeedToExtract(Scalar, UserInst, TLI)) limitation at line 2661 in BoUpSLP::buildTree(), since we have to consider every interal user for partial vectorization, while calculating cost.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3565–3566

I think those ignoring cases are related to the fact that we are doing full vectorization at BoUpSLP::buildTree and we can avoid extracting for in-tree users. And here we have to extract to each user of once proposed to vectorized value.

dtemirbulatov added inline comments.Sep 22 2020, 8:12 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3565–3566

And here we have to extract to each user of once proposed to vectorized value. I mean for the partial vectorization.

Rebased. PING

Rebased. Ping.

Rebased. Ping^2

ABataev added inline comments.Nov 23 2020, 9:50 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3569–3571

Either just cast without if or dyn_cast

4882

Not sure that this is the best criterion. I think you also need to include the distance from the head of the tree to the entry, because some big costs can be compensated by the vectorizable nodes in the tree.
What I would do here is just some kind of level ordering search (BFS) starting from the deepest level.

4889

I think you can also exclude entries with the number of operands <= 1.

dtemirbulatov added inline comments.Dec 3 2020, 5:55 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4882

Hmm, implemented, but I don't see any benefit from that, plus we have to do BFS search. And we are going to throw away any non-vectorizable nodes at 4295.

4889

But why? The only thing that matters here is the cost.

dtemirbulatov marked 2 inline comments as done.Dec 3 2020, 5:57 PM
ABataev added inline comments.Dec 4 2020, 5:20 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4882

It may trigger for targets like silvermont or in future for vectorized functions.

4889

Because the main idea is to drop gathers and drop one gather in favor of another one will not be profitable for sure. But it may improve compile time and the list of candidates, The only case you need to check for is the latest masked gather case, it may be profitable to convert it to gathers for some targets.

dtemirbulatov marked an inline comment as done.Dec 7 2020, 7:32 AM
dtemirbulatov added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4882

I measured the BFS approach vs this implementation. And with BFS, it is ~10% less efficient on SPEC2006 INT and ~20% less on compilable SPEC2006 FP. By efficiency, I mean the total number of reduced trees while the whole compilation.

4889

I think I can check here if scutter/gather is supported via TargetrInfo and if it is not then move all nodes with TreeEntry::ScatterVectorize to TreeEntry::Gather.

anton-afanasyev added inline comments.Dec 7 2020, 8:43 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4889

I believe it's wrong decision to check scatter/gather target support for the reason mentioned here https://reviews.llvm.org/D92701#2435573. Why could not we just rely on costs (node cost and total one)?

dtemirbulatov marked an inline comment as not done.Dec 7 2020, 9:04 AM
dtemirbulatov added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4889

I agree with @anton-afanasyev here. I am not sure what @ABataev wants here? If I exclude (operands <= 1) then we would lose have of all tests in SLP affected by throttling.

ABataev added inline comments.Dec 7 2020, 9:16 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4882

Could you post it anyway to check if it may be improved?

4889

I did not say anything about checking if scatter is supported here. I just said that we can improve the criterion here by checking that the entry node has at least 2 operands (because if it has just one operand, most probably we can skip it) and we just need to check the nodes with only 1 operand if it is gather scatter node, because it may be better to represent it as simple gather.

dtemirbulatov added inline comments.Dec 7 2020, 9:19 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4882

ok, I might miss something. Thanks.

Here is the BFS version of the change. Rebased.

And I counted the total number of nodes vectorized with throttling, instead of just the number of successful tree reductions. So, the total number is higher ~25% for INT and FP CPU2006(AVX2 and AVX512F) with Cost sort compare to Distance.

Discussed with @ABataev further improvements offline and he suggested removing the throttle limiter ("slp-throttling-budget"), at least for basic blocks without calls. I am looking for new functionality.

Removed "slp-throttling-budget" limiter for trees without calls
Moved the main tree reduction loop to getTreeCost() function
deleted ProposedToGather node attribute out of EntryState

Rebased, Measured compile time impact on cpu2006 integer and I have not noticed any significant regressions in SLP compile-time compared to SLP throttle with the limiter.

Rebased, Measured compile time impact on cpu2006 integer and I have not noticed any significant regressions in SLP compile-time compared to SLP throttle with the limiter.

I mean only SLP time regression, by using "-ftime-trace" flag.

At Dinar's request, I've measured compile time regression: http://llvm-compile-time-tracker.com/compare.php?from=f3449ed6073cac58efd9b62d0eb285affa650238&to=39362e11add238c45a7a7d55c1e002005f396fb7&stat=instructions. The regression is visible, but it is acceptable for such change imho. The largest regression comes from CMakeFiles/clamscan.dir/libclamav_uuencode.c.o (+11.28%), so one can investigate this particular file.

At Dinar's request, I've measured compile time regression: http://llvm-compile-time-tracker.com/compare.php?from=f3449ed6073cac58efd9b62d0eb285affa650238&to=39362e11add238c45a7a7d55c1e002005f396fb7&stat=instructions. The regression is visible, but it is acceptable for such change imho. The largest regression comes from CMakeFiles/clamscan.dir/libclamav_uuencode.c.o (+11.28%), so one can investigate this particular file.

Thanks, Anton. Eh, I don't see any time difference on my side for `CMakeFiles/clamscan.dir/libclamav_uuencode.c.o -with -O3 for mavx2 or -mavx512f as well as SLP didn't try to throttle any trees in this particular test, it looks like noise to me.

Rebased, Ping.

ABataev added inline comments.Feb 10 2021, 7:40 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
124–132

Do we really need both of these options? MaxCostsRecalculations should be enough.

668–672

Does "scalar form" means "gathered nodes"? I don't think that currently we may end up with the situation like in the picture, we can't have gathered node that depends on another node (either gather or vectorized).

726–732

Why do we need to save intermediate results? Cannot it be solved in a single iteration loop without saving the intermediate results in the class instance?

ABataev added inline comments.Feb 10 2021, 7:40 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2768–2769

Looks like unrelated change

dtemirbulatov added inline comments.Feb 10 2021, 1:21 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
124–132

ok, thanks.

726–732

I have noticed many regressions if we decide right away and rebuilding the tree afterward is expensive.

ABataev added inline comments.Feb 10 2021, 1:40 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
726–732

What is the cause of those regressions? If I understand it correctly, you're just trying to find the subtree, exclude its cost, compare, repeat if it is not profitable. What does not allow to do it in the loop without saving intermediate results in the class, but save the result in the stack vectors, if it is needed?

dtemirbulatov added inline comments.Feb 10 2021, 3:37 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
726–732

For example, if we could partially vectorize at vectorizeStoreChain(), or later it is possilble to fully vectorize the same tree tryToVectorizeList() or tryToReduce()

dtemirbulatov added inline comments.Feb 10 2021, 3:41 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
726–732

For example, if we could partially vectorize at vectorizeStoreChain(), or later it is possilble to fully vectorize the same tree with tryToVectorizeList() or tryToReduce()

ABataev added inline comments.Feb 10 2021, 3:55 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
726–732

Could you give an example, please?

Here we could see the regression, it misses vectorizing the whole tree as partial vectorization kicks in too early and "add" instructions stay scalar:

  • a/llvm/test/Transforms/SLPVectorizer/X86/PR39774.ll

+++ b/llvm/test/Transforms/SLPVectorizer/X86/PR39774.ll
@@ -7,49 +7,65 @@ define void @test(i32) {
; CHECK-NEXT: entry:
; CHECK-NEXT: br label [[LOOP:%.*]]
; CHECK: loop:
-; CHECK-NEXT: [[TMP1:%.*]] = phi <2 x i32> [ [[TMP15:%.*]], [[LOOP]] ], [ zeroinitializer, [[ENTRY:%.*]] ]
-; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> poison, <8 x i32> <i32 0, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
-; CHECK-NEXT: [[TMP2:%.*]] = extractelement <8 x i32> [[SHUFFLE]], i32 1
-; CHECK-NEXT: [[TMP3:%.*]] = add <8 x i32> [[SHUFFLE]], <i32 0, i32 55, i32 285, i32 1240, i32 1496, i32 8555, i32 12529, i32 13685>
-; CHECK-NEXT: [[TMP4:%.*]] = call i32 @llvm.vector.reduce.and.v8i32(<8 x i32> [[TMP3]])
-; CHECK-NEXT: [[OP_EXTRA:%.*]] = and i32 [[TMP4]], [[TMP0:%.*]]
-; CHECK-NEXT: [[OP_EXTRA1:%.*]] = and i32 [[OP_EXTRA]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA2:%.*]] = and i32 [[OP_EXTRA1]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA3:%.*]] = and i32 [[OP_EXTRA2]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA4:%.*]] = and i32 [[OP_EXTRA3]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA5:%.*]] = and i32 [[OP_EXTRA4]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA6:%.*]] = and i32 [[OP_EXTRA5]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA7:%.*]] = and i32 [[OP_EXTRA6]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA8:%.*]] = and i32 [[OP_EXTRA7]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA9:%.*]] = and i32 [[OP_EXTRA8]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA10:%.*]] = and i32 [[OP_EXTRA9]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA11:%.*]] = and i32 [[OP_EXTRA10]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA12:%.*]] = and i32 [[OP_EXTRA11]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA13:%.*]] = and i32 [[OP_EXTRA12]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA14:%.*]] = and i32 [[OP_EXTRA13]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA15:%.*]] = and i32 [[OP_EXTRA14]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA16:%.*]] = and i32 [[OP_EXTRA15]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA17:%.*]] = and i32 [[OP_EXTRA16]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA18:%.*]] = and i32 [[OP_EXTRA17]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA19:%.*]] = and i32 [[OP_EXTRA18]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA20:%.*]] = and i32 [[OP_EXTRA19]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA21:%.*]] = and i32 [[OP_EXTRA20]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA22:%.*]] = and i32 [[OP_EXTRA21]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA23:%.*]] = and i32 [[OP_EXTRA22]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA24:%.*]] = and i32 [[OP_EXTRA23]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA25:%.*]] = and i32 [[OP_EXTRA24]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA26:%.*]] = and i32 [[OP_EXTRA25]], [[TMP0]]
-; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x i32> poison, i32 [[OP_EXTRA26]], i32 0
-; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x i32> [[TMP5]], i32 14910, i32 1
-; CHECK-NEXT: [[TMP7:%.*]] = insertelement <2 x i32> poison, i32 [[TMP2]], i32 0
-; CHECK-NEXT: [[TMP8:%.*]] = insertelement <2 x i32> [[TMP7]], i32 [[TMP2]], i32 1
-; CHECK-NEXT: [[TMP9:%.*]] = and <2 x i32> [[TMP6]], [[TMP8]]
-; CHECK-NEXT: [[TMP10:%.*]] = add <2 x i32> [[TMP6]], [[TMP8]]
-; CHECK-NEXT: [[TMP11:%.*]] = shufflevector <2 x i32> [[TMP9]], <2 x i32> [[TMP10]], <2 x i32> <i32 0, i32 3>
-; CHECK-NEXT: [[TMP12:%.*]] = extractelement <2 x i32> [[TMP11]], i32 0
-; CHECK-NEXT: [[TMP13:%.*]] = insertelement <2 x i32> poison, i32 [[TMP12]], i32 0
-; CHECK-NEXT: [[TMP14:%.*]] = extractelement <2 x i32> [[TMP11]], i32 1
-; CHECK-NEXT: [[TMP15]] = insertelement <2 x i32> [[TMP13]], i32 [[TMP14]], i32 1
+; CHECK-NEXT: [[TMP1:%.*]] = phi <2 x i32> [ [[TMP19:%.*]], [[LOOP]] ], [ zeroinitializer, [[ENTRY:%.*]] ]
+; CHECK-NEXT: [[TMP2:%.*]] = extractelement <2 x i32> [[TMP1]], i32 0
+; CHECK-NEXT: [[VAL_0:%.*]] = add i32 [[TMP2]], 0
+; CHECK-NEXT: [[TMP3:%.*]] = extractelement <2 x i32> [[TMP1]], i32 1
+; CHECK-NEXT: [[VAL_1:%.*]] = and i32 [[TMP3]], [[VAL_0]]
+; CHECK-NEXT: [[VAL_2:%.*]] = and i32 [[VAL_1]], [[TMP0:%.*]]
+; CHECK-NEXT: [[VAL_3:%.*]] = and i32 [[VAL_2]], [[TMP0]]
+; CHECK-NEXT: [[VAL_4:%.*]] = and i32 [[VAL_3]], [[TMP0]]
+; CHECK-NEXT: [[VAL_5:%.*]] = and i32 [[VAL_4]], [[TMP0]]
+; CHECK-NEXT: [[VAL_6:%.*]] = add i32 [[TMP3]], 55
+; CHECK-NEXT: [[VAL_7:%.*]] = and i32 [[VAL_5]], [[VAL_6]]
+; CHECK-NEXT: [[VAL_8:%.*]] = and i32 [[VAL_7]], [[TMP0]]
+; CHECK-NEXT: [[VAL_9:%.*]] = and i32 [[VAL_8]], [[TMP0]]
+; CHECK-NEXT: [[VAL_10:%.*]] = and i32 [[VAL_9]], [[TMP0]]
+; CHECK-NEXT: [[VAL_11:%.*]] = add i32 [[TMP3]], 285
+; CHECK-NEXT: [[VAL_12:%.*]] = and i32 [[VAL_10]], [[VAL_11]]
+; CHECK-NEXT: [[VAL_13:%.*]] = and i32 [[VAL_12]], [[TMP0]]
+; CHECK-NEXT: [[VAL_14:%.*]] = and i32 [[VAL_13]], [[TMP0]]
+; CHECK-NEXT: [[VAL_15:%.*]] = and i32 [[VAL_14]], [[TMP0]]
+; CHECK-NEXT: [[VAL_16:%.*]] = and i32 [[VAL_15]], [[TMP0]]
+; CHECK-NEXT: [[VAL_17:%.*]] = and i32 [[VAL_16]], [[TMP0]]
+; CHECK-NEXT: [[VAL_18:%.*]] = add i32 [[TMP3]], 1240
+; CHECK-NEXT: [[VAL_19:%.*]] = and i32 [[VAL_17]], [[VAL_18]]
+; CHECK-NEXT: [[VAL_20:%.*]] = add i32 [[TMP3]], 1496
+; CHECK-NEXT: [[VAL_21:%.*]] = and i32 [[VAL_19]], [[VAL_20]]
+; CHECK-NEXT: [[VAL_22:%.*]] = and i32 [[VAL_21]], [[TMP0]]
+; CHECK-NEXT: [[VAL_23:%.*]] = and i32 [[VAL_22]], [[TMP0]]
+; CHECK-NEXT: [[VAL_24:%.*]] = and i32 [[VAL_23]], [[TMP0]]
+; CHECK-NEXT: [[VAL_25:%.*]] = and i32 [[VAL_24]], [[TMP0]]
+; CHECK-NEXT: [[VAL_26:%.*]] = and i32 [[VAL_25]], [[TMP0]]
+; CHECK-NEXT: [[VAL_27:%.*]] = and i32 [[VAL_26]], [[TMP0]]
+; CHECK-NEXT: [[VAL_28:%.*]] = and i32 [[VAL_27]], [[TMP0]]
+; CHECK-NEXT: [[VAL_29:%.*]] = and i32 [[VAL_28]], [[TMP0]]
+; CHECK-NEXT: [[VAL_30:%.*]] = and i32 [[VAL_29]], [[TMP0]]
+; CHECK-NEXT: [[VAL_31:%.*]] = and i32 [[VAL_30]], [[TMP0]]
+; CHECK-NEXT: [[VAL_32:%.*]] = and i32 [[VAL_31]], [[TMP0]]
+; CHECK-NEXT: [[VAL_33:%.*]] = and i32 [[VAL_32]], [[TMP0]]
+; CHECK-NEXT: [[VAL_34:%.*]] = add i32 [[TMP3]], 8555
+; CHECK-NEXT: [[VAL_35:%.*]] = and i32 [[VAL_33]], [[VAL_34]]
+; CHECK-NEXT: [[VAL_36:%.*]] = and i32 [[VAL_35]], [[TMP0]]
+; CHECK-NEXT: [[VAL_37:%.*]] = and i32 [[VAL_36]], [[TMP0]]
+; CHECK-NEXT: [[VAL_38:%.*]] = and i32 [[VAL_37]], [[TMP0]]
+; CHECK-NEXT: [[TMP4:%.*]] = insertelement <2 x i32> poison, i32 [[TMP3]], i32 0
+; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x i32> [[TMP4]], i32 [[TMP3]], i32 1
+; CHECK-NEXT: [[TMP6:%.*]] = add <2 x i32> [[TMP5]], <i32 12529, i32 13685>
+; CHECK-NEXT: [[TMP7:%.*]] = extractelement <2 x i32> [[TMP6]], i32 0
+; CHECK-NEXT: [[VAL_40:%.*]] = and i32 [[VAL_38]], [[TMP7]]
+; CHECK-NEXT: [[TMP8:%.*]] = extractelement <2 x i32> [[TMP6]], i32 1
+; CHECK-NEXT: [[TMP9:%.*]] = insertelement <2 x i32> poison, i32 [[VAL_40]], i32 0
+; CHECK-NEXT: [[TMP10:%.*]] = insertelement <2 x i32> [[TMP9]], i32 14910, i32 1
+; CHECK-NEXT: [[TMP11:%.*]] = insertelement <2 x i32> poison, i32 [[TMP8]], i32 0
+; CHECK-NEXT: [[TMP12:%.*]] = insertelement <2 x i32> [[TMP11]], i32 [[TMP3]], i32 1
+; CHECK-NEXT: [[TMP13:%.*]] = and <2 x i32> [[TMP10]], [[TMP12]]
+; CHECK-NEXT: [[TMP14:%.*]] = add <2 x i32> [[TMP10]], [[TMP12]]
+; CHECK-NEXT: [[TMP15:%.*]] = shufflevector <2 x i32> [[TMP13]], <2 x i32> [[TMP14]], <2 x i32> <i32 0, i32 3>
+; CHECK-NEXT: [[TMP16:%.*]] = extractelement <2 x i32> [[TMP15]], i32 0
+; CHECK-NEXT: [[TMP17:%.*]] = insertelement <2 x i32> poison, i32 [[TMP16]], i32 0
+; CHECK-NEXT: [[TMP18:%.*]] = extractelement <2 x i32> [[TMP15]], i32 1
+; CHECK-NEXT: [[TMP19]] = insertelement <2 x i32> [[TMP17]], i32 [[TMP18]], i32 1
; CHECK-NEXT: br label [[LOOP]]
;
; FORCE_REDUCTION-LABEL: @test(

Here we could see the regression, it misses vectorizing the whole tree as partial vectorization kicks in too early and "add" instructions stay scalar:

  • a/llvm/test/Transforms/SLPVectorizer/X86/PR39774.ll

+++ b/llvm/test/Transforms/SLPVectorizer/X86/PR39774.ll
@@ -7,49 +7,65 @@ define void @test(i32) {
; CHECK-NEXT: entry:
; CHECK-NEXT: br label [[LOOP:%.*]]
; CHECK: loop:
-; CHECK-NEXT: [[TMP1:%.*]] = phi <2 x i32> [ [[TMP15:%.*]], [[LOOP]] ], [ zeroinitializer, [[ENTRY:%.*]] ]
-; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> poison, <8 x i32> <i32 0, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
-; CHECK-NEXT: [[TMP2:%.*]] = extractelement <8 x i32> [[SHUFFLE]], i32 1
-; CHECK-NEXT: [[TMP3:%.*]] = add <8 x i32> [[SHUFFLE]], <i32 0, i32 55, i32 285, i32 1240, i32 1496, i32 8555, i32 12529, i32 13685>
-; CHECK-NEXT: [[TMP4:%.*]] = call i32 @llvm.vector.reduce.and.v8i32(<8 x i32> [[TMP3]])
-; CHECK-NEXT: [[OP_EXTRA:%.*]] = and i32 [[TMP4]], [[TMP0:%.*]]
-; CHECK-NEXT: [[OP_EXTRA1:%.*]] = and i32 [[OP_EXTRA]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA2:%.*]] = and i32 [[OP_EXTRA1]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA3:%.*]] = and i32 [[OP_EXTRA2]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA4:%.*]] = and i32 [[OP_EXTRA3]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA5:%.*]] = and i32 [[OP_EXTRA4]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA6:%.*]] = and i32 [[OP_EXTRA5]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA7:%.*]] = and i32 [[OP_EXTRA6]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA8:%.*]] = and i32 [[OP_EXTRA7]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA9:%.*]] = and i32 [[OP_EXTRA8]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA10:%.*]] = and i32 [[OP_EXTRA9]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA11:%.*]] = and i32 [[OP_EXTRA10]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA12:%.*]] = and i32 [[OP_EXTRA11]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA13:%.*]] = and i32 [[OP_EXTRA12]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA14:%.*]] = and i32 [[OP_EXTRA13]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA15:%.*]] = and i32 [[OP_EXTRA14]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA16:%.*]] = and i32 [[OP_EXTRA15]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA17:%.*]] = and i32 [[OP_EXTRA16]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA18:%.*]] = and i32 [[OP_EXTRA17]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA19:%.*]] = and i32 [[OP_EXTRA18]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA20:%.*]] = and i32 [[OP_EXTRA19]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA21:%.*]] = and i32 [[OP_EXTRA20]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA22:%.*]] = and i32 [[OP_EXTRA21]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA23:%.*]] = and i32 [[OP_EXTRA22]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA24:%.*]] = and i32 [[OP_EXTRA23]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA25:%.*]] = and i32 [[OP_EXTRA24]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA26:%.*]] = and i32 [[OP_EXTRA25]], [[TMP0]]
-; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x i32> poison, i32 [[OP_EXTRA26]], i32 0
-; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x i32> [[TMP5]], i32 14910, i32 1
-; CHECK-NEXT: [[TMP7:%.*]] = insertelement <2 x i32> poison, i32 [[TMP2]], i32 0
-; CHECK-NEXT: [[TMP8:%.*]] = insertelement <2 x i32> [[TMP7]], i32 [[TMP2]], i32 1
-; CHECK-NEXT: [[TMP9:%.*]] = and <2 x i32> [[TMP6]], [[TMP8]]
-; CHECK-NEXT: [[TMP10:%.*]] = add <2 x i32> [[TMP6]], [[TMP8]]
-; CHECK-NEXT: [[TMP11:%.*]] = shufflevector <2 x i32> [[TMP9]], <2 x i32> [[TMP10]], <2 x i32> <i32 0, i32 3>
-; CHECK-NEXT: [[TMP12:%.*]] = extractelement <2 x i32> [[TMP11]], i32 0
-; CHECK-NEXT: [[TMP13:%.*]] = insertelement <2 x i32> poison, i32 [[TMP12]], i32 0
-; CHECK-NEXT: [[TMP14:%.*]] = extractelement <2 x i32> [[TMP11]], i32 1
-; CHECK-NEXT: [[TMP15]] = insertelement <2 x i32> [[TMP13]], i32 [[TMP14]], i32 1
+; CHECK-NEXT: [[TMP1:%.*]] = phi <2 x i32> [ [[TMP19:%.*]], [[LOOP]] ], [ zeroinitializer, [[ENTRY:%.*]] ]
+; CHECK-NEXT: [[TMP2:%.*]] = extractelement <2 x i32> [[TMP1]], i32 0
+; CHECK-NEXT: [[VAL_0:%.*]] = add i32 [[TMP2]], 0
+; CHECK-NEXT: [[TMP3:%.*]] = extractelement <2 x i32> [[TMP1]], i32 1
+; CHECK-NEXT: [[VAL_1:%.*]] = and i32 [[TMP3]], [[VAL_0]]
+; CHECK-NEXT: [[VAL_2:%.*]] = and i32 [[VAL_1]], [[TMP0:%.*]]
+; CHECK-NEXT: [[VAL_3:%.*]] = and i32 [[VAL_2]], [[TMP0]]
+; CHECK-NEXT: [[VAL_4:%.*]] = and i32 [[VAL_3]], [[TMP0]]
+; CHECK-NEXT: [[VAL_5:%.*]] = and i32 [[VAL_4]], [[TMP0]]
+; CHECK-NEXT: [[VAL_6:%.*]] = add i32 [[TMP3]], 55
+; CHECK-NEXT: [[VAL_7:%.*]] = and i32 [[VAL_5]], [[VAL_6]]
+; CHECK-NEXT: [[VAL_8:%.*]] = and i32 [[VAL_7]], [[TMP0]]
+; CHECK-NEXT: [[VAL_9:%.*]] = and i32 [[VAL_8]], [[TMP0]]
+; CHECK-NEXT: [[VAL_10:%.*]] = and i32 [[VAL_9]], [[TMP0]]
+; CHECK-NEXT: [[VAL_11:%.*]] = add i32 [[TMP3]], 285
+; CHECK-NEXT: [[VAL_12:%.*]] = and i32 [[VAL_10]], [[VAL_11]]
+; CHECK-NEXT: [[VAL_13:%.*]] = and i32 [[VAL_12]], [[TMP0]]
+; CHECK-NEXT: [[VAL_14:%.*]] = and i32 [[VAL_13]], [[TMP0]]
+; CHECK-NEXT: [[VAL_15:%.*]] = and i32 [[VAL_14]], [[TMP0]]
+; CHECK-NEXT: [[VAL_16:%.*]] = and i32 [[VAL_15]], [[TMP0]]
+; CHECK-NEXT: [[VAL_17:%.*]] = and i32 [[VAL_16]], [[TMP0]]
+; CHECK-NEXT: [[VAL_18:%.*]] = add i32 [[TMP3]], 1240
+; CHECK-NEXT: [[VAL_19:%.*]] = and i32 [[VAL_17]], [[VAL_18]]
+; CHECK-NEXT: [[VAL_20:%.*]] = add i32 [[TMP3]], 1496
+; CHECK-NEXT: [[VAL_21:%.*]] = and i32 [[VAL_19]], [[VAL_20]]
+; CHECK-NEXT: [[VAL_22:%.*]] = and i32 [[VAL_21]], [[TMP0]]
+; CHECK-NEXT: [[VAL_23:%.*]] = and i32 [[VAL_22]], [[TMP0]]
+; CHECK-NEXT: [[VAL_24:%.*]] = and i32 [[VAL_23]], [[TMP0]]
+; CHECK-NEXT: [[VAL_25:%.*]] = and i32 [[VAL_24]], [[TMP0]]
+; CHECK-NEXT: [[VAL_26:%.*]] = and i32 [[VAL_25]], [[TMP0]]
+; CHECK-NEXT: [[VAL_27:%.*]] = and i32 [[VAL_26]], [[TMP0]]
+; CHECK-NEXT: [[VAL_28:%.*]] = and i32 [[VAL_27]], [[TMP0]]
+; CHECK-NEXT: [[VAL_29:%.*]] = and i32 [[VAL_28]], [[TMP0]]
+; CHECK-NEXT: [[VAL_30:%.*]] = and i32 [[VAL_29]], [[TMP0]]
+; CHECK-NEXT: [[VAL_31:%.*]] = and i32 [[VAL_30]], [[TMP0]]
+; CHECK-NEXT: [[VAL_32:%.*]] = and i32 [[VAL_31]], [[TMP0]]
+; CHECK-NEXT: [[VAL_33:%.*]] = and i32 [[VAL_32]], [[TMP0]]
+; CHECK-NEXT: [[VAL_34:%.*]] = add i32 [[TMP3]], 8555
+; CHECK-NEXT: [[VAL_35:%.*]] = and i32 [[VAL_33]], [[VAL_34]]
+; CHECK-NEXT: [[VAL_36:%.*]] = and i32 [[VAL_35]], [[TMP0]]
+; CHECK-NEXT: [[VAL_37:%.*]] = and i32 [[VAL_36]], [[TMP0]]
+; CHECK-NEXT: [[VAL_38:%.*]] = and i32 [[VAL_37]], [[TMP0]]
+; CHECK-NEXT: [[TMP4:%.*]] = insertelement <2 x i32> poison, i32 [[TMP3]], i32 0
+; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x i32> [[TMP4]], i32 [[TMP3]], i32 1
+; CHECK-NEXT: [[TMP6:%.*]] = add <2 x i32> [[TMP5]], <i32 12529, i32 13685>
+; CHECK-NEXT: [[TMP7:%.*]] = extractelement <2 x i32> [[TMP6]], i32 0
+; CHECK-NEXT: [[VAL_40:%.*]] = and i32 [[VAL_38]], [[TMP7]]
+; CHECK-NEXT: [[TMP8:%.*]] = extractelement <2 x i32> [[TMP6]], i32 1
+; CHECK-NEXT: [[TMP9:%.*]] = insertelement <2 x i32> poison, i32 [[VAL_40]], i32 0
+; CHECK-NEXT: [[TMP10:%.*]] = insertelement <2 x i32> [[TMP9]], i32 14910, i32 1
+; CHECK-NEXT: [[TMP11:%.*]] = insertelement <2 x i32> poison, i32 [[TMP8]], i32 0
+; CHECK-NEXT: [[TMP12:%.*]] = insertelement <2 x i32> [[TMP11]], i32 [[TMP3]], i32 1
+; CHECK-NEXT: [[TMP13:%.*]] = and <2 x i32> [[TMP10]], [[TMP12]]
+; CHECK-NEXT: [[TMP14:%.*]] = add <2 x i32> [[TMP10]], [[TMP12]]
+; CHECK-NEXT: [[TMP15:%.*]] = shufflevector <2 x i32> [[TMP13]], <2 x i32> [[TMP14]], <2 x i32> <i32 0, i32 3>
+; CHECK-NEXT: [[TMP16:%.*]] = extractelement <2 x i32> [[TMP15]], i32 0
+; CHECK-NEXT: [[TMP17:%.*]] = insertelement <2 x i32> poison, i32 [[TMP16]], i32 0
+; CHECK-NEXT: [[TMP18:%.*]] = extractelement <2 x i32> [[TMP15]], i32 1
+; CHECK-NEXT: [[TMP19]] = insertelement <2 x i32> [[TMP17]], i32 [[TMP18]], i32 1
; CHECK-NEXT: br label [[LOOP]]
;
; FORCE_REDUCTION-LABEL: @test(

To me, it just looks like we need to postpone the vectorization of phi nodes in the function rather than trying to fix all the issues in the world in a single patch.

To me, it just looks like we need to postpone the vectorization of phi nodes in the function rather than trying to fix all the issues in the world in a single patch.

I think I could give one simpler example without PHI nodes.

Here is another example:
source_filename = "psspread.c"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define dso_local void @spread_q_poisson() local_unnamed_addr #0 {
entry:

%div.i = fdiv float undef, undef
%conv13.i = fdiv float 1.000000e+00, %div.i
%conv18.i = fdiv float 1.000000e+00, undef
%conv23.i = fdiv float 1.000000e+00, undef
%conv162 = fptosi float undef to i32
%0 = load float, float* undef, align 4
%1 = load i32, i32* undef, align 4
%add187.us = add nsw i32 %1, %conv162
%add191.us = add nsw i32 undef, undef
%add195.us = add nsw i32 undef, undef
%conv196.us = sitofp i32 %add187.us to float
%mul197.us = fmul float %conv13.i, %conv196.us
%sub198.us = fsub float undef, %mul197.us
%mul.i363.us = fmul float %sub198.us, %sub198.us
%conv200.us = sitofp i32 %add191.us to float
%mul201.us = fmul float %conv18.i, %conv200.us
%sub202.us = fsub float undef, %mul201.us
%mul.i362.us = fmul float %sub202.us, %sub202.us
%conv204.us = sitofp i32 %add195.us to float
%mul205.us = fmul float %conv23.i, %conv204.us
%sub206.us = fsub float %0, %mul205.us
%mul.i.us = fmul float %sub206.us, %sub206.us
%add208.us = fadd float %mul.i363.us, %mul.i362.us
%add209.us = fadd float %add208.us, %mul.i.us
%cmp210.us = fcmp olt float %add209.us, undef
%add230.us = add nsw i32 undef, %add195.us
unreachable

}

attributes #0 = { "use-soft-float"="false" }

!llvm.ident = !{!0}

!0 = !{!"clang version 13.0.0 (/home/dtemirbulatov/llvm/llvm-project-thl/llvm/tools/clang eec04092d67b94f47439a9065b6bd4cd60165be2)"}

with proposed change it produces :
; ModuleID = 'bug.ll'
source_filename = "psspread.c"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define dso_local void @spread_q_poisson() local_unnamed_addr #0 {
entry:

%div.i = fdiv float undef, undef
%conv13.i = fdiv float 1.000000e+00, %div.i
%conv162 = fptosi float undef to i32
%0 = load float, float* undef, align 4
%1 = load i32, i32* undef, align 4
%add187.us = add nsw i32 %1, %conv162
%conv196.us = sitofp i32 %add187.us to float
%mul197.us = fmul float %conv13.i, %conv196.us
%sub198.us = fsub float undef, %mul197.us
%mul.i363.us = fmul float %sub198.us, %sub198.us
%2 = insertelement <2 x float> <float undef, float poison>, float %0, i32 1
%3 = fsub <2 x float> %2, <float 0x7FF8000000000000, float 0x7FF8000000000000>
%4 = fmul <2 x float> %3, %3
%5 = extractelement <2 x float> %4, i32 0
%add208.us = fadd float %mul.i363.us, %5
%6 = extractelement <2 x float> %4, i32 1
%add209.us = fadd float %add208.us, %6
%cmp210.us = fcmp olt float %add209.us, undef
%add230.us = add nsw i32 undef, undef
unreachable

}

attributes #0 = { "use-soft-float"="false" }

!llvm.ident = !{!0}

!0 = !{!"clang version 13.0.0 (/home/dtemirbulatov/llvm/llvm-project-thl/llvm/tools/clang eec04092d67b94f47439a9065b6bd4cd60165be2)"}

but if we immediately decide to vectorize patrially to get this output:
; ModuleID = 'bug.ll'
source_filename = "psspread.c"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define dso_local void @spread_q_poisson() local_unnamed_addr #0 {
entry:

%div.i = fdiv float undef, undef
%conv18.i = fdiv float 1.000000e+00, undef
%0 = insertelement <2 x float> poison, float %div.i, i32 0
%1 = insertelement <2 x float> %0, float undef, i32 1
%2 = fdiv <2 x float> <float 1.000000e+00, float 1.000000e+00>, %1
%conv162 = fptosi float undef to i32
%3 = load float, float* undef, align 4
%4 = load i32, i32* undef, align 4
%add187.us = add nsw i32 %4, %conv162
%add191.us = add nsw i32 undef, undef
%add195.us = add nsw i32 undef, undef
%conv200.us = sitofp i32 %add191.us to float
%mul201.us = fmul float %conv18.i, %conv200.us
%sub202.us = fsub float undef, %mul201.us
%mul.i362.us = fmul float %sub202.us, %sub202.us
%5 = insertelement <2 x i32> poison, i32 %add187.us, i32 0
%6 = insertelement <2 x i32> %5, i32 %add195.us, i32 1
%7 = sitofp <2 x i32> %6 to <2 x float>
%8 = fmul <2 x float> %2, %7
%9 = insertelement <2 x float> <float undef, float poison>, float %3, i32 1
%10 = fsub <2 x float> %9, %8
%11 = fmul <2 x float> %10, %10
%12 = extractelement <2 x float> %11, i32 0
%add208.us = fadd float %12, %mul.i362.us
%13 = extractelement <2 x float> %11, i32 1
%add209.us = fadd float %add208.us, %13
%cmp210.us = fcmp olt float %add209.us, undef
%add230.us = add nsw i32 undef, %add195.us
unreachable

}

attributes #0 = { "use-soft-float"="false" }

!llvm.ident = !{!0}

!0 = !{!"clang version 13.0.0 (/home/dtemirbulatov/llvm/llvm-project-thl/llvm/tools/clang eec04092d67b94f47439a9065b6bd4cd60165be2)"}

Here is another example:
source_filename = "psspread.c"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define dso_local void @spread_q_poisson() local_unnamed_addr #0 {
entry:

%div.i = fdiv float undef, undef
%conv13.i = fdiv float 1.000000e+00, %div.i
%conv18.i = fdiv float 1.000000e+00, undef
%conv23.i = fdiv float 1.000000e+00, undef
%conv162 = fptosi float undef to i32
%0 = load float, float* undef, align 4
%1 = load i32, i32* undef, align 4
%add187.us = add nsw i32 %1, %conv162
%add191.us = add nsw i32 undef, undef
%add195.us = add nsw i32 undef, undef
%conv196.us = sitofp i32 %add187.us to float
%mul197.us = fmul float %conv13.i, %conv196.us
%sub198.us = fsub float undef, %mul197.us
%mul.i363.us = fmul float %sub198.us, %sub198.us
%conv200.us = sitofp i32 %add191.us to float
%mul201.us = fmul float %conv18.i, %conv200.us
%sub202.us = fsub float undef, %mul201.us
%mul.i362.us = fmul float %sub202.us, %sub202.us
%conv204.us = sitofp i32 %add195.us to float
%mul205.us = fmul float %conv23.i, %conv204.us
%sub206.us = fsub float %0, %mul205.us
%mul.i.us = fmul float %sub206.us, %sub206.us
%add208.us = fadd float %mul.i363.us, %mul.i362.us
%add209.us = fadd float %add208.us, %mul.i.us
%cmp210.us = fcmp olt float %add209.us, undef
%add230.us = add nsw i32 undef, %add195.us
unreachable

}

attributes #0 = { "use-soft-float"="false" }

!llvm.ident = !{!0}

!0 = !{!"clang version 13.0.0 (/home/dtemirbulatov/llvm/llvm-project-thl/llvm/tools/clang eec04092d67b94f47439a9065b6bd4cd60165be2)"}

with proposed change it produces :
; ModuleID = 'bug.ll'
source_filename = "psspread.c"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define dso_local void @spread_q_poisson() local_unnamed_addr #0 {
entry:

%div.i = fdiv float undef, undef
%conv13.i = fdiv float 1.000000e+00, %div.i
%conv162 = fptosi float undef to i32
%0 = load float, float* undef, align 4
%1 = load i32, i32* undef, align 4
%add187.us = add nsw i32 %1, %conv162
%conv196.us = sitofp i32 %add187.us to float
%mul197.us = fmul float %conv13.i, %conv196.us
%sub198.us = fsub float undef, %mul197.us
%mul.i363.us = fmul float %sub198.us, %sub198.us
%2 = insertelement <2 x float> <float undef, float poison>, float %0, i32 1
%3 = fsub <2 x float> %2, <float 0x7FF8000000000000, float 0x7FF8000000000000>
%4 = fmul <2 x float> %3, %3
%5 = extractelement <2 x float> %4, i32 0
%add208.us = fadd float %mul.i363.us, %5
%6 = extractelement <2 x float> %4, i32 1
%add209.us = fadd float %add208.us, %6
%cmp210.us = fcmp olt float %add209.us, undef
%add230.us = add nsw i32 undef, undef
unreachable

}

attributes #0 = { "use-soft-float"="false" }

!llvm.ident = !{!0}

!0 = !{!"clang version 13.0.0 (/home/dtemirbulatov/llvm/llvm-project-thl/llvm/tools/clang eec04092d67b94f47439a9065b6bd4cd60165be2)"}

but if we immediately decide to vectorize patrially to get this output:
; ModuleID = 'bug.ll'
source_filename = "psspread.c"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define dso_local void @spread_q_poisson() local_unnamed_addr #0 {
entry:

%div.i = fdiv float undef, undef
%conv18.i = fdiv float 1.000000e+00, undef
%0 = insertelement <2 x float> poison, float %div.i, i32 0
%1 = insertelement <2 x float> %0, float undef, i32 1
%2 = fdiv <2 x float> <float 1.000000e+00, float 1.000000e+00>, %1
%conv162 = fptosi float undef to i32
%3 = load float, float* undef, align 4
%4 = load i32, i32* undef, align 4
%add187.us = add nsw i32 %4, %conv162
%add191.us = add nsw i32 undef, undef
%add195.us = add nsw i32 undef, undef
%conv200.us = sitofp i32 %add191.us to float
%mul201.us = fmul float %conv18.i, %conv200.us
%sub202.us = fsub float undef, %mul201.us
%mul.i362.us = fmul float %sub202.us, %sub202.us
%5 = insertelement <2 x i32> poison, i32 %add187.us, i32 0
%6 = insertelement <2 x i32> %5, i32 %add195.us, i32 1
%7 = sitofp <2 x i32> %6 to <2 x float>
%8 = fmul <2 x float> %2, %7
%9 = insertelement <2 x float> <float undef, float poison>, float %3, i32 1
%10 = fsub <2 x float> %9, %8
%11 = fmul <2 x float> %10, %10
%12 = extractelement <2 x float> %11, i32 0
%add208.us = fadd float %12, %mul.i362.us
%13 = extractelement <2 x float> %11, i32 1
%add209.us = fadd float %add208.us, %13
%cmp210.us = fcmp olt float %add209.us, undef
%add230.us = add nsw i32 undef, %add195.us
unreachable

}

attributes #0 = { "use-soft-float"="false" }

!llvm.ident = !{!0}

!0 = !{!"clang version 13.0.0 (/home/dtemirbulatov/llvm/llvm-project-thl/llvm/tools/clang eec04092d67b94f47439a9065b6bd4cd60165be2)"}

I see that immediate vectorization is better as it vectorizes more, no? Also, there is a problem, looks like it is caused by the multinode analysis. I'm trying to improve this in my non-power-2 patch, will prepare a separate patch for it.

I see that immediate vectorization is better as it vectorizes more, no? Also, there is a problem, looks like it is caused by the multinode analysis. I'm trying to improve this in my non-power-2 patch, will prepare a separate patch for it.

eh, I think it is not a clear example, I have seen better examples, I will show something better.

I see that immediate vectorization is better as it vectorizes more, no? Also, there is a problem, looks like it is caused by the multinode analysis. I'm trying to improve this in my non-power-2 patch, will prepare a separate patch for it.

eh, I think it is not a clear example, I have seen better examples, I will show something better.

Even this example shows that the current solution does not always produce the best result.

Even this example shows that the current solution does not always produce the best result.

at least, we could avoid regressions.

dtemirbulatov added a comment.EditedFeb 12 2021, 8:55 AM

I think the next step is to compare vectorized tree heights(number of vectorized nodes) among possible vectorizable trees.

Even this example shows that the current solution does not always produce the best result.

SLP has a greedy approach and let's assume that full vectorization is always better than partial. We don't have the resources to save all trees and then choose from saved the best one. I think I can add now choosing the best from already partially vectorized.

Even this example shows that the current solution does not always produce the best result.

SLP has a greedy approach and let's assume that full vectorization is always better than partial. We don't have the resources to save all trees and then choose from saved the best one. I think I can add now choosing the best from already partially vectorized.

  1. Again, even your example showed that this solution is worse in some cases. Why do we need to waste the time and invest in a solution, which is not better than the existing one, requires more time to understand, consumes more memory?
  2. SLP implements a bottom-up approach, i.e. it always tries to vectorize the longest chain (except for PHI nodes, which should be improved). If we have a partial graph, it should not affect other vectorization graphs in the same basic block, generally speaking, just some subnodes may become the subnodes of the other graphs but this is not a problem.
  3. Looks like you're trying to implement something similar to VPlan. We have it already and better to invest the time to implement support for SLP vectorization there.
  4. Redesign is completely different work, it requires correct estimation (not the assumptions, but real investigation), discussion, RFC, approval, and separate implementation.
  1. Again, even your example showed that this solution is worse in some cases. Why do we need to waste the time and invest in a solution, which is not better than the existing one, requires more time to understand, consumes more memory?
  2. SLP implements a bottom-up approach, i.e. it always tries to vectorize the longest chain (except for PHI nodes, which should be improved). If we have a partial graph, it should not affect other vectorization graphs in the same basic block, generally speaking, just some subnodes may become the subnodes of the other graphs but this is not a problem.
  3. Looks like you're trying to implement something similar to VPlan. We have it already and better to invest the time to implement support for SLP vectorization there.
  4. Redesign is completely different work, it requires correct estimation (not the assumptions, but real investigation), discussion, RFC, approval, and separate implementation.

Ok, Agree.

Addressed @ABataev remarks, investigated regression with PHI nodes in PR39774.ll and I have not spotted any other case involving PHI nodes, but I have several other cases and it happens quite rarely. I am not sure how-to generalize them and I think VPLAN might be helpful. Overall, I think it is ready.

ABataev added inline comments.Mar 19 2021, 7:15 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
658

PriorityQueue?

6716–6725

Looks like you need to implement something like reduceSchedulingRegion(), similar to extendSchedulingRegion function. Because otherwise you're going to operate with the larger scheduling region. I.e. need to modify ScheduleStart and ScheduleEnd data members.

7682

Why SLPThrottleBudget > 0? What if SLPThrottleBudget equals 0?

7682–7683

Why we can't do something like this:

int NumAttempts = 0;
do {
      if (R.isTreeTinyAndNotFullyVectorizable())
        break;

      R.computeMinimumValueSizes();
      InstructionCost Cost = R.getTreeCost();
      InstructionCost UserCost = 0;
      ....
      if (Cost < -SLPCostThreshold) {
        LLVM_DEBUG(dbgs() << "SLP: Vectorizing list at cost:" << Cost << ".\n");
        R.getORE()->emit(OptimizationRemark(SV_NAME, "VectorizedList",
                                                    cast<Instruction>(Ops[0]))
                                 << "SLP vectorized with cost " << ore::NV("Cost", Cost)
                                 << " and with tree size "
                                 << ore::NV("TreeSize", R.getTreeSize()));

        R.vectorizeTree();
        // Move to the next bundle.
        I += VF - 1;
        NextInst = I + 1;
        Changed = true;
        break;
      }
   ...
   /// Do throttling here.
   ++NumAttempts;
} while (NumAttempts < SLPThrottleBudget);
dtemirbulatov added inline comments.Mar 21 2021, 5:04 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
658

hmm, ProprityQueue allows duplicates of elements and it might be an issue.

Rebased, addressed remarks, added reduceSchedulingRegion() function with the ability to set only ScheduleStart at this time, renamed RemovedOperations property to ProposedToGather.

dtemirbulatov marked 2 inline comments as done.Mar 29 2021, 5:39 PM
dtemirbulatov added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
7682–7683

We are doing partial vectorization and we have to know UserCost to make the correct partial tree cut.

dtemirbulatov marked 2 inline comments as done.Mar 29 2021, 5:40 PM

Ping, ready to land?

Ping, ready to land?

Will review it on Monday.

Ping, ready to land?

Will review it on Monday.

I found an error in reduceSchedulingRegion() implementation. I am reworking the change.

Rebased, fixed incorrect comment at 2358, fixed the wrong implementation of shrink scheduling region, changed the code in tryToVectorizeList() as suggested by @ABataev.

dtemirbulatov added inline comments.Apr 8 2021, 8:33 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6749

Perhaps we could also check here for !BS->getScheduleData(I)->isPartOfBundle() and further shrink the region.

dtemirbulatov added inline comments.Apr 8 2021, 8:35 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6749

ah, no, this instruction could belong to a real gather node.

Slightly improved schedular area shrinking algorithm, by allowing to remove unnecessary unmaps in chains instructions.

Rebased, formatted, noticed 3x testcases involved after @ABataev landed D100495 "Add detection of shuffled/perfect matching of tree entries.", returned "-slp-throttle" flag in order to AArch64/gather-cost.ll to be functional, manually adjust "TMP" in minimum-sizes.ll in PR31243_sext for probably a bug in update_test_checks.py.

Fixed two format errors.

RKSimon added inline comments.Apr 26 2021, 7:34 AM
llvm/test/Transforms/SLPVectorizer/X86/uitofp.ll
683 ↗(On Diff #340425)

what happened to these checks?

Updated llvm/test/Transforms/SLPVectorizer/X86/uitofp.ll checks on request from @RKSimon

dtemirbulatov marked an inline comment as done.Apr 26 2021, 8:07 AM

Updated llvm/test/Transforms/SLPVectorizer/X86/uitofp.ll checks on request from @RKSimon

@RKSimon , I have to split AVX256NODQ X86/sitofp.ll and maybe others.

ABataev added inline comments.Apr 26 2021, 8:14 AM
llvm/test/Transforms/SLPVectorizer/X86/arith-fix.ll
357–361 ↗(On Diff #340530)

Looks like it does not respect MinTreeSize option anymore. And it is strange that such code sequence gets profitable for vectorization (scalar cost is 8, vector cost is 9)

Rebased, Forbid "detection of shuffled/perfect matching of tree entries" for canceled TreeEntries during throttling, replaced TEVectorizableSet to PriorityQueue.

Fix formatting.

ABataev added inline comments.May 3 2021, 5:45 AM
llvm/test/Transforms/SLPVectorizer/X86/powof2div.ll
85–90 ↗(On Diff #342296)

Still looks like it does not respect mintreesize

dtemirbulatov added inline comments.May 4 2021, 7:00 PM
llvm/test/Transforms/SLPVectorizer/X86/powof2div.ll
85–90 ↗(On Diff #342296)

hmm, this is not the case here, the tree height is 5 here, divide node cost is 20 and after deleting this not node, extracting from "add" node costs 4 and inserting after scalar divide cost 4 and the final tree cost is -4. llvm-mca for -mattr=+avx shows 1305 cycles before and 1609 cycles after.

Added check for current tree size to MinTreeSize before making the decision to vectorize.

dtemirbulatov updated this revision to Diff 343392.EditedMay 6 2021, 6:50 AM
  1. Fixed issue in getInsertCost(), I incorrectly added gather costs to the nodes which were not in relation with any proposed to vectorized nodes, I thought of this and used before "ScalarToTreeEntry.count(Op) > 0", but I discovered that I am not updating ScalarToTreeEntry while reducing the tree. 2) Now I am checking with isTreeTinyAndNotFullyVectorizable() before decide to vectorize. 3) I introduced "MinVecNodes" parameter, which sets how many minimal vectorizable nodes we would like to have while throttling, currently it is equal to 2 by default. For example, we have 3 total nodes in the tree and it is satisfied with MinTreeSize and we would like to have at least two nodes to be vectorizable while reducing the tree to have a positive decision.
  1. Fixed issue in getInsertCost(), I incorrectly added gather costs to the nodes which were not in relation with any proposed to vectorized nodes, I thought of this and used before "ScalarToTreeEntry.count(Op) > 0", but I discovered that I am not updating ScalarToTreeEntry while reducing the tree. 2) Now I am checking with isTreeTinyAndNotFullyVectorizable() before decide to vectorize. 3) I introduced "MinVecNodes" parameter, which sets how many minimal vectorizable nodes we would like to have while throttling, currently it is equal to 2 by default. For example, we have 3 total nodes in the tree and it is satisfied with MinTreeSize and we would like to have at least two nodes to be vectorizable while reducing the tree to have a positive decision.

Why do we need MinVecNodes? MinTreeSize and all associated analysis must be enough

Why do we need MinVecNodes? MinTreeSize and all associated analysis must be enough

it is Transforms/SLPVectorizer/X86/tiny-tree.ll transform that scared me.
From:
define void @tiny_tree_not_fully_vectorizable(double* noalias nocapture %dst, double* noalias nocapture readonly %src, i64 %count) #0 {
entry:

%cmp12 = icmp eq i64 %count, 0
br i1 %cmp12, label %for.end, label %for.body

for.body: ; preds = %entry, %for.body

%i.015 = phi i64 [ %inc, %for.body ], [ 0, %entry ]
%dst.addr.014 = phi double* [ %add.ptr4, %for.body ], [ %dst, %entry ]
%src.addr.013 = phi double* [ %add.ptr, %for.body ], [ %src, %entry ]
%0 = load double, double* %src.addr.013, align 8
store double %0, double* %dst.addr.014, align 8
%arrayidx2 = getelementptr inbounds double, double* %src.addr.013, i64 2
%1 = load double, double* %arrayidx2, align 8
%arrayidx3 = getelementptr inbounds double, double* %dst.addr.014, i64 1
store double %1, double* %arrayidx3, align 8
%add.ptr = getelementptr inbounds double, double* %src.addr.013, i64 %i.015
%add.ptr4 = getelementptr inbounds double, double* %dst.addr.014, i64 %i.015
%inc = add i64 %i.015, 1
%exitcond = icmp eq i64 %inc, %count
br i1 %exitcond, label %for.end, label %for.body

for.end: ; preds = %for.body, %entry

ret void

}
to:
define void @tiny_tree_not_fully_vectorizable(double* noalias nocapture %dst, double* noalias nocapture readonly %src, i64 %count) #0 {
entry:

%cmp12 = icmp eq i64 %count, 0
br i1 %cmp12, label %for.end, label %for.body

for.body: ; preds = %for.body, %entry

%i.015 = phi i64 [ %inc, %for.body ], [ 0, %entry ]
%dst.addr.014 = phi double* [ %add.ptr4, %for.body ], [ %dst, %entry ]
%src.addr.013 = phi double* [ %add.ptr, %for.body ], [ %src, %entry ]
%0 = load double, double* %src.addr.013, align 8
%arrayidx2 = getelementptr inbounds double, double* %src.addr.013, i64 2
%1 = load double, double* %arrayidx2, align 8
%arrayidx3 = getelementptr inbounds double, double* %dst.addr.014, i64 1
%2 = insertelement <2 x double> poison, double %0, i32 0
%3 = insertelement <2 x double> %2, double %1, i32 1
%4 = bitcast double* %dst.addr.014 to <2 x double>*
store <2 x double> %3, <2 x double>* %4, align 8
%add.ptr = getelementptr inbounds double, double* %src.addr.013, i64 %i.015
%add.ptr4 = getelementptr inbounds double, double* %dst.addr.014, i64 %i.015
%inc = add i64 %i.015, 1
%exitcond = icmp eq i64 %inc, %count
br i1 %exitcond, label %for.end, label %for.body

for.end: ; preds = %for.body, %entry

ret void

}

but now with llvm-mca with -mattr=+corei7-avx, I see the change from 1111 to 1014 cycles, so it looks good. I will check other cases.

ABataev added a comment.EditedMay 7 2021, 3:05 AM

Why do we need MinVecNodes? MinTreeSize and all associated analysis must be enough

it is Transforms/SLPVectorizer/X86/tiny-tree.ll transform that scared me.
From:
define void @tiny_tree_not_fully_vectorizable(double* noalias nocapture %dst, double* noalias nocapture readonly %src, i64 %count) #0 {
entry:

%cmp12 = icmp eq i64 %count, 0
br i1 %cmp12, label %for.end, label %for.body

for.body: ; preds = %entry, %for.body

%i.015 = phi i64 [ %inc, %for.body ], [ 0, %entry ]
%dst.addr.014 = phi double* [ %add.ptr4, %for.body ], [ %dst, %entry ]
%src.addr.013 = phi double* [ %add.ptr, %for.body ], [ %src, %entry ]
%0 = load double, double* %src.addr.013, align 8
store double %0, double* %dst.addr.014, align 8
%arrayidx2 = getelementptr inbounds double, double* %src.addr.013, i64 2
%1 = load double, double* %arrayidx2, align 8
%arrayidx3 = getelementptr inbounds double, double* %dst.addr.014, i64 1
store double %1, double* %arrayidx3, align 8
%add.ptr = getelementptr inbounds double, double* %src.addr.013, i64 %i.015
%add.ptr4 = getelementptr inbounds double, double* %dst.addr.014, i64 %i.015
%inc = add i64 %i.015, 1
%exitcond = icmp eq i64 %inc, %count
br i1 %exitcond, label %for.end, label %for.body

for.end: ; preds = %for.body, %entry

ret void

}
to:
define void @tiny_tree_not_fully_vectorizable(double* noalias nocapture %dst, double* noalias nocapture readonly %src, i64 %count) #0 {
entry:

%cmp12 = icmp eq i64 %count, 0
br i1 %cmp12, label %for.end, label %for.body

for.body: ; preds = %for.body, %entry

%i.015 = phi i64 [ %inc, %for.body ], [ 0, %entry ]
%dst.addr.014 = phi double* [ %add.ptr4, %for.body ], [ %dst, %entry ]
%src.addr.013 = phi double* [ %add.ptr, %for.body ], [ %src, %entry ]
%0 = load double, double* %src.addr.013, align 8
%arrayidx2 = getelementptr inbounds double, double* %src.addr.013, i64 2
%1 = load double, double* %arrayidx2, align 8
%arrayidx3 = getelementptr inbounds double, double* %dst.addr.014, i64 1
%2 = insertelement <2 x double> poison, double %0, i32 0
%3 = insertelement <2 x double> %2, double %1, i32 1
%4 = bitcast double* %dst.addr.014 to <2 x double>*
store <2 x double> %3, <2 x double>* %4, align 8
%add.ptr = getelementptr inbounds double, double* %src.addr.013, i64 %i.015
%add.ptr4 = getelementptr inbounds double, double* %dst.addr.014, i64 %i.015
%inc = add i64 %i.015, 1
%exitcond = icmp eq i64 %inc, %count
br i1 %exitcond, label %for.end, label %for.body

for.end: ; preds = %for.body, %entry

ret void

}

but now with llvm-mca with -mattr=+corei7-avx, I see the change from 1111 to 1014 cycles, so it looks good. I will check other cases.

If so, it just means that our min-tree-size analysis is too strict and must be fixed in general, but not by introducing some new throttling-specific options. We may have the same situation without throttling.

Rebased, Removed SLP parameter MinVecNodes. Added estimations of a good tree reduction 1) if the tree contained some real operations like binary, arithmetical, calls which were proposed to vectorize then we don't want to reduce this tree to just load and store operations in vectorized form. 2) if the tree doesn't have any real operations like binary, arithmetical... then we have to make sure that at least the root node and the next node to root are going to be vectorized.

Allen added a subscriber: Allen.May 20 2021, 10:16 PM

Rebased. I switched to path aware tree reduction approach and we start from the leaves of a vectorizable tree toward the root of that tree.

Current status? Review stalled?

Herald added a project: Restricted Project. · View Herald TranscriptSep 7 2022, 12:55 PM
dtemirbulatov abandoned this revision.Sep 17 2022, 4:23 PM