This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
104/150
SLPVectorizer.cpp
-
test/Transforms/SLPVectorizer/X86/
-
Transforms/
-
SLPVectorizer/
-
X86/
1/1
slp-throttle.ll

Differential D57779

[SLP] Add support for throttling.
AbandonedPublic

Authored by dtemirbulatov on Feb 5 2019, 12:48 PM.

Download Raw Diff

Details

Reviewers

ABataev
RKSimon
spatel
anton-afanasyev
hfinkel
vporpo
fhahn

Summary

Here is support for SLP throttling, when cost is high to vectorize the whole tree we can reduce the number of proposed vectorizable elements and partially vectorize the tree. https://www.youtube.com/watch?v=xxtA2XPmIug&t=5s

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

RKSimon added inline comments.Aug 10 2020, 6:30 AM

clang/tools/clang-tidy
2 ↗	(On Diff #284313)	Remove this
llvm/tools/mlir
2 ↗	(On Diff #284313)	remove this

Fixed.

RKSimon mentioned this in rG90f721404ff8: [SLP] Regenerate load-merge.ll tests.Aug 10 2020, 8:09 AM

RKSimon added inline comments.Aug 10 2020, 8:42 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
8352	This looks like a NFC clang-format change now - either pre-commit or discard from the patch?
llvm/test/Transforms/SLPVectorizer/X86/load-merge.ll
59 ↗	(On Diff #284345)	rebase - this was committed at rG90f721404ff8

Rebased, Fixed.

oh, I missed to fully remove from diff at 7269, Fixed

RKSimon added inline comments.Aug 11 2020, 2:36 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4924–4928	Maybe pull this NDEBUG change out into its own patch?
7666	This still looks wrong - isn't the UserCost only used locally in the CompensateUseCost path?

dtemirbulatov added inline comments.Aug 11 2020, 3:09 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4924–4928	yes, it could go as NFC.
7666	No, there is another instance of UserCost at 6476, We have to compare the cost to SLPCostThreshold inside findSubTree() and subtract UserCost.

dtemirbulatov added inline comments.Aug 11 2020, 3:12 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
7666	I mean the instance of usage.

dtemirbulatov mentioned this in rGb1600d8b8971: [NFC] Guard the cost report block of debug outputs with NDEBUG and.Aug 11 2020, 7:35 AM

Rebased after rGb1600d8b8971

@ABataev @anton-afanasyev Any more comments on this?

llvm/test/Transforms/SLPVectorizer/X86/slp-throttle.ll
2	Is it worth adding a second RUN with -slp-throttle=false ?

xbolva00 added inline comments.Aug 16 2020, 10:23 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
680	Please mention paper name: “Throttling Automatic Vectorization: When Less Is More” https://www.cl.cam.ac.uk/~tmj32/papers/docs/porpodas15-pact.pdf Slides are good, but paper is paper :)

Corrected paper citation, added -slp-throttle=false to llvm/test/Transforms/SLPVectorizer/X86/slp-throttle.ll, rebased.

dtemirbulatov marked 2 inline comments as done.Aug 17 2020, 4:21 AM

ABataev added inline comments.Aug 21 2020, 6:51 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3564	What if the user does not have corresponding tree entry, i.e. it is initially scalar? What if the `Scalar` itself is going to remain scalar?
4884–4891	Just: for (Value V : Entry->Scalars) { auto Inst = cast<Instruction>(V); if (llvm::any_of(Inst->users(), [this](User *Op){ return Tree->ScalarToTreeEntry.count(Op) > 0; })) return InsertCost + getEntryCost(Entry); } Also, check code formatting

dtemirbulatov added inline comments.Aug 21 2020, 7:38 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3564	What if the Scalar itself is going to remain scalar? At this point, the decision to cut the tree was made and the Scalar could be only with intend to vectorize. Note about that 3295 we are ignoring any tree entries without State not equals TreeEntry::Vectorize. What if the user does not have corresponding tree entry, i.e. it is initially scalar? ah, yes. I have to check that !UserTE at 3305 and just continue if it is true.

dtemirbulatov added inline comments.Aug 21 2020, 3:23 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4884–4891	hmm, I think this is not a correct suggestion, there might be several tree entries with TreeEntry::ProposedToGather status and we have to calculate Insert cost for the whole tree here.

ABataev added inline comments.Aug 21 2020, 3:27 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4884–4891	Yeah, maybe. But you van do something similar, like InsertCost += ... break; instead of setting flag and do a check after the loop.

Fixed remarks, rebased.

dtemirbulatov marked 3 inline comments as done.Aug 21 2020, 3:52 PM

Removed unnecessary check for "UserTE" at 3305.

Rebased. Ping.

Good enough for initial implementation?

In D57779#2258046, @xbolva00 wrote:

Good enough for initial implementation?

yes, For me, it looks like ready.

In D57779#2266618, @dtemirbulatov wrote:

In D57779#2258046, @xbolva00 wrote:

Good enough for initial implementation?

yes, For me, it looks like ready.

Will be able to review it next week, after returning from vacation.

vdmitrie added a subscriber: vdmitrie.Sep 11 2020, 3:36 PM

ABataev added inline comments.Sep 15 2020, 9:30 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3565–3566	Could you compare it with a similar code in BoUpSLP::buildTree? Looks like you still missed some cases for user ignoring.

Matt added a subscriber: Matt.Sep 16 2020, 8:53 AM

Rebased. Moved InternalTreeUses population out of (UseScalar != U || !InTreeUserNeedToExtract(Scalar, UserInst, TLI)) limitation at line 2661 in BoUpSLP::buildTree(), since we have to consider every interal user for partial vectorization, while calculating cost.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3565–3566	I think those ignoring cases are related to the fact that we are doing full vectorization at BoUpSLP::buildTree and we can avoid extracting for in-tree users. And here we have to extract to each user of once proposed to vectorized value.

dtemirbulatov added inline comments.Sep 22 2020, 8:12 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3565–3566	And here we have to extract to each user of once proposed to vectorized value. I mean for the partial vectorization.

Ping

Rebased. PING

Rebased. Ping.

Rebased. Ping^2

ABataev added inline comments.Nov 23 2020, 9:50 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3569–3571	Either just `cast` without `if` or `dyn_cast`
4882	Not sure that this is the best criterion. I think you also need to include the distance from the head of the tree to the entry, because some big costs can be compensated by the vectorizable nodes in the tree. What I would do here is just some kind of level ordering search (BFS) starting from the deepest level.
4889	I think you can also exclude entries with the number of operands <= 1.

anton-afanasyev mentioned this in D90445: [SLP] Make SLPVectorizer to use `llvm.masked.gather` intrinsic.Nov 25 2020, 3:20 AM

dtemirbulatov added inline comments.Dec 3 2020, 5:55 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4882	Hmm, implemented, but I don't see any benefit from that, plus we have to do BFS search. And we are going to throw away any non-vectorizable nodes at 4295.
4889	But why? The only thing that matters here is the cost.

dtemirbulatov marked 2 inline comments as done.Dec 3 2020, 5:57 PM

ABataev added inline comments.Dec 4 2020, 5:20 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4882	It may trigger for targets like silvermont or in future for vectorized functions.
4889	Because the main idea is to drop gathers and drop one gather in favor of another one will not be profitable for sure. But it may improve compile time and the list of candidates, The only case you need to check for is the latest masked gather case, it may be profitable to convert it to gathers for some targets.

dtemirbulatov marked an inline comment as done.Dec 7 2020, 7:32 AM

dtemirbulatov added inline comments.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4882	I measured the BFS approach vs this implementation. And with BFS, it is ~10% less efficient on SPEC2006 INT and ~20% less on compilable SPEC2006 FP. By efficiency, I mean the total number of reduced trees while the whole compilation.
4889	I think I can check here if scutter/gather is supported via TargetrInfo and if it is not then move all nodes with TreeEntry::ScatterVectorize to TreeEntry::Gather.

anton-afanasyev added inline comments.Dec 7 2020, 8:43 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4889	I believe it's wrong decision to check scatter/gather target support for the reason mentioned here https://reviews.llvm.org/D92701#2435573. Why could not we just rely on costs (node cost and total one)?

dtemirbulatov marked an inline comment as not done.Dec 7 2020, 9:04 AM

dtemirbulatov added inline comments.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4889	I agree with @anton-afanasyev here. I am not sure what @ABataev wants here? If I exclude (operands <= 1) then we would lose have of all tests in SLP affected by throttling.

ABataev added inline comments.Dec 7 2020, 9:16 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4882	Could you post it anyway to check if it may be improved?
4889	I did not say anything about checking if scatter is supported here. I just said that we can improve the criterion here by checking that the entry node has at least 2 operands (because if it has just one operand, most probably we can skip it) and we just need to check the nodes with only 1 operand if it is gather scatter node, because it may be better to represent it as simple gather.

dtemirbulatov added inline comments.Dec 7 2020, 9:19 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4882	ok, I might miss something. Thanks.

Here is the BFS version of the change. Rebased.

And I counted the total number of nodes vectorized with throttling, instead of just the number of successful tree reductions. So, the total number is higher ~25% for INT and FP CPU2006(AVX2 and AVX512F) with Cost sort compare to Distance.

Discussed with @ABataev further improvements offline and he suggested removing the throttle limiter ("slp-throttling-budget"), at least for basic blocks without calls. I am looking for new functionality.

Removed "slp-throttling-budget" limiter for trees without calls
Moved the main tree reduction loop to getTreeCost() function
deleted ProposedToGather node attribute out of EntryState

Rebased, Measured compile time impact on cpu2006 integer and I have not noticed any significant regressions in SLP compile-time compared to SLP throttle with the limiter.

In D57779#2525124, @dtemirbulatov wrote:

Rebased, Measured compile time impact on cpu2006 integer and I have not noticed any significant regressions in SLP compile-time compared to SLP throttle with the limiter.

I mean only SLP time regression, by using "-ftime-trace" flag.

At Dinar's request, I've measured compile time regression: http://llvm-compile-time-tracker.com/compare.php?from=f3449ed6073cac58efd9b62d0eb285affa650238&to=39362e11add238c45a7a7d55c1e002005f396fb7&stat=instructions. The regression is visible, but it is acceptable for such change imho. The largest regression comes from CMakeFiles/clamscan.dir/libclamav_uuencode.c.o (+11.28%), so one can investigate this particular file.

In D57779#2525959, @anton-afanasyev wrote:

At Dinar's request, I've measured compile time regression: http://llvm-compile-time-tracker.com/compare.php?from=f3449ed6073cac58efd9b62d0eb285affa650238&to=39362e11add238c45a7a7d55c1e002005f396fb7&stat=instructions. The regression is visible, but it is acceptable for such change imho. The largest regression comes from CMakeFiles/clamscan.dir/libclamav_uuencode.c.o (+11.28%), so one can investigate this particular file.

Thanks, Anton. Eh, I don't see any time difference on my side for `CMakeFiles/clamscan.dir/libclamav_uuencode.c.o -with -O3 for mavx2 or -mavx512f as well as SLP didn't try to throttle any trees in this particular test, it looks like noise to me.

Ping.

Rebased, Ping.

ABataev added inline comments.Feb 10 2021, 7:40 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
124–132	Do we really need both of these options? `MaxCostsRecalculations` should be enough.
668–672	Does "scalar form" means "gathered nodes"? I don't think that currently we may end up with the situation like in the picture, we can't have gathered node that depends on another node (either gather or vectorized).
726–732	Why do we need to save intermediate results? Cannot it be solved in a single iteration loop without saving the intermediate results in the class instance?

ABataev added inline comments.Feb 10 2021, 7:40 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2768–2769	Looks like unrelated change

dtemirbulatov added inline comments.Feb 10 2021, 1:21 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
124–132	ok, thanks.
726–732	I have noticed many regressions if we decide right away and rebuilding the tree afterward is expensive.

ABataev added inline comments.Feb 10 2021, 1:40 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
726–732	What is the cause of those regressions? If I understand it correctly, you're just trying to find the subtree, exclude its cost, compare, repeat if it is not profitable. What does not allow to do it in the loop without saving intermediate results in the class, but save the result in the stack vectors, if it is needed?

dtemirbulatov added inline comments.Feb 10 2021, 3:37 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
726–732	For example, if we could partially vectorize at vectorizeStoreChain(), or later it is possilble to fully vectorize the same tree tryToVectorizeList() or tryToReduce()

dtemirbulatov added inline comments.Feb 10 2021, 3:41 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
726–732	For example, if we could partially vectorize at vectorizeStoreChain(), or later it is possilble to fully vectorize the same tree with tryToVectorizeList() or tryToReduce()

ABataev added inline comments.Feb 10 2021, 3:55 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
726–732	Could you give an example, please?

Here we could see the regression, it misses vectorizing the whole tree as partial vectorization kicks in too early and "add" instructions stay scalar:

a/llvm/test/Transforms/SLPVectorizer/X86/PR39774.ll

+++ b/llvm/test/Transforms/SLPVectorizer/X86/PR39774.ll
@@ -7,49 +7,65 @@ define void @test(i32) {
; CHECK-NEXT: entry:
; CHECK-NEXT: br label [[LOOP:%.*]]
; CHECK: loop:
-; CHECK-NEXT: [[TMP1:%.*]] = phi <2 x i32> [ [[TMP15:%.*]], [[LOOP]] ], [ zeroinitializer, [[ENTRY:%.*]] ]
-; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> poison, <8 x i32> <i32 0, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
-; CHECK-NEXT: [[TMP2:%.*]] = extractelement <8 x i32> [[SHUFFLE]], i32 1
-; CHECK-NEXT: [[TMP3:%.*]] = add <8 x i32> [[SHUFFLE]], <i32 0, i32 55, i32 285, i32 1240, i32 1496, i32 8555, i32 12529, i32 13685>
-; CHECK-NEXT: [[TMP4:%.*]] = call i32 @llvm.vector.reduce.and.v8i32(<8 x i32> [[TMP3]])
-; CHECK-NEXT: [[OP_EXTRA:%.*]] = and i32 [[TMP4]], [[TMP0:%.*]]
-; CHECK-NEXT: [[OP_EXTRA1:%.*]] = and i32 [[OP_EXTRA]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA2:%.*]] = and i32 [[OP_EXTRA1]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA3:%.*]] = and i32 [[OP_EXTRA2]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA4:%.*]] = and i32 [[OP_EXTRA3]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA5:%.*]] = and i32 [[OP_EXTRA4]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA6:%.*]] = and i32 [[OP_EXTRA5]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA7:%.*]] = and i32 [[OP_EXTRA6]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA8:%.*]] = and i32 [[OP_EXTRA7]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA9:%.*]] = and i32 [[OP_EXTRA8]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA10:%.*]] = and i32 [[OP_EXTRA9]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA11:%.*]] = and i32 [[OP_EXTRA10]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA12:%.*]] = and i32 [[OP_EXTRA11]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA13:%.*]] = and i32 [[OP_EXTRA12]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA14:%.*]] = and i32 [[OP_EXTRA13]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA15:%.*]] = and i32 [[OP_EXTRA14]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA16:%.*]] = and i32 [[OP_EXTRA15]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA17:%.*]] = and i32 [[OP_EXTRA16]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA18:%.*]] = and i32 [[OP_EXTRA17]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA19:%.*]] = and i32 [[OP_EXTRA18]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA20:%.*]] = and i32 [[OP_EXTRA19]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA21:%.*]] = and i32 [[OP_EXTRA20]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA22:%.*]] = and i32 [[OP_EXTRA21]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA23:%.*]] = and i32 [[OP_EXTRA22]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA24:%.*]] = and i32 [[OP_EXTRA23]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA25:%.*]] = and i32 [[OP_EXTRA24]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA26:%.*]] = and i32 [[OP_EXTRA25]], [[TMP0]]
-; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x i32> poison, i32 [[OP_EXTRA26]], i32 0
-; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x i32> [[TMP5]], i32 14910, i32 1
-; CHECK-NEXT: [[TMP7:%.*]] = insertelement <2 x i32> poison, i32 [[TMP2]], i32 0
-; CHECK-NEXT: [[TMP8:%.*]] = insertelement <2 x i32> [[TMP7]], i32 [[TMP2]], i32 1
-; CHECK-NEXT: [[TMP9:%.*]] = and <2 x i32> [[TMP6]], [[TMP8]]
-; CHECK-NEXT: [[TMP10:%.*]] = add <2 x i32> [[TMP6]], [[TMP8]]
-; CHECK-NEXT: [[TMP11:%.*]] = shufflevector <2 x i32> [[TMP9]], <2 x i32> [[TMP10]], <2 x i32> <i32 0, i32 3>
-; CHECK-NEXT: [[TMP12:%.*]] = extractelement <2 x i32> [[TMP11]], i32 0
-; CHECK-NEXT: [[TMP13:%.*]] = insertelement <2 x i32> poison, i32 [[TMP12]], i32 0
-; CHECK-NEXT: [[TMP14:%.*]] = extractelement <2 x i32> [[TMP11]], i32 1
-; CHECK-NEXT: [[TMP15]] = insertelement <2 x i32> [[TMP13]], i32 [[TMP14]], i32 1
+; CHECK-NEXT: [[TMP1:%.*]] = phi <2 x i32> [ [[TMP19:%.*]], [[LOOP]] ], [ zeroinitializer, [[ENTRY:%.*]] ]
+; CHECK-NEXT: [[TMP2:%.*]] = extractelement <2 x i32> [[TMP1]], i32 0
+; CHECK-NEXT: [[VAL_0:%.*]] = add i32 [[TMP2]], 0
+; CHECK-NEXT: [[TMP3:%.*]] = extractelement <2 x i32> [[TMP1]], i32 1
+; CHECK-NEXT: [[VAL_1:%.*]] = and i32 [[TMP3]], [[VAL_0]]
+; CHECK-NEXT: [[VAL_2:%.*]] = and i32 [[VAL_1]], [[TMP0:%.*]]
+; CHECK-NEXT: [[VAL_3:%.*]] = and i32 [[VAL_2]], [[TMP0]]
+; CHECK-NEXT: [[VAL_4:%.*]] = and i32 [[VAL_3]], [[TMP0]]
+; CHECK-NEXT: [[VAL_5:%.*]] = and i32 [[VAL_4]], [[TMP0]]
+; CHECK-NEXT: [[VAL_6:%.*]] = add i32 [[TMP3]], 55
+; CHECK-NEXT: [[VAL_7:%.*]] = and i32 [[VAL_5]], [[VAL_6]]
+; CHECK-NEXT: [[VAL_8:%.*]] = and i32 [[VAL_7]], [[TMP0]]
+; CHECK-NEXT: [[VAL_9:%.*]] = and i32 [[VAL_8]], [[TMP0]]
+; CHECK-NEXT: [[VAL_10:%.*]] = and i32 [[VAL_9]], [[TMP0]]
+; CHECK-NEXT: [[VAL_11:%.*]] = add i32 [[TMP3]], 285
+; CHECK-NEXT: [[VAL_12:%.*]] = and i32 [[VAL_10]], [[VAL_11]]
+; CHECK-NEXT: [[VAL_13:%.*]] = and i32 [[VAL_12]], [[TMP0]]
+; CHECK-NEXT: [[VAL_14:%.*]] = and i32 [[VAL_13]], [[TMP0]]
+; CHECK-NEXT: [[VAL_15:%.*]] = and i32 [[VAL_14]], [[TMP0]]
+; CHECK-NEXT: [[VAL_16:%.*]] = and i32 [[VAL_15]], [[TMP0]]
+; CHECK-NEXT: [[VAL_17:%.*]] = and i32 [[VAL_16]], [[TMP0]]
+; CHECK-NEXT: [[VAL_18:%.*]] = add i32 [[TMP3]], 1240
+; CHECK-NEXT: [[VAL_19:%.*]] = and i32 [[VAL_17]], [[VAL_18]]
+; CHECK-NEXT: [[VAL_20:%.*]] = add i32 [[TMP3]], 1496
+; CHECK-NEXT: [[VAL_21:%.*]] = and i32 [[VAL_19]], [[VAL_20]]
+; CHECK-NEXT: [[VAL_22:%.*]] = and i32 [[VAL_21]], [[TMP0]]
+; CHECK-NEXT: [[VAL_23:%.*]] = and i32 [[VAL_22]], [[TMP0]]
+; CHECK-NEXT: [[VAL_24:%.*]] = and i32 [[VAL_23]], [[TMP0]]
+; CHECK-NEXT: [[VAL_25:%.*]] = and i32 [[VAL_24]], [[TMP0]]
+; CHECK-NEXT: [[VAL_26:%.*]] = and i32 [[VAL_25]], [[TMP0]]
+; CHECK-NEXT: [[VAL_27:%.*]] = and i32 [[VAL_26]], [[TMP0]]
+; CHECK-NEXT: [[VAL_28:%.*]] = and i32 [[VAL_27]], [[TMP0]]
+; CHECK-NEXT: [[VAL_29:%.*]] = and i32 [[VAL_28]], [[TMP0]]
+; CHECK-NEXT: [[VAL_30:%.*]] = and i32 [[VAL_29]], [[TMP0]]
+; CHECK-NEXT: [[VAL_31:%.*]] = and i32 [[VAL_30]], [[TMP0]]
+; CHECK-NEXT: [[VAL_32:%.*]] = and i32 [[VAL_31]], [[TMP0]]
+; CHECK-NEXT: [[VAL_33:%.*]] = and i32 [[VAL_32]], [[TMP0]]
+; CHECK-NEXT: [[VAL_34:%.*]] = add i32 [[TMP3]], 8555
+; CHECK-NEXT: [[VAL_35:%.*]] = and i32 [[VAL_33]], [[VAL_34]]
+; CHECK-NEXT: [[VAL_36:%.*]] = and i32 [[VAL_35]], [[TMP0]]
+; CHECK-NEXT: [[VAL_37:%.*]] = and i32 [[VAL_36]], [[TMP0]]
+; CHECK-NEXT: [[VAL_38:%.*]] = and i32 [[VAL_37]], [[TMP0]]
+; CHECK-NEXT: [[TMP4:%.*]] = insertelement <2 x i32> poison, i32 [[TMP3]], i32 0
+; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x i32> [[TMP4]], i32 [[TMP3]], i32 1
+; CHECK-NEXT: [[TMP6:%.*]] = add <2 x i32> [[TMP5]], <i32 12529, i32 13685>
+; CHECK-NEXT: [[TMP7:%.*]] = extractelement <2 x i32> [[TMP6]], i32 0
+; CHECK-NEXT: [[VAL_40:%.*]] = and i32 [[VAL_38]], [[TMP7]]
+; CHECK-NEXT: [[TMP8:%.*]] = extractelement <2 x i32> [[TMP6]], i32 1
+; CHECK-NEXT: [[TMP9:%.*]] = insertelement <2 x i32> poison, i32 [[VAL_40]], i32 0
+; CHECK-NEXT: [[TMP10:%.*]] = insertelement <2 x i32> [[TMP9]], i32 14910, i32 1
+; CHECK-NEXT: [[TMP11:%.*]] = insertelement <2 x i32> poison, i32 [[TMP8]], i32 0
+; CHECK-NEXT: [[TMP12:%.*]] = insertelement <2 x i32> [[TMP11]], i32 [[TMP3]], i32 1
+; CHECK-NEXT: [[TMP13:%.*]] = and <2 x i32> [[TMP10]], [[TMP12]]
+; CHECK-NEXT: [[TMP14:%.*]] = add <2 x i32> [[TMP10]], [[TMP12]]
+; CHECK-NEXT: [[TMP15:%.*]] = shufflevector <2 x i32> [[TMP13]], <2 x i32> [[TMP14]], <2 x i32> <i32 0, i32 3>
+; CHECK-NEXT: [[TMP16:%.*]] = extractelement <2 x i32> [[TMP15]], i32 0
+; CHECK-NEXT: [[TMP17:%.*]] = insertelement <2 x i32> poison, i32 [[TMP16]], i32 0
+; CHECK-NEXT: [[TMP18:%.*]] = extractelement <2 x i32> [[TMP15]], i32 1
+; CHECK-NEXT: [[TMP19]] = insertelement <2 x i32> [[TMP17]], i32 [[TMP18]], i32 1
; CHECK-NEXT: br label [[LOOP]]
;
; FORCE_REDUCTION-LABEL: @test(

In D57779#2556783, @dtemirbulatov wrote:

Here we could see the regression, it misses vectorizing the whole tree as partial vectorization kicks in too early and "add" instructions stay scalar:

a/llvm/test/Transforms/SLPVectorizer/X86/PR39774.ll

+++ b/llvm/test/Transforms/SLPVectorizer/X86/PR39774.ll
@@ -7,49 +7,65 @@ define void @test(i32) {
; CHECK-NEXT: entry:
; CHECK-NEXT: br label [[LOOP:%.*]]
; CHECK: loop:
-; CHECK-NEXT: [[TMP1:%.*]] = phi <2 x i32> [ [[TMP15:%.*]], [[LOOP]] ], [ zeroinitializer, [[ENTRY:%.*]] ]
-; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> poison, <8 x i32> <i32 0, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
-; CHECK-NEXT: [[TMP2:%.*]] = extractelement <8 x i32> [[SHUFFLE]], i32 1
-; CHECK-NEXT: [[TMP3:%.*]] = add <8 x i32> [[SHUFFLE]], <i32 0, i32 55, i32 285, i32 1240, i32 1496, i32 8555, i32 12529, i32 13685>
-; CHECK-NEXT: [[TMP4:%.*]] = call i32 @llvm.vector.reduce.and.v8i32(<8 x i32> [[TMP3]])
-; CHECK-NEXT: [[OP_EXTRA:%.*]] = and i32 [[TMP4]], [[TMP0:%.*]]
-; CHECK-NEXT: [[OP_EXTRA1:%.*]] = and i32 [[OP_EXTRA]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA2:%.*]] = and i32 [[OP_EXTRA1]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA3:%.*]] = and i32 [[OP_EXTRA2]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA4:%.*]] = and i32 [[OP_EXTRA3]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA5:%.*]] = and i32 [[OP_EXTRA4]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA6:%.*]] = and i32 [[OP_EXTRA5]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA7:%.*]] = and i32 [[OP_EXTRA6]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA8:%.*]] = and i32 [[OP_EXTRA7]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA9:%.*]] = and i32 [[OP_EXTRA8]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA10:%.*]] = and i32 [[OP_EXTRA9]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA11:%.*]] = and i32 [[OP_EXTRA10]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA12:%.*]] = and i32 [[OP_EXTRA11]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA13:%.*]] = and i32 [[OP_EXTRA12]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA14:%.*]] = and i32 [[OP_EXTRA13]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA15:%.*]] = and i32 [[OP_EXTRA14]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA16:%.*]] = and i32 [[OP_EXTRA15]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA17:%.*]] = and i32 [[OP_EXTRA16]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA18:%.*]] = and i32 [[OP_EXTRA17]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA19:%.*]] = and i32 [[OP_EXTRA18]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA20:%.*]] = and i32 [[OP_EXTRA19]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA21:%.*]] = and i32 [[OP_EXTRA20]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA22:%.*]] = and i32 [[OP_EXTRA21]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA23:%.*]] = and i32 [[OP_EXTRA22]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA24:%.*]] = and i32 [[OP_EXTRA23]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA25:%.*]] = and i32 [[OP_EXTRA24]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA26:%.*]] = and i32 [[OP_EXTRA25]], [[TMP0]]
-; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x i32> poison, i32 [[OP_EXTRA26]], i32 0
-; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x i32> [[TMP5]], i32 14910, i32 1
-; CHECK-NEXT: [[TMP7:%.*]] = insertelement <2 x i32> poison, i32 [[TMP2]], i32 0
-; CHECK-NEXT: [[TMP8:%.*]] = insertelement <2 x i32> [[TMP7]], i32 [[TMP2]], i32 1
-; CHECK-NEXT: [[TMP9:%.*]] = and <2 x i32> [[TMP6]], [[TMP8]]
-; CHECK-NEXT: [[TMP10:%.*]] = add <2 x i32> [[TMP6]], [[TMP8]]
-; CHECK-NEXT: [[TMP11:%.*]] = shufflevector <2 x i32> [[TMP9]], <2 x i32> [[TMP10]], <2 x i32> <i32 0, i32 3>
-; CHECK-NEXT: [[TMP12:%.*]] = extractelement <2 x i32> [[TMP11]], i32 0
-; CHECK-NEXT: [[TMP13:%.*]] = insertelement <2 x i32> poison, i32 [[TMP12]], i32 0
-; CHECK-NEXT: [[TMP14:%.*]] = extractelement <2 x i32> [[TMP11]], i32 1
-; CHECK-NEXT: [[TMP15]] = insertelement <2 x i32> [[TMP13]], i32 [[TMP14]], i32 1
+; CHECK-NEXT: [[TMP1:%.*]] = phi <2 x i32> [ [[TMP19:%.*]], [[LOOP]] ], [ zeroinitializer, [[ENTRY:%.*]] ]
+; CHECK-NEXT: [[TMP2:%.*]] = extractelement <2 x i32> [[TMP1]], i32 0
+; CHECK-NEXT: [[VAL_0:%.*]] = add i32 [[TMP2]], 0
+; CHECK-NEXT: [[TMP3:%.*]] = extractelement <2 x i32> [[TMP1]], i32 1
+; CHECK-NEXT: [[VAL_1:%.*]] = and i32 [[TMP3]], [[VAL_0]]
+; CHECK-NEXT: [[VAL_2:%.*]] = and i32 [[VAL_1]], [[TMP0:%.*]]
+; CHECK-NEXT: [[VAL_3:%.*]] = and i32 [[VAL_2]], [[TMP0]]
+; CHECK-NEXT: [[VAL_4:%.*]] = and i32 [[VAL_3]], [[TMP0]]
+; CHECK-NEXT: [[VAL_5:%.*]] = and i32 [[VAL_4]], [[TMP0]]
+; CHECK-NEXT: [[VAL_6:%.*]] = add i32 [[TMP3]], 55
+; CHECK-NEXT: [[VAL_7:%.*]] = and i32 [[VAL_5]], [[VAL_6]]
+; CHECK-NEXT: [[VAL_8:%.*]] = and i32 [[VAL_7]], [[TMP0]]
+; CHECK-NEXT: [[VAL_9:%.*]] = and i32 [[VAL_8]], [[TMP0]]
+; CHECK-NEXT: [[VAL_10:%.*]] = and i32 [[VAL_9]], [[TMP0]]
+; CHECK-NEXT: [[VAL_11:%.*]] = add i32 [[TMP3]], 285
+; CHECK-NEXT: [[VAL_12:%.*]] = and i32 [[VAL_10]], [[VAL_11]]
+; CHECK-NEXT: [[VAL_13:%.*]] = and i32 [[VAL_12]], [[TMP0]]
+; CHECK-NEXT: [[VAL_14:%.*]] = and i32 [[VAL_13]], [[TMP0]]
+; CHECK-NEXT: [[VAL_15:%.*]] = and i32 [[VAL_14]], [[TMP0]]
+; CHECK-NEXT: [[VAL_16:%.*]] = and i32 [[VAL_15]], [[TMP0]]
+; CHECK-NEXT: [[VAL_17:%.*]] = and i32 [[VAL_16]], [[TMP0]]
+; CHECK-NEXT: [[VAL_18:%.*]] = add i32 [[TMP3]], 1240
+; CHECK-NEXT: [[VAL_19:%.*]] = and i32 [[VAL_17]], [[VAL_18]]
+; CHECK-NEXT: [[VAL_20:%.*]] = add i32 [[TMP3]], 1496
+; CHECK-NEXT: [[VAL_21:%.*]] = and i32 [[VAL_19]], [[VAL_20]]
+; CHECK-NEXT: [[VAL_22:%.*]] = and i32 [[VAL_21]], [[TMP0]]
+; CHECK-NEXT: [[VAL_23:%.*]] = and i32 [[VAL_22]], [[TMP0]]
+; CHECK-NEXT: [[VAL_24:%.*]] = and i32 [[VAL_23]], [[TMP0]]
+; CHECK-NEXT: [[VAL_25:%.*]] = and i32 [[VAL_24]], [[TMP0]]
+; CHECK-NEXT: [[VAL_26:%.*]] = and i32 [[VAL_25]], [[TMP0]]
+; CHECK-NEXT: [[VAL_27:%.*]] = and i32 [[VAL_26]], [[TMP0]]
+; CHECK-NEXT: [[VAL_28:%.*]] = and i32 [[VAL_27]], [[TMP0]]
+; CHECK-NEXT: [[VAL_29:%.*]] = and i32 [[VAL_28]], [[TMP0]]
+; CHECK-NEXT: [[VAL_30:%.*]] = and i32 [[VAL_29]], [[TMP0]]
+; CHECK-NEXT: [[VAL_31:%.*]] = and i32 [[VAL_30]], [[TMP0]]
+; CHECK-NEXT: [[VAL_32:%.*]] = and i32 [[VAL_31]], [[TMP0]]
+; CHECK-NEXT: [[VAL_33:%.*]] = and i32 [[VAL_32]], [[TMP0]]
+; CHECK-NEXT: [[VAL_34:%.*]] = add i32 [[TMP3]], 8555
+; CHECK-NEXT: [[VAL_35:%.*]] = and i32 [[VAL_33]], [[VAL_34]]
+; CHECK-NEXT: [[VAL_36:%.*]] = and i32 [[VAL_35]], [[TMP0]]
+; CHECK-NEXT: [[VAL_37:%.*]] = and i32 [[VAL_36]], [[TMP0]]
+; CHECK-NEXT: [[VAL_38:%.*]] = and i32 [[VAL_37]], [[TMP0]]
+; CHECK-NEXT: [[TMP4:%.*]] = insertelement <2 x i32> poison, i32 [[TMP3]], i32 0
+; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x i32> [[TMP4]], i32 [[TMP3]], i32 1
+; CHECK-NEXT: [[TMP6:%.*]] = add <2 x i32> [[TMP5]], <i32 12529, i32 13685>
+; CHECK-NEXT: [[TMP7:%.*]] = extractelement <2 x i32> [[TMP6]], i32 0
+; CHECK-NEXT: [[VAL_40:%.*]] = and i32 [[VAL_38]], [[TMP7]]
+; CHECK-NEXT: [[TMP8:%.*]] = extractelement <2 x i32> [[TMP6]], i32 1
+; CHECK-NEXT: [[TMP9:%.*]] = insertelement <2 x i32> poison, i32 [[VAL_40]], i32 0
+; CHECK-NEXT: [[TMP10:%.*]] = insertelement <2 x i32> [[TMP9]], i32 14910, i32 1
+; CHECK-NEXT: [[TMP11:%.*]] = insertelement <2 x i32> poison, i32 [[TMP8]], i32 0
+; CHECK-NEXT: [[TMP12:%.*]] = insertelement <2 x i32> [[TMP11]], i32 [[TMP3]], i32 1
+; CHECK-NEXT: [[TMP13:%.*]] = and <2 x i32> [[TMP10]], [[TMP12]]
+; CHECK-NEXT: [[TMP14:%.*]] = add <2 x i32> [[TMP10]], [[TMP12]]
+; CHECK-NEXT: [[TMP15:%.*]] = shufflevector <2 x i32> [[TMP13]], <2 x i32> [[TMP14]], <2 x i32> <i32 0, i32 3>
+; CHECK-NEXT: [[TMP16:%.*]] = extractelement <2 x i32> [[TMP15]], i32 0
+; CHECK-NEXT: [[TMP17:%.*]] = insertelement <2 x i32> poison, i32 [[TMP16]], i32 0
+; CHECK-NEXT: [[TMP18:%.*]] = extractelement <2 x i32> [[TMP15]], i32 1
+; CHECK-NEXT: [[TMP19]] = insertelement <2 x i32> [[TMP17]], i32 [[TMP18]], i32 1
; CHECK-NEXT: br label [[LOOP]]
;
; FORCE_REDUCTION-LABEL: @test(

To me, it just looks like we need to postpone the vectorization of phi nodes in the function rather than trying to fix all the issues in the world in a single patch.

To me, it just looks like we need to postpone the vectorization of phi nodes in the function rather than trying to fix all the issues in the world in a single patch.

I think I could give one simpler example without PHI nodes.

Here is another example:
source_filename = "psspread.c"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define dso_local void @spread_q_poisson() local_unnamed_addr #0 {
entry:

%div.i = fdiv float undef, undef
%conv13.i = fdiv float 1.000000e+00, %div.i
%conv18.i = fdiv float 1.000000e+00, undef
%conv23.i = fdiv float 1.000000e+00, undef
%conv162 = fptosi float undef to i32
%0 = load float, float* undef, align 4
%1 = load i32, i32* undef, align 4
%add187.us = add nsw i32 %1, %conv162
%add191.us = add nsw i32 undef, undef
%add195.us = add nsw i32 undef, undef
%conv196.us = sitofp i32 %add187.us to float
%mul197.us = fmul float %conv13.i, %conv196.us
%sub198.us = fsub float undef, %mul197.us
%mul.i363.us = fmul float %sub198.us, %sub198.us
%conv200.us = sitofp i32 %add191.us to float
%mul201.us = fmul float %conv18.i, %conv200.us
%sub202.us = fsub float undef, %mul201.us
%mul.i362.us = fmul float %sub202.us, %sub202.us
%conv204.us = sitofp i32 %add195.us to float
%mul205.us = fmul float %conv23.i, %conv204.us
%sub206.us = fsub float %0, %mul205.us
%mul.i.us = fmul float %sub206.us, %sub206.us
%add208.us = fadd float %mul.i363.us, %mul.i362.us
%add209.us = fadd float %add208.us, %mul.i.us
%cmp210.us = fcmp olt float %add209.us, undef
%add230.us = add nsw i32 undef, %add195.us
unreachable

}

attributes #0 = { "use-soft-float"="false" }

!llvm.ident = !{!0}

!0 = !{!"clang version 13.0.0 (/home/dtemirbulatov/llvm/llvm-project-thl/llvm/tools/clang eec04092d67b94f47439a9065b6bd4cd60165be2)"}

with proposed change it produces :
; ModuleID = 'bug.ll'
source_filename = "psspread.c"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define dso_local void @spread_q_poisson() local_unnamed_addr #0 {
entry:

%div.i = fdiv float undef, undef
%conv13.i = fdiv float 1.000000e+00, %div.i
%conv162 = fptosi float undef to i32
%0 = load float, float* undef, align 4
%1 = load i32, i32* undef, align 4
%add187.us = add nsw i32 %1, %conv162
%conv196.us = sitofp i32 %add187.us to float
%mul197.us = fmul float %conv13.i, %conv196.us
%sub198.us = fsub float undef, %mul197.us
%mul.i363.us = fmul float %sub198.us, %sub198.us
%2 = insertelement <2 x float> <float undef, float poison>, float %0, i32 1
%3 = fsub <2 x float> %2, <float 0x7FF8000000000000, float 0x7FF8000000000000>
%4 = fmul <2 x float> %3, %3
%5 = extractelement <2 x float> %4, i32 0
%add208.us = fadd float %mul.i363.us, %5
%6 = extractelement <2 x float> %4, i32 1
%add209.us = fadd float %add208.us, %6
%cmp210.us = fcmp olt float %add209.us, undef
%add230.us = add nsw i32 undef, undef
unreachable

}

attributes #0 = { "use-soft-float"="false" }

!llvm.ident = !{!0}

!0 = !{!"clang version 13.0.0 (/home/dtemirbulatov/llvm/llvm-project-thl/llvm/tools/clang eec04092d67b94f47439a9065b6bd4cd60165be2)"}

but if we immediately decide to vectorize patrially to get this output:
; ModuleID = 'bug.ll'
source_filename = "psspread.c"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define dso_local void @spread_q_poisson() local_unnamed_addr #0 {
entry:

%div.i = fdiv float undef, undef
%conv18.i = fdiv float 1.000000e+00, undef
%0 = insertelement <2 x float> poison, float %div.i, i32 0
%1 = insertelement <2 x float> %0, float undef, i32 1
%2 = fdiv <2 x float> <float 1.000000e+00, float 1.000000e+00>, %1
%conv162 = fptosi float undef to i32
%3 = load float, float* undef, align 4
%4 = load i32, i32* undef, align 4
%add187.us = add nsw i32 %4, %conv162
%add191.us = add nsw i32 undef, undef
%add195.us = add nsw i32 undef, undef
%conv200.us = sitofp i32 %add191.us to float
%mul201.us = fmul float %conv18.i, %conv200.us
%sub202.us = fsub float undef, %mul201.us
%mul.i362.us = fmul float %sub202.us, %sub202.us
%5 = insertelement <2 x i32> poison, i32 %add187.us, i32 0
%6 = insertelement <2 x i32> %5, i32 %add195.us, i32 1
%7 = sitofp <2 x i32> %6 to <2 x float>
%8 = fmul <2 x float> %2, %7
%9 = insertelement <2 x float> <float undef, float poison>, float %3, i32 1
%10 = fsub <2 x float> %9, %8
%11 = fmul <2 x float> %10, %10
%12 = extractelement <2 x float> %11, i32 0
%add208.us = fadd float %12, %mul.i362.us
%13 = extractelement <2 x float> %11, i32 1
%add209.us = fadd float %add208.us, %13
%cmp210.us = fcmp olt float %add209.us, undef
%add230.us = add nsw i32 undef, %add195.us
unreachable

}

attributes #0 = { "use-soft-float"="false" }

!llvm.ident = !{!0}

!0 = !{!"clang version 13.0.0 (/home/dtemirbulatov/llvm/llvm-project-thl/llvm/tools/clang eec04092d67b94f47439a9065b6bd4cd60165be2)"}

In D57779#2559601, @dtemirbulatov wrote:
Here is another example:
source_filename = "psspread.c"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define dso_local void @spread_q_poisson() local_unnamed_addr #0 {
entry:
%div.i = fdiv float undef, undef
%conv13.i = fdiv float 1.000000e+00, %div.i
%conv18.i = fdiv float 1.000000e+00, undef
%conv23.i = fdiv float 1.000000e+00, undef
%conv162 = fptosi float undef to i32
%0 = load float, float* undef, align 4
%1 = load i32, i32* undef, align 4
%add187.us = add nsw i32 %1, %conv162
%add191.us = add nsw i32 undef, undef
%add195.us = add nsw i32 undef, undef
%conv196.us = sitofp i32 %add187.us to float
%mul197.us = fmul float %conv13.i, %conv196.us
%sub198.us = fsub float undef, %mul197.us
%mul.i363.us = fmul float %sub198.us, %sub198.us
%conv200.us = sitofp i32 %add191.us to float
%mul201.us = fmul float %conv18.i, %conv200.us
%sub202.us = fsub float undef, %mul201.us
%mul.i362.us = fmul float %sub202.us, %sub202.us
%conv204.us = sitofp i32 %add195.us to float
%mul205.us = fmul float %conv23.i, %conv204.us
%sub206.us = fsub float %0, %mul205.us
%mul.i.us = fmul float %sub206.us, %sub206.us
%add208.us = fadd float %mul.i363.us, %mul.i362.us
%add209.us = fadd float %add208.us, %mul.i.us
%cmp210.us = fcmp olt float %add209.us, undef
%add230.us = add nsw i32 undef, %add195.us
unreachable
}

attributes #0 = { "use-soft-float"="false" }

!llvm.ident = !{!0}

!0 = !{!"clang version 13.0.0 (/home/dtemirbulatov/llvm/llvm-project-thl/llvm/tools/clang eec04092d67b94f47439a9065b6bd4cd60165be2)"}

with proposed change it produces :
; ModuleID = 'bug.ll'
source_filename = "psspread.c"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define dso_local void @spread_q_poisson() local_unnamed_addr #0 {
entry:
%div.i = fdiv float undef, undef
%conv13.i = fdiv float 1.000000e+00, %div.i
%conv162 = fptosi float undef to i32
%0 = load float, float* undef, align 4
%1 = load i32, i32* undef, align 4
%add187.us = add nsw i32 %1, %conv162
%conv196.us = sitofp i32 %add187.us to float
%mul197.us = fmul float %conv13.i, %conv196.us
%sub198.us = fsub float undef, %mul197.us
%mul.i363.us = fmul float %sub198.us, %sub198.us
%2 = insertelement <2 x float> <float undef, float poison>, float %0, i32 1
%3 = fsub <2 x float> %2, <float 0x7FF8000000000000, float 0x7FF8000000000000>
%4 = fmul <2 x float> %3, %3
%5 = extractelement <2 x float> %4, i32 0
%add208.us = fadd float %mul.i363.us, %5
%6 = extractelement <2 x float> %4, i32 1
%add209.us = fadd float %add208.us, %6
%cmp210.us = fcmp olt float %add209.us, undef
%add230.us = add nsw i32 undef, undef
unreachable
}

attributes #0 = { "use-soft-float"="false" }

!llvm.ident = !{!0}

!0 = !{!"clang version 13.0.0 (/home/dtemirbulatov/llvm/llvm-project-thl/llvm/tools/clang eec04092d67b94f47439a9065b6bd4cd60165be2)"}

but if we immediately decide to vectorize patrially to get this output:
; ModuleID = 'bug.ll'
source_filename = "psspread.c"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define dso_local void @spread_q_poisson() local_unnamed_addr #0 {
entry:
%div.i = fdiv float undef, undef
%conv18.i = fdiv float 1.000000e+00, undef
%0 = insertelement <2 x float> poison, float %div.i, i32 0
%1 = insertelement <2 x float> %0, float undef, i32 1
%2 = fdiv <2 x float> <float 1.000000e+00, float 1.000000e+00>, %1
%conv162 = fptosi float undef to i32
%3 = load float, float* undef, align 4
%4 = load i32, i32* undef, align 4
%add187.us = add nsw i32 %4, %conv162
%add191.us = add nsw i32 undef, undef
%add195.us = add nsw i32 undef, undef
%conv200.us = sitofp i32 %add191.us to float
%mul201.us = fmul float %conv18.i, %conv200.us
%sub202.us = fsub float undef, %mul201.us
%mul.i362.us = fmul float %sub202.us, %sub202.us
%5 = insertelement <2 x i32> poison, i32 %add187.us, i32 0
%6 = insertelement <2 x i32> %5, i32 %add195.us, i32 1
%7 = sitofp <2 x i32> %6 to <2 x float>
%8 = fmul <2 x float> %2, %7
%9 = insertelement <2 x float> <float undef, float poison>, float %3, i32 1
%10 = fsub <2 x float> %9, %8
%11 = fmul <2 x float> %10, %10
%12 = extractelement <2 x float> %11, i32 0
%add208.us = fadd float %12, %mul.i362.us
%13 = extractelement <2 x float> %11, i32 1
%add209.us = fadd float %add208.us, %13
%cmp210.us = fcmp olt float %add209.us, undef
%add230.us = add nsw i32 undef, %add195.us
unreachable
}

attributes #0 = { "use-soft-float"="false" }

!llvm.ident = !{!0}

!0 = !{!"clang version 13.0.0 (/home/dtemirbulatov/llvm/llvm-project-thl/llvm/tools/clang eec04092d67b94f47439a9065b6bd4cd60165be2)"}

I see that immediate vectorization is better as it vectorizes more, no? Also, there is a problem, looks like it is caused by the multinode analysis. I'm trying to improve this in my non-power-2 patch, will prepare a separate patch for it.

I see that immediate vectorization is better as it vectorizes more, no? Also, there is a problem, looks like it is caused by the multinode analysis. I'm trying to improve this in my non-power-2 patch, will prepare a separate patch for it.

eh, I think it is not a clear example, I have seen better examples, I will show something better.

In D57779#2560071, @dtemirbulatov wrote:

I see that immediate vectorization is better as it vectorizes more, no? Also, there is a problem, looks like it is caused by the multinode analysis. I'm trying to improve this in my non-power-2 patch, will prepare a separate patch for it.

eh, I think it is not a clear example, I have seen better examples, I will show something better.

Even this example shows that the current solution does not always produce the best result.

Even this example shows that the current solution does not always produce the best result.

at least, we could avoid regressions.

I think the next step is to compare vectorized tree heights(number of vectorized nodes) among possible vectorizable trees.

Even this example shows that the current solution does not always produce the best result.

SLP has a greedy approach and let's assume that full vectorization is always better than partial. We don't have the resources to save all trees and then choose from saved the best one. I think I can add now choosing the best from already partially vectorized.

In D57779#2564284, @dtemirbulatov wrote:

Even this example shows that the current solution does not always produce the best result.

SLP has a greedy approach and let's assume that full vectorization is always better than partial. We don't have the resources to save all trees and then choose from saved the best one. I think I can add now choosing the best from already partially vectorized.

Again, even your example showed that this solution is worse in some cases. Why do we need to waste the time and invest in a solution, which is not better than the existing one, requires more time to understand, consumes more memory?
SLP implements a bottom-up approach, i.e. it always tries to vectorize the longest chain (except for PHI nodes, which should be improved). If we have a partial graph, it should not affect other vectorization graphs in the same basic block, generally speaking, just some subnodes may become the subnodes of the other graphs but this is not a problem.
Looks like you're trying to implement something similar to VPlan. We have it already and better to invest the time to implement support for SLP vectorization there.
Redesign is completely different work, it requires correct estimation (not the assumptions, but real investigation), discussion, RFC, approval, and separate implementation.

Again, even your example showed that this solution is worse in some cases. Why do we need to waste the time and invest in a solution, which is not better than the existing one, requires more time to understand, consumes more memory?

SLP implements a bottom-up approach, i.e. it always tries to vectorize the longest chain (except for PHI nodes, which should be improved). If we have a partial graph, it should not affect other vectorization graphs in the same basic block, generally speaking, just some subnodes may become the subnodes of the other graphs but this is not a problem.

Looks like you're trying to implement something similar to VPlan. We have it already and better to invest the time to implement support for SLP vectorization there.

Redesign is completely different work, it requires correct estimation (not the assumptions, but real investigation), discussion, RFC, approval, and separate implementation.

Ok, Agree.

Addressed @ABataev remarks, investigated regression with PHI nodes in PR39774.ll and I have not spotted any other case involving PHI nodes, but I have several other cases and it happens quite rarely. I am not sure how-to generalize them and I think VPLAN might be helpful. Overall, I think it is ready.

Harbormaster completed remote builds in B94259: Diff 331286.Mar 17 2021, 9:57 AM

ABataev added inline comments.Mar 19 2021, 7:15 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
658	`PriorityQueue`?
6716–6725	Looks like you need to implement something like `reduceSchedulingRegion()`, similar to `extendSchedulingRegion` function. Because otherwise you're going to operate with the larger scheduling region. I.e. need to modify `ScheduleStart` and `ScheduleEnd` data members.
7682	Why `SLPThrottleBudget > 0`? What if `SLPThrottleBudget` equals 0?
7682–7683	Why we can't do something like this: int NumAttempts = 0; do { if (R.isTreeTinyAndNotFullyVectorizable()) break; R.computeMinimumValueSizes(); InstructionCost Cost = R.getTreeCost(); InstructionCost UserCost = 0; .... if (Cost < -SLPCostThreshold) { LLVM_DEBUG(dbgs() << "SLP: Vectorizing list at cost:" << Cost << ".\n"); R.getORE()->emit(OptimizationRemark(SV_NAME, "VectorizedList", cast<Instruction>(Ops[0])) << "SLP vectorized with cost " << ore::NV("Cost", Cost) << " and with tree size " << ore::NV("TreeSize", R.getTreeSize())); R.vectorizeTree(); // Move to the next bundle. I += VF - 1; NextInst = I + 1; Changed = true; break; } ... /// Do throttling here. ++NumAttempts; } while (NumAttempts < SLPThrottleBudget);

dtemirbulatov added inline comments.Mar 21 2021, 5:04 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
658	hmm, ProprityQueue allows duplicates of elements and it might be an issue.

Rebased, addressed remarks, added reduceSchedulingRegion() function with the ability to set only ScheduleStart at this time, renamed RemovedOperations property to ProposedToGather.

dtemirbulatov marked 2 inline comments as done.Mar 29 2021, 5:39 PM

dtemirbulatov added inline comments.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
7682–7683	We are doing partial vectorization and we have to know UserCost to make the correct partial tree cut.

dtemirbulatov marked 2 inline comments as done.Mar 29 2021, 5:40 PM

Harbormaster completed remote builds in B96225: Diff 334018.Mar 29 2021, 5:49 PM

Ping, ready to land?

In D57779#2667679, @xbolva00 wrote:

Ping, ready to land?

Will review it on Monday.

In D57779#2667704, @ABataev wrote:

In D57779#2667679, @xbolva00 wrote:

Ping, ready to land?

Will review it on Monday.

I found an error in reduceSchedulingRegion() implementation. I am reworking the change.

Rebased, fixed incorrect comment at 2358, fixed the wrong implementation of shrink scheduling region, changed the code in tryToVectorizeList() as suggested by @ABataev.

dtemirbulatov added inline comments.Apr 8 2021, 8:33 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6749	Perhaps we could also check here for !BS->getScheduleData(I)->isPartOfBundle() and further shrink the region.

dtemirbulatov added inline comments.Apr 8 2021, 8:35 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6749	ah, no, this instruction could belong to a real gather node.

Harbormaster completed remote builds in B97744: Diff 336118.Apr 8 2021, 9:11 AM

Slightly improved schedular area shrinking algorithm, by allowing to remove unnecessary unmaps in chains instructions.

Harbormaster completed remote builds in B97848: Diff 336272.Apr 8 2021, 5:50 PM

Rebased, formatted, noticed 3x testcases involved after @ABataev landed D100495 "Add detection of shuffled/perfect matching of tree entries.", returned "-slp-throttle" flag in order to AArch64/gather-cost.ll to be functional, manually adjust "TMP" in minimum-sizes.ll in PR31243_sext for probably a bug in update_test_checks.py.

Herald added subscribers: kerbowa, nhaehnle, jvesely. · View Herald TranscriptApr 25 2021, 5:40 PM

Harbormaster completed remote builds in B100843: Diff 340406.Apr 25 2021, 6:26 PM

Fixed two format errors.

Harbormaster completed remote builds in B100858: Diff 340425.Apr 25 2021, 9:45 PM

RKSimon added inline comments.Apr 26 2021, 7:34 AM

llvm/test/Transforms/SLPVectorizer/X86/uitofp.ll
683 ↗	(On Diff #340425)	what happened to these checks?

Updated llvm/test/Transforms/SLPVectorizer/X86/uitofp.ll checks on request from @RKSimon

dtemirbulatov marked an inline comment as done.Apr 26 2021, 8:07 AM

In D57779#2716824, @dtemirbulatov wrote:

Updated llvm/test/Transforms/SLPVectorizer/X86/uitofp.ll checks on request from @RKSimon

@RKSimon , I have to split AVX256NODQ X86/sitofp.ll and maybe others.

ABataev added inline comments.Apr 26 2021, 8:14 AM

llvm/test/Transforms/SLPVectorizer/X86/arith-fix.ll
357–361 ↗	(On Diff #340530)	Looks like it does not respect `MinTreeSize` option anymore. And it is strange that such code sequence gets profitable for vectorization (scalar cost is 8, vector cost is 9)

Harbormaster completed remote builds in B100939: Diff 340530.Apr 26 2021, 9:13 AM

Rebased, Forbid "detection of shuffled/perfect matching of tree entries" for canceled TreeEntries during throttling, replaced TEVectorizableSet to PriorityQueue.

Harbormaster completed remote builds in B102213: Diff 342283.May 2 2021, 3:48 PM

Fix formatting.

Harbormaster completed remote builds in B102223: Diff 342296.May 2 2021, 6:39 PM

ABataev added inline comments.May 3 2021, 5:45 AM

llvm/test/Transforms/SLPVectorizer/X86/powof2div.ll
85–90 ↗	(On Diff #342296)	Still looks like it does not respect mintreesize

dtemirbulatov added inline comments.May 4 2021, 7:00 PM

llvm/test/Transforms/SLPVectorizer/X86/powof2div.ll
85–90 ↗	(On Diff #342296)	hmm, this is not the case here, the tree height is 5 here, divide node cost is 20 and after deleting this not node, extracting from "add" node costs 4 and inserting after scalar divide cost 4 and the final tree cost is -4. llvm-mca for -mattr=+avx shows 1305 cycles before and 1609 cycles after.

Added check for current tree size to MinTreeSize before making the decision to vectorize.

Harbormaster completed remote builds in B102679: Diff 342959.May 5 2021, 1:44 AM

Fixed issue in getInsertCost(), I incorrectly added gather costs to the nodes which were not in relation with any proposed to vectorized nodes, I thought of this and used before "ScalarToTreeEntry.count(Op) > 0", but I discovered that I am not updating ScalarToTreeEntry while reducing the tree. 2) Now I am checking with isTreeTinyAndNotFullyVectorizable() before decide to vectorize. 3) I introduced "MinVecNodes" parameter, which sets how many minimal vectorizable nodes we would like to have while throttling, currently it is equal to 2 by default. For example, we have 3 total nodes in the tree and it is satisfied with MinTreeSize and we would like to have at least two nodes to be vectorizable while reducing the tree to have a positive decision.

Harbormaster completed remote builds in B102986: Diff 343392.May 6 2021, 7:24 AM

In D57779#2741906, @dtemirbulatov wrote:

Fixed issue in getInsertCost(), I incorrectly added gather costs to the nodes which were not in relation with any proposed to vectorized nodes, I thought of this and used before "ScalarToTreeEntry.count(Op) > 0", but I discovered that I am not updating ScalarToTreeEntry while reducing the tree. 2) Now I am checking with isTreeTinyAndNotFullyVectorizable() before decide to vectorize. 3) I introduced "MinVecNodes" parameter, which sets how many minimal vectorizable nodes we would like to have while throttling, currently it is equal to 2 by default. For example, we have 3 total nodes in the tree and it is satisfied with MinTreeSize and we would like to have at least two nodes to be vectorizable while reducing the tree to have a positive decision.

Why do we need MinVecNodes? MinTreeSize and all associated analysis must be enough

Why do we need MinVecNodes? MinTreeSize and all associated analysis must be enough

it is Transforms/SLPVectorizer/X86/tiny-tree.ll transform that scared me.
From:
define void @tiny_tree_not_fully_vectorizable(double* noalias nocapture %dst, double* noalias nocapture readonly %src, i64 %count) #0 {
entry:

%cmp12 = icmp eq i64 %count, 0
br i1 %cmp12, label %for.end, label %for.body

for.body: ; preds = %entry, %for.body

%i.015 = phi i64 [ %inc, %for.body ], [ 0, %entry ]
%dst.addr.014 = phi double* [ %add.ptr4, %for.body ], [ %dst, %entry ]
%src.addr.013 = phi double* [ %add.ptr, %for.body ], [ %src, %entry ]
%0 = load double, double* %src.addr.013, align 8
store double %0, double* %dst.addr.014, align 8
%arrayidx2 = getelementptr inbounds double, double* %src.addr.013, i64 2
%1 = load double, double* %arrayidx2, align 8
%arrayidx3 = getelementptr inbounds double, double* %dst.addr.014, i64 1
store double %1, double* %arrayidx3, align 8
%add.ptr = getelementptr inbounds double, double* %src.addr.013, i64 %i.015
%add.ptr4 = getelementptr inbounds double, double* %dst.addr.014, i64 %i.015
%inc = add i64 %i.015, 1
%exitcond = icmp eq i64 %inc, %count
br i1 %exitcond, label %for.end, label %for.body

for.end: ; preds = %for.body, %entry

ret void

}
to:
define void @tiny_tree_not_fully_vectorizable(double* noalias nocapture %dst, double* noalias nocapture readonly %src, i64 %count) #0 {
entry:

%cmp12 = icmp eq i64 %count, 0
br i1 %cmp12, label %for.end, label %for.body

for.body: ; preds = %for.body, %entry

%i.015 = phi i64 [ %inc, %for.body ], [ 0, %entry ]
%dst.addr.014 = phi double* [ %add.ptr4, %for.body ], [ %dst, %entry ]
%src.addr.013 = phi double* [ %add.ptr, %for.body ], [ %src, %entry ]
%0 = load double, double* %src.addr.013, align 8
%arrayidx2 = getelementptr inbounds double, double* %src.addr.013, i64 2
%1 = load double, double* %arrayidx2, align 8
%arrayidx3 = getelementptr inbounds double, double* %dst.addr.014, i64 1
%2 = insertelement <2 x double> poison, double %0, i32 0
%3 = insertelement <2 x double> %2, double %1, i32 1
%4 = bitcast double* %dst.addr.014 to <2 x double>*
store <2 x double> %3, <2 x double>* %4, align 8
%add.ptr = getelementptr inbounds double, double* %src.addr.013, i64 %i.015
%add.ptr4 = getelementptr inbounds double, double* %dst.addr.014, i64 %i.015
%inc = add i64 %i.015, 1
%exitcond = icmp eq i64 %inc, %count
br i1 %exitcond, label %for.end, label %for.body

for.end: ; preds = %for.body, %entry

ret void

}

but now with llvm-mca with -mattr=+corei7-avx, I see the change from 1111 to 1014 cycles, so it looks good. I will check other cases.

In D57779#2743946, @dtemirbulatov wrote:
Why do we need MinVecNodes? MinTreeSize and all associated analysis must be enough

it is Transforms/SLPVectorizer/X86/tiny-tree.ll transform that scared me.
From:
define void @tiny_tree_not_fully_vectorizable(double* noalias nocapture %dst, double* noalias nocapture readonly %src, i64 %count) #0 {
entry:
%cmp12 = icmp eq i64 %count, 0
br i1 %cmp12, label %for.end, label %for.body
for.body: ; preds = %entry, %for.body
%i.015 = phi i64 [ %inc, %for.body ], [ 0, %entry ]
%dst.addr.014 = phi double* [ %add.ptr4, %for.body ], [ %dst, %entry ]
%src.addr.013 = phi double* [ %add.ptr, %for.body ], [ %src, %entry ]
%0 = load double, double* %src.addr.013, align 8
store double %0, double* %dst.addr.014, align 8
%arrayidx2 = getelementptr inbounds double, double* %src.addr.013, i64 2
%1 = load double, double* %arrayidx2, align 8
%arrayidx3 = getelementptr inbounds double, double* %dst.addr.014, i64 1
store double %1, double* %arrayidx3, align 8
%add.ptr = getelementptr inbounds double, double* %src.addr.013, i64 %i.015
%add.ptr4 = getelementptr inbounds double, double* %dst.addr.014, i64 %i.015
%inc = add i64 %i.015, 1
%exitcond = icmp eq i64 %inc, %count
br i1 %exitcond, label %for.end, label %for.body
for.end: ; preds = %for.body, %entry
ret void
}
to:
define void @tiny_tree_not_fully_vectorizable(double* noalias nocapture %dst, double* noalias nocapture readonly %src, i64 %count) #0 {
entry:
%cmp12 = icmp eq i64 %count, 0
br i1 %cmp12, label %for.end, label %for.body
for.body: ; preds = %for.body, %entry
%i.015 = phi i64 [ %inc, %for.body ], [ 0, %entry ]
%dst.addr.014 = phi double* [ %add.ptr4, %for.body ], [ %dst, %entry ]
%src.addr.013 = phi double* [ %add.ptr, %for.body ], [ %src, %entry ]
%0 = load double, double* %src.addr.013, align 8
%arrayidx2 = getelementptr inbounds double, double* %src.addr.013, i64 2
%1 = load double, double* %arrayidx2, align 8
%arrayidx3 = getelementptr inbounds double, double* %dst.addr.014, i64 1
%2 = insertelement <2 x double> poison, double %0, i32 0
%3 = insertelement <2 x double> %2, double %1, i32 1
%4 = bitcast double* %dst.addr.014 to <2 x double>*
store <2 x double> %3, <2 x double>* %4, align 8
%add.ptr = getelementptr inbounds double, double* %src.addr.013, i64 %i.015
%add.ptr4 = getelementptr inbounds double, double* %dst.addr.014, i64 %i.015
%inc = add i64 %i.015, 1
%exitcond = icmp eq i64 %inc, %count
br i1 %exitcond, label %for.end, label %for.body
for.end: ; preds = %for.body, %entry
ret void
}

but now with llvm-mca with -mattr=+corei7-avx, I see the change from 1111 to 1014 cycles, so it looks good. I will check other cases.

If so, it just means that our min-tree-size analysis is too strict and must be fixed in general, but not by introducing some new throttling-specific options. We may have the same situation without throttling.

Rebased, Removed SLP parameter MinVecNodes. Added estimations of a good tree reduction 1) if the tree contained some real operations like binary, arithmetical, calls which were proposed to vectorize then we don't want to reduce this tree to just load and store operations in vectorized form. 2) if the tree doesn't have any real operations like binary, arithmetical... then we have to make sure that at least the root node and the next node to root are going to be vectorized.

Harbormaster completed remote builds in B105050: Diff 346180.May 18 2021, 12:17 PM

Formatting.

Harbormaster completed remote builds in B105196: Diff 346402.May 19 2021, 4:33 AM

Allen added a subscriber: Allen.May 20 2021, 10:16 PM

Rebased. I switched to path aware tree reduction approach and we start from the leaves of a vectorizable tree toward the root of that tree.

Harbormaster completed remote builds in B123830: Diff 372463.Sep 14 2021, 6:39 AM

dtemirbulatov mentioned this in D110623: [SLP] Avoid calculating expensive spill cost when it is not required.Sep 28 2021, 6:16 AM

Current status? Review stalled?

Herald added a project: Restricted Project. · View Herald TranscriptSep 7 2022, 12:55 PM

Herald added a subscriber: • pcwang-thead. · View Herald Transcript

dtemirbulatov abandoned this revision.Sep 17 2022, 4:23 PM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

778 lines

test/

Transforms/

SLPVectorizer/

X86/

slp-throttle.ll

20 lines

Diff 372463

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 87 Lines • ▼ Show 20 Lines
#include "llvm/Transforms/Utils/InjectTLIMappings.h"		#include "llvm/Transforms/Utils/InjectTLIMappings.h"
#include "llvm/Transforms/Utils/LoopUtils.h"		#include "llvm/Transforms/Utils/LoopUtils.h"
#include "llvm/Transforms/Vectorize.h"		#include "llvm/Transforms/Vectorize.h"
#include <algorithm>		#include <algorithm>
#include <cassert>		#include <cassert>
#include <cstdint>		#include <cstdint>
#include <iterator>		#include <iterator>
#include <memory>		#include <memory>
		#include <queue>
#include <set>		#include <set>
#include <string>		#include <string>
#include <tuple>		#include <tuple>
#include <utility>		#include <utility>
#include <vector>		#include <vector>

using namespace llvm;		using namespace llvm;
using namespace llvm::PatternMatch;		using namespace llvm::PatternMatch;
Show All 9 Lines

static cl::opt<int>		static cl::opt<int>
SLPCostThreshold("slp-threshold", cl::init(0), cl::Hidden,		SLPCostThreshold("slp-threshold", cl::init(0), cl::Hidden,
cl::desc("Only vectorize if you gain more than this "		cl::desc("Only vectorize if you gain more than this "
"number "));		"number "));

static cl::opt<bool>		static cl::opt<bool>
ShouldVectorizeHor("slp-vectorize-hor", cl::init(true), cl::Hidden,		ShouldVectorizeHor("slp-vectorize-hor", cl::init(true), cl::Hidden,
cl::desc("Attempt to vectorize horizontal reductions"));		cl::desc("Attempt to vectorize horizontal reductions"));

		ABataevUnsubmitted Done Reply Inline Actions Tabs are added ABataev: Tabs are added
static cl::opt<bool> ShouldStartVectorizeHorAtStore(		static cl::opt<bool> ShouldStartVectorizeHorAtStore(
"slp-vectorize-hor-store", cl::init(false), cl::Hidden,		"slp-vectorize-hor-store", cl::init(false), cl::Hidden,
cl::desc(		cl::desc(
"Attempt to vectorize horizontal reductions feeding into a store"));		"Attempt to vectorize horizontal reductions feeding into a store"));

		static cl::opt<unsigned>
		SLPThrottleBudget("slp-throttling-budget", cl::init(32), cl::Hidden,
		cl::desc("Limit the total number of nodes for cost "
		"recalculations during throttling"));
		ABataevUnsubmitted Not Done Reply Inline Actions Do we really need both of these options? `MaxCostsRecalculations` should be enough. ABataev: Do we really need both of these options? `MaxCostsRecalculations` should be enough.
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions ok, thanks. dtemirbulatov: ok, thanks.

static cl::opt<int>		static cl::opt<int>
MaxVectorRegSizeOption("slp-max-reg-size", cl::init(128), cl::Hidden,		MaxVectorRegSizeOption("slp-max-reg-size", cl::init(128), cl::Hidden,
cl::desc("Attempt to vectorize for this register size in bits"));		cl::desc("Attempt to vectorize for this register size in bits"));

static cl::opt<unsigned>		static cl::opt<unsigned>
MaxVFOption("slp-max-vf", cl::init(0), cl::Hidden,		MaxVFOption("slp-max-vf", cl::init(0), cl::Hidden,
cl::desc("Maximum SLP vectorization factor (0=unlimited)"));		cl::desc("Maximum SLP vectorization factor (0=unlimited)"));

▲ Show 20 Lines • Show All 493 Lines • ▼ Show 20 Lines	public:

/// Vectorize the tree but with the list of externally used values \p		/// Vectorize the tree but with the list of externally used values \p
/// ExternallyUsedValues. Values in this MapVector can be replaced but the		/// ExternallyUsedValues. Values in this MapVector can be replaced but the
/// generated extractvalue instructions.		/// generated extractvalue instructions.
Value *vectorizeTree(ExtraValueToDebugLocsMap &ExternallyUsedValues);		Value *vectorizeTree(ExtraValueToDebugLocsMap &ExternallyUsedValues);

/// \returns the cost incurred by unwanted spills and fills, caused by		/// \returns the cost incurred by unwanted spills and fills, caused by
/// holding live values over call sites.		/// holding live values over call sites.
InstructionCost getSpillCost() const;		InstructionCost getSpillCost();

		/// \returns the cost extracting vectorized elements.
		InstructionCost getExtractShuffleCost(InstructionCost Cost);

		/// \returns the cost of gathering canceled elements to be used
		/// by vectorized operations during throttling.
		InstructionCost getInsertCost(ArrayRef<Value *> VectorizedVals);

		/// Find a non-gathering leaf node from current node C and record the path
		/// on the way.
		void findLeaf(TreeEntry C, SetVector<TreeEntry > &Path) const;

		using SubTreeQueue =
		std::priority_queue<std::pair<int, std::vector<TreeEntry *>>>;
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - std::priority_queue<std::pair<int, std::vector<TreeEntry >>>; + std::priority_queue<std::pair<int, std::vector<TreeEntry >>>; Lint: Pre-merge checks: clang-format: please reformat the code ``` - std::priority_queue<std::pair<int, std…

		ABataevUnsubmitted Not Done Reply Inline Actions `PriorityQueue`? ABataev: `PriorityQueue`?
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions hmm, ProprityQueue allows duplicates of elements and it might be an issue. dtemirbulatov: hmm, ProprityQueue allows duplicates of elements and it might be an issue.
		/// Find a subtree of the whole tree suitable to be vectorized. When
		/// vectorizing the whole tree is not profitable, we can consider vectorizing
		/// part of that tree. SLP algorithm looks to operations to vectorize starting
		/// from seed instructions on the bottom toward the end of chains of
		/// dependencies to the top of SLP graph, it groups potentially vectorizable
		/// operations in scalar form to bundles.
		/// For example:
		///
		/// <bundle 1> vector form
		/// \|
		/// <bundle 2> vector form <bundle 3> vector form
		/// \ /
		/// <seed root bundle> vector form
		///
		ABataevUnsubmitted Not Done Reply Inline Actions Does "scalar form" means "gathered nodes"? I don't think that currently we may end up with the situation like in the picture, we can't have gathered node that depends on another node (either gather or vectorized). ABataev: Does "scalar form" means "gathered nodes"? I don't think that currently we may end up with the…
		/// Total cost is not profitable to vectorize, hence all operations are in
		/// scalar form.
		///
		/// Here is the same tree after SLP throttling transformation:
		///
		/// <bundle 1> vector form
		/// \|
		/// <bundle 2> vector form <bundle 3> gathered nodes
		xbolva00Unsubmitted Done Reply Inline Actions Please mention paper name: “Throttling Automatic Vectorization: When Less Is More” https://www.cl.cam.ac.uk/~tmj32/papers/docs/porpodas15-pact.pdf Slides are good, but paper is paper :) xbolva00: Please mention paper name: “Throttling Automatic Vectorization: When Less Is More” https…
		/// \ /
		/// <seed root bundle> vector form
		///
		/// So, we can throttle some operations in such a way that it is still
		/// profitable to vectorize part on the tree, while all tree vectorization
		/// does not make sense.
		/// More details:
		/// https://www.cl.cam.ac.uk/~tmj32/papers/docs/porpodas15-pact.pdf
		bool findSubTree(SubTreeQueue &SubTrees, InstructionCost TreeCost);

		/// Get raw summary of all elements of the tree.
		InstructionCost getRawTreeCost(ArrayRef<Value *> VectorizedVals = None);

/// \returns the vectorization cost of the subtree that starts at \p VL.		/// \returns the vectorization cost of the subtree that starts at \p VL.
/// A negative number means that this is profitable.		/// A negative number means that this is profitable.
InstructionCost getTreeCost(ArrayRef<Value *> VectorizedVals = None);		InstructionCost getTreeCost(bool TreeReduce,
		ArrayRef<Value *> VectorizedVals = None);

/// Construct a vectorizable tree that starts at \p Roots, ignoring users for		/// Construct a vectorizable tree that starts at \p Roots, ignoring users for
/// the purpose of scheduling and extraction in the \p UserIgnoreLst.		/// the purpose of scheduling and extraction in the \p UserIgnoreLst.
void buildTree(ArrayRef<Value *> Roots,		void buildTree(ArrayRef<Value *> Roots,
ArrayRef<Value *> UserIgnoreLst = None);		ArrayRef<Value *> UserIgnoreLst = None);

/// Construct a vectorizable tree that starts at \p Roots, ignoring users for		/// Construct a vectorizable tree that starts at \p Roots, ignoring users for
/// the purpose of scheduling and extraction in the \p UserIgnoreLst taking		/// the purpose of scheduling and extraction in the \p UserIgnoreLst taking
/// into account (and updating it, if required) list of externally used		/// into account (and updating it, if required) list of externally used
/// values stored in \p ExternallyUsedValues.		/// values stored in \p ExternallyUsedValues.
void buildTree(ArrayRef<Value *> Roots,		void buildTree(ArrayRef<Value *> Roots,
ExtraValueToDebugLocsMap &ExternallyUsedValues,		ExtraValueToDebugLocsMap &ExternallyUsedValues,
ArrayRef<Value *> UserIgnoreLst = None);		ArrayRef<Value *> UserIgnoreLst = None);

/// Clear the internal data structures that are created by 'buildTree'.		/// Clear the internal data structures that are created by 'buildTree'.
void deleteTree() {		void deleteTree() {
VectorizableTree.clear();		VectorizableTree.clear();
ScalarToTreeEntry.clear();		ScalarToTreeEntry.clear();
MustGather.clear();		MustGather.clear();
ExternalUses.clear();		ExternalUses.clear();
		InternalTreeUses.clear();
		ProposedToGather.clear();
NumOpsWantToKeepOrder.clear();		NumOpsWantToKeepOrder.clear();
NumOpsWantToKeepOriginalOrder = 0;		NumOpsWantToKeepOriginalOrder = 0;
for (auto &Iter : BlocksSchedules) {		for (auto &Iter : BlocksSchedules) {
BlockScheduling *BS = Iter.second.get();		BlockScheduling *BS = Iter.second.get();
BS->clear();		BS->clear();
}		}
MinBWs.clear();		MinBWs.clear();
InstrElementSize.clear();		InstrElementSize.clear();
		NoCallInst = true;
		RawTreeCost = 0;
		ReduceableCost = 0;
		IsCostSumReady = false;
}		}
		ABataevUnsubmitted Not Done Reply Inline Actions Why do we need to save intermediate results? Cannot it be solved in a single iteration loop without saving the intermediate results in the class instance? ABataev: Why do we need to save intermediate results? Cannot it be solved in a single iteration loop…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions I have noticed many regressions if we decide right away and rebuilding the tree afterward is expensive. dtemirbulatov: I have noticed many regressions if we decide right away and rebuilding the tree afterward is…
		ABataevUnsubmitted Not Done Reply Inline Actions What is the cause of those regressions? If I understand it correctly, you're just trying to find the subtree, exclude its cost, compare, repeat if it is not profitable. What does not allow to do it in the loop without saving intermediate results in the class, but save the result in the stack vectors, if it is needed? ABataev: What is the cause of those regressions? If I understand it correctly, you're just trying to…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions For example, if we could partially vectorize at vectorizeStoreChain(), or later it is possilble to fully vectorize the same tree tryToVectorizeList() or tryToReduce() dtemirbulatov: For example, if we could partially vectorize at vectorizeStoreChain(), or later it is possilble…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions For example, if we could partially vectorize at vectorizeStoreChain(), or later it is possilble to fully vectorize the same tree with tryToVectorizeList() or tryToReduce() dtemirbulatov: For example, if we could partially vectorize at vectorizeStoreChain(), or later it is possilble…
		ABataevUnsubmitted Not Done Reply Inline Actions Could you give an example, please? ABataev: Could you give an example, please?

unsigned getTreeSize() const { return VectorizableTree.size(); }		unsigned getTreeSize() const { return VectorizableTree.size(); }

/// Perform LICM and CSE on the newly generated gather sequences.		/// Perform LICM and CSE on the newly generated gather sequences.
void optimizeGatherSequence();		void optimizeGatherSequence();
		ABataevUnsubmitted Done Reply Inline Actions DO you really need to return `Optional` here? Maybe, just return `-SLPCostThreshold` if not successful? ABataev: DO you really need to return `Optional` here? Maybe, just return `-SLPCostThreshold` if not…

/// \returns The best order of instructions for vectorization.		/// \returns The best order of instructions for vectorization.
Optional<ArrayRef<unsigned>> bestOrder() const {		Optional<ArrayRef<unsigned>> bestOrder() const {
assert(llvm::all_of(		assert(llvm::all_of(
NumOpsWantToKeepOrder,		NumOpsWantToKeepOrder,
[this](const decltype(NumOpsWantToKeepOrder)::value_type &D) {		[this](const decltype(NumOpsWantToKeepOrder)::value_type &D) {
return D.getFirst().size() ==		return D.getFirst().size() ==
VectorizableTree[0]->Scalars.size();		VectorizableTree[0]->Scalars.size();
▲ Show 20 Lines • Show All 140 Lines • ▼ Show 20 Lines	public:
/// can be load combined in the backend. Load combining may not be allowed in		/// can be load combined in the backend. Load combining may not be allowed in
/// the IR optimizer, so we do not want to alter the pattern. For example,		/// the IR optimizer, so we do not want to alter the pattern. For example,
/// partially transforming a scalar bswap() pattern into vector code is		/// partially transforming a scalar bswap() pattern into vector code is
/// effectively impossible for the backend to undo.		/// effectively impossible for the backend to undo.
/// TODO: If load combining is allowed in the IR optimizer, this analysis		/// TODO: If load combining is allowed in the IR optimizer, this analysis
/// may not be necessary.		/// may not be necessary.
bool isLoadCombineCandidate() const;		bool isLoadCombineCandidate() const;

		/// Cut the tree to make it partially vectorizable.
		void cutTree();

OptimizationRemarkEmitter *getORE() { return ORE; }		OptimizationRemarkEmitter *getORE() { return ORE; }

/// This structure holds any data we need about the edges being traversed		/// This structure holds any data we need about the edges being traversed
/// during buildTree_rec(). We keep track of:		/// during buildTree_rec(). We keep track of:
/// (i) the user TreeEntry index, and		/// (i) the user TreeEntry index, and
/// (ii) the index of the edge.		/// (ii) the index of the edge.
struct EdgeInfo {		struct EdgeInfo {
EdgeInfo() = default;		EdgeInfo() = default;
▲ Show 20 Lines • Show All 790 Lines • ▼ Show 20 Lines	struct TreeEntry {
}		}

/// A vector of scalars.		/// A vector of scalars.
ValueList Scalars;		ValueList Scalars;

/// The Scalars are vectorized into this value. It is initialized to Null.		/// The Scalars are vectorized into this value. It is initialized to Null.
Value *VectorizedValue = nullptr;		Value *VectorizedValue = nullptr;

/// Do we need to gather this sequence or vectorize it		/// Do we need to gather this sequence or vectorize it
/// (either with vector instruction or with scatter/gather		/// (either with vector instruction or with scatter/gather
/// intrinsics for store/load)?		/// intrinsics for store/load)?
enum EntryState { Vectorize, ScatterVectorize, NeedToGather };		enum EntryState { Vectorize, ScatterVectorize, NeedToGather };
EntryState State;		EntryState State;

		ABataevUnsubmitted Not Done Reply Inline Actions Could you split the patch and commit this part of the change (I mean, using of the enum instead of bool) as a separate NFC patch? ABataev: Could you split the patch and commit this part of the change (I mean, using of the enum instead…
/// Does this sequence require some shuffling?		/// Does this sequence require some shuffling?
SmallVector<int, 4> ReuseShuffleIndices;		SmallVector<int, 4> ReuseShuffleIndices;

/// Does this entry require reordering?		/// Does this entry require reordering?
SmallVector<unsigned, 4> ReorderIndices;		SmallVector<unsigned, 4> ReorderIndices;

		/// Cost of this tree entry.
		InstructionCost Cost = 0;

/// Points back to the VectorizableTree.		/// Points back to the VectorizableTree.
///		///
/// Only used for Graphviz right now. Unfortunately GraphTrait::NodeRef has		/// Only used for Graphviz right now. Unfortunately GraphTrait::NodeRef has
/// to be a pointer and needs to be able to initialize the child iterator.		/// to be a pointer and needs to be able to initialize the child iterator.
/// Thus we need a reference back to the container to translate the indices		/// Thus we need a reference back to the container to translate the indices
/// to entries.		/// to entries.
VecTreeTy &Container;		VecTreeTy &Container;

/// The TreeEntry index containing the user of this entry. We can actually		/// The TreeEntry index containing the user of this entry. We can actually
/// have multiple users so the data structure is not truly a tree.		/// have multiple users so the data structure is not truly a tree.
SmallVector<EdgeInfo, 1> UserTreeIndices;		SmallVector<EdgeInfo, 1> UserTreeIndices;

		/// Use of this entry.
		TinyPtrVector<TreeEntry *> UseEntries;

/// The index of this treeEntry in VectorizableTree.		/// The index of this treeEntry in VectorizableTree.
int Idx = -1;		int Idx = -1;

private:		private:
/// The operands of each instruction in each lane Operands[op_index][lane].		/// The operands of each instruction in each lane Operands[op_index][lane].
/// Note: This helps avoid the replication of the code that performs the		/// Note: This helps avoid the replication of the code that performs the
/// reordering of operands during buildTree_rec() and vectorizeTree().		/// reordering of operands during buildTree_rec() and vectorizeTree().
SmallVector<ValueList, 2> Operands;		SmallVector<ValueList, 2> Operands;
▲ Show 20 Lines • Show All 105 Lines • ▼ Show 20 Lines	int findLaneForValue(Value *V) const {
assert(FoundLane < Scalars.size() && "Couldn't find extract lane");		assert(FoundLane < Scalars.size() && "Couldn't find extract lane");
if (!ReuseShuffleIndices.empty()) {		if (!ReuseShuffleIndices.empty()) {
FoundLane = std::distance(ReuseShuffleIndices.begin(),		FoundLane = std::distance(ReuseShuffleIndices.begin(),
find(ReuseShuffleIndices, FoundLane));		find(ReuseShuffleIndices, FoundLane));
}		}
return FoundLane;		return FoundLane;
}		}

		// Find nodes with more than one use.
		bool isFork() const {
		return llvm::count_if(UseEntries, [this](TreeEntry *Next) {
		return (Next->Idx != Idx && Next->State != TreeEntry::NeedToGather);
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - return (Next->Idx != Idx && Next->State != TreeEntry::NeedToGather); + return (Next->Idx != Idx && + Next->State != TreeEntry::NeedToGather); Lint: Pre-merge checks: clang-format: please reformat the code ``` - return (Next->Idx != Idx && Next…
		}) > 1;
		}

#ifndef NDEBUG		#ifndef NDEBUG
/// Debug printer.		/// Debug printer.
LLVM_DUMP_METHOD void dump() const {		LLVM_DUMP_METHOD void dump() const {
dbgs() << Idx << ".\n";		dbgs() << Idx << ".\n";
for (unsigned OpI = 0, OpE = Operands.size(); OpI != OpE; ++OpI) {		for (unsigned OpI = 0, OpE = Operands.size(); OpI != OpE; ++OpI) {
dbgs() << "Operand " << OpI << ":\n";		dbgs() << "Operand " << OpI << ":\n";
for (const Value *V : Operands[OpI])		for (const Value *V : Operands[OpI])
dbgs().indent(2) << *V << "\n";		dbgs().indent(2) << *V << "\n";
}		}
dbgs() << "Scalars: \n";		dbgs() << "Scalars: \n";
for (Value *V : Scalars)		for (Value *V : Scalars)
dbgs().indent(2) << *V << "\n";		dbgs().indent(2) << *V << "\n";
dbgs() << "State: ";		dbgs() << "State: ";
switch (State) {		switch (State) {
case Vectorize:		case Vectorize:
dbgs() << "Vectorize\n";		dbgs() << "Vectorize\n";
break;		break;
case ScatterVectorize:		case ScatterVectorize:
dbgs() << "ScatterVectorize\n";		dbgs() << "ScatterVectorize\n";
break;		break;
case NeedToGather:		case NeedToGather:
dbgs() << "NeedToGather\n";		dbgs() << "NeedToGather\n";
break;		break;
}		}
		dbgs() << "Cost: ";
		dbgs() << Cost << "\n";
dbgs() << "MainOp: ";		dbgs() << "MainOp: ";
if (MainOp)		if (MainOp)
dbgs() << *MainOp << "\n";		dbgs() << *MainOp << "\n";
else		else
dbgs() << "NULL\n";		dbgs() << "NULL\n";
dbgs() << "AltOp: ";		dbgs() << "AltOp: ";
if (AltOp)		if (AltOp)
dbgs() << *AltOp << "\n";		dbgs() << *AltOp << "\n";
▲ Show 20 Lines • Show All 82 Lines • ▼ Show 20 Lines	if (Last->State != TreeEntry::NeedToGather) {
++Lane;		++Lane;
}		}
assert((!Bundle.getValue() \|\| Lane == VL.size()) &&		assert((!Bundle.getValue() \|\| Lane == VL.size()) &&
"Bundle and VL out of sync");		"Bundle and VL out of sync");
} else {		} else {
MustGather.insert(VL.begin(), VL.end());		MustGather.insert(VL.begin(), VL.end());
}		}

if (UserTreeIdx.UserTE)		if (UserTreeIdx.UserTE) {
Last->UserTreeIndices.push_back(UserTreeIdx);		Last->UserTreeIndices.push_back(UserTreeIdx);
		VectorizableTree[UserTreeIdx.UserTE->Idx]->UseEntries.push_back(Last);
		}

return Last;		return Last;
}		}

/// -- Vectorization State --		/// -- Vectorization State --
/// Holds all of the tree entries.		/// Holds all of the tree entries.
TreeEntry::VecTreeTy VectorizableTree;		TreeEntry::VecTreeTy VectorizableTree;

Show All 33 Lines	struct ExternalUser {
// Which user that uses the scalar.		// Which user that uses the scalar.
llvm::User *User;		llvm::User *User;

// Which lane does the scalar belong to.		// Which lane does the scalar belong to.
int Lane;		int Lane;
};		};
using UserList = SmallVector<ExternalUser, 16>;		using UserList = SmallVector<ExternalUser, 16>;

		/// \returns the cost of extracting the vectorized elements.
		InstructionCost
		getExtractOperationCost(const ExternalUser &EU,
		SmallVectorImpl<SmallVector<int>> &ShuffleMask,
		SmallVectorImpl<Value *> &FirstUsers,
		SmallVectorImpl<APInt> &DemandedElts);

/// Checks if two instructions may access the same memory.		/// Checks if two instructions may access the same memory.
///		///
/// \p Loc1 is the location of \p Inst1. It is passed explicitly because it		/// \p Loc1 is the location of \p Inst1. It is passed explicitly because it
/// is invariant in the calling loop.		/// is invariant in the calling loop.
bool isAliased(const MemoryLocation &Loc1, Instruction *Inst1,		bool isAliased(const MemoryLocation &Loc1, Instruction *Inst1,
Instruction *Inst2) {		Instruction *Inst2) {
// First check if the result is already in the cache.		// First check if the result is already in the cache.
AliasCacheKey key = std::make_pair(Inst1, Inst2);		AliasCacheKey key = std::make_pair(Inst1, Inst2);
Show All 31 Lines	#endif
DenseMap<Instruction *, bool> DeletedInstructions;		DenseMap<Instruction *, bool> DeletedInstructions;

/// A list of values that need to extracted out of the tree.		/// A list of values that need to extracted out of the tree.
/// This list holds pairs of (Internal Scalar : External User). External User		/// This list holds pairs of (Internal Scalar : External User). External User
/// can be nullptr, it means that this Internal Scalar will be used later,		/// can be nullptr, it means that this Internal Scalar will be used later,
/// after vectorization.		/// after vectorization.
UserList ExternalUses;		UserList ExternalUses;

		/// Tree entries that should not be vectorized due to throttling.
		SmallPtrSet<TreeEntry *, 2> ProposedToGather;

		/// Raw cost of all elemts in the tree.
		InstructionCost RawTreeCost = 0;

		InstructionCost ReduceableCost = 0;

		/// Indicate that no CallInst found in the tree and we don't need to
		/// calculate spill cost.
		bool NoCallInst = true;

		/// True, if we have calucalted tree cost for the tree.
		bool IsCostSumReady = false;

		/// Current operations width to vectorize.
		unsigned BundleWidth = 0;

		/// Internal tree oprations proposed to be vectorized values use.
		SmallDenseMap<Value *, UserList> InternalTreeUses;

/// Values used only by @llvm.assume calls.		/// Values used only by @llvm.assume calls.
SmallPtrSet<const Value *, 32> EphValues;		SmallPtrSet<const Value *, 32> EphValues;

/// Holds all of the instructions that we gathered.		/// Holds all of the instructions that we gathered.
SetVector<Instruction *> GatherSeq;		SetVector<Instruction *> GatherSeq;

/// A list of blocks that we are going to CSE.		/// A list of blocks that we are going to CSE.
SetVector<BasicBlock *> CSEBlocks;		SetVector<BasicBlock *> CSEBlocks;
▲ Show 20 Lines • Show All 326 Lines • ▼ Show 20 Lines	struct BlockScheduling {
/// Updates the dependency information of a bundle and of all instructions/		/// Updates the dependency information of a bundle and of all instructions/
/// bundles which depend on the original bundle.		/// bundles which depend on the original bundle.
void calculateDependencies(ScheduleData *SD, bool InsertInReadyList,		void calculateDependencies(ScheduleData *SD, bool InsertInReadyList,
BoUpSLP *SLP);		BoUpSLP *SLP);

/// Sets all instruction in the scheduling region to un-scheduled.		/// Sets all instruction in the scheduling region to un-scheduled.
void resetSchedule();		void resetSchedule();

		/// Make the scheduling region smaller.
		void reduceSchedulingRegion(Instruction Start, Instruction End);

BasicBlock *BB;		BasicBlock *BB;

/// Simple memory allocation for ScheduleData.		/// Simple memory allocation for ScheduleData.
std::vector<std::unique_ptr<ScheduleData[]>> ScheduleDataChunks;		std::vector<std::unique_ptr<ScheduleData[]>> ScheduleDataChunks;

/// The size of a ScheduleData array in ScheduleDataChunks.		/// The size of a ScheduleData array in ScheduleDataChunks.
int ChunkSize;		int ChunkSize;

▲ Show 20 Lines • Show All 46 Lines • ▼ Show 20 Lines	#endif

/// Attaches the BlockScheduling structures to basic blocks.		/// Attaches the BlockScheduling structures to basic blocks.
MapVector<BasicBlock *, std::unique_ptr<BlockScheduling>> BlocksSchedules;		MapVector<BasicBlock *, std::unique_ptr<BlockScheduling>> BlocksSchedules;

/// Performs the "real" scheduling. Done before vectorization is actually		/// Performs the "real" scheduling. Done before vectorization is actually
/// performed in a basic block.		/// performed in a basic block.
void scheduleBlock(BlockScheduling *BS);		void scheduleBlock(BlockScheduling *BS);

		/// Remove operations from the list of proposed to schedule.
		void removeFromScheduling(BlockScheduling *BS);

/// List of users to ignore during scheduling and that don't need extracting.		/// List of users to ignore during scheduling and that don't need extracting.
ArrayRef<Value *> UserIgnoreList;		ArrayRef<Value *> UserIgnoreList;

/// A DenseMapInfo implementation for holding DenseMaps and DenseSets of		/// A DenseMapInfo implementation for holding DenseMaps and DenseSets of
/// sorted SmallVectors of unsigned.		/// sorted SmallVectors of unsigned.
struct OrdersTypeDenseMapInfo {		struct OrdersTypeDenseMapInfo {
static OrdersType getEmptyKey() {		static OrdersType getEmptyKey() {
OrdersType V;		OrdersType V;
Show All 19 Lines	#endif
/// Contains orders of operations along with the number of bundles that have		/// Contains orders of operations along with the number of bundles that have
/// operations in this order. It stores only those orders that require		/// operations in this order. It stores only those orders that require
/// reordering, if reordering is not required it is counted using \a		/// reordering, if reordering is not required it is counted using \a
/// NumOpsWantToKeepOriginalOrder.		/// NumOpsWantToKeepOriginalOrder.
DenseMap<OrdersType, unsigned, OrdersTypeDenseMapInfo> NumOpsWantToKeepOrder;		DenseMap<OrdersType, unsigned, OrdersTypeDenseMapInfo> NumOpsWantToKeepOrder;
/// Number of bundles that do not require reordering.		/// Number of bundles that do not require reordering.
unsigned NumOpsWantToKeepOriginalOrder = 0;		unsigned NumOpsWantToKeepOriginalOrder = 0;

// Analysis and block reference.		// Analysis and block reference.
		ABataevUnsubmitted Not Done Reply Inline Actions Why need to store a pointer to `TreeState` but the `TreeState` itself? ABataev: Why need to store a pointer to `TreeState` but the `TreeState` itself?
		ABataevUnsubmitted Not Done Reply Inline Actions Why `unique_ptr` again? Why not a `TreeState` directly? Just `SmallVector<TreeState, 2>;` ABataev: Why `unique_ptr` again? Why not a `TreeState` directly? Just `SmallVector<TreeState, 2>;`
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions TreeState is a large structure, it is convenient with dynamically allocate, but static version might be faster, do you think it is critical? dtemirbulatov: TreeState is a large structure, it is convenient with dynamically allocate, but static version…
		ABataevUnsubmitted Done Reply Inline Actions Reduce the number of the preallocated elements to, say, 2 or 4 and store elements directly. ABataev: Reduce the number of the preallocated elements to, say, 2 or 4 and store elements directly.
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions ok dtemirbulatov: ok
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions hmm. Tree state is too complex, we don't have to make it movable or copyable. Why unique_ptr is not good here? dtemirbulatov: hmm. Tree state is too complex, we don't have to make it movable or copyable. Why unique_ptr is…
Function *F;		Function *F;
ScalarEvolution *SE;		ScalarEvolution *SE;
TargetTransformInfo *TTI;		TargetTransformInfo *TTI;
TargetLibraryInfo *TLI;		TargetLibraryInfo *TLI;
AAResults *AA;		AAResults *AA;
LoopInfo *LI;		LoopInfo *LI;
DominatorTree *DT;		DominatorTree *DT;
AssumptionCache *AC;		AssumptionCache *AC;
DemandedBits *DB;		DemandedBits *DB;
const DataLayout *DL;		const DataLayout *DL;
OptimizationRemarkEmitter *ORE;		OptimizationRemarkEmitter *ORE;

unsigned MaxVecRegSize; // This is set by TTI or overridden by cl::opt.		unsigned MaxVecRegSize; // This is set by TTI or overridden by cl::opt.
unsigned MinVecRegSize; // Set by cl::opt (default: 128).		unsigned MinVecRegSize; // Set by cl::opt (default: 128).

/// Instruction builder to construct the vectorized tree.		/// Instruction builder to construct the vectorized tree.
IRBuilder<> Builder;		IRBuilder<> Builder;

/// A map of scalar integer values to the smallest bit width with which they		/// A map of scalar integer values to the smallest bit width with which they
		ABataevUnsubmitted Not Done Reply Inline Actions Tabs ABataev: Tabs
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions This is now part of TreeState structure, this is LLVM's standard format(clang-format). dtemirbulatov: This is now part of TreeState structure, this is LLVM's standard format(clang-format).
/// can legally be represented. The values map to (width, signed) pairs,		/// can legally be represented. The values map to (width, signed) pairs,
/// where "width" indicates the minimum bit width and "signed" is True if the		/// where "width" indicates the minimum bit width and "signed" is True if the
/// value must be signed-extended, rather than zero-extended, back to its		/// value must be signed-extended, rather than zero-extended, back to its
		ABataevUnsubmitted Done Reply Inline Actions Why do you need this new set? You can get the result just by using `ScalarToTreeEntry` data member and checking the vectorization status of the corresponding `TreeEntry`. ABataev: Why do you need this new set? You can get the result just by using `ScalarToTreeEntry` data…
/// original width.		/// original width.
MapVector<Value *, std::pair<uint64_t, bool>> MinBWs;		MapVector<Value *, std::pair<uint64_t, bool>> MinBWs;
};		};

		ABataevUnsubmitted Done Reply Inline Actions Seem to me, here is the same story just like with `ScalarsToVec` ABataev: Seem to me, here is the same story just like with `ScalarsToVec`
} // end namespace slpvectorizer		} // end namespace slpvectorizer

template <> struct GraphTraits<BoUpSLP *> {		template <> struct GraphTraits<BoUpSLP *> {
using TreeEntry = BoUpSLP::TreeEntry;		using TreeEntry = BoUpSLP::TreeEntry;

/// NodeRef has to be a pointer per the GraphWriter.		/// NodeRef has to be a pointer per the GraphWriter.
		ABataevUnsubmitted Done Reply Inline Actions Currently is not used ABataev: Currently is not used
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions Thanks, Sorry for this, I missed somehow. dtemirbulatov: Thanks, Sorry for this, I missed somehow.
using NodeRef = TreeEntry *;		using NodeRef = TreeEntry *;

using ContainerTy = BoUpSLP::TreeEntry::VecTreeTy;		using ContainerTy = BoUpSLP::TreeEntry::VecTreeTy;

/// Add the VectorizableTree to the index iterator to be able to return		/// Add the VectorizableTree to the index iterator to be able to return
/// TreeEntry pointers.		/// TreeEntry pointers.
struct ChildIteratorType		struct ChildIteratorType
: public iterator_adaptor_base<		: public iterator_adaptor_base<
▲ Show 20 Lines • Show All 49 Lines • ▼ Show 20 Lines
template <> struct DOTGraphTraits<BoUpSLP *> : public DefaultDOTGraphTraits {		template <> struct DOTGraphTraits<BoUpSLP *> : public DefaultDOTGraphTraits {
using TreeEntry = BoUpSLP::TreeEntry;		using TreeEntry = BoUpSLP::TreeEntry;

DOTGraphTraits(bool isSimple = false) : DefaultDOTGraphTraits(isSimple) {}		DOTGraphTraits(bool isSimple = false) : DefaultDOTGraphTraits(isSimple) {}

std::string getNodeLabel(const TreeEntry Entry, const BoUpSLP R) {		std::string getNodeLabel(const TreeEntry Entry, const BoUpSLP R) {
std::string Str;		std::string Str;
raw_string_ostream OS(Str);		raw_string_ostream OS(Str);
		OS << "Idx: " << Entry->Idx << "\n";
if (isSplat(Entry->Scalars)) {		if (isSplat(Entry->Scalars)) {
OS << "<splat> " << *Entry->Scalars[0];		OS << "<splat> " << *Entry->Scalars[0];
return Str;		return Str;
}		}
for (auto V : Entry->Scalars) {		for (auto V : Entry->Scalars) {
OS << *V;		OS << *V;
if (llvm::any_of(R->ExternalUses, [&](const BoUpSLP::ExternalUser &EU) {		if (llvm::any_of(R->ExternalUses, [&](const BoUpSLP::ExternalUser &EU) {
return EU.Scalar == V;		return EU.Scalar == V;
▲ Show 20 Lines • Show All 54 Lines • ▼ Show 20 Lines	void BoUpSLP::buildTree(ArrayRef<Value *> Roots,
ArrayRef<Value *> UserIgnoreLst) {		ArrayRef<Value *> UserIgnoreLst) {
deleteTree();		deleteTree();
UserIgnoreList = UserIgnoreLst;		UserIgnoreList = UserIgnoreLst;
if (!allSameType(Roots))		if (!allSameType(Roots))
return;		return;
buildTree_rec(Roots, 0, EdgeInfo());		buildTree_rec(Roots, 0, EdgeInfo());

// Collect the values that we need to extract from the tree.		// Collect the values that we need to extract from the tree.
for (auto &TEPtr : VectorizableTree) {		for (std::unique_ptr<TreeEntry> &TEPtr : VectorizableTree) {
TreeEntry *Entry = TEPtr.get();		TreeEntry *Entry = TEPtr.get();

// No need to handle users of gathered values.		// No need to handle users of gathered values.
if (Entry->State == TreeEntry::NeedToGather)		if (Entry->State == TreeEntry::NeedToGather)
continue;		continue;

// For each lane:		// For each lane:
for (int Lane = 0, LE = Entry->Scalars.size(); Lane != LE; ++Lane) {		for (int Lane = 0, LE = Entry->Scalars.size(); Lane != LE; ++Lane) {
Show All 18 Lines	for (int Lane = 0, LE = Entry->Scalars.size(); Lane != LE; ++Lane) {
continue;		continue;

// Skip in-tree scalars that become vectors		// Skip in-tree scalars that become vectors
if (TreeEntry *UseEntry = getTreeEntry(U)) {		if (TreeEntry *UseEntry = getTreeEntry(U)) {
Value *UseScalar = UseEntry->Scalars[0];		Value *UseScalar = UseEntry->Scalars[0];
// Some in-tree scalars will remain as scalar in vectorized		// Some in-tree scalars will remain as scalar in vectorized
// instructions. If that is the case, the one in Lane 0 will		// instructions. If that is the case, the one in Lane 0 will
// be used.		// be used.
		InternalTreeUses[U].emplace_back(Scalar, U, FoundLane);
if (UseScalar != U \|\|		if (UseScalar != U \|\|
UseEntry->State == TreeEntry::ScatterVectorize \|\|		UseEntry->State == TreeEntry::ScatterVectorize \|\|
!InTreeUserNeedToExtract(Scalar, UserInst, TLI)) {		!InTreeUserNeedToExtract(Scalar, UserInst, TLI)) {
LLVM_DEBUG(dbgs() << "SLP: \tInternal user will be removed:" << *U		LLVM_DEBUG(dbgs() << "SLP: \tInternal user will be removed:" << *U
<< ".\n");		<< ".\n");
assert(UseEntry->State != TreeEntry::NeedToGather && "Bad state");		assert(UseEntry->State != TreeEntry::NeedToGather && "Bad state");
continue;		continue;
		ABataevUnsubmitted Not Done Reply Inline Actions Looks like unrelated change ABataev: Looks like unrelated change
}		}
}		}

// Ignore users in the user ignore list.		// Ignore users in the user ignore list.
if (is_contained(UserIgnoreList, UserInst))		if (is_contained(UserIgnoreList, UserInst))
continue;		continue;

LLVM_DEBUG(dbgs() << "SLP: Need to extract:" << *U << " from lane "		LLVM_DEBUG(dbgs() << "SLP: Need to extract:" << *U << " from lane "
▲ Show 20 Lines • Show All 764 Lines • ▼ Show 20 Lines	default:
BS.cancelScheduling(VL, VL0);		BS.cancelScheduling(VL, VL0);
newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx,		newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx,
ReuseShuffleIndicies);		ReuseShuffleIndicies);
LLVM_DEBUG(dbgs() << "SLP: Gathering unknown instruction.\n");		LLVM_DEBUG(dbgs() << "SLP: Gathering unknown instruction.\n");
return;		return;
}		}
}		}

		void BoUpSLP::cutTree() {
		SmallVector<TreeEntry *, 4> VecNodes;

		for (std::unique_ptr<TreeEntry> &TEPtr : VectorizableTree) {
		TreeEntry *Entry = TEPtr.get();
		if (Entry->State == TreeEntry::NeedToGather)
		continue;
		ABataevUnsubmitted Not Done Reply Inline Actions You don't need to push the elements to a new vector here, instead, you can directly perform required actions. ABataev: You don't need to push the elements to a new vector here, instead, you can directly perform…
		// For all canceled operations we should consider the possibility of
		// use by with non-canceled operations and for that, it requires
		// to populate ExternalUser list with canceled elements.
		for (int Lane = 0, LE = Entry->Scalars.size(); Lane != LE; ++Lane) {
		ABataevUnsubmitted Not Done Reply Inline Actions Should this loop be executed only for `ProposedToGather` `Entry`s? ABataev: Should this loop be executed only for `ProposedToGather` `Entry`s?
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions Yes, but we have to know the lane there. dtemirbulatov: Yes, but we have to know the lane there.
		ABataevUnsubmitted Not Done Reply Inline Actions I mean, that the check in the previous if must be `Entry->State == TreeEntry::ProposedToGather`, not `!= TreeEntry::Vectorize` ABataev: I mean, that the check in the previous if must be `Entry->State == TreeEntry::ProposedToGather`…
		Value *Scalar = Entry->Scalars[Lane];
		for (User *U : Scalar->users()) {
		ABataevUnsubmitted Done Reply Inline Actions These two loops can be merged, no? And use `switch` instead of `if`, if possible, after merging ABataev: These two loops can be merged, no? And use `switch` instead of `if`, if possible, after merging
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions Thanks. dtemirbulatov: Thanks.
		LLVM_DEBUG(dbgs() << "SLP: Checking user:" << *U << ".\n");
		TreeEntry *UserTE = getTreeEntry(U);
		ABataevUnsubmitted Done Reply Inline Actions What if the user does not have corresponding tree entry, i.e. it is initially scalar? What if the `Scalar` itself is going to remain scalar? ABataev: What if the user does not have corresponding tree entry, i.e. it is initially scalar? What if…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions What if the Scalar itself is going to remain scalar? At this point, the decision to cut the tree was made and the Scalar could be only with intend to vectorize. Note about that 3295 we are ignoring any tree entries without State not equals TreeEntry::Vectorize. What if the user does not have corresponding tree entry, i.e. it is initially scalar? ah, yes. I have to check that !UserTE at 3305 and just continue if it is true. dtemirbulatov: > What if the Scalar itself is going to remain scalar? At this point, the decision to cut the…
		if (!UserTE \|\| ProposedToGather.count(UserTE) == 0)
		continue;
		ABataevUnsubmitted Done Reply Inline Actions Could you compare it with a similar code in BoUpSLP::buildTree? Looks like you still missed some cases for user ignoring. ABataev: Could you compare it with a similar code in BoUpSLP::buildTree? Looks like you still missed…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions I think those ignoring cases are related to the fact that we are doing full vectorization at BoUpSLP::buildTree and we can avoid extracting for in-tree users. And here we have to extract to each user of once proposed to vectorized value. dtemirbulatov: I think those ignoring cases are related to the fact that we are doing full vectorization at…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions And here we have to extract to each user of once proposed to vectorized value. I mean for the partial vectorization. dtemirbulatov: And here we have to extract to each user of once proposed to vectorized value. I mean for the…
		// Ignore users in the user ignore list.
		ABataevUnsubmitted Not Done Reply Inline Actions This does nothing except for debugging print, guard with `#ifndef NDEBUG .. #endif` ABataev: This does nothing except for debugging print, guard with `#ifndef NDEBUG .. #endif`
		auto *UserInst = dyn_cast<Instruction>(U);
		ABataevUnsubmitted Done Reply Inline Actions Why do you need to compare flow and operation instructions count? Also, why use hardcoded `3` as a limit of vectorizable nodes? ABataev: Why do you need to compare flow and operation instructions count? Also, why use hardcoded `3`…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions I noticed the cost perspective it is good to vectorize the subtree, but on the benchmarking side, it is introducing regressions. Maybe this is a known issue where the partial vectorization prevents full vectorization later on, for example, if I remove this limiter at 3271: for Transforms/SLPVectorizer/X86/PR39774.ll testcase I wold get: a/llvm/test/Transforms/SLPVectorizer/X86/PR39774.ll +++ b/llvm/test/Transforms/SLPVectorizer/X86/PR39774.ll @@ -7,55 +7,65 @@ define void @test(i32) { ; CHECK-NEXT: entry: ; CHECK-NEXT: br label [[LOOP:%.]] ; CHECK: loop: -; CHECK-NEXT: [[TMP1:%.]] = phi <2 x i32> [ [[TMP15:%.]], [[LOOP]] ], [ zeroinitializer, [[ENTRY:%.]] ] -; CHECK-NEXT: [[SHUFFLE:%.]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> undef, <8 x i32> <i32 0, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1> -; CHECK-NEXT: [[TMP2:%.]] = extractelement <8 x i32> [[SHUFFLE]], i32 1 -; CHECK-NEXT: [[TMP3:%.]] = add <8 x i32> [[SHUFFLE]], <i32 0, i32 55, i32 285, i32 1240, i32 1496, i32 8555, i32 12529, i32 13685> -; CHECK-NEXT: [[RDX_SHUF:%.]] = shufflevector <8 x i32> [[TMP3]], <8 x i32> undef, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef> -; CHECK-NEXT: [[BIN_RDX:%.]] = and <8 x i32> [[TMP3]], [[RDX_SHUF]] -; CHECK-NEXT: [[RDX_SHUF1:%.]] = shufflevector <8 x i32> [[BIN_RDX]], <8 x i32> undef, <8 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef> -; CHECK-NEXT: [[BIN_RDX2:%.]] = and <8 x i32> [[BIN_RDX]], [[RDX_SHUF1]] -; CHECK-NEXT: [[RDX_SHUF3:%.]] = shufflevector <8 x i32> [[BIN_RDX2]], <8 x i32> undef, <8 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef> -; CHECK-NEXT: [[BIN_RDX4:%.]] = and <8 x i32> [[BIN_RDX2]], [[RDX_SHUF3]] -; CHECK-NEXT: [[TMP4:%.]] = extractelement <8 x i32> [[BIN_RDX4]], i32 0 -; CHECK-NEXT: [[OP_EXTRA:%.]] = and i32 [[TMP4]], [[TMP0:%.]] -; CHECK-NEXT: [[OP_EXTRA5:%.]] = and i32 [[OP_EXTRA]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA6:%.]] = and i32 [[OP_EXTRA5]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA7:%.]] = and i32 [[OP_EXTRA6]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA8:%.]] = and i32 [[OP_EXTRA7]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA9:%.]] = and i32 [[OP_EXTRA8]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA10:%.]] = and i32 [[OP_EXTRA9]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA11:%.]] = and i32 [[OP_EXTRA10]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA12:%.]] = and i32 [[OP_EXTRA11]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA13:%.]] = and i32 [[OP_EXTRA12]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA14:%.]] = and i32 [[OP_EXTRA13]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA15:%.]] = and i32 [[OP_EXTRA14]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA16:%.]] = and i32 [[OP_EXTRA15]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA17:%.]] = and i32 [[OP_EXTRA16]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA18:%.]] = and i32 [[OP_EXTRA17]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA19:%.]] = and i32 [[OP_EXTRA18]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA20:%.]] = and i32 [[OP_EXTRA19]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA21:%.]] = and i32 [[OP_EXTRA20]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA22:%.]] = and i32 [[OP_EXTRA21]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA23:%.]] = and i32 [[OP_EXTRA22]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA24:%.]] = and i32 [[OP_EXTRA23]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA25:%.]] = and i32 [[OP_EXTRA24]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA26:%.]] = and i32 [[OP_EXTRA25]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA27:%.]] = and i32 [[OP_EXTRA26]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA28:%.]] = and i32 [[OP_EXTRA27]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA29:%.]] = and i32 [[OP_EXTRA28]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA30:%.]] = and i32 [[OP_EXTRA29]], [[TMP0]] -; CHECK-NEXT: [[TMP5:%.]] = insertelement <2 x i32> undef, i32 [[OP_EXTRA30]], i32 0 -; CHECK-NEXT: [[TMP6:%.]] = insertelement <2 x i32> [[TMP5]], i32 14910, i32 1 -; CHECK-NEXT: [[TMP7:%.]] = insertelement <2 x i32> undef, i32 [[TMP2]], i32 0 -; CHECK-NEXT: [[TMP8:%.]] = insertelement <2 x i32> [[TMP7]], i32 [[TMP2]], i32 1 -; CHECK-NEXT: [[TMP9:%.]] = and <2 x i32> [[TMP6]], [[TMP8]] -; CHECK-NEXT: [[TMP10:%.]] = add <2 x i32> [[TMP6]], [[TMP8]] -; CHECK-NEXT: [[TMP11:%.]] = shufflevector <2 x i32> [[TMP9]], <2 x i32> [[TMP10]], <2 x i32> <i32 0, i32 3> -; CHECK-NEXT: [[TMP12:%.]] = extractelement <2 x i32> [[TMP11]], i32 0 -; CHECK-NEXT: [[TMP13:%.]] = insertelement <2 x i32> undef, i32 [[TMP12]], i32 0 -; CHECK-NEXT: [[TMP14:%.]] = extractelement <2 x i32> [[TMP11]], i32 1 -; CHECK-NEXT: [[TMP15]] = insertelement <2 x i32> [[TMP13]], i32 [[TMP14]], i32 1 +; CHECK-NEXT: [[TMP1:%.]] = phi <2 x i32> [ [[TMP19:%.]], [[LOOP]] ], [ zeroinitializer, [[ENTRY:%.]] ] +; CHECK-NEXT: [[TMP2:%.]] = extractelement <2 x i32> [[TMP1]], i32 0 +; CHECK-NEXT: [[VAL_0:%.]] = add i32 [[TMP2]], 0 +; CHECK-NEXT: [[TMP3:%.]] = extractelement <2 x i32> [[TMP1]], i32 1 +; CHECK-NEXT: [[VAL_1:%.]] = and i32 [[TMP3]], [[VAL_0]] +; CHECK-NEXT: [[VAL_2:%.]] = and i32 [[VAL_1]], [[TMP0:%.]] +; CHECK-NEXT: [[VAL_3:%.]] = and i32 [[VAL_2]], [[TMP0]] +; CHECK-NEXT: [[VAL_4:%.]] = and i32 [[VAL_3]], [[TMP0]] +; CHECK-NEXT: [[VAL_5:%.]] = and i32 [[VAL_4]], [[TMP0]] +; CHECK-NEXT: [[VAL_6:%.]] = add i32 [[TMP3]], 55 +; CHECK-NEXT: [[VAL_7:%.]] = and i32 [[VAL_5]], [[VAL_6]] +; CHECK-NEXT: [[VAL_8:%.]] = and i32 [[VAL_7]], [[TMP0]] +; CHECK-NEXT: [[VAL_9:%.]] = and i32 [[VAL_8]], [[TMP0]] +; CHECK-NEXT: [[VAL_10:%.]] = and i32 [[VAL_9]], [[TMP0]] +; CHECK-NEXT: [[VAL_11:%.]] = add i32 [[TMP3]], 285 +; CHECK-NEXT: [[VAL_12:%.]] = and i32 [[VAL_10]], [[VAL_11]] +; CHECK-NEXT: [[VAL_13:%.]] = and i32 [[VAL_12]], [[TMP0]] +; CHECK-NEXT: [[VAL_14:%.]] = and i32 [[VAL_13]], [[TMP0]] +; CHECK-NEXT: [[VAL_15:%.]] = and i32 [[VAL_14]], [[TMP0]] +; CHECK-NEXT: [[VAL_16:%.]] = and i32 [[VAL_15]], [[TMP0]] +; CHECK-NEXT: [[VAL_17:%.]] = and i32 [[VAL_16]], [[TMP0]] +; CHECK-NEXT: [[VAL_18:%.]] = add i32 [[TMP3]], 1240 +; CHECK-NEXT: [[VAL_19:%.]] = and i32 [[VAL_17]], [[VAL_18]] +; CHECK-NEXT: [[VAL_20:%.]] = add i32 [[TMP3]], 1496 +; CHECK-NEXT: [[VAL_21:%.]] = and i32 [[VAL_19]], [[VAL_20]] +; CHECK-NEXT: [[VAL_22:%.]] = and i32 [[VAL_21]], [[TMP0]] +; CHECK-NEXT: [[VAL_23:%.]] = and i32 [[VAL_22]], [[TMP0]] +; CHECK-NEXT: [[VAL_24:%.]] = and i32 [[VAL_23]], [[TMP0]] +; CHECK-NEXT: [[VAL_25:%.]] = and i32 [[VAL_24]], [[TMP0]] +; CHECK-NEXT: [[VAL_26:%.]] = and i32 [[VAL_25]], [[TMP0]] +; CHECK-NEXT: [[VAL_27:%.]] = and i32 [[VAL_26]], [[TMP0]] +; CHECK-NEXT: [[VAL_28:%.]] = and i32 [[VAL_27]], [[TMP0]] +; CHECK-NEXT: [[VAL_29:%.]] = and i32 [[VAL_28]], [[TMP0]] +; CHECK-NEXT: [[VAL_30:%.]] = and i32 [[VAL_29]], [[TMP0]] +; CHECK-NEXT: [[VAL_31:%.]] = and i32 [[VAL_30]], [[TMP0]] +; CHECK-NEXT: [[VAL_32:%.]] = and i32 [[VAL_31]], [[TMP0]] +; CHECK-NEXT: [[VAL_33:%.]] = and i32 [[VAL_32]], [[TMP0]] +; CHECK-NEXT: [[VAL_34:%.]] = add i32 [[TMP3]], 8555 +; CHECK-NEXT: [[VAL_35:%.]] = and i32 [[VAL_33]], [[VAL_34]] +; CHECK-NEXT: [[VAL_36:%.]] = and i32 [[VAL_35]], [[TMP0]] +; CHECK-NEXT: [[VAL_37:%.]] = and i32 [[VAL_36]], [[TMP0]] +; CHECK-NEXT: [[VAL_38:%.]] = and i32 [[VAL_37]], [[TMP0]] +; CHECK-NEXT: [[TMP4:%.]] = insertelement <2 x i32> undef, i32 [[TMP3]], i32 0 +; CHECK-NEXT: [[TMP5:%.]] = insertelement <2 x i32> [[TMP4]], i32 [[TMP3]], i32 1 +; CHECK-NEXT: [[TMP6:%.]] = add <2 x i32> [[TMP5]], <i32 12529, i32 13685> +; CHECK-NEXT: [[TMP7:%.]] = extractelement <2 x i32> [[TMP6]], i32 0 +; CHECK-NEXT: [[VAL_40:%.]] = and i32 [[VAL_38]], [[TMP7]] +; CHECK-NEXT: [[TMP8:%.]] = extractelement <2 x i32> [[TMP6]], i32 1 +; CHECK-NEXT: [[TMP9:%.]] = insertelement <2 x i32> undef, i32 [[VAL_40]], i32 0 +; CHECK-NEXT: [[TMP10:%.]] = insertelement <2 x i32> [[TMP9]], i32 14910, i32 1 +; CHECK-NEXT: [[TMP11:%.]] = insertelement <2 x i32> undef, i32 [[TMP8]], i32 0 +; CHECK-NEXT: [[TMP12:%.]] = insertelement <2 x i32> [[TMP11]], i32 [[TMP3]], i32 1 +; CHECK-NEXT: [[TMP13:%.]] = and <2 x i32> [[TMP10]], [[TMP12]] +; CHECK-NEXT: [[TMP14:%.]] = add <2 x i32> [[TMP10]], [[TMP12]] +; CHECK-NEXT: [[TMP15:%.]] = shufflevector <2 x i32> [[TMP13]], <2 x i32> [[TMP14]], <2 x i32> <i32 0, i32 3> +; CHECK-NEXT: [[TMP16:%.]] = extractelement <2 x i32> [[TMP15]], i32 0 +; CHECK-NEXT: [[TMP17:%.]] = insertelement <2 x i32> undef, i32 [[TMP16]], i32 0 +; CHECK-NEXT: [[TMP18:%.]] = extractelement <2 x i32> [[TMP15]], i32 1 +; CHECK-NEXT: [[TMP19]] = insertelement <2 x i32> [[TMP17]], i32 [[TMP18]], i32 1 and we could see disappearance of [[TMP3:%.]] = add <8 x i32> [[SHUFFLE]], <i32 0, i32 55, i32 285, i32 1240, i32 1496, i32 8555, i32 12529, i32 13685> I have to recheck those regressions again on cpu2006. dtemirbulatov: I noticed the cost perspective it is good to vectorize the subtree, but on the benchmarking…
		if (!UserInst)
		continue;

		ABataevUnsubmitted Not Done Reply Inline Actions Either just `cast` without `if` or `dyn_cast` ABataev: Either just `cast` without `if` or `dyn_cast`
		if (is_contained(UserIgnoreList, UserInst))
		continue;
		LLVM_DEBUG(dbgs() << "SLP: Need extract to canceled operation :" << *U
		<< " from lane " << Lane << " from " << *Scalar
		<< ".\n");
		ABataevUnsubmitted Done Reply Inline Actions Looks like the `Scalar` should be extracted only if its user is vectorized and it remains to be scalar in the vectorized tree. Or it is not going to be vectorized. ABataev: Looks like the `Scalar` should be extracted only if its user is vectorized and it remains to be…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions Thanks, good catch. I think I need first to populate ExternalUser as TreeEntires are mark with ProposedToGather and then change them to a NeedToGather node in the separate loop. dtemirbulatov: Thanks, good catch. I think I need first to populate ExternalUser as TreeEntires are mark with…
		ExternalUses.emplace_back(Scalar, U, Lane);
		}
		}
		ABataevUnsubmitted Not Done Reply Inline Actions I actually don't see propagation for `ProposedTogather` and these loops can be merged, no? ABataev: I actually don't see propagation for `ProposedTogather` and these loops can be merged, no?
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions No, No possible to merge those two loops. dtemirbulatov: No, No possible to merge those two loops.
		ABataevUnsubmitted Not Done Reply Inline Actions Why? ABataev: Why?
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions For example, in the first loops, we could change from Entry1 TreeEntry::ProposedToGather to TreeEntry::NeedToGather status, but we later could encounter another use of this Entry1 and from another Entry2()let's say) with TreeEntry::Vectorize status and we could tell difference with just canceled item and not considered to vectorize Entry. thus ExternalUses would not be properly populated. dtemirbulatov: For example, in the first loops, we could change from Entry1 TreeEntry::ProposedToGather to…
		ABataevUnsubmitted Not Done Reply Inline Actions The first loop does not change the state of the tree entries. ABataev: The first loop does not change the state of the tree entries.
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions I mean if we merge them into one loop. dtemirbulatov: I mean if we merge them into one loop.
		}
		ABataevUnsubmitted Done Reply Inline Actions The scalars are not actually removed here. ABataev: The scalars are not actually removed here.
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions The scalars are not actually removed here. yes, but here I was thinking there would be too much noise to put this print in findSubTree() and we might not get the tree to vectorize in the end. Probably, I think it is better to print whole TreeEntry scalars at once saying we are not going to vectorize those operations. dtemirbulatov: > The scalars are not actually removed here. yes, but here I was thinking there would be too…
		// Canceling unprofitable elements.
		for (TreeEntry *Entry : ProposedToGather) {
		for (Value *V : Entry->Scalars) {
		ScalarToTreeEntry.erase(V);
		#ifndef NDEBUG
		LLVM_DEBUG(dbgs() << "SLP: Remove scalar " << *V
		<< " out of proposed to vectorize.\n");
		#endif
		}
		}
		}

unsigned BoUpSLP::canMapToVector(Type *T, const DataLayout &DL) const {		unsigned BoUpSLP::canMapToVector(Type *T, const DataLayout &DL) const {
		ABataevUnsubmitted Not Done Reply Inline Actions Can this occur at all? ABataev: Can this occur at all?
unsigned N = 1;		unsigned N = 1;
Type *EltTy = T;		Type *EltTy = T;

while (isa<StructType>(EltTy) \|\| isa<ArrayType>(EltTy) \|\|		while (isa<StructType>(EltTy) \|\| isa<ArrayType>(EltTy) \|\|
isa<VectorType>(EltTy)) {		isa<VectorType>(EltTy)) {
if (auto *ST = dyn_cast<StructType>(EltTy)) {		if (auto *ST = dyn_cast<StructType>(EltTy)) {
// Check that struct is homogeneous.		// Check that struct is homogeneous.
for (const auto *Ty : ST->elements())		for (const auto *Ty : ST->elements())
▲ Show 20 Lines • Show All 301 Lines • ▼ Show 20 Lines	if (E->State == TreeEntry::NeedToGather) {
if (allConstant(VL))		if (allConstant(VL))
return 0;		return 0;
if (isa<InsertElementInst>(VL[0]))		if (isa<InsertElementInst>(VL[0]))
return InstructionCost::getInvalid();		return InstructionCost::getInvalid();
SmallVector<int> Mask;		SmallVector<int> Mask;
SmallVector<const TreeEntry *> Entries;		SmallVector<const TreeEntry *> Entries;
Optional<TargetTransformInfo::ShuffleKind> Shuffle =		Optional<TargetTransformInfo::ShuffleKind> Shuffle =
isGatherShuffledEntry(E, Mask, Entries);		isGatherShuffledEntry(E, Mask, Entries);
if (Shuffle.hasValue()) {		if (Shuffle.hasValue() && ProposedToGather.count(E) == 0) {
InstructionCost GatherCost = 0;		InstructionCost GatherCost = 0;
if (ShuffleVectorInst::isIdentityMask(Mask)) {		if (ShuffleVectorInst::isIdentityMask(Mask)) {
// Perfect match in the graph, will reuse the previously vectorized		// Perfect match in the graph, will reuse the previously vectorized
// node. Cost is 0.		// node. Cost is 0.
LLVM_DEBUG(		LLVM_DEBUG(
dbgs()		dbgs()
<< "SLP: perfect diamond match for gather bundle that starts with "		<< "SLP: perfect diamond match for gather bundle that starts with "
<< *VL.front() << ".\n");		<< *VL.front() << ".\n");
▲ Show 20 Lines • Show All 497 Lines • ▼ Show 20 Lines	if (VectorizableTree[0]->State == TreeEntry::Vectorize &&
VectorizableTree[0]->Scalars.size()) \|\|		VectorizableTree[0]->Scalars.size()) \|\|
(VectorizableTree[1]->State == TreeEntry::NeedToGather &&		(VectorizableTree[1]->State == TreeEntry::NeedToGather &&
VectorizableTree[1]->getOpcode() == Instruction::ExtractElement &&		VectorizableTree[1]->getOpcode() == Instruction::ExtractElement &&
isShuffle(VectorizableTree[1]->Scalars, Mask))))		isShuffle(VectorizableTree[1]->Scalars, Mask))))
return true;		return true;

// Gathering cost would be too much for tiny trees.		// Gathering cost would be too much for tiny trees.
if (VectorizableTree[0]->State == TreeEntry::NeedToGather \|\|		if (VectorizableTree[0]->State == TreeEntry::NeedToGather \|\|
VectorizableTree[1]->State == TreeEntry::NeedToGather)		VectorizableTree[1]->State == TreeEntry::NeedToGather)
return false;		return false;
		ABataevUnsubmitted Done Reply Inline Actions Maybe, better to use `!= TreeEntry::Vectorize` to avoid trees with proposed gathering? ABataev: Maybe, better to use `!= TreeEntry::Vectorize` to avoid trees with proposed gathering?

return true;		return true;
}		}

static bool isLoadCombineCandidateImpl(Value *Root, unsigned NumElts,		static bool isLoadCombineCandidateImpl(Value *Root, unsigned NumElts,
TargetTransformInfo *TTI,		TargetTransformInfo *TTI,
bool MustMatchOrInst) {		bool MustMatchOrInst) {
// Look past the root to find a source value. Arbitrarily follow the		// Look past the root to find a source value. Arbitrarily follow the
▲ Show 20 Lines • Show All 77 Lines • ▼ Show 20 Lines	assert(VectorizableTree.empty()
? ExternalUses.empty()		? ExternalUses.empty()
: true && "We shouldn't have any external users");		: true && "We shouldn't have any external users");

// Otherwise, we can't vectorize the tree. It is both tiny and not fully		// Otherwise, we can't vectorize the tree. It is both tiny and not fully
// vectorizable.		// vectorizable.
return true;		return true;
}		}

InstructionCost BoUpSLP::getSpillCost() const {		InstructionCost
		BoUpSLP::getExtractOperationCost(const ExternalUser &EU,
		SmallVectorImpl<SmallVector<int>> &ShuffleMask,
		SmallVectorImpl<Value *> &FirstUsers,
		SmallVectorImpl<APInt> &DemandedElts) {
		InstructionCost ExtractCost = 0;
		SmallVector<unsigned> VF;

		// If found user is an insertelement, do not calculate extract cost but try
		// to detect it as a final shuffled/identity match.
		if (EU.User && isa<InsertElementInst>(EU.User)) {
		if (auto *FTy = dyn_cast<FixedVectorType>(EU.User->getType())) {
		Optional<int> InsertIdx = getInsertIndex(EU.User, 0);
		if (!InsertIdx \|\| *InsertIdx == UndefMaskElem)
		return 0;
		Value *VU = EU.User;
		auto It = find_if(FirstUsers, [VU](Value V) {
		// Checks if 2 insertelements are from the same buildvector.
		if (VU->getType() != V->getType())
		return false;
		auto *IE1 = cast<InsertElementInst>(VU);
		auto *IE2 = cast<InsertElementInst>(V);
		// Go though of insertelement instructions trying to find either VU as
		// the original vector for IE2 or V as the original vector for IE1.
		do {
		if (IE1 == VU \|\| IE2 == V)
		return true;
		if (IE1)
		IE1 = dyn_cast<InsertElementInst>(IE1->getOperand(0));
		if (IE2)
		IE2 = dyn_cast<InsertElementInst>(IE2->getOperand(0));
		} while (IE1 \|\| IE2);
		return false;
		});
		int VecId = -1;
		if (It == FirstUsers.end()) {
		VF.push_back(FTy->getNumElements());
		ShuffleMask.emplace_back(VF.back(), UndefMaskElem);
		FirstUsers.push_back(EU.User);
		DemandedElts.push_back(APInt::getZero(VF.back()));
		VecId = FirstUsers.size() - 1;
		} else {
		VecId = std::distance(FirstUsers.begin(), It);
		}
		int Idx = *InsertIdx;
		ShuffleMask[VecId][Idx] = EU.Lane;
		DemandedElts[VecId].setBit(Idx);
		}
		}
		// If we plan to rewrite the tree in a smaller type, we will need to sign
		// extend the extracted value back to the original type. Here, we account
		// for the extract and the added cost of the sign extend if needed.
		auto *VecTy = FixedVectorType::get(EU.Scalar->getType(), BundleWidth);
		auto *ScalarRoot = VectorizableTree[0]->Scalars[0];
		if (MinBWs.count(ScalarRoot)) {
		auto *MinTy = IntegerType::get(F->getContext(), MinBWs[ScalarRoot].first);
		auto Extend =
		MinBWs[ScalarRoot].second ? Instruction::SExt : Instruction::ZExt;
		VecTy = FixedVectorType::get(MinTy, BundleWidth);
		ExtractCost += TTI->getExtractWithExtendCost(Extend, EU.Scalar->getType(),
		VecTy, EU.Lane);
		} else {
		ExtractCost +=
		TTI->getVectorInstrCost(Instruction::ExtractElement, VecTy, EU.Lane);
		}
		return ExtractCost;
		}

		InstructionCost BoUpSLP::getExtractShuffleCost(InstructionCost Cost) {
		InstructionCost ExtractCost = 0;
		InstructionCost ShuffleCost = 0;
		SmallPtrSet<Value *, 16> ExtractCostCalculated;
		SmallVector<SmallVector<int>> ShuffleMask;
		SmallVector<Value *> FirstUsers;
		SmallVector<APInt> DemandedElts;
		for (const ExternalUser &EU : ExternalUses) {
		// We only add extract cost once for the same scalar.
		if (!ExtractCostCalculated.insert(EU.Scalar).second)
		continue;

		// Uses by ephemeral values are free (because the ephemeral value will be
		// removed prior to code generation, and so the extraction will be
		// removed as well).
		if (EphValues.count(EU.User))
		continue;

		// No extract cost for vector "scalar"
		if (isa<FixedVectorType>(EU.Scalar->getType()))
		continue;

		// Already counted the cost for external uses when tried to adjust the cost
		// for extractelements, no need to add it again.
		if (isa<ExtractElementInst>(EU.Scalar))
		continue;

		ExtractCost +=
		getExtractOperationCost(EU, ShuffleMask, FirstUsers, DemandedElts);
		}

		// Consider the possibility of extracting vectorized
		// values for canceled elements use.
		for (TreeEntry *Entry : ProposedToGather) {
		for (Value *V : Entry->Scalars) {
		// Consider the possibility of extracting vectorized
		// values for canceled elements use.
		auto It = InternalTreeUses.find(V);
		if (It != InternalTreeUses.end()) {
		const UserList &UL = It->second;
		for (const ExternalUser &IU : UL)
		if (!ExtractCostCalculated.insert(IU.Scalar).second)
		ExtractCost += getExtractOperationCost(IU, ShuffleMask, FirstUsers,
		DemandedElts);
		}
		}
		}

		for (int I = 0, E = FirstUsers.size(); I < E; ++I) {
		// For the very first element - simple shuffle of the source vector.
		int Limit = ShuffleMask[I].size() * 2;
		if (I == 0 &&
		all_of(ShuffleMask[I], [Limit](int Idx) { return Idx < Limit; }) &&
		!ShuffleVectorInst::isIdentityMask(ShuffleMask[I])) {
		InstructionCost C = TTI->getShuffleCost(
		TTI::SK_PermuteSingleSrc,
		cast<FixedVectorType>(FirstUsers[I]->getType()), ShuffleMask[I]);
		LLVM_DEBUG(dbgs() << "SLP: Adding cost " << C
		<< " for final shuffle of insertelement external users "
		<< *VectorizableTree.front()->Scalars.front() << ".\n"
		<< "SLP: Current total cost = " << Cost << "\n");
		ShuffleCost += C;
		continue;
		}
		// Other elements - permutation of 2 vectors (the initial one and the next
		// Ith incoming vector).
		unsigned VF = ShuffleMask[I].size();
		for (unsigned Idx = 0; Idx < VF; ++Idx) {
		int &Mask = ShuffleMask[I][Idx];
		Mask = Mask == UndefMaskElem ? Idx : VF + Mask;
		}
		InstructionCost C = TTI->getShuffleCost(
		TTI::SK_PermuteTwoSrc, cast<FixedVectorType>(FirstUsers[I]->getType()),
		ShuffleMask[I]);
		LLVM_DEBUG(
		dbgs()
		<< "SLP: Adding cost " << C
		<< " for final shuffle of vector node and external insertelement users "
		<< *VectorizableTree.front()->Scalars.front() << ".\n"
		<< "SLP: Current total cost = " << Cost << "\n");
		ShuffleCost += C;
		InstructionCost InsertCost = TTI->getScalarizationOverhead(
		cast<FixedVectorType>(FirstUsers[I]->getType()), DemandedElts[I],
		/Insert/ true,
		/Extract/ false);
		ShuffleCost -= InsertCost;
		LLVM_DEBUG(dbgs() << "SLP: subtracting the cost " << InsertCost
		<< " for insertelements gather.\n"
		<< "SLP: Current total cost = " << Cost << "\n");
		}

		return ExtractCost + ShuffleCost;
		}

		InstructionCost BoUpSLP::getInsertCost(ArrayRef<Value *> VectorizedVals) {
		InstructionCost InsertCost = 0;
		for (TreeEntry *Entry : ProposedToGather) {
		// Avoid already vectorized TreeEntries, it is already in a vector form and
		// we don't need to gather those operations or nodes that were once
		// considered to be vectorized but now don't have any direct relations
		// to vectorizable nodes.
		for (Value *V : Entry->Scalars) {
		auto *Inst = cast<Instruction>(V);
		if (llvm::any_of(Inst->users(), [this](User *Op) {
		if (const TreeEntry *UserTE = getTreeEntry(Op)) {
		return (ProposedToGather.count(UserTE) != 0);
		}
		return false;
		})) {
		InsertCost += getEntryCost(Entry, VectorizedVals);
		break;
		}
		}
		}
		return InsertCost;
		}

		void BoUpSLP::findLeaf(TreeEntry C, SetVector<TreeEntry > &Path) const {
		if (!Path.count(C))
		Path.insert(C);
		int NonGatherUse;
		do {
		NonGatherUse = 0;
		for (TreeEntry *Next : llvm::reverse(C->UseEntries)) {
		// Ignore any processed nodes to avoid cycles.
		if (Next->State == TreeEntry::NeedToGather \|\| Path.count(Next) \|\|
		Next == C)
		continue;
		C = Next;
		Path.insert(C);
		NonGatherUse++;
		break;
		}
		} while (NonGatherUse != 0);
		}

		InstructionCost BoUpSLP::getSpillCost() {
// Walk from the bottom of the tree to the top, tracking which values are		// Walk from the bottom of the tree to the top, tracking which values are
// live. When we see a call instruction that is not part of our tree,		// live. When we see a call instruction that is not part of our tree,
// query TTI to see if there is a cost to keeping values live over it		// query TTI to see if there is a cost to keeping values live over it
// (for example, if spills and fills are required).		// (for example, if spills and fills are required).
unsigned BundleWidth = VectorizableTree.front()->Scalars.size();
InstructionCost Cost = 0;		InstructionCost Cost = 0;

SmallPtrSet<Instruction*, 4> LiveValues;		SmallPtrSet<Instruction*, 4> LiveValues;
Instruction *PrevInst = nullptr;		Instruction *PrevInst = nullptr;

// The entries in VectorizableTree are not necessarily ordered by their		// The entries in VectorizableTree are not necessarily ordered by their
// position in basic blocks. Collect them and order them by dominance so later		// position in basic blocks. Collect them and order them by dominance so later
// instructions are guaranteed to be visited first. For instructions in		// instructions are guaranteed to be visited first. For instructions in
▲ Show 20 Lines • Show All 56 Lines • ▼ Show 20 Lines	while (InstIt != PrevInstIt) {
!isa<DbgInfoIntrinsic>(&*PrevInstIt)) &&		!isa<DbgInfoIntrinsic>(&*PrevInstIt)) &&
&*PrevInstIt != PrevInst)		&*PrevInstIt != PrevInst)
NumCalls++;		NumCalls++;

++PrevInstIt;		++PrevInstIt;
}		}

if (NumCalls) {		if (NumCalls) {
		NoCallInst = false;
SmallVector<Type*, 4> V;		SmallVector<Type*, 4> V;
for (auto *II : LiveValues) {		for (auto *II : LiveValues) {
auto *ScalarTy = II->getType();		auto *ScalarTy = II->getType();
if (auto *VectorTy = dyn_cast<FixedVectorType>(ScalarTy))		if (auto *VectorTy = dyn_cast<FixedVectorType>(ScalarTy))
ScalarTy = VectorTy->getElementType();		ScalarTy = VectorTy->getElementType();
V.push_back(FixedVectorType::get(ScalarTy, BundleWidth));		V.push_back(FixedVectorType::get(ScalarTy, BundleWidth));
}		}
Cost += NumCalls * TTI->getCostOfKeepingLiveOverCall(V);		Cost += NumCalls * TTI->getCostOfKeepingLiveOverCall(V);
}		}

PrevInst = Inst;		PrevInst = Inst;
}		}

return Cost;		return Cost;
}		}

InstructionCost BoUpSLP::getTreeCost(ArrayRef<Value *> VectorizedVals) {		bool BoUpSLP::findSubTree(SubTreeQueue &SubTrees, InstructionCost TreeCost) {
		ABataevUnsubmitted Done Reply Inline Actions The code in this function is very similar to the code in the `getTreeCost()`. Can it be reused somehow to avoid duplication? ABataev: The code in this function is very similar to the code in the `getTreeCost()`. Can it be reused…
InstructionCost Cost = 0;		SetVector<TreeEntry *> Path;
LLVM_DEBUG(dbgs() << "SLP: Calculating cost for tree of size "		std::vector<TreeEntry *> SubPath;
<< VectorizableTree.size() << ".\n");		SetVector<TreeEntry *> Visited;
		TreeEntry *Node = VectorizableTree.front().get();

		// Avoid reducing the tree if there is no potential room to reduce.
		if ((TreeCost - ReduceableCost) >= -SLPCostThreshold)
		return false;

unsigned BundleWidth = VectorizableTree[0]->Scalars.size();		// To start we can find just one leaf node that happens to be not the root
		ABataevUnsubmitted Not Done Reply Inline Actions `>=` ABataev: `>=`
		// node of the graph i.e. with non-zero index. Then, Path is route from the
		// root node to our leaf node.
		ABataevUnsubmitted Done Reply Inline Actions `VectorizableTree.front()` instead of `VectorizableTree[0]` just like in the first statement. ABataev: `VectorizableTree.front()` instead of `VectorizableTree[0]` just like in the first statement.
		ABataevUnsubmitted Not Done Reply Inline Actions Does it include the cost of all subtree or just this particular `Entry`? ABataev: Does it include the cost of all subtree or just this particular `Entry`?
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions The Entry->Cost is a particular Entry Cost with some sub-tree elements for example if we have a gathering element in this particular Entry. Note that we only consider here TreeEntry::Vectorize entries this summary. dtemirbulatov: The Entry->Cost is a particular Entry Cost with some sub-tree elements for example if we have a…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions And we recalculate all canceled to vectorize (ProposedToGather) entries costs in getInsertCost() every time we call getTreeCost() at line 4214. dtemirbulatov: And we recalculate all canceled to vectorize (ProposedToGather) entries costs in getInsertCost…
		findLeaf(Node, Path);
		if (Node == Path.back())
		ABataevUnsubmitted Not Done Reply Inline Actions Why need to exclude `UserCost` here? ABataev: Why need to exclude `UserCost` here?
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions We might want to add(substract) extra value for example at line 6071 before this change. dtemirbulatov: We might want to add(substract) extra value for example at line 6071 before this change.
		return false;
		do {
		Node = Path.back();
		Visited.insert(Node);
		assert(Node->State != TreeEntry::NeedToGather && "Incorrect node state");
		// If we found a branch node i.e. node with more than one non-gathering
		// child, we could try to find set of profitable nodes in SubPath to
		// vectorize and if there is no such set of profitable nodes then we could
		// consider another leaf that is reachable from this branch node.
		if (Node->isFork()) {
		TreeEntry *NextFromFork = nullptr;
		auto It = llvm::find_if(
		llvm::reverse(Node->UseEntries), [&Node, &Visited](TreeEntry *E) {
		return (E != Node && E->State != TreeEntry::NeedToGather &&
		!Visited.count(E));
		});
		if (It != Node->UseEntries.rend())
		NextFromFork = *It;
		else
		Path.pop_back();

for (unsigned I = 0, E = VectorizableTree.size(); I < E; ++I) {		InstructionCost SubTreeCost = 0;
TreeEntry &TE = *VectorizableTree[I].get();		for (TreeEntry *Entry : SubPath)
		SubTreeCost += Entry->Cost;
		if (SubTreeCost.isValid() && SubTreeCost > 0)
		SubTrees.push(std::make_pair(*SubTreeCost.getValue(), SubPath));
		SubPath.clear();
		if (NextFromFork && NextFromFork != Node) {
		findLeaf(NextFromFork, Path);
		Node = Path.back();
		}
		} else {
		// If this node is not a branch node then we could move to another node
		// below until we reach the root node of the graph or encounter another
		// branch node.
		SubPath.push_back(Node);
		Path.pop_back();
		}
		} while (Node->Idx);
		wweiUnsubmitted Done Reply Inline Actions I think the input for `getGatherCost` should be `Entry->Scalars` instead of `V`. The code in `getGatherCost` likes: int BoUpSLP::getGatherCost(ArrayRef<Value > VL) const { // Find the type of the operands in VL. Type ScalarTy = VL[0]->getType(); ... ... VectorType VecTy = VectorType::get(ScalarTy, VL.size()); ... ... return getGatherCost(VecTy, ShuffledElements); So, if input is `V`, `VectTy` will always be equal to `<1 x iN>`, I think it's the same with `iN` type, and `getGatherCost(VecTy, ShuffledElements)` will return incorrect InsertCost value. wwei:* I think the input for `getGatherCost` should be `Entry->Scalars` instead of `V`. The code in…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions Yes, correct. Thanks, I will fix that. dtemirbulatov: Yes, correct. Thanks, I will fix that.

InstructionCost C = getEntryCost(&TE, VectorizedVals);		return (SubTrees.size() > 0);
Cost += C;
LLVM_DEBUG(dbgs() << "SLP: Adding cost " << C
<< " for bundle that starts with " << *TE.Scalars[0]
<< ".\n"
<< "SLP: Current total cost = " << Cost << "\n");
}		}
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions eh, I found typo here, it should be Inst->Users, the same for 4101 line. dtemirbulatov: eh, I found typo here, it should be Inst->Users, the same for 4101 line.

SmallPtrSet<Value *, 16> ExtractCostCalculated;		InstructionCost BoUpSLP::getRawTreeCost(ArrayRef<Value *> VectorizedVals) {
InstructionCost ExtractCost = 0;		InstructionCost CostSum = 0;
		ABataevUnsubmitted Not Done Reply Inline Actions Tabs ABataev: Tabs
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions this is LLVM's standard format(clang-format). dtemirbulatov: this is LLVM's standard format(clang-format).
SmallVector<unsigned> VF;		BundleWidth = VectorizableTree.front()->Scalars.size();
SmallVector<SmallVector<int>> ShuffleMask;		LLVM_DEBUG(dbgs() << "SLP: Calculating cost for tree of size "
SmallVector<Value *> FirstUsers;		<< VectorizableTree.size() << ".\n");
		ABataevUnsubmitted Not Done Reply Inline Actions Tab ABataev: Tab
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions this is LLVM's standard format(clang-format). dtemirbulatov: this is LLVM's standard format(clang-format).
SmallVector<APInt> DemandedElts;
for (ExternalUser &EU : ExternalUses) {
// We only add extract cost once for the same scalar.
if (!ExtractCostCalculated.insert(EU.Scalar).second)
continue;

// Uses by ephemeral values are free (because the ephemeral value will be		for (std::unique_ptr<TreeEntry> &TEPtr : VectorizableTree) {
		ABataevUnsubmitted Done Reply Inline Actions Not sure that this is the best criterion. I think you also need to include the distance from the head of the tree to the entry, because some big costs can be compensated by the vectorizable nodes in the tree. What I would do here is just some kind of level ordering search (BFS) starting from the deepest level. ABataev: Not sure that this is the best criterion. I think you also need to include the distance from…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions Hmm, implemented, but I don't see any benefit from that, plus we have to do BFS search. And we are going to throw away any non-vectorizable nodes at 4295. dtemirbulatov: Hmm, implemented, but I don't see any benefit from that, plus we have to do BFS search. And we…
		ABataevUnsubmitted Not Done Reply Inline Actions It may trigger for targets like silvermont or in future for vectorized functions. ABataev: It may trigger for targets like silvermont or in future for vectorized functions.
		dtemirbulatovAuthorUnsubmitted Not Done Reply Inline Actions I measured the BFS approach vs this implementation. And with BFS, it is ~10% less efficient on SPEC2006 INT and ~20% less on compilable SPEC2006 FP. By efficiency, I mean the total number of reduced trees while the whole compilation. dtemirbulatov: I measured the BFS approach vs this implementation. And with BFS, it is ~10% less efficient on…
		ABataevUnsubmitted Not Done Reply Inline Actions Could you post it anyway to check if it may be improved? ABataev: Could you post it anyway to check if it may be improved?
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions ok, I might miss something. Thanks. dtemirbulatov: ok, I might miss something. Thanks.
// removed prior to code generation, and so the extraction will be		TreeEntry &TE = *TEPtr.get();
// removed as well).
if (EphValues.count(EU.User))
continue;

// No extract cost for vector "scalar"		TE.Cost = getEntryCost(&TE, VectorizedVals);
if (isa<FixedVectorType>(EU.Scalar->getType()))		LLVM_DEBUG(dbgs() << "SLP: Adding cost " << TE.Cost
continue;		<< " for bundle that starts with " << *TE.Scalars[0]
		<< ".\n");
		CostSum += TE.Cost;
		ABataevUnsubmitted Done Reply Inline Actions I think you can also exclude entries with the number of operands <= 1. ABataev: I think you can also exclude entries with the number of operands <= 1.
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions But why? The only thing that matters here is the cost. dtemirbulatov: But why? The only thing that matters here is the cost.
		ABataevUnsubmitted Not Done Reply Inline Actions Because the main idea is to drop gathers and drop one gather in favor of another one will not be profitable for sure. But it may improve compile time and the list of candidates, The only case you need to check for is the latest masked gather case, it may be profitable to convert it to gathers for some targets. ABataev: Because the main idea is to drop gathers and drop one gather in favor of another one will not…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions I think I can check here if scutter/gather is supported via TargetrInfo and if it is not then move all nodes with TreeEntry::ScatterVectorize to TreeEntry::Gather. dtemirbulatov: I think I can check here if scutter/gather is supported via TargetrInfo and if it is not then…
		anton-afanasyevUnsubmitted Not Done Reply Inline Actions I believe it's wrong decision to check scatter/gather target support for the reason mentioned here https://reviews.llvm.org/D92701#2435573. Why could not we just rely on costs (node cost and total one)? anton-afanasyev: I believe it's wrong decision to check scatter/gather target support for the reason mentioned…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions I agree with @anton-afanasyev here. I am not sure what @ABataev wants here? If I exclude (operands <= 1) then we would lose have of all tests in SLP affected by throttling. dtemirbulatov: I agree with @anton-afanasyev here. I am not sure what @ABataev wants here? If I exclude…
		ABataevUnsubmitted Not Done Reply Inline Actions I did not say anything about checking if scatter is supported here. I just said that we can improve the criterion here by checking that the entry node has at least 2 operands (because if it has just one operand, most probably we can skip it) and we just need to check the nodes with only 1 operand if it is gather scatter node, because it may be better to represent it as simple gather. ABataev: I did not say anything about checking if scatter is supported here. I just said that we can…
		LLVM_DEBUG(dbgs() << "SLP: Current total cost = " << CostSum << "\n");
		ABataevUnsubmitted Done Reply Inline Actions No need for `[&]` here, just `[]` ABataev: No need for `[&]` here, just `[]`
		ABataevUnsubmitted Done Reply Inline Actions `llvm::any_of(Inst->users(), [Tree](User Op){ return Tree->ScalarToTreeEntry.count(Op) > 0; }` ABataev:* `llvm::any_of(Inst->users(), [Tree](User *Op){ return Tree->ScalarToTreeEntry.count(Op) > 0; }`

		ABataevUnsubmitted Done Reply Inline Actions Just: for (Value V : Entry->Scalars) { auto Inst = cast<Instruction>(V); if (llvm::any_of(Inst->users(), [this](User Op){ return Tree->ScalarToTreeEntry.count(Op) > 0; })) return InsertCost + getEntryCost(Entry); } Also, check code formatting ABataev:* Just: ``` for (Value V : Entry->Scalars) { auto Inst = cast<Instruction>(V)…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions hmm, I think this is not a correct suggestion, there might be several tree entries with TreeEntry::ProposedToGather status and we have to calculate Insert cost for the whole tree here. dtemirbulatov: hmm, I think this is not a correct suggestion, there might be several tree entries with…
		ABataevUnsubmitted Done Reply Inline Actions Yeah, maybe. But you van do something similar, like InsertCost += ... break; instead of setting flag and do a check after the loop. ABataev: Yeah, maybe. But you van do something similar, like ``` InsertCost += ... break; ``` instead…
// Already counted the cost for external uses when tried to adjust the cost		if (TE.State == TreeEntry::NeedToGather)
// for extractelements, no need to add it again.
if (isa<ExtractElementInst>(EU.Scalar))
continue;		continue;

// If found user is an insertelement, do not calculate extract cost but try		if (TE.Idx && TE.Cost > 0)
// to detect it as a final shuffled/identity match.		ReduceableCost += TE.Cost;
if (EU.User && isa<InsertElementInst>(EU.User)) {
if (auto *FTy = dyn_cast<FixedVectorType>(EU.User->getType())) {
Optional<int> InsertIdx = getInsertIndex(EU.User, 0);
if (!InsertIdx \|\| *InsertIdx == UndefMaskElem)
continue;
Value *VU = EU.User;
auto It = find_if(FirstUsers, [VU](Value V) {
// Checks if 2 insertelements are from the same buildvector.
if (VU->getType() != V->getType())
return false;
auto *IE1 = cast<InsertElementInst>(VU);
auto *IE2 = cast<InsertElementInst>(V);
// Go though of insertelement instructions trying to find either VU as
// the original vector for IE2 or V as the original vector for IE1.
do {
if (IE1 == VU \|\| IE2 == V)
return true;
if (IE1)
IE1 = dyn_cast<InsertElementInst>(IE1->getOperand(0));
if (IE2)
IE2 = dyn_cast<InsertElementInst>(IE2->getOperand(0));
} while (IE1 \|\| IE2);
return false;
});
int VecId = -1;
if (It == FirstUsers.end()) {
VF.push_back(FTy->getNumElements());
ShuffleMask.emplace_back(VF.back(), UndefMaskElem);
FirstUsers.push_back(EU.User);
DemandedElts.push_back(APInt::getZero(VF.back()));
VecId = FirstUsers.size() - 1;
} else {
VecId = std::distance(FirstUsers.begin(), It);
}
int Idx = *InsertIdx;
ShuffleMask[VecId][Idx] = EU.Lane;
DemandedElts[VecId].setBit(Idx);
}		}
		return CostSum;
}		}
		ABataevUnsubmitted Done Reply Inline Actions `Cmp` ABataev: `Cmp`

// If we plan to rewrite the tree in a smaller type, we will need to sign		InstructionCost BoUpSLP::getTreeCost(bool TreeReduce,
// extend the extracted value back to the original type. Here, we account		ArrayRef<Value *> VectorizedVals) {
// for the extract and the added cost of the sign extend if needed.		InstructionCost CostSum;
auto *VecTy = FixedVectorType::get(EU.Scalar->getType(), BundleWidth);		if (!IsCostSumReady) {
auto *ScalarRoot = VectorizableTree[0]->Scalars[0];		CostSum = getRawTreeCost(VectorizedVals);
if (MinBWs.count(ScalarRoot)) {		RawTreeCost = CostSum;
		ABataevUnsubmitted Not Done Reply Inline Actions `SpillCost == 0`? ABataev: `SpillCost == 0`?
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions hmm, I am thinking if we might get SpillCost == 0 for the whole tree and somehow after reduction, we might get non-zero. dtemirbulatov: hmm, I am thinking if we might get SpillCost == 0 for the whole tree and somehow after…
auto *MinTy = IntegerType::get(F->getContext(), MinBWs[ScalarRoot].first);
auto Extend =
MinBWs[ScalarRoot].second ? Instruction::SExt : Instruction::ZExt;
VecTy = FixedVectorType::get(MinTy, BundleWidth);
ExtractCost += TTI->getExtractWithExtendCost(Extend, EU.Scalar->getType(),
VecTy, EU.Lane);
} else {		} else {
		ABataevUnsubmitted Done Reply Inline Actions Just `assert((!Tree->NoCallInst \|\| getSpillCost() == 0) && "Incorrect spill cost");` ABataev: Just `assert((!Tree->NoCallInst \|\| getSpillCost() == 0) && "Incorrect spill cost");`
ExtractCost +=		CostSum = RawTreeCost;
		ABataevUnsubmitted Done Reply Inline Actions Why need to use a SmallVector and then sort it? Better to use a set with custom compare functor. ABataev: Why need to use a SmallVector and then sort it? Better to use a set with custom compare functor.
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions Thanks, Good. dtemirbulatov: Thanks, Good.
TTI->getVectorInstrCost(Instruction::ExtractElement, VecTy, EU.Lane);
}
}		}

		InstructionCost ExtractCost = getExtractShuffleCost(CostSum);
InstructionCost SpillCost = getSpillCost();		InstructionCost SpillCost = getSpillCost();
Cost += SpillCost + ExtractCost;		InstructionCost InsertCost = getInsertCost(VectorizedVals);
		ABataevUnsubmitted Done Reply Inline Actions `[V]` ABataev: `[V]`
		ABataevUnsubmitted Not Done Reply Inline Actions Just `Vec.erase(Vec.rbegin(), Vec.rbegin() + (Vec.size() - MaxCostsRecalculations)`? ABataev: Just `Vec.erase(Vec.rbegin(), Vec.rbegin() + (Vec.size() - MaxCostsRecalculations)`?
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions No, We could not use "Vec.rbegin() + " with std::set. dtemirbulatov: No, We could not use "Vec.rbegin() + " with std::set.
		ABataevUnsubmitted Not Done Reply Inline Actions Then just `Vec.erase(Vec.begin() + MaxCostsRecalculations, Vec.end());`. ABataev: Then just `Vec.erase(Vec.begin() + MaxCostsRecalculations, Vec.end());`.
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions eh, no it is std::_Rb_tree_const_iterator<llvm::slpvectorizer::BoUpSLP::TreeEntry>. dtemirbulatov:* eh, no it is std::_Rb_tree_const_iterator<llvm::slpvectorizer::BoUpSLP::TreeEntry*>.
		ABataevUnsubmitted Done Reply Inline Actions Why is this a const iterator? ABataev: Why is this a const iterator?
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions std::set iterators are bidirectional, not random-access. we have to use std::advance dtemirbulatov: std::set iterators are bidirectional, not random-access. we have to use std::advance
for (int I = 0, E = FirstUsers.size(); I < E; ++I) {		InstructionCost Cost = CostSum + ExtractCost + SpillCost + InsertCost;
// For the very first element - simple shuffle of the source vector.		InstructionCost FullCost = Cost;
int Limit = ShuffleMask[I].size() * 2;
if (I == 0 &&
all_of(ShuffleMask[I], [Limit](int Idx) { return Idx < Limit; }) &&
!ShuffleVectorInst::isIdentityMask(ShuffleMask[I])) {
InstructionCost C = TTI->getShuffleCost(
TTI::SK_PermuteSingleSrc,
cast<FixedVectorType>(FirstUsers[I]->getType()), ShuffleMask[I]);
LLVM_DEBUG(dbgs() << "SLP: Adding cost " << C
<< " for final shuffle of insertelement external users "
<< *VectorizableTree.front()->Scalars.front() << ".\n"
<< "SLP: Current total cost = " << Cost << "\n");
Cost += C;
continue;
}
// Other elements - permutation of 2 vectors (the initial one and the next
// Ith incoming vector).
unsigned VF = ShuffleMask[I].size();
for (unsigned Idx = 0; Idx < VF; ++Idx) {
int &Mask = ShuffleMask[I][Idx];
Mask = Mask == UndefMaskElem ? Idx : VF + Mask;
}
InstructionCost C = TTI->getShuffleCost(
TTI::SK_PermuteTwoSrc, cast<FixedVectorType>(FirstUsers[I]->getType()),
ShuffleMask[I]);
LLVM_DEBUG(
dbgs()
<< "SLP: Adding cost " << C
<< " for final shuffle of vector node and external insertelement users "
<< *VectorizableTree.front()->Scalars.front() << ".\n"
<< "SLP: Current total cost = " << Cost << "\n");
Cost += C;
InstructionCost InsertCost = TTI->getScalarizationOverhead(
cast<FixedVectorType>(FirstUsers[I]->getType()), DemandedElts[I],
/Insert/ true,
/Extract/ false);
Cost -= InsertCost;
LLVM_DEBUG(dbgs() << "SLP: subtracting the cost " << InsertCost
<< " for insertelements gather.\n"
<< "SLP: Current total cost = " << Cost << "\n");
}

#ifndef NDEBUG		#ifndef NDEBUG
SmallString<256> Str;		SmallString<256> Str;
{		{
raw_svector_ostream OS(Str);		raw_svector_ostream OS(Str);
OS << "SLP: Spill Cost = " << SpillCost << ".\n"		OS << "SLP: Spill Cost = " << SpillCost << ".\n"
<< "SLP: Extract Cost = " << ExtractCost << ".\n"		<< "SLP: Extract Cost = " << ExtractCost << ".\n"
<< "SLP: Total Cost = " << Cost << ".\n";		<< "SLP: Total Cost = " << Cost << ".\n";
		ABataevUnsubmitted Done Reply Inline Actions `>=` ABataev: `>=`
}		}
LLVM_DEBUG(dbgs() << Str);		LLVM_DEBUG(dbgs() << Str);
if (ViewSLPTree)		if (ViewSLPTree)
ViewGraph(this, "SLP" + F->getName(), false, Str);		ViewGraph(this, "SLP" + F->getName(), false, Str);
#endif		#endif
		ABataevUnsubmitted Done Reply Inline Actions All this code must be active only when the debug mode on? ABataev: All this code must be active only when the debug mode on?
		RKSimonUnsubmitted Not Done Reply Inline Actions Maybe pull this NDEBUG change out into its own patch? RKSimon: Maybe pull this NDEBUG change out into its own patch?
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions yes, it could go as NFC. dtemirbulatov: yes, it could go as NFC.

		if (TreeReduce && Cost >= -SLPCostThreshold) {
		ABataevUnsubmitted Done Reply Inline Actions `is_contained()` is `O(n)`. Maybe use a set instead of it in the loop? ABataev: `is_contained()` is `O(n)`. Maybe use a set instead of it in the loop?
		std::vector<TreeEntry *> Vec;
		SubTreeQueue SubTrees;
		if (!findSubTree(SubTrees, Cost))
		return Cost;
		unsigned NodeCounter = 0;

		for (std::unique_ptr<TreeEntry> &TEPtr : VectorizableTree) {
		TreeEntry *Entry = TEPtr.get();
		Entry->dump();
		}

		while (!SubTrees.empty()) {
		std::pair<int, std::vector<TreeEntry *>> SubTree = SubTrees.top();
		SubTrees.pop();
		NodeCounter++;

		bool NoConnectToVecOp = true;
		for (TreeEntry *T : SubTree.second) {
		ProposedToGather.insert(T);
		T->State = TreeEntry::NeedToGather;
		// Zero cost on any gather operations with no direct
		// connection to vector nodes. At this point those nodes stop
		// being gather nodes
		for (TreeEntry *U : T->UseEntries) {
		for (const EdgeInfo &EI : U->UserTreeIndices) {
		if (EI.UserTE->State != TreeEntry::NeedToGather) {
		NoConnectToVecOp = false;
		break;
		}
		}
		if (!NoConnectToVecOp)
		break;
		}
		if (NoConnectToVecOp)
		for (TreeEntry *U : T->UseEntries) {
		CostSum -= U->Cost;
		U->Cost = 0;
		}
		CostSum -= T->Cost;
		T->Cost = getEntryCost(T, VectorizedVals);
		CostSum += T->Cost;

		for (Value *V : T->Scalars) {
		MustGather.insert(V);
		ExternalUses.erase(
		llvm::remove_if(ExternalUses,
		[V](ExternalUser &EU) { return EU.Scalar == V; }),
		ExternalUses.end());
		}
		ExtractCost = getExtractShuffleCost(CostSum);
		if (!NoCallInst)
		SpillCost = getSpillCost();
		assert((!NoCallInst \|\| getSpillCost() == 0) && "Incorrect spill cost");
		InstructionCost InsertCost = getInsertCost(VectorizedVals);
		Cost = CostSum + ExtractCost + SpillCost + InsertCost;
		for (std::unique_ptr<TreeEntry> &TEPtr : VectorizableTree) {
		TreeEntry *Entry = TEPtr.get();
		Entry->dump();
		}
		if (Cost < -SLPCostThreshold && !isTreeTinyAndNotFullyVectorizable() &&
		(VectorizableTree[0]->State != TreeEntry::NeedToGather &&
		VectorizableTree[1]->State != TreeEntry::NeedToGather)) {
		cutTree();
return Cost;		return Cost;
}		}
		}
		}
		ProposedToGather.clear();
		}
		return FullCost;
		}

Optional<TargetTransformInfo::ShuffleKind>		Optional<TargetTransformInfo::ShuffleKind>
BoUpSLP::isGatherShuffledEntry(const TreeEntry *TE, SmallVectorImpl<int> &Mask,		BoUpSLP::isGatherShuffledEntry(const TreeEntry *TE, SmallVectorImpl<int> &Mask,
SmallVectorImpl<const TreeEntry *> &Entries) {		SmallVectorImpl<const TreeEntry *> &Entries) {
// TODO: currently checking only for Scalars in the tree entry, need to count		// TODO: currently checking only for Scalars in the tree entry, need to count
// reused elements too for better cost estimation.		// reused elements too for better cost estimation.
Mask.assign(TE->Scalars.size(), UndefMaskElem);		Mask.assign(TE->Scalars.size(), UndefMaskElem);
Entries.clear();		Entries.clear();
▲ Show 20 Lines • Show All 792 Lines • ▼ Show 20 Lines	case Instruction::Load: {
LoadInst *LI = cast<LoadInst>(VL0);		LoadInst *LI = cast<LoadInst>(VL0);
Instruction *NewLI;		Instruction *NewLI;
unsigned AS = LI->getPointerAddressSpace();		unsigned AS = LI->getPointerAddressSpace();
Value *PO = LI->getPointerOperand();		Value *PO = LI->getPointerOperand();
if (E->State == TreeEntry::Vectorize) {		if (E->State == TreeEntry::Vectorize) {

Value *VecPtr = Builder.CreateBitCast(PO, VecTy->getPointerTo(AS));		Value *VecPtr = Builder.CreateBitCast(PO, VecTy->getPointerTo(AS));

// The pointer operand uses an in-tree scalar so we add the new BitCast		// The pointer operand uses an in-tree scalar so we add the new BitCast
// to ExternalUses list to make sure that an extract will be generated		// to ExternalUses list to make sure that an extract will be generated
// in the future.		// in the future.
if (getTreeEntry(PO))		if (getTreeEntry(PO))
ExternalUses.emplace_back(PO, cast<User>(VecPtr), 0);		ExternalUses.emplace_back(PO, cast<User>(VecPtr), 0);
		ABataevUnsubmitted Done Reply Inline Actions Use `emplace_back(PO, VecPtr, 0)` ABataev: Use `emplace_back(PO, VecPtr, 0)`

NewLI = Builder.CreateAlignedLoad(VecTy, VecPtr, LI->getAlign());		NewLI = Builder.CreateAlignedLoad(VecTy, VecPtr, LI->getAlign());
} else {		} else {
assert(E->State == TreeEntry::ScatterVectorize && "Unhandled state");		assert(E->State == TreeEntry::ScatterVectorize && "Unhandled state");
Value *VecPtr = vectorizeTree(E->getOperand(0));		Value *VecPtr = vectorizeTree(E->getOperand(0));
// Use the minimum alignment of the gathered loads.		// Use the minimum alignment of the gathered loads.
Align CommonAlignment = LI->getAlign();		Align CommonAlignment = LI->getAlign();
for (Value *V : E->Scalars)		for (Value *V : E->Scalars)
Show All 27 Lines	case Instruction::Store: {
ScalarPtr, VecValue->getType()->getPointerTo(AS));		ScalarPtr, VecValue->getType()->getPointerTo(AS));
StoreInst *ST = Builder.CreateAlignedStore(VecValue, VecPtr,		StoreInst *ST = Builder.CreateAlignedStore(VecValue, VecPtr,
SI->getAlign());		SI->getAlign());

// The pointer operand uses an in-tree scalar, so add the new BitCast to		// The pointer operand uses an in-tree scalar, so add the new BitCast to
// ExternalUses to make sure that an extract will be generated in the		// ExternalUses to make sure that an extract will be generated in the
// future.		// future.
if (getTreeEntry(ScalarPtr))		if (getTreeEntry(ScalarPtr))
ExternalUses.push_back(ExternalUser(ScalarPtr, cast<User>(VecPtr), 0));		ExternalUses.push_back(ExternalUser(ScalarPtr, cast<User>(VecPtr), 0));

		ABataevUnsubmitted Not Done Reply Inline Actions Use `emplace_back(ScalarPtr, VecPtr, 0);` ABataev: Use `emplace_back(ScalarPtr, VecPtr, 0);`
Value *V = propagateMetadata(ST, E->Scalars);		Value *V = propagateMetadata(ST, E->Scalars);

E->VectorizedValue = V;		E->VectorizedValue = V;
++NumVectorInstructions;		++NumVectorInstructions;
return V;		return V;
}		}
case Instruction::GetElementPtr: {		case Instruction::GetElementPtr: {
setInsertPointAfterBundle(E);		setInsertPointAfterBundle(E);
▲ Show 20 Lines • Show All 85 Lines • ▼ Show 20 Lines	case Instruction::Call: {
SmallVector<OperandBundleDef, 1> OpBundles;		SmallVector<OperandBundleDef, 1> OpBundles;
CI->getOperandBundlesAsDefs(OpBundles);		CI->getOperandBundlesAsDefs(OpBundles);
Value *V = Builder.CreateCall(CF, OpVecs, OpBundles);		Value *V = Builder.CreateCall(CF, OpVecs, OpBundles);

// The scalar argument uses an in-tree scalar so we add the new vectorized		// The scalar argument uses an in-tree scalar so we add the new vectorized
// call to ExternalUses list to make sure that an extract will be		// call to ExternalUses list to make sure that an extract will be
// generated in the future.		// generated in the future.
if (ScalarArg && getTreeEntry(ScalarArg))		if (ScalarArg && getTreeEntry(ScalarArg))
ExternalUses.push_back(ExternalUser(ScalarArg, cast<User>(V), 0));		ExternalUses.push_back(ExternalUser(ScalarArg, cast<User>(V), 0));
		ABataevUnsubmitted Done Reply Inline Actions Use `emplace_back(ScalarArg, V, 0);` ABataev: Use `emplace_back(ScalarArg, V, 0);`

propagateIRFlags(V, E->Scalars, VL0);		propagateIRFlags(V, E->Scalars, VL0);
ShuffleBuilder.addMask(E->ReuseShuffleIndices);		ShuffleBuilder.addMask(E->ReuseShuffleIndices);
V = ShuffleBuilder.finalize(V);		V = ShuffleBuilder.finalize(V);

E->VectorizedValue = V;		E->VectorizedValue = V;
++NumVectorInstructions;		++NumVectorInstructions;
return V;		return V;
▲ Show 20 Lines • Show All 76 Lines • ▼ Show 20 Lines	Value *BoUpSLP::vectorizeTree() {
ExtraValueToDebugLocsMap ExternallyUsedValues;		ExtraValueToDebugLocsMap ExternallyUsedValues;
return vectorizeTree(ExternallyUsedValues);		return vectorizeTree(ExternallyUsedValues);
}		}

Value *		Value *
BoUpSLP::vectorizeTree(ExtraValueToDebugLocsMap &ExternallyUsedValues) {		BoUpSLP::vectorizeTree(ExtraValueToDebugLocsMap &ExternallyUsedValues) {
// All blocks must be scheduled before any instructions are inserted.		// All blocks must be scheduled before any instructions are inserted.
for (auto &BSIter : BlocksSchedules) {		for (auto &BSIter : BlocksSchedules) {
scheduleBlock(BSIter.second.get());		BlockScheduling *BS = BSIter.second.get();
		// Remove all Schedule Data from all nodes that we have changed
		// vectorization decision.
		if (!ProposedToGather.empty())
		removeFromScheduling(BS);
		scheduleBlock(BS);
}		}

Builder.SetInsertPoint(&F->getEntryBlock().front());		Builder.SetInsertPoint(&F->getEntryBlock().front());
auto *VectorRoot = vectorizeTree(VectorizableTree[0].get());		auto *VectorRoot = vectorizeTree(VectorizableTree[0].get());

		for (std::unique_ptr<TreeEntry> &TEPtr : VectorizableTree) {
		TreeEntry *Entry = TEPtr.get();
		if ((Entry->State != TreeEntry::NeedToGather) &&
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - if ((Entry->State != TreeEntry::NeedToGather) && - !Entry->VectorizedValue) + if ((Entry->State != TreeEntry::NeedToGather) && !Entry->VectorizedValue) Lint: Pre-merge checks: clang-format: please reformat the code ``` - if ((Entry->State != TreeEntry::NeedToGather)…
		!Entry->VectorizedValue)
		vectorizeTree(Entry);
		}

// If the vectorized tree can be rewritten in a smaller type, we truncate the		// If the vectorized tree can be rewritten in a smaller type, we truncate the
// vectorized root. InstCombine will then rewrite the entire expression. We		// vectorized root. InstCombine will then rewrite the entire expression. We
// sign extend the extracted values below.		// sign extend the extracted values below.
auto *ScalarRoot = VectorizableTree[0]->Scalars[0];		auto *ScalarRoot = VectorizableTree[0]->Scalars[0];
if (MinBWs.count(ScalarRoot)) {		if (MinBWs.count(ScalarRoot)) {
if (auto *I = dyn_cast<Instruction>(VectorRoot)) {		if (auto *I = dyn_cast<Instruction>(VectorRoot)) {
// If current instr is a phi and not the last phi, insert it after the		// If current instr is a phi and not the last phi, insert it after the
// last phi node.		// last phi node.
▲ Show 20 Lines • Show All 125 Lines • ▼ Show 20 Lines	for (auto &TEPtr : VectorizableTree) {
assert(Entry->VectorizedValue && "Can't find vectorizable value");		assert(Entry->VectorizedValue && "Can't find vectorizable value");

// For each lane:		// For each lane:
for (int Lane = 0, LE = Entry->Scalars.size(); Lane != LE; ++Lane) {		for (int Lane = 0, LE = Entry->Scalars.size(); Lane != LE; ++Lane) {
Value *Scalar = Entry->Scalars[Lane];		Value *Scalar = Entry->Scalars[Lane];

#ifndef NDEBUG		#ifndef NDEBUG
Type *Ty = Scalar->getType();		Type *Ty = Scalar->getType();
if (!Ty->isVoidTy()) {		// The tree might not be fully vectorized, so we don't have to
		// check every user.
		if (!Ty->isVoidTy() && ProposedToGather.empty()) {
for (User *U : Scalar->users()) {		for (User *U : Scalar->users()) {
LLVM_DEBUG(dbgs() << "SLP: \tvalidating user:" << *U << ".\n");		LLVM_DEBUG(dbgs() << "SLP: \tvalidating user:" << *U << ".\n");

// It is legal to delete users in the ignorelist.		// It is legal to delete users in the ignorelist.
assert((getTreeEntry(U) \|\| is_contained(UserIgnoreList, U) \|\|		assert((getTreeEntry(U) \|\| is_contained(UserIgnoreList, U) \|\|
(isa_and_nonnull<Instruction>(U) &&		(isa_and_nonnull<Instruction>(U) &&
isDeleted(cast<Instruction>(U)))) &&		isDeleted(cast<Instruction>(U)))) &&
"Deleting out-of-tree value");		"Deleting out-of-tree value");
}		}
}		}
#endif		#endif
LLVM_DEBUG(dbgs() << "SLP: \tErasing scalar:" << *Scalar << ".\n");		LLVM_DEBUG(dbgs() << "SLP: \tErasing scalar:" << *Scalar << ".\n");
eraseInstruction(cast<Instruction>(Scalar));		eraseInstruction(cast<Instruction>(Scalar));
}		}
}		}

Builder.ClearInsertionPoint();		Builder.ClearInsertionPoint();
InstrElementSize.clear();		InstrElementSize.clear();

return VectorizableTree[0]->VectorizedValue;		return VectorizableTree[0]->VectorizedValue;
}		}
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions hmm. I would like to use remove_if here, but we have to capture a unique_ptr here. dtemirbulatov: hmm. I would like to use remove_if here, but we have to capture a unique_ptr here.
		ABataevUnsubmitted Done Reply Inline Actions Use `llvm::erase_if` ABataev: Use `llvm::erase_if`

		ABataevUnsubmitted Not Done Reply Inline Actions `[Tree]` ABataev: `[Tree]`
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions Tree is a class property, not a local variable. dtemirbulatov: Tree is a class property, not a local variable.
		ABataevUnsubmitted Done Reply Inline Actions Then `[this]` ABataev: Then `[this]`
void BoUpSLP::optimizeGatherSequence() {		void BoUpSLP::optimizeGatherSequence() {
LLVM_DEBUG(dbgs() << "SLP: Optimizing " << GatherSeq.size()		LLVM_DEBUG(dbgs() << "SLP: Optimizing " << GatherSeq.size()
<< " gather sequences instructions.\n");		<< " gather sequences instructions.\n");
		ABataevUnsubmitted Not Done Reply Inline Actions Why do you need to call `BuiltTrees.erase(` after `llvm::remove_if`? ABataev: Why do you need to call `BuiltTrees.erase(` after `llvm::remove_if`?
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions it is SmallVector<std::unique_ptr<TreeState>, 8> and we have to call erase(...) dtemirbulatov: it is SmallVector<std::unique_ptr<TreeState>, 8> and we have to call erase(...)
// LICM InsertElementInst sequences.		// LICM InsertElementInst sequences.
for (Instruction *I : GatherSeq) {		for (Instruction *I : GatherSeq) {
if (isDeleted(I))		if (isDeleted(I))
continue;		continue;

// Check if this block is inside a loop.		// Check if this block is inside a loop.
Loop *L = LI->getLoopFor(I->getParent());		Loop *L = LI->getLoopFor(I->getParent());
if (!L)		if (!L)
▲ Show 20 Lines • Show All 187 Lines • ▼ Show 20 Lines	void BoUpSLP::BlockScheduling::cancelScheduling(ArrayRef<Value *> VL,

// Un-bundle: make single instructions out of the bundle.		// Un-bundle: make single instructions out of the bundle.
ScheduleData *BundleMember = Bundle;		ScheduleData *BundleMember = Bundle;
while (BundleMember) {		while (BundleMember) {
assert(BundleMember->FirstInBundle == Bundle && "corrupt bundle links");		assert(BundleMember->FirstInBundle == Bundle && "corrupt bundle links");
BundleMember->FirstInBundle = BundleMember;		BundleMember->FirstInBundle = BundleMember;
ScheduleData *Next = BundleMember->NextInBundle;		ScheduleData *Next = BundleMember->NextInBundle;
BundleMember->NextInBundle = nullptr;		BundleMember->NextInBundle = nullptr;
		BundleMember->TE = nullptr;
BundleMember->UnscheduledDepsInBundle = BundleMember->UnscheduledDeps;		BundleMember->UnscheduledDepsInBundle = BundleMember->UnscheduledDeps;
if (BundleMember->UnscheduledDepsInBundle == 0) {		if (BundleMember->UnscheduledDepsInBundle == 0) {
ReadyInsts.insert(BundleMember);		ReadyInsts.insert(BundleMember);
}		}
BundleMember = Next;		BundleMember = Next;
}		}
}		}

▲ Show 20 Lines • Show All 253 Lines • ▼ Show 20 Lines	doForAllOpcodes(I, [&](ScheduleData *SD) {
"ScheduleData not in scheduling region");		"ScheduleData not in scheduling region");
SD->IsScheduled = false;		SD->IsScheduled = false;
SD->resetUnscheduledDeps();		SD->resetUnscheduledDeps();
});		});
}		}
ReadyInsts.clear();		ReadyInsts.clear();
}		}

		void BoUpSLP::BlockScheduling::reduceSchedulingRegion(Instruction *Start,
		Instruction *End) {
		if (Start)
		ScheduleStart = Start;
		if (End)
		ScheduleEnd = End;
		}

		void BoUpSLP::removeFromScheduling(BlockScheduling *BS) {
		bool Removed = false;
		SmallPtrSet<Instruction *, 12> Gathers;
		SmallPtrSet<Instruction *, 12> Reduced;
		ABataevUnsubmitted Done Reply Inline Actions Looks like you need to implement something like `reduceSchedulingRegion()`, similar to `extendSchedulingRegion` function. Because otherwise you're going to operate with the larger scheduling region. I.e. need to modify `ScheduleStart` and `ScheduleEnd` data members. ABataev: Looks like you need to implement something like `reduceSchedulingRegion()`, similar to…
		Instruction *Start = nullptr;

		// We can reduce the number of instructions to be considered for scheduling,
		// after cutting the tree. Here we shrink the scheduling area from the top,
		// consecutively, untill we encounter the required instruction. There might be
		// unnecessary NeedToGather nodes with the!relationship only to other
		// NeedToGather nodes and unmap instructions in!chains, we could safely
		// delete those.
		for (std::unique_ptr<TreeEntry> &TEPtr : reverse(VectorizableTree)) {
		TreeEntry *TE = TEPtr.get();
		ABataevUnsubmitted Done Reply Inline Actions `[]` ABataev: `[]`
		if (TE->State != TreeEntry::NeedToGather \|\| !TE->getOpcode() \|\|
		TE->getMainOp()->getParent() != BS->BB)
		continue;
		for (const EdgeInfo &EI : TE->UserTreeIndices) {
		if (EI.UserTE && (EI.UserTE->State != TreeEntry::NeedToGather)) {
		auto InstructionsOnly =
		make_filter_range(TE->Scalars, Instruction::classof);
		for (Value *V : InstructionsOnly)
		Gathers.insert(cast<Instruction>(V));
		break;
		}
		}
		}

		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions Perhaps we could also check here for !BS->getScheduleData(I)->isPartOfBundle() and further shrink the region. dtemirbulatov: Perhaps we could also check here for !BS->getScheduleData(I)->isPartOfBundle() and further…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions ah, no, this instruction could belong to a real gather node. dtemirbulatov: ah, no, this instruction could belong to a real gather node.
		for (Instruction *I = BS->ScheduleStart; I != BS->ScheduleEnd;
		I = I->getNextNode()) {
		if (!getTreeEntry(I) && !Gathers.count(I)) {
		Reduced.insert(I);
		} else {
		Start = I;
		break;
		}
		}

		BS->reduceSchedulingRegion(Start, nullptr);

		for (TreeEntry *Entry : ProposedToGather) {
		ScheduleData *SD = BS->getScheduleData(Entry->Scalars[0]);
		if (SD && SD->isPartOfBundle()) {
		if (!Removed) {
		Removed = true;
		BS->resetSchedule();
		}
		SD->IsScheduled = false;
		BS->cancelScheduling(Entry->Scalars, SD->OpValue);
		}
		}
		if (!Removed)
		return;

		if (Reduced.size()) {
		for (Instruction *I : Reduced) {
		ScheduleData *SD = BS->getScheduleData(I);
		if (SD)
		SD->SchedulingRegionID = -1;
		}
		}
		BS->resetSchedule();
		BS->initialFillReadyList(BS->ReadyInsts);
		for (Instruction *I = BS->ScheduleStart; I != BS->ScheduleEnd;
		I = I->getNextNode()) {
		if (BS->ScheduleDataMap.find(I) == BS->ScheduleDataMap.end())
		continue;
		BS->doForAllOpcodes(I, [](ScheduleData *SD) { SD->clearDependencies(); });
		}
		}

void BoUpSLP::scheduleBlock(BlockScheduling *BS) {		void BoUpSLP::scheduleBlock(BlockScheduling *BS) {
if (!BS->ScheduleStart)		if (!BS->ScheduleStart)
return;		return;

LLVM_DEBUG(dbgs() << "SLP: schedule block " << BS->BB->getName() << "\n");		LLVM_DEBUG(dbgs() << "SLP: schedule block " << BS->BB->getName() << "\n");

BS->resetSchedule();		BS->resetSchedule();

▲ Show 20 Lines • Show All 558 Lines • ▼ Show 20 Lines	bool SLPVectorizerPass::vectorizeStoreChain(ArrayRef<Value *> Chain, BoUpSLP &R,
}		}
if (R.isTreeTinyAndNotFullyVectorizable())		if (R.isTreeTinyAndNotFullyVectorizable())
return false;		return false;
if (R.isLoadCombineCandidate())		if (R.isLoadCombineCandidate())
return false;		return false;

R.computeMinimumValueSizes();		R.computeMinimumValueSizes();

InstructionCost Cost = R.getTreeCost();		InstructionCost Cost = R.getTreeCost(true);

LLVM_DEBUG(dbgs() << "SLP: Found cost = " << Cost << " for VF =" << VF << "\n");		LLVM_DEBUG(dbgs() << "SLP: Found cost = " << Cost << " for VF =" << VF << "\n");
if (Cost < -SLPCostThreshold) {		if (Cost < -SLPCostThreshold) {
LLVM_DEBUG(dbgs() << "SLP: Decided to vectorize cost = " << Cost << "\n");		LLVM_DEBUG(dbgs() << "SLP: Decided to vectorize cost = " << Cost << "\n");

using namespace ore;		using namespace ore;

R.getORE()->emit(OptimizationRemark(SV_NAME, "StoresVectorized",		R.getORE()->emit(OptimizationRemark(SV_NAME, "StoresVectorized",
cast<StoreInst>(Chain[0]))		cast<StoreInst>(Chain[0]))
<< "Stores SLP vectorized with cost " << NV("Cost", Cost)		<< "Stores SLP vectorized with cost " << NV("Cost", Cost)
<< " and with tree size "		<< " and with tree size "
<< NV("TreeSize", R.getTreeSize()));		<< NV("TreeSize", R.getTreeSize()));

R.vectorizeTree();		R.vectorizeTree();
return true;		return true;
}		}

		ABataevUnsubmitted Done Reply Inline Actions Why not try to vectorize a partial tree right here? ABataev: Why not try to vectorize a partial tree right here?
		ABataevUnsubmitted Not Done Reply Inline Actions Actually, `else` is not required here at all. Just make it a standalone `if` statement since there is an early exit in the previous `if` ABataev: Actually, `else` is not required here at all. Just make it a standalone `if` statement since…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions hmm, we might have an opportunity to vectorize the whole tree with smaller Chain sizes or at vectorizeStores or while doing reductions. dtemirbulatov: hmm, we might have an opportunity to vectorize the whole tree with smaller Chain sizes or at…
		ABataevUnsubmitted Not Done Reply Inline Actions Did you check that? ABataev: Did you check that?
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions yes, I can add a testcase for that. dtemirbulatov: yes, I can add a testcase for that.
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions For example, if we allow partial vectorization straight away we could see partial vectorization in test/Transforms/SLPVectorizer/X86/PR39774.ll without [[TMP3:%.]] = add <8 x i32> [[SHUFFLE]], <i32 0, i32 55, i32 285, i32 1240, i32 1496, i32 8555, i32 12529, i32 13685> That because of later we would have opportinity to vectorize the whole tree. dtemirbulatov:* For example, if we allow partial vectorization straight away we could see partial vectorization…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions hmm, No I think it is required here, we don't want to reduce already decided full-tree vectorization. dtemirbulatov: hmm, No I think it is required here, we don't want to reduce already decided full-tree…
		ABataevUnsubmitted Done Reply Inline Actions Ho, you don't need it. Read https://llvm.org/docs/CodingStandards.html#don-t-use-else-after-a-return ABataev: Ho, you don't need it. Read https://llvm.org/docs/CodingStandards.html#don-t-use-else-after-a…
return false;		return false;
}		}

		ABataevUnsubmitted Done Reply Inline Actions `else if` ABataev: `else if`
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions thanks dtemirbulatov: thanks
bool SLPVectorizerPass::vectorizeStores(ArrayRef<StoreInst *> Stores,		bool SLPVectorizerPass::vectorizeStores(ArrayRef<StoreInst *> Stores,
BoUpSLP &R) {		BoUpSLP &R) {
// We may run into multiple chains that merge into a single chain. We mark the		// We may run into multiple chains that merge into a single chain. We mark the
// stores that we vectorized so that we don't visit the same store twice.		// stores that we vectorized so that we don't visit the same store twice.
BoUpSLP::ValueSet VectorizedStores;		BoUpSLP::ValueSet VectorizedStores;
bool Changed = false;		bool Changed = false;

int E = Stores.size();		int E = Stores.size();
▲ Show 20 Lines • Show All 169 Lines • ▼ Show 20 Lines	if (VL.size() < 2)
return false;		return false;

LLVM_DEBUG(dbgs() << "SLP: Trying to vectorize a list of length = "		LLVM_DEBUG(dbgs() << "SLP: Trying to vectorize a list of length = "
<< VL.size() << ".\n");		<< VL.size() << ".\n");

// Check that all of the parts are instructions of the same type,		// Check that all of the parts are instructions of the same type,
// we permit an alternate opcode via InstructionsState.		// we permit an alternate opcode via InstructionsState.
InstructionsState S = getSameOpcode(VL);		InstructionsState S = getSameOpcode(VL);

if (!S.getOpcode())		if (!S.getOpcode())
return false;		return false;

Instruction *I0 = cast<Instruction>(S.OpValue);		Instruction *I0 = cast<Instruction>(S.OpValue);
// Make sure invalid types (including vector type) are rejected before		// Make sure invalid types (including vector type) are rejected before
// determining vectorization factor for scalar instructions.		// determining vectorization factor for scalar instructions.
for (Value *V : VL) {		for (Value *V : VL) {
Type *Ty = V->getType();		Type *Ty = V->getType();
▲ Show 20 Lines • Show All 75 Lines • ▼ Show 20 Lines	for (unsigned I = NextInst; I < MaxInst; ++I) {
[Ops](const unsigned Idx) { return Ops[Idx]; });		[Ops](const unsigned Idx) { return Ops[Idx]; });
R.buildTree(ReorderedOps);		R.buildTree(ReorderedOps);
}		}
}		}
if (R.isTreeTinyAndNotFullyVectorizable())		if (R.isTreeTinyAndNotFullyVectorizable())
continue;		continue;

R.computeMinimumValueSizes();		R.computeMinimumValueSizes();
InstructionCost Cost = R.getTreeCost();		InstructionCost Cost = R.getTreeCost(true);
CandidateFound = true;		CandidateFound = true;
		ABataevUnsubmitted Not Done Reply Inline Actions Do you really need this new var here? I don't see where it is used except as an argument of `R.findSubTree(UserCost)` call ABataev: Do you really need this new var here? I don't see where it is used except as an argument of `R.
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions I think we need to compensate the ExctractCost with that cost of the insert sequence as in case of full-vectorization. dtemirbulatov: I think we need to compensate the ExctractCost with that cost of the insert sequence as in case…
		RKSimonUnsubmitted Not Done Reply Inline Actions This still looks wrong - isn't the UserCost only used locally in the CompensateUseCost path? RKSimon: This still looks wrong - isn't the UserCost only used locally in the CompensateUseCost path?
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions No, there is another instance of UserCost at 6476, We have to compare the cost to SLPCostThreshold inside findSubTree() and subtract UserCost. dtemirbulatov: No, there is another instance of UserCost at 6476, We have to compare the cost to…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions I mean the instance of usage. dtemirbulatov: I mean the instance of usage.
MinCost = std::min(MinCost, Cost);		MinCost = std::min(MinCost, Cost);

if (Cost < -SLPCostThreshold) {		if (Cost < -SLPCostThreshold) {
LLVM_DEBUG(dbgs() << "SLP: Vectorizing list at cost:" << Cost << ".\n");		LLVM_DEBUG(dbgs() << "SLP: Vectorizing list at cost:" << Cost << ".\n");
R.getORE()->emit(OptimizationRemark(SV_NAME, "VectorizedList",		R.getORE()->emit(OptimizationRemark(SV_NAME, "VectorizedList",
cast<Instruction>(Ops[0]))		cast<Instruction>(Ops[0]))
<< "SLP vectorized with cost " << ore::NV("Cost", Cost)		<< "SLP vectorized with cost " << ore::NV("Cost", Cost)
<< " and with tree size "		<< " and with tree size "
<< ore::NV("TreeSize", R.getTreeSize()));		<< ore::NV("TreeSize", R.getTreeSize()));

R.vectorizeTree();		R.vectorizeTree();
// Move to the next bundle.		// Move to the next bundle.
I += VF - 1;		I += VF - 1;
NextInst = I + 1;		NextInst = I + 1;
Changed = true;		Changed = true;
}		}
		ABataevUnsubmitted Done Reply Inline Actions Why `SLPThrottleBudget > 0`? What if `SLPThrottleBudget` equals 0? ABataev: Why `SLPThrottleBudget > 0`? What if `SLPThrottleBudget` equals 0?
}		}
		ABataevUnsubmitted Done Reply Inline Actions `else if` ABataev: `else if`
		ABataevUnsubmitted Not Done Reply Inline Actions Why not try to vectorize a partial tree right here? ABataev: Why not try to vectorize a partial tree right here?
		ABataevUnsubmitted Not Done Reply Inline Actions Enclose substatement into braces since the substatement in `then` branch is in braces. ABataev: Enclose substatement into braces since the substatement in `then` branch is in braces.
		ABataevUnsubmitted Done Reply Inline Actions Better to enclose this substatement in braces to make the code uniform ABataev: Better to enclose this substatement in braces to make the code uniform
		ABataevUnsubmitted Done Reply Inline Actions Why we can't do something like this: int NumAttempts = 0; do { if (R.isTreeTinyAndNotFullyVectorizable()) break; R.computeMinimumValueSizes(); InstructionCost Cost = R.getTreeCost(); InstructionCost UserCost = 0; .... if (Cost < -SLPCostThreshold) { LLVM_DEBUG(dbgs() << "SLP: Vectorizing list at cost:" << Cost << ".\n"); R.getORE()->emit(OptimizationRemark(SV_NAME, "VectorizedList", cast<Instruction>(Ops[0])) << "SLP vectorized with cost " << ore::NV("Cost", Cost) << " and with tree size " << ore::NV("TreeSize", R.getTreeSize())); R.vectorizeTree(); // Move to the next bundle. I += VF - 1; NextInst = I + 1; Changed = true; break; } ... /// Do throttling here. ++NumAttempts; } while (NumAttempts < SLPThrottleBudget); ABataev: Why we can't do something like this: ``` int NumAttempts = 0; do { if (R.
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions thanks dtemirbulatov: thanks
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions we might have an opportunity to vectorize the whole tree with smaller Chain sizes at vectorizeStoreChain or while doing reductions. dtemirbulatov: we might have an opportunity to vectorize the whole tree with smaller Chain sizes at…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions We are doing partial vectorization and we have to know UserCost to make the correct partial tree cut. dtemirbulatov: We are doing partial vectorization and we have to know UserCost to make the correct partial…
}		}

if (!Changed && CandidateFound) {		if (!Changed && CandidateFound) {
R.getORE()->emit([&]() {		R.getORE()->emit([&]() {
return OptimizationRemarkMissed(SV_NAME, "NotBeneficial", I0)		return OptimizationRemarkMissed(SV_NAME, "NotBeneficial", I0)
<< "List vectorization was possible but not beneficial with cost "		<< "List vectorization was possible but not beneficial with cost "
<< ore::NV("Cost", MinCost) << " >= "		<< ore::NV("Cost", MinCost) << " >= "
<< ore::NV("Treshold", -SLPCostThreshold);		<< ore::NV("Treshold", -SLPCostThreshold);
▲ Show 20 Lines • Show All 635 Lines • ▼ Show 20 Lines	while (i < NumReducedVals - ReduxWidth + 1 && ReduxWidth > 2) {
isBoolLogicOp(cast<Instruction>(ReductionRoot)) &&		isBoolLogicOp(cast<Instruction>(ReductionRoot)) &&
NumReducedVals != ReduxWidth)		NumReducedVals != ReduxWidth)
break;		break;

V.computeMinimumValueSizes();		V.computeMinimumValueSizes();

// Estimate cost.		// Estimate cost.
InstructionCost TreeCost =		InstructionCost TreeCost =
V.getTreeCost(makeArrayRef(&ReducedVals[i], ReduxWidth));		V.getTreeCost(false, makeArrayRef(&ReducedVals[i], ReduxWidth));
InstructionCost ReductionCost =		InstructionCost ReductionCost =
getReductionCost(TTI, ReducedVals[i], ReduxWidth, RdxFMF);		getReductionCost(TTI, ReducedVals[i], ReduxWidth, RdxFMF);
InstructionCost Cost = TreeCost + ReductionCost;		InstructionCost Cost = TreeCost + ReductionCost;
if (!Cost.isValid()) {		if (!Cost.isValid()) {
LLVM_DEBUG(dbgs() << "Encountered invalid baseline cost.\n");		LLVM_DEBUG(dbgs() << "Encountered invalid baseline cost.\n");
return false;		return false;
}		}
if (Cost >= -SLPCostThreshold) {		if (Cost >= -SLPCostThreshold) {
V.getORE()->emit([&]() {		V.getORE()->emit([&]() {
return OptimizationRemarkMissed(SV_NAME, "HorSLPNotBeneficial",		return OptimizationRemarkMissed(SV_NAME, "HorSLPNotBeneficial",
cast<Instruction>(VL[0]))		cast<Instruction>(VL[0]))
		ABataevUnsubmitted Done Reply Inline Actions Looks like you missed compare ща `Cost` with `-SLPCostThreshold` here. You vectorized the tree after throttling unconditionally. Plus, the `Cost` is calculated here, but not used later except for the debug prints. ABataev: Looks like you missed compare ща `Cost` with `-SLPCostThreshold` here. You vectorized the tree…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions we don't need need to compare here, this is done inside findSubTree(). dtemirbulatov: we don't need need to compare here, this is done inside findSubTree().
<< "Vectorizing horizontal reduction is possible"		<< "Vectorizing horizontal reduction is possible"
<< "but not beneficial with cost " << ore::NV("Cost", Cost)		<< "but not beneficial with cost " << ore::NV("Cost", Cost)
		ABataevUnsubmitted Not Done Reply Inline Actions Just `else`? ABataev: Just `else`?
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions We might try partial vectorization without success here and we to report about insufficient cost and break dtemirbulatov: We might try partial vectorization without success here and we to report about insufficient…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions We might try partial vectorization without success here and we have to report about insufficient cost and break dtemirbulatov: We might try partial vectorization without success here and we have to report about…
<< " and threshold "		<< " and threshold "
<< ore::NV("Threshold", -SLPCostThreshold);		<< ore::NV("Threshold", -SLPCostThreshold);
});		});
break;		break;
		RKSimonUnsubmitted Not Done Reply Inline Actions This looks like a NFC clang-format change now - either pre-commit or discard from the patch? RKSimon: This looks like a NFC clang-format change now - either pre-commit or discard from the patch?
}		}

LLVM_DEBUG(dbgs() << "SLP: Vectorizing horizontal reduction at cost:"		LLVM_DEBUG(dbgs() << "SLP: Vectorizing horizontal reduction at cost:"
<< Cost << ". (HorRdx)\n");		<< Cost << ". (HorRdx)\n");
V.getORE()->emit([&]() {		V.getORE()->emit([&]() {
return OptimizationRemark(SV_NAME, "VectorizedHorizontalReduction",		return OptimizationRemark(SV_NAME, "VectorizedHorizontalReduction",
cast<Instruction>(VL[0]))		cast<Instruction>(VL[0]))
<< "Vectorized horizontal reduction with cost "		<< "Vectorized horizontal reduction with cost "
▲ Show 20 Lines • Show All 973 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/slp-throttle.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -slp-vectorizer -S -mtriple=x86_64-unknown-linux-gnu -mcpu=bdver2 < %s \| FileCheck %s			; RUN: opt -slp-vectorizer -S -mtriple=x86_64-unknown-linux-gnu -mcpu=bdver2 < %s \| FileCheck %s
				RKSimonUnsubmitted Done Reply Inline Actions Is it worth adding a second RUN with -slp-throttle=false ? RKSimon: Is it worth adding a second RUN with -slp-throttle=false ?

	define dso_local void @rftbsub(double* %a) local_unnamed_addr #0 {			define dso_local void @rftbsub(double* %a) local_unnamed_addr #0 {
	; CHECK-LABEL: @rftbsub(			; CHECK-LABEL: @rftbsub(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[ARRAYIDX6:%.]] = getelementptr inbounds double, double [[A:%.*]], i64 2			; CHECK-NEXT: [[ARRAYIDX6:%.]] = getelementptr inbounds double, double [[A:%.*]], i64 2
	; CHECK-NEXT: [[TMP0:%.]] = load double, double [[ARRAYIDX6]], align 8			; CHECK-NEXT: [[TMP0:%.*]] = or i64 2, 1
	; CHECK-NEXT: [[TMP1:%.*]] = or i64 2, 1			; CHECK-NEXT: [[ARRAYIDX12:%.]] = getelementptr inbounds double, double [[A]], i64 [[TMP0]]
	; CHECK-NEXT: [[ARRAYIDX12:%.]] = getelementptr inbounds double, double [[A]], i64 [[TMP1]]			; CHECK-NEXT: [[TMP1:%.]] = bitcast double [[ARRAYIDX6]] to <2 x double>*
	; CHECK-NEXT: [[TMP2:%.]] = load double, double [[ARRAYIDX12]], align 8			; CHECK-NEXT: [[TMP2:%.]] = load <2 x double>, <2 x double> [[TMP1]], align 8
	; CHECK-NEXT: [[ADD16:%.*]] = fadd double [[TMP2]], undef			; CHECK-NEXT: [[TMP3:%.*]] = extractelement <2 x double> [[TMP2]], i32 1
				; CHECK-NEXT: [[ADD16:%.*]] = fadd double [[TMP3]], undef
	; CHECK-NEXT: [[MUL18:%.*]] = fmul double undef, [[ADD16]]			; CHECK-NEXT: [[MUL18:%.*]] = fmul double undef, [[ADD16]]
	; CHECK-NEXT: [[ADD19:%.*]] = fadd double undef, [[MUL18]]			; CHECK-NEXT: [[ADD19:%.*]] = fadd double undef, [[MUL18]]
	; CHECK-NEXT: [[SUB22:%.*]] = fsub double undef, undef			; CHECK-NEXT: [[SUB22:%.*]] = fsub double undef, undef
	; CHECK-NEXT: [[SUB25:%.*]] = fsub double [[TMP0]], [[ADD19]]			; CHECK-NEXT: [[TMP4:%.*]] = insertelement <2 x double> poison, double [[ADD19]], i32 0
	; CHECK-NEXT: store double [[SUB25]], double* [[ARRAYIDX6]], align 8			; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x double> [[TMP4]], double [[SUB22]], i32 1
	; CHECK-NEXT: [[SUB29:%.*]] = fsub double [[TMP2]], [[SUB22]]			; CHECK-NEXT: [[TMP6:%.*]] = fsub <2 x double> [[TMP2]], [[TMP5]]
	; CHECK-NEXT: store double [[SUB29]], double* [[ARRAYIDX12]], align 8			; CHECK-NEXT: [[TMP7:%.]] = bitcast double [[ARRAYIDX6]] to <2 x double>*
				; CHECK-NEXT: store <2 x double> [[TMP6]], <2 x double>* [[TMP7]], align 8
	; CHECK-NEXT: unreachable			; CHECK-NEXT: unreachable
	;			;
	entry:			entry:
	%arrayidx6 = getelementptr inbounds double, double* %a, i64 2			%arrayidx6 = getelementptr inbounds double, double* %a, i64 2
	%0 = load double, double* %arrayidx6, align 8			%0 = load double, double* %arrayidx6, align 8
	%1 = or i64 2, 1			%1 = or i64 2, 1
	%arrayidx12 = getelementptr inbounds double, double* %a, i64 %1			%arrayidx12 = getelementptr inbounds double, double* %a, i64 %1
	%2 = load double, double* %arrayidx12, align 8			%2 = load double, double* %arrayidx12, align 8
	Show All 10 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Add support for throttling.AbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 372463

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

llvm/test/Transforms/SLPVectorizer/X86/slp-throttle.ll

[SLP] Add support for throttling.
AbandonedPublic