This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
104/150
SLPVectorizer.cpp
-
test/Transforms/SLPVectorizer/X86/
-
Transforms/
-
SLPVectorizer/
-
X86/
1/2
powof2div.ll
1/1
slp-throttle.ll

Differential D57779

[SLP] Add support for throttling.
AbandonedPublic

Authored by dtemirbulatov on Feb 5 2019, 12:48 PM.

Download Raw Diff

Details

Reviewers

ABataev
RKSimon
spatel
anton-afanasyev
hfinkel
vporpo
fhahn

Summary

Here is support for SLP throttling, when cost is high to vectorize the whole tree we can reduce the number of proposed vectorizable elements and partially vectorize the tree. https://www.youtube.com/watch?v=xxtA2XPmIug&t=5s

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

RKSimon added inline comments.Aug 10 2020, 6:30 AM

clang/tools/clang-tidy
2 ↗	(On Diff #284313)	Remove this
llvm/tools/mlir
2 ↗	(On Diff #284313)	remove this

Fixed.

RKSimon mentioned this in rG90f721404ff8: [SLP] Regenerate load-merge.ll tests.Aug 10 2020, 8:09 AM

RKSimon added inline comments.Aug 10 2020, 8:42 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
7146	This looks like a NFC clang-format change now - either pre-commit or discard from the patch?
llvm/test/Transforms/SLPVectorizer/X86/load-merge.ll
59 ↗	(On Diff #284345)	rebase - this was committed at rG90f721404ff8

Rebased, Fixed.

oh, I missed to fully remove from diff at 7269, Fixed

RKSimon added inline comments.Aug 11 2020, 2:36 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4156–4165	Maybe pull this NDEBUG change out into its own patch?
6333	This still looks wrong - isn't the UserCost only used locally in the CompensateUseCost path?

dtemirbulatov added inline comments.Aug 11 2020, 3:09 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4156–4165	yes, it could go as NFC.
6333	No, there is another instance of UserCost at 6476, We have to compare the cost to SLPCostThreshold inside findSubTree() and subtract UserCost.

dtemirbulatov added inline comments.Aug 11 2020, 3:12 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6333	I mean the instance of usage.

dtemirbulatov mentioned this in rGb1600d8b8971: [NFC] Guard the cost report block of debug outputs with NDEBUG and.Aug 11 2020, 7:35 AM

Rebased after rGb1600d8b8971

@ABataev @anton-afanasyev Any more comments on this?

llvm/test/Transforms/SLPVectorizer/X86/slp-throttle.ll
2	Is it worth adding a second RUN with -slp-throttle=false ?

xbolva00 added inline comments.Aug 16 2020, 10:23 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
621	Please mention paper name: “Throttling Automatic Vectorization: When Less Is More” https://www.cl.cam.ac.uk/~tmj32/papers/docs/porpodas15-pact.pdf Slides are good, but paper is paper :)

Corrected paper citation, added -slp-throttle=false to llvm/test/Transforms/SLPVectorizer/X86/slp-throttle.ll, rebased.

dtemirbulatov marked 2 inline comments as done.Aug 17 2020, 4:21 AM

ABataev added inline comments.Aug 21 2020, 6:51 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3253	What if the user does not have corresponding tree entry, i.e. it is initially scalar? What if the `Scalar` itself is going to remain scalar?
4142–4149	Just: for (Value V : Entry->Scalars) { auto Inst = cast<Instruction>(V); if (llvm::any_of(Inst->users(), [this](User *Op){ return Tree->ScalarToTreeEntry.count(Op) > 0; })) return InsertCost + getEntryCost(Entry); } Also, check code formatting

dtemirbulatov added inline comments.Aug 21 2020, 7:38 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3253	What if the Scalar itself is going to remain scalar? At this point, the decision to cut the tree was made and the Scalar could be only with intend to vectorize. Note about that 3295 we are ignoring any tree entries without State not equals TreeEntry::Vectorize. What if the user does not have corresponding tree entry, i.e. it is initially scalar? ah, yes. I have to check that !UserTE at 3305 and just continue if it is true.

dtemirbulatov added inline comments.Aug 21 2020, 3:23 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4142–4149	hmm, I think this is not a correct suggestion, there might be several tree entries with TreeEntry::ProposedToGather status and we have to calculate Insert cost for the whole tree here.

ABataev added inline comments.Aug 21 2020, 3:27 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4142–4149	Yeah, maybe. But you van do something similar, like InsertCost += ... break; instead of setting flag and do a check after the loop.

Fixed remarks, rebased.

dtemirbulatov marked 3 inline comments as done.Aug 21 2020, 3:52 PM

Removed unnecessary check for "UserTE" at 3305.

Rebased. Ping.

Good enough for initial implementation?

In D57779#2258046, @xbolva00 wrote:

Good enough for initial implementation?

yes, For me, it looks like ready.

In D57779#2266618, @dtemirbulatov wrote:

In D57779#2258046, @xbolva00 wrote:

Good enough for initial implementation?

yes, For me, it looks like ready.

Will be able to review it next week, after returning from vacation.

vdmitrie added a subscriber: vdmitrie.Sep 11 2020, 3:36 PM

ABataev added inline comments.Sep 15 2020, 9:30 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3254–3255	Could you compare it with a similar code in BoUpSLP::buildTree? Looks like you still missed some cases for user ignoring.

Matt added a subscriber: Matt.Sep 16 2020, 8:53 AM

Rebased. Moved InternalTreeUses population out of (UseScalar != U || !InTreeUserNeedToExtract(Scalar, UserInst, TLI)) limitation at line 2661 in BoUpSLP::buildTree(), since we have to consider every interal user for partial vectorization, while calculating cost.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3254–3255	I think those ignoring cases are related to the fact that we are doing full vectorization at BoUpSLP::buildTree and we can avoid extracting for in-tree users. And here we have to extract to each user of once proposed to vectorized value.

dtemirbulatov added inline comments.Sep 22 2020, 8:12 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3254–3255	And here we have to extract to each user of once proposed to vectorized value. I mean for the partial vectorization.

Ping

Rebased. PING

Rebased. Ping.

Rebased. Ping^2

ABataev added inline comments.Nov 23 2020, 9:50 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3258–3260	Either just `cast` without `if` or `dyn_cast`
4132	Not sure that this is the best criterion. I think you also need to include the distance from the head of the tree to the entry, because some big costs can be compensated by the vectorizable nodes in the tree. What I would do here is just some kind of level ordering search (BFS) starting from the deepest level.
4139	I think you can also exclude entries with the number of operands <= 1.

anton-afanasyev mentioned this in D90445: [SLP] Make SLPVectorizer to use `llvm.masked.gather` intrinsic.Nov 25 2020, 3:20 AM

dtemirbulatov added inline comments.Dec 3 2020, 5:55 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4132	Hmm, implemented, but I don't see any benefit from that, plus we have to do BFS search. And we are going to throw away any non-vectorizable nodes at 4295.
4139	But why? The only thing that matters here is the cost.

dtemirbulatov marked 2 inline comments as done.Dec 3 2020, 5:57 PM

ABataev added inline comments.Dec 4 2020, 5:20 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4132	It may trigger for targets like silvermont or in future for vectorized functions.
4139	Because the main idea is to drop gathers and drop one gather in favor of another one will not be profitable for sure. But it may improve compile time and the list of candidates, The only case you need to check for is the latest masked gather case, it may be profitable to convert it to gathers for some targets.

dtemirbulatov marked an inline comment as done.Dec 7 2020, 7:32 AM

dtemirbulatov added inline comments.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4132	I measured the BFS approach vs this implementation. And with BFS, it is ~10% less efficient on SPEC2006 INT and ~20% less on compilable SPEC2006 FP. By efficiency, I mean the total number of reduced trees while the whole compilation.
4139	I think I can check here if scutter/gather is supported via TargetrInfo and if it is not then move all nodes with TreeEntry::ScatterVectorize to TreeEntry::Gather.

anton-afanasyev added inline comments.Dec 7 2020, 8:43 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4139	I believe it's wrong decision to check scatter/gather target support for the reason mentioned here https://reviews.llvm.org/D92701#2435573. Why could not we just rely on costs (node cost and total one)?

dtemirbulatov marked an inline comment as not done.Dec 7 2020, 9:04 AM

dtemirbulatov added inline comments.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4139	I agree with @anton-afanasyev here. I am not sure what @ABataev wants here? If I exclude (operands <= 1) then we would lose have of all tests in SLP affected by throttling.

ABataev added inline comments.Dec 7 2020, 9:16 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4132	Could you post it anyway to check if it may be improved?
4139	I did not say anything about checking if scatter is supported here. I just said that we can improve the criterion here by checking that the entry node has at least 2 operands (because if it has just one operand, most probably we can skip it) and we just need to check the nodes with only 1 operand if it is gather scatter node, because it may be better to represent it as simple gather.

dtemirbulatov added inline comments.Dec 7 2020, 9:19 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4132	ok, I might miss something. Thanks.

Here is the BFS version of the change. Rebased.

And I counted the total number of nodes vectorized with throttling, instead of just the number of successful tree reductions. So, the total number is higher ~25% for INT and FP CPU2006(AVX2 and AVX512F) with Cost sort compare to Distance.

Discussed with @ABataev further improvements offline and he suggested removing the throttle limiter ("slp-throttling-budget"), at least for basic blocks without calls. I am looking for new functionality.

Removed "slp-throttling-budget" limiter for trees without calls
Moved the main tree reduction loop to getTreeCost() function
deleted ProposedToGather node attribute out of EntryState

Rebased, Measured compile time impact on cpu2006 integer and I have not noticed any significant regressions in SLP compile-time compared to SLP throttle with the limiter.

In D57779#2525124, @dtemirbulatov wrote:

Rebased, Measured compile time impact on cpu2006 integer and I have not noticed any significant regressions in SLP compile-time compared to SLP throttle with the limiter.

I mean only SLP time regression, by using "-ftime-trace" flag.

At Dinar's request, I've measured compile time regression: http://llvm-compile-time-tracker.com/compare.php?from=f3449ed6073cac58efd9b62d0eb285affa650238&to=39362e11add238c45a7a7d55c1e002005f396fb7&stat=instructions. The regression is visible, but it is acceptable for such change imho. The largest regression comes from CMakeFiles/clamscan.dir/libclamav_uuencode.c.o (+11.28%), so one can investigate this particular file.

In D57779#2525959, @anton-afanasyev wrote:

At Dinar's request, I've measured compile time regression: http://llvm-compile-time-tracker.com/compare.php?from=f3449ed6073cac58efd9b62d0eb285affa650238&to=39362e11add238c45a7a7d55c1e002005f396fb7&stat=instructions. The regression is visible, but it is acceptable for such change imho. The largest regression comes from CMakeFiles/clamscan.dir/libclamav_uuencode.c.o (+11.28%), so one can investigate this particular file.

Thanks, Anton. Eh, I don't see any time difference on my side for `CMakeFiles/clamscan.dir/libclamav_uuencode.c.o -with -O3 for mavx2 or -mavx512f as well as SLP didn't try to throttle any trees in this particular test, it looks like noise to me.

Ping.

Rebased, Ping.

ABataev added inline comments.Feb 10 2021, 7:40 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
134–142	Do we really need both of these options? `MaxCostsRecalculations` should be enough.
609–613	Does "scalar form" means "gathered nodes"? I don't think that currently we may end up with the situation like in the picture, we can't have gathered node that depends on another node (either gather or vectorized).
658–665	Why do we need to save intermediate results? Cannot it be solved in a single iteration loop without saving the intermediate results in the class instance?

ABataev added inline comments.Feb 10 2021, 7:40 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2530–2532	Looks like unrelated change

dtemirbulatov added inline comments.Feb 10 2021, 1:21 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
134–142	ok, thanks.
658–665	I have noticed many regressions if we decide right away and rebuilding the tree afterward is expensive.

ABataev added inline comments.Feb 10 2021, 1:40 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
658–665	What is the cause of those regressions? If I understand it correctly, you're just trying to find the subtree, exclude its cost, compare, repeat if it is not profitable. What does not allow to do it in the loop without saving intermediate results in the class, but save the result in the stack vectors, if it is needed?

dtemirbulatov added inline comments.Feb 10 2021, 3:37 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
658–665	For example, if we could partially vectorize at vectorizeStoreChain(), or later it is possilble to fully vectorize the same tree tryToVectorizeList() or tryToReduce()

dtemirbulatov added inline comments.Feb 10 2021, 3:41 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
658–665	For example, if we could partially vectorize at vectorizeStoreChain(), or later it is possilble to fully vectorize the same tree with tryToVectorizeList() or tryToReduce()

ABataev added inline comments.Feb 10 2021, 3:55 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
658–665	Could you give an example, please?

Here we could see the regression, it misses vectorizing the whole tree as partial vectorization kicks in too early and "add" instructions stay scalar:

a/llvm/test/Transforms/SLPVectorizer/X86/PR39774.ll

+++ b/llvm/test/Transforms/SLPVectorizer/X86/PR39774.ll
@@ -7,49 +7,65 @@ define void @test(i32) {
; CHECK-NEXT: entry:
; CHECK-NEXT: br label [[LOOP:%.*]]
; CHECK: loop:
-; CHECK-NEXT: [[TMP1:%.*]] = phi <2 x i32> [ [[TMP15:%.*]], [[LOOP]] ], [ zeroinitializer, [[ENTRY:%.*]] ]
-; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> poison, <8 x i32> <i32 0, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
-; CHECK-NEXT: [[TMP2:%.*]] = extractelement <8 x i32> [[SHUFFLE]], i32 1
-; CHECK-NEXT: [[TMP3:%.*]] = add <8 x i32> [[SHUFFLE]], <i32 0, i32 55, i32 285, i32 1240, i32 1496, i32 8555, i32 12529, i32 13685>
-; CHECK-NEXT: [[TMP4:%.*]] = call i32 @llvm.vector.reduce.and.v8i32(<8 x i32> [[TMP3]])
-; CHECK-NEXT: [[OP_EXTRA:%.*]] = and i32 [[TMP4]], [[TMP0:%.*]]
-; CHECK-NEXT: [[OP_EXTRA1:%.*]] = and i32 [[OP_EXTRA]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA2:%.*]] = and i32 [[OP_EXTRA1]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA3:%.*]] = and i32 [[OP_EXTRA2]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA4:%.*]] = and i32 [[OP_EXTRA3]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA5:%.*]] = and i32 [[OP_EXTRA4]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA6:%.*]] = and i32 [[OP_EXTRA5]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA7:%.*]] = and i32 [[OP_EXTRA6]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA8:%.*]] = and i32 [[OP_EXTRA7]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA9:%.*]] = and i32 [[OP_EXTRA8]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA10:%.*]] = and i32 [[OP_EXTRA9]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA11:%.*]] = and i32 [[OP_EXTRA10]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA12:%.*]] = and i32 [[OP_EXTRA11]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA13:%.*]] = and i32 [[OP_EXTRA12]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA14:%.*]] = and i32 [[OP_EXTRA13]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA15:%.*]] = and i32 [[OP_EXTRA14]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA16:%.*]] = and i32 [[OP_EXTRA15]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA17:%.*]] = and i32 [[OP_EXTRA16]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA18:%.*]] = and i32 [[OP_EXTRA17]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA19:%.*]] = and i32 [[OP_EXTRA18]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA20:%.*]] = and i32 [[OP_EXTRA19]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA21:%.*]] = and i32 [[OP_EXTRA20]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA22:%.*]] = and i32 [[OP_EXTRA21]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA23:%.*]] = and i32 [[OP_EXTRA22]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA24:%.*]] = and i32 [[OP_EXTRA23]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA25:%.*]] = and i32 [[OP_EXTRA24]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA26:%.*]] = and i32 [[OP_EXTRA25]], [[TMP0]]
-; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x i32> poison, i32 [[OP_EXTRA26]], i32 0
-; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x i32> [[TMP5]], i32 14910, i32 1
-; CHECK-NEXT: [[TMP7:%.*]] = insertelement <2 x i32> poison, i32 [[TMP2]], i32 0
-; CHECK-NEXT: [[TMP8:%.*]] = insertelement <2 x i32> [[TMP7]], i32 [[TMP2]], i32 1
-; CHECK-NEXT: [[TMP9:%.*]] = and <2 x i32> [[TMP6]], [[TMP8]]
-; CHECK-NEXT: [[TMP10:%.*]] = add <2 x i32> [[TMP6]], [[TMP8]]
-; CHECK-NEXT: [[TMP11:%.*]] = shufflevector <2 x i32> [[TMP9]], <2 x i32> [[TMP10]], <2 x i32> <i32 0, i32 3>
-; CHECK-NEXT: [[TMP12:%.*]] = extractelement <2 x i32> [[TMP11]], i32 0
-; CHECK-NEXT: [[TMP13:%.*]] = insertelement <2 x i32> poison, i32 [[TMP12]], i32 0
-; CHECK-NEXT: [[TMP14:%.*]] = extractelement <2 x i32> [[TMP11]], i32 1
-; CHECK-NEXT: [[TMP15]] = insertelement <2 x i32> [[TMP13]], i32 [[TMP14]], i32 1
+; CHECK-NEXT: [[TMP1:%.*]] = phi <2 x i32> [ [[TMP19:%.*]], [[LOOP]] ], [ zeroinitializer, [[ENTRY:%.*]] ]
+; CHECK-NEXT: [[TMP2:%.*]] = extractelement <2 x i32> [[TMP1]], i32 0
+; CHECK-NEXT: [[VAL_0:%.*]] = add i32 [[TMP2]], 0
+; CHECK-NEXT: [[TMP3:%.*]] = extractelement <2 x i32> [[TMP1]], i32 1
+; CHECK-NEXT: [[VAL_1:%.*]] = and i32 [[TMP3]], [[VAL_0]]
+; CHECK-NEXT: [[VAL_2:%.*]] = and i32 [[VAL_1]], [[TMP0:%.*]]
+; CHECK-NEXT: [[VAL_3:%.*]] = and i32 [[VAL_2]], [[TMP0]]
+; CHECK-NEXT: [[VAL_4:%.*]] = and i32 [[VAL_3]], [[TMP0]]
+; CHECK-NEXT: [[VAL_5:%.*]] = and i32 [[VAL_4]], [[TMP0]]
+; CHECK-NEXT: [[VAL_6:%.*]] = add i32 [[TMP3]], 55
+; CHECK-NEXT: [[VAL_7:%.*]] = and i32 [[VAL_5]], [[VAL_6]]
+; CHECK-NEXT: [[VAL_8:%.*]] = and i32 [[VAL_7]], [[TMP0]]
+; CHECK-NEXT: [[VAL_9:%.*]] = and i32 [[VAL_8]], [[TMP0]]
+; CHECK-NEXT: [[VAL_10:%.*]] = and i32 [[VAL_9]], [[TMP0]]
+; CHECK-NEXT: [[VAL_11:%.*]] = add i32 [[TMP3]], 285
+; CHECK-NEXT: [[VAL_12:%.*]] = and i32 [[VAL_10]], [[VAL_11]]
+; CHECK-NEXT: [[VAL_13:%.*]] = and i32 [[VAL_12]], [[TMP0]]
+; CHECK-NEXT: [[VAL_14:%.*]] = and i32 [[VAL_13]], [[TMP0]]
+; CHECK-NEXT: [[VAL_15:%.*]] = and i32 [[VAL_14]], [[TMP0]]
+; CHECK-NEXT: [[VAL_16:%.*]] = and i32 [[VAL_15]], [[TMP0]]
+; CHECK-NEXT: [[VAL_17:%.*]] = and i32 [[VAL_16]], [[TMP0]]
+; CHECK-NEXT: [[VAL_18:%.*]] = add i32 [[TMP3]], 1240
+; CHECK-NEXT: [[VAL_19:%.*]] = and i32 [[VAL_17]], [[VAL_18]]
+; CHECK-NEXT: [[VAL_20:%.*]] = add i32 [[TMP3]], 1496
+; CHECK-NEXT: [[VAL_21:%.*]] = and i32 [[VAL_19]], [[VAL_20]]
+; CHECK-NEXT: [[VAL_22:%.*]] = and i32 [[VAL_21]], [[TMP0]]
+; CHECK-NEXT: [[VAL_23:%.*]] = and i32 [[VAL_22]], [[TMP0]]
+; CHECK-NEXT: [[VAL_24:%.*]] = and i32 [[VAL_23]], [[TMP0]]
+; CHECK-NEXT: [[VAL_25:%.*]] = and i32 [[VAL_24]], [[TMP0]]
+; CHECK-NEXT: [[VAL_26:%.*]] = and i32 [[VAL_25]], [[TMP0]]
+; CHECK-NEXT: [[VAL_27:%.*]] = and i32 [[VAL_26]], [[TMP0]]
+; CHECK-NEXT: [[VAL_28:%.*]] = and i32 [[VAL_27]], [[TMP0]]
+; CHECK-NEXT: [[VAL_29:%.*]] = and i32 [[VAL_28]], [[TMP0]]
+; CHECK-NEXT: [[VAL_30:%.*]] = and i32 [[VAL_29]], [[TMP0]]
+; CHECK-NEXT: [[VAL_31:%.*]] = and i32 [[VAL_30]], [[TMP0]]
+; CHECK-NEXT: [[VAL_32:%.*]] = and i32 [[VAL_31]], [[TMP0]]
+; CHECK-NEXT: [[VAL_33:%.*]] = and i32 [[VAL_32]], [[TMP0]]
+; CHECK-NEXT: [[VAL_34:%.*]] = add i32 [[TMP3]], 8555
+; CHECK-NEXT: [[VAL_35:%.*]] = and i32 [[VAL_33]], [[VAL_34]]
+; CHECK-NEXT: [[VAL_36:%.*]] = and i32 [[VAL_35]], [[TMP0]]
+; CHECK-NEXT: [[VAL_37:%.*]] = and i32 [[VAL_36]], [[TMP0]]
+; CHECK-NEXT: [[VAL_38:%.*]] = and i32 [[VAL_37]], [[TMP0]]
+; CHECK-NEXT: [[TMP4:%.*]] = insertelement <2 x i32> poison, i32 [[TMP3]], i32 0
+; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x i32> [[TMP4]], i32 [[TMP3]], i32 1
+; CHECK-NEXT: [[TMP6:%.*]] = add <2 x i32> [[TMP5]], <i32 12529, i32 13685>
+; CHECK-NEXT: [[TMP7:%.*]] = extractelement <2 x i32> [[TMP6]], i32 0
+; CHECK-NEXT: [[VAL_40:%.*]] = and i32 [[VAL_38]], [[TMP7]]
+; CHECK-NEXT: [[TMP8:%.*]] = extractelement <2 x i32> [[TMP6]], i32 1
+; CHECK-NEXT: [[TMP9:%.*]] = insertelement <2 x i32> poison, i32 [[VAL_40]], i32 0
+; CHECK-NEXT: [[TMP10:%.*]] = insertelement <2 x i32> [[TMP9]], i32 14910, i32 1
+; CHECK-NEXT: [[TMP11:%.*]] = insertelement <2 x i32> poison, i32 [[TMP8]], i32 0
+; CHECK-NEXT: [[TMP12:%.*]] = insertelement <2 x i32> [[TMP11]], i32 [[TMP3]], i32 1
+; CHECK-NEXT: [[TMP13:%.*]] = and <2 x i32> [[TMP10]], [[TMP12]]
+; CHECK-NEXT: [[TMP14:%.*]] = add <2 x i32> [[TMP10]], [[TMP12]]
+; CHECK-NEXT: [[TMP15:%.*]] = shufflevector <2 x i32> [[TMP13]], <2 x i32> [[TMP14]], <2 x i32> <i32 0, i32 3>
+; CHECK-NEXT: [[TMP16:%.*]] = extractelement <2 x i32> [[TMP15]], i32 0
+; CHECK-NEXT: [[TMP17:%.*]] = insertelement <2 x i32> poison, i32 [[TMP16]], i32 0
+; CHECK-NEXT: [[TMP18:%.*]] = extractelement <2 x i32> [[TMP15]], i32 1
+; CHECK-NEXT: [[TMP19]] = insertelement <2 x i32> [[TMP17]], i32 [[TMP18]], i32 1
; CHECK-NEXT: br label [[LOOP]]
;
; FORCE_REDUCTION-LABEL: @test(

In D57779#2556783, @dtemirbulatov wrote:

Here we could see the regression, it misses vectorizing the whole tree as partial vectorization kicks in too early and "add" instructions stay scalar:

a/llvm/test/Transforms/SLPVectorizer/X86/PR39774.ll

+++ b/llvm/test/Transforms/SLPVectorizer/X86/PR39774.ll
@@ -7,49 +7,65 @@ define void @test(i32) {
; CHECK-NEXT: entry:
; CHECK-NEXT: br label [[LOOP:%.*]]
; CHECK: loop:
-; CHECK-NEXT: [[TMP1:%.*]] = phi <2 x i32> [ [[TMP15:%.*]], [[LOOP]] ], [ zeroinitializer, [[ENTRY:%.*]] ]
-; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> poison, <8 x i32> <i32 0, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
-; CHECK-NEXT: [[TMP2:%.*]] = extractelement <8 x i32> [[SHUFFLE]], i32 1
-; CHECK-NEXT: [[TMP3:%.*]] = add <8 x i32> [[SHUFFLE]], <i32 0, i32 55, i32 285, i32 1240, i32 1496, i32 8555, i32 12529, i32 13685>
-; CHECK-NEXT: [[TMP4:%.*]] = call i32 @llvm.vector.reduce.and.v8i32(<8 x i32> [[TMP3]])
-; CHECK-NEXT: [[OP_EXTRA:%.*]] = and i32 [[TMP4]], [[TMP0:%.*]]
-; CHECK-NEXT: [[OP_EXTRA1:%.*]] = and i32 [[OP_EXTRA]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA2:%.*]] = and i32 [[OP_EXTRA1]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA3:%.*]] = and i32 [[OP_EXTRA2]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA4:%.*]] = and i32 [[OP_EXTRA3]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA5:%.*]] = and i32 [[OP_EXTRA4]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA6:%.*]] = and i32 [[OP_EXTRA5]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA7:%.*]] = and i32 [[OP_EXTRA6]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA8:%.*]] = and i32 [[OP_EXTRA7]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA9:%.*]] = and i32 [[OP_EXTRA8]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA10:%.*]] = and i32 [[OP_EXTRA9]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA11:%.*]] = and i32 [[OP_EXTRA10]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA12:%.*]] = and i32 [[OP_EXTRA11]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA13:%.*]] = and i32 [[OP_EXTRA12]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA14:%.*]] = and i32 [[OP_EXTRA13]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA15:%.*]] = and i32 [[OP_EXTRA14]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA16:%.*]] = and i32 [[OP_EXTRA15]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA17:%.*]] = and i32 [[OP_EXTRA16]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA18:%.*]] = and i32 [[OP_EXTRA17]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA19:%.*]] = and i32 [[OP_EXTRA18]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA20:%.*]] = and i32 [[OP_EXTRA19]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA21:%.*]] = and i32 [[OP_EXTRA20]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA22:%.*]] = and i32 [[OP_EXTRA21]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA23:%.*]] = and i32 [[OP_EXTRA22]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA24:%.*]] = and i32 [[OP_EXTRA23]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA25:%.*]] = and i32 [[OP_EXTRA24]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA26:%.*]] = and i32 [[OP_EXTRA25]], [[TMP0]]
-; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x i32> poison, i32 [[OP_EXTRA26]], i32 0
-; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x i32> [[TMP5]], i32 14910, i32 1
-; CHECK-NEXT: [[TMP7:%.*]] = insertelement <2 x i32> poison, i32 [[TMP2]], i32 0
-; CHECK-NEXT: [[TMP8:%.*]] = insertelement <2 x i32> [[TMP7]], i32 [[TMP2]], i32 1
-; CHECK-NEXT: [[TMP9:%.*]] = and <2 x i32> [[TMP6]], [[TMP8]]
-; CHECK-NEXT: [[TMP10:%.*]] = add <2 x i32> [[TMP6]], [[TMP8]]
-; CHECK-NEXT: [[TMP11:%.*]] = shufflevector <2 x i32> [[TMP9]], <2 x i32> [[TMP10]], <2 x i32> <i32 0, i32 3>
-; CHECK-NEXT: [[TMP12:%.*]] = extractelement <2 x i32> [[TMP11]], i32 0
-; CHECK-NEXT: [[TMP13:%.*]] = insertelement <2 x i32> poison, i32 [[TMP12]], i32 0
-; CHECK-NEXT: [[TMP14:%.*]] = extractelement <2 x i32> [[TMP11]], i32 1
-; CHECK-NEXT: [[TMP15]] = insertelement <2 x i32> [[TMP13]], i32 [[TMP14]], i32 1
+; CHECK-NEXT: [[TMP1:%.*]] = phi <2 x i32> [ [[TMP19:%.*]], [[LOOP]] ], [ zeroinitializer, [[ENTRY:%.*]] ]
+; CHECK-NEXT: [[TMP2:%.*]] = extractelement <2 x i32> [[TMP1]], i32 0
+; CHECK-NEXT: [[VAL_0:%.*]] = add i32 [[TMP2]], 0
+; CHECK-NEXT: [[TMP3:%.*]] = extractelement <2 x i32> [[TMP1]], i32 1
+; CHECK-NEXT: [[VAL_1:%.*]] = and i32 [[TMP3]], [[VAL_0]]
+; CHECK-NEXT: [[VAL_2:%.*]] = and i32 [[VAL_1]], [[TMP0:%.*]]
+; CHECK-NEXT: [[VAL_3:%.*]] = and i32 [[VAL_2]], [[TMP0]]
+; CHECK-NEXT: [[VAL_4:%.*]] = and i32 [[VAL_3]], [[TMP0]]
+; CHECK-NEXT: [[VAL_5:%.*]] = and i32 [[VAL_4]], [[TMP0]]
+; CHECK-NEXT: [[VAL_6:%.*]] = add i32 [[TMP3]], 55
+; CHECK-NEXT: [[VAL_7:%.*]] = and i32 [[VAL_5]], [[VAL_6]]
+; CHECK-NEXT: [[VAL_8:%.*]] = and i32 [[VAL_7]], [[TMP0]]
+; CHECK-NEXT: [[VAL_9:%.*]] = and i32 [[VAL_8]], [[TMP0]]
+; CHECK-NEXT: [[VAL_10:%.*]] = and i32 [[VAL_9]], [[TMP0]]
+; CHECK-NEXT: [[VAL_11:%.*]] = add i32 [[TMP3]], 285
+; CHECK-NEXT: [[VAL_12:%.*]] = and i32 [[VAL_10]], [[VAL_11]]
+; CHECK-NEXT: [[VAL_13:%.*]] = and i32 [[VAL_12]], [[TMP0]]
+; CHECK-NEXT: [[VAL_14:%.*]] = and i32 [[VAL_13]], [[TMP0]]
+; CHECK-NEXT: [[VAL_15:%.*]] = and i32 [[VAL_14]], [[TMP0]]
+; CHECK-NEXT: [[VAL_16:%.*]] = and i32 [[VAL_15]], [[TMP0]]
+; CHECK-NEXT: [[VAL_17:%.*]] = and i32 [[VAL_16]], [[TMP0]]
+; CHECK-NEXT: [[VAL_18:%.*]] = add i32 [[TMP3]], 1240
+; CHECK-NEXT: [[VAL_19:%.*]] = and i32 [[VAL_17]], [[VAL_18]]
+; CHECK-NEXT: [[VAL_20:%.*]] = add i32 [[TMP3]], 1496
+; CHECK-NEXT: [[VAL_21:%.*]] = and i32 [[VAL_19]], [[VAL_20]]
+; CHECK-NEXT: [[VAL_22:%.*]] = and i32 [[VAL_21]], [[TMP0]]
+; CHECK-NEXT: [[VAL_23:%.*]] = and i32 [[VAL_22]], [[TMP0]]
+; CHECK-NEXT: [[VAL_24:%.*]] = and i32 [[VAL_23]], [[TMP0]]
+; CHECK-NEXT: [[VAL_25:%.*]] = and i32 [[VAL_24]], [[TMP0]]
+; CHECK-NEXT: [[VAL_26:%.*]] = and i32 [[VAL_25]], [[TMP0]]
+; CHECK-NEXT: [[VAL_27:%.*]] = and i32 [[VAL_26]], [[TMP0]]
+; CHECK-NEXT: [[VAL_28:%.*]] = and i32 [[VAL_27]], [[TMP0]]
+; CHECK-NEXT: [[VAL_29:%.*]] = and i32 [[VAL_28]], [[TMP0]]
+; CHECK-NEXT: [[VAL_30:%.*]] = and i32 [[VAL_29]], [[TMP0]]
+; CHECK-NEXT: [[VAL_31:%.*]] = and i32 [[VAL_30]], [[TMP0]]
+; CHECK-NEXT: [[VAL_32:%.*]] = and i32 [[VAL_31]], [[TMP0]]
+; CHECK-NEXT: [[VAL_33:%.*]] = and i32 [[VAL_32]], [[TMP0]]
+; CHECK-NEXT: [[VAL_34:%.*]] = add i32 [[TMP3]], 8555
+; CHECK-NEXT: [[VAL_35:%.*]] = and i32 [[VAL_33]], [[VAL_34]]
+; CHECK-NEXT: [[VAL_36:%.*]] = and i32 [[VAL_35]], [[TMP0]]
+; CHECK-NEXT: [[VAL_37:%.*]] = and i32 [[VAL_36]], [[TMP0]]
+; CHECK-NEXT: [[VAL_38:%.*]] = and i32 [[VAL_37]], [[TMP0]]
+; CHECK-NEXT: [[TMP4:%.*]] = insertelement <2 x i32> poison, i32 [[TMP3]], i32 0
+; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x i32> [[TMP4]], i32 [[TMP3]], i32 1
+; CHECK-NEXT: [[TMP6:%.*]] = add <2 x i32> [[TMP5]], <i32 12529, i32 13685>
+; CHECK-NEXT: [[TMP7:%.*]] = extractelement <2 x i32> [[TMP6]], i32 0
+; CHECK-NEXT: [[VAL_40:%.*]] = and i32 [[VAL_38]], [[TMP7]]
+; CHECK-NEXT: [[TMP8:%.*]] = extractelement <2 x i32> [[TMP6]], i32 1
+; CHECK-NEXT: [[TMP9:%.*]] = insertelement <2 x i32> poison, i32 [[VAL_40]], i32 0
+; CHECK-NEXT: [[TMP10:%.*]] = insertelement <2 x i32> [[TMP9]], i32 14910, i32 1
+; CHECK-NEXT: [[TMP11:%.*]] = insertelement <2 x i32> poison, i32 [[TMP8]], i32 0
+; CHECK-NEXT: [[TMP12:%.*]] = insertelement <2 x i32> [[TMP11]], i32 [[TMP3]], i32 1
+; CHECK-NEXT: [[TMP13:%.*]] = and <2 x i32> [[TMP10]], [[TMP12]]
+; CHECK-NEXT: [[TMP14:%.*]] = add <2 x i32> [[TMP10]], [[TMP12]]
+; CHECK-NEXT: [[TMP15:%.*]] = shufflevector <2 x i32> [[TMP13]], <2 x i32> [[TMP14]], <2 x i32> <i32 0, i32 3>
+; CHECK-NEXT: [[TMP16:%.*]] = extractelement <2 x i32> [[TMP15]], i32 0
+; CHECK-NEXT: [[TMP17:%.*]] = insertelement <2 x i32> poison, i32 [[TMP16]], i32 0
+; CHECK-NEXT: [[TMP18:%.*]] = extractelement <2 x i32> [[TMP15]], i32 1
+; CHECK-NEXT: [[TMP19]] = insertelement <2 x i32> [[TMP17]], i32 [[TMP18]], i32 1
; CHECK-NEXT: br label [[LOOP]]
;
; FORCE_REDUCTION-LABEL: @test(

To me, it just looks like we need to postpone the vectorization of phi nodes in the function rather than trying to fix all the issues in the world in a single patch.

To me, it just looks like we need to postpone the vectorization of phi nodes in the function rather than trying to fix all the issues in the world in a single patch.

I think I could give one simpler example without PHI nodes.

Here is another example:
source_filename = "psspread.c"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define dso_local void @spread_q_poisson() local_unnamed_addr #0 {
entry:

%div.i = fdiv float undef, undef
%conv13.i = fdiv float 1.000000e+00, %div.i
%conv18.i = fdiv float 1.000000e+00, undef
%conv23.i = fdiv float 1.000000e+00, undef
%conv162 = fptosi float undef to i32
%0 = load float, float* undef, align 4
%1 = load i32, i32* undef, align 4
%add187.us = add nsw i32 %1, %conv162
%add191.us = add nsw i32 undef, undef
%add195.us = add nsw i32 undef, undef
%conv196.us = sitofp i32 %add187.us to float
%mul197.us = fmul float %conv13.i, %conv196.us
%sub198.us = fsub float undef, %mul197.us
%mul.i363.us = fmul float %sub198.us, %sub198.us
%conv200.us = sitofp i32 %add191.us to float
%mul201.us = fmul float %conv18.i, %conv200.us
%sub202.us = fsub float undef, %mul201.us
%mul.i362.us = fmul float %sub202.us, %sub202.us
%conv204.us = sitofp i32 %add195.us to float
%mul205.us = fmul float %conv23.i, %conv204.us
%sub206.us = fsub float %0, %mul205.us
%mul.i.us = fmul float %sub206.us, %sub206.us
%add208.us = fadd float %mul.i363.us, %mul.i362.us
%add209.us = fadd float %add208.us, %mul.i.us
%cmp210.us = fcmp olt float %add209.us, undef
%add230.us = add nsw i32 undef, %add195.us
unreachable

}

attributes #0 = { "use-soft-float"="false" }

!llvm.ident = !{!0}

!0 = !{!"clang version 13.0.0 (/home/dtemirbulatov/llvm/llvm-project-thl/llvm/tools/clang eec04092d67b94f47439a9065b6bd4cd60165be2)"}

with proposed change it produces :
; ModuleID = 'bug.ll'
source_filename = "psspread.c"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define dso_local void @spread_q_poisson() local_unnamed_addr #0 {
entry:

%div.i = fdiv float undef, undef
%conv13.i = fdiv float 1.000000e+00, %div.i
%conv162 = fptosi float undef to i32
%0 = load float, float* undef, align 4
%1 = load i32, i32* undef, align 4
%add187.us = add nsw i32 %1, %conv162
%conv196.us = sitofp i32 %add187.us to float
%mul197.us = fmul float %conv13.i, %conv196.us
%sub198.us = fsub float undef, %mul197.us
%mul.i363.us = fmul float %sub198.us, %sub198.us
%2 = insertelement <2 x float> <float undef, float poison>, float %0, i32 1
%3 = fsub <2 x float> %2, <float 0x7FF8000000000000, float 0x7FF8000000000000>
%4 = fmul <2 x float> %3, %3
%5 = extractelement <2 x float> %4, i32 0
%add208.us = fadd float %mul.i363.us, %5
%6 = extractelement <2 x float> %4, i32 1
%add209.us = fadd float %add208.us, %6
%cmp210.us = fcmp olt float %add209.us, undef
%add230.us = add nsw i32 undef, undef
unreachable

}

attributes #0 = { "use-soft-float"="false" }

!llvm.ident = !{!0}

!0 = !{!"clang version 13.0.0 (/home/dtemirbulatov/llvm/llvm-project-thl/llvm/tools/clang eec04092d67b94f47439a9065b6bd4cd60165be2)"}

but if we immediately decide to vectorize patrially to get this output:
; ModuleID = 'bug.ll'
source_filename = "psspread.c"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define dso_local void @spread_q_poisson() local_unnamed_addr #0 {
entry:

%div.i = fdiv float undef, undef
%conv18.i = fdiv float 1.000000e+00, undef
%0 = insertelement <2 x float> poison, float %div.i, i32 0
%1 = insertelement <2 x float> %0, float undef, i32 1
%2 = fdiv <2 x float> <float 1.000000e+00, float 1.000000e+00>, %1
%conv162 = fptosi float undef to i32
%3 = load float, float* undef, align 4
%4 = load i32, i32* undef, align 4
%add187.us = add nsw i32 %4, %conv162
%add191.us = add nsw i32 undef, undef
%add195.us = add nsw i32 undef, undef
%conv200.us = sitofp i32 %add191.us to float
%mul201.us = fmul float %conv18.i, %conv200.us
%sub202.us = fsub float undef, %mul201.us
%mul.i362.us = fmul float %sub202.us, %sub202.us
%5 = insertelement <2 x i32> poison, i32 %add187.us, i32 0
%6 = insertelement <2 x i32> %5, i32 %add195.us, i32 1
%7 = sitofp <2 x i32> %6 to <2 x float>
%8 = fmul <2 x float> %2, %7
%9 = insertelement <2 x float> <float undef, float poison>, float %3, i32 1
%10 = fsub <2 x float> %9, %8
%11 = fmul <2 x float> %10, %10
%12 = extractelement <2 x float> %11, i32 0
%add208.us = fadd float %12, %mul.i362.us
%13 = extractelement <2 x float> %11, i32 1
%add209.us = fadd float %add208.us, %13
%cmp210.us = fcmp olt float %add209.us, undef
%add230.us = add nsw i32 undef, %add195.us
unreachable

}

attributes #0 = { "use-soft-float"="false" }

!llvm.ident = !{!0}

!0 = !{!"clang version 13.0.0 (/home/dtemirbulatov/llvm/llvm-project-thl/llvm/tools/clang eec04092d67b94f47439a9065b6bd4cd60165be2)"}

In D57779#2559601, @dtemirbulatov wrote:
Here is another example:
source_filename = "psspread.c"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define dso_local void @spread_q_poisson() local_unnamed_addr #0 {
entry:
%div.i = fdiv float undef, undef
%conv13.i = fdiv float 1.000000e+00, %div.i
%conv18.i = fdiv float 1.000000e+00, undef
%conv23.i = fdiv float 1.000000e+00, undef
%conv162 = fptosi float undef to i32
%0 = load float, float* undef, align 4
%1 = load i32, i32* undef, align 4
%add187.us = add nsw i32 %1, %conv162
%add191.us = add nsw i32 undef, undef
%add195.us = add nsw i32 undef, undef
%conv196.us = sitofp i32 %add187.us to float
%mul197.us = fmul float %conv13.i, %conv196.us
%sub198.us = fsub float undef, %mul197.us
%mul.i363.us = fmul float %sub198.us, %sub198.us
%conv200.us = sitofp i32 %add191.us to float
%mul201.us = fmul float %conv18.i, %conv200.us
%sub202.us = fsub float undef, %mul201.us
%mul.i362.us = fmul float %sub202.us, %sub202.us
%conv204.us = sitofp i32 %add195.us to float
%mul205.us = fmul float %conv23.i, %conv204.us
%sub206.us = fsub float %0, %mul205.us
%mul.i.us = fmul float %sub206.us, %sub206.us
%add208.us = fadd float %mul.i363.us, %mul.i362.us
%add209.us = fadd float %add208.us, %mul.i.us
%cmp210.us = fcmp olt float %add209.us, undef
%add230.us = add nsw i32 undef, %add195.us
unreachable
}

attributes #0 = { "use-soft-float"="false" }

!llvm.ident = !{!0}

!0 = !{!"clang version 13.0.0 (/home/dtemirbulatov/llvm/llvm-project-thl/llvm/tools/clang eec04092d67b94f47439a9065b6bd4cd60165be2)"}

with proposed change it produces :
; ModuleID = 'bug.ll'
source_filename = "psspread.c"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define dso_local void @spread_q_poisson() local_unnamed_addr #0 {
entry:
%div.i = fdiv float undef, undef
%conv13.i = fdiv float 1.000000e+00, %div.i
%conv162 = fptosi float undef to i32
%0 = load float, float* undef, align 4
%1 = load i32, i32* undef, align 4
%add187.us = add nsw i32 %1, %conv162
%conv196.us = sitofp i32 %add187.us to float
%mul197.us = fmul float %conv13.i, %conv196.us
%sub198.us = fsub float undef, %mul197.us
%mul.i363.us = fmul float %sub198.us, %sub198.us
%2 = insertelement <2 x float> <float undef, float poison>, float %0, i32 1
%3 = fsub <2 x float> %2, <float 0x7FF8000000000000, float 0x7FF8000000000000>
%4 = fmul <2 x float> %3, %3
%5 = extractelement <2 x float> %4, i32 0
%add208.us = fadd float %mul.i363.us, %5
%6 = extractelement <2 x float> %4, i32 1
%add209.us = fadd float %add208.us, %6
%cmp210.us = fcmp olt float %add209.us, undef
%add230.us = add nsw i32 undef, undef
unreachable
}

attributes #0 = { "use-soft-float"="false" }

!llvm.ident = !{!0}

!0 = !{!"clang version 13.0.0 (/home/dtemirbulatov/llvm/llvm-project-thl/llvm/tools/clang eec04092d67b94f47439a9065b6bd4cd60165be2)"}

but if we immediately decide to vectorize patrially to get this output:
; ModuleID = 'bug.ll'
source_filename = "psspread.c"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define dso_local void @spread_q_poisson() local_unnamed_addr #0 {
entry:
%div.i = fdiv float undef, undef
%conv18.i = fdiv float 1.000000e+00, undef
%0 = insertelement <2 x float> poison, float %div.i, i32 0
%1 = insertelement <2 x float> %0, float undef, i32 1
%2 = fdiv <2 x float> <float 1.000000e+00, float 1.000000e+00>, %1
%conv162 = fptosi float undef to i32
%3 = load float, float* undef, align 4
%4 = load i32, i32* undef, align 4
%add187.us = add nsw i32 %4, %conv162
%add191.us = add nsw i32 undef, undef
%add195.us = add nsw i32 undef, undef
%conv200.us = sitofp i32 %add191.us to float
%mul201.us = fmul float %conv18.i, %conv200.us
%sub202.us = fsub float undef, %mul201.us
%mul.i362.us = fmul float %sub202.us, %sub202.us
%5 = insertelement <2 x i32> poison, i32 %add187.us, i32 0
%6 = insertelement <2 x i32> %5, i32 %add195.us, i32 1
%7 = sitofp <2 x i32> %6 to <2 x float>
%8 = fmul <2 x float> %2, %7
%9 = insertelement <2 x float> <float undef, float poison>, float %3, i32 1
%10 = fsub <2 x float> %9, %8
%11 = fmul <2 x float> %10, %10
%12 = extractelement <2 x float> %11, i32 0
%add208.us = fadd float %12, %mul.i362.us
%13 = extractelement <2 x float> %11, i32 1
%add209.us = fadd float %add208.us, %13
%cmp210.us = fcmp olt float %add209.us, undef
%add230.us = add nsw i32 undef, %add195.us
unreachable
}

attributes #0 = { "use-soft-float"="false" }

!llvm.ident = !{!0}

!0 = !{!"clang version 13.0.0 (/home/dtemirbulatov/llvm/llvm-project-thl/llvm/tools/clang eec04092d67b94f47439a9065b6bd4cd60165be2)"}

I see that immediate vectorization is better as it vectorizes more, no? Also, there is a problem, looks like it is caused by the multinode analysis. I'm trying to improve this in my non-power-2 patch, will prepare a separate patch for it.

I see that immediate vectorization is better as it vectorizes more, no? Also, there is a problem, looks like it is caused by the multinode analysis. I'm trying to improve this in my non-power-2 patch, will prepare a separate patch for it.

eh, I think it is not a clear example, I have seen better examples, I will show something better.

In D57779#2560071, @dtemirbulatov wrote:

I see that immediate vectorization is better as it vectorizes more, no? Also, there is a problem, looks like it is caused by the multinode analysis. I'm trying to improve this in my non-power-2 patch, will prepare a separate patch for it.

eh, I think it is not a clear example, I have seen better examples, I will show something better.

Even this example shows that the current solution does not always produce the best result.

Even this example shows that the current solution does not always produce the best result.

at least, we could avoid regressions.

I think the next step is to compare vectorized tree heights(number of vectorized nodes) among possible vectorizable trees.

Even this example shows that the current solution does not always produce the best result.

SLP has a greedy approach and let's assume that full vectorization is always better than partial. We don't have the resources to save all trees and then choose from saved the best one. I think I can add now choosing the best from already partially vectorized.

In D57779#2564284, @dtemirbulatov wrote:

Even this example shows that the current solution does not always produce the best result.

SLP has a greedy approach and let's assume that full vectorization is always better than partial. We don't have the resources to save all trees and then choose from saved the best one. I think I can add now choosing the best from already partially vectorized.

Again, even your example showed that this solution is worse in some cases. Why do we need to waste the time and invest in a solution, which is not better than the existing one, requires more time to understand, consumes more memory?
SLP implements a bottom-up approach, i.e. it always tries to vectorize the longest chain (except for PHI nodes, which should be improved). If we have a partial graph, it should not affect other vectorization graphs in the same basic block, generally speaking, just some subnodes may become the subnodes of the other graphs but this is not a problem.
Looks like you're trying to implement something similar to VPlan. We have it already and better to invest the time to implement support for SLP vectorization there.
Redesign is completely different work, it requires correct estimation (not the assumptions, but real investigation), discussion, RFC, approval, and separate implementation.

Again, even your example showed that this solution is worse in some cases. Why do we need to waste the time and invest in a solution, which is not better than the existing one, requires more time to understand, consumes more memory?

SLP implements a bottom-up approach, i.e. it always tries to vectorize the longest chain (except for PHI nodes, which should be improved). If we have a partial graph, it should not affect other vectorization graphs in the same basic block, generally speaking, just some subnodes may become the subnodes of the other graphs but this is not a problem.

Looks like you're trying to implement something similar to VPlan. We have it already and better to invest the time to implement support for SLP vectorization there.

Redesign is completely different work, it requires correct estimation (not the assumptions, but real investigation), discussion, RFC, approval, and separate implementation.

Ok, Agree.

Addressed @ABataev remarks, investigated regression with PHI nodes in PR39774.ll and I have not spotted any other case involving PHI nodes, but I have several other cases and it happens quite rarely. I am not sure how-to generalize them and I think VPLAN might be helpful. Overall, I think it is ready.

Harbormaster completed remote builds in B94259: Diff 331286.Mar 17 2021, 9:57 AM

ABataev added inline comments.Mar 19 2021, 7:15 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
599	`PriorityQueue`?
5546–5555	Looks like you need to implement something like `reduceSchedulingRegion()`, similar to `extendSchedulingRegion` function. Because otherwise you're going to operate with the larger scheduling region. I.e. need to modify `ScheduleStart` and `ScheduleEnd` data members.
6352	Why `SLPThrottleBudget > 0`? What if `SLPThrottleBudget` equals 0?
6352–6353	Why we can't do something like this: int NumAttempts = 0; do { if (R.isTreeTinyAndNotFullyVectorizable()) break; R.computeMinimumValueSizes(); InstructionCost Cost = R.getTreeCost(); InstructionCost UserCost = 0; .... if (Cost < -SLPCostThreshold) { LLVM_DEBUG(dbgs() << "SLP: Vectorizing list at cost:" << Cost << ".\n"); R.getORE()->emit(OptimizationRemark(SV_NAME, "VectorizedList", cast<Instruction>(Ops[0])) << "SLP vectorized with cost " << ore::NV("Cost", Cost) << " and with tree size " << ore::NV("TreeSize", R.getTreeSize())); R.vectorizeTree(); // Move to the next bundle. I += VF - 1; NextInst = I + 1; Changed = true; break; } ... /// Do throttling here. ++NumAttempts; } while (NumAttempts < SLPThrottleBudget);

dtemirbulatov added inline comments.Mar 21 2021, 5:04 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
599	hmm, ProprityQueue allows duplicates of elements and it might be an issue.

Rebased, addressed remarks, added reduceSchedulingRegion() function with the ability to set only ScheduleStart at this time, renamed RemovedOperations property to ProposedToGather.

dtemirbulatov marked 2 inline comments as done.Mar 29 2021, 5:39 PM

dtemirbulatov added inline comments.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6352–6353	We are doing partial vectorization and we have to know UserCost to make the correct partial tree cut.

dtemirbulatov marked 2 inline comments as done.Mar 29 2021, 5:40 PM

Harbormaster completed remote builds in B96225: Diff 334018.Mar 29 2021, 5:49 PM

Ping, ready to land?

In D57779#2667679, @xbolva00 wrote:

Ping, ready to land?

Will review it on Monday.

In D57779#2667704, @ABataev wrote:

In D57779#2667679, @xbolva00 wrote:

Ping, ready to land?

Will review it on Monday.

I found an error in reduceSchedulingRegion() implementation. I am reworking the change.

Rebased, fixed incorrect comment at 2358, fixed the wrong implementation of shrink scheduling region, changed the code in tryToVectorizeList() as suggested by @ABataev.

dtemirbulatov added inline comments.Apr 8 2021, 8:33 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
5579	Perhaps we could also check here for !BS->getScheduleData(I)->isPartOfBundle() and further shrink the region.

dtemirbulatov added inline comments.Apr 8 2021, 8:35 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
5579	ah, no, this instruction could belong to a real gather node.

Harbormaster completed remote builds in B97744: Diff 336118.Apr 8 2021, 9:11 AM

Slightly improved schedular area shrinking algorithm, by allowing to remove unnecessary unmaps in chains instructions.

Harbormaster completed remote builds in B97848: Diff 336272.Apr 8 2021, 5:50 PM

Rebased, formatted, noticed 3x testcases involved after @ABataev landed D100495 "Add detection of shuffled/perfect matching of tree entries.", returned "-slp-throttle" flag in order to AArch64/gather-cost.ll to be functional, manually adjust "TMP" in minimum-sizes.ll in PR31243_sext for probably a bug in update_test_checks.py.

Herald added subscribers: kerbowa, nhaehnle, jvesely. · View Herald TranscriptApr 25 2021, 5:40 PM

Harbormaster completed remote builds in B100843: Diff 340406.Apr 25 2021, 6:26 PM

Fixed two format errors.

Harbormaster completed remote builds in B100858: Diff 340425.Apr 25 2021, 9:45 PM

RKSimon added inline comments.Apr 26 2021, 7:34 AM

llvm/test/Transforms/SLPVectorizer/X86/uitofp.ll
683 ↗	(On Diff #340425)	what happened to these checks?

Updated llvm/test/Transforms/SLPVectorizer/X86/uitofp.ll checks on request from @RKSimon

dtemirbulatov marked an inline comment as done.Apr 26 2021, 8:07 AM

In D57779#2716824, @dtemirbulatov wrote:

Updated llvm/test/Transforms/SLPVectorizer/X86/uitofp.ll checks on request from @RKSimon

@RKSimon , I have to split AVX256NODQ X86/sitofp.ll and maybe others.

ABataev added inline comments.Apr 26 2021, 8:14 AM

llvm/test/Transforms/SLPVectorizer/X86/arith-fix.ll
357–361 ↗	(On Diff #340530)	Looks like it does not respect `MinTreeSize` option anymore. And it is strange that such code sequence gets profitable for vectorization (scalar cost is 8, vector cost is 9)

Harbormaster completed remote builds in B100939: Diff 340530.Apr 26 2021, 9:13 AM

Rebased, Forbid "detection of shuffled/perfect matching of tree entries" for canceled TreeEntries during throttling, replaced TEVectorizableSet to PriorityQueue.

Harbormaster completed remote builds in B102213: Diff 342283.May 2 2021, 3:48 PM

Fix formatting.

Harbormaster completed remote builds in B102223: Diff 342296.May 2 2021, 6:39 PM

ABataev added inline comments.May 3 2021, 5:45 AM

llvm/test/Transforms/SLPVectorizer/X86/powof2div.ll
85–91	Still looks like it does not respect mintreesize

dtemirbulatov added inline comments.May 4 2021, 7:00 PM

llvm/test/Transforms/SLPVectorizer/X86/powof2div.ll
85–91	hmm, this is not the case here, the tree height is 5 here, divide node cost is 20 and after deleting this not node, extracting from "add" node costs 4 and inserting after scalar divide cost 4 and the final tree cost is -4. llvm-mca for -mattr=+avx shows 1305 cycles before and 1609 cycles after.

Added check for current tree size to MinTreeSize before making the decision to vectorize.

Harbormaster completed remote builds in B102679: Diff 342959.May 5 2021, 1:44 AM

Fixed issue in getInsertCost(), I incorrectly added gather costs to the nodes which were not in relation with any proposed to vectorized nodes, I thought of this and used before "ScalarToTreeEntry.count(Op) > 0", but I discovered that I am not updating ScalarToTreeEntry while reducing the tree. 2) Now I am checking with isTreeTinyAndNotFullyVectorizable() before decide to vectorize. 3) I introduced "MinVecNodes" parameter, which sets how many minimal vectorizable nodes we would like to have while throttling, currently it is equal to 2 by default. For example, we have 3 total nodes in the tree and it is satisfied with MinTreeSize and we would like to have at least two nodes to be vectorizable while reducing the tree to have a positive decision.

Harbormaster completed remote builds in B102986: Diff 343392.May 6 2021, 7:24 AM

In D57779#2741906, @dtemirbulatov wrote:

Fixed issue in getInsertCost(), I incorrectly added gather costs to the nodes which were not in relation with any proposed to vectorized nodes, I thought of this and used before "ScalarToTreeEntry.count(Op) > 0", but I discovered that I am not updating ScalarToTreeEntry while reducing the tree. 2) Now I am checking with isTreeTinyAndNotFullyVectorizable() before decide to vectorize. 3) I introduced "MinVecNodes" parameter, which sets how many minimal vectorizable nodes we would like to have while throttling, currently it is equal to 2 by default. For example, we have 3 total nodes in the tree and it is satisfied with MinTreeSize and we would like to have at least two nodes to be vectorizable while reducing the tree to have a positive decision.

Why do we need MinVecNodes? MinTreeSize and all associated analysis must be enough

Why do we need MinVecNodes? MinTreeSize and all associated analysis must be enough

it is Transforms/SLPVectorizer/X86/tiny-tree.ll transform that scared me.
From:
define void @tiny_tree_not_fully_vectorizable(double* noalias nocapture %dst, double* noalias nocapture readonly %src, i64 %count) #0 {
entry:

%cmp12 = icmp eq i64 %count, 0
br i1 %cmp12, label %for.end, label %for.body

for.body: ; preds = %entry, %for.body

%i.015 = phi i64 [ %inc, %for.body ], [ 0, %entry ]
%dst.addr.014 = phi double* [ %add.ptr4, %for.body ], [ %dst, %entry ]
%src.addr.013 = phi double* [ %add.ptr, %for.body ], [ %src, %entry ]
%0 = load double, double* %src.addr.013, align 8
store double %0, double* %dst.addr.014, align 8
%arrayidx2 = getelementptr inbounds double, double* %src.addr.013, i64 2
%1 = load double, double* %arrayidx2, align 8
%arrayidx3 = getelementptr inbounds double, double* %dst.addr.014, i64 1
store double %1, double* %arrayidx3, align 8
%add.ptr = getelementptr inbounds double, double* %src.addr.013, i64 %i.015
%add.ptr4 = getelementptr inbounds double, double* %dst.addr.014, i64 %i.015
%inc = add i64 %i.015, 1
%exitcond = icmp eq i64 %inc, %count
br i1 %exitcond, label %for.end, label %for.body

for.end: ; preds = %for.body, %entry

ret void

}
to:
define void @tiny_tree_not_fully_vectorizable(double* noalias nocapture %dst, double* noalias nocapture readonly %src, i64 %count) #0 {
entry:

%cmp12 = icmp eq i64 %count, 0
br i1 %cmp12, label %for.end, label %for.body

for.body: ; preds = %for.body, %entry

%i.015 = phi i64 [ %inc, %for.body ], [ 0, %entry ]
%dst.addr.014 = phi double* [ %add.ptr4, %for.body ], [ %dst, %entry ]
%src.addr.013 = phi double* [ %add.ptr, %for.body ], [ %src, %entry ]
%0 = load double, double* %src.addr.013, align 8
%arrayidx2 = getelementptr inbounds double, double* %src.addr.013, i64 2
%1 = load double, double* %arrayidx2, align 8
%arrayidx3 = getelementptr inbounds double, double* %dst.addr.014, i64 1
%2 = insertelement <2 x double> poison, double %0, i32 0
%3 = insertelement <2 x double> %2, double %1, i32 1
%4 = bitcast double* %dst.addr.014 to <2 x double>*
store <2 x double> %3, <2 x double>* %4, align 8
%add.ptr = getelementptr inbounds double, double* %src.addr.013, i64 %i.015
%add.ptr4 = getelementptr inbounds double, double* %dst.addr.014, i64 %i.015
%inc = add i64 %i.015, 1
%exitcond = icmp eq i64 %inc, %count
br i1 %exitcond, label %for.end, label %for.body

for.end: ; preds = %for.body, %entry

ret void

}

but now with llvm-mca with -mattr=+corei7-avx, I see the change from 1111 to 1014 cycles, so it looks good. I will check other cases.

In D57779#2743946, @dtemirbulatov wrote:
Why do we need MinVecNodes? MinTreeSize and all associated analysis must be enough

it is Transforms/SLPVectorizer/X86/tiny-tree.ll transform that scared me.
From:
define void @tiny_tree_not_fully_vectorizable(double* noalias nocapture %dst, double* noalias nocapture readonly %src, i64 %count) #0 {
entry:
%cmp12 = icmp eq i64 %count, 0
br i1 %cmp12, label %for.end, label %for.body
for.body: ; preds = %entry, %for.body
%i.015 = phi i64 [ %inc, %for.body ], [ 0, %entry ]
%dst.addr.014 = phi double* [ %add.ptr4, %for.body ], [ %dst, %entry ]
%src.addr.013 = phi double* [ %add.ptr, %for.body ], [ %src, %entry ]
%0 = load double, double* %src.addr.013, align 8
store double %0, double* %dst.addr.014, align 8
%arrayidx2 = getelementptr inbounds double, double* %src.addr.013, i64 2
%1 = load double, double* %arrayidx2, align 8
%arrayidx3 = getelementptr inbounds double, double* %dst.addr.014, i64 1
store double %1, double* %arrayidx3, align 8
%add.ptr = getelementptr inbounds double, double* %src.addr.013, i64 %i.015
%add.ptr4 = getelementptr inbounds double, double* %dst.addr.014, i64 %i.015
%inc = add i64 %i.015, 1
%exitcond = icmp eq i64 %inc, %count
br i1 %exitcond, label %for.end, label %for.body
for.end: ; preds = %for.body, %entry
ret void
}
to:
define void @tiny_tree_not_fully_vectorizable(double* noalias nocapture %dst, double* noalias nocapture readonly %src, i64 %count) #0 {
entry:
%cmp12 = icmp eq i64 %count, 0
br i1 %cmp12, label %for.end, label %for.body
for.body: ; preds = %for.body, %entry
%i.015 = phi i64 [ %inc, %for.body ], [ 0, %entry ]
%dst.addr.014 = phi double* [ %add.ptr4, %for.body ], [ %dst, %entry ]
%src.addr.013 = phi double* [ %add.ptr, %for.body ], [ %src, %entry ]
%0 = load double, double* %src.addr.013, align 8
%arrayidx2 = getelementptr inbounds double, double* %src.addr.013, i64 2
%1 = load double, double* %arrayidx2, align 8
%arrayidx3 = getelementptr inbounds double, double* %dst.addr.014, i64 1
%2 = insertelement <2 x double> poison, double %0, i32 0
%3 = insertelement <2 x double> %2, double %1, i32 1
%4 = bitcast double* %dst.addr.014 to <2 x double>*
store <2 x double> %3, <2 x double>* %4, align 8
%add.ptr = getelementptr inbounds double, double* %src.addr.013, i64 %i.015
%add.ptr4 = getelementptr inbounds double, double* %dst.addr.014, i64 %i.015
%inc = add i64 %i.015, 1
%exitcond = icmp eq i64 %inc, %count
br i1 %exitcond, label %for.end, label %for.body
for.end: ; preds = %for.body, %entry
ret void
}

but now with llvm-mca with -mattr=+corei7-avx, I see the change from 1111 to 1014 cycles, so it looks good. I will check other cases.

If so, it just means that our min-tree-size analysis is too strict and must be fixed in general, but not by introducing some new throttling-specific options. We may have the same situation without throttling.

Rebased, Removed SLP parameter MinVecNodes. Added estimations of a good tree reduction 1) if the tree contained some real operations like binary, arithmetical, calls which were proposed to vectorize then we don't want to reduce this tree to just load and store operations in vectorized form. 2) if the tree doesn't have any real operations like binary, arithmetical... then we have to make sure that at least the root node and the next node to root are going to be vectorized.

Harbormaster completed remote builds in B105050: Diff 346180.May 18 2021, 12:17 PM

Formatting.

Harbormaster completed remote builds in B105196: Diff 346402.May 19 2021, 4:33 AM

Allen added a subscriber: Allen.May 20 2021, 10:16 PM

Rebased. I switched to path aware tree reduction approach and we start from the leaves of a vectorizable tree toward the root of that tree.

Harbormaster completed remote builds in B123830: Diff 372463.Sep 14 2021, 6:39 AM

dtemirbulatov mentioned this in D110623: [SLP] Avoid calculating expensive spill cost when it is not required.Sep 28 2021, 6:16 AM

Current status? Review stalled?

Herald added a project: Restricted Project. · View Herald TranscriptSep 7 2022, 12:55 PM

Herald added a subscriber: • pcwang-thead. · View Herald Transcript

dtemirbulatov abandoned this revision.Sep 17 2022, 4:23 PM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

502 lines

test/

Transforms/

SLPVectorizer/

X86/

powof2div.ll

47 lines

slp-throttle.ll

20 lines

Diff 255123

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show All 23 Lines
#include "llvm/ADT/None.h"		#include "llvm/ADT/None.h"
#include "llvm/ADT/Optional.h"		#include "llvm/ADT/Optional.h"
#include "llvm/ADT/PostOrderIterator.h"		#include "llvm/ADT/PostOrderIterator.h"
#include "llvm/ADT/STLExtras.h"		#include "llvm/ADT/STLExtras.h"
#include "llvm/ADT/SetVector.h"		#include "llvm/ADT/SetVector.h"
#include "llvm/ADT/SmallBitVector.h"		#include "llvm/ADT/SmallBitVector.h"
#include "llvm/ADT/SmallPtrSet.h"		#include "llvm/ADT/SmallPtrSet.h"
#include "llvm/ADT/SmallSet.h"		#include "llvm/ADT/SmallSet.h"
		#include "llvm/ADT/SmallString.h"
#include "llvm/ADT/SmallVector.h"		#include "llvm/ADT/SmallVector.h"
#include "llvm/ADT/Statistic.h"		#include "llvm/ADT/Statistic.h"
#include "llvm/ADT/iterator.h"		#include "llvm/ADT/iterator.h"
#include "llvm/ADT/iterator_range.h"		#include "llvm/ADT/iterator_range.h"
#include "llvm/Analysis/AliasAnalysis.h"		#include "llvm/Analysis/AliasAnalysis.h"
#include "llvm/Analysis/CodeMetrics.h"		#include "llvm/Analysis/CodeMetrics.h"
#include "llvm/Analysis/DemandedBits.h"		#include "llvm/Analysis/DemandedBits.h"
#include "llvm/Analysis/GlobalsModRef.h"		#include "llvm/Analysis/GlobalsModRef.h"
▲ Show 20 Lines • Show All 73 Lines • ▼ Show 20 Lines	llvm::RunSLPVectorization("vectorize-slp", cl::init(false), cl::Hidden,
cl::desc("Run the SLP vectorization passes"));		cl::desc("Run the SLP vectorization passes"));

static cl::opt<int>		static cl::opt<int>
SLPCostThreshold("slp-threshold", cl::init(0), cl::Hidden,		SLPCostThreshold("slp-threshold", cl::init(0), cl::Hidden,
cl::desc("Only vectorize if you gain more than this "		cl::desc("Only vectorize if you gain more than this "
"number "));		"number "));

static cl::opt<bool>		static cl::opt<bool>
ShouldVectorizeHor("slp-vectorize-hor", cl::init(true), cl::Hidden,		ShouldVectorizeHor("slp-vectorize-hor", cl::init(true), cl::Hidden,
cl::desc("Attempt to vectorize horizontal reductions"));		cl::desc("Attempt to vectorize horizontal reductions"));

		static cl::opt<bool>
		SLPThrottling("slp-throttle", cl::init(true), cl::Hidden,
		cl::desc("Enable tree partial vectorize with throttling"));

		static cl::opt<unsigned>
		MaxCostsRecalculations("slp-throttling-budget", cl::init(32), cl::Hidden,
		cl::desc("Limit the total number of nodes for cost "
		"recalculations during throttling"));

		ABataevUnsubmitted Done Reply Inline Actions Tabs are added ABataev: Tabs are added
static cl::opt<bool> ShouldStartVectorizeHorAtStore(		static cl::opt<bool> ShouldStartVectorizeHorAtStore(
"slp-vectorize-hor-store", cl::init(false), cl::Hidden,		"slp-vectorize-hor-store", cl::init(false), cl::Hidden,
cl::desc(		cl::desc(
"Attempt to vectorize horizontal reductions feeding into a store"));		"Attempt to vectorize horizontal reductions feeding into a store"));

static cl::opt<int>		static cl::opt<int>
MaxVectorRegSizeOption("slp-max-reg-size", cl::init(128), cl::Hidden,		MaxVectorRegSizeOption("slp-max-reg-size", cl::init(128), cl::Hidden,
cl::desc("Attempt to vectorize for this register size in bits"));		cl::desc("Attempt to vectorize for this register size in bits"));

		ABataevUnsubmitted Not Done Reply Inline Actions Do we really need both of these options? `MaxCostsRecalculations` should be enough. ABataev: Do we really need both of these options? `MaxCostsRecalculations` should be enough.
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions ok, thanks. dtemirbulatov: ok, thanks.
static cl::opt<int>		static cl::opt<int>
MaxStoreLookup("slp-max-store-lookup", cl::init(32), cl::Hidden,		MaxStoreLookup("slp-max-store-lookup", cl::init(32), cl::Hidden,
cl::desc("Maximum depth of the lookup for consecutive stores."));		cl::desc("Maximum depth of the lookup for consecutive stores."));

/// Limits the size of scheduling regions in a block.		/// Limits the size of scheduling regions in a block.
/// It avoid long compile times for _very_ large blocks where vector		/// It avoid long compile times for _very_ large blocks where vector
/// instructions are spread over a wide range.		/// instructions are spread over a wide range.
/// This limit is way higher than needed by real-world functions.		/// This limit is way higher than needed by real-world functions.
▲ Show 20 Lines • Show All 425 Lines • ▼ Show 20 Lines	public:

/// Vectorize the tree but with the list of externally used values \p		/// Vectorize the tree but with the list of externally used values \p
/// ExternallyUsedValues. Values in this MapVector can be replaced but the		/// ExternallyUsedValues. Values in this MapVector can be replaced but the
/// generated extractvalue instructions.		/// generated extractvalue instructions.
Value *vectorizeTree(ExtraValueToDebugLocsMap &ExternallyUsedValues);		Value *vectorizeTree(ExtraValueToDebugLocsMap &ExternallyUsedValues);

/// \returns the cost incurred by unwanted spills and fills, caused by		/// \returns the cost incurred by unwanted spills and fills, caused by
/// holding live values over call sites.		/// holding live values over call sites.
int getSpillCost() const;		int getSpillCost();

		/// \returns the cost extracting vectorized elements.
		int getExtractCost() const;

		/// \returns the cost of gathering canceled elements to be used
		/// by vectorized operations during throttling.
		int getInsertCost();

		/// Find a subtree of the whole tree suitable to be vectorized. When
		/// vectorizing the whole tree is not profitable, we can consider vectorizing
		/// part of that tree. SLP algorithm looks to operations to vectorize starting
		/// from seed instructions on the bottom toward the end of chains of
		/// dependencies to the top of SLP graph, it groups potentially vectorizable
		/// operations in scalar form to bundles.
		/// For example:
		ABataevUnsubmitted Not Done Reply Inline Actions `PriorityQueue`? ABataev: `PriorityQueue`?
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions hmm, ProprityQueue allows duplicates of elements and it might be an issue. dtemirbulatov: hmm, ProprityQueue allows duplicates of elements and it might be an issue.
		///
		/// <bundle 1> scalar form
		/// \|
		/// <bundle 2> scalar form <bundle 3> scalar form
		/// \ /
		/// <seed root bundle> scalar form
		///
		/// Total cost is not profitable to vectorize, hence all operations are in
		/// scalar form.
		///
		/// Here is the same tree after SLP throttling transformation:
		///
		/// <bundle 1> vector form
		/// \|
		ABataevUnsubmitted Not Done Reply Inline Actions Does "scalar form" means "gathered nodes"? I don't think that currently we may end up with the situation like in the picture, we can't have gathered node that depends on another node (either gather or vectorized). ABataev: Does "scalar form" means "gathered nodes"? I don't think that currently we may end up with the…
		/// <bundle 2> vector form <bundle 3> scalar form
		/// \ /
		/// <seed root bundle> vector form
		///
		/// So, we can throttle some operations in such a way that it is still
		/// profitable to vectorize part on the tree, while all tree vectorization
		/// does not make sense.
		/// More details: http://www.llvm.org/devmtg/2015-10/slides/Porpodas-ThrottlingAutomaticVectorization.pdf
		xbolva00Unsubmitted Done Reply Inline Actions Please mention paper name: “Throttling Automatic Vectorization: When Less Is More” https://www.cl.cam.ac.uk/~tmj32/papers/docs/porpodas15-pact.pdf Slides are good, but paper is paper :) xbolva00: Please mention paper name: “Throttling Automatic Vectorization: When Less Is More” https…
		bool findSubTree(int UserCost = 0);

		/// Get raw summary of all elements of the tree.
		int getRawTreeCost();

/// \returns the vectorization cost of the subtree that starts at \p VL.		/// \returns the vectorization cost of the subtree that starts at \p VL.
/// A negative number means that this is profitable.		/// A negative number means that this is profitable.
int getTreeCost();		int getTreeCost();

/// Construct a vectorizable tree that starts at \p Roots, ignoring users for		/// Construct a vectorizable tree that starts at \p Roots, ignoring users for
/// the purpose of scheduling and extraction in the \p UserIgnoreLst.		/// the purpose of scheduling and extraction in the \p UserIgnoreLst.
void buildTree(ArrayRef<Value *> Roots,		void buildTree(ArrayRef<Value *> Roots,
ArrayRef<Value *> UserIgnoreLst = None);		ArrayRef<Value *> UserIgnoreLst = None);

/// Construct a vectorizable tree that starts at \p Roots, ignoring users for		/// Construct a vectorizable tree that starts at \p Roots, ignoring users for
/// the purpose of scheduling and extraction in the \p UserIgnoreLst taking		/// the purpose of scheduling and extraction in the \p UserIgnoreLst taking
/// into account (and updating it, if required) list of externally used		/// into account (and updating it, if required) list of externally used
/// values stored in \p ExternallyUsedValues.		/// values stored in \p ExternallyUsedValues.
void buildTree(ArrayRef<Value *> Roots,		void buildTree(ArrayRef<Value *> Roots,
ExtraValueToDebugLocsMap &ExternallyUsedValues,		ExtraValueToDebugLocsMap &ExternallyUsedValues,
ArrayRef<Value *> UserIgnoreLst = None);		ArrayRef<Value *> UserIgnoreLst = None);

/// Clear the internal data structures that are created by 'buildTree'.		/// Clear the internal data structures that are created by 'buildTree'.
void deleteTree() {		void deleteTree() {
VectorizableTree.clear();		VectorizableTree.clear();
ScalarToTreeEntry.clear();		ScalarToTreeEntry.clear();
MustGather.clear();		MustGather.clear();
ExternalUses.clear();		ExternalUses.clear();
		InternalTreeUses.clear();
		RemovedOperations.clear();
NumOpsWantToKeepOrder.clear();		NumOpsWantToKeepOrder.clear();
NumOpsWantToKeepOriginalOrder = 0;		NumOpsWantToKeepOriginalOrder = 0;
for (auto &Iter : BlocksSchedules) {		for (auto &Iter : BlocksSchedules) {
BlockScheduling *BS = Iter.second.get();		BlockScheduling *BS = Iter.second.get();
BS->clear();		BS->clear();
}		}
MinBWs.clear();		MinBWs.clear();
		ScalarsToVec.clear();
		VecToScalars.clear();
		VecInserts.clear();
		NoCallInst = true;
		RawTreeCost = 0;
		IsCostSumReady = false;
}		}
		ABataevUnsubmitted Not Done Reply Inline Actions Why do we need to save intermediate results? Cannot it be solved in a single iteration loop without saving the intermediate results in the class instance? ABataev: Why do we need to save intermediate results? Cannot it be solved in a single iteration loop…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions I have noticed many regressions if we decide right away and rebuilding the tree afterward is expensive. dtemirbulatov: I have noticed many regressions if we decide right away and rebuilding the tree afterward is…
		ABataevUnsubmitted Not Done Reply Inline Actions What is the cause of those regressions? If I understand it correctly, you're just trying to find the subtree, exclude its cost, compare, repeat if it is not profitable. What does not allow to do it in the loop without saving intermediate results in the class, but save the result in the stack vectors, if it is needed? ABataev: What is the cause of those regressions? If I understand it correctly, you're just trying to…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions For example, if we could partially vectorize at vectorizeStoreChain(), or later it is possilble to fully vectorize the same tree tryToVectorizeList() or tryToReduce() dtemirbulatov: For example, if we could partially vectorize at vectorizeStoreChain(), or later it is possilble…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions For example, if we could partially vectorize at vectorizeStoreChain(), or later it is possilble to fully vectorize the same tree with tryToVectorizeList() or tryToReduce() dtemirbulatov: For example, if we could partially vectorize at vectorizeStoreChain(), or later it is possilble…
		ABataevUnsubmitted Not Done Reply Inline Actions Could you give an example, please? ABataev: Could you give an example, please?

unsigned getTreeSize() const { return VectorizableTree.size(); }		unsigned getTreeSize() const { return VectorizableTree.size(); }

/// Perform LICM and CSE on the newly generated gather sequences.		/// Perform LICM and CSE on the newly generated gather sequences.
void optimizeGatherSequence();		void optimizeGatherSequence();
		ABataevUnsubmitted Done Reply Inline Actions DO you really need to return `Optional` here? Maybe, just return `-SLPCostThreshold` if not successful? ABataev: DO you really need to return `Optional` here? Maybe, just return `-SLPCostThreshold` if not…

/// \returns The best order of instructions for vectorization.		/// \returns The best order of instructions for vectorization.
Optional<ArrayRef<unsigned>> bestOrder() const {		Optional<ArrayRef<unsigned>> bestOrder() const {
auto I = std::max_element(		auto I = std::max_element(
NumOpsWantToKeepOrder.begin(), NumOpsWantToKeepOrder.end(),		NumOpsWantToKeepOrder.begin(), NumOpsWantToKeepOrder.end(),
[](const decltype(NumOpsWantToKeepOrder)::value_type &D1,		[](const decltype(NumOpsWantToKeepOrder)::value_type &D1,
const decltype(NumOpsWantToKeepOrder)::value_type &D2) {		const decltype(NumOpsWantToKeepOrder)::value_type &D2) {
return D1.second < D2.second;		return D1.second < D2.second;
▲ Show 20 Lines • Show All 42 Lines • ▼ Show 20 Lines	public:
/// can be load combined in the backend. Load combining may not be allowed in		/// can be load combined in the backend. Load combining may not be allowed in
/// the IR optimizer, so we do not want to alter the pattern. For example,		/// the IR optimizer, so we do not want to alter the pattern. For example,
/// partially transforming a scalar bswap() pattern into vector code is		/// partially transforming a scalar bswap() pattern into vector code is
/// effectively impossible for the backend to undo.		/// effectively impossible for the backend to undo.
/// TODO: If load combining is allowed in the IR optimizer, this analysis		/// TODO: If load combining is allowed in the IR optimizer, this analysis
/// may not be necessary.		/// may not be necessary.
bool isLoadCombineReductionCandidate(unsigned ReductionOpcode) const;		bool isLoadCombineReductionCandidate(unsigned ReductionOpcode) const;

		/// Try to cut the tree to make it partially vectorizable.
		bool cutTree();

OptimizationRemarkEmitter *getORE() { return ORE; }		OptimizationRemarkEmitter *getORE() { return ORE; }

/// This structure holds any data we need about the edges being traversed		/// This structure holds any data we need about the edges being traversed
/// during buildTree_rec(). We keep track of:		/// during buildTree_rec(). We keep track of:
/// (i) the user TreeEntry index, and		/// (i) the user TreeEntry index, and
/// (ii) the index of the edge.		/// (ii) the index of the edge.
struct EdgeInfo {		struct EdgeInfo {
EdgeInfo() = default;		EdgeInfo() = default;
▲ Show 20 Lines • Show All 765 Lines • ▼ Show 20 Lines	struct TreeEntry {

/// A vector of scalars.		/// A vector of scalars.
ValueList Scalars;		ValueList Scalars;

/// The Scalars are vectorized into this value. It is initialized to Null.		/// The Scalars are vectorized into this value. It is initialized to Null.
Value *VectorizedValue = nullptr;		Value *VectorizedValue = nullptr;

/// Do we need to gather this sequence ?		/// Do we need to gather this sequence ?
enum EntryState { Vectorize, NeedToGather };		enum EntryState { Vectorize, NeedToGather, ProposedToGather };
EntryState State;		EntryState State;

		ABataevUnsubmitted Not Done Reply Inline Actions Could you split the patch and commit this part of the change (I mean, using of the enum instead of bool) as a separate NFC patch? ABataev: Could you split the patch and commit this part of the change (I mean, using of the enum instead…
/// Does this sequence require some shuffling?		/// Does this sequence require some shuffling?
SmallVector<unsigned, 4> ReuseShuffleIndices;		SmallVector<unsigned, 4> ReuseShuffleIndices;

/// Does this entry require reordering?		/// Does this entry require reordering?
ArrayRef<unsigned> ReorderIndices;		ArrayRef<unsigned> ReorderIndices;

		/// Cost of this tree entry.
		int Cost = 0;

/// Points back to the VectorizableTree.		/// Points back to the VectorizableTree.
///		///
/// Only used for Graphviz right now. Unfortunately GraphTrait::NodeRef has		/// Only used for Graphviz right now. Unfortunately GraphTrait::NodeRef has
/// to be a pointer and needs to be able to initialize the child iterator.		/// to be a pointer and needs to be able to initialize the child iterator.
/// Thus we need a reference back to the container to translate the indices		/// Thus we need a reference back to the container to translate the indices
/// to entries.		/// to entries.
VecTreeTy &Container;		VecTreeTy &Container;

/// The TreeEntry index containing the user of this entry. We can actually		/// The TreeEntry index containing the user of this entry. We can actually
/// have multiple users so the data structure is not truly a tree.		/// have multiple users so the data structure is not truly a tree.
SmallVector<EdgeInfo, 1> UserTreeIndices;		SmallVector<EdgeInfo, 1> UserTreeIndices;

		/// Use of this entry.
		TinyPtrVector<TreeEntry *> UseEntries;

/// The index of this treeEntry in VectorizableTree.		/// The index of this treeEntry in VectorizableTree.
int Idx = -1;		int Idx = -1;

private:		private:
/// The operands of each instruction in each lane Operands[op_index][lane].		/// The operands of each instruction in each lane Operands[op_index][lane].
/// Note: This helps avoid the replication of the code that performs the		/// Note: This helps avoid the replication of the code that performs the
/// reordering of operands during buildTree_rec() and vectorizeTree().		/// reordering of operands during buildTree_rec() and vectorizeTree().
SmallVector<ValueList, 2> Operands;		SmallVector<ValueList, 2> Operands;
▲ Show 20 Lines • Show All 114 Lines • ▼ Show 20 Lines	LLVM_DUMP_METHOD void dump() const {
dbgs() << "State: ";		dbgs() << "State: ";
switch (State) {		switch (State) {
case Vectorize:		case Vectorize:
dbgs() << "Vectorize\n";		dbgs() << "Vectorize\n";
break;		break;
case NeedToGather:		case NeedToGather:
dbgs() << "NeedToGather\n";		dbgs() << "NeedToGather\n";
break;		break;
		case ProposedToGather:
		dbgs() << "ProposedToGather\n";
		break;
}		}
dbgs() << "MainOp: ";		dbgs() << "MainOp: ";
if (MainOp)		if (MainOp)
dbgs() << *MainOp << "\n";		dbgs() << *MainOp << "\n";
else		else
dbgs() << "NULL\n";		dbgs() << "NULL\n";
dbgs() << "AltOp: ";		dbgs() << "AltOp: ";
if (AltOp)		if (AltOp)
▲ Show 20 Lines • Show All 54 Lines • ▼ Show 20 Lines	if (Vectorized) {
++Lane;		++Lane;
}		}
assert((!Bundle.getValue() \|\| Lane == VL.size()) &&		assert((!Bundle.getValue() \|\| Lane == VL.size()) &&
"Bundle and VL out of sync");		"Bundle and VL out of sync");
} else {		} else {
MustGather.insert(VL.begin(), VL.end());		MustGather.insert(VL.begin(), VL.end());
}		}

if (UserTreeIdx.UserTE)		if (UserTreeIdx.UserTE) {
Last->UserTreeIndices.push_back(UserTreeIdx);		Last->UserTreeIndices.push_back(UserTreeIdx);
		VectorizableTree[UserTreeIdx.UserTE->Idx]->UseEntries.push_back(Last);
		}

return Last;		return Last;
}		}

/// -- Vectorization State --		/// -- Vectorization State --
/// Holds all of the tree entries.		/// Holds all of the tree entries.
TreeEntry::VecTreeTy VectorizableTree;		TreeEntry::VecTreeTy VectorizableTree;

Show All 19 Lines	const TreeEntry getTreeEntry(Value V) const {
if (I != ScalarToTreeEntry.end())		if (I != ScalarToTreeEntry.end())
return I->second;		return I->second;
return nullptr;		return nullptr;
}		}

/// Maps a specific scalar to its tree entry.		/// Maps a specific scalar to its tree entry.
SmallDenseMap<Value, TreeEntry > ScalarToTreeEntry;		SmallDenseMap<Value, TreeEntry > ScalarToTreeEntry;

		/// Tree entries that should not be vectorized due to throttling.
		SmallVector<TreeEntry *, 2> RemovedOperations;

		/// Tree values proposed to be vectorized.
		ValueSet ScalarsToVec;

		/// Tree values once considered to be vectorized, but later with throttling
		/// decided to stay in a scalar form.
		ValueSet VecToScalars;

/// A list of scalars that we found that we need to keep as scalars.		/// A list of scalars that we found that we need to keep as scalars.
ValueSet MustGather;		ValueSet MustGather;

		/// Total cost of inserts in the tree for a particular value.
		SmallDenseMap<Value*, int> VecInserts;

		/// Raw cost of all elemts in the tree.
		int RawTreeCost = 0;

		/// Indicate that no CallInst found in the tree and we don't need to calculate
		/// spill cost.
		bool NoCallInst = true;

		/// True, if we have calucalte tree cost for the tree.
		bool IsCostSumReady = false;

/// This POD struct describes one external user in the vectorized tree.		/// This POD struct describes one external user in the vectorized tree.
struct ExternalUser {		struct ExternalUser {
ExternalUser(Value S, llvm::User U, int L)		ExternalUser(Value S, llvm::User U, int L)
: Scalar(S), User(U), Lane(L) {}		: Scalar(S), User(U), Lane(L) {}

// Which scalar in our function.		// Which scalar in our function.
Value *Scalar;		Value *Scalar;

// Which user that uses the scalar.		// Which user that uses the scalar.
llvm::User *User;		llvm::User *User;

// Which lane does the scalar belong to.		// Which lane does the scalar belong to.
int Lane;		int Lane;
};		};
using UserList = SmallVector<ExternalUser, 16>;		using UserList = SmallVector<ExternalUser, 16>;

		/// \returns the cost of extracting the vectorized elements.
		int getExtractOperationCost(const ExternalUser &EU) const;

/// Checks if two instructions may access the same memory.		/// Checks if two instructions may access the same memory.
///		///
/// \p Loc1 is the location of \p Inst1. It is passed explicitly because it		/// \p Loc1 is the location of \p Inst1. It is passed explicitly because it
/// is invariant in the calling loop.		/// is invariant in the calling loop.
bool isAliased(const MemoryLocation &Loc1, Instruction *Inst1,		bool isAliased(const MemoryLocation &Loc1, Instruction *Inst1,
Instruction *Inst2) {		Instruction *Inst2) {
// First check if the result is already in the cache.		// First check if the result is already in the cache.
AliasCacheKey key = std::make_pair(Inst1, Inst2);		AliasCacheKey key = std::make_pair(Inst1, Inst2);
Show All 34 Lines	#endif
DenseMap<Instruction *, bool> DeletedInstructions;		DenseMap<Instruction *, bool> DeletedInstructions;

/// A list of values that need to extracted out of the tree.		/// A list of values that need to extracted out of the tree.
/// This list holds pairs of (Internal Scalar : External User). External User		/// This list holds pairs of (Internal Scalar : External User). External User
/// can be nullptr, it means that this Internal Scalar will be used later,		/// can be nullptr, it means that this Internal Scalar will be used later,
/// after vectorization.		/// after vectorization.
UserList ExternalUses;		UserList ExternalUses;

		/// Current operations width to vectorize.
		unsigned BundleWidth = 0;

		/// Internal tree oprations proposed to be vectorized values use.
		SmallDenseMap<Value *, UserList> InternalTreeUses;

/// Values used only by @llvm.assume calls.		/// Values used only by @llvm.assume calls.
SmallPtrSet<const Value *, 32> EphValues;		SmallPtrSet<const Value *, 32> EphValues;

/// Holds all of the instructions that we gathered.		/// Holds all of the instructions that we gathered.
SetVector<Instruction *> GatherSeq;		SetVector<Instruction *> GatherSeq;

/// A list of blocks that we are going to CSE.		/// A list of blocks that we are going to CSE.
SetVector<BasicBlock *> CSEBlocks;		SetVector<BasicBlock *> CSEBlocks;
▲ Show 20 Lines • Show All 384 Lines • ▼ Show 20 Lines	struct BlockScheduling {
// Make sure that the initial SchedulingRegionID is greater than the		// Make sure that the initial SchedulingRegionID is greater than the
// initial SchedulingRegionID in ScheduleData (which is 0).		// initial SchedulingRegionID in ScheduleData (which is 0).
int SchedulingRegionID = 1;		int SchedulingRegionID = 1;
};		};

/// Attaches the BlockScheduling structures to basic blocks.		/// Attaches the BlockScheduling structures to basic blocks.
MapVector<BasicBlock *, std::unique_ptr<BlockScheduling>> BlocksSchedules;		MapVector<BasicBlock *, std::unique_ptr<BlockScheduling>> BlocksSchedules;

		/// Remove operations from the list of proposed to schedule.
		void removeFromScheduling(BlockScheduling *BS);

/// Performs the "real" scheduling. Done before vectorization is actually		/// Performs the "real" scheduling. Done before vectorization is actually
/// performed in a basic block.		/// performed in a basic block.
void scheduleBlock(BlockScheduling *BS);		void scheduleBlock(BlockScheduling *BS);

/// List of users to ignore during scheduling and that don't need extracting.		/// List of users to ignore during scheduling and that don't need extracting.
ArrayRef<Value *> UserIgnoreList;		ArrayRef<Value *> UserIgnoreList;

using OrdersType = SmallVector<unsigned, 4>;		using OrdersType = SmallVector<unsigned, 4>;
Show All 24 Lines	#endif
/// Contains orders of operations along with the number of bundles that have		/// Contains orders of operations along with the number of bundles that have
/// operations in this order. It stores only those orders that require		/// operations in this order. It stores only those orders that require
/// reordering, if reordering is not required it is counted using \a		/// reordering, if reordering is not required it is counted using \a
/// NumOpsWantToKeepOriginalOrder.		/// NumOpsWantToKeepOriginalOrder.
DenseMap<OrdersType, unsigned, OrdersTypeDenseMapInfo> NumOpsWantToKeepOrder;		DenseMap<OrdersType, unsigned, OrdersTypeDenseMapInfo> NumOpsWantToKeepOrder;
/// Number of bundles that do not require reordering.		/// Number of bundles that do not require reordering.
unsigned NumOpsWantToKeepOriginalOrder = 0;		unsigned NumOpsWantToKeepOriginalOrder = 0;

// Analysis and block reference.		// Analysis and block reference.
		ABataevUnsubmitted Not Done Reply Inline Actions Why need to store a pointer to `TreeState` but the `TreeState` itself? ABataev: Why need to store a pointer to `TreeState` but the `TreeState` itself?
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions TreeState is a large structure, it is convenient with dynamically allocate, but static version might be faster, do you think it is critical? dtemirbulatov: TreeState is a large structure, it is convenient with dynamically allocate, but static version…
		ABataevUnsubmitted Done Reply Inline Actions Reduce the number of the preallocated elements to, say, 2 or 4 and store elements directly. ABataev: Reduce the number of the preallocated elements to, say, 2 or 4 and store elements directly.
		ABataevUnsubmitted Not Done Reply Inline Actions Why `unique_ptr` again? Why not a `TreeState` directly? Just `SmallVector<TreeState, 2>;` ABataev: Why `unique_ptr` again? Why not a `TreeState` directly? Just `SmallVector<TreeState, 2>;`
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions ok dtemirbulatov: ok
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions hmm. Tree state is too complex, we don't have to make it movable or copyable. Why unique_ptr is not good here? dtemirbulatov: hmm. Tree state is too complex, we don't have to make it movable or copyable. Why unique_ptr is…
Function *F;		Function *F;
ScalarEvolution *SE;		ScalarEvolution *SE;
TargetTransformInfo *TTI;		TargetTransformInfo *TTI;
TargetLibraryInfo *TLI;		TargetLibraryInfo *TLI;
AliasAnalysis *AA;		AliasAnalysis *AA;
LoopInfo *LI;		LoopInfo *LI;
DominatorTree *DT;		DominatorTree *DT;
AssumptionCache *AC;		AssumptionCache *AC;
DemandedBits *DB;		DemandedBits *DB;
const DataLayout *DL;		const DataLayout *DL;
OptimizationRemarkEmitter *ORE;		OptimizationRemarkEmitter *ORE;

unsigned MaxVecRegSize; // This is set by TTI or overridden by cl::opt.		unsigned MaxVecRegSize; // This is set by TTI or overridden by cl::opt.
unsigned MinVecRegSize; // Set by cl::opt (default: 128).		unsigned MinVecRegSize; // Set by cl::opt (default: 128).

/// Instruction builder to construct the vectorized tree.		/// Instruction builder to construct the vectorized tree.
IRBuilder<> Builder;		IRBuilder<> Builder;

/// A map of scalar integer values to the smallest bit width with which they		/// A map of scalar integer values to the smallest bit width with which they
		ABataevUnsubmitted Not Done Reply Inline Actions Tabs ABataev: Tabs
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions This is now part of TreeState structure, this is LLVM's standard format(clang-format). dtemirbulatov: This is now part of TreeState structure, this is LLVM's standard format(clang-format).
/// can legally be represented. The values map to (width, signed) pairs,		/// can legally be represented. The values map to (width, signed) pairs,
/// where "width" indicates the minimum bit width and "signed" is True if the		/// where "width" indicates the minimum bit width and "signed" is True if the
/// value must be signed-extended, rather than zero-extended, back to its		/// value must be signed-extended, rather than zero-extended, back to its
		ABataevUnsubmitted Done Reply Inline Actions Why do you need this new set? You can get the result just by using `ScalarToTreeEntry` data member and checking the vectorization status of the corresponding `TreeEntry`. ABataev: Why do you need this new set? You can get the result just by using `ScalarToTreeEntry` data…
/// original width.		/// original width.
MapVector<Value *, std::pair<uint64_t, bool>> MinBWs;		MapVector<Value *, std::pair<uint64_t, bool>> MinBWs;
};		};

		ABataevUnsubmitted Done Reply Inline Actions Seem to me, here is the same story just like with `ScalarsToVec` ABataev: Seem to me, here is the same story just like with `ScalarsToVec`
} // end namespace slpvectorizer		} // end namespace slpvectorizer

template <> struct GraphTraits<BoUpSLP *> {		template <> struct GraphTraits<BoUpSLP *> {
using TreeEntry = BoUpSLP::TreeEntry;		using TreeEntry = BoUpSLP::TreeEntry;

/// NodeRef has to be a pointer per the GraphWriter.		/// NodeRef has to be a pointer per the GraphWriter.
		ABataevUnsubmitted Done Reply Inline Actions Currently is not used ABataev: Currently is not used
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions Thanks, Sorry for this, I missed somehow. dtemirbulatov: Thanks, Sorry for this, I missed somehow.
using NodeRef = TreeEntry *;		using NodeRef = TreeEntry *;

using ContainerTy = BoUpSLP::TreeEntry::VecTreeTy;		using ContainerTy = BoUpSLP::TreeEntry::VecTreeTy;

/// Add the VectorizableTree to the index iterator to be able to return		/// Add the VectorizableTree to the index iterator to be able to return
/// TreeEntry pointers.		/// TreeEntry pointers.
struct ChildIteratorType		struct ChildIteratorType
: public iterator_adaptor_base<		: public iterator_adaptor_base<
▲ Show 20 Lines • Show All 114 Lines • ▼ Show 20 Lines	void BoUpSLP::buildTree(ArrayRef<Value *> Roots,
ArrayRef<Value *> UserIgnoreLst) {		ArrayRef<Value *> UserIgnoreLst) {
deleteTree();		deleteTree();
UserIgnoreList = UserIgnoreLst;		UserIgnoreList = UserIgnoreLst;
if (!allSameType(Roots))		if (!allSameType(Roots))
return;		return;
buildTree_rec(Roots, 0, EdgeInfo());		buildTree_rec(Roots, 0, EdgeInfo());

// Collect the values that we need to extract from the tree.		// Collect the values that we need to extract from the tree.
for (auto &TEPtr : VectorizableTree) {		for (std::unique_ptr<TreeEntry> &TEPtr : VectorizableTree) {
TreeEntry *Entry = TEPtr.get();		TreeEntry *Entry = TEPtr.get();

// No need to handle users of gathered values.		// No need to handle users of gathered values.
if (Entry->State == TreeEntry::NeedToGather)		if (Entry->State == TreeEntry::NeedToGather)
continue;		continue;

// For each lane:		// For each lane:
for (int Lane = 0, LE = Entry->Scalars.size(); Lane != LE; ++Lane) {		for (int Lane = 0, LE = Entry->Scalars.size(); Lane != LE; ++Lane) {
Show All 24 Lines	for (int Lane = 0, LE = Entry->Scalars.size(); Lane != LE; ++Lane) {
Value *UseScalar = UseEntry->Scalars[0];		Value *UseScalar = UseEntry->Scalars[0];
// Some in-tree scalars will remain as scalar in vectorized		// Some in-tree scalars will remain as scalar in vectorized
// instructions. If that is the case, the one in Lane 0 will		// instructions. If that is the case, the one in Lane 0 will
// be used.		// be used.
if (UseScalar != U \|\|		if (UseScalar != U \|\|
!InTreeUserNeedToExtract(Scalar, UserInst, TLI)) {		!InTreeUserNeedToExtract(Scalar, UserInst, TLI)) {
LLVM_DEBUG(dbgs() << "SLP: \tInternal user will be removed:" << *U		LLVM_DEBUG(dbgs() << "SLP: \tInternal user will be removed:" << *U
<< ".\n");		<< ".\n");
assert(UseEntry->State != TreeEntry::NeedToGather && "Bad state");		assert(UseEntry->State != TreeEntry::NeedToGather && "Bad state");
		InternalTreeUses[U].emplace_back(Scalar, U, FoundLane);
continue;		continue;
		ABataevUnsubmitted Not Done Reply Inline Actions Looks like unrelated change ABataev: Looks like unrelated change
}		}
}		}

// Ignore users in the user ignore list.		// Ignore users in the user ignore list.
if (is_contained(UserIgnoreList, UserInst))		if (is_contained(UserIgnoreList, UserInst))
continue;		continue;

LLVM_DEBUG(dbgs() << "SLP: Need to extract:" << *U << " from lane "		LLVM_DEBUG(dbgs() << "SLP: Need to extract:" << *U << " from lane "
▲ Show 20 Lines • Show All 690 Lines • ▼ Show 20 Lines	default:
BS.cancelScheduling(VL, VL0);		BS.cancelScheduling(VL, VL0);
newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx,		newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx,
ReuseShuffleIndicies);		ReuseShuffleIndicies);
LLVM_DEBUG(dbgs() << "SLP: Gathering unknown instruction.\n");		LLVM_DEBUG(dbgs() << "SLP: Gathering unknown instruction.\n");
return;		return;
}		}
}		}

		bool BoUpSLP::cutTree() {
		SmallVector<TreeEntry *, 4> VecNodes;

		// Estimate the subtree not just from a cost perspective, but functional.
		bool FoundRealOp = false;
		for (const std::unique_ptr<TreeEntry> &TEPtr : VectorizableTree) {
		TreeEntry *Entry = TEPtr.get();
		ABataevUnsubmitted Not Done Reply Inline Actions You don't need to push the elements to a new vector here, instead, you can directly perform required actions. ABataev: You don't need to push the elements to a new vector here, instead, you can directly perform…
		if (Entry->State != TreeEntry::Vectorize)
		continue;
		Instruction *Inst = Entry->getMainOp();
		if (Inst && (isa<BinaryOperator>(Inst) \|\| isa<FPMathOperator>(Inst) \|\|
		ABataevUnsubmitted Not Done Reply Inline Actions Should this loop be executed only for `ProposedToGather` `Entry`s? ABataev: Should this loop be executed only for `ProposedToGather` `Entry`s?
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions Yes, but we have to know the lane there. dtemirbulatov: Yes, but we have to know the lane there.
		ABataevUnsubmitted Not Done Reply Inline Actions I mean, that the check in the previous if must be `Entry->State == TreeEntry::ProposedToGather`, not `!= TreeEntry::Vectorize` ABataev: I mean, that the check in the previous if must be `Entry->State == TreeEntry::ProposedToGather`…
		isa<CmpInst>(Inst))) {
		FoundRealOp = true;
		ABataevUnsubmitted Done Reply Inline Actions These two loops can be merged, no? And use `switch` instead of `if`, if possible, after merging ABataev: These two loops can be merged, no? And use `switch` instead of `if`, if possible, after merging
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions Thanks. dtemirbulatov: Thanks.
		break;
		}
		ABataevUnsubmitted Done Reply Inline Actions What if the user does not have corresponding tree entry, i.e. it is initially scalar? What if the `Scalar` itself is going to remain scalar? ABataev: What if the user does not have corresponding tree entry, i.e. it is initially scalar? What if…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions What if the Scalar itself is going to remain scalar? At this point, the decision to cut the tree was made and the Scalar could be only with intend to vectorize. Note about that 3295 we are ignoring any tree entries without State not equals TreeEntry::Vectorize. What if the user does not have corresponding tree entry, i.e. it is initially scalar? ah, yes. I have to check that !UserTE at 3305 and just continue if it is true. dtemirbulatov: > What if the Scalar itself is going to remain scalar? At this point, the decision to cut the…
		}
		if (!FoundRealOp)
		ABataevUnsubmitted Done Reply Inline Actions Could you compare it with a similar code in BoUpSLP::buildTree? Looks like you still missed some cases for user ignoring. ABataev: Could you compare it with a similar code in BoUpSLP::buildTree? Looks like you still missed…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions I think those ignoring cases are related to the fact that we are doing full vectorization at BoUpSLP::buildTree and we can avoid extracting for in-tree users. And here we have to extract to each user of once proposed to vectorized value. dtemirbulatov: I think those ignoring cases are related to the fact that we are doing full vectorization at…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions And here we have to extract to each user of once proposed to vectorized value. I mean for the partial vectorization. dtemirbulatov: And here we have to extract to each user of once proposed to vectorized value. I mean for the…
		return false;
		ABataevUnsubmitted Not Done Reply Inline Actions This does nothing except for debugging print, guard with `#ifndef NDEBUG .. #endif` ABataev: This does nothing except for debugging print, guard with `#ifndef NDEBUG .. #endif`

		ABataevUnsubmitted Done Reply Inline Actions Why do you need to compare flow and operation instructions count? Also, why use hardcoded `3` as a limit of vectorizable nodes? ABataev: Why do you need to compare flow and operation instructions count? Also, why use hardcoded `3`…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions I noticed the cost perspective it is good to vectorize the subtree, but on the benchmarking side, it is introducing regressions. Maybe this is a known issue where the partial vectorization prevents full vectorization later on, for example, if I remove this limiter at 3271: for Transforms/SLPVectorizer/X86/PR39774.ll testcase I wold get: a/llvm/test/Transforms/SLPVectorizer/X86/PR39774.ll +++ b/llvm/test/Transforms/SLPVectorizer/X86/PR39774.ll @@ -7,55 +7,65 @@ define void @test(i32) { ; CHECK-NEXT: entry: ; CHECK-NEXT: br label [[LOOP:%.]] ; CHECK: loop: -; CHECK-NEXT: [[TMP1:%.]] = phi <2 x i32> [ [[TMP15:%.]], [[LOOP]] ], [ zeroinitializer, [[ENTRY:%.]] ] -; CHECK-NEXT: [[SHUFFLE:%.]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> undef, <8 x i32> <i32 0, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1> -; CHECK-NEXT: [[TMP2:%.]] = extractelement <8 x i32> [[SHUFFLE]], i32 1 -; CHECK-NEXT: [[TMP3:%.]] = add <8 x i32> [[SHUFFLE]], <i32 0, i32 55, i32 285, i32 1240, i32 1496, i32 8555, i32 12529, i32 13685> -; CHECK-NEXT: [[RDX_SHUF:%.]] = shufflevector <8 x i32> [[TMP3]], <8 x i32> undef, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef> -; CHECK-NEXT: [[BIN_RDX:%.]] = and <8 x i32> [[TMP3]], [[RDX_SHUF]] -; CHECK-NEXT: [[RDX_SHUF1:%.]] = shufflevector <8 x i32> [[BIN_RDX]], <8 x i32> undef, <8 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef> -; CHECK-NEXT: [[BIN_RDX2:%.]] = and <8 x i32> [[BIN_RDX]], [[RDX_SHUF1]] -; CHECK-NEXT: [[RDX_SHUF3:%.]] = shufflevector <8 x i32> [[BIN_RDX2]], <8 x i32> undef, <8 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef> -; CHECK-NEXT: [[BIN_RDX4:%.]] = and <8 x i32> [[BIN_RDX2]], [[RDX_SHUF3]] -; CHECK-NEXT: [[TMP4:%.]] = extractelement <8 x i32> [[BIN_RDX4]], i32 0 -; CHECK-NEXT: [[OP_EXTRA:%.]] = and i32 [[TMP4]], [[TMP0:%.]] -; CHECK-NEXT: [[OP_EXTRA5:%.]] = and i32 [[OP_EXTRA]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA6:%.]] = and i32 [[OP_EXTRA5]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA7:%.]] = and i32 [[OP_EXTRA6]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA8:%.]] = and i32 [[OP_EXTRA7]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA9:%.]] = and i32 [[OP_EXTRA8]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA10:%.]] = and i32 [[OP_EXTRA9]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA11:%.]] = and i32 [[OP_EXTRA10]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA12:%.]] = and i32 [[OP_EXTRA11]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA13:%.]] = and i32 [[OP_EXTRA12]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA14:%.]] = and i32 [[OP_EXTRA13]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA15:%.]] = and i32 [[OP_EXTRA14]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA16:%.]] = and i32 [[OP_EXTRA15]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA17:%.]] = and i32 [[OP_EXTRA16]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA18:%.]] = and i32 [[OP_EXTRA17]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA19:%.]] = and i32 [[OP_EXTRA18]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA20:%.]] = and i32 [[OP_EXTRA19]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA21:%.]] = and i32 [[OP_EXTRA20]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA22:%.]] = and i32 [[OP_EXTRA21]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA23:%.]] = and i32 [[OP_EXTRA22]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA24:%.]] = and i32 [[OP_EXTRA23]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA25:%.]] = and i32 [[OP_EXTRA24]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA26:%.]] = and i32 [[OP_EXTRA25]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA27:%.]] = and i32 [[OP_EXTRA26]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA28:%.]] = and i32 [[OP_EXTRA27]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA29:%.]] = and i32 [[OP_EXTRA28]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA30:%.]] = and i32 [[OP_EXTRA29]], [[TMP0]] -; CHECK-NEXT: [[TMP5:%.]] = insertelement <2 x i32> undef, i32 [[OP_EXTRA30]], i32 0 -; CHECK-NEXT: [[TMP6:%.]] = insertelement <2 x i32> [[TMP5]], i32 14910, i32 1 -; CHECK-NEXT: [[TMP7:%.]] = insertelement <2 x i32> undef, i32 [[TMP2]], i32 0 -; CHECK-NEXT: [[TMP8:%.]] = insertelement <2 x i32> [[TMP7]], i32 [[TMP2]], i32 1 -; CHECK-NEXT: [[TMP9:%.]] = and <2 x i32> [[TMP6]], [[TMP8]] -; CHECK-NEXT: [[TMP10:%.]] = add <2 x i32> [[TMP6]], [[TMP8]] -; CHECK-NEXT: [[TMP11:%.]] = shufflevector <2 x i32> [[TMP9]], <2 x i32> [[TMP10]], <2 x i32> <i32 0, i32 3> -; CHECK-NEXT: [[TMP12:%.]] = extractelement <2 x i32> [[TMP11]], i32 0 -; CHECK-NEXT: [[TMP13:%.]] = insertelement <2 x i32> undef, i32 [[TMP12]], i32 0 -; CHECK-NEXT: [[TMP14:%.]] = extractelement <2 x i32> [[TMP11]], i32 1 -; CHECK-NEXT: [[TMP15]] = insertelement <2 x i32> [[TMP13]], i32 [[TMP14]], i32 1 +; CHECK-NEXT: [[TMP1:%.]] = phi <2 x i32> [ [[TMP19:%.]], [[LOOP]] ], [ zeroinitializer, [[ENTRY:%.]] ] +; CHECK-NEXT: [[TMP2:%.]] = extractelement <2 x i32> [[TMP1]], i32 0 +; CHECK-NEXT: [[VAL_0:%.]] = add i32 [[TMP2]], 0 +; CHECK-NEXT: [[TMP3:%.]] = extractelement <2 x i32> [[TMP1]], i32 1 +; CHECK-NEXT: [[VAL_1:%.]] = and i32 [[TMP3]], [[VAL_0]] +; CHECK-NEXT: [[VAL_2:%.]] = and i32 [[VAL_1]], [[TMP0:%.]] +; CHECK-NEXT: [[VAL_3:%.]] = and i32 [[VAL_2]], [[TMP0]] +; CHECK-NEXT: [[VAL_4:%.]] = and i32 [[VAL_3]], [[TMP0]] +; CHECK-NEXT: [[VAL_5:%.]] = and i32 [[VAL_4]], [[TMP0]] +; CHECK-NEXT: [[VAL_6:%.]] = add i32 [[TMP3]], 55 +; CHECK-NEXT: [[VAL_7:%.]] = and i32 [[VAL_5]], [[VAL_6]] +; CHECK-NEXT: [[VAL_8:%.]] = and i32 [[VAL_7]], [[TMP0]] +; CHECK-NEXT: [[VAL_9:%.]] = and i32 [[VAL_8]], [[TMP0]] +; CHECK-NEXT: [[VAL_10:%.]] = and i32 [[VAL_9]], [[TMP0]] +; CHECK-NEXT: [[VAL_11:%.]] = add i32 [[TMP3]], 285 +; CHECK-NEXT: [[VAL_12:%.]] = and i32 [[VAL_10]], [[VAL_11]] +; CHECK-NEXT: [[VAL_13:%.]] = and i32 [[VAL_12]], [[TMP0]] +; CHECK-NEXT: [[VAL_14:%.]] = and i32 [[VAL_13]], [[TMP0]] +; CHECK-NEXT: [[VAL_15:%.]] = and i32 [[VAL_14]], [[TMP0]] +; CHECK-NEXT: [[VAL_16:%.]] = and i32 [[VAL_15]], [[TMP0]] +; CHECK-NEXT: [[VAL_17:%.]] = and i32 [[VAL_16]], [[TMP0]] +; CHECK-NEXT: [[VAL_18:%.]] = add i32 [[TMP3]], 1240 +; CHECK-NEXT: [[VAL_19:%.]] = and i32 [[VAL_17]], [[VAL_18]] +; CHECK-NEXT: [[VAL_20:%.]] = add i32 [[TMP3]], 1496 +; CHECK-NEXT: [[VAL_21:%.]] = and i32 [[VAL_19]], [[VAL_20]] +; CHECK-NEXT: [[VAL_22:%.]] = and i32 [[VAL_21]], [[TMP0]] +; CHECK-NEXT: [[VAL_23:%.]] = and i32 [[VAL_22]], [[TMP0]] +; CHECK-NEXT: [[VAL_24:%.]] = and i32 [[VAL_23]], [[TMP0]] +; CHECK-NEXT: [[VAL_25:%.]] = and i32 [[VAL_24]], [[TMP0]] +; CHECK-NEXT: [[VAL_26:%.]] = and i32 [[VAL_25]], [[TMP0]] +; CHECK-NEXT: [[VAL_27:%.]] = and i32 [[VAL_26]], [[TMP0]] +; CHECK-NEXT: [[VAL_28:%.]] = and i32 [[VAL_27]], [[TMP0]] +; CHECK-NEXT: [[VAL_29:%.]] = and i32 [[VAL_28]], [[TMP0]] +; CHECK-NEXT: [[VAL_30:%.]] = and i32 [[VAL_29]], [[TMP0]] +; CHECK-NEXT: [[VAL_31:%.]] = and i32 [[VAL_30]], [[TMP0]] +; CHECK-NEXT: [[VAL_32:%.]] = and i32 [[VAL_31]], [[TMP0]] +; CHECK-NEXT: [[VAL_33:%.]] = and i32 [[VAL_32]], [[TMP0]] +; CHECK-NEXT: [[VAL_34:%.]] = add i32 [[TMP3]], 8555 +; CHECK-NEXT: [[VAL_35:%.]] = and i32 [[VAL_33]], [[VAL_34]] +; CHECK-NEXT: [[VAL_36:%.]] = and i32 [[VAL_35]], [[TMP0]] +; CHECK-NEXT: [[VAL_37:%.]] = and i32 [[VAL_36]], [[TMP0]] +; CHECK-NEXT: [[VAL_38:%.]] = and i32 [[VAL_37]], [[TMP0]] +; CHECK-NEXT: [[TMP4:%.]] = insertelement <2 x i32> undef, i32 [[TMP3]], i32 0 +; CHECK-NEXT: [[TMP5:%.]] = insertelement <2 x i32> [[TMP4]], i32 [[TMP3]], i32 1 +; CHECK-NEXT: [[TMP6:%.]] = add <2 x i32> [[TMP5]], <i32 12529, i32 13685> +; CHECK-NEXT: [[TMP7:%.]] = extractelement <2 x i32> [[TMP6]], i32 0 +; CHECK-NEXT: [[VAL_40:%.]] = and i32 [[VAL_38]], [[TMP7]] +; CHECK-NEXT: [[TMP8:%.]] = extractelement <2 x i32> [[TMP6]], i32 1 +; CHECK-NEXT: [[TMP9:%.]] = insertelement <2 x i32> undef, i32 [[VAL_40]], i32 0 +; CHECK-NEXT: [[TMP10:%.]] = insertelement <2 x i32> [[TMP9]], i32 14910, i32 1 +; CHECK-NEXT: [[TMP11:%.]] = insertelement <2 x i32> undef, i32 [[TMP8]], i32 0 +; CHECK-NEXT: [[TMP12:%.]] = insertelement <2 x i32> [[TMP11]], i32 [[TMP3]], i32 1 +; CHECK-NEXT: [[TMP13:%.]] = and <2 x i32> [[TMP10]], [[TMP12]] +; CHECK-NEXT: [[TMP14:%.]] = add <2 x i32> [[TMP10]], [[TMP12]] +; CHECK-NEXT: [[TMP15:%.]] = shufflevector <2 x i32> [[TMP13]], <2 x i32> [[TMP14]], <2 x i32> <i32 0, i32 3> +; CHECK-NEXT: [[TMP16:%.]] = extractelement <2 x i32> [[TMP15]], i32 0 +; CHECK-NEXT: [[TMP17:%.]] = insertelement <2 x i32> undef, i32 [[TMP16]], i32 0 +; CHECK-NEXT: [[TMP18:%.]] = extractelement <2 x i32> [[TMP15]], i32 1 +; CHECK-NEXT: [[TMP19]] = insertelement <2 x i32> [[TMP17]], i32 [[TMP18]], i32 1 and we could see disappearance of [[TMP3:%.]] = add <8 x i32> [[SHUFFLE]], <i32 0, i32 55, i32 285, i32 1240, i32 1496, i32 8555, i32 12529, i32 13685> I have to recheck those regressions again on cpu2006. dtemirbulatov: I noticed the cost perspective it is good to vectorize the subtree, but on the benchmarking…
		for (std::unique_ptr<TreeEntry> &TEPtr : VectorizableTree) {
		TreeEntry *Entry = TEPtr.get();
		if (Entry->State == TreeEntry::Vectorize)
		ABataevUnsubmitted Not Done Reply Inline Actions Either just `cast` without `if` or `dyn_cast` ABataev: Either just `cast` without `if` or `dyn_cast`
		VecNodes.push_back(Entry);
		}
		if (VecNodes.size() <= 2)
		return false;
		// Canceling unprofitable elements.
		ABataevUnsubmitted Done Reply Inline Actions Looks like the `Scalar` should be extracted only if its user is vectorized and it remains to be scalar in the vectorized tree. Or it is not going to be vectorized. ABataev: Looks like the `Scalar` should be extracted only if its user is vectorized and it remains to be…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions Thanks, good catch. I think I need first to populate ExternalUser as TreeEntires are mark with ProposedToGather and then change them to a NeedToGather node in the separate loop. dtemirbulatov: Thanks, good catch. I think I need first to populate ExternalUser as TreeEntires are mark with…
		for (std::unique_ptr<TreeEntry> &TEPtr : VectorizableTree) {
		TreeEntry *Entry = TEPtr.get();
		if (Entry->State == TreeEntry::NeedToGather)
		ABataevUnsubmitted Not Done Reply Inline Actions I actually don't see propagation for `ProposedTogather` and these loops can be merged, no? ABataev: I actually don't see propagation for `ProposedTogather` and these loops can be merged, no?
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions No, No possible to merge those two loops. dtemirbulatov: No, No possible to merge those two loops.
		ABataevUnsubmitted Not Done Reply Inline Actions Why? ABataev: Why?
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions For example, in the first loops, we could change from Entry1 TreeEntry::ProposedToGather to TreeEntry::NeedToGather status, but we later could encounter another use of this Entry1 and from another Entry2()let's say) with TreeEntry::Vectorize status and we could tell difference with just canceled item and not considered to vectorize Entry. thus ExternalUses would not be properly populated. dtemirbulatov: For example, in the first loops, we could change from Entry1 TreeEntry::ProposedToGather to…
		ABataevUnsubmitted Not Done Reply Inline Actions The first loop does not change the state of the tree entries. ABataev: The first loop does not change the state of the tree entries.
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions I mean if we merge them into one loop. dtemirbulatov: I mean if we merge them into one loop.
		continue;
		ABataevUnsubmitted Done Reply Inline Actions The scalars are not actually removed here. ABataev: The scalars are not actually removed here.
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions The scalars are not actually removed here. yes, but here I was thinking there would be too much noise to put this print in findSubTree() and we might not get the tree to vectorize in the end. Probably, I think it is better to print whole TreeEntry scalars at once saying we are not going to vectorize those operations. dtemirbulatov: > The scalars are not actually removed here. yes, but here I was thinking there would be too…
		if (Entry->State == TreeEntry::ProposedToGather) {
		Entry->State = TreeEntry::NeedToGather;
		for (Value *V : Entry->Scalars) {
		LLVM_DEBUG(dbgs() << "SLP: Remove scalar " << *V
		<< " out of proposed to vectorize.\n");
		}
		}
		}
		// For all canceled operations we should consider the possibility of
		// use by with non-canceled operations and for that, it requires
		// to populate ExternalUser list with canceled elements.
		for (TreeEntry *Entry : VecNodes)
		for (int Lane = 0, LE = Entry->Scalars.size(); Lane != LE; ++Lane) {
		ABataevUnsubmitted Not Done Reply Inline Actions Can this occur at all? ABataev: Can this occur at all?
		Value *Scalar = Entry->Scalars[Lane];
		for (User *U : Scalar->users()) {
		LLVM_DEBUG(dbgs() << "SLP: Checking user:" << *U << ".\n");
		if (!VecToScalars.count(U))
		continue;
		// Ignore users in the user ignore list.
		auto *UserInst = cast<Instruction>(U);
		if (is_contained(UserIgnoreList, UserInst))
		continue;
		LLVM_DEBUG(dbgs() << "SLP: Need to extract canceled operation :" << *U
		<< " from lane " << Lane << " from " << *Scalar
		<< ".\n");
		ExternalUses.emplace_back(Scalar, U, Lane);
		}
		}
		return true;
		}

unsigned BoUpSLP::canMapToVector(Type *T, const DataLayout &DL) const {		unsigned BoUpSLP::canMapToVector(Type *T, const DataLayout &DL) const {
unsigned N = 1;		unsigned N = 1;
Type *EltTy = T;		Type *EltTy = T;

while (isa<StructType>(EltTy) \|\| isa<SequentialType>(EltTy)) {		while (isa<StructType>(EltTy) \|\| isa<SequentialType>(EltTy)) {
if (auto *ST = dyn_cast<StructType>(EltTy)) {		if (auto *ST = dyn_cast<StructType>(EltTy)) {
// Check that struct is homogeneous.		// Check that struct is homogeneous.
for (const auto *Ty : ST->elements())		for (const auto *Ty : ST->elements())
▲ Show 20 Lines • Show All 140 Lines • ▼ Show 20 Lines	int BoUpSLP::getEntryCost(TreeEntry *E) {

unsigned ReuseShuffleNumbers = E->ReuseShuffleIndices.size();		unsigned ReuseShuffleNumbers = E->ReuseShuffleIndices.size();
bool NeedToShuffleReuses = !E->ReuseShuffleIndices.empty();		bool NeedToShuffleReuses = !E->ReuseShuffleIndices.empty();
int ReuseShuffleCost = 0;		int ReuseShuffleCost = 0;
if (NeedToShuffleReuses) {		if (NeedToShuffleReuses) {
ReuseShuffleCost =		ReuseShuffleCost =
TTI->getShuffleCost(TargetTransformInfo::SK_PermuteSingleSrc, VecTy);		TTI->getShuffleCost(TargetTransformInfo::SK_PermuteSingleSrc, VecTy);
}		}
if (E->State == TreeEntry::NeedToGather) {		if (E->State != TreeEntry::Vectorize) {
if (allConstant(VL))		if (allConstant(VL))
return 0;		return 0;
if (isSplat(VL)) {		if (isSplat(VL)) {
return ReuseShuffleCost +		return ReuseShuffleCost +
TTI->getShuffleCost(TargetTransformInfo::SK_Broadcast, VecTy, 0);		TTI->getShuffleCost(TargetTransformInfo::SK_Broadcast, VecTy, 0);
}		}
if (E->getOpcode() == Instruction::ExtractElement &&		if (E->getOpcode() == Instruction::ExtractElement &&
allSameType(VL) && allSameBlock(VL)) {		allSameType(VL) && allSameBlock(VL)) {
▲ Show 20 Lines • Show All 349 Lines • ▼ Show 20 Lines	bool BoUpSLP::isFullyVectorizableTinyTree() const {
// Handle splat and all-constants stores.		// Handle splat and all-constants stores.
if (VectorizableTree[0]->State == TreeEntry::Vectorize &&		if (VectorizableTree[0]->State == TreeEntry::Vectorize &&
(allConstant(VectorizableTree[1]->Scalars) \|\|		(allConstant(VectorizableTree[1]->Scalars) \|\|
isSplat(VectorizableTree[1]->Scalars)))		isSplat(VectorizableTree[1]->Scalars)))
return true;		return true;

// Gathering cost would be too much for tiny trees.		// Gathering cost would be too much for tiny trees.
if (VectorizableTree[0]->State == TreeEntry::NeedToGather \|\|		if (VectorizableTree[0]->State == TreeEntry::NeedToGather \|\|
VectorizableTree[1]->State == TreeEntry::NeedToGather)		VectorizableTree[1]->State == TreeEntry::NeedToGather)
return false;		return false;
		ABataevUnsubmitted Done Reply Inline Actions Maybe, better to use `!= TreeEntry::Vectorize` to avoid trees with proposed gathering? ABataev: Maybe, better to use `!= TreeEntry::Vectorize` to avoid trees with proposed gathering?

return true;		return true;
}		}

bool BoUpSLP::isLoadCombineReductionCandidate(unsigned RdxOpcode) const {		bool BoUpSLP::isLoadCombineReductionCandidate(unsigned RdxOpcode) const {
if (RdxOpcode != Instruction::Or)		if (RdxOpcode != Instruction::Or)
return false;		return false;

▲ Show 20 Lines • Show All 45 Lines • ▼ Show 20 Lines	assert(VectorizableTree.empty()
? ExternalUses.empty()		? ExternalUses.empty()
: true && "We shouldn't have any external users");		: true && "We shouldn't have any external users");

// Otherwise, we can't vectorize the tree. It is both tiny and not fully		// Otherwise, we can't vectorize the tree. It is both tiny and not fully
// vectorizable.		// vectorizable.
return true;		return true;
}		}

int BoUpSLP::getSpillCost() const {		int BoUpSLP::getSpillCost() {
// Walk from the bottom of the tree to the top, tracking which values are		// Walk from the bottom of the tree to the top, tracking which values are
// live. When we see a call instruction that is not part of our tree,		// live. When we see a call instruction that is not part of our tree,
// query TTI to see if there is a cost to keeping values live over it		// query TTI to see if there is a cost to keeping values live over it
// (for example, if spills and fills are required).		// (for example, if spills and fills are required).
unsigned BundleWidth = VectorizableTree.front()->Scalars.size();
int Cost = 0;		int Cost = 0;

SmallPtrSet<Instruction*, 4> LiveValues;		SmallPtrSet<Instruction*, 4> LiveValues;
Instruction *PrevInst = nullptr;		Instruction *PrevInst = nullptr;

for (const auto &TEPtr : VectorizableTree) {		for (const std::unique_ptr<TreeEntry> &TEPtr : VectorizableTree) {
Instruction *Inst = dyn_cast<Instruction>(TEPtr->Scalars[0]);		Instruction *Inst = dyn_cast<Instruction>(TEPtr->Scalars[0]);
if (!Inst)		if (!Inst)
continue;		continue;

if (!PrevInst) {		if (!PrevInst) {
PrevInst = Inst;		PrevInst = Inst;
continue;		continue;
}		}

// Update LiveValues.		// Update LiveValues.
LiveValues.erase(PrevInst);		LiveValues.erase(PrevInst);
for (auto &J : PrevInst->operands()) {		for (auto &J : PrevInst->operands()) {
if (isa<Instruction>(&J) && getTreeEntry(&J))		if (isa<Instruction>(&J) && ScalarsToVec.count(&J))
LiveValues.insert(cast<Instruction>(&*J));		LiveValues.insert(cast<Instruction>(&*J));
}		}

LLVM_DEBUG({		LLVM_DEBUG({
dbgs() << "SLP: #LV: " << LiveValues.size();		dbgs() << "SLP: #LV: " << LiveValues.size();
for (auto *X : LiveValues)		for (auto *X : LiveValues)
dbgs() << " " << X->getName();		dbgs() << " " << X->getName();
dbgs() << ", Looking at ";		dbgs() << ", Looking at ";
Show All 11 Lines	while (InstIt != PrevInstIt) {
continue;		continue;
}		}

// Debug information does not impact spill cost.		// Debug information does not impact spill cost.
if ((isa<CallInst>(&*PrevInstIt) &&		if ((isa<CallInst>(&*PrevInstIt) &&
!isa<DbgInfoIntrinsic>(&*PrevInstIt)) &&		!isa<DbgInfoIntrinsic>(&*PrevInstIt)) &&
&*PrevInstIt != PrevInst)		&*PrevInstIt != PrevInst)
NumCalls++;		NumCalls++;

++PrevInstIt;		++PrevInstIt;
}		}

if (NumCalls) {		if (NumCalls) {
		NoCallInst = false;
SmallVector<Type*, 4> V;		SmallVector<Type*, 4> V;
for (auto *II : LiveValues)		for (auto *II : LiveValues)
V.push_back(VectorType::get(II->getType(), BundleWidth));		V.push_back(VectorType::get(II->getType(), BundleWidth));
Cost += NumCalls * TTI->getCostOfKeepingLiveOverCall(V);		Cost += NumCalls * TTI->getCostOfKeepingLiveOverCall(V);
}		}

PrevInst = Inst;		PrevInst = Inst;
}		}

return Cost;		return Cost;
}		}

int BoUpSLP::getTreeCost() {		int BoUpSLP::getExtractOperationCost(const ExternalUser &EU) const {
int Cost = 0;		// Uses by ephemeral values are free (because the ephemeral value will be
		// removed prior to code generation, and so the extraction will be
		// removed as well).
		if (EphValues.count(EU.User))
		return 0;

		// If we plan to rewrite the tree in a smaller type, we will need to sign
		// extend the extracted value back to the original type. Here, we account
		// for the extract and the added cost of the sign extend if needed.
		auto *VecTy = VectorType::get(EU.Scalar->getType(), BundleWidth);
		Value *ScalarRoot = VectorizableTree.front()->Scalars[0];

		auto It = MinBWs.find(ScalarRoot);
		ABataevUnsubmitted Done Reply Inline Actions `VectorizableTree.front()` instead of `VectorizableTree[0]` just like in the first statement. ABataev: `VectorizableTree.front()` instead of `VectorizableTree[0]` just like in the first statement.
		if (It != MinBWs.end()) {
		uint64_t Width = It->second.first;
		bool Signed = It->second.second;
		auto *MinTy = IntegerType::get(F->getContext(), Width);
		unsigned ExtOp = Signed ? Instruction::SExt : Instruction::ZExt;
		VecTy = VectorType::get(MinTy, BundleWidth);
		return (TTI->getExtractWithExtendCost(ExtOp, EU.Scalar->getType(), VecTy,
		EU.Lane));
		}
		return TTI->getVectorInstrCost(Instruction::ExtractElement, VecTy, EU.Lane);
		}

		int BoUpSLP::getExtractCost() const {
		int ExtractCost = 0;
		SmallPtrSet<Value *, 16> ExtractCostCalculated;
		// Consider the possibility of extracting vectorized
		// values for canceled elements use.
		for (const std::unique_ptr<TreeEntry> &TEPtr : VectorizableTree) {
		TreeEntry *Entry = TEPtr.get();
		if (Entry->State != TreeEntry::ProposedToGather)
		continue;
		for (Value *V : Entry->Scalars) {
		// Consider the possibility of extracting vectorized
		// values for canceled elements use.
		auto It = InternalTreeUses.find(V);
		if (It != InternalTreeUses.end()) {
		const UserList &UL = It->second;
		for (const ExternalUser &IU : UL)
		ExtractCost += getExtractOperationCost(IU);
		}
		}
		}
		for (const ExternalUser &EU : ExternalUses) {
		// We only add extract cost once for the same scalar.
		if (!ExtractCostCalculated.insert(EU.Scalar).second)
		continue;

		int Cost = getExtractOperationCost(EU);
		ExtractCost += Cost;
		}
		return ExtractCost;
		}

		int BoUpSLP::getInsertCost() {
		int InsertCost = 0;
		for (const std::unique_ptr<TreeEntry> &TEPtr : VectorizableTree) {
		TreeEntry *Entry = TEPtr.get();
		// Avoid already vectorized TreeEntries, it is already in a vector form and
		// we don't need to gather those operations.
		if (Entry->State != TreeEntry::ProposedToGather)
		continue;
		bool NeedGather = false;
		for (Value *V : Entry->Scalars) {
		auto *Inst = cast<Instruction>(V);
		for (User *Op : Inst->users())
		if (ScalarsToVec.count(Op)) {
		NeedGather = true;
		break;
		}
		}
		if (NeedGather)
		InsertCost += getEntryCost(Entry);
		}
		return InsertCost;
		}

		bool BoUpSLP::findSubTree(int UserCost) {
		SmallVector<TreeEntry *, 64> Vec;
		for (const std::unique_ptr<TreeEntry> &TEPtr : VectorizableTree) {
		TreeEntry *Entry = TEPtr.get();
		if (Entry->State != TreeEntry::Vectorize \|\| Entry->Cost <= 0 \|\| !Entry->Idx)
		continue;
		Vec.push_back(Entry);
		if (Vec.size() > MaxCostsRecalculations)
		break;
		}
		llvm::sort(Vec, [&](const TreeEntry LHS, const TreeEntry RHS) {
		return LHS->Cost > RHS->Cost;
		});

		for (TreeEntry *T : Vec) {
		T->State = TreeEntry::ProposedToGather;
		for (Value *V : T->Scalars) {
		ScalarsToVec.erase(V);
		VecToScalars.insert(V);
		ScalarToTreeEntry.erase(V);
		MustGather.insert(V);
		ExternalUses.erase(
		llvm::remove_if(ExternalUses,
		[&V](ExternalUser &EU) { return EU.Scalar == V; }),
		ExternalUses.end());
		}
		int PartialCost = getTreeCost() - UserCost;
		RemovedOperations.push_back(T);
		if (PartialCost < -SLPCostThreshold && cutTree()) {
		LLVM_DEBUG(
		dbgs() << "SLP: Decided to partially vectorize tree with cost: "
		<< PartialCost << ".\n");
		return true;
		}
		ABataevUnsubmitted Done Reply Inline Actions `is_contained()` is `O(n)`. Maybe use a set instead of it in the loop? ABataev: `is_contained()` is `O(n)`. Maybe use a set instead of it in the loop?
		}
		return false;
		}

		int BoUpSLP::getRawTreeCost() {
		int CostSum = 0;
		BundleWidth = VectorizableTree.front()->Scalars.size();
		ABataevUnsubmitted Done Reply Inline Actions The code in this function is very similar to the code in the `getTreeCost()`. Can it be reused somehow to avoid duplication? ABataev: The code in this function is very similar to the code in the `getTreeCost()`. Can it be reused…
LLVM_DEBUG(dbgs() << "SLP: Calculating cost for tree of size "		LLVM_DEBUG(dbgs() << "SLP: Calculating cost for tree of size "
<< VectorizableTree.size() << ".\n");		<< VectorizableTree.size() << ".\n");

unsigned BundleWidth = VectorizableTree[0]->Scalars.size();		for (std::unique_ptr<TreeEntry> &TEPtr : VectorizableTree) {
		TreeEntry &TE = *TEPtr.get();
for (unsigned I = 0, E = VectorizableTree.size(); I < E; ++I) {
TreeEntry &TE = *VectorizableTree[I].get();

		ABataevUnsubmitted Not Done Reply Inline Actions Does it include the cost of all subtree or just this particular `Entry`? ABataev: Does it include the cost of all subtree or just this particular `Entry`?
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions The Entry->Cost is a particular Entry Cost with some sub-tree elements for example if we have a gathering element in this particular Entry. Note that we only consider here TreeEntry::Vectorize entries this summary. dtemirbulatov: The Entry->Cost is a particular Entry Cost with some sub-tree elements for example if we have a…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions And we recalculate all canceled to vectorize (ProposedToGather) entries costs in getInsertCost() every time we call getTreeCost() at line 4214. dtemirbulatov: And we recalculate all canceled to vectorize (ProposedToGather) entries costs in getInsertCost…
// We create duplicate tree entries for gather sequences that have multiple		// We create duplicate tree entries for gather sequences that have multiple
		ABataevUnsubmitted Not Done Reply Inline Actions `>=` ABataev: `>=`
// uses. However, we should not compute the cost of duplicate sequences.		// uses. However, we should not compute the cost of duplicate sequences.
		ABataevUnsubmitted Not Done Reply Inline Actions Why need to exclude `UserCost` here? ABataev: Why need to exclude `UserCost` here?
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions We might want to add(substract) extra value for example at line 6071 before this change. dtemirbulatov: We might want to add(substract) extra value for example at line 6071 before this change.
// For example, if we have a build vector (i.e., insertelement sequence)		// For example, if we have a build vector (i.e., insertelement sequence)
// that is used by more than one vector instruction, we only need to		// that is used by more than one vector instruction, we only need to
// compute the cost of the insertelement instructions once. The redundant		// compute the cost of the insertelement instructions once. The redundant
// instructions will be eliminated by CSE.		// instructions will be eliminated by CSE.
//		//
// We should consider not creating duplicate tree entries for gather		// We should consider not creating duplicate tree entries for gather
// sequences, and instead add additional edges to the tree representing		// sequences, and instead add additional edges to the tree representing
// their uses. Since such an approach results in fewer total entries,		// their uses. Since such an approach results in fewer total entries,
// existing heuristics based on tree size may yield different results.		// existing heuristics based on tree size may yield different results.
//		//
if (TE.State == TreeEntry::NeedToGather &&		if (TE.State == TreeEntry::ProposedToGather)
std::any_of(std::next(VectorizableTree.begin(), I + 1),		VecToScalars.insert(TE.Scalars.begin(), TE.Scalars.end());
VectorizableTree.end(),		if (TE.State != TreeEntry::Vectorize &&
		llvm::any_of(llvm::drop_begin(VectorizableTree, TE.Idx + 1),
[TE](const std::unique_ptr<TreeEntry> &EntryPtr) {		[TE](const std::unique_ptr<TreeEntry> &EntryPtr) {
return EntryPtr->State == TreeEntry::NeedToGather &&		return EntryPtr->State != TreeEntry::Vectorize &&
		ABataevUnsubmitted Not Done Reply Inline Actions Tabs ABataev: Tabs
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions this is LLVM's standard format(clang-format). dtemirbulatov: this is LLVM's standard format(clang-format).
EntryPtr->isSame(TE.Scalars);		EntryPtr->isSame(TE.Scalars);
}))		}))
continue;		continue;

int C = getEntryCost(&TE);		if (TE.State == TreeEntry::Vectorize)
LLVM_DEBUG(dbgs() << "SLP: Adding cost " << C		ScalarsToVec.insert(TE.Scalars.begin(), TE.Scalars.end());

		TE.Cost = getEntryCost(&TE);
		LLVM_DEBUG(dbgs() << "SLP: Adding cost " << TE.Cost
<< " for bundle that starts with " << *TE.Scalars[0]		<< " for bundle that starts with " << *TE.Scalars[0]
<< ".\n");		<< ".\n");
Cost += C;		CostSum += TE.Cost;
}		}

SmallPtrSet<Value *, 16> ExtractCostCalculated;		if (SLPThrottling)
int ExtractCost = 0;		for (std::unique_ptr<TreeEntry> &TEPtr : VectorizableTree) {
for (ExternalUser &EU : ExternalUses) {		TreeEntry *TE = TEPtr.get();
// We only add extract cost once for the same scalar.		if (TE->State != TreeEntry::Vectorize)
if (!ExtractCostCalculated.insert(EU.Scalar).second)
continue;

// Uses by ephemeral values are free (because the ephemeral value will be
// removed prior to code generation, and so the extraction will be
// removed as well).
if (EphValues.count(EU.User))
continue;		continue;
		ABataevUnsubmitted Not Done Reply Inline Actions Tab ABataev: Tab
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions this is LLVM's standard format(clang-format). dtemirbulatov: this is LLVM's standard format(clang-format).
		int GatherCost = 0;
		for (TreeEntry *Gather : TE->UseEntries)
		if (Gather->State != TreeEntry::Vectorize)
		GatherCost += Gather->Cost;
		wweiUnsubmitted Done Reply Inline Actions I think the input for `getGatherCost` should be `Entry->Scalars` instead of `V`. The code in `getGatherCost` likes: int BoUpSLP::getGatherCost(ArrayRef<Value > VL) const { // Find the type of the operands in VL. Type ScalarTy = VL[0]->getType(); ... ... VectorType VecTy = VectorType::get(ScalarTy, VL.size()); ... ... return getGatherCost(VecTy, ShuffledElements); So, if input is `V`, `VectTy` will always be equal to `<1 x iN>`, I think it's the same with `iN` type, and `getGatherCost(VecTy, ShuffledElements)` will return incorrect InsertCost value. wwei:* I think the input for `getGatherCost` should be `Entry->Scalars` instead of `V`. The code in…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions Yes, correct. Thanks, I will fix that. dtemirbulatov: Yes, correct. Thanks, I will fix that.
		TE->Cost += GatherCost;
		}
		return CostSum;
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions eh, I found typo here, it should be Inst->Users, the same for 4101 line. dtemirbulatov: eh, I found typo here, it should be Inst->Users, the same for 4101 line.
		}

// If we plan to rewrite the tree in a smaller type, we will need to sign		int BoUpSLP::getTreeCost() {
// extend the extracted value back to the original type. Here, we account		int CostSum;
// for the extract and the added cost of the sign extend if needed.		if (!IsCostSumReady) {
auto *VecTy = VectorType::get(EU.Scalar->getType(), BundleWidth);		CostSum = getRawTreeCost();
auto *ScalarRoot = VectorizableTree[0]->Scalars[0];		RawTreeCost = CostSum;
if (MinBWs.count(ScalarRoot)) {
auto *MinTy = IntegerType::get(F->getContext(), MinBWs[ScalarRoot].first);
auto Extend =
MinBWs[ScalarRoot].second ? Instruction::SExt : Instruction::ZExt;
VecTy = VectorType::get(MinTy, BundleWidth);
ExtractCost += TTI->getExtractWithExtendCost(Extend, EU.Scalar->getType(),
VecTy, EU.Lane);
} else {		} else {
		ABataevUnsubmitted Done Reply Inline Actions Not sure that this is the best criterion. I think you also need to include the distance from the head of the tree to the entry, because some big costs can be compensated by the vectorizable nodes in the tree. What I would do here is just some kind of level ordering search (BFS) starting from the deepest level. ABataev: Not sure that this is the best criterion. I think you also need to include the distance from…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions Hmm, implemented, but I don't see any benefit from that, plus we have to do BFS search. And we are going to throw away any non-vectorizable nodes at 4295. dtemirbulatov: Hmm, implemented, but I don't see any benefit from that, plus we have to do BFS search. And we…
		ABataevUnsubmitted Not Done Reply Inline Actions It may trigger for targets like silvermont or in future for vectorized functions. ABataev: It may trigger for targets like silvermont or in future for vectorized functions.
		dtemirbulatovAuthorUnsubmitted Not Done Reply Inline Actions I measured the BFS approach vs this implementation. And with BFS, it is ~10% less efficient on SPEC2006 INT and ~20% less on compilable SPEC2006 FP. By efficiency, I mean the total number of reduced trees while the whole compilation. dtemirbulatov: I measured the BFS approach vs this implementation. And with BFS, it is ~10% less efficient on…
		ABataevUnsubmitted Not Done Reply Inline Actions Could you post it anyway to check if it may be improved? ABataev: Could you post it anyway to check if it may be improved?
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions ok, I might miss something. Thanks. dtemirbulatov: ok, I might miss something. Thanks.
ExtractCost +=		CostSum = RawTreeCost;
TTI->getVectorInstrCost(Instruction::ExtractElement, VecTy, EU.Lane);
}		}

		if (SLPThrottling)
		for (std::unique_ptr<TreeEntry> &TEPtr : VectorizableTree) {
		TreeEntry *TE = TEPtr.get();
		if (TE->State == TreeEntry::ProposedToGather)
		ABataevUnsubmitted Done Reply Inline Actions I think you can also exclude entries with the number of operands <= 1. ABataev: I think you can also exclude entries with the number of operands <= 1.
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions But why? The only thing that matters here is the cost. dtemirbulatov: But why? The only thing that matters here is the cost.
		ABataevUnsubmitted Not Done Reply Inline Actions Because the main idea is to drop gathers and drop one gather in favor of another one will not be profitable for sure. But it may improve compile time and the list of candidates, The only case you need to check for is the latest masked gather case, it may be profitable to convert it to gathers for some targets. ABataev: Because the main idea is to drop gathers and drop one gather in favor of another one will not…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions I think I can check here if scutter/gather is supported via TargetrInfo and if it is not then move all nodes with TreeEntry::ScatterVectorize to TreeEntry::Gather. dtemirbulatov: I think I can check here if scutter/gather is supported via TargetrInfo and if it is not then…
		anton-afanasyevUnsubmitted Not Done Reply Inline Actions I believe it's wrong decision to check scatter/gather target support for the reason mentioned here https://reviews.llvm.org/D92701#2435573. Why could not we just rely on costs (node cost and total one)? anton-afanasyev: I believe it's wrong decision to check scatter/gather target support for the reason mentioned…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions I agree with @anton-afanasyev here. I am not sure what @ABataev wants here? If I exclude (operands <= 1) then we would lose have of all tests in SLP affected by throttling. dtemirbulatov: I agree with @anton-afanasyev here. I am not sure what @ABataev wants here? If I exclude…
		ABataevUnsubmitted Not Done Reply Inline Actions I did not say anything about checking if scatter is supported here. I just said that we can improve the criterion here by checking that the entry node has at least 2 operands (because if it has just one operand, most probably we can skip it) and we just need to check the nodes with only 1 operand if it is gather scatter node, because it may be better to represent it as simple gather. ABataev: I did not say anything about checking if scatter is supported here. I just said that we can…
		CostSum -= TE->Cost;
		ABataevUnsubmitted Done Reply Inline Actions No need for `[&]` here, just `[]` ABataev: No need for `[&]` here, just `[]`
}		}

int SpillCost = getSpillCost();		int ExtractCost = getExtractCost();
Cost += SpillCost + ExtractCost;		int SpillCost = 0;
		if (!NoCallInst \|\| !IsCostSumReady)
		SpillCost = getSpillCost();
		#ifndef NDEBUG
		if (NoCallInst)
		ABataevUnsubmitted Done Reply Inline Actions `llvm::any_of(Inst->users(), [Tree](User Op){ return Tree->ScalarToTreeEntry.count(Op) > 0; }` ABataev:* `llvm::any_of(Inst->users(), [Tree](User *Op){ return Tree->ScalarToTreeEntry.count(Op) > 0; }`
		assert(getSpillCost() == 0 && "Incorrect spill cost");
		ABataevUnsubmitted Not Done Reply Inline Actions `SpillCost == 0`? ABataev: `SpillCost == 0`?
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions hmm, I am thinking if we might get SpillCost == 0 for the whole tree and somehow after reduction, we might get non-zero. dtemirbulatov: hmm, I am thinking if we might get SpillCost == 0 for the whole tree and somehow after…
		ABataevUnsubmitted Done Reply Inline Actions Just: for (Value V : Entry->Scalars) { auto Inst = cast<Instruction>(V); if (llvm::any_of(Inst->users(), [this](User Op){ return Tree->ScalarToTreeEntry.count(Op) > 0; })) return InsertCost + getEntryCost(Entry); } Also, check code formatting ABataev:* Just: ``` for (Value V : Entry->Scalars) { auto Inst = cast<Instruction>(V)…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions hmm, I think this is not a correct suggestion, there might be several tree entries with TreeEntry::ProposedToGather status and we have to calculate Insert cost for the whole tree here. dtemirbulatov: hmm, I think this is not a correct suggestion, there might be several tree entries with…
		ABataevUnsubmitted Done Reply Inline Actions Yeah, maybe. But you van do something similar, like InsertCost += ... break; instead of setting flag and do a check after the loop. ABataev: Yeah, maybe. But you van do something similar, like ``` InsertCost += ... break; ``` instead…
		#endif
		ABataevUnsubmitted Done Reply Inline Actions Just `assert((!Tree->NoCallInst \|\| getSpillCost() == 0) && "Incorrect spill cost");` ABataev: Just `assert((!Tree->NoCallInst \|\| getSpillCost() == 0) && "Incorrect spill cost");`
		if (!IsCostSumReady)
		IsCostSumReady = true;
		int InsertCost = getInsertCost();
		int Cost = CostSum + ExtractCost + SpillCost + InsertCost;

std::string Str;		#ifndef NDEBUG
{		SmallString<256> Str;
		ABataevUnsubmitted Done Reply Inline Actions `Cmp` ABataev: `Cmp`
raw_string_ostream OS(Str);		raw_svector_ostream OS(Str);
OS << "SLP: Spill Cost = " << SpillCost << ".\n"		OS << "SLP: Spill Cost = " << SpillCost << ".\n"
<< "SLP: Extract Cost = " << ExtractCost << ".\n"		<< "SLP: Extract Cost = " << ExtractCost << ".\n"
		<< "SLP: Insert Cost = " << InsertCost << ".\n"
<< "SLP: Total Cost = " << Cost << ".\n";		<< "SLP: Total Cost = " << Cost << ".\n";
}
LLVM_DEBUG(dbgs() << Str);		LLVM_DEBUG(dbgs() << Str);
		ABataevUnsubmitted Done Reply Inline Actions `[V]` ABataev: `[V]`

if (ViewSLPTree)		if (ViewSLPTree)
ViewGraph(this, "SLP" + F->getName(), false, Str);		ViewGraph(this, "SLP" + F->getName(), false, Str);
		ABataevUnsubmitted Done Reply Inline Actions All this code must be active only when the debug mode on? ABataev: All this code must be active only when the debug mode on?
		RKSimonUnsubmitted Not Done Reply Inline Actions Maybe pull this NDEBUG change out into its own patch? RKSimon: Maybe pull this NDEBUG change out into its own patch?
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions yes, it could go as NFC. dtemirbulatov: yes, it could go as NFC.
		#endif
		ABataevUnsubmitted Done Reply Inline Actions Why need to use a SmallVector and then sort it? Better to use a set with custom compare functor. ABataev: Why need to use a SmallVector and then sort it? Better to use a set with custom compare functor.
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions Thanks, Good. dtemirbulatov: Thanks, Good.
return Cost;		return Cost;
}		}

int BoUpSLP::getGatherCost(Type *Ty,		int BoUpSLP::getGatherCost(Type *Ty,
const DenseSet<unsigned> &ShuffledIndices) const {		const DenseSet<unsigned> &ShuffledIndices) const {
		ABataevUnsubmitted Not Done Reply Inline Actions Just `Vec.erase(Vec.rbegin(), Vec.rbegin() + (Vec.size() - MaxCostsRecalculations)`? ABataev: Just `Vec.erase(Vec.rbegin(), Vec.rbegin() + (Vec.size() - MaxCostsRecalculations)`?
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions No, We could not use "Vec.rbegin() + " with std::set. dtemirbulatov: No, We could not use "Vec.rbegin() + " with std::set.
		ABataevUnsubmitted Not Done Reply Inline Actions Then just `Vec.erase(Vec.begin() + MaxCostsRecalculations, Vec.end());`. ABataev: Then just `Vec.erase(Vec.begin() + MaxCostsRecalculations, Vec.end());`.
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions eh, no it is std::_Rb_tree_const_iterator<llvm::slpvectorizer::BoUpSLP::TreeEntry>. dtemirbulatov:* eh, no it is std::_Rb_tree_const_iterator<llvm::slpvectorizer::BoUpSLP::TreeEntry*>.
		ABataevUnsubmitted Done Reply Inline Actions Why is this a const iterator? ABataev: Why is this a const iterator?
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions std::set iterators are bidirectional, not random-access. we have to use std::advance dtemirbulatov: std::set iterators are bidirectional, not random-access. we have to use std::advance
int Cost = 0;		int Cost = 0;
for (unsigned i = 0, e = cast<VectorType>(Ty)->getNumElements(); i < e; ++i)		for (unsigned i = 0, e = cast<VectorType>(Ty)->getNumElements(); i < e; ++i)
if (!ShuffledIndices.count(i))		if (!ShuffledIndices.count(i))
Cost += TTI->getVectorInstrCost(Instruction::InsertElement, Ty, i);		Cost += TTI->getVectorInstrCost(Instruction::InsertElement, Ty, i);
if (!ShuffledIndices.empty())		if (!ShuffledIndices.empty())
Cost += TTI->getShuffleCost(TargetTransformInfo::SK_PermuteSingleSrc, Ty);		Cost += TTI->getShuffleCost(TargetTransformInfo::SK_PermuteSingleSrc, Ty);
return Cost;		return Cost;
}		}

int BoUpSLP::getGatherCost(ArrayRef<Value *> VL) const {		int BoUpSLP::getGatherCost(ArrayRef<Value *> VL) const {
		ABataevUnsubmitted Done Reply Inline Actions `>=` ABataev: `>=`
// Find the type of the operands in VL.		// Find the type of the operands in VL.
Type *ScalarTy = VL[0]->getType();		Type *ScalarTy = VL[0]->getType();
if (StoreInst *SI = dyn_cast<StoreInst>(VL[0]))		if (StoreInst *SI = dyn_cast<StoreInst>(VL[0]))
ScalarTy = SI->getValueOperand()->getType();		ScalarTy = SI->getValueOperand()->getType();
VectorType *VecTy = VectorType::get(ScalarTy, VL.size());		VectorType *VecTy = VectorType::get(ScalarTy, VL.size());
// Find the cost of inserting/extracting values from the vector.		// Find the cost of inserting/extracting values from the vector.
// Check if the same elements are inserted several times and count them as		// Check if the same elements are inserted several times and count them as
// shuffle candidates.		// shuffle candidates.
▲ Show 20 Lines • Show All 469 Lines • ▼ Show 20 Lines	case Instruction::Load: {
Value *VecPtr = Builder.CreateBitCast(LI->getPointerOperand(),		Value *VecPtr = Builder.CreateBitCast(LI->getPointerOperand(),
VecTy->getPointerTo(AS));		VecTy->getPointerTo(AS));

// The pointer operand uses an in-tree scalar so we add the new BitCast to		// The pointer operand uses an in-tree scalar so we add the new BitCast to
// ExternalUses list to make sure that an extract will be generated in the		// ExternalUses list to make sure that an extract will be generated in the
// future.		// future.
Value *PO = LI->getPointerOperand();		Value *PO = LI->getPointerOperand();
if (getTreeEntry(PO))		if (getTreeEntry(PO))
ExternalUses.push_back(ExternalUser(PO, cast<User>(VecPtr), 0));		ExternalUses.push_back(ExternalUser(PO, cast<User>(VecPtr), 0));
		ABataevUnsubmitted Done Reply Inline Actions Use `emplace_back(PO, VecPtr, 0)` ABataev: Use `emplace_back(PO, VecPtr, 0)`

MaybeAlign Alignment = MaybeAlign(LI->getAlignment());		MaybeAlign Alignment = MaybeAlign(LI->getAlignment());
LI = Builder.CreateLoad(VecTy, VecPtr);		LI = Builder.CreateLoad(VecTy, VecPtr);
if (!Alignment)		if (!Alignment)
Alignment = MaybeAlign(DL->getABITypeAlignment(ScalarLoadTy));		Alignment = MaybeAlign(DL->getABITypeAlignment(ScalarLoadTy));
LI->setAlignment(Alignment);		LI->setAlignment(Alignment);
Value *V = propagateMetadata(LI, E->Scalars);		Value *V = propagateMetadata(LI, E->Scalars);
if (IsReorder) {		if (IsReorder) {
Show All 32 Lines	case Instruction::Store: {
Value *VecPtr = Builder.CreateBitCast(		Value *VecPtr = Builder.CreateBitCast(
ScalarPtr, VecValue->getType()->getPointerTo(AS));		ScalarPtr, VecValue->getType()->getPointerTo(AS));
StoreInst *ST = Builder.CreateStore(VecValue, VecPtr);		StoreInst *ST = Builder.CreateStore(VecValue, VecPtr);

// The pointer operand uses an in-tree scalar, so add the new BitCast to		// The pointer operand uses an in-tree scalar, so add the new BitCast to
// ExternalUses to make sure that an extract will be generated in the		// ExternalUses to make sure that an extract will be generated in the
// future.		// future.
if (getTreeEntry(ScalarPtr))		if (getTreeEntry(ScalarPtr))
ExternalUses.push_back(ExternalUser(ScalarPtr, cast<User>(VecPtr), 0));		ExternalUses.push_back(ExternalUser(ScalarPtr, cast<User>(VecPtr), 0));

if (!Alignment)		if (!Alignment)
Alignment = DL->getABITypeAlignment(SI->getValueOperand()->getType());		Alignment = DL->getABITypeAlignment(SI->getValueOperand()->getType());

ST->setAlignment(Align(Alignment));		ST->setAlignment(Align(Alignment));
		ABataevUnsubmitted Not Done Reply Inline Actions Use `emplace_back(ScalarPtr, VecPtr, 0);` ABataev: Use `emplace_back(ScalarPtr, VecPtr, 0);`
Value *V = propagateMetadata(ST, E->Scalars);		Value *V = propagateMetadata(ST, E->Scalars);
if (NeedToShuffleReuses) {		if (NeedToShuffleReuses) {
V = Builder.CreateShuffleVector(V, UndefValue::get(VecTy),		V = Builder.CreateShuffleVector(V, UndefValue::get(VecTy),
E->ReuseShuffleIndices, "shuffle");		E->ReuseShuffleIndices, "shuffle");
}		}
E->VectorizedValue = V;		E->VectorizedValue = V;
++NumVectorInstructions;		++NumVectorInstructions;
return V;		return V;
▲ Show 20 Lines • Show All 84 Lines • ▼ Show 20 Lines	case Instruction::Call: {
SmallVector<OperandBundleDef, 1> OpBundles;		SmallVector<OperandBundleDef, 1> OpBundles;
CI->getOperandBundlesAsDefs(OpBundles);		CI->getOperandBundlesAsDefs(OpBundles);
Value *V = Builder.CreateCall(CF, OpVecs, OpBundles);		Value *V = Builder.CreateCall(CF, OpVecs, OpBundles);

// The scalar argument uses an in-tree scalar so we add the new vectorized		// The scalar argument uses an in-tree scalar so we add the new vectorized
// call to ExternalUses list to make sure that an extract will be		// call to ExternalUses list to make sure that an extract will be
// generated in the future.		// generated in the future.
if (ScalarArg && getTreeEntry(ScalarArg))		if (ScalarArg && getTreeEntry(ScalarArg))
ExternalUses.push_back(ExternalUser(ScalarArg, cast<User>(V), 0));		ExternalUses.push_back(ExternalUser(ScalarArg, cast<User>(V), 0));
		ABataevUnsubmitted Done Reply Inline Actions Use `emplace_back(ScalarArg, V, 0);` ABataev: Use `emplace_back(ScalarArg, V, 0);`

propagateIRFlags(V, E->Scalars, VL0);		propagateIRFlags(V, E->Scalars, VL0);
if (NeedToShuffleReuses) {		if (NeedToShuffleReuses) {
V = Builder.CreateShuffleVector(V, UndefValue::get(VecTy),		V = Builder.CreateShuffleVector(V, UndefValue::get(VecTy),
E->ReuseShuffleIndices, "shuffle");		E->ReuseShuffleIndices, "shuffle");
}		}
E->VectorizedValue = V;		E->VectorizedValue = V;
++NumVectorInstructions;		++NumVectorInstructions;
▲ Show 20 Lines • Show All 79 Lines • ▼ Show 20 Lines	Value *BoUpSLP::vectorizeTree() {
ExtraValueToDebugLocsMap ExternallyUsedValues;		ExtraValueToDebugLocsMap ExternallyUsedValues;
return vectorizeTree(ExternallyUsedValues);		return vectorizeTree(ExternallyUsedValues);
}		}

Value *		Value *
BoUpSLP::vectorizeTree(ExtraValueToDebugLocsMap &ExternallyUsedValues) {		BoUpSLP::vectorizeTree(ExtraValueToDebugLocsMap &ExternallyUsedValues) {
// All blocks must be scheduled before any instructions are inserted.		// All blocks must be scheduled before any instructions are inserted.
for (auto &BSIter : BlocksSchedules) {		for (auto &BSIter : BlocksSchedules) {
scheduleBlock(BSIter.second.get());		BlockScheduling *BS = BSIter.second.get();
		// Remove all Schedule Data from all nodes that we have changed
		// vectorization decision.
		if (!RemovedOperations.empty())
		removeFromScheduling(BS);
		scheduleBlock(BS);
}		}

Builder.SetInsertPoint(&F->getEntryBlock().front());		Builder.SetInsertPoint(&F->getEntryBlock().front());
auto *VectorRoot = vectorizeTree(VectorizableTree[0].get());		auto *VectorRoot = vectorizeTree(VectorizableTree.front().get());

		for (std::unique_ptr<TreeEntry> &TEPtr : VectorizableTree) {
		TreeEntry *Entry = TEPtr.get();
		if (Entry->State == TreeEntry::Vectorize && !Entry->VectorizedValue)
		vectorizeTree(Entry);
		}

// If the vectorized tree can be rewritten in a smaller type, we truncate the		// If the vectorized tree can be rewritten in a smaller type, we truncate the
// vectorized root. InstCombine will then rewrite the entire expression. We		// vectorized root. InstCombine will then rewrite the entire expression. We
// sign extend the extracted values below.		// sign extend the extracted values below.
auto *ScalarRoot = VectorizableTree[0]->Scalars[0];		auto *ScalarRoot = VectorizableTree[0]->Scalars[0];
if (MinBWs.count(ScalarRoot)) {		if (MinBWs.count(ScalarRoot)) {
if (auto *I = dyn_cast<Instruction>(VectorRoot))		if (auto *I = dyn_cast<Instruction>(VectorRoot))
Builder.SetInsertPoint(&*++BasicBlock::iterator(I));		Builder.SetInsertPoint(&*++BasicBlock::iterator(I));
auto BundleWidth = VectorizableTree[0]->Scalars.size();		BundleWidth = VectorizableTree.front()->Scalars.size();
auto *MinTy = IntegerType::get(F->getContext(), MinBWs[ScalarRoot].first);		auto *MinTy = IntegerType::get(F->getContext(), MinBWs[ScalarRoot].first);
auto *VecTy = VectorType::get(MinTy, BundleWidth);		auto *VecTy = VectorType::get(MinTy, BundleWidth);
auto *Trunc = Builder.CreateTrunc(VectorRoot, VecTy);		auto *Trunc = Builder.CreateTrunc(VectorRoot, VecTy);
VectorizableTree[0]->VectorizedValue = Trunc;		VectorizableTree[0]->VectorizedValue = Trunc;
}		}

LLVM_DEBUG(dbgs() << "SLP: Extracting " << ExternalUses.size()		LLVM_DEBUG(dbgs() << "SLP: Extracting " << ExternalUses.size()
<< " values .\n");		<< " values .\n");
▲ Show 20 Lines • Show All 83 Lines • ▼ Show 20 Lines	if (auto *VecI = dyn_cast<Instruction>(Vec)) {
CSEBlocks.insert(&F->getEntryBlock());		CSEBlocks.insert(&F->getEntryBlock());
User->replaceUsesOfWith(Scalar, Ex);		User->replaceUsesOfWith(Scalar, Ex);
}		}

LLVM_DEBUG(dbgs() << "SLP: Replaced:" << *User << ".\n");		LLVM_DEBUG(dbgs() << "SLP: Replaced:" << *User << ".\n");
}		}

// For each vectorized value:		// For each vectorized value:
for (auto &TEPtr : VectorizableTree) {		for (std::unique_ptr<TreeEntry> &TEPtr : VectorizableTree) {
TreeEntry *Entry = TEPtr.get();		TreeEntry *Entry = TEPtr.get();

// No need to handle users of gathered values.		// No need to handle users of gathered values.
if (Entry->State == TreeEntry::NeedToGather)		if (Entry->State == TreeEntry::NeedToGather)
continue;		continue;

assert(Entry->VectorizedValue && "Can't find vectorizable value");		assert(Entry->VectorizedValue && "Can't find vectorizable value");

// For each lane:		// For each lane:
for (int Lane = 0, LE = Entry->Scalars.size(); Lane != LE; ++Lane) {		for (int Lane = 0, LE = Entry->Scalars.size(); Lane != LE; ++Lane) {
Value *Scalar = Entry->Scalars[Lane];		Value *Scalar = Entry->Scalars[Lane];

#ifndef NDEBUG		#ifndef NDEBUG
Type *Ty = Scalar->getType();		Type *Ty = Scalar->getType();
if (!Ty->isVoidTy()) {		// The tree might not be fully vectorized, so we don't have to
		// check every user.
		if (!Ty->isVoidTy() && RemovedOperations.empty()) {
for (User *U : Scalar->users()) {		for (User *U : Scalar->users()) {
LLVM_DEBUG(dbgs() << "SLP: \tvalidating user:" << *U << ".\n");		LLVM_DEBUG(dbgs() << "SLP: \tvalidating user:" << *U << ".\n");

// It is legal to delete users in the ignorelist.		// It is legal to delete users in the ignorelist.
assert((getTreeEntry(U) \|\| is_contained(UserIgnoreList, U)) &&		assert((getTreeEntry(U) \|\| is_contained(UserIgnoreList, U)) &&
"Deleting out-of-tree value");		"Deleting out-of-tree value");
}		}
}		}
#endif		#endif
LLVM_DEBUG(dbgs() << "SLP: \tErasing scalar:" << *Scalar << ".\n");		LLVM_DEBUG(dbgs() << "SLP: \tErasing scalar:" << *Scalar << ".\n");
eraseInstruction(cast<Instruction>(Scalar));		eraseInstruction(cast<Instruction>(Scalar));
}		}
}		}

Builder.ClearInsertionPoint();		Builder.ClearInsertionPoint();

return VectorizableTree[0]->VectorizedValue;		return VectorizableTree[0]->VectorizedValue;
}		}
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions hmm. I would like to use remove_if here, but we have to capture a unique_ptr here. dtemirbulatov: hmm. I would like to use remove_if here, but we have to capture a unique_ptr here.
		ABataevUnsubmitted Done Reply Inline Actions Use `llvm::erase_if` ABataev: Use `llvm::erase_if`

		ABataevUnsubmitted Not Done Reply Inline Actions `[Tree]` ABataev: `[Tree]`
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions Tree is a class property, not a local variable. dtemirbulatov: Tree is a class property, not a local variable.
		ABataevUnsubmitted Done Reply Inline Actions Then `[this]` ABataev: Then `[this]`
void BoUpSLP::optimizeGatherSequence() {		void BoUpSLP::optimizeGatherSequence() {
LLVM_DEBUG(dbgs() << "SLP: Optimizing " << GatherSeq.size()		LLVM_DEBUG(dbgs() << "SLP: Optimizing " << GatherSeq.size()
<< " gather sequences instructions.\n");		<< " gather sequences instructions.\n");
		ABataevUnsubmitted Not Done Reply Inline Actions Why do you need to call `BuiltTrees.erase(` after `llvm::remove_if`? ABataev: Why do you need to call `BuiltTrees.erase(` after `llvm::remove_if`?
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions it is SmallVector<std::unique_ptr<TreeState>, 8> and we have to call erase(...) dtemirbulatov: it is SmallVector<std::unique_ptr<TreeState>, 8> and we have to call erase(...)
// LICM InsertElementInst sequences.		// LICM InsertElementInst sequences.
for (Instruction *I : GatherSeq) {		for (Instruction *I : GatherSeq) {
if (isDeleted(I))		if (isDeleted(I))
continue;		continue;

// Check if this block is inside a loop.		// Check if this block is inside a loop.
Loop *L = LI->getLoopFor(I->getParent());		Loop *L = LI->getLoopFor(I->getParent());
if (!L)		if (!L)
▲ Show 20 Lines • Show All 446 Lines • ▼ Show 20 Lines	doForAllOpcodes(I, [&](ScheduleData *SD) {
"ScheduleData not in scheduling region");		"ScheduleData not in scheduling region");
SD->IsScheduled = false;		SD->IsScheduled = false;
SD->resetUnscheduledDeps();		SD->resetUnscheduledDeps();
});		});
}		}
ReadyInsts.clear();		ReadyInsts.clear();
}		}

		void BoUpSLP::removeFromScheduling(BlockScheduling *BS) {
		bool Removed = false;
		for (TreeEntry *Entry : RemovedOperations) {
		ScheduleData *SD = BS->getScheduleData(Entry->Scalars[0]);
		if (SD && SD->isPartOfBundle()) {
		if (!Removed) {
		Removed = true;
		BS->resetSchedule();
		}
		BS->cancelScheduling(Entry->Scalars, SD->OpValue);
		}
		}
		ABataevUnsubmitted Done Reply Inline Actions Looks like you need to implement something like `reduceSchedulingRegion()`, similar to `extendSchedulingRegion` function. Because otherwise you're going to operate with the larger scheduling region. I.e. need to modify `ScheduleStart` and `ScheduleEnd` data members. ABataev: Looks like you need to implement something like `reduceSchedulingRegion()`, similar to…
		if (!Removed)
		return;
		BS->resetSchedule();
		BS->initialFillReadyList(BS->ReadyInsts);
		for (Instruction *I = BS->ScheduleStart; I != BS->ScheduleEnd;
		I = I->getNextNode()) {
		if (BS->ScheduleDataMap.find(I) == BS->ScheduleDataMap.end())
		continue;
		BS->doForAllOpcodes(I,
		[&](ScheduleData *SD) { SD->clearDependencies(); });
		ABataevUnsubmitted Done Reply Inline Actions `[]` ABataev: `[]`
		}
		}

void BoUpSLP::scheduleBlock(BlockScheduling *BS) {		void BoUpSLP::scheduleBlock(BlockScheduling *BS) {
if (!BS->ScheduleStart)		if (!BS->ScheduleStart)
return;		return;

LLVM_DEBUG(dbgs() << "SLP: schedule block " << BS->BB->getName() << "\n");		LLVM_DEBUG(dbgs() << "SLP: schedule block " << BS->BB->getName() << "\n");

BS->resetSchedule();		BS->resetSchedule();

// For the real scheduling we use a more sophisticated ready-list: it is		// For the real scheduling we use a more sophisticated ready-list: it is
// sorted by the original instruction location. This lets the final schedule		// sorted by the original instruction location. This lets the final schedule
// be as close as possible to the original instruction order.		// be as close as possible to the original instruction order.
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions Perhaps we could also check here for !BS->getScheduleData(I)->isPartOfBundle() and further shrink the region. dtemirbulatov: Perhaps we could also check here for !BS->getScheduleData(I)->isPartOfBundle() and further…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions ah, no, this instruction could belong to a real gather node. dtemirbulatov: ah, no, this instruction could belong to a real gather node.
struct ScheduleDataCompare {		struct ScheduleDataCompare {
bool operator()(ScheduleData SD1, ScheduleData SD2) const {		bool operator()(ScheduleData SD1, ScheduleData SD2) const {
return SD2->SchedulingPriority < SD1->SchedulingPriority;		return SD2->SchedulingPriority < SD1->SchedulingPriority;
}		}
};		};
std::set<ScheduleData *, ScheduleDataCompare> ReadyInsts;		std::set<ScheduleData *, ScheduleDataCompare> ReadyInsts;

// Ensure that all dependency data is updated and fill the ready-list with		// Ensure that all dependency data is updated and fill the ready-list with
▲ Show 20 Lines • Show All 498 Lines • ▼ Show 20 Lines	if (Cost < -SLPCostThreshold) {
R.getORE()->emit(OptimizationRemark(SV_NAME, "StoresVectorized",		R.getORE()->emit(OptimizationRemark(SV_NAME, "StoresVectorized",
cast<StoreInst>(Chain[0]))		cast<StoreInst>(Chain[0]))
<< "Stores SLP vectorized with cost " << NV("Cost", Cost)		<< "Stores SLP vectorized with cost " << NV("Cost", Cost)
<< " and with tree size "		<< " and with tree size "
<< NV("TreeSize", R.getTreeSize()));		<< NV("TreeSize", R.getTreeSize()));

R.vectorizeTree();		R.vectorizeTree();
return true;		return true;
		} else {
		if (SLPThrottling && R.findSubTree())
		R.vectorizeTree();
}		}

		ABataevUnsubmitted Done Reply Inline Actions `else if` ABataev: `else if`
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions thanks dtemirbulatov: thanks
		ABataevUnsubmitted Done Reply Inline Actions Why not try to vectorize a partial tree right here? ABataev: Why not try to vectorize a partial tree right here?
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions hmm, we might have an opportunity to vectorize the whole tree with smaller Chain sizes or at vectorizeStores or while doing reductions. dtemirbulatov: hmm, we might have an opportunity to vectorize the whole tree with smaller Chain sizes or at…
		ABataevUnsubmitted Not Done Reply Inline Actions Did you check that? ABataev: Did you check that?
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions yes, I can add a testcase for that. dtemirbulatov: yes, I can add a testcase for that.
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions For example, if we allow partial vectorization straight away we could see partial vectorization in test/Transforms/SLPVectorizer/X86/PR39774.ll without [[TMP3:%.]] = add <8 x i32> [[SHUFFLE]], <i32 0, i32 55, i32 285, i32 1240, i32 1496, i32 8555, i32 12529, i32 13685> That because of later we would have opportinity to vectorize the whole tree. dtemirbulatov:* For example, if we allow partial vectorization straight away we could see partial vectorization…
		ABataevUnsubmitted Not Done Reply Inline Actions Actually, `else` is not required here at all. Just make it a standalone `if` statement since there is an early exit in the previous `if` ABataev: Actually, `else` is not required here at all. Just make it a standalone `if` statement since…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions hmm, No I think it is required here, we don't want to reduce already decided full-tree vectorization. dtemirbulatov: hmm, No I think it is required here, we don't want to reduce already decided full-tree…
		ABataevUnsubmitted Done Reply Inline Actions Ho, you don't need it. Read https://llvm.org/docs/CodingStandards.html#don-t-use-else-after-a-return ABataev: Ho, you don't need it. Read https://llvm.org/docs/CodingStandards.html#don-t-use-else-after-a…
return false;		return false;
}		}

bool SLPVectorizerPass::vectorizeStores(ArrayRef<StoreInst *> Stores,		bool SLPVectorizerPass::vectorizeStores(ArrayRef<StoreInst *> Stores,
BoUpSLP &R) {		BoUpSLP &R) {
// We may run into multiple chains that merge into a single chain. We mark the		// We may run into multiple chains that merge into a single chain. We mark the
// stores that we vectorized so that we don't visit the same store twice.		// stores that we vectorized so that we don't visit the same store twice.
BoUpSLP::ValueSet VectorizedStores;		BoUpSLP::ValueSet VectorizedStores;
▲ Show 20 Lines • Show All 218 Lines • ▼ Show 20 Lines	for (unsigned I = NextInst; I < MaxInst; ++I) {
Value *ReorderedOps[] = {Ops[1], Ops[0]};		Value *ReorderedOps[] = {Ops[1], Ops[0]};
R.buildTree(ReorderedOps, None);		R.buildTree(ReorderedOps, None);
}		}
if (R.isTreeTinyAndNotFullyVectorizable())		if (R.isTreeTinyAndNotFullyVectorizable())
continue;		continue;

R.computeMinimumValueSizes();		R.computeMinimumValueSizes();
int Cost = R.getTreeCost() - UserCost;		int Cost = R.getTreeCost() - UserCost;
CandidateFound = true;		CandidateFound = true;
		ABataevUnsubmitted Not Done Reply Inline Actions Do you really need this new var here? I don't see where it is used except as an argument of `R.findSubTree(UserCost)` call ABataev: Do you really need this new var here? I don't see where it is used except as an argument of `R.
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions I think we need to compensate the ExctractCost with that cost of the insert sequence as in case of full-vectorization. dtemirbulatov: I think we need to compensate the ExctractCost with that cost of the insert sequence as in case…
		RKSimonUnsubmitted Not Done Reply Inline Actions This still looks wrong - isn't the UserCost only used locally in the CompensateUseCost path? RKSimon: This still looks wrong - isn't the UserCost only used locally in the CompensateUseCost path?
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions No, there is another instance of UserCost at 6476, We have to compare the cost to SLPCostThreshold inside findSubTree() and subtract UserCost. dtemirbulatov: No, there is another instance of UserCost at 6476, We have to compare the cost to…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions I mean the instance of usage. dtemirbulatov: I mean the instance of usage.
MinCost = std::min(MinCost, Cost);		MinCost = std::min(MinCost, Cost);

if (Cost < -SLPCostThreshold) {		if (Cost < -SLPCostThreshold) {
LLVM_DEBUG(dbgs() << "SLP: Vectorizing list at cost:" << Cost << ".\n");		LLVM_DEBUG(dbgs() << "SLP: Vectorizing list at cost:" << Cost << ".\n");
R.getORE()->emit(OptimizationRemark(SV_NAME, "VectorizedList",		R.getORE()->emit(OptimizationRemark(SV_NAME, "VectorizedList",
cast<Instruction>(Ops[0]))		cast<Instruction>(Ops[0]))
<< "SLP vectorized with cost " << ore::NV("Cost", Cost)		<< "SLP vectorized with cost " << ore::NV("Cost", Cost)
<< " and with tree size "		<< " and with tree size "
<< ore::NV("TreeSize", R.getTreeSize()));		<< ore::NV("TreeSize", R.getTreeSize()));

R.vectorizeTree();		R.vectorizeTree();
// Move to the next bundle.		// Move to the next bundle.
I += VF - 1;		I += VF - 1;
NextInst = I + 1;		NextInst = I + 1;
Changed = true;		Changed = true;
		} else {
		if (SLPThrottling && R.findSubTree(UserCost))
		ABataevUnsubmitted Done Reply Inline Actions `else if` ABataev: `else if`
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions thanks dtemirbulatov: thanks
		R.vectorizeTree();
}		}
		ABataevUnsubmitted Done Reply Inline Actions Why `SLPThrottleBudget > 0`? What if `SLPThrottleBudget` equals 0? ABataev: Why `SLPThrottleBudget > 0`? What if `SLPThrottleBudget` equals 0?
}		}
		ABataevUnsubmitted Not Done Reply Inline Actions Why not try to vectorize a partial tree right here? ABataev: Why not try to vectorize a partial tree right here?
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions we might have an opportunity to vectorize the whole tree with smaller Chain sizes at vectorizeStoreChain or while doing reductions. dtemirbulatov: we might have an opportunity to vectorize the whole tree with smaller Chain sizes at…
		ABataevUnsubmitted Not Done Reply Inline Actions Enclose substatement into braces since the substatement in `then` branch is in braces. ABataev: Enclose substatement into braces since the substatement in `then` branch is in braces.
		ABataevUnsubmitted Done Reply Inline Actions Better to enclose this substatement in braces to make the code uniform ABataev: Better to enclose this substatement in braces to make the code uniform
		ABataevUnsubmitted Done Reply Inline Actions Why we can't do something like this: int NumAttempts = 0; do { if (R.isTreeTinyAndNotFullyVectorizable()) break; R.computeMinimumValueSizes(); InstructionCost Cost = R.getTreeCost(); InstructionCost UserCost = 0; .... if (Cost < -SLPCostThreshold) { LLVM_DEBUG(dbgs() << "SLP: Vectorizing list at cost:" << Cost << ".\n"); R.getORE()->emit(OptimizationRemark(SV_NAME, "VectorizedList", cast<Instruction>(Ops[0])) << "SLP vectorized with cost " << ore::NV("Cost", Cost) << " and with tree size " << ore::NV("TreeSize", R.getTreeSize())); R.vectorizeTree(); // Move to the next bundle. I += VF - 1; NextInst = I + 1; Changed = true; break; } ... /// Do throttling here. ++NumAttempts; } while (NumAttempts < SLPThrottleBudget); ABataev: Why we can't do something like this: ``` int NumAttempts = 0; do { if (R.
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions We are doing partial vectorization and we have to know UserCost to make the correct partial tree cut. dtemirbulatov: We are doing partial vectorization and we have to know UserCost to make the correct partial…
}		}

if (!Changed && CandidateFound) {		if (!Changed && CandidateFound) {
R.getORE()->emit([&]() {		R.getORE()->emit([&]() {
return OptimizationRemarkMissed(SV_NAME, "NotBeneficial", I0)		return OptimizationRemarkMissed(SV_NAME, "NotBeneficial", I0)
<< "List vectorization was possible but not beneficial with cost "		<< "List vectorization was possible but not beneficial with cost "
<< ore::NV("Cost", MinCost) << " >= "		<< ore::NV("Cost", MinCost) << " >= "
<< ore::NV("Treshold", -SLPCostThreshold);		<< ore::NV("Treshold", -SLPCostThreshold);
▲ Show 20 Lines • Show All 767 Lines • ▼ Show 20 Lines	while (i < NumReducedVals - ReduxWidth + 1 && ReduxWidth > 2) {

V.computeMinimumValueSizes();		V.computeMinimumValueSizes();

// Estimate cost.		// Estimate cost.
int TreeCost = V.getTreeCost();		int TreeCost = V.getTreeCost();
int ReductionCost = getReductionCost(TTI, ReducedVals[i], ReduxWidth);		int ReductionCost = getReductionCost(TTI, ReducedVals[i], ReduxWidth);
int Cost = TreeCost + ReductionCost;		int Cost = TreeCost + ReductionCost;
if (Cost >= -SLPCostThreshold) {		if (Cost >= -SLPCostThreshold) {
		if (!SLPThrottling \|\| !V.findSubTree(-ReductionCost))
		break;
V.getORE()->emit([&]() {		V.getORE()->emit([&]() {
		ABataevUnsubmitted Done Reply Inline Actions Looks like you missed compare ща `Cost` with `-SLPCostThreshold` here. You vectorized the tree after throttling unconditionally. Plus, the `Cost` is calculated here, but not used later except for the debug prints. ABataev: Looks like you missed compare ща `Cost` with `-SLPCostThreshold` here. You vectorized the tree…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions we don't need need to compare here, this is done inside findSubTree(). dtemirbulatov: we don't need need to compare here, this is done inside findSubTree().
return OptimizationRemarkMissed(		return OptimizationRemarkMissed(SV_NAME, "HorSLPNotBeneficial",
SV_NAME, "HorSLPNotBeneficial", cast<Instruction>(VL[0]))		cast<Instruction>(VL[0]))
		ABataevUnsubmitted Not Done Reply Inline Actions Just `else`? ABataev: Just `else`?
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions We might try partial vectorization without success here and we to report about insufficient cost and break dtemirbulatov: We might try partial vectorization without success here and we to report about insufficient…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions We might try partial vectorization without success here and we have to report about insufficient cost and break dtemirbulatov: We might try partial vectorization without success here and we have to report about…
<< "Vectorizing horizontal reduction is possible"		<< "Vectorizing horizontal reduction is possible"
<< "but not beneficial with cost "		<< "but not beneficial with cost " << ore::NV("Cost", Cost)
<< ore::NV("Cost", Cost) << " and threshold "		<< " and threshold "
<< ore::NV("Threshold", -SLPCostThreshold);		<< ore::NV("Threshold", -SLPCostThreshold);
});		});
		RKSimonUnsubmitted Not Done Reply Inline Actions This looks like a NFC clang-format change now - either pre-commit or discard from the patch? RKSimon: This looks like a NFC clang-format change now - either pre-commit or discard from the patch?
break;
}		}

LLVM_DEBUG(dbgs() << "SLP: Vectorizing horizontal reduction at cost:"		LLVM_DEBUG(dbgs() << "SLP: Vectorizing horizontal reduction at cost:"
<< Cost << ". (HorRdx)\n");		<< Cost << ". (HorRdx)\n");
V.getORE()->emit([&]() {		V.getORE()->emit([&]() {
return OptimizationRemark(		return OptimizationRemark(
SV_NAME, "VectorizedHorizontalReduction", cast<Instruction>(VL[0]))		SV_NAME, "VectorizedHorizontalReduction", cast<Instruction>(VL[0]))
<< "Vectorized horizontal reduction with cost "		<< "Vectorized horizontal reduction with cost "
▲ Show 20 Lines • Show All 720 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/powof2div.ll

Show First 20 Lines • Show All 54 Lines • ▼ Show 20 Lines	entry:
%arrayidx17 = getelementptr inbounds i32, i32* %a, i64 3		%arrayidx17 = getelementptr inbounds i32, i32* %a, i64 3
store i32 %div16, i32* %arrayidx17, align 4		store i32 %div16, i32* %arrayidx17, align 4
ret void		ret void
}		}

define void @powof2div_nonuniform(i32* noalias nocapture %a, i32* noalias nocapture readonly %b, i32* noalias nocapture readonly %c){		define void @powof2div_nonuniform(i32* noalias nocapture %a, i32* noalias nocapture readonly %b, i32* noalias nocapture readonly %c){
; AVX1-LABEL: @powof2div_nonuniform(		; AVX1-LABEL: @powof2div_nonuniform(
; AVX1-NEXT: entry:		; AVX1-NEXT: entry:
; AVX1-NEXT: [[TMP0:%.]] = load i32, i32 [[B:%.*]], align 4		; AVX1-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i64 1
; AVX1-NEXT: [[TMP1:%.]] = load i32, i32 [[C:%.*]], align 4		; AVX1-NEXT: [[ARRAYIDX4:%.]] = getelementptr inbounds i32, i32 [[C:%.*]], i64 1
; AVX1-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP1]], [[TMP0]]
; AVX1-NEXT: [[DIV:%.*]] = sdiv i32 [[ADD]], 2
; AVX1-NEXT: store i32 [[DIV]], i32* [[A:%.*]], align 4
; AVX1-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 [[B]], i64 1
; AVX1-NEXT: [[TMP2:%.]] = load i32, i32 [[ARRAYIDX3]], align 4
; AVX1-NEXT: [[ARRAYIDX4:%.]] = getelementptr inbounds i32, i32 [[C]], i64 1
; AVX1-NEXT: [[TMP3:%.]] = load i32, i32 [[ARRAYIDX4]], align 4
; AVX1-NEXT: [[ADD5:%.*]] = add nsw i32 [[TMP3]], [[TMP2]]
; AVX1-NEXT: [[DIV6:%.*]] = sdiv i32 [[ADD5]], 4
; AVX1-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds i32, i32 [[A]], i64 1
; AVX1-NEXT: store i32 [[DIV6]], i32* [[ARRAYIDX7]], align 4
; AVX1-NEXT: [[ARRAYIDX8:%.]] = getelementptr inbounds i32, i32 [[B]], i64 2		; AVX1-NEXT: [[ARRAYIDX8:%.]] = getelementptr inbounds i32, i32 [[B]], i64 2
; AVX1-NEXT: [[TMP4:%.]] = load i32, i32 [[ARRAYIDX8]], align 4
; AVX1-NEXT: [[ARRAYIDX9:%.]] = getelementptr inbounds i32, i32 [[C]], i64 2		; AVX1-NEXT: [[ARRAYIDX9:%.]] = getelementptr inbounds i32, i32 [[C]], i64 2
; AVX1-NEXT: [[TMP5:%.]] = load i32, i32 [[ARRAYIDX9]], align 4
; AVX1-NEXT: [[ADD10:%.*]] = add nsw i32 [[TMP5]], [[TMP4]]
; AVX1-NEXT: [[DIV11:%.*]] = sdiv i32 [[ADD10]], 8
; AVX1-NEXT: [[ARRAYIDX12:%.]] = getelementptr inbounds i32, i32 [[A]], i64 2
; AVX1-NEXT: store i32 [[DIV11]], i32* [[ARRAYIDX12]], align 4
; AVX1-NEXT: [[ARRAYIDX13:%.]] = getelementptr inbounds i32, i32 [[B]], i64 3		; AVX1-NEXT: [[ARRAYIDX13:%.]] = getelementptr inbounds i32, i32 [[B]], i64 3
; AVX1-NEXT: [[TMP6:%.]] = load i32, i32 [[ARRAYIDX13]], align 4		; AVX1-NEXT: [[TMP0:%.]] = bitcast i32 [[B]] to <4 x i32>*
		; AVX1-NEXT: [[TMP1:%.]] = load <4 x i32>, <4 x i32> [[TMP0]], align 4
; AVX1-NEXT: [[ARRAYIDX14:%.]] = getelementptr inbounds i32, i32 [[C]], i64 3		; AVX1-NEXT: [[ARRAYIDX14:%.]] = getelementptr inbounds i32, i32 [[C]], i64 3
; AVX1-NEXT: [[TMP7:%.]] = load i32, i32 [[ARRAYIDX14]], align 4		; AVX1-NEXT: [[TMP2:%.]] = bitcast i32 [[C]] to <4 x i32>*
; AVX1-NEXT: [[ADD15:%.*]] = add nsw i32 [[TMP7]], [[TMP6]]		; AVX1-NEXT: [[TMP3:%.]] = load <4 x i32>, <4 x i32> [[TMP2]], align 4
; AVX1-NEXT: [[DIV16:%.*]] = sdiv i32 [[ADD15]], 16		; AVX1-NEXT: [[TMP4:%.*]] = add nsw <4 x i32> [[TMP3]], [[TMP1]]
		; AVX1-NEXT: [[TMP5:%.*]] = extractelement <4 x i32> [[TMP4]], i32 0
		; AVX1-NEXT: [[DIV:%.*]] = sdiv i32 [[TMP5]], 2
		; AVX1-NEXT: [[TMP6:%.*]] = extractelement <4 x i32> [[TMP4]], i32 1
		; AVX1-NEXT: [[DIV6:%.*]] = sdiv i32 [[TMP6]], 4
		; AVX1-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 1
		; AVX1-NEXT: [[TMP7:%.*]] = extractelement <4 x i32> [[TMP4]], i32 2
		; AVX1-NEXT: [[DIV11:%.*]] = sdiv i32 [[TMP7]], 8
		; AVX1-NEXT: [[ARRAYIDX12:%.]] = getelementptr inbounds i32, i32 [[A]], i64 2
		; AVX1-NEXT: [[TMP8:%.*]] = extractelement <4 x i32> [[TMP4]], i32 3
		; AVX1-NEXT: [[DIV16:%.*]] = sdiv i32 [[TMP8]], 16
; AVX1-NEXT: [[ARRAYIDX17:%.]] = getelementptr inbounds i32, i32 [[A]], i64 3		; AVX1-NEXT: [[ARRAYIDX17:%.]] = getelementptr inbounds i32, i32 [[A]], i64 3
; AVX1-NEXT: store i32 [[DIV16]], i32* [[ARRAYIDX17]], align 4		; AVX1-NEXT: [[TMP9:%.*]] = insertelement <4 x i32> undef, i32 [[DIV]], i32 0
		; AVX1-NEXT: [[TMP10:%.*]] = insertelement <4 x i32> [[TMP9]], i32 [[DIV6]], i32 1
		; AVX1-NEXT: [[TMP11:%.*]] = insertelement <4 x i32> [[TMP10]], i32 [[DIV11]], i32 2
		; AVX1-NEXT: [[TMP12:%.*]] = insertelement <4 x i32> [[TMP11]], i32 [[DIV16]], i32 3
		; AVX1-NEXT: [[TMP13:%.]] = bitcast i32 [[A]] to <4 x i32>*
		; AVX1-NEXT: store <4 x i32> [[TMP12]], <4 x i32>* [[TMP13]], align 4
; AVX1-NEXT: ret void		; AVX1-NEXT: ret void
		ABataevUnsubmitted Not Done Reply Inline Actions Still looks like it does not respect mintreesize ABataev: Still looks like it does not respect mintreesize
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions hmm, this is not the case here, the tree height is 5 here, divide node cost is 20 and after deleting this not node, extracting from "add" node costs 4 and inserting after scalar divide cost 4 and the final tree cost is -4. llvm-mca for -mattr=+avx shows 1305 cycles before and 1609 cycles after. dtemirbulatov: hmm, this is not the case here, the tree height is 5 here, divide node cost is 20 and after…
;		;
; AVX2-LABEL: @powof2div_nonuniform(		; AVX2-LABEL: @powof2div_nonuniform(
; AVX2-NEXT: entry:		; AVX2-NEXT: entry:
; AVX2-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i64 1		; AVX2-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i64 1
; AVX2-NEXT: [[ARRAYIDX4:%.]] = getelementptr inbounds i32, i32 [[C:%.*]], i64 1		; AVX2-NEXT: [[ARRAYIDX4:%.]] = getelementptr inbounds i32, i32 [[C:%.*]], i64 1
; AVX2-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 1		; AVX2-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 1
; AVX2-NEXT: [[ARRAYIDX8:%.]] = getelementptr inbounds i32, i32 [[B]], i64 2		; AVX2-NEXT: [[ARRAYIDX8:%.]] = getelementptr inbounds i32, i32 [[B]], i64 2
; AVX2-NEXT: [[ARRAYIDX9:%.]] = getelementptr inbounds i32, i32 [[C]], i64 2		; AVX2-NEXT: [[ARRAYIDX9:%.]] = getelementptr inbounds i32, i32 [[C]], i64 2
▲ Show 20 Lines • Show All 47 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/slp-throttle.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -slp-vectorizer -S -mtriple=x86_64-unknown-linux-gnu -mcpu=bdver2 < %s \| FileCheck %s			; RUN: opt -slp-vectorizer -S -mtriple=x86_64-unknown-linux-gnu -mcpu=bdver2 < %s \| FileCheck %s
				RKSimonUnsubmitted Done Reply Inline Actions Is it worth adding a second RUN with -slp-throttle=false ? RKSimon: Is it worth adding a second RUN with -slp-throttle=false ?

	define dso_local void @rftbsub(double* %a) local_unnamed_addr #0 {			define dso_local void @rftbsub(double* %a) local_unnamed_addr #0 {
	; CHECK-LABEL: @rftbsub(			; CHECK-LABEL: @rftbsub(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[ARRAYIDX6:%.]] = getelementptr inbounds double, double [[A:%.*]], i64 2			; CHECK-NEXT: [[ARRAYIDX6:%.]] = getelementptr inbounds double, double [[A:%.*]], i64 2
	; CHECK-NEXT: [[TMP0:%.]] = load double, double [[ARRAYIDX6]], align 8			; CHECK-NEXT: [[TMP0:%.*]] = or i64 2, 1
	; CHECK-NEXT: [[TMP1:%.*]] = or i64 2, 1			; CHECK-NEXT: [[ARRAYIDX12:%.]] = getelementptr inbounds double, double [[A]], i64 [[TMP0]]
	; CHECK-NEXT: [[ARRAYIDX12:%.]] = getelementptr inbounds double, double [[A]], i64 [[TMP1]]			; CHECK-NEXT: [[TMP1:%.]] = bitcast double [[ARRAYIDX6]] to <2 x double>*
	; CHECK-NEXT: [[TMP2:%.]] = load double, double [[ARRAYIDX12]], align 8			; CHECK-NEXT: [[TMP2:%.]] = load <2 x double>, <2 x double> [[TMP1]], align 8
	; CHECK-NEXT: [[ADD16:%.*]] = fadd double [[TMP2]], undef			; CHECK-NEXT: [[TMP3:%.*]] = extractelement <2 x double> [[TMP2]], i32 1
				; CHECK-NEXT: [[ADD16:%.*]] = fadd double [[TMP3]], undef
	; CHECK-NEXT: [[MUL18:%.*]] = fmul double undef, [[ADD16]]			; CHECK-NEXT: [[MUL18:%.*]] = fmul double undef, [[ADD16]]
	; CHECK-NEXT: [[ADD19:%.*]] = fadd double undef, [[MUL18]]			; CHECK-NEXT: [[ADD19:%.*]] = fadd double undef, [[MUL18]]
	; CHECK-NEXT: [[SUB22:%.*]] = fsub double undef, undef			; CHECK-NEXT: [[SUB22:%.*]] = fsub double undef, undef
	; CHECK-NEXT: [[SUB25:%.*]] = fsub double [[TMP0]], [[ADD19]]			; CHECK-NEXT: [[TMP4:%.*]] = insertelement <2 x double> undef, double [[ADD19]], i32 0
	; CHECK-NEXT: store double [[SUB25]], double* [[ARRAYIDX6]], align 8			; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x double> [[TMP4]], double [[SUB22]], i32 1
	; CHECK-NEXT: [[SUB29:%.*]] = fsub double [[TMP2]], [[SUB22]]			; CHECK-NEXT: [[TMP6:%.*]] = fsub <2 x double> [[TMP2]], [[TMP5]]
	; CHECK-NEXT: store double [[SUB29]], double* [[ARRAYIDX12]], align 8			; CHECK-NEXT: [[TMP7:%.]] = bitcast double [[ARRAYIDX6]] to <2 x double>*
				; CHECK-NEXT: store <2 x double> [[TMP6]], <2 x double>* [[TMP7]], align 8
	; CHECK-NEXT: unreachable			; CHECK-NEXT: unreachable
	;			;
	entry:			entry:
	%arrayidx6 = getelementptr inbounds double, double* %a, i64 2			%arrayidx6 = getelementptr inbounds double, double* %a, i64 2
	%0 = load double, double* %arrayidx6, align 8			%0 = load double, double* %arrayidx6, align 8
	%1 = or i64 2, 1			%1 = or i64 2, 1
	%arrayidx12 = getelementptr inbounds double, double* %a, i64 %1			%arrayidx12 = getelementptr inbounds double, double* %a, i64 %1
	%2 = load double, double* %arrayidx12, align 8			%2 = load double, double* %arrayidx12, align 8
	Show All 10 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Add support for throttling.AbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 255123

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

llvm/test/Transforms/SLPVectorizer/X86/powof2div.ll

llvm/test/Transforms/SLPVectorizer/X86/slp-throttle.ll

[SLP] Add support for throttling.
AbandonedPublic