This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
104/150
SLPVectorizer.cpp
-
test/Transforms/SLPVectorizer/
-
Transforms/
-
SLPVectorizer/
-
AArch64/
-
gather-root.ll
-
X86/
1
load-merge.ll
1/2
powof2div.ll
1/1
slp-throttle.ll

Differential D57779

[SLP] Add support for throttling.
AbandonedPublic

Authored by dtemirbulatov on Feb 5 2019, 12:48 PM.

Download Raw Diff

Details

Reviewers

ABataev
RKSimon
spatel
anton-afanasyev
hfinkel
vporpo
fhahn

Summary

Here is support for SLP throttling, when cost is high to vectorize the whole tree we can reduce the number of proposed vectorizable elements and partially vectorize the tree. https://www.youtube.com/watch?v=xxtA2XPmIug&t=5s

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

RKSimon added inline comments.Aug 10 2020, 6:30 AM

clang/tools/clang-tidy
2 ↗	(On Diff #284313)	Remove this
llvm/tools/mlir
2 ↗	(On Diff #284313)	remove this

Fixed.

RKSimon mentioned this in rG90f721404ff8: [SLP] Regenerate load-merge.ll tests.Aug 10 2020, 8:09 AM

RKSimon added inline comments.Aug 10 2020, 8:42 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
7278	This looks like a NFC clang-format change now - either pre-commit or discard from the patch?
llvm/test/Transforms/SLPVectorizer/X86/load-merge.ll
59	rebase - this was committed at rG90f721404ff8

Rebased, Fixed.

oh, I missed to fully remove from diff at 7269, Fixed

RKSimon added inline comments.Aug 11 2020, 2:36 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4222–4232	Maybe pull this NDEBUG change out into its own patch?
6420	This still looks wrong - isn't the UserCost only used locally in the CompensateUseCost path?

dtemirbulatov added inline comments.Aug 11 2020, 3:09 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4222–4232	yes, it could go as NFC.
6420	No, there is another instance of UserCost at 6476, We have to compare the cost to SLPCostThreshold inside findSubTree() and subtract UserCost.

dtemirbulatov added inline comments.Aug 11 2020, 3:12 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6420	I mean the instance of usage.

dtemirbulatov mentioned this in rGb1600d8b8971: [NFC] Guard the cost report block of debug outputs with NDEBUG and.Aug 11 2020, 7:35 AM

Rebased after rGb1600d8b8971

@ABataev @anton-afanasyev Any more comments on this?

llvm/test/Transforms/SLPVectorizer/X86/slp-throttle.ll
2	Is it worth adding a second RUN with -slp-throttle=false ?

xbolva00 added inline comments.Aug 16 2020, 10:23 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
623	Please mention paper name: “Throttling Automatic Vectorization: When Less Is More” https://www.cl.cam.ac.uk/~tmj32/papers/docs/porpodas15-pact.pdf Slides are good, but paper is paper :)

Corrected paper citation, added -slp-throttle=false to llvm/test/Transforms/SLPVectorizer/X86/slp-throttle.ll, rebased.

dtemirbulatov marked 2 inline comments as done.Aug 17 2020, 4:21 AM

ABataev added inline comments.Aug 21 2020, 6:51 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3304	What if the user does not have corresponding tree entry, i.e. it is initially scalar? What if the `Scalar` itself is going to remain scalar?
4088–4095	Just: for (Value V : Entry->Scalars) { auto Inst = cast<Instruction>(V); if (llvm::any_of(Inst->users(), [this](User *Op){ return Tree->ScalarToTreeEntry.count(Op) > 0; })) return InsertCost + getEntryCost(Entry); } Also, check code formatting

dtemirbulatov added inline comments.Aug 21 2020, 7:38 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3304	What if the Scalar itself is going to remain scalar? At this point, the decision to cut the tree was made and the Scalar could be only with intend to vectorize. Note about that 3295 we are ignoring any tree entries without State not equals TreeEntry::Vectorize. What if the user does not have corresponding tree entry, i.e. it is initially scalar? ah, yes. I have to check that !UserTE at 3305 and just continue if it is true.

dtemirbulatov added inline comments.Aug 21 2020, 3:23 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4088–4095	hmm, I think this is not a correct suggestion, there might be several tree entries with TreeEntry::ProposedToGather status and we have to calculate Insert cost for the whole tree here.

ABataev added inline comments.Aug 21 2020, 3:27 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4088–4095	Yeah, maybe. But you van do something similar, like InsertCost += ... break; instead of setting flag and do a check after the loop.

Fixed remarks, rebased.

dtemirbulatov marked 3 inline comments as done.Aug 21 2020, 3:52 PM

Removed unnecessary check for "UserTE" at 3305.

Rebased. Ping.

Good enough for initial implementation?

In D57779#2258046, @xbolva00 wrote:

Good enough for initial implementation?

yes, For me, it looks like ready.

In D57779#2266618, @dtemirbulatov wrote:

In D57779#2258046, @xbolva00 wrote:

Good enough for initial implementation?

yes, For me, it looks like ready.

Will be able to review it next week, after returning from vacation.

vdmitrie added a subscriber: vdmitrie.Sep 11 2020, 3:36 PM

ABataev added inline comments.Sep 15 2020, 9:30 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3305–3306	Could you compare it with a similar code in BoUpSLP::buildTree? Looks like you still missed some cases for user ignoring.

Matt added a subscriber: Matt.Sep 16 2020, 8:53 AM

Rebased. Moved InternalTreeUses population out of (UseScalar != U || !InTreeUserNeedToExtract(Scalar, UserInst, TLI)) limitation at line 2661 in BoUpSLP::buildTree(), since we have to consider every interal user for partial vectorization, while calculating cost.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3305–3306	I think those ignoring cases are related to the fact that we are doing full vectorization at BoUpSLP::buildTree and we can avoid extracting for in-tree users. And here we have to extract to each user of once proposed to vectorized value.

dtemirbulatov added inline comments.Sep 22 2020, 8:12 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3305–3306	And here we have to extract to each user of once proposed to vectorized value. I mean for the partial vectorization.

Ping

Rebased. PING

Rebased. Ping.

Rebased. Ping^2

ABataev added inline comments.Nov 23 2020, 9:50 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3309–3311	Either just `cast` without `if` or `dyn_cast`
4133	Not sure that this is the best criterion. I think you also need to include the distance from the head of the tree to the entry, because some big costs can be compensated by the vectorizable nodes in the tree. What I would do here is just some kind of level ordering search (BFS) starting from the deepest level.
4140	I think you can also exclude entries with the number of operands <= 1.

anton-afanasyev mentioned this in D90445: [SLP] Make SLPVectorizer to use `llvm.masked.gather` intrinsic.Nov 25 2020, 3:20 AM

dtemirbulatov added inline comments.Dec 3 2020, 5:55 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4133	Hmm, implemented, but I don't see any benefit from that, plus we have to do BFS search. And we are going to throw away any non-vectorizable nodes at 4295.
4140	But why? The only thing that matters here is the cost.

dtemirbulatov marked 2 inline comments as done.Dec 3 2020, 5:57 PM

ABataev added inline comments.Dec 4 2020, 5:20 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4133	It may trigger for targets like silvermont or in future for vectorized functions.
4140	Because the main idea is to drop gathers and drop one gather in favor of another one will not be profitable for sure. But it may improve compile time and the list of candidates, The only case you need to check for is the latest masked gather case, it may be profitable to convert it to gathers for some targets.

dtemirbulatov marked an inline comment as done.Dec 7 2020, 7:32 AM

dtemirbulatov added inline comments.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4133	I measured the BFS approach vs this implementation. And with BFS, it is ~10% less efficient on SPEC2006 INT and ~20% less on compilable SPEC2006 FP. By efficiency, I mean the total number of reduced trees while the whole compilation.
4140	I think I can check here if scutter/gather is supported via TargetrInfo and if it is not then move all nodes with TreeEntry::ScatterVectorize to TreeEntry::Gather.

anton-afanasyev added inline comments.Dec 7 2020, 8:43 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4140	I believe it's wrong decision to check scatter/gather target support for the reason mentioned here https://reviews.llvm.org/D92701#2435573. Why could not we just rely on costs (node cost and total one)?

dtemirbulatov marked an inline comment as not done.Dec 7 2020, 9:04 AM

dtemirbulatov added inline comments.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4140	I agree with @anton-afanasyev here. I am not sure what @ABataev wants here? If I exclude (operands <= 1) then we would lose have of all tests in SLP affected by throttling.

ABataev added inline comments.Dec 7 2020, 9:16 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4133	Could you post it anyway to check if it may be improved?
4140	I did not say anything about checking if scatter is supported here. I just said that we can improve the criterion here by checking that the entry node has at least 2 operands (because if it has just one operand, most probably we can skip it) and we just need to check the nodes with only 1 operand if it is gather scatter node, because it may be better to represent it as simple gather.

dtemirbulatov added inline comments.Dec 7 2020, 9:19 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4133	ok, I might miss something. Thanks.

Here is the BFS version of the change. Rebased.

And I counted the total number of nodes vectorized with throttling, instead of just the number of successful tree reductions. So, the total number is higher ~25% for INT and FP CPU2006(AVX2 and AVX512F) with Cost sort compare to Distance.

Discussed with @ABataev further improvements offline and he suggested removing the throttle limiter ("slp-throttling-budget"), at least for basic blocks without calls. I am looking for new functionality.

Removed "slp-throttling-budget" limiter for trees without calls
Moved the main tree reduction loop to getTreeCost() function
deleted ProposedToGather node attribute out of EntryState

Rebased, Measured compile time impact on cpu2006 integer and I have not noticed any significant regressions in SLP compile-time compared to SLP throttle with the limiter.

In D57779#2525124, @dtemirbulatov wrote:

Rebased, Measured compile time impact on cpu2006 integer and I have not noticed any significant regressions in SLP compile-time compared to SLP throttle with the limiter.

I mean only SLP time regression, by using "-ftime-trace" flag.

At Dinar's request, I've measured compile time regression: http://llvm-compile-time-tracker.com/compare.php?from=f3449ed6073cac58efd9b62d0eb285affa650238&to=39362e11add238c45a7a7d55c1e002005f396fb7&stat=instructions. The regression is visible, but it is acceptable for such change imho. The largest regression comes from CMakeFiles/clamscan.dir/libclamav_uuencode.c.o (+11.28%), so one can investigate this particular file.

In D57779#2525959, @anton-afanasyev wrote:

At Dinar's request, I've measured compile time regression: http://llvm-compile-time-tracker.com/compare.php?from=f3449ed6073cac58efd9b62d0eb285affa650238&to=39362e11add238c45a7a7d55c1e002005f396fb7&stat=instructions. The regression is visible, but it is acceptable for such change imho. The largest regression comes from CMakeFiles/clamscan.dir/libclamav_uuencode.c.o (+11.28%), so one can investigate this particular file.

Thanks, Anton. Eh, I don't see any time difference on my side for `CMakeFiles/clamscan.dir/libclamav_uuencode.c.o -with -O3 for mavx2 or -mavx512f as well as SLP didn't try to throttle any trees in this particular test, it looks like noise to me.

Ping.

Rebased, Ping.

ABataev added inline comments.Feb 10 2021, 7:40 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
125–133	Do we really need both of these options? `MaxCostsRecalculations` should be enough.
611–615	Does "scalar form" means "gathered nodes"? I don't think that currently we may end up with the situation like in the picture, we can't have gathered node that depends on another node (either gather or vectorized).
646–650	Why do we need to save intermediate results? Cannot it be solved in a single iteration loop without saving the intermediate results in the class instance?

ABataev added inline comments.Feb 10 2021, 7:40 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2572–2574	Looks like unrelated change

dtemirbulatov added inline comments.Feb 10 2021, 1:21 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
125–133	ok, thanks.
646–650	I have noticed many regressions if we decide right away and rebuilding the tree afterward is expensive.

ABataev added inline comments.Feb 10 2021, 1:40 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
646–650	What is the cause of those regressions? If I understand it correctly, you're just trying to find the subtree, exclude its cost, compare, repeat if it is not profitable. What does not allow to do it in the loop without saving intermediate results in the class, but save the result in the stack vectors, if it is needed?

dtemirbulatov added inline comments.Feb 10 2021, 3:37 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
646–650	For example, if we could partially vectorize at vectorizeStoreChain(), or later it is possilble to fully vectorize the same tree tryToVectorizeList() or tryToReduce()

dtemirbulatov added inline comments.Feb 10 2021, 3:41 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
646–650	For example, if we could partially vectorize at vectorizeStoreChain(), or later it is possilble to fully vectorize the same tree with tryToVectorizeList() or tryToReduce()

ABataev added inline comments.Feb 10 2021, 3:55 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
646–650	Could you give an example, please?

Here we could see the regression, it misses vectorizing the whole tree as partial vectorization kicks in too early and "add" instructions stay scalar:

a/llvm/test/Transforms/SLPVectorizer/X86/PR39774.ll

+++ b/llvm/test/Transforms/SLPVectorizer/X86/PR39774.ll
@@ -7,49 +7,65 @@ define void @test(i32) {
; CHECK-NEXT: entry:
; CHECK-NEXT: br label [[LOOP:%.*]]
; CHECK: loop:
-; CHECK-NEXT: [[TMP1:%.*]] = phi <2 x i32> [ [[TMP15:%.*]], [[LOOP]] ], [ zeroinitializer, [[ENTRY:%.*]] ]
-; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> poison, <8 x i32> <i32 0, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
-; CHECK-NEXT: [[TMP2:%.*]] = extractelement <8 x i32> [[SHUFFLE]], i32 1
-; CHECK-NEXT: [[TMP3:%.*]] = add <8 x i32> [[SHUFFLE]], <i32 0, i32 55, i32 285, i32 1240, i32 1496, i32 8555, i32 12529, i32 13685>
-; CHECK-NEXT: [[TMP4:%.*]] = call i32 @llvm.vector.reduce.and.v8i32(<8 x i32> [[TMP3]])
-; CHECK-NEXT: [[OP_EXTRA:%.*]] = and i32 [[TMP4]], [[TMP0:%.*]]
-; CHECK-NEXT: [[OP_EXTRA1:%.*]] = and i32 [[OP_EXTRA]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA2:%.*]] = and i32 [[OP_EXTRA1]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA3:%.*]] = and i32 [[OP_EXTRA2]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA4:%.*]] = and i32 [[OP_EXTRA3]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA5:%.*]] = and i32 [[OP_EXTRA4]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA6:%.*]] = and i32 [[OP_EXTRA5]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA7:%.*]] = and i32 [[OP_EXTRA6]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA8:%.*]] = and i32 [[OP_EXTRA7]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA9:%.*]] = and i32 [[OP_EXTRA8]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA10:%.*]] = and i32 [[OP_EXTRA9]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA11:%.*]] = and i32 [[OP_EXTRA10]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA12:%.*]] = and i32 [[OP_EXTRA11]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA13:%.*]] = and i32 [[OP_EXTRA12]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA14:%.*]] = and i32 [[OP_EXTRA13]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA15:%.*]] = and i32 [[OP_EXTRA14]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA16:%.*]] = and i32 [[OP_EXTRA15]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA17:%.*]] = and i32 [[OP_EXTRA16]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA18:%.*]] = and i32 [[OP_EXTRA17]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA19:%.*]] = and i32 [[OP_EXTRA18]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA20:%.*]] = and i32 [[OP_EXTRA19]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA21:%.*]] = and i32 [[OP_EXTRA20]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA22:%.*]] = and i32 [[OP_EXTRA21]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA23:%.*]] = and i32 [[OP_EXTRA22]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA24:%.*]] = and i32 [[OP_EXTRA23]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA25:%.*]] = and i32 [[OP_EXTRA24]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA26:%.*]] = and i32 [[OP_EXTRA25]], [[TMP0]]
-; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x i32> poison, i32 [[OP_EXTRA26]], i32 0
-; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x i32> [[TMP5]], i32 14910, i32 1
-; CHECK-NEXT: [[TMP7:%.*]] = insertelement <2 x i32> poison, i32 [[TMP2]], i32 0
-; CHECK-NEXT: [[TMP8:%.*]] = insertelement <2 x i32> [[TMP7]], i32 [[TMP2]], i32 1
-; CHECK-NEXT: [[TMP9:%.*]] = and <2 x i32> [[TMP6]], [[TMP8]]
-; CHECK-NEXT: [[TMP10:%.*]] = add <2 x i32> [[TMP6]], [[TMP8]]
-; CHECK-NEXT: [[TMP11:%.*]] = shufflevector <2 x i32> [[TMP9]], <2 x i32> [[TMP10]], <2 x i32> <i32 0, i32 3>
-; CHECK-NEXT: [[TMP12:%.*]] = extractelement <2 x i32> [[TMP11]], i32 0
-; CHECK-NEXT: [[TMP13:%.*]] = insertelement <2 x i32> poison, i32 [[TMP12]], i32 0
-; CHECK-NEXT: [[TMP14:%.*]] = extractelement <2 x i32> [[TMP11]], i32 1
-; CHECK-NEXT: [[TMP15]] = insertelement <2 x i32> [[TMP13]], i32 [[TMP14]], i32 1
+; CHECK-NEXT: [[TMP1:%.*]] = phi <2 x i32> [ [[TMP19:%.*]], [[LOOP]] ], [ zeroinitializer, [[ENTRY:%.*]] ]
+; CHECK-NEXT: [[TMP2:%.*]] = extractelement <2 x i32> [[TMP1]], i32 0
+; CHECK-NEXT: [[VAL_0:%.*]] = add i32 [[TMP2]], 0
+; CHECK-NEXT: [[TMP3:%.*]] = extractelement <2 x i32> [[TMP1]], i32 1
+; CHECK-NEXT: [[VAL_1:%.*]] = and i32 [[TMP3]], [[VAL_0]]
+; CHECK-NEXT: [[VAL_2:%.*]] = and i32 [[VAL_1]], [[TMP0:%.*]]
+; CHECK-NEXT: [[VAL_3:%.*]] = and i32 [[VAL_2]], [[TMP0]]
+; CHECK-NEXT: [[VAL_4:%.*]] = and i32 [[VAL_3]], [[TMP0]]
+; CHECK-NEXT: [[VAL_5:%.*]] = and i32 [[VAL_4]], [[TMP0]]
+; CHECK-NEXT: [[VAL_6:%.*]] = add i32 [[TMP3]], 55
+; CHECK-NEXT: [[VAL_7:%.*]] = and i32 [[VAL_5]], [[VAL_6]]
+; CHECK-NEXT: [[VAL_8:%.*]] = and i32 [[VAL_7]], [[TMP0]]
+; CHECK-NEXT: [[VAL_9:%.*]] = and i32 [[VAL_8]], [[TMP0]]
+; CHECK-NEXT: [[VAL_10:%.*]] = and i32 [[VAL_9]], [[TMP0]]
+; CHECK-NEXT: [[VAL_11:%.*]] = add i32 [[TMP3]], 285
+; CHECK-NEXT: [[VAL_12:%.*]] = and i32 [[VAL_10]], [[VAL_11]]
+; CHECK-NEXT: [[VAL_13:%.*]] = and i32 [[VAL_12]], [[TMP0]]
+; CHECK-NEXT: [[VAL_14:%.*]] = and i32 [[VAL_13]], [[TMP0]]
+; CHECK-NEXT: [[VAL_15:%.*]] = and i32 [[VAL_14]], [[TMP0]]
+; CHECK-NEXT: [[VAL_16:%.*]] = and i32 [[VAL_15]], [[TMP0]]
+; CHECK-NEXT: [[VAL_17:%.*]] = and i32 [[VAL_16]], [[TMP0]]
+; CHECK-NEXT: [[VAL_18:%.*]] = add i32 [[TMP3]], 1240
+; CHECK-NEXT: [[VAL_19:%.*]] = and i32 [[VAL_17]], [[VAL_18]]
+; CHECK-NEXT: [[VAL_20:%.*]] = add i32 [[TMP3]], 1496
+; CHECK-NEXT: [[VAL_21:%.*]] = and i32 [[VAL_19]], [[VAL_20]]
+; CHECK-NEXT: [[VAL_22:%.*]] = and i32 [[VAL_21]], [[TMP0]]
+; CHECK-NEXT: [[VAL_23:%.*]] = and i32 [[VAL_22]], [[TMP0]]
+; CHECK-NEXT: [[VAL_24:%.*]] = and i32 [[VAL_23]], [[TMP0]]
+; CHECK-NEXT: [[VAL_25:%.*]] = and i32 [[VAL_24]], [[TMP0]]
+; CHECK-NEXT: [[VAL_26:%.*]] = and i32 [[VAL_25]], [[TMP0]]
+; CHECK-NEXT: [[VAL_27:%.*]] = and i32 [[VAL_26]], [[TMP0]]
+; CHECK-NEXT: [[VAL_28:%.*]] = and i32 [[VAL_27]], [[TMP0]]
+; CHECK-NEXT: [[VAL_29:%.*]] = and i32 [[VAL_28]], [[TMP0]]
+; CHECK-NEXT: [[VAL_30:%.*]] = and i32 [[VAL_29]], [[TMP0]]
+; CHECK-NEXT: [[VAL_31:%.*]] = and i32 [[VAL_30]], [[TMP0]]
+; CHECK-NEXT: [[VAL_32:%.*]] = and i32 [[VAL_31]], [[TMP0]]
+; CHECK-NEXT: [[VAL_33:%.*]] = and i32 [[VAL_32]], [[TMP0]]
+; CHECK-NEXT: [[VAL_34:%.*]] = add i32 [[TMP3]], 8555
+; CHECK-NEXT: [[VAL_35:%.*]] = and i32 [[VAL_33]], [[VAL_34]]
+; CHECK-NEXT: [[VAL_36:%.*]] = and i32 [[VAL_35]], [[TMP0]]
+; CHECK-NEXT: [[VAL_37:%.*]] = and i32 [[VAL_36]], [[TMP0]]
+; CHECK-NEXT: [[VAL_38:%.*]] = and i32 [[VAL_37]], [[TMP0]]
+; CHECK-NEXT: [[TMP4:%.*]] = insertelement <2 x i32> poison, i32 [[TMP3]], i32 0
+; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x i32> [[TMP4]], i32 [[TMP3]], i32 1
+; CHECK-NEXT: [[TMP6:%.*]] = add <2 x i32> [[TMP5]], <i32 12529, i32 13685>
+; CHECK-NEXT: [[TMP7:%.*]] = extractelement <2 x i32> [[TMP6]], i32 0
+; CHECK-NEXT: [[VAL_40:%.*]] = and i32 [[VAL_38]], [[TMP7]]
+; CHECK-NEXT: [[TMP8:%.*]] = extractelement <2 x i32> [[TMP6]], i32 1
+; CHECK-NEXT: [[TMP9:%.*]] = insertelement <2 x i32> poison, i32 [[VAL_40]], i32 0
+; CHECK-NEXT: [[TMP10:%.*]] = insertelement <2 x i32> [[TMP9]], i32 14910, i32 1
+; CHECK-NEXT: [[TMP11:%.*]] = insertelement <2 x i32> poison, i32 [[TMP8]], i32 0
+; CHECK-NEXT: [[TMP12:%.*]] = insertelement <2 x i32> [[TMP11]], i32 [[TMP3]], i32 1
+; CHECK-NEXT: [[TMP13:%.*]] = and <2 x i32> [[TMP10]], [[TMP12]]
+; CHECK-NEXT: [[TMP14:%.*]] = add <2 x i32> [[TMP10]], [[TMP12]]
+; CHECK-NEXT: [[TMP15:%.*]] = shufflevector <2 x i32> [[TMP13]], <2 x i32> [[TMP14]], <2 x i32> <i32 0, i32 3>
+; CHECK-NEXT: [[TMP16:%.*]] = extractelement <2 x i32> [[TMP15]], i32 0
+; CHECK-NEXT: [[TMP17:%.*]] = insertelement <2 x i32> poison, i32 [[TMP16]], i32 0
+; CHECK-NEXT: [[TMP18:%.*]] = extractelement <2 x i32> [[TMP15]], i32 1
+; CHECK-NEXT: [[TMP19]] = insertelement <2 x i32> [[TMP17]], i32 [[TMP18]], i32 1
; CHECK-NEXT: br label [[LOOP]]
;
; FORCE_REDUCTION-LABEL: @test(

In D57779#2556783, @dtemirbulatov wrote:

Here we could see the regression, it misses vectorizing the whole tree as partial vectorization kicks in too early and "add" instructions stay scalar:

a/llvm/test/Transforms/SLPVectorizer/X86/PR39774.ll

+++ b/llvm/test/Transforms/SLPVectorizer/X86/PR39774.ll
@@ -7,49 +7,65 @@ define void @test(i32) {
; CHECK-NEXT: entry:
; CHECK-NEXT: br label [[LOOP:%.*]]
; CHECK: loop:
-; CHECK-NEXT: [[TMP1:%.*]] = phi <2 x i32> [ [[TMP15:%.*]], [[LOOP]] ], [ zeroinitializer, [[ENTRY:%.*]] ]
-; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> poison, <8 x i32> <i32 0, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>
-; CHECK-NEXT: [[TMP2:%.*]] = extractelement <8 x i32> [[SHUFFLE]], i32 1
-; CHECK-NEXT: [[TMP3:%.*]] = add <8 x i32> [[SHUFFLE]], <i32 0, i32 55, i32 285, i32 1240, i32 1496, i32 8555, i32 12529, i32 13685>
-; CHECK-NEXT: [[TMP4:%.*]] = call i32 @llvm.vector.reduce.and.v8i32(<8 x i32> [[TMP3]])
-; CHECK-NEXT: [[OP_EXTRA:%.*]] = and i32 [[TMP4]], [[TMP0:%.*]]
-; CHECK-NEXT: [[OP_EXTRA1:%.*]] = and i32 [[OP_EXTRA]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA2:%.*]] = and i32 [[OP_EXTRA1]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA3:%.*]] = and i32 [[OP_EXTRA2]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA4:%.*]] = and i32 [[OP_EXTRA3]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA5:%.*]] = and i32 [[OP_EXTRA4]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA6:%.*]] = and i32 [[OP_EXTRA5]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA7:%.*]] = and i32 [[OP_EXTRA6]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA8:%.*]] = and i32 [[OP_EXTRA7]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA9:%.*]] = and i32 [[OP_EXTRA8]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA10:%.*]] = and i32 [[OP_EXTRA9]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA11:%.*]] = and i32 [[OP_EXTRA10]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA12:%.*]] = and i32 [[OP_EXTRA11]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA13:%.*]] = and i32 [[OP_EXTRA12]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA14:%.*]] = and i32 [[OP_EXTRA13]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA15:%.*]] = and i32 [[OP_EXTRA14]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA16:%.*]] = and i32 [[OP_EXTRA15]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA17:%.*]] = and i32 [[OP_EXTRA16]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA18:%.*]] = and i32 [[OP_EXTRA17]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA19:%.*]] = and i32 [[OP_EXTRA18]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA20:%.*]] = and i32 [[OP_EXTRA19]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA21:%.*]] = and i32 [[OP_EXTRA20]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA22:%.*]] = and i32 [[OP_EXTRA21]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA23:%.*]] = and i32 [[OP_EXTRA22]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA24:%.*]] = and i32 [[OP_EXTRA23]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA25:%.*]] = and i32 [[OP_EXTRA24]], [[TMP0]]
-; CHECK-NEXT: [[OP_EXTRA26:%.*]] = and i32 [[OP_EXTRA25]], [[TMP0]]
-; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x i32> poison, i32 [[OP_EXTRA26]], i32 0
-; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x i32> [[TMP5]], i32 14910, i32 1
-; CHECK-NEXT: [[TMP7:%.*]] = insertelement <2 x i32> poison, i32 [[TMP2]], i32 0
-; CHECK-NEXT: [[TMP8:%.*]] = insertelement <2 x i32> [[TMP7]], i32 [[TMP2]], i32 1
-; CHECK-NEXT: [[TMP9:%.*]] = and <2 x i32> [[TMP6]], [[TMP8]]
-; CHECK-NEXT: [[TMP10:%.*]] = add <2 x i32> [[TMP6]], [[TMP8]]
-; CHECK-NEXT: [[TMP11:%.*]] = shufflevector <2 x i32> [[TMP9]], <2 x i32> [[TMP10]], <2 x i32> <i32 0, i32 3>
-; CHECK-NEXT: [[TMP12:%.*]] = extractelement <2 x i32> [[TMP11]], i32 0
-; CHECK-NEXT: [[TMP13:%.*]] = insertelement <2 x i32> poison, i32 [[TMP12]], i32 0
-; CHECK-NEXT: [[TMP14:%.*]] = extractelement <2 x i32> [[TMP11]], i32 1
-; CHECK-NEXT: [[TMP15]] = insertelement <2 x i32> [[TMP13]], i32 [[TMP14]], i32 1
+; CHECK-NEXT: [[TMP1:%.*]] = phi <2 x i32> [ [[TMP19:%.*]], [[LOOP]] ], [ zeroinitializer, [[ENTRY:%.*]] ]
+; CHECK-NEXT: [[TMP2:%.*]] = extractelement <2 x i32> [[TMP1]], i32 0
+; CHECK-NEXT: [[VAL_0:%.*]] = add i32 [[TMP2]], 0
+; CHECK-NEXT: [[TMP3:%.*]] = extractelement <2 x i32> [[TMP1]], i32 1
+; CHECK-NEXT: [[VAL_1:%.*]] = and i32 [[TMP3]], [[VAL_0]]
+; CHECK-NEXT: [[VAL_2:%.*]] = and i32 [[VAL_1]], [[TMP0:%.*]]
+; CHECK-NEXT: [[VAL_3:%.*]] = and i32 [[VAL_2]], [[TMP0]]
+; CHECK-NEXT: [[VAL_4:%.*]] = and i32 [[VAL_3]], [[TMP0]]
+; CHECK-NEXT: [[VAL_5:%.*]] = and i32 [[VAL_4]], [[TMP0]]
+; CHECK-NEXT: [[VAL_6:%.*]] = add i32 [[TMP3]], 55
+; CHECK-NEXT: [[VAL_7:%.*]] = and i32 [[VAL_5]], [[VAL_6]]
+; CHECK-NEXT: [[VAL_8:%.*]] = and i32 [[VAL_7]], [[TMP0]]
+; CHECK-NEXT: [[VAL_9:%.*]] = and i32 [[VAL_8]], [[TMP0]]
+; CHECK-NEXT: [[VAL_10:%.*]] = and i32 [[VAL_9]], [[TMP0]]
+; CHECK-NEXT: [[VAL_11:%.*]] = add i32 [[TMP3]], 285
+; CHECK-NEXT: [[VAL_12:%.*]] = and i32 [[VAL_10]], [[VAL_11]]
+; CHECK-NEXT: [[VAL_13:%.*]] = and i32 [[VAL_12]], [[TMP0]]
+; CHECK-NEXT: [[VAL_14:%.*]] = and i32 [[VAL_13]], [[TMP0]]
+; CHECK-NEXT: [[VAL_15:%.*]] = and i32 [[VAL_14]], [[TMP0]]
+; CHECK-NEXT: [[VAL_16:%.*]] = and i32 [[VAL_15]], [[TMP0]]
+; CHECK-NEXT: [[VAL_17:%.*]] = and i32 [[VAL_16]], [[TMP0]]
+; CHECK-NEXT: [[VAL_18:%.*]] = add i32 [[TMP3]], 1240
+; CHECK-NEXT: [[VAL_19:%.*]] = and i32 [[VAL_17]], [[VAL_18]]
+; CHECK-NEXT: [[VAL_20:%.*]] = add i32 [[TMP3]], 1496
+; CHECK-NEXT: [[VAL_21:%.*]] = and i32 [[VAL_19]], [[VAL_20]]
+; CHECK-NEXT: [[VAL_22:%.*]] = and i32 [[VAL_21]], [[TMP0]]
+; CHECK-NEXT: [[VAL_23:%.*]] = and i32 [[VAL_22]], [[TMP0]]
+; CHECK-NEXT: [[VAL_24:%.*]] = and i32 [[VAL_23]], [[TMP0]]
+; CHECK-NEXT: [[VAL_25:%.*]] = and i32 [[VAL_24]], [[TMP0]]
+; CHECK-NEXT: [[VAL_26:%.*]] = and i32 [[VAL_25]], [[TMP0]]
+; CHECK-NEXT: [[VAL_27:%.*]] = and i32 [[VAL_26]], [[TMP0]]
+; CHECK-NEXT: [[VAL_28:%.*]] = and i32 [[VAL_27]], [[TMP0]]
+; CHECK-NEXT: [[VAL_29:%.*]] = and i32 [[VAL_28]], [[TMP0]]
+; CHECK-NEXT: [[VAL_30:%.*]] = and i32 [[VAL_29]], [[TMP0]]
+; CHECK-NEXT: [[VAL_31:%.*]] = and i32 [[VAL_30]], [[TMP0]]
+; CHECK-NEXT: [[VAL_32:%.*]] = and i32 [[VAL_31]], [[TMP0]]
+; CHECK-NEXT: [[VAL_33:%.*]] = and i32 [[VAL_32]], [[TMP0]]
+; CHECK-NEXT: [[VAL_34:%.*]] = add i32 [[TMP3]], 8555
+; CHECK-NEXT: [[VAL_35:%.*]] = and i32 [[VAL_33]], [[VAL_34]]
+; CHECK-NEXT: [[VAL_36:%.*]] = and i32 [[VAL_35]], [[TMP0]]
+; CHECK-NEXT: [[VAL_37:%.*]] = and i32 [[VAL_36]], [[TMP0]]
+; CHECK-NEXT: [[VAL_38:%.*]] = and i32 [[VAL_37]], [[TMP0]]
+; CHECK-NEXT: [[TMP4:%.*]] = insertelement <2 x i32> poison, i32 [[TMP3]], i32 0
+; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x i32> [[TMP4]], i32 [[TMP3]], i32 1
+; CHECK-NEXT: [[TMP6:%.*]] = add <2 x i32> [[TMP5]], <i32 12529, i32 13685>
+; CHECK-NEXT: [[TMP7:%.*]] = extractelement <2 x i32> [[TMP6]], i32 0
+; CHECK-NEXT: [[VAL_40:%.*]] = and i32 [[VAL_38]], [[TMP7]]
+; CHECK-NEXT: [[TMP8:%.*]] = extractelement <2 x i32> [[TMP6]], i32 1
+; CHECK-NEXT: [[TMP9:%.*]] = insertelement <2 x i32> poison, i32 [[VAL_40]], i32 0
+; CHECK-NEXT: [[TMP10:%.*]] = insertelement <2 x i32> [[TMP9]], i32 14910, i32 1
+; CHECK-NEXT: [[TMP11:%.*]] = insertelement <2 x i32> poison, i32 [[TMP8]], i32 0
+; CHECK-NEXT: [[TMP12:%.*]] = insertelement <2 x i32> [[TMP11]], i32 [[TMP3]], i32 1
+; CHECK-NEXT: [[TMP13:%.*]] = and <2 x i32> [[TMP10]], [[TMP12]]
+; CHECK-NEXT: [[TMP14:%.*]] = add <2 x i32> [[TMP10]], [[TMP12]]
+; CHECK-NEXT: [[TMP15:%.*]] = shufflevector <2 x i32> [[TMP13]], <2 x i32> [[TMP14]], <2 x i32> <i32 0, i32 3>
+; CHECK-NEXT: [[TMP16:%.*]] = extractelement <2 x i32> [[TMP15]], i32 0
+; CHECK-NEXT: [[TMP17:%.*]] = insertelement <2 x i32> poison, i32 [[TMP16]], i32 0
+; CHECK-NEXT: [[TMP18:%.*]] = extractelement <2 x i32> [[TMP15]], i32 1
+; CHECK-NEXT: [[TMP19]] = insertelement <2 x i32> [[TMP17]], i32 [[TMP18]], i32 1
; CHECK-NEXT: br label [[LOOP]]
;
; FORCE_REDUCTION-LABEL: @test(

To me, it just looks like we need to postpone the vectorization of phi nodes in the function rather than trying to fix all the issues in the world in a single patch.

To me, it just looks like we need to postpone the vectorization of phi nodes in the function rather than trying to fix all the issues in the world in a single patch.

I think I could give one simpler example without PHI nodes.

Here is another example:
source_filename = "psspread.c"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define dso_local void @spread_q_poisson() local_unnamed_addr #0 {
entry:

%div.i = fdiv float undef, undef
%conv13.i = fdiv float 1.000000e+00, %div.i
%conv18.i = fdiv float 1.000000e+00, undef
%conv23.i = fdiv float 1.000000e+00, undef
%conv162 = fptosi float undef to i32
%0 = load float, float* undef, align 4
%1 = load i32, i32* undef, align 4
%add187.us = add nsw i32 %1, %conv162
%add191.us = add nsw i32 undef, undef
%add195.us = add nsw i32 undef, undef
%conv196.us = sitofp i32 %add187.us to float
%mul197.us = fmul float %conv13.i, %conv196.us
%sub198.us = fsub float undef, %mul197.us
%mul.i363.us = fmul float %sub198.us, %sub198.us
%conv200.us = sitofp i32 %add191.us to float
%mul201.us = fmul float %conv18.i, %conv200.us
%sub202.us = fsub float undef, %mul201.us
%mul.i362.us = fmul float %sub202.us, %sub202.us
%conv204.us = sitofp i32 %add195.us to float
%mul205.us = fmul float %conv23.i, %conv204.us
%sub206.us = fsub float %0, %mul205.us
%mul.i.us = fmul float %sub206.us, %sub206.us
%add208.us = fadd float %mul.i363.us, %mul.i362.us
%add209.us = fadd float %add208.us, %mul.i.us
%cmp210.us = fcmp olt float %add209.us, undef
%add230.us = add nsw i32 undef, %add195.us
unreachable

}

attributes #0 = { "use-soft-float"="false" }

!llvm.ident = !{!0}

!0 = !{!"clang version 13.0.0 (/home/dtemirbulatov/llvm/llvm-project-thl/llvm/tools/clang eec04092d67b94f47439a9065b6bd4cd60165be2)"}

with proposed change it produces :
; ModuleID = 'bug.ll'
source_filename = "psspread.c"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define dso_local void @spread_q_poisson() local_unnamed_addr #0 {
entry:

%div.i = fdiv float undef, undef
%conv13.i = fdiv float 1.000000e+00, %div.i
%conv162 = fptosi float undef to i32
%0 = load float, float* undef, align 4
%1 = load i32, i32* undef, align 4
%add187.us = add nsw i32 %1, %conv162
%conv196.us = sitofp i32 %add187.us to float
%mul197.us = fmul float %conv13.i, %conv196.us
%sub198.us = fsub float undef, %mul197.us
%mul.i363.us = fmul float %sub198.us, %sub198.us
%2 = insertelement <2 x float> <float undef, float poison>, float %0, i32 1
%3 = fsub <2 x float> %2, <float 0x7FF8000000000000, float 0x7FF8000000000000>
%4 = fmul <2 x float> %3, %3
%5 = extractelement <2 x float> %4, i32 0
%add208.us = fadd float %mul.i363.us, %5
%6 = extractelement <2 x float> %4, i32 1
%add209.us = fadd float %add208.us, %6
%cmp210.us = fcmp olt float %add209.us, undef
%add230.us = add nsw i32 undef, undef
unreachable

}

attributes #0 = { "use-soft-float"="false" }

!llvm.ident = !{!0}

!0 = !{!"clang version 13.0.0 (/home/dtemirbulatov/llvm/llvm-project-thl/llvm/tools/clang eec04092d67b94f47439a9065b6bd4cd60165be2)"}

but if we immediately decide to vectorize patrially to get this output:
; ModuleID = 'bug.ll'
source_filename = "psspread.c"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define dso_local void @spread_q_poisson() local_unnamed_addr #0 {
entry:

%div.i = fdiv float undef, undef
%conv18.i = fdiv float 1.000000e+00, undef
%0 = insertelement <2 x float> poison, float %div.i, i32 0
%1 = insertelement <2 x float> %0, float undef, i32 1
%2 = fdiv <2 x float> <float 1.000000e+00, float 1.000000e+00>, %1
%conv162 = fptosi float undef to i32
%3 = load float, float* undef, align 4
%4 = load i32, i32* undef, align 4
%add187.us = add nsw i32 %4, %conv162
%add191.us = add nsw i32 undef, undef
%add195.us = add nsw i32 undef, undef
%conv200.us = sitofp i32 %add191.us to float
%mul201.us = fmul float %conv18.i, %conv200.us
%sub202.us = fsub float undef, %mul201.us
%mul.i362.us = fmul float %sub202.us, %sub202.us
%5 = insertelement <2 x i32> poison, i32 %add187.us, i32 0
%6 = insertelement <2 x i32> %5, i32 %add195.us, i32 1
%7 = sitofp <2 x i32> %6 to <2 x float>
%8 = fmul <2 x float> %2, %7
%9 = insertelement <2 x float> <float undef, float poison>, float %3, i32 1
%10 = fsub <2 x float> %9, %8
%11 = fmul <2 x float> %10, %10
%12 = extractelement <2 x float> %11, i32 0
%add208.us = fadd float %12, %mul.i362.us
%13 = extractelement <2 x float> %11, i32 1
%add209.us = fadd float %add208.us, %13
%cmp210.us = fcmp olt float %add209.us, undef
%add230.us = add nsw i32 undef, %add195.us
unreachable

}

attributes #0 = { "use-soft-float"="false" }

!llvm.ident = !{!0}

!0 = !{!"clang version 13.0.0 (/home/dtemirbulatov/llvm/llvm-project-thl/llvm/tools/clang eec04092d67b94f47439a9065b6bd4cd60165be2)"}

In D57779#2559601, @dtemirbulatov wrote:
Here is another example:
source_filename = "psspread.c"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define dso_local void @spread_q_poisson() local_unnamed_addr #0 {
entry:
%div.i = fdiv float undef, undef
%conv13.i = fdiv float 1.000000e+00, %div.i
%conv18.i = fdiv float 1.000000e+00, undef
%conv23.i = fdiv float 1.000000e+00, undef
%conv162 = fptosi float undef to i32
%0 = load float, float* undef, align 4
%1 = load i32, i32* undef, align 4
%add187.us = add nsw i32 %1, %conv162
%add191.us = add nsw i32 undef, undef
%add195.us = add nsw i32 undef, undef
%conv196.us = sitofp i32 %add187.us to float
%mul197.us = fmul float %conv13.i, %conv196.us
%sub198.us = fsub float undef, %mul197.us
%mul.i363.us = fmul float %sub198.us, %sub198.us
%conv200.us = sitofp i32 %add191.us to float
%mul201.us = fmul float %conv18.i, %conv200.us
%sub202.us = fsub float undef, %mul201.us
%mul.i362.us = fmul float %sub202.us, %sub202.us
%conv204.us = sitofp i32 %add195.us to float
%mul205.us = fmul float %conv23.i, %conv204.us
%sub206.us = fsub float %0, %mul205.us
%mul.i.us = fmul float %sub206.us, %sub206.us
%add208.us = fadd float %mul.i363.us, %mul.i362.us
%add209.us = fadd float %add208.us, %mul.i.us
%cmp210.us = fcmp olt float %add209.us, undef
%add230.us = add nsw i32 undef, %add195.us
unreachable
}

attributes #0 = { "use-soft-float"="false" }

!llvm.ident = !{!0}

!0 = !{!"clang version 13.0.0 (/home/dtemirbulatov/llvm/llvm-project-thl/llvm/tools/clang eec04092d67b94f47439a9065b6bd4cd60165be2)"}

with proposed change it produces :
; ModuleID = 'bug.ll'
source_filename = "psspread.c"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define dso_local void @spread_q_poisson() local_unnamed_addr #0 {
entry:
%div.i = fdiv float undef, undef
%conv13.i = fdiv float 1.000000e+00, %div.i
%conv162 = fptosi float undef to i32
%0 = load float, float* undef, align 4
%1 = load i32, i32* undef, align 4
%add187.us = add nsw i32 %1, %conv162
%conv196.us = sitofp i32 %add187.us to float
%mul197.us = fmul float %conv13.i, %conv196.us
%sub198.us = fsub float undef, %mul197.us
%mul.i363.us = fmul float %sub198.us, %sub198.us
%2 = insertelement <2 x float> <float undef, float poison>, float %0, i32 1
%3 = fsub <2 x float> %2, <float 0x7FF8000000000000, float 0x7FF8000000000000>
%4 = fmul <2 x float> %3, %3
%5 = extractelement <2 x float> %4, i32 0
%add208.us = fadd float %mul.i363.us, %5
%6 = extractelement <2 x float> %4, i32 1
%add209.us = fadd float %add208.us, %6
%cmp210.us = fcmp olt float %add209.us, undef
%add230.us = add nsw i32 undef, undef
unreachable
}

attributes #0 = { "use-soft-float"="false" }

!llvm.ident = !{!0}

!0 = !{!"clang version 13.0.0 (/home/dtemirbulatov/llvm/llvm-project-thl/llvm/tools/clang eec04092d67b94f47439a9065b6bd4cd60165be2)"}

but if we immediately decide to vectorize patrially to get this output:
; ModuleID = 'bug.ll'
source_filename = "psspread.c"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

define dso_local void @spread_q_poisson() local_unnamed_addr #0 {
entry:
%div.i = fdiv float undef, undef
%conv18.i = fdiv float 1.000000e+00, undef
%0 = insertelement <2 x float> poison, float %div.i, i32 0
%1 = insertelement <2 x float> %0, float undef, i32 1
%2 = fdiv <2 x float> <float 1.000000e+00, float 1.000000e+00>, %1
%conv162 = fptosi float undef to i32
%3 = load float, float* undef, align 4
%4 = load i32, i32* undef, align 4
%add187.us = add nsw i32 %4, %conv162
%add191.us = add nsw i32 undef, undef
%add195.us = add nsw i32 undef, undef
%conv200.us = sitofp i32 %add191.us to float
%mul201.us = fmul float %conv18.i, %conv200.us
%sub202.us = fsub float undef, %mul201.us
%mul.i362.us = fmul float %sub202.us, %sub202.us
%5 = insertelement <2 x i32> poison, i32 %add187.us, i32 0
%6 = insertelement <2 x i32> %5, i32 %add195.us, i32 1
%7 = sitofp <2 x i32> %6 to <2 x float>
%8 = fmul <2 x float> %2, %7
%9 = insertelement <2 x float> <float undef, float poison>, float %3, i32 1
%10 = fsub <2 x float> %9, %8
%11 = fmul <2 x float> %10, %10
%12 = extractelement <2 x float> %11, i32 0
%add208.us = fadd float %12, %mul.i362.us
%13 = extractelement <2 x float> %11, i32 1
%add209.us = fadd float %add208.us, %13
%cmp210.us = fcmp olt float %add209.us, undef
%add230.us = add nsw i32 undef, %add195.us
unreachable
}

attributes #0 = { "use-soft-float"="false" }

!llvm.ident = !{!0}

!0 = !{!"clang version 13.0.0 (/home/dtemirbulatov/llvm/llvm-project-thl/llvm/tools/clang eec04092d67b94f47439a9065b6bd4cd60165be2)"}

I see that immediate vectorization is better as it vectorizes more, no? Also, there is a problem, looks like it is caused by the multinode analysis. I'm trying to improve this in my non-power-2 patch, will prepare a separate patch for it.

I see that immediate vectorization is better as it vectorizes more, no? Also, there is a problem, looks like it is caused by the multinode analysis. I'm trying to improve this in my non-power-2 patch, will prepare a separate patch for it.

eh, I think it is not a clear example, I have seen better examples, I will show something better.

In D57779#2560071, @dtemirbulatov wrote:

I see that immediate vectorization is better as it vectorizes more, no? Also, there is a problem, looks like it is caused by the multinode analysis. I'm trying to improve this in my non-power-2 patch, will prepare a separate patch for it.

eh, I think it is not a clear example, I have seen better examples, I will show something better.

Even this example shows that the current solution does not always produce the best result.

Even this example shows that the current solution does not always produce the best result.

at least, we could avoid regressions.

I think the next step is to compare vectorized tree heights(number of vectorized nodes) among possible vectorizable trees.

Even this example shows that the current solution does not always produce the best result.

SLP has a greedy approach and let's assume that full vectorization is always better than partial. We don't have the resources to save all trees and then choose from saved the best one. I think I can add now choosing the best from already partially vectorized.

In D57779#2564284, @dtemirbulatov wrote:

Even this example shows that the current solution does not always produce the best result.

SLP has a greedy approach and let's assume that full vectorization is always better than partial. We don't have the resources to save all trees and then choose from saved the best one. I think I can add now choosing the best from already partially vectorized.

Again, even your example showed that this solution is worse in some cases. Why do we need to waste the time and invest in a solution, which is not better than the existing one, requires more time to understand, consumes more memory?
SLP implements a bottom-up approach, i.e. it always tries to vectorize the longest chain (except for PHI nodes, which should be improved). If we have a partial graph, it should not affect other vectorization graphs in the same basic block, generally speaking, just some subnodes may become the subnodes of the other graphs but this is not a problem.
Looks like you're trying to implement something similar to VPlan. We have it already and better to invest the time to implement support for SLP vectorization there.
Redesign is completely different work, it requires correct estimation (not the assumptions, but real investigation), discussion, RFC, approval, and separate implementation.

Again, even your example showed that this solution is worse in some cases. Why do we need to waste the time and invest in a solution, which is not better than the existing one, requires more time to understand, consumes more memory?

SLP implements a bottom-up approach, i.e. it always tries to vectorize the longest chain (except for PHI nodes, which should be improved). If we have a partial graph, it should not affect other vectorization graphs in the same basic block, generally speaking, just some subnodes may become the subnodes of the other graphs but this is not a problem.

Looks like you're trying to implement something similar to VPlan. We have it already and better to invest the time to implement support for SLP vectorization there.

Redesign is completely different work, it requires correct estimation (not the assumptions, but real investigation), discussion, RFC, approval, and separate implementation.

Ok, Agree.

Addressed @ABataev remarks, investigated regression with PHI nodes in PR39774.ll and I have not spotted any other case involving PHI nodes, but I have several other cases and it happens quite rarely. I am not sure how-to generalize them and I think VPLAN might be helpful. Overall, I think it is ready.

Harbormaster completed remote builds in B94259: Diff 331286.Mar 17 2021, 9:57 AM

ABataev added inline comments.Mar 19 2021, 7:15 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
601	`PriorityQueue`?
5609–5618	Looks like you need to implement something like `reduceSchedulingRegion()`, similar to `extendSchedulingRegion` function. Because otherwise you're going to operate with the larger scheduling region. I.e. need to modify `ScheduleStart` and `ScheduleEnd` data members.
6478	Why `SLPThrottleBudget > 0`? What if `SLPThrottleBudget` equals 0?
6478–6479	Why we can't do something like this: int NumAttempts = 0; do { if (R.isTreeTinyAndNotFullyVectorizable()) break; R.computeMinimumValueSizes(); InstructionCost Cost = R.getTreeCost(); InstructionCost UserCost = 0; .... if (Cost < -SLPCostThreshold) { LLVM_DEBUG(dbgs() << "SLP: Vectorizing list at cost:" << Cost << ".\n"); R.getORE()->emit(OptimizationRemark(SV_NAME, "VectorizedList", cast<Instruction>(Ops[0])) << "SLP vectorized with cost " << ore::NV("Cost", Cost) << " and with tree size " << ore::NV("TreeSize", R.getTreeSize())); R.vectorizeTree(); // Move to the next bundle. I += VF - 1; NextInst = I + 1; Changed = true; break; } ... /// Do throttling here. ++NumAttempts; } while (NumAttempts < SLPThrottleBudget);

dtemirbulatov added inline comments.Mar 21 2021, 5:04 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
601	hmm, ProprityQueue allows duplicates of elements and it might be an issue.

Rebased, addressed remarks, added reduceSchedulingRegion() function with the ability to set only ScheduleStart at this time, renamed RemovedOperations property to ProposedToGather.

dtemirbulatov marked 2 inline comments as done.Mar 29 2021, 5:39 PM

dtemirbulatov added inline comments.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
6478–6479	We are doing partial vectorization and we have to know UserCost to make the correct partial tree cut.

dtemirbulatov marked 2 inline comments as done.Mar 29 2021, 5:40 PM

Harbormaster completed remote builds in B96225: Diff 334018.Mar 29 2021, 5:49 PM

Ping, ready to land?

In D57779#2667679, @xbolva00 wrote:

Ping, ready to land?

Will review it on Monday.

In D57779#2667704, @ABataev wrote:

In D57779#2667679, @xbolva00 wrote:

Ping, ready to land?

Will review it on Monday.

I found an error in reduceSchedulingRegion() implementation. I am reworking the change.

Rebased, fixed incorrect comment at 2358, fixed the wrong implementation of shrink scheduling region, changed the code in tryToVectorizeList() as suggested by @ABataev.

dtemirbulatov added inline comments.Apr 8 2021, 8:33 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
5642	Perhaps we could also check here for !BS->getScheduleData(I)->isPartOfBundle() and further shrink the region.

dtemirbulatov added inline comments.Apr 8 2021, 8:35 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
5642	ah, no, this instruction could belong to a real gather node.

Harbormaster completed remote builds in B97744: Diff 336118.Apr 8 2021, 9:11 AM

Slightly improved schedular area shrinking algorithm, by allowing to remove unnecessary unmaps in chains instructions.

Harbormaster completed remote builds in B97848: Diff 336272.Apr 8 2021, 5:50 PM

Rebased, formatted, noticed 3x testcases involved after @ABataev landed D100495 "Add detection of shuffled/perfect matching of tree entries.", returned "-slp-throttle" flag in order to AArch64/gather-cost.ll to be functional, manually adjust "TMP" in minimum-sizes.ll in PR31243_sext for probably a bug in update_test_checks.py.

Herald added subscribers: kerbowa, nhaehnle, jvesely. · View Herald TranscriptApr 25 2021, 5:40 PM

Harbormaster completed remote builds in B100843: Diff 340406.Apr 25 2021, 6:26 PM

Fixed two format errors.

Harbormaster completed remote builds in B100858: Diff 340425.Apr 25 2021, 9:45 PM

RKSimon added inline comments.Apr 26 2021, 7:34 AM

llvm/test/Transforms/SLPVectorizer/X86/uitofp.ll
683 ↗	(On Diff #340425)	what happened to these checks?

Updated llvm/test/Transforms/SLPVectorizer/X86/uitofp.ll checks on request from @RKSimon

dtemirbulatov marked an inline comment as done.Apr 26 2021, 8:07 AM

In D57779#2716824, @dtemirbulatov wrote:

Updated llvm/test/Transforms/SLPVectorizer/X86/uitofp.ll checks on request from @RKSimon

@RKSimon , I have to split AVX256NODQ X86/sitofp.ll and maybe others.

ABataev added inline comments.Apr 26 2021, 8:14 AM

llvm/test/Transforms/SLPVectorizer/X86/arith-fix.ll
357–361 ↗	(On Diff #340530)	Looks like it does not respect `MinTreeSize` option anymore. And it is strange that such code sequence gets profitable for vectorization (scalar cost is 8, vector cost is 9)

Harbormaster completed remote builds in B100939: Diff 340530.Apr 26 2021, 9:13 AM

Rebased, Forbid "detection of shuffled/perfect matching of tree entries" for canceled TreeEntries during throttling, replaced TEVectorizableSet to PriorityQueue.

Harbormaster completed remote builds in B102213: Diff 342283.May 2 2021, 3:48 PM

Fix formatting.

Harbormaster completed remote builds in B102223: Diff 342296.May 2 2021, 6:39 PM

ABataev added inline comments.May 3 2021, 5:45 AM

llvm/test/Transforms/SLPVectorizer/X86/powof2div.ll
85–91	Still looks like it does not respect mintreesize

dtemirbulatov added inline comments.May 4 2021, 7:00 PM

llvm/test/Transforms/SLPVectorizer/X86/powof2div.ll
85–91	hmm, this is not the case here, the tree height is 5 here, divide node cost is 20 and after deleting this not node, extracting from "add" node costs 4 and inserting after scalar divide cost 4 and the final tree cost is -4. llvm-mca for -mattr=+avx shows 1305 cycles before and 1609 cycles after.

Added check for current tree size to MinTreeSize before making the decision to vectorize.

Harbormaster completed remote builds in B102679: Diff 342959.May 5 2021, 1:44 AM

Fixed issue in getInsertCost(), I incorrectly added gather costs to the nodes which were not in relation with any proposed to vectorized nodes, I thought of this and used before "ScalarToTreeEntry.count(Op) > 0", but I discovered that I am not updating ScalarToTreeEntry while reducing the tree. 2) Now I am checking with isTreeTinyAndNotFullyVectorizable() before decide to vectorize. 3) I introduced "MinVecNodes" parameter, which sets how many minimal vectorizable nodes we would like to have while throttling, currently it is equal to 2 by default. For example, we have 3 total nodes in the tree and it is satisfied with MinTreeSize and we would like to have at least two nodes to be vectorizable while reducing the tree to have a positive decision.

Harbormaster completed remote builds in B102986: Diff 343392.May 6 2021, 7:24 AM

In D57779#2741906, @dtemirbulatov wrote:

Fixed issue in getInsertCost(), I incorrectly added gather costs to the nodes which were not in relation with any proposed to vectorized nodes, I thought of this and used before "ScalarToTreeEntry.count(Op) > 0", but I discovered that I am not updating ScalarToTreeEntry while reducing the tree. 2) Now I am checking with isTreeTinyAndNotFullyVectorizable() before decide to vectorize. 3) I introduced "MinVecNodes" parameter, which sets how many minimal vectorizable nodes we would like to have while throttling, currently it is equal to 2 by default. For example, we have 3 total nodes in the tree and it is satisfied with MinTreeSize and we would like to have at least two nodes to be vectorizable while reducing the tree to have a positive decision.

Why do we need MinVecNodes? MinTreeSize and all associated analysis must be enough

Why do we need MinVecNodes? MinTreeSize and all associated analysis must be enough

it is Transforms/SLPVectorizer/X86/tiny-tree.ll transform that scared me.
From:
define void @tiny_tree_not_fully_vectorizable(double* noalias nocapture %dst, double* noalias nocapture readonly %src, i64 %count) #0 {
entry:

%cmp12 = icmp eq i64 %count, 0
br i1 %cmp12, label %for.end, label %for.body

for.body: ; preds = %entry, %for.body

%i.015 = phi i64 [ %inc, %for.body ], [ 0, %entry ]
%dst.addr.014 = phi double* [ %add.ptr4, %for.body ], [ %dst, %entry ]
%src.addr.013 = phi double* [ %add.ptr, %for.body ], [ %src, %entry ]
%0 = load double, double* %src.addr.013, align 8
store double %0, double* %dst.addr.014, align 8
%arrayidx2 = getelementptr inbounds double, double* %src.addr.013, i64 2
%1 = load double, double* %arrayidx2, align 8
%arrayidx3 = getelementptr inbounds double, double* %dst.addr.014, i64 1
store double %1, double* %arrayidx3, align 8
%add.ptr = getelementptr inbounds double, double* %src.addr.013, i64 %i.015
%add.ptr4 = getelementptr inbounds double, double* %dst.addr.014, i64 %i.015
%inc = add i64 %i.015, 1
%exitcond = icmp eq i64 %inc, %count
br i1 %exitcond, label %for.end, label %for.body

for.end: ; preds = %for.body, %entry

ret void

}
to:
define void @tiny_tree_not_fully_vectorizable(double* noalias nocapture %dst, double* noalias nocapture readonly %src, i64 %count) #0 {
entry:

%cmp12 = icmp eq i64 %count, 0
br i1 %cmp12, label %for.end, label %for.body

for.body: ; preds = %for.body, %entry

%i.015 = phi i64 [ %inc, %for.body ], [ 0, %entry ]
%dst.addr.014 = phi double* [ %add.ptr4, %for.body ], [ %dst, %entry ]
%src.addr.013 = phi double* [ %add.ptr, %for.body ], [ %src, %entry ]
%0 = load double, double* %src.addr.013, align 8
%arrayidx2 = getelementptr inbounds double, double* %src.addr.013, i64 2
%1 = load double, double* %arrayidx2, align 8
%arrayidx3 = getelementptr inbounds double, double* %dst.addr.014, i64 1
%2 = insertelement <2 x double> poison, double %0, i32 0
%3 = insertelement <2 x double> %2, double %1, i32 1
%4 = bitcast double* %dst.addr.014 to <2 x double>*
store <2 x double> %3, <2 x double>* %4, align 8
%add.ptr = getelementptr inbounds double, double* %src.addr.013, i64 %i.015
%add.ptr4 = getelementptr inbounds double, double* %dst.addr.014, i64 %i.015
%inc = add i64 %i.015, 1
%exitcond = icmp eq i64 %inc, %count
br i1 %exitcond, label %for.end, label %for.body

for.end: ; preds = %for.body, %entry

ret void

}

but now with llvm-mca with -mattr=+corei7-avx, I see the change from 1111 to 1014 cycles, so it looks good. I will check other cases.

In D57779#2743946, @dtemirbulatov wrote:
Why do we need MinVecNodes? MinTreeSize and all associated analysis must be enough

it is Transforms/SLPVectorizer/X86/tiny-tree.ll transform that scared me.
From:
define void @tiny_tree_not_fully_vectorizable(double* noalias nocapture %dst, double* noalias nocapture readonly %src, i64 %count) #0 {
entry:
%cmp12 = icmp eq i64 %count, 0
br i1 %cmp12, label %for.end, label %for.body
for.body: ; preds = %entry, %for.body
%i.015 = phi i64 [ %inc, %for.body ], [ 0, %entry ]
%dst.addr.014 = phi double* [ %add.ptr4, %for.body ], [ %dst, %entry ]
%src.addr.013 = phi double* [ %add.ptr, %for.body ], [ %src, %entry ]
%0 = load double, double* %src.addr.013, align 8
store double %0, double* %dst.addr.014, align 8
%arrayidx2 = getelementptr inbounds double, double* %src.addr.013, i64 2
%1 = load double, double* %arrayidx2, align 8
%arrayidx3 = getelementptr inbounds double, double* %dst.addr.014, i64 1
store double %1, double* %arrayidx3, align 8
%add.ptr = getelementptr inbounds double, double* %src.addr.013, i64 %i.015
%add.ptr4 = getelementptr inbounds double, double* %dst.addr.014, i64 %i.015
%inc = add i64 %i.015, 1
%exitcond = icmp eq i64 %inc, %count
br i1 %exitcond, label %for.end, label %for.body
for.end: ; preds = %for.body, %entry
ret void
}
to:
define void @tiny_tree_not_fully_vectorizable(double* noalias nocapture %dst, double* noalias nocapture readonly %src, i64 %count) #0 {
entry:
%cmp12 = icmp eq i64 %count, 0
br i1 %cmp12, label %for.end, label %for.body
for.body: ; preds = %for.body, %entry
%i.015 = phi i64 [ %inc, %for.body ], [ 0, %entry ]
%dst.addr.014 = phi double* [ %add.ptr4, %for.body ], [ %dst, %entry ]
%src.addr.013 = phi double* [ %add.ptr, %for.body ], [ %src, %entry ]
%0 = load double, double* %src.addr.013, align 8
%arrayidx2 = getelementptr inbounds double, double* %src.addr.013, i64 2
%1 = load double, double* %arrayidx2, align 8
%arrayidx3 = getelementptr inbounds double, double* %dst.addr.014, i64 1
%2 = insertelement <2 x double> poison, double %0, i32 0
%3 = insertelement <2 x double> %2, double %1, i32 1
%4 = bitcast double* %dst.addr.014 to <2 x double>*
store <2 x double> %3, <2 x double>* %4, align 8
%add.ptr = getelementptr inbounds double, double* %src.addr.013, i64 %i.015
%add.ptr4 = getelementptr inbounds double, double* %dst.addr.014, i64 %i.015
%inc = add i64 %i.015, 1
%exitcond = icmp eq i64 %inc, %count
br i1 %exitcond, label %for.end, label %for.body
for.end: ; preds = %for.body, %entry
ret void
}

but now with llvm-mca with -mattr=+corei7-avx, I see the change from 1111 to 1014 cycles, so it looks good. I will check other cases.

If so, it just means that our min-tree-size analysis is too strict and must be fixed in general, but not by introducing some new throttling-specific options. We may have the same situation without throttling.

Rebased, Removed SLP parameter MinVecNodes. Added estimations of a good tree reduction 1) if the tree contained some real operations like binary, arithmetical, calls which were proposed to vectorize then we don't want to reduce this tree to just load and store operations in vectorized form. 2) if the tree doesn't have any real operations like binary, arithmetical... then we have to make sure that at least the root node and the next node to root are going to be vectorized.

Harbormaster completed remote builds in B105050: Diff 346180.May 18 2021, 12:17 PM

Formatting.

Harbormaster completed remote builds in B105196: Diff 346402.May 19 2021, 4:33 AM

Allen added a subscriber: Allen.May 20 2021, 10:16 PM

Rebased. I switched to path aware tree reduction approach and we start from the leaves of a vectorizable tree toward the root of that tree.

Harbormaster completed remote builds in B123830: Diff 372463.Sep 14 2021, 6:39 AM

dtemirbulatov mentioned this in D110623: [SLP] Avoid calculating expensive spill cost when it is not required.Sep 28 2021, 6:16 AM

Current status? Review stalled?

Herald added a project: Restricted Project. · View Herald TranscriptSep 7 2022, 12:55 PM

Herald added a subscriber: • pcwang-thead. · View Herald Transcript

dtemirbulatov abandoned this revision.Sep 17 2022, 4:23 PM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

746 lines

test/

Transforms/

SLPVectorizer/

AArch64/

gather-root.ll

34 lines

X86/

load-merge.ll

13 lines

powof2div.ll

47 lines

slp-throttle.ll

20 lines

Diff 284617

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show All 23 Lines
#include "llvm/ADT/None.h"		#include "llvm/ADT/None.h"
#include "llvm/ADT/Optional.h"		#include "llvm/ADT/Optional.h"
#include "llvm/ADT/PostOrderIterator.h"		#include "llvm/ADT/PostOrderIterator.h"
#include "llvm/ADT/STLExtras.h"		#include "llvm/ADT/STLExtras.h"
#include "llvm/ADT/SetVector.h"		#include "llvm/ADT/SetVector.h"
#include "llvm/ADT/SmallBitVector.h"		#include "llvm/ADT/SmallBitVector.h"
#include "llvm/ADT/SmallPtrSet.h"		#include "llvm/ADT/SmallPtrSet.h"
#include "llvm/ADT/SmallSet.h"		#include "llvm/ADT/SmallSet.h"
		#include "llvm/ADT/SmallString.h"
#include "llvm/ADT/SmallVector.h"		#include "llvm/ADT/SmallVector.h"
#include "llvm/ADT/Statistic.h"		#include "llvm/ADT/Statistic.h"
#include "llvm/ADT/iterator.h"		#include "llvm/ADT/iterator.h"
#include "llvm/ADT/iterator_range.h"		#include "llvm/ADT/iterator_range.h"
#include "llvm/Analysis/AliasAnalysis.h"		#include "llvm/Analysis/AliasAnalysis.h"
#include "llvm/Analysis/CodeMetrics.h"		#include "llvm/Analysis/CodeMetrics.h"
#include "llvm/Analysis/DemandedBits.h"		#include "llvm/Analysis/DemandedBits.h"
#include "llvm/Analysis/GlobalsModRef.h"		#include "llvm/Analysis/GlobalsModRef.h"
▲ Show 20 Lines • Show All 74 Lines • ▼ Show 20 Lines

static cl::opt<int>		static cl::opt<int>
SLPCostThreshold("slp-threshold", cl::init(0), cl::Hidden,		SLPCostThreshold("slp-threshold", cl::init(0), cl::Hidden,
cl::desc("Only vectorize if you gain more than this "		cl::desc("Only vectorize if you gain more than this "
"number "));		"number "));

static cl::opt<bool>		static cl::opt<bool>
ShouldVectorizeHor("slp-vectorize-hor", cl::init(true), cl::Hidden,		ShouldVectorizeHor("slp-vectorize-hor", cl::init(true), cl::Hidden,
cl::desc("Attempt to vectorize horizontal reductions"));		cl::desc("Attempt to vectorize horizontal reductions"));

		static cl::opt<bool>
		SLPThrottling("slp-throttle", cl::init(true), cl::Hidden,
		cl::desc("Enable tree partial vectorize with throttling"));

		static cl::opt<unsigned>
		MaxCostsRecalculations("slp-throttling-budget", cl::init(32), cl::Hidden,
		cl::desc("Limit the total number of nodes for cost "
		"recalculations during throttling"));

		ABataevUnsubmitted Done Reply Inline Actions Tabs are added ABataev: Tabs are added
		ABataevUnsubmitted Not Done Reply Inline Actions Do we really need both of these options? `MaxCostsRecalculations` should be enough. ABataev: Do we really need both of these options? `MaxCostsRecalculations` should be enough.
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions ok, thanks. dtemirbulatov: ok, thanks.
static cl::opt<bool> ShouldStartVectorizeHorAtStore(		static cl::opt<bool> ShouldStartVectorizeHorAtStore(
"slp-vectorize-hor-store", cl::init(false), cl::Hidden,		"slp-vectorize-hor-store", cl::init(false), cl::Hidden,
cl::desc(		cl::desc(
"Attempt to vectorize horizontal reductions feeding into a store"));		"Attempt to vectorize horizontal reductions feeding into a store"));

static cl::opt<int>		static cl::opt<int>
MaxVectorRegSizeOption("slp-max-reg-size", cl::init(128), cl::Hidden,		MaxVectorRegSizeOption("slp-max-reg-size", cl::init(128), cl::Hidden,
cl::desc("Attempt to vectorize for this register size in bits"));		cl::desc("Attempt to vectorize for this register size in bits"));
▲ Show 20 Lines • Show All 421 Lines • ▼ Show 20 Lines	if (MaxVectorRegSizeOption.getNumOccurrences())
MaxVecRegSize = MaxVectorRegSizeOption;		MaxVecRegSize = MaxVectorRegSizeOption;
else		else
MaxVecRegSize = TTI->getRegisterBitWidth(true);		MaxVecRegSize = TTI->getRegisterBitWidth(true);

if (MinVectorRegSizeOption.getNumOccurrences())		if (MinVectorRegSizeOption.getNumOccurrences())
MinVecRegSize = MinVectorRegSizeOption;		MinVecRegSize = MinVectorRegSizeOption;
else		else
MinVecRegSize = TTI->getMinVectorRegisterBitWidth();		MinVecRegSize = TTI->getMinVectorRegisterBitWidth();
		BuiltTrees.push_back(std::make_unique<TreeState>());
		Tree = BuiltTrees.back().get();
}		}

/// Vectorize the tree that starts with the elements in \p VL.		/// Vectorize the tree that starts with the elements in \p VL.
/// Returns the vectorized root.		/// Returns the vectorized root.
Value *vectorizeTree();		Value *vectorizeTree();

/// Vectorize the tree but with the list of externally used values \p		/// Vectorize the tree but with the list of externally used values \p
/// ExternallyUsedValues. Values in this MapVector can be replaced but the		/// ExternallyUsedValues. Values in this MapVector can be replaced but the
/// generated extractvalue instructions.		/// generated extractvalue instructions.
Value *vectorizeTree(ExtraValueToDebugLocsMap &ExternallyUsedValues);		Value *vectorizeTree(ExtraValueToDebugLocsMap &ExternallyUsedValues);

/// \returns the cost incurred by unwanted spills and fills, caused by		/// \returns the cost incurred by unwanted spills and fills, caused by
/// holding live values over call sites.		/// holding live values over call sites.
int getSpillCost() const;		int getSpillCost();

		/// \returns the cost extracting vectorized elements.
		int getExtractCost() const;

		/// \returns the cost of gathering canceled elements to be used
		/// by vectorized operations during throttling.
		int getInsertCost();

		/// Find a subtree of the whole tree suitable to be vectorized. When
		/// vectorizing the whole tree is not profitable, we can consider vectorizing
		/// part of that tree. SLP algorithm looks to operations to vectorize starting
		/// from seed instructions on the bottom toward the end of chains of
		/// dependencies to the top of SLP graph, it groups potentially vectorizable
		/// operations in scalar form to bundles.
		/// For example:
		ABataevUnsubmitted Not Done Reply Inline Actions `PriorityQueue`? ABataev: `PriorityQueue`?
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions hmm, ProprityQueue allows duplicates of elements and it might be an issue. dtemirbulatov: hmm, ProprityQueue allows duplicates of elements and it might be an issue.
		///
		/// <bundle 1> scalar form
		/// \|
		/// <bundle 2> scalar form <bundle 3> scalar form
		/// \ /
		/// <seed root bundle> scalar form
		///
		/// Total cost is not profitable to vectorize, hence all operations are in
		/// scalar form.
		///
		/// Here is the same tree after SLP throttling transformation:
		///
		/// <bundle 1> vector form
		/// \|
		ABataevUnsubmitted Not Done Reply Inline Actions Does "scalar form" means "gathered nodes"? I don't think that currently we may end up with the situation like in the picture, we can't have gathered node that depends on another node (either gather or vectorized). ABataev: Does "scalar form" means "gathered nodes"? I don't think that currently we may end up with the…
		/// <bundle 2> vector form <bundle 3> scalar form
		/// \ /
		/// <seed root bundle> vector form
		///
		/// So, we can throttle some operations in such a way that it is still
		/// profitable to vectorize part on the tree, while all tree vectorization
		/// does not make sense.
		/// More details: http://www.llvm.org/devmtg/2015-10/slides/Porpodas-ThrottlingAutomaticVectorization.pdf
		xbolva00Unsubmitted Done Reply Inline Actions Please mention paper name: “Throttling Automatic Vectorization: When Less Is More” https://www.cl.cam.ac.uk/~tmj32/papers/docs/porpodas15-pact.pdf Slides are good, but paper is paper :) xbolva00: Please mention paper name: “Throttling Automatic Vectorization: When Less Is More” https…
		bool findSubTree(int UserCost = 0);

		/// Get raw summary of all elements of the tree.
		int getRawTreeCost();

/// \returns the vectorization cost of the subtree that starts at \p VL.		/// \returns the vectorization cost of the subtree that starts at \p VL.
/// A negative number means that this is profitable.		/// A negative number means that this is profitable.
int getTreeCost();		int getTreeCost();

/// Construct a vectorizable tree that starts at \p Roots, ignoring users for		/// Construct a vectorizable tree that starts at \p Roots, ignoring users for
/// the purpose of scheduling and extraction in the \p UserIgnoreLst.		/// the purpose of scheduling and extraction in the \p UserIgnoreLst.
void buildTree(ArrayRef<Value *> Roots,		void buildTree(ArrayRef<Value *> Roots,
ArrayRef<Value *> UserIgnoreLst = None);		ArrayRef<Value *> UserIgnoreLst = None);

/// Construct a vectorizable tree that starts at \p Roots, ignoring users for		/// Construct a vectorizable tree that starts at \p Roots, ignoring users for
/// the purpose of scheduling and extraction in the \p UserIgnoreLst taking		/// the purpose of scheduling and extraction in the \p UserIgnoreLst taking
/// into account (and updating it, if required) list of externally used		/// into account (and updating it, if required) list of externally used
/// values stored in \p ExternallyUsedValues.		/// values stored in \p ExternallyUsedValues.
void buildTree(ArrayRef<Value *> Roots,		void buildTree(ArrayRef<Value *> Roots,
ExtraValueToDebugLocsMap &ExternallyUsedValues,		ExtraValueToDebugLocsMap &ExternallyUsedValues,
ArrayRef<Value *> UserIgnoreLst = None);		ArrayRef<Value *> UserIgnoreLst = None);

/// Clear the internal data structures that are created by 'buildTree'.		/// Save current tree for possible later vectorization.
void deleteTree() {		void saveTree() {
VectorizableTree.clear();		BuiltTrees.push_back(std::make_unique<TreeState>());
ScalarToTreeEntry.clear();		Tree = BuiltTrees.back().get();
MustGather.clear();
ExternalUses.clear();
NumOpsWantToKeepOrder.clear();
NumOpsWantToKeepOriginalOrder = 0;
for (auto &Iter : BlocksSchedules) {
BlockScheduling *BS = Iter.second.get();
BS->clear();
}
MinBWs.clear();
}		}
		ABataevUnsubmitted Not Done Reply Inline Actions Why do we need to save intermediate results? Cannot it be solved in a single iteration loop without saving the intermediate results in the class instance? ABataev: Why do we need to save intermediate results? Cannot it be solved in a single iteration loop…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions I have noticed many regressions if we decide right away and rebuilding the tree afterward is expensive. dtemirbulatov: I have noticed many regressions if we decide right away and rebuilding the tree afterward is…
		ABataevUnsubmitted Not Done Reply Inline Actions What is the cause of those regressions? If I understand it correctly, you're just trying to find the subtree, exclude its cost, compare, repeat if it is not profitable. What does not allow to do it in the loop without saving intermediate results in the class, but save the result in the stack vectors, if it is needed? ABataev: What is the cause of those regressions? If I understand it correctly, you're just trying to…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions For example, if we could partially vectorize at vectorizeStoreChain(), or later it is possilble to fully vectorize the same tree tryToVectorizeList() or tryToReduce() dtemirbulatov: For example, if we could partially vectorize at vectorizeStoreChain(), or later it is possilble…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions For example, if we could partially vectorize at vectorizeStoreChain(), or later it is possilble to fully vectorize the same tree with tryToVectorizeList() or tryToReduce() dtemirbulatov: For example, if we could partially vectorize at vectorizeStoreChain(), or later it is possilble…
		ABataevUnsubmitted Not Done Reply Inline Actions Could you give an example, please? ABataev: Could you give an example, please?

unsigned getTreeSize() const { return VectorizableTree.size(); }		unsigned getTreeSize() const { return Tree->VectorizableTree.size(); }

/// Perform LICM and CSE on the newly generated gather sequences.		/// Perform LICM and CSE on the newly generated gather sequences.
void optimizeGatherSequence();		void optimizeGatherSequence();

/// \returns The best order of instructions for vectorization.		/// \returns The best order of instructions for vectorization.
Optional<ArrayRef<unsigned>> bestOrder() const {		Optional<ArrayRef<unsigned>> bestOrder() const {
auto I = std::max_element(		auto I = std::max_element(
NumOpsWantToKeepOrder.begin(), NumOpsWantToKeepOrder.end(),		Tree->NumOpsWantToKeepOrder.begin(), Tree->NumOpsWantToKeepOrder.end(),
[](const decltype(NumOpsWantToKeepOrder)::value_type &D1,		[](const decltype(Tree->NumOpsWantToKeepOrder)::value_type &D1,
const decltype(NumOpsWantToKeepOrder)::value_type &D2) {		const decltype(Tree->NumOpsWantToKeepOrder)::value_type &D2) {
return D1.second < D2.second;		return D1.second < D2.second;
});		});
if (I == NumOpsWantToKeepOrder.end() \|\|		if (I == Tree->NumOpsWantToKeepOrder.end() \|\|
I->getSecond() <= NumOpsWantToKeepOriginalOrder)		I->getSecond() <= Tree->NumOpsWantToKeepOriginalOrder)
return None;		return None;

return makeArrayRef(I->getFirst());		return makeArrayRef(I->getFirst());
}		}

/// \return The vector element size in bits to use when vectorizing the		/// \return The vector element size in bits to use when vectorizing the
		ABataevUnsubmitted Done Reply Inline Actions DO you really need to return `Optional` here? Maybe, just return `-SLPCostThreshold` if not successful? ABataev: DO you really need to return `Optional` here? Maybe, just return `-SLPCostThreshold` if not…
/// expression tree ending at \p V. If V is a store, the size is the width of		/// expression tree ending at \p V. If V is a store, the size is the width of
/// the stored value. Otherwise, the size is the width of the largest loaded		/// the stored value. Otherwise, the size is the width of the largest loaded
/// value reaching V. This method is used by the vectorizer to calculate		/// value reaching V. This method is used by the vectorizer to calculate
/// vectorization factors.		/// vectorization factors.
unsigned getVectorElementSize(Value *V);		unsigned getVectorElementSize(Value *V);

/// Compute the minimum type sizes required to represent the entries in a		/// Compute the minimum type sizes required to represent the entries in a
/// vectorizable tree.		/// vectorizable tree.
Show All 34 Lines	public:
/// can be load combined in the backend. Load combining may not be allowed in		/// can be load combined in the backend. Load combining may not be allowed in
/// the IR optimizer, so we do not want to alter the pattern. For example,		/// the IR optimizer, so we do not want to alter the pattern. For example,
/// partially transforming a scalar bswap() pattern into vector code is		/// partially transforming a scalar bswap() pattern into vector code is
/// effectively impossible for the backend to undo.		/// effectively impossible for the backend to undo.
/// TODO: If load combining is allowed in the IR optimizer, this analysis		/// TODO: If load combining is allowed in the IR optimizer, this analysis
/// may not be necessary.		/// may not be necessary.
bool isLoadCombineCandidate() const;		bool isLoadCombineCandidate() const;

		/// Cut the tree to make it partially vectorizable.
		void cutTree();

		/// Try partially vectorize the tree via throttling.
		bool tryPartialVectorization();

OptimizationRemarkEmitter *getORE() { return ORE; }		OptimizationRemarkEmitter *getORE() { return ORE; }

/// This structure holds any data we need about the edges being traversed		/// This structure holds any data we need about the edges being traversed
/// during buildTree_rec(). We keep track of:		/// during buildTree_rec(). We keep track of:
/// (i) the user TreeEntry index, and		/// (i) the user TreeEntry index, and
/// (ii) the index of the edge.		/// (ii) the index of the edge.
struct EdgeInfo {		struct EdgeInfo {
EdgeInfo() = default;		EdgeInfo() = default;
▲ Show 20 Lines • Show All 766 Lines • ▼ Show 20 Lines	struct TreeEntry {

/// A vector of scalars.		/// A vector of scalars.
ValueList Scalars;		ValueList Scalars;

/// The Scalars are vectorized into this value. It is initialized to Null.		/// The Scalars are vectorized into this value. It is initialized to Null.
Value *VectorizedValue = nullptr;		Value *VectorizedValue = nullptr;

/// Do we need to gather this sequence ?		/// Do we need to gather this sequence ?
enum EntryState { Vectorize, NeedToGather };		enum EntryState { Vectorize, NeedToGather, ProposedToGather };
EntryState State;		EntryState State;

		ABataevUnsubmitted Not Done Reply Inline Actions Could you split the patch and commit this part of the change (I mean, using of the enum instead of bool) as a separate NFC patch? ABataev: Could you split the patch and commit this part of the change (I mean, using of the enum instead…
/// Does this sequence require some shuffling?		/// Does this sequence require some shuffling?
SmallVector<int, 4> ReuseShuffleIndices;		SmallVector<int, 4> ReuseShuffleIndices;

/// Does this entry require reordering?		/// Does this entry require reordering?
ArrayRef<unsigned> ReorderIndices;		ArrayRef<unsigned> ReorderIndices;

		/// Cost of this tree entry.
		int Cost = 0;

/// Points back to the VectorizableTree.		/// Points back to the VectorizableTree.
///		///
/// Only used for Graphviz right now. Unfortunately GraphTrait::NodeRef has		/// Only used for Graphviz right now. Unfortunately GraphTrait::NodeRef has
/// to be a pointer and needs to be able to initialize the child iterator.		/// to be a pointer and needs to be able to initialize the child iterator.
/// Thus we need a reference back to the container to translate the indices		/// Thus we need a reference back to the container to translate the indices
/// to entries.		/// to entries.
VecTreeTy &Container;		VecTreeTy &Container;

/// The TreeEntry index containing the user of this entry. We can actually		/// The TreeEntry index containing the user of this entry. We can actually
/// have multiple users so the data structure is not truly a tree.		/// have multiple users so the data structure is not truly a tree.
SmallVector<EdgeInfo, 1> UserTreeIndices;		SmallVector<EdgeInfo, 1> UserTreeIndices;

		/// Use of this entry.
		TinyPtrVector<TreeEntry *> UseEntries;

/// The index of this treeEntry in VectorizableTree.		/// The index of this treeEntry in VectorizableTree.
int Idx = -1;		int Idx = -1;

private:		private:
/// The operands of each instruction in each lane Operands[op_index][lane].		/// The operands of each instruction in each lane Operands[op_index][lane].
/// Note: This helps avoid the replication of the code that performs the		/// Note: This helps avoid the replication of the code that performs the
/// reordering of operands during buildTree_rec() and vectorizeTree().		/// reordering of operands during buildTree_rec() and vectorizeTree().
SmallVector<ValueList, 2> Operands;		SmallVector<ValueList, 2> Operands;
▲ Show 20 Lines • Show All 114 Lines • ▼ Show 20 Lines	LLVM_DUMP_METHOD void dump() const {
dbgs() << "State: ";		dbgs() << "State: ";
switch (State) {		switch (State) {
case Vectorize:		case Vectorize:
dbgs() << "Vectorize\n";		dbgs() << "Vectorize\n";
break;		break;
case NeedToGather:		case NeedToGather:
dbgs() << "NeedToGather\n";		dbgs() << "NeedToGather\n";
break;		break;
		case ProposedToGather:
		dbgs() << "ProposedToGather\n";
		break;
}		}
dbgs() << "MainOp: ";		dbgs() << "MainOp: ";
if (MainOp)		if (MainOp)
dbgs() << *MainOp << "\n";		dbgs() << *MainOp << "\n";
else		else
dbgs() << "NULL\n";		dbgs() << "NULL\n";
dbgs() << "AltOp: ";		dbgs() << "AltOp: ";
if (AltOp)		if (AltOp)
Show All 26 Lines	#endif

/// Create a new VectorizableTree entry.		/// Create a new VectorizableTree entry.
TreeEntry newTreeEntry(ArrayRef<Value > VL, Optional<ScheduleData *> Bundle,		TreeEntry newTreeEntry(ArrayRef<Value > VL, Optional<ScheduleData *> Bundle,
const InstructionsState &S,		const InstructionsState &S,
const EdgeInfo &UserTreeIdx,		const EdgeInfo &UserTreeIdx,
ArrayRef<unsigned> ReuseShuffleIndices = None,		ArrayRef<unsigned> ReuseShuffleIndices = None,
ArrayRef<unsigned> ReorderIndices = None) {		ArrayRef<unsigned> ReorderIndices = None) {
bool Vectorized = (bool)Bundle;		bool Vectorized = (bool)Bundle;
VectorizableTree.push_back(std::make_unique<TreeEntry>(VectorizableTree));		Tree->VectorizableTree.push_back(
TreeEntry *Last = VectorizableTree.back().get();		std::make_unique<TreeEntry>(Tree->VectorizableTree));
Last->Idx = VectorizableTree.size() - 1;		TreeEntry *Last = Tree->VectorizableTree.back().get();
		Last->Idx = Tree->VectorizableTree.size() - 1;
Last->Scalars.insert(Last->Scalars.begin(), VL.begin(), VL.end());		Last->Scalars.insert(Last->Scalars.begin(), VL.begin(), VL.end());
Last->State = Vectorized ? TreeEntry::Vectorize : TreeEntry::NeedToGather;		Last->State = Vectorized ? TreeEntry::Vectorize : TreeEntry::NeedToGather;
Last->ReuseShuffleIndices.append(ReuseShuffleIndices.begin(),		Last->ReuseShuffleIndices.append(ReuseShuffleIndices.begin(),
ReuseShuffleIndices.end());		ReuseShuffleIndices.end());
Last->ReorderIndices = ReorderIndices;		Last->ReorderIndices = ReorderIndices;
Last->setOperations(S);		Last->setOperations(S);
if (Vectorized) {		if (Vectorized) {
for (int i = 0, e = VL.size(); i != e; ++i) {		for (int i = 0, e = VL.size(); i != e; ++i) {
assert(!getTreeEntry(VL[i]) && "Scalar already in tree!");		assert(!getTreeEntry(VL[i]) && "Scalar already in tree!");
ScalarToTreeEntry[VL[i]] = Last;		Tree->ScalarToTreeEntry[VL[i]] = Last;
}		}
// Update the scheduler bundle to point to this TreeEntry.		// Update the scheduler bundle to point to this TreeEntry.
unsigned Lane = 0;		unsigned Lane = 0;
for (ScheduleData *BundleMember = Bundle.getValue(); BundleMember;		for (ScheduleData *BundleMember = Bundle.getValue(); BundleMember;
BundleMember = BundleMember->NextInBundle) {		BundleMember = BundleMember->NextInBundle) {
BundleMember->TE = Last;		BundleMember->TE = Last;
BundleMember->Lane = Lane;		BundleMember->Lane = Lane;
++Lane;		++Lane;
}		}
assert((!Bundle.getValue() \|\| Lane == VL.size()) &&		assert((!Bundle.getValue() \|\| Lane == VL.size()) &&
"Bundle and VL out of sync");		"Bundle and VL out of sync");
} else {		} else {
MustGather.insert(VL.begin(), VL.end());		Tree->MustGather.insert(VL.begin(), VL.end());
}		}

if (UserTreeIdx.UserTE)		if (UserTreeIdx.UserTE) {
Last->UserTreeIndices.push_back(UserTreeIdx);		Last->UserTreeIndices.push_back(UserTreeIdx);
		Tree->VectorizableTree[UserTreeIdx.UserTE->Idx]->UseEntries.push_back(
		Last);
		}

return Last;		return Last;
}		}

/// -- Vectorization State --
/// Holds all of the tree entries.
TreeEntry::VecTreeTy VectorizableTree;

#ifndef NDEBUG		#ifndef NDEBUG
/// Debug printer.		/// Debug printer.
LLVM_DUMP_METHOD void dumpVectorizableTree() const {		LLVM_DUMP_METHOD void dumpVectorizableTree() const {
for (unsigned Id = 0, IdE = VectorizableTree.size(); Id != IdE; ++Id) {		for (unsigned Id = 0, IdE = Tree->VectorizableTree.size(); Id != IdE;
VectorizableTree[Id]->dump();		++Id) {
		Tree->VectorizableTree[Id]->dump();
dbgs() << "\n";		dbgs() << "\n";
}		}
}		}
#endif		#endif

TreeEntry getTreeEntry(Value V) {		TreeEntry getTreeEntry(Value V) {
auto I = ScalarToTreeEntry.find(V);		auto I = Tree->ScalarToTreeEntry.find(V);
if (I != ScalarToTreeEntry.end())		if (I != Tree->ScalarToTreeEntry.end())
return I->second;		return I->second;
return nullptr;		return nullptr;
}		}

const TreeEntry getTreeEntry(Value V) const {		const TreeEntry getTreeEntry(Value V) const {
auto I = ScalarToTreeEntry.find(V);		auto I = Tree->ScalarToTreeEntry.find(V);
if (I != ScalarToTreeEntry.end())		if (I != Tree->ScalarToTreeEntry.end())
return I->second;		return I->second;
return nullptr;		return nullptr;
}		}

/// Maps a specific scalar to its tree entry.
SmallDenseMap<Value, TreeEntry > ScalarToTreeEntry;

/// Maps a value!to the proposed vectorizable size.		/// Maps a value!to the proposed vectorizable size.
SmallDenseMap<Value *, unsigned> InstrElementSize;		SmallDenseMap<Value *, unsigned> InstrElementSize;

/// A list of scalars that we found that we need to keep as scalars.
ValueSet MustGather;

/// This POD struct describes one external user in the vectorized tree.		/// This POD struct describes one external user in the vectorized tree.
struct ExternalUser {		struct ExternalUser {
ExternalUser(Value S, llvm::User U, int L)		ExternalUser(Value S, llvm::User U, int L)
: Scalar(S), User(U), Lane(L) {}		: Scalar(S), User(U), Lane(L) {}

// Which scalar in our function.		// Which scalar in our function.
Value *Scalar;		Value *Scalar;

// Which user that uses the scalar.		// Which user that uses the scalar.
llvm::User *User;		llvm::User *User;

// Which lane does the scalar belong to.		// Which lane does the scalar belong to.
int Lane;		int Lane;
};		};
using UserList = SmallVector<ExternalUser, 16>;		using UserList = SmallVector<ExternalUser, 16>;

		/// \returns the cost of extracting the vectorized elements.
		int getExtractOperationCost(const ExternalUser &EU) const;

/// Checks if two instructions may access the same memory.		/// Checks if two instructions may access the same memory.
///		///
/// \p Loc1 is the location of \p Inst1. It is passed explicitly because it		/// \p Loc1 is the location of \p Inst1. It is passed explicitly because it
/// is invariant in the calling loop.		/// is invariant in the calling loop.
bool isAliased(const MemoryLocation &Loc1, Instruction *Inst1,		bool isAliased(const MemoryLocation &Loc1, Instruction *Inst1,
Instruction *Inst2) {		Instruction *Inst2) {
// First check if the result is already in the cache.		// First check if the result is already in the cache.
AliasCacheKey key = std::make_pair(Inst1, Inst2);		AliasCacheKey key = std::make_pair(Inst1, Inst2);
Show All 28 Lines	void eraseInstruction(Instruction *I, bool ReplaceOpsWithUndef = false) {
auto It = DeletedInstructions.try_emplace(I, ReplaceOpsWithUndef).first;		auto It = DeletedInstructions.try_emplace(I, ReplaceOpsWithUndef).first;
It->getSecond() = It->getSecond() && ReplaceOpsWithUndef;		It->getSecond() = It->getSecond() && ReplaceOpsWithUndef;
}		}

/// Temporary store for deleted instructions. Instructions will be deleted		/// Temporary store for deleted instructions. Instructions will be deleted
/// eventually when the BoUpSLP is destructed.		/// eventually when the BoUpSLP is destructed.
DenseMap<Instruction *, bool> DeletedInstructions;		DenseMap<Instruction *, bool> DeletedInstructions;

/// A list of values that need to extracted out of the tree.
/// This list holds pairs of (Internal Scalar : External User). External User
/// can be nullptr, it means that this Internal Scalar will be used later,
/// after vectorization.
UserList ExternalUses;

/// Values used only by @llvm.assume calls.		/// Values used only by @llvm.assume calls.
SmallPtrSet<const Value *, 32> EphValues;		SmallPtrSet<const Value *, 32> EphValues;

/// Holds all of the instructions that we gathered.		/// Holds all of the instructions that we gathered.
SetVector<Instruction *> GatherSeq;		SetVector<Instruction *> GatherSeq;

/// A list of blocks that we are going to CSE.		/// A list of blocks that we are going to CSE.
SetVector<BasicBlock *> CSEBlocks;		SetVector<BasicBlock *> CSEBlocks;
▲ Show 20 Lines • Show All 381 Lines • ▼ Show 20 Lines	struct BlockScheduling {

/// The ID of the scheduling region. For a new vectorization iteration this		/// The ID of the scheduling region. For a new vectorization iteration this
/// is incremented which "removes" all ScheduleData from the region.		/// is incremented which "removes" all ScheduleData from the region.
// Make sure that the initial SchedulingRegionID is greater than the		// Make sure that the initial SchedulingRegionID is greater than the
// initial SchedulingRegionID in ScheduleData (which is 0).		// initial SchedulingRegionID in ScheduleData (which is 0).
int SchedulingRegionID = 1;		int SchedulingRegionID = 1;
};		};

/// Attaches the BlockScheduling structures to basic blocks.		/// Remove operations from the list of proposed to schedule.
MapVector<BasicBlock *, std::unique_ptr<BlockScheduling>> BlocksSchedules;		void removeFromScheduling(BlockScheduling *BS);

/// Performs the "real" scheduling. Done before vectorization is actually		/// Performs the "real" scheduling. Done before vectorization is actually
/// performed in a basic block.		/// performed in a basic block.
void scheduleBlock(BlockScheduling *BS);		void scheduleBlock(BlockScheduling *BS);

/// List of users to ignore during scheduling and that don't need extracting.		/// List of users to ignore during scheduling and that don't need extracting.
ArrayRef<Value *> UserIgnoreList;		ArrayRef<Value *> UserIgnoreList;

Show All 17 Lines	static unsigned getHashValue(const OrdersType &V) {
return static_cast<unsigned>(hash_combine_range(V.begin(), V.end()));		return static_cast<unsigned>(hash_combine_range(V.begin(), V.end()));
}		}

static bool isEqual(const OrdersType &LHS, const OrdersType &RHS) {		static bool isEqual(const OrdersType &LHS, const OrdersType &RHS) {
return LHS == RHS;		return LHS == RHS;
}		}
};		};

		/// Tree state that created by 'buildTree'.
		struct TreeState {
		using TreeStateTy = SmallVector<std::unique_ptr<TreeState>, 2>;
		ABataevUnsubmitted Not Done Reply Inline Actions Why need to store a pointer to `TreeState` but the `TreeState` itself? ABataev: Why need to store a pointer to `TreeState` but the `TreeState` itself?
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions TreeState is a large structure, it is convenient with dynamically allocate, but static version might be faster, do you think it is critical? dtemirbulatov: TreeState is a large structure, it is convenient with dynamically allocate, but static version…
		ABataevUnsubmitted Done Reply Inline Actions Reduce the number of the preallocated elements to, say, 2 or 4 and store elements directly. ABataev: Reduce the number of the preallocated elements to, say, 2 or 4 and store elements directly.
		ABataevUnsubmitted Not Done Reply Inline Actions Why `unique_ptr` again? Why not a `TreeState` directly? Just `SmallVector<TreeState, 2>;` ABataev: Why `unique_ptr` again? Why not a `TreeState` directly? Just `SmallVector<TreeState, 2>;`
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions ok dtemirbulatov: ok
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions hmm. Tree state is too complex, we don't have to make it movable or copyable. Why unique_ptr is not good here? dtemirbulatov: hmm. Tree state is too complex, we don't have to make it movable or copyable. Why unique_ptr is…

		/// -- Vectorization State --
		/// Holds all of the tree entries.
		TreeEntry::VecTreeTy VectorizableTree;

		/// Maps a specific scalar to its tree entry.
		SmallDenseMap<Value , TreeEntry > ScalarToTreeEntry;

		/// Tree entries that should not be vectorized due to throttling.
		SmallVector<TreeEntry *, 2> RemovedOperations;

/// Contains orders of operations along with the number of bundles that have		/// Contains orders of operations along with the number of bundles that have
/// operations in this order. It stores only those orders that require		/// operations in this order. It stores only those orders that require
/// reordering, if reordering is not required it is counted using \a		/// reordering, if reordering is not required it is counted using \a
/// NumOpsWantToKeepOriginalOrder.		/// NumOpsWantToKeepOriginalOrder.
DenseMap<OrdersType, unsigned, OrdersTypeDenseMapInfo> NumOpsWantToKeepOrder;		DenseMap<OrdersType, unsigned, OrdersTypeDenseMapInfo>
		NumOpsWantToKeepOrder;
/// Number of bundles that do not require reordering.		/// Number of bundles that do not require reordering.
unsigned NumOpsWantToKeepOriginalOrder = 0;		unsigned NumOpsWantToKeepOriginalOrder = 0;
		ABataevUnsubmitted Not Done Reply Inline Actions Tabs ABataev: Tabs
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions This is now part of TreeState structure, this is LLVM's standard format(clang-format). dtemirbulatov: This is now part of TreeState structure, this is LLVM's standard format(clang-format).

		/// A list of scalars that we found that we need to keep as scalars.
		ValueSet MustGather;
		ABataevUnsubmitted Done Reply Inline Actions Why do you need this new set? You can get the result just by using `ScalarToTreeEntry` data member and checking the vectorization status of the corresponding `TreeEntry`. ABataev: Why do you need this new set? You can get the result just by using `ScalarToTreeEntry` data…

		/// Raw cost of all elemts in the tree.
		int RawTreeCost = 0;

		ABataevUnsubmitted Done Reply Inline Actions Seem to me, here is the same story just like with `ScalarsToVec` ABataev: Seem to me, here is the same story just like with `ScalarsToVec`
		/// Final cost of the tree.
		int TreeCost = 0;

		/// Partail cost of the tree.
		int PartialCost = 0;

		ABataevUnsubmitted Done Reply Inline Actions Currently is not used ABataev: Currently is not used
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions Thanks, Sorry for this, I missed somehow. dtemirbulatov: Thanks, Sorry for this, I missed somehow.
		/// Indicate that no CallInst found in the tree and we don't need to
		/// calculate spill cost.
		bool NoCallInst = true;

		/// True, if we have calucalte tree cost for the tree.
		bool IsCostSumReady = false;

		/// A list of values that need to extracted out of the tree.
		/// This list holds pairs of (Internal Scalar : External User). External
		/// User can be nullptr, it means that this Internal Scalar will be used
		/// later, after vectorization.
		UserList ExternalUses;

		/// Current operations width to vectorize.
		unsigned BundleWidth = 0;

		/// Internal tree oprations proposed to be vectorized values use.
		SmallDenseMap<Value *, UserList> InternalTreeUses;

		/// Attaches the BlockScheduling structures to basic blocks.
		MapVector<BasicBlock *, std::unique_ptr<BlockScheduling>> BlocksSchedules;

		/// A map of scalar integer values to the smallest bit width with which they
		/// can legally be represented. The values map to (width, signed) pairs,
		/// where "width" indicates the minimum bit width and "signed" is True if
		/// the value must be signed-extended, rather than zero-extended, back to
		/// its original width.
		MapVector<Value *, std::pair<uint64_t, bool>> MinBWs;

		/// Clear the internal data structures that are created by 'buildTree'.
		void deleteTree() {
		VectorizableTree.clear();
		ScalarToTreeEntry.clear();
		MustGather.clear();
		ExternalUses.clear();
		InternalTreeUses.clear();
		RemovedOperations.clear();
		NumOpsWantToKeepOrder.clear();
		NumOpsWantToKeepOriginalOrder = 0;
		for (auto &Iter : BlocksSchedules) {
		BlockScheduling *BS = Iter.second.get();
		BS->clear();
		}
		MinBWs.clear();
		NoCallInst = true;
		RawTreeCost = 0;
		TreeCost = 0;
		PartialCost = 0;
		IsCostSumReady = false;
		}
		};

		// Previous trees that might be worth to vectorize.
		TreeState::TreeStateTy BuiltTrees;

		// Current tree that we consider.
		TreeState *Tree = nullptr;

// Analysis and block reference.		// Analysis and block reference.
Function *F;		Function *F;
ScalarEvolution *SE;		ScalarEvolution *SE;
TargetTransformInfo *TTI;		TargetTransformInfo *TTI;
TargetLibraryInfo *TLI;		TargetLibraryInfo *TLI;
AliasAnalysis *AA;		AliasAnalysis *AA;
LoopInfo *LI;		LoopInfo *LI;
DominatorTree *DT;		DominatorTree *DT;
Show All 36 Lines	struct ChildIteratorType
ChildIteratorType(SmallVector<BoUpSLP::EdgeInfo, 1>::iterator W,		ChildIteratorType(SmallVector<BoUpSLP::EdgeInfo, 1>::iterator W,
ContainerTy &VT)		ContainerTy &VT)
: ChildIteratorType::iterator_adaptor_base(W), VectorizableTree(VT) {}		: ChildIteratorType::iterator_adaptor_base(W), VectorizableTree(VT) {}

NodeRef operator*() { return I->UserTE; }		NodeRef operator*() { return I->UserTE; }
};		};

static NodeRef getEntryNode(BoUpSLP &R) {		static NodeRef getEntryNode(BoUpSLP &R) {
return R.VectorizableTree[0].get();		return R.Tree->VectorizableTree[0].get();
}		}

static ChildIteratorType child_begin(NodeRef N) {		static ChildIteratorType child_begin(NodeRef N) {
return {N->UserTreeIndices.begin(), N->Container};		return {N->UserTreeIndices.begin(), N->Container};
}		}

static ChildIteratorType child_end(NodeRef N) {		static ChildIteratorType child_end(NodeRef N) {
return {N->UserTreeIndices.end(), N->Container};		return {N->UserTreeIndices.end(), N->Container};
Show All 11 Lines	public:
nodes_iterator operator++() {		nodes_iterator operator++() {
++It;		++It;
return *this;		return *this;
}		}
bool operator!=(const nodes_iterator &N2) const { return N2.It != It; }		bool operator!=(const nodes_iterator &N2) const { return N2.It != It; }
};		};

static nodes_iterator nodes_begin(BoUpSLP *R) {		static nodes_iterator nodes_begin(BoUpSLP *R) {
return nodes_iterator(R->VectorizableTree.begin());		return nodes_iterator(R->Tree->VectorizableTree.begin());
}		}

static nodes_iterator nodes_end(BoUpSLP *R) {		static nodes_iterator nodes_end(BoUpSLP *R) {
return nodes_iterator(R->VectorizableTree.end());		return nodes_iterator(R->Tree->VectorizableTree.end());
}		}

static unsigned size(BoUpSLP *R) { return R->VectorizableTree.size(); }		static unsigned size(BoUpSLP *R) { return R->Tree->VectorizableTree.size(); }
};		};

template <> struct DOTGraphTraits<BoUpSLP *> : public DefaultDOTGraphTraits {		template <> struct DOTGraphTraits<BoUpSLP *> : public DefaultDOTGraphTraits {
using TreeEntry = BoUpSLP::TreeEntry;		using TreeEntry = BoUpSLP::TreeEntry;

DOTGraphTraits(bool isSimple = false) : DefaultDOTGraphTraits(isSimple) {}		DOTGraphTraits(bool isSimple = false) : DefaultDOTGraphTraits(isSimple) {}

std::string getNodeLabel(const TreeEntry Entry, const BoUpSLP R) {		std::string getNodeLabel(const TreeEntry Entry, const BoUpSLP R) {
std::string Str;		std::string Str;
raw_string_ostream OS(Str);		raw_string_ostream OS(Str);
if (isSplat(Entry->Scalars)) {		if (isSplat(Entry->Scalars)) {
OS << "<splat> " << *Entry->Scalars[0];		OS << "<splat> " << *Entry->Scalars[0];
return Str;		return Str;
}		}
for (auto V : Entry->Scalars) {		for (auto V : Entry->Scalars) {
OS << *V;		OS << *V;
if (std::any_of(		if (std::any_of(
R->ExternalUses.begin(), R->ExternalUses.end(),		R->Tree->ExternalUses.begin(), R->Tree->ExternalUses.end(),
[&](const BoUpSLP::ExternalUser &EU) { return EU.Scalar == V; }))		[&](const BoUpSLP::ExternalUser &EU) { return EU.Scalar == V; }))
OS << " <extract>";		OS << " <extract>";
OS << "\n";		OS << "\n";
}		}
return Str;		return Str;
}		}

static std::string getNodeAttributes(const TreeEntry *Entry,		static std::string getNodeAttributes(const TreeEntry *Entry,
Show All 35 Lines	void BoUpSLP::buildTree(ArrayRef<Value *> Roots,
ArrayRef<Value *> UserIgnoreLst) {		ArrayRef<Value *> UserIgnoreLst) {
ExtraValueToDebugLocsMap ExternallyUsedValues;		ExtraValueToDebugLocsMap ExternallyUsedValues;
buildTree(Roots, ExternallyUsedValues, UserIgnoreLst);		buildTree(Roots, ExternallyUsedValues, UserIgnoreLst);
}		}

void BoUpSLP::buildTree(ArrayRef<Value *> Roots,		void BoUpSLP::buildTree(ArrayRef<Value *> Roots,
ExtraValueToDebugLocsMap &ExternallyUsedValues,		ExtraValueToDebugLocsMap &ExternallyUsedValues,
ArrayRef<Value *> UserIgnoreLst) {		ArrayRef<Value *> UserIgnoreLst) {
deleteTree();		Tree->deleteTree();
UserIgnoreList = UserIgnoreLst;		UserIgnoreList = UserIgnoreLst;
if (!allSameType(Roots))		if (!allSameType(Roots))
return;		return;
buildTree_rec(Roots, 0, EdgeInfo());		buildTree_rec(Roots, 0, EdgeInfo());

// Collect the values that we need to extract from the tree.		// Collect the values that we need to extract from the tree.
for (auto &TEPtr : VectorizableTree) {		for (std::unique_ptr<TreeEntry> &TEPtr : Tree->VectorizableTree) {
TreeEntry *Entry = TEPtr.get();		TreeEntry *Entry = TEPtr.get();

// No need to handle users of gathered values.		// No need to handle users of gathered values.
if (Entry->State == TreeEntry::NeedToGather)		if (Entry->State == TreeEntry::NeedToGather)
continue;		continue;

// For each lane:		// For each lane:
for (int Lane = 0, LE = Entry->Scalars.size(); Lane != LE; ++Lane) {		for (int Lane = 0, LE = Entry->Scalars.size(); Lane != LE; ++Lane) {
Value *Scalar = Entry->Scalars[Lane];		Value *Scalar = Entry->Scalars[Lane];
int FoundLane = Lane;		int FoundLane = Lane;
if (!Entry->ReuseShuffleIndices.empty()) {		if (!Entry->ReuseShuffleIndices.empty()) {
FoundLane =		FoundLane =
std::distance(Entry->ReuseShuffleIndices.begin(),		std::distance(Entry->ReuseShuffleIndices.begin(),
llvm::find(Entry->ReuseShuffleIndices, FoundLane));		llvm::find(Entry->ReuseShuffleIndices, FoundLane));
}		}

// Check if the scalar is externally used as an extra arg.		// Check if the scalar is externally used as an extra arg.
auto ExtI = ExternallyUsedValues.find(Scalar);		auto ExtI = ExternallyUsedValues.find(Scalar);
if (ExtI != ExternallyUsedValues.end()) {		if (ExtI != ExternallyUsedValues.end()) {
LLVM_DEBUG(dbgs() << "SLP: Need to extract: Extra arg from lane "		LLVM_DEBUG(dbgs() << "SLP: Need to extract: Extra arg from lane "
<< Lane << " from " << *Scalar << ".\n");		<< Lane << " from " << *Scalar << ".\n");
ExternalUses.emplace_back(Scalar, nullptr, FoundLane);		Tree->ExternalUses.emplace_back(Scalar, nullptr, FoundLane);
}		}
for (User *U : Scalar->users()) {		for (User *U : Scalar->users()) {
LLVM_DEBUG(dbgs() << "SLP: Checking user:" << *U << ".\n");		LLVM_DEBUG(dbgs() << "SLP: Checking user:" << *U << ".\n");

Instruction *UserInst = dyn_cast<Instruction>(U);		Instruction *UserInst = dyn_cast<Instruction>(U);
if (!UserInst)		if (!UserInst)
continue;		continue;

// Skip in-tree scalars that become vectors		// Skip in-tree scalars that become vectors
if (TreeEntry *UseEntry = getTreeEntry(U)) {		if (TreeEntry *UseEntry = getTreeEntry(U)) {
Value *UseScalar = UseEntry->Scalars[0];		Value *UseScalar = UseEntry->Scalars[0];
// Some in-tree scalars will remain as scalar in vectorized		// Some in-tree scalars will remain as scalar in vectorized
// instructions. If that is the case, the one in Lane 0 will		// instructions. If that is the case, the one in Lane 0 will
// be used.		// be used.
if (UseScalar != U \|\|		if (UseScalar != U \|\|
!InTreeUserNeedToExtract(Scalar, UserInst, TLI)) {		!InTreeUserNeedToExtract(Scalar, UserInst, TLI)) {
LLVM_DEBUG(dbgs() << "SLP: \tInternal user will be removed:" << *U		LLVM_DEBUG(dbgs() << "SLP: \tInternal user will be removed:" << *U
<< ".\n");		<< ".\n");
assert(UseEntry->State != TreeEntry::NeedToGather && "Bad state");		assert(UseEntry->State != TreeEntry::NeedToGather && "Bad state");
		Tree->InternalTreeUses[U].emplace_back(Scalar, U, FoundLane);
continue;		continue;
		ABataevUnsubmitted Not Done Reply Inline Actions Looks like unrelated change ABataev: Looks like unrelated change
}		}
}		}

// Ignore users in the user ignore list.		// Ignore users in the user ignore list.
if (is_contained(UserIgnoreList, UserInst))		if (is_contained(UserIgnoreList, UserInst))
continue;		continue;

LLVM_DEBUG(dbgs() << "SLP: Need to extract:" << *U << " from lane "		LLVM_DEBUG(dbgs() << "SLP: Need to extract:" << *U << " from lane "
<< Lane << " from " << *Scalar << ".\n");		<< Lane << " from " << *Scalar << ".\n");
ExternalUses.push_back(ExternalUser(Scalar, U, FoundLane));		Tree->ExternalUses.push_back(ExternalUser(Scalar, U, FoundLane));
}		}
}		}
}		}
}		}

void BoUpSLP::buildTree_rec(ArrayRef<Value *> VL, unsigned Depth,		void BoUpSLP::buildTree_rec(ArrayRef<Value *> VL, unsigned Depth,
const EdgeInfo &UserTreeIdx) {		const EdgeInfo &UserTreeIdx) {
assert((allConstant(VL) \|\| allSameType(VL)) && "Invalid types!");		assert((allConstant(VL) \|\| allSameType(VL)) && "Invalid types!");
▲ Show 20 Lines • Show All 67 Lines • ▼ Show 20 Lines	if (getTreeEntry(I)) {
return;		return;
}		}
}		}

// If any of the scalars is marked as a value that needs to stay scalar, then		// If any of the scalars is marked as a value that needs to stay scalar, then
// we need to gather the scalars.		// we need to gather the scalars.
// The reduction nodes (stored in UserIgnoreList) also should stay scalar.		// The reduction nodes (stored in UserIgnoreList) also should stay scalar.
for (Value *V : VL) {		for (Value *V : VL) {
if (MustGather.count(V) \|\| is_contained(UserIgnoreList, V)) {		if (Tree->MustGather.count(V) \|\| is_contained(UserIgnoreList, V)) {
LLVM_DEBUG(dbgs() << "SLP: Gathering due to gathered scalar.\n");		LLVM_DEBUG(dbgs() << "SLP: Gathering due to gathered scalar.\n");
newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx);		newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx);
return;		return;
}		}
}		}

// Check that all of the users of the scalars that we want to vectorize are		// Check that all of the users of the scalars that we want to vectorize are
// schedulable.		// schedulable.
Show All 27 Lines	if (NumUniqueScalarValues <= 1 \|\|
!llvm::isPowerOf2_32(NumUniqueScalarValues)) {		!llvm::isPowerOf2_32(NumUniqueScalarValues)) {
LLVM_DEBUG(dbgs() << "SLP: Scalar used twice in bundle.\n");		LLVM_DEBUG(dbgs() << "SLP: Scalar used twice in bundle.\n");
newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx);		newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx);
return;		return;
}		}
VL = UniqueValues;		VL = UniqueValues;
}		}

auto &BSRef = BlocksSchedules[BB];		auto &BSRef = Tree->BlocksSchedules[BB];
if (!BSRef)		if (!BSRef)
BSRef = std::make_unique<BlockScheduling>(BB);		BSRef = std::make_unique<BlockScheduling>(BB);

BlockScheduling &BS = *BSRef.get();		BlockScheduling &BS = *BSRef.get();

Optional<ScheduleData *> Bundle = BS.tryScheduleBundle(VL, this, S);		Optional<ScheduleData *> Bundle = BS.tryScheduleBundle(VL, this, S);
if (!Bundle) {		if (!Bundle) {
LLVM_DEBUG(dbgs() << "SLP: We are not able to schedule this bundle!\n");		LLVM_DEBUG(dbgs() << "SLP: We are not able to schedule this bundle!\n");
▲ Show 20 Lines • Show All 48 Lines • ▼ Show 20 Lines	case Instruction::PHI: {
return;		return;
}		}
case Instruction::ExtractValue:		case Instruction::ExtractValue:
case Instruction::ExtractElement: {		case Instruction::ExtractElement: {
OrdersType CurrentOrder;		OrdersType CurrentOrder;
bool Reuse = canReuseExtract(VL, VL0, CurrentOrder);		bool Reuse = canReuseExtract(VL, VL0, CurrentOrder);
if (Reuse) {		if (Reuse) {
LLVM_DEBUG(dbgs() << "SLP: Reusing or shuffling extract sequence.\n");		LLVM_DEBUG(dbgs() << "SLP: Reusing or shuffling extract sequence.\n");
++NumOpsWantToKeepOriginalOrder;		++Tree->NumOpsWantToKeepOriginalOrder;
newTreeEntry(VL, Bundle /vectorized/, S, UserTreeIdx,		newTreeEntry(VL, Bundle /vectorized/, S, UserTreeIdx,
ReuseShuffleIndicies);		ReuseShuffleIndicies);
// This is a special case, as it does not gather, but at the same time		// This is a special case, as it does not gather, but at the same time
// we are not extending buildTree_rec() towards the operands.		// we are not extending buildTree_rec() towards the operands.
ValueList Op0;		ValueList Op0;
Op0.assign(VL.size(), VL0->getOperand(0));		Op0.assign(VL.size(), VL0->getOperand(0));
VectorizableTree.back()->setOperand(0, Op0);		Tree->VectorizableTree.back()->setOperand(0, Op0);
return;		return;
}		}
if (!CurrentOrder.empty()) {		if (!CurrentOrder.empty()) {
LLVM_DEBUG({		LLVM_DEBUG({
dbgs() << "SLP: Reusing or shuffling of reordered extract sequence "		dbgs() << "SLP: Reusing or shuffling of reordered extract sequence "
"with order";		"with order";
for (unsigned Idx : CurrentOrder)		for (unsigned Idx : CurrentOrder)
dbgs() << " " << Idx;		dbgs() << " " << Idx;
dbgs() << "\n";		dbgs() << "\n";
});		});
// Insert new order with initial value 0, if it does not exist,		// Insert new order with initial value 0, if it does not exist,
// otherwise return the iterator to the existing one.		// otherwise return the iterator to the existing one.
auto StoredCurrentOrderAndNum =		auto StoredCurrentOrderAndNum =
NumOpsWantToKeepOrder.try_emplace(CurrentOrder).first;		Tree->NumOpsWantToKeepOrder.try_emplace(CurrentOrder).first;
++StoredCurrentOrderAndNum->getSecond();		++StoredCurrentOrderAndNum->getSecond();
newTreeEntry(VL, Bundle /vectorized/, S, UserTreeIdx,		newTreeEntry(VL, Bundle /vectorized/, S, UserTreeIdx,
ReuseShuffleIndicies,		ReuseShuffleIndicies,
StoredCurrentOrderAndNum->getFirst());		StoredCurrentOrderAndNum->getFirst());
// This is a special case, as it does not gather, but at the same time		// This is a special case, as it does not gather, but at the same time
// we are not extending buildTree_rec() towards the operands.		// we are not extending buildTree_rec() towards the operands.
ValueList Op0;		ValueList Op0;
Op0.assign(VL.size(), VL0->getOperand(0));		Op0.assign(VL.size(), VL0->getOperand(0));
VectorizableTree.back()->setOperand(0, Op0);		Tree->VectorizableTree.back()->setOperand(0, Op0);
return;		return;
}		}
LLVM_DEBUG(dbgs() << "SLP: Gather extract sequence.\n");		LLVM_DEBUG(dbgs() << "SLP: Gather extract sequence.\n");
newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx,		newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx,
ReuseShuffleIndicies);		ReuseShuffleIndicies);
BS.cancelScheduling(VL, VL0);		BS.cancelScheduling(VL, VL0);
return;		return;
}		}
▲ Show 20 Lines • Show All 48 Lines • ▼ Show 20 Lines	case Instruction::Load: {
const SCEV *ScevN = SE->getSCEV(PtrN);		const SCEV *ScevN = SE->getSCEV(PtrN);
const auto *Diff =		const auto *Diff =
dyn_cast<SCEVConstant>(SE->getMinusSCEV(ScevN, Scev0));		dyn_cast<SCEVConstant>(SE->getMinusSCEV(ScevN, Scev0));
uint64_t Size = DL->getTypeAllocSize(ScalarTy);		uint64_t Size = DL->getTypeAllocSize(ScalarTy);
// Check that the sorted loads are consecutive.		// Check that the sorted loads are consecutive.
if (Diff && Diff->getAPInt() == (VL.size() - 1) * Size) {		if (Diff && Diff->getAPInt() == (VL.size() - 1) * Size) {
if (CurrentOrder.empty()) {		if (CurrentOrder.empty()) {
// Original loads are consecutive and does not require reordering.		// Original loads are consecutive and does not require reordering.
++NumOpsWantToKeepOriginalOrder;		++Tree->NumOpsWantToKeepOriginalOrder;
TreeEntry TE = newTreeEntry(VL, Bundle /vectorized*/, S,		TreeEntry TE = newTreeEntry(VL, Bundle /vectorized*/, S,
UserTreeIdx, ReuseShuffleIndicies);		UserTreeIdx, ReuseShuffleIndicies);
TE->setOperandsInOrder();		TE->setOperandsInOrder();
LLVM_DEBUG(dbgs() << "SLP: added a vector of loads.\n");		LLVM_DEBUG(dbgs() << "SLP: added a vector of loads.\n");
} else {		} else {
// Need to reorder.		// Need to reorder.
auto I = NumOpsWantToKeepOrder.try_emplace(CurrentOrder).first;		auto I =
		Tree->NumOpsWantToKeepOrder.try_emplace(CurrentOrder).first;
++I->getSecond();		++I->getSecond();
TreeEntry *TE =		TreeEntry *TE =
newTreeEntry(VL, Bundle /vectorized/, S, UserTreeIdx,		newTreeEntry(VL, Bundle /vectorized/, S, UserTreeIdx,
ReuseShuffleIndicies, I->getFirst());		ReuseShuffleIndicies, I->getFirst());
TE->setOperandsInOrder();		TE->setOperandsInOrder();
LLVM_DEBUG(dbgs() << "SLP: added a vector of jumbled loads.\n");		LLVM_DEBUG(dbgs() << "SLP: added a vector of jumbled loads.\n");
}		}
return;		return;
▲ Show 20 Lines • Show All 238 Lines • ▼ Show 20 Lines	case Instruction::Store: {
const SCEV *ScevN = SE->getSCEV(PtrN);		const SCEV *ScevN = SE->getSCEV(PtrN);
const auto *Diff =		const auto *Diff =
dyn_cast<SCEVConstant>(SE->getMinusSCEV(ScevN, Scev0));		dyn_cast<SCEVConstant>(SE->getMinusSCEV(ScevN, Scev0));
uint64_t Size = DL->getTypeAllocSize(ScalarTy);		uint64_t Size = DL->getTypeAllocSize(ScalarTy);
// Check that the sorted pointer operands are consecutive.		// Check that the sorted pointer operands are consecutive.
if (Diff && Diff->getAPInt() == (VL.size() - 1) * Size) {		if (Diff && Diff->getAPInt() == (VL.size() - 1) * Size) {
if (CurrentOrder.empty()) {		if (CurrentOrder.empty()) {
// Original stores are consecutive and does not require reordering.		// Original stores are consecutive and does not require reordering.
++NumOpsWantToKeepOriginalOrder;		++Tree->NumOpsWantToKeepOriginalOrder;
TreeEntry TE = newTreeEntry(VL, Bundle /vectorized*/, S,		TreeEntry TE = newTreeEntry(VL, Bundle /vectorized*/, S,
UserTreeIdx, ReuseShuffleIndicies);		UserTreeIdx, ReuseShuffleIndicies);
TE->setOperandsInOrder();		TE->setOperandsInOrder();
buildTree_rec(Operands, Depth + 1, {TE, 0});		buildTree_rec(Operands, Depth + 1, {TE, 0});
LLVM_DEBUG(dbgs() << "SLP: added a vector of stores.\n");		LLVM_DEBUG(dbgs() << "SLP: added a vector of stores.\n");
} else {		} else {
// Need to reorder.		// Need to reorder.
auto I = NumOpsWantToKeepOrder.try_emplace(CurrentOrder).first;		auto I =
		Tree->NumOpsWantToKeepOrder.try_emplace(CurrentOrder).first;
++(I->getSecond());		++(I->getSecond());
TreeEntry *TE =		TreeEntry *TE =
newTreeEntry(VL, Bundle /vectorized/, S, UserTreeIdx,		newTreeEntry(VL, Bundle /vectorized/, S, UserTreeIdx,
ReuseShuffleIndicies, I->getFirst());		ReuseShuffleIndicies, I->getFirst());
TE->setOperandsInOrder();		TE->setOperandsInOrder();
buildTree_rec(Operands, Depth + 1, {TE, 0});		buildTree_rec(Operands, Depth + 1, {TE, 0});
LLVM_DEBUG(dbgs() << "SLP: added a vector of jumbled stores.\n");		LLVM_DEBUG(dbgs() << "SLP: added a vector of jumbled stores.\n");
}		}
▲ Show 20 Lines • Show All 129 Lines • ▼ Show 20 Lines	default:
BS.cancelScheduling(VL, VL0);		BS.cancelScheduling(VL, VL0);
newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx,		newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx,
ReuseShuffleIndicies);		ReuseShuffleIndicies);
LLVM_DEBUG(dbgs() << "SLP: Gathering unknown instruction.\n");		LLVM_DEBUG(dbgs() << "SLP: Gathering unknown instruction.\n");
return;		return;
}		}
}		}

		void BoUpSLP::cutTree() {
		SmallVector<TreeEntry *, 4> VecNodes;

		for (std::unique_ptr<TreeEntry> &TEPtr : Tree->VectorizableTree) {
		TreeEntry *Entry = TEPtr.get();
		if (Entry->State != TreeEntry::Vectorize)
		continue;
		ABataevUnsubmitted Not Done Reply Inline Actions You don't need to push the elements to a new vector here, instead, you can directly perform required actions. ABataev: You don't need to push the elements to a new vector here, instead, you can directly perform…
		// For all canceled operations we should consider the possibility of
		// use by with non-canceled operations and for that, it requires
		// to populate ExternalUser list with canceled elements.
		for (int Lane = 0, LE = Entry->Scalars.size(); Lane != LE; ++Lane) {
		ABataevUnsubmitted Not Done Reply Inline Actions Should this loop be executed only for `ProposedToGather` `Entry`s? ABataev: Should this loop be executed only for `ProposedToGather` `Entry`s?
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions Yes, but we have to know the lane there. dtemirbulatov: Yes, but we have to know the lane there.
		ABataevUnsubmitted Not Done Reply Inline Actions I mean, that the check in the previous if must be `Entry->State == TreeEntry::ProposedToGather`, not `!= TreeEntry::Vectorize` ABataev: I mean, that the check in the previous if must be `Entry->State == TreeEntry::ProposedToGather`…
		Value *Scalar = Entry->Scalars[Lane];
		for (User *U : Scalar->users()) {
		ABataevUnsubmitted Done Reply Inline Actions These two loops can be merged, no? And use `switch` instead of `if`, if possible, after merging ABataev: These two loops can be merged, no? And use `switch` instead of `if`, if possible, after merging
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions Thanks. dtemirbulatov: Thanks.
		LLVM_DEBUG(dbgs() << "SLP: Checking user:" << *U << ".\n");
		TreeEntry *UserTE = getTreeEntry(U);
		ABataevUnsubmitted Done Reply Inline Actions What if the user does not have corresponding tree entry, i.e. it is initially scalar? What if the `Scalar` itself is going to remain scalar? ABataev: What if the user does not have corresponding tree entry, i.e. it is initially scalar? What if…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions What if the Scalar itself is going to remain scalar? At this point, the decision to cut the tree was made and the Scalar could be only with intend to vectorize. Note about that 3295 we are ignoring any tree entries without State not equals TreeEntry::Vectorize. What if the user does not have corresponding tree entry, i.e. it is initially scalar? ah, yes. I have to check that !UserTE at 3305 and just continue if it is true. dtemirbulatov: > What if the Scalar itself is going to remain scalar? At this point, the decision to cut the…
		if (UserTE && UserTE->State != TreeEntry::ProposedToGather)
		continue;
		ABataevUnsubmitted Done Reply Inline Actions Could you compare it with a similar code in BoUpSLP::buildTree? Looks like you still missed some cases for user ignoring. ABataev: Could you compare it with a similar code in BoUpSLP::buildTree? Looks like you still missed…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions I think those ignoring cases are related to the fact that we are doing full vectorization at BoUpSLP::buildTree and we can avoid extracting for in-tree users. And here we have to extract to each user of once proposed to vectorized value. dtemirbulatov: I think those ignoring cases are related to the fact that we are doing full vectorization at…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions And here we have to extract to each user of once proposed to vectorized value. I mean for the partial vectorization. dtemirbulatov: And here we have to extract to each user of once proposed to vectorized value. I mean for the…
		// Ignore users in the user ignore list.
		ABataevUnsubmitted Not Done Reply Inline Actions This does nothing except for debugging print, guard with `#ifndef NDEBUG .. #endif` ABataev: This does nothing except for debugging print, guard with `#ifndef NDEBUG .. #endif`
		auto *UserInst = cast<Instruction>(U);
		ABataevUnsubmitted Done Reply Inline Actions Why do you need to compare flow and operation instructions count? Also, why use hardcoded `3` as a limit of vectorizable nodes? ABataev: Why do you need to compare flow and operation instructions count? Also, why use hardcoded `3`…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions I noticed the cost perspective it is good to vectorize the subtree, but on the benchmarking side, it is introducing regressions. Maybe this is a known issue where the partial vectorization prevents full vectorization later on, for example, if I remove this limiter at 3271: for Transforms/SLPVectorizer/X86/PR39774.ll testcase I wold get: a/llvm/test/Transforms/SLPVectorizer/X86/PR39774.ll +++ b/llvm/test/Transforms/SLPVectorizer/X86/PR39774.ll @@ -7,55 +7,65 @@ define void @test(i32) { ; CHECK-NEXT: entry: ; CHECK-NEXT: br label [[LOOP:%.]] ; CHECK: loop: -; CHECK-NEXT: [[TMP1:%.]] = phi <2 x i32> [ [[TMP15:%.]], [[LOOP]] ], [ zeroinitializer, [[ENTRY:%.]] ] -; CHECK-NEXT: [[SHUFFLE:%.]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> undef, <8 x i32> <i32 0, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1> -; CHECK-NEXT: [[TMP2:%.]] = extractelement <8 x i32> [[SHUFFLE]], i32 1 -; CHECK-NEXT: [[TMP3:%.]] = add <8 x i32> [[SHUFFLE]], <i32 0, i32 55, i32 285, i32 1240, i32 1496, i32 8555, i32 12529, i32 13685> -; CHECK-NEXT: [[RDX_SHUF:%.]] = shufflevector <8 x i32> [[TMP3]], <8 x i32> undef, <8 x i32> <i32 4, i32 5, i32 6, i32 7, i32 undef, i32 undef, i32 undef, i32 undef> -; CHECK-NEXT: [[BIN_RDX:%.]] = and <8 x i32> [[TMP3]], [[RDX_SHUF]] -; CHECK-NEXT: [[RDX_SHUF1:%.]] = shufflevector <8 x i32> [[BIN_RDX]], <8 x i32> undef, <8 x i32> <i32 2, i32 3, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef> -; CHECK-NEXT: [[BIN_RDX2:%.]] = and <8 x i32> [[BIN_RDX]], [[RDX_SHUF1]] -; CHECK-NEXT: [[RDX_SHUF3:%.]] = shufflevector <8 x i32> [[BIN_RDX2]], <8 x i32> undef, <8 x i32> <i32 1, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef, i32 undef> -; CHECK-NEXT: [[BIN_RDX4:%.]] = and <8 x i32> [[BIN_RDX2]], [[RDX_SHUF3]] -; CHECK-NEXT: [[TMP4:%.]] = extractelement <8 x i32> [[BIN_RDX4]], i32 0 -; CHECK-NEXT: [[OP_EXTRA:%.]] = and i32 [[TMP4]], [[TMP0:%.]] -; CHECK-NEXT: [[OP_EXTRA5:%.]] = and i32 [[OP_EXTRA]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA6:%.]] = and i32 [[OP_EXTRA5]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA7:%.]] = and i32 [[OP_EXTRA6]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA8:%.]] = and i32 [[OP_EXTRA7]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA9:%.]] = and i32 [[OP_EXTRA8]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA10:%.]] = and i32 [[OP_EXTRA9]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA11:%.]] = and i32 [[OP_EXTRA10]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA12:%.]] = and i32 [[OP_EXTRA11]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA13:%.]] = and i32 [[OP_EXTRA12]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA14:%.]] = and i32 [[OP_EXTRA13]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA15:%.]] = and i32 [[OP_EXTRA14]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA16:%.]] = and i32 [[OP_EXTRA15]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA17:%.]] = and i32 [[OP_EXTRA16]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA18:%.]] = and i32 [[OP_EXTRA17]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA19:%.]] = and i32 [[OP_EXTRA18]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA20:%.]] = and i32 [[OP_EXTRA19]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA21:%.]] = and i32 [[OP_EXTRA20]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA22:%.]] = and i32 [[OP_EXTRA21]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA23:%.]] = and i32 [[OP_EXTRA22]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA24:%.]] = and i32 [[OP_EXTRA23]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA25:%.]] = and i32 [[OP_EXTRA24]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA26:%.]] = and i32 [[OP_EXTRA25]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA27:%.]] = and i32 [[OP_EXTRA26]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA28:%.]] = and i32 [[OP_EXTRA27]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA29:%.]] = and i32 [[OP_EXTRA28]], [[TMP0]] -; CHECK-NEXT: [[OP_EXTRA30:%.]] = and i32 [[OP_EXTRA29]], [[TMP0]] -; CHECK-NEXT: [[TMP5:%.]] = insertelement <2 x i32> undef, i32 [[OP_EXTRA30]], i32 0 -; CHECK-NEXT: [[TMP6:%.]] = insertelement <2 x i32> [[TMP5]], i32 14910, i32 1 -; CHECK-NEXT: [[TMP7:%.]] = insertelement <2 x i32> undef, i32 [[TMP2]], i32 0 -; CHECK-NEXT: [[TMP8:%.]] = insertelement <2 x i32> [[TMP7]], i32 [[TMP2]], i32 1 -; CHECK-NEXT: [[TMP9:%.]] = and <2 x i32> [[TMP6]], [[TMP8]] -; CHECK-NEXT: [[TMP10:%.]] = add <2 x i32> [[TMP6]], [[TMP8]] -; CHECK-NEXT: [[TMP11:%.]] = shufflevector <2 x i32> [[TMP9]], <2 x i32> [[TMP10]], <2 x i32> <i32 0, i32 3> -; CHECK-NEXT: [[TMP12:%.]] = extractelement <2 x i32> [[TMP11]], i32 0 -; CHECK-NEXT: [[TMP13:%.]] = insertelement <2 x i32> undef, i32 [[TMP12]], i32 0 -; CHECK-NEXT: [[TMP14:%.]] = extractelement <2 x i32> [[TMP11]], i32 1 -; CHECK-NEXT: [[TMP15]] = insertelement <2 x i32> [[TMP13]], i32 [[TMP14]], i32 1 +; CHECK-NEXT: [[TMP1:%.]] = phi <2 x i32> [ [[TMP19:%.]], [[LOOP]] ], [ zeroinitializer, [[ENTRY:%.]] ] +; CHECK-NEXT: [[TMP2:%.]] = extractelement <2 x i32> [[TMP1]], i32 0 +; CHECK-NEXT: [[VAL_0:%.]] = add i32 [[TMP2]], 0 +; CHECK-NEXT: [[TMP3:%.]] = extractelement <2 x i32> [[TMP1]], i32 1 +; CHECK-NEXT: [[VAL_1:%.]] = and i32 [[TMP3]], [[VAL_0]] +; CHECK-NEXT: [[VAL_2:%.]] = and i32 [[VAL_1]], [[TMP0:%.]] +; CHECK-NEXT: [[VAL_3:%.]] = and i32 [[VAL_2]], [[TMP0]] +; CHECK-NEXT: [[VAL_4:%.]] = and i32 [[VAL_3]], [[TMP0]] +; CHECK-NEXT: [[VAL_5:%.]] = and i32 [[VAL_4]], [[TMP0]] +; CHECK-NEXT: [[VAL_6:%.]] = add i32 [[TMP3]], 55 +; CHECK-NEXT: [[VAL_7:%.]] = and i32 [[VAL_5]], [[VAL_6]] +; CHECK-NEXT: [[VAL_8:%.]] = and i32 [[VAL_7]], [[TMP0]] +; CHECK-NEXT: [[VAL_9:%.]] = and i32 [[VAL_8]], [[TMP0]] +; CHECK-NEXT: [[VAL_10:%.]] = and i32 [[VAL_9]], [[TMP0]] +; CHECK-NEXT: [[VAL_11:%.]] = add i32 [[TMP3]], 285 +; CHECK-NEXT: [[VAL_12:%.]] = and i32 [[VAL_10]], [[VAL_11]] +; CHECK-NEXT: [[VAL_13:%.]] = and i32 [[VAL_12]], [[TMP0]] +; CHECK-NEXT: [[VAL_14:%.]] = and i32 [[VAL_13]], [[TMP0]] +; CHECK-NEXT: [[VAL_15:%.]] = and i32 [[VAL_14]], [[TMP0]] +; CHECK-NEXT: [[VAL_16:%.]] = and i32 [[VAL_15]], [[TMP0]] +; CHECK-NEXT: [[VAL_17:%.]] = and i32 [[VAL_16]], [[TMP0]] +; CHECK-NEXT: [[VAL_18:%.]] = add i32 [[TMP3]], 1240 +; CHECK-NEXT: [[VAL_19:%.]] = and i32 [[VAL_17]], [[VAL_18]] +; CHECK-NEXT: [[VAL_20:%.]] = add i32 [[TMP3]], 1496 +; CHECK-NEXT: [[VAL_21:%.]] = and i32 [[VAL_19]], [[VAL_20]] +; CHECK-NEXT: [[VAL_22:%.]] = and i32 [[VAL_21]], [[TMP0]] +; CHECK-NEXT: [[VAL_23:%.]] = and i32 [[VAL_22]], [[TMP0]] +; CHECK-NEXT: [[VAL_24:%.]] = and i32 [[VAL_23]], [[TMP0]] +; CHECK-NEXT: [[VAL_25:%.]] = and i32 [[VAL_24]], [[TMP0]] +; CHECK-NEXT: [[VAL_26:%.]] = and i32 [[VAL_25]], [[TMP0]] +; CHECK-NEXT: [[VAL_27:%.]] = and i32 [[VAL_26]], [[TMP0]] +; CHECK-NEXT: [[VAL_28:%.]] = and i32 [[VAL_27]], [[TMP0]] +; CHECK-NEXT: [[VAL_29:%.]] = and i32 [[VAL_28]], [[TMP0]] +; CHECK-NEXT: [[VAL_30:%.]] = and i32 [[VAL_29]], [[TMP0]] +; CHECK-NEXT: [[VAL_31:%.]] = and i32 [[VAL_30]], [[TMP0]] +; CHECK-NEXT: [[VAL_32:%.]] = and i32 [[VAL_31]], [[TMP0]] +; CHECK-NEXT: [[VAL_33:%.]] = and i32 [[VAL_32]], [[TMP0]] +; CHECK-NEXT: [[VAL_34:%.]] = add i32 [[TMP3]], 8555 +; CHECK-NEXT: [[VAL_35:%.]] = and i32 [[VAL_33]], [[VAL_34]] +; CHECK-NEXT: [[VAL_36:%.]] = and i32 [[VAL_35]], [[TMP0]] +; CHECK-NEXT: [[VAL_37:%.]] = and i32 [[VAL_36]], [[TMP0]] +; CHECK-NEXT: [[VAL_38:%.]] = and i32 [[VAL_37]], [[TMP0]] +; CHECK-NEXT: [[TMP4:%.]] = insertelement <2 x i32> undef, i32 [[TMP3]], i32 0 +; CHECK-NEXT: [[TMP5:%.]] = insertelement <2 x i32> [[TMP4]], i32 [[TMP3]], i32 1 +; CHECK-NEXT: [[TMP6:%.]] = add <2 x i32> [[TMP5]], <i32 12529, i32 13685> +; CHECK-NEXT: [[TMP7:%.]] = extractelement <2 x i32> [[TMP6]], i32 0 +; CHECK-NEXT: [[VAL_40:%.]] = and i32 [[VAL_38]], [[TMP7]] +; CHECK-NEXT: [[TMP8:%.]] = extractelement <2 x i32> [[TMP6]], i32 1 +; CHECK-NEXT: [[TMP9:%.]] = insertelement <2 x i32> undef, i32 [[VAL_40]], i32 0 +; CHECK-NEXT: [[TMP10:%.]] = insertelement <2 x i32> [[TMP9]], i32 14910, i32 1 +; CHECK-NEXT: [[TMP11:%.]] = insertelement <2 x i32> undef, i32 [[TMP8]], i32 0 +; CHECK-NEXT: [[TMP12:%.]] = insertelement <2 x i32> [[TMP11]], i32 [[TMP3]], i32 1 +; CHECK-NEXT: [[TMP13:%.]] = and <2 x i32> [[TMP10]], [[TMP12]] +; CHECK-NEXT: [[TMP14:%.]] = add <2 x i32> [[TMP10]], [[TMP12]] +; CHECK-NEXT: [[TMP15:%.]] = shufflevector <2 x i32> [[TMP13]], <2 x i32> [[TMP14]], <2 x i32> <i32 0, i32 3> +; CHECK-NEXT: [[TMP16:%.]] = extractelement <2 x i32> [[TMP15]], i32 0 +; CHECK-NEXT: [[TMP17:%.]] = insertelement <2 x i32> undef, i32 [[TMP16]], i32 0 +; CHECK-NEXT: [[TMP18:%.]] = extractelement <2 x i32> [[TMP15]], i32 1 +; CHECK-NEXT: [[TMP19]] = insertelement <2 x i32> [[TMP17]], i32 [[TMP18]], i32 1 and we could see disappearance of [[TMP3:%.]] = add <8 x i32> [[SHUFFLE]], <i32 0, i32 55, i32 285, i32 1240, i32 1496, i32 8555, i32 12529, i32 13685> I have to recheck those regressions again on cpu2006. dtemirbulatov: I noticed the cost perspective it is good to vectorize the subtree, but on the benchmarking…
		if (is_contained(UserIgnoreList, UserInst))
		continue;
		LLVM_DEBUG(dbgs() << "SLP: Need to extract canceled operation :" << *U
		ABataevUnsubmitted Not Done Reply Inline Actions Either just `cast` without `if` or `dyn_cast` ABataev: Either just `cast` without `if` or `dyn_cast`
		<< " from lane " << Lane << " from " << *Scalar
		<< ".\n");
		Tree->ExternalUses.emplace_back(Scalar, U, Lane);
		}
		}
		ABataevUnsubmitted Done Reply Inline Actions Looks like the `Scalar` should be extracted only if its user is vectorized and it remains to be scalar in the vectorized tree. Or it is not going to be vectorized. ABataev: Looks like the `Scalar` should be extracted only if its user is vectorized and it remains to be…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions Thanks, good catch. I think I need first to populate ExternalUser as TreeEntires are mark with ProposedToGather and then change them to a NeedToGather node in the separate loop. dtemirbulatov: Thanks, good catch. I think I need first to populate ExternalUser as TreeEntires are mark with…
		}
		// Canceling unprofitable elements.
		for (std::unique_ptr<TreeEntry> &TEPtr : Tree->VectorizableTree) {
		ABataevUnsubmitted Not Done Reply Inline Actions I actually don't see propagation for `ProposedTogather` and these loops can be merged, no? ABataev: I actually don't see propagation for `ProposedTogather` and these loops can be merged, no?
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions No, No possible to merge those two loops. dtemirbulatov: No, No possible to merge those two loops.
		ABataevUnsubmitted Not Done Reply Inline Actions Why? ABataev: Why?
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions For example, in the first loops, we could change from Entry1 TreeEntry::ProposedToGather to TreeEntry::NeedToGather status, but we later could encounter another use of this Entry1 and from another Entry2()let's say) with TreeEntry::Vectorize status and we could tell difference with just canceled item and not considered to vectorize Entry. thus ExternalUses would not be properly populated. dtemirbulatov: For example, in the first loops, we could change from Entry1 TreeEntry::ProposedToGather to…
		ABataevUnsubmitted Not Done Reply Inline Actions The first loop does not change the state of the tree entries. ABataev: The first loop does not change the state of the tree entries.
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions I mean if we merge them into one loop. dtemirbulatov: I mean if we merge them into one loop.
		TreeEntry *Entry = TEPtr.get();
		ABataevUnsubmitted Done Reply Inline Actions The scalars are not actually removed here. ABataev: The scalars are not actually removed here.
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions The scalars are not actually removed here. yes, but here I was thinking there would be too much noise to put this print in findSubTree() and we might not get the tree to vectorize in the end. Probably, I think it is better to print whole TreeEntry scalars at once saying we are not going to vectorize those operations. dtemirbulatov: > The scalars are not actually removed here. yes, but here I was thinking there would be too…
		if (Entry->State != TreeEntry::ProposedToGather)
		continue;
		Entry->State = TreeEntry::NeedToGather;
		#ifndef NDEBUG
		for (Value *V : Entry->Scalars)
		LLVM_DEBUG(dbgs() << "SLP: Remove scalar " << *V
		<< " out of proposed to vectorize.\n");
		#endif
		}
		}

		bool BoUpSLP::tryPartialVectorization() {
		if (BuiltTrees.size() == 1)
		ABataevUnsubmitted Not Done Reply Inline Actions Can this occur at all? ABataev: Can this occur at all?
		return false;
		Tree = BuiltTrees.front().get();
		vectorizeTree();
		LLVM_DEBUG(dbgs() << "SLP: Decided to partially vectorize tree with cost: "
		<< Tree->PartialCost << ".\n");
		return true;
		}

unsigned BoUpSLP::canMapToVector(Type *T, const DataLayout &DL) const {		unsigned BoUpSLP::canMapToVector(Type *T, const DataLayout &DL) const {
unsigned N = 1;		unsigned N = 1;
Type *EltTy = T;		Type *EltTy = T;

while (isa<StructType>(EltTy) \|\| isa<ArrayType>(EltTy) \|\|		while (isa<StructType>(EltTy) \|\| isa<ArrayType>(EltTy) \|\|
isa<VectorType>(EltTy)) {		isa<VectorType>(EltTy)) {
if (auto *ST = dyn_cast<StructType>(EltTy)) {		if (auto *ST = dyn_cast<StructType>(EltTy)) {
// Check that struct is homogeneous.		// Check that struct is homogeneous.
▲ Show 20 Lines • Show All 85 Lines • ▼ Show 20 Lines	bool BoUpSLP::canReuseExtract(ArrayRef<Value > VL, Value OpValue,
}		}

return ShouldKeepOrder;		return ShouldKeepOrder;
}		}

bool BoUpSLP::areAllUsersVectorized(Instruction *I) const {		bool BoUpSLP::areAllUsersVectorized(Instruction *I) const {
return I->hasOneUse() \|\|		return I->hasOneUse() \|\|
std::all_of(I->user_begin(), I->user_end(), [this](User *U) {		std::all_of(I->user_begin(), I->user_end(), [this](User *U) {
return ScalarToTreeEntry.count(U) > 0;		return Tree->ScalarToTreeEntry.count(U) > 0;
});		});
}		}

static std::pair<unsigned, unsigned>		static std::pair<unsigned, unsigned>
getVectorCallCosts(CallInst CI, VectorType VecTy, TargetTransformInfo *TTI,		getVectorCallCosts(CallInst CI, VectorType VecTy, TargetTransformInfo *TTI,
TargetLibraryInfo *TLI) {		TargetLibraryInfo *TLI) {
Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI);		Intrinsic::ID ID = getVectorIntrinsicIDForCall(CI, TLI);

Show All 40 Lines	int BoUpSLP::getEntryCost(TreeEntry *E) {

unsigned ReuseShuffleNumbers = E->ReuseShuffleIndices.size();		unsigned ReuseShuffleNumbers = E->ReuseShuffleIndices.size();
bool NeedToShuffleReuses = !E->ReuseShuffleIndices.empty();		bool NeedToShuffleReuses = !E->ReuseShuffleIndices.empty();
int ReuseShuffleCost = 0;		int ReuseShuffleCost = 0;
if (NeedToShuffleReuses) {		if (NeedToShuffleReuses) {
ReuseShuffleCost =		ReuseShuffleCost =
TTI->getShuffleCost(TargetTransformInfo::SK_PermuteSingleSrc, VecTy);		TTI->getShuffleCost(TargetTransformInfo::SK_PermuteSingleSrc, VecTy);
}		}
if (E->State == TreeEntry::NeedToGather) {		if (E->State != TreeEntry::Vectorize) {
if (allConstant(VL))		if (allConstant(VL))
return 0;		return 0;
if (isSplat(VL)) {		if (isSplat(VL)) {
return ReuseShuffleCost +		return ReuseShuffleCost +
TTI->getShuffleCost(TargetTransformInfo::SK_Broadcast, VecTy, 0);		TTI->getShuffleCost(TargetTransformInfo::SK_Broadcast, VecTy, 0);
}		}
if (E->getOpcode() == Instruction::ExtractElement &&		if (E->getOpcode() == Instruction::ExtractElement &&
allSameType(VL) && allSameBlock(VL)) {		allSameType(VL) && allSameBlock(VL)) {
Optional<TargetTransformInfo::ShuffleKind> ShuffleKind = isShuffle(VL);		Optional<TargetTransformInfo::ShuffleKind> ShuffleKind = isShuffle(VL);
if (ShuffleKind.hasValue()) {		if (ShuffleKind.hasValue()) {
int Cost = TTI->getShuffleCost(ShuffleKind.getValue(), VecTy);		int Cost = TTI->getShuffleCost(ShuffleKind.getValue(), VecTy);
for (auto *V : VL) {		for (auto *V : VL) {
// If all users of instruction are going to be vectorized and this		// If all users of instruction are going to be vectorized and this
// instruction itself is not going to be vectorized, consider this		// instruction itself is not going to be vectorized, consider this
// instruction as dead and remove its cost from the final cost of the		// instruction as dead and remove its cost from the final cost of the
// vectorized tree.		// vectorized tree.
if (areAllUsersVectorized(cast<Instruction>(V)) &&		if (areAllUsersVectorized(cast<Instruction>(V)) &&
!ScalarToTreeEntry.count(V)) {		!Tree->ScalarToTreeEntry.count(V)) {
auto *IO = cast<ConstantInt>(		auto *IO = cast<ConstantInt>(
cast<ExtractElementInst>(V)->getIndexOperand());		cast<ExtractElementInst>(V)->getIndexOperand());
Cost -= TTI->getVectorInstrCost(Instruction::ExtractElement, VecTy,		Cost -= TTI->getVectorInstrCost(Instruction::ExtractElement, VecTy,
IO->getZExtValue());		IO->getZExtValue());
}		}
}		}
return ReuseShuffleCost + Cost;		return ReuseShuffleCost + Cost;
}		}
▲ Show 20 Lines • Show All 318 Lines • ▼ Show 20 Lines	switch (ShuffleOrOp) {
}		}
default:		default:
llvm_unreachable("Unknown instruction");		llvm_unreachable("Unknown instruction");
}		}
}		}

bool BoUpSLP::isFullyVectorizableTinyTree() const {		bool BoUpSLP::isFullyVectorizableTinyTree() const {
LLVM_DEBUG(dbgs() << "SLP: Check whether the tree with height "		LLVM_DEBUG(dbgs() << "SLP: Check whether the tree with height "
<< VectorizableTree.size() << " is fully vectorizable .\n");		<< Tree->VectorizableTree.size()
		<< " is fully vectorizable .\n");

// We only handle trees of heights 1 and 2.		// We only handle trees of heights 1 and 2.
if (VectorizableTree.size() == 1 &&		if (Tree->VectorizableTree.size() == 1 &&
VectorizableTree[0]->State == TreeEntry::Vectorize)		Tree->VectorizableTree[0]->State == TreeEntry::Vectorize)
return true;		return true;

if (VectorizableTree.size() != 2)		if (Tree->VectorizableTree.size() != 2)
return false;		return false;

// Handle splat and all-constants stores.		// Handle splat and all-constants stores.
if (VectorizableTree[0]->State == TreeEntry::Vectorize &&		if (Tree->VectorizableTree[0]->State == TreeEntry::Vectorize &&
(allConstant(VectorizableTree[1]->Scalars) \|\|		(allConstant(Tree->VectorizableTree[1]->Scalars) \|\|
isSplat(VectorizableTree[1]->Scalars)))		isSplat(Tree->VectorizableTree[1]->Scalars)))
return true;		return true;

// Gathering cost would be too much for tiny trees.		// Gathering cost would be too much for tiny trees.
if (VectorizableTree[0]->State == TreeEntry::NeedToGather \|\|		if (Tree->VectorizableTree[0]->State != TreeEntry::Vectorize \|\|
VectorizableTree[1]->State == TreeEntry::NeedToGather)		Tree->VectorizableTree[1]->State != TreeEntry::Vectorize)
return false;		return false;
		ABataevUnsubmitted Done Reply Inline Actions Maybe, better to use `!= TreeEntry::Vectorize` to avoid trees with proposed gathering? ABataev: Maybe, better to use `!= TreeEntry::Vectorize` to avoid trees with proposed gathering?

return true;		return true;
}		}

static bool isLoadCombineCandidateImpl(Value *Root, unsigned NumElts,		static bool isLoadCombineCandidateImpl(Value *Root, unsigned NumElts,
TargetTransformInfo *TTI) {		TargetTransformInfo *TTI) {
// Look past the root to find a source value. Arbitrarily follow the		// Look past the root to find a source value. Arbitrarily follow the
// path through operand 0 of any 'or'. Also, peek through optional		// path through operand 0 of any 'or'. Also, peek through optional
Show All 24 Lines	static bool isLoadCombineCandidateImpl(Value *Root, unsigned NumElts,

return true;		return true;
}		}

bool BoUpSLP::isLoadCombineReductionCandidate(unsigned RdxOpcode) const {		bool BoUpSLP::isLoadCombineReductionCandidate(unsigned RdxOpcode) const {
if (RdxOpcode != Instruction::Or)		if (RdxOpcode != Instruction::Or)
return false;		return false;

unsigned NumElts = VectorizableTree[0]->Scalars.size();		unsigned NumElts = Tree->VectorizableTree[0]->Scalars.size();
Value *FirstReduced = VectorizableTree[0]->Scalars[0];		Value *FirstReduced = Tree->VectorizableTree[0]->Scalars[0];
return isLoadCombineCandidateImpl(FirstReduced, NumElts, TTI);		return isLoadCombineCandidateImpl(FirstReduced, NumElts, TTI);
}		}

bool BoUpSLP::isLoadCombineCandidate() const {		bool BoUpSLP::isLoadCombineCandidate() const {
// Peek through a final sequence of stores and check if all operations are		// Peek through a final sequence of stores and check if all operations are
// likely to be load-combined.		// likely to be load-combined.
unsigned NumElts = VectorizableTree[0]->Scalars.size();		unsigned NumElts = Tree->VectorizableTree[0]->Scalars.size();
for (Value *Scalar : VectorizableTree[0]->Scalars) {		for (Value *Scalar : Tree->VectorizableTree[0]->Scalars) {
Value *X;		Value *X;
if (!match(Scalar, m_Store(m_Value(X), m_Value())) \|\|		if (!match(Scalar, m_Store(m_Value(X), m_Value())) \|\|
!isLoadCombineCandidateImpl(X, NumElts, TTI))		!isLoadCombineCandidateImpl(X, NumElts, TTI))
return false;		return false;
}		}
return true;		return true;
}		}

bool BoUpSLP::isTreeTinyAndNotFullyVectorizable() const {		bool BoUpSLP::isTreeTinyAndNotFullyVectorizable() const {
// We can vectorize the tree if its size is greater than or equal to the		// We can vectorize the tree if its size is greater than or equal to the
// minimum size specified by the MinTreeSize command line option.		// minimum size specified by the MinTreeSize command line option.
if (VectorizableTree.size() >= MinTreeSize)		if (Tree->VectorizableTree.size() >= MinTreeSize)
return false;		return false;

// If we have a tiny tree (a tree whose size is less than MinTreeSize), we		// If we have a tiny tree (a tree whose size is less than MinTreeSize), we
// can vectorize it if we can prove it fully vectorizable.		// can vectorize it if we can prove it fully vectorizable.
if (isFullyVectorizableTinyTree())		if (isFullyVectorizableTinyTree())
return false;		return false;

assert(VectorizableTree.empty()		assert(Tree->VectorizableTree.empty()
? ExternalUses.empty()		? Tree->ExternalUses.empty()
: true && "We shouldn't have any external users");		: true && "We shouldn't have any external users");

// Otherwise, we can't vectorize the tree. It is both tiny and not fully		// Otherwise, we can't vectorize the tree. It is both tiny and not fully
// vectorizable.		// vectorizable.
return true;		return true;
}		}

int BoUpSLP::getSpillCost() const {		int BoUpSLP::getSpillCost() {
// Walk from the bottom of the tree to the top, tracking which values are		// Walk from the bottom of the tree to the top, tracking which values are
// live. When we see a call instruction that is not part of our tree,		// live. When we see a call instruction that is not part of our tree,
// query TTI to see if there is a cost to keeping values live over it		// query TTI to see if there is a cost to keeping values live over it
// (for example, if spills and fills are required).		// (for example, if spills and fills are required).
unsigned BundleWidth = VectorizableTree.front()->Scalars.size();
int Cost = 0;		int Cost = 0;

SmallPtrSet<Instruction*, 4> LiveValues;		SmallPtrSet<Instruction*, 4> LiveValues;
Instruction *PrevInst = nullptr;		Instruction *PrevInst = nullptr;

for (const auto &TEPtr : VectorizableTree) {		for (const std::unique_ptr<TreeEntry> &TEPtr : Tree->VectorizableTree) {
Instruction *Inst = dyn_cast<Instruction>(TEPtr->Scalars[0]);		Instruction *Inst = dyn_cast<Instruction>(TEPtr->Scalars[0]);
if (!Inst)		if (!Inst)
continue;		continue;

if (!PrevInst) {		if (!PrevInst) {
PrevInst = Inst;		PrevInst = Inst;
continue;		continue;
}		}

// Update LiveValues.		// Update LiveValues.
LiveValues.erase(PrevInst);		LiveValues.erase(PrevInst);
for (auto &J : PrevInst->operands()) {		for (auto &J : PrevInst->operands()) {
if (isa<Instruction>(&J) && getTreeEntry(&J))		if (isa<Instruction>(&*J) &&
		Tree->ScalarToTreeEntry.find(&*J) != Tree->ScalarToTreeEntry.end())
LiveValues.insert(cast<Instruction>(&*J));		LiveValues.insert(cast<Instruction>(&*J));
}		}

LLVM_DEBUG({		LLVM_DEBUG({
dbgs() << "SLP: #LV: " << LiveValues.size();		dbgs() << "SLP: #LV: " << LiveValues.size();
for (auto *X : LiveValues)		for (auto *X : LiveValues)
dbgs() << " " << X->getName();		dbgs() << " " << X->getName();
dbgs() << ", Looking at ";		dbgs() << ", Looking at ";
Show All 16 Lines	while (InstIt != PrevInstIt) {
!isa<DbgInfoIntrinsic>(&*PrevInstIt)) &&		!isa<DbgInfoIntrinsic>(&*PrevInstIt)) &&
&*PrevInstIt != PrevInst)		&*PrevInstIt != PrevInst)
NumCalls++;		NumCalls++;

++PrevInstIt;		++PrevInstIt;
}		}

if (NumCalls) {		if (NumCalls) {
		Tree->NoCallInst = false;
SmallVector<Type*, 4> V;		SmallVector<Type*, 4> V;
for (auto *II : LiveValues)		for (auto *II : LiveValues)
V.push_back(FixedVectorType::get(II->getType(), BundleWidth));		V.push_back(FixedVectorType::get(II->getType(), Tree->BundleWidth));
Cost += NumCalls * TTI->getCostOfKeepingLiveOverCall(V);		Cost += NumCalls * TTI->getCostOfKeepingLiveOverCall(V);
}		}

PrevInst = Inst;		PrevInst = Inst;
}		}

return Cost;		return Cost;
}		}

int BoUpSLP::getTreeCost() {		int BoUpSLP::getExtractOperationCost(const ExternalUser &EU) const {
		ABataevUnsubmitted Done Reply Inline Actions The code in this function is very similar to the code in the `getTreeCost()`. Can it be reused somehow to avoid duplication? ABataev: The code in this function is very similar to the code in the `getTreeCost()`. Can it be reused…
int Cost = 0;		// Uses by ephemeral values are free (because the ephemeral value will be
LLVM_DEBUG(dbgs() << "SLP: Calculating cost for tree of size "		// removed prior to code generation, and so the extraction will be
<< VectorizableTree.size() << ".\n");		// removed as well).
		if (EphValues.count(EU.User))
		return 0;

		// If we plan to rewrite the tree in a smaller type, we will need to sign
		// extend the extracted value back to the original type. Here, we account
		// for the extract and the added cost of the sign extend if needed.
		auto *VecTy = FixedVectorType::get(EU.Scalar->getType(), Tree->BundleWidth);
		Value *ScalarRoot = Tree->VectorizableTree.front()->Scalars[0];

		ABataevUnsubmitted Done Reply Inline Actions `VectorizableTree.front()` instead of `VectorizableTree[0]` just like in the first statement. ABataev: `VectorizableTree.front()` instead of `VectorizableTree[0]` just like in the first statement.
		auto It = MinBWs.find(ScalarRoot);
		if (It != MinBWs.end()) {
		uint64_t Width = It->second.first;
		bool Signed = It->second.second;
		auto *MinTy = IntegerType::get(F->getContext(), Width);
		unsigned ExtOp = Signed ? Instruction::SExt : Instruction::ZExt;
		VecTy = FixedVectorType::get(MinTy, Tree->BundleWidth);
		return (TTI->getExtractWithExtendCost(ExtOp, EU.Scalar->getType(), VecTy,
		EU.Lane));
		}
		return TTI->getVectorInstrCost(Instruction::ExtractElement, VecTy, EU.Lane);
		}

		int BoUpSLP::getExtractCost() const {
		int ExtractCost = 0;
		SmallPtrSet<Value *, 16> ExtractCostCalculated;
		// Consider the possibility of extracting vectorized
		// values for canceled elements use.
		for (const std::unique_ptr<TreeEntry> &TEPtr : Tree->VectorizableTree) {
		TreeEntry *Entry = TEPtr.get();
		if (Entry->State != TreeEntry::ProposedToGather)
		continue;
		for (Value *V : Entry->Scalars) {
		// Consider the possibility of extracting vectorized
		// values for canceled elements use.
		auto It = Tree->InternalTreeUses.find(V);
		if (It != Tree->InternalTreeUses.end()) {
		const UserList &UL = It->second;
		for (const ExternalUser &IU : UL)
		ExtractCost += getExtractOperationCost(IU);
		}
		}
		}
		for (const ExternalUser &EU : Tree->ExternalUses) {
		// We only add extract cost once for the same scalar.
		if (!ExtractCostCalculated.insert(EU.Scalar).second)
		continue;

		int Cost = getExtractOperationCost(EU);
		ExtractCost += Cost;
		}
		return ExtractCost;
		}
		ABataevUnsubmitted Not Done Reply Inline Actions `>=` ABataev: `>=`

		int BoUpSLP::getInsertCost() {
		int InsertCost = 0;
		for (const std::unique_ptr<TreeEntry> &TEPtr : Tree->VectorizableTree) {
		ABataevUnsubmitted Not Done Reply Inline Actions Does it include the cost of all subtree or just this particular `Entry`? ABataev: Does it include the cost of all subtree or just this particular `Entry`?
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions The Entry->Cost is a particular Entry Cost with some sub-tree elements for example if we have a gathering element in this particular Entry. Note that we only consider here TreeEntry::Vectorize entries this summary. dtemirbulatov: The Entry->Cost is a particular Entry Cost with some sub-tree elements for example if we have a…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions And we recalculate all canceled to vectorize (ProposedToGather) entries costs in getInsertCost() every time we call getTreeCost() at line 4214. dtemirbulatov: And we recalculate all canceled to vectorize (ProposedToGather) entries costs in getInsertCost…
		TreeEntry *Entry = TEPtr.get();
		// Avoid already vectorized TreeEntries, it is already in a vector form and
		ABataevUnsubmitted Not Done Reply Inline Actions Why need to exclude `UserCost` here? ABataev: Why need to exclude `UserCost` here?
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions We might want to add(substract) extra value for example at line 6071 before this change. dtemirbulatov: We might want to add(substract) extra value for example at line 6071 before this change.
		// we don't need to gather those operations.
		if (Entry->State != TreeEntry::ProposedToGather)
		continue;
		bool NeedGather = false;
		for (Value *V : Entry->Scalars) {
		auto *Inst = cast<Instruction>(V);
		if (llvm::any_of(Inst->users(), [this](User *Op){ return Tree->ScalarToTreeEntry.count(Op) > 0; })) {
		NeedGather = true;
		}
		}
		if (NeedGather)
		ABataevUnsubmitted Done Reply Inline Actions `llvm::any_of(Inst->users(), [Tree](User Op){ return Tree->ScalarToTreeEntry.count(Op) > 0; }` ABataev:* `llvm::any_of(Inst->users(), [Tree](User *Op){ return Tree->ScalarToTreeEntry.count(Op) > 0; }`
		InsertCost += getEntryCost(Entry);
		ABataevUnsubmitted Done Reply Inline Actions Just: for (Value V : Entry->Scalars) { auto Inst = cast<Instruction>(V); if (llvm::any_of(Inst->users(), [this](User Op){ return Tree->ScalarToTreeEntry.count(Op) > 0; })) return InsertCost + getEntryCost(Entry); } Also, check code formatting ABataev:* Just: ``` for (Value V : Entry->Scalars) { auto Inst = cast<Instruction>(V)…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions hmm, I think this is not a correct suggestion, there might be several tree entries with TreeEntry::ProposedToGather status and we have to calculate Insert cost for the whole tree here. dtemirbulatov: hmm, I think this is not a correct suggestion, there might be several tree entries with…
		ABataevUnsubmitted Done Reply Inline Actions Yeah, maybe. But you van do something similar, like InsertCost += ... break; instead of setting flag and do a check after the loop. ABataev: Yeah, maybe. But you van do something similar, like ``` InsertCost += ... break; ``` instead…
		}
		return InsertCost;
		}

unsigned BundleWidth = VectorizableTree[0]->Scalars.size();		bool BoUpSLP::findSubTree(int UserCost) {
		auto Cmp = [](const TreeEntry LHS, const TreeEntry RHS) {
		return LHS->Cost > RHS->Cost;
		};
		ABataevUnsubmitted Done Reply Inline Actions `Cmp` ABataev: `Cmp`
		std::set<TreeEntry *, decltype(Cmp)> Vec(Cmp);
		for (const std::unique_ptr<TreeEntry> &TEPtr : Tree->VectorizableTree) {
		TreeEntry *Entry = TEPtr.get();
		if (Entry->State != TreeEntry::Vectorize \|\| Entry->Cost <= 0 \|\| !Entry->Idx)
		continue;
		Vec.insert(Entry);
		}
		if (Vec.size() > MaxCostsRecalculations) {
		std::set<llvm::slpvectorizer::BoUpSLP::TreeEntry *>::iterator It =
		ABataevUnsubmitted Done Reply Inline Actions Why need to use a SmallVector and then sort it? Better to use a set with custom compare functor. ABataev: Why need to use a SmallVector and then sort it? Better to use a set with custom compare functor.
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions Thanks, Good. dtemirbulatov: Thanks, Good.
		Vec.begin();
		std::advance(It, (unsigned)MaxCostsRecalculations);
		Vec.erase(It, Vec.end());
		}

		ABataevUnsubmitted Not Done Reply Inline Actions Just `Vec.erase(Vec.rbegin(), Vec.rbegin() + (Vec.size() - MaxCostsRecalculations)`? ABataev: Just `Vec.erase(Vec.rbegin(), Vec.rbegin() + (Vec.size() - MaxCostsRecalculations)`?
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions No, We could not use "Vec.rbegin() + " with std::set. dtemirbulatov: No, We could not use "Vec.rbegin() + " with std::set.
		ABataevUnsubmitted Not Done Reply Inline Actions Then just `Vec.erase(Vec.begin() + MaxCostsRecalculations, Vec.end());`. ABataev: Then just `Vec.erase(Vec.begin() + MaxCostsRecalculations, Vec.end());`.
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions eh, no it is std::_Rb_tree_const_iterator<llvm::slpvectorizer::BoUpSLP::TreeEntry>. dtemirbulatov:* eh, no it is std::_Rb_tree_const_iterator<llvm::slpvectorizer::BoUpSLP::TreeEntry*>.
		ABataevUnsubmitted Done Reply Inline Actions Why is this a const iterator? ABataev: Why is this a const iterator?
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions std::set iterators are bidirectional, not random-access. we have to use std::advance dtemirbulatov: std::set iterators are bidirectional, not random-access. we have to use std::advance
		int Sum = 0;
		for (TreeEntry *Entry : Vec)
		Sum += Entry->Cost;
		// Avoid reducing the tree if there is no potential room to reduce.
		if ((Tree->TreeCost - UserCost - Sum) >= -SLPCostThreshold)
		wweiUnsubmitted Done Reply Inline Actions I think the input for `getGatherCost` should be `Entry->Scalars` instead of `V`. The code in `getGatherCost` likes: int BoUpSLP::getGatherCost(ArrayRef<Value > VL) const { // Find the type of the operands in VL. Type ScalarTy = VL[0]->getType(); ... ... VectorType VecTy = VectorType::get(ScalarTy, VL.size()); ... ... return getGatherCost(VecTy, ShuffledElements); So, if input is `V`, `VectTy` will always be equal to `<1 x iN>`, I think it's the same with `iN` type, and `getGatherCost(VecTy, ShuffledElements)` will return incorrect InsertCost value. wwei:* I think the input for `getGatherCost` should be `Entry->Scalars` instead of `V`. The code in…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions Yes, correct. Thanks, I will fix that. dtemirbulatov: Yes, correct. Thanks, I will fix that.
		return false;

		for (TreeEntry *T : Vec) {
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions eh, I found typo here, it should be Inst->Users, the same for 4101 line. dtemirbulatov: eh, I found typo here, it should be Inst->Users, the same for 4101 line.
		T->State = TreeEntry::ProposedToGather;
		for (Value *V : T->Scalars) {
		ABataevUnsubmitted Done Reply Inline Actions `>=` ABataev: `>=`
		Tree->ScalarToTreeEntry.erase(V);
		Tree->MustGather.insert(V);
		Tree->ExternalUses.erase(
		llvm::remove_if(Tree->ExternalUses,
		[V](ExternalUser &EU) { return EU.Scalar == V; }),
		Tree->ExternalUses.end());
		ABataevUnsubmitted Done Reply Inline Actions Not sure that this is the best criterion. I think you also need to include the distance from the head of the tree to the entry, because some big costs can be compensated by the vectorizable nodes in the tree. What I would do here is just some kind of level ordering search (BFS) starting from the deepest level. ABataev: Not sure that this is the best criterion. I think you also need to include the distance from…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions Hmm, implemented, but I don't see any benefit from that, plus we have to do BFS search. And we are going to throw away any non-vectorizable nodes at 4295. dtemirbulatov: Hmm, implemented, but I don't see any benefit from that, plus we have to do BFS search. And we…
		ABataevUnsubmitted Not Done Reply Inline Actions It may trigger for targets like silvermont or in future for vectorized functions. ABataev: It may trigger for targets like silvermont or in future for vectorized functions.
		dtemirbulatovAuthorUnsubmitted Not Done Reply Inline Actions I measured the BFS approach vs this implementation. And with BFS, it is ~10% less efficient on SPEC2006 INT and ~20% less on compilable SPEC2006 FP. By efficiency, I mean the total number of reduced trees while the whole compilation. dtemirbulatov: I measured the BFS approach vs this implementation. And with BFS, it is ~10% less efficient on…
		ABataevUnsubmitted Not Done Reply Inline Actions Could you post it anyway to check if it may be improved? ABataev: Could you post it anyway to check if it may be improved?
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions ok, I might miss something. Thanks. dtemirbulatov: ok, I might miss something. Thanks.
		}
		ABataevUnsubmitted Done Reply Inline Actions `is_contained()` is `O(n)`. Maybe use a set instead of it in the loop? ABataev: `is_contained()` is `O(n)`. Maybe use a set instead of it in the loop?
		Tree->PartialCost = getTreeCost() - UserCost;
		Tree->RemovedOperations.push_back(T);
		if (Tree->PartialCost < -SLPCostThreshold) {
		cutTree();
		return true;
		}
		ABataevUnsubmitted Done Reply Inline Actions I think you can also exclude entries with the number of operands <= 1. ABataev: I think you can also exclude entries with the number of operands <= 1.
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions But why? The only thing that matters here is the cost. dtemirbulatov: But why? The only thing that matters here is the cost.
		ABataevUnsubmitted Not Done Reply Inline Actions Because the main idea is to drop gathers and drop one gather in favor of another one will not be profitable for sure. But it may improve compile time and the list of candidates, The only case you need to check for is the latest masked gather case, it may be profitable to convert it to gathers for some targets. ABataev: Because the main idea is to drop gathers and drop one gather in favor of another one will not…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions I think I can check here if scutter/gather is supported via TargetrInfo and if it is not then move all nodes with TreeEntry::ScatterVectorize to TreeEntry::Gather. dtemirbulatov: I think I can check here if scutter/gather is supported via TargetrInfo and if it is not then…
		anton-afanasyevUnsubmitted Not Done Reply Inline Actions I believe it's wrong decision to check scatter/gather target support for the reason mentioned here https://reviews.llvm.org/D92701#2435573. Why could not we just rely on costs (node cost and total one)? anton-afanasyev: I believe it's wrong decision to check scatter/gather target support for the reason mentioned…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions I agree with @anton-afanasyev here. I am not sure what @ABataev wants here? If I exclude (operands <= 1) then we would lose have of all tests in SLP affected by throttling. dtemirbulatov: I agree with @anton-afanasyev here. I am not sure what @ABataev wants here? If I exclude…
		ABataevUnsubmitted Not Done Reply Inline Actions I did not say anything about checking if scatter is supported here. I just said that we can improve the criterion here by checking that the entry node has at least 2 operands (because if it has just one operand, most probably we can skip it) and we just need to check the nodes with only 1 operand if it is gather scatter node, because it may be better to represent it as simple gather. ABataev: I did not say anything about checking if scatter is supported here. I just said that we can…
		}
		ABataevUnsubmitted Done Reply Inline Actions No need for `[&]` here, just `[]` ABataev: No need for `[&]` here, just `[]`
		return false;
		}

		int BoUpSLP::getRawTreeCost() {
		int CostSum = 0;
		Tree->BundleWidth = Tree->VectorizableTree.front()->Scalars.size();
		LLVM_DEBUG(dbgs() << "SLP: Calculating cost for tree of size "
		<< Tree->VectorizableTree.size() << ".\n");

for (unsigned I = 0, E = VectorizableTree.size(); I < E; ++I) {		for (std::unique_ptr<TreeEntry> &TEPtr : Tree->VectorizableTree) {
TreeEntry &TE = *VectorizableTree[I].get();		TreeEntry &TE = *TEPtr.get();

// We create duplicate tree entries for gather sequences that have multiple		// We create duplicate tree entries for gather sequences that have multiple
// uses. However, we should not compute the cost of duplicate sequences.		// uses. However, we should not compute the cost of duplicate sequences.
// For example, if we have a build vector (i.e., insertelement sequence)		// For example, if we have a build vector (i.e., insertelement sequence)
// that is used by more than one vector instruction, we only need to		// that is used by more than one vector instruction, we only need to
// compute the cost of the insertelement instructions once. The redundant		// compute the cost of the insertelement instructions once. The redundant
// instructions will be eliminated by CSE.		// instructions will be eliminated by CSE.
//		//
// We should consider not creating duplicate tree entries for gather		// We should consider not creating duplicate tree entries for gather
// sequences, and instead add additional edges to the tree representing		// sequences, and instead add additional edges to the tree representing
// their uses. Since such an approach results in fewer total entries,		// their uses. Since such an approach results in fewer total entries,
// existing heuristics based on tree size may yield different results.		// existing heuristics based on tree size may yield different results.
		ABataevUnsubmitted Done Reply Inline Actions `[V]` ABataev: `[V]`
//		//
if (TE.State == TreeEntry::NeedToGather &&		if (TE.State != TreeEntry::Vectorize &&
std::any_of(std::next(VectorizableTree.begin(), I + 1),		llvm::any_of(llvm::drop_begin(Tree->VectorizableTree, TE.Idx + 1),
VectorizableTree.end(),
[TE](const std::unique_ptr<TreeEntry> &EntryPtr) {		[TE](const std::unique_ptr<TreeEntry> &EntryPtr) {
return EntryPtr->State == TreeEntry::NeedToGather &&		return EntryPtr->State != TreeEntry::Vectorize &&
EntryPtr->isSame(TE.Scalars);		EntryPtr->isSame(TE.Scalars);
}))		}))
		ABataevUnsubmitted Not Done Reply Inline Actions Tabs ABataev: Tabs
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions this is LLVM's standard format(clang-format). dtemirbulatov: this is LLVM's standard format(clang-format).
continue;		continue;

int C = getEntryCost(&TE);		TE.Cost = getEntryCost(&TE);
LLVM_DEBUG(dbgs() << "SLP: Adding cost " << C		LLVM_DEBUG(dbgs() << "SLP: Adding cost " << TE.Cost
<< " for bundle that starts with " << *TE.Scalars[0]		<< " for bundle that starts with " << *TE.Scalars[0]
<< ".\n");		<< ".\n");
Cost += C;		CostSum += TE.Cost;
}		}

SmallPtrSet<Value *, 16> ExtractCostCalculated;		if (SLPThrottling)
int ExtractCost = 0;		for (std::unique_ptr<TreeEntry> &TEPtr : Tree->VectorizableTree) {
for (ExternalUser &EU : ExternalUses) {		TreeEntry *TE = TEPtr.get();
// We only add extract cost once for the same scalar.		if (TE->State != TreeEntry::Vectorize)
if (!ExtractCostCalculated.insert(EU.Scalar).second)
continue;

// Uses by ephemeral values are free (because the ephemeral value will be
// removed prior to code generation, and so the extraction will be
// removed as well).
if (EphValues.count(EU.User))
continue;		continue;
		ABataevUnsubmitted Not Done Reply Inline Actions Tab ABataev: Tab
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions this is LLVM's standard format(clang-format). dtemirbulatov: this is LLVM's standard format(clang-format).
		int GatherCost = 0;
		for (TreeEntry *Gather : TE->UseEntries)
		if (Gather->State != TreeEntry::Vectorize)
		GatherCost += Gather->Cost;
		TE->Cost += GatherCost;
		}
		return CostSum;
		}

// If we plan to rewrite the tree in a smaller type, we will need to sign		int BoUpSLP::getTreeCost() {
// extend the extracted value back to the original type. Here, we account		int CostSum;
// for the extract and the added cost of the sign extend if needed.		if (!Tree->IsCostSumReady) {
auto *VecTy = FixedVectorType::get(EU.Scalar->getType(), BundleWidth);		CostSum = getRawTreeCost();
auto *ScalarRoot = VectorizableTree[0]->Scalars[0];		Tree->RawTreeCost = CostSum;
if (MinBWs.count(ScalarRoot)) {
auto *MinTy = IntegerType::get(F->getContext(), MinBWs[ScalarRoot].first);
auto Extend =
MinBWs[ScalarRoot].second ? Instruction::SExt : Instruction::ZExt;
VecTy = FixedVectorType::get(MinTy, BundleWidth);
ExtractCost += TTI->getExtractWithExtendCost(Extend, EU.Scalar->getType(),
VecTy, EU.Lane);
} else {		} else {
ExtractCost +=		CostSum = Tree->RawTreeCost;
TTI->getVectorInstrCost(Instruction::ExtractElement, VecTy, EU.Lane);
}
}		}

int SpillCost = getSpillCost();		if (SLPThrottling)
Cost += SpillCost + ExtractCost;		for (std::unique_ptr<TreeEntry> &TEPtr : Tree->VectorizableTree) {
		TreeEntry *TE = TEPtr.get();
		if (TE->State == TreeEntry::ProposedToGather)
		CostSum -= TE->Cost;
		}

		int ExtractCost = getExtractCost();
		int SpillCost = 0;
		if (!Tree->NoCallInst \|\| !Tree->IsCostSumReady)
		SpillCost = getSpillCost();
		assert((!Tree->NoCallInst \|\| getSpillCost() == 0) && "Incorrect spill cost");
		if (!Tree->IsCostSumReady)
		Tree->IsCostSumReady = true;
		ABataevUnsubmitted Not Done Reply Inline Actions `SpillCost == 0`? ABataev: `SpillCost == 0`?
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions hmm, I am thinking if we might get SpillCost == 0 for the whole tree and somehow after reduction, we might get non-zero. dtemirbulatov: hmm, I am thinking if we might get SpillCost == 0 for the whole tree and somehow after…
		int InsertCost = getInsertCost();
		ABataevUnsubmitted Done Reply Inline Actions Just `assert((!Tree->NoCallInst \|\| getSpillCost() == 0) && "Incorrect spill cost");` ABataev: Just `assert((!Tree->NoCallInst \|\| getSpillCost() == 0) && "Incorrect spill cost");`
		int Cost = CostSum + ExtractCost + SpillCost + InsertCost;
		Tree->TreeCost = Cost;

std::string Str;		#ifndef NDEBUG
{		SmallString<256> Str;
raw_string_ostream OS(Str);		raw_svector_ostream OS(Str);
OS << "SLP: Spill Cost = " << SpillCost << ".\n"		OS << "SLP: Spill Cost = " << SpillCost << ".\n"
<< "SLP: Extract Cost = " << ExtractCost << ".\n"		<< "SLP: Extract Cost = " << ExtractCost << ".\n"
		<< "SLP: Insert Cost = " << InsertCost << ".\n"
<< "SLP: Total Cost = " << Cost << ".\n";		<< "SLP: Total Cost = " << Cost << ".\n";
}
LLVM_DEBUG(dbgs() << Str);		LLVM_DEBUG(dbgs() << Str);

if (ViewSLPTree)		if (ViewSLPTree)
ViewGraph(this, "SLP" + F->getName(), false, Str);		ViewGraph(this, "SLP" + F->getName(), false, Str);
		#endif
		ABataevUnsubmitted Done Reply Inline Actions All this code must be active only when the debug mode on? ABataev: All this code must be active only when the debug mode on?
		RKSimonUnsubmitted Not Done Reply Inline Actions Maybe pull this NDEBUG change out into its own patch? RKSimon: Maybe pull this NDEBUG change out into its own patch?
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions yes, it could go as NFC. dtemirbulatov: yes, it could go as NFC.

return Cost;		return Cost;
}		}

int BoUpSLP::getGatherCost(VectorType *Ty,		int BoUpSLP::getGatherCost(VectorType *Ty,
const DenseSet<unsigned> &ShuffledIndices) const {		const DenseSet<unsigned> &ShuffledIndices) const {
unsigned NumElts = Ty->getNumElements();		unsigned NumElts = Ty->getNumElements();
APInt DemandedElts = APInt::getNullValue(NumElts);		APInt DemandedElts = APInt::getNullValue(NumElts);
▲ Show 20 Lines • Show All 57 Lines • ▼ Show 20 Lines	void BoUpSLP::setInsertPointAfterBundle(TreeEntry *E) {

// The last instruction in the bundle in program order.		// The last instruction in the bundle in program order.
Instruction *LastInst = nullptr;		Instruction *LastInst = nullptr;

// Find the last instruction. The common case should be that BB has been		// Find the last instruction. The common case should be that BB has been
// scheduled, and the last instruction is VL.back(). So we start with		// scheduled, and the last instruction is VL.back(). So we start with
// VL.back() and iterate over schedule data until we reach the end of the		// VL.back() and iterate over schedule data until we reach the end of the
// bundle. The end of the bundle is marked by null ScheduleData.		// bundle. The end of the bundle is marked by null ScheduleData.
if (BlocksSchedules.count(BB)) {		if (Tree->BlocksSchedules.count(BB)) {
auto *Bundle =		auto *Bundle = Tree->BlocksSchedules[BB]->getScheduleData(
BlocksSchedules[BB]->getScheduleData(E->isOneOf(E->Scalars.back()));		E->isOneOf(E->Scalars.back()));
if (Bundle && Bundle->isPartOfBundle())		if (Bundle && Bundle->isPartOfBundle())
for (; Bundle; Bundle = Bundle->NextInBundle)		for (; Bundle; Bundle = Bundle->NextInBundle)
if (Bundle->OpValue == Bundle->Inst)		if (Bundle->OpValue == Bundle->Inst)
LastInst = Bundle->Inst;		LastInst = Bundle->Inst;
}		}

// LastInst can still be null at this point if there's either not an entry		// LastInst can still be null at this point if there's either not an entry
// for BB in BlocksSchedules or there's no ScheduleData available for		// for BB in BlocksSchedules or there's no ScheduleData available for
▲ Show 20 Lines • Show All 51 Lines • ▼ Show 20 Lines	if (auto *Insrt = dyn_cast<InsertElementInst>(Vec)) {
}		}
}		}
assert(FoundLane >= 0 && "Could not find the correct lane");		assert(FoundLane >= 0 && "Could not find the correct lane");
if (!E->ReuseShuffleIndices.empty()) {		if (!E->ReuseShuffleIndices.empty()) {
FoundLane =		FoundLane =
std::distance(E->ReuseShuffleIndices.begin(),		std::distance(E->ReuseShuffleIndices.begin(),
llvm::find(E->ReuseShuffleIndices, FoundLane));		llvm::find(E->ReuseShuffleIndices, FoundLane));
}		}
ExternalUses.push_back(ExternalUser(VL[i], Insrt, FoundLane));		Tree->ExternalUses.emplace_back(VL[i], Insrt, FoundLane);
}		}
}		}
}		}

return Vec;		return Vec;
}		}

Value BoUpSLP::vectorizeTree(ArrayRef<Value > VL) {		Value BoUpSLP::vectorizeTree(ArrayRef<Value > VL) {
▲ Show 20 Lines • Show All 338 Lines • ▼ Show 20 Lines	case Instruction::Load: {
Value *VecPtr = Builder.CreateBitCast(LI->getPointerOperand(),		Value *VecPtr = Builder.CreateBitCast(LI->getPointerOperand(),
VecTy->getPointerTo(AS));		VecTy->getPointerTo(AS));

// The pointer operand uses an in-tree scalar so we add the new BitCast to		// The pointer operand uses an in-tree scalar so we add the new BitCast to
// ExternalUses list to make sure that an extract will be generated in the		// ExternalUses list to make sure that an extract will be generated in the
// future.		// future.
Value *PO = LI->getPointerOperand();		Value *PO = LI->getPointerOperand();
if (getTreeEntry(PO))		if (getTreeEntry(PO))
ExternalUses.push_back(ExternalUser(PO, cast<User>(VecPtr), 0));		Tree->ExternalUses.emplace_back(PO, cast<User>(VecPtr), 0);
		ABataevUnsubmitted Done Reply Inline Actions Use `emplace_back(PO, VecPtr, 0)` ABataev: Use `emplace_back(PO, VecPtr, 0)`

LI = Builder.CreateAlignedLoad(VecTy, VecPtr, LI->getAlign());		LI = Builder.CreateAlignedLoad(VecTy, VecPtr, LI->getAlign());
Value *V = propagateMetadata(LI, E->Scalars);		Value *V = propagateMetadata(LI, E->Scalars);
if (IsReorder) {		if (IsReorder) {
SmallVector<int, 4> Mask;		SmallVector<int, 4> Mask;
inversePermutation(E->ReorderIndices, Mask);		inversePermutation(E->ReorderIndices, Mask);
V = Builder.CreateShuffleVector(V, UndefValue::get(V->getType()),		V = Builder.CreateShuffleVector(V, UndefValue::get(V->getType()),
Mask, "reorder_shuffle");		Mask, "reorder_shuffle");
Show All 28 Lines	case Instruction::Store: {
ScalarPtr, VecValue->getType()->getPointerTo(AS));		ScalarPtr, VecValue->getType()->getPointerTo(AS));
StoreInst *ST = Builder.CreateAlignedStore(VecValue, VecPtr,		StoreInst *ST = Builder.CreateAlignedStore(VecValue, VecPtr,
SI->getAlign());		SI->getAlign());

// The pointer operand uses an in-tree scalar, so add the new BitCast to		// The pointer operand uses an in-tree scalar, so add the new BitCast to
// ExternalUses to make sure that an extract will be generated in the		// ExternalUses to make sure that an extract will be generated in the
// future.		// future.
if (getTreeEntry(ScalarPtr))		if (getTreeEntry(ScalarPtr))
ExternalUses.push_back(ExternalUser(ScalarPtr, cast<User>(VecPtr), 0));		Tree->ExternalUses.emplace_back(ScalarPtr, cast<User>(VecPtr), 0);

		ABataevUnsubmitted Not Done Reply Inline Actions Use `emplace_back(ScalarPtr, VecPtr, 0);` ABataev: Use `emplace_back(ScalarPtr, VecPtr, 0);`
Value *V = propagateMetadata(ST, E->Scalars);		Value *V = propagateMetadata(ST, E->Scalars);
if (NeedToShuffleReuses) {		if (NeedToShuffleReuses) {
V = Builder.CreateShuffleVector(V, UndefValue::get(VecTy),		V = Builder.CreateShuffleVector(V, UndefValue::get(VecTy),
E->ReuseShuffleIndices, "shuffle");		E->ReuseShuffleIndices, "shuffle");
}		}
E->VectorizedValue = V;		E->VectorizedValue = V;
++NumVectorInstructions;		++NumVectorInstructions;
return V;		return V;
▲ Show 20 Lines • Show All 85 Lines • ▼ Show 20 Lines	case Instruction::Call: {
SmallVector<OperandBundleDef, 1> OpBundles;		SmallVector<OperandBundleDef, 1> OpBundles;
CI->getOperandBundlesAsDefs(OpBundles);		CI->getOperandBundlesAsDefs(OpBundles);
Value *V = Builder.CreateCall(CF, OpVecs, OpBundles);		Value *V = Builder.CreateCall(CF, OpVecs, OpBundles);

// The scalar argument uses an in-tree scalar so we add the new vectorized		// The scalar argument uses an in-tree scalar so we add the new vectorized
// call to ExternalUses list to make sure that an extract will be		// call to ExternalUses list to make sure that an extract will be
// generated in the future.		// generated in the future.
if (ScalarArg && getTreeEntry(ScalarArg))		if (ScalarArg && getTreeEntry(ScalarArg))
ExternalUses.push_back(ExternalUser(ScalarArg, cast<User>(V), 0));		Tree->ExternalUses.emplace_back(ScalarArg, cast<User>(V), 0);
		ABataevUnsubmitted Done Reply Inline Actions Use `emplace_back(ScalarArg, V, 0);` ABataev: Use `emplace_back(ScalarArg, V, 0);`

propagateIRFlags(V, E->Scalars, VL0);		propagateIRFlags(V, E->Scalars, VL0);
if (NeedToShuffleReuses) {		if (NeedToShuffleReuses) {
V = Builder.CreateShuffleVector(V, UndefValue::get(VecTy),		V = Builder.CreateShuffleVector(V, UndefValue::get(VecTy),
E->ReuseShuffleIndices, "shuffle");		E->ReuseShuffleIndices, "shuffle");
}		}
E->VectorizedValue = V;		E->VectorizedValue = V;
++NumVectorInstructions;		++NumVectorInstructions;
▲ Show 20 Lines • Show All 77 Lines • ▼ Show 20 Lines
Value *BoUpSLP::vectorizeTree() {		Value *BoUpSLP::vectorizeTree() {
ExtraValueToDebugLocsMap ExternallyUsedValues;		ExtraValueToDebugLocsMap ExternallyUsedValues;
return vectorizeTree(ExternallyUsedValues);		return vectorizeTree(ExternallyUsedValues);
}		}

Value *		Value *
BoUpSLP::vectorizeTree(ExtraValueToDebugLocsMap &ExternallyUsedValues) {		BoUpSLP::vectorizeTree(ExtraValueToDebugLocsMap &ExternallyUsedValues) {
// All blocks must be scheduled before any instructions are inserted.		// All blocks must be scheduled before any instructions are inserted.
for (auto &BSIter : BlocksSchedules) {		for (auto &BSIter : Tree->BlocksSchedules) {
scheduleBlock(BSIter.second.get());		BlockScheduling *BS = BSIter.second.get();
		// Remove all Schedule Data from all nodes that we have changed
		// vectorization decision.
		if (!Tree->RemovedOperations.empty())
		removeFromScheduling(BS);
		scheduleBlock(BS);
}		}

Builder.SetInsertPoint(&F->getEntryBlock().front());		Builder.SetInsertPoint(&F->getEntryBlock().front());
auto *VectorRoot = vectorizeTree(VectorizableTree[0].get());		auto *VectorRoot = vectorizeTree(Tree->VectorizableTree.front().get());

		for (std::unique_ptr<TreeEntry> &TEPtr : Tree->VectorizableTree) {
		TreeEntry *Entry = TEPtr.get();
		if (Entry->State == TreeEntry::Vectorize && !Entry->VectorizedValue)
		vectorizeTree(Entry);
		}

// If the vectorized tree can be rewritten in a smaller type, we truncate the		// If the vectorized tree can be rewritten in a smaller type, we truncate the
// vectorized root. InstCombine will then rewrite the entire expression. We		// vectorized root. InstCombine will then rewrite the entire expression. We
// sign extend the extracted values below.		// sign extend the extracted values below.
auto *ScalarRoot = VectorizableTree[0]->Scalars[0];		auto *ScalarRoot = Tree->VectorizableTree[0]->Scalars[0];
if (MinBWs.count(ScalarRoot)) {		if (MinBWs.count(ScalarRoot)) {
if (auto *I = dyn_cast<Instruction>(VectorRoot))		if (auto *I = dyn_cast<Instruction>(VectorRoot))
Builder.SetInsertPoint(&*++BasicBlock::iterator(I));		Builder.SetInsertPoint(&*++BasicBlock::iterator(I));
auto BundleWidth = VectorizableTree[0]->Scalars.size();		Tree->BundleWidth = Tree->VectorizableTree[0]->Scalars.size();
auto *MinTy = IntegerType::get(F->getContext(), MinBWs[ScalarRoot].first);		auto *MinTy = IntegerType::get(F->getContext(), MinBWs[ScalarRoot].first);
auto *VecTy = FixedVectorType::get(MinTy, BundleWidth);		auto *VecTy = FixedVectorType::get(MinTy, Tree->BundleWidth);
auto *Trunc = Builder.CreateTrunc(VectorRoot, VecTy);		auto *Trunc = Builder.CreateTrunc(VectorRoot, VecTy);
VectorizableTree[0]->VectorizedValue = Trunc;		Tree->VectorizableTree[0]->VectorizedValue = Trunc;
}		}

LLVM_DEBUG(dbgs() << "SLP: Extracting " << ExternalUses.size()		LLVM_DEBUG(dbgs() << "SLP: Extracting " << Tree->ExternalUses.size()
<< " values .\n");		<< " values .\n");

// If necessary, sign-extend or zero-extend ScalarRoot to the larger type		// If necessary, sign-extend or zero-extend ScalarRoot to the larger type
// specified by ScalarType.		// specified by ScalarType.
auto extend = [&](Value ScalarRoot, Value Ex, Type *ScalarType) {		auto extend = [&](Value ScalarRoot, Value Ex, Type *ScalarType) {
if (!MinBWs.count(ScalarRoot))		if (!MinBWs.count(ScalarRoot))
return Ex;		return Ex;
if (MinBWs[ScalarRoot].second)		if (MinBWs[ScalarRoot].second)
return Builder.CreateSExt(Ex, ScalarType);		return Builder.CreateSExt(Ex, ScalarType);
return Builder.CreateZExt(Ex, ScalarType);		return Builder.CreateZExt(Ex, ScalarType);
};		};

// Extract all of the elements with the external uses.		// Extract all of the elements with the external uses.
for (const auto &ExternalUse : ExternalUses) {		for (const auto &ExternalUse : Tree->ExternalUses) {
Value *Scalar = ExternalUse.Scalar;		Value *Scalar = ExternalUse.Scalar;
llvm::User *User = ExternalUse.User;		llvm::User *User = ExternalUse.User;

// Skip users that we already RAUW. This happens when one instruction		// Skip users that we already RAUW. This happens when one instruction
// has multiple uses of the same value.		// has multiple uses of the same value.
if (User && !is_contained(Scalar->users(), User))		if (User && !is_contained(Scalar->users(), User))
continue;		continue;
TreeEntry *E = getTreeEntry(Scalar);		TreeEntry *E = getTreeEntry(Scalar);
▲ Show 20 Lines • Show All 62 Lines • ▼ Show 20 Lines	if (auto *VecI = dyn_cast<Instruction>(Vec)) {
CSEBlocks.insert(&F->getEntryBlock());		CSEBlocks.insert(&F->getEntryBlock());
User->replaceUsesOfWith(Scalar, Ex);		User->replaceUsesOfWith(Scalar, Ex);
}		}

LLVM_DEBUG(dbgs() << "SLP: Replaced:" << *User << ".\n");		LLVM_DEBUG(dbgs() << "SLP: Replaced:" << *User << ".\n");
}		}

// For each vectorized value:		// For each vectorized value:
for (auto &TEPtr : VectorizableTree) {		for (std::unique_ptr<TreeEntry> &TEPtr : Tree->VectorizableTree) {
TreeEntry *Entry = TEPtr.get();		TreeEntry *Entry = TEPtr.get();

// No need to handle users of gathered values.		// No need to handle users of gathered values.
if (Entry->State == TreeEntry::NeedToGather)		if (Entry->State == TreeEntry::NeedToGather)
continue;		continue;

assert(Entry->VectorizedValue && "Can't find vectorizable value");		assert(Entry->VectorizedValue && "Can't find vectorizable value");

// For each lane:		// For each lane:
for (int Lane = 0, LE = Entry->Scalars.size(); Lane != LE; ++Lane) {		for (int Lane = 0, LE = Entry->Scalars.size(); Lane != LE; ++Lane) {
Value *Scalar = Entry->Scalars[Lane];		Value *Scalar = Entry->Scalars[Lane];

#ifndef NDEBUG		#ifndef NDEBUG
Type *Ty = Scalar->getType();		Type *Ty = Scalar->getType();
if (!Ty->isVoidTy()) {		// The tree might not be fully vectorized, so we don't have to
		// check every user.
		if (!Ty->isVoidTy() && Tree->RemovedOperations.empty()) {
for (User *U : Scalar->users()) {		for (User *U : Scalar->users()) {
LLVM_DEBUG(dbgs() << "SLP: \tvalidating user:" << *U << ".\n");		LLVM_DEBUG(dbgs() << "SLP: \tvalidating user:" << *U << ".\n");

// It is legal to delete users in the ignorelist.		// It is legal to delete users in the ignorelist.
assert((getTreeEntry(U) \|\| is_contained(UserIgnoreList, U)) &&		assert((getTreeEntry(U) \|\| is_contained(UserIgnoreList, U)) &&
"Deleting out-of-tree value");		"Deleting out-of-tree value");
}		}
}		}
#endif		#endif
LLVM_DEBUG(dbgs() << "SLP: \tErasing scalar:" << *Scalar << ".\n");		LLVM_DEBUG(dbgs() << "SLP: \tErasing scalar:" << *Scalar << ".\n");
eraseInstruction(cast<Instruction>(Scalar));		eraseInstruction(cast<Instruction>(Scalar));
}		}
}		}

Builder.ClearInsertionPoint();		Builder.ClearInsertionPoint();
InstrElementSize.clear();		InstrElementSize.clear();

return VectorizableTree[0]->VectorizedValue;		// Erase all saved trees after vectorization, except the current.
		llvm::erase_if(
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions hmm. I would like to use remove_if here, but we have to capture a unique_ptr here. dtemirbulatov: hmm. I would like to use remove_if here, but we have to capture a unique_ptr here.
		ABataevUnsubmitted Done Reply Inline Actions Use `llvm::erase_if` ABataev: Use `llvm::erase_if`
		BuiltTrees,
		ABataevUnsubmitted Not Done Reply Inline Actions `[Tree]` ABataev: `[Tree]`
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions Tree is a class property, not a local variable. dtemirbulatov: Tree is a class property, not a local variable.
		ABataevUnsubmitted Done Reply Inline Actions Then `[this]` ABataev: Then `[this]`
		[this](std::unique_ptr<TreeState> &T) { return T.get() != Tree; }),
		BuiltTrees.end();

		ABataevUnsubmitted Not Done Reply Inline Actions Why do you need to call `BuiltTrees.erase(` after `llvm::remove_if`? ABataev: Why do you need to call `BuiltTrees.erase(` after `llvm::remove_if`?
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions it is SmallVector<std::unique_ptr<TreeState>, 8> and we have to call erase(...) dtemirbulatov: it is SmallVector<std::unique_ptr<TreeState>, 8> and we have to call erase(...)
		return Tree->VectorizableTree[0]->VectorizedValue;
}		}

void BoUpSLP::optimizeGatherSequence() {		void BoUpSLP::optimizeGatherSequence() {
LLVM_DEBUG(dbgs() << "SLP: Optimizing " << GatherSeq.size()		LLVM_DEBUG(dbgs() << "SLP: Optimizing " << GatherSeq.size()
<< " gather sequences instructions.\n");		<< " gather sequences instructions.\n");
// LICM InsertElementInst sequences.		// LICM InsertElementInst sequences.
for (Instruction *I : GatherSeq) {		for (Instruction *I : GatherSeq) {
if (isDeleted(I))		if (isDeleted(I))
▲ Show 20 Lines • Show All 451 Lines • ▼ Show 20 Lines	doForAllOpcodes(I, [&](ScheduleData *SD) {
"ScheduleData not in scheduling region");		"ScheduleData not in scheduling region");
SD->IsScheduled = false;		SD->IsScheduled = false;
SD->resetUnscheduledDeps();		SD->resetUnscheduledDeps();
});		});
}		}
ReadyInsts.clear();		ReadyInsts.clear();
}		}

		void BoUpSLP::removeFromScheduling(BlockScheduling *BS) {
		bool Removed = false;
		for (TreeEntry *Entry : Tree->RemovedOperations) {
		ScheduleData *SD = BS->getScheduleData(Entry->Scalars[0]);
		if (SD && SD->isPartOfBundle()) {
		if (!Removed) {
		Removed = true;
		BS->resetSchedule();
		}
		BS->cancelScheduling(Entry->Scalars, SD->OpValue);
		}
		}
		ABataevUnsubmitted Done Reply Inline Actions Looks like you need to implement something like `reduceSchedulingRegion()`, similar to `extendSchedulingRegion` function. Because otherwise you're going to operate with the larger scheduling region. I.e. need to modify `ScheduleStart` and `ScheduleEnd` data members. ABataev: Looks like you need to implement something like `reduceSchedulingRegion()`, similar to…
		if (!Removed)
		return;
		BS->resetSchedule();
		BS->initialFillReadyList(BS->ReadyInsts);
		for (Instruction *I = BS->ScheduleStart; I != BS->ScheduleEnd;
		I = I->getNextNode()) {
		if (BS->ScheduleDataMap.find(I) == BS->ScheduleDataMap.end())
		continue;
		BS->doForAllOpcodes(I,
		[](ScheduleData *SD) { SD->clearDependencies(); });
		ABataevUnsubmitted Done Reply Inline Actions `[]` ABataev: `[]`
		}
		}

void BoUpSLP::scheduleBlock(BlockScheduling *BS) {		void BoUpSLP::scheduleBlock(BlockScheduling *BS) {
if (!BS->ScheduleStart)		if (!BS->ScheduleStart)
return;		return;

LLVM_DEBUG(dbgs() << "SLP: schedule block " << BS->BB->getName() << "\n");		LLVM_DEBUG(dbgs() << "SLP: schedule block " << BS->BB->getName() << "\n");

BS->resetSchedule();		BS->resetSchedule();

// For the real scheduling we use a more sophisticated ready-list: it is		// For the real scheduling we use a more sophisticated ready-list: it is
// sorted by the original instruction location. This lets the final schedule		// sorted by the original instruction location. This lets the final schedule
// be as close as possible to the original instruction order.		// be as close as possible to the original instruction order.
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions Perhaps we could also check here for !BS->getScheduleData(I)->isPartOfBundle() and further shrink the region. dtemirbulatov: Perhaps we could also check here for !BS->getScheduleData(I)->isPartOfBundle() and further…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions ah, no, this instruction could belong to a real gather node. dtemirbulatov: ah, no, this instruction could belong to a real gather node.
struct ScheduleDataCompare {		struct ScheduleDataCompare {
bool operator()(ScheduleData SD1, ScheduleData SD2) const {		bool operator()(ScheduleData SD1, ScheduleData SD2) const {
return SD2->SchedulingPriority < SD1->SchedulingPriority;		return SD2->SchedulingPriority < SD1->SchedulingPriority;
}		}
};		};
std::set<ScheduleData *, ScheduleDataCompare> ReadyInsts;		std::set<ScheduleData *, ScheduleDataCompare> ReadyInsts;

// Ensure that all dependency data is updated and fill the ready-list with		// Ensure that all dependency data is updated and fill the ready-list with
▲ Show 20 Lines • Show All 182 Lines • ▼ Show 20 Lines	static bool collectValuesToDemote(Value V, SmallPtrSetImpl<Value > &Expr,
// Record the value that we can demote.		// Record the value that we can demote.
ToDemote.push_back(V);		ToDemote.push_back(V);
return true;		return true;
}		}

void BoUpSLP::computeMinimumValueSizes() {		void BoUpSLP::computeMinimumValueSizes() {
// If there are no external uses, the expression tree must be rooted by a		// If there are no external uses, the expression tree must be rooted by a
// store. We can't demote in-memory values, so there is nothing to do here.		// store. We can't demote in-memory values, so there is nothing to do here.
if (ExternalUses.empty())		if (Tree->ExternalUses.empty())
return;		return;

// We only attempt to truncate integer expressions.		// We only attempt to truncate integer expressions.
auto &TreeRoot = VectorizableTree[0]->Scalars;		auto &TreeRoot = Tree->VectorizableTree[0]->Scalars;
auto *TreeRootIT = dyn_cast<IntegerType>(TreeRoot[0]->getType());		auto *TreeRootIT = dyn_cast<IntegerType>(TreeRoot[0]->getType());
if (!TreeRootIT)		if (!TreeRootIT)
return;		return;

// If the expression is not rooted by a store, these roots should have		// If the expression is not rooted by a store, these roots should have
// external uses. We will rely on InstCombine to rewrite the expression in		// external uses. We will rely on InstCombine to rewrite the expression in
// the narrower type. However, InstCombine only rewrites single-use values.		// the narrower type. However, InstCombine only rewrites single-use values.
// This means that if a tree entry other than a root is used externally, it		// This means that if a tree entry other than a root is used externally, it
// must have multiple uses and InstCombine will not rewrite it. The code		// must have multiple uses and InstCombine will not rewrite it. The code
// below ensures that only the roots are used externally.		// below ensures that only the roots are used externally.
SmallPtrSet<Value *, 32> Expr(TreeRoot.begin(), TreeRoot.end());		SmallPtrSet<Value *, 32> Expr(TreeRoot.begin(), TreeRoot.end());
for (auto &EU : ExternalUses)		for (auto &EU : Tree->ExternalUses)
if (!Expr.erase(EU.Scalar))		if (!Expr.erase(EU.Scalar))
return;		return;
if (!Expr.empty())		if (!Expr.empty())
return;		return;

// Collect the scalar values of the vectorizable expression. We will use this		// Collect the scalar values of the vectorizable expression. We will use this
// context to determine which values can be demoted. If we see a truncation,		// context to determine which values can be demoted. If we see a truncation,
// we mark it as seeding another demotion.		// we mark it as seeding another demotion.
for (auto &EntryPtr : VectorizableTree)		for (auto &EntryPtr : Tree->VectorizableTree)
Expr.insert(EntryPtr->Scalars.begin(), EntryPtr->Scalars.end());		Expr.insert(EntryPtr->Scalars.begin(), EntryPtr->Scalars.end());

// Ensure the roots of the vectorizable tree don't form a cycle. They must		// Ensure the roots of the vectorizable tree don't form a cycle. They must
// have a single external user that is not in the vectorizable tree.		// have a single external user that is not in the vectorizable tree.
for (auto *Root : TreeRoot)		for (auto *Root : TreeRoot)
if (!Root->hasOneUse() \|\| Expr.count(*Root->user_begin()))		if (!Root->hasOneUse() \|\| Expr.count(*Root->user_begin()))
return;		return;

▲ Show 20 Lines • Show All 229 Lines • ▼ Show 20 Lines	for (auto BB : post_order(&F.getEntryBlock())) {
// Vectorize the index computations of getelementptr instructions. This		// Vectorize the index computations of getelementptr instructions. This
// is primarily intended to catch gather-like idioms ending at		// is primarily intended to catch gather-like idioms ending at
// non-consecutive loads.		// non-consecutive loads.
if (!GEPs.empty()) {		if (!GEPs.empty()) {
LLVM_DEBUG(dbgs() << "SLP: Found GEPs for " << GEPs.size()		LLVM_DEBUG(dbgs() << "SLP: Found GEPs for " << GEPs.size()
<< " underlying objects.\n");		<< " underlying objects.\n");
Changed \|= vectorizeGEPIndices(BB, R);		Changed \|= vectorizeGEPIndices(BB, R);
}		}

		// Partially vectorize trees after all full vectorization is done,
		// otherwise, we could prevent more profitable full vectorization with
		// smaller vector sizes.
		if (SLPThrottling)
		Changed \|= R.tryPartialVectorization();
}		}

if (Changed) {		if (Changed) {
R.optimizeGatherSequence();		R.optimizeGatherSequence();
LLVM_DEBUG(dbgs() << "SLP: vectorized \"" << F.getName() << "\"\n");		LLVM_DEBUG(dbgs() << "SLP: vectorized \"" << F.getName() << "\"\n");
}		}
return Changed;		return Changed;
}		}
Show All 40 Lines	if (Cost < -SLPCostThreshold) {
R.getORE()->emit(OptimizationRemark(SV_NAME, "StoresVectorized",		R.getORE()->emit(OptimizationRemark(SV_NAME, "StoresVectorized",
cast<StoreInst>(Chain[0]))		cast<StoreInst>(Chain[0]))
<< "Stores SLP vectorized with cost " << NV("Cost", Cost)		<< "Stores SLP vectorized with cost " << NV("Cost", Cost)
<< " and with tree size "		<< " and with tree size "
<< NV("TreeSize", R.getTreeSize()));		<< NV("TreeSize", R.getTreeSize()));

R.vectorizeTree();		R.vectorizeTree();
return true;		return true;
}		}
		if (SLPThrottling && R.findSubTree())
		R.saveTree();

		ABataevUnsubmitted Done Reply Inline Actions Why not try to vectorize a partial tree right here? ABataev: Why not try to vectorize a partial tree right here?
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions hmm, we might have an opportunity to vectorize the whole tree with smaller Chain sizes or at vectorizeStores or while doing reductions. dtemirbulatov: hmm, we might have an opportunity to vectorize the whole tree with smaller Chain sizes or at…
		ABataevUnsubmitted Not Done Reply Inline Actions Did you check that? ABataev: Did you check that?
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions yes, I can add a testcase for that. dtemirbulatov: yes, I can add a testcase for that.
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions For example, if we allow partial vectorization straight away we could see partial vectorization in test/Transforms/SLPVectorizer/X86/PR39774.ll without [[TMP3:%.]] = add <8 x i32> [[SHUFFLE]], <i32 0, i32 55, i32 285, i32 1240, i32 1496, i32 8555, i32 12529, i32 13685> That because of later we would have opportinity to vectorize the whole tree. dtemirbulatov:* For example, if we allow partial vectorization straight away we could see partial vectorization…
		ABataevUnsubmitted Not Done Reply Inline Actions Actually, `else` is not required here at all. Just make it a standalone `if` statement since there is an early exit in the previous `if` ABataev: Actually, `else` is not required here at all. Just make it a standalone `if` statement since…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions hmm, No I think it is required here, we don't want to reduce already decided full-tree vectorization. dtemirbulatov: hmm, No I think it is required here, we don't want to reduce already decided full-tree…
		ABataevUnsubmitted Done Reply Inline Actions Ho, you don't need it. Read https://llvm.org/docs/CodingStandards.html#don-t-use-else-after-a-return ABataev: Ho, you don't need it. Read https://llvm.org/docs/CodingStandards.html#don-t-use-else-after-a…
return false;		return false;
		ABataevUnsubmitted Done Reply Inline Actions `else if` ABataev: `else if`
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions thanks dtemirbulatov: thanks
}		}

bool SLPVectorizerPass::vectorizeStores(ArrayRef<StoreInst *> Stores,		bool SLPVectorizerPass::vectorizeStores(ArrayRef<StoreInst *> Stores,
BoUpSLP &R) {		BoUpSLP &R) {
// We may run into multiple chains that merge into a single chain. We mark the		// We may run into multiple chains that merge into a single chain. We mark the
// stores that we vectorized so that we don't visit the same store twice.		// stores that we vectorized so that we don't visit the same store twice.
BoUpSLP::ValueSet VectorizedStores;		BoUpSLP::ValueSet VectorizedStores;
bool Changed = false;		bool Changed = false;
▲ Show 20 Lines • Show All 225 Lines • ▼ Show 20 Lines	for (unsigned I = NextInst; I < MaxInst; ++I) {
Value *ReorderedOps[] = {Ops[1], Ops[0]};		Value *ReorderedOps[] = {Ops[1], Ops[0]};
R.buildTree(ReorderedOps, None);		R.buildTree(ReorderedOps, None);
}		}
if (R.isTreeTinyAndNotFullyVectorizable())		if (R.isTreeTinyAndNotFullyVectorizable())
continue;		continue;

R.computeMinimumValueSizes();		R.computeMinimumValueSizes();
int Cost = R.getTreeCost();		int Cost = R.getTreeCost();
		unsigned UserCost = 0;
		ABataevUnsubmitted Not Done Reply Inline Actions Do you really need this new var here? I don't see where it is used except as an argument of `R.findSubTree(UserCost)` call ABataev: Do you really need this new var here? I don't see where it is used except as an argument of `R.
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions I think we need to compensate the ExctractCost with that cost of the insert sequence as in case of full-vectorization. dtemirbulatov: I think we need to compensate the ExctractCost with that cost of the insert sequence as in case…
		RKSimonUnsubmitted Not Done Reply Inline Actions This still looks wrong - isn't the UserCost only used locally in the CompensateUseCost path? RKSimon: This still looks wrong - isn't the UserCost only used locally in the CompensateUseCost path?
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions No, there is another instance of UserCost at 6476, We have to compare the cost to SLPCostThreshold inside findSubTree() and subtract UserCost. dtemirbulatov: No, there is another instance of UserCost at 6476, We have to compare the cost to…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions I mean the instance of usage. dtemirbulatov: I mean the instance of usage.
CandidateFound = true;		CandidateFound = true;
if (CompensateUseCost) {		if (CompensateUseCost) {
// TODO: Use TTI's getScalarizationOverhead for sequence of inserts		// TODO: Use TTI's getScalarizationOverhead for sequence of inserts
// rather than sum of single inserts as the latter may overestimate		// rather than sum of single inserts as the latter may overestimate
// cost. This work should imply improving cost estimation for extracts		// cost. This work should imply improving cost estimation for extracts
// that added in for external (for vectorization tree) users,i.e. that		// that added in for external (for vectorization tree) users,i.e. that
// part should also switch to same interface.		// part should also switch to same interface.
// For example, the following case is projected code after SLP:		// For example, the following case is projected code after SLP:
Show All 13 Lines	for (unsigned I = NextInst; I < MaxInst; ++I) {
// SLP makes an assumption that such sequence will be optimized away		// SLP makes an assumption that such sequence will be optimized away
// later (instcombine) so it tries to compensate ExctractCost with		// later (instcombine) so it tries to compensate ExctractCost with
// cost of insert sequence.		// cost of insert sequence.
// Current per element cost calculation approach is not quite accurate		// Current per element cost calculation approach is not quite accurate
// and tends to create bias toward favoring vectorization.		// and tends to create bias toward favoring vectorization.
// Switching to the TTI interface might help a bit.		// Switching to the TTI interface might help a bit.
// Alternative solution could be pattern-match to detect a no-op or		// Alternative solution could be pattern-match to detect a no-op or
// shuffle.		// shuffle.
unsigned UserCost = 0;
for (unsigned Lane = 0; Lane < OpsWidth; Lane++) {		for (unsigned Lane = 0; Lane < OpsWidth; Lane++) {
auto *IE = cast<InsertElementInst>(InsertUses[I + Lane]);		auto *IE = cast<InsertElementInst>(InsertUses[I + Lane]);
if (auto *CI = dyn_cast<ConstantInt>(IE->getOperand(2)))		if (auto *CI = dyn_cast<ConstantInt>(IE->getOperand(2)))
UserCost += TTI->getVectorInstrCost(		UserCost += TTI->getVectorInstrCost(
Instruction::InsertElement, IE->getType(), CI->getZExtValue());		Instruction::InsertElement, IE->getType(), CI->getZExtValue());
}		}
LLVM_DEBUG(dbgs() << "SLP: Compensate cost of users by: " << UserCost		LLVM_DEBUG(dbgs() << "SLP: Compensate cost of users by: " << UserCost
<< ".\n");		<< ".\n");
Show All 10 Lines	for (unsigned I = NextInst; I < MaxInst; ++I) {
<< " and with tree size "		<< " and with tree size "
<< ore::NV("TreeSize", R.getTreeSize()));		<< ore::NV("TreeSize", R.getTreeSize()));

R.vectorizeTree();		R.vectorizeTree();
// Move to the next bundle.		// Move to the next bundle.
I += VF - 1;		I += VF - 1;
NextInst = I + 1;		NextInst = I + 1;
Changed = true;		Changed = true;
		} else if (SLPThrottling && R.findSubTree(UserCost)) {
		R.saveTree();
		ABataevUnsubmitted Done Reply Inline Actions `else if` ABataev: `else if`
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions thanks dtemirbulatov: thanks
}		}
		ABataevUnsubmitted Done Reply Inline Actions Why `SLPThrottleBudget > 0`? What if `SLPThrottleBudget` equals 0? ABataev: Why `SLPThrottleBudget > 0`? What if `SLPThrottleBudget` equals 0?
}		}
		ABataevUnsubmitted Not Done Reply Inline Actions Why not try to vectorize a partial tree right here? ABataev: Why not try to vectorize a partial tree right here?
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions we might have an opportunity to vectorize the whole tree with smaller Chain sizes at vectorizeStoreChain or while doing reductions. dtemirbulatov: we might have an opportunity to vectorize the whole tree with smaller Chain sizes at…
		ABataevUnsubmitted Not Done Reply Inline Actions Enclose substatement into braces since the substatement in `then` branch is in braces. ABataev: Enclose substatement into braces since the substatement in `then` branch is in braces.
		ABataevUnsubmitted Done Reply Inline Actions Better to enclose this substatement in braces to make the code uniform ABataev: Better to enclose this substatement in braces to make the code uniform
		ABataevUnsubmitted Done Reply Inline Actions Why we can't do something like this: int NumAttempts = 0; do { if (R.isTreeTinyAndNotFullyVectorizable()) break; R.computeMinimumValueSizes(); InstructionCost Cost = R.getTreeCost(); InstructionCost UserCost = 0; .... if (Cost < -SLPCostThreshold) { LLVM_DEBUG(dbgs() << "SLP: Vectorizing list at cost:" << Cost << ".\n"); R.getORE()->emit(OptimizationRemark(SV_NAME, "VectorizedList", cast<Instruction>(Ops[0])) << "SLP vectorized with cost " << ore::NV("Cost", Cost) << " and with tree size " << ore::NV("TreeSize", R.getTreeSize())); R.vectorizeTree(); // Move to the next bundle. I += VF - 1; NextInst = I + 1; Changed = true; break; } ... /// Do throttling here. ++NumAttempts; } while (NumAttempts < SLPThrottleBudget); ABataev: Why we can't do something like this: ``` int NumAttempts = 0; do { if (R.
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions We are doing partial vectorization and we have to know UserCost to make the correct partial tree cut. dtemirbulatov: We are doing partial vectorization and we have to know UserCost to make the correct partial…
}		}

if (!Changed && CandidateFound) {		if (!Changed && CandidateFound) {
R.getORE()->emit([&]() {		R.getORE()->emit([&]() {
return OptimizationRemarkMissed(SV_NAME, "NotBeneficial", I0)		return OptimizationRemarkMissed(SV_NAME, "NotBeneficial", I0)
<< "List vectorization was possible but not beneficial with cost "		<< "List vectorization was possible but not beneficial with cost "
<< ore::NV("Cost", MinCost) << " >= "		<< ore::NV("Cost", MinCost) << " >= "
<< ore::NV("Treshold", -SLPCostThreshold);		<< ore::NV("Treshold", -SLPCostThreshold);
▲ Show 20 Lines • Show All 765 Lines • ▼ Show 20 Lines	while (i < NumReducedVals - ReduxWidth + 1 && ReduxWidth > 2) {
break;		break;

V.computeMinimumValueSizes();		V.computeMinimumValueSizes();

// Estimate cost.		// Estimate cost.
int TreeCost = V.getTreeCost();		int TreeCost = V.getTreeCost();
int ReductionCost = getReductionCost(TTI, ReducedVals[i], ReduxWidth);		int ReductionCost = getReductionCost(TTI, ReducedVals[i], ReduxWidth);
int Cost = TreeCost + ReductionCost;		int Cost = TreeCost + ReductionCost;
if (Cost >= -SLPCostThreshold) {		if (Cost >= -SLPCostThreshold &&
		(!SLPThrottling \|\| !V.findSubTree(-ReductionCost))) {
V.getORE()->emit([&]() {		V.getORE()->emit([&]() {
		ABataevUnsubmitted Done Reply Inline Actions Looks like you missed compare ща `Cost` with `-SLPCostThreshold` here. You vectorized the tree after throttling unconditionally. Plus, the `Cost` is calculated here, but not used later except for the debug prints. ABataev: Looks like you missed compare ща `Cost` with `-SLPCostThreshold` here. You vectorized the tree…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions we don't need need to compare here, this is done inside findSubTree(). dtemirbulatov: we don't need need to compare here, this is done inside findSubTree().
return OptimizationRemarkMissed(		return OptimizationRemarkMissed(SV_NAME, "HorSLPNotBeneficial",
SV_NAME, "HorSLPNotBeneficial", cast<Instruction>(VL[0]))		cast<Instruction>(VL[0]))
		ABataevUnsubmitted Not Done Reply Inline Actions Just `else`? ABataev: Just `else`?
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions We might try partial vectorization without success here and we to report about insufficient cost and break dtemirbulatov: We might try partial vectorization without success here and we to report about insufficient…
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions We might try partial vectorization without success here and we have to report about insufficient cost and break dtemirbulatov: We might try partial vectorization without success here and we have to report about…
<< "Vectorizing horizontal reduction is possible"		<< "Vectorizing horizontal reduction is possible"
<< "but not beneficial with cost "		<< "but not beneficial with cost "
<< ore::NV("Cost", Cost) << " and threshold "		<< ore::NV("Cost", Cost) << " and threshold "
<< ore::NV("Threshold", -SLPCostThreshold);		<< ore::NV("Threshold", -SLPCostThreshold);
});		});
break;		break;
}		}

LLVM_DEBUG(dbgs() << "SLP: Vectorizing horizontal reduction at cost:"		LLVM_DEBUG(dbgs() << "SLP: Vectorizing horizontal reduction at cost:"
<< Cost << ". (HorRdx)\n");		<< Cost << ". (HorRdx)\n");
V.getORE()->emit([&]() {		V.getORE()->emit([&]() {
return OptimizationRemark(		return OptimizationRemark(
SV_NAME, "VectorizedHorizontalReduction", cast<Instruction>(VL[0]))		SV_NAME, "VectorizedHorizontalReduction", cast<Instruction>(VL[0]))
		RKSimonUnsubmitted Not Done Reply Inline Actions This looks like a NFC clang-format change now - either pre-commit or discard from the patch? RKSimon: This looks like a NFC clang-format change now - either pre-commit or discard from the patch?
<< "Vectorized horizontal reduction with cost "		<< "Vectorized horizontal reduction with cost "
<< ore::NV("Cost", Cost) << " and with tree size "		<< ore::NV("Cost", Cost) << " and with tree size "
<< ore::NV("TreeSize", V.getTreeSize());		<< ore::NV("TreeSize", V.getTreeSize());
});		});

// Vectorize a tree.		// Vectorize a tree.
DebugLoc Loc = cast<Instruction>(ReducedVals[i])->getDebugLoc();		DebugLoc Loc = cast<Instruction>(ReducedVals[i])->getDebugLoc();
Value *VectorizedRoot = V.vectorizeTree(ExternallyUsedValues);		Value *VectorizedRoot = V.vectorizeTree(ExternallyUsedValues);
▲ Show 20 Lines • Show All 809 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/AArch64/gather-root.ll

	Show First 20 Lines • Show All 198 Lines • ▼ Show 20 Lines
	; GATHER-NEXT: [[TMP34:%.*]] = insertelement <8 x i32> [[TMP32]], i32 [[TMP33]], i32 7			; GATHER-NEXT: [[TMP34:%.*]] = insertelement <8 x i32> [[TMP32]], i32 [[TMP33]], i32 7
	; GATHER-NEXT: [[TMP35:%.*]] = call i32 @llvm.experimental.vector.reduce.add.v8i32(<8 x i32> [[TMP34]])			; GATHER-NEXT: [[TMP35:%.*]] = call i32 @llvm.experimental.vector.reduce.add.v8i32(<8 x i32> [[TMP34]])
	; GATHER-NEXT: [[OP_EXTRA]] = add i32 [[TMP35]], -5			; GATHER-NEXT: [[OP_EXTRA]] = add i32 [[TMP35]], -5
	; GATHER-NEXT: br label [[FOR_BODY]]			; GATHER-NEXT: br label [[FOR_BODY]]
	;			;
	; MAX-COST-LABEL: @PR32038(			; MAX-COST-LABEL: @PR32038(
	; MAX-COST-NEXT: entry:			; MAX-COST-NEXT: entry:
	; MAX-COST-NEXT: [[TMP0:%.]] = load <2 x i8>, <2 x i8> bitcast (i8* getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 1) to <2 x i8>*), align 1			; MAX-COST-NEXT: [[TMP0:%.]] = load <2 x i8>, <2 x i8> bitcast (i8* getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 1) to <2 x i8>*), align 1
	; MAX-COST-NEXT: [[TMP1:%.*]] = icmp eq <2 x i8> [[TMP0]], zeroinitializer
	; MAX-COST-NEXT: [[P4:%.]] = load i8, i8 getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 3), align 1			; MAX-COST-NEXT: [[P4:%.]] = load i8, i8 getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 3), align 1
	; MAX-COST-NEXT: [[P5:%.*]] = icmp eq i8 [[P4]], 0
	; MAX-COST-NEXT: [[P6:%.]] = load i8, i8 getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 4), align 4			; MAX-COST-NEXT: [[P6:%.]] = load i8, i8 getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 4), align 4
	; MAX-COST-NEXT: [[P7:%.*]] = icmp eq i8 [[P6]], 0			; MAX-COST-NEXT: [[TMP1:%.*]] = extractelement <2 x i8> [[TMP0]], i32 0
				; MAX-COST-NEXT: [[TMP2:%.*]] = insertelement <4 x i8> undef, i8 [[TMP1]], i32 0
				; MAX-COST-NEXT: [[TMP3:%.*]] = extractelement <2 x i8> [[TMP0]], i32 1
				; MAX-COST-NEXT: [[TMP4:%.*]] = insertelement <4 x i8> [[TMP2]], i8 [[TMP3]], i32 1
				; MAX-COST-NEXT: [[TMP5:%.*]] = insertelement <4 x i8> [[TMP4]], i8 [[P4]], i32 2
				; MAX-COST-NEXT: [[TMP6:%.*]] = insertelement <4 x i8> [[TMP5]], i8 [[P6]], i32 3
				; MAX-COST-NEXT: [[TMP7:%.*]] = icmp eq <4 x i8> [[TMP6]], zeroinitializer
	; MAX-COST-NEXT: [[P8:%.]] = load i8, i8 getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 5), align 1			; MAX-COST-NEXT: [[P8:%.]] = load i8, i8 getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 5), align 1
	; MAX-COST-NEXT: [[P9:%.*]] = icmp eq i8 [[P8]], 0			; MAX-COST-NEXT: [[P9:%.*]] = icmp eq i8 [[P8]], 0
	; MAX-COST-NEXT: [[P10:%.]] = load i8, i8 getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 6), align 2			; MAX-COST-NEXT: [[P10:%.]] = load i8, i8 getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 6), align 2
	; MAX-COST-NEXT: [[P11:%.*]] = icmp eq i8 [[P10]], 0			; MAX-COST-NEXT: [[P11:%.*]] = icmp eq i8 [[P10]], 0
	; MAX-COST-NEXT: [[P12:%.]] = load i8, i8 getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 7), align 1			; MAX-COST-NEXT: [[P12:%.]] = load i8, i8 getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 7), align 1
	; MAX-COST-NEXT: [[P13:%.*]] = icmp eq i8 [[P12]], 0			; MAX-COST-NEXT: [[P13:%.*]] = icmp eq i8 [[P12]], 0
	; MAX-COST-NEXT: [[P14:%.]] = load i8, i8 getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 8), align 8			; MAX-COST-NEXT: [[P14:%.]] = load i8, i8 getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 8), align 8
	; MAX-COST-NEXT: [[P15:%.*]] = icmp eq i8 [[P14]], 0			; MAX-COST-NEXT: [[P15:%.*]] = icmp eq i8 [[P14]], 0
	; MAX-COST-NEXT: br label [[FOR_BODY:%.*]]			; MAX-COST-NEXT: br label [[FOR_BODY:%.*]]
	; MAX-COST: for.body:			; MAX-COST: for.body:
	; MAX-COST-NEXT: [[P17:%.]] = phi i32 [ [[P34:%.]], [[FOR_BODY]] ], [ 0, [[ENTRY:%.*]] ]			; MAX-COST-NEXT: [[P17:%.]] = phi i32 [ [[P34:%.]], [[FOR_BODY]] ], [ 0, [[ENTRY:%.*]] ]
	; MAX-COST-NEXT: [[TMP2:%.*]] = extractelement <2 x i1> [[TMP1]], i32 0			; MAX-COST-NEXT: [[TMP8:%.*]] = extractelement <4 x i1> [[TMP7]], i32 3
	; MAX-COST-NEXT: [[TMP3:%.*]] = insertelement <4 x i1> undef, i1 [[TMP2]], i32 0			; MAX-COST-NEXT: [[TMP9:%.*]] = extractelement <4 x i1> [[TMP7]], i32 0
	; MAX-COST-NEXT: [[TMP4:%.*]] = extractelement <2 x i1> [[TMP1]], i32 1			; MAX-COST-NEXT: [[TMP10:%.*]] = insertelement <4 x i1> undef, i1 [[TMP9]], i32 0
	; MAX-COST-NEXT: [[TMP5:%.*]] = insertelement <4 x i1> [[TMP3]], i1 [[TMP4]], i32 1			; MAX-COST-NEXT: [[TMP11:%.*]] = extractelement <4 x i1> [[TMP7]], i32 1
	; MAX-COST-NEXT: [[TMP6:%.*]] = insertelement <4 x i1> [[TMP5]], i1 [[P5]], i32 2			; MAX-COST-NEXT: [[TMP12:%.*]] = insertelement <4 x i1> [[TMP10]], i1 [[TMP11]], i32 1
	; MAX-COST-NEXT: [[TMP7:%.*]] = insertelement <4 x i1> [[TMP6]], i1 [[P7]], i32 3			; MAX-COST-NEXT: [[TMP13:%.*]] = extractelement <4 x i1> [[TMP7]], i32 2
	; MAX-COST-NEXT: [[TMP8:%.*]] = select <4 x i1> [[TMP7]], <4 x i32> <i32 -720, i32 -720, i32 -720, i32 -720>, <4 x i32> <i32 -80, i32 -80, i32 -80, i32 -80>			; MAX-COST-NEXT: [[TMP14:%.*]] = insertelement <4 x i1> [[TMP12]], i1 [[TMP13]], i32 2
				; MAX-COST-NEXT: [[TMP15:%.*]] = insertelement <4 x i1> [[TMP14]], i1 [[TMP8]], i32 3
				; MAX-COST-NEXT: [[TMP16:%.*]] = select <4 x i1> [[TMP15]], <4 x i32> <i32 -720, i32 -720, i32 -720, i32 -720>, <4 x i32> <i32 -80, i32 -80, i32 -80, i32 -80>
	; MAX-COST-NEXT: [[P27:%.*]] = select i1 [[P9]], i32 -720, i32 -80			; MAX-COST-NEXT: [[P27:%.*]] = select i1 [[P9]], i32 -720, i32 -80
	; MAX-COST-NEXT: [[P29:%.*]] = select i1 [[P11]], i32 -720, i32 -80			; MAX-COST-NEXT: [[P29:%.*]] = select i1 [[P11]], i32 -720, i32 -80
	; MAX-COST-NEXT: [[TMP9:%.*]] = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> [[TMP8]])			; MAX-COST-NEXT: [[TMP17:%.*]] = call i32 @llvm.experimental.vector.reduce.add.v4i32(<4 x i32> [[TMP16]])
	; MAX-COST-NEXT: [[TMP10:%.*]] = add i32 [[TMP9]], [[P27]]			; MAX-COST-NEXT: [[TMP18:%.*]] = add i32 [[TMP17]], [[P27]]
	; MAX-COST-NEXT: [[TMP11:%.*]] = add i32 [[TMP10]], [[P29]]			; MAX-COST-NEXT: [[TMP19:%.*]] = add i32 [[TMP18]], [[P29]]
	; MAX-COST-NEXT: [[OP_EXTRA:%.*]] = add i32 [[TMP11]], -5			; MAX-COST-NEXT: [[OP_EXTRA:%.*]] = add i32 [[TMP19]], -5
	; MAX-COST-NEXT: [[P31:%.*]] = select i1 [[P13]], i32 -720, i32 -80			; MAX-COST-NEXT: [[P31:%.*]] = select i1 [[P13]], i32 -720, i32 -80
	; MAX-COST-NEXT: [[P32:%.*]] = add i32 [[OP_EXTRA]], [[P31]]			; MAX-COST-NEXT: [[P32:%.*]] = add i32 [[OP_EXTRA]], [[P31]]
	; MAX-COST-NEXT: [[P33:%.*]] = select i1 [[P15]], i32 -720, i32 -80			; MAX-COST-NEXT: [[P33:%.*]] = select i1 [[P15]], i32 -720, i32 -80
	; MAX-COST-NEXT: [[P34]] = add i32 [[P32]], [[P33]]			; MAX-COST-NEXT: [[P34]] = add i32 [[P32]], [[P33]]
	; MAX-COST-NEXT: br label [[FOR_BODY]]			; MAX-COST-NEXT: br label [[FOR_BODY]]
	;			;
	entry:			entry:
	%p0 = load i8, i8* getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 1), align 1			%p0 = load i8, i8* getelementptr inbounds ([80 x i8], [80 x i8]* @a, i64 0, i64 1), align 1
	Show All 37 Lines

llvm/test/Transforms/SLPVectorizer/X86/load-merge.ll

	Show First 20 Lines • Show All 50 Lines • ▼ Show 20 Lines

	define <4 x float> @PR16739_byref(<4 x float>* nocapture readonly dereferenceable(16) %x) {			define <4 x float> @PR16739_byref(<4 x float>* nocapture readonly dereferenceable(16) %x) {
	; CHECK-LABEL: @PR16739_byref(			; CHECK-LABEL: @PR16739_byref(
	; CHECK-NEXT: [[GEP0:%.]] = getelementptr inbounds <4 x float>, <4 x float> [[X:%.*]], i64 0, i64 0			; CHECK-NEXT: [[GEP0:%.]] = getelementptr inbounds <4 x float>, <4 x float> [[X:%.*]], i64 0, i64 0
	; CHECK-NEXT: [[GEP1:%.]] = getelementptr inbounds <4 x float>, <4 x float> [[X]], i64 0, i64 1			; CHECK-NEXT: [[GEP1:%.]] = getelementptr inbounds <4 x float>, <4 x float> [[X]], i64 0, i64 1
	; CHECK-NEXT: [[GEP2:%.]] = getelementptr inbounds <4 x float>, <4 x float> [[X]], i64 0, i64 2			; CHECK-NEXT: [[GEP2:%.]] = getelementptr inbounds <4 x float>, <4 x float> [[X]], i64 0, i64 2
	; CHECK-NEXT: [[TMP1:%.]] = bitcast float [[GEP0]] to <2 x float>*			; CHECK-NEXT: [[TMP1:%.]] = bitcast float [[GEP0]] to <2 x float>*
	; CHECK-NEXT: [[TMP2:%.]] = load <2 x float>, <2 x float> [[TMP1]], align 4			; CHECK-NEXT: [[TMP2:%.]] = load <2 x float>, <2 x float> [[TMP1]], align 4
	; CHECK-NEXT: [[X2:%.]] = load float, float [[GEP2]], align 4			; CHECK-NEXT: [[X2:%.]] = load float, float [[GEP2]], align 4
				RKSimonUnsubmitted Not Done Reply Inline Actions rebase - this was committed at rG90f721404ff8 RKSimon: rebase - this was committed at rG90f721404ff8
	; CHECK-NEXT: [[TMP3:%.*]] = extractelement <2 x float> [[TMP2]], i32 0			; CHECK-NEXT: [[TMP3:%.*]] = extractelement <2 x float> [[TMP2]], i32 0
	; CHECK-NEXT: [[I0:%.*]] = insertelement <4 x float> undef, float [[TMP3]], i32 0			; CHECK-NEXT: [[I0:%.*]] = insertelement <4 x float> undef, float [[TMP3]], i32 0
	; CHECK-NEXT: [[TMP4:%.*]] = extractelement <2 x float> [[TMP2]], i32 1			; CHECK-NEXT: [[TMP4:%.*]] = extractelement <2 x float> [[TMP2]], i32 1
	; CHECK-NEXT: [[I1:%.*]] = insertelement <4 x float> [[I0]], float [[TMP4]], i32 1			; CHECK-NEXT: [[I1:%.*]] = insertelement <4 x float> [[I0]], float [[TMP4]], i32 1
	; CHECK-NEXT: [[I2:%.*]] = insertelement <4 x float> [[I1]], float [[X2]], i32 2			; CHECK-NEXT: [[I2:%.*]] = insertelement <4 x float> [[I1]], float [[X2]], i32 2
	; CHECK-NEXT: [[I3:%.*]] = insertelement <4 x float> [[I2]], float [[X2]], i32 3			; CHECK-NEXT: [[I3:%.*]] = insertelement <4 x float> [[I2]], float [[X2]], i32 3
	; CHECK-NEXT: ret <4 x float> [[I3]]			; CHECK-NEXT: ret <4 x float> [[I3]]
	;			;
	▲ Show 20 Lines • Show All 43 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: [[T2:%.]] = getelementptr inbounds <4 x float>, <4 x float> [[X]], i64 0, i64 2			; CHECK-NEXT: [[T2:%.]] = getelementptr inbounds <4 x float>, <4 x float> [[X]], i64 0, i64 2
	; CHECK-NEXT: [[T3:%.]] = bitcast float [[T2]] to i64*			; CHECK-NEXT: [[T3:%.]] = bitcast float [[T2]] to i64*
	; CHECK-NEXT: [[T4:%.]] = load i64, i64 [[T3]], align 8			; CHECK-NEXT: [[T4:%.]] = load i64, i64 [[T3]], align 8
	; CHECK-NEXT: [[T5:%.*]] = trunc i64 [[T1]] to i32			; CHECK-NEXT: [[T5:%.*]] = trunc i64 [[T1]] to i32
	; CHECK-NEXT: [[T6:%.*]] = bitcast i32 [[T5]] to float			; CHECK-NEXT: [[T6:%.*]] = bitcast i32 [[T5]] to float
	; CHECK-NEXT: [[T7:%.*]] = insertelement <4 x float> undef, float [[T6]], i32 0			; CHECK-NEXT: [[T7:%.*]] = insertelement <4 x float> undef, float [[T6]], i32 0
	; CHECK-NEXT: [[T8:%.*]] = lshr i64 [[T1]], 32			; CHECK-NEXT: [[T8:%.*]] = lshr i64 [[T1]], 32
	; CHECK-NEXT: [[T9:%.*]] = trunc i64 [[T8]] to i32			; CHECK-NEXT: [[T9:%.*]] = trunc i64 [[T8]] to i32
	; CHECK-NEXT: [[T10:%.*]] = bitcast i32 [[T9]] to float
	; CHECK-NEXT: [[T11:%.*]] = insertelement <4 x float> [[T7]], float [[T10]], i32 1
	; CHECK-NEXT: [[T12:%.*]] = trunc i64 [[T4]] to i32			; CHECK-NEXT: [[T12:%.*]] = trunc i64 [[T4]] to i32
	; CHECK-NEXT: [[T13:%.*]] = bitcast i32 [[T12]] to float			; CHECK-NEXT: [[TMP1:%.*]] = insertelement <2 x i32> undef, i32 [[T9]], i32 0
	; CHECK-NEXT: [[T14:%.*]] = insertelement <4 x float> [[T11]], float [[T13]], i32 2			; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x i32> [[TMP1]], i32 [[T12]], i32 1
	; CHECK-NEXT: [[T15:%.*]] = insertelement <4 x float> [[T14]], float [[T13]], i32 3			; CHECK-NEXT: [[TMP3:%.*]] = bitcast <2 x i32> [[TMP2]] to <2 x float>
				; CHECK-NEXT: [[TMP4:%.*]] = extractelement <2 x float> [[TMP3]], i32 0
				; CHECK-NEXT: [[T11:%.*]] = insertelement <4 x float> [[T7]], float [[TMP4]], i32 1
				; CHECK-NEXT: [[TMP5:%.*]] = extractelement <2 x float> [[TMP3]], i32 1
				; CHECK-NEXT: [[T14:%.*]] = insertelement <4 x float> [[T11]], float [[TMP5]], i32 2
				; CHECK-NEXT: [[T15:%.*]] = insertelement <4 x float> [[T14]], float [[TMP5]], i32 3
	; CHECK-NEXT: ret <4 x float> [[T15]]			; CHECK-NEXT: ret <4 x float> [[T15]]
	;			;
	%t0 = bitcast <4 x float>* %x to i64*			%t0 = bitcast <4 x float>* %x to i64*
	%t1 = load i64, i64* %t0, align 16			%t1 = load i64, i64* %t0, align 16
	%t2 = getelementptr inbounds <4 x float>, <4 x float>* %x, i64 0, i64 2			%t2 = getelementptr inbounds <4 x float>, <4 x float>* %x, i64 0, i64 2
	%t3 = bitcast float* %t2 to i64*			%t3 = bitcast float* %t2 to i64*
	%t4 = load i64, i64* %t3, align 8			%t4 = load i64, i64* %t3, align 8
	%t5 = trunc i64 %t1 to i32			%t5 = trunc i64 %t1 to i32
	▲ Show 20 Lines • Show All 76 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/powof2div.ll

Show First 20 Lines • Show All 54 Lines • ▼ Show 20 Lines	entry:
%arrayidx17 = getelementptr inbounds i32, i32* %a, i64 3		%arrayidx17 = getelementptr inbounds i32, i32* %a, i64 3
store i32 %div16, i32* %arrayidx17, align 4		store i32 %div16, i32* %arrayidx17, align 4
ret void		ret void
}		}

define void @powof2div_nonuniform(i32* noalias nocapture %a, i32* noalias nocapture readonly %b, i32* noalias nocapture readonly %c){		define void @powof2div_nonuniform(i32* noalias nocapture %a, i32* noalias nocapture readonly %b, i32* noalias nocapture readonly %c){
; AVX1-LABEL: @powof2div_nonuniform(		; AVX1-LABEL: @powof2div_nonuniform(
; AVX1-NEXT: entry:		; AVX1-NEXT: entry:
; AVX1-NEXT: [[TMP0:%.]] = load i32, i32 [[B:%.*]], align 4		; AVX1-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i64 1
; AVX1-NEXT: [[TMP1:%.]] = load i32, i32 [[C:%.*]], align 4		; AVX1-NEXT: [[ARRAYIDX4:%.]] = getelementptr inbounds i32, i32 [[C:%.*]], i64 1
; AVX1-NEXT: [[ADD:%.*]] = add nsw i32 [[TMP1]], [[TMP0]]
; AVX1-NEXT: [[DIV:%.*]] = sdiv i32 [[ADD]], 2
; AVX1-NEXT: store i32 [[DIV]], i32* [[A:%.*]], align 4
; AVX1-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 [[B]], i64 1
; AVX1-NEXT: [[TMP2:%.]] = load i32, i32 [[ARRAYIDX3]], align 4
; AVX1-NEXT: [[ARRAYIDX4:%.]] = getelementptr inbounds i32, i32 [[C]], i64 1
; AVX1-NEXT: [[TMP3:%.]] = load i32, i32 [[ARRAYIDX4]], align 4
; AVX1-NEXT: [[ADD5:%.*]] = add nsw i32 [[TMP3]], [[TMP2]]
; AVX1-NEXT: [[DIV6:%.*]] = sdiv i32 [[ADD5]], 4
; AVX1-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds i32, i32 [[A]], i64 1
; AVX1-NEXT: store i32 [[DIV6]], i32* [[ARRAYIDX7]], align 4
; AVX1-NEXT: [[ARRAYIDX8:%.]] = getelementptr inbounds i32, i32 [[B]], i64 2		; AVX1-NEXT: [[ARRAYIDX8:%.]] = getelementptr inbounds i32, i32 [[B]], i64 2
; AVX1-NEXT: [[TMP4:%.]] = load i32, i32 [[ARRAYIDX8]], align 4
; AVX1-NEXT: [[ARRAYIDX9:%.]] = getelementptr inbounds i32, i32 [[C]], i64 2		; AVX1-NEXT: [[ARRAYIDX9:%.]] = getelementptr inbounds i32, i32 [[C]], i64 2
; AVX1-NEXT: [[TMP5:%.]] = load i32, i32 [[ARRAYIDX9]], align 4
; AVX1-NEXT: [[ADD10:%.*]] = add nsw i32 [[TMP5]], [[TMP4]]
; AVX1-NEXT: [[DIV11:%.*]] = sdiv i32 [[ADD10]], 8
; AVX1-NEXT: [[ARRAYIDX12:%.]] = getelementptr inbounds i32, i32 [[A]], i64 2
; AVX1-NEXT: store i32 [[DIV11]], i32* [[ARRAYIDX12]], align 4
; AVX1-NEXT: [[ARRAYIDX13:%.]] = getelementptr inbounds i32, i32 [[B]], i64 3		; AVX1-NEXT: [[ARRAYIDX13:%.]] = getelementptr inbounds i32, i32 [[B]], i64 3
; AVX1-NEXT: [[TMP6:%.]] = load i32, i32 [[ARRAYIDX13]], align 4		; AVX1-NEXT: [[TMP0:%.]] = bitcast i32 [[B]] to <4 x i32>*
		; AVX1-NEXT: [[TMP1:%.]] = load <4 x i32>, <4 x i32> [[TMP0]], align 4
; AVX1-NEXT: [[ARRAYIDX14:%.]] = getelementptr inbounds i32, i32 [[C]], i64 3		; AVX1-NEXT: [[ARRAYIDX14:%.]] = getelementptr inbounds i32, i32 [[C]], i64 3
; AVX1-NEXT: [[TMP7:%.]] = load i32, i32 [[ARRAYIDX14]], align 4		; AVX1-NEXT: [[TMP2:%.]] = bitcast i32 [[C]] to <4 x i32>*
; AVX1-NEXT: [[ADD15:%.*]] = add nsw i32 [[TMP7]], [[TMP6]]		; AVX1-NEXT: [[TMP3:%.]] = load <4 x i32>, <4 x i32> [[TMP2]], align 4
; AVX1-NEXT: [[DIV16:%.*]] = sdiv i32 [[ADD15]], 16		; AVX1-NEXT: [[TMP4:%.*]] = add nsw <4 x i32> [[TMP3]], [[TMP1]]
		; AVX1-NEXT: [[TMP5:%.*]] = extractelement <4 x i32> [[TMP4]], i32 0
		; AVX1-NEXT: [[DIV:%.*]] = sdiv i32 [[TMP5]], 2
		; AVX1-NEXT: [[TMP6:%.*]] = extractelement <4 x i32> [[TMP4]], i32 1
		; AVX1-NEXT: [[DIV6:%.*]] = sdiv i32 [[TMP6]], 4
		; AVX1-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 1
		; AVX1-NEXT: [[TMP7:%.*]] = extractelement <4 x i32> [[TMP4]], i32 2
		; AVX1-NEXT: [[DIV11:%.*]] = sdiv i32 [[TMP7]], 8
		; AVX1-NEXT: [[ARRAYIDX12:%.]] = getelementptr inbounds i32, i32 [[A]], i64 2
		; AVX1-NEXT: [[TMP8:%.*]] = extractelement <4 x i32> [[TMP4]], i32 3
		; AVX1-NEXT: [[DIV16:%.*]] = sdiv i32 [[TMP8]], 16
; AVX1-NEXT: [[ARRAYIDX17:%.]] = getelementptr inbounds i32, i32 [[A]], i64 3		; AVX1-NEXT: [[ARRAYIDX17:%.]] = getelementptr inbounds i32, i32 [[A]], i64 3
; AVX1-NEXT: store i32 [[DIV16]], i32* [[ARRAYIDX17]], align 4		; AVX1-NEXT: [[TMP9:%.*]] = insertelement <4 x i32> undef, i32 [[DIV]], i32 0
		; AVX1-NEXT: [[TMP10:%.*]] = insertelement <4 x i32> [[TMP9]], i32 [[DIV6]], i32 1
		; AVX1-NEXT: [[TMP11:%.*]] = insertelement <4 x i32> [[TMP10]], i32 [[DIV11]], i32 2
		; AVX1-NEXT: [[TMP12:%.*]] = insertelement <4 x i32> [[TMP11]], i32 [[DIV16]], i32 3
		; AVX1-NEXT: [[TMP13:%.]] = bitcast i32 [[A]] to <4 x i32>*
		; AVX1-NEXT: store <4 x i32> [[TMP12]], <4 x i32>* [[TMP13]], align 4
; AVX1-NEXT: ret void		; AVX1-NEXT: ret void
		ABataevUnsubmitted Not Done Reply Inline Actions Still looks like it does not respect mintreesize ABataev: Still looks like it does not respect mintreesize
		dtemirbulatovAuthorUnsubmitted Done Reply Inline Actions hmm, this is not the case here, the tree height is 5 here, divide node cost is 20 and after deleting this not node, extracting from "add" node costs 4 and inserting after scalar divide cost 4 and the final tree cost is -4. llvm-mca for -mattr=+avx shows 1305 cycles before and 1609 cycles after. dtemirbulatov: hmm, this is not the case here, the tree height is 5 here, divide node cost is 20 and after…
;		;
; AVX2-LABEL: @powof2div_nonuniform(		; AVX2-LABEL: @powof2div_nonuniform(
; AVX2-NEXT: entry:		; AVX2-NEXT: entry:
; AVX2-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i64 1		; AVX2-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 [[B:%.*]], i64 1
; AVX2-NEXT: [[ARRAYIDX4:%.]] = getelementptr inbounds i32, i32 [[C:%.*]], i64 1		; AVX2-NEXT: [[ARRAYIDX4:%.]] = getelementptr inbounds i32, i32 [[C:%.*]], i64 1
; AVX2-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 1		; AVX2-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds i32, i32 [[A:%.*]], i64 1
; AVX2-NEXT: [[ARRAYIDX8:%.]] = getelementptr inbounds i32, i32 [[B]], i64 2		; AVX2-NEXT: [[ARRAYIDX8:%.]] = getelementptr inbounds i32, i32 [[B]], i64 2
; AVX2-NEXT: [[ARRAYIDX9:%.]] = getelementptr inbounds i32, i32 [[C]], i64 2		; AVX2-NEXT: [[ARRAYIDX9:%.]] = getelementptr inbounds i32, i32 [[C]], i64 2
▲ Show 20 Lines • Show All 47 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/slp-throttle.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -slp-vectorizer -S -mtriple=x86_64-unknown-linux-gnu -mcpu=bdver2 < %s \| FileCheck %s			; RUN: opt -slp-vectorizer -S -mtriple=x86_64-unknown-linux-gnu -mcpu=bdver2 < %s \| FileCheck %s
				RKSimonUnsubmitted Done Reply Inline Actions Is it worth adding a second RUN with -slp-throttle=false ? RKSimon: Is it worth adding a second RUN with -slp-throttle=false ?

	define dso_local void @rftbsub(double* %a) local_unnamed_addr #0 {			define dso_local void @rftbsub(double* %a) local_unnamed_addr #0 {
	; CHECK-LABEL: @rftbsub(			; CHECK-LABEL: @rftbsub(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[ARRAYIDX6:%.]] = getelementptr inbounds double, double [[A:%.*]], i64 2			; CHECK-NEXT: [[ARRAYIDX6:%.]] = getelementptr inbounds double, double [[A:%.*]], i64 2
	; CHECK-NEXT: [[TMP0:%.]] = load double, double [[ARRAYIDX6]], align 8			; CHECK-NEXT: [[TMP0:%.*]] = or i64 2, 1
	; CHECK-NEXT: [[TMP1:%.*]] = or i64 2, 1			; CHECK-NEXT: [[ARRAYIDX12:%.]] = getelementptr inbounds double, double [[A]], i64 [[TMP0]]
	; CHECK-NEXT: [[ARRAYIDX12:%.]] = getelementptr inbounds double, double [[A]], i64 [[TMP1]]			; CHECK-NEXT: [[TMP1:%.]] = bitcast double [[ARRAYIDX6]] to <2 x double>*
	; CHECK-NEXT: [[TMP2:%.]] = load double, double [[ARRAYIDX12]], align 8			; CHECK-NEXT: [[TMP2:%.]] = load <2 x double>, <2 x double> [[TMP1]], align 8
	; CHECK-NEXT: [[ADD16:%.*]] = fadd double [[TMP2]], undef			; CHECK-NEXT: [[TMP3:%.*]] = extractelement <2 x double> [[TMP2]], i32 1
				; CHECK-NEXT: [[ADD16:%.*]] = fadd double [[TMP3]], undef
	; CHECK-NEXT: [[MUL18:%.*]] = fmul double undef, [[ADD16]]			; CHECK-NEXT: [[MUL18:%.*]] = fmul double undef, [[ADD16]]
	; CHECK-NEXT: [[ADD19:%.*]] = fadd double undef, [[MUL18]]			; CHECK-NEXT: [[ADD19:%.*]] = fadd double undef, [[MUL18]]
	; CHECK-NEXT: [[SUB22:%.*]] = fsub double undef, undef			; CHECK-NEXT: [[SUB22:%.*]] = fsub double undef, undef
	; CHECK-NEXT: [[SUB25:%.*]] = fsub double [[TMP0]], [[ADD19]]			; CHECK-NEXT: [[TMP4:%.*]] = insertelement <2 x double> undef, double [[ADD19]], i32 0
	; CHECK-NEXT: store double [[SUB25]], double* [[ARRAYIDX6]], align 8			; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x double> [[TMP4]], double [[SUB22]], i32 1
	; CHECK-NEXT: [[SUB29:%.*]] = fsub double [[TMP2]], [[SUB22]]			; CHECK-NEXT: [[TMP6:%.*]] = fsub <2 x double> [[TMP2]], [[TMP5]]
	; CHECK-NEXT: store double [[SUB29]], double* [[ARRAYIDX12]], align 8			; CHECK-NEXT: [[TMP7:%.]] = bitcast double [[ARRAYIDX6]] to <2 x double>*
				; CHECK-NEXT: store <2 x double> [[TMP6]], <2 x double>* [[TMP7]], align 8
	; CHECK-NEXT: unreachable			; CHECK-NEXT: unreachable
	;			;
	entry:			entry:
	%arrayidx6 = getelementptr inbounds double, double* %a, i64 2			%arrayidx6 = getelementptr inbounds double, double* %a, i64 2
	%0 = load double, double* %arrayidx6, align 8			%0 = load double, double* %arrayidx6, align 8
	%1 = or i64 2, 1			%1 = or i64 2, 1
	%arrayidx12 = getelementptr inbounds double, double* %a, i64 %1			%arrayidx12 = getelementptr inbounds double, double* %a, i64 %1
	%2 = load double, double* %arrayidx12, align 8			%2 = load double, double* %arrayidx12, align 8
	Show All 10 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Add support for throttling.AbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 284617

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

llvm/test/Transforms/SLPVectorizer/AArch64/gather-root.ll

llvm/test/Transforms/SLPVectorizer/X86/load-merge.ll

llvm/test/Transforms/SLPVectorizer/X86/powof2div.ll

llvm/test/Transforms/SLPVectorizer/X86/slp-throttle.ll

[SLP] Add support for throttling.
AbandonedPublic