This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/Transforms/Vectorize/
-
llvm/
-
Transforms/
-
Vectorize/
-
SLPVectorizer.h
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
14/27
SLPVectorizer.cpp
-
test/Transforms/SLPVectorizer/
-
Transforms/
-
SLPVectorizer/
-
AArch64/
-
transpose-inseltpoison.ll
-
transpose.ll
-
X86/
1/2
addsub.ll
-
crash_cmpop.ll
-
extract.ll
-
jumbled-load-multiuse.ll
-
jumbled-load.ll
-
jumbled_store_crash.ll
-
reorder_repeated_ops.ll
-
split-load8_2-unord.ll
-
vectorize-reorder-alt-shuffle.ll
-
vectorize-reorder-reuse.ll

Differential D105020

[SLP]Improve graph reordering.
ClosedPublic

Authored by ABataev on Jun 28 2021, 6:12 AM.

Download Raw Diff

Details

Reviewers

RKSimon
spatel
vdmitrie
dtemirbulatov
anton-afanasyev
SjoerdMeijer
dmgreen

Commits

rGbc69dd62c04a: [SLP]Improve graph reordering.
rG84cbd71c9592: [SLP]Improve graph reordering.
rGa28234e37af8: [SLP]Improve graph reordering.
rGe408d1dfab42: [SLP]Improve graph reordering.

Summary

Reworked reordering algorithm. Originally, the compiler just tried to
detect the most common order in the reordarable nodes (loads, stores,
extractelements,extractvalues) and then fully rebuilding the graph in
the best order. This was not effecient, since it required an extra
memory and time for building/rebuilding tree, double the use of the
scheduling budget, which could lead to missing vectorization due to
exausted scheduling resources.

Patch provide 2-way approach for graph reodering problem. At first, all
reordering is done in-place, it doe not required tree
deleting/rebuilding, it just rotates the scalars/orders/reuses masks in
the graph node.

The first step (top-to bottom) rotates the whole graph, similarly to the previous
implementation. Compiler counts the number of the most used orders of
the graph nodes with the same vectorization factor and then rotates the
subgraph with the given vectorization factor to the most used order, if
it is not empty. Then repeats the same procedure for the subgraphs with
the smaller vectorization factor. We can do this because we still need
to reshuffle smaller subgraph when buildiong operands for the graph
nodes with lasrger vectorization factor, we can rotate just subgraph,
not the whole graph.

The second step (bottom-to-top) scans through the leaves and tries to
detect the users of the leaves which can be reordered. If the leaves can
be reorder in the best fashion, they are reordered and their user too.
It allows to remove double shuffles to the same ordering of the operands in
many cases and just reorder the user operations instead. Plus, it moves
the final shuffles closer to the top of the graph and in many cases
allows to remove extra shuffle because the same procedure is repeated
again and we can again merge some reordering masks and reorder user nodes
instead of the operands.

Also, patch improves cost model for gathering of loads, which improves
x264 benchmark in some cases.

Gives about +2% on AVX512 + LTO (more expected for AVX/AVX2) for {625,525}x264,
+3% for 508.namd, improves most of other benchmarks.
The compile and link time are almost the same, though in some cases it
should be better (we're not doing an extra instruction scheduling
anymore) + we may vectorize more code for the large basic blocks again
because of saving scheduling budget.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

ABataev created this revision.Jun 28 2021, 6:12 AM

Herald added subscribers: jfb, mgrang, hiraditya. · View Herald TranscriptJun 28 2021, 6:12 AM

ABataev requested review of this revision.Jun 28 2021, 6:12 AM

Herald added a project: Restricted Project. · View Herald TranscriptJun 28 2021, 6:12 AM

Harbormaster completed remote builds in B111270: Diff 354869.Jun 28 2021, 7:08 AM

RKSimon added inline comments.Jul 5 2021, 9:33 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3111	This is very similar to the EntryState enum - merge them?
llvm/test/Transforms/SLPVectorizer/X86/addsub.ll
340–341	Update comment? I had to do something similar on D103925, although the wording here might be different.

ABataev added inline comments.Jul 6 2021, 6:05 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3111	I would not do this. Though the values look similar the meaning is completely different. This state handles only loads, while the entry state handles all possible entries kinds. In the future, we may have different values in these enums, which may lead to some unpredictable results. I would keep it as is.
llvm/test/Transforms/SLPVectorizer/X86/addsub.ll
340–341	Ok, will do

Rebase + address comments

xbolva00 added a reviewer: SjoerdMeijer.Jul 6 2021, 6:30 AM

Harbormaster completed remote builds in B112599: Diff 356698.Jul 6 2021, 7:00 AM

Rebase

Harbormaster completed remote builds in B113075: Diff 357337.Jul 8 2021, 2:52 PM

ABataev added a child revision: D101109: [SLP]Improve multinode analysis..Jul 9 2021, 7:42 AM

ABataev mentioned this in D101109: [SLP]Improve multinode analysis..

Rebase

Harbormaster completed remote builds in B113199: Diff 357510.Jul 9 2021, 8:35 AM

A few minors things I've noticed so far. It looks like there's some renaming / nfc(ish) refactoring in here - if that could be pre-committed to reduce the size of this patch it'd be very welcome!

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
648	Do we need a default value for vector?
2619	Do we need a default value for vector?
2623	Doesn't the IsIdentity pass have to be done after Order[] is updated?
2645	Do we need a default value for vector?

ABataev added inline comments.Jul 9 2021, 9:26 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
648	Not currently, but I'll update these functions to make them compatible with upcoming non-power-2 vectorization.
2619	No, it is swapped with `Order` and `Order` then updated. But I'll add default initialization because we may need it in the future for non-pow-2 patch.
2623	Yes, you're right, will fix this. It may affect the cost in some rare cases.
2645	I'll add `UndefMaskElem` to reduce future updates for non-power-2.

Address comments

Harbormaster completed remote builds in B113254: Diff 357588.Jul 9 2021, 12:42 PM

A few minors, but I haven't spotted anything critical - does anyone else have any comments?

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3197	Default value?
4397	Can we remove the loop and avoid the repeated call to TTI->getShuffleCost(TTI::SK_Select, VecTy) ?
4398	Is it worth doing the CommonCost -> ReuseShuffleCost refactor as a NFC pre-commit to simplify this patch?
6237	A lot of this is just a refactor cleanup that looks like a NFC that could be done as a pre-commit to simplify the patch?

ABataev marked an inline comment as done.Jul 14 2021, 8:17 AM

ABataev added inline comments.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3197	It is initialized with zeroes by default.
4398	Will check.
6237	Yes, will precommit some of these changes.

ABataev mentioned this in D106060: [SLP]Improve calculations of the cost for reused/reordered scalars..Jul 15 2021, 5:36 AM

ABataev mentioned this in rGda3dbfcacf9a: [SLP]Improve calculations of the cost for reused/reordered scalars..Jul 16 2021, 1:42 PM

Rebase

ABataev added inline comments.Jul 16 2021, 2:38 PM

llvm/test/Transforms/SLPVectorizer/AArch64/PR38339.ll
6–16 ↗	(On Diff #359457)	Need to adjust the cost for 4xi16 shuffle in AArch64 target.
40–50 ↗	(On Diff #359457)	Same here

Harbormaster completed remote builds in B114601: Diff 359457.Jul 16 2021, 3:18 PM

RKSimon added a reviewer: dmgreen.Jul 17 2021, 8:16 AM

RKSimon added a subscriber: dmgreen.

RKSimon added inline comments.

llvm/test/Transforms/SLPVectorizer/AArch64/PR38339.ll

6–16 ↗

(On Diff #359457)

https://simd.godbolt.org/z/Pana3396f

@dmgreen Some very basic tests suggests the v4i16 shuffle cost should never be higher than 3 (which encouragingly matches what is already set for v4i32/v4f32) - do you agree?

// PermuteSingleSrc shuffle kinds.
// TODO: handle vXi8/vXi16.
{ TTI::SK_PermuteSingleSrc, MVT::v2i32, 1 }, // mov.
{ TTI::SK_PermuteSingleSrc, MVT::v4i32, 3 }, // perfectshuffle worst case.
{ TTI::SK_PermuteSingleSrc, MVT::v2i64, 1 }, // mov.
{ TTI::SK_PermuteSingleSrc, MVT::v2f32, 1 }, // mov.
{ TTI::SK_PermuteSingleSrc, MVT::v4f32, 3 }, // perfectshuffle worst case.
{ TTI::SK_PermuteSingleSrc, MVT::v2f64, 1 }, // mov.

dmgreen added inline comments.Jul 17 2021, 2:53 PM

llvm/test/Transforms/SLPVectorizer/AArch64/PR38339.ll
6–16 ↗	(On Diff #359457)	Yeah, I think that sounds right. GeneratePerfectShuffle applies to 4 element 64bit shuffles as well as 128bit shuffles, so the same 3 instruction worst case would apply. I don't have a lot of tests that check SLP vectorization. Let me try some things and put a patch together if it looks sensible.

dmgreen added inline comments.Jul 20 2021, 12:01 AM

llvm/test/Transforms/SLPVectorizer/AArch64/PR38339.ll
6–16 ↗	(On Diff #359457)	There are some patches in https://reviews.llvm.org/D106241 to add some extra worst case costs.

Matt added a subscriber: Matt.Jul 20 2021, 7:01 AM

Rebase

I think we're just waiting to confirm that D106241 fixes the regressions?

In D105020#2897064, @RKSimon wrote:

I think we're just waiting to confirm that D106241 fixes the regressions?

Yep.

Harbormaster completed remote builds in B115596: Diff 360848.Jul 22 2021, 10:19 AM

D106241 had some dependencies, but I've rebased it to get it out of the way.

Please can you rebase to confirm the aarch64 regressions have gone

In D105020#2900357, @RKSimon wrote:

Please can you rebase to confirm the aarch64 regressions have gone

Sure, will do it later.

Rebase

Harbormaster completed remote builds in B115954: Diff 361345.Jul 23 2021, 3:00 PM

LGTM

This revision is now accepted and ready to land.Jul 24 2021, 1:28 AM

This revision was landed with ongoing or failed builds.Jul 28 2021, 5:53 AM

Closed by commit rGe408d1dfab42: [SLP]Improve graph reordering. (authored by ABataev). · Explain Why

This revision was automatically updated to reflect the committed changes.

ABataev added a commit: rGe408d1dfab42: [SLP]Improve graph reordering..

It looks like this change might be responsible for a build failure on GreenDragon: https://green.lab.llvm.org/green/job/clang-stage1-RA/22811/console

In D105020#2910050, @fhahn wrote:

It looks like this change might be responsible for a build failure on GreenDragon: https://green.lab.llvm.org/green/job/clang-stage1-RA/22811/console

Yes, going to commit a small fix in a minute.

In D105020#2910051, @ABataev wrote:

In D105020#2910050, @fhahn wrote:

It looks like this change might be responsible for a build failure on GreenDragon: https://green.lab.llvm.org/green/job/clang-stage1-RA/22811/console

Yes, going to commit a small fix in a minute.

It still crashes for me, https://martin.st/temp/vf_perspective-preproc.c, with clang -target aarch64-w32-mingw32 -w -c -O2 vf_perspective-preproc.c.

Hi @ABataev, I ran into an issue when running the LLVM test-suite. It seems to be a different issue than the one that @mstorsjo reported.

I got it reduced to:

target triple = "aarch64-unknown-linux-gnu"

define void @foo() local_unnamed_addr {
entry:
  %0 = load volatile double, double* poison, align 8
  %1 = load volatile double, double* poison, align 8
  %2 = load volatile double, double* poison, align 8
  %3 = load volatile double, double* poison, align 8
  br label %for.body

for.body:                                         ; preds = %for.body, %entry
  %d30.0734 = phi double [ undef, %for.body ], [ %0, %entry ]
  %d01.0733 = phi double [ undef, %for.body ], [ %1, %entry ]
  %d11.0732 = phi double [ undef, %for.body ], [ %2, %entry ]
  %d21.0731 = phi double [ undef, %for.body ], [ %3, %entry ]
  br label %for.body
}

Run with: opt -slp-vectorizer -S < reduced.ll.

I had actually expected one of the aarch64-buildbots would have caught this, so not sure if there was something special about the way I ran the test-suite, but I don't believe so.

In D105020#2912688, @mstorsjo wrote:

In D105020#2910051, @ABataev wrote:

In D105020#2910050, @fhahn wrote:

It looks like this change might be responsible for a build failure on GreenDragon: https://green.lab.llvm.org/green/job/clang-stage1-RA/22811/console

Yes, going to commit a small fix in a minute.

It still crashes for me, https://martin.st/temp/vf_perspective-preproc.c, with clang -target aarch64-w32-mingw32 -w -c -O2 vf_perspective-preproc.c.

Will check and fix it ASAP, thanks for the report.

In D105020#2912776, @sdesmalen wrote:
Hi @ABataev, I ran into an issue when running the LLVM test-suite. It seems to be a different issue than the one that @mstorsjo reported.

I got it reduced to:
target triple = "aarch64-unknown-linux-gnu"

define void @foo() local_unnamed_addr {
entry:
  %0 = load volatile double, double* poison, align 8
  %1 = load volatile double, double* poison, align 8
  %2 = load volatile double, double* poison, align 8
  %3 = load volatile double, double* poison, align 8
  br label %for.body

for.body:                                         ; preds = %for.body, %entry
  %d30.0734 = phi double [ undef, %for.body ], [ %0, %entry ]
  %d01.0733 = phi double [ undef, %for.body ], [ %1, %entry ]
  %d11.0732 = phi double [ undef, %for.body ], [ %2, %entry ]
  %d21.0731 = phi double [ undef, %for.body ], [ %3, %entry ]
  br label %for.body
}
Run with: opt -slp-vectorizer -S < reduced.ll.

I had actually expected one of the aarch64-buildbots would have caught this, so not sure if there was something special about the way I ran the test-suite, but I don't believe so.

Thanks, will check it too.

In D105020#2912776, @sdesmalen wrote:
Hi @ABataev, I ran into an issue when running the LLVM test-suite. It seems to be a different issue than the one that @mstorsjo reported.

I got it reduced to:
target triple = "aarch64-unknown-linux-gnu"

define void @foo() local_unnamed_addr {
entry:
  %0 = load volatile double, double* poison, align 8
  %1 = load volatile double, double* poison, align 8
  %2 = load volatile double, double* poison, align 8
  %3 = load volatile double, double* poison, align 8
  br label %for.body

for.body:                                         ; preds = %for.body, %entry
  %d30.0734 = phi double [ undef, %for.body ], [ %0, %entry ]
  %d01.0733 = phi double [ undef, %for.body ], [ %1, %entry ]
  %d11.0732 = phi double [ undef, %for.body ], [ %2, %entry ]
  %d21.0731 = phi double [ undef, %for.body ], [ %3, %entry ]
  br label %for.body
}
Run with: opt -slp-vectorizer -S < reduced.ll.

I had actually expected one of the aarch64-buildbots would have caught this, so not sure if there was something special about the way I ran the test-suite, but I don't believe so.

Investigated it, the crash is not related to this patch, caused by one of the previous patches. Going to publish a fix soon.

ABataev mentioned this in D107080: [SLP]Fix an assertion for the size of user nodes..Jul 29 2021, 7:52 AM

bjope added a subscriber: bjope.Jul 29 2021, 3:40 PM

bjope added inline comments.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4326	With my OOT target I ended up here with getMinVecRegSize() returning 32. E->getMainOp was a load like this `%3 = load i24, i24* getelementptr inbounds ([128 x i24], [128 x i24]* @a_ua, i16 0, i16 3)`, so Sz was 24. That gives a MinVF that is 0. And after some iterations the inner loop was entered with VF=0, which gives a slice that is empty, hitting assertions when doing Slice.front(). I haven't reduced this failure for any in-tree target (a bit short on people at the office here in the middle of summer). But maybe you should make sure MinVF doesn't go below 1 here (or maybe not even below 2) to avoid getting a VF that is less than 1 (or less than 2). Or check that VF is at least 1 (or 2) in the loop guard. (I do not know if VF=1 makes any sense. Hence the alternatives above regarding 1 or 2.)

ABataev added inline comments.Jul 29 2021, 3:41 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4326	The fix is ready (see D107058), will commit it tomorrow.

bjope added inline comments.Jul 29 2021, 3:43 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4326	Ok, thanks!

ABataev mentioned this in rG4b25c113210e: [SLP]Fix an assertion for the size of user nodes..Jul 30 2021, 5:47 AM

Just a heads up that we're seeing test failures in Chromium that bisect to this revision (http://crbug.com/1235252). Not sure what's going on yet, though.

In D105020#2919603, @hans wrote:

Just a heads up that we're seeing test failures in Chromium that bisect to this revision (http://crbug.com/1235252). Not sure what's going on yet, though.

Hi, thanks for the report, will try to investigate it ASAP.

In D105020#2919603, @hans wrote:

Just a heads up that we're seeing test failures in Chromium that bisect to this revision (http://crbug.com/1235252). Not sure what's going on yet, though.

Also, would be good if you could provide a reproducer and a command line for it. I took a look at the buildbot but was unable to get info on how to reproduce it.

In D105020#2919627, @ABataev wrote:

In D105020#2919603, @hans wrote:

Just a heads up that we're seeing test failures in Chromium that bisect to this revision (http://crbug.com/1235252). Not sure what's going on yet, though.

Also, would be good if you could provide a reproducer and a command line for it. I took a look at the buildbot but was unable to get info on how to reproduce it.

I don't have a reproducer yet, and it will take some work to get it. I just wanted to give a heads up, especially in case others are bisecting issues and also end up at this change.

In D105020#2919848, @hans wrote:

In D105020#2919627, @ABataev wrote:

In D105020#2919603, @hans wrote:

Just a heads up that we're seeing test failures in Chromium that bisect to this revision (http://crbug.com/1235252). Not sure what's going on yet, though.

Also, would be good if you could provide a reproducer and a command line for it. I took a look at the buildbot but was unable to get info on how to reproduce it.

I don't have a reproducer yet, and it will take some work to get it. I just wanted to give a heads up, especially in case others are bisecting issues and also end up at this change.

Still not sure what's going on (and it could also be our code that's broken), but we're starting to get some IR to look at now: https://bugs.chromium.org/p/chromium/issues/detail?id=1235252#c15

In D105020#2922309, @hans wrote:

In D105020#2919848, @hans wrote:

In D105020#2919627, @ABataev wrote:

In D105020#2919603, @hans wrote:

Just a heads up that we're seeing test failures in Chromium that bisect to this revision (http://crbug.com/1235252). Not sure what's going on yet, though.

Also, would be good if you could provide a reproducer and a command line for it. I took a look at the buildbot but was unable to get info on how to reproduce it.

I don't have a reproducer yet, and it will take some work to get it. I just wanted to give a heads up, especially in case others are bisecting issues and also end up at this change.

Still not sure what's going on (and it could also be our code that's broken), but we're starting to get some IR to look at now: https://bugs.chromium.org/p/chromium/issues/detail?id=1235252#c15

Ok, thanks, I see the wrong mask for the loads, comparing these 2 examples. Looking at the problem already, this should help to fix it ASAP.

I bisected a miscompilation in XLA to this change.

Repro:

grab the three files in this gist: https://gist.github.com/hawkinsp/93de2fcb1a4d13ca01c826288bde4b9b
build clang at this revision

At this commit

clang driver.cc module_0000.primitive_computation_broadcast_in_dim.3.ir-no-opt.ll -o a.out
./a.out buffer-assignment.txt

(no optimization)

and

clang -O3 -march=haswell driver.cc module_0000.primitive_computation_broadcast_in_dim.3.ir-no-opt.ll -o a.out
./a.out buffer-assignment.txt

produce different outputs: the first 4 values differ.

At the previous revision, both produce identical outputs.

A quick inspection of the optimized IR seems to show it reading memory out of bounds. Disabling SLP vectorization also fixes the problem.

In D105020#2922419, @phawkins wrote:
I bisected a miscompilation in XLA to this change.

Repro:

grab the three files in this gist: https://gist.github.com/hawkinsp/93de2fcb1a4d13ca01c826288bde4b9b

build clang at this revision

At this commit
clang driver.cc module_0000.primitive_computation_broadcast_in_dim.3.ir-no-opt.ll -o a.out
./a.out buffer-assignment.txt
(no optimization)

and
clang -O3 -march=haswell driver.cc module_0000.primitive_computation_broadcast_in_dim.3.ir-no-opt.ll -o a.out
./a.out buffer-assignment.txt
produce different outputs: the first 4 values differ.

At the previous revision, both produce identical outputs.

A quick inspection of the optimized IR seems to show it reading memory out of bounds. Disabling SLP vectorization also fixes the problem.

Thanks, I suspect what is the cause of the bug, hope to prepare a fix later today.

In D105020#2922309, @hans wrote:

In D105020#2919848, @hans wrote:

In D105020#2919627, @ABataev wrote:

In D105020#2919603, @hans wrote:

Just a heads up that we're seeing test failures in Chromium that bisect to this revision (http://crbug.com/1235252). Not sure what's going on yet, though.

Also, would be good if you could provide a reproducer and a command line for it. I took a look at the buildbot but was unable to get info on how to reproduce it.

I don't have a reproducer yet, and it will take some work to get it. I just wanted to give a heads up, especially in case others are bisecting issues and also end up at this change.

Still not sure what's going on (and it could also be our code that's broken), but we're starting to get some IR to look at now: https://bugs.chromium.org/p/chromium/issues/detail?id=1235252#c15

We now have analysis of the bad IR: https://bugs.chromium.org/p/chromium/issues/detail?id=1235252#c16

Thanks, I suspect what is the cause of the bug, hope to prepare a fix later today.

Can you please revert to green in the meantime?

In D105020#2922483, @hans wrote:

In D105020#2922309, @hans wrote:

In D105020#2919848, @hans wrote:

In D105020#2919627, @ABataev wrote:

In D105020#2919603, @hans wrote:

Just a heads up that we're seeing test failures in Chromium that bisect to this revision (http://crbug.com/1235252). Not sure what's going on yet, though.

Also, would be good if you could provide a reproducer and a command line for it. I took a look at the buildbot but was unable to get info on how to reproduce it.

I don't have a reproducer yet, and it will take some work to get it. I just wanted to give a heads up, especially in case others are bisecting issues and also end up at this change.

Still not sure what's going on (and it could also be our code that's broken), but we're starting to get some IR to look at now: https://bugs.chromium.org/p/chromium/issues/detail?id=1235252#c15

We now have analysis of the bad IR: https://bugs.chromium.org/p/chromium/issues/detail?id=1235252#c16

Thanks, I suspect what is the cause of the bug, hope to prepare a fix later today.

Can you please revert to green in the meantime?

It may take some time, there were other commits already. I'll revert it later if won't be able to prepare a quick fix today.

Can you please revert to green in the meantime?

It may take some time, there were other commits already. I'll revert it later if won't be able to prepare a quick fix today.

If there's already a chain of commits depending on this, that's an even stronger reason to revert. What if the quick fix doesn't fix everything? Then the commit chain has just become longer and even harder to revert.

I'd suggest reverting first, and then fixing without the stress.

In D105020#2922564, @hans wrote:

Can you please revert to green in the meantime?

It may take some time, there were other commits already. I'll revert it later if won't be able to prepare a quick fix today.

If there's already a chain of commits depending on this, that's an even stronger reason to revert. What if the quick fix doesn't fix everything? Then the commit chain has just become longer and even harder to revert.

I'd suggest reverting first, and then fixing without the stress.

I would not call it a stress, the reason is known, but I’ll try to revert it.

if you can get alive-tv working from https://github.com/AliveToolkit/alive2

; ModuleID = '/tmp/b1.ll'                                                                                                                                                                          
source_filename = "/tmp/b1.ll"                                                                                                                                                                     
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"                                                                                                       
target triple = "x86_64-unknown-linux-gnu"                                                                                                                                                         
                                                                                                                                                                                                   
%"class.deqp::gls::ShaderEvalContext" = type { %"class.tcu::Vector", %"class.tcu::Vector", %"class.tcu::Vector", [4 x %"class.tcu::Vector"], [4 x %"struct.deqp::gls::ShaderEvalContext::ShaderSamp
ler"], %"class.tcu::Vector", i8, %"class.deqp::gls::QuadGrid"* }                                                                                                                                   
%"struct.deqp::gls::ShaderEvalContext::ShaderSampler" = type { %"class.tcu::Sampler", %"class.tcu::Texture2D"*, %"class.tcu::TextureCube"*, %"class.tcu::Texture2DArray"*, %"class.tcu::Texture3D"*
 }                                                                                                                                                                                                 
%"class.tcu::Sampler" = type { i32, i32, i32, i32, i32, i32, float, i8, i32, i32, %"class.rr::GenericVec4", i8, i32 }                                                                              
%"class.rr::GenericVec4" = type { %union.anon }                                                                                                                                                    
%union.anon = type { [4 x i32] }                                                                                                                                                                   
%"class.tcu::Texture2D" = type { %"class.tcu::TextureLevelPyramid", i32, i32, %"class.tcu::Texture2DView" }                                                                                        
%"class.tcu::TextureLevelPyramid" = type { %"class.tcu::TextureFormat", %"class.std::__Cr::vector", %"class.std::__Cr::vector.1" }                                                                 
%"class.tcu::TextureFormat" = type { i32, i32 }                                                                                                                                                    
%"class.std::__Cr::vector" = type { %"class.std::__Cr::__vector_base" }                                                                                                                            
%"class.std::__Cr::__vector_base" = type { %"class.de::ArrayBuffer"*, %"class.de::ArrayBuffer"*, %"class.std::__Cr::__compressed_pair" }                                                           
%"class.de::ArrayBuffer" = type { i8*, i64 }                                                                                                                                                       
%"class.std::__Cr::__compressed_pair" = type { %"struct.std::__Cr::__compressed_pair_elem" }                                                                                                       
%"struct.std::__Cr::__compressed_pair_elem" = type { %"class.de::ArrayBuffer"* }                                                                                                                   
%"class.std::__Cr::vector.1" = type { %"class.std::__Cr::__vector_base.2" }                                                                                                                        
%"class.std::__Cr::__vector_base.2" = type { %"class.tcu::PixelBufferAccess"*, %"class.tcu::PixelBufferAccess"*, %"class.std::__Cr::__compressed_pair.4" }                                         
%"class.tcu::PixelBufferAccess" = type { %"class.tcu::ConstPixelBufferAccess" }                                                                                                                    
%"class.tcu::ConstPixelBufferAccess" = type { %"class.tcu::TextureFormat", %"class.tcu::Vector.3", %"class.tcu::Vector.3", %"class.tcu::Vector.3", i8* }                                           
%"class.tcu::Vector.3" = type { [3 x i32] }                                                                                                                                                        
%"class.std::__Cr::__compressed_pair.4" = type { %"struct.std::__Cr::__compressed_pair_elem.5" }                                                                                                   
%"struct.std::__Cr::__compressed_pair_elem.5" = type { %"class.tcu::PixelBufferAccess"* }                                                                                                          
%"class.tcu::Texture2DView" = type <{ i32, [4 x i8], %"class.tcu::ConstPixelBufferAccess"*, i8, [7 x i8] }>                                                                                        
%"class.tcu::TextureCube" = type { %"class.tcu::TextureFormat", i32, [6 x %"class.std::__Cr::vector"], [6 x %"class.std::__Cr::vector.1"], %"class.tcu::TextureCubeView" }                         
%"class.tcu::TextureCubeView" = type <{ i32, [4 x i8], [6 x %"class.tcu::ConstPixelBufferAccess"*], i8, [7 x i8] }>                                                                                
%"class.tcu::Texture2DArray" = type { %"class.tcu::TextureLevelPyramid", i32, i32, i32, %"class.tcu::Texture2DArrayView" }                                                                         
%"class.tcu::Texture2DArrayView" = type { i32, %"class.tcu::ConstPixelBufferAccess"* }
%"class.tcu::Texture3D" = type { %"class.tcu::TextureLevelPyramid", i32, i32, i32, %"class.tcu::Texture3DView" }
%"class.tcu::Texture3DView" = type { i32, %"class.tcu::ConstPixelBufferAccess"* }
%"class.tcu::Vector" = type { [4 x float] }
%"class.deqp::gls::QuadGrid" = type opaque

define hidden void @_ZN4deqp5gles210Functional19eval_selection_vec4ERNS_3gls17ShaderEvalContextE(%"class.deqp::gls::ShaderEvalContext"* nocapture align 8 dereferenceable(528) %0) {
  %2 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 0, i32 0, i64 2
  %3 = load float, float* %2, align 8
  %4 = fcmp ogt float %3, 0.000000e+00
  %5 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 1, i32 0, i64 3
  %6 = load float, float* %5, align 4
  %7 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 1, i32 0, i64 2
  %8 = load float, float* %7, align 8
  %9 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 1, i32 0, i64 1
  %10 = load float, float* %9, align 4
  %11 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 1, i32 0, i64 0
  %12 = load float, float* %11, align 8
  %13 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 2, i32 0, i64 0
  %14 = load float, float* %13, align 8
  %15 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 2, i32 0, i64 3
  %16 = load float, float* %15, align 4
  %17 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 2, i32 0, i64 2
  %18 = load float, float* %17, align 8
  %19 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 2, i32 0, i64 1
  %20 = load float, float* %19, align 4
  %21 = select i1 %4, float %6, float %14
  %22 = select i1 %4, float %8, float %16
  %23 = select i1 %4, float %10, float %18
  %24 = select i1 %4, float %12, float %20
  %25 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 5, i32 0, i64 0
  store float %21, float* %25, align 8
  %26 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 5, i32 0, i64 1
  store float %22, float* %26, align 4
  %27 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 5, i32 0, i64 2
  store float %23, float* %27, align 8
  %28 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 5, i32 0, i64 3
  store float %24, float* %28, align 4
  ret void
}

$ bin/opt -passes=slp-vectorizer -S /tmp/b1.ll -o /tmp/b2.ll
$ alive-tv /tmp/b1.ll /tmp/b2.ll

fails at this commit, and times out at the commit before
(sorry, couldn't get it working on https://alive2.llvm.org/ for some reason)

seems worth a revert

In D105020#2923016, @aeubanks wrote:

if you can get alive-tv working from https://github.com/AliveToolkit/alive2

; ModuleID = '/tmp/b1.ll'                                                                                                                                                                          
source_filename = "/tmp/b1.ll"                                                                                                                                                                     
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"                                                                                                       
target triple = "x86_64-unknown-linux-gnu"                                                                                                                                                         
                                                                                                                                                                                                   
%"class.deqp::gls::ShaderEvalContext" = type { %"class.tcu::Vector", %"class.tcu::Vector", %"class.tcu::Vector", [4 x %"class.tcu::Vector"], [4 x %"struct.deqp::gls::ShaderEvalContext::ShaderSamp
ler"], %"class.tcu::Vector", i8, %"class.deqp::gls::QuadGrid"* }                                                                                                                                   
%"struct.deqp::gls::ShaderEvalContext::ShaderSampler" = type { %"class.tcu::Sampler", %"class.tcu::Texture2D"*, %"class.tcu::TextureCube"*, %"class.tcu::Texture2DArray"*, %"class.tcu::Texture3D"*
 }                                                                                                                                                                                                 
%"class.tcu::Sampler" = type { i32, i32, i32, i32, i32, i32, float, i8, i32, i32, %"class.rr::GenericVec4", i8, i32 }                                                                              
%"class.rr::GenericVec4" = type { %union.anon }                                                                                                                                                    
%union.anon = type { [4 x i32] }                                                                                                                                                                   
%"class.tcu::Texture2D" = type { %"class.tcu::TextureLevelPyramid", i32, i32, %"class.tcu::Texture2DView" }                                                                                        
%"class.tcu::TextureLevelPyramid" = type { %"class.tcu::TextureFormat", %"class.std::__Cr::vector", %"class.std::__Cr::vector.1" }                                                                 
%"class.tcu::TextureFormat" = type { i32, i32 }                                                                                                                                                    
%"class.std::__Cr::vector" = type { %"class.std::__Cr::__vector_base" }                                                                                                                            
%"class.std::__Cr::__vector_base" = type { %"class.de::ArrayBuffer"*, %"class.de::ArrayBuffer"*, %"class.std::__Cr::__compressed_pair" }                                                           
%"class.de::ArrayBuffer" = type { i8*, i64 }                                                                                                                                                       
%"class.std::__Cr::__compressed_pair" = type { %"struct.std::__Cr::__compressed_pair_elem" }                                                                                                       
%"struct.std::__Cr::__compressed_pair_elem" = type { %"class.de::ArrayBuffer"* }                                                                                                                   
%"class.std::__Cr::vector.1" = type { %"class.std::__Cr::__vector_base.2" }                                                                                                                        
%"class.std::__Cr::__vector_base.2" = type { %"class.tcu::PixelBufferAccess"*, %"class.tcu::PixelBufferAccess"*, %"class.std::__Cr::__compressed_pair.4" }                                         
%"class.tcu::PixelBufferAccess" = type { %"class.tcu::ConstPixelBufferAccess" }                                                                                                                    
%"class.tcu::ConstPixelBufferAccess" = type { %"class.tcu::TextureFormat", %"class.tcu::Vector.3", %"class.tcu::Vector.3", %"class.tcu::Vector.3", i8* }                                           
%"class.tcu::Vector.3" = type { [3 x i32] }                                                                                                                                                        
%"class.std::__Cr::__compressed_pair.4" = type { %"struct.std::__Cr::__compressed_pair_elem.5" }                                                                                                   
%"struct.std::__Cr::__compressed_pair_elem.5" = type { %"class.tcu::PixelBufferAccess"* }                                                                                                          
%"class.tcu::Texture2DView" = type <{ i32, [4 x i8], %"class.tcu::ConstPixelBufferAccess"*, i8, [7 x i8] }>                                                                                        
%"class.tcu::TextureCube" = type { %"class.tcu::TextureFormat", i32, [6 x %"class.std::__Cr::vector"], [6 x %"class.std::__Cr::vector.1"], %"class.tcu::TextureCubeView" }                         
%"class.tcu::TextureCubeView" = type <{ i32, [4 x i8], [6 x %"class.tcu::ConstPixelBufferAccess"*], i8, [7 x i8] }>                                                                                
%"class.tcu::Texture2DArray" = type { %"class.tcu::TextureLevelPyramid", i32, i32, i32, %"class.tcu::Texture2DArrayView" }                                                                         
%"class.tcu::Texture2DArrayView" = type { i32, %"class.tcu::ConstPixelBufferAccess"* }
%"class.tcu::Texture3D" = type { %"class.tcu::TextureLevelPyramid", i32, i32, i32, %"class.tcu::Texture3DView" }
%"class.tcu::Texture3DView" = type { i32, %"class.tcu::ConstPixelBufferAccess"* }
%"class.tcu::Vector" = type { [4 x float] }
%"class.deqp::gls::QuadGrid" = type opaque

define hidden void @_ZN4deqp5gles210Functional19eval_selection_vec4ERNS_3gls17ShaderEvalContextE(%"class.deqp::gls::ShaderEvalContext"* nocapture align 8 dereferenceable(528) %0) {
  %2 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 0, i32 0, i64 2
  %3 = load float, float* %2, align 8
  %4 = fcmp ogt float %3, 0.000000e+00
  %5 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 1, i32 0, i64 3
  %6 = load float, float* %5, align 4
  %7 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 1, i32 0, i64 2
  %8 = load float, float* %7, align 8
  %9 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 1, i32 0, i64 1
  %10 = load float, float* %9, align 4
  %11 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 1, i32 0, i64 0
  %12 = load float, float* %11, align 8
  %13 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 2, i32 0, i64 0
  %14 = load float, float* %13, align 8
  %15 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 2, i32 0, i64 3
  %16 = load float, float* %15, align 4
  %17 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 2, i32 0, i64 2
  %18 = load float, float* %17, align 8
  %19 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 2, i32 0, i64 1
  %20 = load float, float* %19, align 4
  %21 = select i1 %4, float %6, float %14
  %22 = select i1 %4, float %8, float %16
  %23 = select i1 %4, float %10, float %18
  %24 = select i1 %4, float %12, float %20
  %25 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 5, i32 0, i64 0
  store float %21, float* %25, align 8
  %26 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 5, i32 0, i64 1
  store float %22, float* %26, align 4
  %27 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 5, i32 0, i64 2
  store float %23, float* %27, align 8
  %28 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 5, i32 0, i64 3
  store float %24, float* %28, align 4
  ret void
}

$ bin/opt -passes=slp-vectorizer -S /tmp/b1.ll -o /tmp/b2.ll
$ alive-tv /tmp/b1.ll /tmp/b2.ll

fails at this commit, and times out at the commit before
(sorry, couldn't get it working on https://alive2.llvm.org/ for some reason)

seems worth a revert

Yep, I'll revert. The fix is pretty simple but I'll revert and recommit the patch with all the fixes again

ABataev added a reverting change: rG7d9d926a1861: Revert "[SLP]Improve graph reordering.".Aug 3 2021, 12:14 PM

ABataev reopened this revision.Aug 24 2021, 9:23 AM

This revision is now accepted and ready to land.Aug 24 2021, 9:23 AM

ABataev planned changes to this revision.Aug 24 2021, 9:23 AM

Reworked significantly patch + rebase.

This revision is now accepted and ready to land.Aug 24 2021, 9:24 AM

Harbormaster completed remote builds in B120983: Diff 368371.Aug 24 2021, 9:54 AM

LGTM with a few minors

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
540	Why didn't you move these up here instead of forward declaring?
1616–1638	This std::equal + pragma is repeated a lot in this method - worth pulling out?
3444	Indices?

ABataev added inline comments.Aug 25 2021, 10:22 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
540	Just to reduce the number of changes, will move the functions here.
1616–1638	Ok, will try to transform it into a lambda or something similar.
3444	Yep, will fix it, thanks!

Closed by commit rGa28234e37af8: [SLP]Improve graph reordering. (authored by ABataev). · Explain WhyAug 26 2021, 7:19 AM

This revision was automatically updated to reflect the committed changes.

ABataev added a commit: rGa28234e37af8: [SLP]Improve graph reordering..

I'm seeing another crash bisected to this reland:

./build/rel/bin/opt -passes='default<Os>' -disable-output /tmp/a.ll

reduced.ll.txt10 KBDownload

Instruction does not dominate all uses!

In D105020#2967449, @aeubanks wrote:

I'm seeing another crash bisected to this reland:

./build/rel/bin/opt -passes='default<Os>' -disable-output /tmp/a.ll

reduced.ll.txt10 KBDownload

Instruction does not dominate all uses!

Ok, thanks for the reproducer. Going to revert the patch and investigate a crash.

ABataev added a reverting change: rGb00f73d8bf3e: Revert "[SLP]Improve graph reordering.".Aug 26 2021, 9:20 AM

ABataev added a commit: rG84cbd71c9592: [SLP]Improve graph reordering..Aug 26 2021, 12:49 PM

Some of our tests started failing after rG84cbd71c95923f9912512f3051c6ab548a99e016 (and previously after rGa28234e37af877b2b4a23c2091c27fa18c155f9a). Could you revert it again while we're working on a test case?

In D105020#2971986, @alexfh wrote:

Some of our tests started failing after rG84cbd71c95923f9912512f3051c6ab548a99e016 (and previously after rGa28234e37af877b2b4a23c2091c27fa18c155f9a). Could you revert it again while we're working on a test case?

Couldyou revert it yourself? I'm on vacation, will be back in 2 weeks.
I would appreciate it if you could send the reproducer. The patch os very complex, always triggers many different corner cases.

goncharov added a reverting change: rG5097b6e35291: Revert "[SLP]Improve graph reordering.".Aug 30 2021, 10:18 AM

goncharov added a subscriber: goncharov.Aug 31 2021, 5:12 AM

Here is a repro I found

$ cat ./repro.cc
#include <iostream>
namespace a {
__attribute__((noinline)) void b(uint8_t *c, float *d, uint16_t e, uint16_t f) {
  uint16_t g = f;
  uint16_t h = c[0] << 2 | c[4] & 3;
  uint16_t l = c[1] << 2 | c[4] >> 2 & 3;
  uint16_t v = c[2] << 2 | 3;
  uint16_t j = c[3] << 2 | c[4] >> 6;
  *(d - 1) = float(l - e) / g;
  *(d - 2) = float(h - e) / g;
  *(d - 3) = float(j - e) / g;
  *(d - 4) = float(v - e) / g;
}
void bar(uint8_t *buffer, size_t k, size_t height, size_t, uint16_t e,
         uint16_t f, float *m) {
  int n = 24, g = f;
  float *t = &m[(24 - 2) * 4];
  size_t u = 2 * k;
  for (int o; o < height; o++) {
    uint8_t *p = buffer;
    uint8_t *q = p;
    float *r = &t[u];
    for (int a = 0; a < u; a += 4) {
      b(q, r, e, g);
      r -= 4;
    }
    t -= n;
  }
}
}  // namespace a
int main() {
  uint8_t s[]{255, 0, 255, 0, 51};
  float m[4 * 24]{};
  a::bar(s, 4, 4, 0, 0, 1023, m);
  int *out = reinterpret_cast<int *>(m);
  int64_t sum;
  for (int i = 0; i < sizeof(m) / sizeof(int); i++) sum += out[i];
  std::cout << sum;
}
$ clang-before -O2 -std=gnu++17 repro.cc
$ ./a.out
17045651456
$ clang-after -O2 -std=gnu++17 repro.cc
$ ./a.out
16609148640

In D105020#2975046, @goncharov wrote:

Here is a repro I found

$ cat ./repro.cc
#include <iostream>
namespace a {
__attribute__((noinline)) void b(uint8_t *c, float *d, uint16_t e, uint16_t f) {
  uint16_t g = f;
  uint16_t h = c[0] << 2 | c[4] & 3;
  uint16_t l = c[1] << 2 | c[4] >> 2 & 3;
  uint16_t v = c[2] << 2 | 3;
  uint16_t j = c[3] << 2 | c[4] >> 6;
  *(d - 1) = float(l - e) / g;
  *(d - 2) = float(h - e) / g;
  *(d - 3) = float(j - e) / g;
  *(d - 4) = float(v - e) / g;
}
void bar(uint8_t *buffer, size_t k, size_t height, size_t, uint16_t e,
         uint16_t f, float *m) {
  int n = 24, g = f;
  float *t = &m[(24 - 2) * 4];
  size_t u = 2 * k;
  for (int o; o < height; o++) {
    uint8_t *p = buffer;
    uint8_t *q = p;
    float *r = &t[u];
    for (int a = 0; a < u; a += 4) {
      b(q, r, e, g);
      r -= 4;
    }
    t -= n;
  }
}
}  // namespace a
int main() {
  uint8_t s[]{255, 0, 255, 0, 51};
  float m[4 * 24]{};
  a::bar(s, 4, 4, 0, 0, 1023, m);
  int *out = reinterpret_cast<int *>(m);
  int64_t sum;
  for (int i = 0; i < sizeof(m) / sizeof(int); i++) sum += out[i];
  std::cout << sum;
}
$ clang-before -O2 -std=gnu++17 repro.cc
$ ./a.out
17045651456
$ clang-after -O2 -std=gnu++17 repro.cc
$ ./a.out
16609148640

Thanks, this will help a lot!

In D105020#2975046, @goncharov wrote:

Here is a repro I found

$ cat ./repro.cc
#include <iostream>
namespace a {
__attribute__((noinline)) void b(uint8_t *c, float *d, uint16_t e, uint16_t f) {
  uint16_t g = f;
  uint16_t h = c[0] << 2 | c[4] & 3;
  uint16_t l = c[1] << 2 | c[4] >> 2 & 3;
  uint16_t v = c[2] << 2 | 3;
  uint16_t j = c[3] << 2 | c[4] >> 6;
  *(d - 1) = float(l - e) / g;
  *(d - 2) = float(h - e) / g;
  *(d - 3) = float(j - e) / g;
  *(d - 4) = float(v - e) / g;
}
void bar(uint8_t *buffer, size_t k, size_t height, size_t, uint16_t e,
         uint16_t f, float *m) {
  int n = 24, g = f;
  float *t = &m[(24 - 2) * 4];
  size_t u = 2 * k;
  for (int o; o < height; o++) {
    uint8_t *p = buffer;
    uint8_t *q = p;
    float *r = &t[u];
    for (int a = 0; a < u; a += 4) {
      b(q, r, e, g);
      r -= 4;
    }
    t -= n;
  }
}
}  // namespace a
int main() {
  uint8_t s[]{255, 0, 255, 0, 51};
  float m[4 * 24]{};
  a::bar(s, 4, 4, 0, 0, 1023, m);
  int *out = reinterpret_cast<int *>(m);
  int64_t sum;
  for (int i = 0; i < sizeof(m) / sizeof(int); i++) sum += out[i];
  std::cout << sum;
}
$ clang-before -O2 -std=gnu++17 repro.cc
$ ./a.out
17045651456
$ clang-after -O2 -std=gnu++17 repro.cc
$ ./a.out
16609148640

Sorry, the reproducer is not correct. It has not initialized variables, writes after array bounds etc. Unable to use it as a reproducer for the investigation.

In D105020#2999752, @ABataev wrote:

In D105020#2975046, @goncharov wrote:

Here is a repro I found

$ cat ./repro.cc
#include <iostream>
namespace a {
__attribute__((noinline)) void b(uint8_t *c, float *d, uint16_t e, uint16_t f) {
  uint16_t g = f;
  uint16_t h = c[0] << 2 | c[4] & 3;
  uint16_t l = c[1] << 2 | c[4] >> 2 & 3;
  uint16_t v = c[2] << 2 | 3;
  uint16_t j = c[3] << 2 | c[4] >> 6;
  *(d - 1) = float(l - e) / g;
  *(d - 2) = float(h - e) / g;
  *(d - 3) = float(j - e) / g;
  *(d - 4) = float(v - e) / g;
}
void bar(uint8_t *buffer, size_t k, size_t height, size_t, uint16_t e,
         uint16_t f, float *m) {
  int n = 24, g = f;
  float *t = &m[(24 - 2) * 4];
  size_t u = 2 * k;
  for (int o; o < height; o++) {
    uint8_t *p = buffer;
    uint8_t *q = p;
    float *r = &t[u];
    for (int a = 0; a < u; a += 4) {
      b(q, r, e, g);
      r -= 4;
    }
    t -= n;
  }
}
}  // namespace a
int main() {
  uint8_t s[]{255, 0, 255, 0, 51};
  float m[4 * 24]{};
  a::bar(s, 4, 4, 0, 0, 1023, m);
  int *out = reinterpret_cast<int *>(m);
  int64_t sum;
  for (int i = 0; i < sizeof(m) / sizeof(int); i++) sum += out[i];
  std::cout << sum;
}
$ clang-before -O2 -std=gnu++17 repro.cc
$ ./a.out
17045651456
$ clang-after -O2 -std=gnu++17 repro.cc
$ ./a.out
16609148640

Sorry, the reproducer is not correct. It has not initialized variables, writes after array bounds etc. Unable to use it as a reproducer for the investigation.

So if they are not able to produce UB-free reproducer, you should recommit the patch.

In D105020#2999765, @xbolva00 wrote:

In D105020#2999752, @ABataev wrote:

In D105020#2975046, @goncharov wrote:

Here is a repro I found

$ cat ./repro.cc
#include <iostream>
namespace a {
__attribute__((noinline)) void b(uint8_t *c, float *d, uint16_t e, uint16_t f) {
  uint16_t g = f;
  uint16_t h = c[0] << 2 | c[4] & 3;
  uint16_t l = c[1] << 2 | c[4] >> 2 & 3;
  uint16_t v = c[2] << 2 | 3;
  uint16_t j = c[3] << 2 | c[4] >> 6;
  *(d - 1) = float(l - e) / g;
  *(d - 2) = float(h - e) / g;
  *(d - 3) = float(j - e) / g;
  *(d - 4) = float(v - e) / g;
}
void bar(uint8_t *buffer, size_t k, size_t height, size_t, uint16_t e,
         uint16_t f, float *m) {
  int n = 24, g = f;
  float *t = &m[(24 - 2) * 4];
  size_t u = 2 * k;
  for (int o; o < height; o++) {
    uint8_t *p = buffer;
    uint8_t *q = p;
    float *r = &t[u];
    for (int a = 0; a < u; a += 4) {
      b(q, r, e, g);
      r -= 4;
    }
    t -= n;
  }
}
}  // namespace a
int main() {
  uint8_t s[]{255, 0, 255, 0, 51};
  float m[4 * 24]{};
  a::bar(s, 4, 4, 0, 0, 1023, m);
  int *out = reinterpret_cast<int *>(m);
  int64_t sum;
  for (int i = 0; i < sizeof(m) / sizeof(int); i++) sum += out[i];
  std::cout << sum;
}
$ clang-before -O2 -std=gnu++17 repro.cc
$ ./a.out
17045651456
$ clang-after -O2 -std=gnu++17 repro.cc
$ ./a.out
16609148640

Sorry, the reproducer is not correct. It has not initialized variables, writes after array bounds etc. Unable to use it as a reproducer for the investigation.

So if they are not able to produce UB-free reproducer, you should recommit the patch.

I believe they have a reproducer and tried to reduce it but the tool just extracted/removed too much code here :). I'll try to check manually some parts of the code and wait for the actual reproducer, will recommit the patch in a couple of days if would not be able to find a bug/no correct reproducer provided.

In D105020#2999773, @ABataev wrote:
In D105020#2999765, @xbolva00 wrote:
In D105020#2999752, @ABataev wrote:
In D105020#2975046, @goncharov wrote:
Here is a repro I found
$ cat ./repro.cc
#include <iostream>
namespace a {
__attribute__((noinline)) void b(uint8_t *c, float *d, uint16_t e, uint16_t f) {
  uint16_t g = f;
  uint16_t h = c[0] << 2 | c[4] & 3;
  uint16_t l = c[1] << 2 | c[4] >> 2 & 3;
  uint16_t v = c[2] << 2 | 3;
  uint16_t j = c[3] << 2 | c[4] >> 6;
  *(d - 1) = float(l - e) / g;
  *(d - 2) = float(h - e) / g;
  *(d - 3) = float(j - e) / g;
  *(d - 4) = float(v - e) / g;
}
void bar(uint8_t *buffer, size_t k, size_t height, size_t, uint16_t e,
         uint16_t f, float *m) {
  int n = 24, g = f;
  float *t = &m[(24 - 2) * 4];
  size_t u = 2 * k;
  for (int o; o < height; o++) {
    uint8_t *p = buffer;
    uint8_t *q = p;
    float *r = &t[u];
    for (int a = 0; a < u; a += 4) {
      b(q, r, e, g);
      r -= 4;
    }
    t -= n;
  }
}
}  // namespace a
int main() {
  uint8_t s[]{255, 0, 255, 0, 51};
  float m[4 * 24]{};
  a::bar(s, 4, 4, 0, 0, 1023, m);
  int *out = reinterpret_cast<int *>(m);
  int64_t sum;
  for (int i = 0; i < sizeof(m) / sizeof(int); i++) sum += out[i];
  std::cout << sum;
}
$ clang-before -O2 -std=gnu++17 repro.cc
$ ./a.out
17045651456
$ clang-after -O2 -std=gnu++17 repro.cc
$ ./a.out
16609148640
Sorry, the reproducer is not correct. It has not initialized variables, writes after array bounds etc. Unable to use it as a reproducer for the investigation.
So if they are not able to produce UB-free reproducer, you should recommit the patch.
I believe they have a reproducer and tried to reduce it but the tool just extracted/removed too much code here :). I'll try to check manually some parts of the code and wait for the actual reproducer, will recommit the patch in a couple of days if would not be able to find a bug/no correct reproducer provided.

Exactly, we reduced the code quite aggressively. We'll try to do this more carefully this time. Please hold on while we're on it. Last time it took multiple hours until we got something we could feed to creduce.

In D105020#3001887, @alexfh wrote:
In D105020#2999773, @ABataev wrote:
In D105020#2999765, @xbolva00 wrote:
In D105020#2999752, @ABataev wrote:
In D105020#2975046, @goncharov wrote:
Here is a repro I found
$ cat ./repro.cc
#include <iostream>
namespace a {
__attribute__((noinline)) void b(uint8_t *c, float *d, uint16_t e, uint16_t f) {
  uint16_t g = f;
  uint16_t h = c[0] << 2 | c[4] & 3;
  uint16_t l = c[1] << 2 | c[4] >> 2 & 3;
  uint16_t v = c[2] << 2 | 3;
  uint16_t j = c[3] << 2 | c[4] >> 6;
  *(d - 1) = float(l - e) / g;
  *(d - 2) = float(h - e) / g;
  *(d - 3) = float(j - e) / g;
  *(d - 4) = float(v - e) / g;
}
void bar(uint8_t *buffer, size_t k, size_t height, size_t, uint16_t e,
         uint16_t f, float *m) {
  int n = 24, g = f;
  float *t = &m[(24 - 2) * 4];
  size_t u = 2 * k;
  for (int o; o < height; o++) {
    uint8_t *p = buffer;
    uint8_t *q = p;
    float *r = &t[u];
    for (int a = 0; a < u; a += 4) {
      b(q, r, e, g);
      r -= 4;
    }
    t -= n;
  }
}
}  // namespace a
int main() {
  uint8_t s[]{255, 0, 255, 0, 51};
  float m[4 * 24]{};
  a::bar(s, 4, 4, 0, 0, 1023, m);
  int *out = reinterpret_cast<int *>(m);
  int64_t sum;
  for (int i = 0; i < sizeof(m) / sizeof(int); i++) sum += out[i];
  std::cout << sum;
}
$ clang-before -O2 -std=gnu++17 repro.cc
$ ./a.out
17045651456
$ clang-after -O2 -std=gnu++17 repro.cc
$ ./a.out
16609148640
Sorry, the reproducer is not correct. It has not initialized variables, writes after array bounds etc. Unable to use it as a reproducer for the investigation.
So if they are not able to produce UB-free reproducer, you should recommit the patch.
I believe they have a reproducer and tried to reduce it but the tool just extracted/removed too much code here :). I'll try to check manually some parts of the code and wait for the actual reproducer, will recommit the patch in a couple of days if would not be able to find a bug/no correct reproducer provided.
Exactly, we reduced the code quite aggressively. We'll try to do this more carefully this time. Please hold on while we're on it. Last time it took multiple hours until we got something we could feed to creduce.

That's what I thought. I would appreciate if you could send a better reproducer, though even in its current form it is usable and may help to find a bug. I believe it reveals a bug in vectorization of instructions with main/alternate opcode, will try to fix/improve it.

ABataev reopened this revision.Sep 17 2021, 2:33 PM

This revision is now accepted and ready to land.Sep 17 2021, 2:33 PM

Fixed reordering of the nodes with alternate opcodes.

Harbormaster completed remote builds in B124484: Diff 373339.Sep 17 2021, 3:07 PM

Hi @ABataev ,

this time I've run creduce with asan and msan, so it should not do read out of bounds

cat repro.cc
#include <iostream>
namespace a {
__attribute__((noinline)) void b(uint8_t *c, float *d, uint16_t e, uint16_t f) {
  uint16_t g = f;
  uint16_t h = c[0] << 2 | (c[4] & 3);
  uint16_t m = c[1] << 2 | 2;
  uint16_t n = c[2] << 2 | 3;
  uint16_t j = c[3] << 2 | c[4] >> 6;
  *(d - 1) = float(m - e) / g;
  *(d - 2) = float(h - e) / g;
  *(d - 3) = float(j - e) / g;
  *(d - 4) = float(n - e) / g;
}
void bar(uint8_t *buffer, size_t k, size_t height, size_t l, uint16_t e,
         uint16_t f, float *output) {
  float *q = &output[4];
  uint16_t g = f;
  size_t r = k;
  for (size_t s = 0; s < height; s++) {
    uint8_t *o = buffer + l;
    uint8_t *t = o;
    float *p = &q[r];
    b(t, p, e, g);
  }
}
} // namespace a
int main() {
  uint8_t u[]{5, 5, 0, 0, 0};
  float output[4 * 4];
  a::bar(u, 4, 4, 0, 0, 3, output);
  int *out = reinterpret_cast<int *>(output);
  int64_t sum;
  for (size_t i = 0; i < sizeof sizeof(int); i++)
    sum = out[i];
  printf("%ld\n", sum);
}
$ # on revision before: 8441a8eea8007b9eaaaabf76055949180a702d6d
$ clang++ -Wall -Werror -Wextra -O2 -fno-exceptions  -stdlib=libc++ -std=gnu++17 repro.cc && ./a.out
1089120939
$ # revision 84cbd71c95923f9912512f3051c6ab548a99e016 
$ clang++ -Wall -Werror -Wextra -O2 -fno-exceptions  -stdlib=libc++ -std=gnu++17 repro.cc && ./a.out
1059760811

In D105020#3008717, @goncharov wrote:

Hi @ABataev ,

this time I've run creduce with asan and msan, so it should not do read out of bounds

cat repro.cc
#include <iostream>
namespace a {
__attribute__((noinline)) void b(uint8_t *c, float *d, uint16_t e, uint16_t f) {
  uint16_t g = f;
  uint16_t h = c[0] << 2 | (c[4] & 3);
  uint16_t m = c[1] << 2 | 2;
  uint16_t n = c[2] << 2 | 3;
  uint16_t j = c[3] << 2 | c[4] >> 6;
  *(d - 1) = float(m - e) / g;
  *(d - 2) = float(h - e) / g;
  *(d - 3) = float(j - e) / g;
  *(d - 4) = float(n - e) / g;
}
void bar(uint8_t *buffer, size_t k, size_t height, size_t l, uint16_t e,
         uint16_t f, float *output) {
  float *q = &output[4];
  uint16_t g = f;
  size_t r = k;
  for (size_t s = 0; s < height; s++) {
    uint8_t *o = buffer + l;
    uint8_t *t = o;
    float *p = &q[r];
    b(t, p, e, g);
  }
}
} // namespace a
int main() {
  uint8_t u[]{5, 5, 0, 0, 0};
  float output[4 * 4];
  a::bar(u, 4, 4, 0, 0, 3, output);
  int *out = reinterpret_cast<int *>(output);
  int64_t sum;
  for (size_t i = 0; i < sizeof sizeof(int); i++)
    sum = out[i];
  printf("%ld\n", sum);
}
$ # on revision before: 8441a8eea8007b9eaaaabf76055949180a702d6d
$ clang++ -Wall -Werror -Wextra -O2 -fno-exceptions  -stdlib=libc++ -std=gnu++17 repro.cc && ./a.out
1089120939
$ # revision 84cbd71c95923f9912512f3051c6ab548a99e016 
$ clang++ -Wall -Werror -Wextra -O2 -fno-exceptions  -stdlib=libc++ -std=gnu++17 repro.cc && ./a.out
1059760811

Thanks again for the reproducer, checked it with the last version uploaded on Friday, the results are correct. I fixed the incorrect reordering of the nodes with the alternate instructions.

In D105020#3009190, @ABataev wrote:

Thanks again for the reproducer, checked it with the last version uploaded on Friday, the results are correct. I fixed the incorrect reordering of the nodes with the alternate instructions.

LGTM - if you can add another (useful) test case for the latest repro that'd be great.

In D105020#3009208, @RKSimon wrote:

In D105020#3009190, @ABataev wrote:

Thanks again for the reproducer, checked it with the last version uploaded on Friday, the results are correct. I fixed the incorrect reordering of the nodes with the alternate instructions.

LGTM - if you can add another (useful) test case for the latest repro that'd be great.

Actually, already added. I used the previous reproducer, added test/Transforms/SLPVectorizer/X86/vectorize-reorder-alt-shuffle.ll which revealed the bug in the previous version.

Closed by commit rGbc69dd62c04a: [SLP]Improve graph reordering. (authored by ABataev). · Explain WhySep 20 2021, 8:42 AM

This revision was automatically updated to reflect the committed changes.

ABataev added a commit: rGbc69dd62c04a: [SLP]Improve graph reordering..

Hi @ABataev ,

I have found another regression on the latest version of this change:

> cat repro.cc
#include <cstdio>
struct a {
  int b;
  int c;
  int o;
  int d;
};
class e {
public:
  e(int);
  void f(a *);
  int g;
  int h;
  int i;
};
void e::f(a *p) {
  fprintf(stderr, "MiscompiledFunction\n");
  fprintf(stderr, "%d, %d, %d, %d, %d, %d\n", p->b, p->c, p->o, p->d, h, g);
  int j = (p->b + p->c) / 2, k = (p->o + p->d) / 2, l, m;
  switch (i) {
  case 0:
  case 2:
    l = h - 1 - j;
    m = g - 1 - k;
  }
  p->b = l + p->b - j;
  p->c = l + p->c - j;
  p->o = m + p->o - k;
  p->d = m + p->d - k;
}
int n = 0;
e::e(int q) : g(0), h(0), i(q) {}
int main() {
  a *bb = new a{0, 9, 4, 0}, *p = bb;
  e r(n);
  r.f(p);
  printf("%d %d %d %d\n", bb->b, bb->c, bb->o, bb->d);
}
> # at 5661317f864abf750cf893c6a4cc7a977be0995a
> clang -Wall -Werror -Wextra -O3 -fno-exceptions -stdlib=libc++ -lc++ -std=gnu++17 repro.cc && ./a.out
-9 0 -1 -5
> # at bc69dd62c04a70d29943c1c06c7effed150b70e1
> clang -Wall -Werror -Wextra -O3 -fno-exceptions -stdlib=libc++ -lc++ -std=gnu++17 repro.cc && ./a.out
-7 2 -3 -7

It's quite fragile as e.g. removing fprintf(stderr, "MiscompiledFunction\n"); or running with -fsanitize=null "fixes" the output.

This revision is now accepted and ready to land.Oct 14 2021, 7:41 AM

In D105020#3064191, @goncharov wrote:

Hi @ABataev ,

I have found another regression on the latest version of this change:

> cat repro.cc
#include <cstdio>
struct a {
  int b;
  int c;
  int o;
  int d;
};
class e {
public:
  e(int);
  void f(a *);
  int g;
  int h;
  int i;
};
void e::f(a *p) {
  fprintf(stderr, "MiscompiledFunction\n");
  fprintf(stderr, "%d, %d, %d, %d, %d, %d\n", p->b, p->c, p->o, p->d, h, g);
  int j = (p->b + p->c) / 2, k = (p->o + p->d) / 2, l, m;
  switch (i) {
  case 0:
  case 2:
    l = h - 1 - j;
    m = g - 1 - k;
  }
  p->b = l + p->b - j;
  p->c = l + p->c - j;
  p->o = m + p->o - k;
  p->d = m + p->d - k;
}
int n = 0;
e::e(int q) : g(0), h(0), i(q) {}
int main() {
  a *bb = new a{0, 9, 4, 0}, *p = bb;
  e r(n);
  r.f(p);
  printf("%d %d %d %d\n", bb->b, bb->c, bb->o, bb->d);
}
> # at 5661317f864abf750cf893c6a4cc7a977be0995a
> clang -Wall -Werror -Wextra -O3 -fno-exceptions -stdlib=libc++ -lc++ -std=gnu++17 repro.cc && ./a.out
-9 0 -1 -5
> # at bc69dd62c04a70d29943c1c06c7effed150b70e1
> clang -Wall -Werror -Wextra -O3 -fno-exceptions -stdlib=libc++ -lc++ -std=gnu++17 repro.cc && ./a.out
-7 2 -3 -7

It's quite fragile as e.g. removing fprintf(stderr, "MiscompiledFunction\n"); or running with -fsanitize=null "fixes" the output.

I'll check it, thanks!

In D105020#3064191, @goncharov wrote:

Hi @ABataev ,

I have found another regression on the latest version of this change:

> cat repro.cc
#include <cstdio>
struct a {
  int b;
  int c;
  int o;
  int d;
};
class e {
public:
  e(int);
  void f(a *);
  int g;
  int h;
  int i;
};
void e::f(a *p) {
  fprintf(stderr, "MiscompiledFunction\n");
  fprintf(stderr, "%d, %d, %d, %d, %d, %d\n", p->b, p->c, p->o, p->d, h, g);
  int j = (p->b + p->c) / 2, k = (p->o + p->d) / 2, l, m;
  switch (i) {
  case 0:
  case 2:
    l = h - 1 - j;
    m = g - 1 - k;
  }
  p->b = l + p->b - j;
  p->c = l + p->c - j;
  p->o = m + p->o - k;
  p->d = m + p->d - k;
}
int n = 0;
e::e(int q) : g(0), h(0), i(q) {}
int main() {
  a *bb = new a{0, 9, 4, 0}, *p = bb;
  e r(n);
  r.f(p);
  printf("%d %d %d %d\n", bb->b, bb->c, bb->o, bb->d);
}
> # at 5661317f864abf750cf893c6a4cc7a977be0995a
> clang -Wall -Werror -Wextra -O3 -fno-exceptions -stdlib=libc++ -lc++ -std=gnu++17 repro.cc && ./a.out
-9 0 -1 -5
> # at bc69dd62c04a70d29943c1c06c7effed150b70e1
> clang -Wall -Werror -Wextra -O3 -fno-exceptions -stdlib=libc++ -lc++ -std=gnu++17 repro.cc && ./a.out
-7 2 -3 -7

It's quite fragile as e.g. removing fprintf(stderr, "MiscompiledFunction\n"); or running with -fsanitize=null "fixes" the output.

The fix is here: https://reviews.llvm.org/D111898

vporpo added a subscriber: vporpo.Nov 11 2021, 8:00 PM

ABataev closed this revision.Nov 30 2021, 6:10 AM

vdmitrie added inline comments.Feb 8 2022, 3:01 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
640–644	The description is stale.

Revision Contents

Path

Size

llvm/

include/

llvm/

Transforms/

Vectorize/

SLPVectorizer.h

3 lines

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

1364 lines

test/

Transforms/

SLPVectorizer/

AArch64/

transpose-inseltpoison.ll

84 lines

transpose.ll

84 lines

X86/

addsub.ll

42 lines

crash_cmpop.ll

6 lines

extract.ll

6 lines

jumbled-load-multiuse.ll

12 lines

jumbled-load.ll

22 lines

jumbled_store_crash.ll

29 lines

reorder_repeated_ops.ll

4 lines

split-load8_2-unord.ll

4 lines

vectorize-reorder-alt-shuffle.ll

9 lines

vectorize-reorder-reuse.ll

52 lines

Diff 373608

llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h

Show First 20 Lines • Show All 89 Lines • ▼ Show 20 Lines	private:
/// every time we run into a memory barrier.		/// every time we run into a memory barrier.
void collectSeedInstructions(BasicBlock *BB);		void collectSeedInstructions(BasicBlock *BB);

/// Try to vectorize a chain that starts at two arithmetic instrs.		/// Try to vectorize a chain that starts at two arithmetic instrs.
bool tryToVectorizePair(Value A, Value B, slpvectorizer::BoUpSLP &R);		bool tryToVectorizePair(Value A, Value B, slpvectorizer::BoUpSLP &R);

/// Try to vectorize a list of operands.		/// Try to vectorize a list of operands.
/// \returns true if a value was vectorized.		/// \returns true if a value was vectorized.
bool tryToVectorizeList(ArrayRef<Value *> VL, slpvectorizer::BoUpSLP &R,		bool tryToVectorizeList(ArrayRef<Value *> VL, slpvectorizer::BoUpSLP &R);
bool AllowReorder = false);

/// Try to vectorize a chain that may start at the operands of \p I.		/// Try to vectorize a chain that may start at the operands of \p I.
bool tryToVectorize(Instruction *I, slpvectorizer::BoUpSLP &R);		bool tryToVectorize(Instruction *I, slpvectorizer::BoUpSLP &R);

/// Vectorize the store instructions collected in Stores.		/// Vectorize the store instructions collected in Stores.
bool vectorizeStoreChains(slpvectorizer::BoUpSLP &R);		bool vectorizeStoreChains(slpvectorizer::BoUpSLP &R);

/// Vectorize the index computations of the getelementptr instructions		/// Vectorize the index computations of the getelementptr instructions
▲ Show 20 Lines • Show All 42 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show All 15 Lines
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "llvm/Transforms/Vectorize/SLPVectorizer.h"		#include "llvm/Transforms/Vectorize/SLPVectorizer.h"
#include "llvm/ADT/DenseMap.h"		#include "llvm/ADT/DenseMap.h"
#include "llvm/ADT/DenseSet.h"		#include "llvm/ADT/DenseSet.h"
#include "llvm/ADT/Optional.h"		#include "llvm/ADT/Optional.h"
#include "llvm/ADT/PostOrderIterator.h"		#include "llvm/ADT/PostOrderIterator.h"
		#include "llvm/ADT/PriorityQueue.h"
#include "llvm/ADT/STLExtras.h"		#include "llvm/ADT/STLExtras.h"
#include "llvm/ADT/SetOperations.h"		#include "llvm/ADT/SetOperations.h"
#include "llvm/ADT/SetVector.h"		#include "llvm/ADT/SetVector.h"
#include "llvm/ADT/SmallBitVector.h"		#include "llvm/ADT/SmallBitVector.h"
#include "llvm/ADT/SmallPtrSet.h"		#include "llvm/ADT/SmallPtrSet.h"
#include "llvm/ADT/SmallSet.h"		#include "llvm/ADT/SmallSet.h"
#include "llvm/ADT/SmallString.h"		#include "llvm/ADT/SmallString.h"
#include "llvm/ADT/Statistic.h"		#include "llvm/ADT/Statistic.h"
▲ Show 20 Lines • Show All 498 Lines • ▼ Show 20 Lines	if (LoadInst *LI = dyn_cast<LoadInst>(I))
return LI->isSimple();		return LI->isSimple();
if (StoreInst *SI = dyn_cast<StoreInst>(I))		if (StoreInst *SI = dyn_cast<StoreInst>(I))
return SI->isSimple();		return SI->isSimple();
if (MemIntrinsic *MI = dyn_cast<MemIntrinsic>(I))		if (MemIntrinsic *MI = dyn_cast<MemIntrinsic>(I))
return !MI->isVolatile();		return !MI->isVolatile();
return true;		return true;
}		}

		/// Shuffles \p Mask in accordance with the given \p SubMask.
		static void addMask(SmallVectorImpl<int> &Mask, ArrayRef<int> SubMask) {
		RKSimonUnsubmitted Not Done Reply Inline Actions Why didn't you move these up here instead of forward declaring? RKSimon: Why didn't you move these up here instead of forward declaring?
		ABataevAuthorUnsubmitted Done Reply Inline Actions Just to reduce the number of changes, will move the functions here. ABataev: Just to reduce the number of changes, will move the functions here.
		if (SubMask.empty())
		return;
		if (Mask.empty()) {
		Mask.append(SubMask.begin(), SubMask.end());
		return;
		}
		SmallVector<int> NewMask(SubMask.size(), UndefMaskElem);
		int TermValue = std::min(Mask.size(), SubMask.size());
		for (int I = 0, E = SubMask.size(); I < E; ++I) {
		if (SubMask[I] >= TermValue \|\| SubMask[I] == UndefMaskElem \|\|
		Mask[SubMask[I]] >= TermValue)
		continue;
		NewMask[I] = Mask[SubMask[I]];
		}
		Mask.swap(NewMask);
		}

		/// Order may have elements assigned special value (size) which is out of
		/// bounds. Such indices only appear on places which correspond to undef values
		/// (see canReuseExtract for details) and used in order to avoid undef values
		/// have effect on operands ordering.
		/// The first loop below simply finds all unused indices and then the next loop
		/// nest assigns these indices for undef values positions.
		/// As an example below Order has two undef positions and they have assigned
		/// values 3 and 7 respectively:
		/// before: 6 9 5 4 9 2 1 0
		/// after: 6 3 5 4 7 2 1 0
		/// \returns Fixed ordering.
		static void fixupOrderingIndices(SmallVectorImpl<unsigned> &Order) {
		const unsigned Sz = Order.size();
		SmallBitVector UsedIndices(Sz);
		SmallVector<int> MaskedIndices;
		for (unsigned I = 0; I < Sz; ++I) {
		if (Order[I] < Sz)
		UsedIndices.set(Order[I]);
		else
		MaskedIndices.push_back(I);
		}
		if (MaskedIndices.empty())
		return;
		SmallVector<int> AvailableIndices(MaskedIndices.size());
		unsigned Cnt = 0;
		int Idx = UsedIndices.find_first();
		do {
		AvailableIndices[Cnt] = Idx;
		Idx = UsedIndices.find_next(Idx);
		++Cnt;
		} while (Idx > 0);
		assert(Cnt == MaskedIndices.size() && "Non-synced masked/available indices.");
		for (int I = 0, E = MaskedIndices.size(); I < E; ++I)
		Order[MaskedIndices[I]] = AvailableIndices[I];
		}

namespace llvm {		namespace llvm {

static void inversePermutation(ArrayRef<unsigned> Indices,		static void inversePermutation(ArrayRef<unsigned> Indices,
SmallVectorImpl<int> &Mask) {		SmallVectorImpl<int> &Mask) {
Mask.clear();		Mask.clear();
const unsigned E = Indices.size();		const unsigned E = Indices.size();
Mask.resize(E, E + 1);		Mask.resize(E, UndefMaskElem);
for (unsigned I = 0; I < E; ++I)		for (unsigned I = 0; I < E; ++I)
Mask[Indices[I]] = I;		Mask[Indices[I]] = I;
}		}

/// \returns inserting index of InsertElement or InsertValue instruction,		/// \returns inserting index of InsertElement or InsertValue instruction,
/// using Offset as base offset for index.		/// using Offset as base offset for index.
static Optional<int> getInsertIndex(Value *InsertInst, unsigned Offset) {		static Optional<int> getInsertIndex(Value *InsertInst, unsigned Offset) {
int Index = Offset;		int Index = Offset;
Show All 23 Lines	for (unsigned I : IV->indices()) {
} else {		} else {
return None;		return None;
}		}
Index += I;		Index += I;
}		}
return Index;		return Index;
}		}

		/// Reorders the list of scalars in accordance with the given \p Order and then
		/// the \p Mask. \p Order - is the original order of the scalars, need to
		/// reorder scalars into an unordered state at first according to the given
		/// order. Then the ordered scalars are shuffled once again in accordance with
		/// the provided mask.
		vdmitrieUnsubmitted Not Done Reply Inline Actions The description is stale. vdmitrie: The description is stale.
		static void reorderScalars(SmallVectorImpl<Value *> &Scalars,
		ArrayRef<int> Mask) {
		assert(!Mask.empty() && "Expected non-empty mask.");
		SmallVector<Value *> Prev(Scalars.size(),
		RKSimonUnsubmitted Not Done Reply Inline Actions Do we need a default value for vector? RKSimon: Do we need a default value for vector?
		ABataevAuthorUnsubmitted Done Reply Inline Actions Not currently, but I'll update these functions to make them compatible with upcoming non-power-2 vectorization. ABataev: Not currently, but I'll update these functions to make them compatible with upcoming non-power…
		UndefValue::get(Scalars.front()->getType()));
		Prev.swap(Scalars);
		for (unsigned I = 0, E = Prev.size(); I < E; ++I)
		if (Mask[I] != UndefMaskElem)
		Scalars[Mask[I]] = Prev[I];
		}

namespace slpvectorizer {		namespace slpvectorizer {

/// Bottom Up SLP Vectorizer.		/// Bottom Up SLP Vectorizer.
class BoUpSLP {		class BoUpSLP {
struct TreeEntry;		struct TreeEntry;
struct ScheduleData;		struct ScheduleData;

public:		public:
▲ Show 20 Lines • Show All 48 Lines • ▼ Show 20 Lines	public:
/// A negative number means that this is profitable.		/// A negative number means that this is profitable.
InstructionCost getTreeCost(ArrayRef<Value *> VectorizedVals = None);		InstructionCost getTreeCost(ArrayRef<Value *> VectorizedVals = None);

/// Construct a vectorizable tree that starts at \p Roots, ignoring users for		/// Construct a vectorizable tree that starts at \p Roots, ignoring users for
/// the purpose of scheduling and extraction in the \p UserIgnoreLst.		/// the purpose of scheduling and extraction in the \p UserIgnoreLst.
void buildTree(ArrayRef<Value *> Roots,		void buildTree(ArrayRef<Value *> Roots,
ArrayRef<Value *> UserIgnoreLst = None);		ArrayRef<Value *> UserIgnoreLst = None);

/// Construct a vectorizable tree that starts at \p Roots, ignoring users for		/// Builds external uses of the vectorized scalars, i.e. the list of
/// the purpose of scheduling and extraction in the \p UserIgnoreLst taking		/// vectorized scalars to be extracted, their lanes and their scalar users. \p
/// into account (and updating it, if required) list of externally used		/// ExternallyUsedValues contains additional list of external uses to handle
/// values stored in \p ExternallyUsedValues.		/// vectorization of reductions.
void buildTree(ArrayRef<Value *> Roots,		void
ExtraValueToDebugLocsMap &ExternallyUsedValues,		buildExternalUses(const ExtraValueToDebugLocsMap &ExternallyUsedValues = {});
ArrayRef<Value *> UserIgnoreLst = None);

/// Clear the internal data structures that are created by 'buildTree'.		/// Clear the internal data structures that are created by 'buildTree'.
void deleteTree() {		void deleteTree() {
VectorizableTree.clear();		VectorizableTree.clear();
ScalarToTreeEntry.clear();		ScalarToTreeEntry.clear();
MustGather.clear();		MustGather.clear();
ExternalUses.clear();		ExternalUses.clear();
NumOpsWantToKeepOrder.clear();
NumOpsWantToKeepOriginalOrder = 0;
for (auto &Iter : BlocksSchedules) {		for (auto &Iter : BlocksSchedules) {
BlockScheduling *BS = Iter.second.get();		BlockScheduling *BS = Iter.second.get();
BS->clear();		BS->clear();
}		}
MinBWs.clear();		MinBWs.clear();
InstrElementSize.clear();		InstrElementSize.clear();
}		}

unsigned getTreeSize() const { return VectorizableTree.size(); }		unsigned getTreeSize() const { return VectorizableTree.size(); }

/// Perform LICM and CSE on the newly generated gather sequences.		/// Perform LICM and CSE on the newly generated gather sequences.
void optimizeGatherSequence();		void optimizeGatherSequence();

/// \returns The best order of instructions for vectorization.		/// Reorders the current graph to the most profitable order starting from the
Optional<ArrayRef<unsigned>> bestOrder() const {		/// root node to the leaf nodes. The best order is chosen only from the nodes
assert(llvm::all_of(		/// of the same size (vectorization factor). Smaller nodes are considered
NumOpsWantToKeepOrder,		/// parts of subgraph with smaller VF and they are reordered independently. We
[this](const decltype(NumOpsWantToKeepOrder)::value_type &D) {		/// can make it because we still need to extend smaller nodes to the wider VF
return D.getFirst().size() ==		/// and we can merge reordering shuffles with the widening shuffles.
VectorizableTree[0]->Scalars.size();		void reorderTopToBottom();
}) &&
"All orders must have the same size as number of instructions in "		/// Reorders the current graph to the most profitable order starting from
"tree node.");		/// leaves to the root. It allows to rotate small subgraphs and reduce the
auto I = std::max_element(		/// number of reshuffles if the leaf nodes use the same order. In this case we
NumOpsWantToKeepOrder.begin(), NumOpsWantToKeepOrder.end(),		/// can merge the orders and just shuffle user node instead of shuffling its
[](const decltype(NumOpsWantToKeepOrder)::value_type &D1,		/// operands. Plus, even the leaf nodes have different orders, it allows to
const decltype(NumOpsWantToKeepOrder)::value_type &D2) {		/// sink reordering in the graph closer to the root node and merge it later
return D1.second < D2.second;		/// during analysis.
});		void reorderBottomToTop();
if (I == NumOpsWantToKeepOrder.end() \|\|
I->getSecond() <= NumOpsWantToKeepOriginalOrder)
return None;

return makeArrayRef(I->getFirst());
}

/// Builds the correct order for root instructions.
/// If some leaves have the same instructions to be vectorized, we may
/// incorrectly evaluate the best order for the root node (it is built for the
/// vector of instructions without repeated instructions and, thus, has less
/// elements than the root node). This function builds the correct order for
/// the root node.
/// For example, if the root node is \<a+b, a+c, a+d, f+e\>, then the leaves
/// are \<a, a, a, f\> and \<b, c, d, e\>. When we try to vectorize the first
/// leaf, it will be shrink to \<a, b\>. If instructions in this leaf should
/// be reordered, the best order will be \<1, 0\>. We need to extend this
/// order for the root node. For the root node this order should look like
/// \<3, 0, 1, 2\>. This function extends the order for the reused
/// instructions.
void findRootOrder(OrdersType &Order) {
// If the leaf has the same number of instructions to vectorize as the root
// - order must be set already.
unsigned RootSize = VectorizableTree[0]->Scalars.size();
if (Order.size() == RootSize)
return;
SmallVector<unsigned, 4> RealOrder(Order.size());
std::swap(Order, RealOrder);
SmallVector<int, 4> Mask;
inversePermutation(RealOrder, Mask);
Order.assign(Mask.begin(), Mask.end());
// The leaf has less number of instructions - need to find the true order of
// the root.
// Scan the nodes starting from the leaf back to the root.
const TreeEntry *PNode = VectorizableTree.back().get();
SmallVector<const TreeEntry *, 4> Nodes(1, PNode);
SmallPtrSet<const TreeEntry *, 4> Visited;
while (!Nodes.empty() && Order.size() != RootSize) {
const TreeEntry *PNode = Nodes.pop_back_val();
if (!Visited.insert(PNode).second)
continue;
const TreeEntry &Node = *PNode;
for (const EdgeInfo &EI : Node.UserTreeIndices)
if (EI.UserTE)
Nodes.push_back(EI.UserTE);
if (Node.ReuseShuffleIndices.empty())
continue;
// Build the order for the parent node.
OrdersType NewOrder(Node.ReuseShuffleIndices.size(), RootSize);
SmallVector<unsigned, 4> OrderCounter(Order.size(), 0);
// The algorithm of the order extension is:
// 1. Calculate the number of the same instructions for the order.
// 2. Calculate the index of the new order: total number of instructions
// with order less than the order of the current instruction + reuse
// number of the current instruction.
// 3. The new order is just the index of the instruction in the original
// vector of the instructions.
for (unsigned I : Node.ReuseShuffleIndices)
++OrderCounter[Order[I]];
SmallVector<unsigned, 4> CurrentCounter(Order.size(), 0);
for (unsigned I = 0, E = Node.ReuseShuffleIndices.size(); I < E; ++I) {
unsigned ReusedIdx = Node.ReuseShuffleIndices[I];
unsigned OrderIdx = Order[ReusedIdx];
unsigned NewIdx = 0;
for (unsigned J = 0; J < OrderIdx; ++J)
NewIdx += OrderCounter[J];
NewIdx += CurrentCounter[OrderIdx];
++CurrentCounter[OrderIdx];
assert(NewOrder[NewIdx] == RootSize &&
"The order index should not be written already.");
NewOrder[NewIdx] = I;
}
std::swap(Order, NewOrder);
}
assert(Order.size() == RootSize &&
"Root node is expected or the size of the order must be the same as "
"the number of elements in the root node.");
assert(llvm::all_of(Order,
[RootSize](unsigned Val) { return Val != RootSize; }) &&
"All indices must be initialized");
}

/// \return The vector element size in bits to use when vectorizing the		/// \return The vector element size in bits to use when vectorizing the
/// expression tree ending at \p V. If V is a store, the size is the width of		/// expression tree ending at \p V. If V is a store, the size is the width of
/// the stored value. Otherwise, the size is the width of the largest loaded		/// the stored value. Otherwise, the size is the width of the largest loaded
/// value reaching V. This method is used by the vectorizer to calculate		/// value reaching V. This method is used by the vectorizer to calculate
/// vectorization factors.		/// vectorization factors.
unsigned getVectorElementSize(Value *V);		unsigned getVectorElementSize(Value *V);

/// Compute the minimum type sizes required to represent the entries in a		/// Compute the minimum type sizes required to represent the entries in a
/// vectorizable tree.		/// vectorizable tree.
void computeMinimumValueSizes();		void computeMinimumValueSizes();

// \returns maximum vector register size as set by TTI or overridden by cl::opt.		// \returns maximum vector register size as set by TTI or overridden by cl::opt.
unsigned getMaxVecRegSize() const {		unsigned getMaxVecRegSize() const {
return MaxVecRegSize;		return MaxVecRegSize;
}		}

// \returns minimum vector register size as set by cl::opt.		// \returns minimum vector register size as set by cl::opt.
unsigned getMinVecRegSize() const {		unsigned getMinVecRegSize() const {
return MinVecRegSize;		return MinVecRegSize;
}		}

		unsigned getMinVF(unsigned Sz) const {
		return std::max(2U, getMinVecRegSize() / Sz);
		}

unsigned getMaximumVF(unsigned ElemWidth, unsigned Opcode) const {		unsigned getMaximumVF(unsigned ElemWidth, unsigned Opcode) const {
unsigned MaxVF = MaxVFOption.getNumOccurrences() ?		unsigned MaxVF = MaxVFOption.getNumOccurrences() ?
MaxVFOption : TTI->getMaximumVF(ElemWidth, Opcode);		MaxVFOption : TTI->getMaximumVF(ElemWidth, Opcode);
return MaxVF ? MaxVF : UINT_MAX;		return MaxVF ? MaxVF : UINT_MAX;
}		}

/// Check if homogeneous aggregate is isomorphic to some VectorType.		/// Check if homogeneous aggregate is isomorphic to some VectorType.
/// Accepts homogeneous multidimensional aggregate of scalars/vectors like		/// Accepts homogeneous multidimensional aggregate of scalars/vectors like
▲ Show 20 Lines • Show All 812 Lines • ▼ Show 20 Lines	static void reorderInputsAccordingToOpcode(ArrayRef<Value *> VL,
ScalarEvolution &SE,		ScalarEvolution &SE,
const BoUpSLP &R);		const BoUpSLP &R);
struct TreeEntry {		struct TreeEntry {
using VecTreeTy = SmallVector<std::unique_ptr<TreeEntry>, 8>;		using VecTreeTy = SmallVector<std::unique_ptr<TreeEntry>, 8>;
TreeEntry(VecTreeTy &Container) : Container(Container) {}		TreeEntry(VecTreeTy &Container) : Container(Container) {}

/// \returns true if the scalars in VL are equal to this entry.		/// \returns true if the scalars in VL are equal to this entry.
bool isSame(ArrayRef<Value *> VL) const {		bool isSame(ArrayRef<Value *> VL) const {
if (VL.size() == Scalars.size())		auto &&IsSame = [VL](ArrayRef<Value *> Scalars, ArrayRef<int> Mask) {
		if (Mask.size() != VL.size() && VL.size() == Scalars.size())
return std::equal(VL.begin(), VL.end(), Scalars.begin());		return std::equal(VL.begin(), VL.end(), Scalars.begin());
return VL.size() == ReuseShuffleIndices.size() &&		return VL.size() == Mask.size() &&
std::equal(		std::equal(
VL.begin(), VL.end(), ReuseShuffleIndices.begin(),		VL.begin(), VL.end(), Mask.begin(),
[this](Value *V, int Idx) { return V == Scalars[Idx]; });		[Scalars](Value *V, int Idx) { return V == Scalars[Idx]; });
		};
		if (!ReorderIndices.empty()) {
		// TODO: implement matching if the nodes are just reordered, still can
		// treat the vector as the same if the list of scalars matches VL
		// directly, without reordering.
		SmallVector<int> Mask;
		inversePermutation(ReorderIndices, Mask);
		if (VL.size() == Scalars.size())
		return IsSame(Scalars, Mask);
		if (VL.size() == ReuseShuffleIndices.size()) {
		::addMask(Mask, ReuseShuffleIndices);
		return IsSame(Scalars, Mask);
		}
		return false;
		}
		return IsSame(Scalars, ReuseShuffleIndices);
		RKSimonUnsubmitted Not Done Reply Inline Actions This std::equal + pragma is repeated a lot in this method - worth pulling out? RKSimon: This std::equal + pragma is repeated a lot in this method - worth pulling out?
		ABataevAuthorUnsubmitted Done Reply Inline Actions Ok, will try to transform it into a lambda or something similar. ABataev: Ok, will try to transform it into a lambda or something similar.
}		}

/// A vector of scalars.		/// A vector of scalars.
ValueList Scalars;		ValueList Scalars;

/// The Scalars are vectorized into this value. It is initialized to Null.		/// The Scalars are vectorized into this value. It is initialized to Null.
Value *VectorizedValue = nullptr;		Value *VectorizedValue = nullptr;

▲ Show 20 Lines • Show All 58 Lines • ▼ Show 20 Lines	void setOperandsInOrder() {
auto *I = cast<Instruction>(Scalars[Lane]);		auto *I = cast<Instruction>(Scalars[Lane]);
assert(I->getNumOperands() == NumOperands &&		assert(I->getNumOperands() == NumOperands &&
"Expected same number of operands");		"Expected same number of operands");
Operands[OpIdx][Lane] = I->getOperand(OpIdx);		Operands[OpIdx][Lane] = I->getOperand(OpIdx);
}		}
}		}
}		}

		/// Reorders operands of the node to the given mask \p Mask.
		void reorderOperands(ArrayRef<int> Mask) {
		for (ValueList &Operand : Operands)
		reorderScalars(Operand, Mask);
		}

/// \returns the \p OpIdx operand of this TreeEntry.		/// \returns the \p OpIdx operand of this TreeEntry.
ValueList &getOperand(unsigned OpIdx) {		ValueList &getOperand(unsigned OpIdx) {
assert(OpIdx < Operands.size() && "Off bounds");		assert(OpIdx < Operands.size() && "Off bounds");
return Operands[OpIdx];		return Operands[OpIdx];
}		}

/// \returns the number of operands.		/// \returns the number of operands.
unsigned getNumOperands() const { return Operands.size(); }		unsigned getNumOperands() const { return Operands.size(); }
▲ Show 20 Lines • Show All 43 Lines • ▼ Show 20 Lines	public:
unsigned getOpcode() const {		unsigned getOpcode() const {
return MainOp ? MainOp->getOpcode() : 0;		return MainOp ? MainOp->getOpcode() : 0;
}		}

unsigned getAltOpcode() const {		unsigned getAltOpcode() const {
return AltOp ? AltOp->getOpcode() : 0;		return AltOp ? AltOp->getOpcode() : 0;
}		}

/// Update operations state of this entry if reorder occurred.		/// When ReuseReorderShuffleIndices is empty it just returns position of \p
bool updateStateIfReorder() {		/// V within vector of Scalars. Otherwise, try to remap on its reuse index.
if (ReorderIndices.empty())
return false;
InstructionsState S = getSameOpcode(Scalars, ReorderIndices.front());
setOperations(S);
return true;
}
/// When ReuseShuffleIndices is empty it just returns position of \p V
/// within vector of Scalars. Otherwise, try to remap on its reuse index.
int findLaneForValue(Value *V) const {		int findLaneForValue(Value *V) const {
unsigned FoundLane = std::distance(Scalars.begin(), find(Scalars, V));		unsigned FoundLane = std::distance(Scalars.begin(), find(Scalars, V));
assert(FoundLane < Scalars.size() && "Couldn't find extract lane");		assert(FoundLane < Scalars.size() && "Couldn't find extract lane");
		if (!ReorderIndices.empty())
		FoundLane = ReorderIndices[FoundLane];
		assert(FoundLane < Scalars.size() && "Couldn't find extract lane");
if (!ReuseShuffleIndices.empty()) {		if (!ReuseShuffleIndices.empty()) {
FoundLane = std::distance(ReuseShuffleIndices.begin(),		FoundLane = std::distance(ReuseShuffleIndices.begin(),
find(ReuseShuffleIndices, FoundLane));		find(ReuseShuffleIndices, FoundLane));
}		}
return FoundLane;		return FoundLane;
}		}

#ifndef NDEBUG		#ifndef NDEBUG
▲ Show 20 Lines • Show All 67 Lines • ▼ Show 20 Lines	dbgs() << "SLP: ReuseShuffleCost + VecCost - ScalarCost = " <<
ReuseShuffleCost + VecCost - ScalarCost << "\n";		ReuseShuffleCost + VecCost - ScalarCost << "\n";
}		}
#endif		#endif

/// Create a new VectorizableTree entry.		/// Create a new VectorizableTree entry.
TreeEntry newTreeEntry(ArrayRef<Value > VL, Optional<ScheduleData *> Bundle,		TreeEntry newTreeEntry(ArrayRef<Value > VL, Optional<ScheduleData *> Bundle,
const InstructionsState &S,		const InstructionsState &S,
const EdgeInfo &UserTreeIdx,		const EdgeInfo &UserTreeIdx,
ArrayRef<unsigned> ReuseShuffleIndices = None,		ArrayRef<int> ReuseShuffleIndices = None,
ArrayRef<unsigned> ReorderIndices = None) {		ArrayRef<unsigned> ReorderIndices = None) {
TreeEntry::EntryState EntryState =		TreeEntry::EntryState EntryState =
Bundle ? TreeEntry::Vectorize : TreeEntry::NeedToGather;		Bundle ? TreeEntry::Vectorize : TreeEntry::NeedToGather;
return newTreeEntry(VL, EntryState, Bundle, S, UserTreeIdx,		return newTreeEntry(VL, EntryState, Bundle, S, UserTreeIdx,
ReuseShuffleIndices, ReorderIndices);		ReuseShuffleIndices, ReorderIndices);
}		}

TreeEntry newTreeEntry(ArrayRef<Value > VL,		TreeEntry newTreeEntry(ArrayRef<Value > VL,
TreeEntry::EntryState EntryState,		TreeEntry::EntryState EntryState,
Optional<ScheduleData *> Bundle,		Optional<ScheduleData *> Bundle,
const InstructionsState &S,		const InstructionsState &S,
const EdgeInfo &UserTreeIdx,		const EdgeInfo &UserTreeIdx,
ArrayRef<unsigned> ReuseShuffleIndices = None,		ArrayRef<int> ReuseShuffleIndices = None,
ArrayRef<unsigned> ReorderIndices = None) {		ArrayRef<unsigned> ReorderIndices = None) {
assert(((!Bundle && EntryState == TreeEntry::NeedToGather) \|\|		assert(((!Bundle && EntryState == TreeEntry::NeedToGather) \|\|
(Bundle && EntryState != TreeEntry::NeedToGather)) &&		(Bundle && EntryState != TreeEntry::NeedToGather)) &&
"Need to vectorize gather entry?");		"Need to vectorize gather entry?");
VectorizableTree.push_back(std::make_unique<TreeEntry>(VectorizableTree));		VectorizableTree.push_back(std::make_unique<TreeEntry>(VectorizableTree));
TreeEntry *Last = VectorizableTree.back().get();		TreeEntry *Last = VectorizableTree.back().get();
Last->Idx = VectorizableTree.size() - 1;		Last->Idx = VectorizableTree.size() - 1;
Last->Scalars.insert(Last->Scalars.begin(), VL.begin(), VL.end());
Last->State = EntryState;		Last->State = EntryState;
Last->ReuseShuffleIndices.append(ReuseShuffleIndices.begin(),		Last->ReuseShuffleIndices.append(ReuseShuffleIndices.begin(),
ReuseShuffleIndices.end());		ReuseShuffleIndices.end());
Last->ReorderIndices.append(ReorderIndices.begin(), ReorderIndices.end());		if (ReorderIndices.empty()) {
		Last->Scalars.assign(VL.begin(), VL.end());
		Last->setOperations(S);
		} else {
		// Reorder scalars and build final mask.
		Last->Scalars.assign(VL.size(), nullptr);
		transform(ReorderIndices, Last->Scalars.begin(),
		[VL](unsigned Idx) -> Value * {
		if (Idx >= VL.size())
		return UndefValue::get(VL.front()->getType());
		return VL[Idx];
		});
		InstructionsState S = getSameOpcode(Last->Scalars);
Last->setOperations(S);		Last->setOperations(S);
		Last->ReorderIndices.append(ReorderIndices.begin(), ReorderIndices.end());
		}
if (Last->State != TreeEntry::NeedToGather) {		if (Last->State != TreeEntry::NeedToGather) {
for (Value *V : VL) {		for (Value *V : VL) {
assert(!getTreeEntry(V) && "Scalar already in tree!");		assert(!getTreeEntry(V) && "Scalar already in tree!");
ScalarToTreeEntry[V] = Last;		ScalarToTreeEntry[V] = Last;
}		}
// Update the scheduler bundle to point to this TreeEntry.		// Update the scheduler bundle to point to this TreeEntry.
unsigned Lane = 0;		unsigned Lane = 0;
for (ScheduleData *BundleMember = Bundle.getValue(); BundleMember;		for (ScheduleData *BundleMember = Bundle.getValue(); BundleMember;
▲ Show 20 Lines • Show All 532 Lines • ▼ Show 20 Lines	static unsigned getHashValue(const OrdersType &V) {
return static_cast<unsigned>(hash_combine_range(V.begin(), V.end()));		return static_cast<unsigned>(hash_combine_range(V.begin(), V.end()));
}		}

static bool isEqual(const OrdersType &LHS, const OrdersType &RHS) {		static bool isEqual(const OrdersType &LHS, const OrdersType &RHS) {
return LHS == RHS;		return LHS == RHS;
}		}
};		};

/// Contains orders of operations along with the number of bundles that have
/// operations in this order. It stores only those orders that require
/// reordering, if reordering is not required it is counted using \a
/// NumOpsWantToKeepOriginalOrder.
DenseMap<OrdersType, unsigned, OrdersTypeDenseMapInfo> NumOpsWantToKeepOrder;
/// Number of bundles that do not require reordering.
unsigned NumOpsWantToKeepOriginalOrder = 0;

// Analysis and block reference.		// Analysis and block reference.
Function *F;		Function *F;
ScalarEvolution *SE;		ScalarEvolution *SE;
TargetTransformInfo *TTI;		TargetTransformInfo *TTI;
TargetLibraryInfo *TLI;		TargetLibraryInfo *TLI;
AAResults *AA;		AAResults *AA;
LoopInfo *LI;		LoopInfo *LI;
DominatorTree *DT;		DominatorTree *DT;
▲ Show 20 Lines • Show All 136 Lines • ▼ Show 20 Lines

void BoUpSLP::eraseInstructions(ArrayRef<Value *> AV) {		void BoUpSLP::eraseInstructions(ArrayRef<Value *> AV) {
for (auto *V : AV) {		for (auto *V : AV) {
if (auto *I = dyn_cast<Instruction>(V))		if (auto *I = dyn_cast<Instruction>(V))
eraseInstruction(I, /ReplaceOpsWithUndef=/true);		eraseInstruction(I, /ReplaceOpsWithUndef=/true);
};		};
}		}

void BoUpSLP::buildTree(ArrayRef<Value *> Roots,		/// Reorders the given \p Reuses mask according to the given \p Mask. \p Reuses
ArrayRef<Value *> UserIgnoreLst) {		/// contains original mask for the scalars reused in the node. Procedure
ExtraValueToDebugLocsMap ExternallyUsedValues;		/// transform this mask in accordance with the given \p Mask.
buildTree(Roots, ExternallyUsedValues, UserIgnoreLst);		static void reorderReuses(SmallVectorImpl<int> &Reuses, ArrayRef<int> Mask) {
		assert(!Mask.empty() && Reuses.size() == Mask.size() &&
		"Expected non-empty mask.");
		SmallVector<int> Prev(Reuses.begin(), Reuses.end());
		Prev.swap(Reuses);
		for (unsigned I = 0, E = Prev.size(); I < E; ++I)
		if (Mask[I] != UndefMaskElem)
		Reuses[Mask[I]] = Prev[I];
		RKSimonUnsubmitted Not Done Reply Inline Actions Do we need a default value for vector? RKSimon: Do we need a default value for vector?
		ABataevAuthorUnsubmitted Done Reply Inline Actions No, it is swapped with `Order` and `Order` then updated. But I'll add default initialization because we may need it in the future for non-pow-2 patch. ABataev: No, it is swapped with `Order` and `Order` then updated. But I'll add default initialization…
}		}

void BoUpSLP::buildTree(ArrayRef<Value *> Roots,		/// Reorders the given \p Order according to the given \p Mask. \p Order - is
ExtraValueToDebugLocsMap &ExternallyUsedValues,		/// the original order of the scalars. Procedure transforms the provided order
		RKSimonUnsubmitted Not Done Reply Inline Actions Doesn't the IsIdentity pass have to be done after Order[] is updated? RKSimon: Doesn't the IsIdentity pass have to be done after Order[] is updated?
		ABataevAuthorUnsubmitted Done Reply Inline Actions Yes, you're right, will fix this. It may affect the cost in some rare cases. ABataev: Yes, you're right, will fix this. It may affect the cost in some rare cases.
ArrayRef<Value *> UserIgnoreLst) {		/// in accordance with the given \p Mask. If the resulting \p Order is just an
deleteTree();		/// identity order, \p Order is cleared.
UserIgnoreList = UserIgnoreLst;		static void reorderOrder(SmallVectorImpl<unsigned> &Order, ArrayRef<int> Mask) {
if (!allSameType(Roots))		assert(!Mask.empty() && "Expected non-empty mask.");
		SmallVector<int> MaskOrder;
		if (Order.empty()) {
		MaskOrder.resize(Mask.size());
		std::iota(MaskOrder.begin(), MaskOrder.end(), 0);
		} else {
		inversePermutation(Order, MaskOrder);
		}
		reorderReuses(MaskOrder, Mask);
		if (ShuffleVectorInst::isIdentityMask(MaskOrder)) {
		Order.clear();
		return;
		}
		Order.assign(Mask.size(), Mask.size());
		for (unsigned I = 0, E = Mask.size(); I < E; ++I)
		if (MaskOrder[I] != UndefMaskElem)
		Order[MaskOrder[I]] = I;
		fixupOrderingIndices(Order);
		}
		RKSimonUnsubmitted Not Done Reply Inline Actions Do we need a default value for vector? RKSimon: Do we need a default value for vector?
		ABataevAuthorUnsubmitted Done Reply Inline Actions I'll add `UndefMaskElem` to reduce future updates for non-power-2. ABataev: I'll add `UndefMaskElem` to reduce future updates for non-power-2.

		void BoUpSLP::reorderTopToBottom() {
		// Maps VF to the graph nodes.
		DenseMap<unsigned, SmallPtrSet<TreeEntry *, 4>> VFToOrderedEntries;
		// ExtractElement gather nodes which can be vectorized and need to handle
		// their ordering.
		DenseMap<const TreeEntry *, OrdersType> GathersToOrders;
		// Find all reorderable nodes with the given VF.
		// Currently the are vectorized loads,extracts + some gathering of extracts.
		for_each(VectorizableTree, [this, &VFToOrderedEntries, &GathersToOrders](
		const std::unique_ptr<TreeEntry> &TE) {
		// No need to reorder if need to shuffle reuses, still need to shuffle the
		// node.
		if (!TE->ReuseShuffleIndices.empty())
return;		return;
buildTree_rec(Roots, 0, EdgeInfo());		if (TE->State == TreeEntry::Vectorize &&
		isa<LoadInst, ExtractElementInst, ExtractValueInst, StoreInst,
		InsertElementInst>(TE->getMainOp()) &&
		!TE->isAltShuffle()) {
		VFToOrderedEntries[TE->Scalars.size()].insert(TE.get());
		} else if (TE->State == TreeEntry::NeedToGather &&
		TE->getOpcode() == Instruction::ExtractElement &&
		!TE->isAltShuffle() &&
		isa<FixedVectorType>(cast<ExtractElementInst>(TE->getMainOp())
		->getVectorOperandType()) &&
		allSameType(TE->Scalars) && allSameBlock(TE->Scalars)) {
		// Check that gather of extractelements can be represented as
		// just a shuffle of a single vector.
		OrdersType CurrentOrder;
		bool Reuse = canReuseExtract(TE->Scalars, TE->getMainOp(), CurrentOrder);
		if (Reuse \|\| !CurrentOrder.empty()) {
		VFToOrderedEntries[TE->Scalars.size()].insert(TE.get());
		GathersToOrders.try_emplace(TE.get(), CurrentOrder);
		}
		}
		});

		// Reorder the graph nodes according to their vectorization factor.
		for (unsigned VF = VectorizableTree.front()->Scalars.size(); VF > 1;
		VF /= 2) {
		auto It = VFToOrderedEntries.find(VF);
		if (It == VFToOrderedEntries.end())
		continue;
		// Try to find the most profitable order. We just are looking for the most
		// used order and reorder scalar elements in the nodes according to this
		// mostly used order.
		const SmallPtrSetImpl<TreeEntry *> &OrderedEntries = It->getSecond();
		// All operands are reordered and used only in this node - propagate the
		// most used order to the user node.
		DenseMap<OrdersType, unsigned, OrdersTypeDenseMapInfo> OrdersUses;
		SmallPtrSet<const TreeEntry *, 4> VisitedOps;
		for (const TreeEntry *OpTE : OrderedEntries) {
		// No need to reorder this nodes, still need to extend and to use shuffle,
		// just need to merge reordering shuffle and the reuse shuffle.
		if (!OpTE->ReuseShuffleIndices.empty())
		continue;
		// Count number of orders uses.
		const auto &Order = [OpTE, &GathersToOrders]() -> const OrdersType & {
		if (OpTE->State == TreeEntry::NeedToGather)
		return GathersToOrders.find(OpTE)->second;
		return OpTE->ReorderIndices;
		}();
		// Stores actually store the mask, not the order, need to invert.
		if (OpTE->State == TreeEntry::Vectorize && !OpTE->isAltShuffle() &&
		OpTE->getOpcode() == Instruction::Store && !Order.empty()) {
		SmallVector<int> Mask;
		inversePermutation(Order, Mask);
		unsigned E = Order.size();
		OrdersType CurrentOrder(E, E);
		transform(Mask, CurrentOrder.begin(), [E](int Idx) {
		return Idx == UndefMaskElem ? E : static_cast<unsigned>(Idx);
		});
		fixupOrderingIndices(CurrentOrder);
		++OrdersUses.try_emplace(CurrentOrder).first->getSecond();
		} else {
		++OrdersUses.try_emplace(Order).first->getSecond();
		}
		}
		// Set order of the user node.
		if (OrdersUses.empty())
		continue;
		// Choose the most used order.
		ArrayRef<unsigned> BestOrder = OrdersUses.begin()->first;
		unsigned Cnt = OrdersUses.begin()->second;
		for (const auto &Pair : llvm::drop_begin(OrdersUses)) {
		if (Cnt < Pair.second \|\| (Cnt == Pair.second && Pair.first.empty())) {
		BestOrder = Pair.first;
		Cnt = Pair.second;
		}
		}
		// Set order of the user node.
		if (BestOrder.empty())
		continue;
		SmallVector<int> Mask;
		inversePermutation(BestOrder, Mask);
		SmallVector<int> MaskOrder(BestOrder.size(), UndefMaskElem);
		unsigned E = BestOrder.size();
		transform(BestOrder, MaskOrder.begin(), [E](unsigned I) {
		return I < E ? static_cast<int>(I) : UndefMaskElem;
		});
		// Do an actual reordering, if profitable.
		for (std::unique_ptr<TreeEntry> &TE : VectorizableTree) {
		// Just do the reordering for the nodes with the given VF.
		if (TE->Scalars.size() != VF) {
		if (TE->ReuseShuffleIndices.size() == VF) {
		// Need to reorder the reuses masks of the operands with smaller VF to
		// be able to find the match between the graph nodes and scalar
		// operands of the given node during vectorization/cost estimation.
		assert(all_of(TE->UserTreeIndices,
		[VF, &TE](const EdgeInfo &EI) {
		return EI.UserTE->Scalars.size() == VF \|\|
		EI.UserTE->Scalars.size() ==
		TE->Scalars.size();
		}) &&
		"All users must be of VF size.");
		// Update ordering of the operands with the smaller VF than the given
		// one.
		reorderReuses(TE->ReuseShuffleIndices, Mask);
		}
		continue;
		}
		if (TE->State == TreeEntry::Vectorize &&
		isa<ExtractElementInst, ExtractValueInst, LoadInst, StoreInst,
		InsertElementInst>(TE->getMainOp()) &&
		!TE->isAltShuffle()) {
		// Build correct orders for extract{element,value}, loads and
		// stores.
		reorderOrder(TE->ReorderIndices, Mask);
		if (isa<InsertElementInst, StoreInst>(TE->getMainOp()))
		TE->reorderOperands(Mask);
		} else {
		// Reorder the node and its operands.
		TE->reorderOperands(Mask);
		assert(TE->ReorderIndices.empty() &&
		"Expected empty reorder sequence.");
		reorderScalars(TE->Scalars, Mask);
		}
		if (!TE->ReuseShuffleIndices.empty()) {
		// Apply reversed order to keep the original ordering of the reused
		// elements to avoid extra reorder indices shuffling.
		OrdersType CurrentOrder;
		reorderOrder(CurrentOrder, MaskOrder);
		SmallVector<int> NewReuses;
		inversePermutation(CurrentOrder, NewReuses);
		addMask(NewReuses, TE->ReuseShuffleIndices);
		TE->ReuseShuffleIndices.swap(NewReuses);
		}
		}
		}
		}

		void BoUpSLP::reorderBottomToTop() {
		SetVector<TreeEntry *> OrderedEntries;
		DenseMap<const TreeEntry *, OrdersType> GathersToOrders;
		// Find all reorderable leaf nodes with the given VF.
		// Currently the are vectorized loads,extracts without alternate operands +
		// some gathering of extracts.
		SmallVector<TreeEntry *> NonVectorized;
		for_each(VectorizableTree, [this, &OrderedEntries, &GathersToOrders,
		&NonVectorized](
		const std::unique_ptr<TreeEntry> &TE) {
		// No need to reorder if need to shuffle reuses, still need to shuffle the
		// node.
		if (!TE->ReuseShuffleIndices.empty())
		return;
		if (TE->State == TreeEntry::Vectorize &&
		isa<LoadInst, ExtractElementInst, ExtractValueInst>(TE->getMainOp()) &&
		!TE->isAltShuffle()) {
		OrderedEntries.insert(TE.get());
		} else if (TE->State == TreeEntry::NeedToGather &&
		TE->getOpcode() == Instruction::ExtractElement &&
		!TE->isAltShuffle() &&
		isa<FixedVectorType>(cast<ExtractElementInst>(TE->getMainOp())
		->getVectorOperandType()) &&
		allSameType(TE->Scalars) && allSameBlock(TE->Scalars)) {
		// Check that gather of extractelements can be represented as
		// just a shuffle of a single vector with a single user only.
		OrdersType CurrentOrder;
		bool Reuse = canReuseExtract(TE->Scalars, TE->getMainOp(), CurrentOrder);
		if ((Reuse \|\| !CurrentOrder.empty()) &&
		!any_of(
		VectorizableTree, [&TE](const std::unique_ptr<TreeEntry> &Entry) {
		return Entry->State == TreeEntry::NeedToGather &&
		Entry.get() != TE.get() && Entry->isSame(TE->Scalars);
		})) {
		OrderedEntries.insert(TE.get());
		GathersToOrders.try_emplace(TE.get(), CurrentOrder);
		}
		}
		if (TE->State != TreeEntry::Vectorize)
		NonVectorized.push_back(TE.get());
		});

		// Checks if the operands of the users are reordarable and have only single
		// use.
		auto &&CheckOperands =
		[this, &NonVectorized](const auto &Data,
		SmallVectorImpl<TreeEntry *> &GatherOps) {
		for (unsigned I = 0, E = Data.first->getNumOperands(); I < E; ++I) {
		if (any_of(Data.second,
		[I](const std::pair<unsigned, TreeEntry *> &OpData) {
		return OpData.first == I &&
		OpData.second->State == TreeEntry::Vectorize;
		}))
		continue;
		ArrayRef<Value *> VL = Data.first->getOperand(I);
		const TreeEntry *TE = nullptr;
		const auto It = find_if(VL, [this, &TE](Value V) {
		TE = getTreeEntry(V);
		return TE;
		});
		if (It != VL.end() && TE->isSame(VL))
		return false;
		TreeEntry *Gather = nullptr;
		if (count_if(NonVectorized, [VL, &Gather](TreeEntry *TE) {
		assert(TE->State != TreeEntry::Vectorize &&
		"Only non-vectorized nodes are expected.");
		if (TE->isSame(VL)) {
		Gather = TE;
		return true;
		}
		return false;
		}) > 1)
		return false;
		if (Gather)
		GatherOps.push_back(Gather);
		}
		return true;
		};
		// 1. Propagate order to the graph nodes, which use only reordered nodes.
		// I.e., if the node has operands, that are reordered, try to make at least
		// one operand order in the natural order and reorder others + reorder the
		// user node itself.
		SmallPtrSet<const TreeEntry *, 4> Visited;
		while (!OrderedEntries.empty()) {
		// 1. Filter out only reordered nodes.
		// 2. If the entry has multiple uses - skip it and jump to the next node.
		MapVector<TreeEntry , SmallVector<std::pair<unsigned, TreeEntry >>> Users;
		SmallVector<TreeEntry *> Filtered;
		for (TreeEntry *TE : OrderedEntries) {
		if (!(TE->State == TreeEntry::Vectorize \|\|
		(TE->State == TreeEntry::NeedToGather &&
		TE->getOpcode() == Instruction::ExtractElement)) \|\|
		TE->UserTreeIndices.empty() \|\| !TE->ReuseShuffleIndices.empty() \|\|
		!all_of(drop_begin(TE->UserTreeIndices),
		[TE](const EdgeInfo &EI) {
		return EI.UserTE == TE->UserTreeIndices.front().UserTE;
		}) \|\|
		!Visited.insert(TE).second) {
		Filtered.push_back(TE);
		continue;
		}
		// Build a map between user nodes and their operands order to speedup
		// search. The graph currently does not provide this dependency directly.
		for (EdgeInfo &EI : TE->UserTreeIndices) {
		TreeEntry *UserTE = EI.UserTE;
		auto It = Users.find(UserTE);
		if (It == Users.end())
		It = Users.insert({UserTE, {}}).first;
		It->second.emplace_back(EI.EdgeIdx, TE);
		}
		}
		// Erase filtered entries.
		for_each(Filtered,
		[&OrderedEntries](TreeEntry *TE) { OrderedEntries.remove(TE); });
		for (const auto &Data : Users) {
		// Check that operands are used only in the User node.
		SmallVector<TreeEntry *> GatherOps;
		if (!CheckOperands(Data, GatherOps)) {
		for_each(Data.second,
		[&OrderedEntries](const std::pair<unsigned, TreeEntry *> &Op) {
		OrderedEntries.remove(Op.second);
		});
		continue;
		}
		// All operands are reordered and used only in this node - propagate the
		// most used order to the user node.
		DenseMap<OrdersType, unsigned, OrdersTypeDenseMapInfo> OrdersUses;
		SmallPtrSet<const TreeEntry *, 4> VisitedOps;
		for (const auto &Op : Data.second) {
		TreeEntry *OpTE = Op.second;
		if (!OpTE->ReuseShuffleIndices.empty())
		continue;
		const auto &Order = [OpTE, &GathersToOrders]() -> const OrdersType & {
		if (OpTE->State == TreeEntry::NeedToGather)
		return GathersToOrders.find(OpTE)->second;
		return OpTE->ReorderIndices;
		}();
		// Stores actually store the mask, not the order, need to invert.
		if (OpTE->State == TreeEntry::Vectorize && !OpTE->isAltShuffle() &&
		OpTE->getOpcode() == Instruction::Store && !Order.empty()) {
		SmallVector<int> Mask;
		inversePermutation(Order, Mask);
		unsigned E = Order.size();
		OrdersType CurrentOrder(E, E);
		transform(Mask, CurrentOrder.begin(), [E](int Idx) {
		return Idx == UndefMaskElem ? E : static_cast<unsigned>(Idx);
		});
		fixupOrderingIndices(CurrentOrder);
		++OrdersUses.try_emplace(CurrentOrder).first->getSecond();
		} else {
		++OrdersUses.try_emplace(Order).first->getSecond();
		}
		if (VisitedOps.insert(OpTE).second)
		OrdersUses.try_emplace({}, 0).first->getSecond() +=
		OpTE->UserTreeIndices.size();
		--OrdersUses[{}];
		}
		// If no orders - skip current nodes and jump to the next one, if any.
		if (OrdersUses.empty()) {
		for_each(Data.second,
		[&OrderedEntries](const std::pair<unsigned, TreeEntry *> &Op) {
		OrderedEntries.remove(Op.second);
		});
		continue;
		}
		// Choose the best order.
		ArrayRef<unsigned> BestOrder = OrdersUses.begin()->first;
		unsigned Cnt = OrdersUses.begin()->second;
		for (const auto &Pair : llvm::drop_begin(OrdersUses)) {
		if (Cnt < Pair.second \|\| (Cnt == Pair.second && Pair.first.empty())) {
		BestOrder = Pair.first;
		Cnt = Pair.second;
		}
		}
		// Set order of the user node (reordering of operands and user nodes).
		if (BestOrder.empty()) {
		for_each(Data.second,
		[&OrderedEntries](const std::pair<unsigned, TreeEntry *> &Op) {
		OrderedEntries.remove(Op.second);
		});
		continue;
		}
		// Erase operands from OrderedEntries list and adjust their orders.
		VisitedOps.clear();
		SmallVector<int> Mask;
		inversePermutation(BestOrder, Mask);
		SmallVector<int> MaskOrder(BestOrder.size(), UndefMaskElem);
		unsigned E = BestOrder.size();
		transform(BestOrder, MaskOrder.begin(), [E](unsigned I) {
		return I < E ? static_cast<int>(I) : UndefMaskElem;
		});
		for (const std::pair<unsigned, TreeEntry *> &Op : Data.second) {
		TreeEntry *TE = Op.second;
		OrderedEntries.remove(TE);
		if (!VisitedOps.insert(TE).second)
		continue;
		if (!TE->ReuseShuffleIndices.empty() && TE->ReorderIndices.empty()) {
		// Just reorder reuses indices.
		reorderReuses(TE->ReuseShuffleIndices, Mask);
		continue;
		}
		// Gathers are processed separately.
		if (TE->State != TreeEntry::Vectorize)
		continue;
		assert((BestOrder.size() == TE->ReorderIndices.size() \|\|
		TE->ReorderIndices.empty()) &&
		"Non-matching sizes of user/operand entries.");
		reorderOrder(TE->ReorderIndices, Mask);
		}
		// For gathers just need to reorder its scalars.
		for (TreeEntry *Gather : GatherOps) {
		if (!Gather->ReuseShuffleIndices.empty())
		continue;
		assert(Gather->ReorderIndices.empty() &&
		"Unexpected reordering of gathers.");
		reorderScalars(Gather->Scalars, Mask);
		OrderedEntries.remove(Gather);
		}
		// Reorder operands of the user node and set the ordering for the user
		// node itself.
		if (Data.first->State != TreeEntry::Vectorize \|\|
		!isa<ExtractElementInst, ExtractValueInst, LoadInst>(
		Data.first->getMainOp()) \|\|
		Data.first->isAltShuffle())
		Data.first->reorderOperands(Mask);
		if (!isa<InsertElementInst, StoreInst>(Data.first->getMainOp()) \|\|
		Data.first->isAltShuffle()) {
		reorderScalars(Data.first->Scalars, Mask);
		reorderOrder(Data.first->ReorderIndices, MaskOrder);
		if (Data.first->ReuseShuffleIndices.empty() &&
		!Data.first->ReorderIndices.empty() &&
		!Data.first->isAltShuffle()) {
		// Insert user node to the list to try to sink reordering deeper in
		// the graph.
		OrderedEntries.insert(Data.first);
		}
		} else {
		reorderOrder(Data.first->ReorderIndices, Mask);
		}
		}
		}
		}

		void BoUpSLP::buildExternalUses(
		const ExtraValueToDebugLocsMap &ExternallyUsedValues) {
// Collect the values that we need to extract from the tree.		// Collect the values that we need to extract from the tree.
for (auto &TEPtr : VectorizableTree) {		for (auto &TEPtr : VectorizableTree) {
TreeEntry *Entry = TEPtr.get();		TreeEntry *Entry = TEPtr.get();

// No need to handle users of gathered values.		// No need to handle users of gathered values.
if (Entry->State == TreeEntry::NeedToGather)		if (Entry->State == TreeEntry::NeedToGather)
continue;		continue;

▲ Show 20 Lines • Show All 42 Lines • ▼ Show 20 Lines	for (int Lane = 0, LE = Entry->Scalars.size(); Lane != LE; ++Lane) {
LLVM_DEBUG(dbgs() << "SLP: Need to extract:" << *U << " from lane "		LLVM_DEBUG(dbgs() << "SLP: Need to extract:" << *U << " from lane "
<< Lane << " from " << *Scalar << ".\n");		<< Lane << " from " << *Scalar << ".\n");
ExternalUses.push_back(ExternalUser(Scalar, U, FoundLane));		ExternalUses.push_back(ExternalUser(Scalar, U, FoundLane));
}		}
}		}
}		}
}		}

		void BoUpSLP::buildTree(ArrayRef<Value *> Roots,
		ArrayRef<Value *> UserIgnoreLst) {
		deleteTree();
		UserIgnoreList = UserIgnoreLst;
		if (!allSameType(Roots))
		return;
		buildTree_rec(Roots, 0, EdgeInfo());
		}

		namespace {
		/// Tracks the state we can represent the loads in the given sequence.
		enum class LoadsState { Gather, Vectorize, ScatterVectorize };
		RKSimonUnsubmitted Not Done Reply Inline Actions This is very similar to the EntryState enum - merge them? RKSimon: This is very similar to the EntryState enum - merge them?
		ABataevAuthorUnsubmitted Done Reply Inline Actions I would not do this. Though the values look similar the meaning is completely different. This state handles only loads, while the entry state handles all possible entries kinds. In the future, we may have different values in these enums, which may lead to some unpredictable results. I would keep it as is. ABataev: I would not do this. Though the values look similar the meaning is completely different. This…
		} // anonymous namespace

		/// Checks if the given array of loads can be represented as a vectorized,
		/// scatter or just simple gather.
		static LoadsState canVectorizeLoads(ArrayRef<Value > VL, const Value VL0,
		const TargetTransformInfo &TTI,
		const DataLayout &DL, ScalarEvolution &SE,
		SmallVectorImpl<unsigned> &Order,
		SmallVectorImpl<Value *> &PointerOps) {
		// Check that a vectorized load would load the same memory as a scalar
		// load. For example, we don't want to vectorize loads that are smaller
		// than 8-bit. Even though we have a packed struct {<i2, i2, i2, i2>} LLVM
		// treats loading/storing it as an i8 struct. If we vectorize loads/stores
		// from such a struct, we read/write packed bits disagreeing with the
		// unvectorized version.
		Type *ScalarTy = VL0->getType();

		if (DL.getTypeSizeInBits(ScalarTy) != DL.getTypeAllocSizeInBits(ScalarTy))
		return LoadsState::Gather;

		// Make sure all loads in the bundle are simple - we can't vectorize
		// atomic or volatile loads.
		PointerOps.clear();
		PointerOps.resize(VL.size());
		auto *POIter = PointerOps.begin();
		for (Value *V : VL) {
		auto *L = cast<LoadInst>(V);
		if (!L->isSimple())
		return LoadsState::Gather;
		*POIter = L->getPointerOperand();
		++POIter;
		}

		Order.clear();
		// Check the order of pointer operands.
		if (llvm::sortPtrAccesses(PointerOps, ScalarTy, DL, SE, Order)) {
		Value *Ptr0;
		Value *PtrN;
		if (Order.empty()) {
		Ptr0 = PointerOps.front();
		PtrN = PointerOps.back();
		} else {
		Ptr0 = PointerOps[Order.front()];
		PtrN = PointerOps[Order.back()];
		}
		Optional<int> Diff =
		getPointersDiff(ScalarTy, Ptr0, ScalarTy, PtrN, DL, SE);
		// Check that the sorted loads are consecutive.
		if (static_cast<unsigned>(*Diff) == VL.size() - 1)
		return LoadsState::Vectorize;
		Align CommonAlignment = cast<LoadInst>(VL0)->getAlign();
		for (Value *V : VL)
		CommonAlignment =
		commonAlignment(CommonAlignment, cast<LoadInst>(V)->getAlign());
		if (TTI.isLegalMaskedGather(FixedVectorType::get(ScalarTy, VL.size()),
		CommonAlignment))
		return LoadsState::ScatterVectorize;
		}

		return LoadsState::Gather;
		}

void BoUpSLP::buildTree_rec(ArrayRef<Value *> VL, unsigned Depth,		void BoUpSLP::buildTree_rec(ArrayRef<Value *> VL, unsigned Depth,
const EdgeInfo &UserTreeIdx) {		const EdgeInfo &UserTreeIdx) {
assert((allConstant(VL) \|\| allSameType(VL)) && "Invalid types!");		assert((allConstant(VL) \|\| allSameType(VL)) && "Invalid types!");

InstructionsState S = getSameOpcode(VL);		InstructionsState S = getSameOpcode(VL);
if (Depth == RecursionMaxDepth) {		if (Depth == RecursionMaxDepth) {
LLVM_DEBUG(dbgs() << "SLP: Gathering due to max recursion depth.\n");		LLVM_DEBUG(dbgs() << "SLP: Gathering due to max recursion depth.\n");
newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx);		newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx);
return;		return;
}		}

// Don't handle scalable vectors		// Don't handle scalable vectors
if (S.getOpcode() == Instruction::ExtractElement &&		if (S.getOpcode() == Instruction::ExtractElement &&
isa<ScalableVectorType>(		isa<ScalableVectorType>(
cast<ExtractElementInst>(S.OpValue)->getVectorOperandType())) {		cast<ExtractElementInst>(S.OpValue)->getVectorOperandType())) {
LLVM_DEBUG(dbgs() << "SLP: Gathering due to scalable vector type.\n");		LLVM_DEBUG(dbgs() << "SLP: Gathering due to scalable vector type.\n");
newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx);		newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx);
return;		return;
}		}

// Don't handle vectors.		// Don't handle vectors.
if (S.OpValue->getType()->isVectorTy() &&		if (S.OpValue->getType()->isVectorTy() &&
!isa<InsertElementInst>(S.OpValue)) {		!isa<InsertElementInst>(S.OpValue)) {
LLVM_DEBUG(dbgs() << "SLP: Gathering due to vector type.\n");		LLVM_DEBUG(dbgs() << "SLP: Gathering due to vector type.\n");
		RKSimonUnsubmitted Not Done Reply Inline Actions Default value? RKSimon: Default value?
		ABataevAuthorUnsubmitted Done Reply Inline Actions It is initialized with zeroes by default. ABataev: It is initialized with zeroes by default.
newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx);		newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx);
return;		return;
}		}

if (StoreInst *SI = dyn_cast<StoreInst>(S.OpValue))		if (StoreInst *SI = dyn_cast<StoreInst>(S.OpValue))
if (SI->getValueOperand()->getType()->isVectorTy()) {		if (SI->getValueOperand()->getType()->isVectorTy()) {
LLVM_DEBUG(dbgs() << "SLP: Gathering due to store vector type.\n");		LLVM_DEBUG(dbgs() << "SLP: Gathering due to store vector type.\n");
newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx);		newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx);
▲ Show 20 Lines • Show All 69 Lines • ▼ Show 20 Lines	if (!DT->isReachableFromEntry(BB)) {
// Don't go into unreachable blocks. They may contain instructions with		// Don't go into unreachable blocks. They may contain instructions with
// dependency cycles which confuse the final scheduling.		// dependency cycles which confuse the final scheduling.
LLVM_DEBUG(dbgs() << "SLP: bundle in unreachable block.\n");		LLVM_DEBUG(dbgs() << "SLP: bundle in unreachable block.\n");
newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx);		newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx);
return;		return;
}		}

// Check that every instruction appears once in this bundle.		// Check that every instruction appears once in this bundle.
SmallVector<unsigned, 4> ReuseShuffleIndicies;		SmallVector<int> ReuseShuffleIndicies;
SmallVector<Value *, 4> UniqueValues;		SmallVector<Value *, 4> UniqueValues;
DenseMap<Value *, unsigned> UniquePositions;		DenseMap<Value *, unsigned> UniquePositions;
for (Value *V : VL) {		for (Value *V : VL) {
auto Res = UniquePositions.try_emplace(V, UniqueValues.size());		auto Res = UniquePositions.try_emplace(V, UniqueValues.size());
ReuseShuffleIndicies.emplace_back(Res.first->second);		ReuseShuffleIndicies.emplace_back(Res.first->second);
if (Res.second)		if (Res.second)
UniqueValues.emplace_back(V);		UniqueValues.emplace_back(V);
}		}
▲ Show 20 Lines • Show All 77 Lines • ▼ Show 20 Lines	case Instruction::PHI: {
return;		return;
}		}
case Instruction::ExtractValue:		case Instruction::ExtractValue:
case Instruction::ExtractElement: {		case Instruction::ExtractElement: {
OrdersType CurrentOrder;		OrdersType CurrentOrder;
bool Reuse = canReuseExtract(VL, VL0, CurrentOrder);		bool Reuse = canReuseExtract(VL, VL0, CurrentOrder);
if (Reuse) {		if (Reuse) {
LLVM_DEBUG(dbgs() << "SLP: Reusing or shuffling extract sequence.\n");		LLVM_DEBUG(dbgs() << "SLP: Reusing or shuffling extract sequence.\n");
++NumOpsWantToKeepOriginalOrder;
newTreeEntry(VL, Bundle /vectorized/, S, UserTreeIdx,		newTreeEntry(VL, Bundle /vectorized/, S, UserTreeIdx,
ReuseShuffleIndicies);		ReuseShuffleIndicies);
// This is a special case, as it does not gather, but at the same time		// This is a special case, as it does not gather, but at the same time
// we are not extending buildTree_rec() towards the operands.		// we are not extending buildTree_rec() towards the operands.
ValueList Op0;		ValueList Op0;
Op0.assign(VL.size(), VL0->getOperand(0));		Op0.assign(VL.size(), VL0->getOperand(0));
VectorizableTree.back()->setOperand(0, Op0);		VectorizableTree.back()->setOperand(0, Op0);
return;		return;
}		}
if (!CurrentOrder.empty()) {		if (!CurrentOrder.empty()) {
LLVM_DEBUG({		LLVM_DEBUG({
dbgs() << "SLP: Reusing or shuffling of reordered extract sequence "		dbgs() << "SLP: Reusing or shuffling of reordered extract sequence "
"with order";		"with order";
for (unsigned Idx : CurrentOrder)		for (unsigned Idx : CurrentOrder)
dbgs() << " " << Idx;		dbgs() << " " << Idx;
dbgs() << "\n";		dbgs() << "\n";
});		});
		fixupOrderingIndices(CurrentOrder);
// Insert new order with initial value 0, if it does not exist,		// Insert new order with initial value 0, if it does not exist,
// otherwise return the iterator to the existing one.		// otherwise return the iterator to the existing one.
newTreeEntry(VL, Bundle /vectorized/, S, UserTreeIdx,		newTreeEntry(VL, Bundle /vectorized/, S, UserTreeIdx,
ReuseShuffleIndicies, CurrentOrder);		ReuseShuffleIndicies, CurrentOrder);
findRootOrder(CurrentOrder);
++NumOpsWantToKeepOrder[CurrentOrder];
// This is a special case, as it does not gather, but at the same time		// This is a special case, as it does not gather, but at the same time
// we are not extending buildTree_rec() towards the operands.		// we are not extending buildTree_rec() towards the operands.
ValueList Op0;		ValueList Op0;
Op0.assign(VL.size(), VL0->getOperand(0));		Op0.assign(VL.size(), VL0->getOperand(0));
VectorizableTree.back()->setOperand(0, Op0);		VectorizableTree.back()->setOperand(0, Op0);
return;		return;
}		}
LLVM_DEBUG(dbgs() << "SLP: Gather extract sequence.\n");		LLVM_DEBUG(dbgs() << "SLP: Gather extract sequence.\n");
newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx,		newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx,
ReuseShuffleIndicies);		ReuseShuffleIndicies);
BS.cancelScheduling(VL, VL0);		BS.cancelScheduling(VL, VL0);
return;		return;
}		}
case Instruction::InsertElement: {		case Instruction::InsertElement: {
assert(ReuseShuffleIndicies.empty() && "All inserts should be unique");		assert(ReuseShuffleIndicies.empty() && "All inserts should be unique");

// Check that we have a buildvector and not a shuffle of 2 or more		// Check that we have a buildvector and not a shuffle of 2 or more
// different vectors.		// different vectors.
ValueSet SourceVectors;		ValueSet SourceVectors;
for (Value *V : VL)		int MinIdx = std::numeric_limits<int>::max();
		for (Value *V : VL) {
SourceVectors.insert(cast<Instruction>(V)->getOperand(0));		SourceVectors.insert(cast<Instruction>(V)->getOperand(0));
		Optional<int> Idx = *getInsertIndex(V, 0);
		if (!Idx \|\| *Idx == UndefMaskElem)
		continue;
		MinIdx = std::min(MinIdx, *Idx);
		}

if (count_if(VL, [&SourceVectors](Value *V) {		if (count_if(VL, [&SourceVectors](Value *V) {
return !SourceVectors.contains(V);		return !SourceVectors.contains(V);
}) >= 2) {		}) >= 2) {
// Found 2nd source vector - cancel.		// Found 2nd source vector - cancel.
LLVM_DEBUG(dbgs() << "SLP: Gather of insertelement vectors with "		LLVM_DEBUG(dbgs() << "SLP: Gather of insertelement vectors with "
"different source vectors.\n");		"different source vectors.\n");
newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx,		newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx);
ReuseShuffleIndicies);
BS.cancelScheduling(VL, VL0);		BS.cancelScheduling(VL, VL0);
return;		return;
}		}

TreeEntry TE = newTreeEntry(VL, Bundle /vectorized*/, S, UserTreeIdx);		auto OrdCompare = [](const std::pair<int, int> &P1,
		const std::pair<int, int> &P2) {
		return P1.first > P2.first;
		};
		PriorityQueue<std::pair<int, int>, SmallVector<std::pair<int, int>>,
		decltype(OrdCompare)>
		Indices(OrdCompare);
		RKSimonUnsubmitted Not Done Reply Inline Actions Indices? RKSimon: Indices?
		ABataevAuthorUnsubmitted Done Reply Inline Actions Yep, will fix it, thanks! ABataev: Yep, will fix it, thanks!
		for (int I = 0, E = VL.size(); I < E; ++I) {
		Optional<int> Idx = *getInsertIndex(VL[I], 0);
		if (!Idx \|\| *Idx == UndefMaskElem)
		continue;
		Indices.emplace(*Idx, I);
		}
		OrdersType CurrentOrder(VL.size(), VL.size());
		bool IsIdentity = true;
		for (int I = 0, E = VL.size(); I < E; ++I) {
		CurrentOrder[Indices.top().second] = I;
		IsIdentity &= Indices.top().second == I;
		Indices.pop();
		}
		if (IsIdentity)
		CurrentOrder.clear();
		TreeEntry TE = newTreeEntry(VL, Bundle /vectorized*/, S, UserTreeIdx,
		None, CurrentOrder);
LLVM_DEBUG(dbgs() << "SLP: added inserts bundle.\n");		LLVM_DEBUG(dbgs() << "SLP: added inserts bundle.\n");

constexpr int NumOps = 2;		constexpr int NumOps = 2;
ValueList VectorOperands[NumOps];		ValueList VectorOperands[NumOps];
for (int I = 0; I < NumOps; ++I) {		for (int I = 0; I < NumOps; ++I) {
for (Value *V : VL)		for (Value *V : VL)
VectorOperands[I].push_back(cast<Instruction>(V)->getOperand(I));		VectorOperands[I].push_back(cast<Instruction>(V)->getOperand(I));

TE->setOperand(I, VectorOperands[I]);		TE->setOperand(I, VectorOperands[I]);
}		}
buildTree_rec(VectorOperands[NumOps - 1], Depth + 1, {TE, 0});		buildTree_rec(VectorOperands[NumOps - 1], Depth + 1, {TE, NumOps - 1});
return;		return;
}		}
case Instruction::Load: {		case Instruction::Load: {
// Check that a vectorized load would load the same memory as a scalar		// Check that a vectorized load would load the same memory as a scalar
// load. For example, we don't want to vectorize loads that are smaller		// load. For example, we don't want to vectorize loads that are smaller
// than 8-bit. Even though we have a packed struct {<i2, i2, i2, i2>} LLVM		// than 8-bit. Even though we have a packed struct {<i2, i2, i2, i2>} LLVM
// treats loading/storing it as an i8 struct. If we vectorize loads/stores		// treats loading/storing it as an i8 struct. If we vectorize loads/stores
// from such a struct, we read/write packed bits disagreeing with the		// from such a struct, we read/write packed bits disagreeing with the
// unvectorized version.		// unvectorized version.
Type *ScalarTy = VL0->getType();		SmallVector<Value *> PointerOps;

if (DL->getTypeSizeInBits(ScalarTy) !=
DL->getTypeAllocSizeInBits(ScalarTy)) {
BS.cancelScheduling(VL, VL0);
newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx,
ReuseShuffleIndicies);
LLVM_DEBUG(dbgs() << "SLP: Gathering loads of non-packed type.\n");
return;
}

// Make sure all loads in the bundle are simple - we can't vectorize
// atomic or volatile loads.
SmallVector<Value *, 4> PointerOps(VL.size());
auto POIter = PointerOps.begin();
for (Value *V : VL) {
auto *L = cast<LoadInst>(V);
if (!L->isSimple()) {
BS.cancelScheduling(VL, VL0);
newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx,
ReuseShuffleIndicies);
LLVM_DEBUG(dbgs() << "SLP: Gathering non-simple loads.\n");
return;
}
*POIter = L->getPointerOperand();
++POIter;
}

OrdersType CurrentOrder;		OrdersType CurrentOrder;
// Check the order of pointer operands.		TreeEntry *TE = nullptr;
if (llvm::sortPtrAccesses(PointerOps, ScalarTy, DL, SE, CurrentOrder)) {		switch (canVectorizeLoads(VL, VL0, TTI, DL, *SE, CurrentOrder,
Value *Ptr0;		PointerOps)) {
Value *PtrN;		case LoadsState::Vectorize:
if (CurrentOrder.empty()) {
Ptr0 = PointerOps.front();
PtrN = PointerOps.back();
} else {
Ptr0 = PointerOps[CurrentOrder.front()];
PtrN = PointerOps[CurrentOrder.back()];
}
Optional<int> Diff = getPointersDiff(
ScalarTy, Ptr0, ScalarTy, PtrN, DL, SE);
// Check that the sorted loads are consecutive.
if (static_cast<unsigned>(*Diff) == VL.size() - 1) {
if (CurrentOrder.empty()) {		if (CurrentOrder.empty()) {
// Original loads are consecutive and does not require reordering.		// Original loads are consecutive and does not require reordering.
++NumOpsWantToKeepOriginalOrder;		TE = newTreeEntry(VL, Bundle /vectorized/, S, UserTreeIdx,
TreeEntry TE = newTreeEntry(VL, Bundle /vectorized*/, S,		ReuseShuffleIndicies);
UserTreeIdx, ReuseShuffleIndicies);
TE->setOperandsInOrder();
LLVM_DEBUG(dbgs() << "SLP: added a vector of loads.\n");		LLVM_DEBUG(dbgs() << "SLP: added a vector of loads.\n");
} else {		} else {
		fixupOrderingIndices(CurrentOrder);
// Need to reorder.		// Need to reorder.
TreeEntry *TE =		TE = newTreeEntry(VL, Bundle /vectorized/, S, UserTreeIdx,
newTreeEntry(VL, Bundle /vectorized/, S, UserTreeIdx,
ReuseShuffleIndicies, CurrentOrder);		ReuseShuffleIndicies, CurrentOrder);
TE->setOperandsInOrder();
LLVM_DEBUG(dbgs() << "SLP: added a vector of jumbled loads.\n");		LLVM_DEBUG(dbgs() << "SLP: added a vector of jumbled loads.\n");
findRootOrder(CurrentOrder);
++NumOpsWantToKeepOrder[CurrentOrder];
}
return;
}		}
Align CommonAlignment = cast<LoadInst>(VL0)->getAlign();		TE->setOperandsInOrder();
for (Value *V : VL)		break;
CommonAlignment =		case LoadsState::ScatterVectorize:
commonAlignment(CommonAlignment, cast<LoadInst>(V)->getAlign());
if (TTI->isLegalMaskedGather(FixedVectorType::get(ScalarTy, VL.size()),
CommonAlignment)) {
// Vectorizing non-consecutive loads with `llvm.masked.gather`.		// Vectorizing non-consecutive loads with `llvm.masked.gather`.
TreeEntry *TE = newTreeEntry(VL, TreeEntry::ScatterVectorize, Bundle,		TE = newTreeEntry(VL, TreeEntry::ScatterVectorize, Bundle, S,
S, UserTreeIdx, ReuseShuffleIndicies);		UserTreeIdx, ReuseShuffleIndicies);
TE->setOperandsInOrder();		TE->setOperandsInOrder();
buildTree_rec(PointerOps, Depth + 1, {TE, 0});		buildTree_rec(PointerOps, Depth + 1, {TE, 0});
LLVM_DEBUG(dbgs()		LLVM_DEBUG(dbgs() << "SLP: added a vector of non-consecutive loads.\n");
<< "SLP: added a vector of non-consecutive loads.\n");		break;
return;		case LoadsState::Gather:
}
}

LLVM_DEBUG(dbgs() << "SLP: Gathering non-consecutive loads.\n");
BS.cancelScheduling(VL, VL0);		BS.cancelScheduling(VL, VL0);
newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx,		newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx,
ReuseShuffleIndicies);		ReuseShuffleIndicies);
		#ifndef NDEBUG
		Type *ScalarTy = VL0->getType();
		if (DL->getTypeSizeInBits(ScalarTy) !=
		DL->getTypeAllocSizeInBits(ScalarTy))
		LLVM_DEBUG(dbgs() << "SLP: Gathering loads of non-packed type.\n");
		else if (any_of(VL, [](Value *V) {
		return !cast<LoadInst>(V)->isSimple();
		}))
		LLVM_DEBUG(dbgs() << "SLP: Gathering non-simple loads.\n");
		else
		LLVM_DEBUG(dbgs() << "SLP: Gathering non-consecutive loads.\n");
		#endif // NDEBUG
		break;
		}
return;		return;
}		}
case Instruction::ZExt:		case Instruction::ZExt:
case Instruction::SExt:		case Instruction::SExt:
case Instruction::FPToUI:		case Instruction::FPToUI:
case Instruction::FPToSI:		case Instruction::FPToSI:
case Instruction::FPExt:		case Instruction::FPExt:
case Instruction::PtrToInt:		case Instruction::PtrToInt:
▲ Show 20 Lines • Show All 230 Lines • ▼ Show 20 Lines	case Instruction::Store: {
PtrN = PointerOps[CurrentOrder.back()];		PtrN = PointerOps[CurrentOrder.back()];
}		}
Optional<int> Dist =		Optional<int> Dist =
getPointersDiff(ScalarTy, Ptr0, ScalarTy, PtrN, DL, SE);		getPointersDiff(ScalarTy, Ptr0, ScalarTy, PtrN, DL, SE);
// Check that the sorted pointer operands are consecutive.		// Check that the sorted pointer operands are consecutive.
if (static_cast<unsigned>(*Dist) == VL.size() - 1) {		if (static_cast<unsigned>(*Dist) == VL.size() - 1) {
if (CurrentOrder.empty()) {		if (CurrentOrder.empty()) {
// Original stores are consecutive and does not require reordering.		// Original stores are consecutive and does not require reordering.
++NumOpsWantToKeepOriginalOrder;
TreeEntry TE = newTreeEntry(VL, Bundle /vectorized*/, S,		TreeEntry TE = newTreeEntry(VL, Bundle /vectorized*/, S,
UserTreeIdx, ReuseShuffleIndicies);		UserTreeIdx, ReuseShuffleIndicies);
TE->setOperandsInOrder();		TE->setOperandsInOrder();
buildTree_rec(Operands, Depth + 1, {TE, 0});		buildTree_rec(Operands, Depth + 1, {TE, 0});
LLVM_DEBUG(dbgs() << "SLP: added a vector of stores.\n");		LLVM_DEBUG(dbgs() << "SLP: added a vector of stores.\n");
} else {		} else {
		fixupOrderingIndices(CurrentOrder);
TreeEntry *TE =		TreeEntry *TE =
newTreeEntry(VL, Bundle /vectorized/, S, UserTreeIdx,		newTreeEntry(VL, Bundle /vectorized/, S, UserTreeIdx,
ReuseShuffleIndicies, CurrentOrder);		ReuseShuffleIndicies, CurrentOrder);
TE->setOperandsInOrder();		TE->setOperandsInOrder();
buildTree_rec(Operands, Depth + 1, {TE, 0});		buildTree_rec(Operands, Depth + 1, {TE, 0});
LLVM_DEBUG(dbgs() << "SLP: added a vector of jumbled stores.\n");		LLVM_DEBUG(dbgs() << "SLP: added a vector of jumbled stores.\n");
findRootOrder(CurrentOrder);
++NumOpsWantToKeepOrder[CurrentOrder];
}		}
return;		return;
}		}
}		}

BS.cancelScheduling(VL, VL0);		BS.cancelScheduling(VL, VL0);
newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx,		newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx,
ReuseShuffleIndicies);		ReuseShuffleIndicies);
▲ Show 20 Lines • Show All 314 Lines • ▼ Show 20 Lines	for (auto *V : VL) {
// cost to extract the a vector with EltsPerVector elements.		// cost to extract the a vector with EltsPerVector elements.
Cost += TTI.getShuffleCost(		Cost += TTI.getShuffleCost(
TargetTransformInfo::SK_PermuteSingleSrc,		TargetTransformInfo::SK_PermuteSingleSrc,
FixedVectorType::get(VecTy->getElementType(), EltsPerVector));		FixedVectorType::get(VecTy->getElementType(), EltsPerVector));
}		}
return Cost;		return Cost;
}		}

/// Shuffles \p Mask in accordance with the given \p SubMask.		/// Build shuffle mask for shuffle graph entries and lists of main and alternate
static void addMask(SmallVectorImpl<int> &Mask, ArrayRef<int> SubMask) {		/// operations operands.
if (SubMask.empty())		static void
return;		buildSuffleEntryMask(ArrayRef<Value *> VL, ArrayRef<unsigned> ReorderIndices,
if (Mask.empty()) {		ArrayRef<int> ReusesIndices,
Mask.append(SubMask.begin(), SubMask.end());		const function_ref<bool(Instruction *)> IsAltOp,
return;		SmallVectorImpl<int> &Mask,
}		SmallVectorImpl<Value > OpScalars = nullptr,
SmallVector<int, 4> NewMask(SubMask.size(), SubMask.size());		SmallVectorImpl<Value > AltScalars = nullptr) {
int TermValue = std::min(Mask.size(), SubMask.size());		unsigned Sz = VL.size();
for (int I = 0, E = SubMask.size(); I < E; ++I) {		Mask.assign(Sz, UndefMaskElem);
if (SubMask[I] >= TermValue \|\| SubMask[I] == UndefMaskElem \|\|		SmallVector<int> OrderMask;
Mask[SubMask[I]] >= TermValue) {		if (!ReorderIndices.empty())
NewMask[I] = UndefMaskElem;		inversePermutation(ReorderIndices, OrderMask);
continue;		for (unsigned I = 0; I < Sz; ++I) {
}		unsigned Idx = I;
NewMask[I] = Mask[SubMask[I]];		if (!ReorderIndices.empty())
}		Idx = OrderMask[I];
		auto *OpInst = cast<Instruction>(VL[Idx]);
		if (IsAltOp(OpInst)) {
		Mask[I] = Sz + Idx;
		if (AltScalars)
		AltScalars->push_back(OpInst);
		} else {
		Mask[I] = Idx;
		if (OpScalars)
		OpScalars->push_back(OpInst);
		}
		}
		if (!ReusesIndices.empty()) {
		SmallVector<int> NewMask(ReusesIndices.size(), UndefMaskElem);
		transform(ReusesIndices, NewMask.begin(), [&Mask](int Idx) {
		return Idx != UndefMaskElem ? Mask[Idx] : UndefMaskElem;
		});
Mask.swap(NewMask);		Mask.swap(NewMask);
}		}
		}

InstructionCost BoUpSLP::getEntryCost(const TreeEntry *E,		InstructionCost BoUpSLP::getEntryCost(const TreeEntry *E,
ArrayRef<Value *> VectorizedVals) {		ArrayRef<Value *> VectorizedVals) {
ArrayRef<Value*> VL = E->Scalars;		ArrayRef<Value*> VL = E->Scalars;

Type *ScalarTy = VL[0]->getType();		Type *ScalarTy = VL[0]->getType();
if (StoreInst *SI = dyn_cast<StoreInst>(VL[0]))		if (StoreInst *SI = dyn_cast<StoreInst>(VL[0]))
ScalarTy = SI->getValueOperand()->getType();		ScalarTy = SI->getValueOperand()->getType();
▲ Show 20 Lines • Show All 146 Lines • ▼ Show 20 Lines	if (E->getOpcode() == Instruction::ExtractElement && allSameType(VL) &&
FinalVecTy, E->ReuseShuffleIndices);		FinalVecTy, E->ReuseShuffleIndices);
return Cost;		return Cost;
}		}
}		}
InstructionCost ReuseShuffleCost = 0;		InstructionCost ReuseShuffleCost = 0;
if (NeedToShuffleReuses)		if (NeedToShuffleReuses)
ReuseShuffleCost = TTI->getShuffleCost(		ReuseShuffleCost = TTI->getShuffleCost(
TTI::SK_PermuteSingleSrc, FinalVecTy, E->ReuseShuffleIndices);		TTI::SK_PermuteSingleSrc, FinalVecTy, E->ReuseShuffleIndices);
		// Improve gather cost for gather of loads, if we can group some of the
		// loads into vector loads.
		if (VL.size() > 2 && E->getOpcode() == Instruction::Load &&
		!E->isAltShuffle()) {
		BoUpSLP::ValueSet VectorizedLoads;
		unsigned StartIdx = 0;
		unsigned VF = VL.size() / 2;
		unsigned VectorizedCnt = 0;
		unsigned ScatterVectorizeCnt = 0;
		const unsigned Sz = DL->getTypeSizeInBits(E->getMainOp()->getType());
		for (unsigned MinVF = getMinVF(2 * Sz); VF >= MinVF; VF /= 2) {
		bjopeUnsubmitted Not Done Reply Inline Actions With my OOT target I ended up here with getMinVecRegSize() returning 32. E->getMainOp was a load like this `%3 = load i24, i24* getelementptr inbounds ([128 x i24], [128 x i24]* @a_ua, i16 0, i16 3)`, so Sz was 24. That gives a MinVF that is 0. And after some iterations the inner loop was entered with VF=0, which gives a slice that is empty, hitting assertions when doing Slice.front(). I haven't reduced this failure for any in-tree target (a bit short on people at the office here in the middle of summer). But maybe you should make sure MinVF doesn't go below 1 here (or maybe not even below 2) to avoid getting a VF that is less than 1 (or less than 2). Or check that VF is at least 1 (or 2) in the loop guard. (I do not know if VF=1 makes any sense. Hence the alternatives above regarding 1 or 2.) bjope: With my OOT target I ended up here with getMinVecRegSize() returning 32. E->getMainOp was a…
		ABataevAuthorUnsubmitted Done Reply Inline Actions The fix is ready (see D107058), will commit it tomorrow. ABataev: The fix is ready (see D107058), will commit it tomorrow.
		bjopeUnsubmitted Done Reply Inline Actions Ok, thanks! bjope: Ok, thanks!
		for (unsigned Cnt = StartIdx, End = VL.size(); Cnt + VF <= End;
		Cnt += VF) {
		ArrayRef<Value *> Slice = VL.slice(Cnt, VF);
		if (!VectorizedLoads.count(Slice.front()) &&
		!VectorizedLoads.count(Slice.back()) && allSameBlock(Slice)) {
		SmallVector<Value *> PointerOps;
		OrdersType CurrentOrder;
		LoadsState LS = canVectorizeLoads(Slice, Slice.front(), TTI, DL,
		*SE, CurrentOrder, PointerOps);
		switch (LS) {
		case LoadsState::Vectorize:
		case LoadsState::ScatterVectorize:
		// Mark the vectorized loads so that we don't vectorize them
		// again.
		if (LS == LoadsState::Vectorize)
		++VectorizedCnt;
		else
		++ScatterVectorizeCnt;
		VectorizedLoads.insert(Slice.begin(), Slice.end());
		// If we vectorized initial block, no need to try to vectorize it
		// again.
		if (Cnt == StartIdx)
		StartIdx += VF;
		break;
		case LoadsState::Gather:
		break;
		}
		}
		}
		// Check if the whole array was vectorized already - exit.
		if (StartIdx >= VL.size())
		break;
		// Found vectorizable parts - exit.
		if (!VectorizedLoads.empty())
		break;
		}
		if (!VectorizedLoads.empty()) {
		InstructionCost GatherCost = 0;
		// Get the cost for gathered loads.
		for (unsigned I = 0, End = VL.size(); I < End; I += VF) {
		if (VectorizedLoads.contains(VL[I]))
		continue;
		GatherCost += getGatherCost(VL.slice(I, VF));
		}
		// The cost for vectorized loads.
		InstructionCost ScalarsCost = 0;
		for (Value *V : VectorizedLoads) {
		auto *LI = cast<LoadInst>(V);
		ScalarsCost += TTI->getMemoryOpCost(
		Instruction::Load, LI->getType(), LI->getAlign(),
		LI->getPointerAddressSpace(), CostKind, LI);
		}
		auto *LI = cast<LoadInst>(E->getMainOp());
		auto *LoadTy = FixedVectorType::get(LI->getType(), VF);
		Align Alignment = LI->getAlign();
		GatherCost +=
		VectorizedCnt *
		TTI->getMemoryOpCost(Instruction::Load, LoadTy, Alignment,
		LI->getPointerAddressSpace(), CostKind, LI);
		GatherCost += ScatterVectorizeCnt *
		TTI->getGatherScatterOpCost(
		Instruction::Load, LoadTy, LI->getPointerOperand(),
		/VariableMask=/false, Alignment, CostKind, LI);
		// Add the cost for the subvectors shuffling.
		GatherCost += ((VL.size() - VF) / VF) *
		TTI->getShuffleCost(TTI::SK_Select, VecTy);
		return ReuseShuffleCost + GatherCost - ScalarsCost;
		}
		}
return ReuseShuffleCost + getGatherCost(VL);		return ReuseShuffleCost + getGatherCost(VL);
}		}
		RKSimonUnsubmitted Done Reply Inline Actions Can we remove the loop and avoid the repeated call to TTI->getShuffleCost(TTI::SK_Select, VecTy) ? RKSimon: Can we remove the loop and avoid the repeated call to TTI->getShuffleCost(TTI::SK_Select…
InstructionCost CommonCost = 0;		InstructionCost CommonCost = 0;
		RKSimonUnsubmitted Not Done Reply Inline Actions Is it worth doing the CommonCost -> ReuseShuffleCost refactor as a NFC pre-commit to simplify this patch? RKSimon: Is it worth doing the CommonCost -> ReuseShuffleCost refactor as a NFC pre-commit to simplify…
		ABataevAuthorUnsubmitted Done Reply Inline Actions Will check. ABataev: Will check.
SmallVector<int> Mask;		SmallVector<int> Mask;
if (!E->ReorderIndices.empty()) {		if (!E->ReorderIndices.empty()) {
SmallVector<int> NewMask;		SmallVector<int> NewMask;
if (E->getOpcode() == Instruction::Store) {		if (E->getOpcode() == Instruction::Store) {
// For stores the order is actually a mask.		// For stores the order is actually a mask.
NewMask.resize(E->ReorderIndices.size());		NewMask.resize(E->ReorderIndices.size());
copy(E->ReorderIndices, NewMask.begin());		copy(E->ReorderIndices, NewMask.begin());
} else {		} else {
▲ Show 20 Lines • Show All 73 Lines • ▼ Show 20 Lines	case Instruction::ExtractElement: {
TTI->getVectorInstrCost(Instruction::ExtractElement, VecTy, I);		TTI->getVectorInstrCost(Instruction::ExtractElement, VecTy, I);
}		}
} else {		} else {
AdjustExtractsCost(CommonCost, /IsGather=/false);		AdjustExtractsCost(CommonCost, /IsGather=/false);
}		}
return CommonCost;		return CommonCost;
}		}
case Instruction::InsertElement: {		case Instruction::InsertElement: {
		assert(E->ReuseShuffleIndices.empty() &&
		"Unique insertelements only are expected.");
auto *SrcVecTy = cast<FixedVectorType>(VL0->getType());		auto *SrcVecTy = cast<FixedVectorType>(VL0->getType());

unsigned const NumElts = SrcVecTy->getNumElements();		unsigned const NumElts = SrcVecTy->getNumElements();
unsigned const NumScalars = VL.size();		unsigned const NumScalars = VL.size();
APInt DemandedElts = APInt::getZero(NumElts);		APInt DemandedElts = APInt::getZero(NumElts);
// TODO: Add support for Instruction::InsertValue.		// TODO: Add support for Instruction::InsertValue.
unsigned Offset = UINT_MAX;		SmallVector<int> Mask;
		if (!E->ReorderIndices.empty()) {
		inversePermutation(E->ReorderIndices, Mask);
		Mask.append(NumElts - NumScalars, UndefMaskElem);
		} else {
		Mask.assign(NumElts, UndefMaskElem);
		std::iota(Mask.begin(), std::next(Mask.begin(), NumScalars), 0);
		}
		unsigned Offset = *getInsertIndex(VL0, 0);
bool IsIdentity = true;		bool IsIdentity = true;
SmallVector<int> ShuffleMask(NumElts, UndefMaskElem);		SmallVector<int> PrevMask(NumElts, UndefMaskElem);
		Mask.swap(PrevMask);
for (unsigned I = 0; I < NumScalars; ++I) {		for (unsigned I = 0; I < NumScalars; ++I) {
Optional<int> InsertIdx = getInsertIndex(VL[I], 0);		Optional<int> InsertIdx = getInsertIndex(VL[PrevMask[I]], 0);
if (!InsertIdx \|\| *InsertIdx == UndefMaskElem)		if (!InsertIdx \|\| *InsertIdx == UndefMaskElem)
continue;		continue;
unsigned Idx = *InsertIdx;		DemandedElts.setBit(*InsertIdx);
DemandedElts.setBit(Idx);		IsIdentity &= *InsertIdx - Offset == I;
if (Idx < Offset) {		Mask[*InsertIdx - Offset] = I;
Offset = Idx;
IsIdentity &= I == 0;
} else {
assert(Idx >= Offset && "Failed to find vector index offset");
IsIdentity &= Idx - Offset == I;
}
ShuffleMask[Idx] = I;
}		}
assert(Offset < NumElts && "Failed to find vector index offset");		assert(Offset < NumElts && "Failed to find vector index offset");

InstructionCost Cost = 0;		InstructionCost Cost = 0;
Cost -= TTI->getScalarizationOverhead(SrcVecTy, DemandedElts,		Cost -= TTI->getScalarizationOverhead(SrcVecTy, DemandedElts,
/Insert/ true, /Extract/ false);		/Insert/ true, /Extract/ false);

if (IsIdentity && NumElts != NumScalars && Offset % NumScalars != 0) {		if (IsIdentity && NumElts != NumScalars && Offset % NumScalars != 0) {
// FIXME: Replace with SK_InsertSubvector once it is properly supported.		// FIXME: Replace with SK_InsertSubvector once it is properly supported.
unsigned Sz = PowerOf2Ceil(Offset + NumScalars);		unsigned Sz = PowerOf2Ceil(Offset + NumScalars);
Cost += TTI->getShuffleCost(		Cost += TTI->getShuffleCost(
TargetTransformInfo::SK_PermuteSingleSrc,		TargetTransformInfo::SK_PermuteSingleSrc,
FixedVectorType::get(SrcVecTy->getElementType(), Sz));		FixedVectorType::get(SrcVecTy->getElementType(), Sz));
} else if (!IsIdentity) {		} else if (!IsIdentity) {
Cost += TTI->getShuffleCost(TTI::SK_PermuteSingleSrc, SrcVecTy,		auto *FirstInsert =
ShuffleMask);		cast<Instruction>(find_if(E->Scalars, [E](Value V) {
		return !is_contained(E->Scalars,
		cast<Instruction>(V)->getOperand(0));
		}));
		if (isa<UndefValue>(FirstInsert->getOperand(0))) {
		Cost += TTI->getShuffleCost(TTI::SK_PermuteSingleSrc, SrcVecTy, Mask);
		} else {
		SmallVector<int> InsertMask(NumElts);
		std::iota(InsertMask.begin(), InsertMask.end(), 0);
		for (unsigned I = 0; I < NumElts; I++) {
		if (Mask[I] != UndefMaskElem)
		InsertMask[Offset + I] = NumElts + I;
		}
		Cost +=
		TTI->getShuffleCost(TTI::SK_PermuteTwoSrc, SrcVecTy, InsertMask);
		}
}		}

return Cost;		return Cost;
}		}
case Instruction::ZExt:		case Instruction::ZExt:
case Instruction::SExt:		case Instruction::SExt:
case Instruction::FPToUI:		case Instruction::FPToUI:
case Instruction::FPToSI:		case Instruction::FPToSI:
▲ Show 20 Lines • Show All 265 Lines • ▼ Show 20 Lines	case Instruction::ShuffleVector: {
auto *Src0Ty = FixedVectorType::get(Src0SclTy, VL.size());		auto *Src0Ty = FixedVectorType::get(Src0SclTy, VL.size());
auto *Src1Ty = FixedVectorType::get(Src1SclTy, VL.size());		auto *Src1Ty = FixedVectorType::get(Src1SclTy, VL.size());
VecCost = TTI->getCastInstrCost(E->getOpcode(), VecTy, Src0Ty,		VecCost = TTI->getCastInstrCost(E->getOpcode(), VecTy, Src0Ty,
TTI::CastContextHint::None, CostKind);		TTI::CastContextHint::None, CostKind);
VecCost += TTI->getCastInstrCost(E->getAltOpcode(), VecTy, Src1Ty,		VecCost += TTI->getCastInstrCost(E->getAltOpcode(), VecTy, Src1Ty,
TTI::CastContextHint::None, CostKind);		TTI::CastContextHint::None, CostKind);
}		}

SmallVector<int> Mask(E->Scalars.size());		SmallVector<int> Mask;
for (unsigned I = 0, End = E->Scalars.size(); I < End; ++I) {		buildSuffleEntryMask(
auto *OpInst = cast<Instruction>(E->Scalars[I]);		E->Scalars, E->ReorderIndices, E->ReuseShuffleIndices,
assert(E->isOpcodeOrAlt(OpInst) && "Unexpected main/alternate opcode");		[E](Instruction *I) {
Mask[I] = I + (OpInst->getOpcode() == E->getAltOpcode() ? End : 0);		assert(E->isOpcodeOrAlt(I) && "Unexpected main/alternate opcode");
}		return I->getOpcode() == E->getAltOpcode();
VecCost +=		},
TTI->getShuffleCost(TargetTransformInfo::SK_Select, VecTy, Mask, 0);		Mask);
		CommonCost =
		TTI->getShuffleCost(TargetTransformInfo::SK_Select, FinalVecTy, Mask);
LLVM_DEBUG(dumpTreeCosts(E, CommonCost, VecCost, ScalarCost));		LLVM_DEBUG(dumpTreeCosts(E, CommonCost, VecCost, ScalarCost));
return CommonCost + VecCost - ScalarCost;		return CommonCost + VecCost - ScalarCost;
}		}
default:		default:
llvm_unreachable("Unknown instruction");		llvm_unreachable("Unknown instruction");
}		}
}		}

▲ Show 20 Lines • Show All 906 Lines • ▼ Show 20 Lines	Value BoUpSLP::vectorizeTree(TreeEntry E) {
Type *ScalarTy = VL0->getType();		Type *ScalarTy = VL0->getType();
if (auto *Store = dyn_cast<StoreInst>(VL0))		if (auto *Store = dyn_cast<StoreInst>(VL0))
ScalarTy = Store->getValueOperand()->getType();		ScalarTy = Store->getValueOperand()->getType();
else if (auto *IE = dyn_cast<InsertElementInst>(VL0))		else if (auto *IE = dyn_cast<InsertElementInst>(VL0))
ScalarTy = IE->getOperand(1)->getType();		ScalarTy = IE->getOperand(1)->getType();
auto *VecTy = FixedVectorType::get(ScalarTy, E->Scalars.size());		auto *VecTy = FixedVectorType::get(ScalarTy, E->Scalars.size());
switch (ShuffleOrOp) {		switch (ShuffleOrOp) {
case Instruction::PHI: {		case Instruction::PHI: {
		assert(
		(E->ReorderIndices.empty() \|\| E != VectorizableTree.front().get()) &&
		"PHI reordering is free.");
auto *PH = cast<PHINode>(VL0);		auto *PH = cast<PHINode>(VL0);
Builder.SetInsertPoint(PH->getParent()->getFirstNonPHI());		Builder.SetInsertPoint(PH->getParent()->getFirstNonPHI());
Builder.SetCurrentDebugLocation(PH->getDebugLoc());		Builder.SetCurrentDebugLocation(PH->getDebugLoc());
PHINode *NewPhi = Builder.CreatePHI(VecTy, PH->getNumIncomingValues());		PHINode *NewPhi = Builder.CreatePHI(VecTy, PH->getNumIncomingValues());
Value *V = NewPhi;		Value *V = NewPhi;
if (NeedToShuffleReuses)		ShuffleBuilder.addInversedMask(E->ReorderIndices);
V = Builder.CreateShuffleVector(V, E->ReuseShuffleIndices, "shuffle");		ShuffleBuilder.addMask(E->ReuseShuffleIndices);
		V = ShuffleBuilder.finalize(V);

E->VectorizedValue = V;		E->VectorizedValue = V;

// PHINodes may have multiple entries from the same block. We want to		// PHINodes may have multiple entries from the same block. We want to
// visit every block once.		// visit every block once.
SmallPtrSet<BasicBlock*, 4> VisitedBBs;		SmallPtrSet<BasicBlock*, 4> VisitedBBs;

for (unsigned i = 0, e = PH->getNumIncomingValues(); i < e; ++i) {		for (unsigned i = 0, e = PH->getNumIncomingValues(); i < e; ++i) {
Show All 34 Lines	case Instruction::ExtractValue: {
Value *NewV = propagateMetadata(V, E->Scalars);		Value *NewV = propagateMetadata(V, E->Scalars);
ShuffleBuilder.addInversedMask(E->ReorderIndices);		ShuffleBuilder.addInversedMask(E->ReorderIndices);
ShuffleBuilder.addMask(E->ReuseShuffleIndices);		ShuffleBuilder.addMask(E->ReuseShuffleIndices);
NewV = ShuffleBuilder.finalize(NewV);		NewV = ShuffleBuilder.finalize(NewV);
E->VectorizedValue = NewV;		E->VectorizedValue = NewV;
return NewV;		return NewV;
}		}
case Instruction::InsertElement: {		case Instruction::InsertElement: {
Builder.SetInsertPoint(VL0);		assert(E->ReuseShuffleIndices.empty() && "All inserts should be unique");
		Builder.SetInsertPoint(cast<Instruction>(E->Scalars.back()));
Value *V = vectorizeTree(E->getOperand(1));		Value *V = vectorizeTree(E->getOperand(1));

		// Create InsertVector shuffle if necessary
		auto FirstInsert = cast<Instruction>(find_if(E->Scalars, [E](Value *V) {
		return !is_contained(E->Scalars, cast<Instruction>(V)->getOperand(0));
		}));
const unsigned NumElts =		const unsigned NumElts =
cast<FixedVectorType>(VL0->getType())->getNumElements();		cast<FixedVectorType>(FirstInsert->getType())->getNumElements();
const unsigned NumScalars = E->Scalars.size();		const unsigned NumScalars = E->Scalars.size();

// Create InsertVector shuffle if necessary		unsigned Offset = *getInsertIndex(VL0, 0);
Instruction *FirstInsert = nullptr;
bool IsIdentity = true;
unsigned Offset = UINT_MAX;
for (unsigned I = 0; I < NumScalars; ++I) {
Value *Scalar = E->Scalars[I];
if (!FirstInsert &&
!is_contained(E->Scalars, cast<Instruction>(Scalar)->getOperand(0)))
FirstInsert = cast<Instruction>(Scalar);
Optional<int> InsertIdx = getInsertIndex(Scalar, 0);
if (!InsertIdx \|\| *InsertIdx == UndefMaskElem)
continue;
unsigned Idx = *InsertIdx;
if (Idx < Offset) {
Offset = Idx;
IsIdentity &= I == 0;
} else {
assert(Idx >= Offset && "Failed to find vector index offset");
IsIdentity &= Idx - Offset == I;
}
}
assert(Offset < NumElts && "Failed to find vector index offset");		assert(Offset < NumElts && "Failed to find vector index offset");

// Create shuffle to resize vector		// Create shuffle to resize vector
SmallVector<int> Mask(NumElts, UndefMaskElem);		SmallVector<int> Mask;
if (!IsIdentity) {		if (!E->ReorderIndices.empty()) {
		inversePermutation(E->ReorderIndices, Mask);
		Mask.append(NumElts - NumScalars, UndefMaskElem);
		} else {
		Mask.assign(NumElts, UndefMaskElem);
		std::iota(Mask.begin(), std::next(Mask.begin(), NumScalars), 0);
		}
		// Create InsertVector shuffle if necessary
		bool IsIdentity = true;
		SmallVector<int> PrevMask(NumElts, UndefMaskElem);
		Mask.swap(PrevMask);
for (unsigned I = 0; I < NumScalars; ++I) {		for (unsigned I = 0; I < NumScalars; ++I) {
Value *Scalar = E->Scalars[I];		Value *Scalar = E->Scalars[PrevMask[I]];
Optional<int> InsertIdx = getInsertIndex(Scalar, 0);		Optional<int> InsertIdx = getInsertIndex(Scalar, 0);
if (!InsertIdx \|\| *InsertIdx == UndefMaskElem)		if (!InsertIdx \|\| *InsertIdx == UndefMaskElem)
continue;		continue;
		IsIdentity &= *InsertIdx - Offset == I;
Mask[*InsertIdx - Offset] = I;		Mask[*InsertIdx - Offset] = I;
}		}
} else {
std::iota(Mask.begin(), std::next(Mask.begin(), NumScalars), 0);
}
if (!IsIdentity \|\| NumElts != NumScalars)		if (!IsIdentity \|\| NumElts != NumScalars)
V = Builder.CreateShuffleVector(V, Mask);		V = Builder.CreateShuffleVector(V, Mask);

if ((!IsIdentity \|\| Offset != 0 \|\|		if ((!IsIdentity \|\| Offset != 0 \|\|
!isa<UndefValue>(FirstInsert->getOperand(0))) &&		!isa<UndefValue>(FirstInsert->getOperand(0))) &&
NumElts != NumScalars) {		NumElts != NumScalars) {
SmallVector<int> InsertMask(NumElts);		SmallVector<int> InsertMask(NumElts);
std::iota(InsertMask.begin(), InsertMask.end(), 0);		std::iota(InsertMask.begin(), InsertMask.end(), 0);
Show All 29 Lines	case Instruction::BitCast: {

if (E->VectorizedValue) {		if (E->VectorizedValue) {
LLVM_DEBUG(dbgs() << "SLP: Diamond merged for " << *VL0 << ".\n");		LLVM_DEBUG(dbgs() << "SLP: Diamond merged for " << *VL0 << ".\n");
return E->VectorizedValue;		return E->VectorizedValue;
}		}

auto *CI = cast<CastInst>(VL0);		auto *CI = cast<CastInst>(VL0);
Value *V = Builder.CreateCast(CI->getOpcode(), InVec, VecTy);		Value *V = Builder.CreateCast(CI->getOpcode(), InVec, VecTy);
		ShuffleBuilder.addInversedMask(E->ReorderIndices);
ShuffleBuilder.addMask(E->ReuseShuffleIndices);		ShuffleBuilder.addMask(E->ReuseShuffleIndices);
V = ShuffleBuilder.finalize(V);		V = ShuffleBuilder.finalize(V);

E->VectorizedValue = V;		E->VectorizedValue = V;
++NumVectorInstructions;		++NumVectorInstructions;
return V;		return V;
}		}
case Instruction::FCmp:		case Instruction::FCmp:
case Instruction::ICmp: {		case Instruction::ICmp: {
setInsertPointAfterBundle(E);		setInsertPointAfterBundle(E);

Value *L = vectorizeTree(E->getOperand(0));		Value *L = vectorizeTree(E->getOperand(0));
Value *R = vectorizeTree(E->getOperand(1));		Value *R = vectorizeTree(E->getOperand(1));

if (E->VectorizedValue) {		if (E->VectorizedValue) {
LLVM_DEBUG(dbgs() << "SLP: Diamond merged for " << *VL0 << ".\n");		LLVM_DEBUG(dbgs() << "SLP: Diamond merged for " << *VL0 << ".\n");
return E->VectorizedValue;		return E->VectorizedValue;
}		}

CmpInst::Predicate P0 = cast<CmpInst>(VL0)->getPredicate();		CmpInst::Predicate P0 = cast<CmpInst>(VL0)->getPredicate();
Value *V = Builder.CreateCmp(P0, L, R);		Value *V = Builder.CreateCmp(P0, L, R);
propagateIRFlags(V, E->Scalars, VL0);		propagateIRFlags(V, E->Scalars, VL0);
		ShuffleBuilder.addInversedMask(E->ReorderIndices);
ShuffleBuilder.addMask(E->ReuseShuffleIndices);		ShuffleBuilder.addMask(E->ReuseShuffleIndices);
V = ShuffleBuilder.finalize(V);		V = ShuffleBuilder.finalize(V);

E->VectorizedValue = V;		E->VectorizedValue = V;
++NumVectorInstructions;		++NumVectorInstructions;
return V;		return V;
}		}
case Instruction::Select: {		case Instruction::Select: {
setInsertPointAfterBundle(E);		setInsertPointAfterBundle(E);

Value *Cond = vectorizeTree(E->getOperand(0));		Value *Cond = vectorizeTree(E->getOperand(0));
Value *True = vectorizeTree(E->getOperand(1));		Value *True = vectorizeTree(E->getOperand(1));
Value *False = vectorizeTree(E->getOperand(2));		Value *False = vectorizeTree(E->getOperand(2));

if (E->VectorizedValue) {		if (E->VectorizedValue) {
LLVM_DEBUG(dbgs() << "SLP: Diamond merged for " << *VL0 << ".\n");		LLVM_DEBUG(dbgs() << "SLP: Diamond merged for " << *VL0 << ".\n");
return E->VectorizedValue;		return E->VectorizedValue;
}		}

Value *V = Builder.CreateSelect(Cond, True, False);		Value *V = Builder.CreateSelect(Cond, True, False);
		ShuffleBuilder.addInversedMask(E->ReorderIndices);
ShuffleBuilder.addMask(E->ReuseShuffleIndices);		ShuffleBuilder.addMask(E->ReuseShuffleIndices);
V = ShuffleBuilder.finalize(V);		V = ShuffleBuilder.finalize(V);

E->VectorizedValue = V;		E->VectorizedValue = V;
++NumVectorInstructions;		++NumVectorInstructions;
return V;		return V;
}		}
case Instruction::FNeg: {		case Instruction::FNeg: {
setInsertPointAfterBundle(E);		setInsertPointAfterBundle(E);

Value *Op = vectorizeTree(E->getOperand(0));		Value *Op = vectorizeTree(E->getOperand(0));

if (E->VectorizedValue) {		if (E->VectorizedValue) {
LLVM_DEBUG(dbgs() << "SLP: Diamond merged for " << *VL0 << ".\n");		LLVM_DEBUG(dbgs() << "SLP: Diamond merged for " << *VL0 << ".\n");
return E->VectorizedValue;		return E->VectorizedValue;
}		}

Value *V = Builder.CreateUnOp(		Value *V = Builder.CreateUnOp(
static_cast<Instruction::UnaryOps>(E->getOpcode()), Op);		static_cast<Instruction::UnaryOps>(E->getOpcode()), Op);
propagateIRFlags(V, E->Scalars, VL0);		propagateIRFlags(V, E->Scalars, VL0);
if (auto *I = dyn_cast<Instruction>(V))		if (auto *I = dyn_cast<Instruction>(V))
V = propagateMetadata(I, E->Scalars);		V = propagateMetadata(I, E->Scalars);

		ShuffleBuilder.addInversedMask(E->ReorderIndices);
ShuffleBuilder.addMask(E->ReuseShuffleIndices);		ShuffleBuilder.addMask(E->ReuseShuffleIndices);
V = ShuffleBuilder.finalize(V);		V = ShuffleBuilder.finalize(V);

E->VectorizedValue = V;		E->VectorizedValue = V;
++NumVectorInstructions;		++NumVectorInstructions;

return V;		return V;
}		}
Show All 27 Lines	case Instruction::Xor: {

Value *V = Builder.CreateBinOp(		Value *V = Builder.CreateBinOp(
static_cast<Instruction::BinaryOps>(E->getOpcode()), LHS,		static_cast<Instruction::BinaryOps>(E->getOpcode()), LHS,
RHS);		RHS);
propagateIRFlags(V, E->Scalars, VL0);		propagateIRFlags(V, E->Scalars, VL0);
if (auto *I = dyn_cast<Instruction>(V))		if (auto *I = dyn_cast<Instruction>(V))
V = propagateMetadata(I, E->Scalars);		V = propagateMetadata(I, E->Scalars);

		ShuffleBuilder.addInversedMask(E->ReorderIndices);
ShuffleBuilder.addMask(E->ReuseShuffleIndices);		ShuffleBuilder.addMask(E->ReuseShuffleIndices);
V = ShuffleBuilder.finalize(V);		V = ShuffleBuilder.finalize(V);

E->VectorizedValue = V;		E->VectorizedValue = V;
++NumVectorInstructions;		++NumVectorInstructions;

return V;		return V;
}		}
case Instruction::Load: {		case Instruction::Load: {
// Loads are inserted at the head of the tree because we don't want to		// Loads are inserted at the head of the tree because we don't want to
// sink them all the way down past store instructions.		// sink them all the way down past store instructions.
bool IsReorder = E->updateStateIfReorder();
if (IsReorder)
VL0 = E->getMainOp();
setInsertPointAfterBundle(E);		setInsertPointAfterBundle(E);

LoadInst *LI = cast<LoadInst>(VL0);		LoadInst *LI = cast<LoadInst>(VL0);
Instruction *NewLI;		Instruction *NewLI;
unsigned AS = LI->getPointerAddressSpace();		unsigned AS = LI->getPointerAddressSpace();
Value *PO = LI->getPointerOperand();		Value *PO = LI->getPointerOperand();
if (E->State == TreeEntry::Vectorize) {		if (E->State == TreeEntry::Vectorize) {

Show All 21 Lines	case Instruction::Load: {
ShuffleBuilder.addInversedMask(E->ReorderIndices);		ShuffleBuilder.addInversedMask(E->ReorderIndices);
ShuffleBuilder.addMask(E->ReuseShuffleIndices);		ShuffleBuilder.addMask(E->ReuseShuffleIndices);
V = ShuffleBuilder.finalize(V);		V = ShuffleBuilder.finalize(V);
E->VectorizedValue = V;		E->VectorizedValue = V;
++NumVectorInstructions;		++NumVectorInstructions;
return V;		return V;
}		}
case Instruction::Store: {		case Instruction::Store: {
bool IsReorder = !E->ReorderIndices.empty();		auto *SI = cast<StoreInst>(VL0);
auto *SI = cast<StoreInst>(
IsReorder ? E->Scalars[E->ReorderIndices.front()] : VL0);
unsigned AS = SI->getPointerAddressSpace();		unsigned AS = SI->getPointerAddressSpace();

setInsertPointAfterBundle(E);		setInsertPointAfterBundle(E);

Value *VecValue = vectorizeTree(E->getOperand(0));		Value *VecValue = vectorizeTree(E->getOperand(0));
ShuffleBuilder.addMask(E->ReorderIndices);		ShuffleBuilder.addMask(E->ReorderIndices);
VecValue = ShuffleBuilder.finalize(VecValue);		VecValue = ShuffleBuilder.finalize(VecValue);

▲ Show 20 Lines • Show All 42 Lines • ▼ Show 20 Lines	case Instruction::GetElementPtr: {
OpVecs.push_back(OpVec);		OpVecs.push_back(OpVec);
}		}

Value *V = Builder.CreateGEP(		Value *V = Builder.CreateGEP(
cast<GetElementPtrInst>(VL0)->getSourceElementType(), Op0, OpVecs);		cast<GetElementPtrInst>(VL0)->getSourceElementType(), Op0, OpVecs);
if (Instruction *I = dyn_cast<Instruction>(V))		if (Instruction *I = dyn_cast<Instruction>(V))
V = propagateMetadata(I, E->Scalars);		V = propagateMetadata(I, E->Scalars);

		ShuffleBuilder.addInversedMask(E->ReorderIndices);
ShuffleBuilder.addMask(E->ReuseShuffleIndices);		ShuffleBuilder.addMask(E->ReuseShuffleIndices);
V = ShuffleBuilder.finalize(V);		V = ShuffleBuilder.finalize(V);

E->VectorizedValue = V;		E->VectorizedValue = V;
++NumVectorInstructions;		++NumVectorInstructions;

return V;		return V;
}		}
▲ Show 20 Lines • Show All 50 Lines • ▼ Show 20 Lines	case Instruction::Call: {

// The scalar argument uses an in-tree scalar so we add the new vectorized		// The scalar argument uses an in-tree scalar so we add the new vectorized
// call to ExternalUses list to make sure that an extract will be		// call to ExternalUses list to make sure that an extract will be
// generated in the future.		// generated in the future.
if (ScalarArg && getTreeEntry(ScalarArg))		if (ScalarArg && getTreeEntry(ScalarArg))
ExternalUses.push_back(ExternalUser(ScalarArg, cast<User>(V), 0));		ExternalUses.push_back(ExternalUser(ScalarArg, cast<User>(V), 0));

propagateIRFlags(V, E->Scalars, VL0);		propagateIRFlags(V, E->Scalars, VL0);
		ShuffleBuilder.addInversedMask(E->ReorderIndices);
ShuffleBuilder.addMask(E->ReuseShuffleIndices);		ShuffleBuilder.addMask(E->ReuseShuffleIndices);
V = ShuffleBuilder.finalize(V);		V = ShuffleBuilder.finalize(V);

E->VectorizedValue = V;		E->VectorizedValue = V;
++NumVectorInstructions;		++NumVectorInstructions;
return V;		return V;
}		}
case Instruction::ShuffleVector: {		case Instruction::ShuffleVector: {
Show All 31 Lines	case Instruction::ShuffleVector: {
V1 = Builder.CreateCast(		V1 = Builder.CreateCast(
static_cast<Instruction::CastOps>(E->getAltOpcode()), LHS, VecTy);		static_cast<Instruction::CastOps>(E->getAltOpcode()), LHS, VecTy);
}		}

// Create shuffle to take alternate operations from the vector.		// Create shuffle to take alternate operations from the vector.
// Also, gather up main and alt scalar ops to propagate IR flags to		// Also, gather up main and alt scalar ops to propagate IR flags to
// each vector operation.		// each vector operation.
ValueList OpScalars, AltScalars;		ValueList OpScalars, AltScalars;
unsigned Sz = E->Scalars.size();		SmallVector<int> Mask;
SmallVector<int> Mask(Sz);		buildSuffleEntryMask(
		RKSimonUnsubmitted Not Done Reply Inline Actions A lot of this is just a refactor cleanup that looks like a NFC that could be done as a pre-commit to simplify the patch? RKSimon: A lot of this is just a refactor cleanup that looks like a NFC that could be done as a pre…
		ABataevAuthorUnsubmitted Done Reply Inline Actions Yes, will precommit some of these changes. ABataev: Yes, will precommit some of these changes.
for (unsigned I = 0; I < Sz; ++I) {		E->Scalars, E->ReorderIndices, E->ReuseShuffleIndices,
auto *OpInst = cast<Instruction>(E->Scalars[I]);		[E](Instruction *I) {
assert(E->isOpcodeOrAlt(OpInst) && "Unexpected main/alternate opcode");		assert(E->isOpcodeOrAlt(I) && "Unexpected main/alternate opcode");
if (OpInst->getOpcode() == E->getAltOpcode()) {		return I->getOpcode() == E->getAltOpcode();
Mask[I] = Sz + I;		},
AltScalars.push_back(E->Scalars[I]);		Mask, &OpScalars, &AltScalars);
} else {
Mask[I] = I;
OpScalars.push_back(E->Scalars[I]);
}
}

propagateIRFlags(V0, OpScalars);		propagateIRFlags(V0, OpScalars);
propagateIRFlags(V1, AltScalars);		propagateIRFlags(V1, AltScalars);

Value *V = Builder.CreateShuffleVector(V0, V1, Mask);		Value *V = Builder.CreateShuffleVector(V0, V1, Mask);
if (Instruction *I = dyn_cast<Instruction>(V))		if (Instruction *I = dyn_cast<Instruction>(V))
V = propagateMetadata(I, E->Scalars);		V = propagateMetadata(I, E->Scalars);
ShuffleBuilder.addMask(E->ReuseShuffleIndices);
V = ShuffleBuilder.finalize(V);		V = ShuffleBuilder.finalize(V);

E->VectorizedValue = V;		E->VectorizedValue = V;
++NumVectorInstructions;		++NumVectorInstructions;

return V;		return V;
}		}
default:		default:
▲ Show 20 Lines • Show All 1,156 Lines • ▼ Show 20 Lines	bool SLPVectorizerPass::runImpl(Function &F, ScalarEvolution *SE_,

if (Changed) {		if (Changed) {
R.optimizeGatherSequence();		R.optimizeGatherSequence();
LLVM_DEBUG(dbgs() << "SLP: vectorized \"" << F.getName() << "\"\n");		LLVM_DEBUG(dbgs() << "SLP: vectorized \"" << F.getName() << "\"\n");
}		}
return Changed;		return Changed;
}		}

/// Order may have elements assigned special value (size) which is out of
/// bounds. Such indices only appear on places which correspond to undef values
/// (see canReuseExtract for details) and used in order to avoid undef values
/// have effect on operands ordering.
/// The first loop below simply finds all unused indices and then the next loop
/// nest assigns these indices for undef values positions.
/// As an example below Order has two undef positions and they have assigned
/// values 3 and 7 respectively:
/// before: 6 9 5 4 9 2 1 0
/// after: 6 3 5 4 7 2 1 0
/// \returns Fixed ordering.
static BoUpSLP::OrdersType fixupOrderingIndices(ArrayRef<unsigned> Order) {
BoUpSLP::OrdersType NewOrder(Order.begin(), Order.end());
const unsigned Sz = NewOrder.size();
SmallBitVector UsedIndices(Sz);
SmallVector<int> MaskedIndices;
for (int I = 0, E = NewOrder.size(); I < E; ++I) {
if (NewOrder[I] < Sz)
UsedIndices.set(NewOrder[I]);
else
MaskedIndices.push_back(I);
}
if (MaskedIndices.empty())
return NewOrder;
SmallVector<int> AvailableIndices(MaskedIndices.size());
unsigned Cnt = 0;
int Idx = UsedIndices.find_first();
do {
AvailableIndices[Cnt] = Idx;
Idx = UsedIndices.find_next(Idx);
++Cnt;
} while (Idx > 0);
assert(Cnt == MaskedIndices.size() && "Non-synced masked/available indices.");
for (int I = 0, E = MaskedIndices.size(); I < E; ++I)
NewOrder[MaskedIndices[I]] = AvailableIndices[I];
return NewOrder;
}

bool SLPVectorizerPass::vectorizeStoreChain(ArrayRef<Value *> Chain, BoUpSLP &R,		bool SLPVectorizerPass::vectorizeStoreChain(ArrayRef<Value *> Chain, BoUpSLP &R,
unsigned Idx) {		unsigned Idx) {
LLVM_DEBUG(dbgs() << "SLP: Analyzing a store chain of length " << Chain.size()		LLVM_DEBUG(dbgs() << "SLP: Analyzing a store chain of length " << Chain.size()
<< "\n");		<< "\n");
const unsigned Sz = R.getVectorElementSize(Chain[0]);		const unsigned Sz = R.getVectorElementSize(Chain[0]);
const unsigned MinVF = R.getMinVecRegSize() / Sz;		const unsigned MinVF = R.getMinVecRegSize() / Sz;
unsigned VF = Chain.size();		unsigned VF = Chain.size();

if (!isPowerOf2_32(Sz) \|\| !isPowerOf2_32(VF) \|\| VF < 2 \|\| VF < MinVF)		if (!isPowerOf2_32(Sz) \|\| !isPowerOf2_32(VF) \|\| VF < 2 \|\| VF < MinVF)
return false;		return false;

LLVM_DEBUG(dbgs() << "SLP: Analyzing " << VF << " stores at offset " << Idx		LLVM_DEBUG(dbgs() << "SLP: Analyzing " << VF << " stores at offset " << Idx
<< "\n");		<< "\n");

R.buildTree(Chain);		R.buildTree(Chain);
Optional<ArrayRef<unsigned>> Order = R.bestOrder();
// TODO: Handle orders of size less than number of elements in the vector.
if (Order && Order->size() == Chain.size()) {
// TODO: reorder tree nodes without tree rebuilding.
SmallVector<Value *, 4> ReorderedOps(Chain.size());
transform(fixupOrderingIndices(*Order), ReorderedOps.begin(),
[Chain](const unsigned Idx) { return Chain[Idx]; });
R.buildTree(ReorderedOps);
}
if (R.isTreeTinyAndNotFullyVectorizable())		if (R.isTreeTinyAndNotFullyVectorizable())
return false;		return false;
if (R.isLoadCombineCandidate())		if (R.isLoadCombineCandidate())
return false;		return false;
		R.reorderTopToBottom();
		R.reorderBottomToTop();
		R.buildExternalUses();

R.computeMinimumValueSizes();		R.computeMinimumValueSizes();

InstructionCost Cost = R.getTreeCost();		InstructionCost Cost = R.getTreeCost();

LLVM_DEBUG(dbgs() << "SLP: Found cost = " << Cost << " for VF =" << VF << "\n");		LLVM_DEBUG(dbgs() << "SLP: Found cost = " << Cost << " for VF =" << VF << "\n");
if (Cost < -SLPCostThreshold) {		if (Cost < -SLPCostThreshold) {
LLVM_DEBUG(dbgs() << "SLP: Decided to vectorize cost = " << Cost << "\n");		LLVM_DEBUG(dbgs() << "SLP: Decided to vectorize cost = " << Cost << "\n");
▲ Show 20 Lines • Show All 106 Lines • ▼ Show 20 Lines	while (I != E && !VectorizedStores.count(Stores[I])) {
I = ConsecutiveChain[I].first;		I = ConsecutiveChain[I].first;
}		}
assert(!Operands.empty() && "Expected non-empty list of stores.");		assert(!Operands.empty() && "Expected non-empty list of stores.");

unsigned MaxVecRegSize = R.getMaxVecRegSize();		unsigned MaxVecRegSize = R.getMaxVecRegSize();
unsigned EltSize = R.getVectorElementSize(Operands[0]);		unsigned EltSize = R.getVectorElementSize(Operands[0]);
unsigned MaxElts = llvm::PowerOf2Floor(MaxVecRegSize / EltSize);		unsigned MaxElts = llvm::PowerOf2Floor(MaxVecRegSize / EltSize);

unsigned MinVF = std::max(2U, R.getMinVecRegSize() / EltSize);		unsigned MinVF = R.getMinVF(EltSize);
unsigned MaxVF = std::min(R.getMaximumVF(EltSize, Instruction::Store),		unsigned MaxVF = std::min(R.getMaximumVF(EltSize, Instruction::Store),
MaxElts);		MaxElts);

// FIXME: Is division-by-2 the correct step? Should we assert that the		// FIXME: Is division-by-2 the correct step? Should we assert that the
// register size is a power-of-2?		// register size is a power-of-2?
unsigned StartIdx = 0;		unsigned StartIdx = 0;
for (unsigned Size = MaxVF; Size >= MinVF; Size /= 2) {		for (unsigned Size = MaxVF; Size >= MinVF; Size /= 2) {
for (unsigned Cnt = StartIdx, E = Operands.size(); Cnt + Size <= E;) {		for (unsigned Cnt = StartIdx, E = Operands.size(); Cnt + Size <= E;) {
▲ Show 20 Lines • Show All 56 Lines • ▼ Show 20 Lines	for (Instruction &I : *BB) {
}		}
}		}
}		}

bool SLPVectorizerPass::tryToVectorizePair(Value A, Value B, BoUpSLP &R) {		bool SLPVectorizerPass::tryToVectorizePair(Value A, Value B, BoUpSLP &R) {
if (!A \|\| !B)		if (!A \|\| !B)
return false;		return false;
Value *VL[] = {A, B};		Value *VL[] = {A, B};
return tryToVectorizeList(VL, R, /AllowReorder=/true);		return tryToVectorizeList(VL, R);
}		}

bool SLPVectorizerPass::tryToVectorizeList(ArrayRef<Value *> VL, BoUpSLP &R,		bool SLPVectorizerPass::tryToVectorizeList(ArrayRef<Value *> VL, BoUpSLP &R) {
bool AllowReorder) {
if (VL.size() < 2)		if (VL.size() < 2)
return false;		return false;

LLVM_DEBUG(dbgs() << "SLP: Trying to vectorize a list of length = "		LLVM_DEBUG(dbgs() << "SLP: Trying to vectorize a list of length = "
<< VL.size() << ".\n");		<< VL.size() << ".\n");

// Check that all of the parts are instructions of the same type,		// Check that all of the parts are instructions of the same type,
// we permit an alternate opcode via InstructionsState.		// we permit an alternate opcode via InstructionsState.
Show All 17 Lines	if (!isa<InsertElementInst>(V) && !isValidElementType(Ty)) {
<< "Cannot SLP vectorize list: type "		<< "Cannot SLP vectorize list: type "
<< rso.str() + " is unsupported by vectorizer";		<< rso.str() + " is unsupported by vectorizer";
});		});
return false;		return false;
}		}
}		}

unsigned Sz = R.getVectorElementSize(I0);		unsigned Sz = R.getVectorElementSize(I0);
unsigned MinVF = std::max(2U, R.getMinVecRegSize() / Sz);		unsigned MinVF = R.getMinVF(Sz);
unsigned MaxVF = std::max<unsigned>(PowerOf2Floor(VL.size()), MinVF);		unsigned MaxVF = std::max<unsigned>(PowerOf2Floor(VL.size()), MinVF);
MaxVF = std::min(R.getMaximumVF(Sz, S.getOpcode()), MaxVF);		MaxVF = std::min(R.getMaximumVF(Sz, S.getOpcode()), MaxVF);
if (MaxVF < 2) {		if (MaxVF < 2) {
R.getORE()->emit([&]() {		R.getORE()->emit([&]() {
return OptimizationRemarkMissed(SV_NAME, "SmallVF", I0)		return OptimizationRemarkMissed(SV_NAME, "SmallVF", I0)
<< "Cannot SLP vectorize list: vectorization factor "		<< "Cannot SLP vectorize list: vectorization factor "
<< "less than 2 is not supported";		<< "less than 2 is not supported";
});		});
Show All 36 Lines	for (unsigned I = NextInst; I < MaxInst; ++I) {
return I && R.isDeleted(I);		return I && R.isDeleted(I);
}))		}))
continue;		continue;

LLVM_DEBUG(dbgs() << "SLP: Analyzing " << OpsWidth << " operations "		LLVM_DEBUG(dbgs() << "SLP: Analyzing " << OpsWidth << " operations "
<< "\n");		<< "\n");

R.buildTree(Ops);		R.buildTree(Ops);
if (AllowReorder) {
Optional<ArrayRef<unsigned>> Order = R.bestOrder();
if (Order) {
// TODO: reorder tree nodes without tree rebuilding.
SmallVector<Value *, 4> ReorderedOps(Ops.size());
transform(fixupOrderingIndices(*Order), ReorderedOps.begin(),
[Ops](const unsigned Idx) { return Ops[Idx]; });
R.buildTree(ReorderedOps);
}
}
if (R.isTreeTinyAndNotFullyVectorizable())		if (R.isTreeTinyAndNotFullyVectorizable())
continue;		continue;
		R.reorderTopToBottom();
		R.reorderBottomToTop();
		R.buildExternalUses();

R.computeMinimumValueSizes();		R.computeMinimumValueSizes();
InstructionCost Cost = R.getTreeCost();		InstructionCost Cost = R.getTreeCost();
CandidateFound = true;		CandidateFound = true;
MinCost = std::min(MinCost, Cost);		MinCost = std::min(MinCost, Cost);

if (Cost < -SLPCostThreshold) {		if (Cost < -SLPCostThreshold) {
LLVM_DEBUG(dbgs() << "SLP: Vectorizing list at cost:" << Cost << ".\n");		LLVM_DEBUG(dbgs() << "SLP: Vectorizing list at cost:" << Cost << ".\n");
▲ Show 20 Lines • Show All 627 Lines • ▼ Show 20 Lines	if (NumReducedVals > ReduxWidth) {
return false;		return false;
});		});
}		}

Value *VectorizedTree = nullptr;		Value *VectorizedTree = nullptr;
unsigned i = 0;		unsigned i = 0;
while (i < NumReducedVals - ReduxWidth + 1 && ReduxWidth > 2) {		while (i < NumReducedVals - ReduxWidth + 1 && ReduxWidth > 2) {
ArrayRef<Value *> VL(&ReducedVals[i], ReduxWidth);		ArrayRef<Value *> VL(&ReducedVals[i], ReduxWidth);
V.buildTree(VL, ExternallyUsedValues, IgnoreList);		V.buildTree(VL, IgnoreList);
Optional<ArrayRef<unsigned>> Order = V.bestOrder();
if (Order) {
assert(Order->size() == VL.size() &&
"Order size must be the same as number of vectorized "
"instructions.");
// TODO: reorder tree nodes without tree rebuilding.
SmallVector<Value *, 4> ReorderedOps(VL.size());
transform(fixupOrderingIndices(*Order), ReorderedOps.begin(),
[VL](const unsigned Idx) { return VL[Idx]; });
V.buildTree(ReorderedOps, ExternallyUsedValues, IgnoreList);
}
if (V.isTreeTinyAndNotFullyVectorizable())		if (V.isTreeTinyAndNotFullyVectorizable())
break;		break;
if (V.isLoadCombineReductionCandidate(RdxKind))		if (V.isLoadCombineReductionCandidate(RdxKind))
break;		break;
		V.reorderTopToBottom();
		V.reorderBottomToTop();
		V.buildExternalUses(ExternallyUsedValues);

// For a poison-safe boolean logic reduction, do not replace select		// For a poison-safe boolean logic reduction, do not replace select
// instructions with logic ops. All reduced values will be frozen (see		// instructions with logic ops. All reduced values will be frozen (see
// below) to prevent leaking poison.		// below) to prevent leaking poison.
if (isa<SelectInst>(ReductionRoot) &&		if (isa<SelectInst>(ReductionRoot) &&
isBoolLogicOp(cast<Instruction>(ReductionRoot)) &&		isBoolLogicOp(cast<Instruction>(ReductionRoot)) &&
NumReducedVals != ReduxWidth)		NumReducedVals != ReduxWidth)
break;		break;
▲ Show 20 Lines • Show All 456 Lines • ▼ Show 20 Lines	bool SLPVectorizerPass::vectorizeInsertValueInst(InsertValueInst *IVI,
SmallVector<Value *, 16> BuildVectorOpds;		SmallVector<Value *, 16> BuildVectorOpds;
SmallVector<Value *, 16> BuildVectorInsts;		SmallVector<Value *, 16> BuildVectorInsts;
if (!findBuildAggregate(IVI, TTI, BuildVectorOpds, BuildVectorInsts))		if (!findBuildAggregate(IVI, TTI, BuildVectorOpds, BuildVectorInsts))
return false;		return false;

LLVM_DEBUG(dbgs() << "SLP: array mappable to vector: " << *IVI << "\n");		LLVM_DEBUG(dbgs() << "SLP: array mappable to vector: " << *IVI << "\n");
// Aggregate value is unlikely to be processed in vector register, we need to		// Aggregate value is unlikely to be processed in vector register, we need to
// extract scalars into scalar registers, so NeedExtraction is set true.		// extract scalars into scalar registers, so NeedExtraction is set true.
return tryToVectorizeList(BuildVectorOpds, R, /AllowReorder=/false);		return tryToVectorizeList(BuildVectorOpds, R);
}		}

bool SLPVectorizerPass::vectorizeInsertElementInst(InsertElementInst *IEI,		bool SLPVectorizerPass::vectorizeInsertElementInst(InsertElementInst *IEI,
BasicBlock *BB, BoUpSLP &R) {		BasicBlock *BB, BoUpSLP &R) {
SmallVector<Value *, 16> BuildVectorInsts;		SmallVector<Value *, 16> BuildVectorInsts;
SmallVector<Value *, 16> BuildVectorOpds;		SmallVector<Value *, 16> BuildVectorOpds;
SmallVector<int> Mask;		SmallVector<int> Mask;
if (!findBuildAggregate(IEI, TTI, BuildVectorOpds, BuildVectorInsts) \|\|		if (!findBuildAggregate(IEI, TTI, BuildVectorOpds, BuildVectorInsts) \|\|
(llvm::all_of(BuildVectorOpds,		(llvm::all_of(BuildVectorOpds,
[](Value *V) { return isa<ExtractElementInst>(V); }) &&		[](Value *V) { return isa<ExtractElementInst>(V); }) &&
isShuffle(BuildVectorOpds, Mask)))		isShuffle(BuildVectorOpds, Mask)))
return false;		return false;

LLVM_DEBUG(dbgs() << "SLP: array mappable to vector: " << *IEI << "\n");		LLVM_DEBUG(dbgs() << "SLP: array mappable to vector: " << *IEI << "\n");
return tryToVectorizeList(BuildVectorInsts, R, /AllowReorder=/true);		return tryToVectorizeList(BuildVectorInsts, R);
}		}

bool SLPVectorizerPass::vectorizeSimpleInstructions(		bool SLPVectorizerPass::vectorizeSimpleInstructions(
SmallVectorImpl<Instruction > &Instructions, BasicBlock BB, BoUpSLP &R,		SmallVectorImpl<Instruction > &Instructions, BasicBlock BB, BoUpSLP &R,
bool AtTerminator) {		bool AtTerminator) {
bool OpsChanged = false;		bool OpsChanged = false;
SmallVector<Instruction *, 4> PostponedCmps;		SmallVector<Instruction *, 4> PostponedCmps;
for (auto *I : reverse(Instructions)) {		for (auto *I : reverse(Instructions)) {
▲ Show 20 Lines • Show All 175 Lines • ▼ Show 20 Lines	for (SmallVector<Value *, 4>::iterator IncIt = Incoming.begin(),
// Try to vectorize them.		// Try to vectorize them.
unsigned NumElts = (SameTypeIt - IncIt);		unsigned NumElts = (SameTypeIt - IncIt);
LLVM_DEBUG(dbgs() << "SLP: Trying to vectorize starting at PHIs ("		LLVM_DEBUG(dbgs() << "SLP: Trying to vectorize starting at PHIs ("
<< NumElts << ")\n");		<< NumElts << ")\n");
// The order in which the phi nodes appear in the program does not matter.		// The order in which the phi nodes appear in the program does not matter.
// So allow tryToVectorizeList to reorder them if it is beneficial. This		// So allow tryToVectorizeList to reorder them if it is beneficial. This
// is done when there are exactly two elements since tryToVectorizeList		// is done when there are exactly two elements since tryToVectorizeList
// asserts that there are only two values when AllowReorder is true.		// asserts that there are only two values when AllowReorder is true.
if (NumElts > 1 && tryToVectorizeList(makeArrayRef(IncIt, NumElts), R,		if (NumElts > 1 && tryToVectorizeList(makeArrayRef(IncIt, NumElts), R)) {
/AllowReorder=/true)) {
// Success start over because instructions might have been changed.		// Success start over because instructions might have been changed.
HaveVectorizedPhiNodes = true;		HaveVectorizedPhiNodes = true;
Changed = true;		Changed = true;
} else if (NumElts < 4 &&		} else if (NumElts < 4 &&
(Candidates.empty() \|\|		(Candidates.empty() \|\|
Candidates.front()->getType() == (*IncIt)->getType())) {		Candidates.front()->getType() == (*IncIt)->getType())) {
Candidates.append(IncIt, std::next(IncIt, NumElts));		Candidates.append(IncIt, std::next(IncIt, NumElts));
}		}
// Final attempt to vectorize phis with the same types.		// Final attempt to vectorize phis with the same types.
if (SameTypeIt == E \|\| (SameTypeIt)->getType() != (IncIt)->getType()) {		if (SameTypeIt == E \|\| (SameTypeIt)->getType() != (IncIt)->getType()) {
if (Candidates.size() > 1 &&		if (Candidates.size() > 1 && tryToVectorizeList(Candidates, R)) {
tryToVectorizeList(Candidates, R, /AllowReorder=/true)) {
// Success start over because instructions might have been changed.		// Success start over because instructions might have been changed.
HaveVectorizedPhiNodes = true;		HaveVectorizedPhiNodes = true;
Changed = true;		Changed = true;
}		}
Candidates.clear();		Candidates.clear();
}		}

// Start over at the next instruction of a different type (or the end).		// Start over at the next instruction of a different type (or the end).
▲ Show 20 Lines • Show All 311 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/AArch64/transpose-inseltpoison.ll

; NOTE: Assertions have been autogenerated by utils/update_test_checks.py		; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
; RUN: opt < %s -slp-vectorizer -instcombine -S \| FileCheck %s		; RUN: opt < %s -slp-vectorizer -instcombine -S \| FileCheck %s

target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"		target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
target triple = "aarch64--linux-gnu"		target triple = "aarch64--linux-gnu"

define <2 x i64> @build_vec_v2i64(<2 x i64> %v0, <2 x i64> %v1) {		define <2 x i64> @build_vec_v2i64(<2 x i64> %v0, <2 x i64> %v1) {
; CHECK-LABEL: @build_vec_v2i64(		; CHECK-LABEL: @build_vec_v2i64(
; CHECK-NEXT: [[TMP1:%.]] = shufflevector <2 x i64> [[V0:%.]], <2 x i64> undef, <2 x i32> <i32 1, i32 0>		; CHECK-NEXT: [[TMP1:%.]] = add <2 x i64> [[V0:%.]], [[V1:%.*]]
; CHECK-NEXT: [[TMP2:%.]] = shufflevector <2 x i64> [[V1:%.]], <2 x i64> undef, <2 x i32> <i32 1, i32 0>		; CHECK-NEXT: [[TMP2:%.*]] = sub <2 x i64> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP3:%.*]] = add <2 x i64> [[TMP1]], [[TMP2]]		; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <2 x i64> [[TMP1]], <2 x i64> [[TMP2]], <2 x i32> <i32 1, i32 2>
; CHECK-NEXT: [[TMP4:%.*]] = sub <2 x i64> [[TMP1]], [[TMP2]]		; CHECK-NEXT: [[TMP4:%.*]] = add <2 x i64> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <2 x i64> [[TMP3]], <2 x i64> [[TMP4]], <2 x i32> <i32 0, i32 3>		; CHECK-NEXT: [[TMP5:%.*]] = sub <2 x i64> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP6:%.*]] = add <2 x i64> [[V0]], [[V1]]		; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <2 x i64> [[TMP4]], <2 x i64> [[TMP5]], <2 x i32> <i32 0, i32 3>
; CHECK-NEXT: [[TMP7:%.*]] = sub <2 x i64> [[V0]], [[V1]]		; CHECK-NEXT: [[TMP7:%.*]] = add <2 x i64> [[TMP6]], [[TMP3]]
; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <2 x i64> [[TMP6]], <2 x i64> [[TMP7]], <2 x i32> <i32 0, i32 3>		; CHECK-NEXT: ret <2 x i64> [[TMP7]]
; CHECK-NEXT: [[TMP9:%.*]] = add <2 x i64> [[TMP8]], [[TMP5]]
; CHECK-NEXT: ret <2 x i64> [[TMP9]]
;		;
%v0.0 = extractelement <2 x i64> %v0, i32 0		%v0.0 = extractelement <2 x i64> %v0, i32 0
%v0.1 = extractelement <2 x i64> %v0, i32 1		%v0.1 = extractelement <2 x i64> %v0, i32 1
%v1.0 = extractelement <2 x i64> %v1, i32 0		%v1.0 = extractelement <2 x i64> %v1, i32 0
%v1.1 = extractelement <2 x i64> %v1, i32 1		%v1.1 = extractelement <2 x i64> %v1, i32 1
%tmp0.0 = add i64 %v0.0, %v1.0		%tmp0.0 = add i64 %v0.0, %v1.0
%tmp0.1 = add i64 %v0.1, %v1.1		%tmp0.1 = add i64 %v0.1, %v1.1
%tmp1.0 = sub i64 %v0.0, %v1.0		%tmp1.0 = sub i64 %v0.0, %v1.0
▲ Show 20 Lines • Show All 42 Lines • ▼ Show 20 Lines	;
%tmp2.1 = add i64 %tmp1.0, %tmp1.1		%tmp2.1 = add i64 %tmp1.0, %tmp1.1
store i64 %tmp2.0, i64* %c.0, align 8		store i64 %tmp2.0, i64* %c.0, align 8
store i64 %tmp2.1, i64* %c.1, align 8		store i64 %tmp2.1, i64* %c.1, align 8
ret void		ret void
}		}

define <4 x i32> @build_vec_v4i32(<4 x i32> %v0, <4 x i32> %v1) {		define <4 x i32> @build_vec_v4i32(<4 x i32> %v0, <4 x i32> %v1) {
; CHECK-LABEL: @build_vec_v4i32(		; CHECK-LABEL: @build_vec_v4i32(
; CHECK-NEXT: [[TMP1:%.]] = shufflevector <4 x i32> [[V0:%.]], <4 x i32> undef, <4 x i32> <i32 1, i32 0, i32 3, i32 2>		; CHECK-NEXT: [[TMP1:%.]] = add <4 x i32> [[V0:%.]], [[V1:%.*]]
; CHECK-NEXT: [[TMP2:%.]] = shufflevector <4 x i32> [[V1:%.]], <4 x i32> undef, <4 x i32> <i32 1, i32 0, i32 3, i32 2>		; CHECK-NEXT: [[TMP2:%.*]] = sub <4 x i32> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP3:%.*]] = add <4 x i32> [[TMP1]], [[TMP2]]		; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <4 x i32> [[TMP1]], <4 x i32> [[TMP2]], <4 x i32> <i32 1, i32 4, i32 3, i32 6>
; CHECK-NEXT: [[TMP4:%.*]] = sub <4 x i32> [[TMP1]], [[TMP2]]		; CHECK-NEXT: [[TMP4:%.*]] = add <4 x i32> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <4 x i32> [[TMP3]], <4 x i32> [[TMP4]], <4 x i32> <i32 0, i32 5, i32 2, i32 7>		; CHECK-NEXT: [[TMP5:%.*]] = sub <4 x i32> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP6:%.*]] = add <4 x i32> [[V0]], [[V1]]		; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <4 x i32> [[TMP4]], <4 x i32> [[TMP5]], <4 x i32> <i32 0, i32 5, i32 2, i32 7>
; CHECK-NEXT: [[TMP7:%.*]] = sub <4 x i32> [[V0]], [[V1]]		; CHECK-NEXT: [[TMP7:%.*]] = add <4 x i32> [[TMP6]], [[TMP3]]
; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <4 x i32> [[TMP6]], <4 x i32> [[TMP7]], <4 x i32> <i32 0, i32 5, i32 2, i32 7>		; CHECK-NEXT: ret <4 x i32> [[TMP7]]
; CHECK-NEXT: [[TMP9:%.*]] = add <4 x i32> [[TMP8]], [[TMP5]]
; CHECK-NEXT: ret <4 x i32> [[TMP9]]
;		;
%v0.0 = extractelement <4 x i32> %v0, i32 0		%v0.0 = extractelement <4 x i32> %v0, i32 0
%v0.1 = extractelement <4 x i32> %v0, i32 1		%v0.1 = extractelement <4 x i32> %v0, i32 1
%v0.2 = extractelement <4 x i32> %v0, i32 2		%v0.2 = extractelement <4 x i32> %v0, i32 2
%v0.3 = extractelement <4 x i32> %v0, i32 3		%v0.3 = extractelement <4 x i32> %v0, i32 3
%v1.0 = extractelement <4 x i32> %v1, i32 0		%v1.0 = extractelement <4 x i32> %v1, i32 0
%v1.1 = extractelement <4 x i32> %v1, i32 1		%v1.1 = extractelement <4 x i32> %v1, i32 1
%v1.2 = extractelement <4 x i32> %v1, i32 2		%v1.2 = extractelement <4 x i32> %v1, i32 2
Show All 14 Lines	;
%tmp3.1 = insertelement <4 x i32> %tmp3.0, i32 %tmp2.1, i32 1		%tmp3.1 = insertelement <4 x i32> %tmp3.0, i32 %tmp2.1, i32 1
%tmp3.2 = insertelement <4 x i32> %tmp3.1, i32 %tmp2.2, i32 2		%tmp3.2 = insertelement <4 x i32> %tmp3.1, i32 %tmp2.2, i32 2
%tmp3.3 = insertelement <4 x i32> %tmp3.2, i32 %tmp2.3, i32 3		%tmp3.3 = insertelement <4 x i32> %tmp3.2, i32 %tmp2.3, i32 3
ret <4 x i32> %tmp3.3		ret <4 x i32> %tmp3.3
}		}

define <4 x i32> @build_vec_v4i32_reuse_0(<2 x i32> %v0, <2 x i32> %v1) {		define <4 x i32> @build_vec_v4i32_reuse_0(<2 x i32> %v0, <2 x i32> %v1) {
; CHECK-LABEL: @build_vec_v4i32_reuse_0(		; CHECK-LABEL: @build_vec_v4i32_reuse_0(
; CHECK-NEXT: [[TMP1:%.]] = shufflevector <2 x i32> [[V0:%.]], <2 x i32> undef, <2 x i32> <i32 1, i32 0>		; CHECK-NEXT: [[TMP1:%.]] = add <2 x i32> [[V0:%.]], [[V1:%.*]]
; CHECK-NEXT: [[TMP2:%.]] = shufflevector <2 x i32> [[V1:%.]], <2 x i32> undef, <2 x i32> <i32 1, i32 0>		; CHECK-NEXT: [[TMP2:%.*]] = sub <2 x i32> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP3:%.*]] = add <2 x i32> [[TMP1]], [[TMP2]]		; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> [[TMP2]], <2 x i32> <i32 1, i32 2>
; CHECK-NEXT: [[TMP4:%.*]] = sub <2 x i32> [[TMP1]], [[TMP2]]		; CHECK-NEXT: [[TMP4:%.*]] = add <2 x i32> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <2 x i32> [[TMP3]], <2 x i32> [[TMP4]], <2 x i32> <i32 0, i32 3>		; CHECK-NEXT: [[TMP5:%.*]] = sub <2 x i32> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP6:%.*]] = add <2 x i32> [[V0]], [[V1]]		; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <2 x i32> [[TMP4]], <2 x i32> [[TMP5]], <2 x i32> <i32 0, i32 3>
; CHECK-NEXT: [[TMP7:%.*]] = sub <2 x i32> [[V0]], [[V1]]		; CHECK-NEXT: [[TMP7:%.*]] = add <2 x i32> [[TMP6]], [[TMP3]]
; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <2 x i32> [[TMP6]], <2 x i32> [[TMP7]], <2 x i32> <i32 0, i32 3>		; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x i32> [[TMP7]], <2 x i32> poison, <4 x i32> <i32 0, i32 1, i32 0, i32 1>
; CHECK-NEXT: [[TMP9:%.*]] = add <2 x i32> [[TMP8]], [[TMP5]]
; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x i32> [[TMP9]], <2 x i32> poison, <4 x i32> <i32 0, i32 1, i32 0, i32 1>
; CHECK-NEXT: ret <4 x i32> [[SHUFFLE]]		; CHECK-NEXT: ret <4 x i32> [[SHUFFLE]]
;		;
%v0.0 = extractelement <2 x i32> %v0, i32 0		%v0.0 = extractelement <2 x i32> %v0, i32 0
%v0.1 = extractelement <2 x i32> %v0, i32 1		%v0.1 = extractelement <2 x i32> %v0, i32 1
%v1.0 = extractelement <2 x i32> %v1, i32 0		%v1.0 = extractelement <2 x i32> %v1, i32 0
%v1.1 = extractelement <2 x i32> %v1, i32 1		%v1.1 = extractelement <2 x i32> %v1, i32 1
%tmp0.0 = add i32 %v0.0, %v1.0		%tmp0.0 = add i32 %v0.0, %v1.0
%tmp0.1 = add i32 %v0.1, %v1.1		%tmp0.1 = add i32 %v0.1, %v1.1
▲ Show 20 Lines • Show All 89 Lines • ▼ Show 20 Lines	;
%tmp3.1 = insertelement <4 x i32> %tmp3.0, i32 %tmp2.1, i32 1		%tmp3.1 = insertelement <4 x i32> %tmp3.0, i32 %tmp2.1, i32 1
%tmp3.2 = insertelement <4 x i32> %tmp3.1, i32 %tmp2.2, i32 2		%tmp3.2 = insertelement <4 x i32> %tmp3.1, i32 %tmp2.2, i32 2
%tmp3.3 = insertelement <4 x i32> %tmp3.2, i32 %tmp2.3, i32 3		%tmp3.3 = insertelement <4 x i32> %tmp3.2, i32 %tmp2.3, i32 3
ret <4 x i32> %tmp3.3		ret <4 x i32> %tmp3.3
}		}

define i32 @reduction_v4i32(<4 x i32> %v0, <4 x i32> %v1) {		define i32 @reduction_v4i32(<4 x i32> %v0, <4 x i32> %v1) {
; CHECK-LABEL: @reduction_v4i32(		; CHECK-LABEL: @reduction_v4i32(
; CHECK-NEXT: [[TMP1:%.]] = shufflevector <4 x i32> [[V0:%.]], <4 x i32> undef, <4 x i32> <i32 1, i32 0, i32 3, i32 2>		; CHECK-NEXT: [[TMP1:%.]] = sub <4 x i32> [[V0:%.]], [[V1:%.*]]
; CHECK-NEXT: [[TMP2:%.]] = shufflevector <4 x i32> [[V1:%.]], <4 x i32> undef, <4 x i32> <i32 1, i32 0, i32 3, i32 2>		; CHECK-NEXT: [[TMP2:%.*]] = add <4 x i32> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP3:%.*]] = sub <4 x i32> [[TMP1]], [[TMP2]]		; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <4 x i32> [[TMP1]], <4 x i32> [[TMP2]], <4 x i32> <i32 1, i32 4, i32 7, i32 2>
; CHECK-NEXT: [[TMP4:%.*]] = add <4 x i32> [[TMP1]], [[TMP2]]		; CHECK-NEXT: [[TMP4:%.*]] = sub <4 x i32> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <4 x i32> [[TMP3]], <4 x i32> [[TMP4]], <4 x i32> <i32 0, i32 5, i32 6, i32 3>		; CHECK-NEXT: [[TMP5:%.*]] = add <4 x i32> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP6:%.*]] = sub <4 x i32> [[V0]], [[V1]]		; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <4 x i32> [[TMP4]], <4 x i32> [[TMP5]], <4 x i32> <i32 0, i32 5, i32 6, i32 3>
; CHECK-NEXT: [[TMP7:%.*]] = add <4 x i32> [[V0]], [[V1]]		; CHECK-NEXT: [[TMP7:%.*]] = add <4 x i32> [[TMP6]], [[TMP3]]
; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <4 x i32> [[TMP6]], <4 x i32> [[TMP7]], <4 x i32> <i32 0, i32 5, i32 6, i32 3>		; CHECK-NEXT: [[TMP8:%.*]] = lshr <4 x i32> [[TMP7]], <i32 15, i32 15, i32 15, i32 15>
; CHECK-NEXT: [[TMP9:%.*]] = add <4 x i32> [[TMP8]], [[TMP5]]		; CHECK-NEXT: [[TMP9:%.*]] = and <4 x i32> [[TMP8]], <i32 65537, i32 65537, i32 65537, i32 65537>
; CHECK-NEXT: [[TMP10:%.*]] = lshr <4 x i32> [[TMP9]], <i32 15, i32 15, i32 15, i32 15>		; CHECK-NEXT: [[TMP10:%.*]] = mul nuw <4 x i32> [[TMP9]], <i32 65535, i32 65535, i32 65535, i32 65535>
; CHECK-NEXT: [[TMP11:%.*]] = and <4 x i32> [[TMP10]], <i32 65537, i32 65537, i32 65537, i32 65537>		; CHECK-NEXT: [[TMP11:%.*]] = add <4 x i32> [[TMP10]], [[TMP7]]
; CHECK-NEXT: [[TMP12:%.*]] = mul nuw <4 x i32> [[TMP11]], <i32 65535, i32 65535, i32 65535, i32 65535>		; CHECK-NEXT: [[TMP12:%.*]] = xor <4 x i32> [[TMP11]], [[TMP10]]
; CHECK-NEXT: [[TMP13:%.*]] = add <4 x i32> [[TMP12]], [[TMP9]]		; CHECK-NEXT: [[TMP13:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[TMP12]])
; CHECK-NEXT: [[TMP14:%.*]] = xor <4 x i32> [[TMP13]], [[TMP12]]		; CHECK-NEXT: ret i32 [[TMP13]]
; CHECK-NEXT: [[TMP15:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[TMP14]])
; CHECK-NEXT: ret i32 [[TMP15]]
;		;
%v0.0 = extractelement <4 x i32> %v0, i32 0		%v0.0 = extractelement <4 x i32> %v0, i32 0
%v0.1 = extractelement <4 x i32> %v0, i32 1		%v0.1 = extractelement <4 x i32> %v0, i32 1
%v0.2 = extractelement <4 x i32> %v0, i32 2		%v0.2 = extractelement <4 x i32> %v0, i32 2
%v0.3 = extractelement <4 x i32> %v0, i32 3		%v0.3 = extractelement <4 x i32> %v0, i32 3
%v1.0 = extractelement <4 x i32> %v1, i32 0		%v1.0 = extractelement <4 x i32> %v1, i32 0
%v1.1 = extractelement <4 x i32> %v1, i32 1		%v1.1 = extractelement <4 x i32> %v1, i32 1
%v1.2 = extractelement <4 x i32> %v1, i32 2		%v1.2 = extractelement <4 x i32> %v1, i32 2
Show All 38 Lines

llvm/test/Transforms/SLPVectorizer/AArch64/transpose.ll

; NOTE: Assertions have been autogenerated by utils/update_test_checks.py		; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
; RUN: opt < %s -slp-vectorizer -instcombine -S \| FileCheck %s		; RUN: opt < %s -slp-vectorizer -instcombine -S \| FileCheck %s

target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"		target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
target triple = "aarch64--linux-gnu"		target triple = "aarch64--linux-gnu"

define <2 x i64> @build_vec_v2i64(<2 x i64> %v0, <2 x i64> %v1) {		define <2 x i64> @build_vec_v2i64(<2 x i64> %v0, <2 x i64> %v1) {
; CHECK-LABEL: @build_vec_v2i64(		; CHECK-LABEL: @build_vec_v2i64(
; CHECK-NEXT: [[TMP1:%.]] = shufflevector <2 x i64> [[V0:%.]], <2 x i64> undef, <2 x i32> <i32 1, i32 0>		; CHECK-NEXT: [[TMP1:%.]] = add <2 x i64> [[V0:%.]], [[V1:%.*]]
; CHECK-NEXT: [[TMP2:%.]] = shufflevector <2 x i64> [[V1:%.]], <2 x i64> undef, <2 x i32> <i32 1, i32 0>		; CHECK-NEXT: [[TMP2:%.*]] = sub <2 x i64> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP3:%.*]] = add <2 x i64> [[TMP1]], [[TMP2]]		; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <2 x i64> [[TMP1]], <2 x i64> [[TMP2]], <2 x i32> <i32 1, i32 2>
; CHECK-NEXT: [[TMP4:%.*]] = sub <2 x i64> [[TMP1]], [[TMP2]]		; CHECK-NEXT: [[TMP4:%.*]] = add <2 x i64> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <2 x i64> [[TMP3]], <2 x i64> [[TMP4]], <2 x i32> <i32 0, i32 3>		; CHECK-NEXT: [[TMP5:%.*]] = sub <2 x i64> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP6:%.*]] = add <2 x i64> [[V0]], [[V1]]		; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <2 x i64> [[TMP4]], <2 x i64> [[TMP5]], <2 x i32> <i32 0, i32 3>
; CHECK-NEXT: [[TMP7:%.*]] = sub <2 x i64> [[V0]], [[V1]]		; CHECK-NEXT: [[TMP7:%.*]] = add <2 x i64> [[TMP6]], [[TMP3]]
; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <2 x i64> [[TMP6]], <2 x i64> [[TMP7]], <2 x i32> <i32 0, i32 3>		; CHECK-NEXT: ret <2 x i64> [[TMP7]]
; CHECK-NEXT: [[TMP9:%.*]] = add <2 x i64> [[TMP8]], [[TMP5]]
; CHECK-NEXT: ret <2 x i64> [[TMP9]]
;		;
%v0.0 = extractelement <2 x i64> %v0, i32 0		%v0.0 = extractelement <2 x i64> %v0, i32 0
%v0.1 = extractelement <2 x i64> %v0, i32 1		%v0.1 = extractelement <2 x i64> %v0, i32 1
%v1.0 = extractelement <2 x i64> %v1, i32 0		%v1.0 = extractelement <2 x i64> %v1, i32 0
%v1.1 = extractelement <2 x i64> %v1, i32 1		%v1.1 = extractelement <2 x i64> %v1, i32 1
%tmp0.0 = add i64 %v0.0, %v1.0		%tmp0.0 = add i64 %v0.0, %v1.0
%tmp0.1 = add i64 %v0.1, %v1.1		%tmp0.1 = add i64 %v0.1, %v1.1
%tmp1.0 = sub i64 %v0.0, %v1.0		%tmp1.0 = sub i64 %v0.0, %v1.0
▲ Show 20 Lines • Show All 42 Lines • ▼ Show 20 Lines	;
%tmp2.1 = add i64 %tmp1.0, %tmp1.1		%tmp2.1 = add i64 %tmp1.0, %tmp1.1
store i64 %tmp2.0, i64* %c.0, align 8		store i64 %tmp2.0, i64* %c.0, align 8
store i64 %tmp2.1, i64* %c.1, align 8		store i64 %tmp2.1, i64* %c.1, align 8
ret void		ret void
}		}

define <4 x i32> @build_vec_v4i32(<4 x i32> %v0, <4 x i32> %v1) {		define <4 x i32> @build_vec_v4i32(<4 x i32> %v0, <4 x i32> %v1) {
; CHECK-LABEL: @build_vec_v4i32(		; CHECK-LABEL: @build_vec_v4i32(
; CHECK-NEXT: [[TMP1:%.]] = shufflevector <4 x i32> [[V0:%.]], <4 x i32> undef, <4 x i32> <i32 1, i32 0, i32 3, i32 2>		; CHECK-NEXT: [[TMP1:%.]] = add <4 x i32> [[V0:%.]], [[V1:%.*]]
; CHECK-NEXT: [[TMP2:%.]] = shufflevector <4 x i32> [[V1:%.]], <4 x i32> undef, <4 x i32> <i32 1, i32 0, i32 3, i32 2>		; CHECK-NEXT: [[TMP2:%.*]] = sub <4 x i32> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP3:%.*]] = add <4 x i32> [[TMP1]], [[TMP2]]		; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <4 x i32> [[TMP1]], <4 x i32> [[TMP2]], <4 x i32> <i32 1, i32 4, i32 3, i32 6>
; CHECK-NEXT: [[TMP4:%.*]] = sub <4 x i32> [[TMP1]], [[TMP2]]		; CHECK-NEXT: [[TMP4:%.*]] = add <4 x i32> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <4 x i32> [[TMP3]], <4 x i32> [[TMP4]], <4 x i32> <i32 0, i32 5, i32 2, i32 7>		; CHECK-NEXT: [[TMP5:%.*]] = sub <4 x i32> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP6:%.*]] = add <4 x i32> [[V0]], [[V1]]		; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <4 x i32> [[TMP4]], <4 x i32> [[TMP5]], <4 x i32> <i32 0, i32 5, i32 2, i32 7>
; CHECK-NEXT: [[TMP7:%.*]] = sub <4 x i32> [[V0]], [[V1]]		; CHECK-NEXT: [[TMP7:%.*]] = add <4 x i32> [[TMP6]], [[TMP3]]
; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <4 x i32> [[TMP6]], <4 x i32> [[TMP7]], <4 x i32> <i32 0, i32 5, i32 2, i32 7>		; CHECK-NEXT: ret <4 x i32> [[TMP7]]
; CHECK-NEXT: [[TMP9:%.*]] = add <4 x i32> [[TMP8]], [[TMP5]]
; CHECK-NEXT: ret <4 x i32> [[TMP9]]
;		;
%v0.0 = extractelement <4 x i32> %v0, i32 0		%v0.0 = extractelement <4 x i32> %v0, i32 0
%v0.1 = extractelement <4 x i32> %v0, i32 1		%v0.1 = extractelement <4 x i32> %v0, i32 1
%v0.2 = extractelement <4 x i32> %v0, i32 2		%v0.2 = extractelement <4 x i32> %v0, i32 2
%v0.3 = extractelement <4 x i32> %v0, i32 3		%v0.3 = extractelement <4 x i32> %v0, i32 3
%v1.0 = extractelement <4 x i32> %v1, i32 0		%v1.0 = extractelement <4 x i32> %v1, i32 0
%v1.1 = extractelement <4 x i32> %v1, i32 1		%v1.1 = extractelement <4 x i32> %v1, i32 1
%v1.2 = extractelement <4 x i32> %v1, i32 2		%v1.2 = extractelement <4 x i32> %v1, i32 2
Show All 14 Lines	;
%tmp3.1 = insertelement <4 x i32> %tmp3.0, i32 %tmp2.1, i32 1		%tmp3.1 = insertelement <4 x i32> %tmp3.0, i32 %tmp2.1, i32 1
%tmp3.2 = insertelement <4 x i32> %tmp3.1, i32 %tmp2.2, i32 2		%tmp3.2 = insertelement <4 x i32> %tmp3.1, i32 %tmp2.2, i32 2
%tmp3.3 = insertelement <4 x i32> %tmp3.2, i32 %tmp2.3, i32 3		%tmp3.3 = insertelement <4 x i32> %tmp3.2, i32 %tmp2.3, i32 3
ret <4 x i32> %tmp3.3		ret <4 x i32> %tmp3.3
}		}

define <4 x i32> @build_vec_v4i32_reuse_0(<2 x i32> %v0, <2 x i32> %v1) {		define <4 x i32> @build_vec_v4i32_reuse_0(<2 x i32> %v0, <2 x i32> %v1) {
; CHECK-LABEL: @build_vec_v4i32_reuse_0(		; CHECK-LABEL: @build_vec_v4i32_reuse_0(
; CHECK-NEXT: [[TMP1:%.]] = shufflevector <2 x i32> [[V0:%.]], <2 x i32> undef, <2 x i32> <i32 1, i32 0>		; CHECK-NEXT: [[TMP1:%.]] = add <2 x i32> [[V0:%.]], [[V1:%.*]]
; CHECK-NEXT: [[TMP2:%.]] = shufflevector <2 x i32> [[V1:%.]], <2 x i32> undef, <2 x i32> <i32 1, i32 0>		; CHECK-NEXT: [[TMP2:%.*]] = sub <2 x i32> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP3:%.*]] = add <2 x i32> [[TMP1]], [[TMP2]]		; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> [[TMP2]], <2 x i32> <i32 1, i32 2>
; CHECK-NEXT: [[TMP4:%.*]] = sub <2 x i32> [[TMP1]], [[TMP2]]		; CHECK-NEXT: [[TMP4:%.*]] = add <2 x i32> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <2 x i32> [[TMP3]], <2 x i32> [[TMP4]], <2 x i32> <i32 0, i32 3>		; CHECK-NEXT: [[TMP5:%.*]] = sub <2 x i32> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP6:%.*]] = add <2 x i32> [[V0]], [[V1]]		; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <2 x i32> [[TMP4]], <2 x i32> [[TMP5]], <2 x i32> <i32 0, i32 3>
; CHECK-NEXT: [[TMP7:%.*]] = sub <2 x i32> [[V0]], [[V1]]		; CHECK-NEXT: [[TMP7:%.*]] = add <2 x i32> [[TMP6]], [[TMP3]]
; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <2 x i32> [[TMP6]], <2 x i32> [[TMP7]], <2 x i32> <i32 0, i32 3>		; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x i32> [[TMP7]], <2 x i32> poison, <4 x i32> <i32 0, i32 1, i32 0, i32 1>
; CHECK-NEXT: [[TMP9:%.*]] = add <2 x i32> [[TMP8]], [[TMP5]]
; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x i32> [[TMP9]], <2 x i32> poison, <4 x i32> <i32 0, i32 1, i32 0, i32 1>
; CHECK-NEXT: ret <4 x i32> [[SHUFFLE]]		; CHECK-NEXT: ret <4 x i32> [[SHUFFLE]]
;		;
%v0.0 = extractelement <2 x i32> %v0, i32 0		%v0.0 = extractelement <2 x i32> %v0, i32 0
%v0.1 = extractelement <2 x i32> %v0, i32 1		%v0.1 = extractelement <2 x i32> %v0, i32 1
%v1.0 = extractelement <2 x i32> %v1, i32 0		%v1.0 = extractelement <2 x i32> %v1, i32 0
%v1.1 = extractelement <2 x i32> %v1, i32 1		%v1.1 = extractelement <2 x i32> %v1, i32 1
%tmp0.0 = add i32 %v0.0, %v1.0		%tmp0.0 = add i32 %v0.0, %v1.0
%tmp0.1 = add i32 %v0.1, %v1.1		%tmp0.1 = add i32 %v0.1, %v1.1
▲ Show 20 Lines • Show All 89 Lines • ▼ Show 20 Lines	;
%tmp3.1 = insertelement <4 x i32> %tmp3.0, i32 %tmp2.1, i32 1		%tmp3.1 = insertelement <4 x i32> %tmp3.0, i32 %tmp2.1, i32 1
%tmp3.2 = insertelement <4 x i32> %tmp3.1, i32 %tmp2.2, i32 2		%tmp3.2 = insertelement <4 x i32> %tmp3.1, i32 %tmp2.2, i32 2
%tmp3.3 = insertelement <4 x i32> %tmp3.2, i32 %tmp2.3, i32 3		%tmp3.3 = insertelement <4 x i32> %tmp3.2, i32 %tmp2.3, i32 3
ret <4 x i32> %tmp3.3		ret <4 x i32> %tmp3.3
}		}

define i32 @reduction_v4i32(<4 x i32> %v0, <4 x i32> %v1) {		define i32 @reduction_v4i32(<4 x i32> %v0, <4 x i32> %v1) {
; CHECK-LABEL: @reduction_v4i32(		; CHECK-LABEL: @reduction_v4i32(
; CHECK-NEXT: [[TMP1:%.]] = shufflevector <4 x i32> [[V0:%.]], <4 x i32> undef, <4 x i32> <i32 1, i32 0, i32 3, i32 2>		; CHECK-NEXT: [[TMP1:%.]] = sub <4 x i32> [[V0:%.]], [[V1:%.*]]
; CHECK-NEXT: [[TMP2:%.]] = shufflevector <4 x i32> [[V1:%.]], <4 x i32> undef, <4 x i32> <i32 1, i32 0, i32 3, i32 2>		; CHECK-NEXT: [[TMP2:%.*]] = add <4 x i32> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP3:%.*]] = sub <4 x i32> [[TMP1]], [[TMP2]]		; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <4 x i32> [[TMP1]], <4 x i32> [[TMP2]], <4 x i32> <i32 1, i32 4, i32 7, i32 2>
; CHECK-NEXT: [[TMP4:%.*]] = add <4 x i32> [[TMP1]], [[TMP2]]		; CHECK-NEXT: [[TMP4:%.*]] = sub <4 x i32> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <4 x i32> [[TMP3]], <4 x i32> [[TMP4]], <4 x i32> <i32 0, i32 5, i32 6, i32 3>		; CHECK-NEXT: [[TMP5:%.*]] = add <4 x i32> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP6:%.*]] = sub <4 x i32> [[V0]], [[V1]]		; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <4 x i32> [[TMP4]], <4 x i32> [[TMP5]], <4 x i32> <i32 0, i32 5, i32 6, i32 3>
; CHECK-NEXT: [[TMP7:%.*]] = add <4 x i32> [[V0]], [[V1]]		; CHECK-NEXT: [[TMP7:%.*]] = add <4 x i32> [[TMP6]], [[TMP3]]
; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <4 x i32> [[TMP6]], <4 x i32> [[TMP7]], <4 x i32> <i32 0, i32 5, i32 6, i32 3>		; CHECK-NEXT: [[TMP8:%.*]] = lshr <4 x i32> [[TMP7]], <i32 15, i32 15, i32 15, i32 15>
; CHECK-NEXT: [[TMP9:%.*]] = add <4 x i32> [[TMP8]], [[TMP5]]		; CHECK-NEXT: [[TMP9:%.*]] = and <4 x i32> [[TMP8]], <i32 65537, i32 65537, i32 65537, i32 65537>
; CHECK-NEXT: [[TMP10:%.*]] = lshr <4 x i32> [[TMP9]], <i32 15, i32 15, i32 15, i32 15>		; CHECK-NEXT: [[TMP10:%.*]] = mul nuw <4 x i32> [[TMP9]], <i32 65535, i32 65535, i32 65535, i32 65535>
; CHECK-NEXT: [[TMP11:%.*]] = and <4 x i32> [[TMP10]], <i32 65537, i32 65537, i32 65537, i32 65537>		; CHECK-NEXT: [[TMP11:%.*]] = add <4 x i32> [[TMP10]], [[TMP7]]
; CHECK-NEXT: [[TMP12:%.*]] = mul nuw <4 x i32> [[TMP11]], <i32 65535, i32 65535, i32 65535, i32 65535>		; CHECK-NEXT: [[TMP12:%.*]] = xor <4 x i32> [[TMP11]], [[TMP10]]
; CHECK-NEXT: [[TMP13:%.*]] = add <4 x i32> [[TMP12]], [[TMP9]]		; CHECK-NEXT: [[TMP13:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[TMP12]])
; CHECK-NEXT: [[TMP14:%.*]] = xor <4 x i32> [[TMP13]], [[TMP12]]		; CHECK-NEXT: ret i32 [[TMP13]]
; CHECK-NEXT: [[TMP15:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[TMP14]])
; CHECK-NEXT: ret i32 [[TMP15]]
;		;
%v0.0 = extractelement <4 x i32> %v0, i32 0		%v0.0 = extractelement <4 x i32> %v0, i32 0
%v0.1 = extractelement <4 x i32> %v0, i32 1		%v0.1 = extractelement <4 x i32> %v0, i32 1
%v0.2 = extractelement <4 x i32> %v0, i32 2		%v0.2 = extractelement <4 x i32> %v0, i32 2
%v0.3 = extractelement <4 x i32> %v0, i32 3		%v0.3 = extractelement <4 x i32> %v0, i32 3
%v1.0 = extractelement <4 x i32> %v1, i32 0		%v1.0 = extractelement <4 x i32> %v1, i32 0
%v1.1 = extractelement <4 x i32> %v1, i32 1		%v1.1 = extractelement <4 x i32> %v1, i32 1
%v1.2 = extractelement <4 x i32> %v1, i32 2		%v1.2 = extractelement <4 x i32> %v1, i32 2
Show All 38 Lines

llvm/test/Transforms/SLPVectorizer/X86/addsub.ll

Show First 20 Lines • Show All 331 Lines • ▼ Show 20 Lines	;
%10 = getelementptr inbounds double, double* %b, i64 1		%10 = getelementptr inbounds double, double* %b, i64 1
%11 = load double, double* %10		%11 = load double, double* %10
%12 = fadd double %9, %11		%12 = fadd double %9, %11
%13 = fadd double %7, %12		%13 = fadd double %7, %12
%14 = getelementptr inbounds double, double* %c, i64 1		%14 = getelementptr inbounds double, double* %c, i64 1
store double %13, double* %14		store double %13, double* %14
ret void		ret void
}		}

; Dont vectorization of following code for float data type as sub is not commutative-		define void @vec_shuff_reorder() #0 {
		RKSimonUnsubmitted Not Done Reply Inline Actions Update comment? I had to do something similar on D103925, although the wording here might be different. RKSimon: Update comment? I had to do something similar on D103925, although the wording here might be…
		ABataevAuthorUnsubmitted Done Reply Inline Actions Ok, will do ABataev: Ok, will do
; fc[0] = fb[0]+fa[0];		; CHECK-LABEL: @vec_shuff_reorder(
; fc[1] = fa[1]-fb[1];
; fc[2] = fa[2]+fb[2];
; fc[3] = fb[3]-fa[3];
; In the above code we can swap the 1st and 2nd operation as fadd is commutative
; but not 2nd or 4th as fsub is not commutative.

define void @no_vec_shuff_reorder() #0 {
; CHECK-LABEL: @no_vec_shuff_reorder(
; CHECK-NEXT: [[TMP1:%.]] = load float, float getelementptr inbounds ([4 x float], [4 x float]* @fb, i32 0, i64 0), align 4		; CHECK-NEXT: [[TMP1:%.]] = load float, float getelementptr inbounds ([4 x float], [4 x float]* @fb, i32 0, i64 0), align 4
; CHECK-NEXT: [[TMP2:%.]] = load float, float getelementptr inbounds ([4 x float], [4 x float]* @fa, i32 0, i64 0), align 4		; CHECK-NEXT: [[TMP2:%.]] = load float, float getelementptr inbounds ([4 x float], [4 x float]* @fa, i32 0, i64 0), align 4
; CHECK-NEXT: [[TMP3:%.*]] = fadd float [[TMP1]], [[TMP2]]		; CHECK-NEXT: [[TMP3:%.]] = load <2 x float>, <2 x float> bitcast (float* getelementptr inbounds ([4 x float], [4 x float]* @fa, i32 0, i64 1) to <2 x float>*), align 4
; CHECK-NEXT: store float [[TMP3]], float* getelementptr inbounds ([4 x float], [4 x float]* @fc, i32 0, i64 0), align 4		; CHECK-NEXT: [[TMP4:%.]] = load <2 x float>, <2 x float> bitcast (float* getelementptr inbounds ([4 x float], [4 x float]* @fb, i32 0, i64 1) to <2 x float>*), align 4
; CHECK-NEXT: [[TMP4:%.]] = load float, float getelementptr inbounds ([4 x float], [4 x float]* @fa, i32 0, i64 1), align 4		; CHECK-NEXT: [[TMP5:%.]] = load float, float getelementptr inbounds ([4 x float], [4 x float]* @fb, i32 0, i64 3), align 4
; CHECK-NEXT: [[TMP5:%.]] = load float, float getelementptr inbounds ([4 x float], [4 x float]* @fb, i32 0, i64 1), align 4		; CHECK-NEXT: [[TMP6:%.]] = load float, float getelementptr inbounds ([4 x float], [4 x float]* @fa, i32 0, i64 3), align 4
; CHECK-NEXT: [[TMP6:%.*]] = fsub float [[TMP4]], [[TMP5]]		; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x float> poison, float [[TMP2]], i32 0
; CHECK-NEXT: store float [[TMP6]], float* getelementptr inbounds ([4 x float], [4 x float]* @fc, i32 0, i64 1), align 4		; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <2 x float> [[TMP3]], <2 x float> poison, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef>
; CHECK-NEXT: [[TMP7:%.]] = load float, float getelementptr inbounds ([4 x float], [4 x float]* @fa, i32 0, i64 2), align 4		; CHECK-NEXT: [[TMP9:%.*]] = shufflevector <4 x float> [[TMP7]], <4 x float> [[TMP8]], <4 x i32> <i32 0, i32 4, i32 5, i32 3>
; CHECK-NEXT: [[TMP8:%.]] = load float, float getelementptr inbounds ([4 x float], [4 x float]* @fb, i32 0, i64 2), align 4		; CHECK-NEXT: [[TMP10:%.*]] = insertelement <4 x float> [[TMP9]], float [[TMP5]], i32 3
; CHECK-NEXT: [[TMP9:%.*]] = fadd float [[TMP7]], [[TMP8]]		; CHECK-NEXT: [[TMP11:%.*]] = insertelement <4 x float> poison, float [[TMP1]], i32 0
; CHECK-NEXT: store float [[TMP9]], float* getelementptr inbounds ([4 x float], [4 x float]* @fc, i32 0, i64 2), align 4		; CHECK-NEXT: [[TMP12:%.*]] = shufflevector <2 x float> [[TMP4]], <2 x float> poison, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef>
; CHECK-NEXT: [[TMP10:%.]] = load float, float getelementptr inbounds ([4 x float], [4 x float]* @fb, i32 0, i64 3), align 4		; CHECK-NEXT: [[TMP13:%.*]] = shufflevector <4 x float> [[TMP11]], <4 x float> [[TMP12]], <4 x i32> <i32 0, i32 4, i32 5, i32 3>
; CHECK-NEXT: [[TMP11:%.]] = load float, float getelementptr inbounds ([4 x float], [4 x float]* @fa, i32 0, i64 3), align 4		; CHECK-NEXT: [[TMP14:%.*]] = insertelement <4 x float> [[TMP13]], float [[TMP6]], i32 3
; CHECK-NEXT: [[TMP12:%.*]] = fsub float [[TMP10]], [[TMP11]]		; CHECK-NEXT: [[TMP15:%.*]] = fadd <4 x float> [[TMP10]], [[TMP14]]
; CHECK-NEXT: store float [[TMP12]], float* getelementptr inbounds ([4 x float], [4 x float]* @fc, i32 0, i64 3), align 4		; CHECK-NEXT: [[TMP16:%.*]] = fsub <4 x float> [[TMP10]], [[TMP14]]
		; CHECK-NEXT: [[TMP17:%.*]] = shufflevector <4 x float> [[TMP15]], <4 x float> [[TMP16]], <4 x i32> <i32 0, i32 5, i32 2, i32 7>
		; CHECK-NEXT: store <4 x float> [[TMP17]], <4 x float>* bitcast ([4 x float]* @fc to <4 x float>*), align 4
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
%1 = load float, float* getelementptr inbounds ([4 x float], [4 x float]* @fb, i32 0, i64 0), align 4		%1 = load float, float* getelementptr inbounds ([4 x float], [4 x float]* @fb, i32 0, i64 0), align 4
%2 = load float, float* getelementptr inbounds ([4 x float], [4 x float]* @fa, i32 0, i64 0), align 4		%2 = load float, float* getelementptr inbounds ([4 x float], [4 x float]* @fa, i32 0, i64 0), align 4
%3 = fadd float %1, %2		%3 = fadd float %1, %2
store float %3, float* getelementptr inbounds ([4 x float], [4 x float]* @fc, i32 0, i64 0), align 4		store float %3, float* getelementptr inbounds ([4 x float], [4 x float]* @fc, i32 0, i64 0), align 4
%4 = load float, float* getelementptr inbounds ([4 x float], [4 x float]* @fa, i32 0, i64 1), align 4		%4 = load float, float* getelementptr inbounds ([4 x float], [4 x float]* @fa, i32 0, i64 1), align 4
%5 = load float, float* getelementptr inbounds ([4 x float], [4 x float]* @fb, i32 0, i64 1), align 4		%5 = load float, float* getelementptr inbounds ([4 x float], [4 x float]* @fb, i32 0, i64 1), align 4
Show All 16 Lines

llvm/test/Transforms/SLPVectorizer/X86/crash_cmpop.ll

	Show First 20 Lines • Show All 55 Lines • ▼ Show 20 Lines
	; AVX-NEXT: [[INDVARS_IV:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[INDVARS_IV_NEXT:%.*]], [[FOR_BODY]] ]			; AVX-NEXT: [[INDVARS_IV:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[INDVARS_IV_NEXT:%.*]], [[FOR_BODY]] ]
	; AVX-NEXT: [[ACC1_056:%.]] = phi float [ 0.000000e+00, [[ENTRY]] ], [ [[ADD13:%.]], [[FOR_BODY]] ]			; AVX-NEXT: [[ACC1_056:%.]] = phi float [ 0.000000e+00, [[ENTRY]] ], [ [[ADD13:%.]], [[FOR_BODY]] ]
	; AVX-NEXT: [[TMP0:%.]] = phi <2 x float> [ zeroinitializer, [[ENTRY]] ], [ [[TMP19:%.]], [[FOR_BODY]] ]			; AVX-NEXT: [[TMP0:%.]] = phi <2 x float> [ zeroinitializer, [[ENTRY]] ], [ [[TMP19:%.]], [[FOR_BODY]] ]
	; AVX-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds float, float [[SRC:%.*]], i64 [[INDVARS_IV]]			; AVX-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds float, float [[SRC:%.*]], i64 [[INDVARS_IV]]
	; AVX-NEXT: [[TMP1:%.]] = load float, float [[ARRAYIDX]], align 4			; AVX-NEXT: [[TMP1:%.]] = load float, float [[ARRAYIDX]], align 4
	; AVX-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; AVX-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; AVX-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds float, float [[DEST:%.*]], i64 [[INDVARS_IV]]			; AVX-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds float, float [[DEST:%.*]], i64 [[INDVARS_IV]]
	; AVX-NEXT: store float [[ACC1_056]], float* [[ARRAYIDX2]], align 4			; AVX-NEXT: store float [[ACC1_056]], float* [[ARRAYIDX2]], align 4
	; AVX-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x float> [[TMP0]], <2 x float> poison, <2 x i32> <i32 1, i32 0>
	; AVX-NEXT: [[TMP2:%.*]] = insertelement <2 x float> poison, float [[TMP1]], i32 0			; AVX-NEXT: [[TMP2:%.*]] = insertelement <2 x float> poison, float [[TMP1]], i32 0
	; AVX-NEXT: [[TMP3:%.*]] = insertelement <2 x float> [[TMP2]], float [[TMP1]], i32 1			; AVX-NEXT: [[TMP3:%.*]] = insertelement <2 x float> [[TMP2]], float [[TMP1]], i32 1
	; AVX-NEXT: [[TMP4:%.*]] = fadd <2 x float> [[SHUFFLE]], [[TMP3]]			; AVX-NEXT: [[TMP4:%.*]] = fadd <2 x float> [[TMP0]], [[TMP3]]
				; AVX-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x float> [[TMP4]], <2 x float> poison, <2 x i32> <i32 1, i32 0>
	; AVX-NEXT: [[TMP5:%.*]] = fmul <2 x float> [[TMP0]], zeroinitializer			; AVX-NEXT: [[TMP5:%.*]] = fmul <2 x float> [[TMP0]], zeroinitializer
	; AVX-NEXT: [[TMP6:%.*]] = fadd <2 x float> [[TMP5]], [[TMP4]]			; AVX-NEXT: [[TMP6:%.*]] = fadd <2 x float> [[TMP5]], [[SHUFFLE]]
	; AVX-NEXT: [[TMP7:%.*]] = fcmp olt <2 x float> [[TMP6]], <float 1.000000e+00, float 1.000000e+00>			; AVX-NEXT: [[TMP7:%.*]] = fcmp olt <2 x float> [[TMP6]], <float 1.000000e+00, float 1.000000e+00>
	; AVX-NEXT: [[TMP8:%.*]] = select <2 x i1> [[TMP7]], <2 x float> [[TMP6]], <2 x float> <float 1.000000e+00, float 1.000000e+00>			; AVX-NEXT: [[TMP8:%.*]] = select <2 x i1> [[TMP7]], <2 x float> [[TMP6]], <2 x float> <float 1.000000e+00, float 1.000000e+00>
	; AVX-NEXT: [[TMP9:%.*]] = fcmp olt <2 x float> [[TMP8]], <float -1.000000e+00, float -1.000000e+00>			; AVX-NEXT: [[TMP9:%.*]] = fcmp olt <2 x float> [[TMP8]], <float -1.000000e+00, float -1.000000e+00>
	; AVX-NEXT: [[TMP10:%.*]] = fmul <2 x float> [[TMP8]], zeroinitializer			; AVX-NEXT: [[TMP10:%.*]] = fmul <2 x float> [[TMP8]], zeroinitializer
	; AVX-NEXT: [[TMP11:%.*]] = select <2 x i1> [[TMP9]], <2 x float> <float -0.000000e+00, float -0.000000e+00>, <2 x float> [[TMP10]]			; AVX-NEXT: [[TMP11:%.*]] = select <2 x i1> [[TMP9]], <2 x float> <float -0.000000e+00, float -0.000000e+00>, <2 x float> [[TMP10]]
	; AVX-NEXT: [[TMP12:%.*]] = extractelement <2 x float> [[TMP11]], i32 0			; AVX-NEXT: [[TMP12:%.*]] = extractelement <2 x float> [[TMP11]], i32 0
	; AVX-NEXT: [[TMP13:%.*]] = extractelement <2 x float> [[TMP11]], i32 1			; AVX-NEXT: [[TMP13:%.*]] = extractelement <2 x float> [[TMP11]], i32 1
	; AVX-NEXT: [[ADD13]] = fadd float [[TMP12]], [[TMP13]]			; AVX-NEXT: [[ADD13]] = fadd float [[TMP12]], [[TMP13]]
	▲ Show 20 Lines • Show All 61 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/extract.ll

Show All 24 Lines	entry:
store double %A1, double* %P1, align 4		store double %A1, double* %P1, align 4
ret void		ret void
}		}

define void @fextr1(double* %ptr) {		define void @fextr1(double* %ptr) {
; CHECK-LABEL: @fextr1(		; CHECK-LABEL: @fextr1(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[LD:%.]] = load <2 x double>, <2 x double> undef, align 16		; CHECK-NEXT: [[LD:%.]] = load <2 x double>, <2 x double> undef, align 16
; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x double> [[LD]], <2 x double> poison, <2 x i32> <i32 1, i32 0>
; CHECK-NEXT: [[P1:%.]] = getelementptr inbounds double, double [[PTR:%.*]], i64 0		; CHECK-NEXT: [[P1:%.]] = getelementptr inbounds double, double [[PTR:%.*]], i64 0
; CHECK-NEXT: [[TMP0:%.*]] = fadd <2 x double> [[SHUFFLE]], <double 3.400000e+00, double 1.200000e+00>		; CHECK-NEXT: [[TMP0:%.*]] = fadd <2 x double> [[LD]], <double 1.200000e+00, double 3.400000e+00>
		; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x double> [[TMP0]], <2 x double> poison, <2 x i32> <i32 1, i32 0>
; CHECK-NEXT: [[TMP1:%.]] = bitcast double [[P1]] to <2 x double>*		; CHECK-NEXT: [[TMP1:%.]] = bitcast double [[P1]] to <2 x double>*
; CHECK-NEXT: store <2 x double> [[TMP0]], <2 x double>* [[TMP1]], align 4		; CHECK-NEXT: store <2 x double> [[SHUFFLE]], <2 x double>* [[TMP1]], align 4
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
entry:		entry:
%LD = load <2 x double>, <2 x double>* undef		%LD = load <2 x double>, <2 x double>* undef
%V0 = extractelement <2 x double> %LD, i32 0		%V0 = extractelement <2 x double> %LD, i32 0
%V1 = extractelement <2 x double> %LD, i32 1		%V1 = extractelement <2 x double> %LD, i32 1
%P0 = getelementptr inbounds double, double* %ptr, i64 1 ; <--- incorrect order		%P0 = getelementptr inbounds double, double* %ptr, i64 1 ; <--- incorrect order
%P1 = getelementptr inbounds double, double* %ptr, i64 0		%P1 = getelementptr inbounds double, double* %ptr, i64 0
Show All 34 Lines

llvm/test/Transforms/SLPVectorizer/X86/jumbled-load-multiuse.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -slp-vectorizer -S -mtriple=x86_64-unknown-linux -mattr=+sse4.2 \| FileCheck %s			; RUN: opt < %s -slp-vectorizer -S -mtriple=x86_64-unknown-linux -mattr=+sse4.2 \| FileCheck %s

	@a = common local_unnamed_addr global [4 x i32] zeroinitializer, align 4			@a = common local_unnamed_addr global [4 x i32] zeroinitializer, align 4
	@b = common local_unnamed_addr global [4 x i32] zeroinitializer, align 4			@b = common local_unnamed_addr global [4 x i32] zeroinitializer, align 4

	define i32 @fn1() {			define i32 @fn1() {
	; CHECK-LABEL: @fn1(			; CHECK-LABEL: @fn1(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[TMP0:%.]] = load <4 x i32>, <4 x i32> bitcast ([4 x i32]* @b to <4 x i32>*), align 4			; CHECK-NEXT: [[TMP0:%.]] = load <4 x i32>, <4 x i32> bitcast ([4 x i32]* @b to <4 x i32>*), align 4
	; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <4 x i32> [[TMP0]], <4 x i32> poison, <4 x i32> <i32 1, i32 2, i32 3, i32 0>			; CHECK-NEXT: [[TMP1:%.*]] = icmp sgt <4 x i32> [[TMP0]], zeroinitializer
	; CHECK-NEXT: [[TMP1:%.*]] = icmp sgt <4 x i32> [[SHUFFLE]], zeroinitializer			; CHECK-NEXT: [[TMP2:%.*]] = extractelement <4 x i32> [[TMP0]], i32 1
	; CHECK-NEXT: [[TMP2:%.*]] = extractelement <4 x i32> [[SHUFFLE]], i32 0			; CHECK-NEXT: [[TMP3:%.]] = insertelement <4 x i32> <i32 8, i32 poison, i32 ptrtoint (i32 () @fn1 to i32), i32 ptrtoint (i32 ()* @fn1 to i32)>, i32 [[TMP2]], i32 1
	; CHECK-NEXT: [[TMP3:%.]] = insertelement <4 x i32> <i32 poison, i32 ptrtoint (i32 () @fn1 to i32), i32 ptrtoint (i32 ()* @fn1 to i32), i32 8>, i32 [[TMP2]], i32 0			; CHECK-NEXT: [[TMP4:%.*]] = select <4 x i1> [[TMP1]], <4 x i32> [[TMP3]], <4 x i32> <i32 0, i32 6, i32 0, i32 0>
	; CHECK-NEXT: [[TMP4:%.*]] = select <4 x i1> [[TMP1]], <4 x i32> [[TMP3]], <4 x i32> <i32 6, i32 0, i32 0, i32 0>			; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <4 x i32> [[TMP4]], <4 x i32> poison, <4 x i32> <i32 1, i32 2, i32 3, i32 0>
	; CHECK-NEXT: store <4 x i32> [[TMP4]], <4 x i32>* bitcast ([4 x i32]* @a to <4 x i32>*), align 4			; CHECK-NEXT: store <4 x i32> [[SHUFFLE]], <4 x i32>* bitcast ([4 x i32]* @a to <4 x i32>*), align 4
	; CHECK-NEXT: ret i32 0			; CHECK-NEXT: ret i32 0
	;			;
	entry:			entry:
	%0 = load i32, i32* getelementptr ([4 x i32], [4 x i32]* @b, i64 0, i32 0), align 4			%0 = load i32, i32* getelementptr ([4 x i32], [4 x i32]* @b, i64 0, i32 0), align 4
	%cmp = icmp sgt i32 %0, 0			%cmp = icmp sgt i32 %0, 0
	%cond = select i1 %cmp, i32 8, i32 0			%cond = select i1 %cmp, i32 8, i32 0
	store i32 %cond, i32* getelementptr inbounds ([4 x i32], [4 x i32]* @a, i64 0, i32 3), align 4			store i32 %cond, i32* getelementptr inbounds ([4 x i32], [4 x i32]* @a, i64 0, i32 3), align 4
	%1 = load i32, i32* getelementptr ([4 x i32], [4 x i32]* @b, i64 0, i32 1), align 4			%1 = load i32, i32* getelementptr ([4 x i32], [4 x i32]* @b, i64 0, i32 1), align 4
	Show All 13 Lines

llvm/test/Transforms/SLPVectorizer/X86/jumbled-load.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -S -mtriple=x86_64-unknown -mattr=+avx -slp-vectorizer \| FileCheck %s			; RUN: opt < %s -S -mtriple=x86_64-unknown -mattr=+avx -slp-vectorizer \| FileCheck %s



	define i32 @jumbled-load(i32* noalias nocapture %in, i32* noalias nocapture %inn, i32* noalias nocapture %out) {			define i32 @jumbled-load(i32* noalias nocapture %in, i32* noalias nocapture %inn, i32* noalias nocapture %out) {
	; CHECK-LABEL: @jumbled-load(			; CHECK-LABEL: @jumbled-load(
	; CHECK-NEXT: [[IN_ADDR:%.]] = getelementptr inbounds i32, i32 [[IN:%.*]], i64 0			; CHECK-NEXT: [[IN_ADDR:%.]] = getelementptr inbounds i32, i32 [[IN:%.*]], i64 0
	; CHECK-NEXT: [[GEP_1:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 3			; CHECK-NEXT: [[GEP_1:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 3
	; CHECK-NEXT: [[GEP_2:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 1			; CHECK-NEXT: [[GEP_2:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 1
	; CHECK-NEXT: [[GEP_3:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 2			; CHECK-NEXT: [[GEP_3:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 2
	; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[IN_ADDR]] to <4 x i32>*			; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[IN_ADDR]] to <4 x i32>*
	; CHECK-NEXT: [[TMP2:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4			; CHECK-NEXT: [[TMP2:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4
	; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <4 x i32> [[TMP2]], <4 x i32> poison, <4 x i32> <i32 1, i32 3, i32 2, i32 0>
	; CHECK-NEXT: [[INN_ADDR:%.]] = getelementptr inbounds i32, i32 [[INN:%.*]], i64 0			; CHECK-NEXT: [[INN_ADDR:%.]] = getelementptr inbounds i32, i32 [[INN:%.*]], i64 0
	; CHECK-NEXT: [[GEP_4:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 2			; CHECK-NEXT: [[GEP_4:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 2
	; CHECK-NEXT: [[GEP_5:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 3			; CHECK-NEXT: [[GEP_5:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 3
	; CHECK-NEXT: [[GEP_6:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 1			; CHECK-NEXT: [[GEP_6:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 1
	; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[INN_ADDR]] to <4 x i32>*			; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[INN_ADDR]] to <4 x i32>*
	; CHECK-NEXT: [[TMP4:%.]] = load <4 x i32>, <4 x i32> [[TMP3]], align 4			; CHECK-NEXT: [[TMP4:%.]] = load <4 x i32>, <4 x i32> [[TMP3]], align 4
	; CHECK-NEXT: [[SHUFFLE1:%.*]] = shufflevector <4 x i32> [[TMP4]], <4 x i32> poison, <4 x i32> <i32 0, i32 1, i32 3, i32 2>			; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <4 x i32> [[TMP4]], <4 x i32> poison, <4 x i32> <i32 2, i32 0, i32 3, i32 1>
	; CHECK-NEXT: [[TMP5:%.*]] = mul <4 x i32> [[SHUFFLE]], [[SHUFFLE1]]			; CHECK-NEXT: [[TMP5:%.*]] = mul <4 x i32> [[TMP2]], [[SHUFFLE]]
	; CHECK-NEXT: [[GEP_7:%.]] = getelementptr inbounds i32, i32 [[OUT:%.*]], i64 0			; CHECK-NEXT: [[GEP_7:%.]] = getelementptr inbounds i32, i32 [[OUT:%.*]], i64 0
	; CHECK-NEXT: [[GEP_8:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 1			; CHECK-NEXT: [[GEP_8:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 1
	; CHECK-NEXT: [[GEP_9:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 2			; CHECK-NEXT: [[GEP_9:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 2
	; CHECK-NEXT: [[GEP_10:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 3			; CHECK-NEXT: [[GEP_10:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 3
				; CHECK-NEXT: [[SHUFFLE1:%.*]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> poison, <4 x i32> <i32 1, i32 3, i32 2, i32 0>
	; CHECK-NEXT: [[TMP6:%.]] = bitcast i32 [[GEP_7]] to <4 x i32>*			; CHECK-NEXT: [[TMP6:%.]] = bitcast i32 [[GEP_7]] to <4 x i32>*
	; CHECK-NEXT: store <4 x i32> [[TMP5]], <4 x i32>* [[TMP6]], align 4			; CHECK-NEXT: store <4 x i32> [[SHUFFLE1]], <4 x i32>* [[TMP6]], align 4
	; CHECK-NEXT: ret i32 undef			; CHECK-NEXT: ret i32 undef
	;			;
	%in.addr = getelementptr inbounds i32, i32* %in, i64 0			%in.addr = getelementptr inbounds i32, i32* %in, i64 0
	%load.1 = load i32, i32* %in.addr, align 4			%load.1 = load i32, i32* %in.addr, align 4
	%gep.1 = getelementptr inbounds i32, i32* %in.addr, i64 3			%gep.1 = getelementptr inbounds i32, i32* %in.addr, i64 3
	%load.2 = load i32, i32* %gep.1, align 4			%load.2 = load i32, i32* %gep.1, align 4
	%gep.2 = getelementptr inbounds i32, i32* %in.addr, i64 1			%gep.2 = getelementptr inbounds i32, i32* %in.addr, i64 1
	%load.3 = load i32, i32* %gep.2, align 4			%load.3 = load i32, i32* %gep.2, align 4
	Show All 27 Lines
	define i32 @jumbled-load-multiuses(i32* noalias nocapture %in, i32* noalias nocapture %out) {			define i32 @jumbled-load-multiuses(i32* noalias nocapture %in, i32* noalias nocapture %out) {
	; CHECK-LABEL: @jumbled-load-multiuses(			; CHECK-LABEL: @jumbled-load-multiuses(
	; CHECK-NEXT: [[IN_ADDR:%.]] = getelementptr inbounds i32, i32 [[IN:%.*]], i64 0			; CHECK-NEXT: [[IN_ADDR:%.]] = getelementptr inbounds i32, i32 [[IN:%.*]], i64 0
	; CHECK-NEXT: [[GEP_1:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 3			; CHECK-NEXT: [[GEP_1:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 3
	; CHECK-NEXT: [[GEP_2:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 1			; CHECK-NEXT: [[GEP_2:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 1
	; CHECK-NEXT: [[GEP_3:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 2			; CHECK-NEXT: [[GEP_3:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 2
	; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[IN_ADDR]] to <4 x i32>*			; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[IN_ADDR]] to <4 x i32>*
	; CHECK-NEXT: [[TMP2:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4			; CHECK-NEXT: [[TMP2:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4
	; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <4 x i32> [[TMP2]], <4 x i32> poison, <4 x i32> <i32 1, i32 3, i32 2, i32 0>			; CHECK-NEXT: [[TMP3:%.*]] = extractelement <4 x i32> [[TMP2]], i32 1
	; CHECK-NEXT: [[TMP3:%.*]] = extractelement <4 x i32> [[SHUFFLE]], i32 2
	; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x i32> poison, i32 [[TMP3]], i32 0			; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x i32> poison, i32 [[TMP3]], i32 0
	; CHECK-NEXT: [[TMP5:%.*]] = extractelement <4 x i32> [[SHUFFLE]], i32 1			; CHECK-NEXT: [[TMP5:%.*]] = extractelement <4 x i32> [[TMP2]], i32 2
	; CHECK-NEXT: [[TMP6:%.*]] = insertelement <4 x i32> [[TMP4]], i32 [[TMP5]], i32 1			; CHECK-NEXT: [[TMP6:%.*]] = insertelement <4 x i32> [[TMP4]], i32 [[TMP5]], i32 1
	; CHECK-NEXT: [[TMP7:%.*]] = extractelement <4 x i32> [[SHUFFLE]], i32 3			; CHECK-NEXT: [[TMP7:%.*]] = extractelement <4 x i32> [[TMP2]], i32 0
	; CHECK-NEXT: [[TMP8:%.*]] = insertelement <4 x i32> [[TMP6]], i32 [[TMP7]], i32 2			; CHECK-NEXT: [[TMP8:%.*]] = insertelement <4 x i32> [[TMP6]], i32 [[TMP7]], i32 2
	; CHECK-NEXT: [[TMP9:%.*]] = extractelement <4 x i32> [[SHUFFLE]], i32 0			; CHECK-NEXT: [[TMP9:%.*]] = extractelement <4 x i32> [[TMP2]], i32 3
	; CHECK-NEXT: [[TMP10:%.*]] = insertelement <4 x i32> [[TMP8]], i32 [[TMP9]], i32 3			; CHECK-NEXT: [[TMP10:%.*]] = insertelement <4 x i32> [[TMP8]], i32 [[TMP9]], i32 3
	; CHECK-NEXT: [[TMP11:%.*]] = mul <4 x i32> [[SHUFFLE]], [[TMP10]]			; CHECK-NEXT: [[TMP11:%.*]] = mul <4 x i32> [[TMP2]], [[TMP10]]
	; CHECK-NEXT: [[GEP_7:%.]] = getelementptr inbounds i32, i32 [[OUT:%.*]], i64 0			; CHECK-NEXT: [[GEP_7:%.]] = getelementptr inbounds i32, i32 [[OUT:%.*]], i64 0
	; CHECK-NEXT: [[GEP_8:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 1			; CHECK-NEXT: [[GEP_8:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 1
	; CHECK-NEXT: [[GEP_9:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 2			; CHECK-NEXT: [[GEP_9:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 2
	; CHECK-NEXT: [[GEP_10:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 3			; CHECK-NEXT: [[GEP_10:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 3
				; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <4 x i32> [[TMP11]], <4 x i32> poison, <4 x i32> <i32 1, i32 3, i32 2, i32 0>
	; CHECK-NEXT: [[TMP12:%.]] = bitcast i32 [[GEP_7]] to <4 x i32>*			; CHECK-NEXT: [[TMP12:%.]] = bitcast i32 [[GEP_7]] to <4 x i32>*
	; CHECK-NEXT: store <4 x i32> [[TMP11]], <4 x i32>* [[TMP12]], align 4			; CHECK-NEXT: store <4 x i32> [[SHUFFLE]], <4 x i32>* [[TMP12]], align 4
	; CHECK-NEXT: ret i32 undef			; CHECK-NEXT: ret i32 undef
	;			;
	%in.addr = getelementptr inbounds i32, i32* %in, i64 0			%in.addr = getelementptr inbounds i32, i32* %in, i64 0
	%load.1 = load i32, i32* %in.addr, align 4			%load.1 = load i32, i32* %in.addr, align 4
	%gep.1 = getelementptr inbounds i32, i32* %in.addr, i64 3			%gep.1 = getelementptr inbounds i32, i32* %in.addr, i64 3
	%load.2 = load i32, i32* %gep.1, align 4			%load.2 = load i32, i32* %gep.1, align 4
	%gep.2 = getelementptr inbounds i32, i32* %in.addr, i64 1			%gep.2 = getelementptr inbounds i32, i32* %in.addr, i64 1
	%load.3 = load i32, i32* %gep.2, align 4			%load.3 = load i32, i32* %gep.2, align 4
	Show All 17 Lines

llvm/test/Transforms/SLPVectorizer/X86/jumbled_store_crash.ll

	Show All 20 Lines
	; CHECK-NEXT: [[TMP2:%.]] = load <2 x i32>, <2 x i32> [[TMP1]], align 4			; CHECK-NEXT: [[TMP2:%.]] = load <2 x i32>, <2 x i32> [[TMP1]], align 4
	; CHECK-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 [[TMP0]], i64 13			; CHECK-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 [[TMP0]], i64 13
	; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[ARRAYIDX1]] to <2 x i32>*			; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[ARRAYIDX1]] to <2 x i32>*
	; CHECK-NEXT: [[TMP4:%.]] = load <2 x i32>, <2 x i32> [[TMP3]], align 4			; CHECK-NEXT: [[TMP4:%.]] = load <2 x i32>, <2 x i32> [[TMP3]], align 4
	; CHECK-NEXT: [[TMP5:%.*]] = add nsw <2 x i32> [[TMP4]], [[TMP2]]			; CHECK-NEXT: [[TMP5:%.*]] = add nsw <2 x i32> [[TMP4]], [[TMP2]]
	; CHECK-NEXT: [[TMP6:%.*]] = sitofp <2 x i32> [[TMP5]] to <2 x float>			; CHECK-NEXT: [[TMP6:%.*]] = sitofp <2 x i32> [[TMP5]] to <2 x float>
	; CHECK-NEXT: [[TMP7:%.*]] = fmul <2 x float> [[TMP6]], <float 1.000000e+01, float 1.000000e+01>			; CHECK-NEXT: [[TMP7:%.*]] = fmul <2 x float> [[TMP6]], <float 1.000000e+01, float 1.000000e+01>
	; CHECK-NEXT: [[TMP8:%.*]] = fsub <2 x float> <float 1.000000e+00, float 0.000000e+00>, [[TMP7]]			; CHECK-NEXT: [[TMP8:%.*]] = fsub <2 x float> <float 1.000000e+00, float 0.000000e+00>, [[TMP7]]
	; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x float> [[TMP8]], <2 x float> poison, <4 x i32> <i32 0, i32 0, i32 1, i32 1>			; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x float> [[TMP8]], <2 x float> poison, <4 x i32> <i32 1, i32 0, i32 1, i32 0>
	; CHECK-NEXT: [[TMP9:%.*]] = extractelement <4 x float> [[SHUFFLE]], i32 0			; CHECK-NEXT: [[TMP9:%.*]] = extractelement <4 x float> [[SHUFFLE]], i32 1
	; CHECK-NEXT: store float [[TMP9]], float* @g, align 4			; CHECK-NEXT: store float [[TMP9]], float* @g, align 4
	; CHECK-NEXT: [[TMP10:%.*]] = fadd <4 x float> [[SHUFFLE]], <float -1.000000e+00, float 1.000000e+00, float -1.000000e+00, float 1.000000e+00>			; CHECK-NEXT: [[TMP10:%.*]] = fadd <4 x float> [[SHUFFLE]], <float -1.000000e+00, float -1.000000e+00, float 1.000000e+00, float 1.000000e+00>
	; CHECK-NEXT: [[TMP11:%.*]] = extractelement <4 x float> [[TMP10]], i32 3			; CHECK-NEXT: [[TMP11:%.*]] = extractelement <4 x float> [[TMP10]], i32 2
	; CHECK-NEXT: store float [[TMP11]], float* @c, align 4			; CHECK-NEXT: store float [[TMP11]], float* @c, align 4
	; CHECK-NEXT: [[TMP12:%.*]] = extractelement <4 x float> [[TMP10]], i32 2			; CHECK-NEXT: [[TMP12:%.*]] = extractelement <4 x float> [[TMP10]], i32 0
	; CHECK-NEXT: store float [[TMP12]], float* @d, align 4			; CHECK-NEXT: store float [[TMP12]], float* @d, align 4
	; CHECK-NEXT: [[TMP13:%.*]] = extractelement <4 x float> [[TMP10]], i32 1			; CHECK-NEXT: [[TMP13:%.*]] = extractelement <4 x float> [[TMP10]], i32 3
	; CHECK-NEXT: store float [[TMP13]], float* @e, align 4			; CHECK-NEXT: store float [[TMP13]], float* @e, align 4
	; CHECK-NEXT: [[TMP14:%.*]] = extractelement <4 x float> [[TMP10]], i32 0			; CHECK-NEXT: [[TMP14:%.*]] = extractelement <4 x float> [[TMP10]], i32 1
	; CHECK-NEXT: store float [[TMP14]], float* @f, align 4			; CHECK-NEXT: store float [[TMP14]], float* @f, align 4
	; CHECK-NEXT: [[ARRAYIDX15:%.]] = getelementptr inbounds i32, i32 [[TMP0]], i64 14			; CHECK-NEXT: [[ARRAYIDX15:%.]] = getelementptr inbounds i32, i32 [[TMP0]], i64 14
	; CHECK-NEXT: [[ARRAYIDX18:%.]] = getelementptr inbounds i32, i32 [[TMP0]], i64 15			; CHECK-NEXT: [[ARRAYIDX18:%.]] = getelementptr inbounds i32, i32 [[TMP0]], i64 15
	; CHECK-NEXT: [[TMP15:%.]] = load i32, i32 @a, align 4			; CHECK-NEXT: [[TMP15:%.]] = load i32, i32 @a, align 4
	; CHECK-NEXT: [[CONV19:%.*]] = sitofp i32 [[TMP15]] to float			; CHECK-NEXT: [[CONV19:%.*]] = sitofp i32 [[TMP15]] to float
	; CHECK-NEXT: [[TMP16:%.*]] = insertelement <4 x float> <float -1.000000e+00, float -1.000000e+00, float poison, float poison>, float [[CONV19]], i32 2			; CHECK-NEXT: [[TMP16:%.*]] = insertelement <4 x float> <float poison, float -1.000000e+00, float poison, float -1.000000e+00>, float [[CONV19]], i32 0
	; CHECK-NEXT: [[TMP17:%.*]] = extractelement <4 x float> [[SHUFFLE]], i32 2			; CHECK-NEXT: [[TMP17:%.*]] = extractelement <4 x float> [[SHUFFLE]], i32 0
	; CHECK-NEXT: [[TMP18:%.*]] = insertelement <4 x float> [[TMP16]], float [[TMP17]], i32 3			; CHECK-NEXT: [[TMP18:%.*]] = insertelement <4 x float> [[TMP16]], float [[TMP17]], i32 2
	; CHECK-NEXT: [[TMP19:%.*]] = fadd <4 x float> [[TMP10]], [[TMP18]]			; CHECK-NEXT: [[TMP19:%.*]] = fsub <4 x float> [[TMP10]], [[TMP18]]
	; CHECK-NEXT: [[TMP20:%.*]] = fsub <4 x float> [[TMP10]], [[TMP18]]			; CHECK-NEXT: [[TMP20:%.*]] = fadd <4 x float> [[TMP10]], [[TMP18]]
	; CHECK-NEXT: [[TMP21:%.*]] = shufflevector <4 x float> [[TMP19]], <4 x float> [[TMP20]], <4 x i32> <i32 0, i32 1, i32 6, i32 7>			; CHECK-NEXT: [[TMP21:%.*]] = shufflevector <4 x float> [[TMP19]], <4 x float> [[TMP20]], <4 x i32> <i32 0, i32 5, i32 2, i32 7>
	; CHECK-NEXT: [[TMP22:%.*]] = fptosi <4 x float> [[TMP21]] to <4 x i32>			; CHECK-NEXT: [[TMP22:%.*]] = fptosi <4 x float> [[TMP21]] to <4 x i32>
	; CHECK-NEXT: [[SHUFFLE1:%.*]] = shufflevector <4 x i32> [[TMP22]], <4 x i32> poison, <4 x i32> <i32 2, i32 0, i32 3, i32 1>
	; CHECK-NEXT: [[TMP23:%.]] = bitcast i32 [[ARRAYIDX1]] to <4 x i32>*			; CHECK-NEXT: [[TMP23:%.]] = bitcast i32 [[ARRAYIDX1]] to <4 x i32>*
	; CHECK-NEXT: store <4 x i32> [[SHUFFLE1]], <4 x i32>* [[TMP23]], align 4			; CHECK-NEXT: store <4 x i32> [[TMP22]], <4 x i32>* [[TMP23]], align 4
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	%0 = load i32, i32* @b, align 8			%0 = load i32, i32* @b, align 8
	%arrayidx = getelementptr inbounds i32, i32* %0, i64 4			%arrayidx = getelementptr inbounds i32, i32* %0, i64 4
	%1 = load i32, i32* %arrayidx, align 4			%1 = load i32, i32* %arrayidx, align 4
	%arrayidx1 = getelementptr inbounds i32, i32* %0, i64 12			%arrayidx1 = getelementptr inbounds i32, i32* %0, i64 12
	%2 = load i32, i32* %arrayidx1, align 4			%2 = load i32, i32* %arrayidx1, align 4
	Show All 39 Lines

llvm/test/Transforms/SLPVectorizer/X86/reorder_repeated_ops.ll

	Show All 9 Lines
	; CHECK: bb1:			; CHECK: bb1:
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	; CHECK: bb2:			; CHECK: bb2:
	; CHECK-NEXT: [[T:%.*]] = select i1 undef, i16 undef, i16 15			; CHECK-NEXT: [[T:%.*]] = select i1 undef, i16 undef, i16 15
	; CHECK-NEXT: [[TMP0:%.*]] = insertelement <2 x i16> <i16 poison, i16 undef>, i16 [[T]], i32 0			; CHECK-NEXT: [[TMP0:%.*]] = insertelement <2 x i16> <i16 poison, i16 undef>, i16 [[T]], i32 0
	; CHECK-NEXT: [[TMP1:%.*]] = sext <2 x i16> [[TMP0]] to <2 x i32>			; CHECK-NEXT: [[TMP1:%.*]] = sext <2 x i16> [[TMP0]] to <2 x i32>
	; CHECK-NEXT: [[TMP2:%.*]] = sub nsw <2 x i32> <i32 undef, i32 63>, [[TMP1]]			; CHECK-NEXT: [[TMP2:%.*]] = sub nsw <2 x i32> <i32 undef, i32 63>, [[TMP1]]
	; CHECK-NEXT: [[TMP3:%.*]] = sub <2 x i32> [[TMP2]], undef			; CHECK-NEXT: [[TMP3:%.*]] = sub <2 x i32> [[TMP2]], undef
	; CHECK-NEXT: [[SHUFFLE10:%.*]] = shufflevector <2 x i32> [[TMP3]], <2 x i32> poison, <4 x i32> <i32 0, i32 0, i32 0, i32 1>			; CHECK-NEXT: [[SHUFFLE10:%.*]] = shufflevector <2 x i32> [[TMP3]], <2 x i32> poison, <4 x i32> <i32 1, i32 0, i32 0, i32 0>
	; CHECK-NEXT: [[TMP4:%.*]] = add <4 x i32> [[SHUFFLE10]], <i32 15, i32 31, i32 47, i32 poison>			; CHECK-NEXT: [[TMP4:%.*]] = add <4 x i32> [[SHUFFLE10]], <i32 undef, i32 15, i32 31, i32 47>
	; CHECK-NEXT: [[TMP5:%.*]] = call i32 @llvm.vector.reduce.smax.v4i32(<4 x i32> [[TMP4]])			; CHECK-NEXT: [[TMP5:%.*]] = call i32 @llvm.vector.reduce.smax.v4i32(<4 x i32> [[TMP4]])
	; CHECK-NEXT: [[T19:%.*]] = select i1 undef, i32 [[TMP5]], i32 undef			; CHECK-NEXT: [[T19:%.*]] = select i1 undef, i32 [[TMP5]], i32 undef
	; CHECK-NEXT: [[T20:%.*]] = icmp sgt i32 [[T19]], 63			; CHECK-NEXT: [[T20:%.*]] = icmp sgt i32 [[T19]], 63
	; CHECK-NEXT: [[TMP6:%.*]] = sub nsw <2 x i32> undef, [[TMP1]]			; CHECK-NEXT: [[TMP6:%.*]] = sub nsw <2 x i32> undef, [[TMP1]]
	; CHECK-NEXT: [[TMP7:%.*]] = sub <2 x i32> [[TMP6]], undef			; CHECK-NEXT: [[TMP7:%.*]] = sub <2 x i32> [[TMP6]], undef
	; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x i32> [[TMP7]], <2 x i32> poison, <4 x i32> <i32 0, i32 1, i32 0, i32 1>			; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x i32> [[TMP7]], <2 x i32> poison, <4 x i32> <i32 0, i32 1, i32 0, i32 1>
	; CHECK-NEXT: [[TMP8:%.*]] = add nsw <4 x i32> [[SHUFFLE]], <i32 -49, i32 -33, i32 -33, i32 -17>			; CHECK-NEXT: [[TMP8:%.*]] = add nsw <4 x i32> [[SHUFFLE]], <i32 -49, i32 -33, i32 -33, i32 -17>
	; CHECK-NEXT: [[TMP9:%.*]] = call i32 @llvm.vector.reduce.smin.v4i32(<4 x i32> [[TMP8]])			; CHECK-NEXT: [[TMP9:%.*]] = call i32 @llvm.vector.reduce.smin.v4i32(<4 x i32> [[TMP8]])
	▲ Show 20 Lines • Show All 67 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/split-load8_2-unord.ll

	Show First 20 Lines • Show All 124 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: [[G21:%.]] = getelementptr inbounds [16 x i32], [16 x i32] [[P2]], i32 0, i64 13			; CHECK-NEXT: [[G21:%.]] = getelementptr inbounds [16 x i32], [16 x i32] [[P2]], i32 0, i64 13
	; CHECK-NEXT: [[G22:%.]] = getelementptr inbounds [16 x i32], [16 x i32] [[P2]], i32 0, i64 14			; CHECK-NEXT: [[G22:%.]] = getelementptr inbounds [16 x i32], [16 x i32] [[P2]], i32 0, i64 14
	; CHECK-NEXT: [[G23:%.]] = getelementptr inbounds [16 x i32], [16 x i32] [[P2]], i32 0, i64 15			; CHECK-NEXT: [[G23:%.]] = getelementptr inbounds [16 x i32], [16 x i32] [[P2]], i32 0, i64 15
	; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds [[STRUCT_S:%.]], %struct.S* [[P:%.*]], i64 0, i32 0, i64 0			; CHECK-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds [[STRUCT_S:%.]], %struct.S* [[P:%.*]], i64 0, i32 0, i64 0
	; CHECK-NEXT: [[ARRAYIDX9:%.]] = getelementptr inbounds [[STRUCT_S]], %struct.S [[P]], i64 0, i32 0, i64 1			; CHECK-NEXT: [[ARRAYIDX9:%.]] = getelementptr inbounds [[STRUCT_S]], %struct.S [[P]], i64 0, i32 0, i64 1
	; CHECK-NEXT: [[ARRAYIDX16:%.]] = getelementptr inbounds [[STRUCT_S]], %struct.S [[P]], i64 0, i32 0, i64 2			; CHECK-NEXT: [[ARRAYIDX16:%.]] = getelementptr inbounds [[STRUCT_S]], %struct.S [[P]], i64 0, i32 0, i64 2
	; CHECK-NEXT: [[TMP0:%.]] = bitcast i32 [[G10]] to <4 x i32>*			; CHECK-NEXT: [[TMP0:%.]] = bitcast i32 [[G10]] to <4 x i32>*
	; CHECK-NEXT: [[TMP1:%.]] = load <4 x i32>, <4 x i32> [[TMP0]], align 4			; CHECK-NEXT: [[TMP1:%.]] = load <4 x i32>, <4 x i32> [[TMP0]], align 4
	; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <4 x i32> [[TMP1]], <4 x i32> poison, <4 x i32> <i32 1, i32 0, i32 2, i32 3>
	; CHECK-NEXT: [[ARRAYIDX23:%.]] = getelementptr inbounds [[STRUCT_S]], %struct.S [[P]], i64 0, i32 0, i64 3			; CHECK-NEXT: [[ARRAYIDX23:%.]] = getelementptr inbounds [[STRUCT_S]], %struct.S [[P]], i64 0, i32 0, i64 3
				; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <4 x i32> [[TMP1]], <4 x i32> poison, <4 x i32> <i32 1, i32 0, i32 2, i32 3>
	; CHECK-NEXT: [[TMP2:%.]] = bitcast i32 [[ARRAYIDX2]] to <4 x i32>*			; CHECK-NEXT: [[TMP2:%.]] = bitcast i32 [[ARRAYIDX2]] to <4 x i32>*
	; CHECK-NEXT: store <4 x i32> [[SHUFFLE]], <4 x i32>* [[TMP2]], align 4			; CHECK-NEXT: store <4 x i32> [[SHUFFLE]], <4 x i32>* [[TMP2]], align 4
	; CHECK-NEXT: [[ARRAYIDX30:%.]] = getelementptr inbounds [[STRUCT_S]], %struct.S [[P]], i64 0, i32 0, i64 4			; CHECK-NEXT: [[ARRAYIDX30:%.]] = getelementptr inbounds [[STRUCT_S]], %struct.S [[P]], i64 0, i32 0, i64 4
	; CHECK-NEXT: [[ARRAYIDX37:%.]] = getelementptr inbounds [[STRUCT_S]], %struct.S [[P]], i64 0, i32 0, i64 5			; CHECK-NEXT: [[ARRAYIDX37:%.]] = getelementptr inbounds [[STRUCT_S]], %struct.S [[P]], i64 0, i32 0, i64 5
	; CHECK-NEXT: [[ARRAYIDX44:%.]] = getelementptr inbounds [[STRUCT_S]], %struct.S [[P]], i64 0, i32 0, i64 6			; CHECK-NEXT: [[ARRAYIDX44:%.]] = getelementptr inbounds [[STRUCT_S]], %struct.S [[P]], i64 0, i32 0, i64 6
	; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[G20]] to <4 x i32>*			; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[G20]] to <4 x i32>*
	; CHECK-NEXT: [[TMP4:%.]] = load <4 x i32>, <4 x i32> [[TMP3]], align 4			; CHECK-NEXT: [[TMP4:%.]] = load <4 x i32>, <4 x i32> [[TMP3]], align 4
	; CHECK-NEXT: [[SHUFFLE1:%.*]] = shufflevector <4 x i32> [[TMP4]], <4 x i32> poison, <4 x i32> <i32 3, i32 1, i32 2, i32 0>
	; CHECK-NEXT: [[ARRAYIDX51:%.]] = getelementptr inbounds [[STRUCT_S]], %struct.S [[P]], i64 0, i32 0, i64 7			; CHECK-NEXT: [[ARRAYIDX51:%.]] = getelementptr inbounds [[STRUCT_S]], %struct.S [[P]], i64 0, i32 0, i64 7
				; CHECK-NEXT: [[SHUFFLE1:%.*]] = shufflevector <4 x i32> [[TMP4]], <4 x i32> poison, <4 x i32> <i32 3, i32 1, i32 2, i32 0>
	; CHECK-NEXT: [[TMP5:%.]] = bitcast i32 [[ARRAYIDX30]] to <4 x i32>*			; CHECK-NEXT: [[TMP5:%.]] = bitcast i32 [[ARRAYIDX30]] to <4 x i32>*
	; CHECK-NEXT: store <4 x i32> [[SHUFFLE1]], <4 x i32>* [[TMP5]], align 4			; CHECK-NEXT: store <4 x i32> [[SHUFFLE1]], <4 x i32>* [[TMP5]], align 4
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	%p1 = alloca [16 x i32], align 16			%p1 = alloca [16 x i32], align 16
	%p2 = alloca [16 x i32], align 16			%p2 = alloca [16 x i32], align 16
	%g10 = getelementptr inbounds [16 x i32], [16 x i32]* %p1, i32 0, i64 4			%g10 = getelementptr inbounds [16 x i32], [16 x i32]* %p1, i32 0, i64 4
	▲ Show 20 Lines • Show All 114 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/vectorize-reorder-alt-shuffle.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -slp-vectorizer -S -mtriple=x86_64-unknown-linux-gnu < %s \| FileCheck %s			; RUN: opt -slp-vectorizer -S -mtriple=x86_64-unknown-linux-gnu < %s \| FileCheck %s

	define void @foo(i8* %c, float* %d) {			define void @foo(i8* %c, float* %d) {
	; CHECK-LABEL: @foo(			; CHECK-LABEL: @foo(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[ARRAYIDX1:%.]] = getelementptr inbounds i8, i8 [[C:%.*]], i64 4			; CHECK-NEXT: [[ARRAYIDX1:%.]] = getelementptr inbounds i8, i8 [[C:%.*]], i64 4
	; CHECK-NEXT: [[ARRAYIDX4:%.]] = getelementptr inbounds i8, i8 [[C]], i64 1			; CHECK-NEXT: [[ARRAYIDX4:%.]] = getelementptr inbounds i8, i8 [[C]], i64 1
	; CHECK-NEXT: [[ARRAYIDX12:%.]] = getelementptr inbounds i8, i8 [[C]], i64 2			; CHECK-NEXT: [[ARRAYIDX12:%.]] = getelementptr inbounds i8, i8 [[C]], i64 2
	; CHECK-NEXT: [[ARRAYIDX17:%.]] = getelementptr inbounds i8, i8 [[C]], i64 3			; CHECK-NEXT: [[ARRAYIDX17:%.]] = getelementptr inbounds i8, i8 [[C]], i64 3
	; CHECK-NEXT: [[TMP0:%.]] = bitcast i8 [[ARRAYIDX4]] to <4 x i8>*			; CHECK-NEXT: [[TMP0:%.]] = bitcast i8 [[ARRAYIDX4]] to <4 x i8>*
	; CHECK-NEXT: [[TMP1:%.]] = load <4 x i8>, <4 x i8> [[TMP0]], align 1			; CHECK-NEXT: [[TMP1:%.]] = load <4 x i8>, <4 x i8> [[TMP0]], align 1
	; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <4 x i8> [[TMP1]], <4 x i8> poison, <4 x i32> <i32 1, i32 2, i32 3, i32 0>			; CHECK-NEXT: [[TMP2:%.*]] = zext <4 x i8> [[TMP1]] to <4 x i32>
	; CHECK-NEXT: [[TMP2:%.*]] = zext <4 x i8> [[SHUFFLE]] to <4 x i32>			; CHECK-NEXT: [[TMP3:%.*]] = shl nuw nsw <4 x i32> [[TMP2]], <i32 2, i32 2, i32 2, i32 3>
	; CHECK-NEXT: [[TMP3:%.*]] = shl nuw nsw <4 x i32> [[TMP2]], <i32 2, i32 2, i32 3, i32 2>			; CHECK-NEXT: [[TMP4:%.*]] = and <4 x i32> [[TMP2]], <i32 2, i32 2, i32 2, i32 3>
	; CHECK-NEXT: [[TMP4:%.*]] = and <4 x i32> [[TMP2]], <i32 2, i32 2, i32 3, i32 2>			; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <4 x i32> [[TMP3]], <4 x i32> [[TMP4]], <4 x i32> <i32 1, i32 2, i32 7, i32 0>
	; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <4 x i32> [[TMP3]], <4 x i32> [[TMP4]], <4 x i32> <i32 0, i32 1, i32 6, i32 3>
	; CHECK-NEXT: [[ADD_PTR:%.]] = getelementptr inbounds float, float [[D:%.*]], i64 -1			; CHECK-NEXT: [[ADD_PTR:%.]] = getelementptr inbounds float, float [[D:%.*]], i64 -1
	; CHECK-NEXT: [[ADD_PTR37:%.]] = getelementptr inbounds float, float [[D]], i64 -2			; CHECK-NEXT: [[ADD_PTR37:%.]] = getelementptr inbounds float, float [[D]], i64 -2
	; CHECK-NEXT: [[ADD_PTR45:%.]] = getelementptr inbounds float, float [[D]], i64 -3			; CHECK-NEXT: [[ADD_PTR45:%.]] = getelementptr inbounds float, float [[D]], i64 -3
	; CHECK-NEXT: [[TMP6:%.*]] = add nsw <4 x i32> poison, [[TMP5]]			; CHECK-NEXT: [[TMP6:%.*]] = add nsw <4 x i32> poison, [[TMP5]]
	; CHECK-NEXT: [[TMP7:%.*]] = sitofp <4 x i32> [[TMP6]] to <4 x float>			; CHECK-NEXT: [[TMP7:%.*]] = sitofp <4 x i32> [[TMP6]] to <4 x float>
	; CHECK-NEXT: [[TMP8:%.*]] = fdiv <4 x float> [[TMP7]], poison			; CHECK-NEXT: [[TMP8:%.*]] = fdiv <4 x float> [[TMP7]], poison
	; CHECK-NEXT: [[ADD_PTR53:%.]] = getelementptr inbounds float, float [[D]], i64 -4			; CHECK-NEXT: [[ADD_PTR53:%.]] = getelementptr inbounds float, float [[D]], i64 -4
	; CHECK-NEXT: [[TMP9:%.]] = bitcast float [[ADD_PTR53]] to <4 x float>*			; CHECK-NEXT: [[TMP9:%.]] = bitcast float [[ADD_PTR53]] to <4 x float>*
	▲ Show 20 Lines • Show All 42 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/vectorize-reorder-reuse.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -slp-vectorizer -S -mtriple=x86_64-unknown-linux-gnu -mcpu=bdver2 < %s \| FileCheck %s			; RUN: opt -slp-vectorizer -S -mtriple=x86_64-unknown-linux-gnu -mcpu=bdver2 < %s \| FileCheck %s

	define i32 @foo(i32* nocapture readonly %arr, i32 %a1, i32 %a2, i32 %a3, i32 %a4, i32 %a5, i32 %a6, i32 %a7, i32 %a8) {			define i32 @foo(i32* nocapture readonly %arr, i32 %a1, i32 %a2, i32 %a3, i32 %a4, i32 %a5, i32 %a6, i32 %a7, i32 %a8) {
	; CHECK-LABEL: @foo(			; CHECK-LABEL: @foo(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[ARR:%.*]], i64 1			; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[ARR:%.*]], i64 1
	; CHECK-NEXT: [[TMP0:%.]] = bitcast i32 [[ARR]] to <2 x i32>*			; CHECK-NEXT: [[TMP0:%.]] = bitcast i32 [[ARR]] to <2 x i32>*
	; CHECK-NEXT: [[TMP1:%.]] = load <2 x i32>, <2 x i32> [[TMP0]], align 4			; CHECK-NEXT: [[TMP1:%.]] = load <2 x i32>, <2 x i32> [[TMP0]], align 4
	; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> poison, <8 x i32> <i32 0, i32 0, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>			; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> poison, <8 x i32> <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 0, i32 0>
	; CHECK-NEXT: [[TMP2:%.]] = insertelement <8 x i32> poison, i32 [[A7:%.]], i32 0			; CHECK-NEXT: [[TMP2:%.]] = insertelement <8 x i32> poison, i32 [[A1:%.]], i32 0
	; CHECK-NEXT: [[TMP3:%.]] = insertelement <8 x i32> [[TMP2]], i32 [[A8:%.]], i32 1			; CHECK-NEXT: [[TMP3:%.]] = insertelement <8 x i32> [[TMP2]], i32 [[A2:%.]], i32 1
	; CHECK-NEXT: [[TMP4:%.]] = insertelement <8 x i32> [[TMP3]], i32 [[A1:%.]], i32 2			; CHECK-NEXT: [[TMP4:%.]] = insertelement <8 x i32> [[TMP3]], i32 [[A3:%.]], i32 2
	; CHECK-NEXT: [[TMP5:%.]] = insertelement <8 x i32> [[TMP4]], i32 [[A2:%.]], i32 3			; CHECK-NEXT: [[TMP5:%.]] = insertelement <8 x i32> [[TMP4]], i32 [[A4:%.]], i32 3
	; CHECK-NEXT: [[TMP6:%.]] = insertelement <8 x i32> [[TMP5]], i32 [[A3:%.]], i32 4			; CHECK-NEXT: [[TMP6:%.]] = insertelement <8 x i32> [[TMP5]], i32 [[A5:%.]], i32 4
	; CHECK-NEXT: [[TMP7:%.]] = insertelement <8 x i32> [[TMP6]], i32 [[A4:%.]], i32 5			; CHECK-NEXT: [[TMP7:%.]] = insertelement <8 x i32> [[TMP6]], i32 [[A6:%.]], i32 5
	; CHECK-NEXT: [[TMP8:%.]] = insertelement <8 x i32> [[TMP7]], i32 [[A5:%.]], i32 6			; CHECK-NEXT: [[TMP8:%.]] = insertelement <8 x i32> [[TMP7]], i32 [[A7:%.]], i32 6
	; CHECK-NEXT: [[TMP9:%.]] = insertelement <8 x i32> [[TMP8]], i32 [[A6:%.]], i32 7			; CHECK-NEXT: [[TMP9:%.]] = insertelement <8 x i32> [[TMP8]], i32 [[A8:%.]], i32 7
	; CHECK-NEXT: [[TMP10:%.*]] = add <8 x i32> [[SHUFFLE]], [[TMP9]]			; CHECK-NEXT: [[TMP10:%.*]] = add <8 x i32> [[SHUFFLE]], [[TMP9]]
	; CHECK-NEXT: [[TMP11:%.*]] = call i32 @llvm.vector.reduce.umin.v8i32(<8 x i32> [[TMP10]])			; CHECK-NEXT: [[TMP11:%.*]] = call i32 @llvm.vector.reduce.umin.v8i32(<8 x i32> [[TMP10]])
	; CHECK-NEXT: ret i32 [[TMP11]]			; CHECK-NEXT: ret i32 [[TMP11]]
	;			;
	entry:			entry:
	%arrayidx = getelementptr inbounds i32, i32* %arr, i64 1			%arrayidx = getelementptr inbounds i32, i32* %arr, i64 1
	%0 = load i32, i32* %arrayidx, align 4			%0 = load i32, i32* %arrayidx, align 4
	%add = add i32 %0, %a1			%add = add i32 %0, %a1
	Show All 25 Lines
	define i32 @foo1(i32* nocapture readonly %arr, i32 %a1, i32 %a2, i32 %a3, i32 %a4, i32 %a5, i32 %a6, i32 %a7, i32 %a8) {			define i32 @foo1(i32* nocapture readonly %arr, i32 %a1, i32 %a2, i32 %a3, i32 %a4, i32 %a5, i32 %a6, i32 %a7, i32 %a8) {
	; CHECK-LABEL: @foo1(			; CHECK-LABEL: @foo1(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[ARR:%.*]], i64 1			; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[ARR:%.*]], i64 1
	; CHECK-NEXT: [[ARRAYIDX1:%.]] = getelementptr inbounds i32, i32 [[ARR]], i64 2			; CHECK-NEXT: [[ARRAYIDX1:%.]] = getelementptr inbounds i32, i32 [[ARR]], i64 2
	; CHECK-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 [[ARR]], i64 3			; CHECK-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 [[ARR]], i64 3
	; CHECK-NEXT: [[TMP0:%.]] = bitcast i32 [[ARR]] to <4 x i32>*			; CHECK-NEXT: [[TMP0:%.]] = bitcast i32 [[ARR]] to <4 x i32>*
	; CHECK-NEXT: [[TMP1:%.]] = load <4 x i32>, <4 x i32> [[TMP0]], align 4			; CHECK-NEXT: [[TMP1:%.]] = load <4 x i32>, <4 x i32> [[TMP0]], align 4
	; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <4 x i32> [[TMP1]], <4 x i32> poison, <8 x i32> <i32 0, i32 1, i32 1, i32 1, i32 1, i32 2, i32 2, i32 3>			; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <4 x i32> [[TMP1]], <4 x i32> poison, <8 x i32> <i32 1, i32 2, i32 3, i32 1, i32 1, i32 0, i32 2, i32 1>
	; CHECK-NEXT: [[TMP2:%.]] = insertelement <8 x i32> poison, i32 [[A6:%.]], i32 0			; CHECK-NEXT: [[TMP2:%.]] = insertelement <8 x i32> poison, i32 [[A1:%.]], i32 0
	; CHECK-NEXT: [[TMP3:%.]] = insertelement <8 x i32> [[TMP2]], i32 [[A1:%.]], i32 1			; CHECK-NEXT: [[TMP3:%.]] = insertelement <8 x i32> [[TMP2]], i32 [[A2:%.]], i32 1
	; CHECK-NEXT: [[TMP4:%.]] = insertelement <8 x i32> [[TMP3]], i32 [[A4:%.]], i32 2			; CHECK-NEXT: [[TMP4:%.]] = insertelement <8 x i32> [[TMP3]], i32 [[A3:%.]], i32 2
	; CHECK-NEXT: [[TMP5:%.]] = insertelement <8 x i32> [[TMP4]], i32 [[A5:%.]], i32 3			; CHECK-NEXT: [[TMP5:%.]] = insertelement <8 x i32> [[TMP4]], i32 [[A4:%.]], i32 3
	; CHECK-NEXT: [[TMP6:%.]] = insertelement <8 x i32> [[TMP5]], i32 [[A8:%.]], i32 4			; CHECK-NEXT: [[TMP6:%.]] = insertelement <8 x i32> [[TMP5]], i32 [[A5:%.]], i32 4
	; CHECK-NEXT: [[TMP7:%.]] = insertelement <8 x i32> [[TMP6]], i32 [[A2:%.]], i32 5			; CHECK-NEXT: [[TMP7:%.]] = insertelement <8 x i32> [[TMP6]], i32 [[A6:%.]], i32 5
	; CHECK-NEXT: [[TMP8:%.]] = insertelement <8 x i32> [[TMP7]], i32 [[A7:%.]], i32 6			; CHECK-NEXT: [[TMP8:%.]] = insertelement <8 x i32> [[TMP7]], i32 [[A7:%.]], i32 6
	; CHECK-NEXT: [[TMP9:%.]] = insertelement <8 x i32> [[TMP8]], i32 [[A3:%.]], i32 7			; CHECK-NEXT: [[TMP9:%.]] = insertelement <8 x i32> [[TMP8]], i32 [[A8:%.]], i32 7
	; CHECK-NEXT: [[TMP10:%.*]] = add <8 x i32> [[SHUFFLE]], [[TMP9]]			; CHECK-NEXT: [[TMP10:%.*]] = add <8 x i32> [[SHUFFLE]], [[TMP9]]
	; CHECK-NEXT: [[TMP11:%.*]] = call i32 @llvm.vector.reduce.umin.v8i32(<8 x i32> [[TMP10]])			; CHECK-NEXT: [[TMP11:%.*]] = call i32 @llvm.vector.reduce.umin.v8i32(<8 x i32> [[TMP10]])
	; CHECK-NEXT: ret i32 [[TMP11]]			; CHECK-NEXT: ret i32 [[TMP11]]
	;			;
	entry:			entry:
	%arrayidx = getelementptr inbounds i32, i32* %arr, i64 1			%arrayidx = getelementptr inbounds i32, i32* %arr, i64 1
	%0 = load i32, i32* %arrayidx, align 4			%0 = load i32, i32* %arrayidx, align 4
	%add = add i32 %0, %a1			%add = add i32 %0, %a1
	Show All 29 Lines
	define i32 @foo2(i32* nocapture readonly %arr, i32 %a1, i32 %a2, i32 %a3, i32 %a4, i32 %a5, i32 %a6, i32 %a7, i32 %a8) {			define i32 @foo2(i32* nocapture readonly %arr, i32 %a1, i32 %a2, i32 %a3, i32 %a4, i32 %a5, i32 %a6, i32 %a7, i32 %a8) {
	; CHECK-LABEL: @foo2(			; CHECK-LABEL: @foo2(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[ARR:%.*]], i64 3			; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[ARR:%.*]], i64 3
	; CHECK-NEXT: [[ARRAYIDX1:%.]] = getelementptr inbounds i32, i32 [[ARR]], i64 2			; CHECK-NEXT: [[ARRAYIDX1:%.]] = getelementptr inbounds i32, i32 [[ARR]], i64 2
	; CHECK-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds i32, i32 [[ARR]], i64 1			; CHECK-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds i32, i32 [[ARR]], i64 1
	; CHECK-NEXT: [[TMP0:%.]] = bitcast i32 [[ARR]] to <4 x i32>*			; CHECK-NEXT: [[TMP0:%.]] = bitcast i32 [[ARR]] to <4 x i32>*
	; CHECK-NEXT: [[TMP1:%.]] = load <4 x i32>, <4 x i32> [[TMP0]], align 4			; CHECK-NEXT: [[TMP1:%.]] = load <4 x i32>, <4 x i32> [[TMP0]], align 4
	; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <4 x i32> [[TMP1]], <4 x i32> poison, <8 x i32> <i32 0, i32 0, i32 1, i32 1, i32 2, i32 2, i32 3, i32 3>			; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <4 x i32> [[TMP1]], <4 x i32> poison, <8 x i32> <i32 3, i32 2, i32 3, i32 0, i32 1, i32 0, i32 2, i32 1>
	; CHECK-NEXT: [[TMP2:%.]] = insertelement <8 x i32> poison, i32 [[A4:%.]], i32 0			; CHECK-NEXT: [[TMP2:%.]] = insertelement <8 x i32> poison, i32 [[A1:%.]], i32 0
	; CHECK-NEXT: [[TMP3:%.]] = insertelement <8 x i32> [[TMP2]], i32 [[A6:%.]], i32 1			; CHECK-NEXT: [[TMP3:%.]] = insertelement <8 x i32> [[TMP2]], i32 [[A2:%.]], i32 1
	; CHECK-NEXT: [[TMP4:%.]] = insertelement <8 x i32> [[TMP3]], i32 [[A5:%.]], i32 2			; CHECK-NEXT: [[TMP4:%.]] = insertelement <8 x i32> [[TMP3]], i32 [[A3:%.]], i32 2
	; CHECK-NEXT: [[TMP5:%.]] = insertelement <8 x i32> [[TMP4]], i32 [[A8:%.]], i32 3			; CHECK-NEXT: [[TMP5:%.]] = insertelement <8 x i32> [[TMP4]], i32 [[A4:%.]], i32 3
	; CHECK-NEXT: [[TMP6:%.]] = insertelement <8 x i32> [[TMP5]], i32 [[A2:%.]], i32 4			; CHECK-NEXT: [[TMP6:%.]] = insertelement <8 x i32> [[TMP5]], i32 [[A5:%.]], i32 4
	; CHECK-NEXT: [[TMP7:%.]] = insertelement <8 x i32> [[TMP6]], i32 [[A7:%.]], i32 5			; CHECK-NEXT: [[TMP7:%.]] = insertelement <8 x i32> [[TMP6]], i32 [[A6:%.]], i32 5
	; CHECK-NEXT: [[TMP8:%.]] = insertelement <8 x i32> [[TMP7]], i32 [[A1:%.]], i32 6			; CHECK-NEXT: [[TMP8:%.]] = insertelement <8 x i32> [[TMP7]], i32 [[A7:%.]], i32 6
	; CHECK-NEXT: [[TMP9:%.]] = insertelement <8 x i32> [[TMP8]], i32 [[A3:%.]], i32 7			; CHECK-NEXT: [[TMP9:%.]] = insertelement <8 x i32> [[TMP8]], i32 [[A8:%.]], i32 7
	; CHECK-NEXT: [[TMP10:%.*]] = add <8 x i32> [[SHUFFLE]], [[TMP9]]			; CHECK-NEXT: [[TMP10:%.*]] = add <8 x i32> [[SHUFFLE]], [[TMP9]]
	; CHECK-NEXT: [[TMP11:%.*]] = call i32 @llvm.vector.reduce.umin.v8i32(<8 x i32> [[TMP10]])			; CHECK-NEXT: [[TMP11:%.*]] = call i32 @llvm.vector.reduce.umin.v8i32(<8 x i32> [[TMP10]])
	; CHECK-NEXT: ret i32 [[TMP11]]			; CHECK-NEXT: ret i32 [[TMP11]]
	;			;
	entry:			entry:
	%arrayidx = getelementptr inbounds i32, i32* %arr, i64 3			%arrayidx = getelementptr inbounds i32, i32* %arr, i64 3
	%0 = load i32, i32* %arrayidx, align 4			%0 = load i32, i32* %arrayidx, align 4
	%add = add i32 %0, %a1			%add = add i32 %0, %a1
	Show All 28 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[SLP]Improve graph reordering.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 373608

llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

llvm/test/Transforms/SLPVectorizer/AArch64/transpose-inseltpoison.ll

llvm/test/Transforms/SLPVectorizer/AArch64/transpose.ll

llvm/test/Transforms/SLPVectorizer/X86/addsub.ll

llvm/test/Transforms/SLPVectorizer/X86/crash_cmpop.ll

llvm/test/Transforms/SLPVectorizer/X86/extract.ll

llvm/test/Transforms/SLPVectorizer/X86/jumbled-load-multiuse.ll

llvm/test/Transforms/SLPVectorizer/X86/jumbled-load.ll

llvm/test/Transforms/SLPVectorizer/X86/jumbled_store_crash.ll

llvm/test/Transforms/SLPVectorizer/X86/reorder_repeated_ops.ll

llvm/test/Transforms/SLPVectorizer/X86/split-load8_2-unord.ll

llvm/test/Transforms/SLPVectorizer/X86/vectorize-reorder-alt-shuffle.ll

llvm/test/Transforms/SLPVectorizer/X86/vectorize-reorder-reuse.ll

[SLP]Improve graph reordering.
ClosedPublic