This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
14/27
SLPVectorizer.cpp
-
test/Transforms/SLPVectorizer/
-
Transforms/
-
SLPVectorizer/
-
AArch64/
-
transpose-inseltpoison.ll
-
transpose.ll
-
X86/
1/2
addsub.ll
-
crash_cmpop.ll
-
extract.ll
-
jumbled-load-multiuse.ll
-
jumbled-load.ll
-
jumbled_store_crash.ll
-
reorder_repeated_ops.ll
-
vectorize-reorder-reuse.ll

Differential D105020

[SLP]Improve graph reordering.
ClosedPublic

Authored by ABataev on Jun 28 2021, 6:12 AM.

Download Raw Diff

Details

Reviewers

RKSimon
spatel
vdmitrie
dtemirbulatov
anton-afanasyev
SjoerdMeijer
dmgreen

Commits

rGbc69dd62c04a: [SLP]Improve graph reordering.
rG84cbd71c9592: [SLP]Improve graph reordering.
rGa28234e37af8: [SLP]Improve graph reordering.
rGe408d1dfab42: [SLP]Improve graph reordering.

Summary

Reworked reordering algorithm. Originally, the compiler just tried to
detect the most common order in the reordarable nodes (loads, stores,
extractelements,extractvalues) and then fully rebuilding the graph in
the best order. This was not effecient, since it required an extra
memory and time for building/rebuilding tree, double the use of the
scheduling budget, which could lead to missing vectorization due to
exausted scheduling resources.

Patch provide 2-way approach for graph reodering problem. At first, all
reordering is done in-place, it doe not required tree
deleting/rebuilding, it just rotates the scalars/orders/reuses masks in
the graph node.

The first step (top-to bottom) rotates the whole graph, similarly to the previous
implementation. Compiler counts the number of the most used orders of
the graph nodes with the same vectorization factor and then rotates the
subgraph with the given vectorization factor to the most used order, if
it is not empty. Then repeats the same procedure for the subgraphs with
the smaller vectorization factor. We can do this because we still need
to reshuffle smaller subgraph when buildiong operands for the graph
nodes with lasrger vectorization factor, we can rotate just subgraph,
not the whole graph.

The second step (bottom-to-top) scans through the leaves and tries to
detect the users of the leaves which can be reordered. If the leaves can
be reorder in the best fashion, they are reordered and their user too.
It allows to remove double shuffles to the same ordering of the operands in
many cases and just reorder the user operations instead. Plus, it moves
the final shuffles closer to the top of the graph and in many cases
allows to remove extra shuffle because the same procedure is repeated
again and we can again merge some reordering masks and reorder user nodes
instead of the operands.

Also, patch improves cost model for gathering of loads, which improves
x264 benchmark in some cases.

Gives about +2% on AVX512 + LTO (more expected for AVX/AVX2) for {625,525}x264,
+3% for 508.namd, improves most of other benchmarks.
The compile and link time are almost the same, though in some cases it
should be better (we're not doing an extra instruction scheduling
anymore) + we may vectorize more code for the large basic blocks again
because of saving scheduling budget.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	2,710 ms	x64 debian > libarcher.critical::critical.c
	2,650 ms	x64 debian > libarcher.parallel::parallel-simple2.c
	2,770 ms	x64 debian > libarcher.races::critical-unrelated.c
	2,630 ms	x64 debian > libarcher.races::lock-nested-unrelated.c
	2,640 ms	x64 debian > libarcher.races::lock-unrelated.c
		View Full Test Results (16 Failed)

Event Timeline

ABataev created this revision.Jun 28 2021, 6:12 AM

Herald added subscribers: jfb, mgrang, hiraditya. · View Herald TranscriptJun 28 2021, 6:12 AM

ABataev requested review of this revision.Jun 28 2021, 6:12 AM

Herald added a project: Restricted Project. · View Herald TranscriptJun 28 2021, 6:12 AM

Harbormaster completed remote builds in B111270: Diff 354869.Jun 28 2021, 7:08 AM

RKSimon added inline comments.Jul 5 2021, 9:33 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2982	This is very similar to the EntryState enum - merge them?
llvm/test/Transforms/SLPVectorizer/X86/addsub.ll
340–341	Update comment? I had to do something similar on D103925, although the wording here might be different.

ABataev added inline comments.Jul 6 2021, 6:05 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2982	I would not do this. Though the values look similar the meaning is completely different. This state handles only loads, while the entry state handles all possible entries kinds. In the future, we may have different values in these enums, which may lead to some unpredictable results. I would keep it as is.
llvm/test/Transforms/SLPVectorizer/X86/addsub.ll
340–341	Ok, will do

Rebase + address comments

xbolva00 added a reviewer: SjoerdMeijer.Jul 6 2021, 6:30 AM

Harbormaster completed remote builds in B112599: Diff 356698.Jul 6 2021, 7:00 AM

Rebase

Harbormaster completed remote builds in B113075: Diff 357337.Jul 8 2021, 2:52 PM

ABataev added a child revision: D101109: [SLP]Improve multinode analysis..Jul 9 2021, 7:42 AM

ABataev mentioned this in D101109: [SLP]Improve multinode analysis..

Rebase

Harbormaster completed remote builds in B113199: Diff 357510.Jul 9 2021, 8:35 AM

A few minors things I've noticed so far. It looks like there's some renaming / nfc(ish) refactoring in here - if that could be pre-committed to reduce the size of this patch it'd be very welcome!

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
592	Do we need a default value for vector?
2546	Do we need a default value for vector?
2550	Doesn't the IsIdentity pass have to be done after Order[] is updated?
2568	Do we need a default value for vector?

ABataev added inline comments.Jul 9 2021, 9:26 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
592	Not currently, but I'll update these functions to make them compatible with upcoming non-power-2 vectorization.
2546	No, it is swapped with `Order` and `Order` then updated. But I'll add default initialization because we may need it in the future for non-pow-2 patch.
2550	Yes, you're right, will fix this. It may affect the cost in some rare cases.
2568	I'll add `UndefMaskElem` to reduce future updates for non-power-2.

Address comments

Harbormaster completed remote builds in B113254: Diff 357588.Jul 9 2021, 12:42 PM

A few minors, but I haven't spotted anything critical - does anyone else have any comments?

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3068	Default value?
4259	Can we remove the loop and avoid the repeated call to TTI->getShuffleCost(TTI::SK_Select, VecTy) ?
4261	Is it worth doing the CommonCost -> ReuseShuffleCost refactor as a NFC pre-commit to simplify this patch?
6104	A lot of this is just a refactor cleanup that looks like a NFC that could be done as a pre-commit to simplify the patch?

ABataev marked an inline comment as done.Jul 14 2021, 8:17 AM

ABataev added inline comments.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
3068	It is initialized with zeroes by default.
4261	Will check.
6104	Yes, will precommit some of these changes.

ABataev mentioned this in D106060: [SLP]Improve calculations of the cost for reused/reordered scalars..Jul 15 2021, 5:36 AM

ABataev mentioned this in rGda3dbfcacf9a: [SLP]Improve calculations of the cost for reused/reordered scalars..Jul 16 2021, 1:42 PM

Rebase

ABataev added inline comments.Jul 16 2021, 2:38 PM

llvm/test/Transforms/SLPVectorizer/AArch64/PR38339.ll
6–16 ↗	(On Diff #359457)	Need to adjust the cost for 4xi16 shuffle in AArch64 target.
40–50 ↗	(On Diff #359457)	Same here

Harbormaster completed remote builds in B114601: Diff 359457.Jul 16 2021, 3:18 PM

RKSimon added a reviewer: dmgreen.Jul 17 2021, 8:16 AM

RKSimon added a subscriber: dmgreen.

RKSimon added inline comments.

llvm/test/Transforms/SLPVectorizer/AArch64/PR38339.ll

6–16 ↗

(On Diff #359457)

https://simd.godbolt.org/z/Pana3396f

@dmgreen Some very basic tests suggests the v4i16 shuffle cost should never be higher than 3 (which encouragingly matches what is already set for v4i32/v4f32) - do you agree?

// PermuteSingleSrc shuffle kinds.
// TODO: handle vXi8/vXi16.
{ TTI::SK_PermuteSingleSrc, MVT::v2i32, 1 }, // mov.
{ TTI::SK_PermuteSingleSrc, MVT::v4i32, 3 }, // perfectshuffle worst case.
{ TTI::SK_PermuteSingleSrc, MVT::v2i64, 1 }, // mov.
{ TTI::SK_PermuteSingleSrc, MVT::v2f32, 1 }, // mov.
{ TTI::SK_PermuteSingleSrc, MVT::v4f32, 3 }, // perfectshuffle worst case.
{ TTI::SK_PermuteSingleSrc, MVT::v2f64, 1 }, // mov.

dmgreen added inline comments.Jul 17 2021, 2:53 PM

llvm/test/Transforms/SLPVectorizer/AArch64/PR38339.ll
6–16 ↗	(On Diff #359457)	Yeah, I think that sounds right. GeneratePerfectShuffle applies to 4 element 64bit shuffles as well as 128bit shuffles, so the same 3 instruction worst case would apply. I don't have a lot of tests that check SLP vectorization. Let me try some things and put a patch together if it looks sensible.

dmgreen added inline comments.Jul 20 2021, 12:01 AM

llvm/test/Transforms/SLPVectorizer/AArch64/PR38339.ll
6–16 ↗	(On Diff #359457)	There are some patches in https://reviews.llvm.org/D106241 to add some extra worst case costs.

Matt added a subscriber: Matt.Jul 20 2021, 7:01 AM

Rebase

I think we're just waiting to confirm that D106241 fixes the regressions?

In D105020#2897064, @RKSimon wrote:

I think we're just waiting to confirm that D106241 fixes the regressions?

Yep.

Harbormaster completed remote builds in B115596: Diff 360848.Jul 22 2021, 10:19 AM

D106241 had some dependencies, but I've rebased it to get it out of the way.

Please can you rebase to confirm the aarch64 regressions have gone

In D105020#2900357, @RKSimon wrote:

Please can you rebase to confirm the aarch64 regressions have gone

Sure, will do it later.

Rebase

Harbormaster completed remote builds in B115954: Diff 361345.Jul 23 2021, 3:00 PM

LGTM

This revision is now accepted and ready to land.Jul 24 2021, 1:28 AM

This revision was landed with ongoing or failed builds.Jul 28 2021, 5:53 AM

Closed by commit rGe408d1dfab42: [SLP]Improve graph reordering. (authored by ABataev). · Explain Why

This revision was automatically updated to reflect the committed changes.

ABataev added a commit: rGe408d1dfab42: [SLP]Improve graph reordering..

It looks like this change might be responsible for a build failure on GreenDragon: https://green.lab.llvm.org/green/job/clang-stage1-RA/22811/console

In D105020#2910050, @fhahn wrote:

It looks like this change might be responsible for a build failure on GreenDragon: https://green.lab.llvm.org/green/job/clang-stage1-RA/22811/console

Yes, going to commit a small fix in a minute.

In D105020#2910051, @ABataev wrote:

In D105020#2910050, @fhahn wrote:

It looks like this change might be responsible for a build failure on GreenDragon: https://green.lab.llvm.org/green/job/clang-stage1-RA/22811/console

Yes, going to commit a small fix in a minute.

It still crashes for me, https://martin.st/temp/vf_perspective-preproc.c, with clang -target aarch64-w32-mingw32 -w -c -O2 vf_perspective-preproc.c.

Hi @ABataev, I ran into an issue when running the LLVM test-suite. It seems to be a different issue than the one that @mstorsjo reported.

I got it reduced to:

target triple = "aarch64-unknown-linux-gnu"

define void @foo() local_unnamed_addr {
entry:
  %0 = load volatile double, double* poison, align 8
  %1 = load volatile double, double* poison, align 8
  %2 = load volatile double, double* poison, align 8
  %3 = load volatile double, double* poison, align 8
  br label %for.body

for.body:                                         ; preds = %for.body, %entry
  %d30.0734 = phi double [ undef, %for.body ], [ %0, %entry ]
  %d01.0733 = phi double [ undef, %for.body ], [ %1, %entry ]
  %d11.0732 = phi double [ undef, %for.body ], [ %2, %entry ]
  %d21.0731 = phi double [ undef, %for.body ], [ %3, %entry ]
  br label %for.body
}

Run with: opt -slp-vectorizer -S < reduced.ll.

I had actually expected one of the aarch64-buildbots would have caught this, so not sure if there was something special about the way I ran the test-suite, but I don't believe so.

In D105020#2912688, @mstorsjo wrote:

In D105020#2910051, @ABataev wrote:

In D105020#2910050, @fhahn wrote:

It looks like this change might be responsible for a build failure on GreenDragon: https://green.lab.llvm.org/green/job/clang-stage1-RA/22811/console

Yes, going to commit a small fix in a minute.

It still crashes for me, https://martin.st/temp/vf_perspective-preproc.c, with clang -target aarch64-w32-mingw32 -w -c -O2 vf_perspective-preproc.c.

Will check and fix it ASAP, thanks for the report.

In D105020#2912776, @sdesmalen wrote:
Hi @ABataev, I ran into an issue when running the LLVM test-suite. It seems to be a different issue than the one that @mstorsjo reported.

I got it reduced to:
target triple = "aarch64-unknown-linux-gnu"

define void @foo() local_unnamed_addr {
entry:
  %0 = load volatile double, double* poison, align 8
  %1 = load volatile double, double* poison, align 8
  %2 = load volatile double, double* poison, align 8
  %3 = load volatile double, double* poison, align 8
  br label %for.body

for.body:                                         ; preds = %for.body, %entry
  %d30.0734 = phi double [ undef, %for.body ], [ %0, %entry ]
  %d01.0733 = phi double [ undef, %for.body ], [ %1, %entry ]
  %d11.0732 = phi double [ undef, %for.body ], [ %2, %entry ]
  %d21.0731 = phi double [ undef, %for.body ], [ %3, %entry ]
  br label %for.body
}
Run with: opt -slp-vectorizer -S < reduced.ll.

I had actually expected one of the aarch64-buildbots would have caught this, so not sure if there was something special about the way I ran the test-suite, but I don't believe so.

Thanks, will check it too.

In D105020#2912776, @sdesmalen wrote:
Hi @ABataev, I ran into an issue when running the LLVM test-suite. It seems to be a different issue than the one that @mstorsjo reported.

I got it reduced to:
target triple = "aarch64-unknown-linux-gnu"

define void @foo() local_unnamed_addr {
entry:
  %0 = load volatile double, double* poison, align 8
  %1 = load volatile double, double* poison, align 8
  %2 = load volatile double, double* poison, align 8
  %3 = load volatile double, double* poison, align 8
  br label %for.body

for.body:                                         ; preds = %for.body, %entry
  %d30.0734 = phi double [ undef, %for.body ], [ %0, %entry ]
  %d01.0733 = phi double [ undef, %for.body ], [ %1, %entry ]
  %d11.0732 = phi double [ undef, %for.body ], [ %2, %entry ]
  %d21.0731 = phi double [ undef, %for.body ], [ %3, %entry ]
  br label %for.body
}
Run with: opt -slp-vectorizer -S < reduced.ll.

I had actually expected one of the aarch64-buildbots would have caught this, so not sure if there was something special about the way I ran the test-suite, but I don't believe so.

Investigated it, the crash is not related to this patch, caused by one of the previous patches. Going to publish a fix soon.

ABataev mentioned this in D107080: [SLP]Fix an assertion for the size of user nodes..Jul 29 2021, 7:52 AM

bjope added a subscriber: bjope.Jul 29 2021, 3:40 PM

bjope added inline comments.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4188	With my OOT target I ended up here with getMinVecRegSize() returning 32. E->getMainOp was a load like this `%3 = load i24, i24* getelementptr inbounds ([128 x i24], [128 x i24]* @a_ua, i16 0, i16 3)`, so Sz was 24. That gives a MinVF that is 0. And after some iterations the inner loop was entered with VF=0, which gives a slice that is empty, hitting assertions when doing Slice.front(). I haven't reduced this failure for any in-tree target (a bit short on people at the office here in the middle of summer). But maybe you should make sure MinVF doesn't go below 1 here (or maybe not even below 2) to avoid getting a VF that is less than 1 (or less than 2). Or check that VF is at least 1 (or 2) in the loop guard. (I do not know if VF=1 makes any sense. Hence the alternatives above regarding 1 or 2.)

ABataev added inline comments.Jul 29 2021, 3:41 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4188	The fix is ready (see D107058), will commit it tomorrow.

bjope added inline comments.Jul 29 2021, 3:43 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
4188	Ok, thanks!

ABataev mentioned this in rG4b25c113210e: [SLP]Fix an assertion for the size of user nodes..Jul 30 2021, 5:47 AM

Just a heads up that we're seeing test failures in Chromium that bisect to this revision (http://crbug.com/1235252). Not sure what's going on yet, though.

In D105020#2919603, @hans wrote:

Just a heads up that we're seeing test failures in Chromium that bisect to this revision (http://crbug.com/1235252). Not sure what's going on yet, though.

Hi, thanks for the report, will try to investigate it ASAP.

In D105020#2919603, @hans wrote:

Just a heads up that we're seeing test failures in Chromium that bisect to this revision (http://crbug.com/1235252). Not sure what's going on yet, though.

Also, would be good if you could provide a reproducer and a command line for it. I took a look at the buildbot but was unable to get info on how to reproduce it.

In D105020#2919627, @ABataev wrote:

In D105020#2919603, @hans wrote:

Just a heads up that we're seeing test failures in Chromium that bisect to this revision (http://crbug.com/1235252). Not sure what's going on yet, though.

Also, would be good if you could provide a reproducer and a command line for it. I took a look at the buildbot but was unable to get info on how to reproduce it.

I don't have a reproducer yet, and it will take some work to get it. I just wanted to give a heads up, especially in case others are bisecting issues and also end up at this change.

In D105020#2919848, @hans wrote:

In D105020#2919627, @ABataev wrote:

In D105020#2919603, @hans wrote:

Just a heads up that we're seeing test failures in Chromium that bisect to this revision (http://crbug.com/1235252). Not sure what's going on yet, though.

Also, would be good if you could provide a reproducer and a command line for it. I took a look at the buildbot but was unable to get info on how to reproduce it.

I don't have a reproducer yet, and it will take some work to get it. I just wanted to give a heads up, especially in case others are bisecting issues and also end up at this change.

Still not sure what's going on (and it could also be our code that's broken), but we're starting to get some IR to look at now: https://bugs.chromium.org/p/chromium/issues/detail?id=1235252#c15

In D105020#2922309, @hans wrote:

In D105020#2919848, @hans wrote:

In D105020#2919627, @ABataev wrote:

In D105020#2919603, @hans wrote:

Just a heads up that we're seeing test failures in Chromium that bisect to this revision (http://crbug.com/1235252). Not sure what's going on yet, though.

Also, would be good if you could provide a reproducer and a command line for it. I took a look at the buildbot but was unable to get info on how to reproduce it.

I don't have a reproducer yet, and it will take some work to get it. I just wanted to give a heads up, especially in case others are bisecting issues and also end up at this change.

Still not sure what's going on (and it could also be our code that's broken), but we're starting to get some IR to look at now: https://bugs.chromium.org/p/chromium/issues/detail?id=1235252#c15

Ok, thanks, I see the wrong mask for the loads, comparing these 2 examples. Looking at the problem already, this should help to fix it ASAP.

I bisected a miscompilation in XLA to this change.

Repro:

grab the three files in this gist: https://gist.github.com/hawkinsp/93de2fcb1a4d13ca01c826288bde4b9b
build clang at this revision

At this commit

clang driver.cc module_0000.primitive_computation_broadcast_in_dim.3.ir-no-opt.ll -o a.out
./a.out buffer-assignment.txt

(no optimization)

and

clang -O3 -march=haswell driver.cc module_0000.primitive_computation_broadcast_in_dim.3.ir-no-opt.ll -o a.out
./a.out buffer-assignment.txt

produce different outputs: the first 4 values differ.

At the previous revision, both produce identical outputs.

A quick inspection of the optimized IR seems to show it reading memory out of bounds. Disabling SLP vectorization also fixes the problem.

In D105020#2922419, @phawkins wrote:
I bisected a miscompilation in XLA to this change.

Repro:

grab the three files in this gist: https://gist.github.com/hawkinsp/93de2fcb1a4d13ca01c826288bde4b9b

build clang at this revision

At this commit
clang driver.cc module_0000.primitive_computation_broadcast_in_dim.3.ir-no-opt.ll -o a.out
./a.out buffer-assignment.txt
(no optimization)

and
clang -O3 -march=haswell driver.cc module_0000.primitive_computation_broadcast_in_dim.3.ir-no-opt.ll -o a.out
./a.out buffer-assignment.txt
produce different outputs: the first 4 values differ.

At the previous revision, both produce identical outputs.

A quick inspection of the optimized IR seems to show it reading memory out of bounds. Disabling SLP vectorization also fixes the problem.

Thanks, I suspect what is the cause of the bug, hope to prepare a fix later today.

In D105020#2922309, @hans wrote:

In D105020#2919848, @hans wrote:

In D105020#2919627, @ABataev wrote:

In D105020#2919603, @hans wrote:

Just a heads up that we're seeing test failures in Chromium that bisect to this revision (http://crbug.com/1235252). Not sure what's going on yet, though.

Also, would be good if you could provide a reproducer and a command line for it. I took a look at the buildbot but was unable to get info on how to reproduce it.

I don't have a reproducer yet, and it will take some work to get it. I just wanted to give a heads up, especially in case others are bisecting issues and also end up at this change.

Still not sure what's going on (and it could also be our code that's broken), but we're starting to get some IR to look at now: https://bugs.chromium.org/p/chromium/issues/detail?id=1235252#c15

We now have analysis of the bad IR: https://bugs.chromium.org/p/chromium/issues/detail?id=1235252#c16

Thanks, I suspect what is the cause of the bug, hope to prepare a fix later today.

Can you please revert to green in the meantime?

In D105020#2922483, @hans wrote:

In D105020#2922309, @hans wrote:

In D105020#2919848, @hans wrote:

In D105020#2919627, @ABataev wrote:

In D105020#2919603, @hans wrote:

Just a heads up that we're seeing test failures in Chromium that bisect to this revision (http://crbug.com/1235252). Not sure what's going on yet, though.

Also, would be good if you could provide a reproducer and a command line for it. I took a look at the buildbot but was unable to get info on how to reproduce it.

I don't have a reproducer yet, and it will take some work to get it. I just wanted to give a heads up, especially in case others are bisecting issues and also end up at this change.

Still not sure what's going on (and it could also be our code that's broken), but we're starting to get some IR to look at now: https://bugs.chromium.org/p/chromium/issues/detail?id=1235252#c15

We now have analysis of the bad IR: https://bugs.chromium.org/p/chromium/issues/detail?id=1235252#c16

Thanks, I suspect what is the cause of the bug, hope to prepare a fix later today.

Can you please revert to green in the meantime?

It may take some time, there were other commits already. I'll revert it later if won't be able to prepare a quick fix today.

Can you please revert to green in the meantime?

It may take some time, there were other commits already. I'll revert it later if won't be able to prepare a quick fix today.

If there's already a chain of commits depending on this, that's an even stronger reason to revert. What if the quick fix doesn't fix everything? Then the commit chain has just become longer and even harder to revert.

I'd suggest reverting first, and then fixing without the stress.

In D105020#2922564, @hans wrote:

Can you please revert to green in the meantime?

It may take some time, there were other commits already. I'll revert it later if won't be able to prepare a quick fix today.

If there's already a chain of commits depending on this, that's an even stronger reason to revert. What if the quick fix doesn't fix everything? Then the commit chain has just become longer and even harder to revert.

I'd suggest reverting first, and then fixing without the stress.

I would not call it a stress, the reason is known, but I’ll try to revert it.

if you can get alive-tv working from https://github.com/AliveToolkit/alive2

; ModuleID = '/tmp/b1.ll'                                                                                                                                                                          
source_filename = "/tmp/b1.ll"                                                                                                                                                                     
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"                                                                                                       
target triple = "x86_64-unknown-linux-gnu"                                                                                                                                                         
                                                                                                                                                                                                   
%"class.deqp::gls::ShaderEvalContext" = type { %"class.tcu::Vector", %"class.tcu::Vector", %"class.tcu::Vector", [4 x %"class.tcu::Vector"], [4 x %"struct.deqp::gls::ShaderEvalContext::ShaderSamp
ler"], %"class.tcu::Vector", i8, %"class.deqp::gls::QuadGrid"* }                                                                                                                                   
%"struct.deqp::gls::ShaderEvalContext::ShaderSampler" = type { %"class.tcu::Sampler", %"class.tcu::Texture2D"*, %"class.tcu::TextureCube"*, %"class.tcu::Texture2DArray"*, %"class.tcu::Texture3D"*
 }                                                                                                                                                                                                 
%"class.tcu::Sampler" = type { i32, i32, i32, i32, i32, i32, float, i8, i32, i32, %"class.rr::GenericVec4", i8, i32 }                                                                              
%"class.rr::GenericVec4" = type { %union.anon }                                                                                                                                                    
%union.anon = type { [4 x i32] }                                                                                                                                                                   
%"class.tcu::Texture2D" = type { %"class.tcu::TextureLevelPyramid", i32, i32, %"class.tcu::Texture2DView" }                                                                                        
%"class.tcu::TextureLevelPyramid" = type { %"class.tcu::TextureFormat", %"class.std::__Cr::vector", %"class.std::__Cr::vector.1" }                                                                 
%"class.tcu::TextureFormat" = type { i32, i32 }                                                                                                                                                    
%"class.std::__Cr::vector" = type { %"class.std::__Cr::__vector_base" }                                                                                                                            
%"class.std::__Cr::__vector_base" = type { %"class.de::ArrayBuffer"*, %"class.de::ArrayBuffer"*, %"class.std::__Cr::__compressed_pair" }                                                           
%"class.de::ArrayBuffer" = type { i8*, i64 }                                                                                                                                                       
%"class.std::__Cr::__compressed_pair" = type { %"struct.std::__Cr::__compressed_pair_elem" }                                                                                                       
%"struct.std::__Cr::__compressed_pair_elem" = type { %"class.de::ArrayBuffer"* }                                                                                                                   
%"class.std::__Cr::vector.1" = type { %"class.std::__Cr::__vector_base.2" }                                                                                                                        
%"class.std::__Cr::__vector_base.2" = type { %"class.tcu::PixelBufferAccess"*, %"class.tcu::PixelBufferAccess"*, %"class.std::__Cr::__compressed_pair.4" }                                         
%"class.tcu::PixelBufferAccess" = type { %"class.tcu::ConstPixelBufferAccess" }                                                                                                                    
%"class.tcu::ConstPixelBufferAccess" = type { %"class.tcu::TextureFormat", %"class.tcu::Vector.3", %"class.tcu::Vector.3", %"class.tcu::Vector.3", i8* }                                           
%"class.tcu::Vector.3" = type { [3 x i32] }                                                                                                                                                        
%"class.std::__Cr::__compressed_pair.4" = type { %"struct.std::__Cr::__compressed_pair_elem.5" }                                                                                                   
%"struct.std::__Cr::__compressed_pair_elem.5" = type { %"class.tcu::PixelBufferAccess"* }                                                                                                          
%"class.tcu::Texture2DView" = type <{ i32, [4 x i8], %"class.tcu::ConstPixelBufferAccess"*, i8, [7 x i8] }>                                                                                        
%"class.tcu::TextureCube" = type { %"class.tcu::TextureFormat", i32, [6 x %"class.std::__Cr::vector"], [6 x %"class.std::__Cr::vector.1"], %"class.tcu::TextureCubeView" }                         
%"class.tcu::TextureCubeView" = type <{ i32, [4 x i8], [6 x %"class.tcu::ConstPixelBufferAccess"*], i8, [7 x i8] }>                                                                                
%"class.tcu::Texture2DArray" = type { %"class.tcu::TextureLevelPyramid", i32, i32, i32, %"class.tcu::Texture2DArrayView" }                                                                         
%"class.tcu::Texture2DArrayView" = type { i32, %"class.tcu::ConstPixelBufferAccess"* }
%"class.tcu::Texture3D" = type { %"class.tcu::TextureLevelPyramid", i32, i32, i32, %"class.tcu::Texture3DView" }
%"class.tcu::Texture3DView" = type { i32, %"class.tcu::ConstPixelBufferAccess"* }
%"class.tcu::Vector" = type { [4 x float] }
%"class.deqp::gls::QuadGrid" = type opaque

define hidden void @_ZN4deqp5gles210Functional19eval_selection_vec4ERNS_3gls17ShaderEvalContextE(%"class.deqp::gls::ShaderEvalContext"* nocapture align 8 dereferenceable(528) %0) {
  %2 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 0, i32 0, i64 2
  %3 = load float, float* %2, align 8
  %4 = fcmp ogt float %3, 0.000000e+00
  %5 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 1, i32 0, i64 3
  %6 = load float, float* %5, align 4
  %7 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 1, i32 0, i64 2
  %8 = load float, float* %7, align 8
  %9 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 1, i32 0, i64 1
  %10 = load float, float* %9, align 4
  %11 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 1, i32 0, i64 0
  %12 = load float, float* %11, align 8
  %13 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 2, i32 0, i64 0
  %14 = load float, float* %13, align 8
  %15 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 2, i32 0, i64 3
  %16 = load float, float* %15, align 4
  %17 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 2, i32 0, i64 2
  %18 = load float, float* %17, align 8
  %19 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 2, i32 0, i64 1
  %20 = load float, float* %19, align 4
  %21 = select i1 %4, float %6, float %14
  %22 = select i1 %4, float %8, float %16
  %23 = select i1 %4, float %10, float %18
  %24 = select i1 %4, float %12, float %20
  %25 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 5, i32 0, i64 0
  store float %21, float* %25, align 8
  %26 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 5, i32 0, i64 1
  store float %22, float* %26, align 4
  %27 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 5, i32 0, i64 2
  store float %23, float* %27, align 8
  %28 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 5, i32 0, i64 3
  store float %24, float* %28, align 4
  ret void
}

$ bin/opt -passes=slp-vectorizer -S /tmp/b1.ll -o /tmp/b2.ll
$ alive-tv /tmp/b1.ll /tmp/b2.ll

fails at this commit, and times out at the commit before
(sorry, couldn't get it working on https://alive2.llvm.org/ for some reason)

seems worth a revert

In D105020#2923016, @aeubanks wrote:

if you can get alive-tv working from https://github.com/AliveToolkit/alive2

; ModuleID = '/tmp/b1.ll'                                                                                                                                                                          
source_filename = "/tmp/b1.ll"                                                                                                                                                                     
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"                                                                                                       
target triple = "x86_64-unknown-linux-gnu"                                                                                                                                                         
                                                                                                                                                                                                   
%"class.deqp::gls::ShaderEvalContext" = type { %"class.tcu::Vector", %"class.tcu::Vector", %"class.tcu::Vector", [4 x %"class.tcu::Vector"], [4 x %"struct.deqp::gls::ShaderEvalContext::ShaderSamp
ler"], %"class.tcu::Vector", i8, %"class.deqp::gls::QuadGrid"* }                                                                                                                                   
%"struct.deqp::gls::ShaderEvalContext::ShaderSampler" = type { %"class.tcu::Sampler", %"class.tcu::Texture2D"*, %"class.tcu::TextureCube"*, %"class.tcu::Texture2DArray"*, %"class.tcu::Texture3D"*
 }                                                                                                                                                                                                 
%"class.tcu::Sampler" = type { i32, i32, i32, i32, i32, i32, float, i8, i32, i32, %"class.rr::GenericVec4", i8, i32 }                                                                              
%"class.rr::GenericVec4" = type { %union.anon }                                                                                                                                                    
%union.anon = type { [4 x i32] }                                                                                                                                                                   
%"class.tcu::Texture2D" = type { %"class.tcu::TextureLevelPyramid", i32, i32, %"class.tcu::Texture2DView" }                                                                                        
%"class.tcu::TextureLevelPyramid" = type { %"class.tcu::TextureFormat", %"class.std::__Cr::vector", %"class.std::__Cr::vector.1" }                                                                 
%"class.tcu::TextureFormat" = type { i32, i32 }                                                                                                                                                    
%"class.std::__Cr::vector" = type { %"class.std::__Cr::__vector_base" }                                                                                                                            
%"class.std::__Cr::__vector_base" = type { %"class.de::ArrayBuffer"*, %"class.de::ArrayBuffer"*, %"class.std::__Cr::__compressed_pair" }                                                           
%"class.de::ArrayBuffer" = type { i8*, i64 }                                                                                                                                                       
%"class.std::__Cr::__compressed_pair" = type { %"struct.std::__Cr::__compressed_pair_elem" }                                                                                                       
%"struct.std::__Cr::__compressed_pair_elem" = type { %"class.de::ArrayBuffer"* }                                                                                                                   
%"class.std::__Cr::vector.1" = type { %"class.std::__Cr::__vector_base.2" }                                                                                                                        
%"class.std::__Cr::__vector_base.2" = type { %"class.tcu::PixelBufferAccess"*, %"class.tcu::PixelBufferAccess"*, %"class.std::__Cr::__compressed_pair.4" }                                         
%"class.tcu::PixelBufferAccess" = type { %"class.tcu::ConstPixelBufferAccess" }                                                                                                                    
%"class.tcu::ConstPixelBufferAccess" = type { %"class.tcu::TextureFormat", %"class.tcu::Vector.3", %"class.tcu::Vector.3", %"class.tcu::Vector.3", i8* }                                           
%"class.tcu::Vector.3" = type { [3 x i32] }                                                                                                                                                        
%"class.std::__Cr::__compressed_pair.4" = type { %"struct.std::__Cr::__compressed_pair_elem.5" }                                                                                                   
%"struct.std::__Cr::__compressed_pair_elem.5" = type { %"class.tcu::PixelBufferAccess"* }                                                                                                          
%"class.tcu::Texture2DView" = type <{ i32, [4 x i8], %"class.tcu::ConstPixelBufferAccess"*, i8, [7 x i8] }>                                                                                        
%"class.tcu::TextureCube" = type { %"class.tcu::TextureFormat", i32, [6 x %"class.std::__Cr::vector"], [6 x %"class.std::__Cr::vector.1"], %"class.tcu::TextureCubeView" }                         
%"class.tcu::TextureCubeView" = type <{ i32, [4 x i8], [6 x %"class.tcu::ConstPixelBufferAccess"*], i8, [7 x i8] }>                                                                                
%"class.tcu::Texture2DArray" = type { %"class.tcu::TextureLevelPyramid", i32, i32, i32, %"class.tcu::Texture2DArrayView" }                                                                         
%"class.tcu::Texture2DArrayView" = type { i32, %"class.tcu::ConstPixelBufferAccess"* }
%"class.tcu::Texture3D" = type { %"class.tcu::TextureLevelPyramid", i32, i32, i32, %"class.tcu::Texture3DView" }
%"class.tcu::Texture3DView" = type { i32, %"class.tcu::ConstPixelBufferAccess"* }
%"class.tcu::Vector" = type { [4 x float] }
%"class.deqp::gls::QuadGrid" = type opaque

define hidden void @_ZN4deqp5gles210Functional19eval_selection_vec4ERNS_3gls17ShaderEvalContextE(%"class.deqp::gls::ShaderEvalContext"* nocapture align 8 dereferenceable(528) %0) {
  %2 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 0, i32 0, i64 2
  %3 = load float, float* %2, align 8
  %4 = fcmp ogt float %3, 0.000000e+00
  %5 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 1, i32 0, i64 3
  %6 = load float, float* %5, align 4
  %7 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 1, i32 0, i64 2
  %8 = load float, float* %7, align 8
  %9 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 1, i32 0, i64 1
  %10 = load float, float* %9, align 4
  %11 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 1, i32 0, i64 0
  %12 = load float, float* %11, align 8
  %13 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 2, i32 0, i64 0
  %14 = load float, float* %13, align 8
  %15 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 2, i32 0, i64 3
  %16 = load float, float* %15, align 4
  %17 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 2, i32 0, i64 2
  %18 = load float, float* %17, align 8
  %19 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 3, i64 2, i32 0, i64 1
  %20 = load float, float* %19, align 4
  %21 = select i1 %4, float %6, float %14
  %22 = select i1 %4, float %8, float %16
  %23 = select i1 %4, float %10, float %18
  %24 = select i1 %4, float %12, float %20
  %25 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 5, i32 0, i64 0
  store float %21, float* %25, align 8
  %26 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 5, i32 0, i64 1
  store float %22, float* %26, align 4
  %27 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 5, i32 0, i64 2
  store float %23, float* %27, align 8
  %28 = getelementptr inbounds %"class.deqp::gls::ShaderEvalContext", %"class.deqp::gls::ShaderEvalContext"* %0, i64 0, i32 5, i32 0, i64 3
  store float %24, float* %28, align 4
  ret void
}

$ bin/opt -passes=slp-vectorizer -S /tmp/b1.ll -o /tmp/b2.ll
$ alive-tv /tmp/b1.ll /tmp/b2.ll

fails at this commit, and times out at the commit before
(sorry, couldn't get it working on https://alive2.llvm.org/ for some reason)

seems worth a revert

Yep, I'll revert. The fix is pretty simple but I'll revert and recommit the patch with all the fixes again

ABataev added a reverting change: rG7d9d926a1861: Revert "[SLP]Improve graph reordering.".Aug 3 2021, 12:14 PM

ABataev reopened this revision.Aug 24 2021, 9:23 AM

This revision is now accepted and ready to land.Aug 24 2021, 9:23 AM

ABataev planned changes to this revision.Aug 24 2021, 9:23 AM

Reworked significantly patch + rebase.

This revision is now accepted and ready to land.Aug 24 2021, 9:24 AM

Harbormaster completed remote builds in B120983: Diff 368371.Aug 24 2021, 9:54 AM

LGTM with a few minors

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
539	Why didn't you move these up here instead of forward declaring?
1570	This std::equal + pragma is repeated a lot in this method - worth pulling out?
3346	Indices?

ABataev added inline comments.Aug 25 2021, 10:22 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
539	Just to reduce the number of changes, will move the functions here.
1570	Ok, will try to transform it into a lambda or something similar.
3346	Yep, will fix it, thanks!

Closed by commit rGa28234e37af8: [SLP]Improve graph reordering. (authored by ABataev). · Explain WhyAug 26 2021, 7:19 AM

This revision was automatically updated to reflect the committed changes.

ABataev added a commit: rGa28234e37af8: [SLP]Improve graph reordering..

I'm seeing another crash bisected to this reland:

./build/rel/bin/opt -passes='default<Os>' -disable-output /tmp/a.ll

reduced.ll.txt10 KBDownload

Instruction does not dominate all uses!

In D105020#2967449, @aeubanks wrote:

I'm seeing another crash bisected to this reland:

./build/rel/bin/opt -passes='default<Os>' -disable-output /tmp/a.ll

reduced.ll.txt10 KBDownload

Instruction does not dominate all uses!

Ok, thanks for the reproducer. Going to revert the patch and investigate a crash.

ABataev added a reverting change: rGb00f73d8bf3e: Revert "[SLP]Improve graph reordering.".Aug 26 2021, 9:20 AM

ABataev added a commit: rG84cbd71c9592: [SLP]Improve graph reordering..Aug 26 2021, 12:49 PM

Some of our tests started failing after rG84cbd71c95923f9912512f3051c6ab548a99e016 (and previously after rGa28234e37af877b2b4a23c2091c27fa18c155f9a). Could you revert it again while we're working on a test case?

In D105020#2971986, @alexfh wrote:

Some of our tests started failing after rG84cbd71c95923f9912512f3051c6ab548a99e016 (and previously after rGa28234e37af877b2b4a23c2091c27fa18c155f9a). Could you revert it again while we're working on a test case?

Couldyou revert it yourself? I'm on vacation, will be back in 2 weeks.
I would appreciate it if you could send the reproducer. The patch os very complex, always triggers many different corner cases.

goncharov added a reverting change: rG5097b6e35291: Revert "[SLP]Improve graph reordering.".Aug 30 2021, 10:18 AM

goncharov added a subscriber: goncharov.Aug 31 2021, 5:12 AM

Here is a repro I found

$ cat ./repro.cc
#include <iostream>
namespace a {
__attribute__((noinline)) void b(uint8_t *c, float *d, uint16_t e, uint16_t f) {
  uint16_t g = f;
  uint16_t h = c[0] << 2 | c[4] & 3;
  uint16_t l = c[1] << 2 | c[4] >> 2 & 3;
  uint16_t v = c[2] << 2 | 3;
  uint16_t j = c[3] << 2 | c[4] >> 6;
  *(d - 1) = float(l - e) / g;
  *(d - 2) = float(h - e) / g;
  *(d - 3) = float(j - e) / g;
  *(d - 4) = float(v - e) / g;
}
void bar(uint8_t *buffer, size_t k, size_t height, size_t, uint16_t e,
         uint16_t f, float *m) {
  int n = 24, g = f;
  float *t = &m[(24 - 2) * 4];
  size_t u = 2 * k;
  for (int o; o < height; o++) {
    uint8_t *p = buffer;
    uint8_t *q = p;
    float *r = &t[u];
    for (int a = 0; a < u; a += 4) {
      b(q, r, e, g);
      r -= 4;
    }
    t -= n;
  }
}
}  // namespace a
int main() {
  uint8_t s[]{255, 0, 255, 0, 51};
  float m[4 * 24]{};
  a::bar(s, 4, 4, 0, 0, 1023, m);
  int *out = reinterpret_cast<int *>(m);
  int64_t sum;
  for (int i = 0; i < sizeof(m) / sizeof(int); i++) sum += out[i];
  std::cout << sum;
}
$ clang-before -O2 -std=gnu++17 repro.cc
$ ./a.out
17045651456
$ clang-after -O2 -std=gnu++17 repro.cc
$ ./a.out
16609148640

In D105020#2975046, @goncharov wrote:

Here is a repro I found

$ cat ./repro.cc
#include <iostream>
namespace a {
__attribute__((noinline)) void b(uint8_t *c, float *d, uint16_t e, uint16_t f) {
  uint16_t g = f;
  uint16_t h = c[0] << 2 | c[4] & 3;
  uint16_t l = c[1] << 2 | c[4] >> 2 & 3;
  uint16_t v = c[2] << 2 | 3;
  uint16_t j = c[3] << 2 | c[4] >> 6;
  *(d - 1) = float(l - e) / g;
  *(d - 2) = float(h - e) / g;
  *(d - 3) = float(j - e) / g;
  *(d - 4) = float(v - e) / g;
}
void bar(uint8_t *buffer, size_t k, size_t height, size_t, uint16_t e,
         uint16_t f, float *m) {
  int n = 24, g = f;
  float *t = &m[(24 - 2) * 4];
  size_t u = 2 * k;
  for (int o; o < height; o++) {
    uint8_t *p = buffer;
    uint8_t *q = p;
    float *r = &t[u];
    for (int a = 0; a < u; a += 4) {
      b(q, r, e, g);
      r -= 4;
    }
    t -= n;
  }
}
}  // namespace a
int main() {
  uint8_t s[]{255, 0, 255, 0, 51};
  float m[4 * 24]{};
  a::bar(s, 4, 4, 0, 0, 1023, m);
  int *out = reinterpret_cast<int *>(m);
  int64_t sum;
  for (int i = 0; i < sizeof(m) / sizeof(int); i++) sum += out[i];
  std::cout << sum;
}
$ clang-before -O2 -std=gnu++17 repro.cc
$ ./a.out
17045651456
$ clang-after -O2 -std=gnu++17 repro.cc
$ ./a.out
16609148640

Thanks, this will help a lot!

In D105020#2975046, @goncharov wrote:

Here is a repro I found

$ cat ./repro.cc
#include <iostream>
namespace a {
__attribute__((noinline)) void b(uint8_t *c, float *d, uint16_t e, uint16_t f) {
  uint16_t g = f;
  uint16_t h = c[0] << 2 | c[4] & 3;
  uint16_t l = c[1] << 2 | c[4] >> 2 & 3;
  uint16_t v = c[2] << 2 | 3;
  uint16_t j = c[3] << 2 | c[4] >> 6;
  *(d - 1) = float(l - e) / g;
  *(d - 2) = float(h - e) / g;
  *(d - 3) = float(j - e) / g;
  *(d - 4) = float(v - e) / g;
}
void bar(uint8_t *buffer, size_t k, size_t height, size_t, uint16_t e,
         uint16_t f, float *m) {
  int n = 24, g = f;
  float *t = &m[(24 - 2) * 4];
  size_t u = 2 * k;
  for (int o; o < height; o++) {
    uint8_t *p = buffer;
    uint8_t *q = p;
    float *r = &t[u];
    for (int a = 0; a < u; a += 4) {
      b(q, r, e, g);
      r -= 4;
    }
    t -= n;
  }
}
}  // namespace a
int main() {
  uint8_t s[]{255, 0, 255, 0, 51};
  float m[4 * 24]{};
  a::bar(s, 4, 4, 0, 0, 1023, m);
  int *out = reinterpret_cast<int *>(m);
  int64_t sum;
  for (int i = 0; i < sizeof(m) / sizeof(int); i++) sum += out[i];
  std::cout << sum;
}
$ clang-before -O2 -std=gnu++17 repro.cc
$ ./a.out
17045651456
$ clang-after -O2 -std=gnu++17 repro.cc
$ ./a.out
16609148640

Sorry, the reproducer is not correct. It has not initialized variables, writes after array bounds etc. Unable to use it as a reproducer for the investigation.

In D105020#2999752, @ABataev wrote:

In D105020#2975046, @goncharov wrote:

Here is a repro I found

$ cat ./repro.cc
#include <iostream>
namespace a {
__attribute__((noinline)) void b(uint8_t *c, float *d, uint16_t e, uint16_t f) {
  uint16_t g = f;
  uint16_t h = c[0] << 2 | c[4] & 3;
  uint16_t l = c[1] << 2 | c[4] >> 2 & 3;
  uint16_t v = c[2] << 2 | 3;
  uint16_t j = c[3] << 2 | c[4] >> 6;
  *(d - 1) = float(l - e) / g;
  *(d - 2) = float(h - e) / g;
  *(d - 3) = float(j - e) / g;
  *(d - 4) = float(v - e) / g;
}
void bar(uint8_t *buffer, size_t k, size_t height, size_t, uint16_t e,
         uint16_t f, float *m) {
  int n = 24, g = f;
  float *t = &m[(24 - 2) * 4];
  size_t u = 2 * k;
  for (int o; o < height; o++) {
    uint8_t *p = buffer;
    uint8_t *q = p;
    float *r = &t[u];
    for (int a = 0; a < u; a += 4) {
      b(q, r, e, g);
      r -= 4;
    }
    t -= n;
  }
}
}  // namespace a
int main() {
  uint8_t s[]{255, 0, 255, 0, 51};
  float m[4 * 24]{};
  a::bar(s, 4, 4, 0, 0, 1023, m);
  int *out = reinterpret_cast<int *>(m);
  int64_t sum;
  for (int i = 0; i < sizeof(m) / sizeof(int); i++) sum += out[i];
  std::cout << sum;
}
$ clang-before -O2 -std=gnu++17 repro.cc
$ ./a.out
17045651456
$ clang-after -O2 -std=gnu++17 repro.cc
$ ./a.out
16609148640

Sorry, the reproducer is not correct. It has not initialized variables, writes after array bounds etc. Unable to use it as a reproducer for the investigation.

So if they are not able to produce UB-free reproducer, you should recommit the patch.

In D105020#2999765, @xbolva00 wrote:

In D105020#2999752, @ABataev wrote:

In D105020#2975046, @goncharov wrote:

Here is a repro I found

$ cat ./repro.cc
#include <iostream>
namespace a {
__attribute__((noinline)) void b(uint8_t *c, float *d, uint16_t e, uint16_t f) {
  uint16_t g = f;
  uint16_t h = c[0] << 2 | c[4] & 3;
  uint16_t l = c[1] << 2 | c[4] >> 2 & 3;
  uint16_t v = c[2] << 2 | 3;
  uint16_t j = c[3] << 2 | c[4] >> 6;
  *(d - 1) = float(l - e) / g;
  *(d - 2) = float(h - e) / g;
  *(d - 3) = float(j - e) / g;
  *(d - 4) = float(v - e) / g;
}
void bar(uint8_t *buffer, size_t k, size_t height, size_t, uint16_t e,
         uint16_t f, float *m) {
  int n = 24, g = f;
  float *t = &m[(24 - 2) * 4];
  size_t u = 2 * k;
  for (int o; o < height; o++) {
    uint8_t *p = buffer;
    uint8_t *q = p;
    float *r = &t[u];
    for (int a = 0; a < u; a += 4) {
      b(q, r, e, g);
      r -= 4;
    }
    t -= n;
  }
}
}  // namespace a
int main() {
  uint8_t s[]{255, 0, 255, 0, 51};
  float m[4 * 24]{};
  a::bar(s, 4, 4, 0, 0, 1023, m);
  int *out = reinterpret_cast<int *>(m);
  int64_t sum;
  for (int i = 0; i < sizeof(m) / sizeof(int); i++) sum += out[i];
  std::cout << sum;
}
$ clang-before -O2 -std=gnu++17 repro.cc
$ ./a.out
17045651456
$ clang-after -O2 -std=gnu++17 repro.cc
$ ./a.out
16609148640

Sorry, the reproducer is not correct. It has not initialized variables, writes after array bounds etc. Unable to use it as a reproducer for the investigation.

So if they are not able to produce UB-free reproducer, you should recommit the patch.

I believe they have a reproducer and tried to reduce it but the tool just extracted/removed too much code here :). I'll try to check manually some parts of the code and wait for the actual reproducer, will recommit the patch in a couple of days if would not be able to find a bug/no correct reproducer provided.

In D105020#2999773, @ABataev wrote:
In D105020#2999765, @xbolva00 wrote:
In D105020#2999752, @ABataev wrote:
In D105020#2975046, @goncharov wrote:
Here is a repro I found
$ cat ./repro.cc
#include <iostream>
namespace a {
__attribute__((noinline)) void b(uint8_t *c, float *d, uint16_t e, uint16_t f) {
  uint16_t g = f;
  uint16_t h = c[0] << 2 | c[4] & 3;
  uint16_t l = c[1] << 2 | c[4] >> 2 & 3;
  uint16_t v = c[2] << 2 | 3;
  uint16_t j = c[3] << 2 | c[4] >> 6;
  *(d - 1) = float(l - e) / g;
  *(d - 2) = float(h - e) / g;
  *(d - 3) = float(j - e) / g;
  *(d - 4) = float(v - e) / g;
}
void bar(uint8_t *buffer, size_t k, size_t height, size_t, uint16_t e,
         uint16_t f, float *m) {
  int n = 24, g = f;
  float *t = &m[(24 - 2) * 4];
  size_t u = 2 * k;
  for (int o; o < height; o++) {
    uint8_t *p = buffer;
    uint8_t *q = p;
    float *r = &t[u];
    for (int a = 0; a < u; a += 4) {
      b(q, r, e, g);
      r -= 4;
    }
    t -= n;
  }
}
}  // namespace a
int main() {
  uint8_t s[]{255, 0, 255, 0, 51};
  float m[4 * 24]{};
  a::bar(s, 4, 4, 0, 0, 1023, m);
  int *out = reinterpret_cast<int *>(m);
  int64_t sum;
  for (int i = 0; i < sizeof(m) / sizeof(int); i++) sum += out[i];
  std::cout << sum;
}
$ clang-before -O2 -std=gnu++17 repro.cc
$ ./a.out
17045651456
$ clang-after -O2 -std=gnu++17 repro.cc
$ ./a.out
16609148640
Sorry, the reproducer is not correct. It has not initialized variables, writes after array bounds etc. Unable to use it as a reproducer for the investigation.
So if they are not able to produce UB-free reproducer, you should recommit the patch.
I believe they have a reproducer and tried to reduce it but the tool just extracted/removed too much code here :). I'll try to check manually some parts of the code and wait for the actual reproducer, will recommit the patch in a couple of days if would not be able to find a bug/no correct reproducer provided.

Exactly, we reduced the code quite aggressively. We'll try to do this more carefully this time. Please hold on while we're on it. Last time it took multiple hours until we got something we could feed to creduce.

In D105020#3001887, @alexfh wrote:
In D105020#2999773, @ABataev wrote:
In D105020#2999765, @xbolva00 wrote:
In D105020#2999752, @ABataev wrote:
In D105020#2975046, @goncharov wrote:
Here is a repro I found
$ cat ./repro.cc
#include <iostream>
namespace a {
__attribute__((noinline)) void b(uint8_t *c, float *d, uint16_t e, uint16_t f) {
  uint16_t g = f;
  uint16_t h = c[0] << 2 | c[4] & 3;
  uint16_t l = c[1] << 2 | c[4] >> 2 & 3;
  uint16_t v = c[2] << 2 | 3;
  uint16_t j = c[3] << 2 | c[4] >> 6;
  *(d - 1) = float(l - e) / g;
  *(d - 2) = float(h - e) / g;
  *(d - 3) = float(j - e) / g;
  *(d - 4) = float(v - e) / g;
}
void bar(uint8_t *buffer, size_t k, size_t height, size_t, uint16_t e,
         uint16_t f, float *m) {
  int n = 24, g = f;
  float *t = &m[(24 - 2) * 4];
  size_t u = 2 * k;
  for (int o; o < height; o++) {
    uint8_t *p = buffer;
    uint8_t *q = p;
    float *r = &t[u];
    for (int a = 0; a < u; a += 4) {
      b(q, r, e, g);
      r -= 4;
    }
    t -= n;
  }
}
}  // namespace a
int main() {
  uint8_t s[]{255, 0, 255, 0, 51};
  float m[4 * 24]{};
  a::bar(s, 4, 4, 0, 0, 1023, m);
  int *out = reinterpret_cast<int *>(m);
  int64_t sum;
  for (int i = 0; i < sizeof(m) / sizeof(int); i++) sum += out[i];
  std::cout << sum;
}
$ clang-before -O2 -std=gnu++17 repro.cc
$ ./a.out
17045651456
$ clang-after -O2 -std=gnu++17 repro.cc
$ ./a.out
16609148640
Sorry, the reproducer is not correct. It has not initialized variables, writes after array bounds etc. Unable to use it as a reproducer for the investigation.
So if they are not able to produce UB-free reproducer, you should recommit the patch.
I believe they have a reproducer and tried to reduce it but the tool just extracted/removed too much code here :). I'll try to check manually some parts of the code and wait for the actual reproducer, will recommit the patch in a couple of days if would not be able to find a bug/no correct reproducer provided.
Exactly, we reduced the code quite aggressively. We'll try to do this more carefully this time. Please hold on while we're on it. Last time it took multiple hours until we got something we could feed to creduce.

That's what I thought. I would appreciate if you could send a better reproducer, though even in its current form it is usable and may help to find a bug. I believe it reveals a bug in vectorization of instructions with main/alternate opcode, will try to fix/improve it.

ABataev reopened this revision.Sep 17 2021, 2:33 PM

This revision is now accepted and ready to land.Sep 17 2021, 2:33 PM

Fixed reordering of the nodes with alternate opcodes.

Harbormaster completed remote builds in B124484: Diff 373339.Sep 17 2021, 3:07 PM

Hi @ABataev ,

this time I've run creduce with asan and msan, so it should not do read out of bounds

cat repro.cc
#include <iostream>
namespace a {
__attribute__((noinline)) void b(uint8_t *c, float *d, uint16_t e, uint16_t f) {
  uint16_t g = f;
  uint16_t h = c[0] << 2 | (c[4] & 3);
  uint16_t m = c[1] << 2 | 2;
  uint16_t n = c[2] << 2 | 3;
  uint16_t j = c[3] << 2 | c[4] >> 6;
  *(d - 1) = float(m - e) / g;
  *(d - 2) = float(h - e) / g;
  *(d - 3) = float(j - e) / g;
  *(d - 4) = float(n - e) / g;
}
void bar(uint8_t *buffer, size_t k, size_t height, size_t l, uint16_t e,
         uint16_t f, float *output) {
  float *q = &output[4];
  uint16_t g = f;
  size_t r = k;
  for (size_t s = 0; s < height; s++) {
    uint8_t *o = buffer + l;
    uint8_t *t = o;
    float *p = &q[r];
    b(t, p, e, g);
  }
}
} // namespace a
int main() {
  uint8_t u[]{5, 5, 0, 0, 0};
  float output[4 * 4];
  a::bar(u, 4, 4, 0, 0, 3, output);
  int *out = reinterpret_cast<int *>(output);
  int64_t sum;
  for (size_t i = 0; i < sizeof sizeof(int); i++)
    sum = out[i];
  printf("%ld\n", sum);
}
$ # on revision before: 8441a8eea8007b9eaaaabf76055949180a702d6d
$ clang++ -Wall -Werror -Wextra -O2 -fno-exceptions  -stdlib=libc++ -std=gnu++17 repro.cc && ./a.out
1089120939
$ # revision 84cbd71c95923f9912512f3051c6ab548a99e016 
$ clang++ -Wall -Werror -Wextra -O2 -fno-exceptions  -stdlib=libc++ -std=gnu++17 repro.cc && ./a.out
1059760811

In D105020#3008717, @goncharov wrote:

Hi @ABataev ,

this time I've run creduce with asan and msan, so it should not do read out of bounds

cat repro.cc
#include <iostream>
namespace a {
__attribute__((noinline)) void b(uint8_t *c, float *d, uint16_t e, uint16_t f) {
  uint16_t g = f;
  uint16_t h = c[0] << 2 | (c[4] & 3);
  uint16_t m = c[1] << 2 | 2;
  uint16_t n = c[2] << 2 | 3;
  uint16_t j = c[3] << 2 | c[4] >> 6;
  *(d - 1) = float(m - e) / g;
  *(d - 2) = float(h - e) / g;
  *(d - 3) = float(j - e) / g;
  *(d - 4) = float(n - e) / g;
}
void bar(uint8_t *buffer, size_t k, size_t height, size_t l, uint16_t e,
         uint16_t f, float *output) {
  float *q = &output[4];
  uint16_t g = f;
  size_t r = k;
  for (size_t s = 0; s < height; s++) {
    uint8_t *o = buffer + l;
    uint8_t *t = o;
    float *p = &q[r];
    b(t, p, e, g);
  }
}
} // namespace a
int main() {
  uint8_t u[]{5, 5, 0, 0, 0};
  float output[4 * 4];
  a::bar(u, 4, 4, 0, 0, 3, output);
  int *out = reinterpret_cast<int *>(output);
  int64_t sum;
  for (size_t i = 0; i < sizeof sizeof(int); i++)
    sum = out[i];
  printf("%ld\n", sum);
}
$ # on revision before: 8441a8eea8007b9eaaaabf76055949180a702d6d
$ clang++ -Wall -Werror -Wextra -O2 -fno-exceptions  -stdlib=libc++ -std=gnu++17 repro.cc && ./a.out
1089120939
$ # revision 84cbd71c95923f9912512f3051c6ab548a99e016 
$ clang++ -Wall -Werror -Wextra -O2 -fno-exceptions  -stdlib=libc++ -std=gnu++17 repro.cc && ./a.out
1059760811

Thanks again for the reproducer, checked it with the last version uploaded on Friday, the results are correct. I fixed the incorrect reordering of the nodes with the alternate instructions.

In D105020#3009190, @ABataev wrote:

Thanks again for the reproducer, checked it with the last version uploaded on Friday, the results are correct. I fixed the incorrect reordering of the nodes with the alternate instructions.

LGTM - if you can add another (useful) test case for the latest repro that'd be great.

In D105020#3009208, @RKSimon wrote:

In D105020#3009190, @ABataev wrote:

Thanks again for the reproducer, checked it with the last version uploaded on Friday, the results are correct. I fixed the incorrect reordering of the nodes with the alternate instructions.

LGTM - if you can add another (useful) test case for the latest repro that'd be great.

Actually, already added. I used the previous reproducer, added test/Transforms/SLPVectorizer/X86/vectorize-reorder-alt-shuffle.ll which revealed the bug in the previous version.

Closed by commit rGbc69dd62c04a: [SLP]Improve graph reordering. (authored by ABataev). · Explain WhySep 20 2021, 8:42 AM

This revision was automatically updated to reflect the committed changes.

ABataev added a commit: rGbc69dd62c04a: [SLP]Improve graph reordering..

Hi @ABataev ,

I have found another regression on the latest version of this change:

> cat repro.cc
#include <cstdio>
struct a {
  int b;
  int c;
  int o;
  int d;
};
class e {
public:
  e(int);
  void f(a *);
  int g;
  int h;
  int i;
};
void e::f(a *p) {
  fprintf(stderr, "MiscompiledFunction\n");
  fprintf(stderr, "%d, %d, %d, %d, %d, %d\n", p->b, p->c, p->o, p->d, h, g);
  int j = (p->b + p->c) / 2, k = (p->o + p->d) / 2, l, m;
  switch (i) {
  case 0:
  case 2:
    l = h - 1 - j;
    m = g - 1 - k;
  }
  p->b = l + p->b - j;
  p->c = l + p->c - j;
  p->o = m + p->o - k;
  p->d = m + p->d - k;
}
int n = 0;
e::e(int q) : g(0), h(0), i(q) {}
int main() {
  a *bb = new a{0, 9, 4, 0}, *p = bb;
  e r(n);
  r.f(p);
  printf("%d %d %d %d\n", bb->b, bb->c, bb->o, bb->d);
}
> # at 5661317f864abf750cf893c6a4cc7a977be0995a
> clang -Wall -Werror -Wextra -O3 -fno-exceptions -stdlib=libc++ -lc++ -std=gnu++17 repro.cc && ./a.out
-9 0 -1 -5
> # at bc69dd62c04a70d29943c1c06c7effed150b70e1
> clang -Wall -Werror -Wextra -O3 -fno-exceptions -stdlib=libc++ -lc++ -std=gnu++17 repro.cc && ./a.out
-7 2 -3 -7

It's quite fragile as e.g. removing fprintf(stderr, "MiscompiledFunction\n"); or running with -fsanitize=null "fixes" the output.

This revision is now accepted and ready to land.Oct 14 2021, 7:41 AM

In D105020#3064191, @goncharov wrote:

Hi @ABataev ,

I have found another regression on the latest version of this change:

> cat repro.cc
#include <cstdio>
struct a {
  int b;
  int c;
  int o;
  int d;
};
class e {
public:
  e(int);
  void f(a *);
  int g;
  int h;
  int i;
};
void e::f(a *p) {
  fprintf(stderr, "MiscompiledFunction\n");
  fprintf(stderr, "%d, %d, %d, %d, %d, %d\n", p->b, p->c, p->o, p->d, h, g);
  int j = (p->b + p->c) / 2, k = (p->o + p->d) / 2, l, m;
  switch (i) {
  case 0:
  case 2:
    l = h - 1 - j;
    m = g - 1 - k;
  }
  p->b = l + p->b - j;
  p->c = l + p->c - j;
  p->o = m + p->o - k;
  p->d = m + p->d - k;
}
int n = 0;
e::e(int q) : g(0), h(0), i(q) {}
int main() {
  a *bb = new a{0, 9, 4, 0}, *p = bb;
  e r(n);
  r.f(p);
  printf("%d %d %d %d\n", bb->b, bb->c, bb->o, bb->d);
}
> # at 5661317f864abf750cf893c6a4cc7a977be0995a
> clang -Wall -Werror -Wextra -O3 -fno-exceptions -stdlib=libc++ -lc++ -std=gnu++17 repro.cc && ./a.out
-9 0 -1 -5
> # at bc69dd62c04a70d29943c1c06c7effed150b70e1
> clang -Wall -Werror -Wextra -O3 -fno-exceptions -stdlib=libc++ -lc++ -std=gnu++17 repro.cc && ./a.out
-7 2 -3 -7

It's quite fragile as e.g. removing fprintf(stderr, "MiscompiledFunction\n"); or running with -fsanitize=null "fixes" the output.

I'll check it, thanks!

In D105020#3064191, @goncharov wrote:

Hi @ABataev ,

I have found another regression on the latest version of this change:

> cat repro.cc
#include <cstdio>
struct a {
  int b;
  int c;
  int o;
  int d;
};
class e {
public:
  e(int);
  void f(a *);
  int g;
  int h;
  int i;
};
void e::f(a *p) {
  fprintf(stderr, "MiscompiledFunction\n");
  fprintf(stderr, "%d, %d, %d, %d, %d, %d\n", p->b, p->c, p->o, p->d, h, g);
  int j = (p->b + p->c) / 2, k = (p->o + p->d) / 2, l, m;
  switch (i) {
  case 0:
  case 2:
    l = h - 1 - j;
    m = g - 1 - k;
  }
  p->b = l + p->b - j;
  p->c = l + p->c - j;
  p->o = m + p->o - k;
  p->d = m + p->d - k;
}
int n = 0;
e::e(int q) : g(0), h(0), i(q) {}
int main() {
  a *bb = new a{0, 9, 4, 0}, *p = bb;
  e r(n);
  r.f(p);
  printf("%d %d %d %d\n", bb->b, bb->c, bb->o, bb->d);
}
> # at 5661317f864abf750cf893c6a4cc7a977be0995a
> clang -Wall -Werror -Wextra -O3 -fno-exceptions -stdlib=libc++ -lc++ -std=gnu++17 repro.cc && ./a.out
-9 0 -1 -5
> # at bc69dd62c04a70d29943c1c06c7effed150b70e1
> clang -Wall -Werror -Wextra -O3 -fno-exceptions -stdlib=libc++ -lc++ -std=gnu++17 repro.cc && ./a.out
-7 2 -3 -7

It's quite fragile as e.g. removing fprintf(stderr, "MiscompiledFunction\n"); or running with -fsanitize=null "fixes" the output.

The fix is here: https://reviews.llvm.org/D111898

vporpo added a subscriber: vporpo.Nov 11 2021, 8:00 PM

ABataev closed this revision.Nov 30 2021, 6:10 AM

vdmitrie added inline comments.Feb 8 2022, 3:01 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
584–588	The description is stale.

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

1008 lines

test/

Transforms/

SLPVectorizer/

AArch64/

transpose-inseltpoison.ll

84 lines

transpose.ll

84 lines

X86/

addsub.ll

46 lines

crash_cmpop.ll

6 lines

extract.ll

6 lines

jumbled-load-multiuse.ll

12 lines

jumbled-load.ll

24 lines

jumbled_store_crash.ll

29 lines

reorder_repeated_ops.ll

4 lines

vectorize-reorder-reuse.ll

52 lines

Diff 361345

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 530 Lines • ▼ Show 20 Lines	static bool isSimple(Instruction *I) {
if (StoreInst *SI = dyn_cast<StoreInst>(I))		if (StoreInst *SI = dyn_cast<StoreInst>(I))
return SI->isSimple();		return SI->isSimple();
if (MemIntrinsic *MI = dyn_cast<MemIntrinsic>(I))		if (MemIntrinsic *MI = dyn_cast<MemIntrinsic>(I))
return !MI->isVolatile();		return !MI->isVolatile();
return true;		return true;
}		}

namespace llvm {		namespace llvm {

		RKSimonUnsubmitted Not Done Reply Inline Actions Why didn't you move these up here instead of forward declaring? RKSimon: Why didn't you move these up here instead of forward declaring?
		ABataevAuthorUnsubmitted Done Reply Inline Actions Just to reduce the number of changes, will move the functions here. ABataev: Just to reduce the number of changes, will move the functions here.
static void inversePermutation(ArrayRef<unsigned> Indices,		static void inversePermutation(ArrayRef<unsigned> Indices,
SmallVectorImpl<int> &Mask) {		SmallVectorImpl<int> &Mask) {
Mask.clear();		Mask.clear();
const unsigned E = Indices.size();		const unsigned E = Indices.size();
Mask.resize(E, E + 1);		Mask.resize(E, E + 1);
for (unsigned I = 0; I < E; ++I)		for (unsigned I = 0; I < E; ++I)
Mask[Indices[I]] = I;		Mask[Indices[I]] = I;
}		}
Show All 28 Lines	for (unsigned I : IV->indices()) {
} else {		} else {
return None;		return None;
}		}
Index += I;		Index += I;
}		}
return Index;		return Index;
}		}

		/// Reorders the list of scalars in accordance with the given \p Order and then
		/// the \p Mask. \p Order - is the original order of the scalars, need to
		/// reorder scalars into an unordered state at first according to the given
		/// order. Then the ordered scalars are shuffled once again in accordance with
		/// the provided mask.
		vdmitrieUnsubmitted Not Done Reply Inline Actions The description is stale. vdmitrie: The description is stale.
		static void reorderScalars(SmallVectorImpl<Value *> &Scalars,
		ArrayRef<unsigned> Order, ArrayRef<int> Mask) {
		assert(!Mask.empty() && "Expected non-empty mask.");
		SmallVector<Value *> Prev(Scalars.size(),
		RKSimonUnsubmitted Not Done Reply Inline Actions Do we need a default value for vector? RKSimon: Do we need a default value for vector?
		ABataevAuthorUnsubmitted Done Reply Inline Actions Not currently, but I'll update these functions to make them compatible with upcoming non-power-2 vectorization. ABataev: Not currently, but I'll update these functions to make them compatible with upcoming non-power…
		UndefValue::get(Scalars.front()->getType()));
		Prev.swap(Scalars);
		if (Order.empty()) {
		for (unsigned I = 0, E = Prev.size(); I < E; ++I)
		if (Mask[I] != UndefMaskElem)
		Scalars[Mask[I]] = Prev[I];
		} else {
		for (unsigned I = 0, E = Prev.size(); I < E; ++I)
		if (Mask[Order[I]] != UndefMaskElem)
		Scalars[Mask[Order[I]]] = Prev[Order[I]];
		}
		}

namespace slpvectorizer {		namespace slpvectorizer {

/// Bottom Up SLP Vectorizer.		/// Bottom Up SLP Vectorizer.
class BoUpSLP {		class BoUpSLP {
struct TreeEntry;		struct TreeEntry;
struct ScheduleData;		struct ScheduleData;

public:		public:
▲ Show 20 Lines • Show All 48 Lines • ▼ Show 20 Lines	public:
/// A negative number means that this is profitable.		/// A negative number means that this is profitable.
InstructionCost getTreeCost(ArrayRef<Value *> VectorizedVals = None);		InstructionCost getTreeCost(ArrayRef<Value *> VectorizedVals = None);

/// Construct a vectorizable tree that starts at \p Roots, ignoring users for		/// Construct a vectorizable tree that starts at \p Roots, ignoring users for
/// the purpose of scheduling and extraction in the \p UserIgnoreLst.		/// the purpose of scheduling and extraction in the \p UserIgnoreLst.
void buildTree(ArrayRef<Value *> Roots,		void buildTree(ArrayRef<Value *> Roots,
ArrayRef<Value *> UserIgnoreLst = None);		ArrayRef<Value *> UserIgnoreLst = None);

/// Construct a vectorizable tree that starts at \p Roots, ignoring users for		/// Builds external uses of the vectorized scalars, i.e. the list of
/// the purpose of scheduling and extraction in the \p UserIgnoreLst taking		/// vectorized scalars to be extracted, their lanes and their scalar users. \p
/// into account (and updating it, if required) list of externally used		/// ExternallyUsedValues contains additional list of external uses to handle
/// values stored in \p ExternallyUsedValues.		/// vectorization of reductions.
void buildTree(ArrayRef<Value *> Roots,		void
ExtraValueToDebugLocsMap &ExternallyUsedValues,		buildExternalUses(const ExtraValueToDebugLocsMap &ExternallyUsedValues = {});
ArrayRef<Value *> UserIgnoreLst = None);

/// Clear the internal data structures that are created by 'buildTree'.		/// Clear the internal data structures that are created by 'buildTree'.
void deleteTree() {		void deleteTree() {
VectorizableTree.clear();		VectorizableTree.clear();
ScalarToTreeEntry.clear();		ScalarToTreeEntry.clear();
MustGather.clear();		MustGather.clear();
ExternalUses.clear();		ExternalUses.clear();
NumOpsWantToKeepOrder.clear();
NumOpsWantToKeepOriginalOrder = 0;
for (auto &Iter : BlocksSchedules) {		for (auto &Iter : BlocksSchedules) {
BlockScheduling *BS = Iter.second.get();		BlockScheduling *BS = Iter.second.get();
BS->clear();		BS->clear();
}		}
MinBWs.clear();		MinBWs.clear();
InstrElementSize.clear();		InstrElementSize.clear();
}		}

unsigned getTreeSize() const { return VectorizableTree.size(); }		unsigned getTreeSize() const { return VectorizableTree.size(); }

/// Perform LICM and CSE on the newly generated gather sequences.		/// Perform LICM and CSE on the newly generated gather sequences.
void optimizeGatherSequence();		void optimizeGatherSequence();

/// \returns The best order of instructions for vectorization.		/// Reorders the current graph to the most profitable order starting from the
Optional<ArrayRef<unsigned>> bestOrder() const {		/// root node to the leaf nodes. The best order is chosen only from the nodes
assert(llvm::all_of(		/// of the same size (vectorization factor). Smaller nodes are considered
NumOpsWantToKeepOrder,		/// parts of subgraph with smaller VF and they are reordered independently. We
[this](const decltype(NumOpsWantToKeepOrder)::value_type &D) {		/// can make it because we still need to extend smaller nodes to the wider VF
return D.getFirst().size() ==		/// and we can merge reordering shuffles with the widening shuffles. If \p
VectorizableTree[0]->Scalars.size();		/// FreeReorder is true, the reordering of the root node is considered free
}) &&		/// and we don't need to shuffle it to restore its order.
"All orders must have the same size as number of instructions in "		void reorderTopToBottom(bool FreeReorder);
"tree node.");
auto I = std::max_element(		/// Reorders the current graph to the most profitable order starting from
NumOpsWantToKeepOrder.begin(), NumOpsWantToKeepOrder.end(),		/// leaves to the root. It allows to rotate small subgraphs and reduce the
[](const decltype(NumOpsWantToKeepOrder)::value_type &D1,		/// number of reshuffles if the leaf nodes use the same order. In this case we
const decltype(NumOpsWantToKeepOrder)::value_type &D2) {		/// can merge the orders and just shuffle user node instead of shuffling its
return D1.second < D2.second;		/// operands. Plus, even the leaf nodes have different orders, it allows to
});		/// sink reordering in the graph closer to the root node and merge it later
if (I == NumOpsWantToKeepOrder.end() \|\|		/// during analysis. If \p FreeReorder is true, it means that the root node of
I->getSecond() <= NumOpsWantToKeepOriginalOrder)		/// the graph is free to reorder and no need to restore its original order.
return None;		void reorderBottomToTop(bool FreeReorder);

return makeArrayRef(I->getFirst());
}

/// Builds the correct order for root instructions.
/// If some leaves have the same instructions to be vectorized, we may
/// incorrectly evaluate the best order for the root node (it is built for the
/// vector of instructions without repeated instructions and, thus, has less
/// elements than the root node). This function builds the correct order for
/// the root node.
/// For example, if the root node is \<a+b, a+c, a+d, f+e\>, then the leaves
/// are \<a, a, a, f\> and \<b, c, d, e\>. When we try to vectorize the first
/// leaf, it will be shrink to \<a, b\>. If instructions in this leaf should
/// be reordered, the best order will be \<1, 0\>. We need to extend this
/// order for the root node. For the root node this order should look like
/// \<3, 0, 1, 2\>. This function extends the order for the reused
/// instructions.
void findRootOrder(OrdersType &Order) {
// If the leaf has the same number of instructions to vectorize as the root
// - order must be set already.
unsigned RootSize = VectorizableTree[0]->Scalars.size();
if (Order.size() == RootSize)
return;
SmallVector<unsigned, 4> RealOrder(Order.size());
std::swap(Order, RealOrder);
SmallVector<int, 4> Mask;
inversePermutation(RealOrder, Mask);
Order.assign(Mask.begin(), Mask.end());
// The leaf has less number of instructions - need to find the true order of
// the root.
// Scan the nodes starting from the leaf back to the root.
const TreeEntry *PNode = VectorizableTree.back().get();
SmallVector<const TreeEntry *, 4> Nodes(1, PNode);
SmallPtrSet<const TreeEntry *, 4> Visited;
while (!Nodes.empty() && Order.size() != RootSize) {
const TreeEntry *PNode = Nodes.pop_back_val();
if (!Visited.insert(PNode).second)
continue;
const TreeEntry &Node = *PNode;
for (const EdgeInfo &EI : Node.UserTreeIndices)
if (EI.UserTE)
Nodes.push_back(EI.UserTE);
if (Node.ReuseShuffleIndices.empty())
continue;
// Build the order for the parent node.
OrdersType NewOrder(Node.ReuseShuffleIndices.size(), RootSize);
SmallVector<unsigned, 4> OrderCounter(Order.size(), 0);
// The algorithm of the order extension is:
// 1. Calculate the number of the same instructions for the order.
// 2. Calculate the index of the new order: total number of instructions
// with order less than the order of the current instruction + reuse
// number of the current instruction.
// 3. The new order is just the index of the instruction in the original
// vector of the instructions.
for (unsigned I : Node.ReuseShuffleIndices)
++OrderCounter[Order[I]];
SmallVector<unsigned, 4> CurrentCounter(Order.size(), 0);
for (unsigned I = 0, E = Node.ReuseShuffleIndices.size(); I < E; ++I) {
unsigned ReusedIdx = Node.ReuseShuffleIndices[I];
unsigned OrderIdx = Order[ReusedIdx];
unsigned NewIdx = 0;
for (unsigned J = 0; J < OrderIdx; ++J)
NewIdx += OrderCounter[J];
NewIdx += CurrentCounter[OrderIdx];
++CurrentCounter[OrderIdx];
assert(NewOrder[NewIdx] == RootSize &&
"The order index should not be written already.");
NewOrder[NewIdx] = I;
}
std::swap(Order, NewOrder);
}
assert(Order.size() == RootSize &&
"Root node is expected or the size of the order must be the same as "
"the number of elements in the root node.");
assert(llvm::all_of(Order,
[RootSize](unsigned Val) { return Val != RootSize; }) &&
"All indices must be initialized");
}

/// \return The vector element size in bits to use when vectorizing the		/// \return The vector element size in bits to use when vectorizing the
/// expression tree ending at \p V. If V is a store, the size is the width of		/// expression tree ending at \p V. If V is a store, the size is the width of
/// the stored value. Otherwise, the size is the width of the largest loaded		/// the stored value. Otherwise, the size is the width of the largest loaded
/// value reaching V. This method is used by the vectorizer to calculate		/// value reaching V. This method is used by the vectorizer to calculate
/// vectorization factors.		/// vectorization factors.
unsigned getVectorElementSize(Value *V);		unsigned getVectorElementSize(Value *V);

▲ Show 20 Lines • Show All 839 Lines • ▼ Show 20 Lines	struct TreeEntry {

/// \returns true if the scalars in VL are equal to this entry.		/// \returns true if the scalars in VL are equal to this entry.
bool isSame(ArrayRef<Value *> VL) const {		bool isSame(ArrayRef<Value *> VL) const {
if (VL.size() == Scalars.size())		if (VL.size() == Scalars.size())
return std::equal(VL.begin(), VL.end(), Scalars.begin());		return std::equal(VL.begin(), VL.end(), Scalars.begin());
return VL.size() == ReuseShuffleIndices.size() &&		return VL.size() == ReuseShuffleIndices.size() &&
std::equal(		std::equal(
VL.begin(), VL.end(), ReuseShuffleIndices.begin(),		VL.begin(), VL.end(), ReuseShuffleIndices.begin(),
[this](Value *V, int Idx) { return V == Scalars[Idx]; });		[this](Value *V, int Idx) { return V == Scalars[Idx]; });
		RKSimonUnsubmitted Not Done Reply Inline Actions This std::equal + pragma is repeated a lot in this method - worth pulling out? RKSimon: This std::equal + pragma is repeated a lot in this method - worth pulling out?
		ABataevAuthorUnsubmitted Done Reply Inline Actions Ok, will try to transform it into a lambda or something similar. ABataev: Ok, will try to transform it into a lambda or something similar.
}		}

/// A vector of scalars.		/// A vector of scalars.
ValueList Scalars;		ValueList Scalars;

/// The Scalars are vectorized into this value. It is initialized to Null.		/// The Scalars are vectorized into this value. It is initialized to Null.
Value *VectorizedValue = nullptr;		Value *VectorizedValue = nullptr;

▲ Show 20 Lines • Show All 58 Lines • ▼ Show 20 Lines	void setOperandsInOrder() {
auto *I = cast<Instruction>(Scalars[Lane]);		auto *I = cast<Instruction>(Scalars[Lane]);
assert(I->getNumOperands() == NumOperands &&		assert(I->getNumOperands() == NumOperands &&
"Expected same number of operands");		"Expected same number of operands");
Operands[OpIdx][Lane] = I->getOperand(OpIdx);		Operands[OpIdx][Lane] = I->getOperand(OpIdx);
}		}
}		}
}		}

		/// Reorders operands of the node to the given mask \p Mask.
		void reorderOperands(ArrayRef<int> Mask) {
		for (ValueList &Operand : Operands)
		reorderScalars(Operand, ReorderIndices, Mask);
		}

/// \returns the \p OpIdx operand of this TreeEntry.		/// \returns the \p OpIdx operand of this TreeEntry.
ValueList &getOperand(unsigned OpIdx) {		ValueList &getOperand(unsigned OpIdx) {
assert(OpIdx < Operands.size() && "Off bounds");		assert(OpIdx < Operands.size() && "Off bounds");
return Operands[OpIdx];		return Operands[OpIdx];
}		}

/// \returns the number of operands.		/// \returns the number of operands.
unsigned getNumOperands() const { return Operands.size(); }		unsigned getNumOperands() const { return Operands.size(); }
▲ Show 20 Lines • Show All 717 Lines • ▼ Show 20 Lines	static unsigned getHashValue(const OrdersType &V) {
return static_cast<unsigned>(hash_combine_range(V.begin(), V.end()));		return static_cast<unsigned>(hash_combine_range(V.begin(), V.end()));
}		}

static bool isEqual(const OrdersType &LHS, const OrdersType &RHS) {		static bool isEqual(const OrdersType &LHS, const OrdersType &RHS) {
return LHS == RHS;		return LHS == RHS;
}		}
};		};

/// Contains orders of operations along with the number of bundles that have
/// operations in this order. It stores only those orders that require
/// reordering, if reordering is not required it is counted using \a
/// NumOpsWantToKeepOriginalOrder.
DenseMap<OrdersType, unsigned, OrdersTypeDenseMapInfo> NumOpsWantToKeepOrder;
/// Number of bundles that do not require reordering.
unsigned NumOpsWantToKeepOriginalOrder = 0;

// Analysis and block reference.		// Analysis and block reference.
Function *F;		Function *F;
ScalarEvolution *SE;		ScalarEvolution *SE;
TargetTransformInfo *TTI;		TargetTransformInfo *TTI;
TargetLibraryInfo *TLI;		TargetLibraryInfo *TLI;
AAResults *AA;		AAResults *AA;
LoopInfo *LI;		LoopInfo *LI;
DominatorTree *DT;		DominatorTree *DT;
▲ Show 20 Lines • Show All 136 Lines • ▼ Show 20 Lines

void BoUpSLP::eraseInstructions(ArrayRef<Value *> AV) {		void BoUpSLP::eraseInstructions(ArrayRef<Value *> AV) {
for (auto *V : AV) {		for (auto *V : AV) {
if (auto *I = dyn_cast<Instruction>(V))		if (auto *I = dyn_cast<Instruction>(V))
eraseInstruction(I, /ReplaceOpsWithUndef=/true);		eraseInstruction(I, /ReplaceOpsWithUndef=/true);
};		};
}		}

void BoUpSLP::buildTree(ArrayRef<Value *> Roots,		/// Reorders the given \p Order according to the given \p Mask. \p Order - is
ArrayRef<Value *> UserIgnoreLst) {		/// the original order of the scalars. Procedure transforms the provided order
ExtraValueToDebugLocsMap ExternallyUsedValues;		/// in accordance with the given \p Mask. If the resulting \p Order is just an
buildTree(Roots, ExternallyUsedValues, UserIgnoreLst);		/// identity order, \p Order is cleared.
		static void reorderOrder(SmallVectorImpl<unsigned> &Order, ArrayRef<int> Mask) {
		assert(!Mask.empty() && "Expected non-empty mask.");
		if (Order.empty()) {
		Order.resize(Mask.size());
		std::iota(Order.begin(), Order.end(), 0);
		}
		SmallVector<unsigned> Prev(Order.size(), Order.size());
		RKSimonUnsubmitted Not Done Reply Inline Actions Do we need a default value for vector? RKSimon: Do we need a default value for vector?
		ABataevAuthorUnsubmitted Done Reply Inline Actions No, it is swapped with `Order` and `Order` then updated. But I'll add default initialization because we may need it in the future for non-pow-2 patch. ABataev: No, it is swapped with `Order` and `Order` then updated. But I'll add default initialization…
		Prev.swap(Order);
		for (unsigned I = 0, E = Prev.size(); I < E; ++I)
		if (Mask[Prev[I]] != UndefMaskElem)
		Order[Mask[Prev[I]]] = I;
		RKSimonUnsubmitted Not Done Reply Inline Actions Doesn't the IsIdentity pass have to be done after Order[] is updated? RKSimon: Doesn't the IsIdentity pass have to be done after Order[] is updated?
		ABataevAuthorUnsubmitted Done Reply Inline Actions Yes, you're right, will fix this. It may affect the cost in some rare cases. ABataev: Yes, you're right, will fix this. It may affect the cost in some rare cases.
		auto &&IsIdentity = [](ArrayRef<unsigned> Order) {
		for (unsigned I = 0, E = Order.size(); I < E; ++I) {
		if (Order[I] != I)
		return false;
		}
		return true;
		};
		if (IsIdentity(Order))
		Order.clear();
}		}

void BoUpSLP::buildTree(ArrayRef<Value *> Roots,		/// Reorders the given \p Reuses mask according to the given \p Mask. \p Reuses
ExtraValueToDebugLocsMap &ExternallyUsedValues,		/// contains original mask for the scalars reused in the node. Procedure
ArrayRef<Value *> UserIgnoreLst) {		/// transform this mask in accordance with the given \p Mask.
deleteTree();		static void reorderReuses(SmallVectorImpl<int> &Reuses, ArrayRef<int> Mask) {
UserIgnoreList = UserIgnoreLst;		assert(!Reuses.empty() && !Mask.empty() &&
if (!allSameType(Roots))		"Expected non-empty mask and reuses mask.");
return;		SmallVector<int> Prev(Reuses.size(), UndefMaskElem);
		RKSimonUnsubmitted Not Done Reply Inline Actions Do we need a default value for vector? RKSimon: Do we need a default value for vector?
		ABataevAuthorUnsubmitted Done Reply Inline Actions I'll add `UndefMaskElem` to reduce future updates for non-power-2. ABataev: I'll add `UndefMaskElem` to reduce future updates for non-power-2.
buildTree_rec(Roots, 0, EdgeInfo());		Prev.swap(Reuses);
		for (unsigned I = 0, E = Prev.size(); I < E; ++I)
		if (Mask[I] != UndefMaskElem)
		Reuses[Mask[I]] = Prev[I];
		}

		void BoUpSLP::reorderTopToBottom(bool FreeReorder) {
		// Maps VF to the graph nodes.
		DenseMap<unsigned, SmallPtrSet<TreeEntry *, 4>> VFToOrderedEntries;
		// ExtractElement gather nodes which can be vectorized and need to handle
		// their ordering.
		DenseMap<const TreeEntry *, OrdersType> GathersToOrders;
		// Find all reorderable nodes with the given VF.
		// Currently the are vectorized loads,extracts + some gathering of extracts.
		for_each(VectorizableTree, [this, &VFToOrderedEntries, &GathersToOrders](
		const std::unique_ptr<TreeEntry> &TE) {
		if (TE->State == TreeEntry::Vectorize &&
		isa<LoadInst, ExtractElementInst, ExtractValueInst>(TE->getMainOp())) {
		VFToOrderedEntries[TE->Scalars.size()].insert(TE.get());
		} else if (TE->State == TreeEntry::NeedToGather &&
		TE->getOpcode() == Instruction::ExtractElement &&
		isa<FixedVectorType>(cast<ExtractElementInst>(TE->getMainOp())
		->getVectorOperandType()) &&
		allSameType(TE->Scalars) && allSameBlock(TE->Scalars)) {
		// Check that gather of extractelements can be represented as
		// just a shuffle of a single vector.
		OrdersType CurrentOrder;
		bool Reuse = canReuseExtract(TE->Scalars, TE->getMainOp(), CurrentOrder);
		if (Reuse \|\| !CurrentOrder.empty()) {
		VFToOrderedEntries[TE->Scalars.size()].insert(TE.get());
		GathersToOrders.try_emplace(TE.get(), CurrentOrder);
		}
		}
		});

		// Reorder the graph nodes according to their vectorization factor.
		for (unsigned VF = VectorizableTree.front()->Scalars.size(); VF > 1;
		VF /= 2) {
		auto It = VFToOrderedEntries.find(VF);
		if (It == VFToOrderedEntries.end())
		continue;
		// Try to find the most profitable order. We just are looking for the most
		// used order and reorder scalar elements in the nodes according to this
		// mostly used order.
		const SmallPtrSetImpl<TreeEntry *> &OrderedEntries = It->getSecond();
		// All operands are reordered and used only in this node - propagate the
		// most used order to the user node.
		DenseMap<OrdersType, unsigned, OrdersTypeDenseMapInfo> OrdersUses;
		SmallPtrSet<const TreeEntry *, 4> VisitedOps;
		for (const TreeEntry *OpTE : OrderedEntries) {
		// No need to reorder this nodes, still need to extend and to use shuffle,
		// just need to merge reordering shuffle and the reuse shuffle.
		if (!OpTE->ReuseShuffleIndices.empty())
		continue;
		// Count number of orders uses.
		const auto &Order = [OpTE, &GathersToOrders]() -> const OrdersType & {
		if (OpTE->State == TreeEntry::NeedToGather)
		return GathersToOrders.find(OpTE)->second;
		return OpTE->ReorderIndices;
		}();
		++OrdersUses.try_emplace(Order).first->getSecond();
		}
		// Set order of the user node.
		if (OrdersUses.empty())
		continue;
		// If need to reorder the root node, it means it also requires to keep its
		// original order.
		if (VF == VectorizableTree.front()->Scalars.size() && !FreeReorder)
		++OrdersUses[VectorizableTree.front()->ReorderIndices];
		// Choose the most used order.
		ArrayRef<unsigned> BestOrder;
		unsigned Cnt;
		std::tie(BestOrder, Cnt) = *OrdersUses.begin();
		for (const auto &Pair : llvm::drop_begin(OrdersUses)) {
		if (Cnt < Pair.second \|\| (Cnt == Pair.second && Pair.first.empty()))
		std::tie(BestOrder, Cnt) = Pair;
		}
		// Set order of the user node.
		if (BestOrder.empty())
		continue;
		SmallVector<int> Mask;
		inversePermutation(BestOrder, Mask);
		SmallPtrSet<TreeEntry *, 4> SmallOperandsToReorder;
		// Do an actual reordering, if profitable.
		for (std::unique_ptr<TreeEntry> &TE : VectorizableTree) {
		// Just do the reordering for the nodes with the given VF.
		if (TE->Scalars.size() != VF) {
		if (TE->ReuseShuffleIndices.size() == VF) {
		// Need to reorder the reuses masks of the operands with smaller VF to
		// be able to find the math between the graph nodes and scalar
		// operands of the given node during vectorization/cost estimation.
		// Build a list of such operands for future reordering.
		assert(all_of(TE->UserTreeIndices,
		[VF](const EdgeInfo &EI) {
		return EI.UserTE->Scalars.size() == VF;
		}) &&
		"All users must be of VF size.");
		SmallOperandsToReorder.insert(TE.get());
		}
		continue;
		}
		// Reorder the node and its operands.
		TE->updateStateIfReorder();
		TE->reorderOperands(Mask);
		if (TE->ReuseShuffleIndices.empty()) {
		reorderScalars(TE->Scalars, TE->ReorderIndices, Mask);
		if (TE->State == TreeEntry::Vectorize &&
		(TE.get() != VectorizableTree.front().get() \|\| !FreeReorder) &&
		isa<ExtractElementInst, ExtractValueInst, LoadInst, StoreInst>(
		TE->getMainOp())) {
		// Build correct orders for extract{element,value}, loads and stores.
		reorderOrder(TE->ReorderIndices, Mask);
		// For stores the order is actually a mask.
		if (isa<StoreInst>(TE->getMainOp()) && !TE->ReorderIndices.empty()) {
		SmallVector<int> StoreOrder;
		inversePermutation(TE->ReorderIndices, StoreOrder);
		copy(StoreOrder, TE->ReorderIndices.begin());
		}
		} else {
		TE->ReorderIndices.clear();
		}
		} else {
		// Build correct order for nodes with reused shuffles.
		reorderOrder(TE->ReorderIndices, Mask);
		}
		}
		// Update ordering of the operands with the smaller VF than the given one.
		for (TreeEntry *TE : SmallOperandsToReorder)
		reorderReuses(TE->ReuseShuffleIndices, Mask);
		}
		}

		void BoUpSLP::reorderBottomToTop(bool FreeReorder) {
		SetVector<TreeEntry *> OrderedEntries;
		DenseMap<const TreeEntry *, OrdersType> GathersToOrders;
		// Find all reorderable nodes with the given VF.
		// Currently the are vectorized loads,extracts without alternate operands +
		// some gathering of extracts.
		SmallVector<TreeEntry *> NonVectorized;
		for_each(VectorizableTree, [this, &OrderedEntries, &GathersToOrders,
		&NonVectorized](
		const std::unique_ptr<TreeEntry> &TE) {
		if (TE->State == TreeEntry::Vectorize &&
		isa<LoadInst, ExtractElementInst, ExtractValueInst>(TE->getMainOp())) {
		OrderedEntries.insert(TE.get());
		} else if (TE->State == TreeEntry::NeedToGather &&
		TE->getOpcode() == Instruction::ExtractElement &&
		isa<FixedVectorType>(cast<ExtractElementInst>(TE->getMainOp())
		->getVectorOperandType()) &&
		allSameType(TE->Scalars) && allSameBlock(TE->Scalars)) {
		// Check that gather of extractelements can be represented as
		// just a shuffle of a single vector with a single user only.
		OrdersType CurrentOrder;
		bool Reuse = canReuseExtract(TE->Scalars, TE->getMainOp(), CurrentOrder);
		if ((Reuse \|\| !CurrentOrder.empty()) &&
		!any_of(
		VectorizableTree, [&TE](const std::unique_ptr<TreeEntry> &Entry) {
		return Entry->State == TreeEntry::NeedToGather &&
		Entry.get() != TE.get() && Entry->isSame(TE->Scalars);
		})) {
		OrderedEntries.insert(TE.get());
		GathersToOrders.try_emplace(TE.get(), CurrentOrder);
		}
		}
		if (TE->State != TreeEntry::Vectorize)
		NonVectorized.push_back(TE.get());
		});

		// Checks if the operands of the users are reordarable and have only single
		// use.
		auto &&CheckOperands =
		[this, &NonVectorized](const auto &Data,
		SmallVectorImpl<TreeEntry *> &GatherOps) {
		for (unsigned I = 0, E = Data.first->getNumOperands(); I < E; ++I) {
		if (any_of(Data.second,
		[I](const std::pair<unsigned, TreeEntry *> &OpData) {
		return OpData.first == I;
		}))
		continue;
		ArrayRef<Value *> VL = Data.first->getOperand(I);
		const TreeEntry *TE = nullptr;
		const auto It = find_if(VL, [this, &TE](Value V) {
		TE = getTreeEntry(V);
		return TE;
		});
		if (It != VL.end() && TE->isSame(VL))
		return false;
		TreeEntry *Gather = nullptr;
		if (count_if(NonVectorized, [VL, &Gather](TreeEntry *TE) {
		assert(TE->State != TreeEntry::Vectorize &&
		"Only non-vectorized nodes are expected.");
		if (TE->isSame(VL)) {
		Gather = TE;
		return true;
		}
		return false;
		}) != 1)
		return false;
		GatherOps.push_back(Gather);
		}
		return true;
		};
		// 1. Propagate order to the graph nodes, which use only reordered nodes.
		// I.e., if the node has operands, that are reordered, try to make at least
		// one operand order in the natural order and reorder others + reorder the
		// user node itself.
		SmallPtrSet<const TreeEntry *, 4> Visited;
		while (!OrderedEntries.empty()) {
		// 1. Filter out only reordered nodes.
		// 2. If the entry has multiple uses - skip it and jump to the next node.
		MapVector<TreeEntry , SmallVector<std::pair<unsigned, TreeEntry >>> Users;
		SmallVector<TreeEntry *> Filtered;
		for (TreeEntry *TE : OrderedEntries) {
		if (!(TE->State == TreeEntry::Vectorize \|\|
		(TE->State == TreeEntry::NeedToGather &&
		TE->getOpcode() == Instruction::ExtractElement)) \|\|
		TE->UserTreeIndices.empty() \|\| !TE->ReuseShuffleIndices.empty() \|\|
		!all_of(drop_begin(TE->UserTreeIndices),
		[TE](const EdgeInfo &EI) {
		return EI.UserTE == TE->UserTreeIndices.front().UserTE;
		}) \|\|
		!Visited.insert(TE).second) {
		Filtered.push_back(TE);
		continue;
		}
		// Build a map between user nodes and their operands order to speedup
		// search. The graph currently does not provide this dependency directly.
		for (EdgeInfo &EI : TE->UserTreeIndices) {
		TreeEntry *UserTE = EI.UserTE;
		auto It = Users.find(UserTE);
		if (It == Users.end())
		It = Users.insert({UserTE, {}}).first;
		It->second.emplace_back(EI.EdgeIdx, TE);
		}
		}
		// Erase filtered entries.
		for_each(Filtered,
		[&OrderedEntries](TreeEntry *TE) { OrderedEntries.remove(TE); });
		for (const auto &Data : Users) {
		// Check that operands are used only in the User node.
		SmallVector<TreeEntry *> GatherOps;
		if (!CheckOperands(Data, GatherOps)) {
		for_each(Data.second,
		[&OrderedEntries](const std::pair<unsigned, TreeEntry *> &Op) {
		OrderedEntries.remove(Op.second);
		});
		continue;
		}
		// All operands are reordered and used only in this node - propagate the
		// most used order to the user node.
		DenseMap<OrdersType, unsigned, OrdersTypeDenseMapInfo> OrdersUses;
		SmallPtrSet<const TreeEntry *, 4> VisitedOps;
		for (const auto &Op : Data.second) {
		TreeEntry *OpTE = Op.second;
		if (!OpTE->ReuseShuffleIndices.empty())
		continue;
		const auto &Order = [OpTE, &GathersToOrders]() -> const OrdersType & {
		if (OpTE->State == TreeEntry::NeedToGather)
		return GathersToOrders.find(OpTE)->second;
		return OpTE->ReorderIndices;
		}();
		++OrdersUses.try_emplace(Order).first->getSecond();
		if (VisitedOps.insert(OpTE).second)
		OrdersUses.try_emplace({}, 0).first->getSecond() +=
		OpTE->UserTreeIndices.size();
		--OrdersUses[{}];
		}
		// If no orders - skip current nodes and jump to the next one, if any.
		if (OrdersUses.empty()) {
		for_each(Data.second,
		[&OrderedEntries](const std::pair<unsigned, TreeEntry *> &Op) {
		OrderedEntries.remove(Op.second);
		});
		continue;
		}
		// Choose the best order.
		ArrayRef<unsigned> BestOrder;
		unsigned Cnt;
		std::tie(BestOrder, Cnt) = *OrdersUses.begin();
		for (const auto &Pair : llvm::drop_begin(OrdersUses)) {
		if (Cnt < Pair.second \|\| (Cnt == Pair.second && Pair.first.empty()))
		std::tie(BestOrder, Cnt) = Pair;
		}
		// Set order of the user node (reordering of operands and user nodes).
		if (BestOrder.empty()) {
		for_each(Data.second,
		[&OrderedEntries](const std::pair<unsigned, TreeEntry *> &Op) {
		OrderedEntries.remove(Op.second);
		});
		continue;
		}
		// Erase operands from OrderedEntries list and adjust their orders.
		VisitedOps.clear();
		SmallVector<int> Mask;
		inversePermutation(BestOrder, Mask);
		for (const std::pair<unsigned, TreeEntry *> &Op : Data.second) {
		TreeEntry *TE = Op.second;
		OrderedEntries.remove(TE);
		if (!VisitedOps.insert(TE).second)
		continue;
		if (!TE->ReuseShuffleIndices.empty()) {
		// Just reorder reuses indices.
		reorderReuses(TE->ReuseShuffleIndices, Mask);
		continue;
		}
		// Gathers are processed separately.
		if (TE->State != TreeEntry::Vectorize)
		continue;
		assert((BestOrder.size() == TE->ReorderIndices.size() \|\|
		TE->ReorderIndices.empty()) &&
		"Non-matching sizes of user/operand entries.");
		TE->updateStateIfReorder();
		reorderScalars(TE->Scalars, TE->ReorderIndices, Mask);
		reorderOrder(TE->ReorderIndices, Mask);
		}
		// For gathers just need to reorder its scalars.
		for (TreeEntry *Gather : GatherOps) {
		if (!Gather->ReuseShuffleIndices.empty())
		continue;
		reorderScalars(Gather->Scalars, None, Mask);
		OrderedEntries.remove(Gather);
		}
		// Reorder operands of the user node and set the ordering for the user
		// node itself.
		Data.first->updateStateIfReorder();
		Data.first->reorderOperands(Mask);
		if (!FreeReorder \|\| Data.first != VectorizableTree.front().get()) {
		reorderOrder(Data.first->ReorderIndices, Mask);
		// For stores the order is actually a mask.
		if (isa<StoreInst>(Data.first->getMainOp()) &&
		!Data.first->ReorderIndices.empty()) {
		SmallVector<int> StoreOrder;
		inversePermutation(Data.first->ReorderIndices, StoreOrder);
		copy(StoreOrder, Data.first->ReorderIndices.begin());
		}
		// Insert user node to the list to try to sink reordering deeper in the
		// graph.
		OrderedEntries.insert(Data.first);
		} else {
		reorderScalars(Data.first->Scalars, Data.first->ReorderIndices, Mask);
		}
		}
		}
		}

		void BoUpSLP::buildExternalUses(
		const ExtraValueToDebugLocsMap &ExternallyUsedValues) {
// Collect the values that we need to extract from the tree.		// Collect the values that we need to extract from the tree.
for (auto &TEPtr : VectorizableTree) {		for (auto &TEPtr : VectorizableTree) {
TreeEntry *Entry = TEPtr.get();		TreeEntry *Entry = TEPtr.get();

// No need to handle users of gathered values.		// No need to handle users of gathered values.
if (Entry->State == TreeEntry::NeedToGather)		if (Entry->State == TreeEntry::NeedToGather)
continue;		continue;

Show All 39 Lines	for (int Lane = 0, LE = Entry->Scalars.size(); Lane != LE; ++Lane) {
LLVM_DEBUG(dbgs() << "SLP: Need to extract:" << *U << " from lane "		LLVM_DEBUG(dbgs() << "SLP: Need to extract:" << *U << " from lane "
<< Lane << " from " << *Scalar << ".\n");		<< Lane << " from " << *Scalar << ".\n");
ExternalUses.push_back(ExternalUser(Scalar, U, FoundLane));		ExternalUses.push_back(ExternalUser(Scalar, U, FoundLane));
}		}
}		}
}		}
}		}

		void BoUpSLP::buildTree(ArrayRef<Value *> Roots,
		ArrayRef<Value *> UserIgnoreLst) {
		deleteTree();
		UserIgnoreList = UserIgnoreLst;
		if (!allSameType(Roots))
		return;
		buildTree_rec(Roots, 0, EdgeInfo());
		}

		namespace {
		/// Tracks the state we can represent the loads in the given sequence.
		enum class LoadsState { Gather, Vectorize, ScatterVectorize };
		RKSimonUnsubmitted Not Done Reply Inline Actions This is very similar to the EntryState enum - merge them? RKSimon: This is very similar to the EntryState enum - merge them?
		ABataevAuthorUnsubmitted Done Reply Inline Actions I would not do this. Though the values look similar the meaning is completely different. This state handles only loads, while the entry state handles all possible entries kinds. In the future, we may have different values in these enums, which may lead to some unpredictable results. I would keep it as is. ABataev: I would not do this. Though the values look similar the meaning is completely different. This…
		} // anonymous namespace

		/// Checks if the given array of loads can be represented as a vectorized,
		/// scatter or just simple gather.
		static LoadsState canVectorizeLoads(ArrayRef<Value > VL, const Value VL0,
		const TargetTransformInfo &TTI,
		const DataLayout &DL, ScalarEvolution &SE,
		SmallVectorImpl<unsigned> &Order,
		SmallVectorImpl<Value *> &PointerOps) {
		// Check that a vectorized load would load the same memory as a scalar
		// load. For example, we don't want to vectorize loads that are smaller
		// than 8-bit. Even though we have a packed struct {<i2, i2, i2, i2>} LLVM
		// treats loading/storing it as an i8 struct. If we vectorize loads/stores
		// from such a struct, we read/write packed bits disagreeing with the
		// unvectorized version.
		Type *ScalarTy = VL0->getType();

		if (DL.getTypeSizeInBits(ScalarTy) != DL.getTypeAllocSizeInBits(ScalarTy))
		return LoadsState::Gather;

		// Make sure all loads in the bundle are simple - we can't vectorize
		// atomic or volatile loads.
		PointerOps.clear();
		PointerOps.resize(VL.size());
		auto *POIter = PointerOps.begin();
		for (Value *V : VL) {
		auto *L = cast<LoadInst>(V);
		if (!L->isSimple())
		return LoadsState::Gather;
		*POIter = L->getPointerOperand();
		++POIter;
		}

		Order.clear();
		// Check the order of pointer operands.
		if (llvm::sortPtrAccesses(PointerOps, ScalarTy, DL, SE, Order)) {
		Value *Ptr0;
		Value *PtrN;
		if (Order.empty()) {
		Ptr0 = PointerOps.front();
		PtrN = PointerOps.back();
		} else {
		Ptr0 = PointerOps[Order.front()];
		PtrN = PointerOps[Order.back()];
		}
		Optional<int> Diff =
		getPointersDiff(ScalarTy, Ptr0, ScalarTy, PtrN, DL, SE);
		// Check that the sorted loads are consecutive.
		if (static_cast<unsigned>(*Diff) == VL.size() - 1)
		return LoadsState::Vectorize;
		Align CommonAlignment = cast<LoadInst>(VL0)->getAlign();
		for (Value *V : VL)
		CommonAlignment =
		commonAlignment(CommonAlignment, cast<LoadInst>(V)->getAlign());
		if (TTI.isLegalMaskedGather(FixedVectorType::get(ScalarTy, VL.size()),
		CommonAlignment))
		return LoadsState::ScatterVectorize;
		}

		return LoadsState::Gather;
		}

		/// Order may have elements assigned special value (size) which is out of
		/// bounds. Such indices only appear on places which correspond to undef values
		/// (see canReuseExtract for details) and used in order to avoid undef values
		/// have effect on operands ordering.
		/// The first loop below simply finds all unused indices and then the next loop
		/// nest assigns these indices for undef values positions.
		/// As an example below Order has two undef positions and they have assigned
		/// values 3 and 7 respectively:
		/// before: 6 9 5 4 9 2 1 0
		/// after: 6 3 5 4 7 2 1 0
		/// \returns Fixed ordering.
		static void fixupOrderingIndices(SmallVectorImpl<unsigned> &Order) {
		const unsigned Sz = Order.size();
		SmallBitVector UsedIndices(Sz);
		SmallVector<int> MaskedIndices;
		for (unsigned I = 0; I < Sz; ++I) {
		if (Order[I] < Sz)
		UsedIndices.set(Order[I]);
		else
		MaskedIndices.push_back(I);
		}
		if (MaskedIndices.empty())
		return;
		SmallVector<int> AvailableIndices(MaskedIndices.size());
		RKSimonUnsubmitted Not Done Reply Inline Actions Default value? RKSimon: Default value?
		ABataevAuthorUnsubmitted Done Reply Inline Actions It is initialized with zeroes by default. ABataev: It is initialized with zeroes by default.
		unsigned Cnt = 0;
		int Idx = UsedIndices.find_first();
		do {
		AvailableIndices[Cnt] = Idx;
		Idx = UsedIndices.find_next(Idx);
		++Cnt;
		} while (Idx > 0);
		assert(Cnt == MaskedIndices.size() && "Non-synced masked/available indices.");
		for (int I = 0, E = MaskedIndices.size(); I < E; ++I)
		Order[MaskedIndices[I]] = AvailableIndices[I];
		}

void BoUpSLP::buildTree_rec(ArrayRef<Value *> VL, unsigned Depth,		void BoUpSLP::buildTree_rec(ArrayRef<Value *> VL, unsigned Depth,
const EdgeInfo &UserTreeIdx) {		const EdgeInfo &UserTreeIdx) {
assert((allConstant(VL) \|\| allSameType(VL)) && "Invalid types!");		assert((allConstant(VL) \|\| allSameType(VL)) && "Invalid types!");

InstructionsState S = getSameOpcode(VL);		InstructionsState S = getSameOpcode(VL);
if (Depth == RecursionMaxDepth) {		if (Depth == RecursionMaxDepth) {
LLVM_DEBUG(dbgs() << "SLP: Gathering due to max recursion depth.\n");		LLVM_DEBUG(dbgs() << "SLP: Gathering due to max recursion depth.\n");
newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx);		newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx);
▲ Show 20 Lines • Show All 187 Lines • ▼ Show 20 Lines	case Instruction::PHI: {
return;		return;
}		}
case Instruction::ExtractValue:		case Instruction::ExtractValue:
case Instruction::ExtractElement: {		case Instruction::ExtractElement: {
OrdersType CurrentOrder;		OrdersType CurrentOrder;
bool Reuse = canReuseExtract(VL, VL0, CurrentOrder);		bool Reuse = canReuseExtract(VL, VL0, CurrentOrder);
if (Reuse) {		if (Reuse) {
LLVM_DEBUG(dbgs() << "SLP: Reusing or shuffling extract sequence.\n");		LLVM_DEBUG(dbgs() << "SLP: Reusing or shuffling extract sequence.\n");
++NumOpsWantToKeepOriginalOrder;
newTreeEntry(VL, Bundle /vectorized/, S, UserTreeIdx,		newTreeEntry(VL, Bundle /vectorized/, S, UserTreeIdx,
ReuseShuffleIndicies);		ReuseShuffleIndicies);
// This is a special case, as it does not gather, but at the same time		// This is a special case, as it does not gather, but at the same time
// we are not extending buildTree_rec() towards the operands.		// we are not extending buildTree_rec() towards the operands.
ValueList Op0;		ValueList Op0;
Op0.assign(VL.size(), VL0->getOperand(0));		Op0.assign(VL.size(), VL0->getOperand(0));
VectorizableTree.back()->setOperand(0, Op0);		VectorizableTree.back()->setOperand(0, Op0);
return;		return;
}		}
if (!CurrentOrder.empty()) {		if (!CurrentOrder.empty()) {
LLVM_DEBUG({		LLVM_DEBUG({
dbgs() << "SLP: Reusing or shuffling of reordered extract sequence "		dbgs() << "SLP: Reusing or shuffling of reordered extract sequence "
"with order";		"with order";
for (unsigned Idx : CurrentOrder)		for (unsigned Idx : CurrentOrder)
dbgs() << " " << Idx;		dbgs() << " " << Idx;
dbgs() << "\n";		dbgs() << "\n";
});		});
		fixupOrderingIndices(CurrentOrder);
// Insert new order with initial value 0, if it does not exist,		// Insert new order with initial value 0, if it does not exist,
// otherwise return the iterator to the existing one.		// otherwise return the iterator to the existing one.
newTreeEntry(VL, Bundle /vectorized/, S, UserTreeIdx,		newTreeEntry(VL, Bundle /vectorized/, S, UserTreeIdx,
ReuseShuffleIndicies, CurrentOrder);		ReuseShuffleIndicies, CurrentOrder);
findRootOrder(CurrentOrder);
++NumOpsWantToKeepOrder[CurrentOrder];
// This is a special case, as it does not gather, but at the same time		// This is a special case, as it does not gather, but at the same time
// we are not extending buildTree_rec() towards the operands.		// we are not extending buildTree_rec() towards the operands.
ValueList Op0;		ValueList Op0;
Op0.assign(VL.size(), VL0->getOperand(0));		Op0.assign(VL.size(), VL0->getOperand(0));
VectorizableTree.back()->setOperand(0, Op0);		VectorizableTree.back()->setOperand(0, Op0);
return;		return;
}		}
LLVM_DEBUG(dbgs() << "SLP: Gather extract sequence.\n");		LLVM_DEBUG(dbgs() << "SLP: Gather extract sequence.\n");
Show All 24 Lines	case Instruction::InsertElement: {
}		}

TreeEntry TE = newTreeEntry(VL, Bundle /vectorized*/, S, UserTreeIdx);		TreeEntry TE = newTreeEntry(VL, Bundle /vectorized*/, S, UserTreeIdx);
LLVM_DEBUG(dbgs() << "SLP: added inserts bundle.\n");		LLVM_DEBUG(dbgs() << "SLP: added inserts bundle.\n");

constexpr int NumOps = 2;		constexpr int NumOps = 2;
ValueList VectorOperands[NumOps];		ValueList VectorOperands[NumOps];
for (int I = 0; I < NumOps; ++I) {		for (int I = 0; I < NumOps; ++I) {
for (Value *V : VL)		for (Value *V : VL)
		RKSimonUnsubmitted Not Done Reply Inline Actions Indices? RKSimon: Indices?
		ABataevAuthorUnsubmitted Done Reply Inline Actions Yep, will fix it, thanks! ABataev: Yep, will fix it, thanks!
VectorOperands[I].push_back(cast<Instruction>(V)->getOperand(I));		VectorOperands[I].push_back(cast<Instruction>(V)->getOperand(I));

TE->setOperand(I, VectorOperands[I]);		TE->setOperand(I, VectorOperands[I]);
}		}
buildTree_rec(VectorOperands[NumOps - 1], Depth + 1, {TE, 0});		buildTree_rec(VectorOperands[NumOps - 1], Depth + 1, {TE, 0});
return;		return;
}		}
case Instruction::Load: {		case Instruction::Load: {
// Check that a vectorized load would load the same memory as a scalar		// Check that a vectorized load would load the same memory as a scalar
// load. For example, we don't want to vectorize loads that are smaller		// load. For example, we don't want to vectorize loads that are smaller
// than 8-bit. Even though we have a packed struct {<i2, i2, i2, i2>} LLVM		// than 8-bit. Even though we have a packed struct {<i2, i2, i2, i2>} LLVM
// treats loading/storing it as an i8 struct. If we vectorize loads/stores		// treats loading/storing it as an i8 struct. If we vectorize loads/stores
// from such a struct, we read/write packed bits disagreeing with the		// from such a struct, we read/write packed bits disagreeing with the
// unvectorized version.		// unvectorized version.
Type *ScalarTy = VL0->getType();		SmallVector<Value *> PointerOps;

if (DL->getTypeSizeInBits(ScalarTy) !=
DL->getTypeAllocSizeInBits(ScalarTy)) {
BS.cancelScheduling(VL, VL0);
newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx,
ReuseShuffleIndicies);
LLVM_DEBUG(dbgs() << "SLP: Gathering loads of non-packed type.\n");
return;
}

// Make sure all loads in the bundle are simple - we can't vectorize
// atomic or volatile loads.
SmallVector<Value *, 4> PointerOps(VL.size());
auto POIter = PointerOps.begin();
for (Value *V : VL) {
auto *L = cast<LoadInst>(V);
if (!L->isSimple()) {
BS.cancelScheduling(VL, VL0);
newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx,
ReuseShuffleIndicies);
LLVM_DEBUG(dbgs() << "SLP: Gathering non-simple loads.\n");
return;
}
*POIter = L->getPointerOperand();
++POIter;
}

OrdersType CurrentOrder;		OrdersType CurrentOrder;
// Check the order of pointer operands.		TreeEntry *TE = nullptr;
if (llvm::sortPtrAccesses(PointerOps, ScalarTy, DL, SE, CurrentOrder)) {		switch (canVectorizeLoads(VL, VL0, TTI, DL, *SE, CurrentOrder,
Value *Ptr0;		PointerOps)) {
Value *PtrN;		case LoadsState::Vectorize:
if (CurrentOrder.empty()) {
Ptr0 = PointerOps.front();
PtrN = PointerOps.back();
} else {
Ptr0 = PointerOps[CurrentOrder.front()];
PtrN = PointerOps[CurrentOrder.back()];
}
Optional<int> Diff = getPointersDiff(
ScalarTy, Ptr0, ScalarTy, PtrN, DL, SE);
// Check that the sorted loads are consecutive.
if (static_cast<unsigned>(*Diff) == VL.size() - 1) {
if (CurrentOrder.empty()) {		if (CurrentOrder.empty()) {
// Original loads are consecutive and does not require reordering.		// Original loads are consecutive and does not require reordering.
++NumOpsWantToKeepOriginalOrder;		TE = newTreeEntry(VL, Bundle /vectorized/, S, UserTreeIdx,
TreeEntry TE = newTreeEntry(VL, Bundle /vectorized*/, S,		ReuseShuffleIndicies);
UserTreeIdx, ReuseShuffleIndicies);
TE->setOperandsInOrder();
LLVM_DEBUG(dbgs() << "SLP: added a vector of loads.\n");		LLVM_DEBUG(dbgs() << "SLP: added a vector of loads.\n");
} else {		} else {
		fixupOrderingIndices(CurrentOrder);
// Need to reorder.		// Need to reorder.
TreeEntry *TE =		TE = newTreeEntry(VL, Bundle /vectorized/, S, UserTreeIdx,
newTreeEntry(VL, Bundle /vectorized/, S, UserTreeIdx,
ReuseShuffleIndicies, CurrentOrder);		ReuseShuffleIndicies, CurrentOrder);
TE->setOperandsInOrder();
LLVM_DEBUG(dbgs() << "SLP: added a vector of jumbled loads.\n");		LLVM_DEBUG(dbgs() << "SLP: added a vector of jumbled loads.\n");
findRootOrder(CurrentOrder);
++NumOpsWantToKeepOrder[CurrentOrder];
}
return;
}		}
Align CommonAlignment = cast<LoadInst>(VL0)->getAlign();		TE->setOperandsInOrder();
for (Value *V : VL)		break;
CommonAlignment =		case LoadsState::ScatterVectorize:
commonAlignment(CommonAlignment, cast<LoadInst>(V)->getAlign());
if (TTI->isLegalMaskedGather(FixedVectorType::get(ScalarTy, VL.size()),
CommonAlignment)) {
// Vectorizing non-consecutive loads with `llvm.masked.gather`.		// Vectorizing non-consecutive loads with `llvm.masked.gather`.
TreeEntry *TE = newTreeEntry(VL, TreeEntry::ScatterVectorize, Bundle,		TE = newTreeEntry(VL, TreeEntry::ScatterVectorize, Bundle, S,
S, UserTreeIdx, ReuseShuffleIndicies);		UserTreeIdx, ReuseShuffleIndicies);
TE->setOperandsInOrder();		TE->setOperandsInOrder();
buildTree_rec(PointerOps, Depth + 1, {TE, 0});		buildTree_rec(PointerOps, Depth + 1, {TE, 0});
LLVM_DEBUG(dbgs()		LLVM_DEBUG(dbgs() << "SLP: added a vector of non-consecutive loads.\n");
<< "SLP: added a vector of non-consecutive loads.\n");		break;
return;		case LoadsState::Gather:
}
}

LLVM_DEBUG(dbgs() << "SLP: Gathering non-consecutive loads.\n");
BS.cancelScheduling(VL, VL0);		BS.cancelScheduling(VL, VL0);
newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx,		newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx,
ReuseShuffleIndicies);		ReuseShuffleIndicies);
		#ifndef NDEBUG
		Type *ScalarTy = VL0->getType();
		if (DL->getTypeSizeInBits(ScalarTy) !=
		DL->getTypeAllocSizeInBits(ScalarTy))
		LLVM_DEBUG(dbgs() << "SLP: Gathering loads of non-packed type.\n");
		else if (any_of(VL, [](Value *V) {
		return !cast<LoadInst>(V)->isSimple();
		}))
		LLVM_DEBUG(dbgs() << "SLP: Gathering non-simple loads.\n");
		else
		LLVM_DEBUG(dbgs() << "SLP: Gathering non-consecutive loads.\n");
		#endif // NDEBUG
		break;
		}
return;		return;
}		}
case Instruction::ZExt:		case Instruction::ZExt:
case Instruction::SExt:		case Instruction::SExt:
case Instruction::FPToUI:		case Instruction::FPToUI:
case Instruction::FPToSI:		case Instruction::FPToSI:
case Instruction::FPExt:		case Instruction::FPExt:
case Instruction::PtrToInt:		case Instruction::PtrToInt:
▲ Show 20 Lines • Show All 230 Lines • ▼ Show 20 Lines	case Instruction::Store: {
PtrN = PointerOps[CurrentOrder.back()];		PtrN = PointerOps[CurrentOrder.back()];
}		}
Optional<int> Dist =		Optional<int> Dist =
getPointersDiff(ScalarTy, Ptr0, ScalarTy, PtrN, DL, SE);		getPointersDiff(ScalarTy, Ptr0, ScalarTy, PtrN, DL, SE);
// Check that the sorted pointer operands are consecutive.		// Check that the sorted pointer operands are consecutive.
if (static_cast<unsigned>(*Dist) == VL.size() - 1) {		if (static_cast<unsigned>(*Dist) == VL.size() - 1) {
if (CurrentOrder.empty()) {		if (CurrentOrder.empty()) {
// Original stores are consecutive and does not require reordering.		// Original stores are consecutive and does not require reordering.
++NumOpsWantToKeepOriginalOrder;
TreeEntry TE = newTreeEntry(VL, Bundle /vectorized*/, S,		TreeEntry TE = newTreeEntry(VL, Bundle /vectorized*/, S,
UserTreeIdx, ReuseShuffleIndicies);		UserTreeIdx, ReuseShuffleIndicies);
TE->setOperandsInOrder();		TE->setOperandsInOrder();
buildTree_rec(Operands, Depth + 1, {TE, 0});		buildTree_rec(Operands, Depth + 1, {TE, 0});
LLVM_DEBUG(dbgs() << "SLP: added a vector of stores.\n");		LLVM_DEBUG(dbgs() << "SLP: added a vector of stores.\n");
} else {		} else {
		fixupOrderingIndices(CurrentOrder);
TreeEntry *TE =		TreeEntry *TE =
newTreeEntry(VL, Bundle /vectorized/, S, UserTreeIdx,		newTreeEntry(VL, Bundle /vectorized/, S, UserTreeIdx,
ReuseShuffleIndicies, CurrentOrder);		ReuseShuffleIndicies, CurrentOrder);
TE->setOperandsInOrder();		TE->setOperandsInOrder();
buildTree_rec(Operands, Depth + 1, {TE, 0});		buildTree_rec(Operands, Depth + 1, {TE, 0});
LLVM_DEBUG(dbgs() << "SLP: added a vector of jumbled stores.\n");		LLVM_DEBUG(dbgs() << "SLP: added a vector of jumbled stores.\n");
findRootOrder(CurrentOrder);
++NumOpsWantToKeepOrder[CurrentOrder];
}		}
return;		return;
}		}
}		}

BS.cancelScheduling(VL, VL0);		BS.cancelScheduling(VL, VL0);
newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx,		newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx,
ReuseShuffleIndicies);		ReuseShuffleIndicies);
▲ Show 20 Lines • Show All 496 Lines • ▼ Show 20 Lines	if (E->getOpcode() == Instruction::ExtractElement && allSameType(VL) &&
FinalVecTy, E->ReuseShuffleIndices);		FinalVecTy, E->ReuseShuffleIndices);
return Cost;		return Cost;
}		}
}		}
InstructionCost ReuseShuffleCost = 0;		InstructionCost ReuseShuffleCost = 0;
if (NeedToShuffleReuses)		if (NeedToShuffleReuses)
ReuseShuffleCost = TTI->getShuffleCost(		ReuseShuffleCost = TTI->getShuffleCost(
TTI::SK_PermuteSingleSrc, FinalVecTy, E->ReuseShuffleIndices);		TTI::SK_PermuteSingleSrc, FinalVecTy, E->ReuseShuffleIndices);
		// Improve gather cost for gather of loads, if we can group some of the
		// loads into vector loads.
		if (VL.size() > 2 && E->getOpcode() == Instruction::Load &&
		!E->isAltShuffle()) {
		BoUpSLP::ValueSet VectorizedLoads;
		unsigned StartIdx = 0;
		unsigned VF = VL.size() / 2;
		unsigned VectorizedCnt = 0;
		unsigned ScatterVectorizeCnt = 0;
		const unsigned Sz = DL->getTypeSizeInBits(E->getMainOp()->getType());
		for (unsigned MinVF = getMinVecRegSize() / (2 * Sz); VF >= MinVF;
		bjopeUnsubmitted Not Done Reply Inline Actions With my OOT target I ended up here with getMinVecRegSize() returning 32. E->getMainOp was a load like this `%3 = load i24, i24* getelementptr inbounds ([128 x i24], [128 x i24]* @a_ua, i16 0, i16 3)`, so Sz was 24. That gives a MinVF that is 0. And after some iterations the inner loop was entered with VF=0, which gives a slice that is empty, hitting assertions when doing Slice.front(). I haven't reduced this failure for any in-tree target (a bit short on people at the office here in the middle of summer). But maybe you should make sure MinVF doesn't go below 1 here (or maybe not even below 2) to avoid getting a VF that is less than 1 (or less than 2). Or check that VF is at least 1 (or 2) in the loop guard. (I do not know if VF=1 makes any sense. Hence the alternatives above regarding 1 or 2.) bjope: With my OOT target I ended up here with getMinVecRegSize() returning 32. E->getMainOp was a…
		ABataevAuthorUnsubmitted Done Reply Inline Actions The fix is ready (see D107058), will commit it tomorrow. ABataev: The fix is ready (see D107058), will commit it tomorrow.
		bjopeUnsubmitted Done Reply Inline Actions Ok, thanks! bjope: Ok, thanks!
		VF /= 2) {
		for (unsigned Cnt = StartIdx, End = VL.size(); Cnt + VF <= End;
		Cnt += VF) {
		ArrayRef<Value *> Slice = VL.slice(Cnt, VF);
		if (!VectorizedLoads.count(Slice.front()) &&
		!VectorizedLoads.count(Slice.back()) && allSameBlock(Slice)) {
		SmallVector<Value *> PointerOps;
		OrdersType CurrentOrder;
		LoadsState LS = canVectorizeLoads(Slice, Slice.front(), TTI, DL,
		*SE, CurrentOrder, PointerOps);
		switch (LS) {
		case LoadsState::Vectorize:
		case LoadsState::ScatterVectorize:
		// Mark the vectorized loads so that we don't vectorize them
		// again.
		if (LS == LoadsState::Vectorize)
		++VectorizedCnt;
		else
		++ScatterVectorizeCnt;
		VectorizedLoads.insert(Slice.begin(), Slice.end());
		// If we vectorized initial block, no need to try to vectorize it
		// again.
		if (Cnt == StartIdx)
		StartIdx += VF;
		break;
		case LoadsState::Gather:
		break;
		}
		}
		}
		// Check if the whole array was vectorized already - exit.
		if (StartIdx >= VL.size())
		break;
		// Found vectorizable parts - exit.
		if (!VectorizedLoads.empty())
		break;
		}
		if (!VectorizedLoads.empty()) {
		InstructionCost GatherCost = 0;
		// Get the cost for gathered loads.
		for (unsigned I = 0, End = VL.size(); I < End; I += VF) {
		if (VectorizedLoads.contains(VL[I]))
		continue;
		GatherCost += getGatherCost(VL.slice(I, VF));
		}
		// The cost for vectorized loads.
		InstructionCost ScalarsCost = 0;
		for (Value *V : VectorizedLoads) {
		auto *LI = cast<LoadInst>(V);
		ScalarsCost += TTI->getMemoryOpCost(
		Instruction::Load, LI->getType(), LI->getAlign(),
		LI->getPointerAddressSpace(), CostKind, LI);
		}
		auto *LI = cast<LoadInst>(E->getMainOp());
		auto *LoadTy = FixedVectorType::get(LI->getType(), VF);
		Align Alignment = LI->getAlign();
		GatherCost +=
		VectorizedCnt *
		TTI->getMemoryOpCost(Instruction::Load, LoadTy, Alignment,
		LI->getPointerAddressSpace(), CostKind, LI);
		GatherCost += ScatterVectorizeCnt *
		TTI->getGatherScatterOpCost(
		Instruction::Load, LoadTy, LI->getPointerOperand(),
		/VariableMask=/false, Alignment, CostKind, LI);
		// Add the cost for the subvectors shuffling.
		GatherCost += ((VL.size() - VF) / VF) *
		TTI->getShuffleCost(TTI::SK_Select, VecTy);
		return ReuseShuffleCost + GatherCost - ScalarsCost;
		}
		}
return ReuseShuffleCost + getGatherCost(VL);		return ReuseShuffleCost + getGatherCost(VL);
		RKSimonUnsubmitted Done Reply Inline Actions Can we remove the loop and avoid the repeated call to TTI->getShuffleCost(TTI::SK_Select, VecTy) ? RKSimon: Can we remove the loop and avoid the repeated call to TTI->getShuffleCost(TTI::SK_Select…
}		}
InstructionCost CommonCost = 0;		InstructionCost CommonCost = 0;
		RKSimonUnsubmitted Not Done Reply Inline Actions Is it worth doing the CommonCost -> ReuseShuffleCost refactor as a NFC pre-commit to simplify this patch? RKSimon: Is it worth doing the CommonCost -> ReuseShuffleCost refactor as a NFC pre-commit to simplify…
		ABataevAuthorUnsubmitted Done Reply Inline Actions Will check. ABataev: Will check.
SmallVector<int> Mask;		SmallVector<int> Mask;
if (!E->ReorderIndices.empty()) {		if (!E->ReorderIndices.empty()) {
SmallVector<int> NewMask;		SmallVector<int> NewMask;
if (E->getOpcode() == Instruction::Store) {		if (E->getOpcode() == Instruction::Store) {
// For stores the order is actually a mask.		// For stores the order is actually a mask.
NewMask.resize(E->ReorderIndices.size());		NewMask.resize(E->ReorderIndices.size());
copy(E->ReorderIndices, NewMask.begin());		copy(E->ReorderIndices, NewMask.begin());
} else {		} else {
Show All 16 Lines	InstructionCost BoUpSLP::getEntryCost(const TreeEntry *E,
switch (ShuffleOrOp) {		switch (ShuffleOrOp) {
case Instruction::PHI:		case Instruction::PHI:
return 0;		return 0;

case Instruction::ExtractValue:		case Instruction::ExtractValue:
case Instruction::ExtractElement: {		case Instruction::ExtractElement: {
// The common cost of removal ExtractElement/ExtractValue instructions +		// The common cost of removal ExtractElement/ExtractValue instructions +
// the cost of shuffles, if required to resuffle the original vector.		// the cost of shuffles, if required to resuffle the original vector.
InstructionCost CommonCost = 0;
if (NeedToShuffleReuses) {		if (NeedToShuffleReuses) {
unsigned Idx = 0;		unsigned Idx = 0;
for (unsigned I : E->ReuseShuffleIndices) {		for (unsigned I : E->ReuseShuffleIndices) {
if (ShuffleOrOp == Instruction::ExtractElement) {		if (ShuffleOrOp == Instruction::ExtractElement) {
auto *EE = cast<ExtractElementInst>(VL[I]);		auto *EE = cast<ExtractElementInst>(VL[I]);
CommonCost -= TTI->getVectorInstrCost(Instruction::ExtractElement,		CommonCost -= TTI->getVectorInstrCost(Instruction::ExtractElement,
EE->getVectorOperandType(),		EE->getVectorOperandType(),
*getExtractIndex(EE));		*getExtractIndex(EE));
▲ Show 20 Lines • Show All 41 Lines • ▼ Show 20 Lines	case Instruction::ExtractElement: {
TTI->getVectorInstrCost(Instruction::ExtractElement, VecTy, I);		TTI->getVectorInstrCost(Instruction::ExtractElement, VecTy, I);
}		}
} else {		} else {
AdjustExtractsCost(CommonCost, /IsGather=/false);		AdjustExtractsCost(CommonCost, /IsGather=/false);
}		}
return CommonCost;		return CommonCost;
}		}
case Instruction::InsertElement: {		case Instruction::InsertElement: {
		assert(E->ReuseShuffleIndices.empty() &&
		"Unique insertelements only are expected.");
		assert(E->ReorderIndices.empty() &&
		"No reordering expected for insertelements.");
auto *SrcVecTy = cast<FixedVectorType>(VL0->getType());		auto *SrcVecTy = cast<FixedVectorType>(VL0->getType());

unsigned const NumElts = SrcVecTy->getNumElements();		unsigned const NumElts = SrcVecTy->getNumElements();
unsigned const NumScalars = VL.size();		unsigned const NumScalars = VL.size();
APInt DemandedElts = APInt::getNullValue(NumElts);		APInt DemandedElts = APInt::getNullValue(NumElts);
// TODO: Add support for Instruction::InsertValue.		// TODO: Add support for Instruction::InsertValue.
unsigned Offset = UINT_MAX;		unsigned Offset = UINT_MAX;
bool IsIdentity = true;		bool IsIdentity = true;
▲ Show 20 Lines • Show All 310 Lines • ▼ Show 20 Lines	case Instruction::ShuffleVector: {
}		}

SmallVector<int> Mask(E->Scalars.size());		SmallVector<int> Mask(E->Scalars.size());
for (unsigned I = 0, End = E->Scalars.size(); I < End; ++I) {		for (unsigned I = 0, End = E->Scalars.size(); I < End; ++I) {
auto *OpInst = cast<Instruction>(E->Scalars[I]);		auto *OpInst = cast<Instruction>(E->Scalars[I]);
assert(E->isOpcodeOrAlt(OpInst) && "Unexpected main/alternate opcode");		assert(E->isOpcodeOrAlt(OpInst) && "Unexpected main/alternate opcode");
Mask[I] = I + (OpInst->getOpcode() == E->getAltOpcode() ? End : 0);		Mask[I] = I + (OpInst->getOpcode() == E->getAltOpcode() ? End : 0);
}		}
VecCost +=		if (!E->ReorderIndices.empty()) {
TTI->getShuffleCost(TargetTransformInfo::SK_Select, VecTy, Mask, 0);		SmallVector<int> NewMask;
		inversePermutation(E->ReorderIndices, NewMask);
		::addMask(Mask, NewMask);
		}
		if (NeedToShuffleReuses)
		::addMask(Mask, E->ReuseShuffleIndices);
		CommonCost =
		TTI->getShuffleCost(TargetTransformInfo::SK_Select, FinalVecTy, Mask);
LLVM_DEBUG(dumpTreeCosts(E, CommonCost, VecCost, ScalarCost));		LLVM_DEBUG(dumpTreeCosts(E, CommonCost, VecCost, ScalarCost));
return CommonCost + VecCost - ScalarCost;		return CommonCost + VecCost - ScalarCost;
}		}
default:		default:
llvm_unreachable("Unknown instruction");		llvm_unreachable("Unknown instruction");
}		}
}		}

▲ Show 20 Lines • Show All 908 Lines • ▼ Show 20 Lines	Value BoUpSLP::vectorizeTree(TreeEntry E) {
Type *ScalarTy = VL0->getType();		Type *ScalarTy = VL0->getType();
if (auto *Store = dyn_cast<StoreInst>(VL0))		if (auto *Store = dyn_cast<StoreInst>(VL0))
ScalarTy = Store->getValueOperand()->getType();		ScalarTy = Store->getValueOperand()->getType();
else if (auto *IE = dyn_cast<InsertElementInst>(VL0))		else if (auto *IE = dyn_cast<InsertElementInst>(VL0))
ScalarTy = IE->getOperand(1)->getType();		ScalarTy = IE->getOperand(1)->getType();
auto *VecTy = FixedVectorType::get(ScalarTy, E->Scalars.size());		auto *VecTy = FixedVectorType::get(ScalarTy, E->Scalars.size());
switch (ShuffleOrOp) {		switch (ShuffleOrOp) {
case Instruction::PHI: {		case Instruction::PHI: {
		assert(
		(E->ReorderIndices.empty() \|\| E != VectorizableTree.front().get()) &&
		"PHI reordering is free.");
auto *PH = cast<PHINode>(VL0);		auto *PH = cast<PHINode>(VL0);
Builder.SetInsertPoint(PH->getParent()->getFirstNonPHI());		Builder.SetInsertPoint(PH->getParent()->getFirstNonPHI());
Builder.SetCurrentDebugLocation(PH->getDebugLoc());		Builder.SetCurrentDebugLocation(PH->getDebugLoc());
PHINode *NewPhi = Builder.CreatePHI(VecTy, PH->getNumIncomingValues());		PHINode *NewPhi = Builder.CreatePHI(VecTy, PH->getNumIncomingValues());
Value *V = NewPhi;		Value *V = NewPhi;
if (NeedToShuffleReuses)		ShuffleBuilder.addInversedMask(E->ReorderIndices);
V = Builder.CreateShuffleVector(V, E->ReuseShuffleIndices, "shuffle");		ShuffleBuilder.addMask(E->ReuseShuffleIndices);
		V = ShuffleBuilder.finalize(V);

E->VectorizedValue = V;		E->VectorizedValue = V;

// PHINodes may have multiple entries from the same block. We want to		// PHINodes may have multiple entries from the same block. We want to
// visit every block once.		// visit every block once.
SmallPtrSet<BasicBlock*, 4> VisitedBBs;		SmallPtrSet<BasicBlock*, 4> VisitedBBs;

for (unsigned i = 0, e = PH->getNumIncomingValues(); i < e; ++i) {		for (unsigned i = 0, e = PH->getNumIncomingValues(); i < e; ++i) {
Show All 34 Lines	case Instruction::ExtractValue: {
Value *NewV = propagateMetadata(V, E->Scalars);		Value *NewV = propagateMetadata(V, E->Scalars);
ShuffleBuilder.addInversedMask(E->ReorderIndices);		ShuffleBuilder.addInversedMask(E->ReorderIndices);
ShuffleBuilder.addMask(E->ReuseShuffleIndices);		ShuffleBuilder.addMask(E->ReuseShuffleIndices);
NewV = ShuffleBuilder.finalize(NewV);		NewV = ShuffleBuilder.finalize(NewV);
E->VectorizedValue = NewV;		E->VectorizedValue = NewV;
return NewV;		return NewV;
}		}
case Instruction::InsertElement: {		case Instruction::InsertElement: {
		assert(E->ReorderIndices.empty() && "InsertElements reordering is free.");
Builder.SetInsertPoint(VL0);		Builder.SetInsertPoint(VL0);
Value *V = vectorizeTree(E->getOperand(1));		Value *V = vectorizeTree(E->getOperand(1));

const unsigned NumElts =		const unsigned NumElts =
cast<FixedVectorType>(VL0->getType())->getNumElements();		cast<FixedVectorType>(VL0->getType())->getNumElements();
const unsigned NumScalars = E->Scalars.size();		const unsigned NumScalars = E->Scalars.size();

// Create InsertVector shuffle if necessary		// Create InsertVector shuffle if necessary
▲ Show 20 Lines • Show All 70 Lines • ▼ Show 20 Lines	case Instruction::BitCast: {

if (E->VectorizedValue) {		if (E->VectorizedValue) {
LLVM_DEBUG(dbgs() << "SLP: Diamond merged for " << *VL0 << ".\n");		LLVM_DEBUG(dbgs() << "SLP: Diamond merged for " << *VL0 << ".\n");
return E->VectorizedValue;		return E->VectorizedValue;
}		}

auto *CI = cast<CastInst>(VL0);		auto *CI = cast<CastInst>(VL0);
Value *V = Builder.CreateCast(CI->getOpcode(), InVec, VecTy);		Value *V = Builder.CreateCast(CI->getOpcode(), InVec, VecTy);
		ShuffleBuilder.addInversedMask(E->ReorderIndices);
ShuffleBuilder.addMask(E->ReuseShuffleIndices);		ShuffleBuilder.addMask(E->ReuseShuffleIndices);
V = ShuffleBuilder.finalize(V);		V = ShuffleBuilder.finalize(V);

E->VectorizedValue = V;		E->VectorizedValue = V;
++NumVectorInstructions;		++NumVectorInstructions;
return V;		return V;
}		}
case Instruction::FCmp:		case Instruction::FCmp:
case Instruction::ICmp: {		case Instruction::ICmp: {
setInsertPointAfterBundle(E);		setInsertPointAfterBundle(E);

Value *L = vectorizeTree(E->getOperand(0));		Value *L = vectorizeTree(E->getOperand(0));
Value *R = vectorizeTree(E->getOperand(1));		Value *R = vectorizeTree(E->getOperand(1));

if (E->VectorizedValue) {		if (E->VectorizedValue) {
LLVM_DEBUG(dbgs() << "SLP: Diamond merged for " << *VL0 << ".\n");		LLVM_DEBUG(dbgs() << "SLP: Diamond merged for " << *VL0 << ".\n");
return E->VectorizedValue;		return E->VectorizedValue;
}		}

CmpInst::Predicate P0 = cast<CmpInst>(VL0)->getPredicate();		CmpInst::Predicate P0 = cast<CmpInst>(VL0)->getPredicate();
Value *V = Builder.CreateCmp(P0, L, R);		Value *V = Builder.CreateCmp(P0, L, R);
propagateIRFlags(V, E->Scalars, VL0);		propagateIRFlags(V, E->Scalars, VL0);
		ShuffleBuilder.addInversedMask(E->ReorderIndices);
ShuffleBuilder.addMask(E->ReuseShuffleIndices);		ShuffleBuilder.addMask(E->ReuseShuffleIndices);
V = ShuffleBuilder.finalize(V);		V = ShuffleBuilder.finalize(V);

E->VectorizedValue = V;		E->VectorizedValue = V;
++NumVectorInstructions;		++NumVectorInstructions;
return V;		return V;
}		}
case Instruction::Select: {		case Instruction::Select: {
setInsertPointAfterBundle(E);		setInsertPointAfterBundle(E);

Value *Cond = vectorizeTree(E->getOperand(0));		Value *Cond = vectorizeTree(E->getOperand(0));
Value *True = vectorizeTree(E->getOperand(1));		Value *True = vectorizeTree(E->getOperand(1));
Value *False = vectorizeTree(E->getOperand(2));		Value *False = vectorizeTree(E->getOperand(2));

if (E->VectorizedValue) {		if (E->VectorizedValue) {
LLVM_DEBUG(dbgs() << "SLP: Diamond merged for " << *VL0 << ".\n");		LLVM_DEBUG(dbgs() << "SLP: Diamond merged for " << *VL0 << ".\n");
return E->VectorizedValue;		return E->VectorizedValue;
}		}

Value *V = Builder.CreateSelect(Cond, True, False);		Value *V = Builder.CreateSelect(Cond, True, False);
		ShuffleBuilder.addInversedMask(E->ReorderIndices);
ShuffleBuilder.addMask(E->ReuseShuffleIndices);		ShuffleBuilder.addMask(E->ReuseShuffleIndices);
V = ShuffleBuilder.finalize(V);		V = ShuffleBuilder.finalize(V);

E->VectorizedValue = V;		E->VectorizedValue = V;
++NumVectorInstructions;		++NumVectorInstructions;
return V;		return V;
}		}
case Instruction::FNeg: {		case Instruction::FNeg: {
setInsertPointAfterBundle(E);		setInsertPointAfterBundle(E);

Value *Op = vectorizeTree(E->getOperand(0));		Value *Op = vectorizeTree(E->getOperand(0));

if (E->VectorizedValue) {		if (E->VectorizedValue) {
LLVM_DEBUG(dbgs() << "SLP: Diamond merged for " << *VL0 << ".\n");		LLVM_DEBUG(dbgs() << "SLP: Diamond merged for " << *VL0 << ".\n");
return E->VectorizedValue;		return E->VectorizedValue;
}		}

Value *V = Builder.CreateUnOp(		Value *V = Builder.CreateUnOp(
static_cast<Instruction::UnaryOps>(E->getOpcode()), Op);		static_cast<Instruction::UnaryOps>(E->getOpcode()), Op);
propagateIRFlags(V, E->Scalars, VL0);		propagateIRFlags(V, E->Scalars, VL0);
if (auto *I = dyn_cast<Instruction>(V))		if (auto *I = dyn_cast<Instruction>(V))
V = propagateMetadata(I, E->Scalars);		V = propagateMetadata(I, E->Scalars);

		ShuffleBuilder.addInversedMask(E->ReorderIndices);
ShuffleBuilder.addMask(E->ReuseShuffleIndices);		ShuffleBuilder.addMask(E->ReuseShuffleIndices);
V = ShuffleBuilder.finalize(V);		V = ShuffleBuilder.finalize(V);

E->VectorizedValue = V;		E->VectorizedValue = V;
++NumVectorInstructions;		++NumVectorInstructions;

return V;		return V;
}		}
Show All 27 Lines	case Instruction::Xor: {

Value *V = Builder.CreateBinOp(		Value *V = Builder.CreateBinOp(
static_cast<Instruction::BinaryOps>(E->getOpcode()), LHS,		static_cast<Instruction::BinaryOps>(E->getOpcode()), LHS,
RHS);		RHS);
propagateIRFlags(V, E->Scalars, VL0);		propagateIRFlags(V, E->Scalars, VL0);
if (auto *I = dyn_cast<Instruction>(V))		if (auto *I = dyn_cast<Instruction>(V))
V = propagateMetadata(I, E->Scalars);		V = propagateMetadata(I, E->Scalars);

		ShuffleBuilder.addInversedMask(E->ReorderIndices);
ShuffleBuilder.addMask(E->ReuseShuffleIndices);		ShuffleBuilder.addMask(E->ReuseShuffleIndices);
V = ShuffleBuilder.finalize(V);		V = ShuffleBuilder.finalize(V);

E->VectorizedValue = V;		E->VectorizedValue = V;
++NumVectorInstructions;		++NumVectorInstructions;

return V;		return V;
}		}
▲ Show 20 Lines • Show All 96 Lines • ▼ Show 20 Lines	case Instruction::GetElementPtr: {
OpVecs.push_back(OpVec);		OpVecs.push_back(OpVec);
}		}

Value *V = Builder.CreateGEP(		Value *V = Builder.CreateGEP(
cast<GetElementPtrInst>(VL0)->getSourceElementType(), Op0, OpVecs);		cast<GetElementPtrInst>(VL0)->getSourceElementType(), Op0, OpVecs);
if (Instruction *I = dyn_cast<Instruction>(V))		if (Instruction *I = dyn_cast<Instruction>(V))
V = propagateMetadata(I, E->Scalars);		V = propagateMetadata(I, E->Scalars);

		ShuffleBuilder.addInversedMask(E->ReorderIndices);
ShuffleBuilder.addMask(E->ReuseShuffleIndices);		ShuffleBuilder.addMask(E->ReuseShuffleIndices);
V = ShuffleBuilder.finalize(V);		V = ShuffleBuilder.finalize(V);

E->VectorizedValue = V;		E->VectorizedValue = V;
++NumVectorInstructions;		++NumVectorInstructions;

return V;		return V;
}		}
▲ Show 20 Lines • Show All 50 Lines • ▼ Show 20 Lines	case Instruction::Call: {

// The scalar argument uses an in-tree scalar so we add the new vectorized		// The scalar argument uses an in-tree scalar so we add the new vectorized
// call to ExternalUses list to make sure that an extract will be		// call to ExternalUses list to make sure that an extract will be
// generated in the future.		// generated in the future.
if (ScalarArg && getTreeEntry(ScalarArg))		if (ScalarArg && getTreeEntry(ScalarArg))
ExternalUses.push_back(ExternalUser(ScalarArg, cast<User>(V), 0));		ExternalUses.push_back(ExternalUser(ScalarArg, cast<User>(V), 0));

propagateIRFlags(V, E->Scalars, VL0);		propagateIRFlags(V, E->Scalars, VL0);
		ShuffleBuilder.addInversedMask(E->ReorderIndices);
ShuffleBuilder.addMask(E->ReuseShuffleIndices);		ShuffleBuilder.addMask(E->ReuseShuffleIndices);
V = ShuffleBuilder.finalize(V);		V = ShuffleBuilder.finalize(V);

E->VectorizedValue = V;		E->VectorizedValue = V;
++NumVectorInstructions;		++NumVectorInstructions;
return V;		return V;
}		}
case Instruction::ShuffleVector: {		case Instruction::ShuffleVector: {
Show All 33 Lines	case Instruction::ShuffleVector: {
}		}

// Create shuffle to take alternate operations from the vector.		// Create shuffle to take alternate operations from the vector.
// Also, gather up main and alt scalar ops to propagate IR flags to		// Also, gather up main and alt scalar ops to propagate IR flags to
// each vector operation.		// each vector operation.
ValueList OpScalars, AltScalars;		ValueList OpScalars, AltScalars;
unsigned Sz = E->Scalars.size();		unsigned Sz = E->Scalars.size();
SmallVector<int> Mask(Sz);		SmallVector<int> Mask(Sz);
for (unsigned I = 0; I < Sz; ++I) {		for (unsigned I = 0; I < Sz; ++I) {
		RKSimonUnsubmitted Not Done Reply Inline Actions A lot of this is just a refactor cleanup that looks like a NFC that could be done as a pre-commit to simplify the patch? RKSimon: A lot of this is just a refactor cleanup that looks like a NFC that could be done as a pre…
		ABataevAuthorUnsubmitted Done Reply Inline Actions Yes, will precommit some of these changes. ABataev: Yes, will precommit some of these changes.
auto *OpInst = cast<Instruction>(E->Scalars[I]);		unsigned Idx = I;
		if (!E->ReorderIndices.empty())
		Idx = E->ReorderIndices[I];
		auto *OpInst = cast<Instruction>(E->Scalars[Idx]);
assert(E->isOpcodeOrAlt(OpInst) && "Unexpected main/alternate opcode");		assert(E->isOpcodeOrAlt(OpInst) && "Unexpected main/alternate opcode");
if (OpInst->getOpcode() == E->getAltOpcode()) {		if (OpInst->getOpcode() == E->getAltOpcode()) {
Mask[I] = Sz + I;		Mask[Idx] = Sz + I;
AltScalars.push_back(E->Scalars[I]);		AltScalars.push_back(OpInst);
} else {		} else {
Mask[I] = I;		Mask[Idx] = I;
OpScalars.push_back(E->Scalars[I]);		OpScalars.push_back(OpInst);
		}
}		}
		if (!E->ReuseShuffleIndices.empty()) {
		SmallVector<int> NewMask(E->ReuseShuffleIndices.size());
		transform(E->ReuseShuffleIndices, NewMask.begin(),
		[&Mask](int Idx) { return Mask[Idx]; });
		Mask.swap(NewMask);
}		}

propagateIRFlags(V0, OpScalars);		propagateIRFlags(V0, OpScalars);
propagateIRFlags(V1, AltScalars);		propagateIRFlags(V1, AltScalars);

Value *V = Builder.CreateShuffleVector(V0, V1, Mask);		Value *V = Builder.CreateShuffleVector(V0, V1, Mask);
if (Instruction *I = dyn_cast<Instruction>(V))		if (Instruction *I = dyn_cast<Instruction>(V))
V = propagateMetadata(I, E->Scalars);		V = propagateMetadata(I, E->Scalars);
ShuffleBuilder.addMask(E->ReuseShuffleIndices);
V = ShuffleBuilder.finalize(V);		V = ShuffleBuilder.finalize(V);

E->VectorizedValue = V;		E->VectorizedValue = V;
++NumVectorInstructions;		++NumVectorInstructions;

return V;		return V;
}		}
default:		default:
▲ Show 20 Lines • Show All 1,153 Lines • ▼ Show 20 Lines	bool SLPVectorizerPass::runImpl(Function &F, ScalarEvolution *SE_,

if (Changed) {		if (Changed) {
R.optimizeGatherSequence();		R.optimizeGatherSequence();
LLVM_DEBUG(dbgs() << "SLP: vectorized \"" << F.getName() << "\"\n");		LLVM_DEBUG(dbgs() << "SLP: vectorized \"" << F.getName() << "\"\n");
}		}
return Changed;		return Changed;
}		}

/// Order may have elements assigned special value (size) which is out of
/// bounds. Such indices only appear on places which correspond to undef values
/// (see canReuseExtract for details) and used in order to avoid undef values
/// have effect on operands ordering.
/// The first loop below simply finds all unused indices and then the next loop
/// nest assigns these indices for undef values positions.
/// As an example below Order has two undef positions and they have assigned
/// values 3 and 7 respectively:
/// before: 6 9 5 4 9 2 1 0
/// after: 6 3 5 4 7 2 1 0
/// \returns Fixed ordering.
static BoUpSLP::OrdersType fixupOrderingIndices(ArrayRef<unsigned> Order) {
BoUpSLP::OrdersType NewOrder(Order.begin(), Order.end());
const unsigned Sz = NewOrder.size();
SmallBitVector UsedIndices(Sz);
SmallVector<int> MaskedIndices;
for (int I = 0, E = NewOrder.size(); I < E; ++I) {
if (NewOrder[I] < Sz)
UsedIndices.set(NewOrder[I]);
else
MaskedIndices.push_back(I);
}
if (MaskedIndices.empty())
return NewOrder;
SmallVector<int> AvailableIndices(MaskedIndices.size());
unsigned Cnt = 0;
int Idx = UsedIndices.find_first();
do {
AvailableIndices[Cnt] = Idx;
Idx = UsedIndices.find_next(Idx);
++Cnt;
} while (Idx > 0);
assert(Cnt == MaskedIndices.size() && "Non-synced masked/available indices.");
for (int I = 0, E = MaskedIndices.size(); I < E; ++I)
NewOrder[MaskedIndices[I]] = AvailableIndices[I];
return NewOrder;
}

bool SLPVectorizerPass::vectorizeStoreChain(ArrayRef<Value *> Chain, BoUpSLP &R,		bool SLPVectorizerPass::vectorizeStoreChain(ArrayRef<Value *> Chain, BoUpSLP &R,
unsigned Idx) {		unsigned Idx) {
LLVM_DEBUG(dbgs() << "SLP: Analyzing a store chain of length " << Chain.size()		LLVM_DEBUG(dbgs() << "SLP: Analyzing a store chain of length " << Chain.size()
<< "\n");		<< "\n");
const unsigned Sz = R.getVectorElementSize(Chain[0]);		const unsigned Sz = R.getVectorElementSize(Chain[0]);
const unsigned MinVF = R.getMinVecRegSize() / Sz;		const unsigned MinVF = R.getMinVecRegSize() / Sz;
unsigned VF = Chain.size();		unsigned VF = Chain.size();

if (!isPowerOf2_32(Sz) \|\| !isPowerOf2_32(VF) \|\| VF < 2 \|\| VF < MinVF)		if (!isPowerOf2_32(Sz) \|\| !isPowerOf2_32(VF) \|\| VF < 2 \|\| VF < MinVF)
return false;		return false;

LLVM_DEBUG(dbgs() << "SLP: Analyzing " << VF << " stores at offset " << Idx		LLVM_DEBUG(dbgs() << "SLP: Analyzing " << VF << " stores at offset " << Idx
<< "\n");		<< "\n");

R.buildTree(Chain);		R.buildTree(Chain);
Optional<ArrayRef<unsigned>> Order = R.bestOrder();
// TODO: Handle orders of size less than number of elements in the vector.
if (Order && Order->size() == Chain.size()) {
// TODO: reorder tree nodes without tree rebuilding.
SmallVector<Value *, 4> ReorderedOps(Chain.size());
transform(fixupOrderingIndices(*Order), ReorderedOps.begin(),
[Chain](const unsigned Idx) { return Chain[Idx]; });
R.buildTree(ReorderedOps);
}
if (R.isTreeTinyAndNotFullyVectorizable())		if (R.isTreeTinyAndNotFullyVectorizable())
return false;		return false;
if (R.isLoadCombineCandidate())		if (R.isLoadCombineCandidate())
return false;		return false;
		R.reorderTopToBottom(/FreeReorder=/false);
		R.reorderBottomToTop(/FreeReorder=/false);
		R.buildExternalUses();

R.computeMinimumValueSizes();		R.computeMinimumValueSizes();

InstructionCost Cost = R.getTreeCost();		InstructionCost Cost = R.getTreeCost();

LLVM_DEBUG(dbgs() << "SLP: Found cost = " << Cost << " for VF =" << VF << "\n");		LLVM_DEBUG(dbgs() << "SLP: Found cost = " << Cost << " for VF =" << VF << "\n");
if (Cost < -SLPCostThreshold) {		if (Cost < -SLPCostThreshold) {
LLVM_DEBUG(dbgs() << "SLP: Decided to vectorize cost = " << Cost << "\n");		LLVM_DEBUG(dbgs() << "SLP: Decided to vectorize cost = " << Cost << "\n");
▲ Show 20 Lines • Show All 270 Lines • ▼ Show 20 Lines	for (unsigned I = NextInst; I < MaxInst; ++I) {
return I && R.isDeleted(I);		return I && R.isDeleted(I);
}))		}))
continue;		continue;

LLVM_DEBUG(dbgs() << "SLP: Analyzing " << OpsWidth << " operations "		LLVM_DEBUG(dbgs() << "SLP: Analyzing " << OpsWidth << " operations "
<< "\n");		<< "\n");

R.buildTree(Ops);		R.buildTree(Ops);
if (AllowReorder) {
Optional<ArrayRef<unsigned>> Order = R.bestOrder();
if (Order) {
// TODO: reorder tree nodes without tree rebuilding.
SmallVector<Value *, 4> ReorderedOps(Ops.size());
transform(fixupOrderingIndices(*Order), ReorderedOps.begin(),
[Ops](const unsigned Idx) { return Ops[Idx]; });
R.buildTree(ReorderedOps);
}
}
if (R.isTreeTinyAndNotFullyVectorizable())		if (R.isTreeTinyAndNotFullyVectorizable())
continue;		continue;
		R.reorderTopToBottom(AllowReorder);
		R.reorderBottomToTop(AllowReorder);
		R.buildExternalUses();

R.computeMinimumValueSizes();		R.computeMinimumValueSizes();
InstructionCost Cost = R.getTreeCost();		InstructionCost Cost = R.getTreeCost();
CandidateFound = true;		CandidateFound = true;
MinCost = std::min(MinCost, Cost);		MinCost = std::min(MinCost, Cost);

if (Cost < -SLPCostThreshold) {		if (Cost < -SLPCostThreshold) {
LLVM_DEBUG(dbgs() << "SLP: Vectorizing list at cost:" << Cost << ".\n");		LLVM_DEBUG(dbgs() << "SLP: Vectorizing list at cost:" << Cost << ".\n");
▲ Show 20 Lines • Show All 627 Lines • ▼ Show 20 Lines	if (NumReducedVals > ReduxWidth) {
return false;		return false;
});		});
}		}

Value *VectorizedTree = nullptr;		Value *VectorizedTree = nullptr;
unsigned i = 0;		unsigned i = 0;
while (i < NumReducedVals - ReduxWidth + 1 && ReduxWidth > 2) {		while (i < NumReducedVals - ReduxWidth + 1 && ReduxWidth > 2) {
ArrayRef<Value *> VL(&ReducedVals[i], ReduxWidth);		ArrayRef<Value *> VL(&ReducedVals[i], ReduxWidth);
V.buildTree(VL, ExternallyUsedValues, IgnoreList);		V.buildTree(VL, IgnoreList);
Optional<ArrayRef<unsigned>> Order = V.bestOrder();
if (Order) {
assert(Order->size() == VL.size() &&
"Order size must be the same as number of vectorized "
"instructions.");
// TODO: reorder tree nodes without tree rebuilding.
SmallVector<Value *, 4> ReorderedOps(VL.size());
transform(fixupOrderingIndices(*Order), ReorderedOps.begin(),
[VL](const unsigned Idx) { return VL[Idx]; });
V.buildTree(ReorderedOps, ExternallyUsedValues, IgnoreList);
}
if (V.isTreeTinyAndNotFullyVectorizable())		if (V.isTreeTinyAndNotFullyVectorizable())
break;		break;
if (V.isLoadCombineReductionCandidate(RdxKind))		if (V.isLoadCombineReductionCandidate(RdxKind))
break;		break;
		V.reorderTopToBottom(/FreeReorder=/true);
		V.reorderBottomToTop(/FreeReorder=/true);
		V.buildExternalUses(ExternallyUsedValues);

// For a poison-safe boolean logic reduction, do not replace select		// For a poison-safe boolean logic reduction, do not replace select
// instructions with logic ops. All reduced values will be frozen (see		// instructions with logic ops. All reduced values will be frozen (see
// below) to prevent leaking poison.		// below) to prevent leaking poison.
if (isa<SelectInst>(ReductionRoot) &&		if (isa<SelectInst>(ReductionRoot) &&
isBoolLogicOp(cast<Instruction>(ReductionRoot)) &&		isBoolLogicOp(cast<Instruction>(ReductionRoot)) &&
NumReducedVals != ReduxWidth)		NumReducedVals != ReduxWidth)
break;		break;
▲ Show 20 Lines • Show All 456 Lines • ▼ Show 20 Lines	bool SLPVectorizerPass::vectorizeInsertValueInst(InsertValueInst *IVI,
SmallVector<Value *, 16> BuildVectorOpds;		SmallVector<Value *, 16> BuildVectorOpds;
SmallVector<Value *, 16> BuildVectorInsts;		SmallVector<Value *, 16> BuildVectorInsts;
if (!findBuildAggregate(IVI, TTI, BuildVectorOpds, BuildVectorInsts))		if (!findBuildAggregate(IVI, TTI, BuildVectorOpds, BuildVectorInsts))
return false;		return false;

LLVM_DEBUG(dbgs() << "SLP: array mappable to vector: " << *IVI << "\n");		LLVM_DEBUG(dbgs() << "SLP: array mappable to vector: " << *IVI << "\n");
// Aggregate value is unlikely to be processed in vector register, we need to		// Aggregate value is unlikely to be processed in vector register, we need to
// extract scalars into scalar registers, so NeedExtraction is set true.		// extract scalars into scalar registers, so NeedExtraction is set true.
return tryToVectorizeList(BuildVectorOpds, R, /AllowReorder=/false);		return tryToVectorizeList(BuildVectorOpds, R, /AllowReorder=/true);
}		}

bool SLPVectorizerPass::vectorizeInsertElementInst(InsertElementInst *IEI,		bool SLPVectorizerPass::vectorizeInsertElementInst(InsertElementInst *IEI,
BasicBlock *BB, BoUpSLP &R) {		BasicBlock *BB, BoUpSLP &R) {
SmallVector<Value *, 16> BuildVectorInsts;		SmallVector<Value *, 16> BuildVectorInsts;
SmallVector<Value *, 16> BuildVectorOpds;		SmallVector<Value *, 16> BuildVectorOpds;
SmallVector<int> Mask;		SmallVector<int> Mask;
if (!findBuildAggregate(IEI, TTI, BuildVectorOpds, BuildVectorInsts) \|\|		if (!findBuildAggregate(IEI, TTI, BuildVectorOpds, BuildVectorInsts) \|\|
▲ Show 20 Lines • Show All 531 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/AArch64/transpose-inseltpoison.ll

; NOTE: Assertions have been autogenerated by utils/update_test_checks.py		; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
; RUN: opt < %s -slp-vectorizer -instcombine -S \| FileCheck %s		; RUN: opt < %s -slp-vectorizer -instcombine -S \| FileCheck %s

target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"		target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
target triple = "aarch64--linux-gnu"		target triple = "aarch64--linux-gnu"

define <2 x i64> @build_vec_v2i64(<2 x i64> %v0, <2 x i64> %v1) {		define <2 x i64> @build_vec_v2i64(<2 x i64> %v0, <2 x i64> %v1) {
; CHECK-LABEL: @build_vec_v2i64(		; CHECK-LABEL: @build_vec_v2i64(
; CHECK-NEXT: [[TMP1:%.]] = shufflevector <2 x i64> [[V0:%.]], <2 x i64> undef, <2 x i32> <i32 1, i32 0>		; CHECK-NEXT: [[TMP1:%.]] = add <2 x i64> [[V0:%.]], [[V1:%.*]]
; CHECK-NEXT: [[TMP2:%.]] = shufflevector <2 x i64> [[V1:%.]], <2 x i64> undef, <2 x i32> <i32 1, i32 0>		; CHECK-NEXT: [[TMP2:%.*]] = sub <2 x i64> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP3:%.*]] = add <2 x i64> [[TMP1]], [[TMP2]]		; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <2 x i64> [[TMP1]], <2 x i64> [[TMP2]], <2 x i32> <i32 1, i32 2>
; CHECK-NEXT: [[TMP4:%.*]] = sub <2 x i64> [[TMP1]], [[TMP2]]		; CHECK-NEXT: [[TMP4:%.*]] = add <2 x i64> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <2 x i64> [[TMP3]], <2 x i64> [[TMP4]], <2 x i32> <i32 0, i32 3>		; CHECK-NEXT: [[TMP5:%.*]] = sub <2 x i64> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP6:%.*]] = add <2 x i64> [[V0]], [[V1]]		; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <2 x i64> [[TMP4]], <2 x i64> [[TMP5]], <2 x i32> <i32 0, i32 3>
; CHECK-NEXT: [[TMP7:%.*]] = sub <2 x i64> [[V0]], [[V1]]		; CHECK-NEXT: [[TMP7:%.*]] = add <2 x i64> [[TMP6]], [[TMP3]]
; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <2 x i64> [[TMP6]], <2 x i64> [[TMP7]], <2 x i32> <i32 0, i32 3>		; CHECK-NEXT: ret <2 x i64> [[TMP7]]
; CHECK-NEXT: [[TMP9:%.*]] = add <2 x i64> [[TMP8]], [[TMP5]]
; CHECK-NEXT: ret <2 x i64> [[TMP9]]
;		;
%v0.0 = extractelement <2 x i64> %v0, i32 0		%v0.0 = extractelement <2 x i64> %v0, i32 0
%v0.1 = extractelement <2 x i64> %v0, i32 1		%v0.1 = extractelement <2 x i64> %v0, i32 1
%v1.0 = extractelement <2 x i64> %v1, i32 0		%v1.0 = extractelement <2 x i64> %v1, i32 0
%v1.1 = extractelement <2 x i64> %v1, i32 1		%v1.1 = extractelement <2 x i64> %v1, i32 1
%tmp0.0 = add i64 %v0.0, %v1.0		%tmp0.0 = add i64 %v0.0, %v1.0
%tmp0.1 = add i64 %v0.1, %v1.1		%tmp0.1 = add i64 %v0.1, %v1.1
%tmp1.0 = sub i64 %v0.0, %v1.0		%tmp1.0 = sub i64 %v0.0, %v1.0
▲ Show 20 Lines • Show All 42 Lines • ▼ Show 20 Lines	;
%tmp2.1 = add i64 %tmp1.0, %tmp1.1		%tmp2.1 = add i64 %tmp1.0, %tmp1.1
store i64 %tmp2.0, i64* %c.0, align 8		store i64 %tmp2.0, i64* %c.0, align 8
store i64 %tmp2.1, i64* %c.1, align 8		store i64 %tmp2.1, i64* %c.1, align 8
ret void		ret void
}		}

define <4 x i32> @build_vec_v4i32(<4 x i32> %v0, <4 x i32> %v1) {		define <4 x i32> @build_vec_v4i32(<4 x i32> %v0, <4 x i32> %v1) {
; CHECK-LABEL: @build_vec_v4i32(		; CHECK-LABEL: @build_vec_v4i32(
; CHECK-NEXT: [[TMP1:%.]] = shufflevector <4 x i32> [[V0:%.]], <4 x i32> undef, <4 x i32> <i32 1, i32 0, i32 3, i32 2>		; CHECK-NEXT: [[TMP1:%.]] = add <4 x i32> [[V0:%.]], [[V1:%.*]]
; CHECK-NEXT: [[TMP2:%.]] = shufflevector <4 x i32> [[V1:%.]], <4 x i32> undef, <4 x i32> <i32 1, i32 0, i32 3, i32 2>		; CHECK-NEXT: [[TMP2:%.*]] = sub <4 x i32> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP3:%.*]] = add <4 x i32> [[TMP1]], [[TMP2]]		; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <4 x i32> [[TMP1]], <4 x i32> [[TMP2]], <4 x i32> <i32 1, i32 4, i32 3, i32 6>
; CHECK-NEXT: [[TMP4:%.*]] = sub <4 x i32> [[TMP1]], [[TMP2]]		; CHECK-NEXT: [[TMP4:%.*]] = add <4 x i32> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <4 x i32> [[TMP3]], <4 x i32> [[TMP4]], <4 x i32> <i32 0, i32 5, i32 2, i32 7>		; CHECK-NEXT: [[TMP5:%.*]] = sub <4 x i32> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP6:%.*]] = add <4 x i32> [[V0]], [[V1]]		; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <4 x i32> [[TMP4]], <4 x i32> [[TMP5]], <4 x i32> <i32 0, i32 5, i32 2, i32 7>
; CHECK-NEXT: [[TMP7:%.*]] = sub <4 x i32> [[V0]], [[V1]]		; CHECK-NEXT: [[TMP7:%.*]] = add <4 x i32> [[TMP6]], [[TMP3]]
; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <4 x i32> [[TMP6]], <4 x i32> [[TMP7]], <4 x i32> <i32 0, i32 5, i32 2, i32 7>		; CHECK-NEXT: ret <4 x i32> [[TMP7]]
; CHECK-NEXT: [[TMP9:%.*]] = add <4 x i32> [[TMP8]], [[TMP5]]
; CHECK-NEXT: ret <4 x i32> [[TMP9]]
;		;
%v0.0 = extractelement <4 x i32> %v0, i32 0		%v0.0 = extractelement <4 x i32> %v0, i32 0
%v0.1 = extractelement <4 x i32> %v0, i32 1		%v0.1 = extractelement <4 x i32> %v0, i32 1
%v0.2 = extractelement <4 x i32> %v0, i32 2		%v0.2 = extractelement <4 x i32> %v0, i32 2
%v0.3 = extractelement <4 x i32> %v0, i32 3		%v0.3 = extractelement <4 x i32> %v0, i32 3
%v1.0 = extractelement <4 x i32> %v1, i32 0		%v1.0 = extractelement <4 x i32> %v1, i32 0
%v1.1 = extractelement <4 x i32> %v1, i32 1		%v1.1 = extractelement <4 x i32> %v1, i32 1
%v1.2 = extractelement <4 x i32> %v1, i32 2		%v1.2 = extractelement <4 x i32> %v1, i32 2
Show All 14 Lines	;
%tmp3.1 = insertelement <4 x i32> %tmp3.0, i32 %tmp2.1, i32 1		%tmp3.1 = insertelement <4 x i32> %tmp3.0, i32 %tmp2.1, i32 1
%tmp3.2 = insertelement <4 x i32> %tmp3.1, i32 %tmp2.2, i32 2		%tmp3.2 = insertelement <4 x i32> %tmp3.1, i32 %tmp2.2, i32 2
%tmp3.3 = insertelement <4 x i32> %tmp3.2, i32 %tmp2.3, i32 3		%tmp3.3 = insertelement <4 x i32> %tmp3.2, i32 %tmp2.3, i32 3
ret <4 x i32> %tmp3.3		ret <4 x i32> %tmp3.3
}		}

define <4 x i32> @build_vec_v4i32_reuse_0(<2 x i32> %v0, <2 x i32> %v1) {		define <4 x i32> @build_vec_v4i32_reuse_0(<2 x i32> %v0, <2 x i32> %v1) {
; CHECK-LABEL: @build_vec_v4i32_reuse_0(		; CHECK-LABEL: @build_vec_v4i32_reuse_0(
; CHECK-NEXT: [[TMP1:%.]] = shufflevector <2 x i32> [[V0:%.]], <2 x i32> undef, <2 x i32> <i32 1, i32 0>		; CHECK-NEXT: [[TMP1:%.]] = add <2 x i32> [[V0:%.]], [[V1:%.*]]
; CHECK-NEXT: [[TMP2:%.]] = shufflevector <2 x i32> [[V1:%.]], <2 x i32> undef, <2 x i32> <i32 1, i32 0>		; CHECK-NEXT: [[TMP2:%.*]] = sub <2 x i32> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP3:%.*]] = add <2 x i32> [[TMP1]], [[TMP2]]		; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> [[TMP2]], <2 x i32> <i32 1, i32 2>
; CHECK-NEXT: [[TMP4:%.*]] = sub <2 x i32> [[TMP1]], [[TMP2]]		; CHECK-NEXT: [[TMP4:%.*]] = add <2 x i32> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <2 x i32> [[TMP3]], <2 x i32> [[TMP4]], <2 x i32> <i32 0, i32 3>		; CHECK-NEXT: [[TMP5:%.*]] = sub <2 x i32> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP6:%.*]] = add <2 x i32> [[V0]], [[V1]]		; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <2 x i32> [[TMP4]], <2 x i32> [[TMP5]], <2 x i32> <i32 0, i32 3>
; CHECK-NEXT: [[TMP7:%.*]] = sub <2 x i32> [[V0]], [[V1]]		; CHECK-NEXT: [[TMP7:%.*]] = add <2 x i32> [[TMP6]], [[TMP3]]
; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <2 x i32> [[TMP6]], <2 x i32> [[TMP7]], <2 x i32> <i32 0, i32 3>		; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x i32> [[TMP7]], <2 x i32> poison, <4 x i32> <i32 0, i32 1, i32 0, i32 1>
; CHECK-NEXT: [[TMP9:%.*]] = add <2 x i32> [[TMP8]], [[TMP5]]
; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x i32> [[TMP9]], <2 x i32> poison, <4 x i32> <i32 0, i32 1, i32 0, i32 1>
; CHECK-NEXT: ret <4 x i32> [[SHUFFLE]]		; CHECK-NEXT: ret <4 x i32> [[SHUFFLE]]
;		;
%v0.0 = extractelement <2 x i32> %v0, i32 0		%v0.0 = extractelement <2 x i32> %v0, i32 0
%v0.1 = extractelement <2 x i32> %v0, i32 1		%v0.1 = extractelement <2 x i32> %v0, i32 1
%v1.0 = extractelement <2 x i32> %v1, i32 0		%v1.0 = extractelement <2 x i32> %v1, i32 0
%v1.1 = extractelement <2 x i32> %v1, i32 1		%v1.1 = extractelement <2 x i32> %v1, i32 1
%tmp0.0 = add i32 %v0.0, %v1.0		%tmp0.0 = add i32 %v0.0, %v1.0
%tmp0.1 = add i32 %v0.1, %v1.1		%tmp0.1 = add i32 %v0.1, %v1.1
▲ Show 20 Lines • Show All 89 Lines • ▼ Show 20 Lines	;
%tmp3.1 = insertelement <4 x i32> %tmp3.0, i32 %tmp2.1, i32 1		%tmp3.1 = insertelement <4 x i32> %tmp3.0, i32 %tmp2.1, i32 1
%tmp3.2 = insertelement <4 x i32> %tmp3.1, i32 %tmp2.2, i32 2		%tmp3.2 = insertelement <4 x i32> %tmp3.1, i32 %tmp2.2, i32 2
%tmp3.3 = insertelement <4 x i32> %tmp3.2, i32 %tmp2.3, i32 3		%tmp3.3 = insertelement <4 x i32> %tmp3.2, i32 %tmp2.3, i32 3
ret <4 x i32> %tmp3.3		ret <4 x i32> %tmp3.3
}		}

define i32 @reduction_v4i32(<4 x i32> %v0, <4 x i32> %v1) {		define i32 @reduction_v4i32(<4 x i32> %v0, <4 x i32> %v1) {
; CHECK-LABEL: @reduction_v4i32(		; CHECK-LABEL: @reduction_v4i32(
; CHECK-NEXT: [[TMP1:%.]] = shufflevector <4 x i32> [[V0:%.]], <4 x i32> undef, <4 x i32> <i32 1, i32 0, i32 3, i32 2>		; CHECK-NEXT: [[TMP1:%.]] = sub <4 x i32> [[V0:%.]], [[V1:%.*]]
; CHECK-NEXT: [[TMP2:%.]] = shufflevector <4 x i32> [[V1:%.]], <4 x i32> undef, <4 x i32> <i32 1, i32 0, i32 3, i32 2>		; CHECK-NEXT: [[TMP2:%.*]] = add <4 x i32> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP3:%.*]] = sub <4 x i32> [[TMP1]], [[TMP2]]		; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <4 x i32> [[TMP1]], <4 x i32> [[TMP2]], <4 x i32> <i32 1, i32 4, i32 7, i32 2>
; CHECK-NEXT: [[TMP4:%.*]] = add <4 x i32> [[TMP1]], [[TMP2]]		; CHECK-NEXT: [[TMP4:%.*]] = sub <4 x i32> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <4 x i32> [[TMP3]], <4 x i32> [[TMP4]], <4 x i32> <i32 0, i32 5, i32 6, i32 3>		; CHECK-NEXT: [[TMP5:%.*]] = add <4 x i32> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP6:%.*]] = sub <4 x i32> [[V0]], [[V1]]		; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <4 x i32> [[TMP4]], <4 x i32> [[TMP5]], <4 x i32> <i32 0, i32 5, i32 6, i32 3>
; CHECK-NEXT: [[TMP7:%.*]] = add <4 x i32> [[V0]], [[V1]]		; CHECK-NEXT: [[TMP7:%.*]] = add <4 x i32> [[TMP6]], [[TMP3]]
; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <4 x i32> [[TMP6]], <4 x i32> [[TMP7]], <4 x i32> <i32 0, i32 5, i32 6, i32 3>		; CHECK-NEXT: [[TMP8:%.*]] = lshr <4 x i32> [[TMP7]], <i32 15, i32 15, i32 15, i32 15>
; CHECK-NEXT: [[TMP9:%.*]] = add <4 x i32> [[TMP8]], [[TMP5]]		; CHECK-NEXT: [[TMP9:%.*]] = and <4 x i32> [[TMP8]], <i32 65537, i32 65537, i32 65537, i32 65537>
; CHECK-NEXT: [[TMP10:%.*]] = lshr <4 x i32> [[TMP9]], <i32 15, i32 15, i32 15, i32 15>		; CHECK-NEXT: [[TMP10:%.*]] = mul nuw <4 x i32> [[TMP9]], <i32 65535, i32 65535, i32 65535, i32 65535>
; CHECK-NEXT: [[TMP11:%.*]] = and <4 x i32> [[TMP10]], <i32 65537, i32 65537, i32 65537, i32 65537>		; CHECK-NEXT: [[TMP11:%.*]] = add <4 x i32> [[TMP10]], [[TMP7]]
; CHECK-NEXT: [[TMP12:%.*]] = mul nuw <4 x i32> [[TMP11]], <i32 65535, i32 65535, i32 65535, i32 65535>		; CHECK-NEXT: [[TMP12:%.*]] = xor <4 x i32> [[TMP11]], [[TMP10]]
; CHECK-NEXT: [[TMP13:%.*]] = add <4 x i32> [[TMP12]], [[TMP9]]		; CHECK-NEXT: [[TMP13:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[TMP12]])
; CHECK-NEXT: [[TMP14:%.*]] = xor <4 x i32> [[TMP13]], [[TMP12]]		; CHECK-NEXT: ret i32 [[TMP13]]
; CHECK-NEXT: [[TMP15:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[TMP14]])
; CHECK-NEXT: ret i32 [[TMP15]]
;		;
%v0.0 = extractelement <4 x i32> %v0, i32 0		%v0.0 = extractelement <4 x i32> %v0, i32 0
%v0.1 = extractelement <4 x i32> %v0, i32 1		%v0.1 = extractelement <4 x i32> %v0, i32 1
%v0.2 = extractelement <4 x i32> %v0, i32 2		%v0.2 = extractelement <4 x i32> %v0, i32 2
%v0.3 = extractelement <4 x i32> %v0, i32 3		%v0.3 = extractelement <4 x i32> %v0, i32 3
%v1.0 = extractelement <4 x i32> %v1, i32 0		%v1.0 = extractelement <4 x i32> %v1, i32 0
%v1.1 = extractelement <4 x i32> %v1, i32 1		%v1.1 = extractelement <4 x i32> %v1, i32 1
%v1.2 = extractelement <4 x i32> %v1, i32 2		%v1.2 = extractelement <4 x i32> %v1, i32 2
Show All 38 Lines

llvm/test/Transforms/SLPVectorizer/AArch64/transpose.ll

; NOTE: Assertions have been autogenerated by utils/update_test_checks.py		; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
; RUN: opt < %s -slp-vectorizer -instcombine -S \| FileCheck %s		; RUN: opt < %s -slp-vectorizer -instcombine -S \| FileCheck %s

target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"		target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
target triple = "aarch64--linux-gnu"		target triple = "aarch64--linux-gnu"

define <2 x i64> @build_vec_v2i64(<2 x i64> %v0, <2 x i64> %v1) {		define <2 x i64> @build_vec_v2i64(<2 x i64> %v0, <2 x i64> %v1) {
; CHECK-LABEL: @build_vec_v2i64(		; CHECK-LABEL: @build_vec_v2i64(
; CHECK-NEXT: [[TMP1:%.]] = shufflevector <2 x i64> [[V0:%.]], <2 x i64> undef, <2 x i32> <i32 1, i32 0>		; CHECK-NEXT: [[TMP1:%.]] = add <2 x i64> [[V0:%.]], [[V1:%.*]]
; CHECK-NEXT: [[TMP2:%.]] = shufflevector <2 x i64> [[V1:%.]], <2 x i64> undef, <2 x i32> <i32 1, i32 0>		; CHECK-NEXT: [[TMP2:%.*]] = sub <2 x i64> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP3:%.*]] = add <2 x i64> [[TMP1]], [[TMP2]]		; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <2 x i64> [[TMP1]], <2 x i64> [[TMP2]], <2 x i32> <i32 1, i32 2>
; CHECK-NEXT: [[TMP4:%.*]] = sub <2 x i64> [[TMP1]], [[TMP2]]		; CHECK-NEXT: [[TMP4:%.*]] = add <2 x i64> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <2 x i64> [[TMP3]], <2 x i64> [[TMP4]], <2 x i32> <i32 0, i32 3>		; CHECK-NEXT: [[TMP5:%.*]] = sub <2 x i64> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP6:%.*]] = add <2 x i64> [[V0]], [[V1]]		; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <2 x i64> [[TMP4]], <2 x i64> [[TMP5]], <2 x i32> <i32 0, i32 3>
; CHECK-NEXT: [[TMP7:%.*]] = sub <2 x i64> [[V0]], [[V1]]		; CHECK-NEXT: [[TMP7:%.*]] = add <2 x i64> [[TMP6]], [[TMP3]]
; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <2 x i64> [[TMP6]], <2 x i64> [[TMP7]], <2 x i32> <i32 0, i32 3>		; CHECK-NEXT: ret <2 x i64> [[TMP7]]
; CHECK-NEXT: [[TMP9:%.*]] = add <2 x i64> [[TMP8]], [[TMP5]]
; CHECK-NEXT: ret <2 x i64> [[TMP9]]
;		;
%v0.0 = extractelement <2 x i64> %v0, i32 0		%v0.0 = extractelement <2 x i64> %v0, i32 0
%v0.1 = extractelement <2 x i64> %v0, i32 1		%v0.1 = extractelement <2 x i64> %v0, i32 1
%v1.0 = extractelement <2 x i64> %v1, i32 0		%v1.0 = extractelement <2 x i64> %v1, i32 0
%v1.1 = extractelement <2 x i64> %v1, i32 1		%v1.1 = extractelement <2 x i64> %v1, i32 1
%tmp0.0 = add i64 %v0.0, %v1.0		%tmp0.0 = add i64 %v0.0, %v1.0
%tmp0.1 = add i64 %v0.1, %v1.1		%tmp0.1 = add i64 %v0.1, %v1.1
%tmp1.0 = sub i64 %v0.0, %v1.0		%tmp1.0 = sub i64 %v0.0, %v1.0
▲ Show 20 Lines • Show All 42 Lines • ▼ Show 20 Lines	;
%tmp2.1 = add i64 %tmp1.0, %tmp1.1		%tmp2.1 = add i64 %tmp1.0, %tmp1.1
store i64 %tmp2.0, i64* %c.0, align 8		store i64 %tmp2.0, i64* %c.0, align 8
store i64 %tmp2.1, i64* %c.1, align 8		store i64 %tmp2.1, i64* %c.1, align 8
ret void		ret void
}		}

define <4 x i32> @build_vec_v4i32(<4 x i32> %v0, <4 x i32> %v1) {		define <4 x i32> @build_vec_v4i32(<4 x i32> %v0, <4 x i32> %v1) {
; CHECK-LABEL: @build_vec_v4i32(		; CHECK-LABEL: @build_vec_v4i32(
; CHECK-NEXT: [[TMP1:%.]] = shufflevector <4 x i32> [[V0:%.]], <4 x i32> undef, <4 x i32> <i32 1, i32 0, i32 3, i32 2>		; CHECK-NEXT: [[TMP1:%.]] = add <4 x i32> [[V0:%.]], [[V1:%.*]]
; CHECK-NEXT: [[TMP2:%.]] = shufflevector <4 x i32> [[V1:%.]], <4 x i32> undef, <4 x i32> <i32 1, i32 0, i32 3, i32 2>		; CHECK-NEXT: [[TMP2:%.*]] = sub <4 x i32> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP3:%.*]] = add <4 x i32> [[TMP1]], [[TMP2]]		; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <4 x i32> [[TMP1]], <4 x i32> [[TMP2]], <4 x i32> <i32 1, i32 4, i32 3, i32 6>
; CHECK-NEXT: [[TMP4:%.*]] = sub <4 x i32> [[TMP1]], [[TMP2]]		; CHECK-NEXT: [[TMP4:%.*]] = add <4 x i32> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <4 x i32> [[TMP3]], <4 x i32> [[TMP4]], <4 x i32> <i32 0, i32 5, i32 2, i32 7>		; CHECK-NEXT: [[TMP5:%.*]] = sub <4 x i32> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP6:%.*]] = add <4 x i32> [[V0]], [[V1]]		; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <4 x i32> [[TMP4]], <4 x i32> [[TMP5]], <4 x i32> <i32 0, i32 5, i32 2, i32 7>
; CHECK-NEXT: [[TMP7:%.*]] = sub <4 x i32> [[V0]], [[V1]]		; CHECK-NEXT: [[TMP7:%.*]] = add <4 x i32> [[TMP6]], [[TMP3]]
; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <4 x i32> [[TMP6]], <4 x i32> [[TMP7]], <4 x i32> <i32 0, i32 5, i32 2, i32 7>		; CHECK-NEXT: ret <4 x i32> [[TMP7]]
; CHECK-NEXT: [[TMP9:%.*]] = add <4 x i32> [[TMP8]], [[TMP5]]
; CHECK-NEXT: ret <4 x i32> [[TMP9]]
;		;
%v0.0 = extractelement <4 x i32> %v0, i32 0		%v0.0 = extractelement <4 x i32> %v0, i32 0
%v0.1 = extractelement <4 x i32> %v0, i32 1		%v0.1 = extractelement <4 x i32> %v0, i32 1
%v0.2 = extractelement <4 x i32> %v0, i32 2		%v0.2 = extractelement <4 x i32> %v0, i32 2
%v0.3 = extractelement <4 x i32> %v0, i32 3		%v0.3 = extractelement <4 x i32> %v0, i32 3
%v1.0 = extractelement <4 x i32> %v1, i32 0		%v1.0 = extractelement <4 x i32> %v1, i32 0
%v1.1 = extractelement <4 x i32> %v1, i32 1		%v1.1 = extractelement <4 x i32> %v1, i32 1
%v1.2 = extractelement <4 x i32> %v1, i32 2		%v1.2 = extractelement <4 x i32> %v1, i32 2
Show All 14 Lines	;
%tmp3.1 = insertelement <4 x i32> %tmp3.0, i32 %tmp2.1, i32 1		%tmp3.1 = insertelement <4 x i32> %tmp3.0, i32 %tmp2.1, i32 1
%tmp3.2 = insertelement <4 x i32> %tmp3.1, i32 %tmp2.2, i32 2		%tmp3.2 = insertelement <4 x i32> %tmp3.1, i32 %tmp2.2, i32 2
%tmp3.3 = insertelement <4 x i32> %tmp3.2, i32 %tmp2.3, i32 3		%tmp3.3 = insertelement <4 x i32> %tmp3.2, i32 %tmp2.3, i32 3
ret <4 x i32> %tmp3.3		ret <4 x i32> %tmp3.3
}		}

define <4 x i32> @build_vec_v4i32_reuse_0(<2 x i32> %v0, <2 x i32> %v1) {		define <4 x i32> @build_vec_v4i32_reuse_0(<2 x i32> %v0, <2 x i32> %v1) {
; CHECK-LABEL: @build_vec_v4i32_reuse_0(		; CHECK-LABEL: @build_vec_v4i32_reuse_0(
; CHECK-NEXT: [[TMP1:%.]] = shufflevector <2 x i32> [[V0:%.]], <2 x i32> undef, <2 x i32> <i32 1, i32 0>		; CHECK-NEXT: [[TMP1:%.]] = add <2 x i32> [[V0:%.]], [[V1:%.*]]
; CHECK-NEXT: [[TMP2:%.]] = shufflevector <2 x i32> [[V1:%.]], <2 x i32> undef, <2 x i32> <i32 1, i32 0>		; CHECK-NEXT: [[TMP2:%.*]] = sub <2 x i32> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP3:%.*]] = add <2 x i32> [[TMP1]], [[TMP2]]		; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> [[TMP2]], <2 x i32> <i32 1, i32 2>
; CHECK-NEXT: [[TMP4:%.*]] = sub <2 x i32> [[TMP1]], [[TMP2]]		; CHECK-NEXT: [[TMP4:%.*]] = add <2 x i32> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <2 x i32> [[TMP3]], <2 x i32> [[TMP4]], <2 x i32> <i32 0, i32 3>		; CHECK-NEXT: [[TMP5:%.*]] = sub <2 x i32> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP6:%.*]] = add <2 x i32> [[V0]], [[V1]]		; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <2 x i32> [[TMP4]], <2 x i32> [[TMP5]], <2 x i32> <i32 0, i32 3>
; CHECK-NEXT: [[TMP7:%.*]] = sub <2 x i32> [[V0]], [[V1]]		; CHECK-NEXT: [[TMP7:%.*]] = add <2 x i32> [[TMP6]], [[TMP3]]
; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <2 x i32> [[TMP6]], <2 x i32> [[TMP7]], <2 x i32> <i32 0, i32 3>		; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x i32> [[TMP7]], <2 x i32> poison, <4 x i32> <i32 0, i32 1, i32 0, i32 1>
; CHECK-NEXT: [[TMP9:%.*]] = add <2 x i32> [[TMP8]], [[TMP5]]
; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x i32> [[TMP9]], <2 x i32> poison, <4 x i32> <i32 0, i32 1, i32 0, i32 1>
; CHECK-NEXT: ret <4 x i32> [[SHUFFLE]]		; CHECK-NEXT: ret <4 x i32> [[SHUFFLE]]
;		;
%v0.0 = extractelement <2 x i32> %v0, i32 0		%v0.0 = extractelement <2 x i32> %v0, i32 0
%v0.1 = extractelement <2 x i32> %v0, i32 1		%v0.1 = extractelement <2 x i32> %v0, i32 1
%v1.0 = extractelement <2 x i32> %v1, i32 0		%v1.0 = extractelement <2 x i32> %v1, i32 0
%v1.1 = extractelement <2 x i32> %v1, i32 1		%v1.1 = extractelement <2 x i32> %v1, i32 1
%tmp0.0 = add i32 %v0.0, %v1.0		%tmp0.0 = add i32 %v0.0, %v1.0
%tmp0.1 = add i32 %v0.1, %v1.1		%tmp0.1 = add i32 %v0.1, %v1.1
▲ Show 20 Lines • Show All 89 Lines • ▼ Show 20 Lines	;
%tmp3.1 = insertelement <4 x i32> %tmp3.0, i32 %tmp2.1, i32 1		%tmp3.1 = insertelement <4 x i32> %tmp3.0, i32 %tmp2.1, i32 1
%tmp3.2 = insertelement <4 x i32> %tmp3.1, i32 %tmp2.2, i32 2		%tmp3.2 = insertelement <4 x i32> %tmp3.1, i32 %tmp2.2, i32 2
%tmp3.3 = insertelement <4 x i32> %tmp3.2, i32 %tmp2.3, i32 3		%tmp3.3 = insertelement <4 x i32> %tmp3.2, i32 %tmp2.3, i32 3
ret <4 x i32> %tmp3.3		ret <4 x i32> %tmp3.3
}		}

define i32 @reduction_v4i32(<4 x i32> %v0, <4 x i32> %v1) {		define i32 @reduction_v4i32(<4 x i32> %v0, <4 x i32> %v1) {
; CHECK-LABEL: @reduction_v4i32(		; CHECK-LABEL: @reduction_v4i32(
; CHECK-NEXT: [[TMP1:%.]] = shufflevector <4 x i32> [[V0:%.]], <4 x i32> undef, <4 x i32> <i32 1, i32 0, i32 3, i32 2>		; CHECK-NEXT: [[TMP1:%.]] = sub <4 x i32> [[V0:%.]], [[V1:%.*]]
; CHECK-NEXT: [[TMP2:%.]] = shufflevector <4 x i32> [[V1:%.]], <4 x i32> undef, <4 x i32> <i32 1, i32 0, i32 3, i32 2>		; CHECK-NEXT: [[TMP2:%.*]] = add <4 x i32> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP3:%.*]] = sub <4 x i32> [[TMP1]], [[TMP2]]		; CHECK-NEXT: [[TMP3:%.*]] = shufflevector <4 x i32> [[TMP1]], <4 x i32> [[TMP2]], <4 x i32> <i32 1, i32 4, i32 7, i32 2>
; CHECK-NEXT: [[TMP4:%.*]] = add <4 x i32> [[TMP1]], [[TMP2]]		; CHECK-NEXT: [[TMP4:%.*]] = sub <4 x i32> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP5:%.*]] = shufflevector <4 x i32> [[TMP3]], <4 x i32> [[TMP4]], <4 x i32> <i32 0, i32 5, i32 6, i32 3>		; CHECK-NEXT: [[TMP5:%.*]] = add <4 x i32> [[V0]], [[V1]]
; CHECK-NEXT: [[TMP6:%.*]] = sub <4 x i32> [[V0]], [[V1]]		; CHECK-NEXT: [[TMP6:%.*]] = shufflevector <4 x i32> [[TMP4]], <4 x i32> [[TMP5]], <4 x i32> <i32 0, i32 5, i32 6, i32 3>
; CHECK-NEXT: [[TMP7:%.*]] = add <4 x i32> [[V0]], [[V1]]		; CHECK-NEXT: [[TMP7:%.*]] = add <4 x i32> [[TMP6]], [[TMP3]]
; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <4 x i32> [[TMP6]], <4 x i32> [[TMP7]], <4 x i32> <i32 0, i32 5, i32 6, i32 3>		; CHECK-NEXT: [[TMP8:%.*]] = lshr <4 x i32> [[TMP7]], <i32 15, i32 15, i32 15, i32 15>
; CHECK-NEXT: [[TMP9:%.*]] = add <4 x i32> [[TMP8]], [[TMP5]]		; CHECK-NEXT: [[TMP9:%.*]] = and <4 x i32> [[TMP8]], <i32 65537, i32 65537, i32 65537, i32 65537>
; CHECK-NEXT: [[TMP10:%.*]] = lshr <4 x i32> [[TMP9]], <i32 15, i32 15, i32 15, i32 15>		; CHECK-NEXT: [[TMP10:%.*]] = mul nuw <4 x i32> [[TMP9]], <i32 65535, i32 65535, i32 65535, i32 65535>
; CHECK-NEXT: [[TMP11:%.*]] = and <4 x i32> [[TMP10]], <i32 65537, i32 65537, i32 65537, i32 65537>		; CHECK-NEXT: [[TMP11:%.*]] = add <4 x i32> [[TMP10]], [[TMP7]]
; CHECK-NEXT: [[TMP12:%.*]] = mul nuw <4 x i32> [[TMP11]], <i32 65535, i32 65535, i32 65535, i32 65535>		; CHECK-NEXT: [[TMP12:%.*]] = xor <4 x i32> [[TMP11]], [[TMP10]]
; CHECK-NEXT: [[TMP13:%.*]] = add <4 x i32> [[TMP12]], [[TMP9]]		; CHECK-NEXT: [[TMP13:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[TMP12]])
; CHECK-NEXT: [[TMP14:%.*]] = xor <4 x i32> [[TMP13]], [[TMP12]]		; CHECK-NEXT: ret i32 [[TMP13]]
; CHECK-NEXT: [[TMP15:%.*]] = call i32 @llvm.vector.reduce.add.v4i32(<4 x i32> [[TMP14]])
; CHECK-NEXT: ret i32 [[TMP15]]
;		;
%v0.0 = extractelement <4 x i32> %v0, i32 0		%v0.0 = extractelement <4 x i32> %v0, i32 0
%v0.1 = extractelement <4 x i32> %v0, i32 1		%v0.1 = extractelement <4 x i32> %v0, i32 1
%v0.2 = extractelement <4 x i32> %v0, i32 2		%v0.2 = extractelement <4 x i32> %v0, i32 2
%v0.3 = extractelement <4 x i32> %v0, i32 3		%v0.3 = extractelement <4 x i32> %v0, i32 3
%v1.0 = extractelement <4 x i32> %v1, i32 0		%v1.0 = extractelement <4 x i32> %v1, i32 0
%v1.1 = extractelement <4 x i32> %v1, i32 1		%v1.1 = extractelement <4 x i32> %v1, i32 1
%v1.2 = extractelement <4 x i32> %v1, i32 2		%v1.2 = extractelement <4 x i32> %v1, i32 2
Show All 38 Lines

llvm/test/Transforms/SLPVectorizer/X86/addsub.ll

Show First 20 Lines • Show All 331 Lines • ▼ Show 20 Lines	;
%10 = getelementptr inbounds double, double* %b, i64 1		%10 = getelementptr inbounds double, double* %b, i64 1
%11 = load double, double* %10		%11 = load double, double* %10
%12 = fadd double %9, %11		%12 = fadd double %9, %11
%13 = fadd double %7, %12		%13 = fadd double %7, %12
%14 = getelementptr inbounds double, double* %c, i64 1		%14 = getelementptr inbounds double, double* %c, i64 1
store double %13, double* %14		store double %13, double* %14
ret void		ret void
}		}

; Dont vectorization of following code for float data type as sub is not commutative-		define void @vec_shuff_reorder() #0 {
		RKSimonUnsubmitted Not Done Reply Inline Actions Update comment? I had to do something similar on D103925, although the wording here might be different. RKSimon: Update comment? I had to do something similar on D103925, although the wording here might be…
		ABataevAuthorUnsubmitted Done Reply Inline Actions Ok, will do ABataev: Ok, will do
; fc[0] = fb[0]+fa[0];		; CHECK-LABEL: @vec_shuff_reorder(
; fc[1] = fa[1]-fb[1];
; fc[2] = fa[2]+fb[2];
; fc[3] = fb[3]-fa[3];
; In the above code we can swap the 1st and 2nd operation as fadd is commutative
; but not 2nd or 4th as fsub is not commutative.

define void @no_vec_shuff_reorder() #0 {
; CHECK-LABEL: @no_vec_shuff_reorder(
; CHECK-NEXT: [[TMP1:%.]] = load float, float getelementptr inbounds ([4 x float], [4 x float]* @fb, i32 0, i64 0), align 4		; CHECK-NEXT: [[TMP1:%.]] = load float, float getelementptr inbounds ([4 x float], [4 x float]* @fb, i32 0, i64 0), align 4
; CHECK-NEXT: [[TMP2:%.]] = load float, float getelementptr inbounds ([4 x float], [4 x float]* @fa, i32 0, i64 0), align 4		; CHECK-NEXT: [[TMP2:%.]] = load float, float getelementptr inbounds ([4 x float], [4 x float]* @fa, i32 0, i64 0), align 4
; CHECK-NEXT: [[TMP3:%.*]] = fadd float [[TMP1]], [[TMP2]]		; CHECK-NEXT: [[TMP3:%.]] = load <2 x float>, <2 x float> bitcast (float* getelementptr inbounds ([4 x float], [4 x float]* @fa, i32 0, i64 1) to <2 x float>*), align 4
; CHECK-NEXT: store float [[TMP3]], float* getelementptr inbounds ([4 x float], [4 x float]* @fc, i32 0, i64 0), align 4		; CHECK-NEXT: [[TMP4:%.]] = load <2 x float>, <2 x float> bitcast (float* getelementptr inbounds ([4 x float], [4 x float]* @fb, i32 0, i64 1) to <2 x float>*), align 4
; CHECK-NEXT: [[TMP4:%.]] = load float, float getelementptr inbounds ([4 x float], [4 x float]* @fa, i32 0, i64 1), align 4		; CHECK-NEXT: [[TMP5:%.]] = load float, float getelementptr inbounds ([4 x float], [4 x float]* @fb, i32 0, i64 3), align 4
; CHECK-NEXT: [[TMP5:%.]] = load float, float getelementptr inbounds ([4 x float], [4 x float]* @fb, i32 0, i64 1), align 4		; CHECK-NEXT: [[TMP6:%.]] = load float, float getelementptr inbounds ([4 x float], [4 x float]* @fa, i32 0, i64 3), align 4
; CHECK-NEXT: [[TMP6:%.*]] = fsub float [[TMP4]], [[TMP5]]		; CHECK-NEXT: [[TMP7:%.*]] = insertelement <4 x float> poison, float [[TMP2]], i32 0
; CHECK-NEXT: store float [[TMP6]], float* getelementptr inbounds ([4 x float], [4 x float]* @fc, i32 0, i64 1), align 4		; CHECK-NEXT: [[TMP8:%.*]] = shufflevector <2 x float> [[TMP3]], <2 x float> poison, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef>
; CHECK-NEXT: [[TMP7:%.]] = load float, float getelementptr inbounds ([4 x float], [4 x float]* @fa, i32 0, i64 2), align 4		; CHECK-NEXT: [[TMP9:%.*]] = shufflevector <4 x float> [[TMP7]], <4 x float> [[TMP8]], <4 x i32> <i32 0, i32 4, i32 5, i32 3>
; CHECK-NEXT: [[TMP8:%.]] = load float, float getelementptr inbounds ([4 x float], [4 x float]* @fb, i32 0, i64 2), align 4		; CHECK-NEXT: [[TMP10:%.*]] = insertelement <4 x float> [[TMP9]], float [[TMP5]], i32 3
; CHECK-NEXT: [[TMP9:%.*]] = fadd float [[TMP7]], [[TMP8]]		; CHECK-NEXT: [[TMP11:%.*]] = insertelement <4 x float> poison, float [[TMP1]], i32 0
; CHECK-NEXT: store float [[TMP9]], float* getelementptr inbounds ([4 x float], [4 x float]* @fc, i32 0, i64 2), align 4		; CHECK-NEXT: [[TMP12:%.*]] = shufflevector <2 x float> [[TMP4]], <2 x float> poison, <4 x i32> <i32 0, i32 1, i32 undef, i32 undef>
; CHECK-NEXT: [[TMP10:%.]] = load float, float getelementptr inbounds ([4 x float], [4 x float]* @fb, i32 0, i64 3), align 4		; CHECK-NEXT: [[TMP13:%.*]] = shufflevector <4 x float> [[TMP11]], <4 x float> [[TMP12]], <4 x i32> <i32 0, i32 4, i32 5, i32 3>
; CHECK-NEXT: [[TMP11:%.]] = load float, float getelementptr inbounds ([4 x float], [4 x float]* @fa, i32 0, i64 3), align 4		; CHECK-NEXT: [[TMP14:%.*]] = insertelement <4 x float> [[TMP13]], float [[TMP6]], i32 3
; CHECK-NEXT: [[TMP12:%.*]] = fsub float [[TMP10]], [[TMP11]]		; CHECK-NEXT: [[TMP15:%.*]] = fadd <4 x float> [[TMP10]], [[TMP14]]
; CHECK-NEXT: store float [[TMP12]], float* getelementptr inbounds ([4 x float], [4 x float]* @fc, i32 0, i64 3), align 4		; CHECK-NEXT: [[TMP16:%.*]] = fsub <4 x float> [[TMP10]], [[TMP14]]
		; CHECK-NEXT: [[TMP17:%.*]] = shufflevector <4 x float> [[TMP15]], <4 x float> [[TMP16]], <4 x i32> <i32 0, i32 5, i32 2, i32 7>
		; CHECK-NEXT: [[TMP18:%.*]] = extractelement <2 x float> [[TMP4]], i32 1
		; CHECK-NEXT: [[TMP19:%.*]] = extractelement <2 x float> [[TMP3]], i32 1
		; CHECK-NEXT: [[TMP20:%.*]] = extractelement <2 x float> [[TMP4]], i32 0
		; CHECK-NEXT: [[TMP21:%.*]] = extractelement <2 x float> [[TMP3]], i32 0
		; CHECK-NEXT: store <4 x float> [[TMP17]], <4 x float>* bitcast ([4 x float]* @fc to <4 x float>*), align 4
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
%1 = load float, float* getelementptr inbounds ([4 x float], [4 x float]* @fb, i32 0, i64 0), align 4		%1 = load float, float* getelementptr inbounds ([4 x float], [4 x float]* @fb, i32 0, i64 0), align 4
%2 = load float, float* getelementptr inbounds ([4 x float], [4 x float]* @fa, i32 0, i64 0), align 4		%2 = load float, float* getelementptr inbounds ([4 x float], [4 x float]* @fa, i32 0, i64 0), align 4
%3 = fadd float %1, %2		%3 = fadd float %1, %2
store float %3, float* getelementptr inbounds ([4 x float], [4 x float]* @fc, i32 0, i64 0), align 4		store float %3, float* getelementptr inbounds ([4 x float], [4 x float]* @fc, i32 0, i64 0), align 4
%4 = load float, float* getelementptr inbounds ([4 x float], [4 x float]* @fa, i32 0, i64 1), align 4		%4 = load float, float* getelementptr inbounds ([4 x float], [4 x float]* @fa, i32 0, i64 1), align 4
%5 = load float, float* getelementptr inbounds ([4 x float], [4 x float]* @fb, i32 0, i64 1), align 4		%5 = load float, float* getelementptr inbounds ([4 x float], [4 x float]* @fb, i32 0, i64 1), align 4
Show All 16 Lines

llvm/test/Transforms/SLPVectorizer/X86/crash_cmpop.ll

	Show First 20 Lines • Show All 55 Lines • ▼ Show 20 Lines
	; AVX-NEXT: [[INDVARS_IV:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[INDVARS_IV_NEXT:%.*]], [[FOR_BODY]] ]			; AVX-NEXT: [[INDVARS_IV:%.]] = phi i64 [ 0, [[ENTRY:%.]] ], [ [[INDVARS_IV_NEXT:%.*]], [[FOR_BODY]] ]
	; AVX-NEXT: [[ACC1_056:%.]] = phi float [ 0.000000e+00, [[ENTRY]] ], [ [[ADD13:%.]], [[FOR_BODY]] ]			; AVX-NEXT: [[ACC1_056:%.]] = phi float [ 0.000000e+00, [[ENTRY]] ], [ [[ADD13:%.]], [[FOR_BODY]] ]
	; AVX-NEXT: [[TMP0:%.]] = phi <2 x float> [ zeroinitializer, [[ENTRY]] ], [ [[TMP19:%.]], [[FOR_BODY]] ]			; AVX-NEXT: [[TMP0:%.]] = phi <2 x float> [ zeroinitializer, [[ENTRY]] ], [ [[TMP19:%.]], [[FOR_BODY]] ]
	; AVX-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds float, float [[SRC:%.*]], i64 [[INDVARS_IV]]			; AVX-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds float, float [[SRC:%.*]], i64 [[INDVARS_IV]]
	; AVX-NEXT: [[TMP1:%.]] = load float, float [[ARRAYIDX]], align 4			; AVX-NEXT: [[TMP1:%.]] = load float, float [[ARRAYIDX]], align 4
	; AVX-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1			; AVX-NEXT: [[INDVARS_IV_NEXT]] = add nuw nsw i64 [[INDVARS_IV]], 1
	; AVX-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds float, float [[DEST:%.*]], i64 [[INDVARS_IV]]			; AVX-NEXT: [[ARRAYIDX2:%.]] = getelementptr inbounds float, float [[DEST:%.*]], i64 [[INDVARS_IV]]
	; AVX-NEXT: store float [[ACC1_056]], float* [[ARRAYIDX2]], align 4			; AVX-NEXT: store float [[ACC1_056]], float* [[ARRAYIDX2]], align 4
	; AVX-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x float> [[TMP0]], <2 x float> poison, <2 x i32> <i32 1, i32 0>
	; AVX-NEXT: [[TMP2:%.*]] = insertelement <2 x float> poison, float [[TMP1]], i32 0			; AVX-NEXT: [[TMP2:%.*]] = insertelement <2 x float> poison, float [[TMP1]], i32 0
	; AVX-NEXT: [[TMP3:%.*]] = insertelement <2 x float> [[TMP2]], float [[TMP1]], i32 1			; AVX-NEXT: [[TMP3:%.*]] = insertelement <2 x float> [[TMP2]], float [[TMP1]], i32 1
	; AVX-NEXT: [[TMP4:%.*]] = fadd <2 x float> [[SHUFFLE]], [[TMP3]]			; AVX-NEXT: [[TMP4:%.*]] = fadd <2 x float> [[TMP0]], [[TMP3]]
				; AVX-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x float> [[TMP4]], <2 x float> poison, <2 x i32> <i32 1, i32 0>
	; AVX-NEXT: [[TMP5:%.*]] = fmul <2 x float> [[TMP0]], zeroinitializer			; AVX-NEXT: [[TMP5:%.*]] = fmul <2 x float> [[TMP0]], zeroinitializer
	; AVX-NEXT: [[TMP6:%.*]] = fadd <2 x float> [[TMP5]], [[TMP4]]			; AVX-NEXT: [[TMP6:%.*]] = fadd <2 x float> [[TMP5]], [[SHUFFLE]]
	; AVX-NEXT: [[TMP7:%.*]] = fcmp olt <2 x float> [[TMP6]], <float 1.000000e+00, float 1.000000e+00>			; AVX-NEXT: [[TMP7:%.*]] = fcmp olt <2 x float> [[TMP6]], <float 1.000000e+00, float 1.000000e+00>
	; AVX-NEXT: [[TMP8:%.*]] = select <2 x i1> [[TMP7]], <2 x float> [[TMP6]], <2 x float> <float 1.000000e+00, float 1.000000e+00>			; AVX-NEXT: [[TMP8:%.*]] = select <2 x i1> [[TMP7]], <2 x float> [[TMP6]], <2 x float> <float 1.000000e+00, float 1.000000e+00>
	; AVX-NEXT: [[TMP9:%.*]] = fcmp olt <2 x float> [[TMP8]], <float -1.000000e+00, float -1.000000e+00>			; AVX-NEXT: [[TMP9:%.*]] = fcmp olt <2 x float> [[TMP8]], <float -1.000000e+00, float -1.000000e+00>
	; AVX-NEXT: [[TMP10:%.*]] = fmul <2 x float> [[TMP8]], zeroinitializer			; AVX-NEXT: [[TMP10:%.*]] = fmul <2 x float> [[TMP8]], zeroinitializer
	; AVX-NEXT: [[TMP11:%.*]] = select <2 x i1> [[TMP9]], <2 x float> <float -0.000000e+00, float -0.000000e+00>, <2 x float> [[TMP10]]			; AVX-NEXT: [[TMP11:%.*]] = select <2 x i1> [[TMP9]], <2 x float> <float -0.000000e+00, float -0.000000e+00>, <2 x float> [[TMP10]]
	; AVX-NEXT: [[TMP12:%.*]] = extractelement <2 x float> [[TMP11]], i32 0			; AVX-NEXT: [[TMP12:%.*]] = extractelement <2 x float> [[TMP11]], i32 0
	; AVX-NEXT: [[TMP13:%.*]] = extractelement <2 x float> [[TMP11]], i32 1			; AVX-NEXT: [[TMP13:%.*]] = extractelement <2 x float> [[TMP11]], i32 1
	; AVX-NEXT: [[ADD13]] = fadd float [[TMP12]], [[TMP13]]			; AVX-NEXT: [[ADD13]] = fadd float [[TMP12]], [[TMP13]]
	▲ Show 20 Lines • Show All 61 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/extract.ll

Show All 24 Lines	entry:
store double %A1, double* %P1, align 4		store double %A1, double* %P1, align 4
ret void		ret void
}		}

define void @fextr1(double* %ptr) {		define void @fextr1(double* %ptr) {
; CHECK-LABEL: @fextr1(		; CHECK-LABEL: @fextr1(
; CHECK-NEXT: entry:		; CHECK-NEXT: entry:
; CHECK-NEXT: [[LD:%.]] = load <2 x double>, <2 x double> undef, align 16		; CHECK-NEXT: [[LD:%.]] = load <2 x double>, <2 x double> undef, align 16
; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x double> [[LD]], <2 x double> poison, <2 x i32> <i32 1, i32 0>
; CHECK-NEXT: [[P1:%.]] = getelementptr inbounds double, double [[PTR:%.*]], i64 0		; CHECK-NEXT: [[P1:%.]] = getelementptr inbounds double, double [[PTR:%.*]], i64 0
; CHECK-NEXT: [[TMP0:%.*]] = fadd <2 x double> [[SHUFFLE]], <double 3.400000e+00, double 1.200000e+00>		; CHECK-NEXT: [[TMP0:%.*]] = fadd <2 x double> [[LD]], <double 1.200000e+00, double 3.400000e+00>
		; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x double> [[TMP0]], <2 x double> poison, <2 x i32> <i32 1, i32 0>
; CHECK-NEXT: [[TMP1:%.]] = bitcast double [[P1]] to <2 x double>*		; CHECK-NEXT: [[TMP1:%.]] = bitcast double [[P1]] to <2 x double>*
; CHECK-NEXT: store <2 x double> [[TMP0]], <2 x double>* [[TMP1]], align 4		; CHECK-NEXT: store <2 x double> [[SHUFFLE]], <2 x double>* [[TMP1]], align 4
; CHECK-NEXT: ret void		; CHECK-NEXT: ret void
;		;
entry:		entry:
%LD = load <2 x double>, <2 x double>* undef		%LD = load <2 x double>, <2 x double>* undef
%V0 = extractelement <2 x double> %LD, i32 0		%V0 = extractelement <2 x double> %LD, i32 0
%V1 = extractelement <2 x double> %LD, i32 1		%V1 = extractelement <2 x double> %LD, i32 1
%P0 = getelementptr inbounds double, double* %ptr, i64 1 ; <--- incorrect order		%P0 = getelementptr inbounds double, double* %ptr, i64 1 ; <--- incorrect order
%P1 = getelementptr inbounds double, double* %ptr, i64 0		%P1 = getelementptr inbounds double, double* %ptr, i64 0
Show All 34 Lines

llvm/test/Transforms/SLPVectorizer/X86/jumbled-load-multiuse.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -slp-vectorizer -S -mtriple=x86_64-unknown-linux -mattr=+sse4.2 \| FileCheck %s			; RUN: opt < %s -slp-vectorizer -S -mtriple=x86_64-unknown-linux -mattr=+sse4.2 \| FileCheck %s

	@a = common local_unnamed_addr global [4 x i32] zeroinitializer, align 4			@a = common local_unnamed_addr global [4 x i32] zeroinitializer, align 4
	@b = common local_unnamed_addr global [4 x i32] zeroinitializer, align 4			@b = common local_unnamed_addr global [4 x i32] zeroinitializer, align 4

	define i32 @fn1() {			define i32 @fn1() {
	; CHECK-LABEL: @fn1(			; CHECK-LABEL: @fn1(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[TMP0:%.]] = load <4 x i32>, <4 x i32> bitcast ([4 x i32]* @b to <4 x i32>*), align 4			; CHECK-NEXT: [[TMP0:%.]] = load <4 x i32>, <4 x i32> bitcast ([4 x i32]* @b to <4 x i32>*), align 4
	; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <4 x i32> [[TMP0]], <4 x i32> poison, <4 x i32> <i32 1, i32 2, i32 3, i32 0>			; CHECK-NEXT: [[TMP1:%.*]] = icmp sgt <4 x i32> [[TMP0]], zeroinitializer
	; CHECK-NEXT: [[TMP1:%.*]] = icmp sgt <4 x i32> [[SHUFFLE]], zeroinitializer			; CHECK-NEXT: [[TMP2:%.*]] = extractelement <4 x i32> [[TMP0]], i32 1
	; CHECK-NEXT: [[TMP2:%.*]] = extractelement <4 x i32> [[SHUFFLE]], i32 0			; CHECK-NEXT: [[TMP3:%.]] = insertelement <4 x i32> <i32 8, i32 poison, i32 ptrtoint (i32 () @fn1 to i32), i32 ptrtoint (i32 ()* @fn1 to i32)>, i32 [[TMP2]], i32 1
	; CHECK-NEXT: [[TMP3:%.]] = insertelement <4 x i32> <i32 poison, i32 ptrtoint (i32 () @fn1 to i32), i32 ptrtoint (i32 ()* @fn1 to i32), i32 8>, i32 [[TMP2]], i32 0			; CHECK-NEXT: [[TMP4:%.*]] = select <4 x i1> [[TMP1]], <4 x i32> [[TMP3]], <4 x i32> <i32 0, i32 6, i32 0, i32 0>
	; CHECK-NEXT: [[TMP4:%.*]] = select <4 x i1> [[TMP1]], <4 x i32> [[TMP3]], <4 x i32> <i32 6, i32 0, i32 0, i32 0>			; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <4 x i32> [[TMP4]], <4 x i32> poison, <4 x i32> <i32 1, i32 2, i32 3, i32 0>
	; CHECK-NEXT: store <4 x i32> [[TMP4]], <4 x i32>* bitcast ([4 x i32]* @a to <4 x i32>*), align 4			; CHECK-NEXT: store <4 x i32> [[SHUFFLE]], <4 x i32>* bitcast ([4 x i32]* @a to <4 x i32>*), align 4
	; CHECK-NEXT: ret i32 0			; CHECK-NEXT: ret i32 0
	;			;
	entry:			entry:
	%0 = load i32, i32* getelementptr ([4 x i32], [4 x i32]* @b, i64 0, i32 0), align 4			%0 = load i32, i32* getelementptr ([4 x i32], [4 x i32]* @b, i64 0, i32 0), align 4
	%cmp = icmp sgt i32 %0, 0			%cmp = icmp sgt i32 %0, 0
	%cond = select i1 %cmp, i32 8, i32 0			%cond = select i1 %cmp, i32 8, i32 0
	store i32 %cond, i32* getelementptr inbounds ([4 x i32], [4 x i32]* @a, i64 0, i32 3), align 4			store i32 %cond, i32* getelementptr inbounds ([4 x i32], [4 x i32]* @a, i64 0, i32 3), align 4
	%1 = load i32, i32* getelementptr ([4 x i32], [4 x i32]* @b, i64 0, i32 1), align 4			%1 = load i32, i32* getelementptr ([4 x i32], [4 x i32]* @b, i64 0, i32 1), align 4
	Show All 13 Lines

llvm/test/Transforms/SLPVectorizer/X86/jumbled-load.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -S -mtriple=x86_64-unknown -mattr=+avx -slp-vectorizer \| FileCheck %s			; RUN: opt < %s -S -mtriple=x86_64-unknown -mattr=+avx -slp-vectorizer \| FileCheck %s



	define i32 @jumbled-load(i32* noalias nocapture %in, i32* noalias nocapture %inn, i32* noalias nocapture %out) {			define i32 @jumbled-load(i32* noalias nocapture %in, i32* noalias nocapture %inn, i32* noalias nocapture %out) {
	; CHECK-LABEL: @jumbled-load(			; CHECK-LABEL: @jumbled-load(
	; CHECK-NEXT: [[IN_ADDR:%.]] = getelementptr inbounds i32, i32 [[IN:%.*]], i64 0			; CHECK-NEXT: [[IN_ADDR:%.]] = getelementptr inbounds i32, i32 [[IN:%.*]], i64 0
	; CHECK-NEXT: [[GEP_1:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 3			; CHECK-NEXT: [[GEP_1:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 3
	; CHECK-NEXT: [[GEP_2:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 1			; CHECK-NEXT: [[GEP_2:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 1
	; CHECK-NEXT: [[GEP_3:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 2			; CHECK-NEXT: [[GEP_3:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 2
	; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[IN_ADDR]] to <4 x i32>*			; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[IN_ADDR]] to <4 x i32>*
	; CHECK-NEXT: [[TMP2:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4			; CHECK-NEXT: [[TMP2:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4
	; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <4 x i32> [[TMP2]], <4 x i32> poison, <4 x i32> <i32 1, i32 3, i32 2, i32 0>
	; CHECK-NEXT: [[INN_ADDR:%.]] = getelementptr inbounds i32, i32 [[INN:%.*]], i64 0			; CHECK-NEXT: [[INN_ADDR:%.]] = getelementptr inbounds i32, i32 [[INN:%.*]], i64 0
	; CHECK-NEXT: [[GEP_4:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 2			; CHECK-NEXT: [[GEP_4:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 2
	; CHECK-NEXT: [[GEP_5:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 3			; CHECK-NEXT: [[GEP_5:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 3
	; CHECK-NEXT: [[GEP_6:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 1			; CHECK-NEXT: [[GEP_6:%.]] = getelementptr inbounds i32, i32 [[INN_ADDR]], i64 1
	; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[INN_ADDR]] to <4 x i32>*			; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[GEP_5]] to <4 x i32>*
	; CHECK-NEXT: [[TMP4:%.]] = load <4 x i32>, <4 x i32> [[TMP3]], align 4			; CHECK-NEXT: [[TMP4:%.]] = load <4 x i32>, <4 x i32> [[TMP3]], align 4
	; CHECK-NEXT: [[SHUFFLE1:%.*]] = shufflevector <4 x i32> [[TMP4]], <4 x i32> poison, <4 x i32> <i32 0, i32 1, i32 3, i32 2>			; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <4 x i32> [[TMP4]], <4 x i32> poison, <4 x i32> <i32 1, i32 3, i32 0, i32 2>
	; CHECK-NEXT: [[TMP5:%.*]] = mul <4 x i32> [[SHUFFLE]], [[SHUFFLE1]]			; CHECK-NEXT: [[TMP5:%.*]] = mul <4 x i32> [[TMP2]], [[SHUFFLE]]
				; CHECK-NEXT: [[SHUFFLE1:%.*]] = shufflevector <4 x i32> [[TMP5]], <4 x i32> poison, <4 x i32> <i32 1, i32 3, i32 2, i32 0>
	; CHECK-NEXT: [[GEP_7:%.]] = getelementptr inbounds i32, i32 [[OUT:%.*]], i64 0			; CHECK-NEXT: [[GEP_7:%.]] = getelementptr inbounds i32, i32 [[OUT:%.*]], i64 0
	; CHECK-NEXT: [[GEP_8:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 1			; CHECK-NEXT: [[GEP_8:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 1
	; CHECK-NEXT: [[GEP_9:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 2			; CHECK-NEXT: [[GEP_9:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 2
	; CHECK-NEXT: [[GEP_10:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 3			; CHECK-NEXT: [[GEP_10:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 3
	; CHECK-NEXT: [[TMP6:%.]] = bitcast i32 [[GEP_7]] to <4 x i32>*			; CHECK-NEXT: [[TMP6:%.]] = bitcast i32 [[GEP_7]] to <4 x i32>*
	; CHECK-NEXT: store <4 x i32> [[TMP5]], <4 x i32>* [[TMP6]], align 4			; CHECK-NEXT: store <4 x i32> [[SHUFFLE1]], <4 x i32>* [[TMP6]], align 4
	; CHECK-NEXT: ret i32 undef			; CHECK-NEXT: ret i32 undef
	;			;
	%in.addr = getelementptr inbounds i32, i32* %in, i64 0			%in.addr = getelementptr inbounds i32, i32* %in, i64 0
	%load.1 = load i32, i32* %in.addr, align 4			%load.1 = load i32, i32* %in.addr, align 4
	%gep.1 = getelementptr inbounds i32, i32* %in.addr, i64 3			%gep.1 = getelementptr inbounds i32, i32* %in.addr, i64 3
	%load.2 = load i32, i32* %gep.1, align 4			%load.2 = load i32, i32* %gep.1, align 4
	%gep.2 = getelementptr inbounds i32, i32* %in.addr, i64 1			%gep.2 = getelementptr inbounds i32, i32* %in.addr, i64 1
	%load.3 = load i32, i32* %gep.2, align 4			%load.3 = load i32, i32* %gep.2, align 4
	Show All 27 Lines
	define i32 @jumbled-load-multiuses(i32* noalias nocapture %in, i32* noalias nocapture %out) {			define i32 @jumbled-load-multiuses(i32* noalias nocapture %in, i32* noalias nocapture %out) {
	; CHECK-LABEL: @jumbled-load-multiuses(			; CHECK-LABEL: @jumbled-load-multiuses(
	; CHECK-NEXT: [[IN_ADDR:%.]] = getelementptr inbounds i32, i32 [[IN:%.*]], i64 0			; CHECK-NEXT: [[IN_ADDR:%.]] = getelementptr inbounds i32, i32 [[IN:%.*]], i64 0
	; CHECK-NEXT: [[GEP_1:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 3			; CHECK-NEXT: [[GEP_1:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 3
	; CHECK-NEXT: [[GEP_2:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 1			; CHECK-NEXT: [[GEP_2:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 1
	; CHECK-NEXT: [[GEP_3:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 2			; CHECK-NEXT: [[GEP_3:%.]] = getelementptr inbounds i32, i32 [[IN_ADDR]], i64 2
	; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[IN_ADDR]] to <4 x i32>*			; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[IN_ADDR]] to <4 x i32>*
	; CHECK-NEXT: [[TMP2:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4			; CHECK-NEXT: [[TMP2:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4
	; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <4 x i32> [[TMP2]], <4 x i32> poison, <4 x i32> <i32 1, i32 3, i32 2, i32 0>			; CHECK-NEXT: [[TMP3:%.*]] = extractelement <4 x i32> [[TMP2]], i32 1
	; CHECK-NEXT: [[TMP3:%.*]] = extractelement <4 x i32> [[SHUFFLE]], i32 2
	; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x i32> poison, i32 [[TMP3]], i32 0			; CHECK-NEXT: [[TMP4:%.*]] = insertelement <4 x i32> poison, i32 [[TMP3]], i32 0
	; CHECK-NEXT: [[TMP5:%.*]] = extractelement <4 x i32> [[SHUFFLE]], i32 1			; CHECK-NEXT: [[TMP5:%.*]] = extractelement <4 x i32> [[TMP2]], i32 2
	; CHECK-NEXT: [[TMP6:%.*]] = insertelement <4 x i32> [[TMP4]], i32 [[TMP5]], i32 1			; CHECK-NEXT: [[TMP6:%.*]] = insertelement <4 x i32> [[TMP4]], i32 [[TMP5]], i32 1
	; CHECK-NEXT: [[TMP7:%.*]] = extractelement <4 x i32> [[SHUFFLE]], i32 3			; CHECK-NEXT: [[TMP7:%.*]] = extractelement <4 x i32> [[TMP2]], i32 0
	; CHECK-NEXT: [[TMP8:%.*]] = insertelement <4 x i32> [[TMP6]], i32 [[TMP7]], i32 2			; CHECK-NEXT: [[TMP8:%.*]] = insertelement <4 x i32> [[TMP6]], i32 [[TMP7]], i32 2
	; CHECK-NEXT: [[TMP9:%.*]] = extractelement <4 x i32> [[SHUFFLE]], i32 0			; CHECK-NEXT: [[TMP9:%.*]] = extractelement <4 x i32> [[TMP2]], i32 3
	; CHECK-NEXT: [[TMP10:%.*]] = insertelement <4 x i32> [[TMP8]], i32 [[TMP9]], i32 3			; CHECK-NEXT: [[TMP10:%.*]] = insertelement <4 x i32> [[TMP8]], i32 [[TMP9]], i32 3
	; CHECK-NEXT: [[TMP11:%.*]] = mul <4 x i32> [[SHUFFLE]], [[TMP10]]			; CHECK-NEXT: [[TMP11:%.*]] = mul <4 x i32> [[TMP2]], [[TMP10]]
				; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <4 x i32> [[TMP11]], <4 x i32> poison, <4 x i32> <i32 1, i32 3, i32 2, i32 0>
	; CHECK-NEXT: [[GEP_7:%.]] = getelementptr inbounds i32, i32 [[OUT:%.*]], i64 0			; CHECK-NEXT: [[GEP_7:%.]] = getelementptr inbounds i32, i32 [[OUT:%.*]], i64 0
	; CHECK-NEXT: [[GEP_8:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 1			; CHECK-NEXT: [[GEP_8:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 1
	; CHECK-NEXT: [[GEP_9:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 2			; CHECK-NEXT: [[GEP_9:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 2
	; CHECK-NEXT: [[GEP_10:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 3			; CHECK-NEXT: [[GEP_10:%.]] = getelementptr inbounds i32, i32 [[OUT]], i64 3
	; CHECK-NEXT: [[TMP12:%.]] = bitcast i32 [[GEP_7]] to <4 x i32>*			; CHECK-NEXT: [[TMP12:%.]] = bitcast i32 [[GEP_7]] to <4 x i32>*
	; CHECK-NEXT: store <4 x i32> [[TMP11]], <4 x i32>* [[TMP12]], align 4			; CHECK-NEXT: store <4 x i32> [[SHUFFLE]], <4 x i32>* [[TMP12]], align 4
	; CHECK-NEXT: ret i32 undef			; CHECK-NEXT: ret i32 undef
	;			;
	%in.addr = getelementptr inbounds i32, i32* %in, i64 0			%in.addr = getelementptr inbounds i32, i32* %in, i64 0
	%load.1 = load i32, i32* %in.addr, align 4			%load.1 = load i32, i32* %in.addr, align 4
	%gep.1 = getelementptr inbounds i32, i32* %in.addr, i64 3			%gep.1 = getelementptr inbounds i32, i32* %in.addr, i64 3
	%load.2 = load i32, i32* %gep.1, align 4			%load.2 = load i32, i32* %gep.1, align 4
	%gep.2 = getelementptr inbounds i32, i32* %in.addr, i64 1			%gep.2 = getelementptr inbounds i32, i32* %in.addr, i64 1
	%load.3 = load i32, i32* %gep.2, align 4			%load.3 = load i32, i32* %gep.2, align 4
	Show All 17 Lines

llvm/test/Transforms/SLPVectorizer/X86/jumbled_store_crash.ll

	Show All 20 Lines
	; CHECK-NEXT: [[TMP2:%.]] = load <2 x i32>, <2 x i32> [[TMP1]], align 4			; CHECK-NEXT: [[TMP2:%.]] = load <2 x i32>, <2 x i32> [[TMP1]], align 4
	; CHECK-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 [[TMP0]], i64 13			; CHECK-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 [[TMP0]], i64 13
	; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[ARRAYIDX1]] to <2 x i32>*			; CHECK-NEXT: [[TMP3:%.]] = bitcast i32 [[ARRAYIDX1]] to <2 x i32>*
	; CHECK-NEXT: [[TMP4:%.]] = load <2 x i32>, <2 x i32> [[TMP3]], align 4			; CHECK-NEXT: [[TMP4:%.]] = load <2 x i32>, <2 x i32> [[TMP3]], align 4
	; CHECK-NEXT: [[TMP5:%.*]] = add nsw <2 x i32> [[TMP4]], [[TMP2]]			; CHECK-NEXT: [[TMP5:%.*]] = add nsw <2 x i32> [[TMP4]], [[TMP2]]
	; CHECK-NEXT: [[TMP6:%.*]] = sitofp <2 x i32> [[TMP5]] to <2 x float>			; CHECK-NEXT: [[TMP6:%.*]] = sitofp <2 x i32> [[TMP5]] to <2 x float>
	; CHECK-NEXT: [[TMP7:%.*]] = fmul <2 x float> [[TMP6]], <float 1.000000e+01, float 1.000000e+01>			; CHECK-NEXT: [[TMP7:%.*]] = fmul <2 x float> [[TMP6]], <float 1.000000e+01, float 1.000000e+01>
	; CHECK-NEXT: [[TMP8:%.*]] = fsub <2 x float> <float 1.000000e+00, float 0.000000e+00>, [[TMP7]]			; CHECK-NEXT: [[TMP8:%.*]] = fsub <2 x float> <float 1.000000e+00, float 0.000000e+00>, [[TMP7]]
	; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x float> [[TMP8]], <2 x float> poison, <4 x i32> <i32 0, i32 0, i32 1, i32 1>			; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x float> [[TMP8]], <2 x float> poison, <4 x i32> <i32 1, i32 0, i32 1, i32 0>
	; CHECK-NEXT: [[TMP9:%.*]] = extractelement <4 x float> [[SHUFFLE]], i32 0			; CHECK-NEXT: [[TMP9:%.*]] = extractelement <4 x float> [[SHUFFLE]], i32 1
	; CHECK-NEXT: store float [[TMP9]], float* @g, align 4			; CHECK-NEXT: store float [[TMP9]], float* @g, align 4
	; CHECK-NEXT: [[TMP10:%.*]] = fadd <4 x float> [[SHUFFLE]], <float -1.000000e+00, float 1.000000e+00, float -1.000000e+00, float 1.000000e+00>			; CHECK-NEXT: [[TMP10:%.*]] = fadd <4 x float> [[SHUFFLE]], <float -1.000000e+00, float -1.000000e+00, float 1.000000e+00, float 1.000000e+00>
	; CHECK-NEXT: [[TMP11:%.*]] = extractelement <4 x float> [[TMP10]], i32 3			; CHECK-NEXT: [[TMP11:%.*]] = extractelement <4 x float> [[TMP10]], i32 2
	; CHECK-NEXT: store float [[TMP11]], float* @c, align 4			; CHECK-NEXT: store float [[TMP11]], float* @c, align 4
	; CHECK-NEXT: [[TMP12:%.*]] = extractelement <4 x float> [[TMP10]], i32 2			; CHECK-NEXT: [[TMP12:%.*]] = extractelement <4 x float> [[TMP10]], i32 0
	; CHECK-NEXT: store float [[TMP12]], float* @d, align 4			; CHECK-NEXT: store float [[TMP12]], float* @d, align 4
	; CHECK-NEXT: [[TMP13:%.*]] = extractelement <4 x float> [[TMP10]], i32 1			; CHECK-NEXT: [[TMP13:%.*]] = extractelement <4 x float> [[TMP10]], i32 3
	; CHECK-NEXT: store float [[TMP13]], float* @e, align 4			; CHECK-NEXT: store float [[TMP13]], float* @e, align 4
	; CHECK-NEXT: [[TMP14:%.*]] = extractelement <4 x float> [[TMP10]], i32 0			; CHECK-NEXT: [[TMP14:%.*]] = extractelement <4 x float> [[TMP10]], i32 1
	; CHECK-NEXT: store float [[TMP14]], float* @f, align 4			; CHECK-NEXT: store float [[TMP14]], float* @f, align 4
	; CHECK-NEXT: [[ARRAYIDX15:%.]] = getelementptr inbounds i32, i32 [[TMP0]], i64 14			; CHECK-NEXT: [[ARRAYIDX15:%.]] = getelementptr inbounds i32, i32 [[TMP0]], i64 14
	; CHECK-NEXT: [[ARRAYIDX18:%.]] = getelementptr inbounds i32, i32 [[TMP0]], i64 15			; CHECK-NEXT: [[ARRAYIDX18:%.]] = getelementptr inbounds i32, i32 [[TMP0]], i64 15
	; CHECK-NEXT: [[TMP15:%.]] = load i32, i32 @a, align 4			; CHECK-NEXT: [[TMP15:%.]] = load i32, i32 @a, align 4
	; CHECK-NEXT: [[CONV19:%.*]] = sitofp i32 [[TMP15]] to float			; CHECK-NEXT: [[CONV19:%.*]] = sitofp i32 [[TMP15]] to float
	; CHECK-NEXT: [[TMP16:%.*]] = insertelement <4 x float> <float -1.000000e+00, float -1.000000e+00, float poison, float poison>, float [[CONV19]], i32 2			; CHECK-NEXT: [[TMP16:%.*]] = insertelement <4 x float> <float poison, float -1.000000e+00, float poison, float -1.000000e+00>, float [[CONV19]], i32 0
	; CHECK-NEXT: [[TMP17:%.*]] = extractelement <4 x float> [[SHUFFLE]], i32 2			; CHECK-NEXT: [[TMP17:%.*]] = extractelement <4 x float> [[SHUFFLE]], i32 0
	; CHECK-NEXT: [[TMP18:%.*]] = insertelement <4 x float> [[TMP16]], float [[TMP17]], i32 3			; CHECK-NEXT: [[TMP18:%.*]] = insertelement <4 x float> [[TMP16]], float [[TMP17]], i32 2
	; CHECK-NEXT: [[TMP19:%.*]] = fadd <4 x float> [[TMP10]], [[TMP18]]			; CHECK-NEXT: [[TMP19:%.*]] = fsub <4 x float> [[TMP10]], [[TMP18]]
	; CHECK-NEXT: [[TMP20:%.*]] = fsub <4 x float> [[TMP10]], [[TMP18]]			; CHECK-NEXT: [[TMP20:%.*]] = fadd <4 x float> [[TMP10]], [[TMP18]]
	; CHECK-NEXT: [[TMP21:%.*]] = shufflevector <4 x float> [[TMP19]], <4 x float> [[TMP20]], <4 x i32> <i32 0, i32 1, i32 6, i32 7>			; CHECK-NEXT: [[TMP21:%.*]] = shufflevector <4 x float> [[TMP19]], <4 x float> [[TMP20]], <4 x i32> <i32 0, i32 5, i32 2, i32 7>
	; CHECK-NEXT: [[TMP22:%.*]] = fptosi <4 x float> [[TMP21]] to <4 x i32>			; CHECK-NEXT: [[TMP22:%.*]] = fptosi <4 x float> [[TMP21]] to <4 x i32>
	; CHECK-NEXT: [[SHUFFLE1:%.*]] = shufflevector <4 x i32> [[TMP22]], <4 x i32> poison, <4 x i32> <i32 2, i32 0, i32 3, i32 1>
	; CHECK-NEXT: [[TMP23:%.]] = bitcast i32 [[ARRAYIDX1]] to <4 x i32>*			; CHECK-NEXT: [[TMP23:%.]] = bitcast i32 [[ARRAYIDX1]] to <4 x i32>*
	; CHECK-NEXT: store <4 x i32> [[SHUFFLE1]], <4 x i32>* [[TMP23]], align 4			; CHECK-NEXT: store <4 x i32> [[TMP22]], <4 x i32>* [[TMP23]], align 4
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	%0 = load i32, i32* @b, align 8			%0 = load i32, i32* @b, align 8
	%arrayidx = getelementptr inbounds i32, i32* %0, i64 4			%arrayidx = getelementptr inbounds i32, i32* %0, i64 4
	%1 = load i32, i32* %arrayidx, align 4			%1 = load i32, i32* %arrayidx, align 4
	%arrayidx1 = getelementptr inbounds i32, i32* %0, i64 12			%arrayidx1 = getelementptr inbounds i32, i32* %0, i64 12
	%2 = load i32, i32* %arrayidx1, align 4			%2 = load i32, i32* %arrayidx1, align 4
	Show All 39 Lines

llvm/test/Transforms/SLPVectorizer/X86/reorder_repeated_ops.ll

	Show All 9 Lines
	; CHECK: bb1:			; CHECK: bb1:
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	; CHECK: bb2:			; CHECK: bb2:
	; CHECK-NEXT: [[T:%.*]] = select i1 undef, i16 undef, i16 15			; CHECK-NEXT: [[T:%.*]] = select i1 undef, i16 undef, i16 15
	; CHECK-NEXT: [[TMP0:%.*]] = insertelement <2 x i16> <i16 poison, i16 undef>, i16 [[T]], i32 0			; CHECK-NEXT: [[TMP0:%.*]] = insertelement <2 x i16> <i16 poison, i16 undef>, i16 [[T]], i32 0
	; CHECK-NEXT: [[TMP1:%.*]] = sext <2 x i16> [[TMP0]] to <2 x i32>			; CHECK-NEXT: [[TMP1:%.*]] = sext <2 x i16> [[TMP0]] to <2 x i32>
	; CHECK-NEXT: [[TMP2:%.*]] = sub nsw <2 x i32> <i32 undef, i32 63>, [[TMP1]]			; CHECK-NEXT: [[TMP2:%.*]] = sub nsw <2 x i32> <i32 undef, i32 63>, [[TMP1]]
	; CHECK-NEXT: [[TMP3:%.*]] = sub <2 x i32> [[TMP2]], undef			; CHECK-NEXT: [[TMP3:%.*]] = sub <2 x i32> [[TMP2]], undef
	; CHECK-NEXT: [[SHUFFLE10:%.*]] = shufflevector <2 x i32> [[TMP3]], <2 x i32> poison, <4 x i32> <i32 0, i32 0, i32 0, i32 1>			; CHECK-NEXT: [[SHUFFLE10:%.*]] = shufflevector <2 x i32> [[TMP3]], <2 x i32> poison, <4 x i32> <i32 1, i32 0, i32 0, i32 0>
	; CHECK-NEXT: [[TMP4:%.*]] = add <4 x i32> [[SHUFFLE10]], <i32 15, i32 31, i32 47, i32 poison>			; CHECK-NEXT: [[TMP4:%.*]] = add <4 x i32> [[SHUFFLE10]], <i32 undef, i32 15, i32 31, i32 47>
	; CHECK-NEXT: [[TMP5:%.*]] = call i32 @llvm.vector.reduce.smax.v4i32(<4 x i32> [[TMP4]])			; CHECK-NEXT: [[TMP5:%.*]] = call i32 @llvm.vector.reduce.smax.v4i32(<4 x i32> [[TMP4]])
	; CHECK-NEXT: [[T19:%.*]] = select i1 undef, i32 [[TMP5]], i32 undef			; CHECK-NEXT: [[T19:%.*]] = select i1 undef, i32 [[TMP5]], i32 undef
	; CHECK-NEXT: [[T20:%.*]] = icmp sgt i32 [[T19]], 63			; CHECK-NEXT: [[T20:%.*]] = icmp sgt i32 [[T19]], 63
	; CHECK-NEXT: [[TMP6:%.*]] = sub nsw <2 x i32> undef, [[TMP1]]			; CHECK-NEXT: [[TMP6:%.*]] = sub nsw <2 x i32> undef, [[TMP1]]
	; CHECK-NEXT: [[TMP7:%.*]] = sub <2 x i32> [[TMP6]], undef			; CHECK-NEXT: [[TMP7:%.*]] = sub <2 x i32> [[TMP6]], undef
	; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x i32> [[TMP7]], <2 x i32> poison, <4 x i32> <i32 0, i32 1, i32 0, i32 1>			; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x i32> [[TMP7]], <2 x i32> poison, <4 x i32> <i32 0, i32 1, i32 0, i32 1>
	; CHECK-NEXT: [[TMP8:%.*]] = add nsw <4 x i32> [[SHUFFLE]], <i32 -49, i32 -33, i32 -33, i32 -17>			; CHECK-NEXT: [[TMP8:%.*]] = add nsw <4 x i32> [[SHUFFLE]], <i32 -49, i32 -33, i32 -33, i32 -17>
	; CHECK-NEXT: [[TMP9:%.*]] = call i32 @llvm.vector.reduce.smin.v4i32(<4 x i32> [[TMP8]])			; CHECK-NEXT: [[TMP9:%.*]] = call i32 @llvm.vector.reduce.smin.v4i32(<4 x i32> [[TMP8]])
	▲ Show 20 Lines • Show All 67 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/vectorize-reorder-reuse.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt -slp-vectorizer -S -mtriple=x86_64-unknown-linux-gnu -mcpu=bdver2 < %s \| FileCheck %s			; RUN: opt -slp-vectorizer -S -mtriple=x86_64-unknown-linux-gnu -mcpu=bdver2 < %s \| FileCheck %s

	define i32 @foo(i32* nocapture readonly %arr, i32 %a1, i32 %a2, i32 %a3, i32 %a4, i32 %a5, i32 %a6, i32 %a7, i32 %a8) {			define i32 @foo(i32* nocapture readonly %arr, i32 %a1, i32 %a2, i32 %a3, i32 %a4, i32 %a5, i32 %a6, i32 %a7, i32 %a8) {
	; CHECK-LABEL: @foo(			; CHECK-LABEL: @foo(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[ARR:%.*]], i64 1			; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[ARR:%.*]], i64 1
	; CHECK-NEXT: [[TMP0:%.]] = bitcast i32 [[ARR]] to <2 x i32>*			; CHECK-NEXT: [[TMP0:%.]] = bitcast i32 [[ARR]] to <2 x i32>*
	; CHECK-NEXT: [[TMP1:%.]] = load <2 x i32>, <2 x i32> [[TMP0]], align 4			; CHECK-NEXT: [[TMP1:%.]] = load <2 x i32>, <2 x i32> [[TMP0]], align 4
	; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> poison, <8 x i32> <i32 0, i32 0, i32 1, i32 1, i32 1, i32 1, i32 1, i32 1>			; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> poison, <8 x i32> <i32 1, i32 1, i32 1, i32 1, i32 1, i32 1, i32 0, i32 0>
	; CHECK-NEXT: [[TMP2:%.]] = insertelement <8 x i32> poison, i32 [[A7:%.]], i32 0			; CHECK-NEXT: [[TMP2:%.]] = insertelement <8 x i32> poison, i32 [[A1:%.]], i32 0
	; CHECK-NEXT: [[TMP3:%.]] = insertelement <8 x i32> [[TMP2]], i32 [[A8:%.]], i32 1			; CHECK-NEXT: [[TMP3:%.]] = insertelement <8 x i32> [[TMP2]], i32 [[A2:%.]], i32 1
	; CHECK-NEXT: [[TMP4:%.]] = insertelement <8 x i32> [[TMP3]], i32 [[A1:%.]], i32 2			; CHECK-NEXT: [[TMP4:%.]] = insertelement <8 x i32> [[TMP3]], i32 [[A3:%.]], i32 2
	; CHECK-NEXT: [[TMP5:%.]] = insertelement <8 x i32> [[TMP4]], i32 [[A2:%.]], i32 3			; CHECK-NEXT: [[TMP5:%.]] = insertelement <8 x i32> [[TMP4]], i32 [[A4:%.]], i32 3
	; CHECK-NEXT: [[TMP6:%.]] = insertelement <8 x i32> [[TMP5]], i32 [[A3:%.]], i32 4			; CHECK-NEXT: [[TMP6:%.]] = insertelement <8 x i32> [[TMP5]], i32 [[A5:%.]], i32 4
	; CHECK-NEXT: [[TMP7:%.]] = insertelement <8 x i32> [[TMP6]], i32 [[A4:%.]], i32 5			; CHECK-NEXT: [[TMP7:%.]] = insertelement <8 x i32> [[TMP6]], i32 [[A6:%.]], i32 5
	; CHECK-NEXT: [[TMP8:%.]] = insertelement <8 x i32> [[TMP7]], i32 [[A5:%.]], i32 6			; CHECK-NEXT: [[TMP8:%.]] = insertelement <8 x i32> [[TMP7]], i32 [[A7:%.]], i32 6
	; CHECK-NEXT: [[TMP9:%.]] = insertelement <8 x i32> [[TMP8]], i32 [[A6:%.]], i32 7			; CHECK-NEXT: [[TMP9:%.]] = insertelement <8 x i32> [[TMP8]], i32 [[A8:%.]], i32 7
	; CHECK-NEXT: [[TMP10:%.*]] = add <8 x i32> [[SHUFFLE]], [[TMP9]]			; CHECK-NEXT: [[TMP10:%.*]] = add <8 x i32> [[SHUFFLE]], [[TMP9]]
	; CHECK-NEXT: [[TMP11:%.*]] = call i32 @llvm.vector.reduce.umin.v8i32(<8 x i32> [[TMP10]])			; CHECK-NEXT: [[TMP11:%.*]] = call i32 @llvm.vector.reduce.umin.v8i32(<8 x i32> [[TMP10]])
	; CHECK-NEXT: ret i32 [[TMP11]]			; CHECK-NEXT: ret i32 [[TMP11]]
	;			;
	entry:			entry:
	%arrayidx = getelementptr inbounds i32, i32* %arr, i64 1			%arrayidx = getelementptr inbounds i32, i32* %arr, i64 1
	%0 = load i32, i32* %arrayidx, align 4			%0 = load i32, i32* %arrayidx, align 4
	%add = add i32 %0, %a1			%add = add i32 %0, %a1
	Show All 25 Lines
	define i32 @foo1(i32* nocapture readonly %arr, i32 %a1, i32 %a2, i32 %a3, i32 %a4, i32 %a5, i32 %a6, i32 %a7, i32 %a8) {			define i32 @foo1(i32* nocapture readonly %arr, i32 %a1, i32 %a2, i32 %a3, i32 %a4, i32 %a5, i32 %a6, i32 %a7, i32 %a8) {
	; CHECK-LABEL: @foo1(			; CHECK-LABEL: @foo1(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[ARR:%.*]], i64 1			; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[ARR:%.*]], i64 1
	; CHECK-NEXT: [[ARRAYIDX1:%.]] = getelementptr inbounds i32, i32 [[ARR]], i64 2			; CHECK-NEXT: [[ARRAYIDX1:%.]] = getelementptr inbounds i32, i32 [[ARR]], i64 2
	; CHECK-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 [[ARR]], i64 3			; CHECK-NEXT: [[ARRAYIDX3:%.]] = getelementptr inbounds i32, i32 [[ARR]], i64 3
	; CHECK-NEXT: [[TMP0:%.]] = bitcast i32 [[ARR]] to <4 x i32>*			; CHECK-NEXT: [[TMP0:%.]] = bitcast i32 [[ARR]] to <4 x i32>*
	; CHECK-NEXT: [[TMP1:%.]] = load <4 x i32>, <4 x i32> [[TMP0]], align 4			; CHECK-NEXT: [[TMP1:%.]] = load <4 x i32>, <4 x i32> [[TMP0]], align 4
	; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <4 x i32> [[TMP1]], <4 x i32> poison, <8 x i32> <i32 0, i32 1, i32 1, i32 1, i32 1, i32 2, i32 2, i32 3>			; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <4 x i32> [[TMP1]], <4 x i32> poison, <8 x i32> <i32 1, i32 2, i32 3, i32 1, i32 1, i32 0, i32 2, i32 1>
	; CHECK-NEXT: [[TMP2:%.]] = insertelement <8 x i32> poison, i32 [[A6:%.]], i32 0			; CHECK-NEXT: [[TMP2:%.]] = insertelement <8 x i32> poison, i32 [[A1:%.]], i32 0
	; CHECK-NEXT: [[TMP3:%.]] = insertelement <8 x i32> [[TMP2]], i32 [[A1:%.]], i32 1			; CHECK-NEXT: [[TMP3:%.]] = insertelement <8 x i32> [[TMP2]], i32 [[A2:%.]], i32 1
	; CHECK-NEXT: [[TMP4:%.]] = insertelement <8 x i32> [[TMP3]], i32 [[A4:%.]], i32 2			; CHECK-NEXT: [[TMP4:%.]] = insertelement <8 x i32> [[TMP3]], i32 [[A3:%.]], i32 2
	; CHECK-NEXT: [[TMP5:%.]] = insertelement <8 x i32> [[TMP4]], i32 [[A5:%.]], i32 3			; CHECK-NEXT: [[TMP5:%.]] = insertelement <8 x i32> [[TMP4]], i32 [[A4:%.]], i32 3
	; CHECK-NEXT: [[TMP6:%.]] = insertelement <8 x i32> [[TMP5]], i32 [[A8:%.]], i32 4			; CHECK-NEXT: [[TMP6:%.]] = insertelement <8 x i32> [[TMP5]], i32 [[A5:%.]], i32 4
	; CHECK-NEXT: [[TMP7:%.]] = insertelement <8 x i32> [[TMP6]], i32 [[A2:%.]], i32 5			; CHECK-NEXT: [[TMP7:%.]] = insertelement <8 x i32> [[TMP6]], i32 [[A6:%.]], i32 5
	; CHECK-NEXT: [[TMP8:%.]] = insertelement <8 x i32> [[TMP7]], i32 [[A7:%.]], i32 6			; CHECK-NEXT: [[TMP8:%.]] = insertelement <8 x i32> [[TMP7]], i32 [[A7:%.]], i32 6
	; CHECK-NEXT: [[TMP9:%.]] = insertelement <8 x i32> [[TMP8]], i32 [[A3:%.]], i32 7			; CHECK-NEXT: [[TMP9:%.]] = insertelement <8 x i32> [[TMP8]], i32 [[A8:%.]], i32 7
	; CHECK-NEXT: [[TMP10:%.*]] = add <8 x i32> [[SHUFFLE]], [[TMP9]]			; CHECK-NEXT: [[TMP10:%.*]] = add <8 x i32> [[SHUFFLE]], [[TMP9]]
	; CHECK-NEXT: [[TMP11:%.*]] = call i32 @llvm.vector.reduce.umin.v8i32(<8 x i32> [[TMP10]])			; CHECK-NEXT: [[TMP11:%.*]] = call i32 @llvm.vector.reduce.umin.v8i32(<8 x i32> [[TMP10]])
	; CHECK-NEXT: ret i32 [[TMP11]]			; CHECK-NEXT: ret i32 [[TMP11]]
	;			;
	entry:			entry:
	%arrayidx = getelementptr inbounds i32, i32* %arr, i64 1			%arrayidx = getelementptr inbounds i32, i32* %arr, i64 1
	%0 = load i32, i32* %arrayidx, align 4			%0 = load i32, i32* %arrayidx, align 4
	%add = add i32 %0, %a1			%add = add i32 %0, %a1
	Show All 29 Lines
	define i32 @foo2(i32* nocapture readonly %arr, i32 %a1, i32 %a2, i32 %a3, i32 %a4, i32 %a5, i32 %a6, i32 %a7, i32 %a8) {			define i32 @foo2(i32* nocapture readonly %arr, i32 %a1, i32 %a2, i32 %a3, i32 %a4, i32 %a5, i32 %a6, i32 %a7, i32 %a8) {
	; CHECK-LABEL: @foo2(			; CHECK-LABEL: @foo2(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[ARR:%.*]], i64 3			; CHECK-NEXT: [[ARRAYIDX:%.]] = getelementptr inbounds i32, i32 [[ARR:%.*]], i64 3
	; CHECK-NEXT: [[ARRAYIDX1:%.]] = getelementptr inbounds i32, i32 [[ARR]], i64 2			; CHECK-NEXT: [[ARRAYIDX1:%.]] = getelementptr inbounds i32, i32 [[ARR]], i64 2
	; CHECK-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds i32, i32 [[ARR]], i64 1			; CHECK-NEXT: [[ARRAYIDX7:%.]] = getelementptr inbounds i32, i32 [[ARR]], i64 1
	; CHECK-NEXT: [[TMP0:%.]] = bitcast i32 [[ARR]] to <4 x i32>*			; CHECK-NEXT: [[TMP0:%.]] = bitcast i32 [[ARR]] to <4 x i32>*
	; CHECK-NEXT: [[TMP1:%.]] = load <4 x i32>, <4 x i32> [[TMP0]], align 4			; CHECK-NEXT: [[TMP1:%.]] = load <4 x i32>, <4 x i32> [[TMP0]], align 4
	; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <4 x i32> [[TMP1]], <4 x i32> poison, <8 x i32> <i32 0, i32 0, i32 1, i32 1, i32 2, i32 2, i32 3, i32 3>			; CHECK-NEXT: [[SHUFFLE:%.*]] = shufflevector <4 x i32> [[TMP1]], <4 x i32> poison, <8 x i32> <i32 3, i32 2, i32 3, i32 0, i32 1, i32 0, i32 2, i32 1>
	; CHECK-NEXT: [[TMP2:%.]] = insertelement <8 x i32> poison, i32 [[A4:%.]], i32 0			; CHECK-NEXT: [[TMP2:%.]] = insertelement <8 x i32> poison, i32 [[A1:%.]], i32 0
	; CHECK-NEXT: [[TMP3:%.]] = insertelement <8 x i32> [[TMP2]], i32 [[A6:%.]], i32 1			; CHECK-NEXT: [[TMP3:%.]] = insertelement <8 x i32> [[TMP2]], i32 [[A2:%.]], i32 1
	; CHECK-NEXT: [[TMP4:%.]] = insertelement <8 x i32> [[TMP3]], i32 [[A5:%.]], i32 2			; CHECK-NEXT: [[TMP4:%.]] = insertelement <8 x i32> [[TMP3]], i32 [[A3:%.]], i32 2
	; CHECK-NEXT: [[TMP5:%.]] = insertelement <8 x i32> [[TMP4]], i32 [[A8:%.]], i32 3			; CHECK-NEXT: [[TMP5:%.]] = insertelement <8 x i32> [[TMP4]], i32 [[A4:%.]], i32 3
	; CHECK-NEXT: [[TMP6:%.]] = insertelement <8 x i32> [[TMP5]], i32 [[A2:%.]], i32 4			; CHECK-NEXT: [[TMP6:%.]] = insertelement <8 x i32> [[TMP5]], i32 [[A5:%.]], i32 4
	; CHECK-NEXT: [[TMP7:%.]] = insertelement <8 x i32> [[TMP6]], i32 [[A7:%.]], i32 5			; CHECK-NEXT: [[TMP7:%.]] = insertelement <8 x i32> [[TMP6]], i32 [[A6:%.]], i32 5
	; CHECK-NEXT: [[TMP8:%.]] = insertelement <8 x i32> [[TMP7]], i32 [[A1:%.]], i32 6			; CHECK-NEXT: [[TMP8:%.]] = insertelement <8 x i32> [[TMP7]], i32 [[A7:%.]], i32 6
	; CHECK-NEXT: [[TMP9:%.]] = insertelement <8 x i32> [[TMP8]], i32 [[A3:%.]], i32 7			; CHECK-NEXT: [[TMP9:%.]] = insertelement <8 x i32> [[TMP8]], i32 [[A8:%.]], i32 7
	; CHECK-NEXT: [[TMP10:%.*]] = add <8 x i32> [[SHUFFLE]], [[TMP9]]			; CHECK-NEXT: [[TMP10:%.*]] = add <8 x i32> [[SHUFFLE]], [[TMP9]]
	; CHECK-NEXT: [[TMP11:%.*]] = call i32 @llvm.vector.reduce.umin.v8i32(<8 x i32> [[TMP10]])			; CHECK-NEXT: [[TMP11:%.*]] = call i32 @llvm.vector.reduce.umin.v8i32(<8 x i32> [[TMP10]])
	; CHECK-NEXT: ret i32 [[TMP11]]			; CHECK-NEXT: ret i32 [[TMP11]]
	;			;
	entry:			entry:
	%arrayidx = getelementptr inbounds i32, i32* %arr, i64 3			%arrayidx = getelementptr inbounds i32, i32* %arr, i64 3
	%0 = load i32, i32* %arrayidx, align 4			%0 = load i32, i32* %arrayidx, align 4
	%add = add i32 %0, %a1			%add = add i32 %0, %a1
	Show All 28 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[SLP]Improve graph reordering.ClosedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 361345

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

llvm/test/Transforms/SLPVectorizer/AArch64/transpose-inseltpoison.ll

llvm/test/Transforms/SLPVectorizer/AArch64/transpose.ll

llvm/test/Transforms/SLPVectorizer/X86/addsub.ll

llvm/test/Transforms/SLPVectorizer/X86/crash_cmpop.ll

llvm/test/Transforms/SLPVectorizer/X86/extract.ll

llvm/test/Transforms/SLPVectorizer/X86/jumbled-load-multiuse.ll

llvm/test/Transforms/SLPVectorizer/X86/jumbled-load.ll

llvm/test/Transforms/SLPVectorizer/X86/jumbled_store_crash.ll

llvm/test/Transforms/SLPVectorizer/X86/reorder_repeated_ops.ll

llvm/test/Transforms/SLPVectorizer/X86/vectorize-reorder-reuse.ll

[SLP]Improve graph reordering.
ClosedPublic