This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
60/60
SLPVectorizer.cpp
-
test/Transforms/SLPVectorizer/X86/
-
Transforms/
-
SLPVectorizer/
-
X86/
-
lookahead.ll
2/2
pr47623.ll
6/6
pr47629.ll

Differential D90445

[SLP] Make SLPVectorizer to use `llvm.masked.gather` intrinsic
ClosedPublic

Authored by anton-afanasyev on Oct 30 2020, 12:40 AM.

Download Raw Diff

Details

Reviewers

RKSimon
dtemirbulatov
rengolin
ABataev
fhahn

Commits

rGfcad8d3635cf: [SLP] Make SLPVectorizer to use `llvm.masked.gather` intrinsic

Summary

For the scattered operands of load instructions it makes sense
to use gathering load intrinsic, which can lower to native instruction
for X86/AVX512 and ARM/SVE. This also enables building
vectorization tree with entries containing scattered operands.
The next step is to add scattered store.

Fixes PR47629 and PR47623

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Updated comments and LaneIndex

llvm/test/Transforms/SLPVectorizer/X86/pr47629.ll
24	Hmm, investigated this: no, I was wrong, the calculated cost is correct (`insertelemets`s for gather instr are compesated by `insertelement`s for buildvector). Further investigating this... Looks like codegen issue more.

Harbormaster completed remote builds in B77128: Diff 302068.Oct 31 2020, 12:06 AM

Please can you rebase after rG1eeae4310771d8a6896fe09effe88883998f34e8?

llvm/test/Transforms/SLPVectorizer/X86/pr47629.ll
24	Hmm - does the X86 TTI handle buildvectors of pointers costs or does it fallback to the generic implementation (which is almost certainly lower)?

The MVE gather costs were certainly written with the loop vectorizer in mind. The number of gather/scatter instructions we have is quite limited and it's not always simple to match llvm.gathers to mve gathers. I've not seen any issues from this so far though, other than what was just fixed up in rG30ad7426442e.

anton-afanasyev added inline comments.Oct 31 2020, 6:46 AM

llvm/test/Transforms/SLPVectorizer/X86/pr47629.ll
24	Despite of codegen output for Skylake looking complicated I believe it's still more optimized, since the `gather` is cheaper than four `load`s from memory, isn't it?

RKSimon added inline comments.Oct 31 2020, 7:02 AM

llvm/test/Transforms/SLPVectorizer/X86/pr47629.ll
24	My concern is that -march=avx512 does nothing so you're just getting raw sse2 costs here - hence why I updated the tests at rG1eeae4310771d8a6896fe09effe88883998f34e8

Rebased against changed test file

Rebased, but optimization eliminated for -mattr=.... Option -mattr=+avx512vl should be set explicitly to activate costs.

Harbormaster completed remote builds in B77142: Diff 302094.Oct 31 2020, 8:40 AM

xbolva00 added a subscriber: xbolva00.Oct 31 2020, 8:46 AM

This comment was removed by xbolva00.

RKSimon mentioned this in rGb51b424c679f: [SLP][X86] Add AVX512VL test target coverage for PR47629.Nov 2 2020, 3:41 AM

anton-afanasyev mentioned this in rGe8d67ef2dc93: [SLP][X86][Test] Extend test coverage for PR47629.Nov 3 2020, 6:51 AM

RKSimon added reviewers: ABataev, fhahn.Nov 3 2020, 7:08 AM

I've extended test case again: https://reviews.llvm.org/rGe8d67ef2dc93. Also, @xbolva00 precommited test for PR47263, which is fixed with this patch.

Rebased

ABataev added inline comments.Nov 3 2020, 8:46 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1583	I would think about adding a new `EntryState` instead of adding a new data member.
3722	Comment for `false` argument
4565	`auto *`
4567	`auto *`
4574	`.emplace_back(PO, cast<User>(VecPtr), InsIndex);`
4579	Try to merge this code line with the one in line 4539

Harbormaster completed remote builds in B77414: Diff 302592.Nov 3 2020, 9:06 AM

anton-afanasyev marked 9 inline comments as done.Nov 3 2020, 10:46 PM

anton-afanasyev added inline comments.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1583	I've thought about it and discussed this with @dtemirbulatov (his throttling patch is extending `EntryState` as well). I've come that `IsScattered` is like `ReuseShuffleIndices`/`ReorderIndices` (data members) more than an "entry state". Also, the current entry state `NeedToGather` would add ambiguity in that case.
3722	Ok.
4565	Ok
4567	`UndefValue::get()` returns `UndefValue` type, but we need `Value` here.
4574	Ok
4579	Done.

Addressed notes

Harbormaster completed remote builds in B77509: Diff 302754.Nov 3 2020, 11:24 PM

ABataev added inline comments.Nov 4 2020, 7:31 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1583	It is different thing. `ReuseShuffleIndices/ReorderIndices` can be used by many different entries while `IsScattered` member is used only for a single particular kind of node. It increases memory usage because we add an extra field which is just ignored in many cases. Plus, to me, this field just a mark of the new kind of gather.
2866	Why it is `vectorized` if this is actually kind of gather?

anton-afanasyev marked 2 inline comments as done.Nov 4 2020, 11:57 PM

anton-afanasyev added inline comments.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1583	Ok, sounds convincing, I can change it. But I don't think this field is "just the new kind of gather". I'm planning to use `IsScattered` for scattered stores (`@llvm.store.scatter` intrinsic) at the next step, of course. This is not the gather like `NeedToGather` entry state.
2866	Actually I think `vectorized` is more appropriate here than `gathered` in sense it is used throughout SLPVectorizer module. Since we actually modify several instruction to _one_. I can change it to `scatter-vectorized`.

ABataev added inline comments.Nov 5 2020, 6:14 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2866	Could you give a bit more explanation why it should be treated as `vectorized`?

anton-afanasyev marked 3 inline comments as done.Nov 5 2020, 7:45 AM

anton-afanasyev added inline comments.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2866	Ok. For now, `enum EntryState {Vectorize, NeedToGather}` has two states: the first is for the bundle that is to be vectorized, the second is to be gathered, but here "gathered" means this bundle will stay untouched after tree vectorization, we need no replace several scalar instructions with one vector instruction. Also we need no handle users of gathered instruction. For the "scattered" entry we have opposite case: several scalar instructios to be replaced with `@llvm.masked.gather.` and `@llvm.masked.scatter.` intrinsics and we need to handle its users (at least, for `store` = `@llvm.masked.scatter.*` case (the next patch, also using `IsScattered` field)). Am I clear?

anton-afanasyev marked an inline comment as done.Nov 5 2020, 7:46 AM

ABataev added inline comments.Nov 5 2020, 7:48 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2866	Shall we handle the users for loads?

anton-afanasyev marked an inline comment as done.Nov 5 2020, 8:04 AM

anton-afanasyev added inline comments.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2866	No, only for stores, of course. Vectorized `load`s are leaves of vectorized tree, whereas stores are seed points. Scattered stores can also be "vectorized" in the sense of being replaced with the one intrinsic.

ABataev added inline comments.Nov 5 2020, 8:14 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2866	Then for loads it is more just a kind a gather, rather than vectorize. Is it required to mark the scalars as vectorized or just enough to mark as gathered? Do we need all this stuff that is required for the vectorized instructions, like inorder, etc.?

anton-afanasyev marked 2 inline comments as done.Nov 5 2020, 8:33 AM

anton-afanasyev added inline comments.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2866	Yes, we need this stuff: at least, separate cost estimation, scheduling. We treat this as generalization of vectorization -- like `load <4 x i32>`, `@llvm.masked.gather` loads to `<4 x i32>`, for instance.

ABataev added inline comments.Nov 5 2020, 8:47 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2866	Why it cannot be model via gather? We gather the scalar loads here but into a different form. Instead of direct gather we just gather the addresses and then do a load. But this still a kind of gather. Do you need to continue scheduling here? No, you don't, just need to extend `Gather` function to generate gather of addresses + `llvm.masked.gather` call.

anton-afanasyev marked 2 inline comments as done.Nov 5 2020, 9:17 AM

anton-afanasyev added inline comments.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2866	It can be model via gather, ok, but only for load case, not for store one. But we already treat ordinary `load <4 x i32>` as vectorized entry, what is the difference with `@llvm.masked.gather <4 x i32>`? I see here consistent symmetry: ordinary load of vector <-> ordinary store of vector \| \| gather of vector <-> scatter of vector

ABataev added inline comments.Nov 5 2020, 9:23 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2866	Because the model is not based on the final outcome. Plus, you missed the real gather of addresses. Did you think about adding a new gather treeentry for addresses? Looks like you missed the cost for addresses gather.

anton-afanasyev marked 2 inline comments as done.Nov 5 2020, 10:00 AM

anton-afanasyev added inline comments.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2866	Why is model not based on the final outcome? The cost is separate issue -- I can fix this at `getEntryCost()`.

ABataev added inline comments.Nov 5 2020, 10:04 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2866	Because later this code can be transformed into something else, anyway, for example gather can be transformed into a single instruction. The model itself is based on the scalars, how they are transformed into a vector form. I would suggest just to add 2 tree entries in this case. One for new scatter vectorization form and one for gathering of addresses. It would simplify handling of this in future and won't break the model. Plus, you would not need to add a new cost calculation and rely on the exisiting cost model for the gather of addresses.

anton-afanasyev marked 2 inline comments as done.Nov 5 2020, 10:43 AM

anton-afanasyev added inline comments.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2866	Sounds good, thanks!

Added second tree entry for addresses

Harbormaster completed remote builds in B77939: Diff 303556.Nov 6 2020, 3:18 PM

In D90445#2380023, @anton-afanasyev wrote:

Added second tree entry for addresses

Yep, this looks much better now. What about adding new vectorization kind for scatter-gather?

dtemirbulatov added inline comments.Nov 8 2020, 8:31 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1583	Just to avoid any confusion, I think it is to change the name to IsScatterGatherOps?

anton-afanasyev marked an inline comment as done.Nov 9 2020, 1:40 AM

anton-afanasyev added inline comments.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1583	Current `IsScatteredOps` means that _operands_ of load/store are _scattered_, so need either to gather, or to scatter them. I'm to change name to `IsScatterOpds` to avoid confusion with `Operations`.

Renamed IsScatteredOps to IsScatteredOpds

In D90445#2380322, @ABataev wrote:

In D90445#2380023, @anton-afanasyev wrote:

Added second tree entry for addresses

Yep, this looks much better now. What about adding new vectorization kind for scatter-gather?

What do you mean by "vectorization kind"? Vectorization is differentiated by bundle instructions opcode.

Harbormaster completed remote builds in B78079: Diff 303782.Nov 9 2020, 2:43 AM

In D90445#2382105, @anton-afanasyev wrote:

In D90445#2380322, @ABataev wrote:

In D90445#2380023, @anton-afanasyev wrote:

Added second tree entry for addresses

Yep, this looks much better now. What about adding new vectorization kind for scatter-gather?

What do you mean by "vectorization kind"? Vectorization is differentiated by bundle instructions opcode.

I meant, to add a new vectorization kind for this node, something like Scatter or ScatterVectorized. We discussed before, that this new boolean field is going to e unused in many cases and better to introduce a new vectorization id (extend EntryState enum)

Use of EntryState

I meant, to add a new vectorization kind for this node, something like Scatter or ScatterVectorized. We discussed before, that this new boolean field is going to e unused in many cases and better to introduce a new vectorization id (extend EntryState enum)

Ok, changed. Do you think this is better now?

ABataev added inline comments.Nov 9 2020, 9:15 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2868	Better to modify `Bundle` parameter in `newTreeEntry` function, something like `Optional<std::pair<EntryState, Bundle>>` and pass `EntryState` directly at function call. You can implement it as a separate NFC patch.
3720	Add assert here for `E->State == TreeEntry::ScatterVectorize`
4562	Add assert here for `E->State == TreeEntry::ScatterVectorize`

Added asserts

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2868	Ok, I can fix it in a separate patch. But now I doubt that initial memory optimization with `IsScatteredOpds` elimination was worse than this final `EntryState` expansion requiring changes for all `newTreeEntry()` calls.

ABataev added inline comments.Nov 9 2020, 9:50 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2868	Yes, it still will be better. Here you will allocate the value on the stack, which will be cleared after function termination. With an extra data member each TreeEntry will retain it till the end of life of the whole tree.

Harbormaster completed remote builds in B78142: Diff 303897.Nov 9 2020, 9:53 AM

Assert fix

anton-afanasyev marked an inline comment as done.Nov 9 2020, 10:04 AM

anton-afanasyev added inline comments.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2868	Ok

Harbormaster completed remote builds in B78148: Diff 303909.Nov 9 2020, 10:08 AM

Harbormaster completed remote builds in B78150: Diff 303916.Nov 9 2020, 10:52 AM

Fixed asserts again

Small fix of debug printer

Harbormaster completed remote builds in B78572: Diff 304737.Nov 12 2020, 1:39 AM

Also, please add ScatterVectorize to TreeEntry.dump()

In D90445#2391375, @dtemirbulatov wrote:

Also, please add ScatterVectorize to TreeEntry.dump()

It's already here, line 1704.

Looks good, any other objections?

This revision is now accepted and ready to land.Nov 14 2020, 6:57 AM

LGTM with one minor

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1555–1558	Minor - please add comment explaining the purpose of each enum

ABataev added inline comments.Nov 14 2020, 9:32 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2868	Looks like you still do not pass the vectorization state as an argument. I thought we agreed to modify `Bundle` related param to be an `Optional` of EntryState and Bundle types? Better to create the entry with the correctly set inner state rather than modify them outside of the class. It breaks encapsulation.

Add comment

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1555–1558	Thank you, done!
2868	Sorry, missed this comment. Yes, we agreed to do this, but as a separate NFC patch.
3725	In general, yes, we can end up with reordering, but this could be optimized with preliminary pointers vector shuffling. Though I think it's rather rare case (load-gather plus reordering meeting together), I can prepare separate patch to optimize this after investigation of its impact.

ABataev added inline comments.Nov 14 2020, 9:59 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2868	Yes, thanks. Could you prepare this NFC patch before commiting this patch?

Harbormaster completed remote builds in B78857: Diff 305321.Nov 14 2020, 11:15 AM

Added optional EntryState param to newTreeEntry()

anton-afanasyev marked an inline comment as done.Nov 14 2020, 9:38 PM

anton-afanasyev added inline comments.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2868	Hmm, I've decided to get rid of scary `make_pair` inside all `newTreeEntry()` callings, leaving most of the code untouched. So, added optional `EntryState` param to `newTreeEntry()` function. Minimal changes, so no need for separate patch. Is it good now?

anton-afanasyev marked an inline comment as done.Nov 14 2020, 9:38 PM

Harbormaster completed remote builds in B78872: Diff 305343.Nov 14 2020, 10:06 PM

RKSimon mentioned this in rG119e4550dded: [SLPVectorizer][X86] Remove unused check-prefixes.Nov 15 2020, 12:57 AM

Fix tests after "--check-prefixes" change (D90445) and fix one assert

Harbormaster completed remote builds in B78878: Diff 305350.Nov 15 2020, 5:42 AM

LGTM with one minor

llvm/test/Transforms/SLPVectorizer/X86/pr47623.ll
6	Sorry, I think you need to remove the 'CHECK' from all of these and just use SSE/AVX

Fix test, CHECK prefix is unused

llvm/test/Transforms/SLPVectorizer/X86/pr47623.ll
6	Ok, done

anton-afanasyev edited the summary of this revision. (Show Details)Nov 15 2020, 7:18 AM

Harbormaster completed remote builds in B78884: Diff 305359.Nov 15 2020, 7:44 AM

ABataev added inline comments.Nov 16 2020, 8:51 AM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1753–1769	It may conflict with `State` param value and may lead to incorrect undrstanding/decisions in future. That's why I proposed to merge `EntryState` and `Bundle` into one `Optional<std::pair<>>`, though this solution also is not quite optimal. Maybe better to split this function into 2 overloaded copies, one for gather and one for vectorized state?

anton-afanasyev marked an inline comment as done.Nov 16 2020, 1:19 PM

anton-afanasyev added inline comments.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1753–1769	Would it be better to leave all as it was before and just add `setEntryState()` method to accurately set `ScatterVectorize` (or other state in future)?

anton-afanasyev marked an inline comment as done.Nov 16 2020, 1:19 PM

ABataev added inline comments.Nov 16 2020, 1:25 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1753–1769	Nah, it is similar to what was before. Better to set the correct state when constructing the state rather than construct it with the incorrect flag and then trying to fix it on-the-fly. It is bad design decision and always a source of bugs.

Add overloaded newTreeEntry() for common case

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1753–1769	Ok, thanks! So I created overloaded copy of `newTreeEntry()` especially for `EntryState` given explicitly. Is it good now?

ABataev added inline comments.Nov 16 2020, 2:29 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1755	I think it will crash if Bundle is `None`
1761	Better to make `Optional<std::pair<EntryState, ScheduleData *>>`. I would add `assert(!StateAndBundle \|\| StateAndBundle.getValue().first != NeedToGather && "...");

Harbormaster completed remote builds in B79005: Diff 305595.Nov 16 2020, 2:52 PM

anton-afanasyev marked 2 inline comments as done.Nov 16 2020, 3:22 PM

anton-afanasyev added inline comments.

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1755	Thank you, fixed!
1761	I tried to get rid of `std::pair` again, now looks better, doesn't it? Thanks, added.

anton-afanasyev marked 2 inline comments as done.Nov 16 2020, 3:22 PM

Fixes and assert

ABataev added inline comments.Nov 16 2020, 3:36 PM

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1766	What if `!Bundle && EntryState != NeedToGather`, i.e. trying to vectorize gather entry?

Harbormaster completed remote builds in B79021: Diff 305614.Nov 16 2020, 3:44 PM

Added another assert

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1766	Thanks, added second assert.

Harbormaster completed remote builds in B79056: Diff 305672.Nov 17 2020, 12:46 AM

This revision was landed with ongoing or failed builds.Nov 17 2020, 7:12 AM

Closed by commit rGfcad8d3635cf: [SLP] Make SLPVectorizer to use `llvm.masked.gather` intrinsic (authored by anton-afanasyev). · Explain Why

This revision was automatically updated to reflect the committed changes.

anton-afanasyev added a commit: rGfcad8d3635cf: [SLP] Make SLPVectorizer to use `llvm.masked.gather` intrinsic.

It looks like this change may cause a crash when building LNT, e.g. http://lab.llvm.org:8011/#/builders/105/builds/1899/steps/7/logs/stdio

In D90445#2399926, @fhahn wrote:

It looks like this change may cause a crash when building LNT, e.g. http://lab.llvm.org:8011/#/builders/105/builds/1899/steps/7/logs/stdio

Thank you, fixing!

Fixed

Saw some random miscompiles after this. Fixed by 4dbe12e86649ba6b5f03a9ba97e84d718727f7a7, can you check if I got it right?

In D90445#2402312, @bkramer wrote:

Saw some random miscompiles after this. Fixed by 4dbe12e86649ba6b5f03a9ba97e84d718727f7a7, can you check if I got it right?

Yes, thanks, looks good!

anton-afanasyev mentioned this in rG6f1c07b23a1c: [SLP][Test] Update pr47269.ll test. NFC.Nov 20 2020, 7:35 AM

anton-afanasyev mentioned this in D91919: [SLP] Make SLPVectorizer to use `llvm.masked.scatter` intrinsic.Nov 21 2020, 11:05 AM

It appears this is causing a ~2-3% regression on AArch64 for some benchmarks, including CINT2000/256.bzip2. Any ideas? It might be caused by underestimating the cost of gather/scatter on AArch64.

In D90445#2411311, @fhahn wrote:

It appears this is causing a ~2-3% regression on AArch64 for some benchmarks, including CINT2000/256.bzip2. Any ideas? It might be caused by underestimating the cost of gather/scatter on AArch64.

Poor target specific costs is likely to be the problem - either for explicit gather intrinsic costs or the scalarization overhead fallback.

If isLegalMaskedGather is returning false (as it will do for AArch64 NEON), then should this even be attempting to use gather/scatter? I would have thought that will never do worth it if it is just going to scalarize again?

In D90445#2411347, @dmgreen wrote:

If isLegalMaskedGather is returning false (as it will do for AArch64 NEON), then should this even be attempting to use gather/scatter? I would have thought that will never do worth it if it is just going to scalarize again?

SLP should always create gathers with constant masks while isLegalMaskedGather covers the more general variable masks as well - we can always legalize constant mask gathers back to a buildvector or insertelement chain so its often worthwhile using the gather intrinsic to act for that case as well, hence it being important we correctly handle the scalarization overhead fallback.

I believe this issue is related to the default cost for getGatherScatterOpCost(). For the arch not having gather/scatter instrs we use TargetTransformInfoImplBase::getGatherScatterOpCost() which returns 1 unconditionally: https://github.com/llvm/llvm-project/blob/release/11.x/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h#L480
I'd fix it by setting something like 1024 for the default cost.

In D90445#2411663, @anton-afanasyev wrote:

I believe this issue is related to the default cost for getGatherScatterOpCost(). For the arch not having gather/scatter instrs we use TargetTransformInfoImplBase::getGatherScatterOpCost() which returns 1 unconditionally: https://github.com/llvm/llvm-project/blob/release/11.x/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h#L480
I'd fix it by setting something like 1024 for the default cost.

yes, I managed to track it down to that. I pushed some additional tests and D91984 to add a more realistic estimate for the cost when scalarizing.

Current SLP has significant drawback with regard to its cost modeling. And this patch highlights it.
Consider we have four scalar loads of i8 type. With prior approach (vectorization overhead) we had cost for such entry 4 (x86 target).
With this new approach we have two entries instead of one: ScatterVectorize loads + NeedToGather GEPs. And costs for these entries are 6 and 10 respectively, thus cost increased from 4 to 16.
And the problem here is once we put this pattern into the tree it pulls cost up for the entire tree. If we have multiple such patterns over the tree their effect is magnified. These entries finally outweigh possible profit of vectorization for remaining portion of the tree and we end up not vectorizing it at all (even if downstream optimizations could probably change it into optimal code). If SLP could make choice vectorization overhead vs gather intrinsic based in their costs while building vectorizable tree the outcome could be different.

In D90445#2411961, @vdmitrie wrote:

Current SLP has significant drawback with regard to its cost modeling. And this patch highlights it.
Consider we have four scalar loads of i8 type. With prior approach (vectorization overhead) we had cost for such entry 4 (x86 target).
With this new approach we have two entries instead of one: ScatterVectorize loads + NeedToGather GEPs. And costs for these entries are 6 and 10 respectively, thus cost increased from 4 to 16.
And the problem here is once we put this pattern into the tree it pulls cost up for the entire tree. If we have multiple such patterns over the tree their effect is magnified. These entries finally outweigh possible profit of vectorization for remaining portion of the tree and we end up not vectorizing it at all (even if downstream optimizations could probably change it into optimal code). If SLP could make choice vectorization overhead vs gather intrinsic based in their costs while building vectorizable tree the outcome could be different.

Good point, thank you! As you said, that is not the problem specific for this patch exclusively. One can fix it by hacky cost comparing at the buildind tree stage, but I do believe the more general solution is preferable. Does this patch https://reviews.llvm.org/D57779 (vectorization throttling) fixes this? After greedy strategy of building the maximum tree we choose the cheapest part of it for vectorization.

Good point, thank you! As you said, that is not the problem specific for this patch exclusively. One can fix it by hacky cost comparing at the buildind tree stage, but I do believe the more general solution is preferable. Does this patch https://reviews.llvm.org/D57779 (vectorization throttling) fixes this? After greedy strategy of building the maximum tree we choose the cheapest part of it for vectorization.

No, I think https://reviews.llvm.org/D57779 is about a different thing. Here, we have new functionality which allows us to built the tree with gather-loads otherwise we just ignore it and thus have a different tree. I am not sure how to handle the case if it is accumulating those expensive operations. Maybe guard this new functionality by a flag for now?

Talked with @dtemirbulatov privately and reached a consensus that his patch reviews.llvm.org/D57779 fixes the issue defined above by @vdmitrie in general.

It sounds like throttling patch should resolve this issue as cutting out ScatterVectorize entry with high cost will effectively return to previous behavior.

In D90445#2416578, @vdmitrie wrote:

It sounds like throttling patch should resolve this issue as cutting out ScatterVectorize entry with high cost will effectively return to previous behavior.

Yes, exactly. The only difference with previous behavior could arise in case of the new tree accumulating other instructions starting from ScatterVectorize and NeedToGather GEPs entries, preventing them from being contained in other parts of tree. But these entries are terminal, with so rare speculative exceptions that I believe it's good euristics for this case as well as for the general SLP drawback you mentioned: build the maximum tree and choose the cheapest subtree.

It looks like this commit may be causing a mis-compile: https://bugs.llvm.org/show_bug.cgi?id=50015

spatel mentioned this in D101297: [SLP]Allow masked gathers only if allowed by target..Apr 27 2021, 6:03 AM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

89 lines

test/

Transforms/

SLPVectorizer/

X86/

lookahead.ll

65 lines

pr47623.ll

40 lines

pr47629.ll

98 lines

Diff 305785

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,546 Lines • ▼ Show 20 Lines	struct TreeEntry {
}		}

/// A vector of scalars.		/// A vector of scalars.
ValueList Scalars;		ValueList Scalars;

/// The Scalars are vectorized into this value. It is initialized to Null.		/// The Scalars are vectorized into this value. It is initialized to Null.
Value *VectorizedValue = nullptr;		Value *VectorizedValue = nullptr;

/// Do we need to gather this sequence ?		/// Do we need to gather this sequence or vectorize it
enum EntryState { Vectorize, NeedToGather };		/// (either with vector instruction or with scatter/gather
		/// intrinsics for store/load)?
		enum EntryState { Vectorize, ScatterVectorize, NeedToGather };
		RKSimonUnsubmitted Done Reply Inline Actions Minor - please add comment explaining the purpose of each enum RKSimon: Minor - please add comment explaining the purpose of each enum
		anton-afanasyevAuthorUnsubmitted Done Reply Inline Actions Thank you, done! anton-afanasyev: Thank you, done!
EntryState State;		EntryState State;

/// Does this sequence require some shuffling?		/// Does this sequence require some shuffling?
SmallVector<int, 4> ReuseShuffleIndices;		SmallVector<int, 4> ReuseShuffleIndices;

/// Does this entry require reordering?		/// Does this entry require reordering?
SmallVector<unsigned, 4> ReorderIndices;		SmallVector<unsigned, 4> ReorderIndices;

/// Points back to the VectorizableTree.		/// Points back to the VectorizableTree.
///		///
/// Only used for Graphviz right now. Unfortunately GraphTrait::NodeRef has		/// Only used for Graphviz right now. Unfortunately GraphTrait::NodeRef has
/// to be a pointer and needs to be able to initialize the child iterator.		/// to be a pointer and needs to be able to initialize the child iterator.
/// Thus we need a reference back to the container to translate the indices		/// Thus we need a reference back to the container to translate the indices
/// to entries.		/// to entries.
VecTreeTy &Container;		VecTreeTy &Container;

/// The TreeEntry index containing the user of this entry. We can actually		/// The TreeEntry index containing the user of this entry. We can actually
/// have multiple users so the data structure is not truly a tree.		/// have multiple users so the data structure is not truly a tree.
SmallVector<EdgeInfo, 1> UserTreeIndices;		SmallVector<EdgeInfo, 1> UserTreeIndices;

/// The index of this treeEntry in VectorizableTree.		/// The index of this treeEntry in VectorizableTree.
int Idx = -1;		int Idx = -1;

private:		private:
		RKSimonUnsubmitted Done Reply Inline Actions scattered. RKSimon: scattered.
		anton-afanasyevAuthorUnsubmitted Done Reply Inline Actions Ok anton-afanasyev: Ok
/// The operands of each instruction in each lane Operands[op_index][lane].		/// The operands of each instruction in each lane Operands[op_index][lane].
		ABataevUnsubmitted Done Reply Inline Actions I would think about adding a new `EntryState` instead of adding a new data member. ABataev: I would think about adding a new `EntryState` instead of adding a new data member.
		anton-afanasyevAuthorUnsubmitted Done Reply Inline Actions I've thought about it and discussed this with @dtemirbulatov (his throttling patch is extending `EntryState` as well). I've come that `IsScattered` is like `ReuseShuffleIndices`/`ReorderIndices` (data members) more than an "entry state". Also, the current entry state `NeedToGather` would add ambiguity in that case. anton-afanasyev: I've thought about it and discussed this with @dtemirbulatov (his throttling patch is extending…
		ABataevUnsubmitted Done Reply Inline Actions It is different thing. `ReuseShuffleIndices/ReorderIndices` can be used by many different entries while `IsScattered` member is used only for a single particular kind of node. It increases memory usage because we add an extra field which is just ignored in many cases. Plus, to me, this field just a mark of the new kind of gather. ABataev: It is different thing. `ReuseShuffleIndices/ReorderIndices` can be used by many different…
		anton-afanasyevAuthorUnsubmitted Done Reply Inline Actions Ok, sounds convincing, I can change it. But I don't think this field is "just the new kind of gather". I'm planning to use `IsScattered` for scattered stores (`@llvm.store.scatter` intrinsic) at the next step, of course. This is not the gather like `NeedToGather` entry state. anton-afanasyev: Ok, sounds convincing, I can change it. But I don't think this field is "just the new kind of…
		dtemirbulatovUnsubmitted Done Reply Inline Actions Just to avoid any confusion, I think it is to change the name to IsScatterGatherOps? dtemirbulatov: Just to avoid any confusion, I think it is to change the name to IsScatterGatherOps?
		anton-afanasyevAuthorUnsubmitted Done Reply Inline Actions Current `IsScatteredOps` means that _operands_ of load/store are _scattered_, so need either to gather, or to scatter them. I'm to change name to `IsScatterOpds` to avoid confusion with `Operations`. anton-afanasyev: Current `IsScatteredOps` means that _operands_ of load/store are _scattered_, so need either to…
/// Note: This helps avoid the replication of the code that performs the		/// Note: This helps avoid the replication of the code that performs the
/// reordering of operands during buildTree_rec() and vectorizeTree().		/// reordering of operands during buildTree_rec() and vectorizeTree().
SmallVector<ValueList, 2> Operands;		SmallVector<ValueList, 2> Operands;

/// The main/alternate instruction.		/// The main/alternate instruction.
Instruction *MainOp = nullptr;		Instruction *MainOp = nullptr;
Instruction *AltOp = nullptr;		Instruction *AltOp = nullptr;

▲ Show 20 Lines • Show All 106 Lines • ▼ Show 20 Lines	LLVM_DUMP_METHOD void dump() const {
dbgs() << "Scalars: \n";		dbgs() << "Scalars: \n";
for (Value *V : Scalars)		for (Value *V : Scalars)
dbgs().indent(2) << *V << "\n";		dbgs().indent(2) << *V << "\n";
dbgs() << "State: ";		dbgs() << "State: ";
switch (State) {		switch (State) {
case Vectorize:		case Vectorize:
dbgs() << "Vectorize\n";		dbgs() << "Vectorize\n";
break;		break;
		case ScatterVectorize:
		dbgs() << "ScatterVectorize\n";
		break;
case NeedToGather:		case NeedToGather:
dbgs() << "NeedToGather\n";		dbgs() << "NeedToGather\n";
break;		break;
}		}
dbgs() << "MainOp: ";		dbgs() << "MainOp: ";
if (MainOp)		if (MainOp)
dbgs() << *MainOp << "\n";		dbgs() << *MainOp << "\n";
else		else
Show All 28 Lines	#endif
};		};

/// Create a new VectorizableTree entry.		/// Create a new VectorizableTree entry.
TreeEntry newTreeEntry(ArrayRef<Value > VL, Optional<ScheduleData *> Bundle,		TreeEntry newTreeEntry(ArrayRef<Value > VL, Optional<ScheduleData *> Bundle,
const InstructionsState &S,		const InstructionsState &S,
const EdgeInfo &UserTreeIdx,		const EdgeInfo &UserTreeIdx,
ArrayRef<unsigned> ReuseShuffleIndices = None,		ArrayRef<unsigned> ReuseShuffleIndices = None,
ArrayRef<unsigned> ReorderIndices = None) {		ArrayRef<unsigned> ReorderIndices = None) {
bool Vectorized = (bool)Bundle;		TreeEntry::EntryState EntryState =
		Bundle ? TreeEntry::Vectorize : TreeEntry::NeedToGather;
		return newTreeEntry(VL, EntryState, Bundle, S, UserTreeIdx,
		ABataevUnsubmitted Done Reply Inline Actions I think it will crash if Bundle is `None` ABataev: I think it will crash if Bundle is `None`
		anton-afanasyevAuthorUnsubmitted Done Reply Inline Actions Thank you, fixed! anton-afanasyev: Thank you, fixed!
		ReuseShuffleIndices, ReorderIndices);
		}

		TreeEntry newTreeEntry(ArrayRef<Value > VL,
		TreeEntry::EntryState EntryState,
		Optional<ScheduleData *> Bundle,
		ABataevUnsubmitted Done Reply Inline Actions Better to make `Optional<std::pair<EntryState, ScheduleData >>`. I would add `assert(!StateAndBundle \|\| StateAndBundle.getValue().first != NeedToGather && "..."); ABataev:* 1. Better to make `Optional<std::pair<EntryState, ScheduleData *>>`. 2. I would add `assert(!
		anton-afanasyevAuthorUnsubmitted Done Reply Inline Actions I tried to get rid of `std::pair` again, now looks better, doesn't it? Thanks, added. anton-afanasyev: 1. I tried to get rid of `std::pair` again, now looks better, doesn't it? 2. Thanks, added.
		const InstructionsState &S,
		const EdgeInfo &UserTreeIdx,
		ArrayRef<unsigned> ReuseShuffleIndices = None,
		ArrayRef<unsigned> ReorderIndices = None) {
		assert(!(Bundle && EntryState == TreeEntry::NeedToGather) &&
		ABataevUnsubmitted Done Reply Inline Actions What if `!Bundle && EntryState != NeedToGather`, i.e. trying to vectorize gather entry? ABataev: What if `!Bundle && EntryState != NeedToGather`, i.e. trying to vectorize gather entry?
		anton-afanasyevAuthorUnsubmitted Done Reply Inline Actions Thanks, added second assert. anton-afanasyev: Thanks, added second assert.
		"Need to gather vectorized entry?");
		assert(!Bundle && EntryState != TreeEntry::NeedToGather &&
		"Need to vectorize gather entry?");
		ABataevUnsubmitted Done Reply Inline Actions It may conflict with `State` param value and may lead to incorrect undrstanding/decisions in future. That's why I proposed to merge `EntryState` and `Bundle` into one `Optional<std::pair<>>`, though this solution also is not quite optimal. Maybe better to split this function into 2 overloaded copies, one for gather and one for vectorized state? ABataev: It may conflict with `State` param value and may lead to incorrect undrstanding/decisions in…
		anton-afanasyevAuthorUnsubmitted Done Reply Inline Actions Would it be better to leave all as it was before and just add `setEntryState()` method to accurately set `ScatterVectorize` (or other state in future)? anton-afanasyev: Would it be better to leave all as it was before and just add `setEntryState()` method to…
		ABataevUnsubmitted Done Reply Inline Actions Nah, it is similar to what was before. Better to set the correct state when constructing the state rather than construct it with the incorrect flag and then trying to fix it on-the-fly. It is bad design decision and always a source of bugs. ABataev: Nah, it is similar to what was before. Better to set the correct state when constructing the…
		anton-afanasyevAuthorUnsubmitted Done Reply Inline Actions Ok, thanks! So I created overloaded copy of `newTreeEntry()` especially for `EntryState` given explicitly. Is it good now? anton-afanasyev: Ok, thanks! So I created overloaded copy of `newTreeEntry()` especially for `EntryState` given…
VectorizableTree.push_back(std::make_unique<TreeEntry>(VectorizableTree));		VectorizableTree.push_back(std::make_unique<TreeEntry>(VectorizableTree));
TreeEntry *Last = VectorizableTree.back().get();		TreeEntry *Last = VectorizableTree.back().get();
Last->Idx = VectorizableTree.size() - 1;		Last->Idx = VectorizableTree.size() - 1;
Last->Scalars.insert(Last->Scalars.begin(), VL.begin(), VL.end());		Last->Scalars.insert(Last->Scalars.begin(), VL.begin(), VL.end());
Last->State = Vectorized ? TreeEntry::Vectorize : TreeEntry::NeedToGather;		Last->State = EntryState;
Last->ReuseShuffleIndices.append(ReuseShuffleIndices.begin(),		Last->ReuseShuffleIndices.append(ReuseShuffleIndices.begin(),
ReuseShuffleIndices.end());		ReuseShuffleIndices.end());
Last->ReorderIndices.append(ReorderIndices.begin(), ReorderIndices.end());		Last->ReorderIndices.append(ReorderIndices.begin(), ReorderIndices.end());
Last->setOperations(S);		Last->setOperations(S);
if (Vectorized) {		if (Last->State != TreeEntry::NeedToGather) {
for (Value *V : VL) {		for (Value *V : VL) {
assert(!getTreeEntry(V) && "Scalar already in tree!");		assert(!getTreeEntry(V) && "Scalar already in tree!");
ScalarToTreeEntry[V] = Last;		ScalarToTreeEntry[V] = Last;
}		}
// Update the scheduler bundle to point to this TreeEntry.		// Update the scheduler bundle to point to this TreeEntry.
unsigned Lane = 0;		unsigned Lane = 0;
for (ScheduleData *BundleMember = Bundle.getValue(); BundleMember;		for (ScheduleData *BundleMember = Bundle.getValue(); BundleMember;
BundleMember = BundleMember->NextInBundle) {		BundleMember = BundleMember->NextInBundle) {
▲ Show 20 Lines • Show All 1,069 Lines • ▼ Show 20 Lines	case Instruction::Load: {
ReuseShuffleIndicies, CurrentOrder);		ReuseShuffleIndicies, CurrentOrder);
TE->setOperandsInOrder();		TE->setOperandsInOrder();
LLVM_DEBUG(dbgs() << "SLP: added a vector of jumbled loads.\n");		LLVM_DEBUG(dbgs() << "SLP: added a vector of jumbled loads.\n");
findRootOrder(CurrentOrder);		findRootOrder(CurrentOrder);
++NumOpsWantToKeepOrder[CurrentOrder];		++NumOpsWantToKeepOrder[CurrentOrder];
}		}
return;		return;
}		}
		// Vectorizing non-consecutive loads with `llvm.masked.gather`.
		TreeEntry *TE = newTreeEntry(VL, TreeEntry::ScatterVectorize, Bundle, S,
		RKSimonUnsubmitted Done Reply Inline Actions 'llvm.masked.gather'. RKSimon: 'llvm.masked.gather'.
		anton-afanasyevAuthorUnsubmitted Done Reply Inline Actions Ok anton-afanasyev: Ok
		ABataevUnsubmitted Done Reply Inline Actions Why it is `vectorized` if this is actually kind of gather? ABataev: Why it is `vectorized` if this is actually kind of gather?
		anton-afanasyevAuthorUnsubmitted Done Reply Inline Actions Actually I think `vectorized` is more appropriate here than `gathered` in sense it is used throughout SLPVectorizer module. Since we actually modify several instruction to _one_. I can change it to `scatter-vectorized`. anton-afanasyev: Actually I think `vectorized` is more appropriate here than `gathered` in sense it is used…
		ABataevUnsubmitted Done Reply Inline Actions Could you give a bit more explanation why it should be treated as `vectorized`? ABataev: Could you give a bit more explanation why it should be treated as `vectorized`?
		anton-afanasyevAuthorUnsubmitted Done Reply Inline Actions Ok. For now, `enum EntryState {Vectorize, NeedToGather}` has two states: the first is for the bundle that is to be vectorized, the second is to be gathered, but here "gathered" means this bundle will stay untouched after tree vectorization, we need no replace several scalar instructions with one vector instruction. Also we need no handle users of gathered instruction. For the "scattered" entry we have opposite case: several scalar instructios to be replaced with `@llvm.masked.gather.` and `@llvm.masked.scatter.` intrinsics and we need to handle its users (at least, for `store` = `@llvm.masked.scatter.` case (the next patch, also using `IsScattered` field)). Am I clear? anton-afanasyev:* Ok. For now, `enum EntryState {Vectorize, NeedToGather}` has two states: the first is for the…
		ABataevUnsubmitted Done Reply Inline Actions Shall we handle the users for loads? ABataev: Shall we handle the users for loads?
		anton-afanasyevAuthorUnsubmitted Done Reply Inline Actions No, only for stores, of course. Vectorized `load`s are leaves of vectorized tree, whereas stores are seed points. Scattered stores can also be "vectorized" in the sense of being replaced with the one intrinsic. anton-afanasyev: No, only for stores, of course. Vectorized `load`s are leaves of vectorized tree, whereas…
		ABataevUnsubmitted Done Reply Inline Actions Then for loads it is more just a kind a gather, rather than vectorize. Is it required to mark the scalars as vectorized or just enough to mark as gathered? Do we need all this stuff that is required for the vectorized instructions, like inorder, etc.? ABataev: Then for loads it is more just a kind a gather, rather than vectorize. Is it required to mark…
		anton-afanasyevAuthorUnsubmitted Done Reply Inline Actions Yes, we need this stuff: at least, separate cost estimation, scheduling. We treat this as generalization of vectorization -- like `load <4 x i32>`, `@llvm.masked.gather` loads to `<4 x i32>`, for instance. anton-afanasyev: Yes, we need this stuff: at least, separate cost estimation, scheduling. We treat this as…
		ABataevUnsubmitted Done Reply Inline Actions Why it cannot be model via gather? We gather the scalar loads here but into a different form. Instead of direct gather we just gather the addresses and then do a load. But this still a kind of gather. Do you need to continue scheduling here? No, you don't, just need to extend `Gather` function to generate gather of addresses + `llvm.masked.gather` call. ABataev: Why it cannot be model via gather? We gather the scalar loads here but into a different form.
		anton-afanasyevAuthorUnsubmitted Done Reply Inline Actions It can be model via gather, ok, but only for load case, not for store one. But we already treat ordinary `load <4 x i32>` as vectorized entry, what is the difference with `@llvm.masked.gather <4 x i32>`? I see here consistent symmetry: ordinary load of vector <-> ordinary store of vector \| \| gather of vector <-> scatter of vector anton-afanasyev: It can be model via gather, ok, but only for load case, not for store one. But we already treat…
		ABataevUnsubmitted Done Reply Inline Actions Because the model is not based on the final outcome. Plus, you missed the real gather of addresses. Did you think about adding a new gather treeentry for addresses? Looks like you missed the cost for addresses gather. ABataev: Because the model is not based on the final outcome. Plus, you missed the real gather of…
		anton-afanasyevAuthorUnsubmitted Done Reply Inline Actions Why is model not based on the final outcome? The cost is separate issue -- I can fix this at `getEntryCost()`. anton-afanasyev: Why is model not based on the final outcome? The cost is separate issue -- I can fix this at…
		ABataevUnsubmitted Done Reply Inline Actions Because later this code can be transformed into something else, anyway, for example gather can be transformed into a single instruction. The model itself is based on the scalars, how they are transformed into a vector form. I would suggest just to add 2 tree entries in this case. One for new scatter vectorization form and one for gathering of addresses. It would simplify handling of this in future and won't break the model. Plus, you would not need to add a new cost calculation and rely on the exisiting cost model for the gather of addresses. ABataev: Because later this code can be transformed into something else, anyway, for example gather can…
		anton-afanasyevAuthorUnsubmitted Done Reply Inline Actions Sounds good, thanks! anton-afanasyev: Sounds good, thanks!
		UserTreeIdx, ReuseShuffleIndicies);
		TE->setOperandsInOrder();
		ABataevUnsubmitted Done Reply Inline Actions Better to modify `Bundle` parameter in `newTreeEntry` function, something like `Optional<std::pair<EntryState, Bundle>>` and pass `EntryState` directly at function call. You can implement it as a separate NFC patch. ABataev: Better to modify `Bundle` parameter in `newTreeEntry` function, something like `Optional<std…
		anton-afanasyevAuthorUnsubmitted Done Reply Inline Actions Ok, I can fix it in a separate patch. But now I doubt that initial memory optimization with `IsScatteredOpds` elimination was worse than this final `EntryState` expansion requiring changes for all `newTreeEntry()` calls. anton-afanasyev: Ok, I can fix it in a separate patch. But now I doubt that initial memory optimization with…
		ABataevUnsubmitted Done Reply Inline Actions Yes, it still will be better. Here you will allocate the value on the stack, which will be cleared after function termination. With an extra data member each TreeEntry will retain it till the end of life of the whole tree. ABataev: Yes, it still will be better. Here you will allocate the value on the stack, which will be…
		anton-afanasyevAuthorUnsubmitted Done Reply Inline Actions Ok anton-afanasyev: Ok
		ABataevUnsubmitted Done Reply Inline Actions Looks like you still do not pass the vectorization state as an argument. I thought we agreed to modify `Bundle` related param to be an `Optional` of EntryState and Bundle types? Better to create the entry with the correctly set inner state rather than modify them outside of the class. It breaks encapsulation. ABataev: Looks like you still do not pass the vectorization state as an argument. I thought we agreed to…
		anton-afanasyevAuthorUnsubmitted Done Reply Inline Actions Sorry, missed this comment. Yes, we agreed to do this, but as a separate NFC patch. anton-afanasyev: Sorry, missed this comment. Yes, we agreed to do this, but as a separate NFC patch.
		ABataevUnsubmitted Done Reply Inline Actions Yes, thanks. Could you prepare this NFC patch before commiting this patch? ABataev: Yes, thanks. Could you prepare this NFC patch before commiting this patch?
		anton-afanasyevAuthorUnsubmitted Done Reply Inline Actions Hmm, I've decided to get rid of scary `make_pair` inside all `newTreeEntry()` callings, leaving most of the code untouched. So, added optional `EntryState` param to `newTreeEntry()` function. Minimal changes, so no need for separate patch. Is it good now? anton-afanasyev: Hmm, I've decided to get rid of scary `make_pair` inside all `newTreeEntry()` callings, leaving…
		buildTree_rec(PointerOps, Depth + 1, {TE, 0});
		LLVM_DEBUG(dbgs() << "SLP: added a vector of non-consecutive loads.\n");
		return;
}		}

LLVM_DEBUG(dbgs() << "SLP: Gathering non-consecutive loads.\n");		LLVM_DEBUG(dbgs() << "SLP: Gathering non-consecutive loads.\n");
BS.cancelScheduling(VL, VL0);		BS.cancelScheduling(VL, VL0);
newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx,		newTreeEntry(VL, None /not vectorized/, S, UserTreeIdx,
ReuseShuffleIndicies);		ReuseShuffleIndicies);
return;		return;
}		}
▲ Show 20 Lines • Show All 570 Lines • ▼ Show 20 Lines	if (E->getOpcode() == Instruction::ExtractElement &&
IO->getZExtValue());		IO->getZExtValue());
}		}
}		}
return ReuseShuffleCost + Cost;		return ReuseShuffleCost + Cost;
}		}
}		}
return ReuseShuffleCost + getGatherCost(VL);		return ReuseShuffleCost + getGatherCost(VL);
}		}
assert(E->State == TreeEntry::Vectorize && "Unhandled state");		assert((E->State == TreeEntry::Vectorize \|\|
		E->State == TreeEntry::ScatterVectorize) &&
		"Unhandled state");
assert(E->getOpcode() && allSameType(VL) && allSameBlock(VL) && "Invalid VL");		assert(E->getOpcode() && allSameType(VL) && allSameBlock(VL) && "Invalid VL");
Instruction *VL0 = E->getMainOp();		Instruction *VL0 = E->getMainOp();
unsigned ShuffleOrOp =		unsigned ShuffleOrOp =
E->isAltShuffle() ? (unsigned)Instruction::ShuffleVector : E->getOpcode();		E->isAltShuffle() ? (unsigned)Instruction::ShuffleVector : E->getOpcode();
switch (ShuffleOrOp) {		switch (ShuffleOrOp) {
case Instruction::PHI:		case Instruction::PHI:
return 0;		return 0;

▲ Show 20 Lines • Show All 238 Lines • ▼ Show 20 Lines	case Instruction::Load: {
Align alignment = cast<LoadInst>(VL0)->getAlign();		Align alignment = cast<LoadInst>(VL0)->getAlign();
int ScalarEltCost =		int ScalarEltCost =
TTI->getMemoryOpCost(Instruction::Load, ScalarTy, alignment, 0,		TTI->getMemoryOpCost(Instruction::Load, ScalarTy, alignment, 0,
CostKind, VL0);		CostKind, VL0);
if (NeedToShuffleReuses) {		if (NeedToShuffleReuses) {
ReuseShuffleCost -= (ReuseShuffleNumbers - VL.size()) * ScalarEltCost;		ReuseShuffleCost -= (ReuseShuffleNumbers - VL.size()) * ScalarEltCost;
}		}
int ScalarLdCost = VecTy->getNumElements() * ScalarEltCost;		int ScalarLdCost = VecTy->getNumElements() * ScalarEltCost;
int VecLdCost =		int VecLdCost;
TTI->getMemoryOpCost(Instruction::Load, VecTy, alignment, 0,		if (E->State == TreeEntry::Vectorize) {
		VecLdCost = TTI->getMemoryOpCost(Instruction::Load, VecTy, alignment, 0,
CostKind, VL0);		CostKind, VL0);
		} else {
		assert(E->State == TreeEntry::ScatterVectorize && "Unknown EntryState");
		ABataevUnsubmitted Done Reply Inline Actions Add assert here for `E->State == TreeEntry::ScatterVectorize` ABataev: Add assert here for `E->State == TreeEntry::ScatterVectorize`
		VecLdCost = TTI->getGatherScatterOpCost(
		Instruction::Load, VecTy, cast<LoadInst>(VL0)->getPointerOperand(),
		ABataevUnsubmitted Done Reply Inline Actions Comment for `false` argument ABataev: Comment for `false` argument
		anton-afanasyevAuthorUnsubmitted Done Reply Inline Actions Ok. anton-afanasyev: Ok.
		/VariableMask=/false, alignment, CostKind, VL0);
		}
if (!E->ReorderIndices.empty()) {		if (!E->ReorderIndices.empty()) {
		RKSimonUnsubmitted Done Reply Inline Actions Do we have cases where we end up shuffling scattered ops? RKSimon: Do we have cases where we end up shuffling scattered ops?
		anton-afanasyevAuthorUnsubmitted Done Reply Inline Actions In general, yes, we can end up with reordering, but this could be optimized with preliminary pointers vector shuffling. Though I think it's rather rare case (load-gather plus reordering meeting together), I can prepare separate patch to optimize this after investigation of its impact. anton-afanasyev: In general, yes, we can end up with reordering, but this could be optimized with preliminary…
// TODO: Merge this shuffle with the ReuseShuffleCost.		// TODO: Merge this shuffle with the ReuseShuffleCost.
VecLdCost += TTI->getShuffleCost(		VecLdCost += TTI->getShuffleCost(
TargetTransformInfo::SK_PermuteSingleSrc, VecTy);		TargetTransformInfo::SK_PermuteSingleSrc, VecTy);
}		}
return ReuseShuffleCost + VecLdCost - ScalarLdCost;		return ReuseShuffleCost + VecLdCost - ScalarLdCost;
}		}
case Instruction::Store: {		case Instruction::Store: {
// We know that we can merge the stores. Calculate the cost.		// We know that we can merge the stores. Calculate the cost.
▲ Show 20 Lines • Show All 574 Lines • ▼ Show 20 Lines	if (NeedToShuffleReuses) {
GatherSeq.insert(I);		GatherSeq.insert(I);
CSEBlocks.insert(I->getParent());		CSEBlocks.insert(I->getParent());
}		}
}		}
E->VectorizedValue = Vec;		E->VectorizedValue = Vec;
return Vec;		return Vec;
}		}

assert(E->State == TreeEntry::Vectorize && "Unhandled state");		assert((E->State == TreeEntry::Vectorize \|\|
		E->State == TreeEntry::ScatterVectorize) &&
		"Unhandled state");
unsigned ShuffleOrOp =		unsigned ShuffleOrOp =
E->isAltShuffle() ? (unsigned)Instruction::ShuffleVector : E->getOpcode();		E->isAltShuffle() ? (unsigned)Instruction::ShuffleVector : E->getOpcode();
Instruction *VL0 = E->getMainOp();		Instruction *VL0 = E->getMainOp();
Type *ScalarTy = VL0->getType();		Type *ScalarTy = VL0->getType();
if (auto *Store = dyn_cast<StoreInst>(VL0))		if (auto *Store = dyn_cast<StoreInst>(VL0))
ScalarTy = Store->getValueOperand()->getType();		ScalarTy = Store->getValueOperand()->getType();
auto *VecTy = FixedVectorType::get(ScalarTy, E->Scalars.size());		auto *VecTy = FixedVectorType::get(ScalarTy, E->Scalars.size());
switch (ShuffleOrOp) {		switch (ShuffleOrOp) {
▲ Show 20 Lines • Show All 212 Lines • ▼ Show 20 Lines	case Instruction::Load: {
// Loads are inserted at the head of the tree because we don't want to		// Loads are inserted at the head of the tree because we don't want to
// sink them all the way down past store instructions.		// sink them all the way down past store instructions.
bool IsReorder = E->updateStateIfReorder();		bool IsReorder = E->updateStateIfReorder();
if (IsReorder)		if (IsReorder)
VL0 = E->getMainOp();		VL0 = E->getMainOp();
setInsertPointAfterBundle(E);		setInsertPointAfterBundle(E);

LoadInst *LI = cast<LoadInst>(VL0);		LoadInst *LI = cast<LoadInst>(VL0);
		Instruction *NewLI;
unsigned AS = LI->getPointerAddressSpace();		unsigned AS = LI->getPointerAddressSpace();
		Value *PO = LI->getPointerOperand();
		if (E->State == TreeEntry::Vectorize) {

Value *VecPtr = Builder.CreateBitCast(LI->getPointerOperand(),		Value *VecPtr = Builder.CreateBitCast(PO, VecTy->getPointerTo(AS));
VecTy->getPointerTo(AS));

// The pointer operand uses an in-tree scalar so we add the new BitCast to		// The pointer operand uses an in-tree scalar so we add the new BitCast
// ExternalUses list to make sure that an extract will be generated in the		// to ExternalUses list to make sure that an extract will be generated
// future.		// in the future.
Value *PO = LI->getPointerOperand();
if (getTreeEntry(PO))		if (getTreeEntry(PO))
ExternalUses.push_back(ExternalUser(PO, cast<User>(VecPtr), 0));		ExternalUses.emplace_back(PO, cast<User>(VecPtr), 0);

		NewLI = Builder.CreateAlignedLoad(VecTy, VecPtr, LI->getAlign());
		} else {
		assert(E->State == TreeEntry::ScatterVectorize && "Unhandled state");
		ABataevUnsubmitted Done Reply Inline Actions Add assert here for `E->State == TreeEntry::ScatterVectorize` ABataev: Add assert here for `E->State == TreeEntry::ScatterVectorize`
		Value *VecPtr = vectorizeTree(E->getOperand(0));
		NewLI = Builder.CreateMaskedGather(VecPtr, LI->getAlign());
		}
		ABataevUnsubmitted Done Reply Inline Actions `auto ` ABataev:* `auto *`
		anton-afanasyevAuthorUnsubmitted Done Reply Inline Actions Ok anton-afanasyev: Ok
		Value *V = propagateMetadata(NewLI, E->Scalars);

		ABataevUnsubmitted Done Reply Inline Actions `auto ` ABataev:* `auto *`
		anton-afanasyevAuthorUnsubmitted Done Reply Inline Actions `UndefValue::get()` returns `UndefValue` type, but we need `Value` here. anton-afanasyev: `UndefValue::get()` returns `UndefValue` type, but we need `Value` here.
LI = Builder.CreateAlignedLoad(VecTy, VecPtr, LI->getAlign());
Value *V = propagateMetadata(LI, E->Scalars);
if (IsReorder) {		if (IsReorder) {
SmallVector<int, 4> Mask;		SmallVector<int, 4> Mask;
inversePermutation(E->ReorderIndices, Mask);		inversePermutation(E->ReorderIndices, Mask);
V = Builder.CreateShuffleVector(V, Mask, "reorder_shuffle");		V = Builder.CreateShuffleVector(V, Mask, "reorder_shuffle");
}		}
if (NeedToShuffleReuses) {		if (NeedToShuffleReuses) {
// TODO: Merge this shuffle with the ReorderShuffleMask.		// TODO: Merge this shuffle with the ReorderShuffleMask.
		RKSimonUnsubmitted Done Reply Inline Actions Should this be 0? Isn't it supposed to be the LaneIndex? RKSimon: Should this be 0? Isn't it supposed to be the LaneIndex?
		anton-afanasyevAuthorUnsubmitted Done Reply Inline Actions Sure, thanks! anton-afanasyev: Sure, thanks!
		ABataevUnsubmitted Done Reply Inline Actions `.emplace_back(PO, cast<User>(VecPtr), InsIndex);` ABataev: `.emplace_back(PO, cast<User>(VecPtr), InsIndex);`
		anton-afanasyevAuthorUnsubmitted Done Reply Inline Actions Ok anton-afanasyev: Ok
V = Builder.CreateShuffleVector(V, E->ReuseShuffleIndices, "shuffle");		V = Builder.CreateShuffleVector(V, E->ReuseShuffleIndices, "shuffle");
}		}
E->VectorizedValue = V;		E->VectorizedValue = V;
++NumVectorInstructions;		++NumVectorInstructions;
return V;		return V;
		ABataevUnsubmitted Done Reply Inline Actions Try to merge this code line with the one in line 4539 ABataev: Try to merge this code line with the one in line 4539
		anton-afanasyevAuthorUnsubmitted Done Reply Inline Actions Done. anton-afanasyev: Done.
}		}
case Instruction::Store: {		case Instruction::Store: {
bool IsReorder = !E->ReorderIndices.empty();		bool IsReorder = !E->ReorderIndices.empty();
auto *SI = cast<StoreInst>(		auto *SI = cast<StoreInst>(
IsReorder ? E->Scalars[E->ReorderIndices.front()] : VL0);		IsReorder ? E->Scalars[E->ReorderIndices.front()] : VL0);
unsigned AS = SI->getPointerAddressSpace();		unsigned AS = SI->getPointerAddressSpace();

setInsertPointAfterBundle(E);		setInsertPointAfterBundle(E);
▲ Show 20 Lines • Show All 248 Lines • ▼ Show 20 Lines	for (const auto &ExternalUse : ExternalUses) {
llvm::User *User = ExternalUse.User;		llvm::User *User = ExternalUse.User;

// Skip users that we already RAUW. This happens when one instruction		// Skip users that we already RAUW. This happens when one instruction
// has multiple uses of the same value.		// has multiple uses of the same value.
if (User && !is_contained(Scalar->users(), User))		if (User && !is_contained(Scalar->users(), User))
continue;		continue;
TreeEntry *E = getTreeEntry(Scalar);		TreeEntry *E = getTreeEntry(Scalar);
assert(E && "Invalid scalar");		assert(E && "Invalid scalar");
assert(E->State == TreeEntry::Vectorize && "Extracting from a gather list");		assert(E->State != TreeEntry::NeedToGather &&
		"Extracting from a gather list");

Value *Vec = E->VectorizedValue;		Value *Vec = E->VectorizedValue;
assert(Vec && "Can't find vectorizable value");		assert(Vec && "Can't find vectorizable value");

Value *Lane = Builder.getInt32(ExternalUse.Lane);		Value *Lane = Builder.getInt32(ExternalUse.Lane);
// If User == nullptr, the Scalar is used as extra arg. Generate		// If User == nullptr, the Scalar is used as extra arg. Generate
// ExtractElement instruction and update the record for this scalar in		// ExtractElement instruction and update the record for this scalar in
// ExternallyUsedValues.		// ExternallyUsedValues.
▲ Show 20 Lines • Show All 3,006 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/lookahead.ll

	Show First 20 Lines • Show All 223 Lines • ▼ Show 20 Lines
	; S[0] S[1]			; S[0] S[1]
	;			;
	; SLP should reorder the operands of the RHS add taking into consideration the cost of external uses.			; SLP should reorder the operands of the RHS add taking into consideration the cost of external uses.
	; It is more profitable to reorder the operands of the RHS add, because A[1] has an external use.			; It is more profitable to reorder the operands of the RHS add, because A[1] has an external use.

	define void @lookahead_external_uses(double* %A, double %B, double %C, double %D, double %S, double %Ext1, double %Ext2) {			define void @lookahead_external_uses(double* %A, double %B, double %C, double %D, double %S, double %Ext1, double %Ext2) {
	; CHECK-LABEL: @lookahead_external_uses(			; CHECK-LABEL: @lookahead_external_uses(
	; CHECK-NEXT: entry:			; CHECK-NEXT: entry:
	; CHECK-NEXT: [[IDXA0:%.]] = getelementptr inbounds double, double [[A:%.*]], i64 0
	; CHECK-NEXT: [[IDXB0:%.]] = getelementptr inbounds double, double [[B:%.*]], i64 0			; CHECK-NEXT: [[IDXB0:%.]] = getelementptr inbounds double, double [[B:%.*]], i64 0
	; CHECK-NEXT: [[IDXC0:%.]] = getelementptr inbounds double, double [[C:%.*]], i64 0			; CHECK-NEXT: [[IDXC0:%.]] = getelementptr inbounds double, double [[C:%.*]], i64 0
	; CHECK-NEXT: [[IDXD0:%.]] = getelementptr inbounds double, double [[D:%.*]], i64 0			; CHECK-NEXT: [[IDXD0:%.]] = getelementptr inbounds double, double [[D:%.*]], i64 0
	; CHECK-NEXT: [[IDXA1:%.]] = getelementptr inbounds double, double [[A]], i64 1			; CHECK-NEXT: [[IDXA1:%.]] = getelementptr inbounds double, double [[A:%.*]], i64 1
	; CHECK-NEXT: [[IDXB2:%.]] = getelementptr inbounds double, double [[B]], i64 2			; CHECK-NEXT: [[IDXB2:%.]] = getelementptr inbounds double, double [[B]], i64 2
	; CHECK-NEXT: [[IDXA2:%.]] = getelementptr inbounds double, double [[A]], i64 2			; CHECK-NEXT: [[TMP0:%.]] = insertelement <2 x double> undef, double* [[A]], i32 0
				; CHECK-NEXT: [[TMP1:%.]] = insertelement <2 x double> [[TMP0]], double* [[A]], i32 1
				; CHECK-NEXT: [[TMP2:%.]] = getelementptr double, <2 x double> [[TMP1]], <2 x i64> <i64 0, i64 2>
	; CHECK-NEXT: [[IDXB1:%.]] = getelementptr inbounds double, double [[B]], i64 1			; CHECK-NEXT: [[IDXB1:%.]] = getelementptr inbounds double, double [[B]], i64 1
	; CHECK-NEXT: [[A0:%.]] = load double, double [[IDXA0]], align 8
	; CHECK-NEXT: [[C0:%.]] = load double, double [[IDXC0]], align 8			; CHECK-NEXT: [[C0:%.]] = load double, double [[IDXC0]], align 8
	; CHECK-NEXT: [[D0:%.]] = load double, double [[IDXD0]], align 8			; CHECK-NEXT: [[D0:%.]] = load double, double [[IDXD0]], align 8
	; CHECK-NEXT: [[A1:%.]] = load double, double [[IDXA1]], align 8			; CHECK-NEXT: [[A1:%.]] = load double, double [[IDXA1]], align 8
	; CHECK-NEXT: [[B2:%.]] = load double, double [[IDXB2]], align 8			; CHECK-NEXT: [[B2:%.]] = load double, double [[IDXB2]], align 8
	; CHECK-NEXT: [[A2:%.]] = load double, double [[IDXA2]], align 8			; CHECK-NEXT: [[TMP3:%.]] = call <2 x double> @llvm.masked.gather.v2f64.v2p0f64(<2 x double> [[TMP2]], i32 8, <2 x i1> <i1 true, i1 true>, <2 x double> undef)
	; CHECK-NEXT: [[TMP0:%.]] = bitcast double [[IDXB0]] to <2 x double>*			; CHECK-NEXT: [[TMP4:%.]] = extractelement <2 x double> [[TMP2]], i32 0
	; CHECK-NEXT: [[TMP1:%.]] = load <2 x double>, <2 x double> [[TMP0]], align 8			; CHECK-NEXT: [[TMP5:%.]] = bitcast double [[IDXB0]] to <2 x double>*
	; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> undef, double [[C0]], i32 0			; CHECK-NEXT: [[TMP6:%.]] = load <2 x double>, <2 x double> [[TMP5]], align 8
	; CHECK-NEXT: [[TMP3:%.*]] = insertelement <2 x double> [[TMP2]], double [[A1]], i32 1			; CHECK-NEXT: [[TMP7:%.*]] = insertelement <2 x double> undef, double [[C0]], i32 0
	; CHECK-NEXT: [[TMP4:%.*]] = insertelement <2 x double> undef, double [[D0]], i32 0			; CHECK-NEXT: [[TMP8:%.*]] = insertelement <2 x double> [[TMP7]], double [[A1]], i32 1
	; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x double> [[TMP4]], double [[B2]], i32 1			; CHECK-NEXT: [[TMP9:%.*]] = insertelement <2 x double> undef, double [[D0]], i32 0
	; CHECK-NEXT: [[TMP6:%.*]] = fsub fast <2 x double> [[TMP3]], [[TMP5]]			; CHECK-NEXT: [[TMP10:%.*]] = insertelement <2 x double> [[TMP9]], double [[B2]], i32 1
	; CHECK-NEXT: [[TMP7:%.*]] = insertelement <2 x double> undef, double [[A0]], i32 0			; CHECK-NEXT: [[TMP11:%.*]] = fsub fast <2 x double> [[TMP8]], [[TMP10]]
	; CHECK-NEXT: [[TMP8:%.*]] = insertelement <2 x double> [[TMP7]], double [[A2]], i32 1			; CHECK-NEXT: [[TMP12:%.*]] = fsub fast <2 x double> [[TMP3]], [[TMP6]]
	; CHECK-NEXT: [[TMP9:%.*]] = fsub fast <2 x double> [[TMP8]], [[TMP1]]			; CHECK-NEXT: [[TMP13:%.*]] = fadd fast <2 x double> [[TMP12]], [[TMP11]]
	; CHECK-NEXT: [[TMP10:%.*]] = fadd fast <2 x double> [[TMP9]], [[TMP6]]
	; CHECK-NEXT: [[IDXS0:%.]] = getelementptr inbounds double, double [[S:%.*]], i64 0			; CHECK-NEXT: [[IDXS0:%.]] = getelementptr inbounds double, double [[S:%.*]], i64 0
	; CHECK-NEXT: [[IDXS1:%.]] = getelementptr inbounds double, double [[S]], i64 1			; CHECK-NEXT: [[IDXS1:%.]] = getelementptr inbounds double, double [[S]], i64 1
	; CHECK-NEXT: [[TMP11:%.]] = bitcast double [[IDXS0]] to <2 x double>*			; CHECK-NEXT: [[TMP14:%.]] = bitcast double [[IDXS0]] to <2 x double>*
	; CHECK-NEXT: store <2 x double> [[TMP10]], <2 x double>* [[TMP11]], align 8			; CHECK-NEXT: store <2 x double> [[TMP13]], <2 x double>* [[TMP14]], align 8
	; CHECK-NEXT: store double [[A1]], double* [[EXT1:%.*]], align 8			; CHECK-NEXT: store double [[A1]], double* [[EXT1:%.*]], align 8
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	%IdxA0 = getelementptr inbounds double, double* %A, i64 0			%IdxA0 = getelementptr inbounds double, double* %A, i64 0
	%IdxB0 = getelementptr inbounds double, double* %B, i64 0			%IdxB0 = getelementptr inbounds double, double* %B, i64 0
	%IdxC0 = getelementptr inbounds double, double* %C, i64 0			%IdxC0 = getelementptr inbounds double, double* %C, i64 0
	%IdxD0 = getelementptr inbounds double, double* %D, i64 0			%IdxD0 = getelementptr inbounds double, double* %D, i64 0
	▲ Show 20 Lines • Show All 54 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: [[IDXA0:%.]] = getelementptr inbounds double, double [[A:%.*]], i64 0			; CHECK-NEXT: [[IDXA0:%.]] = getelementptr inbounds double, double [[A:%.*]], i64 0
	; CHECK-NEXT: [[IDXB0:%.]] = getelementptr inbounds double, double [[B:%.*]], i64 0			; CHECK-NEXT: [[IDXB0:%.]] = getelementptr inbounds double, double [[B:%.*]], i64 0
	; CHECK-NEXT: [[IDXC0:%.]] = getelementptr inbounds double, double [[C:%.*]], i64 0			; CHECK-NEXT: [[IDXC0:%.]] = getelementptr inbounds double, double [[C:%.*]], i64 0
	; CHECK-NEXT: [[IDXD0:%.]] = getelementptr inbounds double, double [[D:%.*]], i64 0			; CHECK-NEXT: [[IDXD0:%.]] = getelementptr inbounds double, double [[D:%.*]], i64 0
	; CHECK-NEXT: [[IDXA1:%.]] = getelementptr inbounds double, double [[A]], i64 1			; CHECK-NEXT: [[IDXA1:%.]] = getelementptr inbounds double, double [[A]], i64 1
	; CHECK-NEXT: [[IDXB2:%.]] = getelementptr inbounds double, double [[B]], i64 2			; CHECK-NEXT: [[IDXB2:%.]] = getelementptr inbounds double, double [[B]], i64 2
	; CHECK-NEXT: [[IDXA2:%.]] = getelementptr inbounds double, double [[A]], i64 2			; CHECK-NEXT: [[IDXA2:%.]] = getelementptr inbounds double, double [[A]], i64 2
	; CHECK-NEXT: [[IDXB1:%.]] = getelementptr inbounds double, double [[B]], i64 1			; CHECK-NEXT: [[IDXB1:%.]] = getelementptr inbounds double, double [[B]], i64 1
				; CHECK-NEXT: [[A0:%.]] = load double, double [[IDXA0]], align 8
	; CHECK-NEXT: [[B0:%.]] = load double, double [[IDXB0]], align 8			; CHECK-NEXT: [[B0:%.]] = load double, double [[IDXB0]], align 8
	; CHECK-NEXT: [[C0:%.]] = load double, double [[IDXC0]], align 8			; CHECK-NEXT: [[C0:%.]] = load double, double [[IDXC0]], align 8
	; CHECK-NEXT: [[D0:%.]] = load double, double [[IDXD0]], align 8			; CHECK-NEXT: [[D0:%.]] = load double, double [[IDXD0]], align 8
	; CHECK-NEXT: [[TMP0:%.]] = bitcast double [[IDXA0]] to <2 x double>*			; CHECK-NEXT: [[A1:%.]] = load double, double [[IDXA1]], align 8
	; CHECK-NEXT: [[TMP1:%.]] = load <2 x double>, <2 x double> [[TMP0]], align 8
	; CHECK-NEXT: [[B2:%.]] = load double, double [[IDXB2]], align 8			; CHECK-NEXT: [[B2:%.]] = load double, double [[IDXB2]], align 8
	; CHECK-NEXT: [[A2:%.]] = load double, double [[IDXA2]], align 8			; CHECK-NEXT: [[A2:%.]] = load double, double [[IDXA2]], align 8
	; CHECK-NEXT: [[B1:%.]] = load double, double [[IDXB1]], align 8			; CHECK-NEXT: [[B1:%.]] = load double, double [[IDXB1]], align 8
	; CHECK-NEXT: [[TMP2:%.*]] = insertelement <2 x double> undef, double [[B0]], i32 0			; CHECK-NEXT: [[SUBA0B0:%.*]] = fsub fast double [[A0]], [[B0]]
	; CHECK-NEXT: [[TMP3:%.*]] = insertelement <2 x double> [[TMP2]], double [[B2]], i32 1			; CHECK-NEXT: [[SUBC0D0:%.*]] = fsub fast double [[C0]], [[D0]]
	; CHECK-NEXT: [[TMP4:%.*]] = fsub fast <2 x double> [[TMP1]], [[TMP3]]			; CHECK-NEXT: [[SUBA1B2:%.*]] = fsub fast double [[A1]], [[B2]]
	; CHECK-NEXT: [[TMP5:%.*]] = insertelement <2 x double> undef, double [[C0]], i32 0			; CHECK-NEXT: [[SUBA2B1:%.*]] = fsub fast double [[A2]], [[B1]]
	; CHECK-NEXT: [[TMP6:%.*]] = insertelement <2 x double> [[TMP5]], double [[A2]], i32 1			; CHECK-NEXT: [[ADD0:%.*]] = fadd fast double [[SUBA0B0]], [[SUBC0D0]]
	; CHECK-NEXT: [[TMP7:%.*]] = insertelement <2 x double> undef, double [[D0]], i32 0			; CHECK-NEXT: [[ADD1:%.*]] = fadd fast double [[SUBA1B2]], [[SUBA2B1]]
	; CHECK-NEXT: [[TMP8:%.*]] = insertelement <2 x double> [[TMP7]], double [[B1]], i32 1
	; CHECK-NEXT: [[TMP9:%.*]] = fsub fast <2 x double> [[TMP6]], [[TMP8]]
	; CHECK-NEXT: [[TMP10:%.*]] = fadd fast <2 x double> [[TMP4]], [[TMP9]]
	; CHECK-NEXT: [[IDXS0:%.]] = getelementptr inbounds double, double [[S:%.*]], i64 0			; CHECK-NEXT: [[IDXS0:%.]] = getelementptr inbounds double, double [[S:%.*]], i64 0
	; CHECK-NEXT: [[IDXS1:%.]] = getelementptr inbounds double, double [[S]], i64 1			; CHECK-NEXT: [[IDXS1:%.]] = getelementptr inbounds double, double [[S]], i64 1
	; CHECK-NEXT: [[TMP11:%.]] = bitcast double [[IDXS0]] to <2 x double>*			; CHECK-NEXT: store double [[ADD0]], double* [[IDXS0]], align 8
	; CHECK-NEXT: store <2 x double> [[TMP10]], <2 x double>* [[TMP11]], align 8			; CHECK-NEXT: store double [[ADD1]], double* [[IDXS1]], align 8
	; CHECK-NEXT: [[TMP12:%.*]] = extractelement <2 x double> [[TMP1]], i32 1			; CHECK-NEXT: store double [[A1]], double* [[EXT1:%.*]], align 8
	; CHECK-NEXT: store double [[TMP12]], double* [[EXT1:%.*]], align 8			; CHECK-NEXT: store double [[A1]], double* [[EXT2:%.*]], align 8
	; CHECK-NEXT: store double [[TMP12]], double* [[EXT2:%.*]], align 8			; CHECK-NEXT: store double [[A1]], double* [[EXT3:%.*]], align 8
	; CHECK-NEXT: store double [[TMP12]], double* [[EXT3:%.*]], align 8
	; CHECK-NEXT: store double [[B1]], double* [[EXT4:%.*]], align 8			; CHECK-NEXT: store double [[B1]], double* [[EXT4:%.*]], align 8
	; CHECK-NEXT: store double [[B1]], double* [[EXT5:%.*]], align 8			; CHECK-NEXT: store double [[B1]], double* [[EXT5:%.*]], align 8
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	entry:			entry:
	%IdxA0 = getelementptr inbounds double, double* %A, i64 0			%IdxA0 = getelementptr inbounds double, double* %A, i64 0
	%IdxB0 = getelementptr inbounds double, double* %B, i64 0			%IdxB0 = getelementptr inbounds double, double* %B, i64 0
	%IdxC0 = getelementptr inbounds double, double* %C, i64 0			%IdxC0 = getelementptr inbounds double, double* %C, i64 0
	▲ Show 20 Lines • Show All 283 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/pr47623.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -slp-vectorizer -instcombine -S -mtriple=x86_64-unknown-linux -mattr=+sse2 \| FileCheck %s			; RUN: opt < %s -slp-vectorizer -instcombine -S -mtriple=x86_64-unknown-linux -mattr=+sse2 \| FileCheck %s --check-prefixes=SSE
	; RUN: opt < %s -slp-vectorizer -instcombine -S -mtriple=x86_64-unknown-linux -mattr=+avx \| FileCheck %s			; RUN: opt < %s -slp-vectorizer -instcombine -S -mtriple=x86_64-unknown-linux -mattr=+avx \| FileCheck %s --check-prefixes=AVX
	; RUN: opt < %s -slp-vectorizer -instcombine -S -mtriple=x86_64-unknown-linux -mattr=+avx2 \| FileCheck %s			; RUN: opt < %s -slp-vectorizer -instcombine -S -mtriple=x86_64-unknown-linux -mattr=+avx2 \| FileCheck %s --check-prefixes=AVX
	; RUN: opt < %s -slp-vectorizer -instcombine -S -mtriple=x86_64-unknown-linux -mattr=+avx512f \| FileCheck %s			; RUN: opt < %s -slp-vectorizer -instcombine -S -mtriple=x86_64-unknown-linux -mattr=+avx512f \| FileCheck %s --check-prefixes=AVX
	; RUN: opt < %s -slp-vectorizer -instcombine -S -mtriple=x86_64-unknown-linux -mattr=+avx512vl \| FileCheck %s			; RUN: opt < %s -slp-vectorizer -instcombine -S -mtriple=x86_64-unknown-linux -mattr=+avx512vl \| FileCheck %s --check-prefixes=AVX
				RKSimonUnsubmitted Done Reply Inline Actions Sorry, I think you need to remove the 'CHECK' from all of these and just use SSE/AVX RKSimon: Sorry, I think you need to remove the 'CHECK' from all of these and just use SSE/AVX
				anton-afanasyevAuthorUnsubmitted Done Reply Inline Actions Ok, done anton-afanasyev: Ok, done


	@b = global [8 x i32] zeroinitializer, align 16			@b = global [8 x i32] zeroinitializer, align 16
	@a = global [8 x i32] zeroinitializer, align 16			@a = global [8 x i32] zeroinitializer, align 16

	define void @foo() {			define void @foo() {
	; CHECK-LABEL: @foo(			; SSE-LABEL: @foo(
	; CHECK-NEXT: [[TMP1:%.]] = load i32, i32 getelementptr inbounds ([8 x i32], [8 x i32]* @b, i64 0, i64 0), align 16			; SSE-NEXT: [[TMP1:%.]] = load i32, i32 getelementptr inbounds ([8 x i32], [8 x i32]* @b, i64 0, i64 0), align 16
	; CHECK-NEXT: store i32 [[TMP1]], i32* getelementptr inbounds ([8 x i32], [8 x i32]* @a, i64 0, i64 0), align 16			; SSE-NEXT: store i32 [[TMP1]], i32* getelementptr inbounds ([8 x i32], [8 x i32]* @a, i64 0, i64 0), align 16
	; CHECK-NEXT: [[TMP2:%.]] = load i32, i32 getelementptr inbounds ([8 x i32], [8 x i32]* @b, i64 0, i64 2), align 8			; SSE-NEXT: [[TMP2:%.]] = load i32, i32 getelementptr inbounds ([8 x i32], [8 x i32]* @b, i64 0, i64 2), align 8
	; CHECK-NEXT: store i32 [[TMP2]], i32* getelementptr inbounds ([8 x i32], [8 x i32]* @a, i64 0, i64 1), align 4			; SSE-NEXT: store i32 [[TMP2]], i32* getelementptr inbounds ([8 x i32], [8 x i32]* @a, i64 0, i64 1), align 4
	; CHECK-NEXT: store i32 [[TMP1]], i32* getelementptr inbounds ([8 x i32], [8 x i32]* @a, i64 0, i64 2), align 8			; SSE-NEXT: store i32 [[TMP1]], i32* getelementptr inbounds ([8 x i32], [8 x i32]* @a, i64 0, i64 2), align 8
	; CHECK-NEXT: store i32 [[TMP2]], i32* getelementptr inbounds ([8 x i32], [8 x i32]* @a, i64 0, i64 3), align 4			; SSE-NEXT: store i32 [[TMP2]], i32* getelementptr inbounds ([8 x i32], [8 x i32]* @a, i64 0, i64 3), align 4
	; CHECK-NEXT: store i32 [[TMP1]], i32* getelementptr inbounds ([8 x i32], [8 x i32]* @a, i64 0, i64 4), align 16			; SSE-NEXT: store i32 [[TMP1]], i32* getelementptr inbounds ([8 x i32], [8 x i32]* @a, i64 0, i64 4), align 16
	; CHECK-NEXT: store i32 [[TMP2]], i32* getelementptr inbounds ([8 x i32], [8 x i32]* @a, i64 0, i64 5), align 4			; SSE-NEXT: store i32 [[TMP2]], i32* getelementptr inbounds ([8 x i32], [8 x i32]* @a, i64 0, i64 5), align 4
	; CHECK-NEXT: store i32 [[TMP1]], i32* getelementptr inbounds ([8 x i32], [8 x i32]* @a, i64 0, i64 6), align 8			; SSE-NEXT: store i32 [[TMP1]], i32* getelementptr inbounds ([8 x i32], [8 x i32]* @a, i64 0, i64 6), align 8
	; CHECK-NEXT: store i32 [[TMP2]], i32* getelementptr inbounds ([8 x i32], [8 x i32]* @a, i64 0, i64 7), align 4			; SSE-NEXT: store i32 [[TMP2]], i32* getelementptr inbounds ([8 x i32], [8 x i32]* @a, i64 0, i64 7), align 4
	; CHECK-NEXT: ret void			; SSE-NEXT: ret void
				;
				; AVX-LABEL: @foo(
				; AVX-NEXT: [[TMP1:%.]] = call <2 x i32> @llvm.masked.gather.v2i32.v2p0i32(<2 x i32> <i32* getelementptr inbounds ([8 x i32], [8 x i32]* @b, i64 0, i64 0), i32* getelementptr inbounds ([8 x i32], [8 x i32]* @b, i64 0, i64 2)>, i32 16, <2 x i1> <i1 true, i1 true>, <2 x i32> undef)
				; AVX-NEXT: [[SHUFFLE:%.*]] = shufflevector <2 x i32> [[TMP1]], <2 x i32> undef, <8 x i32> <i32 0, i32 1, i32 0, i32 1, i32 0, i32 1, i32 0, i32 1>
				; AVX-NEXT: store <8 x i32> [[SHUFFLE]], <8 x i32>* bitcast ([8 x i32]* @a to <8 x i32>*), align 16
				; AVX-NEXT: ret void
	;			;
	%1 = load i32, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @b, i64 0, i64 0), align 16			%1 = load i32, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @b, i64 0, i64 0), align 16
	store i32 %1, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @a, i64 0, i64 0), align 16			store i32 %1, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @a, i64 0, i64 0), align 16
	%2 = load i32, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @b, i64 0, i64 2), align 8			%2 = load i32, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @b, i64 0, i64 2), align 8
	store i32 %2, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @a, i64 0, i64 1), align 4			store i32 %2, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @a, i64 0, i64 1), align 4
	store i32 %1, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @a, i64 0, i64 2), align 8			store i32 %1, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @a, i64 0, i64 2), align 8
	store i32 %2, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @a, i64 0, i64 3), align 4			store i32 %2, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @a, i64 0, i64 3), align 4
	store i32 %1, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @a, i64 0, i64 4), align 16			store i32 %1, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @a, i64 0, i64 4), align 16
	store i32 %2, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @a, i64 0, i64 5), align 4			store i32 %2, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @a, i64 0, i64 5), align 4
	store i32 %1, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @a, i64 0, i64 6), align 8			store i32 %1, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @a, i64 0, i64 6), align 8
	store i32 %2, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @a, i64 0, i64 7), align 4			store i32 %2, i32* getelementptr inbounds ([8 x i32], [8 x i32]* @a, i64 0, i64 7), align 4
	ret void			ret void
	}			}

llvm/test/Transforms/SLPVectorizer/X86/pr47629.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -slp-vectorizer -instcombine -S -mtriple=x86_64-unknown-linux -mattr=+sse2 \| FileCheck %s --check-prefixes=CHECK,SSE			; RUN: opt < %s -slp-vectorizer -instcombine -S -mtriple=x86_64-unknown-linux -mattr=+sse2 \| FileCheck %s --check-prefixes=CHECK,SSE
	; RUN: opt < %s -slp-vectorizer -instcombine -S -mtriple=x86_64-unknown-linux -mattr=+avx \| FileCheck %s --check-prefixes=CHECK,AVX			; RUN: opt < %s -slp-vectorizer -instcombine -S -mtriple=x86_64-unknown-linux -mattr=+avx \| FileCheck %s --check-prefixes=CHECK,AVX
	; RUN: opt < %s -slp-vectorizer -instcombine -S -mtriple=x86_64-unknown-linux -mattr=+avx2 \| FileCheck %s --check-prefixes=CHECK,AVX			; RUN: opt < %s -slp-vectorizer -instcombine -S -mtriple=x86_64-unknown-linux -mattr=+avx2 \| FileCheck %s --check-prefixes=CHECK,AVX2
	; RUN: opt < %s -slp-vectorizer -instcombine -S -mtriple=x86_64-unknown-linux -mattr=+avx512f \| FileCheck %s --check-prefixes=CHECK,AVX			; RUN: opt < %s -slp-vectorizer -instcombine -S -mtriple=x86_64-unknown-linux -mattr=+avx512f \| FileCheck %s --check-prefixes=CHECK,AVX512
	; RUN: opt < %s -slp-vectorizer -instcombine -S -mtriple=x86_64-unknown-linux -mattr=+avx512vl \| FileCheck %s --check-prefixes=CHECK,AVX			; RUN: opt < %s -slp-vectorizer -instcombine -S -mtriple=x86_64-unknown-linux -mattr=+avx512vl \| FileCheck %s --check-prefixes=CHECK,AVX512

	define void @gather_load(i32* %0, i32* readonly %1) {			define void @gather_load(i32* %0, i32* readonly %1) {
	; CHECK-LABEL: @gather_load(			; CHECK-LABEL: @gather_load(
	; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds i32, i32 [[TMP1:%.*]], i64 1			; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds i32, i32 [[TMP1:%.*]], i64 1
	; CHECK-NEXT: [[TMP4:%.]] = load i32, i32 [[TMP1]], align 4			; CHECK-NEXT: [[TMP4:%.]] = load i32, i32 [[TMP1]], align 4
	; CHECK-NEXT: [[TMP5:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 11			; CHECK-NEXT: [[TMP5:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 11
	; CHECK-NEXT: [[TMP6:%.]] = load i32, i32 [[TMP5]], align 4			; CHECK-NEXT: [[TMP6:%.]] = load i32, i32 [[TMP5]], align 4
	; CHECK-NEXT: [[TMP7:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 4			; CHECK-NEXT: [[TMP7:%.]] = getelementptr inbounds i32, i32 [[TMP1]], i64 4
	; CHECK-NEXT: [[TMP8:%.]] = load i32, i32 [[TMP7]], align 4			; CHECK-NEXT: [[TMP8:%.]] = load i32, i32 [[TMP7]], align 4
	; CHECK-NEXT: [[TMP9:%.]] = load i32, i32 [[TMP3]], align 4			; CHECK-NEXT: [[TMP9:%.]] = load i32, i32 [[TMP3]], align 4
	; CHECK-NEXT: [[TMP10:%.*]] = insertelement <4 x i32> undef, i32 [[TMP4]], i32 0			; CHECK-NEXT: [[TMP10:%.*]] = insertelement <4 x i32> undef, i32 [[TMP4]], i32 0
	; CHECK-NEXT: [[TMP11:%.*]] = insertelement <4 x i32> [[TMP10]], i32 [[TMP6]], i32 1			; CHECK-NEXT: [[TMP11:%.*]] = insertelement <4 x i32> [[TMP10]], i32 [[TMP6]], i32 1
	; CHECK-NEXT: [[TMP12:%.*]] = insertelement <4 x i32> [[TMP11]], i32 [[TMP8]], i32 2			; CHECK-NEXT: [[TMP12:%.*]] = insertelement <4 x i32> [[TMP11]], i32 [[TMP8]], i32 2
	; CHECK-NEXT: [[TMP13:%.*]] = insertelement <4 x i32> [[TMP12]], i32 [[TMP9]], i32 3			; CHECK-NEXT: [[TMP13:%.*]] = insertelement <4 x i32> [[TMP12]], i32 [[TMP9]], i32 3
	; CHECK-NEXT: [[TMP14:%.*]] = add nsw <4 x i32> [[TMP13]], <i32 1, i32 2, i32 3, i32 4>			; CHECK-NEXT: [[TMP14:%.*]] = add nsw <4 x i32> [[TMP13]], <i32 1, i32 2, i32 3, i32 4>
	; CHECK-NEXT: [[TMP15:%.]] = bitcast i32 [[TMP0:%.]] to <4 x i32>			; CHECK-NEXT: [[TMP15:%.]] = bitcast i32 [[TMP0:%.]] to <4 x i32>
	; CHECK-NEXT: store <4 x i32> [[TMP14]], <4 x i32>* [[TMP15]], align 4			; CHECK-NEXT: store <4 x i32> [[TMP14]], <4 x i32>* [[TMP15]], align 4
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
				RKSimonUnsubmitted Done Reply Inline Actions This doesn't look great in the final codegen: https://gcc.godbolt.org/z/vE9Yoe Which suggests either the costs aren't correct or we're not correctly including the cost of something - the buildvector of the pointers? are we missing getelementptr vectorization? RKSimon: This doesn't look great in the final codegen: https://gcc.godbolt.org/z/vE9Yoe Which suggests…
				anton-afanasyevAuthorUnsubmitted Done Reply Inline Actions Oops, thanks, it looks I've missed the buildvector cost. anton-afanasyev: Oops, thanks, it looks I've missed the buildvector cost.
				anton-afanasyevAuthorUnsubmitted Done Reply Inline Actions Hmm, investigated this: no, I was wrong, the calculated cost is correct (`insertelemets`s for gather instr are compesated by `insertelement`s for buildvector). Further investigating this... Looks like codegen issue more. anton-afanasyev: Hmm, investigated this: no, I was wrong, the calculated cost is correct (`insertelemets`s for…
				RKSimonUnsubmitted Done Reply Inline Actions Hmm - does the X86 TTI handle buildvectors of pointers costs or does it fallback to the generic implementation (which is almost certainly lower)? RKSimon: Hmm - does the X86 TTI handle buildvectors of pointers costs or does it fallback to the generic…
				anton-afanasyevAuthorUnsubmitted Done Reply Inline Actions Despite of codegen output for Skylake looking complicated I believe it's still more optimized, since the `gather` is cheaper than four `load`s from memory, isn't it? anton-afanasyev: Despite of codegen output for Skylake looking complicated I believe it's still more optimized…
				RKSimonUnsubmitted Done Reply Inline Actions My concern is that -march=avx512 does nothing so you're just getting raw sse2 costs here - hence why I updated the tests at rG1eeae4310771d8a6896fe09effe88883998f34e8 RKSimon: My concern is that -march=avx512 does nothing so you're just getting raw sse2 costs here…
	;			;
	%3 = getelementptr inbounds i32, i32* %1, i64 1			%3 = getelementptr inbounds i32, i32* %1, i64 1
	%4 = load i32, i32* %1, align 4			%4 = load i32, i32* %1, align 4
	%5 = getelementptr inbounds i32, i32* %0, i64 1			%5 = getelementptr inbounds i32, i32* %0, i64 1
	%6 = getelementptr inbounds i32, i32* %1, i64 11			%6 = getelementptr inbounds i32, i32* %1, i64 11
	%7 = load i32, i32* %6, align 4			%7 = load i32, i32* %6, align 4
	%8 = getelementptr inbounds i32, i32* %0, i64 2			%8 = getelementptr inbounds i32, i32* %0, i64 2
	%9 = getelementptr inbounds i32, i32* %1, i64 4			%9 = getelementptr inbounds i32, i32* %1, i64 4
	▲ Show 20 Lines • Show All 177 Lines • ▼ Show 20 Lines
	; SSE-NEXT: store i32 [[T16]], i32* [[T13]], align 4			; SSE-NEXT: store i32 [[T16]], i32* [[T13]], align 4
	; SSE-NEXT: store i32 [[T20]], i32* [[T17]], align 4			; SSE-NEXT: store i32 [[T20]], i32* [[T17]], align 4
	; SSE-NEXT: store i32 [[T24]], i32* [[T21]], align 4			; SSE-NEXT: store i32 [[T24]], i32* [[T21]], align 4
	; SSE-NEXT: store i32 [[T28]], i32* [[T25]], align 4			; SSE-NEXT: store i32 [[T28]], i32* [[T25]], align 4
	; SSE-NEXT: store i32 [[T32]], i32* [[T29]], align 4			; SSE-NEXT: store i32 [[T32]], i32* [[T29]], align 4
	; SSE-NEXT: ret void			; SSE-NEXT: ret void
	;			;
	; AVX-LABEL: @gather_load_4(			; AVX-LABEL: @gather_load_4(
				; AVX-NEXT: [[T5:%.]] = getelementptr inbounds i32, i32 [[T0:%.*]], i64 1
	; AVX-NEXT: [[T6:%.]] = getelementptr inbounds i32, i32 [[T1:%.*]], i64 11			; AVX-NEXT: [[T6:%.]] = getelementptr inbounds i32, i32 [[T1:%.*]], i64 11
				; AVX-NEXT: [[T9:%.]] = getelementptr inbounds i32, i32 [[T0]], i64 2
	; AVX-NEXT: [[T10:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 4			; AVX-NEXT: [[T10:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 4
				; AVX-NEXT: [[T13:%.]] = getelementptr inbounds i32, i32 [[T0]], i64 3
	; AVX-NEXT: [[T14:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 15			; AVX-NEXT: [[T14:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 15
				; AVX-NEXT: [[T17:%.]] = getelementptr inbounds i32, i32 [[T0]], i64 4
	; AVX-NEXT: [[T18:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 18			; AVX-NEXT: [[T18:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 18
				; AVX-NEXT: [[T21:%.]] = getelementptr inbounds i32, i32 [[T0]], i64 5
	; AVX-NEXT: [[T22:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 9			; AVX-NEXT: [[T22:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 9
				; AVX-NEXT: [[T25:%.]] = getelementptr inbounds i32, i32 [[T0]], i64 6
	; AVX-NEXT: [[T26:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 6			; AVX-NEXT: [[T26:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 6
				; AVX-NEXT: [[T29:%.]] = getelementptr inbounds i32, i32 [[T0]], i64 7
	; AVX-NEXT: [[T30:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 21			; AVX-NEXT: [[T30:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 21
	; AVX-NEXT: [[T3:%.]] = load i32, i32 [[T1]], align 4			; AVX-NEXT: [[T3:%.]] = load i32, i32 [[T1]], align 4
	; AVX-NEXT: [[T7:%.]] = load i32, i32 [[T6]], align 4			; AVX-NEXT: [[T7:%.]] = load i32, i32 [[T6]], align 4
	; AVX-NEXT: [[T11:%.]] = load i32, i32 [[T10]], align 4			; AVX-NEXT: [[T11:%.]] = load i32, i32 [[T10]], align 4
	; AVX-NEXT: [[T15:%.]] = load i32, i32 [[T14]], align 4			; AVX-NEXT: [[T15:%.]] = load i32, i32 [[T14]], align 4
	; AVX-NEXT: [[T19:%.]] = load i32, i32 [[T18]], align 4			; AVX-NEXT: [[T19:%.]] = load i32, i32 [[T18]], align 4
	; AVX-NEXT: [[T23:%.]] = load i32, i32 [[T22]], align 4			; AVX-NEXT: [[T23:%.]] = load i32, i32 [[T22]], align 4
	; AVX-NEXT: [[T27:%.]] = load i32, i32 [[T26]], align 4			; AVX-NEXT: [[T27:%.]] = load i32, i32 [[T26]], align 4
	; AVX-NEXT: [[T31:%.]] = load i32, i32 [[T30]], align 4			; AVX-NEXT: [[T31:%.]] = load i32, i32 [[T30]], align 4
	; AVX-NEXT: [[TMP1:%.*]] = insertelement <8 x i32> undef, i32 [[T3]], i32 0			; AVX-NEXT: [[T4:%.*]] = add i32 [[T3]], 1
	; AVX-NEXT: [[TMP2:%.*]] = insertelement <8 x i32> [[TMP1]], i32 [[T7]], i32 1			; AVX-NEXT: [[T8:%.*]] = add i32 [[T7]], 2
	; AVX-NEXT: [[TMP3:%.*]] = insertelement <8 x i32> [[TMP2]], i32 [[T11]], i32 2			; AVX-NEXT: [[T12:%.*]] = add i32 [[T11]], 3
	; AVX-NEXT: [[TMP4:%.*]] = insertelement <8 x i32> [[TMP3]], i32 [[T15]], i32 3			; AVX-NEXT: [[T16:%.*]] = add i32 [[T15]], 4
	; AVX-NEXT: [[TMP5:%.*]] = insertelement <8 x i32> [[TMP4]], i32 [[T19]], i32 4			; AVX-NEXT: [[T20:%.*]] = add i32 [[T19]], 1
	; AVX-NEXT: [[TMP6:%.*]] = insertelement <8 x i32> [[TMP5]], i32 [[T23]], i32 5			; AVX-NEXT: [[T24:%.*]] = add i32 [[T23]], 2
	; AVX-NEXT: [[TMP7:%.*]] = insertelement <8 x i32> [[TMP6]], i32 [[T27]], i32 6			; AVX-NEXT: [[T28:%.*]] = add i32 [[T27]], 3
	; AVX-NEXT: [[TMP8:%.*]] = insertelement <8 x i32> [[TMP7]], i32 [[T31]], i32 7			; AVX-NEXT: [[T32:%.*]] = add i32 [[T31]], 4
	; AVX-NEXT: [[TMP9:%.*]] = add <8 x i32> [[TMP8]], <i32 1, i32 2, i32 3, i32 4, i32 1, i32 2, i32 3, i32 4>			; AVX-NEXT: store i32 [[T4]], i32* [[T0]], align 4
	; AVX-NEXT: [[TMP10:%.]] = bitcast i32 [[T0:%.]] to <8 x i32>			; AVX-NEXT: store i32 [[T8]], i32* [[T5]], align 4
	; AVX-NEXT: store <8 x i32> [[TMP9]], <8 x i32>* [[TMP10]], align 4			; AVX-NEXT: store i32 [[T12]], i32* [[T9]], align 4
				; AVX-NEXT: store i32 [[T16]], i32* [[T13]], align 4
				; AVX-NEXT: store i32 [[T20]], i32* [[T17]], align 4
				; AVX-NEXT: store i32 [[T24]], i32* [[T21]], align 4
				; AVX-NEXT: store i32 [[T28]], i32* [[T25]], align 4
				; AVX-NEXT: store i32 [[T32]], i32* [[T29]], align 4
	; AVX-NEXT: ret void			; AVX-NEXT: ret void
	;			;
				; AVX2-LABEL: @gather_load_4(
				; AVX2-NEXT: [[T5:%.]] = getelementptr inbounds i32, i32 [[T0:%.*]], i64 1
				; AVX2-NEXT: [[TMP1:%.]] = insertelement <4 x i32> undef, i32* [[T1:%.*]], i32 0
				; AVX2-NEXT: [[TMP2:%.]] = shufflevector <4 x i32> [[TMP1]], <4 x i32*> undef, <4 x i32> zeroinitializer
				; AVX2-NEXT: [[TMP3:%.]] = getelementptr i32, <4 x i32> [[TMP2]], <4 x i64> <i64 11, i64 4, i64 15, i64 18>
				; AVX2-NEXT: [[T21:%.]] = getelementptr inbounds i32, i32 [[T0]], i64 5
				; AVX2-NEXT: [[T22:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 9
				; AVX2-NEXT: [[T25:%.]] = getelementptr inbounds i32, i32 [[T0]], i64 6
				; AVX2-NEXT: [[T26:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 6
				; AVX2-NEXT: [[T29:%.]] = getelementptr inbounds i32, i32 [[T0]], i64 7
				; AVX2-NEXT: [[T30:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 21
				; AVX2-NEXT: [[T3:%.]] = load i32, i32 [[T1]], align 4
				; AVX2-NEXT: [[TMP4:%.]] = call <4 x i32> @llvm.masked.gather.v4i32.v4p0i32(<4 x i32> [[TMP3]], i32 4, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x i32> undef)
				; AVX2-NEXT: [[T23:%.]] = load i32, i32 [[T22]], align 4
				; AVX2-NEXT: [[T27:%.]] = load i32, i32 [[T26]], align 4
				; AVX2-NEXT: [[T31:%.]] = load i32, i32 [[T30]], align 4
				; AVX2-NEXT: [[T4:%.*]] = add i32 [[T3]], 1
				; AVX2-NEXT: [[TMP5:%.*]] = add <4 x i32> [[TMP4]], <i32 2, i32 3, i32 4, i32 1>
				; AVX2-NEXT: [[T24:%.*]] = add i32 [[T23]], 2
				; AVX2-NEXT: [[T28:%.*]] = add i32 [[T27]], 3
				; AVX2-NEXT: [[T32:%.*]] = add i32 [[T31]], 4
				; AVX2-NEXT: store i32 [[T4]], i32* [[T0]], align 4
				; AVX2-NEXT: [[TMP6:%.]] = bitcast i32 [[T5]] to <4 x i32>*
				; AVX2-NEXT: store <4 x i32> [[TMP5]], <4 x i32>* [[TMP6]], align 4
				; AVX2-NEXT: store i32 [[T24]], i32* [[T21]], align 4
				; AVX2-NEXT: store i32 [[T28]], i32* [[T25]], align 4
				; AVX2-NEXT: store i32 [[T32]], i32* [[T29]], align 4
				; AVX2-NEXT: ret void
				;
				; AVX512-LABEL: @gather_load_4(
				; AVX512-NEXT: [[T5:%.]] = getelementptr inbounds i32, i32 [[T0:%.*]], i64 1
				; AVX512-NEXT: [[TMP1:%.]] = insertelement <4 x i32> undef, i32* [[T1:%.*]], i32 0
				; AVX512-NEXT: [[TMP2:%.]] = shufflevector <4 x i32> [[TMP1]], <4 x i32*> undef, <4 x i32> zeroinitializer
				; AVX512-NEXT: [[TMP3:%.]] = getelementptr i32, <4 x i32> [[TMP2]], <4 x i64> <i64 11, i64 4, i64 15, i64 18>
				; AVX512-NEXT: [[T21:%.]] = getelementptr inbounds i32, i32 [[T0]], i64 5
				; AVX512-NEXT: [[T22:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 9
				; AVX512-NEXT: [[T25:%.]] = getelementptr inbounds i32, i32 [[T0]], i64 6
				; AVX512-NEXT: [[T26:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 6
				; AVX512-NEXT: [[T29:%.]] = getelementptr inbounds i32, i32 [[T0]], i64 7
				; AVX512-NEXT: [[T30:%.]] = getelementptr inbounds i32, i32 [[T1]], i64 21
				; AVX512-NEXT: [[T3:%.]] = load i32, i32 [[T1]], align 4
				; AVX512-NEXT: [[TMP4:%.]] = call <4 x i32> @llvm.masked.gather.v4i32.v4p0i32(<4 x i32> [[TMP3]], i32 4, <4 x i1> <i1 true, i1 true, i1 true, i1 true>, <4 x i32> undef)
				; AVX512-NEXT: [[T23:%.]] = load i32, i32 [[T22]], align 4
				; AVX512-NEXT: [[T27:%.]] = load i32, i32 [[T26]], align 4
				; AVX512-NEXT: [[T31:%.]] = load i32, i32 [[T30]], align 4
				; AVX512-NEXT: [[T4:%.*]] = add i32 [[T3]], 1
				; AVX512-NEXT: [[TMP5:%.*]] = add <4 x i32> [[TMP4]], <i32 2, i32 3, i32 4, i32 1>
				; AVX512-NEXT: [[T24:%.*]] = add i32 [[T23]], 2
				; AVX512-NEXT: [[T28:%.*]] = add i32 [[T27]], 3
				; AVX512-NEXT: [[T32:%.*]] = add i32 [[T31]], 4
				; AVX512-NEXT: store i32 [[T4]], i32* [[T0]], align 4
				; AVX512-NEXT: [[TMP6:%.]] = bitcast i32 [[T5]] to <4 x i32>*
				; AVX512-NEXT: store <4 x i32> [[TMP5]], <4 x i32>* [[TMP6]], align 4
				; AVX512-NEXT: store i32 [[T24]], i32* [[T21]], align 4
				; AVX512-NEXT: store i32 [[T28]], i32* [[T25]], align 4
				; AVX512-NEXT: store i32 [[T32]], i32* [[T29]], align 4
				; AVX512-NEXT: ret void
				;
	%t5 = getelementptr inbounds i32, i32* %t0, i64 1			%t5 = getelementptr inbounds i32, i32* %t0, i64 1
	%t6 = getelementptr inbounds i32, i32* %t1, i64 11			%t6 = getelementptr inbounds i32, i32* %t1, i64 11
	%t9 = getelementptr inbounds i32, i32* %t0, i64 2			%t9 = getelementptr inbounds i32, i32* %t0, i64 2
	%t10 = getelementptr inbounds i32, i32* %t1, i64 4			%t10 = getelementptr inbounds i32, i32* %t1, i64 4
	%t13 = getelementptr inbounds i32, i32* %t0, i64 3			%t13 = getelementptr inbounds i32, i32* %t0, i64 3
	%t14 = getelementptr inbounds i32, i32* %t1, i64 15			%t14 = getelementptr inbounds i32, i32* %t1, i64 15
	%t17 = getelementptr inbounds i32, i32* %t0, i64 4			%t17 = getelementptr inbounds i32, i32* %t0, i64 4
	%t18 = getelementptr inbounds i32, i32* %t1, i64 18			%t18 = getelementptr inbounds i32, i32* %t1, i64 18
	Show All 36 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Make SLPVectorizer to use `llvm.masked.gather` intrinsicClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 305785

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

llvm/test/Transforms/SLPVectorizer/X86/lookahead.ll

llvm/test/Transforms/SLPVectorizer/X86/pr47623.ll

llvm/test/Transforms/SLPVectorizer/X86/pr47629.ll

[SLP] Make SLPVectorizer to use `llvm.masked.gather` intrinsic
ClosedPublic