This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Make SLPVectorizer to use `llvm.masked.gather` intrinsic
ClosedPublic

Authored by anton-afanasyev on Oct 30 2020, 12:40 AM.

Details

Summary

For the scattered operands of load instructions it makes sense
to use gathering load intrinsic, which can lower to native instruction
for X86/AVX512 and ARM/SVE. This also enables building
vectorization tree with entries containing scattered operands.
The next step is to add scattered store.

Fixes PR47629 and PR47623

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
anton-afanasyev marked 3 inline comments as done.Oct 30 2020, 11:22 PM

Updated comments and LaneIndex

llvm/test/Transforms/SLPVectorizer/X86/pr47629.ll
17 ↗(On Diff #301826)

Hmm, investigated this: no, I was wrong, the calculated cost is correct (insertelemetss for gather instr are compesated by insertelements for buildvector). Further investigating this... Looks like codegen issue more.

Please can you rebase after rG1eeae4310771d8a6896fe09effe88883998f34e8?

llvm/test/Transforms/SLPVectorizer/X86/pr47629.ll
17 ↗(On Diff #301826)

Hmm - does the X86 TTI handle buildvectors of pointers costs or does it fallback to the generic implementation (which is almost certainly lower)?

The MVE gather costs were certainly written with the loop vectorizer in mind. The number of gather/scatter instructions we have is quite limited and it's not always simple to match llvm.gathers to mve gathers. I've not seen any issues from this so far though, other than what was just fixed up in rG30ad7426442e.

llvm/test/Transforms/SLPVectorizer/X86/pr47629.ll
17 ↗(On Diff #301826)

Despite of codegen output for Skylake looking complicated I believe it's still more optimized, since the gather is cheaper than four loads from memory, isn't it?

RKSimon added inline comments.Oct 31 2020, 7:02 AM
llvm/test/Transforms/SLPVectorizer/X86/pr47629.ll
17 ↗(On Diff #301826)

My concern is that -march=avx512 does nothing so you're just getting raw sse2 costs here - hence why I updated the tests at rG1eeae4310771d8a6896fe09effe88883998f34e8

Rebased against changed test file

anton-afanasyev added a comment.EditedOct 31 2020, 8:02 AM

Rebased, but optimization eliminated for -mattr=.... Option -mattr=+avx512vl should be set explicitly to activate costs.

This comment was removed by xbolva00.

I've extended test case again: https://reviews.llvm.org/rGe8d67ef2dc93. Also, @xbolva00 precommited test for PR47263, which is fixed with this patch.

ABataev added inline comments.Nov 3 2020, 8:46 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1581

I would think about adding a new EntryState instead of adding a new data member.

3668

Comment for false argument

4508

auto *

4510

auto *

4517

.emplace_back(PO, cast<User>(VecPtr), InsIndex);

4522

Try to merge this code line with the one in line 4539

anton-afanasyev marked 9 inline comments as done.Nov 3 2020, 10:46 PM
anton-afanasyev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1581

I've thought about it and discussed this with @dtemirbulatov (his throttling patch is extending EntryState as well). I've come that IsScattered is like ReuseShuffleIndices/ReorderIndices (data members) more than an "entry state". Also, the current entry state NeedToGather would add ambiguity in that case.

3668

Ok.

4508

Ok

4510

UndefValue::get() returns UndefValue* type, but we need Value* here.

4517

Ok

4522

Done.

anton-afanasyev marked 6 inline comments as done.

Addressed notes

ABataev added inline comments.Nov 4 2020, 7:31 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1581

It is different thing. ReuseShuffleIndices/ReorderIndices can be used by many different entries while IsScattered member is used only for a single particular kind of node. It increases memory usage because we add an extra field which is just ignored in many cases. Plus, to me, this field just a mark of the new kind of gather.

2848

Why it is vectorized if this is actually kind of gather?

anton-afanasyev marked 2 inline comments as done.Nov 4 2020, 11:57 PM
anton-afanasyev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1581

Ok, sounds convincing, I can change it. But I don't think this field is "just the new kind of gather". I'm planning to use IsScattered for scattered stores (@llvm.store.scatter intrinsic) at the next step, of course. This is not the gather like NeedToGather entry state.

2848

Actually I think vectorized is more appropriate here than gathered in sense it is used throughout SLPVectorizer module. Since we actually modify several instruction to _one_. I can change it to scatter-vectorized.

ABataev added inline comments.Nov 5 2020, 6:14 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2848

Could you give a bit more explanation why it should be treated as vectorized?

anton-afanasyev marked 3 inline comments as done.Nov 5 2020, 7:45 AM
anton-afanasyev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2848

Ok.
For now, enum EntryState {Vectorize, NeedToGather} has two states: the first is for the bundle that is to be vectorized, the second is to be gathered, but here "gathered" means this bundle will stay untouched after tree vectorization, we need no replace several scalar instructions with one vector instruction. Also we need no handle users of gathered instruction.
For the "scattered" entry we have opposite case: several scalar instructios to be replaced with @llvm.masked.gather.* and @llvm.masked.scatter.* intrinsics and we need to handle its users (at least, for store = @llvm.masked.scatter.* case (the next patch, also using IsScattered field)).
Am I clear?

anton-afanasyev marked an inline comment as done.Nov 5 2020, 7:46 AM
ABataev added inline comments.Nov 5 2020, 7:48 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2848

Shall we handle the users for loads?

anton-afanasyev marked an inline comment as done.Nov 5 2020, 8:04 AM
anton-afanasyev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2848

No, only for stores, of course. Vectorized loads are leaves of vectorized tree, whereas stores are seed points. Scattered stores can also be "vectorized" in the sense of being replaced with the one intrinsic.

ABataev added inline comments.Nov 5 2020, 8:14 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2848

Then for loads it is more just a kind a gather, rather than vectorize. Is it required to mark the scalars as vectorized or just enough to mark as gathered? Do we need all this stuff that is required for the vectorized instructions, like inorder, etc.?

anton-afanasyev marked 2 inline comments as done.Nov 5 2020, 8:33 AM
anton-afanasyev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2848

Yes, we need this stuff: at least, separate cost estimation, scheduling. We treat this as generalization of vectorization -- like load <4 x i32>, @llvm.masked.gather loads to <4 x i32>, for instance.

ABataev added inline comments.Nov 5 2020, 8:47 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2848

Why it cannot be model via gather? We gather the scalar loads here but into a different form. Instead of direct gather we just gather the addresses and then do a load. But this still a kind of gather. Do you need to continue scheduling here? No, you don't, just need to extend Gather function to generate gather of addresses + llvm.masked.gather call.

anton-afanasyev marked 2 inline comments as done.Nov 5 2020, 9:17 AM
anton-afanasyev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2848

It can be model via gather, ok, but only for load case, not for store one. But we already treat ordinary load <4 x i32> as vectorized entry, what is the difference with @llvm.masked.gather <4 x i32>? I see here consistent symmetry:

ordinary load of vector <-> ordinary store of vector
             |                 |
gather of vector <-> scatter of vector
ABataev added inline comments.Nov 5 2020, 9:23 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2848

Because the model is not based on the final outcome. Plus, you missed the real gather of addresses.
Did you think about adding a new gather treeentry for addresses? Looks like you missed the cost for addresses gather.

anton-afanasyev marked 2 inline comments as done.Nov 5 2020, 10:00 AM
anton-afanasyev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2848

Why is model not based on the final outcome? The cost is separate issue -- I can fix this at getEntryCost().

ABataev added inline comments.Nov 5 2020, 10:04 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2848

Because later this code can be transformed into something else, anyway, for example gather can be transformed into a single instruction. The model itself is based on the scalars, how they are transformed into a vector form.
I would suggest just to add 2 tree entries in this case. One for new scatter vectorization form and one for gathering of addresses. It would simplify handling of this in future and won't break the model. Plus, you would not need to add a new cost calculation and rely on the exisiting cost model for the gather of addresses.

anton-afanasyev marked 2 inline comments as done.Nov 5 2020, 10:43 AM
anton-afanasyev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2848

Sounds good, thanks!

anton-afanasyev marked an inline comment as done.

Added second tree entry for addresses

Added second tree entry for addresses

Yep, this looks much better now. What about adding new vectorization kind for scatter-gather?

dtemirbulatov added inline comments.Nov 8 2020, 8:31 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1581

Just to avoid any confusion, I think it is to change the name to IsScatterGatherOps?

anton-afanasyev marked an inline comment as done.Nov 9 2020, 1:40 AM
anton-afanasyev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1581

Current IsScatteredOps means that _operands_ of load/store are _scattered_, so need either to gather, or to scatter them. I'm to change name to IsScatterOpds to avoid confusion with Operations.

anton-afanasyev marked an inline comment as done.

Renamed IsScatteredOps to IsScatteredOpds

Added second tree entry for addresses

Yep, this looks much better now. What about adding new vectorization kind for scatter-gather?

What do you mean by "vectorization kind"? Vectorization is differentiated by bundle instructions opcode.

Added second tree entry for addresses

Yep, this looks much better now. What about adding new vectorization kind for scatter-gather?

What do you mean by "vectorization kind"? Vectorization is differentiated by bundle instructions opcode.

I meant, to add a new vectorization kind for this node, something like Scatter or ScatterVectorized. We discussed before, that this new boolean field is going to e unused in many cases and better to introduce a new vectorization id (extend EntryState enum)

Use of EntryState

I meant, to add a new vectorization kind for this node, something like Scatter or ScatterVectorized. We discussed before, that this new boolean field is going to e unused in many cases and better to introduce a new vectorization id (extend EntryState enum)

Ok, changed. Do you think this is better now?

ABataev added inline comments.Nov 9 2020, 9:15 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2850

Better to modify Bundle parameter in newTreeEntry function, something like Optional<std::pair<EntryState, Bundle>> and pass EntryState directly at function call. You can implement it as a separate NFC patch.

3666

Add assert here for E->State == TreeEntry::ScatterVectorize

4505

Add assert here for E->State == TreeEntry::ScatterVectorize

anton-afanasyev marked 3 inline comments as done.

Added asserts

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2850

Ok, I can fix it in a separate patch. But now I doubt that initial memory optimization with IsScatteredOpds elimination was worse than this final EntryState expansion requiring changes for all newTreeEntry() calls.

ABataev added inline comments.Nov 9 2020, 9:50 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2850

Yes, it still will be better. Here you will allocate the value on the stack, which will be cleared after function termination. With an extra data member each TreeEntry will retain it till the end of life of the whole tree.

anton-afanasyev marked an inline comment as done.Nov 9 2020, 10:04 AM
anton-afanasyev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2850

Ok

anton-afanasyev marked an inline comment as done.

Fixed asserts again

Small fix of debug printer

Also, please add ScatterVectorize to TreeEntry.dump()

Also, please add ScatterVectorize to TreeEntry.dump()

It's already here, line 1704.

dtemirbulatov accepted this revision.Nov 14 2020, 6:57 AM

Looks good, any other objections?

This revision is now accepted and ready to land.Nov 14 2020, 6:57 AM
RKSimon accepted this revision.Nov 14 2020, 9:01 AM

LGTM with one minor

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1556

Minor - please add comment explaining the purpose of each enum

ABataev added inline comments.Nov 14 2020, 9:32 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2850

Looks like you still do not pass the vectorization state as an argument. I thought we agreed to modify Bundle related param to be an Optional of EntryState and Bundle types? Better to create the entry with the correctly set inner state rather than modify them outside of the class. It breaks encapsulation.

anton-afanasyev marked 3 inline comments as done.

Add comment

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1556

Thank you, done!

2850

Sorry, missed this comment. Yes, we agreed to do this, but as a separate NFC patch.

3670

In general, yes, we can end up with reordering, but this could be optimized with preliminary pointers vector shuffling. Though I think it's rather rare case (load-gather plus reordering meeting together), I can prepare separate patch to optimize this after investigation of its impact.

ABataev added inline comments.Nov 14 2020, 9:59 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2850

Yes, thanks. Could you prepare this NFC patch before commiting this patch?

Added optional EntryState param to newTreeEntry()

anton-afanasyev marked an inline comment as done.Nov 14 2020, 9:38 PM
anton-afanasyev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2850

Hmm, I've decided to get rid of scary make_pair inside all newTreeEntry() callings, leaving most of the code untouched. So, added optional EntryState param to newTreeEntry() function. Minimal changes, so no need for separate patch. Is it good now?

anton-afanasyev marked an inline comment as done.Nov 14 2020, 9:38 PM

Fix tests after "--check-prefixes" change (D90445) and fix one assert

RKSimon accepted this revision.Nov 15 2020, 7:02 AM

LGTM with one minor

llvm/test/Transforms/SLPVectorizer/X86/pr47623.ll
6 ↗(On Diff #305350)

Sorry, I think you need to remove the 'CHECK' from all of these and just use SSE/AVX

anton-afanasyev marked an inline comment as done.

Fix test, CHECK prefix is unused

llvm/test/Transforms/SLPVectorizer/X86/pr47623.ll
6 ↗(On Diff #305350)

Ok, done

anton-afanasyev edited the summary of this revision. (Show Details)Nov 15 2020, 7:18 AM
ABataev added inline comments.Nov 16 2020, 8:51 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1751

It may conflict with State param value and may lead to incorrect undrstanding/decisions in future. That's why I proposed to merge EntryState and Bundle into one Optional<std::pair<>>, though this solution also is not quite optimal.
Maybe better to split this function into 2 overloaded copies, one for gather and one for vectorized state?

anton-afanasyev marked an inline comment as done.Nov 16 2020, 1:19 PM
anton-afanasyev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1751

Would it be better to leave all as it was before and just add setEntryState() method to accurately set ScatterVectorize (or other state in future)?

anton-afanasyev marked an inline comment as done.Nov 16 2020, 1:19 PM
ABataev added inline comments.Nov 16 2020, 1:25 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1751

Nah, it is similar to what was before. Better to set the correct state when constructing the state rather than construct it with the incorrect flag and then trying to fix it on-the-fly. It is bad design decision and always a source of bugs.

anton-afanasyev marked an inline comment as done.

Add overloaded newTreeEntry() for common case

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1751

Ok, thanks! So I created overloaded copy of newTreeEntry() especially for EntryState given explicitly. Is it good now?

ABataev added inline comments.Nov 16 2020, 2:29 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1753

I think it will crash if Bundle is None

1759
  1. Better to make Optional<std::pair<EntryState, ScheduleData *>>.
  2. I would add `assert(!StateAndBundle || StateAndBundle.getValue().first != NeedToGather && "...");
anton-afanasyev marked 2 inline comments as done.Nov 16 2020, 3:22 PM
anton-afanasyev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1753

Thank you, fixed!

1759
  1. I tried to get rid of std::pair again, now looks better, doesn't it?
  2. Thanks, added.
anton-afanasyev marked 2 inline comments as done.Nov 16 2020, 3:22 PM

Fixes and assert

ABataev added inline comments.Nov 16 2020, 3:36 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1764

What if !Bundle && EntryState != NeedToGather, i.e. trying to vectorize gather entry?

anton-afanasyev marked an inline comment as done.

Added another assert

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1764

Thanks, added second assert.

This revision was landed with ongoing or failed builds.Nov 17 2020, 7:12 AM
This revision was automatically updated to reflect the committed changes.
fhahn added a comment.Nov 17 2020, 7:19 AM

It looks like this change may cause a crash when building LNT, e.g. http://lab.llvm.org:8011/#/builders/105/builds/1899/steps/7/logs/stdio

It looks like this change may cause a crash when building LNT, e.g. http://lab.llvm.org:8011/#/builders/105/builds/1899/steps/7/logs/stdio

Thank you, fixing!

Saw some random miscompiles after this. Fixed by 4dbe12e86649ba6b5f03a9ba97e84d718727f7a7, can you check if I got it right?

Saw some random miscompiles after this. Fixed by 4dbe12e86649ba6b5f03a9ba97e84d718727f7a7, can you check if I got it right?

Yes, thanks, looks good!

fhahn added a comment.Nov 23 2020, 6:09 AM

It appears this is causing a ~2-3% regression on AArch64 for some benchmarks, including CINT2000/256.bzip2. Any ideas? It might be caused by underestimating the cost of gather/scatter on AArch64.

It appears this is causing a ~2-3% regression on AArch64 for some benchmarks, including CINT2000/256.bzip2. Any ideas? It might be caused by underestimating the cost of gather/scatter on AArch64.

Poor target specific costs is likely to be the problem - either for explicit gather intrinsic costs or the scalarization overhead fallback.

If isLegalMaskedGather is returning false (as it will do for AArch64 NEON), then should this even be attempting to use gather/scatter? I would have thought that will never do worth it if it is just going to scalarize again?

If isLegalMaskedGather is returning false (as it will do for AArch64 NEON), then should this even be attempting to use gather/scatter? I would have thought that will never do worth it if it is just going to scalarize again?

SLP should always create gathers with constant masks while isLegalMaskedGather covers the more general variable masks as well - we can always legalize constant mask gathers back to a buildvector or insertelement chain so its often worthwhile using the gather intrinsic to act for that case as well, hence it being important we correctly handle the scalarization overhead fallback.

I believe this issue is related to the default cost for getGatherScatterOpCost(). For the arch not having gather/scatter instrs we use TargetTransformInfoImplBase::getGatherScatterOpCost() which returns 1 unconditionally: https://github.com/llvm/llvm-project/blob/release/11.x/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h#L480
I'd fix it by setting something like 1024 for the default cost.

I believe this issue is related to the default cost for getGatherScatterOpCost(). For the arch not having gather/scatter instrs we use TargetTransformInfoImplBase::getGatherScatterOpCost() which returns 1 unconditionally: https://github.com/llvm/llvm-project/blob/release/11.x/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h#L480
I'd fix it by setting something like 1024 for the default cost.

yes, I managed to track it down to that. I pushed some additional tests and D91984 to add a more realistic estimate for the cost when scalarizing.

Current SLP has significant drawback with regard to its cost modeling. And this patch highlights it.
Consider we have four scalar loads of i8 type. With prior approach (vectorization overhead) we had cost for such entry 4 (x86 target).
With this new approach we have two entries instead of one: ScatterVectorize loads + NeedToGather GEPs. And costs for these entries are 6 and 10 respectively, thus cost increased from 4 to 16.
And the problem here is once we put this pattern into the tree it pulls cost up for the entire tree. If we have multiple such patterns over the tree their effect is magnified. These entries finally outweigh possible profit of vectorization for remaining portion of the tree and we end up not vectorizing it at all (even if downstream optimizations could probably change it into optimal code). If SLP could make choice vectorization overhead vs gather intrinsic based in their costs while building vectorizable tree the outcome could be different.

anton-afanasyev added a comment.EditedNov 25 2020, 3:20 AM

Current SLP has significant drawback with regard to its cost modeling. And this patch highlights it.
Consider we have four scalar loads of i8 type. With prior approach (vectorization overhead) we had cost for such entry 4 (x86 target).
With this new approach we have two entries instead of one: ScatterVectorize loads + NeedToGather GEPs. And costs for these entries are 6 and 10 respectively, thus cost increased from 4 to 16.
And the problem here is once we put this pattern into the tree it pulls cost up for the entire tree. If we have multiple such patterns over the tree their effect is magnified. These entries finally outweigh possible profit of vectorization for remaining portion of the tree and we end up not vectorizing it at all (even if downstream optimizations could probably change it into optimal code). If SLP could make choice vectorization overhead vs gather intrinsic based in their costs while building vectorizable tree the outcome could be different.

Good point, thank you! As you said, that is not the problem specific for this patch exclusively. One can fix it by hacky cost comparing at the buildind tree stage, but I do believe the more general solution is preferable. Does this patch https://reviews.llvm.org/D57779 (vectorization throttling) fixes this? After greedy strategy of building the maximum tree we choose the cheapest part of it for vectorization.

Good point, thank you! As you said, that is not the problem specific for this patch exclusively. One can fix it by hacky cost comparing at the buildind tree stage, but I do believe the more general solution is preferable. Does this patch https://reviews.llvm.org/D57779 (vectorization throttling) fixes this? After greedy strategy of building the maximum tree we choose the cheapest part of it for vectorization.

No, I think https://reviews.llvm.org/D57779 is about a different thing. Here, we have new functionality which allows us to built the tree with gather-loads otherwise we just ignore it and thus have a different tree. I am not sure how to handle the case if it is accumulating those expensive operations. Maybe guard this new functionality by a flag for now?

Talked with @dtemirbulatov privately and reached a consensus that his patch reviews.llvm.org/D57779 fixes the issue defined above by @vdmitrie in general.

It sounds like throttling patch should resolve this issue as cutting out ScatterVectorize entry with high cost will effectively return to previous behavior.

It sounds like throttling patch should resolve this issue as cutting out ScatterVectorize entry with high cost will effectively return to previous behavior.

Yes, exactly. The only difference with previous behavior could arise in case of the new tree accumulating other instructions starting from ScatterVectorize and NeedToGather GEPs entries, preventing them from being contained in other parts of tree. But these entries are terminal, with so rare speculative exceptions that I believe it's good euristics for this case as well as for the general SLP drawback you mentioned: build the maximum tree and choose the cheapest subtree.

fhahn added a comment.Apr 18 2021, 3:14 AM

It looks like this commit may be causing a mis-compile: https://bugs.llvm.org/show_bug.cgi?id=50015