Page MenuHomePhabricator

[SLP] Make SLPVectorizer to use `llvm.masked.gather` intrinsic
ClosedPublic

Authored by anton-afanasyev on Oct 30 2020, 12:40 AM.

Details

Summary

For the scattered operands of load instructions it makes sense
to use gathering load intrinsic, which can lower to native instruction
for X86/AVX512 and ARM/SVE. This also enables building
vectorization tree with entries containing scattered operands.
The next step is to add scattered store.

Fixes PR47629 and PR47623

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
anton-afanasyev marked an inline comment as done.

Fixed asserts again

Small fix of debug printer

Also, please add ScatterVectorize to TreeEntry.dump()

Also, please add ScatterVectorize to TreeEntry.dump()

It's already here, line 1704.

dtemirbulatov accepted this revision.Sat, Nov 14, 6:57 AM

Looks good, any other objections?

This revision is now accepted and ready to land.Sat, Nov 14, 6:57 AM
RKSimon accepted this revision.Sat, Nov 14, 9:01 AM

LGTM with one minor

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1555–1558

Minor - please add comment explaining the purpose of each enum

ABataev added inline comments.Sat, Nov 14, 9:32 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2868

Looks like you still do not pass the vectorization state as an argument. I thought we agreed to modify Bundle related param to be an Optional of EntryState and Bundle types? Better to create the entry with the correctly set inner state rather than modify them outside of the class. It breaks encapsulation.

anton-afanasyev marked 3 inline comments as done.

Add comment

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1555–1558

Thank you, done!

2868

Sorry, missed this comment. Yes, we agreed to do this, but as a separate NFC patch.

3725

In general, yes, we can end up with reordering, but this could be optimized with preliminary pointers vector shuffling. Though I think it's rather rare case (load-gather plus reordering meeting together), I can prepare separate patch to optimize this after investigation of its impact.

ABataev added inline comments.Sat, Nov 14, 9:59 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2868

Yes, thanks. Could you prepare this NFC patch before commiting this patch?

Added optional EntryState param to newTreeEntry()

anton-afanasyev marked an inline comment as done.Sat, Nov 14, 9:38 PM
anton-afanasyev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2868

Hmm, I've decided to get rid of scary make_pair inside all newTreeEntry() callings, leaving most of the code untouched. So, added optional EntryState param to newTreeEntry() function. Minimal changes, so no need for separate patch. Is it good now?

anton-afanasyev marked an inline comment as done.Sat, Nov 14, 9:38 PM

Fix tests after "--check-prefixes" change (D90445) and fix one assert

RKSimon accepted this revision.Sun, Nov 15, 7:02 AM

LGTM with one minor

llvm/test/Transforms/SLPVectorizer/X86/pr47623.ll
6

Sorry, I think you need to remove the 'CHECK' from all of these and just use SSE/AVX

anton-afanasyev marked an inline comment as done.

Fix test, CHECK prefix is unused

llvm/test/Transforms/SLPVectorizer/X86/pr47623.ll
6

Ok, done

anton-afanasyev edited the summary of this revision. (Show Details)Sun, Nov 15, 7:18 AM
ABataev added inline comments.Mon, Nov 16, 8:51 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1753–1769

It may conflict with State param value and may lead to incorrect undrstanding/decisions in future. That's why I proposed to merge EntryState and Bundle into one Optional<std::pair<>>, though this solution also is not quite optimal.
Maybe better to split this function into 2 overloaded copies, one for gather and one for vectorized state?

anton-afanasyev marked an inline comment as done.Mon, Nov 16, 1:19 PM
anton-afanasyev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1753–1769

Would it be better to leave all as it was before and just add setEntryState() method to accurately set ScatterVectorize (or other state in future)?

anton-afanasyev marked an inline comment as done.Mon, Nov 16, 1:19 PM
ABataev added inline comments.Mon, Nov 16, 1:25 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1753–1769

Nah, it is similar to what was before. Better to set the correct state when constructing the state rather than construct it with the incorrect flag and then trying to fix it on-the-fly. It is bad design decision and always a source of bugs.

anton-afanasyev marked an inline comment as done.

Add overloaded newTreeEntry() for common case

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1753–1769

Ok, thanks! So I created overloaded copy of newTreeEntry() especially for EntryState given explicitly. Is it good now?

ABataev added inline comments.Mon, Nov 16, 2:29 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1755

I think it will crash if Bundle is None

1761
  1. Better to make Optional<std::pair<EntryState, ScheduleData *>>.
  2. I would add `assert(!StateAndBundle || StateAndBundle.getValue().first != NeedToGather && "...");
anton-afanasyev marked 2 inline comments as done.Mon, Nov 16, 3:22 PM
anton-afanasyev added inline comments.
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1755

Thank you, fixed!

1761
  1. I tried to get rid of std::pair again, now looks better, doesn't it?
  2. Thanks, added.
anton-afanasyev marked 2 inline comments as done.Mon, Nov 16, 3:22 PM

Fixes and assert

ABataev added inline comments.Mon, Nov 16, 3:36 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1766

What if !Bundle && EntryState != NeedToGather, i.e. trying to vectorize gather entry?

anton-afanasyev marked an inline comment as done.

Added another assert

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
1766

Thanks, added second assert.

This revision was landed with ongoing or failed builds.Tue, Nov 17, 7:12 AM
This revision was automatically updated to reflect the committed changes.
fhahn added a comment.Tue, Nov 17, 7:19 AM

It looks like this change may cause a crash when building LNT, e.g. http://lab.llvm.org:8011/#/builders/105/builds/1899/steps/7/logs/stdio

It looks like this change may cause a crash when building LNT, e.g. http://lab.llvm.org:8011/#/builders/105/builds/1899/steps/7/logs/stdio

Thank you, fixing!

Saw some random miscompiles after this. Fixed by 4dbe12e86649ba6b5f03a9ba97e84d718727f7a7, can you check if I got it right?

Saw some random miscompiles after this. Fixed by 4dbe12e86649ba6b5f03a9ba97e84d718727f7a7, can you check if I got it right?

Yes, thanks, looks good!

fhahn added a comment.Mon, Nov 23, 6:09 AM

It appears this is causing a ~2-3% regression on AArch64 for some benchmarks, including CINT2000/256.bzip2. Any ideas? It might be caused by underestimating the cost of gather/scatter on AArch64.

It appears this is causing a ~2-3% regression on AArch64 for some benchmarks, including CINT2000/256.bzip2. Any ideas? It might be caused by underestimating the cost of gather/scatter on AArch64.

Poor target specific costs is likely to be the problem - either for explicit gather intrinsic costs or the scalarization overhead fallback.

If isLegalMaskedGather is returning false (as it will do for AArch64 NEON), then should this even be attempting to use gather/scatter? I would have thought that will never do worth it if it is just going to scalarize again?

If isLegalMaskedGather is returning false (as it will do for AArch64 NEON), then should this even be attempting to use gather/scatter? I would have thought that will never do worth it if it is just going to scalarize again?

SLP should always create gathers with constant masks while isLegalMaskedGather covers the more general variable masks as well - we can always legalize constant mask gathers back to a buildvector or insertelement chain so its often worthwhile using the gather intrinsic to act for that case as well, hence it being important we correctly handle the scalarization overhead fallback.

I believe this issue is related to the default cost for getGatherScatterOpCost(). For the arch not having gather/scatter instrs we use TargetTransformInfoImplBase::getGatherScatterOpCost() which returns 1 unconditionally: https://github.com/llvm/llvm-project/blob/release/11.x/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h#L480
I'd fix it by setting something like 1024 for the default cost.

I believe this issue is related to the default cost for getGatherScatterOpCost(). For the arch not having gather/scatter instrs we use TargetTransformInfoImplBase::getGatherScatterOpCost() which returns 1 unconditionally: https://github.com/llvm/llvm-project/blob/release/11.x/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h#L480
I'd fix it by setting something like 1024 for the default cost.

yes, I managed to track it down to that. I pushed some additional tests and D91984 to add a more realistic estimate for the cost when scalarizing.

Current SLP has significant drawback with regard to its cost modeling. And this patch highlights it.
Consider we have four scalar loads of i8 type. With prior approach (vectorization overhead) we had cost for such entry 4 (x86 target).
With this new approach we have two entries instead of one: ScatterVectorize loads + NeedToGather GEPs. And costs for these entries are 6 and 10 respectively, thus cost increased from 4 to 16.
And the problem here is once we put this pattern into the tree it pulls cost up for the entire tree. If we have multiple such patterns over the tree their effect is magnified. These entries finally outweigh possible profit of vectorization for remaining portion of the tree and we end up not vectorizing it at all (even if downstream optimizations could probably change it into optimal code). If SLP could make choice vectorization overhead vs gather intrinsic based in their costs while building vectorizable tree the outcome could be different.

anton-afanasyev added a comment.EditedWed, Nov 25, 3:20 AM

Current SLP has significant drawback with regard to its cost modeling. And this patch highlights it.
Consider we have four scalar loads of i8 type. With prior approach (vectorization overhead) we had cost for such entry 4 (x86 target).
With this new approach we have two entries instead of one: ScatterVectorize loads + NeedToGather GEPs. And costs for these entries are 6 and 10 respectively, thus cost increased from 4 to 16.
And the problem here is once we put this pattern into the tree it pulls cost up for the entire tree. If we have multiple such patterns over the tree their effect is magnified. These entries finally outweigh possible profit of vectorization for remaining portion of the tree and we end up not vectorizing it at all (even if downstream optimizations could probably change it into optimal code). If SLP could make choice vectorization overhead vs gather intrinsic based in their costs while building vectorizable tree the outcome could be different.

Good point, thank you! As you said, that is not the problem specific for this patch exclusively. One can fix it by hacky cost comparing at the buildind tree stage, but I do believe the more general solution is preferable. Does this patch https://reviews.llvm.org/D57779 (vectorization throttling) fixes this? After greedy strategy of building the maximum tree we choose the cheapest part of it for vectorization.

Good point, thank you! As you said, that is not the problem specific for this patch exclusively. One can fix it by hacky cost comparing at the buildind tree stage, but I do believe the more general solution is preferable. Does this patch https://reviews.llvm.org/D57779 (vectorization throttling) fixes this? After greedy strategy of building the maximum tree we choose the cheapest part of it for vectorization.

No, I think https://reviews.llvm.org/D57779 is about a different thing. Here, we have new functionality which allows us to built the tree with gather-loads otherwise we just ignore it and thus have a different tree. I am not sure how to handle the case if it is accumulating those expensive operations. Maybe guard this new functionality by a flag for now?

Talked with @dtemirbulatov privately and reached a consensus that his patch reviews.llvm.org/D57779 fixes the issue defined above by @vdmitrie in general.

It sounds like throttling patch should resolve this issue as cutting out ScatterVectorize entry with high cost will effectively return to previous behavior.

It sounds like throttling patch should resolve this issue as cutting out ScatterVectorize entry with high cost will effectively return to previous behavior.

Yes, exactly. The only difference with previous behavior could arise in case of the new tree accumulating other instructions starting from ScatterVectorize and NeedToGather GEPs entries, preventing them from being contained in other parts of tree. But these entries are terminal, with so rare speculative exceptions that I believe it's good euristics for this case as well as for the general SLP drawback you mentioned: build the maximum tree and choose the cheapest subtree.