This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Vectorize loads in horizontal reductions when they happen to be in the reverse order
ClosedPublic

Authored by mkuper on Jul 19 2016, 5:05 PM.

Details

Summary

This fixes PR28474.

Admittedly, this is probably not the "right way" to fix this. We already had one case we handle by rebuilding the entire tree with the roots reversed (line 3816), and this adds another one, but this does a lot of redundant work, and obviously can't work for arbitrary orders of the loads.

What we'd really want is to sort the loads "on-the-fly" while building the tree, at least in the cases where the order of the scalars doesn't matter (i.e. reductions). In theory, we could also have a load + shuffle when the order does matter, or even prefer sorting loads to sorting stores when we have a store-rooted reduction, but that'd require additional cost modeling. Unfortunately, I still don't grok the SLP vectorizer enough to understand how to do that correctly. It's not just a question of making the tree mutable - it seems like there are plenty of places that assume the order doesn't change (treating VL[0] as special, isSame() that cares about the order, external users, etc.).

Advice will be appreciated.

Diff Detail

Repository
rL LLVM

Event Timeline

mkuper updated this revision to Diff 64606.Jul 19 2016, 5:05 PM
mkuper retitled this revision from to [SLP] Vectorize loads in horizontal reductions when they happen to be in the reverse order.
mkuper updated this object.
mkuper added reviewers: mzolotukhin, mssimpso.
mkuper added a subscriber: llvm-commits.
mssimpso edited edge metadata.Jul 21 2016, 1:18 PM

Hi Michael,

Thanks for working on this. The issue of unordered loads seems like an arbitrary restriction to me that needs fixing. The patch basically looks okay; I inlined a few comments.

I agree with most of your high-level points. We shouldn't really need to pay the cost of rebuilding the entire tree when we find that the loads are unordered. In fact, except for loads and stores, the order of the instructions shouldn't really matter at all. The fact that we assume an order in some cases is a problem that should be fixable. For example, I think VL[0] is commonly used as a representative of the bundle for things like getting the type, not to imply an ordering. I bet we could eliminate most instances of VL[0] with appropriate additions to TreeEntry. And we keep track of a Lane index for ExternalUsers just so we can extract the correct value from the vectors. But this is already kind of redundant since ExternalUser also holds a pointer to the scalar value and we maintain a list of the scalar values in TreeEntry.

So The difficult case seems to be the store-rooted trees with unordered loads. For that, we could use shuffles like you mentioned.

I think it's fine to only check for the reverse-consecutive case since that's all we are currently doing, but a more general solution would eventually be nice to have. Compile-time might be a concern there, though.

Matt.

lib/Transforms/Vectorize/SLPVectorizer.cpp
884 ↗(On Diff #64606)

Please update the comment since we now track reversed bundles of size greater than 2.

1158 ↗(On Diff #64606)

We also check for the isSimple case here as well. It would be nice to make the comment more detailed since you're already changing it.

1174 ↗(On Diff #64606)

We can break out of the loop when setting Consecutive to false to avoid unneeded calls to isConsecutiveAccess.

But I'm wondering if there other ways to make this faster. We are looking for the all consecutive or all reverse-consecutive cases. So for example if we find at least one consecutive pair in this first loop, there's no chance the list can be all reverse-consecutive.

3818 ↗(On Diff #64606)

This assert is confusing now. I think it only makes sense as-is because allowReorder is only true when coming from tryToVectorizePair. But you've removed the size-equal-2 restriction when checking for the reverse consecutive case. I think we should add a more explanatory comment. What do you think?

test/Transforms/SLPVectorizer/X86/reduction_loads.ll
4–6 ↗(On Diff #64606)

Please use regular expressions for the unnamed instructions to avoid future breakage.

Thanks a lot for the review, Matt!

I agree with most of your high-level points. We shouldn't really need to pay the cost of rebuilding the entire tree when we find that the loads are unordered. In fact, except for loads and stores, the order of the instructions shouldn't really matter at all. The fact that we assume an order in some cases is a problem that should be fixable. For example, I think VL[0] is commonly used as a representative of the bundle for things like getting the type, not to imply an ordering. I bet we could eliminate most instances of VL[0] with appropriate additions to TreeEntry. And we keep track of a Lane index for ExternalUsers just so we can extract the correct value from the vectors. But this is already kind of redundant since ExternalUser also holds a pointer to the scalar value and we maintain a list of the scalar values in TreeEntry.

Yes, that seems fairly simple in theory, but I tried to actually implement something like this - without rewriting the whole thing from scratch :-) - and ran into more trouble than I expected.
Then again, it may just be my lack of familiarity with the code.

So The difficult case seems to be the store-rooted trees with unordered loads. For that, we could use shuffles like you mentioned.

I think It's a bit more complex than that.
I agree the most difficult case is store-rooted trees with unordered loads, but there are at least two other issues that aren't related to the technical problems with getting in-tree sorting to work:

  1. Even for non-store-rooted trees, we may have several load leaves in the tree, and they may need different orders.
  2. There may be a cost modeling issue - I'm not sure a wide load + shuffles is always better than scalar loads. The decision may depend on the width of the vector and on how expensive the specific shuffle is.
lib/Transforms/Vectorize/SLPVectorizer.cpp
884 ↗(On Diff #64606)

Right, thanks!

1158 ↗(On Diff #64606)

Sure.

1174 ↗(On Diff #64606)

Good point. I'll try to make it a bit more efficient.

3818 ↗(On Diff #64606)

Right. At first, I removed the assert, and replaced this code with std::reverse, like the other case, but decided to keep the assert because it made which cases are currently supposed to get here clear. I'll add a comment explaining this.

test/Transforms/SLPVectorizer/X86/reduction_loads.ll
4–6 ↗(On Diff #64606)

Sure.

mkuper updated this revision to Diff 65000.Jul 21 2016, 5:14 PM
mkuper edited edge metadata.

Addressed Matt's comments.

mssimpso accepted this revision.Jul 22 2016, 1:50 PM
mssimpso edited edge metadata.

LGTM after a few more changes. Thanks!

lib/Transforms/Vectorize/SLPVectorizer.cpp
1206–1209 ↗(On Diff #65000)

I think this would be more clear if instead of the !Consecutive check, we have the following after setting Consecutive/ReverseConsecutive

if (Consecutive) {
  ++NumLoadsWantToKeepOrder;
  newTreeEntry(VL, true);
  DEBUG(dbgs() << "SLP: added a vector of loads.\n");
  return;
}

// If none of the load pairs...
test/Transforms/SLPVectorizer/X86/reduction_loads.ll
1 ↗(On Diff #65000)

You don't need the basicaa or dce flags for this test.

This revision is now accepted and ready to land.Jul 22 2016, 1:50 PM
This revision was automatically updated to reflect the committed changes.