This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Vectorize mutual horizontal reductions.
AbandonedPublic

Authored by labrinea on Sep 7 2022, 11:06 AM.

Details

Summary

When we try to reduce a horizontal reduction which has external users, the vectorization may be considered unprofitable because of the scalar extraction cost. However, we sometimes encounter mutual reductions cancelling each other out without actually having to extract any scalars. This fixes issue #55350.

Diff Detail

Event Timeline

labrinea created this revision.Sep 7 2022, 11:06 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 7 2022, 11:06 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
labrinea requested review of this revision.Sep 7 2022, 11:06 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 7 2022, 11:06 AM
ABataev added inline comments.Sep 7 2022, 11:38 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11067–11070

Not sure it is a good idea to store reduction ops here, you're working with reduced values actually.

11799

Why do you need this one? In case of successful reduction, the vectorizer restarts the analysis and rebuilds the reduction graph.

Well, this issue shows up in other cases too, not just reductions, see for example https://reviews.llvm.org/D132773. Ideally I would prefer a more generic solution rather than workarounds for each specific case because each workaround is rather limited. For example this patch won't help vectorize the code if we have more than two reductions, or if the external users are not reductions like the stores in the following example:

define i64 @foo(ptr %ptr.0) {
entry:
  %ptr.1 = getelementptr inbounds i32, ptr %ptr.0, i64 1
  %ptr.2 = getelementptr inbounds i32, ptr %ptr.0, i64 2
  %ptr.3 = getelementptr inbounds i32, ptr %ptr.0, i64 3

  %0 = load i32, ptr %ptr.0, align 4, !tbaa !5
  %1 = load i32, ptr %ptr.1, align 4, !tbaa !5
  %2 = load i32, ptr %ptr.2, align 4, !tbaa !5
  %3 = load i32, ptr %ptr.3, align 4, !tbaa !5

  %add.1 = add i32 %1, %0
  %add.2 = add i32 %2, %add.1
  %add.3 = add i32 %3, %add.2

  %mul = mul i32 %0, %0
  %mul.1 = mul i32 %1, %1
  %mul.2 = mul i32 %2, %2
  %mul.3 = mul i32 %3, %3

  %add5.1 = add i32 %mul.1, %mul
  %add5.2 = add i32 %mul.2, %add5.1
  %add5.3 = add i32 %mul.3, %add5.2

  ; Non-reduction external users
  store i32 %0, ptr %ptr.0
  store i32 %1, ptr %ptr.1
  store i32 %2, ptr %ptr.2
  store i32 %3, ptr %ptr.3

  %conv = zext i32 %add.3 to i64
  %conv6 = zext i32 %add5.3 to i64
  %shl = shl nuw i64 %conv6, 32
  %add7 = or i64 %shl, %conv
  ret i64 %add7
}
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2656

Perhaps we could name this SkipCost or something along those lines, because that way we can reuse it for other similar workarounds?

Ideally I would prefer a more generic solution rather than workarounds for each specific case

Fair enough. Can you recommend a better solution that could fix both D132773 and this one?

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11067–11070

If you look at the reproducer:

for i = ...
  sm += x[i];
  sq += x[i] * x[i];

the addition sm += x[i] (which is a reduction op of the first horizontal reduction) is an external user of the scalar load from the multiplication x[i] * x[i] (which is a reduction value of the second horizontal reduction)

11799

The idea is to separate the "matching" of a horizontal reduction (matchAssociativeReduction) from the "processing" of it (tryToReduce). That way we can postpone the processing until we have found at least another one. This allows us to identify mutual reductions and ignore the extraction cost of the common scalar values. As Vasilis mentioned this limits the amount of mutual reductions to two, which is not ideal.

ABataev added inline comments.Sep 8 2022, 4:21 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11067–11070

Aha, this is for external users. Then why do you need to store there reduced values?

11799

No need to postpone it, the pass will just repeat the same steps once again after the changes.

labrinea added inline comments.Sep 8 2022, 5:16 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11067–11070

Because similarly the multiplication (reduced value) is an external user of the scalar load from the other horizontal reduction.

11799

I am not following. What should the code look like then? How can we solve the problem of mutual reductions without looking ahead? PendingReduction serves the purpose of one element buffer if that makes it clearer.

ABataev added inline comments.Sep 8 2022, 6:21 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11067–11070

Ah, I see what do you mean. But this is not the best place for gathering the reduced values, they must be gathered during an attempt to build the tree, since they might be not vectorized but gathered. If they are gathered, you cannot treat them as a future vector.

11799

Just adjust the cost, everything else should be handled by the SLP vectorizer pass.

labrinea added inline comments.Sep 8 2022, 6:57 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11067–11070

I see. In other words we cannot speculate that the vectorization will eventually happen before trying to actually reduce, so the look ahead approach I am suggesting is rather unsafe? By the time we are attempting to build the tree it is already too late.

ABataev added inline comments.Sep 8 2022, 7:01 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11067–11070

You can gather such "vectorizable" scalars during/after the SLP graph building and store them in the set for the future analysis. Later, next SLP graphs will check if such scalars are potentially vectorizable and exclude the extract cost for them. The, SLP sees, that some part of the basic block was vectorized and restarts to try to vectorize other parts of the same block.

vporpo added a comment.Sep 8 2022, 9:52 AM

Ah, I see what do you mean. But this is not the best place for gathering the reduced values, they must be gathered during an attempt to build the tree, since they might be not vectorized but gathered. If they are gathered, you cannot treat them as a future vector.

Unlike the similar situation with store seeds, the reduction does not have the issue with the "good" lane ordering, and there are no dependency issues that could prohibit them to be scheduled in a specific way, so this looks a lot safer. If the reduction user is connected to a vectorizable TreeEntry and the current tree gets vectorized, isn't it guaranteed that the tree rooted at the reduction will also get vectorized?

Fair enough. Can you recommend a better solution that could fix both D132773 and this one?

One solution could be to collect the seeds that resulted in failed trees due to external users (and perhaps filter them more by checking if the users are actually seeds), and revisit the seeds once again. I think this is more or less what Alexey is describing.

Ah, I see what do you mean. But this is not the best place for gathering the reduced values, they must be gathered during an attempt to build the tree, since they might be not vectorized but gathered. If they are gathered, you cannot treat them as a future vector.

Unlike the similar situation with store seeds, the reduction does not have the issue with the "good" lane ordering, and there are no dependency issues that could prohibit them to be scheduled in a specific way, so this looks a lot safer.

No quite, e.g. if the reduced values are loads or something like this, they still might be not scheduled.

If the reduction user is connected to a vectorizable TreeEntry and the current tree gets vectorized, isn't it guaranteed that the tree rooted at the reduction will also get vectorized?

Yes, but need a check that the tree entry is going to be vectorized.

Fair enough. Can you recommend a better solution that could fix both D132773 and this one?

One solution could be to collect the seeds that resulted in failed trees due to external users (and perhaps filter them more by checking if the users are actually seeds), and revisit the seeds once again. I think this is more or less what Alexey is describing.

Yep, exactly! At least looks so :)

One solution could be to collect the seeds that resulted in failed trees due to external users (and perhaps filter them more by checking if the users are actually seeds), and revisit the seeds once again. I think this is more or less what Alexey is describing.

Yep, exactly! At least looks so :)

I am not sure how would I revisit the seeds though without changing the code. Also can you clarify what do you mean by seeds in this context? The root instruction of the reduction, the list of reduced values passed to buildTree, something else?

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
11067–11070

That doesn't happen with horizontal reductions. If a reduction fails then its root instruction is put in the postponed list for later processing, where it's not treated as a reduction any more. At least that's my understanding.

I could indeed cache the scalars of a vectorizable tree entry later on, maybe inside getTreeCost (where the extract cost is available, allowing a more informed decision). However the reduction ops themselves are not part of the tree entry, and so we would still need to cache them, perhaps via the ignore list.

Also, the horizontal reduction is iteratively modeled by several vectorization trees of different factor. Which one do I pick to cache? All of them?

11799

As I explained on the other thread of comments, the unsuccessful reductions are not rescheduled the way you describe. I am still looking into how else could I make it work.

One solution could be to collect the seeds that resulted in failed trees due to external users (and perhaps filter them more by checking if the users are actually seeds), and revisit the seeds once again. I think this is more or less what Alexey is describing.

Yep, exactly! At least looks so :)

I am not sure how would I revisit the seeds though without changing the code. Also can you clarify what do you mean by seeds in this context? The root instruction of the reduction, the list of reduced values passed to buildTree, something else?

This should require code changes in at least a couple of places:

  1. Checking that a tree failed to vectorize because of external uses (changes mainly in`getTreeCost()`).
  2. Collecting the roots that failed to vectorize because of external uses (these are the "seeds" that I am referring to). This should require changes after the invocations of getTreeCost() and the checks against SLPCostThreshold.
  3. Adding some logic to retry vectorizing the missed sseds. This is probably the trickiest part because we don't really know when it is best to retry vectorizing these trees (unless perhaps we also keep track of when the external uses get vectorized?) . I was having a similar issue with the other patch: the vectorizer would succeed vectorizing the code 2-wide, before I would retry vectorizing from the missed 4-wide seed, which resulted in worse than optimal code. But anyway, retrying to vectorize such missed opportunities is definitely better than not retrying at all.

What you don't need to change is the cost calculation for external uses, as in this way the vectorizer is already calculating the costs correctly for you, as this time the uses will have already been vectorized. I think this is what Alexey meant with his comment (line 11778).

One solution could be to collect the seeds that resulted in failed trees due to external users (and perhaps filter them more by checking if the users are actually seeds), and revisit the seeds once again. I think this is more or less what Alexey is describing.

Yep, exactly! At least looks so :)

I am not sure how would I revisit the seeds though without changing the code. Also can you clarify what do you mean by seeds in this context? The root instruction of the reduction, the list of reduced values passed to buildTree, something else?

This should require code changes in at least a couple of places:

  1. Checking that a tree failed to vectorize because of external uses (changes mainly in`getTreeCost()`).
  2. Collecting the roots that failed to vectorize because of external uses (these are the "seeds" that I am referring to). This should require changes after the invocations of getTreeCost() and the checks against SLPCostThreshold.
  3. Adding some logic to retry vectorizing the missed sseds. This is probably the trickiest part because we don't really know when it is best to retry vectorizing these trees (unless perhaps we also keep track of when the external uses get vectorized?) . I was having a similar issue with the other patch: the vectorizer would succeed vectorizing the code 2-wide, before I would retry vectorizing from the missed 4-wide seed, which resulted in worse than optimal code. But anyway, retrying to vectorize such missed opportunities is definitely better than not retrying at all.

What you don't need to change is the cost calculation for external uses, as in this way the vectorizer is already calculating the costs correctly for you, as this time the uses will have already been vectorized. I think this is what Alexey meant with his comment (line 11778).

Yep. Current solution is very limited, handles only 2 (possible) reductions while the described change may be able to handle more reductions and other vectorizable patterns too.

labrinea updated this revision to Diff 461599.Sep 20 2022, 10:09 AM
labrinea retitled this revision from [SLP] Look ahead for mutual horizontal reductions. to [SLP] Vectorize mutual horizontal reductions..

I tried to follow the review suggestions as close as possible. However I am not sure about the conditions inside cacheVectorizableValues() and those that determine the value of SkipCost. Relaxing them makes the change to trigger way too broadly resulting additional insert/extract element instructions in the unit tests.

ABataev added inline comments.Sep 20 2022, 10:37 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
921

How can the user node be equal to the operand node?

2710–2713

Not sure this is the best option to keep external vectorizable values, again it does not allow to handle values across different graphs, only across the same instance.

7123–7125

This again is too optimistic, you need to try to estimate the cost of extra shuffles, required for the vectorization.

7303–7306

Maybe put it to deleteTree() function and check if the graph-to-be-deleted was built but was not vectorized?

11320

Why return?

labrinea added inline comments.Sep 21 2022, 10:29 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2710–2713

I am not sure I am following your line of thought. As far as I understand the vectorizer keeps one active graph at a time, created by buildTree(), which either gets vectorized or destroyed and then another bundle of instructions is scheduled and so on. The DenseMap I added works as a cache of tree entries that failed to vectorize due to extraction cost. Therefore it holds values across different graphs if that makes sense. As I said earlier, the change triggers in quite a few unit tests if the conditions are relaxed a bit, but I couldn't tell for sure if the new codegen was better or worse than before. If that's not what you meant, can you please explain and maybe suggest an alternative approach?

7123–7125

I thought that the shuffle cost is already accounted inside getEntryCost(), am I wrong? Or maybe that logic needs to be patched too?

11320

Because returning a non null value means the result will be put back in the queue and rescheduled later, which is what we want for trees that failed due to extraction cost. See line 11831 below.

11831

When I == Inst here it means that tryToReduce() returned the root of the horizontal reduction and not a newly vectorized tree.

ABataev added inline comments.Sep 21 2022, 10:33 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
2710–2713

Yes, but it is able to repeat this process and build new graphs with the adjusted costs. And we can reuse this cache for other cases, where could not vectorize the graph because of the external costs

7123–7125

Nope, for external uses you need to add this cost somehow.

labrinea added inline comments.Sep 29 2022, 8:03 AM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
7123–7125

Another thing I tried was to cache the result of getEntryCost() for the tree entries that correspond to the external users, and then use that cost here instead of the extraction cost estimate. I suppose that's not correct either judging from the produced codegen. I am not familiar with the SLP vectorizer and so my ideas are based on intuition rather than actual understanding I am affraid. I'd be open for an offline discussion if you think that would be enlightening/beneficial.

I am not entirely sure but it doesn't seem possible to predict the extraction cost without actually vectorizing first and then retrospectively reflecting upon it. While reading the debug output I noticed that insert/extract-element code is being added when vectorizing. Then this code either gets fully vectorized and removed, in which case the cost estimate ought to be zero, or it remains in the final codegen.

ABataev added inline comments.Sep 30 2022, 1:39 PM
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
7123–7125

You cannot estimate it when you see the instruction for the first time, but if it is the second time, you can have the info already about its position in the previous tree/graph. And thus you can get info about possible extra shuffles.

Also, you should be careful with extractelement instructions emitting, they may led to gather nodes and again too optimistic cost.

labrinea abandoned this revision.Sep 4 2023, 2:09 AM

Fixed in D155459.