This is an archive of the discontinued LLVM Phabricator instance.

[LV] Avoid vectorizing first order recurrences when phi used outside loop
ClosedPublic

Authored by anna on Apr 10 2017, 3:14 PM.

Details

Summary

In the recurrence logic, there is an assumption that we vectorize such that the
last element in the vector will be the one extracted to pass into the scalar
remainder loop. However, this is not true when there is a phi used outside the loop.
In such a case, we need the value from the second last iteration (i.e. the phi value), not the last iteration (which would be the phi update). I've added a test case for
this as well. Also see PR32396.

A follow up patch would generate the correct code gen for such cases, and turn this vectorization on.

Event Timeline

anna created this revision.Apr 10 2017, 3:14 PM
anna added a comment.Apr 10 2017, 3:18 PM

@mssimpso : Hi Mathew, is there a reason to have the first order recurrence test (https://reviews.llvm.org/D16197) in the AArch folder? Could we move this out, since the logic is shared by all platforms? Thanks

mssimpso edited edge metadata.Apr 11 2017, 7:50 AM
In D31910#723084, @anna wrote:

@mssimpso : Hi Mathew, is there a reason to have the first order recurrence test (https://reviews.llvm.org/D16197) in the AArch folder? Could we move this out, since the logic is shared by all platforms? Thanks

Yes, those tests should definitely be moved to the common directory. Thanks! Feel free to go ahead and do so if you want. I will try and take a look at your patch soon.

Hi Anna,

I think we can probably solve this without putting a limitation on the trip count. But I'm wondering if it would be better for now to just avoid vectorizing this case, at least until we can work out a more robust code generation strategy for the first-order recurrences. Why not change RecurrenceDescriptor::isFirstOrderRecurrence to ensure the phi has no users outside the loop? I think that's the underlying issue, right? We're currently assuming that only the phi update can be used externally. But in your test case, it's the phi that's used externally. Just to make sure I understand things correctly, after vectorization we end up with something like the following for the external use:

%r.lcssa = phi i32 [ %scalar.recur, %for.body ], [ %vector.recur.extract, %middle.block ]

So if the scalar loop is executed, %r.lcssa is the phi, but if the trip count is evenly divided and the scalar loop is not executed, %r.lcssa is the phi's last update. And we currently assume we'll only encounter something like:

%r.lcssa = phi i32 [ %update, %for.body ], [ %vector.recur.extract, %middle.block ]

where it's the update that's used externally. I think the solution would eventually be something like:

middle.block:
  %vector.recur.extract = extractelement <4 x i32> %vector.update, i32 3
  %new.extract = extractelement <4 x i32> %vector.recur, i32 3
scalar.ph:
  %scalar.recur.init = phi i32 [ %vector.recur.extract, %middle.block ], ...
for.end:
  %r.lcssa = phi i32 [ %scalar.recur, %for.body ], [ %new.extract, %middle.block ]

where we add an additional extract for the last element of the vector phi, if the phi is used externally. What do you think? We could disable the external use for now to fix the correctness issue and then re-enable it once we get the code generation right.

anna added a comment.Apr 11 2017, 9:22 AM

Hi Mathew,

Hi Anna,

I think we can probably solve this without putting a limitation on the trip count. But I'm wondering if it would be better for now to just avoid vectorizing this case, at least until we can work out a more robust code generation strategy for the first-order recurrences. Why not change RecurrenceDescriptor::isFirstOrderRecurrence to ensure the phi has no users outside the loop? I think that's the underlying issue, right? We're currently assuming that only the phi update can be used externally. But in your test case, it's the phi that's used externally. Just to make sure I understand things correctly, after vectorization we end up with something like the following for the external use:

%r.lcssa = phi i32 [ %scalar.recur, %for.body ], [ %vector.recur.extract, %middle.block ]

So if the scalar loop is executed, %r.lcssa is the phi, but if the trip count is evenly divided and the scalar loop is not executed, %r.lcssa is the phi's last update. And we currently assume we'll only encounter something like:

%r.lcssa = phi i32 [ %update, %for.body ], [ %vector.recur.extract, %middle.block ]

where it's the update that's used externally.

Yes, that's right. The problem only occurs when we need the phi and not it's update at the end of loop. But there's 2 checks if we want to vectorize all cases except for the incorrect code gen cases:

  1. a phi node is used externally, and
  2. that phi node is not the loop termination condition.

If the phi node is the loop terminating condition, we would get the correct value for the phi even if we vectorize the loop completely (Trip count% step= 0).
This is the case (for VF = 4):

for.body:
  %inc.phi = phi i32 [ 0, %entry ], [ %inc, %for.body ]
  %val.phi = phi i32 [ 0, %entry ], [ %addx, %for.body ]
  %inc = add nsw i32 %inc.phi, 1
  %bc = zext i32 %inc.phi to i64.
  %addx = add nuw nsw i32 %inc.phi, %x
  %cmp = icmp eq i32 %inc.phi, 95. 
  br i1 %cmp, label %for.end, label %for.body

for.end:
  ret i32 %inc.phi

I think the solution would eventually be something like:

middle.block:
  %vector.recur.extract = extractelement <4 x i32> %vector.update, i32 3
  %new.extract = extractelement <4 x i32> %vector.recur, i32 3
scalar.ph:
  %scalar.recur.init = phi i32 [ %vector.recur.extract, %middle.block ], ...
for.end:
  %r.lcssa = phi i32 [ %scalar.recur, %for.body ], [ %new.extract, %middle.block ]

where we add an additional extract for the last element of the vector phi, if the phi is used externally. What do you think? We could disable the external use for now to fix the correctness issue and then re-enable it once we get the code generation right.

Actually, this is a good idea, it would have a lesser chance of causing any perf regression than the limiting of trip count. The reason I went with trip count limitation was I was thinking of this bug as extracting from the "correct index" i.e. the last or the second last, instead of generating 2 extracts from the phi and it's last update. And currently, we dont have a way of specifying a runtime index for the extract operation.
With the code gen change there would also be a change to the cost model to consider this added shuffle instruction.

I'll work on the disabling this part as the first step (and work on the robust code gen as a follow up change). Thanks for the input!

anna updated this revision to Diff 94877.Apr 11 2017, 12:53 PM

Updated diff to avoid vectorizing first order recurrences when phi used outside the loop
and the phi is not part of the loop terminating condition.
Based on Mathew's comments above, we have a better way to fix this rather than limiting the loop trip
count in the vectorized case.

anna retitled this revision from [LV] Fix logic to extract correct element in first order recurrences to [LV] Avoid vectorizing first order recurrences when phi used outside loop.Apr 11 2017, 12:56 PM
anna edited the summary of this revision. (Show Details)
anna updated this revision to Diff 94883.Apr 11 2017, 1:21 PM

Change AArch64 test case that already showcases a failing test case.
We should not be vectorizing the loop in this test case.

anna added a comment.Apr 11 2017, 1:26 PM

@mssimpso: Hi Matt, I've updated the diff to avoid vectorizing the loop when it generates incorrect code gen. Also, there's an AArch64 test case that already showcases this exact problem: generate incorrect code gen in first order recurrence. Please take a look. It's also aptly named phi_fail2!

mssimpso added inline comments.Apr 11 2017, 1:27 PM
lib/Transforms/Utils/LoopUtils.cpp
556–559 ↗(On Diff #94877)

I don't think I understand this. Why do we need to check if the phi is used by the back-edge condition? If a phi is used by the back-edge condition, we should probably recognize it as an induction variable, not a first-order recurrence. I think you can remove this, and just check that the phi has no external users.

test/Transforms/LoopVectorize/first-order-recurrence.ll
390–401

In this test, "%inc.phi" is the induction variable and "%val.phi" is the potential first-order recurrence. We already allow induction variables to have external users, so we don't need to worry about this case. I think this test will pass without your patch.

In D31910#724159, @anna wrote:

@mssimpso: Hi Matt, I've updated the diff to avoid vectorizing the loop when it generates incorrect code gen. Also, there's an AArch64 test case that already showcases this exact problem: generate incorrect code gen in first order recurrence. Please take a look. It's also aptly named phi_fail2!

Yeah, that's true. Thanks very much for sorting this out!

anna updated this revision to Diff 94884.Apr 11 2017, 1:40 PM
anna marked 2 inline comments as done.

remove the check for IV having outside users. That is already handled (i.e. vectorized correctly) by the vectorizer.

anna added inline comments.Apr 11 2017, 1:40 PM
lib/Transforms/Utils/LoopUtils.cpp
556–559 ↗(On Diff #94877)

removed this, and the test case corresponding to this passes as required.

test/Transforms/LoopVectorize/first-order-recurrence.ll
390–401

yes, thats right. Thanks for the clarification.

mssimpso accepted this revision.Apr 11 2017, 1:51 PM

The added @extract_last_iteration test isn't really checking anything - it get's completely simplified by InstCombine, and the first-order recurrence isn't even used outside the loop so it's not affected by the patch. I would just remove it (you might have already meant to do this). Otherwise, this patch LGTM. Thanks!

This revision is now accepted and ready to land.Apr 11 2017, 1:51 PM
This revision was automatically updated to reflect the committed changes.
anna added a comment.Apr 11 2017, 2:15 PM

The added @extract_last_iteration test isn't really checking anything - it get's completely simplified by InstCombine, and the first-order recurrence isn't even used outside the loop so it's not affected by the patch. I would just remove it (you might have already meant to do this). Otherwise, this patch LGTM. Thanks!

Yes, done.

anna added a comment.Apr 12 2017, 5:17 AM

FYI: The disabling patch is checked in. I'm starting work on the robust code gen for this case.