This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Fix for a bug in jumbled memory access vectorization introduced in r293386
ClosedPublic

Authored by ashahid on Feb 19 2017, 10:24 PM.

Details

Summary

The bug is due to absence of in order uses of scalars which needs to be available for VectorizeTree() API. This API uses it for proper mask computation to be used in "shufflevector" IR.

The fix is to compute the mask for out of order memory accesses while building the vectorizable tree instead of actual vectorization of vectorizable tree.

Diff Detail

Event Timeline

ashahid created this revision.Feb 19 2017, 10:24 PM
mkuper edited edge metadata.Feb 20 2017, 3:17 PM

This is getting a bit hairy, I think - we have too many valuelists flying around.

How about something like this:
When we sort the vectors, we can also use the sort to generate the corresponding shuffle mask (instead of recomputing it in O(|VL|^2) time in line 2623 - 2634), by sorting pairs of <Value, Index>.
We can then keep an optional shuffle mask instead of the NeedToShuffle bit as part of the tree node, and use it to get the correctly ordered list of scalars back from VectorizableTree[0].Scalars when we need to.

What do you think? My gut feeling is that keeping the mask around is worth it just to avoid the N^2 loop in 2623, and the fix to the bug falls out of it.

This is getting a bit hairy, I think - we have too many valuelists flying around.

How about something like this:
When we sort the vectors, we can also use the sort to generate the corresponding shuffle mask (instead of recomputing it in O(|VL|^2) time in line 2623 - 2634), by sorting pairs of <Value, Index>.

The basic idea to keep the mask computation at the current place was because we wanted to compute the mask, (N^2 loop), only if vectorization is found to be beneficial.

OTOH if we do mask computation while sorting memory accesses, the utility of SortMemAccesses API may be reduced, however this is not a bigger problem and can be
worked around by keeping a bit to decide whether to compute the mask or not.

We can then keep an optional shuffle mask instead of the NeedToShuffle bit as part of the tree node, and use it to get the correctly ordered list of scalars back from VectorizableTree[0].Scalars when we need to.

What do you think? My gut feeling is that keeping the mask around is worth it just to avoid the N^2 loop in 2623, and the fix to the bug falls out of it.

Although I am not an expert, I think keeping another ValueList for unordered scalars is a not a bad design. However I am open to other views.

Regards,
Shahid

This is getting a bit hairy, I think - we have too many valuelists flying around.

How about something like this:
When we sort the vectors, we can also use the sort to generate the corresponding shuffle mask (instead of recomputing it in O(|VL|^2) time in line 2623 - 2634), by sorting pairs of <Value, Index>.

The basic idea to keep the mask computation at the current place was because we wanted to compute the mask, (N^2 loop), only if vectorization is found to be beneficial.

Yes, but we can avoid the O(N^2) computation altogether, right? I mean, doing that doesn't add any cost to the sort. Or am I missing something?

OTOH if we do mask computation while sorting memory accesses, the utility of SortMemAccesses API may be reduced, however this is not a bigger problem and can be
worked around by keeping a bit to decide whether to compute the mask or not.

Don't we sort before we know whether we're going to need the mask?

We can then keep an optional shuffle mask instead of the NeedToShuffle bit as part of the tree node, and use it to get the correctly ordered list of scalars back from VectorizableTree[0].Scalars when we need to.

What do you think? My gut feeling is that keeping the mask around is worth it just to avoid the N^2 loop in 2623, and the fix to the bug falls out of it.

Although I am not an expert, I think keeping another ValueList for unordered scalars is a not a bad design. However I am open to other views.

It seems odd to have both the sorted and the unsorted list. The other thing that bothers me is that we have some invariants which don't seem to be well-defined. Is the guarantee that "Scalars" is ordered to have the "best" order to allow vectorization, while InOrdScalars is ordered according to the original IR order? Also, preserving "program order" is kind of meaningless. The fact we care about order at all is just an artifact of the way we build the tree. What we really care about is mapping each scalar leaf to the vector lane it eventually ends up in, which is precisely the shuffle mask.

Anyway, I'm also open to other opinions, if more people care about this.
What I'm really trying to do here is to make the code make a bit more sense - right now, it makes almost zero sense to me. (This isn't the fault of your patch, I'm talking about the general state of the SLP vectorizer).

This is getting a bit hairy, I think - we have too many valuelists flying around.

How about something like this:
When we sort the vectors, we can also use the sort to generate the corresponding shuffle mask (instead of recomputing it in O(|VL|^2) time in line 2623 - 2634), by sorting pairs of <Value, Index>.

The basic idea to keep the mask computation at the current place was because we wanted to compute the mask, (N^2 loop), only if vectorization is found to be beneficial.

Yes, but we can avoid the O(N^2) computation altogether, right? I mean, doing that doesn't add any cost to the sort. Or am I missing something?

OTOH if we do mask computation while sorting memory accesses, the utility of SortMemAccesses API may be reduced, however this is not a bigger problem and can be
worked around by keeping a bit to decide whether to compute the mask or not.

Don't we sort before we know whether we're going to need the mask?

We can then keep an optional shuffle mask instead of the NeedToShuffle bit as part of the tree node, and use it to get the correctly ordered list of scalars back from VectorizableTree[0].Scalars when we need to.

What do you think? My gut feeling is that keeping the mask around is worth it just to avoid the N^2 loop in 2623, and the fix to the bug falls out of it.

Although I am not an expert, I think keeping another ValueList for unordered scalars is a not a bad design. However I am open to other views.

It seems odd to have both the sorted and the unsorted list. The other thing that bothers me is that we have some invariants which don't seem to be well-defined. Is the guarantee that "Scalars" is ordered to have the "best" order to allow vectorization, while InOrdScalars is ordered according to the original IR order? Also, preserving "program order" is kind of meaningless. The fact we care about order at all is just an artifact of the way we build the tree. What we really care about is mapping each scalar leaf to the vector lane it eventually ends up in, which is precisely the shuffle mask.

Anyway, I'm also open to other opinions, if more people care about this.
What I'm really trying to do here is to make the code make a bit more sense - right now, it makes almost zero sense to me. (This isn't the fault of your patch, I'm talking about the general state of the SLP vectorizer).

Thanks Michael, my bad, I completely missed that we can avoid O(N^2) mask computation altogether. Moreover, we don't need to have extra VL and NeedToShuffle.
I will update the patch accordingly.

ashahid edited the summary of this revision. (Show Details)

Updated the patch to incorporate the review comments.

mkuper added inline comments.Feb 24 2017, 6:44 PM
lib/Analysis/LoopAccessAnalysis.cpp
1063–1064

The API seems a bit odd - having to pass a SmallVector when you know you don't want it to be filled doesn't look right.
I'd prefer something like passing in a pointer, and filling the vector in if it's not null (or passing in an Optional<> reference to a similar effect.)

1097–1108

I was thinking along the lines of directly sorting an array of tuples <Offset, Value*, Index> - and then you can pull the mask out of it.
But the that may turn out uglier than what you have here. Your call.

1111–1119

Redundant braces.

lib/Transforms/Vectorize/SLPVectorizer.cpp
2626

This seems unrelated.
Should it have been here before as well?

ashahid added inline comments.Feb 27 2017, 10:36 PM
lib/Analysis/LoopAccessAnalysis.cpp
1063–1064

Ok

1097–1108

With tuple, I think we still need 2 std::sort() call. I am opting for the simpler than uglier.

lib/Transforms/Vectorize/SLPVectorizer.cpp
2626

Earlier this path was not tested properly but thanks to this bug we found.

As in this particular case the vectorized value is being used from the tree entry data instead of the return value, if we don't update E->VectorizedValue "shufflvector" IR will not be used, rather vectorized load IR will be used which is not the intention.

Updated the path to incorporate review comments.

mkuper accepted this revision.Feb 27 2017, 11:57 PM

LGTM, with some style comments inline.

lib/Analysis/LoopAccessAnalysis.cpp
1097

IIRC, you can do something like "SmallVector<unsigned, 4> UseOrder(VL.size())", and then "UseOrder[i] = i".
This is slightly shorter, and maybe a bit prettier, but feel free to leave as is if you prefer it.

1097–1108

Ok. I think we can do with 1, but I'm still not sure it's an improvement, so we can keep this as is.

1115

I think I would group everything that has to do with Mask together - something like:

if (Mask) {
  Mask->reserve(VL.size());
  for (unsigned i = 0; i < VL.size(); i++)
    Mask->emplace_back(i);
  std::sort(Mask->begin(), Mask->end(),
                [&UseOrder](unsigned Left, unsigned Right) {
                return UseOrder[Left] < UseOrder[Right];
              });  
}

Instead of having three "if(Mask)"-s.
Sure, you get another for loop, but I think it's cleaner this way.

lib/Transforms/Vectorize/SLPVectorizer.cpp
2626

Right, that's more or less what I thought.

This revision is now accepted and ready to land.Feb 27 2017, 11:57 PM
ashahid added inline comments.Feb 28 2017, 2:33 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
2626

Oh I missed one thing, I think you added a test jumbled-same.ll for a recent fix. I think this test needs to be updated now to account for the use of shuffled value of the loaded vector in "extractelement"?

mkuper added inline comments.Feb 28 2017, 8:55 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
2626

Argh, yes, I didn't notice the shuffle is used in the icmp, but not in the extract.
Go ahead and update it. Thanks!

This revision was automatically updated to reflect the committed changes.
ashahid updated this revision to Diff 90429.Mar 2 2017, 9:23 PM

Updated the patch to do a proper lane computation for external uses.

mkuper reopened this revision.Mar 2 2017, 11:13 PM
This revision is now accepted and ready to land.Mar 2 2017, 11:13 PM
mkuper added inline comments.Mar 2 2017, 11:24 PM
lib/Transforms/Vectorize/SLPVectorizer.cpp
2860

This is the only change in the patch w.r.t the previous version, right?

2863

Isn't this condition equivalent to E->ShuffleMask[i] == ExternalUse.Lane? I think that's a bit clearer, maybe.
(I'm not really certain it's clearer, but I'd like to make sure I understand what's going on here)

Regardless, please add a comment describing what this whole thing does.

test/Transforms/SLPVectorizer/X86/jumbled-same.ll
16 ↗(On Diff #90429)

Ok, now this makes sense. :-)

ashahid added inline comments.Mar 3 2017, 1:22 AM
lib/Transforms/Vectorize/SLPVectorizer.cpp
2860

Yes

2863

Yes, this condition is equivalent to E->ShuffleMask[i] == ExternalUse.Lane and it does look clearer. Oh, sure I will add the comment.

This revision was automatically updated to reflect the committed changes.