This is an archive of the discontinued LLVM Phabricator instance.

[LV] Reallow interleaved load groups with gaps
ClosedPublic

Authored by mssimpso on Apr 25 2016, 10:27 AM.

Details

Summary

We previously disallowed interleaved load groups that may cause us to speculatively access memory out-of-bounds (D17332). We did this by ensuring each load group had an access corresponding to the first and last member. Instead of bailing out for these interleaved groups, this patch enables us to peel off the last vector iteration, ensuring that we execute at least one iteration of the scalar remainder loop. This solution was proposed in the review of the previous patch.

Diff Detail

Event Timeline

mssimpso updated this revision to Diff 54876.Apr 25 2016, 10:27 AM
mssimpso retitled this revision from to [LV] Reallow interleaved load groups with gaps.
mssimpso updated this object.
mssimpso added subscribers: llvm-commits, mcrosier.
sbaranga edited edge metadata.Apr 25 2016, 12:33 PM

Thanks for working on this! I have a single comment so far (see inline).

-Silviu

lib/Transforms/Vectorize/LoopVectorize.cpp
2897

This doesn't look right at first sight. Let's say we have 0 iterations in total (TC = 0), and Step is 4. If I'm not mistaken, this would underflow the VectorTripCount and give a result of MAX_UINT - 3?

Also, if the result of the URem is 0 you don't need to peel the loop (although this wouldn't be that important and perhaps not even profitable to account for).

mssimpso added inline comments.Apr 25 2016, 12:48 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
2897

Ah, I think you're right! Thanks for catching the underflow. For the other comment, I think you mean if the result of the URem is *not* 0, we don't need to peel. This is right, and it's something I thought about. I agree that it's probably not that important to account for, but it's definitely something we can consider. The trade-off would be something like 1 select instruction in the preheader vs. VF * UF scalar iterations.

Thanks very much for the quick feedback! I'll upload a new version soon.

mssimpso updated this revision to Diff 54909.Apr 25 2016, 2:33 PM
mssimpso edited edge metadata.

Addressed Silviu's comments.

Silviu, I think this revision addresses both of your comments about the peeling. I now only peel if TC % Step is zero. If the mod is zero, we know that TC >= Step, so the subtraction should not overflow. And we also know that there wouldn't already be scalar iterations due to Step not evenly dividing TC. Thanks for the feedback!

mssimpso marked an inline comment as done.Apr 25 2016, 2:33 PM

I might be missing something obvious, but why is TC >= Step if TC % Step is 0?

I think we also need to check that TC is not 0.

lib/Transforms/Vectorize/LoopVectorize.cpp
2897

You're right, that should have been *not* 0, thanks!

The comment above needs to be updated.

I might be missing something obvious, but why is TC >= Step if TC % Step is 0?

Sorry, I didn't mean for that to read as an implication. The minimum iterations check is supposed to ensure that TC >= Step. We can add the check for TC is not zero, but I think instcombine will probably remove it. What do you think?

I might be missing something obvious, but why is TC >= Step if TC % Step is 0?

Sorry, I didn't mean for that to read as an implication. The minimum iterations check is supposed to ensure that TC >= Step. We can add the check for TC is not zero, but I think instcombine will probably remove it. What do you think?

Do you mean the emitMinimumIterationCountCheck? Looks like the minimum check is using whatever getOrCreateVectorTripCount returns, so it probably won't work as expected. I think checking for the zero case here would be ok.

I might be missing something obvious, but why is TC >= Step if TC % Step is 0?

Sorry, I didn't mean for that to read as an implication. The minimum iterations check is supposed to ensure that TC >= Step. We can add the check for TC is not zero, but I think instcombine will probably remove it. What do you think?

Do you mean the emitMinimumIterationCountCheck? Looks like the minimum check is using whatever getOrCreateVectorTripCount returns, so it probably won't work as expected. I think checking for the zero case here would be ok.

Sorry, I misread that. It's using the actual trip count, so this should be ok.

mssimpso updated this revision to Diff 55004.Apr 26 2016, 8:03 AM

Updated comments about the vector trip count calculation.

mssimpso marked an inline comment as done.Apr 26 2016, 8:03 AM
sbaranga accepted this revision.Apr 26 2016, 8:09 AM
sbaranga edited edge metadata.

LGTM!

This revision is now accepted and ready to land.Apr 26 2016, 8:09 AM

Thanks very much for the review, Silviu!

sbaranga requested changes to this revision.Apr 26 2016, 10:04 AM
sbaranga edited edge metadata.
sbaranga added inline comments.
lib/Transforms/Vectorize/LoopVectorize.cpp
5142–5149

Sorry to backtrack on this, but it appears that you also require to have a positive stride for the group in order for this to work.
Otherwise, you might get an out of bounds access in the first iteration of the vector loop.

This revision now requires changes to proceed.Apr 26 2016, 10:04 AM
mssimpso updated this revision to Diff 55058.Apr 26 2016, 12:13 PM
mssimpso edited edge metadata.

Addressed Silviu's comments.

Thanks for catching the negative stride. I now check that a group is not reversed (stride is negative) before allowing it. I also added a reversed test case and check that we do not generate vector loads. For negative strides, we will need a scalar prologue iteration rather than an epilogue, but I'd rather tackle that in a follow-on patch. In the meantime, I've refactored the current patch to distinguish the epilogue and prologue cases. Thanks!

sbaranga accepted this revision.Apr 27 2016, 3:19 AM
sbaranga edited edge metadata.

Addressed Silviu's comments.

Thanks for catching the negative stride. I now check that a group is not reversed (stride is negative) before allowing it. I also added a reversed test case and check that we do not generate vector loads. For negative strides, we will need a scalar prologue iteration rather than an epilogue, but I'd rather tackle that in a follow-on patch. In the meantime, I've refactored the current patch to distinguish the epilogue and prologue cases. Thanks!

Thanks! LGTM now.

FWIW, for the reverse group we don't need a scalar prologue. It would be enough to "shift right" the interleaved access group such that we have a load at the last position in the group. I've added a comment in the test case.

test/Transforms/LoopVectorize/interleaved-accesses.ll
376

This would work if the wide load started at &A[i].x: it would solve the OOB in the initial iteration problem and the scalar iterations at the end would solve the OOB at the final iteration of the vector loop.

This revision is now accepted and ready to land.Apr 27 2016, 3:19 AM

FWIW, for the reverse group we don't need a scalar prologue. It would be enough to "shift right" the interleaved access group such that we have a load at the last position in the group.

Good point. Yes, that should work, and it would allow us to avoid inserting a new loop. Thanks!

This revision was automatically updated to reflect the committed changes.