We previously disallowed interleaved load groups that may cause us to speculatively access memory out-of-bounds (D17332). We did this by ensuring each load group had an access corresponding to the first and last member. Instead of bailing out for these interleaved groups, this patch enables us to peel off the last vector iteration, ensuring that we execute at least one iteration of the scalar remainder loop. This solution was proposed in the review of the previous patch.
Details
Diff Detail
Event Timeline
Thanks for working on this! I have a single comment so far (see inline).
-Silviu
| lib/Transforms/Vectorize/LoopVectorize.cpp | ||
|---|---|---|
| 2897 | This doesn't look right at first sight. Let's say we have 0 iterations in total (TC = 0), and Step is 4. If I'm not mistaken, this would underflow the VectorTripCount and give a result of MAX_UINT - 3? Also, if the result of the URem is 0 you don't need to peel the loop (although this wouldn't be that important and perhaps not even profitable to account for). | |
| lib/Transforms/Vectorize/LoopVectorize.cpp | ||
|---|---|---|
| 2897 | Ah, I think you're right! Thanks for catching the underflow. For the other comment, I think you mean if the result of the URem is *not* 0, we don't need to peel. This is right, and it's something I thought about. I agree that it's probably not that important to account for, but it's definitely something we can consider. The trade-off would be something like 1 select instruction in the preheader vs. VF * UF scalar iterations. Thanks very much for the quick feedback! I'll upload a new version soon. | |
Addressed Silviu's comments.
Silviu, I think this revision addresses both of your comments about the peeling. I now only peel if TC % Step is zero. If the mod is zero, we know that TC >= Step, so the subtraction should not overflow. And we also know that there wouldn't already be scalar iterations due to Step not evenly dividing TC. Thanks for the feedback!
I might be missing something obvious, but why is TC >= Step if TC % Step is 0?
I think we also need to check that TC is not 0.
| lib/Transforms/Vectorize/LoopVectorize.cpp | ||
|---|---|---|
| 2897 | You're right, that should have been *not* 0, thanks! The comment above needs to be updated. | |
I might be missing something obvious, but why is TC >= Step if TC % Step is 0?
Sorry, I didn't mean for that to read as an implication. The minimum iterations check is supposed to ensure that TC >= Step. We can add the check for TC is not zero, but I think instcombine will probably remove it. What do you think?
Do you mean the emitMinimumIterationCountCheck? Looks like the minimum check is using whatever getOrCreateVectorTripCount returns, so it probably won't work as expected. I think checking for the zero case here would be ok.
| lib/Transforms/Vectorize/LoopVectorize.cpp | ||
|---|---|---|
| 5142–5149 | Sorry to backtrack on this, but it appears that you also require to have a positive stride for the group in order for this to work. | |
Addressed Silviu's comments.
Thanks for catching the negative stride. I now check that a group is not reversed (stride is negative) before allowing it. I also added a reversed test case and check that we do not generate vector loads. For negative strides, we will need a scalar prologue iteration rather than an epilogue, but I'd rather tackle that in a follow-on patch. In the meantime, I've refactored the current patch to distinguish the epilogue and prologue cases. Thanks!
Thanks! LGTM now.
FWIW, for the reverse group we don't need a scalar prologue. It would be enough to "shift right" the interleaved access group such that we have a load at the last position in the group. I've added a comment in the test case.
| test/Transforms/LoopVectorize/interleaved-accesses.ll | ||
|---|---|---|
| 376 | This would work if the wide load started at &A[i].x: it would solve the OOB in the initial iteration problem and the scalar iterations at the end would solve the OOB at the final iteration of the vector loop. | |
FWIW, for the reverse group we don't need a scalar prologue. It would be enough to "shift right" the interleaved access group such that we have a load at the last position in the group.
Good point. Yes, that should work, and it would allow us to avoid inserting a new loop. Thanks!
This doesn't look right at first sight. Let's say we have 0 iterations in total (TC = 0), and Step is 4. If I'm not mistaken, this would underflow the VectorTripCount and give a result of MAX_UINT - 3?
Also, if the result of the URem is 0 you don't need to peel the loop (although this wouldn't be that important and perhaps not even profitable to account for).