This is an archive of the discontinued LLVM Phabricator instance.

Replace overflow check in loop vectorization with minimum loop iterations check
AcceptedPublic

Authored by wmi on Aug 18 2015, 10:37 AM.

Details

Summary

llvm vectorizer will vectorize loop when loop iteration numbers are unknown. When the iteration number of a loop is very small and the preparation and rounding-off cost for vectorization is high, such vectorization may degrade performance, sometimes significantly. We find gcc vectorization generates runtime check to ensure loop iteration number is big enough before every vectorized loop.

Now llvm loop vectorizer generates overflow check before every vectorized loop. The overflow check is to check whether BackedgeCount equals to -1, .i.e ExitCount equals to 0. If the return of the check is true, jump to scalar remainder loop. This check can be covered by the minimum loop iterations check, because if overflow happens, the ExitCount number will be 0 and it will be less than the minimum loop iterations threshold.

This patch turns the original overflow check to a minimum loop iterations check, which has no change on correctness and runtime cost, and can be beneficial for performance in some cases.

Diff Detail

Repository
rL LLVM

Event Timeline

wmi updated this revision to Diff 32422.Aug 18 2015, 10:37 AM
wmi retitled this revision from to Replace overflow check in loop vectorization with minimum loop iterations check.
wmi updated this object.
wmi added reviewers: aschwaighofer, nadav.
wmi set the repository for this revision to rL LLVM.
wmi added a subscriber: davidxl.
aschwaighofer edited edge metadata.

This make sense to me.

Please add tests that this also works with interleave factor that is not 1.

Don't use MaxVectorSize but rather the selected unroll factor times the selected vectorization factor (InnerLoopVectorizer::UF * InnerLoopVectorizer::VF) this is the step the loop will iterate by.

wmi edited edge metadata.Aug 18 2015, 11:04 AM
wmi added a subscriber: llvm-commits.
anemet added inline comments.Aug 18 2015, 1:08 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
2538–2540

Please don't remove the explanation for the need of checking overflow.

I like your explanation in the review description. Please include an appropriate variant of that here as a comment.

wmi updated this revision to Diff 32445.Aug 18 2015, 1:46 PM

Thanks for the quick response.

Please add tests that this also works with interleave factor that is not 1.

Done

Don't use MaxVectorSize but rather the selected unroll factor times the selected vectorization factor (InnerLoopVectorizer::UF * InnerLoopVectorizer::VF) this is the step the loop will iterate by.

Ah, that is right. Fixed.

Please don't remove the explanation for the need of checking overflow.

Done.

anemet accepted this revision.Aug 24 2015, 2:48 PM
anemet edited edge metadata.

LGTM with some minor changes below.

Also did you see any change in performance?

lib/Transforms/Vectorize/LoopVectorize.cpp
2528–2536

The second paragraph is unnecessarily complicated. It should say something like this:

The minimum iteration check also covers case where the backedge-taken count is uint##_max. Adding one to it will cause overflow and an incorrect loop trip count being generated in the vector body. In this case we also want to directly jump to the scalar remainder loop.

test/Transforms/LoopVectorize/miniters.ll
12

You still have MaxVectorSize mentioned here.

This revision is now accepted and ready to land.Aug 24 2015, 2:48 PM
anemet added a subscriber: jmolloy.Aug 24 2015, 2:50 PM

FYI, James, I think you had a similar change in your series.

wmi updated this revision to Diff 33047.Aug 24 2015, 10:47 PM
wmi edited edge metadata.

The second paragraph is unnecessarily complicated. It should say something like this:

Fixed.

You still have MaxVectorSize mentioned here.

Fixed.

did you see any change in performance?

I saw two testcases with perf change steadily:
MultiSource/Benchmarks/FreeBench/pcompress2 +2%
MultiSource/Benchmarks/Ptrdist/ks -15%

The improvement for the first test is caused by the minimum loop iterations check and perf stat shows its dynamic instruction number decreases.
The big regression for the second test is caused by a minor alignment change. The alignment change prevents a cmp and jmp pair inside the kernel loop of the test from being macro-fused, which increases uops-retired events a lot. So the regression is caused by the side-effect of the patch and may not exist after some other changes.

Ok. Thanks for the detailed analysis.

Hi,

We do actually already emit such a check - we just emit it late. This new check now makes the later check redundant, so we should remove it.

Cheers,

James

wmi added a subscriber: wmi.Aug 31 2015, 10:55 AM

We do actually already emit such a check - we just emit it late. This new check now makes the later check redundant, so we should remove it.

Hi James,

Could you show me where is the check you mentioned? I didn't notice it before.

Thanks,
Wei.