This is an archive of the discontinued LLVM Phabricator instance.

[LV] Interleaving should not exceed estimated loop trip count.
ClosedPublic

Authored by ebrevnov on Sep 23 2019, 11:17 PM.

Details

Summary

Currently we may do iterleaving by more than estimated trip count
coming from the profile or computed maximum trip count. The solution is to
use "best known" trip count instead of exact one in interleaving analysis.

Diff Detail

Event Timeline

ebrevnov created this revision.Sep 23 2019, 11:17 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 23 2019, 11:17 PM
ebrevnov marked an inline comment as done.Sep 24 2019, 4:41 AM
ebrevnov added inline comments.
llvm/test/Transforms/LoopVectorize/X86/no_fpmath_with_hotness.ll
6

according to profile original loop has 99 iterations thus interleaving is disabled by short trip count heuristics controlled by tiny-trip-count-interleave-threshold

This looks reasonable to me, but I'm not a qualified reviewer in this area.

llvm/test/Transforms/LoopVectorize/interleave_short_tc.ll
2

I would suggest using an autogenerate checks for these. It'll be more verbose, but also more clear about the expected output.

See utils/update_test_checks.py and FileCheck's --check-prefix option for the three variations.

66

Please remove unnecessary aspects of test.

ebrevnov marked an inline comment as done.Sep 26 2019, 9:39 PM
ebrevnov added inline comments.
llvm/test/Transforms/LoopVectorize/interleave_short_tc.ll
2

See utils/update_test_checks.py and FileCheck's --check-prefix option for the three variations.

Not sure I understand what exactly you mean but this.

I know it is possible to have a dedicated prefix for common checks of two run lines and provide several prefixes to FileCheck. For example:
; RUN: opt <cmd1> | FileCheck --check-prefix=CHECK-COMMON,CHECK-SPECIFIC1
; RUN: opt <cmd2> | FileCheck --check-prefix=CHECK-COMMON,CHECK-SPECIFIC2

In my case I have three different runs with one common check. Is there a way to combine them?

ebrevnov updated this revision to Diff 222122.Sep 27 2019, 3:29 AM

Minor test update

ebrevnov marked 2 inline comments as done and an inline comment as not done.Sep 27 2019, 3:36 AM
ebrevnov added inline comments.
llvm/test/Transforms/LoopVectorize/interleave_short_tc.ll
2

I would suggest using an autogenerate checks for these. It'll be more verbose, but also more clear about the expected output.

In fact I did use update_test_checks.py to generate initial checks and then manually removed all unrelated ones. I think that makes test less sensitive to side changes and it's easy to see what we actually care about.

On the other hand I do see benefit of having wider context without the need to run 'opt' by hands.

It would be interesting to hear what others think in this regard.

66

Done

xbolva00 added inline comments.
llvm/test/Transforms/LoopVectorize/interleave_short_tc.ll
66

+1

Just leave metadata which you really need for this test. (Lines 56-70 can be removed)

hsaito added a comment.Oct 4 2019, 2:16 PM

Vectorizer code change looks fine with me. I'd like to see the comments updated, though. Any more changes needed for the LIT tests?

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5221–5222

small estimated or constant trip count

5287–5288

If trip count is expected to be small, limit the interleave count to be less than the trip count divided by VF

hsaito added inline comments.Oct 4 2019, 4:40 PM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5222–5225

This assumes constant trip count case is handled well by getSmallConstantMaxTripCount called from getSmallBestKnownTC ---- but if that is not the case, that would be a bug on the SCEV side. I traced a little bit but could not verify it myself as I'm not familiar with SCEV code. As such, I'm just pointing out a different SCEV function will be called as a result of this change.

hsaito accepted this revision.Oct 7 2019, 11:32 AM

Vectorizer code change looks fine with me. I'd like to see the comments updated, though. Any more changes needed for the LIT tests?

LGTM. Please wait a few more days to give others a chance for another look.

This revision is now accepted and ready to land.Oct 7 2019, 11:32 AM
ebrevnov updated this revision to Diff 225176.Oct 16 2019, 2:24 AM
ebrevnov marked 4 inline comments as done.

Minor changes requested by reviewers

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5221–5222

Done

5222–5225

Maybe I'm missing your concern but constant trip count case is processed first in getSmallBestKnownTC by a call to getSmallConstantTripCount. Thus I don't see any change for constant trip count case at all.

5287–5288

There is some ambiguity in using "small" through out the code. For getSmallBestKnownTC "small" is if it fits 32-bit. For "if (BestKnownTC && *BestKnownTC < TinyTripCountInterleaveThreshold)" check "small" is what less than TinyTripCountInterleaveThreshold. Here "small" should refer to the meaning defined by getSmallBestKnownTC .
I think we better avoid using "small" one more time here to minimize the confusion.

llvm/test/Transforms/LoopVectorize/interleave_short_tc.ll
66

I tried to remove as much as I can but not all of them can be actually removed.

@hsaito, Please commit if you find this version acceptable.

@hsaito, Please commit if you find this version acceptable.

@ebrevnov, I recommend you visit http://llvm.org/docs/DeveloperPolicy.html#obtaining-commit-access and obtain your own commit access. Your committed and under review patches deserve it.
Thank you very much for your contribution. Let me know if you still would like me to commit this one.

hsaito added inline comments.Oct 16 2019, 10:13 AM
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5287–5288

Sorry for being unclear. I was suggesting an update to the comment. With this patch, BestKnownTC is not constant, right?

ebrevnov marked an inline comment as done.Oct 16 2019, 10:29 PM

@hsaito, Please commit if you find this version acceptable.

@ebrevnov, I recommend you visit http://llvm.org/docs/DeveloperPolicy.html#obtaining-commit-access and obtain your own commit access. Your committed and under review patches deserve it.
Thank you very much for your contribution. Let me know if you still would like me to commit this one.

I was going to do that right after this patch lands. Please assist me (hopefully last time) in landing the patch.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5287–5288

I see what you are talking about. Another ambiguity here :-) BestKnownTC returns compile time constant value which may be exact runtime constant or estimated non-constant. In this case I believe "constant" means that we were able to get a compile time constant value for the trip count. How about the following wording?

// If trip count is known or estimated compile time constant, limit ....

@hsaito, Please commit if you find this version acceptable.

@ebrevnov, I recommend you visit http://llvm.org/docs/DeveloperPolicy.html#obtaining-commit-access and obtain your own commit access. Your committed and under review patches deserve it.
Thank you very much for your contribution. Let me know if you still would like me to commit this one.

I was going to do that right after this patch lands. Please assist me (hopefully last time) in landing the patch.

OK. Please update the comment and then I'll land the patch.

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
5287–5288

That's fine as well.

ebrevnov updated this revision to Diff 225999.Oct 21 2019, 11:07 PM

Updated comment as agreed.

Updated comment as agreed.

Got it. First time committing through git (SVN is read-only now). Expecting some learning curve. If you'd like this in sooner, you might want to ask someone familiar with the process. Will see if I hit any issues.

Updated comment as agreed.

Got it. First time committing through git (SVN is read-only now). Expecting some learning curve. If you'd like this in sooner, you might want to ask someone familiar with the process. Will see if I hit any issues.

Needed to have done this process before SVN went readonly. Please ask @reames to commit. He must be better prepared. Sorry about that.
https://llvm.org/docs/DeveloperPolicy.html#obtaining-commit-access-to-the-github-repository

Updated comment as agreed.

Got it. First time committing through git (SVN is read-only now). Expecting some learning curve. If you'd like this in sooner, you might want to ask someone familiar with the process. Will see if I hit any issues.

Needed to have done this process before SVN went readonly. Please ask @reames to commit. He must be better prepared. Sorry about that.
https://llvm.org/docs/DeveloperPolicy.html#obtaining-commit-access-to-the-github-repository

@craig.topper agreed to commit this. Should happen today. Sorry for the mess.

This revision was automatically updated to reflect the committed changes.

I failed to commit interleave_short_tc.ll initially. I fixed that yesterday, but then had to move it into the X86 test directory to appease the build bots.

hsaito added a comment.Nov 1 2019, 5:09 PM

We have observed some performance regressions, presumably because the vectorized code started to kick-in on short estimated trip count loops (as opposed to skipping vector code and execute scalar code). We'll try following up with cost model tuning. I'm not too surprised if others also hit a similar issue. Overall, though, that's the right direction to head to.

This comment was removed by AaronLiu.

Accidentally removed the message that I posted above. Re-post here: basically what I want say is to request reviewers for this patch to review another patch D81416 that touch the same file. Thanks!

We have observed some performance regressions, presumably because the vectorized code started to kick-in on short estimated trip count loops (as opposed to skipping vector code and execute scalar code). We'll try following up with cost model tuning. I'm not too surprised if others also hit a similar issue. Overall, though, that's the right direction to head to.

This was fixed?

We have observed some performance regressions, presumably because the vectorized code started to kick-in on short estimated trip count loops (as opposed to skipping vector code and execute scalar code). We'll try following up with cost model tuning. I'm not too surprised if others also hit a similar issue. Overall, though, that's the right direction to head to.

This was fixed?

From my testing, one of the bmk degraded 20+% at peak last year, but now with the patch in D81416 it is confirmed that it can get 50+% performance gain.

Wow, significant improvement!

And please post these benchmark results in your patch too.

I test it on PowerPC, I request it to be tested by the community on different platform. Currently it is disabled by default, and will enable it in the next step.