Currently we may do iterleaving by more than estimated trip count
coming from the profile or computed maximum trip count. The solution is to
use "best known" trip count instead of exact one in interleaving analysis.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/test/Transforms/LoopVectorize/X86/no_fpmath_with_hotness.ll | ||
---|---|---|
6 | according to profile original loop has 99 iterations thus interleaving is disabled by short trip count heuristics controlled by tiny-trip-count-interleave-threshold |
This looks reasonable to me, but I'm not a qualified reviewer in this area.
llvm/test/Transforms/LoopVectorize/interleave_short_tc.ll | ||
---|---|---|
1 ↗ | (On Diff #221470) | I would suggest using an autogenerate checks for these. It'll be more verbose, but also more clear about the expected output. See utils/update_test_checks.py and FileCheck's --check-prefix option for the three variations. |
65 ↗ | (On Diff #221470) | Please remove unnecessary aspects of test. |
llvm/test/Transforms/LoopVectorize/interleave_short_tc.ll | ||
---|---|---|
1 ↗ | (On Diff #221470) |
Not sure I understand what exactly you mean but this. I know it is possible to have a dedicated prefix for common checks of two run lines and provide several prefixes to FileCheck. For example: In my case I have three different runs with one common check. Is there a way to combine them? |
llvm/test/Transforms/LoopVectorize/interleave_short_tc.ll | ||
---|---|---|
1 ↗ | (On Diff #221470) |
In fact I did use update_test_checks.py to generate initial checks and then manually removed all unrelated ones. I think that makes test less sensitive to side changes and it's easy to see what we actually care about. On the other hand I do see benefit of having wider context without the need to run 'opt' by hands. It would be interesting to hear what others think in this regard. |
65 ↗ | (On Diff #221470) | Done |
llvm/test/Transforms/LoopVectorize/interleave_short_tc.ll | ||
---|---|---|
65 ↗ | (On Diff #221470) | +1 Just leave metadata which you really need for this test. (Lines 56-70 can be removed) |
Vectorizer code change looks fine with me. I'd like to see the comments updated, though. Any more changes needed for the LIT tests?
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
5146–5147 | small estimated or constant trip count | |
5212–5213 | If trip count is expected to be small, limit the interleave count to be less than the trip count divided by VF |
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
5147–5150 | This assumes constant trip count case is handled well by getSmallConstantMaxTripCount called from getSmallBestKnownTC ---- but if that is not the case, that would be a bug on the SCEV side. I traced a little bit but could not verify it myself as I'm not familiar with SCEV code. As such, I'm just pointing out a different SCEV function will be called as a result of this change. |
Minor changes requested by reviewers
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
5146–5147 | Done | |
5147–5150 | Maybe I'm missing your concern but constant trip count case is processed first in getSmallBestKnownTC by a call to getSmallConstantTripCount. Thus I don't see any change for constant trip count case at all. | |
5212–5213 | There is some ambiguity in using "small" through out the code. For getSmallBestKnownTC "small" is if it fits 32-bit. For "if (BestKnownTC && *BestKnownTC < TinyTripCountInterleaveThreshold)" check "small" is what less than TinyTripCountInterleaveThreshold. Here "small" should refer to the meaning defined by getSmallBestKnownTC . | |
llvm/test/Transforms/LoopVectorize/interleave_short_tc.ll | ||
65 ↗ | (On Diff #221470) | I tried to remove as much as I can but not all of them can be actually removed. |
@ebrevnov, I recommend you visit http://llvm.org/docs/DeveloperPolicy.html#obtaining-commit-access and obtain your own commit access. Your committed and under review patches deserve it.
Thank you very much for your contribution. Let me know if you still would like me to commit this one.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
5212–5213 | Sorry for being unclear. I was suggesting an update to the comment. With this patch, BestKnownTC is not constant, right? |
I was going to do that right after this patch lands. Please assist me (hopefully last time) in landing the patch.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
5212–5213 | I see what you are talking about. Another ambiguity here :-) BestKnownTC returns compile time constant value which may be exact runtime constant or estimated non-constant. In this case I believe "constant" means that we were able to get a compile time constant value for the trip count. How about the following wording? // If trip count is known or estimated compile time constant, limit .... |
OK. Please update the comment and then I'll land the patch.
llvm/lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
5212–5213 | That's fine as well. |
Got it. First time committing through git (SVN is read-only now). Expecting some learning curve. If you'd like this in sooner, you might want to ask someone familiar with the process. Will see if I hit any issues.
Needed to have done this process before SVN went readonly. Please ask @reames to commit. He must be better prepared. Sorry about that.
https://llvm.org/docs/DeveloperPolicy.html#obtaining-commit-access-to-the-github-repository
I failed to commit interleave_short_tc.ll initially. I fixed that yesterday, but then had to move it into the X86 test directory to appease the build bots.
We have observed some performance regressions, presumably because the vectorized code started to kick-in on short estimated trip count loops (as opposed to skipping vector code and execute scalar code). We'll try following up with cost model tuning. I'm not too surprised if others also hit a similar issue. Overall, though, that's the right direction to head to.
Accidentally removed the message that I posted above. Re-post here: basically what I want say is to request reviewers for this patch to review another patch D81416 that touch the same file. Thanks!
From my testing, one of the bmk degraded 20+% at peak last year, but now with the patch in D81416 it is confirmed that it can get 50+% performance gain.
Wow, significant improvement!
And please post these benchmark results in your patch too.
I test it on PowerPC, I request it to be tested by the community on different platform. Currently it is disabled by default, and will enable it in the next step.
small estimated or constant trip count