To be able to maximize the bandwidth during vectorization, this update provides a new option vectorizer-maximize-bandwidth. When it is turned on, the vectorizer will determine the vectorization factor (VF) using the smallest instead of widest type in the loop. To avoid increasing register pressure too much, estimates of the register usage for different VFs are calculated so that we only choose a VF when its register usage doesn't exceed the number of available registers.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
[+Arnold, Nadav,Chandler]
If I understand this correctly, this will cause us to potentially generate wider vectors than we have underlying vector registers, and I think that, generically, this makes sense. Now that our X86 shuffle handling is sane, the splitting of wide vectors, and shuffling that you get from vector extends/truncates is hopefully not too bad. Other opinions?
Did you see any performance changes on the test suite?
We might need to update the register-pressure heuristic (LoopVectorizationCostModel::calculateRegisterUsage()) to understand that very-wide vectors use multiple vector registers.
I generally agree.
The key is that we should maximize the load/store bandwidth provided we have sufficient registers (see below).
Did you see any performance changes on the test suite?
We might need to update the register-pressure heuristic (LoopVectorizationCostModel::calculateRegisterUsage()) to understand that very-wide vectors use multiple vector registers.
Yes, I think that is going to be the important to watch the register pressure heuristics.
Just updating this revision as it seems a bit stalled. I think there are a few things going on here...
- I think it would be good to first at least mostly address the problem of identifying places where we can hoist truncs to narrow the width ad which we're doing operations within the vector. Without this, I think measuring the performance impact of this change will be hard -- we'll see wins that could be realized with a less register pressure intensive change.
- I think this needs some more high-level tests -- we should actually add a loop test case that should vectorize differently as a consequence.
- The fp64_to_uint32-cost-model.ll change seems odd -- either the update to the test or the comments in the test are wrong... Don't know which.
- I think we would need numbers on non-x86 architectures in order to be confident that the register pressure increase wasn't problematic. This might mean using a temporary debug flag to enable this until we can hear back from other backend maintainers. I don't imagine any of the backends outside of ARM, x86, and PPC have enough autovectorization users to really care, so it shouldn't be too bad.
Update the patch.
To be able to maximize the bandwidth during vectorization, this update provides a new option vectorizer-maximize-bandwidth. When it is turned on, the vectorizer will determine the vectorization factor (VF) using the smallest instead of widest type in the loop. To avoid increasing register pressure too much, estimates of the register usage for different VFs are calculated so that we only choose a VF when its register usage doesn't exceed the number of available registers.
I am so sorry for being away from this patch for so long time!
In the updated patch I estimated the register usage of larger VFs to ensure that it isn't too large. This should help to reduce register pressure.
- I think this needs some more high-level tests -- we should actually add a loop test case that should vectorize differently as a consequence.
I added a test case to this patch for larger VF. However, I found the cost estimation of many operations when VF is large is inaccurate at least for X86, and so that the maximum VF could not be applied due to the unreasonable large cost. I will fix those issues later and add more test cases.
- The fp64_to_uint32-cost-model.ll change seems odd -- either the update to the test or the comments in the test are wrong... Don't know which.
As the change is protected by an option, many updates to test cases are not necessary now.
- I think we would need numbers on non-x86 architectures in order to be confident that the register pressure increase wasn't problematic. This might mean using a temporary debug flag to enable this until we can hear back from other backend maintainers. I don't imagine any of the backends outside of ARM, x86, and PPC have enough autovectorization users to really care, so it shouldn't be too bad.
Following you suggestion, I have added an option in this patch. The register pressure problem is also alleviated as described above.
Have you run LLVM's test suite with this turned on? Are there any significant performance changes? [I'm happy for this to go in, given that it's disabled by default, even if there are regressions to fix, but I'd like to know where we stand].
.
lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
4585 ↗ | (On Diff #35321) | I'd make this 8 instead of 4 (we might have 7 VF for 8-bit integers in AVX-512, for example). |
Fix a test failure (a potential bug in LLVM) when the new flag is turned on by default.
I ran the regression tests and only three failed, one of which (assertion fail) is already fixed in the updated patch, and the other two are caused by larger VF and are understood.
The performance test is still running and I will give the result later.
For all 498 test cases in the test suite, the average speed-up is 6.66% with this patch. There are several significant performance changes (>100%) and most are positive. I will keep investigating those significant negative changes.
I applied this patch on top of r248957 and ran the benchmarking subset of test-suite on an AMD Jaguar 1.5 GHz + Ubuntu 14.04 test system. The baseline is -O3 -march=btver2 while the comparison run added -mllvm -vectorizer-maximize-bandwidth (data attached).
I see very little performance difference on any test: almost everything is +/- 2% which is within the noise for most tests.
Cong, I would be interested to know if you saw any large diffs on these tests on your test system or if the bigger wins/losses all occurred on the non-benchmarking tests in test-suite?
Thank you for the performance test! I think there may be two reasons that why we could not observe big performance difference in llvm test suite:
- There is no hotspot that includes a loop with types of different sizes (this is what this patch is optimizing).
- There are some problems with the cost model in llvm. Even we can choose a larger VF, the cost model shows that the larger VF has the larger cost. I will deal with this issue later.
I don't have a test in my codebase that benefits from this patch, but it is quite easy to synthesize one:
const int N = 1024 * 32;
int a[N];
char b[N];
int main() {
for (int i = 0; i < N; ++i) { for (int i = 0; i < N; ++i) { a[i]++; b[i]++; } }
}
For the code shown above, the original running time is ~0.35s and with this patch the running time is reduced to ~0.228s.
Thanks, Cong. I am confused by the above statement versus your earlier one:
"For all 498 test cases in the test suite, the average speed-up is 6.66% with this patch. There are several significant performance changes (>100%) and most are positive. I will keep investigating those significant negative changes."
Did something in the patch or external code change such that there used to be significant performance changes but now there are not?
Sorry for forgetting to clarify on this. After investigating those test results, I found that many large numbers are due to flaky tests. Many tests enjoying 5+% speed-up even don't have any change to be vectorized. However, I think your test results make more sense. So do you want me to do those tests again on my machine?
Ah...that matches my experience with test-suite then. :)
I just wanted to confirm that I wasn't missing some important step with test-suite or with your patch.
I don't see any value in running it all over again. However, for future changes, it would be good to know if your patch is firing on any of those tests, and if so, is there any perf difference? If not, we should locate some new tests!
Since the change is hidden behind a flag, I think this patch is fine. But I will let someone who knows the vectorizers better provide the final approval.
Hi Cong,
Please find some comments inline.
Michael
lib/Target/X86/X86TargetTransformInfo.cpp | ||
---|---|---|
851–853 ↗ | (On Diff #36275) | I believe it's an independent fix from the rest of the patch. Please commit it separately. |
lib/Transforms/Vectorize/LoopVectorize.cpp | ||
1401 ↗ | (On Diff #36275) | Nitpick: redundant whitespace after \return |
4605 ↗ | (On Diff #36275) | I would prefer not changing signature of calculateRegisterUsage and build array of RUs here instead. I think the original interface (takes a single vectorization factor, return register usage object for it) is more intuitive that the one operating on arrays. |
4607 ↗ | (On Diff #36275) | Typo: doen't |
4994–4995 ↗ | (On Diff #36275) | Please commit such formatting fixes separately if you feel that you need them. |
test/Transforms/LoopVectorize/X86/vector_max_bandwidth.ll | ||
1 ↗ | (On Diff #36275) | You'll need REQUIRES: asserts if you scan debug dumps. |
Thank you for the review, Michael! Please see my inline reply.
lib/Target/X86/X86TargetTransformInfo.cpp | ||
---|---|---|
851–853 ↗ | (On Diff #36275) | This is a dependent fix and without it the test case will crash. But I could commit this fix ahead of this patch. |
lib/Transforms/Vectorize/LoopVectorize.cpp | ||
4605 ↗ | (On Diff #36275) | The change of the signature is in consideration of performance: the part that calculates the highest number of values that are alive at each location is shared by different VFs (actually most part are shares across different VFs). If we always consider many VFs for a given loop, then this signature also makes sense. Right? |
LGTM.
After you commit, send a message to llvmdev asking people to test with the flag enabled.
lib/Transforms/Vectorize/LoopVectorize.cpp | ||
---|---|---|
5166 ↗ | (On Diff #38592) | Okay, that makes sense. The register will be accounted for at its source. |