This is an archive of the discontinued LLVM Phabricator instance.

Calculate vectorization factor using the narrowest type instead of widest type
ClosedPublic

Authored by congh on Apr 9 2015, 5:22 PM.

Details

Summary

To be able to maximize the bandwidth during vectorization, this update provides a new option vectorizer-maximize-bandwidth. When it is turned on, the vectorizer will determine the vectorization factor (VF) using the smallest instead of widest type in the loop. To avoid increasing register pressure too much, estimates of the register usage for different VFs are calculated so that we only choose a VF when its register usage doesn't exceed the number of available registers.

Diff Detail

Repository
rL LLVM

Event Timeline

congh updated this revision to Diff 23545.Apr 9 2015, 5:22 PM
congh retitled this revision from to Calculate vectorization factor using the narrowest type instead of widest type.
congh updated this object.
congh edited the test plan for this revision. (Show Details)
congh added a reviewer: hfinkel.
congh added a subscriber: Unknown Object (MLST).
hfinkel added a reviewer: chandlerc.
hfinkel added a subscriber: nadav.
hfinkel edited edge metadata.Apr 11 2015, 6:41 PM

[+Arnold, Nadav,Chandler]

If I understand this correctly, this will cause us to potentially generate wider vectors than we have underlying vector registers, and I think that, generically, this makes sense. Now that our X86 shuffle handling is sane, the splitting of wide vectors, and shuffling that you get from vector extends/truncates is hopefully not too bad. Other opinions?

Did you see any performance changes on the test suite?

We might need to update the register-pressure heuristic (LoopVectorizationCostModel::calculateRegisterUsage()) to understand that very-wide vectors use multiple vector registers.

spatel added a subscriber: spatel.Apr 13 2015, 9:30 PM
chandlerc edited edge metadata.Apr 14 2015, 8:44 AM

[+Arnold, Nadav,Chandler]

If I understand this correctly, this will cause us to potentially generate wider vectors than we have underlying vector registers, and I think that, generically, this makes sense. Now that our X86 shuffle handling is sane, the splitting of wide vectors, and shuffling that you get from vector extends/truncates is hopefully not too bad. Other opinions?

I generally agree.

The key is that we should maximize the load/store bandwidth provided we have sufficient registers (see below).

Did you see any performance changes on the test suite?

We might need to update the register-pressure heuristic (LoopVectorizationCostModel::calculateRegisterUsage()) to understand that very-wide vectors use multiple vector registers.

Yes, I think that is going to be the important to watch the register pressure heuristics.

chandlerc requested changes to this revision.May 27 2015, 2:14 AM
chandlerc edited edge metadata.

Just updating this revision as it seems a bit stalled. I think there are a few things going on here...

  1. I think it would be good to first at least mostly address the problem of identifying places where we can hoist truncs to narrow the width ad which we're doing operations within the vector. Without this, I think measuring the performance impact of this change will be hard -- we'll see wins that could be realized with a less register pressure intensive change.
  1. I think this needs some more high-level tests -- we should actually add a loop test case that should vectorize differently as a consequence.
  1. The fp64_to_uint32-cost-model.ll change seems odd -- either the update to the test or the comments in the test are wrong... Don't know which.
  1. I think we would need numbers on non-x86 architectures in order to be confident that the register pressure increase wasn't problematic. This might mean using a temporary debug flag to enable this until we can hear back from other backend maintainers. I don't imagine any of the backends outside of ARM, x86, and PPC have enough autovectorization users to really care, so it shouldn't be too bad.
This revision now requires changes to proceed.May 27 2015, 2:14 AM
congh updated this revision to Diff 35321.Sep 21 2015, 4:54 PM
congh edited edge metadata.

Update the patch.

To be able to maximize the bandwidth during vectorization, this update provides a new option vectorizer-maximize-bandwidth. When it is turned on, the vectorizer will determine the vectorization factor (VF) using the smallest instead of widest type in the loop. To avoid increasing register pressure too much, estimates of the register usage for different VFs are calculated so that we only choose a VF when its register usage doesn't exceed the number of available registers.

congh updated this object.Sep 21 2015, 4:54 PM
congh added a reviewer: davidxl.
congh added a comment.Sep 21 2015, 5:01 PM

I am so sorry for being away from this patch for so long time!

Just updating this revision as it seems a bit stalled. I think there are a few things going on here...

  1. I think it would be good to first at least mostly address the problem of identifying places where we can hoist truncs to narrow the width ad which we're doing operations within the vector. Without this, I think measuring the performance impact of this change will be hard -- we'll see wins that could be realized with a less register pressure intensive change.

In the updated patch I estimated the register usage of larger VFs to ensure that it isn't too large. This should help to reduce register pressure.

  1. I think this needs some more high-level tests -- we should actually add a loop test case that should vectorize differently as a consequence.

I added a test case to this patch for larger VF. However, I found the cost estimation of many operations when VF is large is inaccurate at least for X86, and so that the maximum VF could not be applied due to the unreasonable large cost. I will fix those issues later and add more test cases.

  1. The fp64_to_uint32-cost-model.ll change seems odd -- either the update to the test or the comments in the test are wrong... Don't know which.

As the change is protected by an option, many updates to test cases are not necessary now.

  1. I think we would need numbers on non-x86 architectures in order to be confident that the register pressure increase wasn't problematic. This might mean using a temporary debug flag to enable this until we can hear back from other backend maintainers. I don't imagine any of the backends outside of ARM, x86, and PPC have enough autovectorization users to really care, so it shouldn't be too bad.

Following you suggestion, I have added an option in this patch. The register pressure problem is also alleviated as described above.

Have you run LLVM's test suite with this turned on? Are there any significant performance changes? [I'm happy for this to go in, given that it's disabled by default, even if there are regressions to fix, but I'd like to know where we stand].

.

lib/Transforms/Vectorize/LoopVectorize.cpp
4585 ↗(On Diff #35321)

I'd make this 8 instead of 4 (we might have 7 VF for 8-bit integers in AVX-512, for example).

congh updated this revision to Diff 35570.Sep 23 2015, 4:20 PM

Fix a test failure (a potential bug in LLVM) when the new flag is turned on by default.

congh added a comment.Sep 23 2015, 4:27 PM

Have you run LLVM's test suite with this turned on? Are there any significant performance changes? [I'm happy for this to go in, given that it's disabled by default, even if there are regressions to fix, but I'd like to know where we stand].

.

I ran the regression tests and only three failed, one of which (assertion fail) is already fixed in the updated patch, and the other two are caused by larger VF and are understood.

The performance test is still running and I will give the result later.

congh added a comment.Sep 24 2015, 4:46 PM

Have you run LLVM's test suite with this turned on? Are there any significant performance changes? [I'm happy for this to go in, given that it's disabled by default, even if there are regressions to fix, but I'd like to know where we stand].

.

For all 498 test cases in the test suite, the average speed-up is 6.66% with this patch. There are several significant performance changes (>100%) and most are positive. I will keep investigating those significant negative changes.

congh updated this revision to Diff 36272.Oct 1 2015, 10:58 AM

Update the patch according to hfinkel's comment.

congh updated this revision to Diff 36275.Oct 1 2015, 11:00 AM

Fix a default flag.

congh added a comment.Oct 2 2015, 3:22 PM

hfinkel, is this patch ready to be checked in?

spatel added a comment.Oct 5 2015, 1:48 PM

I applied this patch on top of r248957 and ran the benchmarking subset of test-suite on an AMD Jaguar 1.5 GHz + Ubuntu 14.04 test system. The baseline is -O3 -march=btver2 while the comparison run added -mllvm -vectorizer-maximize-bandwidth (data attached).

I see very little performance difference on any test: almost everything is +/- 2% which is within the noise for most tests.

Cong, I would be interested to know if you saw any large diffs on these tests on your test system or if the bigger wins/losses all occurred on the non-benchmarking tests in test-suite?

congh added a comment.Oct 13 2015, 2:48 PM

I applied this patch on top of r248957 and ran the benchmarking subset of test-suite on an AMD Jaguar 1.5 GHz + Ubuntu 14.04 test system. The baseline is -O3 -march=btver2 while the comparison run added -mllvm -vectorizer-maximize-bandwidth (data attached).

I see very little performance difference on any test: almost everything is +/- 2% which is within the noise for most tests.

Cong, I would be interested to know if you saw any large diffs on these tests on your test system or if the bigger wins/losses all occurred on the non-benchmarking tests in test-suite?

Thank you for the performance test! I think there may be two reasons that why we could not observe big performance difference in llvm test suite:

  1. There is no hotspot that includes a loop with types of different sizes (this is what this patch is optimizing).
  2. There are some problems with the cost model in llvm. Even we can choose a larger VF, the cost model shows that the larger VF has the larger cost. I will deal with this issue later.

I don't have a test in my codebase that benefits from this patch, but it is quite easy to synthesize one:

const int N = 1024 * 32;
int a[N];
char b[N];

int main() {

for (int i = 0; i < N; ++i) {
  for (int i = 0; i < N; ++i) {
    a[i]++;
    b[i]++;
  }
}

}

For the code shown above, the original running time is ~0.35s and with this patch the running time is reduced to ~0.228s.

In D8943#266404, @congh wrote:

I applied this patch on top of r248957 and ran the benchmarking subset of test-suite on an AMD Jaguar 1.5 GHz + Ubuntu 14.04 test system. The baseline is -O3 -march=btver2 while the comparison run added -mllvm -vectorizer-maximize-bandwidth (data attached).

I see very little performance difference on any test: almost everything is +/- 2% which is within the noise for most tests.

Cong, I would be interested to know if you saw any large diffs on these tests on your test system or if the bigger wins/losses all occurred on the non-benchmarking tests in test-suite?

Thank you for the performance test! I think there may be two reasons that why we could not observe big performance difference in llvm test suite:

  1. There is no hotspot that includes a loop with types of different sizes (this is what this patch is optimizing).
  2. There are some problems with the cost model in llvm. Even we can choose a larger VF, the cost model shows that the larger VF has the larger cost. I will deal with this issue later.

I don't have a test in my codebase that benefits from this patch, but it is quite easy to synthesize one:

Thanks, Cong. I am confused by the above statement versus your earlier one:
"For all 498 test cases in the test suite, the average speed-up is 6.66% with this patch. There are several significant performance changes (>100%) and most are positive. I will keep investigating those significant negative changes."

Did something in the patch or external code change such that there used to be significant performance changes but now there are not?

congh added a comment.Oct 14 2015, 3:27 PM
In D8943#266404, @congh wrote:

I applied this patch on top of r248957 and ran the benchmarking subset of test-suite on an AMD Jaguar 1.5 GHz + Ubuntu 14.04 test system. The baseline is -O3 -march=btver2 while the comparison run added -mllvm -vectorizer-maximize-bandwidth (data attached).

I see very little performance difference on any test: almost everything is +/- 2% which is within the noise for most tests.

Cong, I would be interested to know if you saw any large diffs on these tests on your test system or if the bigger wins/losses all occurred on the non-benchmarking tests in test-suite?

Thank you for the performance test! I think there may be two reasons that why we could not observe big performance difference in llvm test suite:

  1. There is no hotspot that includes a loop with types of different sizes (this is what this patch is optimizing).
  2. There are some problems with the cost model in llvm. Even we can choose a larger VF, the cost model shows that the larger VF has the larger cost. I will deal with this issue later.

I don't have a test in my codebase that benefits from this patch, but it is quite easy to synthesize one:

Thanks, Cong. I am confused by the above statement versus your earlier one:
"For all 498 test cases in the test suite, the average speed-up is 6.66% with this patch. There are several significant performance changes (>100%) and most are positive. I will keep investigating those significant negative changes."

Did something in the patch or external code change such that there used to be significant performance changes but now there are not?

Sorry for forgetting to clarify on this. After investigating those test results, I found that many large numbers are due to flaky tests. Many tests enjoying 5+% speed-up even don't have any change to be vectorized. However, I think your test results make more sense. So do you want me to do those tests again on my machine?

In D8943#267459, @congh wrote:

Sorry for forgetting to clarify on this. After investigating those test results, I found that many large numbers are due to flaky tests. Many tests enjoying 5+% speed-up even don't have any change to be vectorized. However, I think your test results make more sense. So do you want me to do those tests again on my machine?

Ah...that matches my experience with test-suite then. :)
I just wanted to confirm that I wasn't missing some important step with test-suite or with your patch.
I don't see any value in running it all over again. However, for future changes, it would be good to know if your patch is firing on any of those tests, and if so, is there any perf difference? If not, we should locate some new tests!

Since the change is hidden behind a flag, I think this patch is fine. But I will let someone who knows the vectorizers better provide the final approval.

congh added a comment.Oct 14 2015, 3:41 PM
In D8943#267459, @congh wrote:

Sorry for forgetting to clarify on this. After investigating those test results, I found that many large numbers are due to flaky tests. Many tests enjoying 5+% speed-up even don't have any change to be vectorized. However, I think your test results make more sense. So do you want me to do those tests again on my machine?

Ah...that matches my experience with test-suite then. :)
I just wanted to confirm that I wasn't missing some important step with test-suite or with your patch.
I don't see any value in running it all over again. However, for future changes, it would be good to know if your patch is firing on any of those tests, and if so, is there any perf difference? If not, we should locate some new tests!

Since the change is hidden behind a flag, I think this patch is fine. But I will let someone who knows the vectorizers better provide the final approval.

OK. Thanks a lot for the review!

Hi Cong,

Please find some comments inline.

Michael

lib/Target/X86/X86TargetTransformInfo.cpp
851–853 ↗(On Diff #36275)

I believe it's an independent fix from the rest of the patch. Please commit it separately.

lib/Transforms/Vectorize/LoopVectorize.cpp
1401 ↗(On Diff #36275)

Nitpick: redundant whitespace after \return

4605 ↗(On Diff #36275)

I would prefer not changing signature of calculateRegisterUsage and build array of RUs here instead. I think the original interface (takes a single vectorization factor, return register usage object for it) is more intuitive that the one operating on arrays.

4607 ↗(On Diff #36275)

Typo: doen't

4994–4995 ↗(On Diff #36275)

Please commit such formatting fixes separately if you feel that you need them.

test/Transforms/LoopVectorize/X86/vector_max_bandwidth.ll
1 ↗(On Diff #36275)

You'll need REQUIRES: asserts if you scan debug dumps.

congh marked 4 inline comments as done.Oct 19 2015, 10:51 AM

Hi Cong,

Please find some comments inline.

Michael

Thank you for the review, Michael! Please see my inline reply.

lib/Target/X86/X86TargetTransformInfo.cpp
851–853 ↗(On Diff #36275)

This is a dependent fix and without it the test case will crash. But I could commit this fix ahead of this patch.

lib/Transforms/Vectorize/LoopVectorize.cpp
4605 ↗(On Diff #36275)

The change of the signature is in consideration of performance: the part that calculates the highest number of values that are alive at each location is shared by different VFs (actually most part are shares across different VFs). If we always consider many VFs for a given loop, then this signature also makes sense. Right?

congh updated this revision to Diff 38592.Oct 27 2015, 2:23 PM

Update the patch according to Michael's comments.

hfinkel added inline comments.Oct 27 2015, 9:51 PM
lib/Target/X86/X86TargetTransformInfo.cpp
903–905 ↗(On Diff #38592)

In that case, please do commit it.

lib/Transforms/Vectorize/LoopVectorize.cpp
5166 ↗(On Diff #38592)

Does this do the right thing for stores? For stores, you want the type of the stored value, right?

congh added inline comments.Oct 28 2015, 4:14 PM
lib/Target/X86/X86TargetTransformInfo.cpp
903–905 ↗(On Diff #38592)

This is committed now.

lib/Transforms/Vectorize/LoopVectorize.cpp
5011 ↗(On Diff #35321)

I think stores won't appear here as OpenIntervals only collect instructions that are used. See line 4992 above.

hfinkel accepted this revision.Oct 28 2015, 5:58 PM
hfinkel edited edge metadata.

LGTM.

After you commit, send a message to llvmdev asking people to test with the flag enabled.

lib/Transforms/Vectorize/LoopVectorize.cpp
5166 ↗(On Diff #38592)

Okay, that makes sense. The register will be accounted for at its source.

congh added a comment.Oct 28 2015, 6:06 PM

LGTM.

After you commit, send a message to llvmdev asking people to test with the flag enabled.

OK. Thanks for the review!

This revision was automatically updated to reflect the committed changes.