This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Enable 64-bit wide vectorization for Cyclone
ClosedPublic

Authored by anemet on Apr 11 2017, 5:28 PM.

Details

Summary

ARM Neon has native support for half-sized vector registers (64 bits). This
is beneficial for example for 2D and 3D graphics. This patch adds the option
to lower MinVecRegSize from 128 via a TTI in the SLP Vectorizer.

  • Performance Analysis

This change was motivated by some internal benchmarks but it is also
beneficial on SPEC and the LLVM testsuite.

The results are with -O3 and PGO. A negative percentage is an improvement.
The testsuite was run with a sample size of 4.

  • SPEC
  • CFP2006/482.sphinx3 -3.34%

A pretty hot loop is SLP vectorized resulting in nice instruction reduction.
This used to be a +22% regression before rL299482.

  • CFP2000/177.mesa -3.34%
  • CINT2000/256.bzip2 +6.97%

My current plan is to extend the fix in rL299482 to i16 which brings the
regression down to +2.5%. There are also other problems with the codegen in
this loop so there is further room for improvement.

  • LLVM testsuite
  • SingleSource/Benchmarks/Misc/ReedSolomon -10.75%

There are multiple small SLP vectorizations outside the hot code. It's a bit
surprising that it adds up to 10%. Some of this may be code-layout noise.

  • MultiSource/Benchmarks/VersaBench/beamformer/beamformer -8.40%

The opt-viewer screenshot can be seen at F3218284. We start at a colder store
but the tree leads us into the hottest loop.

  • MultiSource/Applications/lambda-0.1.3/lambda -2.68%
  • MultiSource/Benchmarks/Bullet/bullet -2.18%

This is using 3D vectors.

  • SingleSource/Benchmarks/Shootout-C++/Shootout-C++-lists +6.67%

Noise, binary is unchanged.

  • MultiSource/Benchmarks/Ptrdist/anagram/anagram +4.90%

There is an additional SLP in the cold code. The test runs for ~1sec and
prints out over 2000 lines. This is most likely noise.

  • MultiSource/Applications/aha/aha +1.63%
  • MultiSource/Applications/JM/lencod/lencod +1.41%
  • SingleSource/Benchmarks/Misc/richards_benchmark +1.15%

Event Timeline

anemet created this revision.Apr 11 2017, 5:28 PM
rengolin edited edge metadata.
rengolin added a subscriber: kristof.beyls.

Hi Adam,

Interesting results! But it doesn't sound like this is Cyclone specific.

@kristof.beyls Can you check on A57?

cheers,
--renato

include/llvm/Analysis/TargetTransformInfoImpl.h
308

Is this value really the best default to all targets?

lib/Target/AArch64/AArch64Subtarget.cpp
62

Is this really Cyclone specific? ?Have you benchmarked on other cores?

sdardis added inline comments.
include/llvm/Analysis/TargetTransformInfoImpl.h
308

My quick survey of vector register widths suggests this is double the minimum.

SPARC's VIS extension uses the double precision floating point register set (64 bits wide) , as does Intel's MMX, MIPS' MIPS-3D (though currently unimplemented in LLVM).

The S/390 vector registers appear to be 128 bits, like the basic Intel SSE, MIPS MSA, ARM NEON, PowerPC Altivec.

Hi Renato,

Hi Adam,

Interesting results! But it doesn't sound like this is Cyclone specific.

Sure it's not, it is just a deployment strategy for this change. See the FIXME in the code.

Rolling it out for Cyclone-only is just a way to get this going in a controllable manner. Other subtargets can roll it this out as people find the time to benchmark and tune this.

As the results section shows I had and still doing some tuning on this. This mostly allows 2-lane vectorization for 32-bit types so they benefit of vectorization is not so great thus the accuracy of the cost model is really tested by enabling this.

@kristof.beyls Can you check on A57?

That would be great. Thanks!

Adam

cheers,
--renato

include/llvm/Analysis/TargetTransformInfoImpl.h
308

This does not change change the default from SLP. It just brings it to TTI so that targets can change it as they see fit after careful benchmarking (it will need careful benchmarking!). It is *not* the goal of this patch to find a new proper default across targets.

lib/Target/AArch64/AArch64Subtarget.cpp
62

See above. It's not Cyclone-specific, it is just a deployment strategy to only enable for Cyclone since I have no access to other cores.

rengolin added a subscriber: evandro.

Rolling it out for Cyclone-only is just a way to get this going in a controllable manner. Other subtargets can roll it this out as people find the time to benchmark and tune this.

Right. I'd like to at least try a more generic approach first, and only fall back to Cyclone-only if we get odd results on other cores.

ARM run the test-suite in benchmarking mode on AArch64, so they can pick up spikes if later on we add less generic code in generic section.

So, if the approach is generally good overall (I think it is), and we validate that on some cores, then we should let it in for all targets and monitor the performance as we go.

In that sense, @evandro, can you also have a look at the Exynos range? @mssimpso maybe check the Kyro line?

My rationale is that too many improvements are done by specific companies that are beneficial to the whole architecture, using the same reason "others can tune later if they wish". This encourages a culture of "works on my hardware", which makes the back-ends stiff to changes and proliferate decisions like relying on "isCyclone", or creation of target features that are only ever used for one target and other things that we have been cleaning up since last year.

If we can get people from different sub-arches involved at an early stage, everybody wins.

cheers,
--renato

kristof.beyls edited edge metadata.Apr 12 2017, 8:08 AM

Hi Renato,

Hi Adam,

Interesting results! But it doesn't sound like this is Cyclone specific.

Sure it's not, it is just a deployment strategy for this change. See the FIXME in the code.

Rolling it out for Cyclone-only is just a way to get this going in a controllable manner. Other subtargets can roll it this out as people find the time to benchmark and tune this.

As the results section shows I had and still doing some tuning on this. This mostly allows 2-lane vectorization for 32-bit types so they benefit of vectorization is not so great thus the accuracy of the cost model is really tested by enabling this.

I also got the impression that this is a change that is somewhat (but only somewhat) independent of micro-architecture, as I assume this is mostly about trading off the overhead that may be introduced to get data into the vector registers vs the gain by doing arithmetic in a SIMD fashion.
Of course, the cost of getting data in vector registers and the gain of doing arithmetic in a SIMD fashion is somewhat micro-architecture dependent.

I noticed that Adam points to a number of other patches improving things - I'm assuming these other patches lower the cost of getting data into the vector registers?

I've started to notice a trend where at least for AArch64, specific transformations are enabled/disabled for specific cores only, even when the transformation seems beneficial for most cores, so should probably also be enabled for "-mcpu=generic".
I don't think there is a straightforward answer on what the best way is to achieve making the right balanced tradeoff between enabling only for specific cores vs enabling for all cores.
I also talked about this with @evandro at EuroLLVM, who might also be interested in evaluating this patch on the cores he has access to?

@kristof.beyls Can you check on A57?

That would be great. Thanks!

So indeed I kicked off a run on Cortex-A57 to see what results I got (-O3, non-PGO), including test-suite and SPEC2000, but not SPEC2006, with running every program 3 times.
Apart from the mesa, bzip2 and bullet result Adam mentions, the results I see are on a few different programs:

Performance Regressions - Execution Time
MultiSource/Benchmarks/VersaBench/beamformer/beamformer 8.71%: In this case, the overhead of getting data into vector registers seems to outweigh the gain from simd processing in the hot loops in function "begin".
External/SPEC/CINT2000/256.bzip2/256.bzip2 2.51%: I see a codegen difference in the hot loop in "sendMTFValues" - probably the same loop Adam refers to earlier.
External/SPEC/CINT2000/255.vortex/255.vortex 2.35%: I only noticed a slight code layout change in the hot functions, not any different instructions, so this is very likely noise due to sensitivity of code layout.

Performance Improvements - Execution
MultiSource/Benchmarks/Bullet/bullet -3.95%: seems to be mainly due to SLP vectorization now kicking in on a big basic block in function btSequentialImpulseConstraintSolver::resolveSingleConstraintRowLowerLimit(btSolverBody&, btSolverBody&, btSolverConstraint const&)
External/SPEC/CFP2000/177.mesa/177.mesa -1.69%: vectorization now happens in some of the hottest basic blocks.
External/SPEC/CINT2000/176.gcc/176.gcc -1.42%: I didn't have time to analyze this one further.

In summary, with these results and with more patches in progress to lower the overhead of 2-lane vectorization, I think it's fine to enable this on Cortex-A57 too. I hope we'll be able to decide to just enable this generically for AArch64.

Adam

Rolling it out for Cyclone-only is just a way to get this going in a controllable manner. Other subtargets can roll it this out as people find the time to benchmark and tune this.

Right. I'd like to at least try a more generic approach first, and only fall back to Cyclone-only if we get odd results on other cores.

Sure, this is a good compromise.

ARM run the test-suite in benchmarking mode on AArch64, so they can pick up spikes if later on we add less generic code in generic section.

So, if the approach is generally good overall (I think it is), and we validate that on some cores, then we should let it in for all targets and monitor the performance as we go.

In that sense, @evandro, can you also have a look at the Exynos range? @mssimpso maybe check the Kyro line?

My rationale is that too many improvements are done by specific companies that are beneficial to the whole architecture, using the same reason "others can tune later if they wish". This encourages a culture of "works on my hardware", which makes the back-ends stiff to changes and proliferate decisions like relying on "isCyclone", or creation of target features that are only ever used for one target and other things that we have been cleaning up since last year.

Just as a positive counter example, when I added SW prefetch support last year, half of the ARM subtargets followed suit and added the knobs to enable it on their target.

Of course for that patch, we had to enable per subtarget since each needed to supply micro-architectural details.

If we can get people from different sub-arches involved at an early stage, everybody wins.

Agreed!

Adam

cheers,
--renato

Hi Kristof,

Hi Renato,

Hi Adam,

Interesting results! But it doesn't sound like this is Cyclone specific.

Sure it's not, it is just a deployment strategy for this change. See the FIXME in the code.

Rolling it out for Cyclone-only is just a way to get this going in a controllable manner. Other subtargets can roll it this out as people find the time to benchmark and tune this.

As the results section shows I had and still doing some tuning on this. This mostly allows 2-lane vectorization for 32-bit types so they benefit of vectorization is not so great thus the accuracy of the cost model is really tested by enabling this.

I also got the impression that this is a change that is somewhat (but only somewhat) independent of micro-architecture, as I assume this is mostly about trading off the overhead that may be introduced to get data into the vector registers vs the gain by doing arithmetic in a SIMD fashion.
Of course, the cost of getting data in vector registers and the gain of doing arithmetic in a SIMD fashion is somewhat micro-architecture dependent.

I noticed that Adam points to a number of other patches improving things - I'm assuming these other patches lower the cost of getting data into the vector registers?

Yes, do you have rL299482?

I've started to notice a trend where at least for AArch64, specific transformations are enabled/disabled for specific cores only, even when the transformation seems beneficial for most cores, so should probably also be enabled for "-mcpu=generic".
I don't think there is a straightforward answer on what the best way is to achieve making the right balanced tradeoff between enabling only for specific cores vs enabling for all cores.
I also talked about this with @evandro at EuroLLVM, who might also be interested in evaluating this patch on the cores he has access to?

I am wondering if we should have subtarget "owners" and then we could just file bugs (tasks) to enable such features on the "other" subtargets. As I said I had a good results with SW data prefetching but for example the ARM microarchs didn't add support for this.

@kristof.beyls Can you check on A57?

That would be great. Thanks!

So indeed I kicked off a run on Cortex-A57 to see what results I got (-O3, non-PGO), including test-suite and SPEC2000, but not SPEC2006, with running every program 3 times.
Apart from the mesa, bzip2 and bullet result Adam mentions, the results I see are on a few different programs:

Thanks!

Performance Regressions - Execution Time
MultiSource/Benchmarks/VersaBench/beamformer/beamformer 8.71%: In this case, the overhead of getting data into vector registers seems to outweigh the gain from simd processing in the hot loops in function "begin".

This is very interesting if you have rL299482. This is a reversal from Cyclone where this results in a nice gain. The cost model is at the threshold and the perceived benefit is minimal (-1, 0 being the threshold). FTR, beyond rL299482, I have no further plans to work on this.

External/SPEC/CINT2000/256.bzip2/256.bzip2 2.51%: I see a codegen difference in the hot loop in "sendMTFValues" - probably the same loop Adam refers to earlier.
External/SPEC/CINT2000/255.vortex/255.vortex 2.35%: I only noticed a slight code layout change in the hot functions, not any different instructions, so this is very likely noise due to sensitivity of code layout.

Performance Improvements - Execution
MultiSource/Benchmarks/Bullet/bullet -3.95%: seems to be mainly due to SLP vectorization now kicking in on a big basic block in function btSequentialImpulseConstraintSolver::resolveSingleConstraintRowLowerLimit(btSolverBody&, btSolverBody&, btSolverConstraint const&)
External/SPEC/CFP2000/177.mesa/177.mesa -1.69%: vectorization now happens in some of the hottest basic blocks.
External/SPEC/CINT2000/176.gcc/176.gcc -1.42%: I didn't have time to analyze this one further.

In summary, with these results and with more patches in progress to lower the overhead of 2-lane vectorization, I think it's fine to enable this on Cortex-A57 too. I hope we'll be able to decide to just enable this generically for AArch64.

From the above results, I expect bzip2 to improve but unfortunately nothing else before I'd like to enable this. Are you still OK with this?

Thanks again for running this and the analysis!!

Adam

Adam

Hi Kristof,

Hi Renato,

Hi Adam,

Interesting results! But it doesn't sound like this is Cyclone specific.

Sure it's not, it is just a deployment strategy for this change. See the FIXME in the code.

Rolling it out for Cyclone-only is just a way to get this going in a controllable manner. Other subtargets can roll it this out as people find the time to benchmark and tune this.

As the results section shows I had and still doing some tuning on this. This mostly allows 2-lane vectorization for 32-bit types so they benefit of vectorization is not so great thus the accuracy of the cost model is really tested by enabling this.

I also got the impression that this is a change that is somewhat (but only somewhat) independent of micro-architecture, as I assume this is mostly about trading off the overhead that may be introduced to get data into the vector registers vs the gain by doing arithmetic in a SIMD fashion.
Of course, the cost of getting data in vector registers and the gain of doing arithmetic in a SIMD fashion is somewhat micro-architecture dependent.

I noticed that Adam points to a number of other patches improving things - I'm assuming these other patches lower the cost of getting data into the vector registers?

Yes, do you have rL299482?

Yes, I ran this on top of r299981.

I've started to notice a trend where at least for AArch64, specific transformations are enabled/disabled for specific cores only, even when the transformation seems beneficial for most cores, so should probably also be enabled for "-mcpu=generic".
I don't think there is a straightforward answer on what the best way is to achieve making the right balanced tradeoff between enabling only for specific cores vs enabling for all cores.
I also talked about this with @evandro at EuroLLVM, who might also be interested in evaluating this patch on the cores he has access to?

I am wondering if we should have subtarget "owners" and then we could just file bugs (tasks) to enable such features on the "other" subtargets. As I said I had a good results with SW data prefetching but for example the ARM microarchs didn't add support for this.

I am open to trying out any idea that improves over the current situation, and your idea seems to do so! If other subtarget "owners" also like this idea, let's try it out!

@kristof.beyls Can you check on A57?

That would be great. Thanks!

So indeed I kicked off a run on Cortex-A57 to see what results I got (-O3, non-PGO), including test-suite and SPEC2000, but not SPEC2006, with running every program 3 times.
Apart from the mesa, bzip2 and bullet result Adam mentions, the results I see are on a few different programs:

Thanks!

Performance Regressions - Execution Time
MultiSource/Benchmarks/VersaBench/beamformer/beamformer 8.71%: In this case, the overhead of getting data into vector registers seems to outweigh the gain from simd processing in the hot loops in function "begin".

This is very interesting if you have rL299482. This is a reversal from Cyclone where this results in a nice gain. The cost model is at the threshold and the perceived benefit is minimal (-1, 0 being the threshold). FTR, beyond rL299482, I have no further plans to work on this.

That's a useful insight, thanks for sharing!

External/SPEC/CINT2000/256.bzip2/256.bzip2 2.51%: I see a codegen difference in the hot loop in "sendMTFValues" - probably the same loop Adam refers to earlier.
External/SPEC/CINT2000/255.vortex/255.vortex 2.35%: I only noticed a slight code layout change in the hot functions, not any different instructions, so this is very likely noise due to sensitivity of code layout.

Performance Improvements - Execution
MultiSource/Benchmarks/Bullet/bullet -3.95%: seems to be mainly due to SLP vectorization now kicking in on a big basic block in function btSequentialImpulseConstraintSolver::resolveSingleConstraintRowLowerLimit(btSolverBody&, btSolverBody&, btSolverConstraint const&)
External/SPEC/CFP2000/177.mesa/177.mesa -1.69%: vectorization now happens in some of the hottest basic blocks.
External/SPEC/CINT2000/176.gcc/176.gcc -1.42%: I didn't have time to analyze this one further.

In summary, with these results and with more patches in progress to lower the overhead of 2-lane vectorization, I think it's fine to enable this on Cortex-A57 too. I hope we'll be able to decide to just enable this generically for AArch64.

From the above results, I expect bzip2 to improve but unfortunately nothing else before I'd like to enable this. Are you still OK with this?

Yes, I'd still be fine with enabling this.

Thanks for all your efforts on this!

Kristof

I am wondering if we should have subtarget "owners" and then we could just file bugs (tasks) to enable such features on the "other" subtargets. As I said I had a good results with SW data prefetching but for example the ARM microarchs didn't add support for this.

Hi Adam,

We have something similar for the release process (RELEASE_TESTERS.TXT), so it should be fine to have a list of people that volunteered to be bugged when sub-architectural decisions are needed.

If we go down that route, I suggest not to re-use CODE_OWNERS.TXT, because that's already a big mess. Maybe we could move all target owners from CODE_OWNERS to lib/Target/OWNERS.TXT and keep it hierarchical, so we can have multiple owners per target (as we already have in some).

Feel free to start this RFC on the mailing list, though, as this is not a discussion for this patch. :)

cheers,
--renato

evandro edited edge metadata.Apr 12 2017, 8:58 AM

I'll kick off some benchmarks and get back to y'all.

mssimpso edited edge metadata.May 1 2017, 9:55 AM

Hi Adam,

I'm not sure where the conversation about this patch landed, but I'm fine with it being a Cyclone only change for now if that's what you prefer. I haven't had a chance to evaluate it on our cores yet. But when I do, I can easily set MinVectorRegisterBitWith if there's any benefit. How does compile-time look?

anemet added a comment.May 2 2017, 3:44 PM

Hey Matt,

Hi Adam,

I'm not sure where the conversation about this patch landed, but I'm fine with it being a Cyclone only change for now if that's what you prefer. I haven't had a chance to evaluate it on our cores yet. But when I do, I can easily set MinVectorRegisterBitWith if there's any benefit. How does compile-time look?

There is no measurable compile-time change for AArch64 (testsuite, ctmark, spec).

I believe the idea was to try to enable this for all subtargets, assuming you and @evandro can test this. On the other hand, it's been almost a month and I'd like to wrap this up. So perhaps we should enable this for Cyclone and A57 for now and then perhaps file bugs for the remaining subtargets to evaluate this.

How does that sound, @rengolin and others?

Our automated test infra structure croaked, so I'm still getting the SPEC results. In other tests, it's looking promising on Exynos M1 and M2.

anemet added a comment.May 2 2017, 4:00 PM

Thanks @evandro, let me know.

Hey Matt,

Hi Adam,

I'm not sure where the conversation about this patch landed, but I'm fine with it being a Cyclone only change for now if that's what you prefer. I haven't had a chance to evaluate it on our cores yet. But when I do, I can easily set MinVectorRegisterBitWith if there's any benefit. How does compile-time look?

There is no measurable compile-time change for AArch64 (testsuite, ctmark, spec).

I believe the idea was to try to enable this for all subtargets, assuming you and @evandro can test this. On the other hand, it's been almost a month and I'd like to wrap this up. So perhaps we should enable this for Cyclone and A57 for now and then perhaps file bugs for the remaining subtargets to evaluate this.

How does that sound, @rengolin and others?

That sounds reasonable to me, but I would do it the other way around: enable it by default and explicitly disable it for the cores that we know have a chance of being evaluated and decided on later.
Otherwise, I'm afraid that we'll forever have an ever-growing whitelist of cores to enable this on, while it looks like the right thing to do in the end is to just enable it by default.

There is no measurable compile-time change for AArch64 (testsuite, ctmark, spec).

Awesome!

That sounds reasonable to me, but I would do it the other way around: enable it by default and explicitly disable it for the cores that we know have a chance of being evaluated and decided on later. Otherwise, I'm afraid that we'll forever have an ever-growing whitelist of cores to enable this on, while it looks like the right thing to do in the end is to just enable it by default.

I'm fine with that plan as well.

That sounds reasonable to me, but I would do it the other way around: enable it by default and explicitly disable it for the cores that we know have a chance of being evaluated and decided on later.
Otherwise, I'm afraid that we'll forever have an ever-growing whitelist of cores to enable this on, while it looks like the right thing to do in the end is to just enable it by default.

Couldn't agree more with you, @kristof.beyls .

anemet added a comment.May 3 2017, 8:44 AM

That sounds reasonable to me, but I would do it the other way around: enable it by default and explicitly disable it for the cores that we know have a chance of being evaluated and decided on later.
Otherwise, I'm afraid that we'll forever have an ever-growing whitelist of cores to enable this on, while it looks like the right thing to do in the end is to just enable it by default.

Sounds great, let me update the patch.

evandro added a comment.EditedMay 3 2017, 12:28 PM

The results for Exynos M1 and M2 are in and, except for a couple of workloads which improved between 2 and 5%, any difference in workloads was in the noise level with no significant regression.

IOW, it's OK for the Exynos subtargets.

arsenm added a subscriber: arsenm.May 3 2017, 12:30 PM
arsenm added inline comments.
include/llvm/Analysis/TargetTransformInfoImpl.h
308

I had D32714 because AMDGPU wants 32, but this is probably better

anemet added a comment.May 3 2017, 1:46 PM

The results for Exynos M1 and M2 are in and, except for a couple of workloads which improved between 2 and 5%, any difference in workloads was in the noise level with no significant regression.

IOW, it's OK for the Exynos subtargets.

Thanks!

anemet updated this revision to Diff 98170.May 8 2017, 8:19 AM

Updated according to Kristof's idea: rather than whitelist, blacklist
subtargets (Qualcomm, Cavium) that didn't get a chance to benchmark this yet.

kristof.beyls added inline comments.May 9 2017, 5:24 AM
lib/Target/AArch64/AArch64TargetTransformInfo.h
88–89

This comment can be removed now?

test/Transforms/SLPVectorizer/AArch64/64-bit-vector.ll
1–3

Given that we think this change is good for generic AArch64 code generation, wouldn't it be good to also have a test for -mcpu=generic. Or without an -mcpu specified at all?

anemet marked 2 inline comments as done.May 9 2017, 9:07 AM
anemet updated this revision to Diff 98300.May 9 2017, 9:12 AM

Address Kristof's comments. Thanks, Kristof!

This revision was automatically updated to reflect the committed changes.