This is an archive of the discontinued LLVM Phabricator instance.

Differential D31965

[SLP] Enable 64-bit wide vectorization for Cyclone
ClosedPublic

Authored by anemet on Apr 11 2017, 5:28 PM.

Download Raw Diff

Details

Reviewers

kristof.beyls
rengolin
mkuper
mssimpso
evandro
hfinkel

Commits

rGe29686e5c17e: [SLP] Enable 64-bit wide vectorization on AArch64
rL303116: [SLP] Enable 64-bit wide vectorization on AArch64

Summary

ARM Neon has native support for half-sized vector registers (64 bits). This
is beneficial for example for 2D and 3D graphics. This patch adds the option
to lower MinVecRegSize from 128 via a TTI in the SLP Vectorizer.

Performance Analysis

This change was motivated by some internal benchmarks but it is also
beneficial on SPEC and the LLVM testsuite.

The results are with -O3 and PGO. A negative percentage is an improvement.
The testsuite was run with a sample size of 4.

SPEC

CFP2006/482.sphinx3 -3.34%

A pretty hot loop is SLP vectorized resulting in nice instruction reduction.
This used to be a +22% regression before rL299482.

CFP2000/177.mesa -3.34%
CINT2000/256.bzip2 +6.97%

My current plan is to extend the fix in rL299482 to i16 which brings the
regression down to +2.5%. There are also other problems with the codegen in
this loop so there is further room for improvement.

LLVM testsuite

SingleSource/Benchmarks/Misc/ReedSolomon -10.75%

There are multiple small SLP vectorizations outside the hot code. It's a bit
surprising that it adds up to 10%. Some of this may be code-layout noise.

MultiSource/Benchmarks/VersaBench/beamformer/beamformer -8.40%

The opt-viewer screenshot can be seen at F3218284. We start at a colder store
but the tree leads us into the hottest loop.

MultiSource/Applications/lambda-0.1.3/lambda -2.68%
MultiSource/Benchmarks/Bullet/bullet -2.18%

This is using 3D vectors.

SingleSource/Benchmarks/Shootout-C++/Shootout-C++-lists +6.67%

Noise, binary is unchanged.

MultiSource/Benchmarks/Ptrdist/anagram/anagram +4.90%

There is an additional SLP in the cold code. The test runs for ~1sec and
prints out over 2000 lines. This is most likely noise.

MultiSource/Applications/aha/aha +1.63%
MultiSource/Applications/JM/lencod/lencod +1.41%
SingleSource/Benchmarks/Misc/richards_benchmark +1.15%

Diff Detail

Repository: rL LLVM

Event Timeline

anemet created this revision.Apr 11 2017, 5:28 PM

Herald added subscribers: mzolotukhin, aemerson. · View Herald TranscriptApr 11 2017, 5:28 PM

Hi Adam,

Interesting results! But it doesn't sound like this is Cyclone specific.

@kristof.beyls Can you check on A57?

cheers,
--renato

include/llvm/Analysis/TargetTransformInfoImpl.h
306 ↗	(On Diff #94912)	Is this value really the best default to all targets?
lib/Target/AArch64/AArch64Subtarget.cpp
62 ↗	(On Diff #94912)	Is this really Cyclone specific? ?Have you benchmarked on other cores?

sdardis added a subscriber: sdardis.Apr 12 2017, 3:47 AM

sdardis added inline comments.

include/llvm/Analysis/TargetTransformInfoImpl.h
306 ↗	(On Diff #94912)	My quick survey of vector register widths suggests this is double the minimum. SPARC's VIS extension uses the double precision floating point register set (64 bits wide) , as does Intel's MMX, MIPS' MIPS-3D (though currently unimplemented in LLVM). The S/390 vector registers appear to be 128 bits, like the basic Intel SSE, MIPS MSA, ARM NEON, PowerPC Altivec.

Hi Renato,

In D31965#724619, @rengolin wrote:

Hi Adam,

Interesting results! But it doesn't sound like this is Cyclone specific.

Sure it's not, it is just a deployment strategy for this change. See the FIXME in the code.

Rolling it out for Cyclone-only is just a way to get this going in a controllable manner. Other subtargets can roll it this out as people find the time to benchmark and tune this.

As the results section shows I had and still doing some tuning on this. This mostly allows 2-lane vectorization for 32-bit types so they benefit of vectorization is not so great thus the accuracy of the cost model is really tested by enabling this.

@kristof.beyls Can you check on A57?

That would be great. Thanks!

Adam

cheers,
--renato

include/llvm/Analysis/TargetTransformInfoImpl.h
306 ↗	(On Diff #94912)	This does not change change the default from SLP. It just brings it to TTI so that targets can change it as they see fit after careful benchmarking (it will need careful benchmarking!). It is not the goal of this patch to find a new proper default across targets.
lib/Target/AArch64/AArch64Subtarget.cpp
62 ↗	(On Diff #94912)	See above. It's not Cyclone-specific, it is just a deployment strategy to only enable for Cyclone since I have no access to other cores.

In D31965#724804, @anemet wrote:

Rolling it out for Cyclone-only is just a way to get this going in a controllable manner. Other subtargets can roll it this out as people find the time to benchmark and tune this.

Right. I'd like to at least try a more generic approach first, and only fall back to Cyclone-only if we get odd results on other cores.

ARM run the test-suite in benchmarking mode on AArch64, so they can pick up spikes if later on we add less generic code in generic section.

So, if the approach is generally good overall (I think it is), and we validate that on some cores, then we should let it in for all targets and monitor the performance as we go.

In that sense, @evandro, can you also have a look at the Exynos range? @mssimpso maybe check the Kyro line?

My rationale is that too many improvements are done by specific companies that are beneficial to the whole architecture, using the same reason "others can tune later if they wish". This encourages a culture of "works on my hardware", which makes the back-ends stiff to changes and proliferate decisions like relying on "isCyclone", or creation of target features that are only ever used for one target and other things that we have been cleaning up since last year.

If we can get people from different sub-arches involved at an early stage, everybody wins.

cheers,
--renato

In D31965#724804, @anemet wrote:

Hi Renato,

In D31965#724619, @rengolin wrote:

Hi Adam,

Interesting results! But it doesn't sound like this is Cyclone specific.

Sure it's not, it is just a deployment strategy for this change. See the FIXME in the code.

Rolling it out for Cyclone-only is just a way to get this going in a controllable manner. Other subtargets can roll it this out as people find the time to benchmark and tune this.

As the results section shows I had and still doing some tuning on this. This mostly allows 2-lane vectorization for 32-bit types so they benefit of vectorization is not so great thus the accuracy of the cost model is really tested by enabling this.

I also got the impression that this is a change that is somewhat (but only somewhat) independent of micro-architecture, as I assume this is mostly about trading off the overhead that may be introduced to get data into the vector registers vs the gain by doing arithmetic in a SIMD fashion.
Of course, the cost of getting data in vector registers and the gain of doing arithmetic in a SIMD fashion is somewhat micro-architecture dependent.

I noticed that Adam points to a number of other patches improving things - I'm assuming these other patches lower the cost of getting data into the vector registers?

I've started to notice a trend where at least for AArch64, specific transformations are enabled/disabled for specific cores only, even when the transformation seems beneficial for most cores, so should probably also be enabled for "-mcpu=generic".
I don't think there is a straightforward answer on what the best way is to achieve making the right balanced tradeoff between enabling only for specific cores vs enabling for all cores.
I also talked about this with @evandro at EuroLLVM, who might also be interested in evaluating this patch on the cores he has access to?

@kristof.beyls Can you check on A57?

That would be great. Thanks!

So indeed I kicked off a run on Cortex-A57 to see what results I got (-O3, non-PGO), including test-suite and SPEC2000, but not SPEC2006, with running every program 3 times.
Apart from the mesa, bzip2 and bullet result Adam mentions, the results I see are on a few different programs:

Performance Regressions - Execution Time
MultiSource/Benchmarks/VersaBench/beamformer/beamformer 8.71%: In this case, the overhead of getting data into vector registers seems to outweigh the gain from simd processing in the hot loops in function "begin".
External/SPEC/CINT2000/256.bzip2/256.bzip2 2.51%: I see a codegen difference in the hot loop in "sendMTFValues" - probably the same loop Adam refers to earlier.
External/SPEC/CINT2000/255.vortex/255.vortex 2.35%: I only noticed a slight code layout change in the hot functions, not any different instructions, so this is very likely noise due to sensitivity of code layout.

Performance Improvements - Execution
MultiSource/Benchmarks/Bullet/bullet -3.95%: seems to be mainly due to SLP vectorization now kicking in on a big basic block in function btSequentialImpulseConstraintSolver::resolveSingleConstraintRowLowerLimit(btSolverBody&, btSolverBody&, btSolverConstraint const&)
External/SPEC/CFP2000/177.mesa/177.mesa -1.69%: vectorization now happens in some of the hottest basic blocks.
External/SPEC/CINT2000/176.gcc/176.gcc -1.42%: I didn't have time to analyze this one further.

In summary, with these results and with more patches in progress to lower the overhead of 2-lane vectorization, I think it's fine to enable this on Cortex-A57 too. I hope we'll be able to decide to just enable this generically for AArch64.

Adam

In D31965#724841, @rengolin wrote:

In D31965#724804, @anemet wrote:

Rolling it out for Cyclone-only is just a way to get this going in a controllable manner. Other subtargets can roll it this out as people find the time to benchmark and tune this.

Right. I'd like to at least try a more generic approach first, and only fall back to Cyclone-only if we get odd results on other cores.

Sure, this is a good compromise.

ARM run the test-suite in benchmarking mode on AArch64, so they can pick up spikes if later on we add less generic code in generic section.

So, if the approach is generally good overall (I think it is), and we validate that on some cores, then we should let it in for all targets and monitor the performance as we go.

In that sense, @evandro, can you also have a look at the Exynos range? @mssimpso maybe check the Kyro line?

My rationale is that too many improvements are done by specific companies that are beneficial to the whole architecture, using the same reason "others can tune later if they wish". This encourages a culture of "works on my hardware", which makes the back-ends stiff to changes and proliferate decisions like relying on "isCyclone", or creation of target features that are only ever used for one target and other things that we have been cleaning up since last year.

Just as a positive counter example, when I added SW prefetch support last year, half of the ARM subtargets followed suit and added the knobs to enable it on their target.

Of course for that patch, we had to enable per subtarget since each needed to supply micro-architectural details.

If we can get people from different sub-arches involved at an early stage, everybody wins.

Agreed!

Adam

cheers,
--renato

Hi Kristof,

In D31965#724860, @kristof.beyls wrote:

In D31965#724804, @anemet wrote:

Hi Renato,

In D31965#724619, @rengolin wrote:

Hi Adam,

Interesting results! But it doesn't sound like this is Cyclone specific.

Sure it's not, it is just a deployment strategy for this change. See the FIXME in the code.

Rolling it out for Cyclone-only is just a way to get this going in a controllable manner. Other subtargets can roll it this out as people find the time to benchmark and tune this.

As the results section shows I had and still doing some tuning on this. This mostly allows 2-lane vectorization for 32-bit types so they benefit of vectorization is not so great thus the accuracy of the cost model is really tested by enabling this.

I also got the impression that this is a change that is somewhat (but only somewhat) independent of micro-architecture, as I assume this is mostly about trading off the overhead that may be introduced to get data into the vector registers vs the gain by doing arithmetic in a SIMD fashion.
Of course, the cost of getting data in vector registers and the gain of doing arithmetic in a SIMD fashion is somewhat micro-architecture dependent.

I noticed that Adam points to a number of other patches improving things - I'm assuming these other patches lower the cost of getting data into the vector registers?

Yes, do you have rL299482?

I've started to notice a trend where at least for AArch64, specific transformations are enabled/disabled for specific cores only, even when the transformation seems beneficial for most cores, so should probably also be enabled for "-mcpu=generic".
I don't think there is a straightforward answer on what the best way is to achieve making the right balanced tradeoff between enabling only for specific cores vs enabling for all cores.
I also talked about this with @evandro at EuroLLVM, who might also be interested in evaluating this patch on the cores he has access to?

I am wondering if we should have subtarget "owners" and then we could just file bugs (tasks) to enable such features on the "other" subtargets. As I said I had a good results with SW data prefetching but for example the ARM microarchs didn't add support for this.

@kristof.beyls Can you check on A57?

That would be great. Thanks!

So indeed I kicked off a run on Cortex-A57 to see what results I got (-O3, non-PGO), including test-suite and SPEC2000, but not SPEC2006, with running every program 3 times.
Apart from the mesa, bzip2 and bullet result Adam mentions, the results I see are on a few different programs:

Thanks!

Performance Regressions - Execution Time
MultiSource/Benchmarks/VersaBench/beamformer/beamformer 8.71%: In this case, the overhead of getting data into vector registers seems to outweigh the gain from simd processing in the hot loops in function "begin".

This is very interesting if you have rL299482. This is a reversal from Cyclone where this results in a nice gain. The cost model is at the threshold and the perceived benefit is minimal (-1, 0 being the threshold). FTR, beyond rL299482, I have no further plans to work on this.

External/SPEC/CINT2000/256.bzip2/256.bzip2 2.51%: I see a codegen difference in the hot loop in "sendMTFValues" - probably the same loop Adam refers to earlier.
External/SPEC/CINT2000/255.vortex/255.vortex 2.35%: I only noticed a slight code layout change in the hot functions, not any different instructions, so this is very likely noise due to sensitivity of code layout.

Performance Improvements - Execution
MultiSource/Benchmarks/Bullet/bullet -3.95%: seems to be mainly due to SLP vectorization now kicking in on a big basic block in function btSequentialImpulseConstraintSolver::resolveSingleConstraintRowLowerLimit(btSolverBody&, btSolverBody&, btSolverConstraint const&)
External/SPEC/CFP2000/177.mesa/177.mesa -1.69%: vectorization now happens in some of the hottest basic blocks.
External/SPEC/CINT2000/176.gcc/176.gcc -1.42%: I didn't have time to analyze this one further.

In summary, with these results and with more patches in progress to lower the overhead of 2-lane vectorization, I think it's fine to enable this on Cortex-A57 too. I hope we'll be able to decide to just enable this generically for AArch64.

From the above results, I expect bzip2 to improve but unfortunately nothing else before I'd like to enable this. Are you still OK with this?

Thanks again for running this and the analysis!!

Adam

Adam

In D31965#724869, @anemet wrote:

Hi Kristof,

In D31965#724860, @kristof.beyls wrote:

In D31965#724804, @anemet wrote:

Hi Renato,

In D31965#724619, @rengolin wrote:

Hi Adam,

Interesting results! But it doesn't sound like this is Cyclone specific.

Sure it's not, it is just a deployment strategy for this change. See the FIXME in the code.

Rolling it out for Cyclone-only is just a way to get this going in a controllable manner. Other subtargets can roll it this out as people find the time to benchmark and tune this.

As the results section shows I had and still doing some tuning on this. This mostly allows 2-lane vectorization for 32-bit types so they benefit of vectorization is not so great thus the accuracy of the cost model is really tested by enabling this.

I also got the impression that this is a change that is somewhat (but only somewhat) independent of micro-architecture, as I assume this is mostly about trading off the overhead that may be introduced to get data into the vector registers vs the gain by doing arithmetic in a SIMD fashion.
Of course, the cost of getting data in vector registers and the gain of doing arithmetic in a SIMD fashion is somewhat micro-architecture dependent.

I noticed that Adam points to a number of other patches improving things - I'm assuming these other patches lower the cost of getting data into the vector registers?

Yes, do you have rL299482?

Yes, I ran this on top of r299981.

I've started to notice a trend where at least for AArch64, specific transformations are enabled/disabled for specific cores only, even when the transformation seems beneficial for most cores, so should probably also be enabled for "-mcpu=generic".
I don't think there is a straightforward answer on what the best way is to achieve making the right balanced tradeoff between enabling only for specific cores vs enabling for all cores.
I also talked about this with @evandro at EuroLLVM, who might also be interested in evaluating this patch on the cores he has access to?

I am wondering if we should have subtarget "owners" and then we could just file bugs (tasks) to enable such features on the "other" subtargets. As I said I had a good results with SW data prefetching but for example the ARM microarchs didn't add support for this.

I am open to trying out any idea that improves over the current situation, and your idea seems to do so! If other subtarget "owners" also like this idea, let's try it out!

@kristof.beyls Can you check on A57?

That would be great. Thanks!

So indeed I kicked off a run on Cortex-A57 to see what results I got (-O3, non-PGO), including test-suite and SPEC2000, but not SPEC2006, with running every program 3 times.
Apart from the mesa, bzip2 and bullet result Adam mentions, the results I see are on a few different programs:

Thanks!

Performance Regressions - Execution Time
MultiSource/Benchmarks/VersaBench/beamformer/beamformer 8.71%: In this case, the overhead of getting data into vector registers seems to outweigh the gain from simd processing in the hot loops in function "begin".

This is very interesting if you have rL299482. This is a reversal from Cyclone where this results in a nice gain. The cost model is at the threshold and the perceived benefit is minimal (-1, 0 being the threshold). FTR, beyond rL299482, I have no further plans to work on this.

That's a useful insight, thanks for sharing!

External/SPEC/CINT2000/256.bzip2/256.bzip2 2.51%: I see a codegen difference in the hot loop in "sendMTFValues" - probably the same loop Adam refers to earlier.
External/SPEC/CINT2000/255.vortex/255.vortex 2.35%: I only noticed a slight code layout change in the hot functions, not any different instructions, so this is very likely noise due to sensitivity of code layout.

Performance Improvements - Execution
MultiSource/Benchmarks/Bullet/bullet -3.95%: seems to be mainly due to SLP vectorization now kicking in on a big basic block in function btSequentialImpulseConstraintSolver::resolveSingleConstraintRowLowerLimit(btSolverBody&, btSolverBody&, btSolverConstraint const&)
External/SPEC/CFP2000/177.mesa/177.mesa -1.69%: vectorization now happens in some of the hottest basic blocks.
External/SPEC/CINT2000/176.gcc/176.gcc -1.42%: I didn't have time to analyze this one further.

In summary, with these results and with more patches in progress to lower the overhead of 2-lane vectorization, I think it's fine to enable this on Cortex-A57 too. I hope we'll be able to decide to just enable this generically for AArch64.

From the above results, I expect bzip2 to improve but unfortunately nothing else before I'd like to enable this. Are you still OK with this?

Yes, I'd still be fine with enabling this.

Thanks for all your efforts on this!

Kristof

In D31965#724869, @anemet wrote:

I am wondering if we should have subtarget "owners" and then we could just file bugs (tasks) to enable such features on the "other" subtargets. As I said I had a good results with SW data prefetching but for example the ARM microarchs didn't add support for this.

Hi Adam,

We have something similar for the release process (RELEASE_TESTERS.TXT), so it should be fine to have a list of people that volunteered to be bugged when sub-architectural decisions are needed.

If we go down that route, I suggest not to re-use CODE_OWNERS.TXT, because that's already a big mess. Maybe we could move all target owners from CODE_OWNERS to lib/Target/OWNERS.TXT and keep it hierarchical, so we can have multiple owners per target (as we already have in some).

Feel free to start this RFC on the mailing list, though, as this is not a discussion for this patch. :)

cheers,
--renato

I'll kick off some benchmarks and get back to y'all.

anemet mentioned this in D32028: [AArch64] Avoid partial register writes on lane 0 of BUILD_VECTOR for i8/i16/f16.Apr 13 2017, 10:00 AM

anemet mentioned this in rL300276: [AArch64] Avoid partial register writes on lane 0 of BUILD_VECTOR for i8/i16/f16.Apr 13 2017, 4:45 PM

mssimpso mentioned this in D32533: [SLPVectorizer] Limit the number of block chain instructions to max register size.Apr 26 2017, 8:56 AM

Hi Adam,

I'm not sure where the conversation about this patch landed, but I'm fine with it being a Cyclone only change for now if that's what you prefer. I haven't had a chance to evaluate it on our cores yet. But when I do, I can easily set MinVectorRegisterBitWith if there's any benefit. How does compile-time look?

Hey Matt,

In D31965#742295, @mssimpso wrote:

Hi Adam,

I'm not sure where the conversation about this patch landed, but I'm fine with it being a Cyclone only change for now if that's what you prefer. I haven't had a chance to evaluate it on our cores yet. But when I do, I can easily set MinVectorRegisterBitWith if there's any benefit. How does compile-time look?

There is no measurable compile-time change for AArch64 (testsuite, ctmark, spec).

I believe the idea was to try to enable this for all subtargets, assuming you and @evandro can test this. On the other hand, it's been almost a month and I'd like to wrap this up. So perhaps we should enable this for Cyclone and A57 for now and then perhaps file bugs for the remaining subtargets to evaluate this.

How does that sound, @rengolin and others?

Our automated test infra structure croaked, so I'm still getting the SPEC results. In other tests, it's looking promising on Exynos M1 and M2.

Thanks @evandro, let me know.

In D31965#743954, @anemet wrote:

Hey Matt,

In D31965#742295, @mssimpso wrote:

Hi Adam,

I'm not sure where the conversation about this patch landed, but I'm fine with it being a Cyclone only change for now if that's what you prefer. I haven't had a chance to evaluate it on our cores yet. But when I do, I can easily set MinVectorRegisterBitWith if there's any benefit. How does compile-time look?

There is no measurable compile-time change for AArch64 (testsuite, ctmark, spec).

I believe the idea was to try to enable this for all subtargets, assuming you and @evandro can test this. On the other hand, it's been almost a month and I'd like to wrap this up. So perhaps we should enable this for Cyclone and A57 for now and then perhaps file bugs for the remaining subtargets to evaluate this.

How does that sound, @rengolin and others?

That sounds reasonable to me, but I would do it the other way around: enable it by default and explicitly disable it for the cores that we know have a chance of being evaluated and decided on later.
Otherwise, I'm afraid that we'll forever have an ever-growing whitelist of cores to enable this on, while it looks like the right thing to do in the end is to just enable it by default.

In D31965#743954, @anemet wrote:

There is no measurable compile-time change for AArch64 (testsuite, ctmark, spec).

Awesome!

In D31965#744313, @kristof.beyls wrote:

That sounds reasonable to me, but I would do it the other way around: enable it by default and explicitly disable it for the cores that we know have a chance of being evaluated and decided on later. Otherwise, I'm afraid that we'll forever have an ever-growing whitelist of cores to enable this on, while it looks like the right thing to do in the end is to just enable it by default.

I'm fine with that plan as well.

In D31965#744313, @kristof.beyls wrote:

That sounds reasonable to me, but I would do it the other way around: enable it by default and explicitly disable it for the cores that we know have a chance of being evaluated and decided on later.
Otherwise, I'm afraid that we'll forever have an ever-growing whitelist of cores to enable this on, while it looks like the right thing to do in the end is to just enable it by default.

Couldn't agree more with you, @kristof.beyls .

In D31965#744313, @kristof.beyls wrote:

That sounds reasonable to me, but I would do it the other way around: enable it by default and explicitly disable it for the cores that we know have a chance of being evaluated and decided on later.
Otherwise, I'm afraid that we'll forever have an ever-growing whitelist of cores to enable this on, while it looks like the right thing to do in the end is to just enable it by default.

Sounds great, let me update the patch.

The results for Exynos M1 and M2 are in and, except for a couple of workloads which improved between 2 and 5%, any difference in workloads was in the noise level with no significant regression.

IOW, it's OK for the Exynos subtargets.

arsenm added a subscriber: arsenm.May 3 2017, 12:30 PM

arsenm added inline comments.

include/llvm/Analysis/TargetTransformInfoImpl.h
306 ↗	(On Diff #94912)	I had D32714 because AMDGPU wants 32, but this is probably better

In D31965#745161, @evandro wrote:

The results for Exynos M1 and M2 are in and, except for a couple of workloads which improved between 2 and 5%, any difference in workloads was in the noise level with no significant regression.

IOW, it's OK for the Exynos subtargets.

Thanks!

Updated according to Kristof's idea: rather than whitelist, blacklist
subtargets (Qualcomm, Cavium) that didn't get a chance to benchmark this yet.

Herald added subscribers: krytarowski, javed.absar. · View Herald TranscriptMay 8 2017, 8:19 AM

kristof.beyls added inline comments.May 9 2017, 5:24 AM

lib/Target/AArch64/AArch64TargetTransformInfo.h
88–89 ↗	(On Diff #98170)	This comment can be removed now?
test/Transforms/SLPVectorizer/AArch64/64-bit-vector.ll
1–3 ↗	(On Diff #98170)	Given that we think this change is good for generic AArch64 code generation, wouldn't it be good to also have a test for -mcpu=generic. Or without an -mcpu specified at all?

anemet marked 2 inline comments as done.May 9 2017, 9:07 AM

Address Kristof's comments. Thanks, Kristof!

Harbormaster completed remote builds in B6283: Diff 98300.May 9 2017, 9:12 AM

Ping

LGTM!

Closed by commit rL303116: [SLP] Enable 64-bit wide vectorization on AArch64 (authored by anemet). · Explain WhyMay 15 2017, 2:28 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

trunk/

include/

llvm/

Analysis/

TargetTransformInfo.h

7 lines

TargetTransformInfoImpl.h

2 lines

lib/

Analysis/

TargetTransformInfo.cpp

4 lines

Target/

AArch64/

AArch64Subtarget.h

7 lines

AArch64Subtarget.cpp

8 lines

AArch64TargetTransformInfo.h

4 lines

Transforms/

Vectorize/

SLPVectorizer.cpp

5 lines

test/

Transforms/

SLPVectorizer/

AArch64/

64-bit-vector.ll

22 lines

Diff 99063

llvm/trunk/include/llvm/Analysis/TargetTransformInfo.h

Show First 20 Lines • Show All 531 Lines • ▼ Show 20 Lines	public:
/// \return The number of scalar or vector registers that the target has.		/// \return The number of scalar or vector registers that the target has.
/// If 'Vectors' is true, it returns the number of vector registers. If it is		/// If 'Vectors' is true, it returns the number of vector registers. If it is
/// set to false, it returns the number of scalar registers.		/// set to false, it returns the number of scalar registers.
unsigned getNumberOfRegisters(bool Vector) const;		unsigned getNumberOfRegisters(bool Vector) const;

/// \return The width of the largest scalar or vector register type.		/// \return The width of the largest scalar or vector register type.
unsigned getRegisterBitWidth(bool Vector) const;		unsigned getRegisterBitWidth(bool Vector) const;

		/// \return The width of the smallest vector register type.
		unsigned getMinVectorRegisterBitWidth() const;

/// \return True if it should be considered for address type promotion.		/// \return True if it should be considered for address type promotion.
/// \p AllowPromotionWithoutCommonHeader Set true if promoting \p I is		/// \p AllowPromotionWithoutCommonHeader Set true if promoting \p I is
/// profitable without finding other extensions fed by the same input.		/// profitable without finding other extensions fed by the same input.
bool shouldConsiderAddressTypePromotion(		bool shouldConsiderAddressTypePromotion(
const Instruction &I, bool &AllowPromotionWithoutCommonHeader) const;		const Instruction &I, bool &AllowPromotionWithoutCommonHeader) const;

/// \return The size of a cache line in bytes.		/// \return The size of a cache line in bytes.
unsigned getCacheLineSize() const;		unsigned getCacheLineSize() const;
▲ Show 20 Lines • Show All 287 Lines • ▼ Show 20 Lines	virtual int getIntImmCodeSizeCost(unsigned Opc, unsigned Idx, const APInt &Imm,
Type *Ty) = 0;		Type *Ty) = 0;
virtual int getIntImmCost(const APInt &Imm, Type *Ty) = 0;		virtual int getIntImmCost(const APInt &Imm, Type *Ty) = 0;
virtual int getIntImmCost(unsigned Opc, unsigned Idx, const APInt &Imm,		virtual int getIntImmCost(unsigned Opc, unsigned Idx, const APInt &Imm,
Type *Ty) = 0;		Type *Ty) = 0;
virtual int getIntImmCost(Intrinsic::ID IID, unsigned Idx, const APInt &Imm,		virtual int getIntImmCost(Intrinsic::ID IID, unsigned Idx, const APInt &Imm,
Type *Ty) = 0;		Type *Ty) = 0;
virtual unsigned getNumberOfRegisters(bool Vector) = 0;		virtual unsigned getNumberOfRegisters(bool Vector) = 0;
virtual unsigned getRegisterBitWidth(bool Vector) = 0;		virtual unsigned getRegisterBitWidth(bool Vector) = 0;
		virtual unsigned getMinVectorRegisterBitWidth() = 0;
virtual bool shouldConsiderAddressTypePromotion(		virtual bool shouldConsiderAddressTypePromotion(
const Instruction &I, bool &AllowPromotionWithoutCommonHeader) = 0;		const Instruction &I, bool &AllowPromotionWithoutCommonHeader) = 0;
virtual unsigned getCacheLineSize() = 0;		virtual unsigned getCacheLineSize() = 0;
virtual unsigned getPrefetchDistance() = 0;		virtual unsigned getPrefetchDistance() = 0;
virtual unsigned getMinPrefetchStride() = 0;		virtual unsigned getMinPrefetchStride() = 0;
virtual unsigned getMaxPrefetchIterationsAhead() = 0;		virtual unsigned getMaxPrefetchIterationsAhead() = 0;
virtual unsigned getMaxInterleaveFactor(unsigned VF) = 0;		virtual unsigned getMaxInterleaveFactor(unsigned VF) = 0;
virtual unsigned		virtual unsigned
▲ Show 20 Lines • Show All 220 Lines • ▼ Show 20 Lines	int getIntImmCost(Intrinsic::ID IID, unsigned Idx, const APInt &Imm,
return Impl.getIntImmCost(IID, Idx, Imm, Ty);		return Impl.getIntImmCost(IID, Idx, Imm, Ty);
}		}
unsigned getNumberOfRegisters(bool Vector) override {		unsigned getNumberOfRegisters(bool Vector) override {
return Impl.getNumberOfRegisters(Vector);		return Impl.getNumberOfRegisters(Vector);
}		}
unsigned getRegisterBitWidth(bool Vector) override {		unsigned getRegisterBitWidth(bool Vector) override {
return Impl.getRegisterBitWidth(Vector);		return Impl.getRegisterBitWidth(Vector);
}		}
		unsigned getMinVectorRegisterBitWidth() override {
		return Impl.getMinVectorRegisterBitWidth();
		}
bool shouldConsiderAddressTypePromotion(		bool shouldConsiderAddressTypePromotion(
const Instruction &I, bool &AllowPromotionWithoutCommonHeader) override {		const Instruction &I, bool &AllowPromotionWithoutCommonHeader) override {
return Impl.shouldConsiderAddressTypePromotion(		return Impl.shouldConsiderAddressTypePromotion(
I, AllowPromotionWithoutCommonHeader);		I, AllowPromotionWithoutCommonHeader);
}		}
unsigned getCacheLineSize() override {		unsigned getCacheLineSize() override {
return Impl.getCacheLineSize();		return Impl.getCacheLineSize();
}		}
▲ Show 20 Lines • Show All 244 Lines • Show Last 20 Lines

llvm/trunk/include/llvm/Analysis/TargetTransformInfoImpl.h

Show First 20 Lines • Show All 305 Lines • ▼ Show 20 Lines	unsigned getIntImmCost(Intrinsic::ID IID, unsigned Idx, const APInt &Imm,
Type *Ty) {		Type *Ty) {
return TTI::TCC_Free;		return TTI::TCC_Free;
}		}

unsigned getNumberOfRegisters(bool Vector) { return 8; }		unsigned getNumberOfRegisters(bool Vector) { return 8; }

unsigned getRegisterBitWidth(bool Vector) { return 32; }		unsigned getRegisterBitWidth(bool Vector) { return 32; }

		unsigned getMinVectorRegisterBitWidth() { return 128; }

bool		bool
shouldConsiderAddressTypePromotion(const Instruction &I,		shouldConsiderAddressTypePromotion(const Instruction &I,
bool &AllowPromotionWithoutCommonHeader) {		bool &AllowPromotionWithoutCommonHeader) {
AllowPromotionWithoutCommonHeader = false;		AllowPromotionWithoutCommonHeader = false;
return false;		return false;
}		}

unsigned getCacheLineSize() { return 0; }		unsigned getCacheLineSize() { return 0; }
▲ Show 20 Lines • Show All 380 Lines • Show Last 20 Lines

llvm/trunk/lib/Analysis/TargetTransformInfo.cpp

	Show First 20 Lines • Show All 273 Lines • ▼ Show 20 Lines
	unsigned TargetTransformInfo::getNumberOfRegisters(bool Vector) const {			unsigned TargetTransformInfo::getNumberOfRegisters(bool Vector) const {
	return TTIImpl->getNumberOfRegisters(Vector);			return TTIImpl->getNumberOfRegisters(Vector);
	}			}

	unsigned TargetTransformInfo::getRegisterBitWidth(bool Vector) const {			unsigned TargetTransformInfo::getRegisterBitWidth(bool Vector) const {
	return TTIImpl->getRegisterBitWidth(Vector);			return TTIImpl->getRegisterBitWidth(Vector);
	}			}

				unsigned TargetTransformInfo::getMinVectorRegisterBitWidth() const {
				return TTIImpl->getMinVectorRegisterBitWidth();
				}

	bool TargetTransformInfo::shouldConsiderAddressTypePromotion(			bool TargetTransformInfo::shouldConsiderAddressTypePromotion(
	const Instruction &I, bool &AllowPromotionWithoutCommonHeader) const {			const Instruction &I, bool &AllowPromotionWithoutCommonHeader) const {
	return TTIImpl->shouldConsiderAddressTypePromotion(			return TTIImpl->shouldConsiderAddressTypePromotion(
	I, AllowPromotionWithoutCommonHeader);			I, AllowPromotionWithoutCommonHeader);
	}			}

	unsigned TargetTransformInfo::getCacheLineSize() const {			unsigned TargetTransformInfo::getCacheLineSize() const {
	return TTIImpl->getCacheLineSize();			return TTIImpl->getCacheLineSize();
	▲ Show 20 Lines • Show All 271 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AArch64/AArch64Subtarget.h

Show First 20 Lines • Show All 77 Lines • ▼ Show 20 Lines	protected:
bool HasZeroCycleZeroing = false;		bool HasZeroCycleZeroing = false;

// StrictAlign - Disallow unaligned memory accesses.		// StrictAlign - Disallow unaligned memory accesses.
bool StrictAlign = false;		bool StrictAlign = false;

// NegativeImmediates - transform instructions with negative immediates		// NegativeImmediates - transform instructions with negative immediates
bool NegativeImmediates = true;		bool NegativeImmediates = true;

		// Enable 64-bit vectorization in SLP.
		unsigned MinVectorRegisterBitWidth = 64;

bool UseAA = false;		bool UseAA = false;
bool PredictableSelectIsExpensive = false;		bool PredictableSelectIsExpensive = false;
bool BalanceFPOps = false;		bool BalanceFPOps = false;
bool CustomAsCheapAsMove = false;		bool CustomAsCheapAsMove = false;
bool UsePostRAScheduler = false;		bool UsePostRAScheduler = false;
bool Misaligned128StoreIsSlow = false;		bool Misaligned128StoreIsSlow = false;
bool Paired128IsSlow = false;		bool Paired128IsSlow = false;
bool UseAlternateSExtLoadCVTF32Pattern = false;		bool UseAlternateSExtLoadCVTF32Pattern = false;
▲ Show 20 Lines • Show All 92 Lines • ▼ Show 20 Lines	public:
bool hasZeroCycleRegMove() const { return HasZeroCycleRegMove; }		bool hasZeroCycleRegMove() const { return HasZeroCycleRegMove; }

bool hasZeroCycleZeroing() const { return HasZeroCycleZeroing; }		bool hasZeroCycleZeroing() const { return HasZeroCycleZeroing; }

bool requiresStrictAlign() const { return StrictAlign; }		bool requiresStrictAlign() const { return StrictAlign; }

bool isXRaySupported() const override { return true; }		bool isXRaySupported() const override { return true; }

		unsigned getMinVectorRegisterBitWidth() const {
		return MinVectorRegisterBitWidth;
		}

bool isX18Reserved() const { return ReserveX18; }		bool isX18Reserved() const { return ReserveX18; }
bool hasFPARMv8() const { return HasFPARMv8; }		bool hasFPARMv8() const { return HasFPARMv8; }
bool hasNEON() const { return HasNEON; }		bool hasNEON() const { return HasNEON; }
bool hasCrypto() const { return HasCrypto; }		bool hasCrypto() const { return HasCrypto; }
bool hasCRC() const { return HasCRC; }		bool hasCRC() const { return HasCRC; }
bool hasLSE() const { return HasLSE; }		bool hasLSE() const { return HasLSE; }
bool hasRAS() const { return HasRAS; }		bool hasRAS() const { return HasRAS; }
bool hasRDM() const { return HasRDM; }		bool hasRDM() const { return HasRDM; }
▲ Show 20 Lines • Show All 99 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AArch64/AArch64Subtarget.cpp

Show First 20 Lines • Show All 85 Lines • ▼ Show 20 Lines	case ExynosM1:
MaxInterleaveFactor = 4;		MaxInterleaveFactor = 4;
MaxJumpTableSize = 8;		MaxJumpTableSize = 8;
PrefFunctionAlignment = 4;		PrefFunctionAlignment = 4;
PrefLoopAlignment = 3;		PrefLoopAlignment = 3;
break;		break;
case Falkor:		case Falkor:
MaxInterleaveFactor = 4;		MaxInterleaveFactor = 4;
VectorInsertExtractBaseCost = 2;		VectorInsertExtractBaseCost = 2;
		// FIXME: remove this to enable 64-bit SLP if performance looks good.
		MinVectorRegisterBitWidth = 128;
break;		break;
case Kryo:		case Kryo:
MaxInterleaveFactor = 4;		MaxInterleaveFactor = 4;
VectorInsertExtractBaseCost = 2;		VectorInsertExtractBaseCost = 2;
CacheLineSize = 128;		CacheLineSize = 128;
PrefetchDistance = 740;		PrefetchDistance = 740;
MinPrefetchStride = 1024;		MinPrefetchStride = 1024;
MaxPrefetchIterationsAhead = 11;		MaxPrefetchIterationsAhead = 11;
		// FIXME: remove this to enable 64-bit SLP if performance looks good.
		MinVectorRegisterBitWidth = 128;
break;		break;
case ThunderX2T99:		case ThunderX2T99:
CacheLineSize = 64;		CacheLineSize = 64;
PrefFunctionAlignment = 3;		PrefFunctionAlignment = 3;
PrefLoopAlignment = 2;		PrefLoopAlignment = 2;
MaxInterleaveFactor = 4;		MaxInterleaveFactor = 4;
PrefetchDistance = 128;		PrefetchDistance = 128;
MinPrefetchStride = 1024;		MinPrefetchStride = 1024;
MaxPrefetchIterationsAhead = 4;		MaxPrefetchIterationsAhead = 4;
		// FIXME: remove this to enable 64-bit SLP if performance looks good.
		MinVectorRegisterBitWidth = 128;
break;		break;
case ThunderX:		case ThunderX:
case ThunderXT88:		case ThunderXT88:
case ThunderXT81:		case ThunderXT81:
case ThunderXT83:		case ThunderXT83:
CacheLineSize = 128;		CacheLineSize = 128;
PrefFunctionAlignment = 3;		PrefFunctionAlignment = 3;
PrefLoopAlignment = 2;		PrefLoopAlignment = 2;
		// FIXME: remove this to enable 64-bit SLP if performance looks good.
		MinVectorRegisterBitWidth = 128;
break;		break;
case CortexA35: break;		case CortexA35: break;
case CortexA53: break;		case CortexA53: break;
case CortexA72: break;		case CortexA72: break;
case CortexA73: break;		case CortexA73: break;
case Others: break;		case Others: break;
}		}
}		}
▲ Show 20 Lines • Show All 163 Lines • Show Last 20 Lines

llvm/trunk/lib/Target/AArch64/AArch64TargetTransformInfo.h

Show First 20 Lines • Show All 81 Lines • ▼ Show 20 Lines	unsigned getRegisterBitWidth(bool Vector) {
if (Vector) {		if (Vector) {
if (ST->hasNEON())		if (ST->hasNEON())
return 128;		return 128;
return 0;		return 0;
}		}
return 64;		return 64;
}		}

		unsigned getMinVectorRegisterBitWidth() {
		return ST->getMinVectorRegisterBitWidth();
		}

unsigned getMaxInterleaveFactor(unsigned VF);		unsigned getMaxInterleaveFactor(unsigned VF);

int getCastInstrCost(unsigned Opcode, Type Dst, Type Src,		int getCastInstrCost(unsigned Opcode, Type Dst, Type Src,
const Instruction *I = nullptr);		const Instruction *I = nullptr);

int getExtractWithExtendCost(unsigned Opcode, Type Dst, VectorType VecTy,		int getExtractWithExtendCost(unsigned Opcode, Type Dst, VectorType VecTy,
unsigned Index);		unsigned Index);

▲ Show 20 Lines • Show All 52 Lines • Show Last 20 Lines

llvm/trunk/lib/Transforms/Vectorize/SLPVectorizer.cpp

Show First 20 Lines • Show All 310 Lines • ▼ Show 20 Lines	BoUpSLP(Function Func, ScalarEvolution Se, TargetTransformInfo *Tti,
// data type rather than just register size. For example, x86 AVX has		// data type rather than just register size. For example, x86 AVX has
// 256-bit registers, but it does not support integer operations		// 256-bit registers, but it does not support integer operations
// at that width (that requires AVX2).		// at that width (that requires AVX2).
if (MaxVectorRegSizeOption.getNumOccurrences())		if (MaxVectorRegSizeOption.getNumOccurrences())
MaxVecRegSize = MaxVectorRegSizeOption;		MaxVecRegSize = MaxVectorRegSizeOption;
else		else
MaxVecRegSize = TTI->getRegisterBitWidth(true);		MaxVecRegSize = TTI->getRegisterBitWidth(true);

		if (MinVectorRegSizeOption.getNumOccurrences())
MinVecRegSize = MinVectorRegSizeOption;		MinVecRegSize = MinVectorRegSizeOption;
		else
		MinVecRegSize = TTI->getMinVectorRegisterBitWidth();
}		}

/// \brief Vectorize the tree that starts with the elements in \p VL.		/// \brief Vectorize the tree that starts with the elements in \p VL.
/// Returns the vectorized root.		/// Returns the vectorized root.
Value *vectorizeTree();		Value *vectorizeTree();
/// Vectorize the tree but with the list of externally used values \p		/// Vectorize the tree but with the list of externally used values \p
/// ExternallyUsedValues. Values in this MapVector can be replaced but the		/// ExternallyUsedValues. Values in this MapVector can be replaced but the
/// generated extractvalue instructions.		/// generated extractvalue instructions.
▲ Show 20 Lines • Show All 4,855 Lines • Show Last 20 Lines

llvm/trunk/test/Transforms/SLPVectorizer/AArch64/64-bit-vector.ll

				; RUN: opt -S -slp-vectorizer -mtriple=aarch64--linux-gnu -mcpu=generic < %s \| FileCheck %s
				; RUN: opt -S -slp-vectorizer -mtriple=aarch64-apple-ios -mcpu=cyclone < %s \| FileCheck %s
				; Currently disabled for a few subtargets (e.g. Kryo):
				; RUN: opt -S -slp-vectorizer -mtriple=aarch64--linux-gnu -mcpu=kryo < %s \| FileCheck --check-prefix=NO_SLP %s
				; RUN: opt -S -slp-vectorizer -mtriple=aarch64--linux-gnu -mcpu=generic -slp-min-reg-size=128 < %s \| FileCheck --check-prefix=NO_SLP %s

				define void @f(float* %r, float* %w) {
				%r0 = getelementptr inbounds float, float* %r, i64 0
				%r1 = getelementptr inbounds float, float* %r, i64 1
				%f0 = load float, float* %r0
				%f1 = load float, float* %r1
				%add0 = fadd float %f0, %f0
				; CHECK: fadd <2 x float>
				; NO_SLP: fadd float
				; NO_SLP: fadd float
				%add1 = fadd float %f1, %f1
				%w0 = getelementptr inbounds float, float* %w, i64 0
				%w1 = getelementptr inbounds float, float* %w, i64 1
				store float %add0, float* %w0
				store float %add1, float* %w1
				ret void
				}

This is an archive of the discontinued LLVM Phabricator instance.

[SLP] Enable 64-bit wide vectorization for CycloneClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 99063

llvm/trunk/include/llvm/Analysis/TargetTransformInfo.h

llvm/trunk/include/llvm/Analysis/TargetTransformInfoImpl.h

llvm/trunk/lib/Analysis/TargetTransformInfo.cpp

llvm/trunk/lib/Target/AArch64/AArch64Subtarget.h

llvm/trunk/lib/Target/AArch64/AArch64Subtarget.cpp

llvm/trunk/lib/Target/AArch64/AArch64TargetTransformInfo.h

llvm/trunk/lib/Transforms/Vectorize/SLPVectorizer.cpp

llvm/trunk/test/Transforms/SLPVectorizer/AArch64/64-bit-vector.ll

[SLP] Enable 64-bit wide vectorization for Cyclone
ClosedPublic