This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][CostModel] Improve cost for fsqrt intrinsics.
AbandonedPublic

Authored by mcrosier on Jan 3 2017, 2:08 PM.

Details

Summary

This patch changes the cost of sqrt instrinsics for AArch64.

I have very limited knowledge of the cost model, so I tried to pick fairly conservative values as a starting point. In looking at the ongoing work on the X86 side it appears these values reflect the latency of the instruction. However, a discussion with @mssimpso led me to believe the AArch64 cost model doesn't directly use the instruction latencies.

Any input here would be greatly appreciated..

This change causes a hand full of additional cases in SPEC (e.g., povray) to be SLP vectorized. In fact, this may only change codegen when targeting Kryo where insert and extract element operations are cheaper than most other sub-targets.

Chad

Diff Detail

Event Timeline

mcrosier updated this revision to Diff 82945.Jan 3 2017, 2:08 PM
mcrosier retitled this revision from to [AArch64][CostModel] Improve cost for fsqrt intrinsics..
mcrosier updated this object.
mcrosier added reviewers: mkuper, RKSimon, mssimpso, MatzeB.
mcrosier added subscribers: llvm-commits, mssimpso.
RKSimon edited edge metadata.Jan 4 2017, 12:25 AM

I have very limited knowledge of the cost model, so I tried to pick fairly conservative values as a starting point. In looking at the ongoing work on the X86 side it appears these values reflect the latency of the instruction. However, a discussion with @mssimpso led me to believe the AArch64 cost model doesn't directly use the instruction latencies.

At the TTI level the costs should represent throughput, as that is more useful for determining the benefit of vectorization; for fsqrt/fdiv units this is often equal (or almost equal) to the latency. Additionally it might get more complicated if the sqrt/div unit is only a 64-bit ALU and you're processing 128-bit vectors but from your example costs it doesn't look it.

lib/Target/AArch64/AArch64TargetTransformInfo.cpp
504

I don't know the range of costs that AARCH64 cores can have here - for x86 we tend to qualify these by mentioning the core type that we used for the costs in a comment. But AARCH64 is younger so might still be more consistent!

fhahn edited edge metadata.Jan 4 2017, 4:16 AM
fhahn added a subscriber: fhahn.

Thanks for the feedback, Simon. I'll experiment with using costs that are more representative of throughput as you suggest.

lib/Target/AArch64/AArch64TargetTransformInfo.cpp
504

I think this makes sense, but the AArch64 backend has only a single generation of SIMD instructions (excluding the v8.[1|2]a extensions and the recently announced scalable vector extensions (VSE)).

We might consider predicating the logic based on the specific sub-target (e.g., Kryo, Cortex-A57) as opposed to the SIMD generation (e.g., SSE, MMX, AVX2) for AArch64.. at least in those cases where the latencies vary greatly between subtargets.

RKSimon added inline comments.Jan 6 2017, 5:39 AM
test/Transforms/SLPVectorizer/AArch64/intrinsic-cost-model.ll
1

Not sure if you're interested but utils\update_test_checks.py could be used to auto-generate this.

Also, why call it intrinsic-cost-model.ll and not just sqrt.ll or intrinsics.ll ?

20

Since this is a SLPVectorizer test are you gaining anything by giving it a loop to vectorizer instead of (simpler) flat code?

44

Add a float test?

mcrosier updated this revision to Diff 84320.Jan 13 2017, 9:02 AM

-Address a few of Simons comments.
-Specialize the cost specifically for Kryo latencies, which are fairly different from the other targets I've looked at.

mcrosier planned changes to this revision.Jan 13 2017, 9:06 AM
mcrosier added a subscriber: bmakam.

I'm going to hand this work off to @bmakam, so I'm blocking this until he has had time to review the current state and decide if this patch is reasonable.

mcrosier abandoned this revision.Jan 24 2017, 9:49 AM