This is an archive of the discontinued LLVM Phabricator instance.

[TTI][AArch64] Update vector extract cost for Neoverse-N1.
AbandonedPublic

Authored by vporpo on Aug 18 2022, 3:52 PM.

Details

Summary

According to the ARM Neoverse-N1 Software Optimization Guide the
extract instructions have a latency of 2 and a throughput of 2.
TTI returns a cost of 3, which seems too high.

PEXTR in x86 has a latency of 3 and throughput of 1, according to
https://www.agner.org/optimize/instruction_tables.pdf for Skylake,
yet TTI returns 1.

This patch sets the vector extract cost to 1.

Diff Detail

Event Timeline

vporpo created this revision.Aug 18 2022, 3:52 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 18 2022, 3:52 PM
vporpo requested review of this revision.Aug 18 2022, 3:52 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 18 2022, 3:52 PM
mingmingl added inline comments.
llvm/lib/Target/AArch64/AArch64Subtarget.cpp
191

nittest nit: this changes cost for both extract and insert, while summary mostly mentions EXT instruction cost. Might be good to call out that INS has a latency of 2 and throughput of 2 (unless it's common assumption that extract and insert instruction have the same cost).

Also, from the studies of D128302, I think the cost of extract/insert is better modeled by considering user instruction into account (e.g., if user instruction can access lane directly, extract could be combined into user in emitted code and have no cost). Nevertheless, my gut feeling is that 3 is a high number (for instructions of latency 2 and throughput 2); not sure if 1 is too small.

vporpo added inline comments.Aug 18 2022, 8:21 PM
llvm/lib/Target/AArch64/AArch64Subtarget.cpp
191

Yes, I need to update the description and add a test for the insertelement instructions too.

Yeah considering the user instruction is definitely more precise.

I think that a cost of 1 may be all right as long as only 1 instruction is needed for the extraction. I think this is the logic in the cost calculation of the extracts in x86: it returns 1 if only 1 instruction is needed, or a higher cost if more instructions are needed.

Hi @vporpo, the change seems sensible to me. Have you built/run any benchmarks to see what effect this change has on neoverse-n1? I imagine it might make a difference to the output from SLPVectorizer, for example.

fhahn added a subscriber: fhahn.Aug 19 2022, 2:25 AM
fhahn added inline comments.
llvm/lib/Target/AArch64/AArch64Subtarget.cpp
191

I'd expect the extract cost to be similar on most recent-ish AArch64 cores. Should this be changed for more cores than just a single one?

I've tried setting the default VectorInsertExtractBaseCost to 2 in the past, but only ever seen performance regressions. From what I remember they were large and persuasive enough to discourage me from considering it any further.

The VectorInsertExtractBaseCost (the getVectorInstrCost) to an extent is not really a measure of the throughput of a vector<>gpr moves. It is more importantly a control of how much vector shuffling you are will to accept at the expense of just using scalar code. On a machine like N1 with 4 scalar pipelines and 2 SIMD, I'm not sure that it makes a lot of sense to more aggressively target SIMD.

The costmodel under AArch64 is pretty rough in places, and codegen is slowly getting better over time, so there is certainly room for improvement, but I think it would take quite a lot to convince me that this is the right way forward.

vporpo abandoned this revision.Aug 19 2022, 10:00 AM

@david-arm I have not done extensive testing, it just looked strange to me that the cost is so high compared to x86. But I think @dmgreen's point makes sense, so we should probably not change the cost for now.

llvm/lib/Target/AArch64/AArch64Subtarget.cpp
191

Yeah it should be similar, but I think I would agree with @dmgreen .

@dmgreen The extract cost's default value and how the cost is used are two orthogonal things. I suspect that the regression you saw with cost == 2 might be the cause of a hidden bug somewhere else -- in other words, setting it to 3 simply papers over the real issue.

I think this patch is probably the right way to go. An alternative is to introduce an internal option to control the default value.

@dmgreen The extract cost's default value and how the cost is used are two orthogonal things. I suspect that the regression you saw with cost == 2 might be the cause of a hidden bug somewhere else -- in other words, setting it to 3 simply papers over the real issue.

I think, if I'm understanding what you are saying, that somewhat I agree with you. The mid-end is made up of a lot of imperfect heuristics, with a cost-model that is not exact, and some of the decision made are not always optimal. Setting the cost of extracts to 1 or 2 makes the compiler favour SIMD more than scalar, through the costs via methods like getScalarizationOverhead and getVectorInstrCost. That can be beneficial in places but it can also hurt performance. And in practice, on average over many benchmarks, it seems to hurt performance more than it has helps.

I think this patch is probably the right way to go. An alternative is to introduce an internal option to control the default value.

There was an -aarch64-insert-extract-base-cost option added in D124835. Does that serve your purpose?

@dmgreen yes what you said is a good summary. The option seems good enough for now.