This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][CodeGen] Reduce cost of indexed ld1 instructions for Neoverse V1/V2 cores
Needs ReviewPublic

Authored by sushgokh on Mar 23 2023, 3:15 AM.

Details

Summary

Indexed ld1 are costed relatively high for Neoverse V1/V2 cores as compared to N1/N2 cores.

This can also be gauged by comparing cost for V1/V2 with simple integer loads.

e.g. cost of ldrb/ldrh/ldr is 1 as shown below

define i8 @load(ptr %x) {
; CHECK-LABEL: 'load'
; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %tmp = load i8, ptr %x, align 1
; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret i8 %tmp

%tmp = load i8, ptr %x, align 1
ret i8 %tmp

}

while indexed ld1 is costed too high at 5 for V1/V2 cores (See 'LD1_X' in llvm/test/Analysis/CostModel/AArch64/insert-extract.ll)

From the software optimization guide for neoverse V1,
ldrb --> Latency=4 Throughput=3
indexed ld1 --> latency=8 Throughput=3

So, indexed ld1 can be max 2x costly than simple load and not more than that.

This patch tries to reduce the cost for indexed ld1 instructions.

Tested the patch for SPEC2017 on neoverse-V1 and no regressions were observed.

(All the tests present in insert-extract.ll have been split into 2 files for the patch)

Diff Detail

Event Timeline

sushgokh created this revision.Mar 23 2023, 3:15 AM
sushgokh requested review of this revision.Mar 23 2023, 3:15 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 23 2023, 3:15 AM

Hello - I'm not a fan of this patch, but I am interested in why you are making it. You mention that "Tested the patch for SPEC2017 on neoverse-V1 and no regressions were observed." Do you have other cases where this does show improvement?

I am fairly strongly against the cost model having a lot of very cpu-specific micro adjustments, at least without good reason for them. It creates a maintenance burden that I have not yet seen justified.

Note that for the V1/V2 _all_ vector operations have double the throughput of N1/N2, not just these particular lane insert operations. But we probably don't get the relative costs of scalar vs load/store vs vector precisely correct at the moment. The cost model in aarch64 is fairly generic, I've not seen in the past any strong reason to change that. Some parts (like the costs of inserts/extracts) are the way they are less due to the exact relative costs of various operations on particular cpus, and more to try and control the type of code produced by the slp vectorizer. But in the case of the ld1 lane inserts cost I believe it is set high because they require both the L and V pipelines.

It's strange that I first increase the cost, and this proposes it to lower it. Clearly we need some alignment and clarity on this. But most importantly, we need some motivating examples before we continue with any of this.

khchen added a subscriber: khchen.Apr 12 2023, 11:38 PM
llvm/test/Analysis/CostModel/AArch64/extract-from-vector.ll