Indexed ld1 are costed relatively high for Neoverse V1/V2 cores as compared to N1/N2 cores.
This can also be gauged by comparing cost for V1/V2 with simple integer loads.
e.g. cost of ldrb/ldrh/ldr is 1 as shown below
define i8 @load(ptr %x) {
; CHECK-LABEL: 'load'
; CHECK-NEXT: Cost Model: Found an estimated cost of 1 for instruction: %tmp = load i8, ptr %x, align 1
; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret i8 %tmp
%tmp = load i8, ptr %x, align 1 ret i8 %tmp
}
while indexed ld1 is costed too high at 5 for V1/V2 cores (See 'LD1_X' in llvm/test/Analysis/CostModel/AArch64/insert-extract.ll)
From the software optimization guide for neoverse V1,
ldrb --> Latency=4 Throughput=3
indexed ld1 --> latency=8 Throughput=3
So, indexed ld1 can be max 2x costly than simple load and not more than that.
This patch tries to reduce the cost for indexed ld1 instructions.
Tested the patch for SPEC2017 on neoverse-V1 and no regressions were observed.
(All the tests present in insert-extract.ll have been split into 2 files for the patch)