TTI::prefersVectorizedAddressing() try to vectorize the addresses that lead to loads. For aarch64, only gather/scatter (supported by SVE) can deal with vectors of addresses. This patch specializes the hook for AArch64, to return true only when we enable SVE.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Did you do any performance measurements to get an impression of the performance impact of this change?
llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h | ||
---|---|---|
151 | Intuitively I would think that false would be a more sensible default anyway. That wouldn't make much difference to this patch, because we still want to distinguish SVE and NEON. | |
llvm/test/Transforms/LoopVectorize/AArch64/gather-do-not-vectorize-addressing.ll | ||
63 | For a scalable VF there will be no difference in practice, because it won't try to scalarise the addresses. |
Yes - do you have benchmarking results for this patch? This option makes sense, but I'm not sure what it's doing is always optimal. There's something going on with how it alters interleaving group costs, that doesn't look like it should be related to vector addresses. One such case was cleaned up (or maybe hidden) by D124786, but more problems might be present.
llvm/test/Transforms/LoopVectorize/AArch64/gather-do-not-vectorize-addressing.ll | ||
---|---|---|
63 | It might be OK in this case, but in general just _having_ the SVE architecture feature ideally shouldn't make fixed-length NEON vectorization worse. I guess with something that needs a gather, we would always expect it to use VLA vectorization, so have the gather instruction? In that case here it sounds reasonable to base it on the arch feature. | |
llvm/test/Transforms/LoopVectorize/AArch64/interleaved-vs-scalar.ll | ||
13 | It is hard to see why this is now correct.. the vector body looks pretty empty? |
llvm/test/Transforms/LoopVectorize/AArch64/gather-do-not-vectorize-addressing.ll | ||
---|---|---|
84 | does the body here need all those reductions or could it be reduced a bit? It would probably also be good to precommit the test and have only the changes/improvements in the diff here. |
Hi, @dmgreen, thanks for the comment!
In fact, this patch is an optimization for 505.lbm_t in spec hpc.
spec hpc
items | base(s) | D124612 | affect(%) |
---|---|---|---|
505.lbm_t | 792 | 673 | 17.7 |
I also test spec2017 on llvm-testsuit-main, and the results doesn't show much impact overall. Do you think these statistics can assess the impact of this patch?
CFP2017rate
items | base(s) | D124612 | affect(%) |
---|---|---|---|
554.roms_r/554.roms_r.test | 289.4838 | 283.622 | 2.06676492 |
526.blender_r/526.blender_r.test | 197.1358 | 195.0169 | 1.086521219 |
544.nab_r/544.nab_r.test | 157.0759 | 156.4937 | 0.372027756 |
521.wrf_r/521.wrf_r.test | 100.7203 | 101.583 | -0.849256273 |
510.parest_r/510.parest_r.test | 70.1389 | 70.5425 | -0.572137364 |
503.bwaves_r/503.bwaves_r.test | 80.3388 | 81.0353 | -0.85950197 |
549.fotonik3d_r/549.fotonik3d_r.test | 66.388 | 63.7539 | 4.131668808 |
538.imagick_r/538.imagick_r.test | 53.5874 | 53.5616 | 0.048168837 |
527.cam4_r/527.cam4_r.test | 61.1036 | 61.2236 | -0.196002849 |
508.namd_r/508.namd_r.test | 40.3674 | 40.622 | -0.626753976 |
519.lbm_r/519.lbm_r.test | 37.726 | 37.8211 | -0.251446944 |
507.cactuBSSN_r/507.cactuBSSN_r.test | 35.404 | 35.3594 | 0.126133362 |
511.povray_r/511.povray_r.test | 5.9146 | 5.9722 | -0.964468705 |
CINT2017rate
items | base(s) | D124612 | affect(%) |
---|---|---|---|
520.omnetpp_r/520.omnetpp_r.test | 90.2297 | 89.7905 | 0.489138606 |
541.leela_r/541.leela_r.test | 89.2462 | 89.6644 | -0.466405842 |
505.mcf_r/505.mcf_r.test | 84.7131 | 83.7455 | 1.155405365 |
531.deepsjeng_r/531.deepsjeng_r.test | 64.9424 | 65.4395 | -0.759632943 |
523.xalancbmk_r/523.xalancbmk_r.test | 63.8734 | 63.4812 | 0.617820709 |
502.gcc_r/502.gcc_r.test | 57.0005 | 57.1864 | -0.325077291 |
557.xz_r/557.xz_r.test | 43.8289 | 43.9124 | -0.190151301 |
548.exchange2_r/548.exchange2_r.test | 40.916 | 40.3618 | 1.373080487 |
500.perlbench_r/500.perlbench_r.test | 27.7047 | 27.802 | -0.349974822 |
525.x264_r/525.x264_r.test | 22.1032 | 22.199 | -0.431550971 |
Hi, @sdesmalen, thanks for the comment!
I have post statistics in the comment, do you think the data can be used to assess the impact of the patch?
Thanks for getting the numbers.
llvm/test/Transforms/LoopVectorize/AArch64/interleaved-vs-scalar.ll | ||
---|---|---|
13 | Do you know what is going on in this case? |
Thanks for the comment, @dmgreen !
This patch may affect the widening decision (It actually affects ScalarizationCost) in setCostBasedWideningDecision for these loads . NEON cannot process vectorized addresses and requires exectelement support. Therefore, ScalarizationCost will add the overhead of this instruction (3 x InterleaveGroupSize). After the patch, ScalarizationCost excludes this overhead, and the cost becomes smaller.
Before the patch:
ScalarizationCost: {Value = 20, State = llvm::InstructionCost::Valid} InterleaveCost: {Value = 17, State = llvm::InstructionCost::Valid} Final Decision and Cost: CM_Interleave, 17
After the patch
ScalarizationCost: {Value = 14, State = llvm::InstructionCost::Valid} InterleaveCost: {Value = 17, State = llvm::InstructionCost::Valid} Final Decision and Cost: CM_Scalarize, 14
llvm/test/Transforms/LoopVectorize/AArch64/interleaved-vs-scalar.ll | ||
---|---|---|
13 |
|
Thanks. If you can update the test case, then this looks sensible to me.
llvm/test/Transforms/LoopVectorize/AArch64/interleaved-vs-scalar.ll | ||
---|---|---|
13 | OK I see what is going on - the values %tmp1 and %tmp3 are never used, the test not very meaningful in that regard. The vector body being empty isn't an issue in that case. It's a bit of a funny test, but I agree with you that the things it is testing are OK. Can you change the test to this, to be more "glued together": ; REQUIRES: asserts ; RUN: opt < %s -force-vector-width=2 -force-vector-interleave=1 -loop-vectorize -S --debug-only=loop-vectorize 2>&1 | FileCheck %s target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128" target triple = "aarch64--linux-gnu" %pair = type { i8, i8 } ; CHECK-LABEL: test ; CHECK: Found an estimated cost of 14 for VF 2 For instruction: {{.*}} load i8 ; CHECK: Found an estimated cost of 0 for VF 2 For instruction: {{.*}} load i8 ; CHECK-LABEL: entry: ; CHECK-LABEL: vector.body: ; CHECK: [[LOAD1:%.*]] = load i8 ; CHECK: [[LOAD2:%.*]] = load i8 ; CHECK: [[INSERT:%.*]] = insertelement <2 x i8> poison, i8 [[LOAD1]], i32 0 ; CHECK: insertelement <2 x i8> [[INSERT]], i8 [[LOAD2]], i32 1 ; CHECK: br i1 {{.*}}, label %middle.block, label %vector.body define void @test(%pair* %p, i8* %q, i64 %n) { entry: br label %for.body for.body: %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ] %tmp0 = getelementptr %pair, %pair* %p, i64 %i, i32 0 %tmp1 = load i8, i8* %tmp0, align 1 %tmp2 = getelementptr %pair, %pair* %p, i64 %i, i32 1 %tmp3 = load i8, i8* %tmp2, align 1 %add = add i8 %tmp1, %tmp3 %qi = getelementptr i8, i8* %q, i64 %i store i8 %add, i8* %qi, align 1 %i.next = add nuw nsw i64 %i, 1 %cond = icmp eq i64 %i.next, %n br i1 %cond, label %for.end, label %for.body for.end: ret void } |
Thanks, @dmgreen! "interleaved-vs-scalar.ll" has been updated, is there any problem with the other test case?
Intuitively I would think that false would be a more sensible default anyway.
That wouldn't make much difference to this patch, because we still want to distinguish SVE and NEON.