Page MenuHomePhabricator

[AArch64][LV] AArch64 does not prefer vectorized addressing
ClosedPublic

Authored by TiehuZhang on Apr 28 2022, 6:38 AM.

Details

Summary

TTI::prefersVectorizedAddressing() try to vectorize the addresses that lead to loads. For aarch64, only gather/scatter (supported by SVE) can deal with vectors of addresses. This patch specializes the hook for AArch64, to return true only when we enable SVE.

Diff Detail

Event Timeline

TiehuZhang created this revision.Apr 28 2022, 6:38 AM
TiehuZhang requested review of this revision.Apr 28 2022, 6:38 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 28 2022, 6:38 AM

Did you do any performance measurements to get an impression of the performance impact of this change?

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
153

Intuitively I would think that false would be a more sensible default anyway.

That wouldn't make much difference to this patch, because we still want to distinguish SVE and NEON.

llvm/test/Transforms/LoopVectorize/AArch64/gather-do-not-vectorize-addressing.ll
64

For a scalable VF there will be no difference in practice, because it won't try to scalarise the addresses.
If you want to test the difference between SVE and NEON, you'll need to force the VF using -force-vector-width=2 for both RUN lines.

Matt added a subscriber: Matt.May 1 2022, 10:15 AM

Yes - do you have benchmarking results for this patch? This option makes sense, but I'm not sure what it's doing is always optimal. There's something going on with how it alters interleaving group costs, that doesn't look like it should be related to vector addresses. One such case was cleaned up (or maybe hidden) by D124786, but more problems might be present.

llvm/test/Transforms/LoopVectorize/AArch64/gather-do-not-vectorize-addressing.ll
64

It might be OK in this case, but in general just _having_ the SVE architecture feature ideally shouldn't make fixed-length NEON vectorization worse. I guess with something that needs a gather, we would always expect it to use VLA vectorization, so have the gather instruction? In that case here it sounds reasonable to base it on the arch feature.

llvm/test/Transforms/LoopVectorize/AArch64/interleaved-vs-scalar.ll
12–17

It is hard to see why this is now correct.. the vector body looks pretty empty?

fhahn added inline comments.May 2 2022, 11:46 AM
llvm/test/Transforms/LoopVectorize/AArch64/gather-do-not-vectorize-addressing.ll
85

does the body here need all those reductions or could it be reduced a bit? It would probably also be good to precommit the test and have only the changes/improvements in the diff here.

Yes - do you have benchmarking results for this patch? This option makes sense, but I'm not sure what it's doing is always optimal. There's something going on with how it alters interleaving group costs, that doesn't look like it should be related to vector addresses. One such case was cleaned up (or maybe hidden) by D124786, but more problems might be present.

Hi, @dmgreen, thanks for the comment!
In fact, this patch is an optimization for 505.lbm_t in spec hpc.

spec hpc

itemsbase(s)D124612affect(%)
505.lbm_t79267317.7

I also test spec2017 on llvm-testsuit-main, and the results doesn't show much impact overall. Do you think these statistics can assess the impact of this patch?

CFP2017rate

itemsbase(s)D124612affect(%)
554.roms_r/554.roms_r.test289.4838283.6222.06676492
526.blender_r/526.blender_r.test197.1358195.01691.086521219
544.nab_r/544.nab_r.test157.0759156.49370.372027756
521.wrf_r/521.wrf_r.test100.7203101.583-0.849256273
510.parest_r/510.parest_r.test70.138970.5425-0.572137364
503.bwaves_r/503.bwaves_r.test80.338881.0353-0.85950197
549.fotonik3d_r/549.fotonik3d_r.test66.38863.75394.131668808
538.imagick_r/538.imagick_r.test53.587453.56160.048168837
527.cam4_r/527.cam4_r.test61.103661.2236-0.196002849
508.namd_r/508.namd_r.test40.367440.622-0.626753976
519.lbm_r/519.lbm_r.test37.72637.8211-0.251446944
507.cactuBSSN_r/507.cactuBSSN_r.test35.40435.35940.126133362
511.povray_r/511.povray_r.test5.91465.9722-0.964468705

CINT2017rate

itemsbase(s)D124612affect(%)
520.omnetpp_r/520.omnetpp_r.test90.229789.79050.489138606
541.leela_r/541.leela_r.test89.246289.6644-0.466405842
505.mcf_r/505.mcf_r.test84.713183.74551.155405365
531.deepsjeng_r/531.deepsjeng_r.test64.942465.4395-0.759632943
523.xalancbmk_r/523.xalancbmk_r.test63.873463.48120.617820709
502.gcc_r/502.gcc_r.test57.000557.1864-0.325077291
557.xz_r/557.xz_r.test43.828943.9124-0.190151301
548.exchange2_r/548.exchange2_r.test40.91640.36181.373080487
500.perlbench_r/500.perlbench_r.test27.704727.802-0.349974822
525.x264_r/525.x264_r.test22.103222.199-0.431550971
TiehuZhang added a comment.EditedMay 24 2022, 9:39 PM

Did you do any performance measurements to get an impression of the performance impact of this change?

Hi, @sdesmalen, thanks for the comment!
I have post statistics in the comment, do you think the data can be used to assess the impact of the patch?

Thanks for getting the numbers.

llvm/test/Transforms/LoopVectorize/AArch64/interleaved-vs-scalar.ll
12–17

Do you know what is going on in this case?

Thanks for getting the numbers.

Thanks for the comment, @dmgreen !
This patch may affect the widening decision (It actually affects ScalarizationCost) in setCostBasedWideningDecision for these loads . NEON cannot process vectorized addresses and requires exectelement support. Therefore, ScalarizationCost will add the overhead of this instruction (3 x InterleaveGroupSize). After the patch, ScalarizationCost excludes this overhead, and the cost becomes smaller.

Before the patch:

ScalarizationCost: {Value = 20, State = llvm::InstructionCost::Valid}
InterleaveCost: {Value = 17, State = llvm::InstructionCost::Valid}
Final Decision and Cost: CM_Interleave, 17

After the patch

ScalarizationCost: {Value = 14, State = llvm::InstructionCost::Valid}
InterleaveCost: {Value = 17, State = llvm::InstructionCost::Valid}
Final Decision and Cost: CM_Scalarize, 14
llvm/test/Transforms/LoopVectorize/AArch64/interleaved-vs-scalar.ll
12–17

Do you know what is going on in this case?

Thanks. If you can update the test case, then this looks sensible to me.

llvm/test/Transforms/LoopVectorize/AArch64/interleaved-vs-scalar.ll
12–17

OK I see what is going on - the values %tmp1 and %tmp3 are never used, the test not very meaningful in that regard. The vector body being empty isn't an issue in that case. It's a bit of a funny test, but I agree with you that the things it is testing are OK.

Can you change the test to this, to be more "glued together":

; REQUIRES: asserts
; RUN: opt < %s -force-vector-width=2 -force-vector-interleave=1 -loop-vectorize -S --debug-only=loop-vectorize 2>&1 | FileCheck %s

target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
target triple = "aarch64--linux-gnu"

%pair = type { i8, i8 }

; CHECK-LABEL: test
; CHECK: Found an estimated cost of 14 for VF 2 For instruction:   {{.*}} load i8
; CHECK: Found an estimated cost of 0 for VF 2 For instruction:   {{.*}} load i8
; CHECK-LABEL: entry:
; CHECK-LABEL: vector.body:
; CHECK: [[LOAD1:%.*]] = load i8
; CHECK: [[LOAD2:%.*]] = load i8
; CHECK: [[INSERT:%.*]] = insertelement <2 x i8> poison, i8 [[LOAD1]], i32 0
; CHECK: insertelement <2 x i8> [[INSERT]], i8 [[LOAD2]], i32 1
; CHECK: br i1 {{.*}}, label %middle.block, label %vector.body

define void @test(%pair* %p, i8* %q, i64 %n) {
entry:
  br label %for.body

for.body:
  %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
  %tmp0 = getelementptr %pair, %pair* %p, i64 %i, i32 0
  %tmp1 = load i8, i8* %tmp0, align 1
  %tmp2 = getelementptr %pair, %pair* %p, i64 %i, i32 1
  %tmp3 = load i8, i8* %tmp2, align 1
  %add = add i8 %tmp1, %tmp3
  %qi = getelementptr i8, i8* %q, i64 %i
  store i8 %add, i8* %qi, align 1
  %i.next = add nuw nsw i64 %i, 1
  %cond = icmp eq i64 %i.next, %n
  br i1 %cond, label %for.end, label %for.body

for.end:
  ret void
}
This comment was removed by TiehuZhang.

Thanks. If you can update the test case, then this looks sensible to me.

Thanks, @dmgreen! "interleaved-vs-scalar.ll" has been updated, is there any problem with the other test case?

dmgreen accepted this revision.Tue, Jun 14, 5:23 AM

Thanks, LGTM

This revision is now accepted and ready to land.Tue, Jun 14, 5:23 AM
This revision was landed with ongoing or failed builds.Fri, Jun 17, 3:37 AM
This revision was automatically updated to reflect the committed changes.