This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
1
AArch64TargetTransformInfo.h
-
AArch64TargetTransformInfo.cpp
-
test/Transforms/LoopVectorize/AArch64/
-
Transforms/
-
LoopVectorize/
-
AArch64/
3
gather-do-not-vectorize-addressing.ll
1/4
interleaved-vs-scalar.ll

Differential D124612

[AArch64][LV] AArch64 does not prefer vectorized addressing
ClosedPublic

Authored by TiehuZhang on Apr 28 2022, 6:38 AM.

Download Raw Diff

Details

Reviewers

sdesmalen
fhahn
dmgreen
mdchen

Commits

rGb329156f4f14: [AArch64][LV] AArch64 does not prefer vectorized addressing

Summary

TTI::prefersVectorizedAddressing() try to vectorize the addresses that lead to loads. For aarch64, only gather/scatter (supported by SVE) can deal with vectors of addresses. This patch specializes the hook for AArch64, to return true only when we enable SVE.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

TiehuZhang created this revision.Apr 28 2022, 6:38 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 28 2022, 6:38 AM

Herald added subscribers: ctetreau, hiraditya, kristof.beyls. · View Herald Transcript

TiehuZhang requested review of this revision.Apr 28 2022, 6:38 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 28 2022, 6:38 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B161783: Diff 425763.Apr 28 2022, 7:52 AM

TiehuZhang added a reviewer: mdchen.Apr 29 2022, 2:12 AM

Did you do any performance measurements to get an impression of the performance impact of this change?

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h
151	Intuitively I would think that `false` would be a more sensible default anyway. That wouldn't make much difference to this patch, because we still want to distinguish SVE and NEON.
llvm/test/Transforms/LoopVectorize/AArch64/gather-do-not-vectorize-addressing.ll
63	For a scalable VF there will be no difference in practice, because it won't try to scalarise the addresses. If you want to test the difference between SVE and NEON, you'll need to force the VF using `-force-vector-width=2` for both RUN lines.

Matt added a subscriber: Matt.May 1 2022, 10:15 AM

Yes - do you have benchmarking results for this patch? This option makes sense, but I'm not sure what it's doing is always optimal. There's something going on with how it alters interleaving group costs, that doesn't look like it should be related to vector addresses. One such case was cleaned up (or maybe hidden) by D124786, but more problems might be present.

llvm/test/Transforms/LoopVectorize/AArch64/gather-do-not-vectorize-addressing.ll
63	It might be OK in this case, but in general just _having_ the SVE architecture feature ideally shouldn't make fixed-length NEON vectorization worse. I guess with something that needs a gather, we would always expect it to use VLA vectorization, so have the gather instruction? In that case here it sounds reasonable to base it on the arch feature.
llvm/test/Transforms/LoopVectorize/AArch64/interleaved-vs-scalar.ll
13	It is hard to see why this is now correct.. the vector body looks pretty empty?

fhahn added inline comments.May 2 2022, 11:46 AM

llvm/test/Transforms/LoopVectorize/AArch64/gather-do-not-vectorize-addressing.ll
84	does the body here need all those reductions or could it be reduced a bit? It would probably also be good to precommit the test and have only the changes/improvements in the diff here.

dmgreen mentioned this in D124786: [AArch64] Add extra reverse costs..May 5 2022, 1:54 AM

In D124612#3486176, @dmgreen wrote:

Yes - do you have benchmarking results for this patch? This option makes sense, but I'm not sure what it's doing is always optimal. There's something going on with how it alters interleaving group costs, that doesn't look like it should be related to vector addresses. One such case was cleaned up (or maybe hidden) by D124786, but more problems might be present.

Hi, @dmgreen, thanks for the comment!
In fact, this patch is an optimization for 505.lbm_t in spec hpc.

spec hpc

items	base(s)	D124612	affect(%)
505.lbm_t	792	673	17.7

I also test spec2017 on llvm-testsuit-main, and the results doesn't show much impact overall. Do you think these statistics can assess the impact of this patch?

CFP2017rate

items	base(s)	D124612	affect(%)
554.roms_r/554.roms_r.test	289.4838	283.622	2.06676492
526.blender_r/526.blender_r.test	197.1358	195.0169	1.086521219
544.nab_r/544.nab_r.test	157.0759	156.4937	0.372027756
521.wrf_r/521.wrf_r.test	100.7203	101.583	-0.849256273
510.parest_r/510.parest_r.test	70.1389	70.5425	-0.572137364
503.bwaves_r/503.bwaves_r.test	80.3388	81.0353	-0.85950197
549.fotonik3d_r/549.fotonik3d_r.test	66.388	63.7539	4.131668808
538.imagick_r/538.imagick_r.test	53.5874	53.5616	0.048168837
527.cam4_r/527.cam4_r.test	61.1036	61.2236	-0.196002849
508.namd_r/508.namd_r.test	40.3674	40.622	-0.626753976
519.lbm_r/519.lbm_r.test	37.726	37.8211	-0.251446944
507.cactuBSSN_r/507.cactuBSSN_r.test	35.404	35.3594	0.126133362
511.povray_r/511.povray_r.test	5.9146	5.9722	-0.964468705

CINT2017rate

items	base(s)	D124612	affect(%)
520.omnetpp_r/520.omnetpp_r.test	90.2297	89.7905	0.489138606
541.leela_r/541.leela_r.test	89.2462	89.6644	-0.466405842
505.mcf_r/505.mcf_r.test	84.7131	83.7455	1.155405365
531.deepsjeng_r/531.deepsjeng_r.test	64.9424	65.4395	-0.759632943
523.xalancbmk_r/523.xalancbmk_r.test	63.8734	63.4812	0.617820709
502.gcc_r/502.gcc_r.test	57.0005	57.1864	-0.325077291
557.xz_r/557.xz_r.test	43.8289	43.9124	-0.190151301
548.exchange2_r/548.exchange2_r.test	40.916	40.3618	1.373080487
500.perlbench_r/500.perlbench_r.test	27.7047	27.802	-0.349974822
525.x264_r/525.x264_r.test	22.1032	22.199	-0.431550971

In D124612#3482219, @sdesmalen wrote:

Did you do any performance measurements to get an impression of the performance impact of this change?

Hi, @sdesmalen, thanks for the comment!
I have post statistics in the comment, do you think the data can be used to assess the impact of the patch?

Thanks for getting the numbers.

llvm/test/Transforms/LoopVectorize/AArch64/interleaved-vs-scalar.ll
13	Do you know what is going on in this case?

In D124612#3536665, @dmgreen wrote:

Thanks for getting the numbers.

Thanks for the comment, @dmgreen !
This patch may affect the widening decision (It actually affects ScalarizationCost) in setCostBasedWideningDecision for these loads . NEON cannot process vectorized addresses and requires exectelement support. Therefore, ScalarizationCost will add the overhead of this instruction (3 x InterleaveGroupSize). After the patch, ScalarizationCost excludes this overhead, and the cost becomes smaller.

Before the patch:

ScalarizationCost: {Value = 20, State = llvm::InstructionCost::Valid}
InterleaveCost: {Value = 17, State = llvm::InstructionCost::Valid}
Final Decision and Cost: CM_Interleave, 17

After the patch

ScalarizationCost: {Value = 14, State = llvm::InstructionCost::Valid}
InterleaveCost: {Value = 17, State = llvm::InstructionCost::Valid}
Final Decision and Cost: CM_Scalarize, 14

llvm/test/Transforms/LoopVectorize/AArch64/interleaved-vs-scalar.ll
13	Do you know what is going on in this case?

Thanks. If you can update the test case, then this looks sensible to me.

llvm/test/Transforms/LoopVectorize/AArch64/interleaved-vs-scalar.ll

OK I see what is going on - the values %tmp1 and %tmp3 are never used, the test not very meaningful in that regard. The vector body being empty isn't an issue in that case. It's a bit of a funny test, but I agree with you that the things it is testing are OK.

Can you change the test to this, to be more "glued together":

; REQUIRES: asserts
; RUN: opt < %s -force-vector-width=2 -force-vector-interleave=1 -loop-vectorize -S --debug-only=loop-vectorize 2>&1 | FileCheck %s

target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
target triple = "aarch64--linux-gnu"

%pair = type { i8, i8 }

; CHECK-LABEL: test
; CHECK: Found an estimated cost of 14 for VF 2 For instruction:   {{.*}} load i8
; CHECK: Found an estimated cost of 0 for VF 2 For instruction:   {{.*}} load i8
; CHECK-LABEL: entry:
; CHECK-LABEL: vector.body:
; CHECK: [[LOAD1:%.*]] = load i8
; CHECK: [[LOAD2:%.*]] = load i8
; CHECK: [[INSERT:%.*]] = insertelement <2 x i8> poison, i8 [[LOAD1]], i32 0
; CHECK: insertelement <2 x i8> [[INSERT]], i8 [[LOAD2]], i32 1
; CHECK: br i1 {{.*}}, label %middle.block, label %vector.body

define void @test(%pair* %p, i8* %q, i64 %n) {
entry:
  br label %for.body

for.body:
  %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
  %tmp0 = getelementptr %pair, %pair* %p, i64 %i, i32 0
  %tmp1 = load i8, i8* %tmp0, align 1
  %tmp2 = getelementptr %pair, %pair* %p, i64 %i, i32 1
  %tmp3 = load i8, i8* %tmp2, align 1
  %add = add i8 %tmp1, %tmp3
  %qi = getelementptr i8, i8* %q, i64 %i
  store i8 %add, i8* %qi, align 1
  %i.next = add nuw nsw i64 %i, 1
  %cond = icmp eq i64 %i.next, %n
  br i1 %cond, label %for.end, label %for.body

for.end:
  ret void
}

TiehuZhang updated this revision to Diff 436666.Jun 13 2022, 11:44 PM

TiehuZhang added a comment.Jun 13 2022, 11:49 PM

This comment was removed by TiehuZhang.

TiehuZhang updated this revision to Diff 436675.Jun 14 2022, 12:02 AM

In D124612#3559695, @dmgreen wrote:

Thanks. If you can update the test case, then this looks sensible to me.

Thanks, @dmgreen! "interleaved-vs-scalar.ll" has been updated, is there any problem with the other test case?

Harbormaster completed remote builds in B169639: Diff 436675.Jun 14 2022, 1:17 AM

Thanks, LGTM

This revision is now accepted and ready to land.Jun 14 2022, 5:23 AM

This revision was landed with ongoing or failed builds.Jun 17 2022, 3:37 AM

Closed by commit rGb329156f4f14: [AArch64][LV] AArch64 does not prefer vectorized addressing (authored by TiehuZhang, committed by Allen). · Explain Why

This revision was automatically updated to reflect the committed changes.

Allen added a commit: rGb329156f4f14: [AArch64][LV] AArch64 does not prefer vectorized addressing.

Allen added a subscriber: Allen.Nov 24 2022, 3:53 AM

Herald added a subscriber: • pcwang-thead. · View Herald TranscriptNov 24 2022, 3:53 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64TargetTransformInfo.h

2 lines

AArch64TargetTransformInfo.cpp

4 lines

test/

Transforms/

LoopVectorize/

AArch64/

gather-do-not-vectorize-addressing.ll

113 lines

interleaved-vs-scalar.ll

4 lines

Diff 425763

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h

Show First 20 Lines • Show All 142 Lines • ▼ Show 20 Lines	unsigned getMaxNumElements(ElementCount VF) const {
if (!VF.isScalable())		if (!VF.isScalable())
return VF.getFixedValue();		return VF.getFixedValue();

return VF.getKnownMinValue() * ST->getVScaleForTuning();		return VF.getKnownMinValue() * ST->getVScaleForTuning();
}		}

unsigned getMaxInterleaveFactor(unsigned VF);		unsigned getMaxInterleaveFactor(unsigned VF);

		bool prefersVectorizedAddressing() const;
		sdesmalenUnsubmitted Not Done Reply Inline Actions Intuitively I would think that `false` would be a more sensible default anyway. That wouldn't make much difference to this patch, because we still want to distinguish SVE and NEON. sdesmalen: Intuitively I would think that `false` would be a more sensible default anyway. That wouldn't…

InstructionCost getMaskedMemoryOpCost(unsigned Opcode, Type *Src,		InstructionCost getMaskedMemoryOpCost(unsigned Opcode, Type *Src,
Align Alignment, unsigned AddressSpace,		Align Alignment, unsigned AddressSpace,
TTI::TargetCostKind CostKind);		TTI::TargetCostKind CostKind);

InstructionCost getGatherScatterOpCost(unsigned Opcode, Type *DataTy,		InstructionCost getGatherScatterOpCost(unsigned Opcode, Type *DataTy,
const Value *Ptr, bool VariableMask,		const Value *Ptr, bool VariableMask,
Align Alignment,		Align Alignment,
TTI::TargetCostKind CostKind,		TTI::TargetCostKind CostKind,
▲ Show 20 Lines • Show All 199 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

Show First 20 Lines • Show All 1,974 Lines • ▼ Show 20 Lines	AArch64TTIImpl::enableMemCmpExpansion(bool OptSize, bool IsZeroCmp) const {
Options.NumLoadsPerBlock = Options.MaxNumLoads;		Options.NumLoadsPerBlock = Options.MaxNumLoads;
// TODO: Though vector loads usually perform well on AArch64, in some targets		// TODO: Though vector loads usually perform well on AArch64, in some targets
// they may wake up the FP unit, which raises the power consumption. Perhaps		// they may wake up the FP unit, which raises the power consumption. Perhaps
// they could be used with no holds barred (-O3).		// they could be used with no holds barred (-O3).
Options.LoadSizes = {8, 4, 2, 1};		Options.LoadSizes = {8, 4, 2, 1};
return Options;		return Options;
}		}

		bool AArch64TTIImpl::prefersVectorizedAddressing() const {
		return ST->hasSVE();
		}

InstructionCost		InstructionCost
AArch64TTIImpl::getMaskedMemoryOpCost(unsigned Opcode, Type *Src,		AArch64TTIImpl::getMaskedMemoryOpCost(unsigned Opcode, Type *Src,
Align Alignment, unsigned AddressSpace,		Align Alignment, unsigned AddressSpace,
TTI::TargetCostKind CostKind) {		TTI::TargetCostKind CostKind) {
if (useNeonVector(Src))		if (useNeonVector(Src))
return BaseT::getMaskedMemoryOpCost(Opcode, Src, Alignment, AddressSpace,		return BaseT::getMaskedMemoryOpCost(Opcode, Src, Alignment, AddressSpace,
CostKind);		CostKind);
auto LT = TLI->getTypeLegalizationCost(DL, Src);		auto LT = TLI->getTypeLegalizationCost(DL, Src);
▲ Show 20 Lines • Show All 738 Lines • Show Last 20 Lines

llvm/test/Transforms/LoopVectorize/AArch64/gather-do-not-vectorize-addressing.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
				; RUN: opt < %s -loop-vectorize -mtriple=aarch64--linux-gnu -mattr=+neon -force-vector-interleave=1 -S -o - \| FileCheck %s
				; RUN: opt < %s -loop-vectorize -mtriple=aarch64--linux-gnu -mattr=+sve -force-vector-interleave=1 -S -o - \| FileCheck --check-prefix=SVE %s

				%struct.stu = type { [128 x double], [128 x double], [128 x double], [128 x double] }

				define dso_local double @test(double* nocapture readonly %data, i32* nocapture readonly %offset, %struct.stu* nocapture readonly %param) local_unnamed_addr {
				; CHECK-LABEL: @test(
				; CHECK-NEXT: entry:
				; CHECK-NEXT: br i1 false, label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
				; CHECK: vector.ph:
				; CHECK-NEXT: br label [[VECTOR_BODY:%.*]]
				; CHECK: vector.body:
				; CHECK-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[VEC_PHI:%.]] = phi <2 x double> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP33:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[VEC_PHI1:%.]] = phi <2 x double> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP28:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[VEC_PHI2:%.]] = phi <2 x double> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP23:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[VEC_PHI3:%.]] = phi <2 x double> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP18:%.]], [[VECTOR_BODY]] ]
				; CHECK-NEXT: [[TMP0:%.*]] = add i64 [[INDEX]], 0
				; CHECK-NEXT: [[TMP1:%.*]] = add i64 [[INDEX]], 1
				; CHECK-NEXT: [[TMP2:%.]] = getelementptr inbounds i32, i32 [[OFFSET:%.*]], i64 [[TMP0]]
				; CHECK-NEXT: [[TMP3:%.]] = getelementptr inbounds i32, i32 [[OFFSET]], i64 [[TMP1]]
				; CHECK-NEXT: [[TMP4:%.]] = load i32, i32 [[TMP2]], align 4
				; CHECK-NEXT: [[TMP5:%.]] = load i32, i32 [[TMP3]], align 4
				; CHECK-NEXT: [[TMP6:%.*]] = sext i32 [[TMP4]] to i64
				; CHECK-NEXT: [[TMP7:%.*]] = sext i32 [[TMP5]] to i64
				; CHECK-NEXT: [[TMP8:%.]] = getelementptr inbounds double, double [[DATA:%.*]], i64 [[TMP6]]
				; CHECK-NEXT: [[TMP9:%.]] = getelementptr inbounds double, double [[DATA]], i64 [[TMP7]]
				; CHECK-NEXT: [[TMP10:%.]] = load double, double [[TMP8]], align 8
				; CHECK-NEXT: [[TMP11:%.]] = load double, double [[TMP9]], align 8
				; CHECK-NEXT: [[TMP12:%.*]] = insertelement <2 x double> poison, double [[TMP10]], i32 0
				; CHECK-NEXT: [[TMP13:%.*]] = insertelement <2 x double> [[TMP12]], double [[TMP11]], i32 1
				; CHECK-NEXT: [[TMP14:%.]] = getelementptr inbounds [[STRUCT_STU:%.]], %struct.stu* [[PARAM:%.*]], i64 0, i32 0, i64 [[TMP0]]
				; CHECK-NEXT: [[TMP15:%.]] = getelementptr inbounds double, double [[TMP14]], i32 0
				; CHECK-NEXT: [[TMP16:%.]] = bitcast double [[TMP15]] to <2 x double>*
				; CHECK-NEXT: [[WIDE_LOAD:%.]] = load <2 x double>, <2 x double> [[TMP16]], align 8
				;
				; SVE-LABEL: @test(
				; SVE-NEXT: entry:
				; SVE-NEXT: [[TMP0:%.*]] = call i64 @llvm.vscale.i64()
				; SVE-NEXT: [[TMP1:%.*]] = mul i64 [[TMP0]], 2
				; SVE-NEXT: [[MIN_ITERS_CHECK:%.*]] = icmp ult i64 128, [[TMP1]]
				; SVE-NEXT: br i1 [[MIN_ITERS_CHECK]], label [[SCALAR_PH:%.]], label [[VECTOR_PH:%.]]
				; SVE: vector.ph:
				; SVE-NEXT: [[TMP2:%.*]] = call i64 @llvm.vscale.i64()
				; SVE-NEXT: [[TMP3:%.*]] = mul i64 [[TMP2]], 2
				; SVE-NEXT: [[N_MOD_VF:%.*]] = urem i64 128, [[TMP3]]
				; SVE-NEXT: [[N_VEC:%.*]] = sub i64 128, [[N_MOD_VF]]
				; SVE-NEXT: br label [[VECTOR_BODY:%.*]]
				; SVE: vector.body:
				; SVE-NEXT: [[INDEX:%.]] = phi i64 [ 0, [[VECTOR_PH]] ], [ [[INDEX_NEXT:%.]], [[VECTOR_BODY]] ]
				; SVE-NEXT: [[VEC_PHI:%.]] = phi <vscale x 2 x double> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP29:%.]], [[VECTOR_BODY]] ]
				; SVE-NEXT: [[VEC_PHI1:%.]] = phi <vscale x 2 x double> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP24:%.]], [[VECTOR_BODY]] ]
				; SVE-NEXT: [[VEC_PHI2:%.]] = phi <vscale x 2 x double> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP19:%.]], [[VECTOR_BODY]] ]
				; SVE-NEXT: [[VEC_PHI3:%.]] = phi <vscale x 2 x double> [ zeroinitializer, [[VECTOR_PH]] ], [ [[TMP14:%.]], [[VECTOR_BODY]] ]
				; SVE-NEXT: [[TMP4:%.*]] = add i64 [[INDEX]], 0
				; SVE-NEXT: [[TMP5:%.]] = getelementptr inbounds i32, i32 [[OFFSET:%.*]], i64 [[TMP4]]
				; SVE-NEXT: [[TMP6:%.]] = getelementptr inbounds i32, i32 [[TMP5]], i32 0
				; SVE-NEXT: [[TMP7:%.]] = bitcast i32 [[TMP6]] to <vscale x 2 x i32>*
				; SVE-NEXT: [[WIDE_LOAD:%.]] = load <vscale x 2 x i32>, <vscale x 2 x i32> [[TMP7]], align 4
				; SVE-NEXT: [[TMP8:%.*]] = sext <vscale x 2 x i32> [[WIDE_LOAD]] to <vscale x 2 x i64>
				; SVE-NEXT: [[TMP9:%.]] = getelementptr inbounds double, double [[DATA:%.*]], <vscale x 2 x i64> [[TMP8]]
				; SVE-NEXT: [[WIDE_MASKED_GATHER:%.]] = call <vscale x 2 x double> @llvm.masked.gather.nxv2f64.nxv2p0f64(<vscale x 2 x double> [[TMP9]], i32 8, <vscale x 2 x i1> shufflevector (<vscale x 2 x i1> insertelement (<vscale x 2 x i1> poison, i1 true, i32 0), <vscale x 2 x i1> poison, <vscale x 2 x i32> zeroinitializer), <vscale x 2 x double> undef)
				sdesmalenUnsubmitted Not Done Reply Inline Actions For a scalable VF there will be no difference in practice, because it won't try to scalarise the addresses. If you want to test the difference between SVE and NEON, you'll need to force the VF using `-force-vector-width=2` for both RUN lines. sdesmalen: For a scalable VF there will be no difference in practice, because it won't try to scalarise…
				dmgreenUnsubmitted Not Done Reply Inline Actions It might be OK in this case, but in general just _having_ the SVE architecture feature ideally shouldn't make fixed-length NEON vectorization worse. I guess with something that needs a gather, we would always expect it to use VLA vectorization, so have the gather instruction? In that case here it sounds reasonable to base it on the arch feature. dmgreen: It might be OK in this case, but in general just _having_ the SVE architecture feature ideally…
				; SVE-NEXT: [[TMP10:%.]] = getelementptr inbounds [[STRUCT_STU:%.]], %struct.stu* [[PARAM:%.*]], i64 0, i32 0, i64 [[TMP4]]
				; SVE-NEXT: [[TMP11:%.]] = getelementptr inbounds double, double [[TMP10]], i32 0
				; SVE-NEXT: [[TMP12:%.]] = bitcast double [[TMP11]] to <vscale x 2 x double>*
				; SVE-NEXT: [[WIDE_LOAD4:%.]] = load <vscale x 2 x double>, <vscale x 2 x double> [[TMP12]], align 8
				;
				entry:
				br label %for.body

				for.cond.cleanup: ; preds = %for.body
				%add.lcssa = phi double [ %add, %for.body ]
				%add8.lcssa = phi double [ %add8, %for.body ]
				%add12.lcssa = phi double [ %add12, %for.body ]
				%add16.lcssa = phi double [ %add16, %for.body ]
				%add17 = fadd fast double %add8.lcssa, %add.lcssa
				%add18 = fadd fast double %add17, %add12.lcssa
				%add19 = fadd fast double %add18, %add16.lcssa
				ret double %add19

				for.body: ; preds = %entry, %for.body
				%indvars.iv = phi i64 [ 0, %entry ], [ %indvars.iv.next, %for.body ]
				%val4.046 = phi double [ 0.000000e+00, %entry ], [ %add16, %for.body ]
				fhahnUnsubmitted Not Done Reply Inline Actions does the body here need all those reductions or could it be reduced a bit? It would probably also be good to precommit the test and have only the changes/improvements in the diff here. fhahn: does the body here need all those reductions or could it be reduced a bit? It would probably…
				%val3.045 = phi double [ 0.000000e+00, %entry ], [ %add12, %for.body ]
				%val2.044 = phi double [ 0.000000e+00, %entry ], [ %add8, %for.body ]
				%val1.043 = phi double [ 0.000000e+00, %entry ], [ %add, %for.body ]
				%arrayidx = getelementptr inbounds i32, i32* %offset, i64 %indvars.iv
				%0 = load i32, i32* %arrayidx, align 4
				%idxprom1 = sext i32 %0 to i64
				%arrayidx2 = getelementptr inbounds double, double* %data, i64 %idxprom1
				%1 = load double, double* %arrayidx2, align 8
				%arrayidx4 = getelementptr inbounds %struct.stu, %struct.stu* %param, i64 0, i32 0, i64 %indvars.iv
				%2 = load double, double* %arrayidx4, align 8
				%mul = fmul fast double %2, %1
				%add = fadd fast double %mul, %val1.043
				%arrayidx6 = getelementptr inbounds %struct.stu, %struct.stu* %param, i64 0, i32 1, i64 %indvars.iv
				%3 = load double, double* %arrayidx6, align 8
				%mul7 = fmul fast double %3, %1
				%add8 = fadd fast double %mul7, %val2.044
				%arrayidx10 = getelementptr inbounds %struct.stu, %struct.stu* %param, i64 0, i32 2, i64 %indvars.iv
				%4 = load double, double* %arrayidx10, align 8
				%mul11 = fmul fast double %4, %1
				%add12 = fadd fast double %mul11, %val3.045
				%arrayidx14 = getelementptr inbounds %struct.stu, %struct.stu* %param, i64 0, i32 3, i64 %indvars.iv
				%5 = load double, double* %arrayidx14, align 8
				%mul15 = fmul fast double %5, %1
				%add16 = fadd fast double %mul15, %val4.046
				%indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
				%exitcond.not = icmp eq i64 %indvars.iv.next, 128
				br i1 %exitcond.not, label %for.cond.cleanup, label %for.body
				}

llvm/test/Transforms/LoopVectorize/AArch64/interleaved-vs-scalar.ll

	; REQUIRES: asserts			; REQUIRES: asserts
	; RUN: opt < %s -force-vector-width=2 -force-vector-interleave=1 -loop-vectorize -S --debug-only=loop-vectorize 2>&1 \| FileCheck %s			; RUN: opt < %s -force-vector-width=2 -force-vector-interleave=1 -loop-vectorize -S --debug-only=loop-vectorize 2>&1 \| FileCheck %s

	target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"			target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
	target triple = "aarch64--linux-gnu"			target triple = "aarch64--linux-gnu"

	%pair = type { i8, i8 }			%pair = type { i8, i8 }

	; CHECK-LABEL: test			; CHECK-LABEL: test
	; CHECK: Found an estimated cost of 17 for VF 2 For instruction: {{.*}} load i8			; CHECK: Found an estimated cost of 14 for VF 2 For instruction: {{.*}} load i8
	; CHECK: Found an estimated cost of 0 for VF 2 For instruction: {{.*}} load i8			; CHECK: Found an estimated cost of 0 for VF 2 For instruction: {{.*}} load i8
	; CHECK: vector.body			; CHECK: vector.body
	; CHECK: load <4 x i8>			; CHECK: load i8
				dmgreenUnsubmitted Not Done Reply Inline Actions It is hard to see why this is now correct.. the vector body looks pretty empty? dmgreen: It is hard to see why this is now correct.. the vector body looks pretty empty?
				dmgreenUnsubmitted Not Done Reply Inline Actions Do you know what is going on in this case? dmgreen: Do you know what is going on in this case?
				TiehuZhangAuthorUnsubmitted Done Reply Inline Actions Do you know what is going on in this case? TiehuZhang: > Do you know what is going on in this case?
				dmgreenUnsubmitted Not Done Reply Inline Actions OK I see what is going on - the values %tmp1 and %tmp3 are never used, the test not very meaningful in that regard. The vector body being empty isn't an issue in that case. It's a bit of a funny test, but I agree with you that the things it is testing are OK. Can you change the test to this, to be more "glued together": ; REQUIRES: asserts ; RUN: opt < %s -force-vector-width=2 -force-vector-interleave=1 -loop-vectorize -S --debug-only=loop-vectorize 2>&1 \| FileCheck %s target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128" target triple = "aarch64--linux-gnu" %pair = type { i8, i8 } ; CHECK-LABEL: test ; CHECK: Found an estimated cost of 14 for VF 2 For instruction: {{.}} load i8 ; CHECK: Found an estimated cost of 0 for VF 2 For instruction: {{.}} load i8 ; CHECK-LABEL: entry: ; CHECK-LABEL: vector.body: ; CHECK: [[LOAD1:%.]] = load i8 ; CHECK: [[LOAD2:%.]] = load i8 ; CHECK: [[INSERT:%.]] = insertelement <2 x i8> poison, i8 [[LOAD1]], i32 0 ; CHECK: insertelement <2 x i8> [[INSERT]], i8 [[LOAD2]], i32 1 ; CHECK: br i1 {{.}}, label %middle.block, label %vector.body define void @test(%pair* %p, i8* %q, i64 %n) { entry: br label %for.body for.body: %i = phi i64 [ 0, %entry ], [ %i.next, %for.body ] %tmp0 = getelementptr %pair, %pair* %p, i64 %i, i32 0 %tmp1 = load i8, i8* %tmp0, align 1 %tmp2 = getelementptr %pair, %pair* %p, i64 %i, i32 1 %tmp3 = load i8, i8* %tmp2, align 1 %add = add i8 %tmp1, %tmp3 %qi = getelementptr i8, i8* %q, i64 %i store i8 %add, i8* %qi, align 1 %i.next = add nuw nsw i64 %i, 1 %cond = icmp eq i64 %i.next, %n br i1 %cond, label %for.end, label %for.body for.end: ret void } dmgreen: OK I see what is going on - the values %tmp1 and %tmp3 are never used, the test not very…
	; CHECK: br i1 {{.*}}, label %middle.block, label %vector.body			; CHECK: br i1 {{.*}}, label %middle.block, label %vector.body

	define void @test(%pair* %p, i64 %n) {			define void @test(%pair* %p, i64 %n) {
	entry:			entry:
	br label %for.body			br label %for.body

	for.body:			for.body:
	%i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]			%i = phi i64 [ 0, %entry ], [ %i.next, %for.body ]
	Show All 12 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][LV] AArch64 does not prefer vectorized addressingClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 425763

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

llvm/test/Transforms/LoopVectorize/AArch64/gather-do-not-vectorize-addressing.ll

llvm/test/Transforms/LoopVectorize/AArch64/interleaved-vs-scalar.ll

[AArch64][LV] AArch64 does not prefer vectorized addressing
ClosedPublic