Download Raw Diff

Details

Reviewers

bsmith
efriedma
paulwalker-arm
peterwaller-arm

Summary

Reduce the cost of VLS loads/stores to make the vectorizor emit them more frequently.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

MattDevereau created this revision.Aug 6 2021, 8:19 AM

Herald added a reviewer: efriedma. · View Herald TranscriptAug 6 2021, 8:19 AM

Herald added subscribers: psnobl, hiraditya, kristof.beyls, tschuett. · View Herald Transcript

MattDevereau requested review of this revision.Aug 6 2021, 8:19 AM

Herald added a project: Restricted Project. · View Herald TranscriptAug 6 2021, 8:19 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

updated the diff to show more context

bsmith added reviewers: paulwalker-arm, peterwaller-arm.Aug 6 2021, 8:48 AM

bsmith added inline comments.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1519	What's the rationale behind this cost for fixed types?
llvm/test/Analysis/CostModel/AArch64/masked_ldst.ll
2 ↗	(On Diff #364808)	This test is for scalable types, hence the changes in here are testing the cost of scalable masked loads/stores rather than fixed ones. You'll need to add functions (probably in another test file) that use the vscale_range attribute to specify various different fixed vector lengths)

Matt added a subscriber: Matt.Aug 6 2021, 8:59 AM

david-arm added a subscriber: david-arm.Aug 6 2021, 9:02 AM

david-arm added inline comments.

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1505	I don't think we should be removing this because it's needed when we don't know the SVE vector length, in which case we will have to scalarise the masked ops. I think you can add an extra check here to see if we're using SVE for fixed width vectors.
1519	Is this due to the extra predicate we have to create? If so, that's also true for SVE since we'll need a ptrue and it's highly likely to get reused anyway or hoisted out of a loop.
llvm/test/Analysis/CostModel/AArch64/masked_ldst.ll
6 ↗	(On Diff #364808)	Again, I think these changes look wrong - they should be high because we are going to scalarise the operations, since we haven't specified the SVE vector length so we can't use SVE or NEON.

MattDevereau added inline comments.Aug 6 2021, 9:08 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1519	A cost of 3vscale was suggested as a suitable cost for VLS compared to the current VLA cost of 2vscale. Rather than edit TLI->getTypeLegalizationCost(DL, Src); directly, this seemed like the simplest way of achieving it.
llvm/test/Analysis/CostModel/AArch64/masked_ldst.ll
2 ↗	(On Diff #364808)	What is the difference between this test called "fixed" and the test called "scalable" further down then?

paulwalker-arm added inline comments.Aug 6 2021, 9:11 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1505	Perhaps just add something like `&& !ST->useSVEForFixedLengthVectors()`
llvm/test/Analysis/CostModel/AArch64/masked_ldst.ll
2 ↗	(On Diff #364808)	@bsmith Can this not be done using multiple RUN lines instead? I'm hoping that a single set of CHECK lines can be used by utilising FileChecks math functions.

Harbormaster completed remote builds in B118391: Diff 364808.Aug 6 2021, 9:19 AM

bsmith added inline comments.Aug 6 2021, 9:41 AM

llvm/test/Analysis/CostModel/AArch64/masked_ldst.ll
2 ↗	(On Diff #364808)	That's probably a better approach yes, to avoid duplicating things.
2 ↗	(On Diff #364808)	The 'fixed' test is testing the fixed LLVM IR types when using scalable codegen (i.e. fixed types are treated as Neon types, not SVE), the 'scalable' test is testing scalable types using scalable codegen. You need a test that checks fixed types using fixed codegen (of various sizes). That is to say, as per Paul's approach, you need some additional run lines that specify `-aarch64-sve-vector-bits-min=<value>`, which result in different costs for the the masked load/stores depending on the value of `<value>`

junparser added a subscriber: junparser.Aug 10 2021, 2:34 AM

Changed the cost model by keeping the scalarised NEON costs for 128bit width vectors, but use the SVE costs for larger VLS sizes. Added a new regression test to assert the cost-model estimates depending on VLS width

Harbormaster completed remote builds in B118865: Diff 365445.Aug 10 2021, 6:09 AM

paulwalker-arm added inline comments.Aug 16 2021, 3:36 AM

llvm/test/Analysis/CostModel/AArch64/masked_ldst_vls.ll
2	This should be removed as I imagine it's not true and you just inherited it from a file you copied.
3	Rather than having `-mtriple=aarch64-linux-gnu -mattr=+sve` on ever `RUN` line can you use LLVM IR directly. For example: target triple = "aarch64-unknown-linux-gnu" define void @fixed-sve-vls() #0 { .... } attributes #0 = { "target-features"="+sve" } It just means you need to know less of the details when running the test manually.
30	Storing vectors of i1 is a thorny issue so I think it best to not bother testing them just yet.

Removed i1 vector from regression test, added some LLVM IR syntax cleanup to regression test

Removed -mattr=sve from RUN lines in regression test

Harbormaster completed remote builds in B119695: Diff 366604.Aug 16 2021, 6:06 AM

peterwaller-arm added inline comments.Aug 17 2021, 5:35 AM

llvm/test/Analysis/CostModel/AArch64/masked_ldst_vls.ll
20	Minor nit: functions with dashes in the name are quite rare. I'm a little surprised it's allowed! Out of ~240k function definitions in the LLVM test suite, only 28 of them have dashes, and ~187k have underscores, so I'd go with the majority here.

renamed fixed-sve-vl to fixed_sve_vl, use useNeonVector() function instead of verbose if statement

Harbormaster completed remote builds in B120086: Diff 367157.Aug 18 2021, 4:07 AM

paulwalker-arm accepted this revision.Aug 18 2021, 4:51 AM

This revision is now accepted and ready to land.Aug 18 2021, 4:51 AM

Submitted in 734708e04f84b72f1ae7c8b35c002b8bf97dc064.

Diff 367157

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

Show First 20 Lines • Show All 1,496 Lines • ▼ Show 20 Lines	AArch64TTIImpl::enableMemCmpExpansion(bool OptSize, bool IsZeroCmp) const {
Options.LoadSizes = {8, 4, 2, 1};		Options.LoadSizes = {8, 4, 2, 1};
return Options;		return Options;
}		}

InstructionCost		InstructionCost
AArch64TTIImpl::getMaskedMemoryOpCost(unsigned Opcode, Type *Src,		AArch64TTIImpl::getMaskedMemoryOpCost(unsigned Opcode, Type *Src,
Align Alignment, unsigned AddressSpace,		Align Alignment, unsigned AddressSpace,
TTI::TargetCostKind CostKind) {		TTI::TargetCostKind CostKind) {
if (!isa<ScalableVectorType>(Src))		if (useNeonVector(Src))
david-armUnsubmitted Not Done Reply Inline Actions I don't think we should be removing this because it's needed when we don't know the SVE vector length, in which case we will have to scalarise the masked ops. I think you can add an extra check here to see if we're using SVE for fixed width vectors. david-arm: I don't think we should be removing this because it's needed when we don't know the SVE vector…
paulwalker-armUnsubmitted Not Done Reply Inline Actions Perhaps just add something like `&& !ST->useSVEForFixedLengthVectors()` paulwalker-arm: Perhaps just add something like `&& !ST->useSVEForFixedLengthVectors()`
return BaseT::getMaskedMemoryOpCost(Opcode, Src, Alignment, AddressSpace,		return BaseT::getMaskedMemoryOpCost(Opcode, Src, Alignment, AddressSpace,
CostKind);		CostKind);
auto LT = TLI->getTypeLegalizationCost(DL, Src);		auto LT = TLI->getTypeLegalizationCost(DL, Src);
if (!LT.first.isValid())		if (!LT.first.isValid())
return InstructionCost::getInvalid();		return InstructionCost::getInvalid();

// The code-generator is currently not able to handle scalable vectors		// The code-generator is currently not able to handle scalable vectors
// of <vscale x 1 x eltty> yet, so return an invalid cost to avoid selecting		// of <vscale x 1 x eltty> yet, so return an invalid cost to avoid selecting
// it. This change will be removed when code-generation for these types is		// it. This change will be removed when code-generation for these types is
// sufficiently reliable.		// sufficiently reliable.
if (cast<VectorType>(Src)->getElementCount() == ElementCount::getScalable(1))		if (cast<VectorType>(Src)->getElementCount() == ElementCount::getScalable(1))
return InstructionCost::getInvalid();		return InstructionCost::getInvalid();

return LT.first * 2;		return LT.first * 2;
		bsmithUnsubmitted Not Done Reply Inline Actions What's the rationale behind this cost for fixed types? bsmith: What's the rationale behind this cost for fixed types?
		MattDevereauAuthorUnsubmitted Done Reply Inline Actions A cost of 3vscale was suggested as a suitable cost for VLS compared to the current VLA cost of 2vscale. Rather than edit TLI->getTypeLegalizationCost(DL, Src); directly, this seemed like the simplest way of achieving it. MattDevereau: A cost of 3*vscale was suggested as a suitable cost for VLS compared to the current VLA cost of…
		david-armUnsubmitted Not Done Reply Inline Actions Is this due to the extra predicate we have to create? If so, that's also true for SVE since we'll need a ptrue and it's highly likely to get reused anyway or hoisted out of a loop. david-arm: Is this due to the extra predicate we have to create? If so, that's also true for SVE since…
}		}

InstructionCost AArch64TTIImpl::getGatherScatterOpCost(		InstructionCost AArch64TTIImpl::getGatherScatterOpCost(
unsigned Opcode, Type DataTy, const Value Ptr, bool VariableMask,		unsigned Opcode, Type DataTy, const Value Ptr, bool VariableMask,
Align Alignment, TTI::TargetCostKind CostKind, const Instruction *I) {		Align Alignment, TTI::TargetCostKind CostKind, const Instruction *I) {

if (!isa<ScalableVectorType>(DataTy))		if (!isa<ScalableVectorType>(DataTy))
return BaseT::getGatherScatterOpCost(Opcode, DataTy, Ptr, VariableMask,		return BaseT::getGatherScatterOpCost(Opcode, DataTy, Ptr, VariableMask,
▲ Show 20 Lines • Show All 673 Lines • Show Last 20 Lines

llvm/test/Analysis/CostModel/AArch64/masked_ldst_vls.ll

This file was added.

				; RUN: opt < %s -cost-model -analyze -aarch64-sve-vector-bits-min=256 \| FileCheck %s -D#VBITS=256
				; RUN: opt < %s -cost-model -analyze -aarch64-sve-vector-bits-min=384 \| FileCheck %s -D#VBITS=256
				paulwalker-armUnsubmitted Not Done Reply Inline Actions This should be removed as I imagine it's not true and you just inherited it from a file you copied. paulwalker-arm: This should be removed as I imagine it's not true and you just inherited it from a file you…
				; RUN: opt < %s -cost-model -analyze -aarch64-sve-vector-bits-min=512 \| FileCheck %s -D#VBITS=512
				paulwalker-armUnsubmitted Not Done Reply Inline Actions Rather than having `-mtriple=aarch64-linux-gnu -mattr=+sve` on ever `RUN` line can you use LLVM IR directly. For example: target triple = "aarch64-unknown-linux-gnu" define void @fixed-sve-vls() #0 { .... } attributes #0 = { "target-features"="+sve" } It just means you need to know less of the details when running the test manually. paulwalker-arm: Rather than having `-mtriple=aarch64-linux-gnu -mattr=+sve` on ever `RUN` line can you use LLVM…
				; RUN: opt < %s -cost-model -analyze -aarch64-sve-vector-bits-min=640 \| FileCheck %s -D#VBITS=512
				; RUN: opt < %s -cost-model -analyze -aarch64-sve-vector-bits-min=768 \| FileCheck %s -D#VBITS=512
				; RUN: opt < %s -cost-model -analyze -aarch64-sve-vector-bits-min=896 \| FileCheck %s -D#VBITS=512
				; RUN: opt < %s -cost-model -analyze -aarch64-sve-vector-bits-min=1024 \| FileCheck %s -D#VBITS=1024
				; RUN: opt < %s -cost-model -analyze -aarch64-sve-vector-bits-min=1152 \| FileCheck %s -D#VBITS=1024
				; RUN: opt < %s -cost-model -analyze -aarch64-sve-vector-bits-min=1280 \| FileCheck %s -D#VBITS=1024
				; RUN: opt < %s -cost-model -analyze -aarch64-sve-vector-bits-min=1408 \| FileCheck %s -D#VBITS=1024
				; RUN: opt < %s -cost-model -analyze -aarch64-sve-vector-bits-min=1536 \| FileCheck %s -D#VBITS=1024
				; RUN: opt < %s -cost-model -analyze -aarch64-sve-vector-bits-min=1664 \| FileCheck %s -D#VBITS=1024
				; RUN: opt < %s -cost-model -analyze -aarch64-sve-vector-bits-min=1792 \| FileCheck %s -D#VBITS=1024
				; RUN: opt < %s -cost-model -analyze -aarch64-sve-vector-bits-min=1920 \| FileCheck %s -D#VBITS=1024
				; RUN: opt < %s -cost-model -analyze -aarch64-sve-vector-bits-min=2048 \| FileCheck %s -D#VBITS=2048

				target triple = "aarch64-unknown-linux-gnu"

				define void @fixed_sve_vls() #0 {
				; CHECK-LABEL: 'fixed_sve_vls'
				peterwaller-armUnsubmitted Not Done Reply Inline Actions Minor nit: functions with dashes in the name are quite rare. I'm a little surprised it's allowed! Out of ~240k function definitions in the LLVM test suite, only 28 of them have dashes, and ~187k have underscores, so I'd go with the majority here. peterwaller-arm: Minor nit: functions with dashes in the name are quite rare. I'm a little surprised it's…
				; CHECK: Cost Model: Found an estimated cost of [[#mul(div(2047,VBITS)+1,2)]] for instruction: %v256i8 = call <256 x i8> @llvm.masked.load.v256i8.p0v256i8(<256 x i8>* undef, i32 8, <256 x i1> undef, <256 x i8> undef)
				; CHECK: Cost Model: Found an estimated cost of [[#mul(div(4091,VBITS)+1,2)]] for instruction: %v256i16 = call <256 x i16> @llvm.masked.load.v256i16.p0v256i16(<256 x i16>* undef, i32 8, <256 x i1> undef, <256 x i16> undef)
				; CHECK: Cost Model: Found an estimated cost of [[#mul(div(511,VBITS)+1,2)]] for instruction: %v16i32 = call <16 x i32> @llvm.masked.load.v16i32.p0v16i32(<16 x i32>* undef, i32 8, <16 x i1> undef, <16 x i32> undef)
				; CHECK: Cost Model: Found an estimated cost of [[#mul(div(1023,VBITS)+1,2)]] for instruction: %v16i64 = call <16 x i64> @llvm.masked.load.v16i64.p0v16i64(<16 x i64>* undef, i32 8, <16 x i1> undef, <16 x i64> undef)
				; CHECK: Cost Model: Found an estimated cost of [[#mul(div(8191,VBITS)+1,2)]] for instruction: %v512f16 = call <512 x half> @llvm.masked.load.v512f16.p0v512f16(<512 x half>* undef, i32 8, <512 x i1> undef, <512 x half> undef)
				; CHECK: Cost Model: Found an estimated cost of [[#mul(div(8191,VBITS)+1,2)]] for instruction: %v256f32 = call <256 x float> @llvm.masked.load.v256f32.p0v256f32(<256 x float>* undef, i32 8, <256 x i1> undef, <256 x float> undef)
				; CHECK: Cost Model: Found an estimated cost of [[#mul(div(8191,VBITS)+1,2)]] for instruction: %v128f64 = call <128 x double> @llvm.masked.load.v128f64.p0v128f64(<128 x double>* undef, i32 8, <128 x i1> undef, <128 x double> undef)
				; CHECK: Cost Model: Found an estimated cost of 0 for instruction: ret void
				entry:
				%v256i8 = call <256 x i8> @llvm.masked.load.v256i8.p0v256i8(<256 x i8> *undef, i32 8, <256 x i1> undef, <256 x i8> undef)
				paulwalker-armUnsubmitted Not Done Reply Inline Actions Storing vectors of i1 is a thorny issue so I think it best to not bother testing them just yet. paulwalker-arm: Storing vectors of i1 is a thorny issue so I think it best to not bother testing them just yet.
				%v256i16 = call <256 x i16> @llvm.masked.load.v256i16.p0v256i16(<256 x i16> *undef, i32 8, <256 x i1> undef, <256 x i16> undef)
				%v16i32 = call <16 x i32> @llvm.masked.load.v16i32.p0v16i32(<16 x i32> *undef, i32 8, <16 x i1> undef, <16 x i32> undef)
				%v16i64 = call <16 x i64> @llvm.masked.load.v16i64.p0v16i64(<16 x i64> *undef, i32 8, <16 x i1> undef, <16 x i64> undef)

				%v512f16 = call <512 x half> @llvm.masked.load.v512f16.p0v512f16(<512 x half> *undef, i32 8, <512 x i1> undef, <512 x half> undef)
				%v256f32 = call <256 x float> @llvm.masked.load.v256f32.p0v256f32(<256 x float> *undef, i32 8, <256 x i1> undef, <256 x float> undef)
				%v128f64 = call <128 x double> @llvm.masked.load.v128f64.p0v128f64(<128 x double> *undef, i32 8, <128 x i1> undef, <128 x double> undef)

				ret void
				}

				declare <256 x i8> @llvm.masked.load.v256i8.p0v256i8(<256 x i8>*, i32, <256 x i1>, <256 x i8>)
				declare <256 x i16> @llvm.masked.load.v256i16.p0v256i16(<256 x i16>*, i32, <256 x i1>, <256 x i16>)
				declare <16 x i32> @llvm.masked.load.v16i32.p0v16i32(<16 x i32>*, i32, <16 x i1>, <16 x i32>)
				declare <16 x i64> @llvm.masked.load.v16i64.p0v16i64(<16 x i64>*, i32, <16 x i1>, <16 x i64>)

				declare <512 x half> @llvm.masked.load.v512f16.p0v512f16(<512 x half>*, i32, <512 x i1>, <512 x half>)
				declare <256 x float> @llvm.masked.load.v256f32.p0v256f32(<256 x float>*, i32, <256 x i1>, <256 x float>)
				declare <128 x double> @llvm.masked.load.v128f64.p0v128f64(<128 x double>*, i32, <128 x i1>, <128 x double>)

				attributes #0 = { "target-features"="+sve" }

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][SVE] Teach cost model that masked loads/stores are cheap
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 367157

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

llvm/test/Analysis/CostModel/AArch64/masked_ldst_vls.ll

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][SVE] Teach cost model that masked loads/stores are cheapClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 367157

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

llvm/test/Analysis/CostModel/AArch64/masked_ldst_vls.ll

[AArch64][SVE] Teach cost model that masked loads/stores are cheap
ClosedPublic