Download Raw Diff

Details

Reviewers

sdesmalen
david-arm

Commits

rG045eec61f36b: [AArch64][CostModel]: Add costs for zero/sign extend.
rGd65c3bf39aa4: [AArch64][CostModel]: Add costs for zero/sign extend.

Summary

Add cost for extending to illegal scalable vector types.
Add testing file for the extend operations.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	60,300 ms	x64 debian > Clang.CodeGen/RISCV/rvv-intrinsics-autogenerated/policy/non-overloaded::vloxseg.c
	60,270 ms	x64 debian > Clang.CodeGen/RISCV/rvv-intrinsics-autogenerated/policy/non-overloaded::vluxseg.c
	60,300 ms	x64 debian > Clang.CodeGen/RISCV/rvv-intrinsics-autogenerated/policy/overloaded::vloxseg.c
	60,300 ms	x64 debian > Clang.CodeGen/RISCV/rvv-intrinsics-autogenerated/policy/overloaded::vluxseg.c

Event Timeline

hassnaa-arm created this revision.Jan 24 2023, 5:06 AM

Herald added a project: Restricted Project. · View Herald TranscriptJan 24 2023, 5:06 AM

Herald added subscribers: hiraditya, kristof.beyls. · View Herald Transcript

hassnaa-arm requested review of this revision.Jan 24 2023, 5:06 AM

Herald added a project: Restricted Project. · View Herald TranscriptJan 24 2023, 5:06 AM

Herald added subscribers: llvm-commits, alextsao1999. · View Herald Transcript

Harbormaster completed remote builds in B209610: Diff 491721.Jan 24 2023, 6:54 AM

Add testing file for the cost of zero/sign extend

Could you please update the title to be more descriptive?

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2063	nit: `-too wide-`?

hassnaa-arm added a reviewer: david-arm.Jan 24 2023, 7:52 AM

hassnaa-arm marked an inline comment as done.Jan 24 2023, 7:55 AM

fix comment typo

hassnaa-arm retitled this revision from [AArch64] cost mode. to [AArch64][CostModel]: Add costs for zero/sign extend..Jan 24 2023, 8:22 AM

Add more accurte costs.

Harbormaster completed remote builds in B209703: Diff 491843.Jan 24 2023, 1:11 PM

Thanks for this patch @hassnaa-arm! It's definitely an improvement on the existing cost model. However, I just have a few comments on the costs ...

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

2066

For this example, something like

define <vscale x 16 x i16> @sve_ext_i8_i16(<vscale x 16 x i8> %a) {
  %r = zext <vscale x 16 x i8> %a to <vscale x 16 x i16>
  ret <vscale x 16 x i16> %r
}

would end up as

uunpklo z2.h, z0.b
uunpkhi z1.h, z0.b
mov     z0.d, z2.d
ret

So I think perhaps the comment above might be better written as something like

// zero/sign extend are implemented by multiple unpack operations,
// where each unpack operation has a cost of 2.

2067

I wonder if this cost should be even higher? Extending from nxv16i8 to nxv16i16 leads to 2 unpk instructions, where you have a cost of 4. However, going from nxv16i8 to nxv16i32 requires 6 unpk instructions, which I would expect to be 2*6=12. For example, I get:

sve_ext_i8_i32:
      uunpklo z1.h, z0.b
      uunpkhi z3.h, z0.b
      uunpklo z0.s, z1.h
      uunpkhi z1.s, z1.h
      uunpklo z2.s, z3.h
      uunpkhi z3.s, z3.h
      ret

The costs are probably even worse extending from nxv16i8 to nxv16i64!

2072

It would be good to complete the other extends from legal types here too, such as

{ ISD::ZERO_EXTEND, MVT::nxv8i32, MVT::nxv8i16, ?},
{ ISD::ZERO_EXTEND, MVT::nxv8i64, MVT::nxv8i16, ?},
{ ISD::ZERO_EXTEND, MVT::nxv4i64, MVT::nxv4i32, ?},

{ ISD::SIGN_EXTEND, MVT::nxv8i32, MVT::nxv8i16, ?},
{ ISD::SIGN_EXTEND, MVT::nxv8i64, MVT::nxv8i16, ?},
{ ISD::SIGN_EXTEND, MVT::nxv4i64, MVT::nxv4i32, ?},

Recalculate costs. In the code generation testing file, use real variable instead of undef to get accurate costs.

Remove changes included by mistake.

Harbormaster completed remote builds in B209826: Diff 492037.Jan 25 2023, 2:59 AM

where each operation has a cost of 2

Why does each instruction have a cost of 2?

In D142456#4079434, @dmgreen wrote:

where each operation has a cost of 2

Why does each instruction have a cost of 2?

Because that is mentioned here: https://developer.arm.com/documentation/pjdoc466751330-9685/latest/
in section 3.25 (SVE integer instructions)

In D142456#4079499, @hassnaa-arm wrote:

In D142456#4079434, @dmgreen wrote:

where each operation has a cost of 2

Why does each instruction have a cost of 2?

Because that is mentioned here: https://developer.arm.com/documentation/pjdoc466751330-9685/latest/
in section 3.25 (SVE integer instructions)

Hi @hassnaa-arm, the cost-model shouldn't be hardcoding the number of cycles for one specific micro-architecture, because the cost-model should be accurate for other micro-architectures as well.
The cost requested here is the througput cost, not the latency. The throughput is closer to the number of instructions that is required for the operation.

sdesmalen added inline comments.Jan 25 2023, 5:24 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2073	It's probably also worth adding cases for extending: nxv8i16 -> nxv8i32 nxv8i16 -> nxv8i64 nxv4i32 -> nxv4i64

Yep - I would expect the throughput of most extend operations to be fairly cheap, at least it looks that way from the optimization guides I looked at. They are in line with other instructions, so would usually get a cost of 1.

There is always a chance that getting the cost "wrong" can result in better results. This change will probably make the vectorizer prefer lower vectorization factors or prefer neon over sve. That could result in better results in places, but it is usually better to get the costs more correct. We would need a good justification why they were set 2 x higher.

In D142456#4079512, @sdesmalen wrote:

In D142456#4079499, @hassnaa-arm wrote:

In D142456#4079434, @dmgreen wrote:

where each operation has a cost of 2

Why does each instruction have a cost of 2?

Because that is mentioned here: https://developer.arm.com/documentation/pjdoc466751330-9685/latest/
in section 3.25 (SVE integer instructions)

Hi @hassnaa-arm, the cost-model shouldn't be hardcoding the number of cycles for one specific micro-architecture, because the cost-model should be accurate for other micro-architectures as well.
The cost requested here is the througput cost, not the latency. The throughput is closer to the number of instructions that is required for the operation.

The throughput of the unpack operation is 2.

Hi @hassnaa-arm, so the confusing aspect here is that the throughput in the cost model is kind of the inverse of the throughput that's sometimes talked about in optimisation guides, which doesn't help. So in that optimisation guide you referenced earlier I believe that a high throughput is essentially a 'good thing', whereas in the vectoriser cost model a high value means the opposite. Suppose you took the inverse of the throughput (2), then it would be 1/2, but the cost has to be an integer so we can just round up to 1.

Matt added a subscriber: Matt.Jan 25 2023, 8:51 AM

In D142456#4079959, @david-arm wrote:

Hi @hassnaa-arm, so the confusing aspect here is that the throughput in the cost model is kind of the inverse of the throughput that's sometimes talked about in optimisation guides, which doesn't help. So in that optimisation guide you referenced earlier I believe that a high throughput is essentially a 'good thing', whereas in the vectoriser cost model a high value means the opposite. Suppose you took the inverse of the throughput (2), then it would be 1/2, but the cost has to be an integer so we can just round up to 1.

Yeah - it's also relative to other instructions, so our baseline would be around 2 for any SVE operation on V1. Note that the V1 is a bit of an oddball in this regard - the SVE vector length is 256, but you can either execute 4 NEON instructions per cycle or 2 SVE instructions (they use the same vector pipelines, with SVE operations taking two pipelines. They end up with the same throughput in terms of bits). Really the reciprocal throughput cost of _all_ sve instructions should be twice what they are compared to neon instructions on V1. The end result would just be to double the cost so that VScaleForTuning (2 for V1) would divide them by 2 again. So it would be simpler (and more maintainable) to consider setting the VScaleForTuning to 1 for that core. Unfortunately when I tried that in the past the performance hit some snags, and it didn't solve the problem I was trying to fix.

That aside, in general the "Unpack and extend" instructions are usually cheap instructions in other optimization guides, in line with the costs of other instructions. Our generic cost should probably line up with that.

Update the calculated costs.
Use a cost of 1 for each SVE instruction.

david-arm added inline comments.Jan 26 2023, 3:51 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2066	I think perhaps this can just be a cost of 2 because there are only 2 unpacks needed? Same for the SIGN_EXTEND case.
2072	Hi @hassnaa-arm, I think this patch is still missing the other extends that @sdesmalen and I suggested?

Harbormaster completed remote builds in B210071: Diff 492372.Jan 26 2023, 4:04 AM

hassnaa-arm added inline comments.Jan 26 2023, 5:09 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2066	Needed instructions are 2 unpacks and 1 mov, that's why I made it 3.

sdesmalen added inline comments.Jan 26 2023, 5:19 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
2066	In this case the `mov` is inserted by the register allocator and isn't caused by lowering of the extend itself, so this should still have a cost of '2'.

hassnaa-arm marked an inline comment as done.Jan 26 2023, 5:37 AM

Recalculate the costs.

hassnaa-arm edited the summary of this revision. (Show Details)Jan 26 2023, 5:38 AM

Harbormaster completed remote builds in B210096: Diff 492407.Jan 26 2023, 6:49 AM

LGTM with nit addressed

llvm/test/Analysis/CostModel/AArch64/sve-ext.ll
2	We have quite a few different files for the different kinds of extends, it might be nice to move them all to a single `sve-cast.ll` at some point (in a separate patch), similar to what was done for `cast.ll`
3	I don't think this is needed as there is no IR output (because of the `-disable-output`)

This revision is now accepted and ready to land.Jan 31 2023, 12:58 AM

Allen added a subscriber: Allen.Jan 31 2023, 1:15 AM

This revision was landed with ongoing or failed builds.Feb 2 2023, 5:38 AM

Closed by commit rGd65c3bf39aa4: [AArch64][CostModel]: Add costs for zero/sign extend. (authored by Hassnaa Hamdi <hassnaa.hamdi@arm.com>). · Explain Why

This revision was automatically updated to reflect the committed changes.

Hassnaa Hamdi <hassnaa.hamdi@arm.com> added a commit: rGd65c3bf39aa4: [AArch64][CostModel]: Add costs for zero/sign extend..

Hassnaa Hamdi <hassnaa.hamdi@arm.com> added a reverting change: rGb1c34dec6448: Revert "[AArch64][CostModel]: Add costs for zero/sign extend.".Feb 3 2023, 5:45 AM

This patch indirectly causes the vectoriser to choose a lower VF due to the high cost of extending nxv16i8 -> nxv16i16, and that caused a regression.
Dave was investigating that issue and he has created a patch for fixing it.
So, right now this patch should work well.
I will rebase it and run checks to make sure everything is okay.

This revision is now accepted and ready to land.Apr 19 2023, 2:05 AM

Updating by main branch.

This revision was landed with ongoing or failed builds.Apr 19 2023, 3:27 AM

Closed by commit rG045eec61f36b: [AArch64][CostModel]: Add costs for zero/sign extend. (authored by Hassnaa Hamdi <hassnaa.hamdi@arm.com>). · Explain Why

This revision was automatically updated to reflect the committed changes.

Hassnaa Hamdi <hassnaa.hamdi@arm.com> added a commit: rG045eec61f36b: [AArch64][CostModel]: Add costs for zero/sign extend..

Harbormaster completed remote builds in B226563: Diff 514888.Apr 19 2023, 4:34 AM

david-arm mentioned this in D148123: [AArch64][CostModel] Make sext/zext free if folded into a masked load.Apr 19 2023, 5:59 AM

Diff 491843

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

Show First 20 Lines • Show All 2,053 Lines • ▼ Show 20 Lines	ConversionTbl[] = {
{ ISD::BITCAST, MVT::nxv2f16, MVT::nxv2i16, 0 },		{ ISD::BITCAST, MVT::nxv2f16, MVT::nxv2i16, 0 },
{ ISD::BITCAST, MVT::nxv4f16, MVT::nxv4i16, 0 },		{ ISD::BITCAST, MVT::nxv4f16, MVT::nxv4i16, 0 },
{ ISD::BITCAST, MVT::nxv2f32, MVT::nxv2i32, 0 },		{ ISD::BITCAST, MVT::nxv2f32, MVT::nxv2i32, 0 },

// Bitcasts from integer to float		// Bitcasts from integer to float
{ ISD::BITCAST, MVT::nxv2i16, MVT::nxv2f16, 0 },		{ ISD::BITCAST, MVT::nxv2i16, MVT::nxv2f16, 0 },
{ ISD::BITCAST, MVT::nxv4i16, MVT::nxv4f16, 0 },		{ ISD::BITCAST, MVT::nxv4i16, MVT::nxv4f16, 0 },
{ ISD::BITCAST, MVT::nxv2i32, MVT::nxv2f32, 0 },		{ ISD::BITCAST, MVT::nxv2i32, MVT::nxv2f32, 0 },

		// Add high cost for extending to illegal -too wide- scalable vectors.
		fhahnUnsubmitted Done Reply Inline Actions nit: `-too wide-`? fhahn: nit: `-too wide-`?
		// zero/sign extend are implemented by multiple mov operations,
		// each mov operation has a cost of 2.
		{ ISD::ZERO_EXTEND, MVT::nxv16i16, MVT::nxv16i8, 4},
		david-armUnsubmitted Done Reply Inline Actions For this example, something like define <vscale x 16 x i16> @sve_ext_i8_i16(<vscale x 16 x i8> %a) { %r = zext <vscale x 16 x i8> %a to <vscale x 16 x i16> ret <vscale x 16 x i16> %r } would end up as uunpklo z2.h, z0.b uunpkhi z1.h, z0.b mov z0.d, z2.d ret So I think perhaps the comment above might be better written as something like // zero/sign extend are implemented by multiple unpack operations, // where each unpack operation has a cost of 2. david-arm: For this example, something like define <vscale x 16 x i16> @sve_ext_i8_i16(<vscale x 16 x…
		david-armUnsubmitted Not Done Reply Inline Actions I think perhaps this can just be a cost of 2 because there are only 2 unpacks needed? Same for the SIGN_EXTEND case. david-arm: I think perhaps this can just be a cost of 2 because there are only 2 unpacks needed? Same for…
		hassnaa-armAuthorUnsubmitted Done Reply Inline Actions Needed instructions are 2 unpacks and 1 mov, that's why I made it 3. hassnaa-arm: Needed instructions are 2 unpacks and 1 mov, that's why I made it 3.
		sdesmalenUnsubmitted Done Reply Inline Actions In this case the `mov` is inserted by the register allocator and isn't caused by lowering of the extend itself, so this should still have a cost of '2'. sdesmalen: In this case the `mov` is inserted by the register allocator and isn't caused by lowering of…
		{ ISD::ZERO_EXTEND, MVT::nxv16i32, MVT::nxv16i8, 8},
		david-armUnsubmitted Done Reply Inline Actions I wonder if this cost should be even higher? Extending from nxv16i8 to nxv16i16 leads to 2 unpk instructions, where you have a cost of 4. However, going from nxv16i8 to nxv16i32 requires 6 unpk instructions, which I would expect to be 26=12. For example, I get: sve_ext_i8_i32: uunpklo z1.h, z0.b uunpkhi z3.h, z0.b uunpklo z0.s, z1.h uunpkhi z1.s, z1.h uunpklo z2.s, z3.h uunpkhi z3.s, z3.h ret The costs are probably even worse extending from nxv16i8 to nxv16i64! david-arm:* I wonder if this cost should be even higher? Extending from nxv16i8 to nxv16i16 leads to 2 unpk…
		{ ISD::ZERO_EXTEND, MVT::nxv16i64, MVT::nxv16i8, 16},

		{ ISD::SIGN_EXTEND, MVT::nxv16i16, MVT::nxv16i8, 4},
		{ ISD::SIGN_EXTEND, MVT::nxv16i32, MVT::nxv16i8, 8},
		{ ISD::SIGN_EXTEND, MVT::nxv16i64, MVT::nxv16i8, 16},
		david-armUnsubmitted Done Reply Inline Actions It would be good to complete the other extends from legal types here too, such as { ISD::ZERO_EXTEND, MVT::nxv8i32, MVT::nxv8i16, ?}, { ISD::ZERO_EXTEND, MVT::nxv8i64, MVT::nxv8i16, ?}, { ISD::ZERO_EXTEND, MVT::nxv4i64, MVT::nxv4i32, ?}, { ISD::SIGN_EXTEND, MVT::nxv8i32, MVT::nxv8i16, ?}, { ISD::SIGN_EXTEND, MVT::nxv8i64, MVT::nxv8i16, ?}, { ISD::SIGN_EXTEND, MVT::nxv4i64, MVT::nxv4i32, ?}, david-arm: It would be good to complete the other extends from legal types here too, such as { ISD…
		david-armUnsubmitted Not Done Reply Inline Actions Hi @hassnaa-arm, I think this patch is still missing the other extends that @sdesmalen and I suggested? david-arm: Hi @hassnaa-arm, I think this patch is still missing the other extends that @sdesmalen and I…
};		};
		sdesmalenUnsubmitted Not Done Reply Inline Actions It's probably also worth adding cases for extending: nxv8i16 -> nxv8i32 nxv8i16 -> nxv8i64 nxv4i32 -> nxv4i64 sdesmalen: It's probably also worth adding cases for extending: nxv8i16 -> nxv8i32 nxv8i16 -> nxv8i64…

if (const auto *Entry = ConvertCostTableLookup(ConversionTbl, ISD,		if (const auto *Entry = ConvertCostTableLookup(ConversionTbl, ISD,
DstTy.getSimpleVT(),		DstTy.getSimpleVT(),
SrcTy.getSimpleVT()))		SrcTy.getSimpleVT()))
return AdjustCost(Entry->Cost);		return AdjustCost(Entry->Cost);

static const TypeConversionCostTblEntry FP16Tbl[] = {		static const TypeConversionCostTblEntry FP16Tbl[] = {
{ISD::FP_TO_SINT, MVT::v4i8, MVT::v4f16, 1}, // fcvtzs		{ISD::FP_TO_SINT, MVT::v4i8, MVT::v4f16, 1}, // fcvtzs
▲ Show 20 Lines • Show All 1,319 Lines • Show Last 20 Lines

llvm/test/Analysis/CostModel/AArch64/sve-ext.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_analyze_test_checks.py
				; RUN: opt -passes="print<cost-model>" 2>&1 -disable-output -mtriple aarch64-linux-gnu -mattr=+sve -S -o - < %s \| FileCheck %s
				sdesmalenUnsubmitted Not Done Reply Inline Actions We have quite a few different files for the different kinds of extends, it might be nice to move them all to a single `sve-cast.ll` at some point (in a separate patch), similar to what was done for `cast.ll` sdesmalen: We have quite a few different files for the different kinds of extends, it might be nice to…

				sdesmalenUnsubmitted Not Done Reply Inline Actions I don't think this is needed as there is no IR output (because of the `-disable-output`) sdesmalen: I don't think this is needed as there is no IR output (because of the `-disable-output`)
				target triple = "aarch64-unknown-linux-gnu"

				define void @sve_ext() {
				; CHECK-LABEL: 'sve_ext'
				; CHECK-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %zext_nxv16_i8_to_i16 = zext <vscale x 16 x i8> undef to <vscale x 16 x i16>
				; CHECK-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %zext_nxv16_i8_to_i32 = zext <vscale x 16 x i8> undef to <vscale x 16 x i32>
				; CHECK-NEXT: Cost Model: Found an estimated cost of 16 for instruction: %zext_nxv16_i8_to_i64 = zext <vscale x 16 x i8> undef to <vscale x 16 x i64>
				; CHECK-NEXT: Cost Model: Found an estimated cost of 4 for instruction: %sext_nxv16_i8_to_i16 = sext <vscale x 16 x i8> undef to <vscale x 16 x i16>
				; CHECK-NEXT: Cost Model: Found an estimated cost of 8 for instruction: %sext_nxv16_i8_to_i32 = sext <vscale x 16 x i8> undef to <vscale x 16 x i32>
				; CHECK-NEXT: Cost Model: Found an estimated cost of 16 for instruction: %sext_nxv16_i8_to_i64 = sext <vscale x 16 x i8> undef to <vscale x 16 x i64>
				; CHECK-NEXT: Cost Model: Found an estimated cost of 0 for instruction: ret void
				;
				%zext_nxv16_i8_to_i16 = zext <vscale x 16 x i8> undef to <vscale x 16 x i16>
				%zext_nxv16_i8_to_i32 = zext <vscale x 16 x i8> undef to <vscale x 16 x i32>
				%zext_nxv16_i8_to_i64 = zext <vscale x 16 x i8> undef to <vscale x 16 x i64>

				%sext_nxv16_i8_to_i16 = sext <vscale x 16 x i8> undef to <vscale x 16 x i16>
				%sext_nxv16_i8_to_i32 = sext <vscale x 16 x i8> undef to <vscale x 16 x i32>
				%sext_nxv16_i8_to_i64 = sext <vscale x 16 x i8> undef to <vscale x 16 x i64>

				ret void
				}

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][CostModel]: Add costs for zero/sign extend.
ClosedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 491843

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

llvm/test/Analysis/CostModel/AArch64/sve-ext.ll

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][CostModel]: Add costs for zero/sign extend.ClosedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 491843

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

llvm/test/Analysis/CostModel/AArch64/sve-ext.ll

[AArch64][CostModel]: Add costs for zero/sign extend.
ClosedPublic