Download Raw Diff

Details

Reviewers

efriedma
fhahn
dmgreen
nikic
david-arm
bsmith
sdesmalen
paulwalker-arm

Commits

rGb092dc0bb9e4: [AArch64ISelLowering] Avoid duplane in some cases when sve enabled

Diff Detail

Event Timeline

guopeilin created this revision.Sep 27 2021, 1:18 AM

Herald added subscribers: hiraditya, kristof.beyls, tschuett. · View Herald TranscriptSep 27 2021, 1:18 AM

guopeilin requested review of this revision.Sep 27 2021, 1:18 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 27 2021, 1:18 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

guopeilin added reviewers: efriedma, fhahn, dmgreen, nikic.Sep 27 2021, 1:22 AM

dmgreen added reviewers: david-arm, bsmith, sdesmalen.Sep 27 2021, 1:25 AM

Harbormaster completed remote builds in B125809: Diff 375166.Sep 27 2021, 1:38 AM

Can you change the name of the existing test from sve-limit-duplane.ll -> sve-fixed-length-limit-duplane.ll? This is the convention we use when using SVE for fixed-length vectors.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
9519	This isn't safe for truly scalable vectors where the size is unknown at runtime. I think this should probably be written as: auto ExtractedValType = V.getOperand(0).getValueType(); if (ExtractedValType.isFixedLengthVector() && ExtractedValType.getFixedSizeInBits() <= 128) { ... }
llvm/test/CodeGen/AArch64/sve-limit-duplane.ll
30 ↗	(On Diff #375166)	If possible I think it would be better to try to reduce this test case to the raw operations using proper scalable vectors, i.e. something like: define <4 x i32> @test_work_knt_val(<16 x i32> %arg) { %shvec = shufflevector <16 x i32> %arg, <16 x i32> undef, <4 x i32> <i32 14, i32 14, i32 14, i32 14> ret <4 x i32> %shvec }

guopeilin updated this revision to Diff 375521.Sep 28 2021, 4:03 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
9519	Fixed, please review

Tried to simplify the test case but failed. @david-arm

Harbormaster completed remote builds in B126068: Diff 375521.Sep 28 2021, 4:21 AM

Hi @guopeilin, thanks for making the changes. I'm still a little concerned about the test case though as I'm not sure how reliable it will be over time, especially with the undef values present in the code. I did manage to reduce this test case to:

; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
; RUN: llc -mattr=+sve -aarch64-sve-vector-bits-min=256 < %s | FileCheck %s
target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
target triple = "aarch64-unknown-linux-gnu"

define <4 x i32> @test_work_knt_val(<16 x i32>* %arg) {
entry:
  %0 = load <16 x i32>, <16 x i32>* %arg, align 256
  %shvec = shufflevector <16 x i32> %0, <16 x i32> undef, <4 x i32> <i32 14, i32 14, i32 14, i32 14>
  %1 = add <16 x i32> %0, %0
  store <16 x i32> %1, <16 x i32>* %arg, align 256
  ret <4 x i32> %shvec
}

I think the problem is that there has to be multiple uses of the loaded value (%0) in order for the DAG combine to trigger.

In D110524#3027244, @david-arm wrote:
Hi @guopeilin, thanks for making the changes. I'm still a little concerned about the test case though as I'm not sure how reliable it will be over time, especially with the undef values present in the code. I did manage to reduce this test case to:
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
; RUN: llc -mattr=+sve -aarch64-sve-vector-bits-min=256 < %s | FileCheck %s
target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
target triple = "aarch64-unknown-linux-gnu"

define <4 x i32> @test_work_knt_val(<16 x i32>* %arg) {
entry:
  %0 = load <16 x i32>, <16 x i32>* %arg, align 256
  %shvec = shufflevector <16 x i32> %0, <16 x i32> undef, <4 x i32> <i32 14, i32 14, i32 14, i32 14>
  %1 = add <16 x i32> %0, %0
  store <16 x i32> %1, <16 x i32>* %arg, align 256
  ret <4 x i32> %shvec
}
I think the problem is that there has to be multiple uses of the loaded value (%0) in order for the DAG combine to trigger.

Yes, that's great, thanks a lot.
And I use two arguments to avoid the usage of the undef value.
Please review, Thanks!

Harbormaster completed remote builds in B126261: Diff 375795.Sep 29 2021, 12:43 AM

LGTM! Thanks for making the changes.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
9520	nit: Before committing could you change this to: ExtractedValType.getFixedSizeInBits() It's a bit shorter and it also asserts that the TypeSize is fixed.

This revision is now accepted and ready to land.Sep 29 2021, 1:52 AM

Hi @guopeilin, sorry for the stern 'requested changes' on your patch, but I think the code isn't entirely correct just yet.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
9518–9523	nit: The name `ExtractedValType` isn't exactly accurate, because this doesn't represent the value type of the extracted value, but rather the (input) value that a subvector is extracted from.
9520	It should be fine to extract: v2f32 (extract v16f32 X, 2), 1 -> dup X, 3 The current code seems to check that sizeof(v16f32) <= 128, whereas it should use the index `ExtractedValType.getVectorElementType().getSizeInBits() * (Lane + V.getConstantOperandVal(1))` to determine whether the lane can be indexed.. Additionally, the 128 seems conservative as well, since SVE indexed DUP allows an index < 512bits.
llvm/test/CodeGen/AArch64/sve-fixed-length-limit-duplane.ll
34	It seems we can use SVE's indexed DUP instructions for this, see `sve_int_perm_dup_i`, which only has patterns for `AArch64dup` (which is a different ISD node than DUPLANE32).

This revision now requires changes to proceed.Sep 29 2021, 3:35 AM

guopeilin added inline comments.Sep 30 2021, 12:30 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
9520	Thanks for reviewing. And there is something that confused me, I just wonder what exactly is the value of `ExtractedValType.getVectorElementType().getSizeInBits() * (Lane + V.getConstantOperandVal(1))` represents for? For your example, here I guess the value if `f32 x 3` (Correct me if I am wrong), which represents neither the `v2f32`(return value) nor the `v16f32`(input value). And what should the `f32 x 3` compare with?

mdchen added a subscriber: mdchen.Sep 30 2021, 2:29 AM

sdesmalen added inline comments.Oct 1 2021, 12:48 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
9520	AIUI, for: v2f32 (extract v16f32 X, 2), 1 -> dup X, 3 The expression: ExtractedValType.getVectorElementType().getSizeInBits() * (Lane + V.getConstantOperandVal(1)) Represents: (v2f32.getVectorElementType().getSizeInBits() * (1 + 2) <=> 32 * 3 Which must not exceed the native (Neon/SVE) vector length or the range of the immediate, otherwise the extract_subvector needs doing first.

guopeilin added inline comments.Oct 2 2021, 10:22 PM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
9520	Thanks for the answer, but honestly speaking, I still not get any idea about the meaning of the `32 * 3`, why should we care about the `3` rather than the type of the operand of the DUP instruction. I still believe that for this issue, the root cause is that the type of the operands of DUP instruction, which has been specified in the related `.td` file or `.inc` file, should not beyond 128 bits. In other words, we should check the legality of the new instruction, which is `dup X, 3` in your example. And since (v16f32)X is not legal for DUP, so we just ignore this optimization. Besides, I am wondering what benefits can we get if we duplicate a value from a SVE register to a SIMD register? Suppose that `v4f32 %res = dup v8f32 %X, 3` is legal, to represent %res, we might either return SVE registers with an extra P register to indicate which lane is valid, or need another extra copy instruction to move the result from a sve register to the simd register. Please correct me if I am wrong, and I don't think it is useful to do such optimization if it is fix-length sve register.

pinging reviewers...

bsmith added a reviewer: paulwalker-arm.Oct 12 2021, 4:01 AM

The work to handle DUP from wider vectors than 128bits would be a possible improvement for a future patch.
To move this patch forward, please address the two other comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
9518–9523	Can you please address this comment? Maybe just `s/ExtractedValType/VecVT/` ?
9520	I think there are two things here: The DUP instruction must be able to handle the index, so the index must fit the immediate. There must be patterns to handle a DUP for some index and I guess these patterns are only implemented for Neon at the moment. If the type being extracted from is always a Neon register (<= 128bits), then the index always happens to fit the immediate, so by saying "input vector must be less or equal to 128bits", you implicitly satisfy criteria 1 (and I was concerned about it being implicit). If you'd implement DUP patterns to dup lanes from a vector > 128bits (i.e. from a Z register) then you could allow `v2f32 (extract v16f32 X, 2), 1 -> dup X, 3`, but it would mean adding a condition to check that the new index fits the immediate. Besides, I am wondering what benefits can we get if we duplicate a value from a SVE register to a SIMD register? Suppose that v4f32 %res = dup v8f32 %X, 3 is legal, to represent %res, we might either return SVE registers with an extra P register to indicate which lane is valid, or need another extra copy instruction to move the result from a sve register to the simd register. Please correct me if I am wrong, and I don't think it is useful to do such optimization if it is fix-length sve register. You wouldn't need an extra P register to indicate which lane is valid, the other lanes can just be considered 'undef'. You also don't need an extra mov instruction to move an SVE register to a Neon register, because the two registers overlap (i.e. v0 is a subregister of z0)
9520	Please change: s/getValue/getFixedValue/

guopeilin updated this revision to Diff 379264.Oct 12 2021, 8:13 PM

Harbormaster completed remote builds in B128529: Diff 379264.Oct 12 2021, 8:49 PM

guopeilin updated this revision to Diff 379277.Oct 12 2021, 10:06 PM

guopeilin added inline comments.Oct 12 2021, 10:08 PM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
9518–9523	Fixed
9520	fixed

Harbormaster completed remote builds in B128540: Diff 379277.Oct 12 2021, 11:08 PM

sdesmalen accepted this revision.Oct 13 2021, 4:55 AM

This revision is now accepted and ready to land.Oct 13 2021, 4:55 AM

update test file

Harbormaster completed remote builds in B128793: Diff 379617.Oct 14 2021, 1:37 AM

Closed by commit rGb092dc0bb9e4: [AArch64ISelLowering] Avoid duplane in some cases when sve enabled (authored by guopeilin). · Explain WhyOct 15 2021, 12:35 AM

This revision was automatically updated to reflect the committed changes.

guopeilin added a commit: rGb092dc0bb9e4: [AArch64ISelLowering] Avoid duplane in some cases when sve enabled.

paulwalker-arm mentioned this in D116442: [AArch64ISelLowering] Don't look through scalable extract_subvector when optimising DUPLANE..Dec 31 2021, 6:27 AM

paulwalker-arm mentioned this in rG4325fd7402bf: [AArch64ISelLowering] Don't look through scalable extract_subvector when….Jan 5 2022, 3:59 AM

Diff 379264

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 9,509 Lines • ▼ Show 20 Lines	auto getScaledOffsetDup = [](SDValue BitCast, int &LaneC, MVT &CastVT) {
return true;		return true;
};		};
MVT CastVT;		MVT CastVT;
if (getScaledOffsetDup(V, Lane, CastVT)) {		if (getScaledOffsetDup(V, Lane, CastVT)) {
V = DAG.getBitcast(CastVT, V.getOperand(0).getOperand(0));		V = DAG.getBitcast(CastVT, V.getOperand(0).getOperand(0));
} else if (V.getOpcode() == ISD::EXTRACT_SUBVECTOR) {		} else if (V.getOpcode() == ISD::EXTRACT_SUBVECTOR) {
// The lane is incremented by the index of the extract.		// The lane is incremented by the index of the extract.
// Example: dup v2f32 (extract v4f32 X, 2), 1 --> dup v4f32 X, 3		// Example: dup v2f32 (extract v4f32 X, 2), 1 --> dup v4f32 X, 3
		auto ExtractedValType = V.getOperand(0).getValueType();
		if (ExtractedValType.isFixedLengthVector() &&
		david-armUnsubmitted Not Done Reply Inline Actions This isn't safe for truly scalable vectors where the size is unknown at runtime. I think this should probably be written as: auto ExtractedValType = V.getOperand(0).getValueType(); if (ExtractedValType.isFixedLengthVector() && ExtractedValType.getFixedSizeInBits() <= 128) { ... } david-arm: This isn't safe for truly scalable vectors where the size is unknown at runtime. I think this…
		guopeilinAuthorUnsubmitted Done Reply Inline Actions Fixed, please review guopeilin: Fixed, please review
		ExtractedValType.getFixedSizeInBits() <= 128) {
		david-armUnsubmitted Not Done Reply Inline Actions nit: Before committing could you change this to: ExtractedValType.getFixedSizeInBits() It's a bit shorter and it also asserts that the TypeSize is fixed. david-arm: nit: Before committing could you change this to: ExtractedValType.getFixedSizeInBits() It's…
		sdesmalenUnsubmitted Not Done Reply Inline Actions It should be fine to extract: v2f32 (extract v16f32 X, 2), 1 -> dup X, 3 The current code seems to check that sizeof(v16f32) <= 128, whereas it should use the index `ExtractedValType.getVectorElementType().getSizeInBits() * (Lane + V.getConstantOperandVal(1))` to determine whether the lane can be indexed.. Additionally, the 128 seems conservative as well, since SVE indexed DUP allows an index < 512bits. sdesmalen: It should be fine to extract: v2f32 (extract v16f32 X, 2), 1 -> dup X, 3 The current code…
		guopeilinAuthorUnsubmitted Not Done Reply Inline Actions Thanks for reviewing. And there is something that confused me, I just wonder what exactly is the value of `ExtractedValType.getVectorElementType().getSizeInBits() * (Lane + V.getConstantOperandVal(1))` represents for? For your example, here I guess the value if `f32 x 3` (Correct me if I am wrong), which represents neither the `v2f32`(return value) nor the `v16f32`(input value). And what should the `f32 x 3` compare with? guopeilin: Thanks for reviewing. And there is something that confused me, I just wonder what exactly is…
		sdesmalenUnsubmitted Not Done Reply Inline Actions AIUI, for: v2f32 (extract v16f32 X, 2), 1 -> dup X, 3 The expression: ExtractedValType.getVectorElementType().getSizeInBits() * (Lane + V.getConstantOperandVal(1)) Represents: (v2f32.getVectorElementType().getSizeInBits() * (1 + 2) <=> 32 * 3 Which must not exceed the native (Neon/SVE) vector length or the range of the immediate, otherwise the extract_subvector needs doing first. sdesmalen: AIUI, for: v2f32 (extract v16f32 X, 2), 1 -> dup X, 3 The expression: ExtractedValType.
		guopeilinAuthorUnsubmitted Done Reply Inline Actions Thanks for the answer, but honestly speaking, I still not get any idea about the meaning of the `32 * 3`, why should we care about the `3` rather than the type of the operand of the DUP instruction. I still believe that for this issue, the root cause is that the type of the operands of DUP instruction, which has been specified in the related `.td` file or `.inc` file, should not beyond 128 bits. In other words, we should check the legality of the new instruction, which is `dup X, 3` in your example. And since (v16f32)X is not legal for DUP, so we just ignore this optimization. Besides, I am wondering what benefits can we get if we duplicate a value from a SVE register to a SIMD register? Suppose that `v4f32 %res = dup v8f32 %X, 3` is legal, to represent %res, we might either return SVE registers with an extra P register to indicate which lane is valid, or need another extra copy instruction to move the result from a sve register to the simd register. Please correct me if I am wrong, and I don't think it is useful to do such optimization if it is fix-length sve register. guopeilin: Thanks for the answer, but honestly speaking, I still not get any idea about the meaning of the…
		sdesmalenUnsubmitted Not Done Reply Inline Actions I think there are two things here: The DUP instruction must be able to handle the index, so the index must fit the immediate. There must be patterns to handle a DUP for some index and I guess these patterns are only implemented for Neon at the moment. If the type being extracted from is always a Neon register (<= 128bits), then the index always happens to fit the immediate, so by saying "input vector must be less or equal to 128bits", you implicitly satisfy criteria 1 (and I was concerned about it being implicit). If you'd implement DUP patterns to dup lanes from a vector > 128bits (i.e. from a Z register) then you could allow `v2f32 (extract v16f32 X, 2), 1 -> dup X, 3`, but it would mean adding a condition to check that the new index fits the immediate. Besides, I am wondering what benefits can we get if we duplicate a value from a SVE register to a SIMD register? Suppose that v4f32 %res = dup v8f32 %X, 3 is legal, to represent %res, we might either return SVE registers with an extra P register to indicate which lane is valid, or need another extra copy instruction to move the result from a sve register to the simd register. Please correct me if I am wrong, and I don't think it is useful to do such optimization if it is fix-length sve register. You wouldn't need an extra P register to indicate which lane is valid, the other lanes can just be considered 'undef'. You also don't need an extra mov instruction to move an SVE register to a Neon register, because the two registers overlap (i.e. v0 is a subregister of z0) sdesmalen: I think there are two things here: 1. The DUP instruction must be able to handle the index, so…
		sdesmalenUnsubmitted Not Done Reply Inline Actions Please change: s/getValue/getFixedValue/ sdesmalen: Please change: s/getValue/getFixedValue/
		guopeilinAuthorUnsubmitted Done Reply Inline Actions fixed guopeilin: fixed
Lane += V.getConstantOperandVal(1);		Lane += V.getConstantOperandVal(1);
V = V.getOperand(0);		V = V.getOperand(0);
		}
		sdesmalenUnsubmitted Not Done Reply Inline Actions nit: The name `ExtractedValType` isn't exactly accurate, because this doesn't represent the value type of the extracted value, but rather the (input) value that a subvector is extracted from. sdesmalen: nit: The name `ExtractedValType` isn't exactly accurate, because this doesn't represent the…
		sdesmalenUnsubmitted Not Done Reply Inline Actions Can you please address this comment? Maybe just `s/ExtractedValType/VecVT/` ? sdesmalen: Can you please address this comment? Maybe just `s/ExtractedValType/VecVT/` ?
		guopeilinAuthorUnsubmitted Done Reply Inline Actions Fixed guopeilin: Fixed
} else if (V.getOpcode() == ISD::CONCAT_VECTORS) {		} else if (V.getOpcode() == ISD::CONCAT_VECTORS) {
// The lane is decremented if we are splatting from the 2nd operand.		// The lane is decremented if we are splatting from the 2nd operand.
// Example: dup v4i32 (concat v2i32 X, v2i32 Y), 3 --> dup v4i32 Y, 1		// Example: dup v4i32 (concat v2i32 X, v2i32 Y), 3 --> dup v4i32 Y, 1
unsigned Idx = Lane >= (int)VT.getVectorNumElements() / 2;		unsigned Idx = Lane >= (int)VT.getVectorNumElements() / 2;
Lane -= Idx * VT.getVectorNumElements() / 2;		Lane -= Idx * VT.getVectorNumElements() / 2;
V = WidenVector(V.getOperand(Idx), DAG);		V = WidenVector(V.getOperand(Idx), DAG);
} else if (VT.getSizeInBits() == 64) {		} else if (VT.getSizeInBits() == 64) {
// Widen the operand to 128-bit register with undef.		// Widen the operand to 128-bit register with undef.
▲ Show 20 Lines • Show All 9,544 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/sve-fixed-length-limit-duplane.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -mattr=+sve -aarch64-sve-vector-bits-min=256 < %s \| FileCheck %s
				target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
				target triple = "aarch64-unknown-linux-gnu"

				define <4 x i32> @test(<16 x i32>* %arg1, <16 x i32>* %arg2) {
				; CHECK-LABEL: test:
				; CHECK: // %bb.0: // %entry
				; CHECK-NEXT: stp x29, x30, [sp, #-16]! // 16-byte Folded Spill
				; CHECK-NEXT: sub x9, sp, #48
				; CHECK-NEXT: mov x29, sp
				; CHECK-NEXT: and sp, x9, #0xffffffffffffffe0
				; CHECK-NEXT: .cfi_def_cfa w29, 16
				; CHECK-NEXT: .cfi_offset w30, -8
				; CHECK-NEXT: .cfi_offset w29, -16
				; CHECK-NEXT: mov x8, #8
				; CHECK-NEXT: ptrue p0.s, vl8
				; CHECK-NEXT: ld1w { z0.s }, p0/z, [x0, x8, lsl #2]
				; CHECK-NEXT: mov x9, sp
				; CHECK-NEXT: ld1w { z1.s }, p0/z, [x0]
				; CHECK-NEXT: st1w { z0.s }, p0, [x9]
				; CHECK-NEXT: ldr q2, [sp, #16]
				; CHECK-NEXT: add z1.s, p0/m, z1.s, z1.s
				; CHECK-NEXT: movprfx z3, z0
				; CHECK-NEXT: add z3.s, p0/m, z3.s, z0.s
				; CHECK-NEXT: st1w { z3.s }, p0, [x0, x8, lsl #2]
				; CHECK-NEXT: dup v0.4s, v2.s[2]
				; CHECK-NEXT: st1w { z1.s }, p0, [x0]
				; CHECK-NEXT: mov sp, x29
				; CHECK-NEXT: ldp x29, x30, [sp], #16 // 16-byte Folded Reload
				; CHECK-NEXT: ret
				entry:
				%0 = load <16 x i32>, <16 x i32>* %arg1, align 256
				%1 = load <16 x i32>, <16 x i32>* %arg2, align 256
				sdesmalenUnsubmitted Not Done Reply Inline Actions It seems we can use SVE's indexed DUP instructions for this, see `sve_int_perm_dup_i`, which only has patterns for `AArch64dup` (which is a different ISD node than DUPLANE32). sdesmalen: It seems we can use SVE's indexed DUP instructions for this, see `sve_int_perm_dup_i`, which…
				%shvec = shufflevector <16 x i32> %0, <16 x i32> %1, <4 x i32> <i32 14, i32 14, i32 14, i32 14>
				%2 = add <16 x i32> %0, %0
				store <16 x i32> %2, <16 x i32>* %arg1, align 256
				ret <4 x i32> %shvec
				}

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64ISelLowering] Avoid duplane in some cases when sve enabled
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 379264

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

llvm/test/CodeGen/AArch64/sve-fixed-length-limit-duplane.ll

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64ISelLowering] Avoid duplane in some cases when sve enabledClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 379264

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

llvm/test/CodeGen/AArch64/sve-fixed-length-limit-duplane.ll

[AArch64ISelLowering] Avoid duplane in some cases when sve enabled
ClosedPublic