This is an archive of the discontinued LLVM Phabricator instance.

[AArch64ISelLowering] Avoid duplane in some cases when sve enabled
ClosedPublic

Authored by guopeilin on Sep 27 2021, 1:18 AM.

Diff Detail

Event Timeline

guopeilin created this revision.Sep 27 2021, 1:18 AM
guopeilin requested review of this revision.Sep 27 2021, 1:18 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 27 2021, 1:18 AM

Can you change the name of the existing test from sve-limit-duplane.ll -> sve-fixed-length-limit-duplane.ll? This is the convention we use when using SVE for fixed-length vectors.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
9519

This isn't safe for truly scalable vectors where the size is unknown at runtime. I think this should probably be written as:

auto ExtractedValType = V.getOperand(0).getValueType();
if (ExtractedValType.isFixedLengthVector() && ExtractedValType.getFixedSizeInBits() <= 128) {
 ...
}
llvm/test/CodeGen/AArch64/sve-limit-duplane.ll
30 ↗(On Diff #375166)

If possible I think it would be better to try to reduce this test case to the raw operations using proper scalable vectors, i.e. something like:

define <4 x i32> @test_work_knt_val(<16 x i32> %arg) {
  %shvec = shufflevector <16 x i32> %arg, <16 x i32> undef, <4 x i32> <i32 14, i32 14, i32 14, i32 14>
  ret <4 x i32> %shvec
}
guopeilin updated this revision to Diff 375521.Sep 28 2021, 4:03 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
9519

Fixed, please review

Tried to simplify the test case but failed. @david-arm

Hi @guopeilin, thanks for making the changes. I'm still a little concerned about the test case though as I'm not sure how reliable it will be over time, especially with the undef values present in the code. I did manage to reduce this test case to:

; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
; RUN: llc -mattr=+sve -aarch64-sve-vector-bits-min=256 < %s | FileCheck %s
target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
target triple = "aarch64-unknown-linux-gnu"

define <4 x i32> @test_work_knt_val(<16 x i32>* %arg) {
entry:
  %0 = load <16 x i32>, <16 x i32>* %arg, align 256
  %shvec = shufflevector <16 x i32> %0, <16 x i32> undef, <4 x i32> <i32 14, i32 14, i32 14, i32 14>
  %1 = add <16 x i32> %0, %0
  store <16 x i32> %1, <16 x i32>* %arg, align 256
  ret <4 x i32> %shvec
}

I think the problem is that there has to be multiple uses of the loaded value (%0) in order for the DAG combine to trigger.

Hi @guopeilin, thanks for making the changes. I'm still a little concerned about the test case though as I'm not sure how reliable it will be over time, especially with the undef values present in the code. I did manage to reduce this test case to:

; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
; RUN: llc -mattr=+sve -aarch64-sve-vector-bits-min=256 < %s | FileCheck %s
target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128"
target triple = "aarch64-unknown-linux-gnu"

define <4 x i32> @test_work_knt_val(<16 x i32>* %arg) {
entry:
  %0 = load <16 x i32>, <16 x i32>* %arg, align 256
  %shvec = shufflevector <16 x i32> %0, <16 x i32> undef, <4 x i32> <i32 14, i32 14, i32 14, i32 14>
  %1 = add <16 x i32> %0, %0
  store <16 x i32> %1, <16 x i32>* %arg, align 256
  ret <4 x i32> %shvec
}

I think the problem is that there has to be multiple uses of the loaded value (%0) in order for the DAG combine to trigger.

Yes, that's great, thanks a lot.
And I use two arguments to avoid the usage of the undef value.
Please review, Thanks!

david-arm accepted this revision.Sep 29 2021, 1:52 AM

LGTM! Thanks for making the changes.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
9520

nit: Before committing could you change this to:

ExtractedValType.getFixedSizeInBits()

It's a bit shorter and it also asserts that the TypeSize is fixed.

This revision is now accepted and ready to land.Sep 29 2021, 1:52 AM
sdesmalen requested changes to this revision.Sep 29 2021, 3:35 AM

Hi @guopeilin, sorry for the stern 'requested changes' on your patch, but I think the code isn't entirely correct just yet.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
9518–9522

nit: The name ExtractedValType isn't exactly accurate, because this doesn't represent the value type of the extracted value, but rather the (input) value that a subvector is extracted from.

9520

It should be fine to extract:

v2f32 (extract v16f32 X, 2), 1 -> dup X, 3

The current code seems to check that sizeof(v16f32) <= 128, whereas it should use the index ExtractedValType.getVectorElementType().getSizeInBits() * (Lane + V.getConstantOperandVal(1)) to determine whether the lane can be indexed..

Additionally, the 128 seems conservative as well, since SVE indexed DUP allows an index < 512bits.

llvm/test/CodeGen/AArch64/sve-fixed-length-limit-duplane.ll
34

It seems we can use SVE's indexed DUP instructions for this, see sve_int_perm_dup_i, which only has patterns for AArch64dup (which is a different ISD node than DUPLANE32).

This revision now requires changes to proceed.Sep 29 2021, 3:35 AM
guopeilin added inline comments.Sep 30 2021, 12:30 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
9520

Thanks for reviewing.
And there is something that confused me, I just wonder what exactly is the value of ExtractedValType.getVectorElementType().getSizeInBits() * (Lane + V.getConstantOperandVal(1)) represents for? For your example,
here I guess the value if f32 x 3 (Correct me if I am wrong), which represents neither the v2f32(return value) nor the v16f32(input value). And what should the f32 x 3 compare with?

mdchen added a subscriber: mdchen.Sep 30 2021, 2:29 AM
sdesmalen added inline comments.Oct 1 2021, 12:48 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
9520

AIUI, for:

v2f32 (extract v16f32 X, 2), 1 -> dup X, 3

The expression:

ExtractedValType.getVectorElementType().getSizeInBits() * (Lane + V.getConstantOperandVal(1))

Represents:

(v2f32.getVectorElementType().getSizeInBits() * (1 + 2)
<=>
32 * 3

Which must not exceed the native (Neon/SVE) vector length or the range of the immediate, otherwise the extract_subvector needs doing first.

guopeilin added inline comments.Oct 2 2021, 10:22 PM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
9520

Thanks for the answer, but honestly speaking, I still not get any idea about the meaning of the 32 * 3, why should we care about the 3 rather than the type of the operand of the DUP instruction.

I still believe that for this issue, the root cause is that the type of the operands of DUP instruction, which has been specified in the related .td file or .inc file, should not beyond 128 bits. In other words, we should check the legality of the new instruction, which is dup X, 3 in your example. And since (v16f32)X is not legal for DUP, so we just ignore this optimization.

Besides, I am wondering what benefits can we get if we duplicate a value from a SVE register to a SIMD register? Suppose that v4f32 %res = dup v8f32 %X, 3 is legal, to represent %res, we might either return SVE registers with an extra P register to indicate which lane is valid, or need another extra copy instruction to move the result from a sve register to the simd register. Please correct me if I am wrong, and I don't think it is useful to do such optimization if it is fix-length sve register.

pinging reviewers...

The work to handle DUP from wider vectors than 128bits would be a possible improvement for a future patch.
To move this patch forward, please address the two other comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
9518–9522

Can you please address this comment? Maybe just s/ExtractedValType/VecVT/ ?

9520

I think there are two things here:

  1. The DUP instruction must be able to handle the index, so the index must fit the immediate.
  2. There must be patterns to handle a DUP for some index and I guess these patterns are only implemented for Neon at the moment.

If the type being extracted from is always a Neon register (<= 128bits), then the index always happens to fit the immediate, so by saying "input vector must be less or equal to 128bits", you implicitly satisfy criteria 1 (and I was concerned about it being implicit).

If you'd implement DUP patterns to dup lanes from a vector > 128bits (i.e. from a Z register) then you could allow v2f32 (extract v16f32 X, 2), 1 -> dup X, 3, but it would mean adding a condition to check that the new index fits the immediate.

Besides, I am wondering what benefits can we get if we duplicate a value from a SVE register to a SIMD register? Suppose that v4f32 %res = dup v8f32 %X, 3 is legal, to represent %res, we might either return SVE registers with an extra P register to indicate which lane is valid, or need another extra copy instruction to move the result from a sve register to the simd register. Please correct me if I am wrong, and I don't think it is useful to do such optimization if it is fix-length sve register.

You wouldn't need an extra P register to indicate which lane is valid, the other lanes can just be considered 'undef'. You also don't need an extra mov instruction to move an SVE register to a Neon register, because the two registers overlap (i.e. v0 is a subregister of z0)

9520

Please change: s/getValue/getFixedValue/

guopeilin updated this revision to Diff 379264.Oct 12 2021, 8:13 PM
guopeilin added inline comments.Oct 12 2021, 10:08 PM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
9518–9522

Fixed

9520

fixed

sdesmalen accepted this revision.Oct 13 2021, 4:55 AM
This revision is now accepted and ready to land.Oct 13 2021, 4:55 AM

update test file