Details
Diff Detail
Event Timeline
Can you change the name of the existing test from sve-limit-duplane.ll -> sve-fixed-length-limit-duplane.ll? This is the convention we use when using SVE for fixed-length vectors.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
9519 | This isn't safe for truly scalable vectors where the size is unknown at runtime. I think this should probably be written as: auto ExtractedValType = V.getOperand(0).getValueType(); if (ExtractedValType.isFixedLengthVector() && ExtractedValType.getFixedSizeInBits() <= 128) { ... } | |
llvm/test/CodeGen/AArch64/sve-limit-duplane.ll | ||
30 ↗ | (On Diff #375166) | If possible I think it would be better to try to reduce this test case to the raw operations using proper scalable vectors, i.e. something like: define <4 x i32> @test_work_knt_val(<16 x i32> %arg) { %shvec = shufflevector <16 x i32> %arg, <16 x i32> undef, <4 x i32> <i32 14, i32 14, i32 14, i32 14> ret <4 x i32> %shvec } |
Hi @guopeilin, thanks for making the changes. I'm still a little concerned about the test case though as I'm not sure how reliable it will be over time, especially with the undef values present in the code. I did manage to reduce this test case to:
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py ; RUN: llc -mattr=+sve -aarch64-sve-vector-bits-min=256 < %s | FileCheck %s target datalayout = "e-m:e-i8:8:32-i16:16:32-i64:64-i128:128-n32:64-S128" target triple = "aarch64-unknown-linux-gnu" define <4 x i32> @test_work_knt_val(<16 x i32>* %arg) { entry: %0 = load <16 x i32>, <16 x i32>* %arg, align 256 %shvec = shufflevector <16 x i32> %0, <16 x i32> undef, <4 x i32> <i32 14, i32 14, i32 14, i32 14> %1 = add <16 x i32> %0, %0 store <16 x i32> %1, <16 x i32>* %arg, align 256 ret <4 x i32> %shvec }
I think the problem is that there has to be multiple uses of the loaded value (%0) in order for the DAG combine to trigger.
Yes, that's great, thanks a lot.
And I use two arguments to avoid the usage of the undef value.
Please review, Thanks!
LGTM! Thanks for making the changes.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
9520 | nit: Before committing could you change this to: ExtractedValType.getFixedSizeInBits() It's a bit shorter and it also asserts that the TypeSize is fixed. |
Hi @guopeilin, sorry for the stern 'requested changes' on your patch, but I think the code isn't entirely correct just yet.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
9518–9523 | nit: The name ExtractedValType isn't exactly accurate, because this doesn't represent the value type of the extracted value, but rather the (input) value that a subvector is extracted from. | |
9520 | It should be fine to extract: v2f32 (extract v16f32 X, 2), 1 -> dup X, 3 The current code seems to check that sizeof(v16f32) <= 128, whereas it should use the index ExtractedValType.getVectorElementType().getSizeInBits() * (Lane + V.getConstantOperandVal(1)) to determine whether the lane can be indexed.. Additionally, the 128 seems conservative as well, since SVE indexed DUP allows an index < 512bits. | |
llvm/test/CodeGen/AArch64/sve-fixed-length-limit-duplane.ll | ||
34 | It seems we can use SVE's indexed DUP instructions for this, see sve_int_perm_dup_i, which only has patterns for AArch64dup (which is a different ISD node than DUPLANE32). |
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
9520 | Thanks for reviewing. |
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
9520 | AIUI, for: v2f32 (extract v16f32 X, 2), 1 -> dup X, 3 The expression: ExtractedValType.getVectorElementType().getSizeInBits() * (Lane + V.getConstantOperandVal(1)) Represents: (v2f32.getVectorElementType().getSizeInBits() * (1 + 2) <=> 32 * 3 Which must not exceed the native (Neon/SVE) vector length or the range of the immediate, otherwise the extract_subvector needs doing first. |
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
9520 | Thanks for the answer, but honestly speaking, I still not get any idea about the meaning of the 32 * 3, why should we care about the 3 rather than the type of the operand of the DUP instruction. I still believe that for this issue, the root cause is that the type of the operands of DUP instruction, which has been specified in the related .td file or .inc file, should not beyond 128 bits. In other words, we should check the legality of the new instruction, which is dup X, 3 in your example. And since (v16f32)X is not legal for DUP, so we just ignore this optimization. Besides, I am wondering what benefits can we get if we duplicate a value from a SVE register to a SIMD register? Suppose that v4f32 %res = dup v8f32 %X, 3 is legal, to represent %res, we might either return SVE registers with an extra P register to indicate which lane is valid, or need another extra copy instruction to move the result from a sve register to the simd register. Please correct me if I am wrong, and I don't think it is useful to do such optimization if it is fix-length sve register. |
The work to handle DUP from wider vectors than 128bits would be a possible improvement for a future patch.
To move this patch forward, please address the two other comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp | ||
---|---|---|
9518–9523 | Can you please address this comment? Maybe just s/ExtractedValType/VecVT/ ? | |
9520 | I think there are two things here:
If the type being extracted from is always a Neon register (<= 128bits), then the index always happens to fit the immediate, so by saying "input vector must be less or equal to 128bits", you implicitly satisfy criteria 1 (and I was concerned about it being implicit). If you'd implement DUP patterns to dup lanes from a vector > 128bits (i.e. from a Z register) then you could allow v2f32 (extract v16f32 X, 2), 1 -> dup X, 3, but it would mean adding a condition to check that the new index fits the immediate.
You wouldn't need an extra P register to indicate which lane is valid, the other lanes can just be considered 'undef'. You also don't need an extra mov instruction to move an SVE register to a Neon register, because the two registers overlap (i.e. v0 is a subregister of z0) | |
9520 | Please change: s/getValue/getFixedValue/ |
This isn't safe for truly scalable vectors where the size is unknown at runtime. I think this should probably be written as: