llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13965–13966	This also allows the case where the total VecSize == 32 (for e.g. `<4 x i8>` which is currently not supported by Neon), or whether the number of elements is not a power of 2 (e.g. `<6 x i8>`. Can you add a test for this case?
13966	The indentation seems weird, did you use clang-format?
llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-ld2-alloca.ll
163	nit: can you add `nounwind` as one of the parameters, to avoid the .cfi directives in the output?

hassnaa-arm marked an inline comment as done.Nov 28 2022, 9:24 AM

hassnaa-arm added inline comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13965–13966	In the IR, I don't understand why it allocates 16 elements while it only loads 8. Is there a reason behind that ?

Add nounwind to avoid .cfi directives get generated.

Harbormaster completed remote builds in B199796: Diff 478282.Nov 28 2022, 10:50 AM

Add extra test cases

Harbormaster completed remote builds in B199975: Diff 478510.Nov 29 2022, 3:50 AM

sdesmalen added inline comments.Nov 30 2022, 1:18 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13965–13966	Based on the code you've added, I expected the tests you've added to use an ld2, do you know why it doesn't?

hassnaa-arm marked an inline comment as done.Nov 30 2022, 5:56 AM

Change the used mask in the test cases, so that interleaved load get used.

Harbormaster completed remote builds in B200257: Diff 478925.Nov 30 2022, 5:57 AM

sdesmalen added inline comments.Nov 30 2022, 6:09 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13967	Sorry, I just realise you'll also need to add a check that we can generate a predicate pattern for the number of elements (e.g. vl2, vl3, ...), because e.g. a <9 x i8> has no corresponding predicate pattern. You can use `Optional<unsigned> getSVEPredPatternFromNumElements()` for this (defined in Utils/AArch64BaseInfo.h). Can you also add a test for <9 x i8> ?

hassnaa-arm added inline comments.Nov 30 2022, 6:13 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13967	Sorry, I don't understand why I should check for e.g. a <9 x i8> ? How is that related to the condition in the code ? and what do you mean by "You can use Optional<unsigned> getSVEPredPatternFromNumElements() for this " ? Do you mean that in addition to adding the check, I will add a code change also ?

sdesmalen added inline comments.Nov 30 2022, 6:20 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13967	What I meant was that if you do a shuffle mask like this: %load = load <6 x i8>, ptr %alloc %strided.vec = shufflevector <6 x i8> %load, <6 x i8> poison, <3 x i32> <i32 1, i32 3, i32 5> You get the following LD2 instruction: ptrue p0.b, vl3 ld2b { z0.b, z1.b }, p0/z, [x8] Which uses `vl3` for the predicate, which means enable 3 lanes. But there is no `vl9`, so if you'd end up with NumElements == 9, then you can't code-generate the interleaved access using LD2. To ask if there is a ptrue predicate for `NumElements`, you can use `getSVEPredPatternFromNumElements`. (I meant <9 x i8> as the result type of the shuffle by the way)

hassnaa-arm added inline comments.Nov 30 2022, 6:31 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13967	you mean if the result type of the shuffle is <9 x i8> there will be a problem, and you are asking me to add a test cases for that and fix its problem, correct ?

sdesmalen added inline comments.Nov 30 2022, 6:32 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13967	Correct.

hassnaa-arm added inline comments.Nov 30 2022, 6:44 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13967	but why did you choose vl9 specifically ? what about other vl ? e.g. vl10.

sdesmalen added inline comments.Nov 30 2022, 6:50 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13967	Because the available predicate patterns are vl1, vl2, vl3, ..., vl8, vl16, ... So vl9 is the first non-power-of-2 vector length that can't be represented.

hassnaa-arm marked an inline comment as done.Nov 30 2022, 7:08 AM

Skip using interleaved_load for vl9, because vl9 predicate is not available.

Harbormaster completed remote builds in B200267: Diff 478939.Nov 30 2022, 7:08 AM

LGTM with nit addressed

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13955	nit: you can just do `!getSVEPredPatternFromNumElements(NumElements))`

This revision is now accepted and ready to land.Nov 30 2022, 7:11 AM

hassnaa-arm marked an inline comment as done.Nov 30 2022, 7:57 AM

Add check for hasSVE() to avoid affecting NEON interleaved-access tests

Harbormaster completed remote builds in B200400: Diff 479131.Nov 30 2022, 6:09 PM

This revision was landed with ongoing or failed builds.Nov 30 2022, 6:31 PM

Closed by commit rG279c0a83aa22: [AArch64][SME]: Generate streaming-compatible code for ld2-alloca. (authored by Hassnaa Hamdi <hassnaa.hamdi@arm.com>). · Explain Why

This revision was automatically updated to reflect the committed changes.

Hassnaa Hamdi <hassnaa.hamdi@arm.com> mentioned this in rG45adca0f52af: [AArch64][SME]: Add precursory tests for D138791.

Hassnaa Hamdi <hassnaa.hamdi@arm.com> added a commit: rG279c0a83aa22: [AArch64][SME]: Generate streaming-compatible code for ld2-alloca..

Failed Tests (1):

LLVM :: CodeGen/AArch64/sve-streaming-mode-fixed-length-ld2-alloca.ll

Testing Time: 345.12s

Skipped          :    38
Unsupported      :  1357
Passed           : 88906
Expectedly Failed:   195
Failed           :     1

https://lab.llvm.org/buildbot/#/builders/85/builds/12665

david-arm added a reverting change: rG4a5ccf4e9342: Revert "[AArch64][SME]: Generate streaming-compatible code for ld2-alloca.".Dec 1 2022, 2:22 AM

david-arm mentioned this in rG06846596eb17: Revert "[AArch64][SME]: Add precursory tests for D138791".Dec 1 2022, 3:14 AM

uabelho added a subscriber: uabelho.Dec 1 2022, 3:37 AM

sdesmalen mentioned this in rGea11f4ff0ada: Reland "[AArch64][SME]: Add precursory tests for D138791".Dec 1 2022, 6:49 AM

sdesmalen mentioned this in rGd32c9e8384e9: Reland "[AArch64][SME]: Generate streaming-compatible code for ld2-alloca.".

Diff 479133

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 13,945 Lines • ▼ Show 20 Lines	bool AArch64TargetLowering::isLegalInterleavedAccessType(
VectorType *VecTy, const DataLayout &DL, bool &UseScalable) const {		VectorType *VecTy, const DataLayout &DL, bool &UseScalable) const {

unsigned VecSize = DL.getTypeSizeInBits(VecTy);		unsigned VecSize = DL.getTypeSizeInBits(VecTy);
unsigned ElSize = DL.getTypeSizeInBits(VecTy->getElementType());		unsigned ElSize = DL.getTypeSizeInBits(VecTy->getElementType());
unsigned NumElements = cast<FixedVectorType>(VecTy)->getNumElements();		unsigned NumElements = cast<FixedVectorType>(VecTy)->getNumElements();

UseScalable = false;		UseScalable = false;

		// Ensure that the predicate for this elemnts num is available.
		if (Subtarget->hasSVE() && !getSVEPredPatternFromNumElements(NumElements))
		sdesmalenUnsubmitted Done Reply Inline Actions nit: you can just do `!getSVEPredPatternFromNumElements(NumElements))` sdesmalen: nit: you can just do `!getSVEPredPatternFromNumElements(NumElements))`
		return false;

// Ensure the number of vector elements is greater than 1.		// Ensure the number of vector elements is greater than 1.
if (NumElements < 2)		if (NumElements < 2)
return false;		return false;

// Ensure the element type is legal.		// Ensure the element type is legal.
if (ElSize != 8 && ElSize != 16 && ElSize != 32 && ElSize != 64)		if (ElSize != 8 && ElSize != 16 && ElSize != 32 && ElSize != 64)
return false;		return false;

if (Subtarget->useSVEForFixedLengthVectors() &&		if (Subtarget->forceStreamingCompatibleSVE() \|\|
		sdesmalenUnsubmitted Not Done Reply Inline Actions This also allows the case where the total VecSize == 32 (for e.g. `<4 x i8>` which is currently not supported by Neon), or whether the number of elements is not a power of 2 (e.g. `<6 x i8>`. Can you add a test for this case? sdesmalen: This also allows the case where the total VecSize == 32 (for e.g. `<4 x i8>` which is currently…
		hassnaa-armAuthorUnsubmitted Done Reply Inline Actions In the IR, I don't understand why it allocates 16 elements while it only loads 8. Is there a reason behind that ? hassnaa-arm: In the IR, I don't understand why it allocates 16 elements while it only loads 8. Is there a…
		sdesmalenUnsubmitted Done Reply Inline Actions Based on the code you've added, I expected the tests you've added to use an ld2, do you know why it doesn't? sdesmalen: Based on the code you've added, I expected the tests you've added to use an ld2, do you know…
		sdesmalenUnsubmitted Done Reply Inline Actions The indentation seems weird, did you use clang-format? sdesmalen: The indentation seems weird, did you use clang-format?
		(Subtarget->useSVEForFixedLengthVectors() &&
		sdesmalenUnsubmitted Done Reply Inline Actions Sorry, I just realise you'll also need to add a check that we can generate a predicate pattern for the number of elements (e.g. vl2, vl3, ...), because e.g. a <9 x i8> has no corresponding predicate pattern. You can use `Optional<unsigned> getSVEPredPatternFromNumElements()` for this (defined in Utils/AArch64BaseInfo.h). Can you also add a test for <9 x i8> ? sdesmalen: Sorry, I just realise you'll also need to add a check that we can generate a predicate pattern…
		hassnaa-armAuthorUnsubmitted Done Reply Inline Actions Sorry, I don't understand why I should check for e.g. a <9 x i8> ? How is that related to the condition in the code ? and what do you mean by "You can use Optional<unsigned> getSVEPredPatternFromNumElements() for this " ? Do you mean that in addition to adding the check, I will add a code change also ? hassnaa-arm: Sorry, I don't understand why I should check for e.g. a <9 x i8> ? How is that related to the…
		sdesmalenUnsubmitted Not Done Reply Inline Actions What I meant was that if you do a shuffle mask like this: %load = load <6 x i8>, ptr %alloc %strided.vec = shufflevector <6 x i8> %load, <6 x i8> poison, <3 x i32> <i32 1, i32 3, i32 5> You get the following LD2 instruction: ptrue p0.b, vl3 ld2b { z0.b, z1.b }, p0/z, [x8] Which uses `vl3` for the predicate, which means enable 3 lanes. But there is no `vl9`, so if you'd end up with NumElements == 9, then you can't code-generate the interleaved access using LD2. To ask if there is a ptrue predicate for `NumElements`, you can use `getSVEPredPatternFromNumElements`. (I meant <9 x i8> as the result type of the shuffle by the way) sdesmalen: What I meant was that if you do a shuffle mask like this: %load = load <6 x i8>, ptr %alloc…
		hassnaa-armAuthorUnsubmitted Done Reply Inline Actions you mean if the result type of the shuffle is <9 x i8> there will be a problem, and you are asking me to add a test cases for that and fix its problem, correct ? hassnaa-arm: you mean if the result type of the shuffle is <9 x i8> there will be a problem, and you are…
		sdesmalenUnsubmitted Not Done Reply Inline Actions Correct. sdesmalen: Correct.
		hassnaa-armAuthorUnsubmitted Done Reply Inline Actions but why did you choose vl9 specifically ? what about other vl ? e.g. vl10. hassnaa-arm: but why did you choose vl9 specifically ? what about other vl ? e.g. vl10.
		sdesmalenUnsubmitted Not Done Reply Inline Actions Because the available predicate patterns are vl1, vl2, vl3, ..., vl8, vl16, ... So vl9 is the first non-power-of-2 vector length that can't be represented. sdesmalen: Because the available predicate patterns are vl1, vl2, vl3, ..., vl8, vl16, ... So vl9 is the…
(VecSize % Subtarget->getMinSVEVectorSizeInBits() == 0 \|\|		(VecSize % Subtarget->getMinSVEVectorSizeInBits() == 0 \|\|
(VecSize < Subtarget->getMinSVEVectorSizeInBits() &&		(VecSize < Subtarget->getMinSVEVectorSizeInBits() &&
isPowerOf2_32(NumElements) && VecSize > 128))) {		isPowerOf2_32(NumElements) && VecSize > 128)))) {
UseScalable = true;		UseScalable = true;
return true;		return true;
}		}

// Ensure the total vector size is 64 or a multiple of 128. Types larger than		// Ensure the total vector size is 64 or a multiple of 128. Types larger than
// 128 will be split into multiple interleaved accesses.		// 128 will be split into multiple interleaved accesses.
return VecSize == 64 \|\| VecSize % 128 == 0;		return VecSize == 64 \|\| VecSize % 128 == 0;
}		}
▲ Show 20 Lines • Show All 9,466 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-ld2-alloca.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -force-streaming-compatible-sve < %s \| FileCheck %s			; RUN: llc -force-streaming-compatible-sve < %s \| FileCheck %s

	target triple = "aarch64-unknown-linux-gnu"			target triple = "aarch64-unknown-linux-gnu"

	declare void @def(ptr)			declare void @def(ptr)

	define void @alloc_v4i8(ptr %st_ptr) #0 {			define void @alloc_v4i8(ptr %st_ptr) #0 {
	; CHECK-LABEL: alloc_v4i8:			; CHECK-LABEL: alloc_v4i8:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: sub sp, sp, #32			; CHECK-NEXT: sub sp, sp, #32
	; CHECK-NEXT: stp x30, x19, [sp, #16] // 16-byte Folded Spill			; CHECK-NEXT: stp x30, x19, [sp, #16] // 16-byte Folded Spill
	; CHECK-NEXT: mov x19, x0			; CHECK-NEXT: mov x19, x0
	; CHECK-NEXT: add x0, sp, #12			; CHECK-NEXT: add x0, sp, #12
	; CHECK-NEXT: bl def			; CHECK-NEXT: bl def
	; CHECK-NEXT: ldr s0, [sp, #12]			; CHECK-NEXT: add x8, sp, #12
	; CHECK-NEXT: ptrue p0.h, vl4			; CHECK-NEXT: ptrue p0.b, vl2
	; CHECK-NEXT: uunpklo z0.h, z0.b			; CHECK-NEXT: ld2b { z0.b, z1.b }, p0/z, [x8]
	; CHECK-NEXT: fmov w8, s0			; CHECK-NEXT: ptrue p0.s, vl2
	; CHECK-NEXT: mov z1.h, z0.h[3]			; CHECK-NEXT: mov z2.b, z0.b[1]
	; CHECK-NEXT: mov z2.h, z0.h[1]
	; CHECK-NEXT: mov z0.h, z0.h[2]
	; CHECK-NEXT: fmov w9, s1
	; CHECK-NEXT: fmov w10, s2
	; CHECK-NEXT: strh w8, [sp]
	; CHECK-NEXT: fmov w8, s0			; CHECK-NEXT: fmov w8, s0
	; CHECK-NEXT: strh w9, [sp, #6]			; CHECK-NEXT: fmov w9, s2
	; CHECK-NEXT: strh w10, [sp, #4]			; CHECK-NEXT: stp w8, w9, [sp]
	; CHECK-NEXT: strh w8, [sp, #2]
	; CHECK-NEXT: ldr d0, [sp]			; CHECK-NEXT: ldr d0, [sp]
	; CHECK-NEXT: st1b { z0.h }, p0, [x19]			; CHECK-NEXT: st1b { z0.s }, p0, [x19]
	; CHECK-NEXT: ldp x30, x19, [sp, #16] // 16-byte Folded Reload			; CHECK-NEXT: ldp x30, x19, [sp, #16] // 16-byte Folded Reload
	; CHECK-NEXT: add sp, sp, #32			; CHECK-NEXT: add sp, sp, #32
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%alloc = alloca [4 x i8]			%alloc = alloca [4 x i8]
	call void @def(ptr %alloc)			call void @def(ptr %alloc)
	%load = load <4 x i8>, ptr %alloc			%load = load <4 x i8>, ptr %alloc
	%strided.vec = shufflevector <4 x i8> %load, <4 x i8> poison, <4 x i32> <i32 0, i32 2, i32 1, i32 3>			%strided.vec = shufflevector <4 x i8> %load, <4 x i8> poison, <2 x i32> <i32 0, i32 2>
	store <4 x i8> %strided.vec, ptr %st_ptr			store <2 x i8> %strided.vec, ptr %st_ptr
	ret void			ret void
	}			}

	define void @alloc_v6i8(ptr %st_ptr) #0 {			define void @alloc_v6i8(ptr %st_ptr) #0 {
	; CHECK-LABEL: alloc_v6i8:			; CHECK-LABEL: alloc_v6i8:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: sub sp, sp, #32			; CHECK-NEXT: sub sp, sp, #48
	; CHECK-NEXT: stp x30, x19, [sp, #16] // 16-byte Folded Spill			; CHECK-NEXT: stp x30, x19, [sp, #32] // 16-byte Folded Spill
	; CHECK-NEXT: mov x19, x0			; CHECK-NEXT: mov x19, x0
	; CHECK-NEXT: add x0, sp, #8			; CHECK-NEXT: add x0, sp, #24
	; CHECK-NEXT: bl def			; CHECK-NEXT: bl def
	; CHECK-NEXT: ldr d0, [sp, #8]			; CHECK-NEXT: add x8, sp, #24
	; CHECK-NEXT: mov z1.b, z0.b[4]			; CHECK-NEXT: ptrue p0.b, vl3
	; CHECK-NEXT: mov z2.b, z0.b[5]			; CHECK-NEXT: ld2b { z0.b, z1.b }, p0/z, [x8]
	; CHECK-NEXT: fmov w8, s0			; CHECK-NEXT: ptrue p0.h, vl4
	; CHECK-NEXT: fmov w9, s1			; CHECK-NEXT: fmov w8, s1
	; CHECK-NEXT: fmov w10, s2			; CHECK-NEXT: mov z2.b, z1.b[3]
	; CHECK-NEXT: mov z3.b, z0.b[3]			; CHECK-NEXT: mov z3.b, z1.b[2]
	; CHECK-NEXT: mov z4.b, z0.b[1]			; CHECK-NEXT: mov z0.b, z1.b[1]
	; CHECK-NEXT: mov z0.b, z0.b[2]			; CHECK-NEXT: fmov w9, s2
	; CHECK-NEXT: strb w8, [sp]			; CHECK-NEXT: fmov w10, s3
	; CHECK-NEXT: fmov w8, s3			; CHECK-NEXT: strh w8, [sp, #8]
	; CHECK-NEXT: strb w9, [sp, #5]
	; CHECK-NEXT: fmov w9, s4
	; CHECK-NEXT: strb w10, [sp, #4]
	; CHECK-NEXT: fmov w10, s0
	; CHECK-NEXT: strb w8, [sp, #3]
	; CHECK-NEXT: strb w9, [sp, #2]
	; CHECK-NEXT: strb w10, [sp, #1]
	; CHECK-NEXT: ldr d0, [sp]
	; CHECK-NEXT: mov z1.h, z0.h[2]
	; CHECK-NEXT: fmov w8, s0			; CHECK-NEXT: fmov w8, s0
	; CHECK-NEXT: fmov w9, s1			; CHECK-NEXT: strh w9, [sp, #14]
	; CHECK-NEXT: str w8, [x19]			; CHECK-NEXT: strh w10, [sp, #12]
	; CHECK-NEXT: strh w9, [x19, #4]			; CHECK-NEXT: strh w8, [sp, #10]
	; CHECK-NEXT: ldp x30, x19, [sp, #16] // 16-byte Folded Reload			; CHECK-NEXT: add x8, sp, #20
	; CHECK-NEXT: add sp, sp, #32			; CHECK-NEXT: ldr d0, [sp, #8]
				; CHECK-NEXT: st1b { z0.h }, p0, [x8]
				; CHECK-NEXT: ldrh w8, [sp, #20]
				; CHECK-NEXT: strb w10, [x19, #2]
				; CHECK-NEXT: strh w8, [x19]
				; CHECK-NEXT: ldp x30, x19, [sp, #32] // 16-byte Folded Reload
				; CHECK-NEXT: add sp, sp, #48
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%alloc = alloca [6 x i8]			%alloc = alloca [6 x i8]
	call void @def(ptr %alloc)			call void @def(ptr %alloc)
	%load = load <6 x i8>, ptr %alloc			%load = load <6 x i8>, ptr %alloc
	%strided.vec = shufflevector <6 x i8> %load, <6 x i8> poison, <6 x i32> <i32 0, i32 2, i32 1, i32 3, i32 5, i32 4>			%strided.vec = shufflevector <6 x i8> %load, <6 x i8> poison, <3 x i32> <i32 1, i32 3, i32 5>
	store <6 x i8> %strided.vec, ptr %st_ptr			store <3 x i8> %strided.vec, ptr %st_ptr
	ret void			ret void
	}			}

				define void @alloc_v32i8(ptr %st_ptr) #0 {
				; CHECK-LABEL: alloc_v32i8:
				; CHECK: // %bb.0:
				; CHECK-NEXT: sub sp, sp, #64
				; CHECK-NEXT: stp x30, x19, [sp, #48] // 16-byte Folded Spill
				; CHECK-NEXT: mov x19, x0
				; CHECK-NEXT: add x0, sp, #16
				; CHECK-NEXT: bl def
				; CHECK-NEXT: ldp q0, q1, [sp, #16]
				; CHECK-NEXT: mov z2.b, z0.b[14]
				; CHECK-NEXT: mov z3.b, z0.b[12]
				; CHECK-NEXT: fmov w8, s0
				; CHECK-NEXT: fmov w9, s2
				; CHECK-NEXT: fmov w10, s3
				; CHECK-NEXT: mov z4.b, z0.b[10]
				; CHECK-NEXT: mov z5.b, z0.b[8]
				; CHECK-NEXT: mov z6.b, z0.b[6]
				; CHECK-NEXT: strb w8, [sp]
				; CHECK-NEXT: fmov w8, s4
				; CHECK-NEXT: strb w9, [sp, #7]
				; CHECK-NEXT: fmov w9, s5
				; CHECK-NEXT: strb w10, [sp, #6]
				; CHECK-NEXT: fmov w10, s6
				; CHECK-NEXT: mov z7.b, z0.b[4]
				; CHECK-NEXT: mov z0.b, z0.b[2]
				; CHECK-NEXT: strb w8, [sp, #5]
				; CHECK-NEXT: fmov w8, s7
				; CHECK-NEXT: strb w9, [sp, #4]
				; CHECK-NEXT: fmov w9, s0
				; CHECK-NEXT: strb w10, [sp, #3]
				; CHECK-NEXT: fmov w10, s1
				; CHECK-NEXT: strb w8, [sp, #2]
				; CHECK-NEXT: strb w9, [sp, #1]
				; CHECK-NEXT: strb w10, [x19, #8]
				; CHECK-NEXT: ldr q0, [sp]
				; CHECK-NEXT: fmov x8, d0
				; CHECK-NEXT: str x8, [x19]
				; CHECK-NEXT: ldp x30, x19, [sp, #48] // 16-byte Folded Reload
				; CHECK-NEXT: add sp, sp, #64
				; CHECK-NEXT: ret
				%alloc = alloca [32 x i8]
				call void @def(ptr %alloc)
				%load = load <32 x i8>, ptr %alloc
				%strided.vec = shufflevector <32 x i8> %load, <32 x i8> poison, <9 x i32> <i32 0, i32 2, i32 4, i32 6, i32 8, i32 10, i32 12, i32 14, i32 16>
				store <9 x i8> %strided.vec, ptr %st_ptr
				ret void
				}


	define void @alloc_v8f64(ptr %st_ptr) #0 {			define void @alloc_v8f64(ptr %st_ptr) #0 {
	; CHECK-LABEL: alloc_v8f64:			; CHECK-LABEL: alloc_v8f64:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: sub sp, sp, #96			; CHECK-NEXT: str x29, [sp, #-32]! // 8-byte Folded Spill
	; CHECK-NEXT: stp x20, x19, [sp, #80] // 16-byte Folded Spill			; CHECK-NEXT: stp x30, x19, [sp, #16] // 16-byte Folded Spill
				; CHECK-NEXT: addvl sp, sp, #-1
				; CHECK-NEXT: sub sp, sp, #64
	; CHECK-NEXT: mov x19, x0			; CHECK-NEXT: mov x19, x0
	; CHECK-NEXT: mov x0, sp			; CHECK-NEXT: mov x0, sp
	; CHECK-NEXT: str x30, [sp, #64] // 8-byte Folded Spill
	; CHECK-NEXT: mov x20, sp
	; CHECK-NEXT: bl def			; CHECK-NEXT: bl def
	; CHECK-NEXT: ld2 { v0.2d, v1.2d }, [x20], #32			; CHECK-NEXT: cntd x8
	; CHECK-NEXT: ldr x30, [sp, #64] // 8-byte Folded Reload			; CHECK-NEXT: ptrue p0.d, vl4
	; CHECK-NEXT: ld2 { v2.2d, v3.2d }, [x20]			; CHECK-NEXT: sub x8, x8, #2
				; CHECK-NEXT: ld2d { z0.d, z1.d }, p0/z, [sp]
				; CHECK-NEXT: mov w9, #2
				; CHECK-NEXT: cmp x8, #2
				; CHECK-NEXT: csel x8, x8, x9, lo
				; CHECK-NEXT: add x10, sp, #64
				; CHECK-NEXT: lsl x8, x8, #3
				; CHECK-NEXT: ptrue p0.d
				; CHECK-NEXT: add x9, sp, #64
				; CHECK-NEXT: st1d { z0.d }, p0, [x10]
				; CHECK-NEXT: ldr q2, [x9, x8]
	; CHECK-NEXT: stp q0, q2, [x19]			; CHECK-NEXT: stp q0, q2, [x19]
	; CHECK-NEXT: ldp x20, x19, [sp, #80] // 16-byte Folded Reload			; CHECK-NEXT: addvl sp, sp, #1
	; CHECK-NEXT: add sp, sp, #96			; CHECK-NEXT: add sp, sp, #64
				; CHECK-NEXT: ldp x30, x19, [sp, #16] // 16-byte Folded Reload
				; CHECK-NEXT: ldr x29, [sp], #32 // 8-byte Folded Reload
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%alloc = alloca [8 x double]			%alloc = alloca [8 x double]
	call void @def(ptr %alloc)			call void @def(ptr %alloc)
	%load = load <8 x double>, ptr %alloc			%load = load <8 x double>, ptr %alloc
	%strided.vec = shufflevector <8 x double> %load, <8 x double> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>			%strided.vec = shufflevector <8 x double> %load, <8 x double> poison, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
	store <4 x double> %strided.vec, ptr %st_ptr			store <4 x double> %strided.vec, ptr %st_ptr
	ret void			ret void
	}			}

	attributes #0 = { "target-features"="+sve" nounwind}			attributes #0 = { "target-features"="+sve" nounwind}
				sdesmalenUnsubmitted Done Reply Inline Actions nit: can you add `nounwind` as one of the parameters, to avoid the .cfi directives in the output? sdesmalen: nit: can you add `nounwind` as one of the parameters, to avoid the .cfi directives in the…

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][SME]: Generate streaming-compatible code for ld2-alloca.
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 479133

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-ld2-alloca.ll

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][SME]: Generate streaming-compatible code for ld2-alloca.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 479133

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-ld2-alloca.ll

[AArch64][SME]: Generate streaming-compatible code for ld2-alloca.
ClosedPublic