This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
1/2
AArch64ISelLowering.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
2/7
sve-streaming-mode-fixed-length-shuffle.ll

Differential D147040

[AArch64][CodeGen] Use interleave store for streaming compatible functions
ClosedPublic

Authored by CarolineConcatto on Mar 28 2023, 4:16 AM.

Download Raw Diff

Details

Reviewers

hassnaa-arm
david-arm
sdesmalen

Commits

rGc8192670ecc7: [AArch64][CodeGen] Use interleave store for streaming compatible functions

Summary

The previous patch, D135564, was too conservative to avoid store interleave
for streaming-compatible functions/mode.

In this patch, we allow using the interleave store but using scalable vector.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

CarolineConcatto created this revision.Mar 28 2023, 4:16 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 28 2023, 4:16 AM

Herald added subscribers: hiraditya, kristof.beyls. · View Herald Transcript

CarolineConcatto requested review of this revision.Mar 28 2023, 4:16 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 28 2023, 4:16 AM

Herald added subscribers: llvm-commits, alextsao1999. · View Herald Transcript

CarolineConcatto added reviewers: hassnaa-arm, david-arm, sdesmalen.Mar 28 2023, 4:16 AM

sdesmalen added inline comments.Mar 28 2023, 4:31 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
14583	I assume we'll need a similar capability for interleaved loads?
llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-shuffle.ll
13	It would be nice to have some more test coverage. Maybe something where the test doesn't depending on a splat, and some different vector lengths (e.g. where vector legalisation is required)

Harbormaster completed remote builds in B222210: Diff 508960.Mar 28 2023, 5:03 AM

david-arm added inline comments.Mar 29 2023, 1:51 AM

llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-shuffle.ll
15	Interestingly this code is faster than using `st2w`! That makes me wonder if we're missing some DAG combines somewhere for interleaving stores of splats. It's probably something unlikely to happen in practice though. I don't think you have to do anything in this patch, but it's something we may want to revisit at some point. I agree with @sdesmalen here - it would be good to have a more generic test case (without splats) here because otherwise this particular test is fragile.

Add more tests to interleave store with streaming compatible function

CarolineConcatto added inline comments.Mar 29 2023, 5:13 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
14583	Thank you for pointing that out. We don't need to do nothing in both functions, because isLegalInterleavedAccessType already sets UseScalable to true when Subtarget->forceStreamingCompatibleSVE().

Thanks for the new tests @CarolineConcatto! I just had a couple more suggestions on possibly improving the tests a bit more ...

llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-shuffle.ll

I don't think you need the second %v2 argument here, since it's never actually used. You can rewrite the IR below to just be:

%interleaved = shufflevector <8 x i32> %v1, <8 x i32> undef, <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>
store <8 x i32> %interleaved, ptr %a, align 1

This test has the same problem as @hang_when_merging_stores_after_legalisation, because it's using splats. I think you can do this:

define void @interleave_store_legalization(ptr %p, <8 x i32> %a, <8 x i32> %b) #0 {
  %interleaved = shufflevector <8 x i32> %a, <8 x i32> %b, <16 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11, i32 4, i32 12, i32 5, i32 13, i32 6, i32 14, i32 7, i32 15>
  store <16 x i32> %interleaved, ptr %p, align 1
  ret void
}

Fix the test with wrong size of the input vector and remove the splat in the second test

llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-shuffle.ll
21	Yes, you are right. I should have changed the size of the input vectors. I think now it is better. Right?
35	I don't fully understand why the splat in not a good example, but I did changed. Hope it is better now,

LGTM! Eccelente! Thanks for making the changes @CarolineConcatto.

llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-shuffle.ll
35	Well, it's for the same reason as `@hang_when_merging_stores_after_legalisation`, because the splat version may get optimised in future to be a pair of `stp` instructions that's all.

This revision is now accepted and ready to land.Mar 29 2023, 6:06 AM

Harbormaster completed remote builds in B222492: Diff 509330.Mar 29 2023, 9:45 AM

dtemirbulatov added a subscriber: dtemirbulatov.Apr 3 2023, 6:08 AM

sdesmalen accepted this revision.Apr 12 2023, 5:36 AM

This revision was landed with ongoing or failed builds.Apr 13 2023, 1:44 AM

Closed by commit rGc8192670ecc7: [AArch64][CodeGen] Use interleave store for streaming compatible functions (authored by CarolineConcatto). · Explain Why

This revision was automatically updated to reflect the committed changes.

CarolineConcatto added a commit: rGc8192670ecc7: [AArch64][CodeGen] Use interleave store for streaming compatible functions.

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64ISelLowering.cpp

4 lines

test/

CodeGen/

AArch64/

sve-streaming-mode-fixed-length-shuffle.ll

40 lines

Diff 513103

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 14,574 Lines • ▼ Show 20 Lines
	/// <4, 32, 16, 5, 33, 17, 6, 34, 18, 7, 35, 19>			/// <4, 32, 16, 5, 33, 17, 6, 34, 18, 7, 35, 19>
	/// store <12 x i32> %i.vec, <12 x i32>* %ptr			/// store <12 x i32> %i.vec, <12 x i32>* %ptr
	///			///
	/// Into:			/// Into:
	/// %sub.v0 = shuffle <32 x i32> %v0, <32 x i32> v1, <4, 5, 6, 7>			/// %sub.v0 = shuffle <32 x i32> %v0, <32 x i32> v1, <4, 5, 6, 7>
	/// %sub.v1 = shuffle <32 x i32> %v0, <32 x i32> v1, <32, 33, 34, 35>			/// %sub.v1 = shuffle <32 x i32> %v0, <32 x i32> v1, <32, 33, 34, 35>
	/// %sub.v2 = shuffle <32 x i32> %v0, <32 x i32> v1, <16, 17, 18, 19>			/// %sub.v2 = shuffle <32 x i32> %v0, <32 x i32> v1, <16, 17, 18, 19>
	/// call void llvm.aarch64.neon.st3(%sub.v0, %sub.v1, %sub.v2, %ptr)			/// call void llvm.aarch64.neon.st3(%sub.v0, %sub.v1, %sub.v2, %ptr)
	bool AArch64TargetLowering::lowerInterleavedStore(StoreInst *SI,			bool AArch64TargetLowering::lowerInterleavedStore(StoreInst *SI,
				sdesmalenUnsubmitted Not Done Reply Inline Actions I assume we'll need a similar capability for interleaved loads? sdesmalen: I assume we'll need a similar capability for interleaved loads?
				CarolineConcattoAuthorUnsubmitted Done Reply Inline Actions Thank you for pointing that out. We don't need to do nothing in both functions, because isLegalInterleavedAccessType already sets UseScalable to true when Subtarget->forceStreamingCompatibleSVE(). CarolineConcatto: Thank you for pointing that out. We don't need to do nothing in both functions, because…
	ShuffleVectorInst *SVI,			ShuffleVectorInst *SVI,
	unsigned Factor) const {			unsigned Factor) const {
	// Skip if streaming compatible SVE is enabled, because it generates invalid
	// code in streaming mode when SVE length is not specified.
	if (Subtarget->forceStreamingCompatibleSVE())
	return false;

	assert(Factor >= 2 && Factor <= getMaxSupportedInterleaveFactor() &&			assert(Factor >= 2 && Factor <= getMaxSupportedInterleaveFactor() &&
	"Invalid interleave factor");			"Invalid interleave factor");

	auto *VecTy = cast<FixedVectorType>(SVI->getType());			auto *VecTy = cast<FixedVectorType>(SVI->getType());
	assert(VecTy->getNumElements() % Factor == 0 && "Invalid interleaved store");			assert(VecTy->getNumElements() % Factor == 0 && "Invalid interleaved store");

	unsigned LaneLen = VecTy->getNumElements() / Factor;			unsigned LaneLen = VecTy->getNumElements() / Factor;
	▲ Show 20 Lines • Show All 9,962 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-shuffle.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -force-streaming-compatible-sve < %s \| FileCheck %s			; RUN: llc -force-streaming-compatible-sve < %s \| FileCheck %s

	target triple = "aarch64-unknown-linux-gnu"			target triple = "aarch64-unknown-linux-gnu"

	; Currently there is no custom lowering for vector shuffles operating on types
	; bigger than NEON. However, having no support opens us up to a code generator
	; hang when expanding BUILD_VECTOR. Here we just validate the promblematic case
	; successfully exits code generation.
	define void @hang_when_merging_stores_after_legalisation(ptr %a, <2 x i32> %b) #0 {			define void @hang_when_merging_stores_after_legalisation(ptr %a, <2 x i32> %b) #0 {
	; CHECK-LABEL: hang_when_merging_stores_after_legalisation:			; CHECK-LABEL: hang_when_merging_stores_after_legalisation:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0			; CHECK-NEXT: // kill: def $d0 killed $d0 def $z0
				; CHECK-NEXT: ptrue p0.s, vl4
	; CHECK-NEXT: mov z0.s, s0			; CHECK-NEXT: mov z0.s, s0
	; CHECK-NEXT: stp q0, q0, [x0]			; CHECK-NEXT: mov z1.d, z0.d
	david-armUnsubmitted Not Done Reply Inline Actions Interestingly this code is faster than using `st2w`! That makes me wonder if we're missing some DAG combines somewhere for interleaving stores of splats. It's probably something unlikely to happen in practice though. I don't think you have to do anything in this patch, but it's something we may want to revisit at some point. I agree with @sdesmalen here - it would be good to have a more generic test case (without splats) here because otherwise this particular test is fragile. david-arm: Interestingly this code is faster than using `st2w`! That makes me wonder if we're missing some…
				; CHECK-NEXT: st2w { z0.s, z1.s }, p0, [x0]
				sdesmalenUnsubmitted Not Done Reply Inline Actions It would be nice to have some more test coverage. Maybe something where the test doesn't depending on a splat, and some different vector lengths (e.g. where vector legalisation is required) sdesmalen: It would be nice to have some more test coverage. Maybe something where the test doesn't…
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%splat = shufflevector <2 x i32> %b, <2 x i32> undef, <8 x i32> zeroinitializer			%splat = shufflevector <2 x i32> %b, <2 x i32> undef, <8 x i32> zeroinitializer
	%interleaved.vec = shufflevector <8 x i32> %splat, <8 x i32> undef, <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>			%interleaved.vec = shufflevector <8 x i32> %splat, <8 x i32> undef, <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>
	store <8 x i32> %interleaved.vec, ptr %a, align 4			store <8 x i32> %interleaved.vec, ptr %a, align 4
	ret void			ret void
	}			}

				define void @interleave_store_without_splat(ptr %a, <4 x i32> %v1, <4 x i32> %v2) #0 {
				david-armUnsubmitted Not Done Reply Inline Actions I don't think you need the second `%v2` argument here, since it's never actually used. You can rewrite the IR below to just be: %interleaved = shufflevector <8 x i32> %v1, <8 x i32> undef, <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7> store <8 x i32> %interleaved, ptr %a, align 1 david-arm: I don't think you need the second `%v2` argument here, since it's never actually used. You can…
				CarolineConcattoAuthorUnsubmitted Done Reply Inline Actions Yes, you are right. I should have changed the size of the input vectors. I think now it is better. Right? CarolineConcatto: Yes, you are right. I should have changed the size of the input vectors. I think now it is…
				; CHECK-LABEL: interleave_store_without_splat:
				; CHECK: // %bb.0:
				; CHECK-NEXT: // kill: def $q1 killed $q1 killed $z0_z1 def $z0_z1
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: // kill: def $q0 killed $q0 killed $z0_z1 def $z0_z1
				; CHECK-NEXT: st2w { z0.s, z1.s }, p0, [x0]
				; CHECK-NEXT: ret
				%shuffle = shufflevector <4 x i32> %v1, <4 x i32> %v2, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
				%interleaved = shufflevector <8 x i32> %shuffle, <8 x i32> undef, <8 x i32> <i32 0, i32 4, i32 1, i32 5, i32 2, i32 6, i32 3, i32 7>
				store <8 x i32> %interleaved, ptr %a, align 1
				ret void
				}

				define void @interleave_store_legalization(ptr %a, <8 x i32> %v1, <8 x i32> %v2) #0 {
				david-armUnsubmitted Not Done Reply Inline Actions This test has the same problem as `@hang_when_merging_stores_after_legalisation`, because it's using splats. I think you can do this: define void @interleave_store_legalization(ptr %p, <8 x i32> %a, <8 x i32> %b) #0 { %interleaved = shufflevector <8 x i32> %a, <8 x i32> %b, <16 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11, i32 4, i32 12, i32 5, i32 13, i32 6, i32 14, i32 7, i32 15> store <16 x i32> %interleaved, ptr %p, align 1 ret void } david-arm: This test has the same problem as `@hang_when_merging_stores_after_legalisation`, because it's…
				CarolineConcattoAuthorUnsubmitted Done Reply Inline Actions I don't fully understand why the splat in not a good example, but I did changed. Hope it is better now, CarolineConcatto: I don't fully understand why the splat in not a good example, but I did changed. Hope it is…
				david-armUnsubmitted Not Done Reply Inline Actions Well, it's for the same reason as `@hang_when_merging_stores_after_legalisation`, because the splat version may get optimised in future to be a pair of `stp` instructions that's all. david-arm: Well, it's for the same reason as `@hang_when_merging_stores_after_legalisation`, because the…
				; CHECK-LABEL: interleave_store_legalization:
				; CHECK: // %bb.0:
				; CHECK-NEXT: mov x8, #8 // =0x8
				; CHECK-NEXT: // kill: def $q3 killed $q3 killed $z2_z3 def $z2_z3
				; CHECK-NEXT: mov z5.d, z2.d
				; CHECK-NEXT: mov z2.d, z1.d
				; CHECK-NEXT: mov z4.d, z0.d
				; CHECK-NEXT: ptrue p0.s, vl4
				; CHECK-NEXT: st2w { z4.s, z5.s }, p0, [x0]
				; CHECK-NEXT: st2w { z2.s, z3.s }, p0, [x0, x8, lsl #2]
				; CHECK-NEXT: ret
				%interleaved.vec = shufflevector <8 x i32> %v1, <8 x i32> %v2, <16 x i32> <i32 0, i32 8, i32 1, i32 9, i32 2, i32 10, i32 3, i32 11,
				i32 4, i32 12, i32 5, i32 13, i32 6, i32 14, i32 7, i32 15>
				store <16 x i32> %interleaved.vec, ptr %a, align 4
				ret void
				}

	; Ensure we don't crash when trying to lower a shuffle via an extract			; Ensure we don't crash when trying to lower a shuffle via an extract
	define void @crash_when_lowering_extract_shuffle(ptr %dst, i1 %cond) #0 {			define void @crash_when_lowering_extract_shuffle(ptr %dst, i1 %cond) #0 {
	; CHECK-LABEL: crash_when_lowering_extract_shuffle:			; CHECK-LABEL: crash_when_lowering_extract_shuffle:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%broadcast.splat = shufflevector <32 x i1> zeroinitializer, <32 x i1> zeroinitializer, <32 x i32> zeroinitializer			%broadcast.splat = shufflevector <32 x i1> zeroinitializer, <32 x i1> zeroinitializer, <32 x i32> zeroinitializer
	br i1 %cond, label %exit, label %vector.body			br i1 %cond, label %exit, label %vector.body

	Show All 11 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64][CodeGen] Use interleave store for streaming compatible functionsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 513103

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

llvm/test/CodeGen/AArch64/sve-streaming-mode-fixed-length-shuffle.ll

[AArch64][CodeGen] Use interleave store for streaming compatible functions
ClosedPublic