This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Reuse larger DUPLANE if available
ClosedPublic

Authored by jaykang10 on Jul 18 2023, 6:00 AM.

Details

Summary

As combining DUP, try to reuse larger DUPLANELANE.

Diff Detail

Event Timeline

jaykang10 created this revision.Jul 18 2023, 6:00 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 18 2023, 6:00 AM
jaykang10 requested review of this revision.Jul 18 2023, 6:00 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 18 2023, 6:00 AM
dmgreen added inline comments.Jul 18 2023, 7:48 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
22119–22122

Should this be excluded for everything that isn't DUP?

llvm/lib/Target/AArch64/AArch64InstrInfo.td
6950 ↗(On Diff #541489)

Would this be needed for all the "AdvSIMD indexed element" operations?

Is there some way to make that fairly generic? Maybe a PatFrag that matches either v4i16 (AArch64duplane16(.. or extract_subvector (AArch64duplane16 (.., that can be used in the instruction patterns like those in SIMDVectorIndexedHSTied.

jaykang10 added inline comments.Jul 18 2023, 8:27 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
22119–22122

I do not know how the performPostLD1Combine function works in detail but the function checks whether the input node is ISD::LOAD. If it is not ISD::LOAD, the function returns SDValue().
To be safe, let me update code which do not execute the performPostLD1Combine function with DUPLANE.

llvm/lib/Target/AArch64/AArch64InstrInfo.td
6950 ↗(On Diff #541489)

I was not just sure the patterns are needed for all SIMD indexed operations.
If it is ok, let me move the patterns into SIMDIndexedLongSD.

jaykang10 added inline comments.Jul 18 2023, 8:55 AM
llvm/lib/Target/AArch64/AArch64InstrInfo.td
6950 ↗(On Diff #541489)

Additionally, when I added def v2i32_indexed_low in SIMDVectorIndexedLongSD, I got Decoding conflict as below because it causes same encoding so I added Pats.

Decoding Conflict:
		0000111110......1010.0..........
		0000111110......1010............
		0000111110......................
		...0111110......................
		...011..........................
		................................
	SMULLv2i32_indexed 0000111110______1010_0__________
	SMULLv2i32_indexed_low 0000111110______1010_0__________
Decoding Conflict:
		0010111110......1010.0..........
		0010111110......1010............
		0010111110......................
		...0111110......................
		...011..........................
		................................
	UMULLv2i32_indexed 0010111110______1010_0__________
	UMULLv2i32_indexed_low 0010111110______1010_0__________
jaykang10 updated this revision to Diff 541591.Jul 18 2023, 9:07 AM

Following @dmgreen's comment, updated patch.

Could we add a PatFrags that looks something like this:

def dup_v8i16 : PatFrags<(ops node:$LHS, node:$RHS),
                         [(v4i16 (extract_subvector (v8i16 (AArch64duplane16 (v8i16 node:$LHS), node:$RHS)), (i64 0))),
                          (v4i16 (AArch64duplane16 (v8i16 node:$LHS), node:$RHS))]>;

So that it matches either a v4i16 DUPLANE, or a subvector_extract of a v8i16 DUPLANE. Maybe call it something like AArch64duplanev416, I'm not sure. It could then be used in the existing set patterns for all the lane instructions. I think many of them might hit the same problems we see in the tests, but don't happen to have tests for every instruction.

[(set (v4i32 V128:$Rd),
    (OpNode (v4i16 V64:$Rn),
            (dup_v8i16 (v8i16 V128_lo:$Rm), VectorIndexH:$idx)))]> {

Could we add a PatFrags that looks something like this:

def dup_v8i16 : PatFrags<(ops node:$LHS, node:$RHS),
                         [(v4i16 (extract_subvector (v8i16 (AArch64duplane16 (v8i16 node:$LHS), node:$RHS)), (i64 0))),
                          (v4i16 (AArch64duplane16 (v8i16 node:$LHS), node:$RHS))]>;

So that it matches either a v4i16 DUPLANE, or a subvector_extract of a v8i16 DUPLANE. Maybe call it something like AArch64duplanev416, I'm not sure. It could then be used in the existing set patterns for all the lane instructions. I think many of them might hit the same problems we see in the tests, but don't happen to have tests for every instruction.

[(set (v4i32 V128:$Rd),
    (OpNode (v4i16 V64:$Rn),
            (dup_v8i16 (v8i16 V128_lo:$Rm), VectorIndexH:$idx)))]> {

Yep, I agree with you. If possible, I did not want to touch existing patterns because it could cause other regressions.
Let me update the patterns with Patfrags.

jaykang10 updated this revision to Diff 542072.Jul 19 2023, 9:28 AM

Following @dmgreen's comment, updated patterns with PatFrags.

Thanks. Looks pretty good. Do we need to handle the other "indexing" operations in the same way? For example something like this below, which is for fmul. I would guess that the extending operations (umull) are more likely to see the problem, but you can imagine the others in the "AdvSIMD indexed element" section of AArch64InstrInfo.td might all be better using dup_v8i16 now.

define <4 x float> @sel.v8i16(ptr %p, ptr %q, <4 x float> %a, <4 x float> %b, <2 x float> %c) {
  %splat = shufflevector <4 x float> %a, <4 x float> poison, <4 x i32> zeroinitializer
  %splat2 = shufflevector <4 x float> %a, <4 x float> poison, <2 x i32> zeroinitializer
  
  %r = fmul <4 x float> %b, %splat
  %r2 = fmul <2 x float> %c, %splat2
  store <2 x float> %r2, ptr %p
  ret <4 x float> %r
}

Thanks. Looks pretty good. Do we need to handle the other "indexing" operations in the same way? For example something like this below, which is for fmul. I would guess that the extending operations (umull) are more likely to see the problem, but you can imagine the others in the "AdvSIMD indexed element" section of AArch64InstrInfo.td might all be better using dup_v8i16 now.

define <4 x float> @sel.v8i16(ptr %p, ptr %q, <4 x float> %a, <4 x float> %b, <2 x float> %c) {
  %splat = shufflevector <4 x float> %a, <4 x float> poison, <4 x i32> zeroinitializer
  %splat2 = shufflevector <4 x float> %a, <4 x float> poison, <2 x i32> zeroinitializer
  
  %r = fmul <4 x float> %b, %splat
  %r2 = fmul <2 x float> %c, %splat2
  store <2 x float> %r2, ptr %p
  ret <4 x float> %r
}

I have added dup_v8i16 and dup_v4i32 into SIMDVectorIndexedLongSD and SIMDIndexedLongSD multiclass.
The umull uses SIMDVectorIndexedLongSD like smull so it will be matched with the dup_v8i6 and dup_v4i32.
Let me check the fmul example and 64-bits AArch64duplane with VectorIndexS:$idx.

jaykang10 updated this revision to Diff 542419.Jul 20 2023, 4:02 AM

Following @dmgreen's comment, updated more patterns with PatFrags.

dmgreen accepted this revision.Jul 20 2023, 5:13 AM

Cheers. LGTM, but can you try and add some tests for some of the other cases, maybe based on the code in the comment above. Just in case we need to change it in the future again, we will notice they are not as good. Thanks.

This revision is now accepted and ready to land.Jul 20 2023, 5:13 AM

Cheers. LGTM, but can you try and add some tests for some of the other cases, maybe based on the code in the comment above. Just in case we need to change it in the future again, we will notice they are not as good. Thanks.

Yep, let me add some test.

jaykang10 updated this revision to Diff 542520.Jul 20 2023, 7:48 AM

Added tests

This revision was landed with ongoing or failed builds.Jul 20 2023, 7:59 AM
This revision was automatically updated to reflect the committed changes.