This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Only mark cost 1 perfect shuffles as legal
ClosedPublic

Authored by dmgreen on Apr 8 2022, 3:39 AM.

Download Raw Diff

Details

Reviewers

SjoerdMeijer
labrinea
samtebbs
jaykang10

Commits

rGcc9495f6791a: [AArch64] Only mark cost 1 perfect shuffles as legal

Summary

The perfect shuffle tables encode a cost of either 0 (a nop-copy) or 1 (a single instruction) with a cost encoding of 0 in the upper 2 bits. All perfect shuffles with any cost are then marked as legal shuffles though (the maximum encoded cost is 3), which can confuse the DAG combiner into thinking the shuffles are cheaper than the should be.

Limiting legal shuffles to single instructions seems to do better in most case, producing less instructions for complex shuffles. There are some cases that now become tbl, which may be better or worse depending on whether the instruction is in a loop and the tbl load can be hoisted out.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

dmgreen created this revision.Apr 8 2022, 3:39 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 8 2022, 3:39 AM

Herald added subscribers: hiraditya, kristof.beyls. · View Herald Transcript

dmgreen requested review of this revision.Apr 8 2022, 3:39 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 8 2022, 3:39 AM

Harbormaster completed remote builds in B158665: Diff 421475.Apr 8 2022, 3:40 AM

Remove NFC change in LowerVECTOR_SHUFFLE.

Harbormaster completed remote builds in B158666: Diff 421481.Apr 8 2022, 3:42 AM

dmgreen added a parent revision: D123379: [AArch64] Cost all perfect shuffles entries as cost 1.Apr 8 2022, 5:12 AM

dmgreen added a child revision: D123386: [AArch64] Add lane moves to PerfectShuffle tables.Apr 8 2022, 6:17 AM

That's quite a lot change in the test cases. It's easy to see that the smaller ones are improvements. For the bigger changes that isn't that obvious. But I trust you have run numbers and this is overall better. So LGTM, let's give this a try.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
11492–11494	This comment is a bit cryptic for me. The description of this ticket is clear though, perhaps you adopt some of that rationale here.

This revision is now accepted and ready to land.Apr 13 2022, 12:52 AM

This revision was landed with ongoing or failed builds.Apr 19 2022, 4:59 AM

Closed by commit rGcc9495f6791a: [AArch64] Only mark cost 1 perfect shuffles as legal (authored by dmgreen). · Explain Why

This revision was automatically updated to reflect the committed changes.

dmgreen added a commit: rGcc9495f6791a: [AArch64] Only mark cost 1 perfect shuffles as legal.

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64ISelLowering.cpp

4 lines

test/

CodeGen/

AArch64/

aarch64-wide-shuffle.ll

7 lines

build-vector-extract.ll

20 lines

neon-reverseshuffle.ll

8 lines

neon-widen-shuffle.ll

20 lines

shuffles.ll

41 lines

Diff 423597

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 11,483 Lines • ▼ Show 20 Lines	if (VT.getVectorNumElements() == 4 &&
}		}

// Compute the index in the perfect shuffle table.		// Compute the index in the perfect shuffle table.
unsigned PFTableIndex = PFIndexes[0] * 9 * 9 * 9 + PFIndexes[1] * 9 * 9 +		unsigned PFTableIndex = PFIndexes[0] * 9 * 9 * 9 + PFIndexes[1] * 9 * 9 +
PFIndexes[2] * 9 + PFIndexes[3];		PFIndexes[2] * 9 + PFIndexes[3];
unsigned PFEntry = PerfectShuffleTable[PFTableIndex];		unsigned PFEntry = PerfectShuffleTable[PFTableIndex];
unsigned Cost = (PFEntry >> 30);		unsigned Cost = (PFEntry >> 30);

if (Cost <= 4)		// The cost tables encode cost 0 or cost 1 shuffles using the value 0 in
		// the top 2 bits.
		if (Cost == 0)
		SjoerdMeijerUnsubmitted Not Done Reply Inline Actions This comment is a bit cryptic for me. The description of this ticket is clear though, perhaps you adopt some of that rationale here. SjoerdMeijer: This comment is a bit cryptic for me. The description of this ticket is clear though, perhaps…
return true;		return true;
}		}

bool DummyBool;		bool DummyBool;
int DummyInt;		int DummyInt;
unsigned DummyUnsigned;		unsigned DummyUnsigned;

return (ShuffleVectorSDNode::isSplatMask(&M[0], VT) \|\| isREVMask(M, VT, 64) \|\|		return (ShuffleVectorSDNode::isSplatMask(&M[0], VT) \|\| isREVMask(M, VT, 64) \|\|
▲ Show 20 Lines • Show All 9,485 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/aarch64-wide-shuffle.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc < %s \| FileCheck %s			; RUN: llc < %s \| FileCheck %s

	target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"			target datalayout = "e-m:e-i64:64-i128:128-n32:64-S128"
	target triple = "aarch64-unknown-linux-gnu"			target triple = "aarch64-unknown-linux-gnu"

	define <4 x i16> @f(<4 x i32> %vqdmlal_v3.i, <8 x i16> %x5) {			define <4 x i16> @f(<4 x i32> %vqdmlal_v3.i, <8 x i16> %x5) {
	; CHECK-LABEL: f:			; CHECK-LABEL: f:
	; CHECK: // %bb.0: // %entry			; CHECK: // %bb.0: // %entry
	; CHECK-NEXT: ext v0.16b, v0.16b, v0.16b, #8			; CHECK-NEXT: dup v0.4h, v0.h[4]
	; CHECK-NEXT: uzp1 v0.4h, v0.4h, v0.4h			; CHECK-NEXT: mov v0.h[1], v1.h[0]
	; CHECK-NEXT: ext v1.8b, v0.8b, v1.8b, #4			; CHECK-NEXT: // kill: def $d0 killed $d0 killed $q0
	; CHECK-NEXT: uzp1 v0.4h, v1.4h, v0.4h
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	entry:			entry:
	; Check that we don't just dup the input vector. The code emitted is ext, dup, ext, ext			; Check that we don't just dup the input vector. The code emitted is ext, dup, ext, ext
	; but only match the last three instructions as the first two could be combined to			; but only match the last three instructions as the first two could be combined to
	; a dup2 at some stage.			; a dup2 at some stage.
	%x4 = extractelement <4 x i32> %vqdmlal_v3.i, i32 2			%x4 = extractelement <4 x i32> %vqdmlal_v3.i, i32 2
	%vgetq_lane = trunc i32 %x4 to i16			%vgetq_lane = trunc i32 %x4 to i16
	%vecinit.i = insertelement <4 x i16> undef, i16 %vgetq_lane, i32 0			%vecinit.i = insertelement <4 x i16> undef, i16 %vgetq_lane, i32 0
	%vecinit2.i = insertelement <4 x i16> %vecinit.i, i16 %vgetq_lane, i32 2			%vecinit2.i = insertelement <4 x i16> %vecinit.i, i16 %vgetq_lane, i32 2
	%vecinit3.i = insertelement <4 x i16> %vecinit2.i, i16 %vgetq_lane, i32 3			%vecinit3.i = insertelement <4 x i16> %vecinit2.i, i16 %vgetq_lane, i32 3
	%vgetq_lane261 = extractelement <8 x i16> %x5, i32 0			%vgetq_lane261 = extractelement <8 x i16> %x5, i32 0
	%vset_lane267 = insertelement <4 x i16> %vecinit3.i, i16 %vgetq_lane261, i32 1			%vset_lane267 = insertelement <4 x i16> %vecinit3.i, i16 %vgetq_lane261, i32 1
	ret <4 x i16> %vset_lane267			ret <4 x i16> %vset_lane267
	}			}

llvm/test/CodeGen/AArch64/build-vector-extract.ll

Show All 24 Lines	; CHECK-NEXT: ret
%z = zext i32 %e to i64		%z = zext i32 %e to i64
%r = insertelement <2 x i64> zeroinitializer, i64 %z, i32 0		%r = insertelement <2 x i64> zeroinitializer, i64 %z, i32 0
ret <2 x i64> %r		ret <2 x i64> %r
}		}

define <2 x i64> @extract1_i32_zext_insert0_i64_undef(<4 x i32> %x) {		define <2 x i64> @extract1_i32_zext_insert0_i64_undef(<4 x i32> %x) {
; CHECK-LABEL: extract1_i32_zext_insert0_i64_undef:		; CHECK-LABEL: extract1_i32_zext_insert0_i64_undef:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: movi v1.2d, #0000000000000000		; CHECK-NEXT: mov w8, v0.s[1]
; CHECK-NEXT: zip1 v1.4s, v0.4s, v1.4s		; CHECK-NEXT: fmov d0, x8
; CHECK-NEXT: trn2 v0.4s, v0.4s, v1.4s
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%e = extractelement <4 x i32> %x, i32 1		%e = extractelement <4 x i32> %x, i32 1
%z = zext i32 %e to i64		%z = zext i32 %e to i64
%r = insertelement <2 x i64> undef, i64 %z, i32 0		%r = insertelement <2 x i64> undef, i64 %z, i32 0
ret <2 x i64> %r		ret <2 x i64> %r
}		}

define <2 x i64> @extract1_i32_zext_insert0_i64_zero(<4 x i32> %x) {		define <2 x i64> @extract1_i32_zext_insert0_i64_zero(<4 x i32> %x) {
; CHECK-LABEL: extract1_i32_zext_insert0_i64_zero:		; CHECK-LABEL: extract1_i32_zext_insert0_i64_zero:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: movi v1.2d, #0000000000000000		; CHECK-NEXT: movi v1.2d, #0000000000000000
; CHECK-NEXT: mov w8, v0.s[1]		; CHECK-NEXT: mov w8, v0.s[1]
; CHECK-NEXT: mov v1.d[0], x8		; CHECK-NEXT: mov v1.d[0], x8
; CHECK-NEXT: mov v0.16b, v1.16b		; CHECK-NEXT: mov v0.16b, v1.16b
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%e = extractelement <4 x i32> %x, i32 1		%e = extractelement <4 x i32> %x, i32 1
%z = zext i32 %e to i64		%z = zext i32 %e to i64
%r = insertelement <2 x i64> zeroinitializer, i64 %z, i32 0		%r = insertelement <2 x i64> zeroinitializer, i64 %z, i32 0
ret <2 x i64> %r		ret <2 x i64> %r
}		}

define <2 x i64> @extract2_i32_zext_insert0_i64_undef(<4 x i32> %x) {		define <2 x i64> @extract2_i32_zext_insert0_i64_undef(<4 x i32> %x) {
; CHECK-LABEL: extract2_i32_zext_insert0_i64_undef:		; CHECK-LABEL: extract2_i32_zext_insert0_i64_undef:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: movi v1.2d, #0000000000000000		; CHECK-NEXT: mov w8, v0.s[2]
; CHECK-NEXT: uzp1 v1.4s, v0.4s, v1.4s		; CHECK-NEXT: fmov d0, x8
; CHECK-NEXT: zip2 v0.4s, v0.4s, v1.4s
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%e = extractelement <4 x i32> %x, i32 2		%e = extractelement <4 x i32> %x, i32 2
%z = zext i32 %e to i64		%z = zext i32 %e to i64
%r = insertelement <2 x i64> undef, i64 %z, i32 0		%r = insertelement <2 x i64> undef, i64 %z, i32 0
ret <2 x i64> %r		ret <2 x i64> %r
}		}

define <2 x i64> @extract2_i32_zext_insert0_i64_zero(<4 x i32> %x) {		define <2 x i64> @extract2_i32_zext_insert0_i64_zero(<4 x i32> %x) {
Show All 34 Lines	; CHECK-NEXT: ret
%z = zext i32 %e to i64		%z = zext i32 %e to i64
%r = insertelement <2 x i64> zeroinitializer, i64 %z, i32 0		%r = insertelement <2 x i64> zeroinitializer, i64 %z, i32 0
ret <2 x i64> %r		ret <2 x i64> %r
}		}

define <2 x i64> @extract0_i32_zext_insert1_i64_undef(<4 x i32> %x) {		define <2 x i64> @extract0_i32_zext_insert1_i64_undef(<4 x i32> %x) {
; CHECK-LABEL: extract0_i32_zext_insert1_i64_undef:		; CHECK-LABEL: extract0_i32_zext_insert1_i64_undef:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: movi v1.2d, #0000000000000000		; CHECK-NEXT: fmov w8, s0
; CHECK-NEXT: zip1 v1.4s, v0.4s, v1.4s		; CHECK-NEXT: dup v0.2d, x8
; CHECK-NEXT: ext v0.16b, v0.16b, v1.16b, #8
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%e = extractelement <4 x i32> %x, i32 0		%e = extractelement <4 x i32> %x, i32 0
%z = zext i32 %e to i64		%z = zext i32 %e to i64
%r = insertelement <2 x i64> undef, i64 %z, i32 1		%r = insertelement <2 x i64> undef, i64 %z, i32 1
ret <2 x i64> %r		ret <2 x i64> %r
}		}

define <2 x i64> @extract0_i32_zext_insert1_i64_zero(<4 x i32> %x) {		define <2 x i64> @extract0_i32_zext_insert1_i64_zero(<4 x i32> %x) {
; CHECK-LABEL: extract0_i32_zext_insert1_i64_zero:		; CHECK-LABEL: extract0_i32_zext_insert1_i64_zero:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: movi v1.2d, #0000000000000000		; CHECK-NEXT: movi v1.2d, #0000000000000000
; CHECK-NEXT: fmov w8, s0		; CHECK-NEXT: fmov w8, s0
; CHECK-NEXT: mov v1.d[1], x8		; CHECK-NEXT: mov v1.d[1], x8
; CHECK-NEXT: mov v0.16b, v1.16b		; CHECK-NEXT: mov v0.16b, v1.16b
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%e = extractelement <4 x i32> %x, i32 0		%e = extractelement <4 x i32> %x, i32 0
%z = zext i32 %e to i64		%z = zext i32 %e to i64
%r = insertelement <2 x i64> zeroinitializer, i64 %z, i32 1		%r = insertelement <2 x i64> zeroinitializer, i64 %z, i32 1
ret <2 x i64> %r		ret <2 x i64> %r
}		}

define <2 x i64> @extract1_i32_zext_insert1_i64_undef(<4 x i32> %x) {		define <2 x i64> @extract1_i32_zext_insert1_i64_undef(<4 x i32> %x) {
; CHECK-LABEL: extract1_i32_zext_insert1_i64_undef:		; CHECK-LABEL: extract1_i32_zext_insert1_i64_undef:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: movi v1.2d, #0000000000000000		; CHECK-NEXT: mov w8, v0.s[1]
; CHECK-NEXT: zip1 v0.4s, v0.4s, v0.4s		; CHECK-NEXT: dup v0.2d, x8
; CHECK-NEXT: ext v0.16b, v0.16b, v1.16b, #4
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%e = extractelement <4 x i32> %x, i32 1		%e = extractelement <4 x i32> %x, i32 1
%z = zext i32 %e to i64		%z = zext i32 %e to i64
%r = insertelement <2 x i64> undef, i64 %z, i32 1		%r = insertelement <2 x i64> undef, i64 %z, i32 1
ret <2 x i64> %r		ret <2 x i64> %r
}		}

define <2 x i64> @extract1_i32_zext_insert1_i64_zero(<4 x i32> %x) {		define <2 x i64> @extract1_i32_zext_insert1_i64_zero(<4 x i32> %x) {
▲ Show 20 Lines • Show All 502 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/neon-reverseshuffle.ll

	Show All 40 Lines
	entry:			entry:
	%V128 = shufflevector <8 x i16> %a, <8 x i16> undef, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>			%V128 = shufflevector <8 x i16> %a, <8 x i16> undef, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>
	ret <8 x i16> %V128			ret <8 x i16> %V128
	}			}

	define <8 x i16> @v8i16_2(<4 x i16> %a, <4 x i16> %b) {			define <8 x i16> @v8i16_2(<4 x i16> %a, <4 x i16> %b) {
	; CHECK-LABEL: v8i16_2:			; CHECK-LABEL: v8i16_2:
	; CHECK: // %bb.0: // %entry			; CHECK: // %bb.0: // %entry
	; CHECK-NEXT: rev64 v2.4h, v0.4h			; CHECK-NEXT: adrp x8, .LCPI4_0
	; CHECK-NEXT: rev64 v0.4h, v1.4h			; CHECK-NEXT: // kill: def $d1 killed $d1 killed $q0_q1 def $q0_q1
	; CHECK-NEXT: mov v0.d[1], v2.d[0]			; CHECK-NEXT: // kill: def $d0 killed $d0 killed $q0_q1 def $q0_q1
				; CHECK-NEXT: ldr q2, [x8, :lo12:.LCPI4_0]
				; CHECK-NEXT: tbl v0.16b, { v0.16b, v1.16b }, v2.16b
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	entry:			entry:
	%V128 = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>			%V128 = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> <i32 7, i32 6, i32 5, i32 4, i32 3, i32 2, i32 1, i32 0>
	ret <8 x i16> %V128			ret <8 x i16> %V128
	}			}

	define <4 x i16> @v4i16(<4 x i16> %a) {			define <4 x i16> @v4i16(<4 x i16> %a) {
	; CHECK-LABEL: v4i16:			; CHECK-LABEL: v4i16:
	▲ Show 20 Lines • Show All 94 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/neon-widen-shuffle.ll

Show First 20 Lines • Show All 130 Lines • ▼ Show 20 Lines	entry:
%res = shufflevector <8 x i8> %a, <8 x i8> %b, <8 x i32> <i32 4, i32 5, i32 4, i32 undef,		%res = shufflevector <8 x i8> %a, <8 x i8> %b, <8 x i32> <i32 4, i32 5, i32 4, i32 undef,
i32 undef, i32 13, i32 12, i32 undef>		i32 undef, i32 13, i32 12, i32 undef>
ret <8 x i8> %res		ret <8 x i8> %res
}		}

define <8 x i16> @shuffle_widen_faili1(<4 x i16> %a, <4 x i16> %b) {		define <8 x i16> @shuffle_widen_faili1(<4 x i16> %a, <4 x i16> %b) {
; CHECK-LABEL: shuffle_widen_faili1:		; CHECK-LABEL: shuffle_widen_faili1:
; CHECK: // %bb.0: // %entry		; CHECK: // %bb.0: // %entry
; CHECK-NEXT: rev32 v2.4h, v0.4h		; CHECK-NEXT: adrp x8, .LCPI12_0
; CHECK-NEXT: rev32 v3.4h, v1.4h		; CHECK-NEXT: // kill: def $d1 killed $d1 killed $q0_q1 def $q0_q1
; CHECK-NEXT: ext v1.8b, v2.8b, v1.8b, #4		; CHECK-NEXT: // kill: def $d0 killed $d0 killed $q0_q1 def $q0_q1
; CHECK-NEXT: ext v0.8b, v3.8b, v0.8b, #4		; CHECK-NEXT: ldr q2, [x8, :lo12:.LCPI12_0]
; CHECK-NEXT: mov v0.d[1], v1.d[0]		; CHECK-NEXT: tbl v0.16b, { v0.16b, v1.16b }, v2.16b
; CHECK-NEXT: ret		; CHECK-NEXT: ret
entry:		entry:
%res = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> <i32 7, i32 6, i32 0, i32 1,		%res = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> <i32 7, i32 6, i32 0, i32 1,
i32 3, i32 2, i32 4, i32 5>		i32 3, i32 2, i32 4, i32 5>
ret <8 x i16> %res		ret <8 x i16> %res
}		}

define <8 x i16> @shuffle_widen_fail2(<4 x i16> %a, <4 x i16> %b) {		define <8 x i16> @shuffle_widen_fail2(<4 x i16> %a, <4 x i16> %b) {
; CHECK-LABEL: shuffle_widen_fail2:		; CHECK-LABEL: shuffle_widen_fail2:
; CHECK: // %bb.0: // %entry		; CHECK: // %bb.0: // %entry
; CHECK-NEXT: uzp1 v2.4h, v0.4h, v0.4h		; CHECK-NEXT: adrp x8, .LCPI13_0
; CHECK-NEXT: trn1 v3.4h, v1.4h, v1.4h		; CHECK-NEXT: // kill: def $d1 killed $d1 killed $q0_q1 def $q0_q1
; CHECK-NEXT: ext v1.8b, v2.8b, v1.8b, #4		; CHECK-NEXT: // kill: def $d0 killed $d0 killed $q0_q1 def $q0_q1
; CHECK-NEXT: ext v0.8b, v3.8b, v0.8b, #4		; CHECK-NEXT: ldr q2, [x8, :lo12:.LCPI13_0]
; CHECK-NEXT: mov v0.d[1], v1.d[0]		; CHECK-NEXT: tbl v0.16b, { v0.16b, v1.16b }, v2.16b
; CHECK-NEXT: ret		; CHECK-NEXT: ret
entry:		entry:
%res = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> <i32 6, i32 6, i32 0, i32 1,		%res = shufflevector <4 x i16> %a, <4 x i16> %b, <8 x i32> <i32 6, i32 6, i32 0, i32 1,
i32 undef, i32 2, i32 4, i32 5>		i32 undef, i32 2, i32 4, i32 5>
ret <8 x i16> %res		ret <8 x i16> %res
}		}

define <8 x i16> @shuffle_widen_fail3(<8 x i16> %a, <8 x i16> %b) {		define <8 x i16> @shuffle_widen_fail3(<8 x i16> %a, <8 x i16> %b) {
Show All 13 Lines

llvm/test/CodeGen/AArch64/shuffles.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc < %s -mtriple=aarch64--linux-gnu \| FileCheck %s			; RUN: llc < %s -mtriple=aarch64--linux-gnu \| FileCheck %s

	define <16 x i32> @test_shuf1(<16 x i32> %x, <16 x i32> %y) {			define <16 x i32> @test_shuf1(<16 x i32> %x, <16 x i32> %y) {
	; CHECK-LABEL: test_shuf1:			; CHECK-LABEL: test_shuf1:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: zip2 v3.4s, v7.4s, v6.4s			; CHECK-NEXT: uzp1 v16.4s, v1.4s, v0.4s
	; CHECK-NEXT: ext v5.16b, v6.16b, v4.16b, #12			; CHECK-NEXT: ext v3.16b, v6.16b, v4.16b, #12
	; CHECK-NEXT: uzp1 v6.4s, v1.4s, v0.4s			; CHECK-NEXT: zip2 v6.4s, v7.4s, v6.4s
	; CHECK-NEXT: uzp2 v4.4s, v2.4s, v4.4s			; CHECK-NEXT: uzp2 v17.4s, v2.4s, v4.4s
	; CHECK-NEXT: trn2 v3.4s, v7.4s, v3.4s			; CHECK-NEXT: trn2 v16.4s, v16.4s, v1.4s
	; CHECK-NEXT: ext v5.16b, v7.16b, v5.16b, #8			; CHECK-NEXT: ext v1.16b, v1.16b, v1.16b, #4
	; CHECK-NEXT: trn2 v6.4s, v6.4s, v1.4s			; CHECK-NEXT: trn2 v4.4s, v7.4s, v6.4s
	; CHECK-NEXT: trn1 v2.4s, v4.4s, v2.4s			; CHECK-NEXT: rev64 v5.4s, v7.4s
	; CHECK-NEXT: ext v4.16b, v1.16b, v1.16b, #12			; CHECK-NEXT: trn1 v2.4s, v17.4s, v2.4s
	; CHECK-NEXT: ext v3.16b, v1.16b, v3.16b, #8			; CHECK-NEXT: dup v6.4s, v7.s[0]
	; CHECK-NEXT: rev64 v16.4s, v5.4s			; CHECK-NEXT: mov v4.d[1], v1.d[1]
	; CHECK-NEXT: dup v7.4s, v7.s[0]			; CHECK-NEXT: mov v3.d[1], v5.d[1]
	; CHECK-NEXT: ext v1.16b, v0.16b, v6.16b, #12			; CHECK-NEXT: ext v1.16b, v0.16b, v16.16b, #12
	; CHECK-NEXT: mov v2.s[3], v7.s[3]			; CHECK-NEXT: mov v2.s[3], v6.s[3]
	; CHECK-NEXT: ext v0.16b, v3.16b, v4.16b, #8			; CHECK-NEXT: mov v0.16b, v4.16b
	; CHECK-NEXT: ext v3.16b, v5.16b, v16.16b, #8
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%s3 = shufflevector <16 x i32> %x, <16 x i32> %y, <16 x i32> <i32 29, i32 26, i32 7, i32 4, i32 3, i32 6, i32 5, i32 2, i32 9, i32 8, i32 17, i32 28, i32 27, i32 16, i32 31, i32 30>			%s3 = shufflevector <16 x i32> %x, <16 x i32> %y, <16 x i32> <i32 29, i32 26, i32 7, i32 4, i32 3, i32 6, i32 5, i32 2, i32 9, i32 8, i32 17, i32 28, i32 27, i32 16, i32 31, i32 30>
	ret <16 x i32> %s3			ret <16 x i32> %s3
	}			}

	define <4 x i32> @test_shuf2(<16 x i32> %x, <16 x i32> %y) {			define <4 x i32> @test_shuf2(<16 x i32> %x, <16 x i32> %y) {
	; CHECK-LABEL: test_shuf2:			; CHECK-LABEL: test_shuf2:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	; CHECK-NEXT: zip2 v0.4s, v7.4s, v6.4s			; CHECK-NEXT: zip2 v0.4s, v7.4s, v6.4s
	; CHECK-NEXT: ext v2.16b, v1.16b, v1.16b, #12			; CHECK-NEXT: ext v1.16b, v1.16b, v1.16b, #4
	; CHECK-NEXT: trn2 v0.4s, v7.4s, v0.4s			; CHECK-NEXT: trn2 v0.4s, v7.4s, v0.4s
	; CHECK-NEXT: ext v0.16b, v1.16b, v0.16b, #8			; CHECK-NEXT: mov v0.d[1], v1.d[1]
	; CHECK-NEXT: ext v0.16b, v0.16b, v2.16b, #8
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%s3 = shufflevector <16 x i32> %x, <16 x i32> %y, <4 x i32> <i32 29, i32 26, i32 7, i32 4>			%s3 = shufflevector <16 x i32> %x, <16 x i32> %y, <4 x i32> <i32 29, i32 26, i32 7, i32 4>
	ret <4 x i32> %s3			ret <4 x i32> %s3
	}			}

	define <4 x i32> @test_shuf3(<16 x i32> %x, <16 x i32> %y) {			define <4 x i32> @test_shuf3(<16 x i32> %x, <16 x i32> %y) {
	; CHECK-LABEL: test_shuf3:			; CHECK-LABEL: test_shuf3:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	Show All 15 Lines
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%s3 = shufflevector <16 x i32> %x, <16 x i32> %y, <4 x i32> <i32 9, i32 8, i32 17, i32 28>			%s3 = shufflevector <16 x i32> %x, <16 x i32> %y, <4 x i32> <i32 9, i32 8, i32 17, i32 28>
	ret <4 x i32> %s3			ret <4 x i32> %s3
	}			}

	define <4 x i32> @test_shuf5(<16 x i32> %x, <16 x i32> %y) {			define <4 x i32> @test_shuf5(<16 x i32> %x, <16 x i32> %y) {
	; CHECK-LABEL: test_shuf5:			; CHECK-LABEL: test_shuf5:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
				; CHECK-NEXT: rev64 v1.4s, v7.4s
	; CHECK-NEXT: ext v0.16b, v6.16b, v4.16b, #12			; CHECK-NEXT: ext v0.16b, v6.16b, v4.16b, #12
	; CHECK-NEXT: ext v0.16b, v7.16b, v0.16b, #8			; CHECK-NEXT: mov v0.d[1], v1.d[1]
	; CHECK-NEXT: rev64 v1.4s, v0.4s
	; CHECK-NEXT: ext v0.16b, v0.16b, v1.16b, #8
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	%s3 = shufflevector <16 x i32> %x, <16 x i32> %y, <4 x i32> <i32 27, i32 16, i32 31, i32 30>			%s3 = shufflevector <16 x i32> %x, <16 x i32> %y, <4 x i32> <i32 27, i32 16, i32 31, i32 30>
	ret <4 x i32> %s3			ret <4 x i32> %s3
	}			}

	define <4 x i32> @test1503(<4 x i32> %a, <4 x i32> %b)			define <4 x i32> @test1503(<4 x i32> %a, <4 x i32> %b)
	; CHECK-LABEL: test1503:			; CHECK-LABEL: test1503:
	; CHECK: // %bb.0:			; CHECK: // %bb.0:
	Show All 33 Lines