This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Generate LD1 for anyext i8 or i16 vector load
ClosedPublic

Authored by asavonic on May 21 2021, 11:41 AM.

Download Raw Diff

Details

Reviewers

dmgreen
SjoerdMeijer
jmolloy
t.p.northover

Commits

rG8ac66d61eab3: [AArch64] Generate LD1 for anyext i8 or i16 vector load

Summary

The existing LD1 patterns do not cover cases where result type does
not match the memory type. This happens when illegal vector types are
extended and scalarized, for example:

load <2 x i16>* %v2i16

is lowered into:

// first element
(v4i32 (insert_subvector (v2i32 (scalar_to_vector (load anyext from i16)))))
// other elements
(v4i32 (insert_vector_elt (i32 (load anyext from i16)) idx))

Before this patch these patterns were compiled into LDR + INS.
Now they are compiled into LD1.

There is a separate patch to combine these sequences with an ADD
into an LD1_POST: https://reviews.llvm.org/D102939

The problem was reported in
PR24820: LLVM Generates abysmal code in simple situation.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	780 ms	x64 debian > libomp.lock::omp_init_lock.c

Event Timeline

asavonic created this revision.May 21 2021, 11:41 AM

Herald added subscribers: danielkiss, hiraditya, kristof.beyls. · View Herald TranscriptMay 21 2021, 11:41 AM

asavonic requested review of this revision.May 21 2021, 11:41 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 21 2021, 11:41 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

asavonic added reviewers: SjoerdMeijer, jmolloy.May 21 2021, 11:42 AM

asavonic added a reviewer: t.p.northover.

asavonic edited the summary of this revision. (Show Details)May 21 2021, 11:53 AM

asavonic edited the summary of this revision. (Show Details)

Harbormaster completed remote builds in B105678: Diff 347087.May 21 2021, 12:17 PM

Sounds good. Can we make sure there is test coverage for big endian too.

In D102938#2776765, @dmgreen wrote:

Sounds good. Can we make sure there is test coverage for big endian too.

Thanks. Added BE checks to CodeGen/AArch64/aarch64-load-ext.ll.

asavonic edited the summary of this revision. (Show Details)May 24 2021, 9:05 AM

Harbormaster completed remote builds in B105929: Diff 347412.May 24 2021, 9:57 AM

Thanks. LGTM.

This revision is now accepted and ready to land.May 25 2021, 2:20 PM

This revision was landed with ongoing or failed builds.May 26 2021, 4:49 AM

Closed by commit rG8ac66d61eab3: [AArch64] Generate LD1 for anyext i8 or i16 vector load (authored by asavonic). · Explain Why

This revision was automatically updated to reflect the committed changes.

asavonic added a commit: rG8ac66d61eab3: [AArch64] Generate LD1 for anyext i8 or i16 vector load.

dmgreen mentioned this in D103629: [AArch64] Cost-model i8 vector loads/stores.Jun 3 2021, 10:18 PM

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64InstrInfo.td

40 lines

test/

CodeGen/

AArch64/

aarch64-load-ext.ll

112 lines

sadd_sat_vec.ll

36 lines

ssub_sat_vec.ll

36 lines

Diff 347412

llvm/lib/Target/AArch64/AArch64InstrInfo.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 6,716 Lines • ▼ Show 20 Lines
	def : Ld1Lane128Pat<extloadi16, VectorIndexH, v8i16, i32, LD1i16>;			def : Ld1Lane128Pat<extloadi16, VectorIndexH, v8i16, i32, LD1i16>;
	def : Ld1Lane128Pat<load, VectorIndexS, v4i32, i32, LD1i32>;			def : Ld1Lane128Pat<load, VectorIndexS, v4i32, i32, LD1i32>;
	def : Ld1Lane128Pat<load, VectorIndexS, v4f32, f32, LD1i32>;			def : Ld1Lane128Pat<load, VectorIndexS, v4f32, f32, LD1i32>;
	def : Ld1Lane128Pat<load, VectorIndexD, v2i64, i64, LD1i64>;			def : Ld1Lane128Pat<load, VectorIndexD, v2i64, i64, LD1i64>;
	def : Ld1Lane128Pat<load, VectorIndexD, v2f64, f64, LD1i64>;			def : Ld1Lane128Pat<load, VectorIndexD, v2f64, f64, LD1i64>;
	def : Ld1Lane128Pat<load, VectorIndexH, v8f16, f16, LD1i16>;			def : Ld1Lane128Pat<load, VectorIndexH, v8f16, f16, LD1i16>;
	def : Ld1Lane128Pat<load, VectorIndexH, v8bf16, bf16, LD1i16>;			def : Ld1Lane128Pat<load, VectorIndexH, v8bf16, bf16, LD1i16>;

				// Generate LD1 for extload if memory type does not match the
				// destination type, for example:
				//
				// (v4i32 (insert_vector_elt (load anyext from i8) idx))
				//
				// In this case, the index must be adjusted to match LD1 type.
				//
				class Ld1Lane128IdxOpPat<SDPatternOperator scalar_load, Operand
				VecIndex, ValueType VTy, ValueType STy,
				Instruction LD1, SDNodeXForm IdxOp>
				: Pat<(vector_insert (VTy VecListOne128:$Rd),
				(STy (scalar_load GPR64sp:$Rn)), VecIndex:$idx),
				(LD1 VecListOne128:$Rd, (IdxOp VecIndex:$idx), GPR64sp:$Rn)>;

				def VectorIndexStoH : SDNodeXForm<imm, [{
				return CurDAG->getTargetConstant(N->getZExtValue() * 2, SDLoc(N), MVT::i64);
				}]>;
				def VectorIndexStoB : SDNodeXForm<imm, [{
				return CurDAG->getTargetConstant(N->getZExtValue() * 4, SDLoc(N), MVT::i64);
				}]>;
				def VectorIndexHtoB : SDNodeXForm<imm, [{
				return CurDAG->getTargetConstant(N->getZExtValue() * 2, SDLoc(N), MVT::i64);
				}]>;

				def : Ld1Lane128IdxOpPat<extloadi16, VectorIndexS, v4i32, i32, LD1i16, VectorIndexStoH>;
				def : Ld1Lane128IdxOpPat<extloadi8, VectorIndexS, v4i32, i32, LD1i8, VectorIndexStoB>;
				def : Ld1Lane128IdxOpPat<extloadi8, VectorIndexH, v8i16, i32, LD1i8, VectorIndexHtoB>;

				// Same as above, but the first element is populated using
				// scalar_to_vector + insert_subvector instead of insert_vector_elt.
				class Ld1Lane128FirstElm<ValueType ResultTy, ValueType VecTy,
				SDPatternOperator ExtLoad, Instruction LD1>
				: Pat<(ResultTy (scalar_to_vector (i32 (ExtLoad GPR64sp:$Rn)))),
				(ResultTy (EXTRACT_SUBREG
				(LD1 (VecTy (IMPLICIT_DEF)), 0, GPR64sp:$Rn), dsub))>;

				def : Ld1Lane128FirstElm<v2i32, v8i16, extloadi16, LD1i16>;
				def : Ld1Lane128FirstElm<v2i32, v16i8, extloadi8, LD1i8>;
				def : Ld1Lane128FirstElm<v4i16, v16i8, extloadi8, LD1i8>;

	class Ld1Lane64Pat<SDPatternOperator scalar_load, Operand VecIndex,			class Ld1Lane64Pat<SDPatternOperator scalar_load, Operand VecIndex,
	ValueType VTy, ValueType STy, Instruction LD1>			ValueType VTy, ValueType STy, Instruction LD1>
	: Pat<(vector_insert (VTy VecListOne64:$Rd),			: Pat<(vector_insert (VTy VecListOne64:$Rd),
	(STy (scalar_load GPR64sp:$Rn)), VecIndex:$idx),			(STy (scalar_load GPR64sp:$Rn)), VecIndex:$idx),
	(EXTRACT_SUBREG			(EXTRACT_SUBREG
	(LD1 (SUBREG_TO_REG (i32 0), VecListOne64:$Rd, dsub),			(LD1 (SUBREG_TO_REG (i32 0), VecListOne64:$Rd, dsub),
	VecIndex:$idx, GPR64sp:$Rn),			VecIndex:$idx, GPR64sp:$Rn),
	dsub)>;			dsub)>;
	▲ Show 20 Lines • Show All 1,293 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/aarch64-load-ext.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -mtriple=aarch64-unknown-linux-gnu < %s \| FileCheck %s --check-prefix CHECK-LE
				; RUN: llc -mtriple=aarch64_be-unknown-linux-gnu < %s \| FileCheck %s --check-prefix CHECK-BE

				define <2 x i16> @test0(i16* %i16_ptr, i64 %inc) {
				; CHECK-LE-LABEL: test0:
				; CHECK-LE: // %bb.0:
				; CHECK-LE-NEXT: ld1 { v0.h }[0], [x0]
				; CHECK-LE-NEXT: // kill: def $d0 killed $d0 killed $q0
				; CHECK-LE-NEXT: ret
				;
				; CHECK-BE-LABEL: test0:
				; CHECK-BE: // %bb.0:
				; CHECK-BE-NEXT: ld1 { v0.h }[0], [x0]
				; CHECK-BE-NEXT: rev64 v0.2s, v0.2s
				; CHECK-BE-NEXT: ret
				%i_0 = load i16, i16* %i16_ptr
				%v0 = insertelement <2 x i16> undef, i16 %i_0, i32 0
				ret <2 x i16> %v0
				}

				define <2 x i16> @test1(<2 x i16>* %v2i16_ptr) {
				; CHECK-LE-LABEL: test1:
				; CHECK-LE: // %bb.0:
				; CHECK-LE-NEXT: ld1 { v0.h }[0], [x0]
				; CHECK-LE-NEXT: add x8, x0, #2 // =2
				; CHECK-LE-NEXT: ld1 { v0.h }[2], [x8]
				; CHECK-LE-NEXT: // kill: def $d0 killed $d0 killed $q0
				; CHECK-LE-NEXT: ret
				;
				; CHECK-BE-LABEL: test1:
				; CHECK-BE: // %bb.0:
				; CHECK-BE-NEXT: ld1 { v0.h }[0], [x0]
				; CHECK-BE-NEXT: add x8, x0, #2 // =2
				; CHECK-BE-NEXT: ld1 { v0.h }[2], [x8]
				; CHECK-BE-NEXT: rev64 v0.2s, v0.2s
				; CHECK-BE-NEXT: ret
				%v2i16 = load <2 x i16>, <2 x i16>* %v2i16_ptr
				ret <2 x i16> %v2i16
				}

				define <2 x i16> @test2(i16* %i16_ptr, i64 %inc) {
				; CHECK-LE-LABEL: test2:
				; CHECK-LE: // %bb.0:
				; CHECK-LE-NEXT: ld1 { v0.h }[0], [x0]
				; CHECK-LE-NEXT: add x8, x0, x1, lsl #1
				; CHECK-LE-NEXT: ld1 { v0.h }[2], [x8]
				; CHECK-LE-NEXT: // kill: def $d0 killed $d0 killed $q0
				; CHECK-LE-NEXT: ret
				;
				; CHECK-BE-LABEL: test2:
				; CHECK-BE: // %bb.0:
				; CHECK-BE-NEXT: ld1 { v0.h }[0], [x0]
				; CHECK-BE-NEXT: add x8, x0, x1, lsl #1
				; CHECK-BE-NEXT: ld1 { v0.h }[2], [x8]
				; CHECK-BE-NEXT: rev64 v0.2s, v0.2s
				; CHECK-BE-NEXT: ret
				%i_0 = load i16, i16* %i16_ptr
				%i16_ptr_inc = getelementptr i16, i16* %i16_ptr, i64 %inc
				%i_1 = load i16, i16* %i16_ptr_inc
				%v0 = insertelement <2 x i16> undef, i16 %i_0, i32 0
				%v1 = insertelement <2 x i16> %v0, i16 %i_1, i32 1
				ret <2 x i16> %v1
				}

				define <2 x i8> @test3(<2 x i8>* %v2i8_ptr) {
				; CHECK-LE-LABEL: test3:
				; CHECK-LE: // %bb.0:
				; CHECK-LE-NEXT: ld1 { v0.b }[0], [x0]
				; CHECK-LE-NEXT: add x8, x0, #1 // =1
				; CHECK-LE-NEXT: ld1 { v0.b }[4], [x8]
				; CHECK-LE-NEXT: // kill: def $d0 killed $d0 killed $q0
				; CHECK-LE-NEXT: ret
				;
				; CHECK-BE-LABEL: test3:
				; CHECK-BE: // %bb.0:
				; CHECK-BE-NEXT: ld1 { v0.b }[0], [x0]
				; CHECK-BE-NEXT: add x8, x0, #1 // =1
				; CHECK-BE-NEXT: ld1 { v0.b }[4], [x8]
				; CHECK-BE-NEXT: rev64 v0.2s, v0.2s
				; CHECK-BE-NEXT: ret
				%v2i8 = load <2 x i8>, <2 x i8>* %v2i8_ptr
				ret <2 x i8> %v2i8
				}

				define <4 x i8> @test4(<4 x i8>* %v4i8_ptr) {
				; CHECK-LE-LABEL: test4:
				; CHECK-LE: // %bb.0:
				; CHECK-LE-NEXT: ld1 { v0.b }[0], [x0]
				; CHECK-LE-NEXT: add x8, x0, #1 // =1
				; CHECK-LE-NEXT: ld1 { v0.b }[2], [x8]
				; CHECK-LE-NEXT: add x8, x0, #2 // =2
				; CHECK-LE-NEXT: ld1 { v0.b }[4], [x8]
				; CHECK-LE-NEXT: add x8, x0, #3 // =3
				; CHECK-LE-NEXT: ld1 { v0.b }[6], [x8]
				; CHECK-LE-NEXT: // kill: def $d0 killed $d0 killed $q0
				; CHECK-LE-NEXT: ret
				;
				; CHECK-BE-LABEL: test4:
				; CHECK-BE: // %bb.0:
				; CHECK-BE-NEXT: ld1 { v0.b }[0], [x0]
				; CHECK-BE-NEXT: add x8, x0, #1 // =1
				; CHECK-BE-NEXT: ld1 { v0.b }[2], [x8]
				; CHECK-BE-NEXT: add x8, x0, #2 // =2
				; CHECK-BE-NEXT: ld1 { v0.b }[4], [x8]
				; CHECK-BE-NEXT: add x8, x0, #3 // =3
				; CHECK-BE-NEXT: ld1 { v0.b }[6], [x8]
				; CHECK-BE-NEXT: rev64 v0.4h, v0.4h
				; CHECK-BE-NEXT: ret
				%v4i8 = load <4 x i8>, <4 x i8>* %v4i8_ptr
				ret <4 x i8> %v4i8
				}

llvm/test/CodeGen/AArch64/sadd_sat_vec.ll

Show First 20 Lines • Show All 139 Lines • ▼ Show 20 Lines	; CHECK-NEXT: ret
%z = call <4 x i8> @llvm.sadd.sat.v4i8(<4 x i8> %x, <4 x i8> %y)		%z = call <4 x i8> @llvm.sadd.sat.v4i8(<4 x i8> %x, <4 x i8> %y)
store <4 x i8> %z, <4 x i8>* %pz		store <4 x i8> %z, <4 x i8>* %pz
ret void		ret void
}		}

define void @v2i8(<2 x i8>* %px, <2 x i8>* %py, <2 x i8>* %pz) nounwind {		define void @v2i8(<2 x i8>* %px, <2 x i8>* %py, <2 x i8>* %pz) nounwind {
; CHECK-LABEL: v2i8:		; CHECK-LABEL: v2i8:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: ldrb w8, [x0]		; CHECK-NEXT: ld1 { v0.b }[0], [x1]
; CHECK-NEXT: ldrb w9, [x1]		; CHECK-NEXT: ld1 { v1.b }[0], [x0]
; CHECK-NEXT: ldrb w10, [x0, #1]		; CHECK-NEXT: add x8, x0, #1 // =1
; CHECK-NEXT: ldrb w11, [x1, #1]		; CHECK-NEXT: add x9, x1, #1 // =1
; CHECK-NEXT: fmov s0, w8		; CHECK-NEXT: ld1 { v0.b }[4], [x9]
; CHECK-NEXT: fmov s1, w9		; CHECK-NEXT: ld1 { v1.b }[4], [x8]
; CHECK-NEXT: mov v0.s[1], w10
; CHECK-NEXT: mov v1.s[1], w11
; CHECK-NEXT: shl v1.2s, v1.2s, #24
; CHECK-NEXT: shl v0.2s, v0.2s, #24		; CHECK-NEXT: shl v0.2s, v0.2s, #24
; CHECK-NEXT: sqadd v0.2s, v0.2s, v1.2s		; CHECK-NEXT: shl v1.2s, v1.2s, #24
		; CHECK-NEXT: sqadd v0.2s, v1.2s, v0.2s
; CHECK-NEXT: ushr v0.2s, v0.2s, #24		; CHECK-NEXT: ushr v0.2s, v0.2s, #24
; CHECK-NEXT: mov w8, v0.s[1]		; CHECK-NEXT: mov w8, v0.s[1]
; CHECK-NEXT: fmov w9, s0		; CHECK-NEXT: fmov w9, s0
; CHECK-NEXT: strb w9, [x2]		; CHECK-NEXT: strb w9, [x2]
; CHECK-NEXT: strb w8, [x2, #1]		; CHECK-NEXT: strb w8, [x2, #1]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%x = load <2 x i8>, <2 x i8>* %px		%x = load <2 x i8>, <2 x i8>* %px
%y = load <2 x i8>, <2 x i8>* %py		%y = load <2 x i8>, <2 x i8>* %py
Show All 15 Lines	; CHECK-NEXT: ret
%z = call <4 x i16> @llvm.sadd.sat.v4i16(<4 x i16> %x, <4 x i16> %y)		%z = call <4 x i16> @llvm.sadd.sat.v4i16(<4 x i16> %x, <4 x i16> %y)
store <4 x i16> %z, <4 x i16>* %pz		store <4 x i16> %z, <4 x i16>* %pz
ret void		ret void
}		}

define void @v2i16(<2 x i16>* %px, <2 x i16>* %py, <2 x i16>* %pz) nounwind {		define void @v2i16(<2 x i16>* %px, <2 x i16>* %py, <2 x i16>* %pz) nounwind {
; CHECK-LABEL: v2i16:		; CHECK-LABEL: v2i16:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: ldrh w8, [x0]		; CHECK-NEXT: ld1 { v0.h }[0], [x1]
; CHECK-NEXT: ldrh w9, [x1]		; CHECK-NEXT: ld1 { v1.h }[0], [x0]
; CHECK-NEXT: ldrh w10, [x0, #2]		; CHECK-NEXT: add x8, x0, #2 // =2
; CHECK-NEXT: ldrh w11, [x1, #2]		; CHECK-NEXT: add x9, x1, #2 // =2
; CHECK-NEXT: fmov s0, w8		; CHECK-NEXT: ld1 { v0.h }[2], [x9]
; CHECK-NEXT: fmov s1, w9		; CHECK-NEXT: ld1 { v1.h }[2], [x8]
; CHECK-NEXT: mov v0.s[1], w10
; CHECK-NEXT: mov v1.s[1], w11
; CHECK-NEXT: shl v1.2s, v1.2s, #16
; CHECK-NEXT: shl v0.2s, v0.2s, #16		; CHECK-NEXT: shl v0.2s, v0.2s, #16
; CHECK-NEXT: sqadd v0.2s, v0.2s, v1.2s		; CHECK-NEXT: shl v1.2s, v1.2s, #16
		; CHECK-NEXT: sqadd v0.2s, v1.2s, v0.2s
; CHECK-NEXT: ushr v0.2s, v0.2s, #16		; CHECK-NEXT: ushr v0.2s, v0.2s, #16
; CHECK-NEXT: mov w8, v0.s[1]		; CHECK-NEXT: mov w8, v0.s[1]
; CHECK-NEXT: fmov w9, s0		; CHECK-NEXT: fmov w9, s0
; CHECK-NEXT: strh w9, [x2]		; CHECK-NEXT: strh w9, [x2]
; CHECK-NEXT: strh w8, [x2, #2]		; CHECK-NEXT: strh w8, [x2, #2]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%x = load <2 x i16>, <2 x i16>* %px		%x = load <2 x i16>, <2 x i16>* %px
%y = load <2 x i16>, <2 x i16>* %py		%y = load <2 x i16>, <2 x i16>* %py
▲ Show 20 Lines • Show All 188 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/ssub_sat_vec.ll

Show First 20 Lines • Show All 140 Lines • ▼ Show 20 Lines	; CHECK-NEXT: ret
%z = call <4 x i8> @llvm.ssub.sat.v4i8(<4 x i8> %x, <4 x i8> %y)		%z = call <4 x i8> @llvm.ssub.sat.v4i8(<4 x i8> %x, <4 x i8> %y)
store <4 x i8> %z, <4 x i8>* %pz		store <4 x i8> %z, <4 x i8>* %pz
ret void		ret void
}		}

define void @v2i8(<2 x i8>* %px, <2 x i8>* %py, <2 x i8>* %pz) nounwind {		define void @v2i8(<2 x i8>* %px, <2 x i8>* %py, <2 x i8>* %pz) nounwind {
; CHECK-LABEL: v2i8:		; CHECK-LABEL: v2i8:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: ldrb w8, [x0]		; CHECK-NEXT: ld1 { v0.b }[0], [x1]
; CHECK-NEXT: ldrb w9, [x1]		; CHECK-NEXT: ld1 { v1.b }[0], [x0]
; CHECK-NEXT: ldrb w10, [x0, #1]		; CHECK-NEXT: add x8, x0, #1 // =1
; CHECK-NEXT: ldrb w11, [x1, #1]		; CHECK-NEXT: add x9, x1, #1 // =1
; CHECK-NEXT: fmov s0, w8		; CHECK-NEXT: ld1 { v0.b }[4], [x9]
; CHECK-NEXT: fmov s1, w9		; CHECK-NEXT: ld1 { v1.b }[4], [x8]
; CHECK-NEXT: mov v0.s[1], w10
; CHECK-NEXT: mov v1.s[1], w11
; CHECK-NEXT: shl v1.2s, v1.2s, #24
; CHECK-NEXT: shl v0.2s, v0.2s, #24		; CHECK-NEXT: shl v0.2s, v0.2s, #24
; CHECK-NEXT: sqsub v0.2s, v0.2s, v1.2s		; CHECK-NEXT: shl v1.2s, v1.2s, #24
		; CHECK-NEXT: sqsub v0.2s, v1.2s, v0.2s
; CHECK-NEXT: ushr v0.2s, v0.2s, #24		; CHECK-NEXT: ushr v0.2s, v0.2s, #24
; CHECK-NEXT: mov w8, v0.s[1]		; CHECK-NEXT: mov w8, v0.s[1]
; CHECK-NEXT: fmov w9, s0		; CHECK-NEXT: fmov w9, s0
; CHECK-NEXT: strb w9, [x2]		; CHECK-NEXT: strb w9, [x2]
; CHECK-NEXT: strb w8, [x2, #1]		; CHECK-NEXT: strb w8, [x2, #1]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%x = load <2 x i8>, <2 x i8>* %px		%x = load <2 x i8>, <2 x i8>* %px
%y = load <2 x i8>, <2 x i8>* %py		%y = load <2 x i8>, <2 x i8>* %py
Show All 15 Lines	; CHECK-NEXT: ret
%z = call <4 x i16> @llvm.ssub.sat.v4i16(<4 x i16> %x, <4 x i16> %y)		%z = call <4 x i16> @llvm.ssub.sat.v4i16(<4 x i16> %x, <4 x i16> %y)
store <4 x i16> %z, <4 x i16>* %pz		store <4 x i16> %z, <4 x i16>* %pz
ret void		ret void
}		}

define void @v2i16(<2 x i16>* %px, <2 x i16>* %py, <2 x i16>* %pz) nounwind {		define void @v2i16(<2 x i16>* %px, <2 x i16>* %py, <2 x i16>* %pz) nounwind {
; CHECK-LABEL: v2i16:		; CHECK-LABEL: v2i16:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: ldrh w8, [x0]		; CHECK-NEXT: ld1 { v0.h }[0], [x1]
; CHECK-NEXT: ldrh w9, [x1]		; CHECK-NEXT: ld1 { v1.h }[0], [x0]
; CHECK-NEXT: ldrh w10, [x0, #2]		; CHECK-NEXT: add x8, x0, #2 // =2
; CHECK-NEXT: ldrh w11, [x1, #2]		; CHECK-NEXT: add x9, x1, #2 // =2
; CHECK-NEXT: fmov s0, w8		; CHECK-NEXT: ld1 { v0.h }[2], [x9]
; CHECK-NEXT: fmov s1, w9		; CHECK-NEXT: ld1 { v1.h }[2], [x8]
; CHECK-NEXT: mov v0.s[1], w10
; CHECK-NEXT: mov v1.s[1], w11
; CHECK-NEXT: shl v1.2s, v1.2s, #16
; CHECK-NEXT: shl v0.2s, v0.2s, #16		; CHECK-NEXT: shl v0.2s, v0.2s, #16
; CHECK-NEXT: sqsub v0.2s, v0.2s, v1.2s		; CHECK-NEXT: shl v1.2s, v1.2s, #16
		; CHECK-NEXT: sqsub v0.2s, v1.2s, v0.2s
; CHECK-NEXT: ushr v0.2s, v0.2s, #16		; CHECK-NEXT: ushr v0.2s, v0.2s, #16
; CHECK-NEXT: mov w8, v0.s[1]		; CHECK-NEXT: mov w8, v0.s[1]
; CHECK-NEXT: fmov w9, s0		; CHECK-NEXT: fmov w9, s0
; CHECK-NEXT: strh w9, [x2]		; CHECK-NEXT: strh w9, [x2]
; CHECK-NEXT: strh w8, [x2, #2]		; CHECK-NEXT: strh w8, [x2, #2]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%x = load <2 x i16>, <2 x i16>* %px		%x = load <2 x i16>, <2 x i16>* %px
%y = load <2 x i16>, <2 x i16>* %py		%y = load <2 x i16>, <2 x i16>* %py
▲ Show 20 Lines • Show All 190 Lines • Show Last 20 Lines