This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/IR/
-
llvm/
-
IR/
-
IntrinsicsARM.td
-
lib/Target/ARM/
-
Target/
-
ARM/
-
ARMISelDAGToDAG.cpp
1
ARMInstrMVE.td
-
test/CodeGen/Thumb2/mve-intrinsics/
-
CodeGen/
-
Thumb2/
-
mve-intrinsics/
-
vld24.ll

Differential D68700

[ARM] Add IR intrinsics for MVE VLD[24] and VST[24].
ClosedPublic

Authored by simon_tatham on Oct 9 2019, 6:01 AM.

Download Raw Diff

Details

Reviewers

dmgreen
miyuki
ostannard

Commits

rGe0ef4ebe2f6a: [ARM] Add IR intrinsics for MVE VLD[24] and VST[24].

Summary

The VST2 and VST4 instructions take two or four vector registers as
input, and store part of each register to memory in an interleaved
pattern. They come in variants indicating which part of each register
they store (VST20 and VST21; VST40 to VST43 inclusive); the intention
is that issuing each of those variants in turn has the combined effect
of loading or storing the whole set of registers to a memory block of
equal size. The corresponding VLD2 and VLD4 instructions load from
memory in the same interleaved format: each one overwrites only part
of its output register set, and again, the idea is that if you use
VLD4{0,1,2,3} or VLD2{0,1} together, you end up having written to the
whole of each register.

I've implemented the stores and loads quite differently. The loads
were easiest to implement as a single intrinsic that expands to all
four VLD4x instructions or both VLD2x, delivering four complete output
registers. (Implementing each individual load as a separate
instruction taking four input registers to partially overwrite is
possible in theory, but pointless, and when I tried it, I found it
would need extra work to get the register allocation not to be
horrible.) Since that intrinsic delivers multiple outputs, it has to
be instruction-selected in custom C++.

But the store instructions are easier to model individually, because
they don't overwrite any register at all and you can write a DAG Isel
pattern in Tablegen for each one.

Hence, my new intrinsic int_arm_mve_vld4q expands to four load
instructions, delivers four full output vectors, and is handled by C++
code, whereas int_arm_mve_vst4q expands to just one store
instruction, takes four input vectors and a constant indicating which
lanes to store, and is handled entirely in Tablegen. (And similarly
for vld2q/vst2q.) This is asymmetric, but it was the easiest way to do
each one.

Diff Detail

Repository

rG LLVM Github Monorepo

Build Status

Buildable 39990
Build 40062: arc lint + arc unit

Event Timeline

simon_tatham created this revision.Oct 9 2019, 6:01 AM

Herald added a project: Restricted Project. · View Herald TranscriptOct 9 2019, 6:01 AM

Herald added subscribers: llvm-commits, hiraditya, kristof.beyls. · View Herald Transcript

Harbormaster completed remote builds in B39225: Diff 224029.Oct 9 2019, 6:03 AM

simon_tatham added a parent revision: D68699: [ARM] Add some sample IR MVE intrinsics with C++ isel..Oct 9 2019, 6:07 AM

simon_tatham added a child revision: D67159: [clang] New __attribute__((__clang_arm_mve_alias))..

Looks good. I think we can use this for autovec codegen too (there is a pre-isel pass that allows us to convert load+shuffle combos that the vectorizer produces into these intrinsics). As that is the case it would probably be worth making sure we have lots of test coverage of the different types. (I'm happy enough to do that later if you with, but adding them here sounds more sensible, as this is where they are being introduced).

llvm/lib/Target/ARM/ARMInstrMVE.td
4374	Do we need floating point types?

simon_tatham removed a child revision: D67159: [clang] New __attribute__((__clang_arm_mve_alias))..Oct 10 2019, 1:31 AM

simon_tatham added a child revision: D67161: [clang,ARM] Initial ACLE intrinsics for MVE..Oct 10 2019, 9:26 AM

Added floating-point forms of the interleaving loads. (They just
expand to the same instructions as the correspondingly sized integers,
of course.)

Harbormaster completed remote builds in B39990: Diff 226237.Oct 24 2019, 6:03 AM

Nice one. LGTM

This revision is now accepted and ready to land.Oct 24 2019, 7:14 AM

Closed by commit rGe0ef4ebe2f6a: [ARM] Add IR intrinsics for MVE VLD[24] and VST[24]. (authored by simon_tatham). · Explain WhyOct 24 2019, 8:35 AM

This revision was automatically updated to reflect the committed changes.

simon_tatham mentioned this in rG08074cc96557: [clang,ARM] Initial ACLE intrinsics for MVE..

Revision Contents

Path

Size

llvm/

include/

llvm/

IR/

IntrinsicsARM.td

7 lines

lib/

Target/

ARM/

ARMISelDAGToDAG.cpp

74 lines

ARMInstrMVE.td

23 lines

test/

CodeGen/

Thumb2/

mve-intrinsics/

vld24.ll

109 lines

Diff 226237

llvm/include/llvm/IR/IntrinsicsARM.td

	Show First 20 Lines • Show All 830 Lines • ▼ Show 20 Lines
	def int_arm_mve_vadc: Intrinsic<			def int_arm_mve_vadc: Intrinsic<
	[llvm_anyvector_ty, llvm_i32_ty],			[llvm_anyvector_ty, llvm_i32_ty],
	[LLVMMatchType<0>, LLVMMatchType<0>, llvm_i32_ty], [IntrNoMem]>;			[LLVMMatchType<0>, LLVMMatchType<0>, llvm_i32_ty], [IntrNoMem]>;
	def int_arm_mve_vadc_predicated: Intrinsic<			def int_arm_mve_vadc_predicated: Intrinsic<
	[llvm_anyvector_ty, llvm_i32_ty],			[llvm_anyvector_ty, llvm_i32_ty],
	[LLVMMatchType<0>, LLVMMatchType<0>, LLVMMatchType<0>,			[LLVMMatchType<0>, LLVMMatchType<0>, LLVMMatchType<0>,
	llvm_i32_ty, llvm_anyvector_ty], [IntrNoMem]>;			llvm_i32_ty, llvm_anyvector_ty], [IntrNoMem]>;

				def int_arm_mve_vld2q: Intrinsic<[llvm_anyvector_ty, LLVMMatchType<0>], [llvm_anyptr_ty], [IntrReadMem]>;
				def int_arm_mve_vld4q: Intrinsic<[llvm_anyvector_ty, LLVMMatchType<0>, LLVMMatchType<0>, LLVMMatchType<0>], [llvm_anyptr_ty], [IntrReadMem]>;

				def int_arm_mve_vst2q: Intrinsic<[], [llvm_anyptr_ty, llvm_anyvector_ty, LLVMMatchType<1>, llvm_i32_ty], [IntrWriteMem]>;
				def int_arm_mve_vst4q: Intrinsic<[], [llvm_anyptr_ty, llvm_anyvector_ty, LLVMMatchType<1>, LLVMMatchType<1>, LLVMMatchType<1>, llvm_i32_ty], [IntrWriteMem]
				>;

	} // end TargetPrefix			} // end TargetPrefix

llvm/lib/Target/ARM/ARMISelDAGToDAG.cpp

Show First 20 Lines • Show All 226 Lines • ▼ Show 20 Lines	private:

/// SelectMVE_LongShift - Select MVE 64-bit scalar shift intrinsics.		/// SelectMVE_LongShift - Select MVE 64-bit scalar shift intrinsics.
void SelectMVE_LongShift(SDNode *N, uint16_t Opcode, bool Immediate);		void SelectMVE_LongShift(SDNode *N, uint16_t Opcode, bool Immediate);

/// SelectMVE_VADCSBC - Select MVE vector add/sub-with-carry intrinsics.		/// SelectMVE_VADCSBC - Select MVE vector add/sub-with-carry intrinsics.
void SelectMVE_VADCSBC(SDNode *N, uint16_t OpcodeWithCarry,		void SelectMVE_VADCSBC(SDNode *N, uint16_t OpcodeWithCarry,
uint16_t OpcodeWithNoCarry, bool Add, bool Predicated);		uint16_t OpcodeWithNoCarry, bool Add, bool Predicated);

		/// SelectMVE_VLD - Select MVE interleaving load intrinsics. NumVecs
		/// should be 2 or 4. The opcode array specifies the instructions
		/// used for 8, 16 and 32-bit lane sizes respectively, and each
		/// pointer points to a set of NumVecs sub-opcodes used for the
		/// different stages (e.g. VLD20 versus VLD21) of each load family.
		void SelectMVE_VLD(SDNode *N, unsigned NumVecs,
		const uint16_t const Opcodes);

/// SelectVLDDup - Select NEON load-duplicate intrinsics. NumVecs		/// SelectVLDDup - Select NEON load-duplicate intrinsics. NumVecs
/// should be 1, 2, 3 or 4. The opcode array specifies the instructions used		/// should be 1, 2, 3 or 4. The opcode array specifies the instructions used
/// for loading D registers.		/// for loading D registers.
void SelectVLDDup(SDNode *N, bool IsIntrinsic, bool isUpdating,		void SelectVLDDup(SDNode *N, bool IsIntrinsic, bool isUpdating,
unsigned NumVecs, const uint16_t *DOpcodes,		unsigned NumVecs, const uint16_t *DOpcodes,
const uint16_t *QOpcodes0 = nullptr,		const uint16_t *QOpcodes0 = nullptr,
const uint16_t *QOpcodes1 = nullptr);		const uint16_t *QOpcodes1 = nullptr);

▲ Show 20 Lines • Show All 2,201 Lines • ▼ Show 20 Lines	AddMVEPredicateToOps(Ops, Loc,
N->getOperand(FirstInputOp + 3), // predicate		N->getOperand(FirstInputOp + 3), // predicate
N->getOperand(FirstInputOp - 1)); // inactive		N->getOperand(FirstInputOp - 1)); // inactive
else		else
AddEmptyMVEPredicateToOps(Ops, Loc, N->getValueType(0));		AddEmptyMVEPredicateToOps(Ops, Loc, N->getValueType(0));

CurDAG->SelectNodeTo(N, Opcode, N->getVTList(), makeArrayRef(Ops));		CurDAG->SelectNodeTo(N, Opcode, N->getVTList(), makeArrayRef(Ops));
}		}

		void ARMDAGToDAGISel::SelectMVE_VLD(SDNode *N, unsigned NumVecs,
		const uint16_t const Opcodes) {
		EVT VT = N->getValueType(0);
		SDLoc Loc(N);

		const uint16_t *OurOpcodes;
		switch (VT.getVectorElementType().getSizeInBits()) {
		case 8:
		OurOpcodes = Opcodes[0];
		break;
		case 16:
		OurOpcodes = Opcodes[1];
		break;
		case 32:
		OurOpcodes = Opcodes[2];
		break;
		default:
		llvm_unreachable("bad vector element size in SelectMVE_VLD");
		}

		EVT DataTy = EVT::getVectorVT(CurDAG->getContext(), MVT::i64, NumVecs 2);
		EVT ResultTys[] = {DataTy, MVT::Other};

		auto Data = SDValue(
		CurDAG->getMachineNode(TargetOpcode::IMPLICIT_DEF, Loc, DataTy), 0);
		SDValue Chain = N->getOperand(0);
		for (unsigned Stage = 0; Stage < NumVecs; ++Stage) {
		SDValue Ops[] = {Data, N->getOperand(2), Chain};
		auto LoadInst =
		CurDAG->getMachineNode(OurOpcodes[Stage], Loc, ResultTys, Ops);
		Data = SDValue(LoadInst, 0);
		Chain = SDValue(LoadInst, 1);
		}

		for (unsigned i = 0; i < NumVecs; i++)
		ReplaceUses(SDValue(N, i),
		CurDAG->getTargetExtractSubreg(ARM::qsub_0 + i, Loc, VT, Data));
		ReplaceUses(SDValue(N, NumVecs), Chain);
		CurDAG->RemoveDeadNode(N);
		}

void ARMDAGToDAGISel::SelectVLDDup(SDNode *N, bool IsIntrinsic,		void ARMDAGToDAGISel::SelectVLDDup(SDNode *N, bool IsIntrinsic,
bool isUpdating, unsigned NumVecs,		bool isUpdating, unsigned NumVecs,
const uint16_t *DOpcodes,		const uint16_t *DOpcodes,
const uint16_t *QOpcodes0,		const uint16_t *QOpcodes0,
const uint16_t *QOpcodes1) {		const uint16_t *QOpcodes1) {
assert(NumVecs >= 1 && NumVecs <= 4 && "VLDDup NumVecs out-of-range");		assert(NumVecs >= 1 && NumVecs <= 4 && "VLDDup NumVecs out-of-range");
SDLoc dl(N);		SDLoc dl(N);

▲ Show 20 Lines • Show All 1,717 Lines • ▼ Show 20 Lines	case ISD::INTRINSIC_W_CHAIN: {
case Intrinsic::arm_mve_vldr_gather_base_wb:		case Intrinsic::arm_mve_vldr_gather_base_wb:
case Intrinsic::arm_mve_vldr_gather_base_wb_predicated: {		case Intrinsic::arm_mve_vldr_gather_base_wb_predicated: {
static const uint16_t Opcodes[] = {ARM::MVE_VLDRWU32_qi_pre,		static const uint16_t Opcodes[] = {ARM::MVE_VLDRWU32_qi_pre,
ARM::MVE_VLDRDU64_qi_pre};		ARM::MVE_VLDRDU64_qi_pre};
SelectMVE_WB(N, Opcodes,		SelectMVE_WB(N, Opcodes,
IntNo == Intrinsic::arm_mve_vldr_gather_base_wb_predicated);		IntNo == Intrinsic::arm_mve_vldr_gather_base_wb_predicated);
return;		return;
}		}

		case Intrinsic::arm_mve_vld2q: {
		static const uint16_t Opcodes8[] = {ARM::MVE_VLD20_8, ARM::MVE_VLD21_8};
		static const uint16_t Opcodes16[] = {ARM::MVE_VLD20_16,
		ARM::MVE_VLD21_16};
		static const uint16_t Opcodes32[] = {ARM::MVE_VLD20_32,
		ARM::MVE_VLD21_32};
		static const uint16_t *const Opcodes[] = {Opcodes8, Opcodes16, Opcodes32};
		SelectMVE_VLD(N, 2, Opcodes);
		return;
		}

		case Intrinsic::arm_mve_vld4q: {
		static const uint16_t Opcodes8[] = {ARM::MVE_VLD40_8, ARM::MVE_VLD41_8,
		ARM::MVE_VLD42_8, ARM::MVE_VLD43_8};
		static const uint16_t Opcodes16[] = {ARM::MVE_VLD40_16, ARM::MVE_VLD41_16,
		ARM::MVE_VLD42_16,
		ARM::MVE_VLD43_16};
		static const uint16_t Opcodes32[] = {ARM::MVE_VLD40_32, ARM::MVE_VLD41_32,
		ARM::MVE_VLD42_32,
		ARM::MVE_VLD43_32};
		static const uint16_t *const Opcodes[] = {Opcodes8, Opcodes16, Opcodes32};
		SelectMVE_VLD(N, 4, Opcodes);
		return;
		}
}		}
break;		break;
}		}

case ISD::INTRINSIC_WO_CHAIN: {		case ISD::INTRINSIC_WO_CHAIN: {
unsigned IntNo = cast<ConstantSDNode>(N->getOperand(0))->getZExtValue();		unsigned IntNo = cast<ConstantSDNode>(N->getOperand(0))->getZExtValue();
switch (IntNo) {		switch (IntNo) {
default:		default:
▲ Show 20 Lines • Show All 564 Lines • Show Last 20 Lines

llvm/lib/Target/ARM/ARMInstrMVE.td

Show First 20 Lines • Show All 4,346 Lines • ▼ Show 20 Lines	def "MVE_VLD" # n.nvecs # stage # "_" # s.lanesize # wb.id_suffix
: MVE_vld24_base<n, stage, s.sizebits, wb,		: MVE_vld24_base<n, stage, s.sizebits, wb,
"vld" # n.nvecs # stage # "." # s.lanesize>;		"vld" # n.nvecs # stage # "." # s.lanesize>;

def "MVE_VST" # n.nvecs # stage # "_" # s.lanesize # wb.id_suffix		def "MVE_VST" # n.nvecs # stage # "_" # s.lanesize # wb.id_suffix
: MVE_vst24_base<n, stage, s.sizebits, wb,		: MVE_vst24_base<n, stage, s.sizebits, wb,
"vst" # n.nvecs # stage # "." # s.lanesize>;		"vst" # n.nvecs # stage # "." # s.lanesize>;
}		}

		multiclass MVE_vst24_patterns<int lanesize, ValueType VT> {
		foreach stage = [0,1] in
		def : Pat<(int_arm_mve_vst2q i32:$addr,
		(VT MQPR:$v0), (VT MQPR:$v1), (i32 stage)),
		(!cast<Instruction>("MVE_VST2"#stage#"_"#lanesize)
		(REG_SEQUENCE QQPR, VT:$v0, qsub_0, VT:$v1, qsub_1),
		t2_addr_offset_none:$addr)>;

		foreach stage = [0,1,2,3] in
		def : Pat<(int_arm_mve_vst4q i32:$addr,
		(VT MQPR:$v0), (VT MQPR:$v1),
		(VT MQPR:$v2), (VT MQPR:$v3), (i32 stage)),
		(!cast<Instruction>("MVE_VST4"#stage#"_"#lanesize)
		(REG_SEQUENCE QQQQPR, VT:$v0, qsub_0, VT:$v1, qsub_1,
		VT:$v2, qsub_2, VT:$v3, qsub_3),
		t2_addr_offset_none:$addr)>;
		}
		defm : MVE_vst24_patterns<8, v16i8>;
		defm : MVE_vst24_patterns<16, v8i16>;
		defm : MVE_vst24_patterns<32, v4i32>;
		dmgreenUnsubmitted Not Done Reply Inline Actions Do we need floating point types? dmgreen: Do we need floating point types?
		defm : MVE_vst24_patterns<16, v8f16>;
		defm : MVE_vst24_patterns<32, v4f32>;

// end of MVE interleaving load/store		// end of MVE interleaving load/store

// start of MVE predicable load/store		// start of MVE predicable load/store

// A parameter class for the direction of transfer.		// A parameter class for the direction of transfer.
class MVE_ldst_direction<bit b, dag Oo, dag Io, string c=""> {		class MVE_ldst_direction<bit b, dag Oo, dag Io, string c=""> {
bit load = b;		bit load = b;
dag Oops = Oo;		dag Oops = Oo;
▲ Show 20 Lines • Show All 1,169 Lines • Show Last 20 Lines

llvm/test/CodeGen/Thumb2/mve-intrinsics/vld24.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -mtriple=thumbv8.1m.main -mattr=+mve.fp -verify-machineinstrs -o - %s \| FileCheck %s

				%struct.float16x8x2_t = type { [2 x <8 x half>] }
				%struct.uint8x16x4_t = type { [4 x <16 x i8>] }
				%struct.uint32x4x2_t = type { [2 x <4 x i32>] }
				%struct.int8x16x4_t = type { [4 x <16 x i8>] }

				define arm_aapcs_vfpcc %struct.float16x8x2_t @test_vld2q_f16(half* %addr) {
				; CHECK-LABEL: test_vld2q_f16:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: vld20.16 {q0, q1}, [r0]
				; CHECK-NEXT: vld21.16 {q0, q1}, [r0]
				; CHECK-NEXT: bx lr
				entry:
				%0 = tail call { <8 x half>, <8 x half> } @llvm.arm.mve.vld2q.v8f16.p0f16(half* %addr)
				%1 = extractvalue { <8 x half>, <8 x half> } %0, 0
				%2 = insertvalue %struct.float16x8x2_t undef, <8 x half> %1, 0, 0
				%3 = extractvalue { <8 x half>, <8 x half> } %0, 1
				%4 = insertvalue %struct.float16x8x2_t %2, <8 x half> %3, 0, 1
				ret %struct.float16x8x2_t %4
				}

				declare { <8 x half>, <8 x half> } @llvm.arm.mve.vld2q.v8f16.p0f16(half*)

				define arm_aapcs_vfpcc %struct.uint8x16x4_t @test_vld4q_u8(i8* %addr) {
				; CHECK-LABEL: test_vld4q_u8:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: vld40.8 {q0, q1, q2, q3}, [r0]
				; CHECK-NEXT: vld41.8 {q0, q1, q2, q3}, [r0]
				; CHECK-NEXT: vld42.8 {q0, q1, q2, q3}, [r0]
				; CHECK-NEXT: vld43.8 {q0, q1, q2, q3}, [r0]
				; CHECK-NEXT: bx lr
				entry:
				%0 = tail call { <16 x i8>, <16 x i8>, <16 x i8>, <16 x i8> } @llvm.arm.mve.vld4q.v16i8.p0i8(i8* %addr)
				%1 = extractvalue { <16 x i8>, <16 x i8>, <16 x i8>, <16 x i8> } %0, 0
				%2 = insertvalue %struct.uint8x16x4_t undef, <16 x i8> %1, 0, 0
				%3 = extractvalue { <16 x i8>, <16 x i8>, <16 x i8>, <16 x i8> } %0, 1
				%4 = insertvalue %struct.uint8x16x4_t %2, <16 x i8> %3, 0, 1
				%5 = extractvalue { <16 x i8>, <16 x i8>, <16 x i8>, <16 x i8> } %0, 2
				%6 = insertvalue %struct.uint8x16x4_t %4, <16 x i8> %5, 0, 2
				%7 = extractvalue { <16 x i8>, <16 x i8>, <16 x i8>, <16 x i8> } %0, 3
				%8 = insertvalue %struct.uint8x16x4_t %6, <16 x i8> %7, 0, 3
				ret %struct.uint8x16x4_t %8
				}

				declare { <16 x i8>, <16 x i8>, <16 x i8>, <16 x i8> } @llvm.arm.mve.vld4q.v16i8.p0i8(i8*)

				define arm_aapcs_vfpcc void @test_vst2q_u32(i32* %addr, %struct.uint32x4x2_t %value.coerce) {
				; CHECK-LABEL: test_vst2q_u32:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: @ kill: def $q1 killed $q1 killed $q0_q1 def $q0_q1
				; CHECK-NEXT: @ kill: def $q0 killed $q0 killed $q0_q1 def $q0_q1
				; CHECK-NEXT: vst20.32 {q0, q1}, [r0]
				; CHECK-NEXT: vst21.32 {q0, q1}, [r0]
				; CHECK-NEXT: bx lr
				entry:
				%value.coerce.fca.0.0.extract = extractvalue %struct.uint32x4x2_t %value.coerce, 0, 0
				%value.coerce.fca.0.1.extract = extractvalue %struct.uint32x4x2_t %value.coerce, 0, 1
				tail call void @llvm.arm.mve.vst2q.p0i32.v4i32(i32* %addr, <4 x i32> %value.coerce.fca.0.0.extract, <4 x i32> %value.coerce.fca.0.1.extract, i32 0)
				tail call void @llvm.arm.mve.vst2q.p0i32.v4i32(i32* %addr, <4 x i32> %value.coerce.fca.0.0.extract, <4 x i32> %value.coerce.fca.0.1.extract, i32 1)
				ret void
				}

				declare void @llvm.arm.mve.vst2q.p0i32.v4i32(i32*, <4 x i32>, <4 x i32>, i32)

				define arm_aapcs_vfpcc void @test_vst2q_f16(half* %addr, %struct.float16x8x2_t %value.coerce) {
				; CHECK-LABEL: test_vst2q_f16:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: @ kill: def $q1 killed $q1 killed $q0_q1 def $q0_q1
				; CHECK-NEXT: @ kill: def $q0 killed $q0 killed $q0_q1 def $q0_q1
				; CHECK-NEXT: vst20.16 {q0, q1}, [r0]
				; CHECK-NEXT: vst21.16 {q0, q1}, [r0]
				; CHECK-NEXT: bx lr
				entry:
				%value.coerce.fca.0.0.extract = extractvalue %struct.float16x8x2_t %value.coerce, 0, 0
				%value.coerce.fca.0.1.extract = extractvalue %struct.float16x8x2_t %value.coerce, 0, 1
				call void @llvm.arm.mve.vst2q.p0f16.v8f16(half* %addr, <8 x half> %value.coerce.fca.0.0.extract, <8 x half> %value.coerce.fca.0.1.extract, i32 0)
				call void @llvm.arm.mve.vst2q.p0f16.v8f16(half* %addr, <8 x half> %value.coerce.fca.0.0.extract, <8 x half> %value.coerce.fca.0.1.extract, i32 1)
				ret void
				}

				declare void @llvm.arm.mve.vst2q.p0f16.v8f16(half*, <8 x half>, <8 x half>, i32)

				define arm_aapcs_vfpcc void @test_vst4q_s8(i8* %addr, %struct.int8x16x4_t %value.coerce) {
				; CHECK-LABEL: test_vst4q_s8:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: @ kill: def $q3 killed $q3 killed $q0_q1_q2_q3 def $q0_q1_q2_q3
				; CHECK-NEXT: @ kill: def $q2 killed $q2 killed $q0_q1_q2_q3 def $q0_q1_q2_q3
				; CHECK-NEXT: @ kill: def $q1 killed $q1 killed $q0_q1_q2_q3 def $q0_q1_q2_q3
				; CHECK-NEXT: @ kill: def $q0 killed $q0 killed $q0_q1_q2_q3 def $q0_q1_q2_q3
				; CHECK-NEXT: vst40.8 {q0, q1, q2, q3}, [r0]
				; CHECK-NEXT: vst41.8 {q0, q1, q2, q3}, [r0]
				; CHECK-NEXT: vst42.8 {q0, q1, q2, q3}, [r0]
				; CHECK-NEXT: vst43.8 {q0, q1, q2, q3}, [r0]
				; CHECK-NEXT: bx lr
				entry:
				%value.coerce.fca.0.0.extract = extractvalue %struct.int8x16x4_t %value.coerce, 0, 0
				%value.coerce.fca.0.1.extract = extractvalue %struct.int8x16x4_t %value.coerce, 0, 1
				%value.coerce.fca.0.2.extract = extractvalue %struct.int8x16x4_t %value.coerce, 0, 2
				%value.coerce.fca.0.3.extract = extractvalue %struct.int8x16x4_t %value.coerce, 0, 3
				tail call void @llvm.arm.mve.vst4q.p0i8.v16i8(i8* %addr, <16 x i8> %value.coerce.fca.0.0.extract, <16 x i8> %value.coerce.fca.0.1.extract, <16 x i8> %value.coerce.fca.0.2.extract, <16 x i8> %value.coerce.fca.0.3.extract, i32 0)
				tail call void @llvm.arm.mve.vst4q.p0i8.v16i8(i8* %addr, <16 x i8> %value.coerce.fca.0.0.extract, <16 x i8> %value.coerce.fca.0.1.extract, <16 x i8> %value.coerce.fca.0.2.extract, <16 x i8> %value.coerce.fca.0.3.extract, i32 1)
				tail call void @llvm.arm.mve.vst4q.p0i8.v16i8(i8* %addr, <16 x i8> %value.coerce.fca.0.0.extract, <16 x i8> %value.coerce.fca.0.1.extract, <16 x i8> %value.coerce.fca.0.2.extract, <16 x i8> %value.coerce.fca.0.3.extract, i32 2)
				tail call void @llvm.arm.mve.vst4q.p0i8.v16i8(i8* %addr, <16 x i8> %value.coerce.fca.0.0.extract, <16 x i8> %value.coerce.fca.0.1.extract, <16 x i8> %value.coerce.fca.0.2.extract, <16 x i8> %value.coerce.fca.0.3.extract, i32 3)
				ret void
				}

				declare void @llvm.arm.mve.vst4q.p0i8.v16i8(i8*, <16 x i8>, <16 x i8>, <16 x i8>, <16 x i8>, i32)