This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/
-
include/clang/Basic/
-
clang/
-
Basic/
-
arm_neon.td
-
lib/
-
Basic/Targets/
-
Targets/
-
AArch64.h
-
AArch64.cpp
-
CodeGen/
-
CGBuiltin.cpp
-
test/CodeGen/
-
CodeGen/
-
aarch64-matmul.cpp
1/1
aarch64-v8.6a-neon-intrinsics.c
-
llvm/
-
include/llvm/IR/
-
llvm/
-
IR/
-
IntrinsicsAArch64.td
-
lib/Target/AArch64/
-
Target/
-
AArch64/
-
AArch64.td
-
AArch64InstrFormats.td
-
AArch64InstrInfo.td
-
AArch64Subtarget.h
-
test/
-
CodeGen/AArch64/
-
AArch64/
-
aarch64-matmul.ll
-
MC/
-
AArch64/
2/2
armv8.6a-simd-matmul-error.s
-
armv8.6a-simd-matmul.s
-
Disassembler/AArch64/
-
AArch64/
-
armv8.6a-simd-matmul.txt

Differential D77871

[AArch64] Armv8.6-a Matrix Mult Assembly + Intrinsics
ClosedPublic

Authored by LukeGeeson on Apr 10 2020, 6:33 AM.

Download Raw Diff

Details

Reviewers

ostannard
t.p.northover
rengolin
kmclaughlin

Commits

rG832cd749131b: [AArch64] Armv8.6-a Matrix Mult Assembly + Intrinsics

Summary

This patch upstreams support for the Armv8.6-a Matrix Multiplication
Extension. A summary of the features can be found here:

https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/arm-architecture-developments-armv8-6-a

This patch includes:

Assembly support for AArch64 only (no SVE or Neon)
Intrinsics Support for AArch64 Armv8.6a Matrix Multiplication Instructions (No bfloat16 matrix multiplication)

No IR types or C Types are needed for this extension.

This is part of a patch series, starting with BFloat16 support and
the other components in the armv8.6a extension (in previous patches
linked in phabricator)

Based on work by:

Luke Geeson
Oliver Stannard
Luke Cheeseman

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

LukeGeeson created this revision.Apr 10 2020, 6:33 AM

Herald added a reviewer: rengolin. · View Herald TranscriptApr 10 2020, 6:33 AM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: cfe-commits, danielkiss, hiraditya, kristof.beyls. · View Herald Transcript

LukeGeeson added a parent revision: D77540: [PATCH] [ARM]: Armv8.6-a Matrix Mul Asm and Intrinsics Support.Apr 10 2020, 6:34 AM

LukeGeeson added a child revision: D77872: [AArch32] Armv8.6-a Matrix Mult Assembly + Intrinsics.Apr 10 2020, 6:39 AM

Harbormaster failed remote builds in B52667: Diff 256563!Apr 10 2020, 6:57 AM

LukeGeeson removed a parent revision: D77540: [PATCH] [ARM]: Armv8.6-a Matrix Mul Asm and Intrinsics Support.Apr 10 2020, 7:23 AM

Removed reliance on parent revision, harbormaster now builds with unit tests passing

kmclaughlin added a subscriber: kmclaughlin.Apr 22 2020, 7:10 AM

kmclaughlin added inline comments.

clang/test/CodeGen/aarch64-v8.6a-neon-intrinsics.c
4	Is it possible to use -sroa here as you did for the tests added in D77872? If so, I think this might make some of the `_lane` tests below a bit easier to follow.
llvm/test/MC/AArch64/armv8.6a-simd-matmul-error.s
18	The arrangement specifiers of the first two operands don't match for these tests, which is what the next set of tests below is checking for. It might be worth keeping these tests specific to just the index being out of range.
27	muct -> must :)

fixed typos
added sroa as mem2reg arg to reduce redundant mem accesses in tests, refactored test
addressed other comments

Thanks for the updates, @LukeGeeson, LGTM

This revision is now accepted and ready to land.Apr 23 2020, 2:35 AM

Closed by commit rG832cd749131b: [AArch64] Armv8.6-a Matrix Mult Assembly + Intrinsics (authored by LukeGeeson). · Explain WhyApr 24 2020, 8:05 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

clang/

include/

clang/

Basic/

arm_neon.td

34 lines

lib/

Basic/

Targets/

AArch64.h

1 line

AArch64.cpp

6 lines

CodeGen/

CGBuiltin.cpp

24 lines

test/

CodeGen/

aarch64-matmul.cpp

8 lines

aarch64-v8.6a-neon-intrinsics.c

147 lines

llvm/

include/

llvm/

IR/

IntrinsicsAArch64.td

11 lines

lib/

Target/

AArch64/

AArch64.td

12 lines

AArch64InstrFormats.td

35 lines

AArch64InstrInfo.td

48 lines

AArch64Subtarget.h

6 lines

test/

CodeGen/

AArch64/

aarch64-matmul.ll

136 lines

MC/

AArch64/

armv8.6a-simd-matmul-error.s

34 lines

armv8.6a-simd-matmul.s

43 lines

Disassembler/

AArch64/

armv8.6a-simd-matmul.txt

34 lines

Diff 259890

clang/include/clang/Basic/arm_neon.td

Show First 20 Lines • Show All 215 Lines • ▼ Show 20 Lines	def OP_FMLAL_LN : Op<(call "vfmlal_low", $p0, $p1,
(dup_typed $p1, (call "vget_lane", $p2, $p3)))>;		(dup_typed $p1, (call "vget_lane", $p2, $p3)))>;
def OP_FMLSL_LN : Op<(call "vfmlsl_low", $p0, $p1,		def OP_FMLSL_LN : Op<(call "vfmlsl_low", $p0, $p1,
(dup_typed $p1, (call "vget_lane", $p2, $p3)))>;		(dup_typed $p1, (call "vget_lane", $p2, $p3)))>;
def OP_FMLAL_LN_Hi : Op<(call "vfmlal_high", $p0, $p1,		def OP_FMLAL_LN_Hi : Op<(call "vfmlal_high", $p0, $p1,
(dup_typed $p1, (call "vget_lane", $p2, $p3)))>;		(dup_typed $p1, (call "vget_lane", $p2, $p3)))>;
def OP_FMLSL_LN_Hi : Op<(call "vfmlsl_high", $p0, $p1,		def OP_FMLSL_LN_Hi : Op<(call "vfmlsl_high", $p0, $p1,
(dup_typed $p1, (call "vget_lane", $p2, $p3)))>;		(dup_typed $p1, (call "vget_lane", $p2, $p3)))>;

		def OP_USDOT_LN
		: Op<(call "vusdot", $p0, $p1,
		(cast "8", "S", (call_mangled "splat_lane", (bitcast "int32x2_t", $p2), $p3)))>;
		def OP_USDOT_LNQ
		: Op<(call "vusdot", $p0, $p1,
		(cast "8", "S", (call_mangled "splat_lane", (bitcast "int32x4_t", $p2), $p3)))>;

		// sudot splats the second vector and then calls vusdot
		def OP_SUDOT_LN
		: Op<(call "vusdot", $p0,
		(cast "8", "U", (call_mangled "splat_lane", (bitcast "int32x2_t", $p2), $p3)), $p1)>;
		def OP_SUDOT_LNQ
		: Op<(call "vusdot", $p0,
		(cast "8", "U", (call_mangled "splat_lane", (bitcast "int32x4_t", $p2), $p3)), $p1)>;

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// Auxiliary Instructions		// Auxiliary Instructions
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

// Splat operation - performs a range-checked splat over a vector		// Splat operation - performs a range-checked splat over a vector
def SPLAT : WInst<"splat_lane", ".(!q)I",		def SPLAT : WInst<"splat_lane", ".(!q)I",
"UcUsUicsilPcPsfQUcQUsQUiQcQsQiQPcQPsQflUlQlQUlhdQhQdPlQPl">;		"UcUsUicsilPcPsfQUcQUsQUiQcQsQiQPcQPsQflUlQlQUlhdQhQdPlQPl">;
def SPLATQ : WInst<"splat_laneq", ".(!Q)I",		def SPLATQ : WInst<"splat_laneq", ".(!Q)I",
▲ Show 20 Lines • Show All 1,555 Lines • ▼ Show 20 Lines	let ArchGuard = "defined(__ARM_FEATURE_FP16FML) && defined(__aarch64__)" in {
def VFMLAL_LANEQ_HIGH : SOpInst<"vfmlal_laneq_high", "(F>)(F>)F(FQ)I", "hQh", OP_FMLAL_LN_Hi> {		def VFMLAL_LANEQ_HIGH : SOpInst<"vfmlal_laneq_high", "(F>)(F>)F(FQ)I", "hQh", OP_FMLAL_LN_Hi> {
let isLaneQ = 1;		let isLaneQ = 1;
}		}
def VFMLSL_LANEQ_HIGH : SOpInst<"vfmlsl_laneq_high", "(F>)(F>)F(FQ)I", "hQh", OP_FMLSL_LN_Hi> {		def VFMLSL_LANEQ_HIGH : SOpInst<"vfmlsl_laneq_high", "(F>)(F>)F(FQ)I", "hQh", OP_FMLSL_LN_Hi> {
let isLaneQ = 1;		let isLaneQ = 1;
}		}
}		}

		let ArchGuard = "defined(__ARM_FEATURE_MATMUL_INT8)" in {
		def VMMLA : SInst<"vmmla", "..(<<)(<<)", "QUiQi">;
		def VUSMMLA : SInst<"vusmmla", "..(<<U)(<<)", "Qi">;

		def VUSDOT : SInst<"vusdot", "..(<<U)(<<)", "iQi">;

		def VUSDOT_LANE : SOpInst<"vusdot_lane", "..(<<U)(<<q)I", "iQi", OP_USDOT_LN>;
		def VSUDOT_LANE : SOpInst<"vsudot_lane", "..(<<)(<<qU)I", "iQi", OP_SUDOT_LN>;

		let ArchGuard = "defined(__aarch64__)" in {
		let isLaneQ = 1 in {
		def VUSDOT_LANEQ : SOpInst<"vusdot_laneq", "..(<<U)(<<Q)I", "iQi", OP_USDOT_LNQ>;
		def VSUDOT_LANEQ : SOpInst<"vsudot_laneq", "..(<<)(<<QU)I", "iQi", OP_SUDOT_LNQ>;
		}
		}
		}

// v8.3-A Vector complex addition intrinsics		// v8.3-A Vector complex addition intrinsics
let ArchGuard = "defined(__ARM_FEATURE_COMPLEX) && defined(__ARM_FEATURE_FP16_VECTOR_ARITHMETIC)" in {		let ArchGuard = "defined(__ARM_FEATURE_COMPLEX) && defined(__ARM_FEATURE_FP16_VECTOR_ARITHMETIC)" in {
def VCADD_ROT90_FP16 : SInst<"vcadd_rot90", "...", "h">;		def VCADD_ROT90_FP16 : SInst<"vcadd_rot90", "...", "h">;
def VCADD_ROT270_FP16 : SInst<"vcadd_rot270", "...", "h">;		def VCADD_ROT270_FP16 : SInst<"vcadd_rot270", "...", "h">;
def VCADDQ_ROT90_FP16 : SInst<"vcaddq_rot90", "QQQ", "h">;		def VCADDQ_ROT90_FP16 : SInst<"vcaddq_rot90", "QQQ", "h">;
def VCADDQ_ROT270_FP16 : SInst<"vcaddq_rot270", "QQQ", "h">;		def VCADDQ_ROT270_FP16 : SInst<"vcaddq_rot270", "QQQ", "h">;
}		}
let ArchGuard = "defined(__ARM_FEATURE_COMPLEX)" in {		let ArchGuard = "defined(__ARM_FEATURE_COMPLEX)" in {
def VCADD_ROT90 : SInst<"vcadd_rot90", "...", "f">;		def VCADD_ROT90 : SInst<"vcadd_rot90", "...", "f">;
def VCADD_ROT270 : SInst<"vcadd_rot270", "...", "f">;		def VCADD_ROT270 : SInst<"vcadd_rot270", "...", "f">;
def VCADDQ_ROT90 : SInst<"vcaddq_rot90", "QQQ", "f">;		def VCADDQ_ROT90 : SInst<"vcaddq_rot90", "QQQ", "f">;
def VCADDQ_ROT270 : SInst<"vcaddq_rot270", "QQQ", "f">;		def VCADDQ_ROT270 : SInst<"vcaddq_rot270", "QQQ", "f">;
}		}
let ArchGuard = "defined(__ARM_FEATURE_COMPLEX) && defined(__aarch64__)" in {		let ArchGuard = "defined(__ARM_FEATURE_COMPLEX) && defined(__aarch64__)" in {
def VCADDQ_ROT90_FP64 : SInst<"vcaddq_rot90", "QQQ", "d">;		def VCADDQ_ROT90_FP64 : SInst<"vcaddq_rot90", "QQQ", "d">;
def VCADDQ_ROT270_FP64 : SInst<"vcaddq_rot270", "QQQ", "d">;		def VCADDQ_ROT270_FP64 : SInst<"vcaddq_rot270", "QQQ", "d">;
}		}
No newline at end of file

clang/lib/Basic/Targets/AArch64.h

Show All 30 Lines	class LLVM_LIBRARY_VISIBILITY AArch64TargetInfo : public TargetInfo {
bool HasCRC;		bool HasCRC;
bool HasCrypto;		bool HasCrypto;
bool HasUnaligned;		bool HasUnaligned;
bool HasFullFP16;		bool HasFullFP16;
bool HasDotProd;		bool HasDotProd;
bool HasFP16FML;		bool HasFP16FML;
bool HasMTE;		bool HasMTE;
bool HasTME;		bool HasTME;
		bool HasMatMul;

llvm::AArch64::ArchKind ArchKind;		llvm::AArch64::ArchKind ArchKind;

static const Builtin::Info BuiltinInfo[];		static const Builtin::Info BuiltinInfo[];

std::string ABI;		std::string ABI;

public:		public:
▲ Show 20 Lines • Show All 161 Lines • Show Last 20 Lines

clang/lib/Basic/Targets/AArch64.cpp

Show First 20 Lines • Show All 274 Lines • ▼ Show 20 Lines	if (HasDotProd)
Builder.defineMacro("__ARM_FEATURE_DOTPROD", "1");		Builder.defineMacro("__ARM_FEATURE_DOTPROD", "1");

if (HasMTE)		if (HasMTE)
Builder.defineMacro("__ARM_FEATURE_MEMORY_TAGGING", "1");		Builder.defineMacro("__ARM_FEATURE_MEMORY_TAGGING", "1");

if (HasTME)		if (HasTME)
Builder.defineMacro("__ARM_FEATURE_TME", "1");		Builder.defineMacro("__ARM_FEATURE_TME", "1");

		if (HasMatMul)
		Builder.defineMacro("__ARM_FEATURE_MATMUL_INT8", "1");

if ((FPU & NeonMode) && HasFP16FML)		if ((FPU & NeonMode) && HasFP16FML)
Builder.defineMacro("__ARM_FEATURE_FP16FML", "1");		Builder.defineMacro("__ARM_FEATURE_FP16FML", "1");

if (Opts.hasSignReturnAddress()) {		if (Opts.hasSignReturnAddress()) {
// Bitmask:		// Bitmask:
// 0: Protection using the A key		// 0: Protection using the A key
// 1: Protection using the B key		// 1: Protection using the B key
// 2: Protection including leaf functions		// 2: Protection including leaf functions
▲ Show 20 Lines • Show All 60 Lines • ▼ Show 20 Lines	bool AArch64TargetInfo::handleTargetFeatures(std::vector<std::string> &Features,
HasCRC = false;		HasCRC = false;
HasCrypto = false;		HasCrypto = false;
HasUnaligned = true;		HasUnaligned = true;
HasFullFP16 = false;		HasFullFP16 = false;
HasDotProd = false;		HasDotProd = false;
HasFP16FML = false;		HasFP16FML = false;
HasMTE = false;		HasMTE = false;
HasTME = false;		HasTME = false;
		HasMatMul = false;
ArchKind = llvm::AArch64::ArchKind::ARMV8A;		ArchKind = llvm::AArch64::ArchKind::ARMV8A;

for (const auto &Feature : Features) {		for (const auto &Feature : Features) {
if (Feature == "+neon")		if (Feature == "+neon")
FPU \|= NeonMode;		FPU \|= NeonMode;
if (Feature == "+sve")		if (Feature == "+sve")
FPU \|= SveMode;		FPU \|= SveMode;
if (Feature == "+crc")		if (Feature == "+crc")
Show All 19 Lines	for (const auto &Feature : Features) {
if (Feature == "+dotprod")		if (Feature == "+dotprod")
HasDotProd = true;		HasDotProd = true;
if (Feature == "+fp16fml")		if (Feature == "+fp16fml")
HasFP16FML = true;		HasFP16FML = true;
if (Feature == "+mte")		if (Feature == "+mte")
HasMTE = true;		HasMTE = true;
if (Feature == "+tme")		if (Feature == "+tme")
HasTME = true;		HasTME = true;
		if (Feature == "+i8mm")
		HasMatMul = true;
}		}

setDataLayout();		setDataLayout();

return true;		return true;
}		}

TargetInfo::CallingConvCheckResult		TargetInfo::CallingConvCheckResult
▲ Show 20 Lines • Show All 383 Lines • Show Last 20 Lines

clang/lib/CodeGen/CGBuiltin.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 5,003 Lines • ▼ Show 20 Lines	static const ARMVectorIntrinsicInfo AArch64SIMDIntrinsicMap[] = {
NEONMAP2(vhsub_v, aarch64_neon_uhsub, aarch64_neon_shsub, Add1ArgType \| UnsignedAlts),		NEONMAP2(vhsub_v, aarch64_neon_uhsub, aarch64_neon_shsub, Add1ArgType \| UnsignedAlts),
NEONMAP2(vhsubq_v, aarch64_neon_uhsub, aarch64_neon_shsub, Add1ArgType \| UnsignedAlts),		NEONMAP2(vhsubq_v, aarch64_neon_uhsub, aarch64_neon_shsub, Add1ArgType \| UnsignedAlts),
NEONMAP1(vld1_x2_v, aarch64_neon_ld1x2, 0),		NEONMAP1(vld1_x2_v, aarch64_neon_ld1x2, 0),
NEONMAP1(vld1_x3_v, aarch64_neon_ld1x3, 0),		NEONMAP1(vld1_x3_v, aarch64_neon_ld1x3, 0),
NEONMAP1(vld1_x4_v, aarch64_neon_ld1x4, 0),		NEONMAP1(vld1_x4_v, aarch64_neon_ld1x4, 0),
NEONMAP1(vld1q_x2_v, aarch64_neon_ld1x2, 0),		NEONMAP1(vld1q_x2_v, aarch64_neon_ld1x2, 0),
NEONMAP1(vld1q_x3_v, aarch64_neon_ld1x3, 0),		NEONMAP1(vld1q_x3_v, aarch64_neon_ld1x3, 0),
NEONMAP1(vld1q_x4_v, aarch64_neon_ld1x4, 0),		NEONMAP1(vld1q_x4_v, aarch64_neon_ld1x4, 0),
		NEONMAP2(vmmlaq_v, aarch64_neon_ummla, aarch64_neon_smmla, 0),
NEONMAP0(vmovl_v),		NEONMAP0(vmovl_v),
NEONMAP0(vmovn_v),		NEONMAP0(vmovn_v),
NEONMAP1(vmul_v, aarch64_neon_pmul, Add1ArgType),		NEONMAP1(vmul_v, aarch64_neon_pmul, Add1ArgType),
NEONMAP1(vmulq_v, aarch64_neon_pmul, Add1ArgType),		NEONMAP1(vmulq_v, aarch64_neon_pmul, Add1ArgType),
NEONMAP1(vpadd_v, aarch64_neon_addp, Add1ArgType),		NEONMAP1(vpadd_v, aarch64_neon_addp, Add1ArgType),
NEONMAP2(vpaddl_v, aarch64_neon_uaddlp, aarch64_neon_saddlp, UnsignedAlts),		NEONMAP2(vpaddl_v, aarch64_neon_uaddlp, aarch64_neon_saddlp, UnsignedAlts),
NEONMAP2(vpaddlq_v, aarch64_neon_uaddlp, aarch64_neon_saddlp, UnsignedAlts),		NEONMAP2(vpaddlq_v, aarch64_neon_uaddlp, aarch64_neon_saddlp, UnsignedAlts),
NEONMAP1(vpaddq_v, aarch64_neon_addp, Add1ArgType),		NEONMAP1(vpaddq_v, aarch64_neon_addp, Add1ArgType),
▲ Show 20 Lines • Show All 66 Lines • ▼ Show 20 Lines	static const ARMVectorIntrinsicInfo AArch64SIMDIntrinsicMap[] = {
NEONMAP1(vst1_x3_v, aarch64_neon_st1x3, 0),		NEONMAP1(vst1_x3_v, aarch64_neon_st1x3, 0),
NEONMAP1(vst1_x4_v, aarch64_neon_st1x4, 0),		NEONMAP1(vst1_x4_v, aarch64_neon_st1x4, 0),
NEONMAP1(vst1q_x2_v, aarch64_neon_st1x2, 0),		NEONMAP1(vst1q_x2_v, aarch64_neon_st1x2, 0),
NEONMAP1(vst1q_x3_v, aarch64_neon_st1x3, 0),		NEONMAP1(vst1q_x3_v, aarch64_neon_st1x3, 0),
NEONMAP1(vst1q_x4_v, aarch64_neon_st1x4, 0),		NEONMAP1(vst1q_x4_v, aarch64_neon_st1x4, 0),
NEONMAP0(vsubhn_v),		NEONMAP0(vsubhn_v),
NEONMAP0(vtst_v),		NEONMAP0(vtst_v),
NEONMAP0(vtstq_v),		NEONMAP0(vtstq_v),
		NEONMAP1(vusdot_v, aarch64_neon_usdot, 0),
		NEONMAP1(vusdotq_v, aarch64_neon_usdot, 0),
		NEONMAP1(vusmmlaq_v, aarch64_neon_usmmla, 0),
};		};

static const ARMVectorIntrinsicInfo AArch64SISDIntrinsicMap[] = {		static const ARMVectorIntrinsicInfo AArch64SISDIntrinsicMap[] = {
NEONMAP1(vabdd_f64, aarch64_sisd_fabd, Add1ArgType),		NEONMAP1(vabdd_f64, aarch64_sisd_fabd, Add1ArgType),
NEONMAP1(vabds_f32, aarch64_sisd_fabd, Add1ArgType),		NEONMAP1(vabds_f32, aarch64_sisd_fabd, Add1ArgType),
NEONMAP1(vabsd_s64, aarch64_neon_abs, Add1ArgType),		NEONMAP1(vabsd_s64, aarch64_neon_abs, Add1ArgType),
NEONMAP1(vaddlv_s32, aarch64_neon_saddlv, AddRetType \| Add1ArgType),		NEONMAP1(vaddlv_s32, aarch64_neon_saddlv, AddRetType \| Add1ArgType),
NEONMAP1(vaddlv_u32, aarch64_neon_uaddlv, AddRetType \| Add1ArgType),		NEONMAP1(vaddlv_u32, aarch64_neon_uaddlv, AddRetType \| Add1ArgType),
▲ Show 20 Lines • Show All 969 Lines • ▼ Show 20 Lines	Value *CodeGenFunction::EmitCommonNeonBuiltinExpr(
}		}
case NEON::BI__builtin_neon_vfmlsl_high_v:		case NEON::BI__builtin_neon_vfmlsl_high_v:
case NEON::BI__builtin_neon_vfmlslq_high_v: {		case NEON::BI__builtin_neon_vfmlslq_high_v: {
llvm::Type *InputTy =		llvm::Type *InputTy =
llvm::VectorType::get(HalfTy, Ty->getPrimitiveSizeInBits() / 16);		llvm::VectorType::get(HalfTy, Ty->getPrimitiveSizeInBits() / 16);
llvm::Type *Tys[2] = { Ty, InputTy };		llvm::Type *Tys[2] = { Ty, InputTy };
return EmitNeonCall(CGM.getIntrinsic(Int, Tys), Ops, "vfmlsl_high");		return EmitNeonCall(CGM.getIntrinsic(Int, Tys), Ops, "vfmlsl_high");
}		}
		case NEON::BI__builtin_neon_vmmlaq_v: {
		llvm::Type *InputTy =
		llvm::VectorType::get(Int8Ty, Ty->getPrimitiveSizeInBits() / 8);
		llvm::Type *Tys[2] = { Ty, InputTy };
		Int = Usgn ? LLVMIntrinsic : AltLLVMIntrinsic;
		return EmitNeonCall(CGM.getIntrinsic(Int, Tys), Ops, "vmmla");
		}
		case NEON::BI__builtin_neon_vusmmlaq_v: {
		llvm::Type *InputTy =
		llvm::VectorType::get(Int8Ty, Ty->getPrimitiveSizeInBits() / 8);
		llvm::Type *Tys[2] = { Ty, InputTy };
		return EmitNeonCall(CGM.getIntrinsic(Int, Tys), Ops, "vusmmla");
		}
		case NEON::BI__builtin_neon_vusdot_v:
		case NEON::BI__builtin_neon_vusdotq_v: {
		llvm::Type *InputTy =
		llvm::VectorType::get(Int8Ty, Ty->getPrimitiveSizeInBits() / 8);
		llvm::Type *Tys[2] = { Ty, InputTy };
		return EmitNeonCall(CGM.getIntrinsic(Int, Tys), Ops, "vusdot");
		}
}		}

assert(Int && "Expected valid intrinsic number");		assert(Int && "Expected valid intrinsic number");

// Determine the type(s) of this overloaded AArch64 intrinsic.		// Determine the type(s) of this overloaded AArch64 intrinsic.
Function *F = LookupNeonLLVMIntrinsic(Int, Modifier, Ty, E);		Function *F = LookupNeonLLVMIntrinsic(Int, Modifier, Ty, E);

Value *Result = EmitNeonCall(F, Ops, NameHint);		Value *Result = EmitNeonCall(F, Ops, NameHint);
▲ Show 20 Lines • Show All 9,907 Lines • Show Last 20 Lines

clang/test/CodeGen/aarch64-matmul.cpp

This file was added.

				// RUN: %clang_cc1 -triple aarch64-eabi -target-feature +neon -target-feature +i8mm -S -emit-llvm %s -o - \| FileCheck %s

				#ifdef __ARM_FEATURE_MATMUL_INT8
				extern "C" void arm_feature_matmulint8_defined() {}
				#endif
				// CHECK: define void @arm_feature_matmulint8_defined()

clang/test/CodeGen/aarch64-v8.6a-neon-intrinsics.c

This file was added.

				// RUN: %clang_cc1 -triple arm64-none-linux-gnu -target-feature +neon -target-feature +fullfp16 -target-feature +v8.6a -target-feature +i8mm \
				// RUN: -fallow-half-arguments-and-returns -S -disable-O0-optnone -emit-llvm -o - %s \
				// RUN: \| opt -S -mem2reg -sroa \
				// RUN: \| FileCheck %s
				kmclaughlinUnsubmitted Done Reply Inline Actions Is it possible to use -sroa here as you did for the tests added in D77872? If so, I think this might make some of the `_lane` tests below a bit easier to follow. kmclaughlin: Is it possible to use -sroa here as you did for the tests added in D77872? If so, I think this…

				// REQUIRES: aarch64-registered-target

				#include <arm_neon.h>

				// CHECK-LABEL: test_vmmlaq_s32
				// CHECK: [[VAL:%.*]] = call <4 x i32> @llvm.aarch64.neon.smmla.v4i32.v16i8(<4 x i32> %r, <16 x i8> %a, <16 x i8> %b)
				// CHECK: ret <4 x i32> [[VAL]]
				int32x4_t test_vmmlaq_s32(int32x4_t r, int8x16_t a, int8x16_t b) {
				return vmmlaq_s32(r, a, b);
				}

				// CHECK-LABEL: test_vmmlaq_u32
				// CHECK: [[VAL:%.*]] = call <4 x i32> @llvm.aarch64.neon.ummla.v4i32.v16i8(<4 x i32> %r, <16 x i8> %a, <16 x i8> %b)
				// CHECK: ret <4 x i32> [[VAL]]
				uint32x4_t test_vmmlaq_u32(uint32x4_t r, uint8x16_t a, uint8x16_t b) {
				return vmmlaq_u32(r, a, b);
				}

				// CHECK-LABEL: test_vusmmlaq_s32
				// CHECK: [[VAL:%.*]] = call <4 x i32> @llvm.aarch64.neon.usmmla.v4i32.v16i8(<4 x i32> %r, <16 x i8> %a, <16 x i8> %b)
				// CHECK: ret <4 x i32> [[VAL]]
				int32x4_t test_vusmmlaq_s32(int32x4_t r, uint8x16_t a, int8x16_t b) {
				return vusmmlaq_s32(r, a, b);
				}

				// CHECK-LABEL: test_vusdot_s32
				// CHECK: [[VAL:%.*]] = call <2 x i32> @llvm.aarch64.neon.usdot.v2i32.v8i8(<2 x i32> %r, <8 x i8> %a, <8 x i8> %b)
				// CHECK: ret <2 x i32> [[VAL]]
				int32x2_t test_vusdot_s32(int32x2_t r, uint8x8_t a, int8x8_t b) {
				return vusdot_s32(r, a, b);
				}

				// CHECK-LABEL: test_vusdot_lane_s32
				// CHECK: [[TMP0:%.*]] = bitcast <8 x i8> %b to <2 x i32>
				// CHECK: [[TMP1:%.*]] = bitcast <2 x i32> [[TMP0]] to <8 x i8>
				// CHECK: [[TMP2:%.*]] = bitcast <8 x i8> [[TMP1]] to <2 x i32>
				// CHECK: [[LANE:%.*]] = shufflevector <2 x i32> [[TMP2]], <2 x i32> [[TMP2]], <2 x i32> zeroinitializer
				// CHECK: [[TMP4:%.*]] = bitcast <2 x i32> [[LANE]] to <8 x i8>
				// CHECK: [[TMP5:%.*]] = bitcast <2 x i32> %r to <8 x i8>
				// CHECK: [[OP:%.*]] = call <2 x i32> @llvm.aarch64.neon.usdot.v2i32.v8i8(<2 x i32> %r, <8 x i8> %a, <8 x i8> [[TMP4]])
				// CHECK: ret <2 x i32> [[OP]]
				int32x2_t test_vusdot_lane_s32(int32x2_t r, uint8x8_t a, int8x8_t b) {
				return vusdot_lane_s32(r, a, b, 0);
				}

				// CHECK-LABEL: test_vsudot_lane_s32
				// CHECK: [[TMP0:%.*]] = bitcast <8 x i8> %b to <2 x i32>
				// CHECK: [[TMP1:%.*]] = bitcast <2 x i32> %0 to <8 x i8>
				// CHECK: [[TMP2:%.*]] = bitcast <8 x i8> %1 to <2 x i32>
				// CHECK: [[LANE:%.*]] = shufflevector <2 x i32> [[TMP2]], <2 x i32> [[TMP2]], <2 x i32> zeroinitializer
				// CHECK: [[TMP4:%.*]] = bitcast <2 x i32> [[LANE]] to <8 x i8>
				// CHECK: [[TMP5:%.*]] = bitcast <2 x i32> %r to <8 x i8>
				// CHECK: [[OP:%.*]] = call <2 x i32> @llvm.aarch64.neon.usdot.v2i32.v8i8(<2 x i32> %r, <8 x i8> [[TMP4]], <8 x i8> %a)
				// CHECK: ret <2 x i32> [[OP]]
				int32x2_t test_vsudot_lane_s32(int32x2_t r, int8x8_t a, uint8x8_t b) {
				return vsudot_lane_s32(r, a, b, 0);
				}

				// CHECK-LABEL: test_vusdot_laneq_s32
				// CHECK: [[TMP0:%.*]] = bitcast <16 x i8> %b to <4 x i32>
				// CHECK: [[TMP1:%.*]] = bitcast <4 x i32> [[TMP0]] to <16 x i8>
				// CHECK: [[TMP2:%.*]] = bitcast <16 x i8> [[TMP1]] to <4 x i32>
				// CHECK: [[LANE:%.*]] = shufflevector <4 x i32> [[TMP2]], <4 x i32> [[TMP2]], <2 x i32> zeroinitializer
				// CHECK: [[TMP4:%.*]] = bitcast <2 x i32> [[LANE]] to <8 x i8>
				// CHECK: [[TMP5:%.*]] = bitcast <2 x i32> %r to <8 x i8>
				// CHECK: [[OP:%.*]] = call <2 x i32> @llvm.aarch64.neon.usdot.v2i32.v8i8(<2 x i32> %r, <8 x i8> %a, <8 x i8> [[TMP4]])
				// CHECK: ret <2 x i32> [[OP]]
				int32x2_t test_vusdot_laneq_s32(int32x2_t r, uint8x8_t a, int8x16_t b) {
				return vusdot_laneq_s32(r, a, b, 0);
				}

				// CHECK-LABEL: test_vsudot_laneq_s32
				// CHECK: [[TMP0:%.*]] = bitcast <16 x i8> %b to <4 x i32>
				// CHECK: [[TMP1:%.*]] = bitcast <4 x i32> [[TMP0]] to <16 x i8>
				// CHECK: [[TMP2:%.*]] = bitcast <16 x i8> [[TMP1]] to <4 x i32>
				// CHECK: [[LANE:%.*]] = shufflevector <4 x i32> [[TMP2]], <4 x i32> [[TMP2]], <2 x i32> zeroinitializer
				// CHECK: [[TMP4:%.*]] = bitcast <2 x i32> [[LANE]] to <8 x i8>
				// CHECK: [[TMP5:%.*]] = bitcast <2 x i32> %r to <8 x i8>
				// CHECK: [[OP:%.*]] = call <2 x i32> @llvm.aarch64.neon.usdot.v2i32.v8i8(<2 x i32> %r, <8 x i8> [[TMP4]], <8 x i8> %a)
				// CHECK: ret <2 x i32> [[OP]]
				int32x2_t test_vsudot_laneq_s32(int32x2_t r, int8x8_t a, uint8x16_t b) {
				return vsudot_laneq_s32(r, a, b, 0);
				}

				// CHECK-LABEL: test_vusdotq_s32
				// CHECK: [[VAL:%.*]] = call <4 x i32> @llvm.aarch64.neon.usdot.v4i32.v16i8(<4 x i32> %r, <16 x i8> %a, <16 x i8> %b)
				// CHECK: ret <4 x i32> [[VAL]]
				int32x4_t test_vusdotq_s32(int32x4_t r, uint8x16_t a, int8x16_t b) {
				return vusdotq_s32(r, a, b);
				}

				// CHECK-LABEL: test_vusdotq_lane_s32
				// CHECK: [[TMP0:%.*]] = bitcast <8 x i8> %b to <2 x i32>
				// CHECK: [[TMP1:%.*]] = bitcast <2 x i32> [[TMP0]] to <8 x i8>
				// CHECK: [[TMP2:%.*]] = bitcast <8 x i8> [[TMP1]] to <2 x i32>
				// CHECK: [[LANE:%.*]] = shufflevector <2 x i32> [[TMP2]], <2 x i32> [[TMP2]], <4 x i32> zeroinitializer
				// CHECK: [[TMP4:%.*]] = bitcast <4 x i32> [[LANE]] to <16 x i8>
				// CHECK: [[TMP5:%.*]] = bitcast <4 x i32> %r to <16 x i8>
				// CHECK: [[OP:%.*]] = call <4 x i32> @llvm.aarch64.neon.usdot.v4i32.v16i8(<4 x i32> %r, <16 x i8> %a, <16 x i8> [[TMP4]])
				// CHECK: ret <4 x i32> [[OP]]
				int32x4_t test_vusdotq_lane_s32(int32x4_t r, uint8x16_t a, int8x8_t b) {
				return vusdotq_lane_s32(r, a, b, 0);
				}

				// CHECK-LABEL: test_vsudotq_lane_s32
				// CHECK: [[TMP0:%.*]] = bitcast <8 x i8> %b to <2 x i32>
				// CHECK: [[TMP1:%.*]] = bitcast <2 x i32> [[TMP0]] to <8 x i8>
				// CHECK: [[TMP2:%.*]] = bitcast <8 x i8> [[TMP1]] to <2 x i32>
				// CHECK: [[LANE:%.*]] = shufflevector <2 x i32> [[TMP2]], <2 x i32> [[TMP2]], <4 x i32> zeroinitializer
				// CHECK: [[TMP4:%.*]] = bitcast <4 x i32> [[LANE]] to <16 x i8>
				// CHECK: [[TMP5:%.*]] = bitcast <4 x i32> %r to <16 x i8>
				// CHECK: [[OP:%.*]] = call <4 x i32> @llvm.aarch64.neon.usdot.v4i32.v16i8(<4 x i32> %r, <16 x i8> [[TMP4]], <16 x i8> %a)
				// CHECK: ret <4 x i32> [[OP]]
				int32x4_t test_vsudotq_lane_s32(int32x4_t r, int8x16_t a, uint8x8_t b) {
				return vsudotq_lane_s32(r, a, b, 0);
				}

				// CHECK-LABEL: test_vusdotq_laneq_s32
				// CHECK: [[TMP0:%.*]] = bitcast <16 x i8> %b to <4 x i32>
				// CHECK: [[TMP1:%.*]] = bitcast <4 x i32> [[TMP0]] to <16 x i8>
				// CHECK: [[TMP2:%.*]] = bitcast <16 x i8> [[TMP1]] to <4 x i32>
				// CHECK: [[LANE:%.*]] = shufflevector <4 x i32> [[TMP2]], <4 x i32> [[TMP2]], <4 x i32> zeroinitializer
				// CHECK: [[TMP4:%.*]] = bitcast <4 x i32> [[LANE]] to <16 x i8>
				// CHECK: [[TMP5:%.*]] = bitcast <4 x i32> %r to <16 x i8>
				// CHECK: [[OP:%.*]] = call <4 x i32> @llvm.aarch64.neon.usdot.v4i32.v16i8(<4 x i32> %r, <16 x i8> %a, <16 x i8> [[TMP4]])
				// CHECK: ret <4 x i32> [[OP]]
				int32x4_t test_vusdotq_laneq_s32(int32x4_t r, uint8x16_t a, int8x16_t b) {
				return vusdotq_laneq_s32(r, a, b, 0);
				}

				// CHECK-LABEL: test_vsudotq_laneq_s32
				// CHECK: [[TMP0:%.*]] = bitcast <16 x i8> %b to <4 x i32>
				// CHECK: [[TMP1:%.*]] = bitcast <4 x i32> [[TMP0]] to <16 x i8>
				// CHECK: [[TMP2:%.*]] = bitcast <16 x i8> [[TMP1]] to <4 x i32>
				// CHECK: [[LANE:%.*]] = shufflevector <4 x i32> [[TMP2]], <4 x i32> [[TMP2]], <4 x i32> zeroinitializer
				// CHECK: [[TMP4:%.*]] = bitcast <4 x i32> [[LANE]] to <16 x i8>
				// CHECK: [[TMP5:%.*]] = bitcast <4 x i32> %r to <16 x i8>
				// CHECK: [[OP:%.*]] = call <4 x i32> @llvm.aarch64.neon.usdot.v4i32.v16i8(<4 x i32> %r, <16 x i8> [[TMP4]], <16 x i8> %a)
				// CHECK: ret <4 x i32> [[OP]]
				int32x4_t test_vsudotq_laneq_s32(int32x4_t r, int8x16_t a, uint8x16_t b) {
				return vsudotq_laneq_s32(r, a, b, 0);
				}

llvm/include/llvm/IR/IntrinsicsAArch64.td

Show First 20 Lines • Show All 167 Lines • ▼ Show 20 Lines	class AdvSIMD_Dot_Intrinsic
: Intrinsic<[llvm_anyvector_ty],		: Intrinsic<[llvm_anyvector_ty],
[LLVMMatchType<0>, llvm_anyvector_ty, LLVMMatchType<1>],		[LLVMMatchType<0>, llvm_anyvector_ty, LLVMMatchType<1>],
[IntrNoMem]>;		[IntrNoMem]>;

class AdvSIMD_FP16FML_Intrinsic		class AdvSIMD_FP16FML_Intrinsic
: Intrinsic<[llvm_anyvector_ty],		: Intrinsic<[llvm_anyvector_ty],
[LLVMMatchType<0>, llvm_anyvector_ty, LLVMMatchType<1>],		[LLVMMatchType<0>, llvm_anyvector_ty, LLVMMatchType<1>],
[IntrNoMem]>;		[IntrNoMem]>;

		class AdvSIMD_MatMul_Intrinsic
		: Intrinsic<[llvm_anyvector_ty],
		[LLVMMatchType<0>, llvm_anyvector_ty, LLVMMatchType<1>],
		[IntrNoMem]>;
}		}

// Arithmetic ops		// Arithmetic ops

let TargetPrefix = "aarch64", IntrProperties = [IntrNoMem] in {		let TargetPrefix = "aarch64", IntrProperties = [IntrNoMem] in {
// Vector Add Across Lanes		// Vector Add Across Lanes
def int_aarch64_neon_saddv : AdvSIMD_1VectorArg_Int_Across_Intrinsic;		def int_aarch64_neon_saddv : AdvSIMD_1VectorArg_Int_Across_Intrinsic;
def int_aarch64_neon_uaddv : AdvSIMD_1VectorArg_Int_Across_Intrinsic;		def int_aarch64_neon_uaddv : AdvSIMD_1VectorArg_Int_Across_Intrinsic;
▲ Show 20 Lines • Show All 260 Lines • ▼ Show 20 Lines	let TargetPrefix = "aarch64", IntrProperties = [IntrNoMem] in {
// Scalar FP Inexact Narrowing		// Scalar FP Inexact Narrowing
def int_aarch64_sisd_fcvtxn : Intrinsic<[llvm_float_ty], [llvm_double_ty],		def int_aarch64_sisd_fcvtxn : Intrinsic<[llvm_float_ty], [llvm_double_ty],
[IntrNoMem]>;		[IntrNoMem]>;

// v8.2-A Dot Product		// v8.2-A Dot Product
def int_aarch64_neon_udot : AdvSIMD_Dot_Intrinsic;		def int_aarch64_neon_udot : AdvSIMD_Dot_Intrinsic;
def int_aarch64_neon_sdot : AdvSIMD_Dot_Intrinsic;		def int_aarch64_neon_sdot : AdvSIMD_Dot_Intrinsic;

		// v8.6-A Matrix Multiply Intrinsics
		def int_aarch64_neon_ummla : AdvSIMD_MatMul_Intrinsic;
		def int_aarch64_neon_smmla : AdvSIMD_MatMul_Intrinsic;
		def int_aarch64_neon_usmmla : AdvSIMD_MatMul_Intrinsic;
		def int_aarch64_neon_usdot : AdvSIMD_Dot_Intrinsic;

// v8.2-A FP16 Fused Multiply-Add Long		// v8.2-A FP16 Fused Multiply-Add Long
def int_aarch64_neon_fmlal : AdvSIMD_FP16FML_Intrinsic;		def int_aarch64_neon_fmlal : AdvSIMD_FP16FML_Intrinsic;
def int_aarch64_neon_fmlsl : AdvSIMD_FP16FML_Intrinsic;		def int_aarch64_neon_fmlsl : AdvSIMD_FP16FML_Intrinsic;
def int_aarch64_neon_fmlal2 : AdvSIMD_FP16FML_Intrinsic;		def int_aarch64_neon_fmlal2 : AdvSIMD_FP16FML_Intrinsic;
def int_aarch64_neon_fmlsl2 : AdvSIMD_FP16FML_Intrinsic;		def int_aarch64_neon_fmlsl2 : AdvSIMD_FP16FML_Intrinsic;

// v8.3-A Floating-point complex add		// v8.3-A Floating-point complex add
def int_aarch64_neon_vcadd_rot90 : AdvSIMD_2VectorArg_Intrinsic;		def int_aarch64_neon_vcadd_rot90 : AdvSIMD_2VectorArg_Intrinsic;
▲ Show 20 Lines • Show All 1,813 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64.td

Show First 20 Lines • Show All 367 Lines • ▼ Show 20 Lines
def FeatureTaggedGlobals : SubtargetFeature<"tagged-globals",		def FeatureTaggedGlobals : SubtargetFeature<"tagged-globals",
"AllowTaggedGlobals",		"AllowTaggedGlobals",
"true", "Use an instruction sequence for taking the address of a global "		"true", "Use an instruction sequence for taking the address of a global "
"that allows a memory tag in the upper address bits">;		"that allows a memory tag in the upper address bits">;

def FeatureBF16 : SubtargetFeature<"bf16", "HasBF16",		def FeatureBF16 : SubtargetFeature<"bf16", "HasBF16",
"true", "Enable BFloat16 Extension" >;		"true", "Enable BFloat16 Extension" >;

		def FeatureMatMulInt8 : SubtargetFeature<"i8mm", "HasMatMulInt8",
		"true", "Enable Matrix Multiply Int8 Extension">;

		def FeatureMatMulFP32 : SubtargetFeature<"f32mm", "HasMatMulFP32",
		"true", "Enable Matrix Multiply FP32 Extension", [FeatureSVE]>;

		def FeatureMatMulFP64 : SubtargetFeature<"f64mm", "HasMatMulFP64",
		"true", "Enable Matrix Multiply FP64 Extension", [FeatureSVE]>;

def FeatureFineGrainedTraps : SubtargetFeature<"fgt", "HasFineGrainedTraps",		def FeatureFineGrainedTraps : SubtargetFeature<"fgt", "HasFineGrainedTraps",
"true", "Enable fine grained virtualization traps extension">;		"true", "Enable fine grained virtualization traps extension">;

def FeatureEnhancedCounterVirtualization :		def FeatureEnhancedCounterVirtualization :
SubtargetFeature<"ecv", "HasEnhancedCounterVirtualization",		SubtargetFeature<"ecv", "HasEnhancedCounterVirtualization",
"true", "Enable enhanced counter virtualization extension">;		"true", "Enable enhanced counter virtualization extension">;


//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// Architectures.		// Architectures.
//		//

def HasV8_1aOps : SubtargetFeature<"v8.1a", "HasV8_1aOps", "true",		def HasV8_1aOps : SubtargetFeature<"v8.1a", "HasV8_1aOps", "true",
"Support ARM v8.1a instructions", [FeatureCRC, FeatureLSE, FeatureRDM,		"Support ARM v8.1a instructions", [FeatureCRC, FeatureLSE, FeatureRDM,
FeaturePAN, FeatureLOR, FeatureVH]>;		FeaturePAN, FeatureLOR, FeatureVH]>;

Show All 16 Lines	def HasV8_5aOps : SubtargetFeature<
[HasV8_4aOps, FeatureAltFPCmp, FeatureFRInt3264, FeatureSpecRestrict,		[HasV8_4aOps, FeatureAltFPCmp, FeatureFRInt3264, FeatureSpecRestrict,
FeatureSSBS, FeatureSB, FeaturePredRes, FeatureCacheDeepPersist,		FeatureSSBS, FeatureSB, FeaturePredRes, FeatureCacheDeepPersist,
FeatureBranchTargetId]>;		FeatureBranchTargetId]>;

def HasV8_6aOps : SubtargetFeature<		def HasV8_6aOps : SubtargetFeature<
"v8.6a", "HasV8_6aOps", "true", "Support ARM v8.6a instructions",		"v8.6a", "HasV8_6aOps", "true", "Support ARM v8.6a instructions",

[HasV8_5aOps, FeatureAMVS, FeatureBF16, FeatureFineGrainedTraps,		[HasV8_5aOps, FeatureAMVS, FeatureBF16, FeatureFineGrainedTraps,
FeatureEnhancedCounterVirtualization]>;		FeatureEnhancedCounterVirtualization, FeatureMatMulInt8]>;

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// Register File Description		// Register File Description
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

include "AArch64RegisterInfo.td"		include "AArch64RegisterInfo.td"
include "AArch64RegisterBanks.td"		include "AArch64RegisterBanks.td"
include "AArch64CallingConvention.td"		include "AArch64CallingConvention.td"
▲ Show 20 Lines • Show All 570 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64InstrFormats.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 5,544 Lines • ▼ Show 20 Lines	multiclass SIMDLogicalThreeVectorTied<bit U, bits<2> size,
def : Pat<(v2i64 (OpNode (v2i64 V128:$LHS), (v2i64 V128:$MHS),		def : Pat<(v2i64 (OpNode (v2i64 V128:$LHS), (v2i64 V128:$MHS),
(v2i64 V128:$RHS))),		(v2i64 V128:$RHS))),
(!cast<Instruction>(NAME#"v16i8")		(!cast<Instruction>(NAME#"v16i8")
V128:$LHS, V128:$MHS, V128:$RHS)>;		V128:$LHS, V128:$MHS, V128:$RHS)>;
}		}

// ARMv8.2-A Dot Product Instructions (Vector): These instructions extract		// ARMv8.2-A Dot Product Instructions (Vector): These instructions extract
// bytes from S-sized elements.		// bytes from S-sized elements.
class BaseSIMDThreeSameVectorDot<bit Q, bit U, string asm, string kind1,		class BaseSIMDThreeSameVectorDot<bit Q, bit U, bit Mixed, string asm, string kind1,
string kind2, RegisterOperand RegType,		string kind2, RegisterOperand RegType,
ValueType AccumType, ValueType InputType,		ValueType AccumType, ValueType InputType,
SDPatternOperator OpNode> :		SDPatternOperator OpNode> :
BaseSIMDThreeSameVectorTied<Q, U, 0b100, 0b10010, RegType, asm, kind1,		BaseSIMDThreeSameVectorTied<Q, U, 0b100, {0b1001, Mixed}, RegType, asm, kind1,
[(set (AccumType RegType:$dst),		[(set (AccumType RegType:$dst),
(OpNode (AccumType RegType:$Rd),		(OpNode (AccumType RegType:$Rd),
(InputType RegType:$Rn),		(InputType RegType:$Rn),
(InputType RegType:$Rm)))]> {		(InputType RegType:$Rm)))]> {
let AsmString = !strconcat(asm, "{\t$Rd" # kind1 # ", $Rn" # kind2 # ", $Rm" # kind2 # "}");		let AsmString = !strconcat(asm, "{\t$Rd" # kind1 # ", $Rn" # kind2 # ", $Rm" # kind2 # "}");
}		}

multiclass SIMDThreeSameVectorDot<bit U, string asm, SDPatternOperator OpNode> {		multiclass SIMDThreeSameVectorDot<bit U, bit Mixed, string asm, SDPatternOperator OpNode> {
def v8i8 : BaseSIMDThreeSameVectorDot<0, U, asm, ".2s", ".8b", V64,		def v8i8 : BaseSIMDThreeSameVectorDot<0, U, Mixed, asm, ".2s", ".8b", V64,
v2i32, v8i8, OpNode>;		v2i32, v8i8, OpNode>;
def v16i8 : BaseSIMDThreeSameVectorDot<1, U, asm, ".4s", ".16b", V128,		def v16i8 : BaseSIMDThreeSameVectorDot<1, U, Mixed, asm, ".4s", ".16b", V128,
v4i32, v16i8, OpNode>;		v4i32, v16i8, OpNode>;
}		}

// ARMv8.2-A Fused Multiply Add-Long Instructions (Vector): These instructions		// ARMv8.2-A Fused Multiply Add-Long Instructions (Vector): These instructions
// select inputs from 4H vectors and accumulate outputs to a 2S vector (or from		// select inputs from 4H vectors and accumulate outputs to a 2S vector (or from
// 8H to 4S, when Q=1).		// 8H to 4S, when Q=1).
class BaseSIMDThreeSameVectorFML<bit Q, bit U, bit b13, bits<3> size, string asm, string kind1,		class BaseSIMDThreeSameVectorFML<bit Q, bit U, bit b13, bits<3> size, string asm, string kind1,
string kind2, RegisterOperand RegType,		string kind2, RegisterOperand RegType,
▲ Show 20 Lines • Show All 2,321 Lines • ▼ Show 20 Lines	class BF16ToSinglePrecision<string asm>
bits<5> Rd;		bits<5> Rd;
bits<5> Rn;		bits<5> Rn;
let Inst{31-10} = 0b0001111001100011010000;		let Inst{31-10} = 0b0001111001100011010000;
let Inst{9-5} = Rn;		let Inst{9-5} = Rn;
let Inst{4-0} = Rd;		let Inst{4-0} = Rd;
}		}
} // End of let mayStore = 0, mayLoad = 0, hasSideEffects = 0		} // End of let mayStore = 0, mayLoad = 0, hasSideEffects = 0

		//----------------------------------------------------------------------------
		// Armv8.6 Matrix Multiply Extension
		//----------------------------------------------------------------------------

		class SIMDThreeSameVectorMatMul<bit B, bit U, string asm, SDPatternOperator OpNode>
		: BaseSIMDThreeSameVectorTied<1, U, 0b100, {0b1010, B}, V128, asm, ".4s",
		[(set (v4i32 V128:$dst), (OpNode (v4i32 V128:$Rd),
		(v16i8 V128:$Rn),
		(v16i8 V128:$Rm)))]> {
		let AsmString = asm # "{\t$Rd.4s, $Rn.16b, $Rm.16b}";
		}

		//----------------------------------------------------------------------------
// ARMv8.2-A Dot Product Instructions (Indexed)		// ARMv8.2-A Dot Product Instructions (Indexed)
class BaseSIMDThreeSameVectorDotIndex<bit Q, bit U, string asm, string dst_kind,		class BaseSIMDThreeSameVectorDotIndex<bit Q, bit U, bit Mixed, bits<2> size, string asm,
string lhs_kind, string rhs_kind,		string dst_kind, string lhs_kind, string rhs_kind,
RegisterOperand RegType,		RegisterOperand RegType,
ValueType AccumType, ValueType InputType,		ValueType AccumType, ValueType InputType,
SDPatternOperator OpNode> :		SDPatternOperator OpNode> :
BaseSIMDIndexedTied<Q, U, 0b0, 0b10, 0b1110, RegType, RegType, V128,		BaseSIMDIndexedTied<Q, U, 0b0, size, {0b111, Mixed}, RegType, RegType, V128,
VectorIndexS, asm, "", dst_kind, lhs_kind, rhs_kind,		VectorIndexS, asm, "", dst_kind, lhs_kind, rhs_kind,
[(set (AccumType RegType:$dst),		[(set (AccumType RegType:$dst),
(AccumType (OpNode (AccumType RegType:$Rd),		(AccumType (OpNode (AccumType RegType:$Rd),
(InputType RegType:$Rn),		(InputType RegType:$Rn),
(InputType (bitconvert (AccumType		(InputType (bitconvert (AccumType
(AArch64duplane32 (v4i32 V128:$Rm),		(AArch64duplane32 (v4i32 V128:$Rm),
VectorIndexS:$idx)))))))]> {		VectorIndexS:$idx)))))))]> {
bits<2> idx;		bits<2> idx;
let Inst{21} = idx{0}; // L		let Inst{21} = idx{0}; // L
let Inst{11} = idx{1}; // H		let Inst{11} = idx{1}; // H
}		}

multiclass SIMDThreeSameVectorDotIndex<bit U, string asm,		multiclass SIMDThreeSameVectorDotIndex<bit U, bit Mixed, bits<2> size, string asm,
SDPatternOperator OpNode> {		SDPatternOperator OpNode> {
def v8i8 : BaseSIMDThreeSameVectorDotIndex<0, U, asm, ".2s", ".8b", ".4b",		def v8i8 : BaseSIMDThreeSameVectorDotIndex<0, U, Mixed, size, asm, ".2s", ".8b", ".4b",
V64, v2i32, v8i8, OpNode>;		V64, v2i32, v8i8, OpNode>;
def v16i8 : BaseSIMDThreeSameVectorDotIndex<1, U, asm, ".4s", ".16b", ".4b",		def v16i8 : BaseSIMDThreeSameVectorDotIndex<1, U, Mixed, size, asm, ".4s", ".16b", ".4b",
V128, v4i32, v16i8, OpNode>;		V128, v4i32, v16i8, OpNode>;
}		}

// ARMv8.2-A Fused Multiply Add-Long Instructions (Indexed)		// ARMv8.2-A Fused Multiply Add-Long Instructions (Indexed)
class BaseSIMDThreeSameVectorFMLIndex<bit Q, bit U, bits<4> opc, string asm,		class BaseSIMDThreeSameVectorFMLIndex<bit Q, bit U, bits<4> opc, string asm,
string dst_kind, string lhs_kind,		string dst_kind, string lhs_kind,
string rhs_kind, RegisterOperand RegType,		string rhs_kind, RegisterOperand RegType,
ValueType AccumType, ValueType InputType,		ValueType AccumType, ValueType InputType,
▲ Show 20 Lines • Show All 3,252 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64InstrInfo.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 140 Lines • ▼ Show 20 Lines
def HasTME : Predicate<"Subtarget->hasTME()">,		def HasTME : Predicate<"Subtarget->hasTME()">,
AssemblerPredicate<(all_of FeatureTME), "tme">;		AssemblerPredicate<(all_of FeatureTME), "tme">;
def HasETE : Predicate<"Subtarget->hasETE()">,		def HasETE : Predicate<"Subtarget->hasETE()">,
AssemblerPredicate<(all_of FeatureETE), "ete">;		AssemblerPredicate<(all_of FeatureETE), "ete">;
def HasTRBE : Predicate<"Subtarget->hasTRBE()">,		def HasTRBE : Predicate<"Subtarget->hasTRBE()">,
AssemblerPredicate<(all_of FeatureTRBE), "trbe">;		AssemblerPredicate<(all_of FeatureTRBE), "trbe">;
def HasBF16 : Predicate<"Subtarget->hasBF16()">,		def HasBF16 : Predicate<"Subtarget->hasBF16()">,
AssemblerPredicate<(all_of FeatureBF16), "bf16">;		AssemblerPredicate<(all_of FeatureBF16), "bf16">;
		def HasMatMulInt8 : Predicate<"Subtarget->hasMatMulInt8()">,
		AssemblerPredicate<(all_of FeatureMatMulInt8), "i8mm">;
		def HasMatMulFP32 : Predicate<"Subtarget->hasMatMulFP32()">,
		AssemblerPredicate<(all_of FeatureMatMulFP32), "f32mm">;
		def HasMatMulFP64 : Predicate<"Subtarget->hasMatMulFP64()">,
		AssemblerPredicate<(all_of FeatureMatMulFP64), "f64mm">;
def IsLE : Predicate<"Subtarget->isLittleEndian()">;		def IsLE : Predicate<"Subtarget->isLittleEndian()">;
def IsBE : Predicate<"!Subtarget->isLittleEndian()">;		def IsBE : Predicate<"!Subtarget->isLittleEndian()">;
def IsWindows : Predicate<"Subtarget->isTargetWindows()">;		def IsWindows : Predicate<"Subtarget->isTargetWindows()">;
def UseAlternateSExtLoadCVTF32		def UseAlternateSExtLoadCVTF32
: Predicate<"Subtarget->useAlternateSExtLoadCVTF32Pattern()">;		: Predicate<"Subtarget->useAlternateSExtLoadCVTF32Pattern()">;

def UseNegativeImmediates		def UseNegativeImmediates
: Predicate<"false">, AssemblerPredicate<(all_of (not FeatureNoNegativeImmediates)),		: Predicate<"false">, AssemblerPredicate<(all_of (not FeatureNoNegativeImmediates)),
▲ Show 20 Lines • Show All 583 Lines • ▼ Show 20 Lines	def TSB : CRmSystemI<barrier_op, 0b010, "tsb", []> {
let CRm = 0b0010;		let CRm = 0b0010;
let Inst{12} = 0;		let Inst{12} = 0;
let Predicates = [HasTRACEV8_4];		let Predicates = [HasTRACEV8_4];
}		}
}		}

// ARMv8.2-A Dot Product		// ARMv8.2-A Dot Product
let Predicates = [HasDotProd] in {		let Predicates = [HasDotProd] in {
defm SDOT : SIMDThreeSameVectorDot<0, "sdot", int_aarch64_neon_sdot>;		defm SDOT : SIMDThreeSameVectorDot<0, 0, "sdot", int_aarch64_neon_sdot>;
defm UDOT : SIMDThreeSameVectorDot<1, "udot", int_aarch64_neon_udot>;		defm UDOT : SIMDThreeSameVectorDot<1, 0, "udot", int_aarch64_neon_udot>;
defm SDOTlane : SIMDThreeSameVectorDotIndex<0, "sdot", int_aarch64_neon_sdot>;		defm SDOTlane : SIMDThreeSameVectorDotIndex<0, 0, 0b10, "sdot", int_aarch64_neon_sdot>;
defm UDOTlane : SIMDThreeSameVectorDotIndex<1, "udot", int_aarch64_neon_udot>;		defm UDOTlane : SIMDThreeSameVectorDotIndex<1, 0, 0b10, "udot", int_aarch64_neon_udot>;
}		}

// ARMv8.6-A BFloat		// ARMv8.6-A BFloat
let Predicates = [HasBF16] in {		let Predicates = [HasBF16] in {
defm BFDOT : SIMDThreeSameVectorBFDot<1, "bfdot">;		defm BFDOT : SIMDThreeSameVectorBFDot<1, "bfdot">;
defm BF16DOTlane : SIMDThreeSameVectorBF16DotI<0, "bfdot">;		defm BF16DOTlane : SIMDThreeSameVectorBF16DotI<0, "bfdot">;
def BFMMLA : SIMDThreeSameVectorBF16MatrixMul<"bfmmla">;		def BFMMLA : SIMDThreeSameVectorBF16MatrixMul<"bfmmla">;
def BFMLALB : SIMDBF16MLAL<0, "bfmlalb">;		def BFMLALB : SIMDBF16MLAL<0, "bfmlalb">;
def BFMLALT : SIMDBF16MLAL<1, "bfmlalt">;		def BFMLALT : SIMDBF16MLAL<1, "bfmlalt">;
def BFMLALBIdx : SIMDBF16MLALIndex<0, "bfmlalb">;		def BFMLALBIdx : SIMDBF16MLALIndex<0, "bfmlalb">;
def BFMLALTIdx : SIMDBF16MLALIndex<1, "bfmlalt">;		def BFMLALTIdx : SIMDBF16MLALIndex<1, "bfmlalt">;
def BFCVTN : SIMD_BFCVTN;		def BFCVTN : SIMD_BFCVTN;
def BFCVTN2 : SIMD_BFCVTN2;		def BFCVTN2 : SIMD_BFCVTN2;
def BFCVT : BF16ToSinglePrecision<"bfcvt">;		def BFCVT : BF16ToSinglePrecision<"bfcvt">;
}		}

		// ARMv8.6A AArch64 matrix multiplication
		let Predicates = [HasMatMulInt8] in {
		def SMMLA : SIMDThreeSameVectorMatMul<0, 0, "smmla", int_aarch64_neon_smmla>;
		def UMMLA : SIMDThreeSameVectorMatMul<0, 1, "ummla", int_aarch64_neon_ummla>;
		def USMMLA : SIMDThreeSameVectorMatMul<1, 0, "usmmla", int_aarch64_neon_usmmla>;
		defm USDOT : SIMDThreeSameVectorDot<0, 1, "usdot", int_aarch64_neon_usdot>;
		defm USDOTlane : SIMDThreeSameVectorDotIndex<0, 1, 0b10, "usdot", int_aarch64_neon_usdot>;

		// sudot lane has a pattern where usdot is expected (there is no sudot).
		// The second operand is used in the dup operation to repeat the indexed
		// element.
		class BaseSIMDSUDOTIndex<bit Q, string dst_kind, string lhs_kind,
		string rhs_kind, RegisterOperand RegType,
		ValueType AccumType, ValueType InputType>
		: BaseSIMDThreeSameVectorDotIndex<Q, 0, 1, 0b00, "sudot", dst_kind,
		lhs_kind, rhs_kind, RegType, AccumType,
		InputType, null_frag> {
		let Pattern = [(set (AccumType RegType:$dst),
		(AccumType (int_aarch64_neon_usdot (AccumType RegType:$Rd),
		(InputType (bitconvert (AccumType
		(AArch64duplane32 (v4i32 V128:$Rm),
		VectorIndexS:$idx)))),
		(InputType RegType:$Rn))))];
		}

		multiclass SIMDSUDOTIndex {
		def v8i8 : BaseSIMDSUDOTIndex<0, ".2s", ".8b", ".4b", V64, v2i32, v8i8>;
		def v16i8 : BaseSIMDSUDOTIndex<1, ".4s", ".16b", ".4b", V128, v4i32, v16i8>;
		}

		defm SUDOTlane : SIMDSUDOTIndex;

		}

// ARMv8.2-A FP16 Fused Multiply-Add Long		// ARMv8.2-A FP16 Fused Multiply-Add Long
let Predicates = [HasNEON, HasFP16FML] in {		let Predicates = [HasNEON, HasFP16FML] in {
defm FMLAL : SIMDThreeSameVectorFML<0, 1, 0b001, "fmlal", int_aarch64_neon_fmlal>;		defm FMLAL : SIMDThreeSameVectorFML<0, 1, 0b001, "fmlal", int_aarch64_neon_fmlal>;
defm FMLSL : SIMDThreeSameVectorFML<0, 1, 0b101, "fmlsl", int_aarch64_neon_fmlsl>;		defm FMLSL : SIMDThreeSameVectorFML<0, 1, 0b101, "fmlsl", int_aarch64_neon_fmlsl>;
defm FMLAL2 : SIMDThreeSameVectorFML<1, 0, 0b001, "fmlal2", int_aarch64_neon_fmlal2>;		defm FMLAL2 : SIMDThreeSameVectorFML<1, 0, 0b001, "fmlal2", int_aarch64_neon_fmlal2>;
defm FMLSL2 : SIMDThreeSameVectorFML<1, 0, 0b101, "fmlsl2", int_aarch64_neon_fmlsl2>;		defm FMLSL2 : SIMDThreeSameVectorFML<1, 0, 0b101, "fmlsl2", int_aarch64_neon_fmlsl2>;
defm FMLALlane : SIMDThreeSameVectorFMLIndex<0, 0b0000, "fmlal", int_aarch64_neon_fmlal>;		defm FMLALlane : SIMDThreeSameVectorFMLIndex<0, 0b0000, "fmlal", int_aarch64_neon_fmlal>;
defm FMLSLlane : SIMDThreeSameVectorFMLIndex<0, 0b0100, "fmlsl", int_aarch64_neon_fmlsl>;		defm FMLSLlane : SIMDThreeSameVectorFMLIndex<0, 0b0100, "fmlsl", int_aarch64_neon_fmlsl>;
▲ Show 20 Lines • Show All 6,592 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64Subtarget.h

Show First 20 Lines • Show All 142 Lines • ▼ Show 20 Lines	protected:
bool HasCCDP = false;		bool HasCCDP = false;
bool HasBTI = false;		bool HasBTI = false;
bool HasRandGen = false;		bool HasRandGen = false;
bool HasMTE = false;		bool HasMTE = false;
bool HasTME = false;		bool HasTME = false;

// Armv8.6-A Extensions		// Armv8.6-A Extensions
bool HasBF16 = false;		bool HasBF16 = false;
		bool HasMatMulInt8 = false;
		bool HasMatMulFP32 = false;
		bool HasMatMulFP64 = false;
bool HasAMVS = false;		bool HasAMVS = false;
bool HasFineGrainedTraps = false;		bool HasFineGrainedTraps = false;
bool HasEnhancedCounterVirtualization = false;		bool HasEnhancedCounterVirtualization = false;

// Arm SVE2 extensions		// Arm SVE2 extensions
bool HasSVE2AES = false;		bool HasSVE2AES = false;
bool HasSVE2SM4 = false;		bool HasSVE2SM4 = false;
bool HasSVE2SHA3 = false;		bool HasSVE2SHA3 = false;
▲ Show 20 Lines • Show All 253 Lines • ▼ Show 20 Lines	public:
bool hasRandGen() const { return HasRandGen; }		bool hasRandGen() const { return HasRandGen; }
bool hasMTE() const { return HasMTE; }		bool hasMTE() const { return HasMTE; }
bool hasTME() const { return HasTME; }		bool hasTME() const { return HasTME; }
// Arm SVE2 extensions		// Arm SVE2 extensions
bool hasSVE2AES() const { return HasSVE2AES; }		bool hasSVE2AES() const { return HasSVE2AES; }
bool hasSVE2SM4() const { return HasSVE2SM4; }		bool hasSVE2SM4() const { return HasSVE2SM4; }
bool hasSVE2SHA3() const { return HasSVE2SHA3; }		bool hasSVE2SHA3() const { return HasSVE2SHA3; }
bool hasSVE2BitPerm() const { return HasSVE2BitPerm; }		bool hasSVE2BitPerm() const { return HasSVE2BitPerm; }
		bool hasMatMulInt8() const { return HasMatMulInt8; }
		bool hasMatMulFP32() const { return HasMatMulFP32; }
		bool hasMatMulFP64() const { return HasMatMulFP64; }

// Armv8.6-A Extensions		// Armv8.6-A Extensions
bool hasBF16() const { return HasBF16; }		bool hasBF16() const { return HasBF16; }
bool hasFineGrainedTraps() const { return HasFineGrainedTraps; }		bool hasFineGrainedTraps() const { return HasFineGrainedTraps; }
bool hasEnhancedCounterVirtualization() const {		bool hasEnhancedCounterVirtualization() const {
return HasEnhancedCounterVirtualization;		return HasEnhancedCounterVirtualization;
}		}

▲ Show 20 Lines • Show All 100 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/aarch64-matmul.ll

This file was added.

				; RUN: llc -mtriple=aarch64-none-linux-gnu -mattr=+neon,+i8mm < %s -o -\| FileCheck %s

				define <4 x i32> @smmla.v4i32.v16i8(<4 x i32> %r, <16 x i8> %a, <16 x i8> %b) {
				entry:
				; CHECK-LABEL: smmla.v4i32.v16i8
				; CHECK: smmla v0.4s, v1.16b, v2.16b
				%vmmla1.i = tail call <4 x i32> @llvm.aarch64.neon.smmla.v4i32.v16i8(<4 x i32> %r, <16 x i8> %a, <16 x i8> %b)
				ret <4 x i32> %vmmla1.i
				}

				define <4 x i32> @ummla.v4i32.v16i8(<4 x i32> %r, <16 x i8> %a, <16 x i8> %b) {
				entry:
				; CHECK-LABEL: ummla.v4i32.v16i8
				; CHECK: ummla v0.4s, v1.16b, v2.16b
				%vmmla1.i = tail call <4 x i32> @llvm.aarch64.neon.ummla.v4i32.v16i8(<4 x i32> %r, <16 x i8> %a, <16 x i8> %b)
				ret <4 x i32> %vmmla1.i
				}

				define <4 x i32> @usmmla.v4i32.v16i8(<4 x i32> %r, <16 x i8> %a, <16 x i8> %b) {
				entry:
				; CHECK-LABEL: usmmla.v4i32.v16i8
				; CHECK: usmmla v0.4s, v1.16b, v2.16b
				%vusmmla1.i = tail call <4 x i32> @llvm.aarch64.neon.usmmla.v4i32.v16i8(<4 x i32> %r, <16 x i8> %a, <16 x i8> %b) #3
				ret <4 x i32> %vusmmla1.i
				}

				define <2 x i32> @usdot.v2i32.v8i8(<2 x i32> %r, <8 x i8> %a, <8 x i8> %b) {
				entry:
				; CHECK-LABEL: usdot.v2i32.v8i8
				; CHECK: usdot v0.2s, v1.8b, v2.8b
				%vusdot1.i = tail call <2 x i32> @llvm.aarch64.neon.usdot.v2i32.v8i8(<2 x i32> %r, <8 x i8> %a, <8 x i8> %b)
				ret <2 x i32> %vusdot1.i
				}

				define <2 x i32> @usdot_lane.v2i32.v8i8(<2 x i32> %r, <8 x i8> %a, <8 x i8> %b) {
				entry:
				; CHECK-LABEL: usdot_lane.v2i32.v8i8
				; CHECK: usdot v0.2s, v1.8b, v2.4b[0]
				%0 = bitcast <8 x i8> %b to <2 x i32>
				%shuffle = shufflevector <2 x i32> %0, <2 x i32> undef, <2 x i32> zeroinitializer
				%1 = bitcast <2 x i32> %shuffle to <8 x i8>
				%vusdot1.i = tail call <2 x i32> @llvm.aarch64.neon.usdot.v2i32.v8i8(<2 x i32> %r, <8 x i8> %a, <8 x i8> %1)
				ret <2 x i32> %vusdot1.i
				}

				define <2 x i32> @sudot_lane.v2i32.v8i8(<2 x i32> %r, <8 x i8> %a, <8 x i8> %b) {
				entry:
				; CHECK-LABEL: sudot_lane.v2i32.v8i8
				; CHECK: sudot v0.2s, v1.8b, v2.4b[0]
				%0 = bitcast <8 x i8> %b to <2 x i32>
				%shuffle = shufflevector <2 x i32> %0, <2 x i32> undef, <2 x i32> zeroinitializer
				%1 = bitcast <2 x i32> %shuffle to <8 x i8>
				%vusdot1.i = tail call <2 x i32> @llvm.aarch64.neon.usdot.v2i32.v8i8(<2 x i32> %r, <8 x i8> %1, <8 x i8> %a)
				ret <2 x i32> %vusdot1.i
				}

				define <2 x i32> @usdot_lane.v2i32.v16i8(<2 x i32> %r, <8 x i8> %a, <16 x i8> %b) {
				entry:
				; CHECK-LABEL: usdot_lane.v2i32.v16i8
				; CHECK: usdot v0.2s, v1.8b, v2.4b[0]
				%0 = bitcast <16 x i8> %b to <4 x i32>
				%shuffle = shufflevector <4 x i32> %0, <4 x i32> undef, <2 x i32> zeroinitializer
				%1 = bitcast <2 x i32> %shuffle to <8 x i8>
				%vusdot1.i = tail call <2 x i32> @llvm.aarch64.neon.usdot.v2i32.v8i8(<2 x i32> %r, <8 x i8> %a, <8 x i8> %1)
				ret <2 x i32> %vusdot1.i
				}

				define <2 x i32> @sudot_lane.v2i32.v16i8(<2 x i32> %r, <8 x i8> %a, <16 x i8> %b) {
				entry:
				; CHECK-LABEL: sudot_lane.v2i32.v16i8
				; CHECK: sudot v0.2s, v1.8b, v2.4b[0]
				%0 = bitcast <16 x i8> %b to <4 x i32>
				%shuffle = shufflevector <4 x i32> %0, <4 x i32> undef, <2 x i32> zeroinitializer
				%1 = bitcast <2 x i32> %shuffle to <8 x i8>
				%vusdot1.i = tail call <2 x i32> @llvm.aarch64.neon.usdot.v2i32.v8i8(<2 x i32> %r, <8 x i8> %1, <8 x i8> %a) #3
				ret <2 x i32> %vusdot1.i
				}

				define <4 x i32> @usdot.v4i32.v16i8(<4 x i32> %r, <16 x i8> %a, <16 x i8> %b) {
				entry:
				; CHECK-LABEL: usdot.v4i32.v16i8
				; CHECK: usdot v0.4s, v1.16b, v2.16b
				%vusdot1.i = tail call <4 x i32> @llvm.aarch64.neon.usdot.v4i32.v16i8(<4 x i32> %r, <16 x i8> %a, <16 x i8> %b) #3
				ret <4 x i32> %vusdot1.i
				}

				define <4 x i32> @usdot_lane.v4i32.v16i8(<4 x i32> %r, <16 x i8> %a, <8 x i8> %b) {
				entry:
				; CHECK-LABEL: usdot_lane.v4i32.v16i8
				; CHECK: usdot v0.4s, v1.16b, v2.4b[0]
				%0 = bitcast <8 x i8> %b to <2 x i32>
				%shuffle = shufflevector <2 x i32> %0, <2 x i32> undef, <4 x i32> zeroinitializer
				%1 = bitcast <4 x i32> %shuffle to <16 x i8>
				%vusdot1.i = tail call <4 x i32> @llvm.aarch64.neon.usdot.v4i32.v16i8(<4 x i32> %r, <16 x i8> %a, <16 x i8> %1) #3
				ret <4 x i32> %vusdot1.i
				}

				define <4 x i32> @sudot_lane.v4i32.v16i8(<4 x i32> %r, <16 x i8> %a, <8 x i8> %b) {
				entry:
				; CHECK-LABEL: sudot_lane.v4i32.v16i8
				; CHECK: sudot v0.4s, v1.16b, v2.4b[0]
				%0 = bitcast <8 x i8> %b to <2 x i32>
				%shuffle = shufflevector <2 x i32> %0, <2 x i32> undef, <4 x i32> zeroinitializer
				%1 = bitcast <4 x i32> %shuffle to <16 x i8>
				%vusdot1.i = tail call <4 x i32> @llvm.aarch64.neon.usdot.v4i32.v16i8(<4 x i32> %r, <16 x i8> %1, <16 x i8> %a) #3
				ret <4 x i32> %vusdot1.i
				}

				define <4 x i32> @usdot_laneq.v4i32.v16i8(<4 x i32> %r, <16 x i8> %a, <16 x i8> %b) {
				entry:
				; CHECK-LABEL: usdot_laneq.v4i32.v16i8
				; CHECK: usdot v0.4s, v1.16b, v2.4b[0]
				%0 = bitcast <16 x i8> %b to <4 x i32>
				%shuffle = shufflevector <4 x i32> %0, <4 x i32> undef, <4 x i32> zeroinitializer
				%1 = bitcast <4 x i32> %shuffle to <16 x i8>
				%vusdot1.i = tail call <4 x i32> @llvm.aarch64.neon.usdot.v4i32.v16i8(<4 x i32> %r, <16 x i8> %a, <16 x i8> %1) #3
				ret <4 x i32> %vusdot1.i
				}

				define <4 x i32> @sudot_laneq.v4i32.v16i8(<4 x i32> %r, <16 x i8> %a, <16 x i8> %b) {
				entry:
				; CHECK-LABEL: sudot_laneq.v4i32.v16i8
				; CHECK: sudot v0.4s, v1.16b, v2.4b[0]
				%0 = bitcast <16 x i8> %b to <4 x i32>
				%shuffle = shufflevector <4 x i32> %0, <4 x i32> undef, <4 x i32> zeroinitializer
				%1 = bitcast <4 x i32> %shuffle to <16 x i8>
				%vusdot1.i = tail call <4 x i32> @llvm.aarch64.neon.usdot.v4i32.v16i8(<4 x i32> %r, <16 x i8> %1, <16 x i8> %a) #3
				ret <4 x i32> %vusdot1.i
				}

				declare <4 x i32> @llvm.aarch64.neon.smmla.v4i32.v16i8(<4 x i32>, <16 x i8>, <16 x i8>) #2
				declare <4 x i32> @llvm.aarch64.neon.ummla.v4i32.v16i8(<4 x i32>, <16 x i8>, <16 x i8>) #2
				declare <4 x i32> @llvm.aarch64.neon.usmmla.v4i32.v16i8(<4 x i32>, <16 x i8>, <16 x i8>) #2
				declare <2 x i32> @llvm.aarch64.neon.usdot.v2i32.v8i8(<2 x i32>, <8 x i8>, <8 x i8>) #2
				declare <4 x i32> @llvm.aarch64.neon.usdot.v4i32.v16i8(<4 x i32>, <16 x i8>, <16 x i8>) #2

llvm/test/MC/AArch64/armv8.6a-simd-matmul-error.s

This file was added.

				// RUN: not llvm-mc -triple aarch64 -show-encoding -mattr=+i8mm < %s 2>&1 \| FileCheck %s

				// No interesting edge cases for [US]MMLA, except for the fact that the data
				// types are fixed (no 64-bit version), and USMMLA exists, but SUMMLA does not.
				smmla v1.2s, v16.8b, v31.8b
				// CHECK: [[@LINE-1]]:{{[0-9]+}}: error: invalid operand for instruction
				summla v1.4s, v16.16b, v31.16b
				// CHECK: [[@LINE-1]]:{{[0-9]+}}: error: unrecognized instruction mnemonic, did you mean: smmla, ummla, usmmla?

				// USDOT (vector) has two valid data type combinations, others are rejected.
				usdot v3.4s, v15.8b, v30.8b
				// CHECK: [[@LINE-1]]:{{[0-9]+}}: error: invalid operand for instruction
				usdot v3.2s, v15.16b, v30.16b
				// CHECK: [[@LINE-1]]:{{[0-9]+}}: error: invalid operand for instruction

				// For USDOT and SUDOT (indexed), the index is in range [0,3] (regardless of data types)
				usdot v31.2s, v1.8b, v2.4b[4]
				// CHECK: [[@LINE-1]]:{{[0-9]+}}: error: vector lane must be an integer in range [0, 3].
				kmclaughlinUnsubmitted Done Reply Inline Actions The arrangement specifiers of the first two operands don't match for these tests, which is what the next set of tests below is checking for. It might be worth keeping these tests specific to just the index being out of range. kmclaughlin: The arrangement specifiers of the first two operands don't match for these tests, which is what…
				usdot v31.4s, v1.16b, v2.4b[4]
				// CHECK: [[@LINE-1]]:{{[0-9]+}}: error: vector lane must be an integer in range [0, 3].
				sudot v31.2s, v1.8b, v2.4b[4]
				// CHECK: [[@LINE-1]]:{{[0-9]+}}: error: vector lane must be an integer in range [0, 3].
				sudot v31.4s, v1.16b, v2.4b[4]
				// CHECK: [[@LINE-1]]:{{[0-9]+}}: error: vector lane must be an integer in range [0, 3].

				// The arrangement specifiers of the first two operands must match.
				usdot v31.4s, v1.8b, v2.4b[0]
				kmclaughlinUnsubmitted Done Reply Inline Actions muct -> must :) kmclaughlin: muct -> must :)
				// CHECK: [[@LINE-1]]:{{[0-9]+}}: error: invalid operand for instruction
				usdot v31.2s, v1.16b, v2.4b[0]
				// CHECK: [[@LINE-1]]:{{[0-9]+}}: error: invalid operand for instruction
				sudot v31.4s, v1.8b, v2.4b[0]
				// CHECK: [[@LINE-1]]:{{[0-9]+}}: error: invalid operand for instruction
				sudot v31.2s, v1.16b, v2.4b[0]
				// CHECK: [[@LINE-1]]:{{[0-9]+}}: error: invalid operand for instruction

llvm/test/MC/AArch64/armv8.6a-simd-matmul.s

This file was added.

				// RUN: llvm-mc -triple aarch64 -show-encoding -mattr=+i8mm < %s \| FileCheck %s
				// RUN: llvm-mc -triple aarch64 -show-encoding -mattr=+v8.6a < %s \| FileCheck %s
				// RUN: not llvm-mc -triple aarch64 -show-encoding -mattr=+v8.6a-i8mm < %s 2>&1 \| FileCheck %s --check-prefix=NOMATMUL

				smmla v1.4s, v16.16b, v31.16b
				ummla v1.4s, v16.16b, v31.16b
				usmmla v1.4s, v16.16b, v31.16b
				// CHECK: smmla v1.4s, v16.16b, v31.16b // encoding: [0x01,0xa6,0x9f,0x4e]
				// CHECK: ummla v1.4s, v16.16b, v31.16b // encoding: [0x01,0xa6,0x9f,0x6e]
				// CHECK: usmmla v1.4s, v16.16b, v31.16b // encoding: [0x01,0xae,0x9f,0x4e]
				// NOMATMUL: instruction requires: i8mm
				// NOMATMUL-NEXT: smmla v1.4s, v16.16b, v31.16b
				// NOMATMUL: instruction requires: i8mm
				// NOMATMUL-NEXT: ummla v1.4s, v16.16b, v31.16b
				// NOMATMUL: instruction requires: i8mm
				// NOMATMUL-NEXT: usmmla v1.4s, v16.16b, v31.16b

				usdot v3.2s, v15.8b, v30.8b
				usdot v3.4s, v15.16b, v30.16b
				// CHECK: usdot v3.2s, v15.8b, v30.8b // encoding: [0xe3,0x9d,0x9e,0x0e]
				// CHECK: usdot v3.4s, v15.16b, v30.16b // encoding: [0xe3,0x9d,0x9e,0x4e]
				// NOMATMUL: instruction requires: i8mm
				// NOMATMUL-NEXT: usdot v3.2s, v15.8b, v30.8b
				// NOMATMUL: instruction requires: i8mm
				// NOMATMUL-NEXT: usdot v3.4s, v15.16b, v30.16b

				usdot v31.2s, v1.8b, v2.4b[3]
				usdot v31.4s, v1.16b, v2.4b[3]
				// CHECK: usdot v31.2s, v1.8b, v2.4b[3] // encoding: [0x3f,0xf8,0xa2,0x0f]
				// CHECK: usdot v31.4s, v1.16b, v2.4b[3] // encoding: [0x3f,0xf8,0xa2,0x4f]
				// NOMATMUL: instruction requires: i8mm
				// NOMATMUL-NEXT: usdot v31.2s, v1.8b, v2.4b[3]
				// NOMATMUL: instruction requires: i8mm
				// NOMATMUL-NEXT: usdot v31.4s, v1.16b, v2.4b[3]

				sudot v31.2s, v1.8b, v2.4b[3]
				sudot v31.4s, v1.16b, v2.4b[3]
				// CHECK: sudot v31.2s, v1.8b, v2.4b[3] // encoding: [0x3f,0xf8,0x22,0x0f]
				// CHECK: sudot v31.4s, v1.16b, v2.4b[3] // encoding: [0x3f,0xf8,0x22,0x4f]
				// NOMATMUL: instruction requires: i8mm
				// NOMATMUL-NEXT: sudot v31.2s, v1.8b, v2.4b[3]
				// NOMATMUL: instruction requires: i8mm
				// NOMATMUL-NEXT: sudot v31.4s, v1.16b, v2.4b[3]

llvm/test/MC/Disassembler/AArch64/armv8.6a-simd-matmul.txt

This file was added.

				# RUN: llvm-mc -triple=aarch64 -mattr=+i8mm -disassemble < %s \| FileCheck %s
				# RUN: llvm-mc -triple=aarch64 -mattr=+v8.6a -disassemble < %s \| FileCheck %s
				# RUN: not llvm-mc -triple=aarch64 -mattr=+v8.5a -disassemble < %s 2>&1 \| FileCheck %s --check-prefix=NOI8MM

				[0x01,0xa6,0x9f,0x4e]
				[0x01,0xa6,0x9f,0x6e]
				[0x01,0xae,0x9f,0x4e]
				# CHECK: smmla v1.4s, v16.16b, v31.16b
				# CHECK: ummla v1.4s, v16.16b, v31.16b
				# CHECK: usmmla v1.4s, v16.16b, v31.16b
				# NOI8MM: [[@LINE-6]]:{{[0-9]+}}: warning: invalid instruction encoding
				# NOI8MM: [[@LINE-6]]:{{[0-9]+}}: warning: invalid instruction encoding
				# NOI8MM: [[@LINE-6]]:{{[0-9]+}}: warning: invalid instruction encoding

				[0xe3,0x9d,0x9e,0x0e]
				[0xe3,0x9d,0x9e,0x4e]
				# CHECK: usdot v3.2s, v15.8b, v30.8b
				# CHECK: usdot v3.4s, v15.16b, v30.16b
				# NOI8MM: [[@LINE-4]]:{{[0-9]+}}: warning: invalid instruction encoding
				# NOI8MM: [[@LINE-4]]:{{[0-9]+}}: warning: invalid instruction encoding

				[0x3f,0xf8,0xa2,0x0f]
				[0x3f,0xf8,0xa2,0x4f]
				# CHECK: usdot v31.2s, v1.8b, v2.4b[3]
				# CHECK: usdot v31.4s, v1.16b, v2.4b[3]
				# NOI8MM: [[@LINE-4]]:{{[0-9]+}}: warning: invalid instruction encoding
				# NOI8MM: [[@LINE-4]]:{{[0-9]+}}: warning: invalid instruction encoding

				[0x3f,0xf8,0x22,0x0f]
				[0x3f,0xf8,0x22,0x4f]
				# CHECK: sudot v31.2s, v1.8b, v2.4b[3]
				# CHECK: sudot v31.4s, v1.16b, v2.4b[3]
				# NOI8MM: [[@LINE-4]]:{{[0-9]+}}: warning: invalid instruction encoding
				# NOI8MM: [[@LINE-4]]:{{[0-9]+}}: warning: invalid instruction encoding

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Armv8.6-a Matrix Mult Assembly + IntrinsicsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 259890

clang/include/clang/Basic/arm_neon.td

clang/lib/Basic/Targets/AArch64.h

clang/lib/Basic/Targets/AArch64.cpp

clang/lib/CodeGen/CGBuiltin.cpp

clang/test/CodeGen/aarch64-matmul.cpp

clang/test/CodeGen/aarch64-v8.6a-neon-intrinsics.c

llvm/include/llvm/IR/IntrinsicsAArch64.td

llvm/lib/Target/AArch64/AArch64.td

llvm/lib/Target/AArch64/AArch64InstrFormats.td

llvm/lib/Target/AArch64/AArch64InstrInfo.td

llvm/lib/Target/AArch64/AArch64Subtarget.h

llvm/test/CodeGen/AArch64/aarch64-matmul.ll

llvm/test/MC/AArch64/armv8.6a-simd-matmul-error.s

llvm/test/MC/AArch64/armv8.6a-simd-matmul.s

llvm/test/MC/Disassembler/AArch64/armv8.6a-simd-matmul.txt

[AArch64] Armv8.6-a Matrix Mult Assembly + Intrinsics
ClosedPublic