This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/
-
lib/CodeGen/
-
CodeGen/
-
CGBuiltin.cpp
-
test/CodeGen/
-
CodeGen/
2/2
arm-bf16-dotprod-intrinsics.c
-
llvm/
-
include/llvm/IR/
-
llvm/
-
IR/
-
IntrinsicsARM.td
-
lib/Target/ARM/
-
Target/
-
ARM/
1/1
ARMInstrNEON.td
-
test/CodeGen/ARM/
-
CodeGen/
-
ARM/
3/3
arm-bf16-dotprod-intrinsics.ll

Differential D81740

[ARM] BFloat MatMul Intrinsics&CodeGen
ClosedPublic

Authored by miyuki on Jun 12 2020, 7:19 AM.

Download Raw Diff

Details

Reviewers

stuij
t.p.northover
SjoerdMeijer
sdesmalen
fpetrogalli
LukeGeeson
simon_tatham
dmgreen
MarkMurrayARM

Commits

rG9c579540ff69: [ARM] BFloat MatMul Intrinsics&CodeGen

Summary

This patch adds support for BFloat Matrix Multiplication Intrinsics
and Code Generation from __bf16 to AArch32. This includes IR intrinsics. Tests are
provided as needed.

This patch is part of a series implementing the Bfloat16 extension of
the
Armv8.6-a architecture, as detailed here:

https://community.arm.com/developer/ip-products/processors/b/processors-ip-blog/posts/arm-architecture-developments-armv8-6-a

The bfloat type and its properties are specified in the Arm
Architecture
Reference Manual:

https://developer.arm.com/docs/ddi0487/latest/arm-architecture-reference-manual-armv8-for-armv8-a-architecture-profile

The following people contributed to this patch:

Luke Geeson
Momchil Velikov
Mikhail Maltsev
Luke Cheeseman
Simon Tatham

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

LukeGeeson created this revision.Jun 12 2020, 7:19 AM

Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptJun 12 2020, 7:19 AM

Herald added subscribers: llvm-commits, cfe-commits, hiraditya, kristof.beyls. · View Herald Transcript

LukeGeeson added a parent revision: D81486: [ARM][BFloat] Implement lowering of bf16 load/store intrinsics.Jun 12 2020, 7:21 AM

Harbormaster failed remote builds in B60117: Diff 270393!Jun 12 2020, 7:35 AM

removed redundancy in patch

miyuki added inline comments.Jun 15 2020, 5:36 AM

clang/test/CodeGen/arm-bf16-dotprod-intrinsics.c
3	Would it be sufficient to run through `opt -mem2reg -instcombine` instead of the whole -O2 pipeline?
13	The check prefixes are misleading. How about `CHECK-FPABI-HARD` and `CHECK-FPABI-SOFT`?
llvm/lib/Target/ARM/ARMInstrNEON.td
9196	Will this work if the selected element is in the top half of the Q register (`$lane >= 4`)?
llvm/test/CodeGen/ARM/arm-bf16-dotprod-intrinsics.ll
2	`--check-prefix=CHECK` is redundant
6	Could you please get rid of `local_unnamed_addr #0`? `#0` is referring to an attribute that is not defined anywhere.
10	Same for `#3`

miyuki commandeered this revision.Jun 22 2020, 7:21 AM

miyuki added a reviewer: LukeGeeson.

miyuki added a parent revision: D82206: [ARM][BFloat] Implement bf16 get/set_lane without casts to i16 vectors.Jun 22 2020, 9:17 AM

Addressed the review comments

Herald added a subscriber: danielkiss. · View Herald TranscriptJun 23 2020, 3:54 AM

miyuki set the repository for this revision to rG LLVM Github Monorepo.Jun 23 2020, 3:54 AM

miyuki marked 6 inline comments as done.

miyuki added reviewers: simon_tatham, dmgreen.Jun 23 2020, 4:09 AM

LGTM.

This revision is now accepted and ready to land.Jun 23 2020, 4:13 AM

Closed by commit rG9c579540ff69: [ARM] BFloat MatMul Intrinsics&CodeGen (authored by miyuki). · Explain WhyJun 23 2020, 5:17 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

clang/

lib/

CodeGen/

CGBuiltin.cpp

5 lines

test/

CodeGen/

arm-bf16-dotprod-intrinsics.c

166 lines

llvm/

include/

llvm/

IR/

IntrinsicsARM.td

9 lines

lib/

Target/

ARM/

ARMInstrNEON.td

56 lines

test/

CodeGen/

ARM/

arm-bf16-dotprod-intrinsics.ll

194 lines

Diff 272687

clang/lib/CodeGen/CGBuiltin.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 4,742 Lines • ▼ Show 20 Lines	static const ARMVectorIntrinsicInfo ARMSIMDIntrinsicMap [] = {
NEONMAP2(vabdq_v, arm_neon_vabdu, arm_neon_vabds, Add1ArgType \| UnsignedAlts),		NEONMAP2(vabdq_v, arm_neon_vabdu, arm_neon_vabds, Add1ArgType \| UnsignedAlts),
NEONMAP1(vabs_v, arm_neon_vabs, 0),		NEONMAP1(vabs_v, arm_neon_vabs, 0),
NEONMAP1(vabsq_v, arm_neon_vabs, 0),		NEONMAP1(vabsq_v, arm_neon_vabs, 0),
NEONMAP0(vaddhn_v),		NEONMAP0(vaddhn_v),
NEONMAP1(vaesdq_v, arm_neon_aesd, 0),		NEONMAP1(vaesdq_v, arm_neon_aesd, 0),
NEONMAP1(vaeseq_v, arm_neon_aese, 0),		NEONMAP1(vaeseq_v, arm_neon_aese, 0),
NEONMAP1(vaesimcq_v, arm_neon_aesimc, 0),		NEONMAP1(vaesimcq_v, arm_neon_aesimc, 0),
NEONMAP1(vaesmcq_v, arm_neon_aesmc, 0),		NEONMAP1(vaesmcq_v, arm_neon_aesmc, 0),
		NEONMAP1(vbfdot_v, arm_neon_bfdot, 0),
		NEONMAP1(vbfdotq_v, arm_neon_bfdot, 0),
		NEONMAP1(vbfmlalbq_v, arm_neon_bfmlalb, 0),
		NEONMAP1(vbfmlaltq_v, arm_neon_bfmlalt, 0),
		NEONMAP1(vbfmmlaq_v, arm_neon_bfmmla, 0),
NEONMAP1(vbsl_v, arm_neon_vbsl, AddRetType),		NEONMAP1(vbsl_v, arm_neon_vbsl, AddRetType),
NEONMAP1(vbslq_v, arm_neon_vbsl, AddRetType),		NEONMAP1(vbslq_v, arm_neon_vbsl, AddRetType),
NEONMAP1(vcadd_rot270_v, arm_neon_vcadd_rot270, Add1ArgType),		NEONMAP1(vcadd_rot270_v, arm_neon_vcadd_rot270, Add1ArgType),
NEONMAP1(vcadd_rot90_v, arm_neon_vcadd_rot90, Add1ArgType),		NEONMAP1(vcadd_rot90_v, arm_neon_vcadd_rot90, Add1ArgType),
NEONMAP1(vcaddq_rot270_v, arm_neon_vcadd_rot270, Add1ArgType),		NEONMAP1(vcaddq_rot270_v, arm_neon_vcadd_rot270, Add1ArgType),
NEONMAP1(vcaddq_rot90_v, arm_neon_vcadd_rot90, Add1ArgType),		NEONMAP1(vcaddq_rot90_v, arm_neon_vcadd_rot90, Add1ArgType),
NEONMAP1(vcage_v, arm_neon_vacge, 0),		NEONMAP1(vcage_v, arm_neon_vacge, 0),
NEONMAP1(vcageq_v, arm_neon_vacge, 0),		NEONMAP1(vcageq_v, arm_neon_vacge, 0),
▲ Show 20 Lines • Show All 12,018 Lines • Show Last 20 Lines

clang/test/CodeGen/arm-bf16-dotprod-intrinsics.c

This file was added.

				// NOTE: Assertions have been autogenerated by utils/update_cc_test_checks.py
				// RUN: %clang_cc1 -triple armv8-arm-none-eabi \
				// RUN: -target-feature +neon -target-feature +bf16 -mfloat-abi soft \
				miyukiAuthorUnsubmitted Done Reply Inline Actions Would it be sufficient to run through `opt -mem2reg -instcombine` instead of the whole -O2 pipeline? miyuki: Would it be sufficient to run through `opt -mem2reg -instcombine` instead of the whole -O2…
				// RUN: -disable-O0-optnone -S -emit-llvm -o - %s \
				// RUN: \| opt -S -mem2reg -instcombine \| FileCheck %s
				// RUN: %clang_cc1 -triple armv8-arm-none-eabi \
				// RUN: -target-feature +neon -target-feature +bf16 -mfloat-abi hard \
				// RUN: -disable-O0-optnone -S -emit-llvm -o - %s \
				// RUN: \| opt -S -mem2reg -instcombine \| FileCheck %s

				#include <arm_neon.h>

				// CHECK-LABEL: @test_vbfdot_f32(
				miyukiAuthorUnsubmitted Done Reply Inline Actions The check prefixes are misleading. How about `CHECK-FPABI-HARD` and `CHECK-FPABI-SOFT`? miyuki: The check prefixes are misleading. How about `CHECK-FPABI-HARD` and `CHECK-FPABI-SOFT`?
				// CHECK-NEXT: entry:
				// CHECK-NEXT: [[TMP0:%.]] = bitcast <4 x bfloat> [[A:%.]] to <8 x i8>
				// CHECK-NEXT: [[TMP1:%.]] = bitcast <4 x bfloat> [[B:%.]] to <8 x i8>
				// CHECK-NEXT: [[VBFDOT1_I:%.]] = call <2 x float> @llvm.arm.neon.bfdot.v2f32.v8i8(<2 x float> [[R:%.]], <8 x i8> [[TMP0]], <8 x i8> [[TMP1]]) #3
				// CHECK-NEXT: ret <2 x float> [[VBFDOT1_I]]
				//
				float32x2_t test_vbfdot_f32(float32x2_t r, bfloat16x4_t a, bfloat16x4_t b) {
				return vbfdot_f32(r, a, b);
				}

				// CHECK-LABEL: @test_vbfdotq_f32(
				// CHECK-NEXT: entry:
				// CHECK-NEXT: [[TMP0:%.]] = bitcast <8 x bfloat> [[A:%.]] to <16 x i8>
				// CHECK-NEXT: [[TMP1:%.]] = bitcast <8 x bfloat> [[B:%.]] to <16 x i8>
				// CHECK-NEXT: [[VBFDOT1_I:%.]] = call <4 x float> @llvm.arm.neon.bfdot.v4f32.v16i8(<4 x float> [[R:%.]], <16 x i8> [[TMP0]], <16 x i8> [[TMP1]]) #3
				// CHECK-NEXT: ret <4 x float> [[VBFDOT1_I]]
				//
				float32x4_t test_vbfdotq_f32(float32x4_t r, bfloat16x8_t a, bfloat16x8_t b){
				return vbfdotq_f32(r, a, b);
				}

				// CHECK-LABEL: @test_vbfdot_lane_f32(
				// CHECK-NEXT: entry:
				// CHECK-NEXT: [[DOTCAST:%.]] = bitcast <4 x bfloat> [[B:%.]] to <2 x float>
				// CHECK-NEXT: [[LANE:%.*]] = shufflevector <2 x float> [[DOTCAST]], <2 x float> undef, <2 x i32> zeroinitializer
				// CHECK-NEXT: [[DOTCAST1:%.*]] = bitcast <2 x float> [[LANE]] to <8 x i8>
				// CHECK-NEXT: [[TMP0:%.]] = bitcast <4 x bfloat> [[A:%.]] to <8 x i8>
				// CHECK-NEXT: [[VBFDOT1_I:%.]] = call <2 x float> @llvm.arm.neon.bfdot.v2f32.v8i8(<2 x float> [[R:%.]], <8 x i8> [[TMP0]], <8 x i8> [[DOTCAST1]]) #3
				// CHECK-NEXT: ret <2 x float> [[VBFDOT1_I]]
				//
				float32x2_t test_vbfdot_lane_f32(float32x2_t r, bfloat16x4_t a, bfloat16x4_t b){
				return vbfdot_lane_f32(r, a, b, 0);
				}

				// CHECK-LABEL: @test_vbfdotq_laneq_f32(
				// CHECK-NEXT: entry:
				// CHECK-NEXT: [[DOTCAST:%.]] = bitcast <8 x bfloat> [[B:%.]] to <4 x float>
				// CHECK-NEXT: [[LANE:%.*]] = shufflevector <4 x float> [[DOTCAST]], <4 x float> undef, <4 x i32> <i32 3, i32 3, i32 3, i32 3>
				// CHECK-NEXT: [[DOTCAST1:%.*]] = bitcast <4 x float> [[LANE]] to <16 x i8>
				// CHECK-NEXT: [[TMP0:%.]] = bitcast <8 x bfloat> [[A:%.]] to <16 x i8>
				// CHECK-NEXT: [[VBFDOT1_I:%.]] = call <4 x float> @llvm.arm.neon.bfdot.v4f32.v16i8(<4 x float> [[R:%.]], <16 x i8> [[TMP0]], <16 x i8> [[DOTCAST1]]) #3
				// CHECK-NEXT: ret <4 x float> [[VBFDOT1_I]]
				//
				float32x4_t test_vbfdotq_laneq_f32(float32x4_t r, bfloat16x8_t a, bfloat16x8_t b) {
				return vbfdotq_laneq_f32(r, a, b, 3);
				}

				// CHECK-LABEL: @test_vbfdot_laneq_f32(
				// CHECK-NEXT: entry:
				// CHECK-NEXT: [[DOTCAST:%.]] = bitcast <8 x bfloat> [[B:%.]] to <4 x float>
				// CHECK-NEXT: [[LANE:%.*]] = shufflevector <4 x float> [[DOTCAST]], <4 x float> undef, <2 x i32> <i32 3, i32 3>
				// CHECK-NEXT: [[DOTCAST1:%.*]] = bitcast <2 x float> [[LANE]] to <8 x i8>
				// CHECK-NEXT: [[TMP0:%.]] = bitcast <4 x bfloat> [[A:%.]] to <8 x i8>
				// CHECK-NEXT: [[VBFDOT1_I:%.]] = call <2 x float> @llvm.arm.neon.bfdot.v2f32.v8i8(<2 x float> [[R:%.]], <8 x i8> [[TMP0]], <8 x i8> [[DOTCAST1]]) #3
				// CHECK-NEXT: ret <2 x float> [[VBFDOT1_I]]
				//
				float32x2_t test_vbfdot_laneq_f32(float32x2_t r, bfloat16x4_t a, bfloat16x8_t b) {
				return vbfdot_laneq_f32(r, a, b, 3);
				}

				// CHECK-LABEL: @test_vbfdotq_lane_f32(
				// CHECK-NEXT: entry:
				// CHECK-NEXT: [[DOTCAST:%.]] = bitcast <4 x bfloat> [[B:%.]] to <2 x float>
				// CHECK-NEXT: [[LANE:%.*]] = shufflevector <2 x float> [[DOTCAST]], <2 x float> undef, <4 x i32> zeroinitializer
				// CHECK-NEXT: [[DOTCAST1:%.*]] = bitcast <4 x float> [[LANE]] to <16 x i8>
				// CHECK-NEXT: [[TMP0:%.]] = bitcast <8 x bfloat> [[A:%.]] to <16 x i8>
				// CHECK-NEXT: [[VBFDOT1_I:%.]] = call <4 x float> @llvm.arm.neon.bfdot.v4f32.v16i8(<4 x float> [[R:%.]], <16 x i8> [[TMP0]], <16 x i8> [[DOTCAST1]]) #3
				// CHECK-NEXT: ret <4 x float> [[VBFDOT1_I]]
				//
				float32x4_t test_vbfdotq_lane_f32(float32x4_t r, bfloat16x8_t a, bfloat16x4_t b) {
				return vbfdotq_lane_f32(r, a, b, 0);
				}

				// CHECK-LABEL: @test_vbfmmlaq_f32(
				// CHECK-NEXT: entry:
				// CHECK-NEXT: [[TMP0:%.]] = bitcast <8 x bfloat> [[A:%.]] to <16 x i8>
				// CHECK-NEXT: [[TMP1:%.]] = bitcast <8 x bfloat> [[B:%.]] to <16 x i8>
				// CHECK-NEXT: [[VBFMMLA1_I:%.]] = call <4 x float> @llvm.arm.neon.bfmmla.v4f32.v16i8(<4 x float> [[R:%.]], <16 x i8> [[TMP0]], <16 x i8> [[TMP1]]) #3
				// CHECK-NEXT: ret <4 x float> [[VBFMMLA1_I]]
				//
				float32x4_t test_vbfmmlaq_f32(float32x4_t r, bfloat16x8_t a, bfloat16x8_t b) {
				return vbfmmlaq_f32(r, a, b);
				}

				// CHECK-LABEL: @test_vbfmlalbq_f32(
				// CHECK-NEXT: entry:
				// CHECK-NEXT: [[TMP0:%.]] = bitcast <8 x bfloat> [[A:%.]] to <16 x i8>
				// CHECK-NEXT: [[TMP1:%.]] = bitcast <8 x bfloat> [[B:%.]] to <16 x i8>
				// CHECK-NEXT: [[VBFMLALB1_I:%.]] = call <4 x float> @llvm.arm.neon.bfmlalb.v4f32.v16i8(<4 x float> [[R:%.]], <16 x i8> [[TMP0]], <16 x i8> [[TMP1]]) #3
				// CHECK-NEXT: ret <4 x float> [[VBFMLALB1_I]]
				//
				float32x4_t test_vbfmlalbq_f32(float32x4_t r, bfloat16x8_t a, bfloat16x8_t b) {
				return vbfmlalbq_f32(r, a, b);
				}

				// CHECK-LABEL: @test_vbfmlaltq_f32(
				// CHECK-NEXT: entry:
				// CHECK-NEXT: [[TMP0:%.]] = bitcast <8 x bfloat> [[A:%.]] to <16 x i8>
				// CHECK-NEXT: [[TMP1:%.]] = bitcast <8 x bfloat> [[B:%.]] to <16 x i8>
				// CHECK-NEXT: [[VBFMLALT1_I:%.]] = call <4 x float> @llvm.arm.neon.bfmlalt.v4f32.v16i8(<4 x float> [[R:%.]], <16 x i8> [[TMP0]], <16 x i8> [[TMP1]]) #3
				// CHECK-NEXT: ret <4 x float> [[VBFMLALT1_I]]
				//
				float32x4_t test_vbfmlaltq_f32(float32x4_t r, bfloat16x8_t a, bfloat16x8_t b) {
				return vbfmlaltq_f32(r, a, b);
				}

				// CHECK-LABEL: @test_vbfmlalbq_lane_f32(
				// CHECK-NEXT: entry:
				// CHECK-NEXT: [[VECINIT35:%.]] = shufflevector <4 x bfloat> [[B:%.]], <4 x bfloat> undef, <8 x i32> zeroinitializer
				// CHECK-NEXT: [[TMP0:%.]] = bitcast <8 x bfloat> [[A:%.]] to <16 x i8>
				// CHECK-NEXT: [[TMP1:%.*]] = bitcast <8 x bfloat> [[VECINIT35]] to <16 x i8>
				// CHECK-NEXT: [[VBFMLALB1_I:%.]] = call <4 x float> @llvm.arm.neon.bfmlalb.v4f32.v16i8(<4 x float> [[R:%.]], <16 x i8> [[TMP0]], <16 x i8> [[TMP1]]) #3
				// CHECK-NEXT: ret <4 x float> [[VBFMLALB1_I]]
				//
				float32x4_t test_vbfmlalbq_lane_f32(float32x4_t r, bfloat16x8_t a, bfloat16x4_t b) {
				return vbfmlalbq_lane_f32(r, a, b, 0);
				}

				// CHECK-LABEL: @test_vbfmlalbq_laneq_f32(
				// CHECK-NEXT: entry:
				// CHECK-NEXT: [[VECINIT35:%.]] = shufflevector <8 x bfloat> [[B:%.]], <8 x bfloat> undef, <8 x i32> <i32 3, i32 3, i32 3, i32 3, i32 3, i32 3, i32 3, i32 3>
				// CHECK-NEXT: [[TMP0:%.]] = bitcast <8 x bfloat> [[A:%.]] to <16 x i8>
				// CHECK-NEXT: [[TMP1:%.*]] = bitcast <8 x bfloat> [[VECINIT35]] to <16 x i8>
				// CHECK-NEXT: [[VBFMLALB1_I:%.]] = call <4 x float> @llvm.arm.neon.bfmlalb.v4f32.v16i8(<4 x float> [[R:%.]], <16 x i8> [[TMP0]], <16 x i8> [[TMP1]]) #3
				// CHECK-NEXT: ret <4 x float> [[VBFMLALB1_I]]
				//
				float32x4_t test_vbfmlalbq_laneq_f32(float32x4_t r, bfloat16x8_t a, bfloat16x8_t b) {
				return vbfmlalbq_laneq_f32(r, a, b, 3);
				}

				// CHECK-LABEL: @test_vbfmlaltq_lane_f32(
				// CHECK-NEXT: entry:
				// CHECK-NEXT: [[VECINIT35:%.]] = shufflevector <4 x bfloat> [[B:%.]], <4 x bfloat> undef, <8 x i32> zeroinitializer
				// CHECK-NEXT: [[TMP0:%.]] = bitcast <8 x bfloat> [[A:%.]] to <16 x i8>
				// CHECK-NEXT: [[TMP1:%.*]] = bitcast <8 x bfloat> [[VECINIT35]] to <16 x i8>
				// CHECK-NEXT: [[VBFMLALT1_I:%.]] = call <4 x float> @llvm.arm.neon.bfmlalt.v4f32.v16i8(<4 x float> [[R:%.]], <16 x i8> [[TMP0]], <16 x i8> [[TMP1]]) #3
				// CHECK-NEXT: ret <4 x float> [[VBFMLALT1_I]]
				//
				float32x4_t test_vbfmlaltq_lane_f32(float32x4_t r, bfloat16x8_t a, bfloat16x4_t b) {
				return vbfmlaltq_lane_f32(r, a, b, 0);
				}

				// CHECK-LABEL: @test_vbfmlaltq_laneq_f32(
				// CHECK-NEXT: entry:
				// CHECK-NEXT: [[VECINIT35:%.]] = shufflevector <8 x bfloat> [[B:%.]], <8 x bfloat> undef, <8 x i32> <i32 3, i32 3, i32 3, i32 3, i32 3, i32 3, i32 3, i32 3>
				// CHECK-NEXT: [[TMP0:%.]] = bitcast <8 x bfloat> [[A:%.]] to <16 x i8>
				// CHECK-NEXT: [[TMP1:%.*]] = bitcast <8 x bfloat> [[VECINIT35]] to <16 x i8>
				// CHECK-NEXT: [[VBFMLALT1_I:%.]] = call <4 x float> @llvm.arm.neon.bfmlalt.v4f32.v16i8(<4 x float> [[R:%.]], <16 x i8> [[TMP0]], <16 x i8> [[TMP1]]) #3
				// CHECK-NEXT: ret <4 x float> [[VBFMLALT1_I]]
				//
				float32x4_t test_vbfmlaltq_laneq_f32(float32x4_t r, bfloat16x8_t a, bfloat16x8_t b) {
				return vbfmlaltq_laneq_f32(r, a, b, 3);
				}

llvm/include/llvm/IR/IntrinsicsARM.td

Show First 20 Lines • Show All 779 Lines • ▼ Show 20 Lines	: Intrinsic<[llvm_anyvector_ty],
LLVMMatchType<1>],		LLVMMatchType<1>],
[IntrNoMem]>;		[IntrNoMem]>;
def int_arm_neon_ummla : Neon_MatMul_Intrinsic;		def int_arm_neon_ummla : Neon_MatMul_Intrinsic;
def int_arm_neon_smmla : Neon_MatMul_Intrinsic;		def int_arm_neon_smmla : Neon_MatMul_Intrinsic;
def int_arm_neon_usmmla : Neon_MatMul_Intrinsic;		def int_arm_neon_usmmla : Neon_MatMul_Intrinsic;
def int_arm_neon_usdot : Neon_Dot_Intrinsic;		def int_arm_neon_usdot : Neon_Dot_Intrinsic;

// v8.6-A Bfloat Intrinsics		// v8.6-A Bfloat Intrinsics
		def int_arm_neon_bfdot : Neon_Dot_Intrinsic;
		def int_arm_neon_bfmmla : Neon_MatMul_Intrinsic;

		class Neon_FML_Intrinsic
		: Intrinsic<[llvm_anyvector_ty],
		[LLVMMatchType<0>, llvm_anyvector_ty, LLVMMatchType<1>],
		[IntrNoMem]>;
		def int_arm_neon_bfmlalb : Neon_FML_Intrinsic;
		def int_arm_neon_bfmlalt : Neon_FML_Intrinsic;

def int_arm_cls: Intrinsic<[llvm_i32_ty], [llvm_i32_ty], [IntrNoMem]>;		def int_arm_cls: Intrinsic<[llvm_i32_ty], [llvm_i32_ty], [IntrNoMem]>;
def int_arm_cls64: Intrinsic<[llvm_i32_ty], [llvm_i64_ty], [IntrNoMem]>;		def int_arm_cls64: Intrinsic<[llvm_i32_ty], [llvm_i64_ty], [IntrNoMem]>;

def int_arm_mve_vctp8 : Intrinsic<[llvm_v16i1_ty], [llvm_i32_ty], [IntrNoMem]>;		def int_arm_mve_vctp8 : Intrinsic<[llvm_v16i1_ty], [llvm_i32_ty], [IntrNoMem]>;
def int_arm_mve_vctp16 : Intrinsic<[llvm_v8i1_ty], [llvm_i32_ty], [IntrNoMem]>;		def int_arm_mve_vctp16 : Intrinsic<[llvm_v8i1_ty], [llvm_i32_ty], [IntrNoMem]>;
def int_arm_mve_vctp32 : Intrinsic<[llvm_v4i1_ty], [llvm_i32_ty], [IntrNoMem]>;		def int_arm_mve_vctp32 : Intrinsic<[llvm_v4i1_ty], [llvm_i32_ty], [IntrNoMem]>;
// vctp64 takes v4i1, to work around v2i1 not being a legal MVE type		// vctp64 takes v4i1, to work around v2i1 not being a legal MVE type
▲ Show 20 Lines • Show All 570 Lines • Show Last 20 Lines

llvm/lib/Target/ARM/ARMInstrNEON.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

	Show First 20 Lines • Show All 9,093 Lines • ▼ Show 20 Lines
	def : NEONInstAlias<"vmov${p}.f32 $Vd, $imm",			def : NEONInstAlias<"vmov${p}.f32 $Vd, $imm",
	(VMOVv4i32 QPR:$Vd, nImmVMOVI32:$imm, pred:$p)>;			(VMOVv4i32 QPR:$Vd, nImmVMOVI32:$imm, pred:$p)>;
	def : NEONInstAlias<"vmov${p}.f32 $Vd, $imm",			def : NEONInstAlias<"vmov${p}.f32 $Vd, $imm",
	(VMOVv2i32 DPR:$Vd, nImmVMOVI32:$imm, pred:$p)>;			(VMOVv2i32 DPR:$Vd, nImmVMOVI32:$imm, pred:$p)>;

	// ARMv8.6a BFloat16 instructions.			// ARMv8.6a BFloat16 instructions.
	let Predicates = [HasBF16, HasNEON] in {			let Predicates = [HasBF16, HasNEON] in {
	class BF16VDOT<bits<5> op27_23, bits<2> op21_20, bit op6,			class BF16VDOT<bits<5> op27_23, bits<2> op21_20, bit op6,
	dag oops, dag iops>			dag oops, dag iops, list<dag> pattern>
	: N3Vnp<op27_23, op21_20, 0b1101, op6, 0, oops, iops,			: N3Vnp<op27_23, op21_20, 0b1101, op6, 0, oops, iops,
	N3RegFrm, IIC_VDOTPROD, "", "", []> {			N3RegFrm, IIC_VDOTPROD, "", "", pattern>
				{
	let DecoderNamespace = "VFPV8";			let DecoderNamespace = "VFPV8";
	}			}

	class BF16VDOTS<bit Q, RegisterClass RegTy, string opc, ValueType AccumTy, ValueType InputTy>			class BF16VDOTS<bit Q, RegisterClass RegTy, string opc, ValueType AccumTy, ValueType InputTy>
	: BF16VDOT<0b11000, 0b00, Q, (outs RegTy:$dst),			: BF16VDOT<0b11000, 0b00, Q, (outs RegTy:$dst),
	(ins RegTy:$Vd, RegTy:$Vn, RegTy:$Vm)> {			(ins RegTy:$Vd, RegTy:$Vn, RegTy:$Vm),
				[(set (AccumTy RegTy:$dst),
				(int_arm_neon_bfdot (AccumTy RegTy:$Vd),
				(InputTy RegTy:$Vn),
				(InputTy RegTy:$Vm)))]> {
	let Constraints = "$dst = $Vd";			let Constraints = "$dst = $Vd";
	let AsmString = !strconcat(opc, ".bf16", "\t$Vd, $Vn, $Vm");			let AsmString = !strconcat(opc, ".bf16", "\t$Vd, $Vn, $Vm");
	let DecoderNamespace = "VFPV8";			let DecoderNamespace = "VFPV8";
	}			}

	multiclass BF16VDOTI<bit Q, RegisterClass RegTy, string opc, ValueType AccumTy,			multiclass BF16VDOTI<bit Q, RegisterClass RegTy, string opc, ValueType AccumTy,
	ValueType InputTy, dag RHS> {			ValueType InputTy, dag RHS> {

	def "" : BF16VDOT<0b11100, 0b00, Q, (outs RegTy:$dst),			def "" : BF16VDOT<0b11100, 0b00, Q, (outs RegTy:$dst),
	(ins RegTy:$Vd, RegTy:$Vn,			(ins RegTy:$Vd, RegTy:$Vn,
	DPR_VFP2:$Vm, VectorIndex32:$lane)> {			DPR_VFP2:$Vm, VectorIndex32:$lane), []> {
	bit lane;			bit lane;
	let Inst{5} = lane;			let Inst{5} = lane;
	let Constraints = "$dst = $Vd";			let Constraints = "$dst = $Vd";
	let AsmString = !strconcat(opc, ".bf16", "\t$Vd, $Vn, $Vm$lane");			let AsmString = !strconcat(opc, ".bf16", "\t$Vd, $Vn, $Vm$lane");
	let DecoderNamespace = "VFPV8";			let DecoderNamespace = "VFPV8";
	}			}

				def : Pat<
				(AccumTy (int_arm_neon_bfdot (AccumTy RegTy:$Vd),
				(InputTy RegTy:$Vn),
				(InputTy (bitconvert (AccumTy
				(ARMvduplane (AccumTy RegTy:$Vm),
				VectorIndex32:$lane)))))),
				(!cast<Instruction>(NAME) RegTy:$Vd, RegTy:$Vn, RHS, VectorIndex32:$lane)>;
	}			}

	def BF16VDOTS_VDOTD : BF16VDOTS<0, DPR, "vdot", v2f32, v8i8>;			def BF16VDOTS_VDOTD : BF16VDOTS<0, DPR, "vdot", v2f32, v8i8>;
	def BF16VDOTS_VDOTQ : BF16VDOTS<1, QPR, "vdot", v4f32, v16i8>;			def BF16VDOTS_VDOTQ : BF16VDOTS<1, QPR, "vdot", v4f32, v16i8>;

	defm BF16VDOTI_VDOTD : BF16VDOTI<0, DPR, "vdot", v2f32, v8i8, (v2f32 DPR_VFP2:$Vm)>;			defm BF16VDOTI_VDOTD : BF16VDOTI<0, DPR, "vdot", v2f32, v8i8, (v2f32 DPR_VFP2:$Vm)>;
	defm BF16VDOTI_VDOTQ : BF16VDOTI<1, QPR, "vdot", v4f32, v16i8, (EXTRACT_SUBREG QPR:$Vm, dsub_0)>;			defm BF16VDOTI_VDOTQ : BF16VDOTI<1, QPR, "vdot", v4f32, v16i8, (EXTRACT_SUBREG QPR:$Vm, dsub_0)>;

	class BF16MM<bit Q, RegisterClass RegTy,			class BF16MM<bit Q, RegisterClass RegTy,
	string opc>			string opc>
	: N3Vnp<0b11000, 0b00, 0b1100, Q, 0,			: N3Vnp<0b11000, 0b00, 0b1100, Q, 0,
	(outs RegTy:$dst), (ins RegTy:$Vd, RegTy:$Vn, RegTy:$Vm),			(outs RegTy:$dst), (ins RegTy:$Vd, RegTy:$Vn, RegTy:$Vm),
	N3RegFrm, IIC_VDOTPROD, "", "", []> {			N3RegFrm, IIC_VDOTPROD, "", "",
				[(set (v4f32 QPR:$dst), (int_arm_neon_bfmmla (v4f32 QPR:$Vd),
				(v16i8 QPR:$Vn),
				(v16i8 QPR:$Vm)))]> {
	let Constraints = "$dst = $Vd";			let Constraints = "$dst = $Vd";
	let AsmString = !strconcat(opc, ".bf16", "\t$Vd, $Vn, $Vm");			let AsmString = !strconcat(opc, ".bf16", "\t$Vd, $Vn, $Vm");
	let DecoderNamespace = "VFPV8";			let DecoderNamespace = "VFPV8";
	}			}

	def VMMLA : BF16MM<1, QPR, "vmmla">;			def VMMLA : BF16MM<1, QPR, "vmmla">;

	class VBF16MALQ<bit T, string suffix>			class VBF16MALQ<bit T, string suffix, SDPatternOperator OpNode>
	: N3VCP8<0b00, 0b11, T, 1,			: N3VCP8<0b00, 0b11, T, 1,
	(outs QPR:$dst), (ins QPR:$Vd, QPR:$Vn, QPR:$Vm),			(outs QPR:$dst), (ins QPR:$Vd, QPR:$Vn, QPR:$Vm),
	NoItinerary, "vfma" # suffix, "bf16", "$Vd, $Vn, $Vm", "",			NoItinerary, "vfma" # suffix, "bf16", "$Vd, $Vn, $Vm", "",
	[]> { // TODO: Add intrinsics			[(set (v4f32 QPR:$dst),
				(OpNode (v4f32 QPR:$Vd),
				(v16i8 QPR:$Vn),
				(v16i8 QPR:$Vm)))]> {
	let Constraints = "$dst = $Vd";			let Constraints = "$dst = $Vd";
	let DecoderNamespace = "VFPV8";			let DecoderNamespace = "VFPV8";
	}			}

	def VBF16MALTQ: VBF16MALQ<1, "t">;			def VBF16MALTQ: VBF16MALQ<1, "t", int_arm_neon_bfmlalt>;
	def VBF16MALBQ: VBF16MALQ<0, "b">;			def VBF16MALBQ: VBF16MALQ<0, "b", int_arm_neon_bfmlalb>;

	multiclass VBF16MALQI<bit T, string suffix> {			multiclass VBF16MALQI<bit T, string suffix, SDPatternOperator OpNode> {
	def "" : N3VLaneCP8<0, 0b11, T, 1, (outs QPR:$dst),			def "" : N3VLaneCP8<0, 0b11, T, 1, (outs QPR:$dst),
	(ins QPR:$Vd, QPR:$Vn, DPR_8:$Vm, VectorIndex16:$idx),			(ins QPR:$Vd, QPR:$Vn, DPR_8:$Vm, VectorIndex16:$idx),
	IIC_VMACD, "vfma" # suffix, "bf16", "$Vd, $Vn, $Vm$idx", "", []> {			IIC_VMACD, "vfma" # suffix, "bf16", "$Vd, $Vn, $Vm$idx", "", []> {
	bits<2> idx;			bits<2> idx;
	let Inst{5} = idx{1};			let Inst{5} = idx{1};
	let Inst{3} = idx{0};			let Inst{3} = idx{0};
	let Constraints = "$dst = $Vd";			let Constraints = "$dst = $Vd";
	let DecoderNamespace = "VFPV8";			let DecoderNamespace = "VFPV8";
	}			}

				def : Pat<
				(v4f32 (OpNode (v4f32 QPR:$Vd),
				(v16i8 QPR:$Vn),
				(v16i8 (bitconvert (v8bf16 (ARMvduplane (v8bf16 QPR:$Vm),
				VectorIndex16:$lane)))))),
				(!cast<Instruction>(NAME) QPR:$Vd,
				QPR:$Vn,
				miyukiAuthorUnsubmitted Done Reply Inline Actions Will this work if the selected element is in the top half of the Q register (`$lane >= 4`)? miyuki: Will this work if the selected element is in the top half of the Q register (`$lane >= 4`)?
				(EXTRACT_SUBREG QPR:$Vm,
				(DSubReg_i16_reg VectorIndex16:$lane)),
				(SubReg_i16_lane VectorIndex16:$lane))>;
	}			}

	defm VBF16MALTQI: VBF16MALQI<1, "t">;			defm VBF16MALTQI: VBF16MALQI<1, "t", int_arm_neon_bfmlalt>;
	defm VBF16MALBQI: VBF16MALQI<0, "b">;			defm VBF16MALBQI: VBF16MALQI<0, "b", int_arm_neon_bfmlalb>;

	def BF16_VCVT : N2V<0b11, 0b11, 0b01, 0b10, 0b01100, 1, 0,			def BF16_VCVT : N2V<0b11, 0b11, 0b01, 0b10, 0b01100, 1, 0,
	(outs DPR:$Vd), (ins QPR:$Vm),			(outs DPR:$Vd), (ins QPR:$Vm),
	NoItinerary, "vcvt", "bf16.f32", "$Vd, $Vm", "", []>;			NoItinerary, "vcvt", "bf16.f32", "$Vd, $Vm", "", []>;
	}			}
	// End of BFloat16 instructions			// End of BFloat16 instructions

llvm/test/CodeGen/ARM/arm-bf16-dotprod-intrinsics.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -mtriple armv8.6a-arm-none-eabi -mattr=+neon,+bf16 -float-abi=hard -verify-machineinstrs < %s -o - \| FileCheck %s
				miyukiAuthorUnsubmitted Done Reply Inline Actions `--check-prefix=CHECK` is redundant miyuki: `--check-prefix=CHECK` is redundant

				define arm_aapcs_vfpcc <2 x float> @test_vbfdot_f32(<2 x float> %r, <4 x bfloat> %a, <4 x bfloat> %b) {
				; CHECK-LABEL: test_vbfdot_f32:
				; CHECK: @ %bb.0: @ %entry
				miyukiAuthorUnsubmitted Done Reply Inline Actions Could you please get rid of `local_unnamed_addr #0`? `#0` is referring to an attribute that is not defined anywhere. miyuki: Could you please get rid of `local_unnamed_addr #0`? `#0` is referring to an attribute that is…
				; CHECK-NEXT: vdot.bf16 d0, d1, d2
				; CHECK-NEXT: bx lr
				entry:
				%0 = bitcast <4 x bfloat> %a to <8 x i8>
				miyukiAuthorUnsubmitted Done Reply Inline Actions Same for `#3` miyuki: Same for `#3`
				%1 = bitcast <4 x bfloat> %b to <8 x i8>
				%vbfdot1.i = call <2 x float> @llvm.arm.neon.bfdot.v2f32.v8i8(<2 x float> %r, <8 x i8> %0, <8 x i8> %1)
				ret <2 x float> %vbfdot1.i
				}

				define <4 x float> @test_vbfdotq_f32(<4 x float> %r, <8 x bfloat> %a, <8 x bfloat> %b) {
				; CHECK-LABEL: test_vbfdotq_f32:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: vdot.bf16 q0, q1, q2
				; CHECK-NEXT: bx lr
				entry:
				%0 = bitcast <8 x bfloat> %a to <16 x i8>
				%1 = bitcast <8 x bfloat> %b to <16 x i8>
				%vbfdot1.i = call <4 x float> @llvm.arm.neon.bfdot.v4f32.v16i8(<4 x float> %r, <16 x i8> %0, <16 x i8> %1)
				ret <4 x float> %vbfdot1.i
				}

				define <2 x float> @test_vbfdot_lane_f32(<2 x float> %r, <4 x bfloat> %a, <4 x bfloat> %b) {
				; CHECK-LABEL: test_vbfdot_lane_f32:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: vdot.bf16 d0, d1, d2[0]
				; CHECK-NEXT: bx lr
				entry:
				%0 = bitcast <4 x bfloat> %b to <2 x float>
				%shuffle = shufflevector <2 x float> %0, <2 x float> undef, <2 x i32> zeroinitializer
				%1 = bitcast <4 x bfloat> %a to <8 x i8>
				%2 = bitcast <2 x float> %shuffle to <8 x i8>
				%vbfdot1.i = call <2 x float> @llvm.arm.neon.bfdot.v2f32.v8i8(<2 x float> %r, <8 x i8> %1, <8 x i8> %2)
				ret <2 x float> %vbfdot1.i
				}

				define <4 x float> @test_vbfdotq_laneq_f32(<4 x float> %r, <8 x bfloat> %a, <8 x bfloat> %b) {
				; CHECK-LABEL: test_vbfdotq_laneq_f32:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: vdup.32 q8, d5[1]
				; CHECK-NEXT: vdot.bf16 q0, q1, q8
				; CHECK-NEXT: bx lr
				entry:
				%0 = bitcast <8 x bfloat> %b to <4 x float>
				%shuffle = shufflevector <4 x float> %0, <4 x float> undef, <4 x i32> <i32 3, i32 3, i32 3, i32 3>
				%1 = bitcast <8 x bfloat> %a to <16 x i8>
				%2 = bitcast <4 x float> %shuffle to <16 x i8>
				%vbfdot1.i = call <4 x float> @llvm.arm.neon.bfdot.v4f32.v16i8(<4 x float> %r, <16 x i8> %1, <16 x i8> %2)
				ret <4 x float> %vbfdot1.i
				}

				define <2 x float> @test_vbfdot_laneq_f32(<2 x float> %r, <4 x bfloat> %a, <8 x bfloat> %b) {
				; CHECK-LABEL: test_vbfdot_laneq_f32:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: vdot.bf16 d0, d1, d3[1]
				; CHECK-NEXT: bx lr
				entry:
				%0 = bitcast <8 x bfloat> %b to <4 x float>
				%shuffle = shufflevector <4 x float> %0, <4 x float> undef, <2 x i32> <i32 3, i32 3>
				%1 = bitcast <4 x bfloat> %a to <8 x i8>
				%2 = bitcast <2 x float> %shuffle to <8 x i8>
				%vbfdot1.i = call <2 x float> @llvm.arm.neon.bfdot.v2f32.v8i8(<2 x float> %r, <8 x i8> %1, <8 x i8> %2)
				ret <2 x float> %vbfdot1.i
				}

				define <4 x float> @test_vbfdotq_lane_f32(<4 x float> %r, <8 x bfloat> %a, <4 x bfloat> %b) {
				; CHECK-LABEL: test_vbfdotq_lane_f32:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: @ kill: def $d4 killed $d4 def $q2
				; CHECK-NEXT: vdot.bf16 q0, q1, d4[0]
				; CHECK-NEXT: bx lr
				entry:
				%0 = bitcast <4 x bfloat> %b to <2 x float>
				%shuffle = shufflevector <2 x float> %0, <2 x float> undef, <4 x i32> zeroinitializer
				%1 = bitcast <8 x bfloat> %a to <16 x i8>
				%2 = bitcast <4 x float> %shuffle to <16 x i8>
				%vbfdot1.i = call <4 x float> @llvm.arm.neon.bfdot.v4f32.v16i8(<4 x float> %r, <16 x i8> %1, <16 x i8> %2)
				ret <4 x float> %vbfdot1.i
				}

				define <4 x float> @test_vbfmmlaq_f32(<4 x float> %r, <8 x bfloat> %a, <8 x bfloat> %b) {
				; CHECK-LABEL: test_vbfmmlaq_f32:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: vmmla.bf16 q0, q1, q2
				; CHECK-NEXT: bx lr
				entry:
				%0 = bitcast <8 x bfloat> %a to <16 x i8>
				%1 = bitcast <8 x bfloat> %b to <16 x i8>
				%vbfmmla1.i = call <4 x float> @llvm.arm.neon.bfmmla.v4f32.v16i8(<4 x float> %r, <16 x i8> %0, <16 x i8> %1)
				ret <4 x float> %vbfmmla1.i
				}

				define <4 x float> @test_vbfmlalbq_f32(<4 x float> %r, <8 x bfloat> %a, <8 x bfloat> %b) {
				; CHECK-LABEL: test_vbfmlalbq_f32:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: vfmab.bf16 q0, q1, q2
				; CHECK-NEXT: bx lr
				entry:
				%0 = bitcast <8 x bfloat> %a to <16 x i8>
				%1 = bitcast <8 x bfloat> %b to <16 x i8>
				%vbfmlalb1.i = call <4 x float> @llvm.arm.neon.bfmlalb.v4f32.v16i8(<4 x float> %r, <16 x i8> %0, <16 x i8> %1)
				ret <4 x float> %vbfmlalb1.i
				}

				define <4 x float> @test_vbfmlaltq_f32(<4 x float> %r, <8 x bfloat> %a, <8 x bfloat> %b) {
				; CHECK-LABEL: test_vbfmlaltq_f32:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: vfmat.bf16 q0, q1, q2
				; CHECK-NEXT: bx lr
				entry:
				%0 = bitcast <8 x bfloat> %a to <16 x i8>
				%1 = bitcast <8 x bfloat> %b to <16 x i8>
				%vbfmlalt1.i = call <4 x float> @llvm.arm.neon.bfmlalt.v4f32.v16i8(<4 x float> %r, <16 x i8> %0, <16 x i8> %1)
				ret <4 x float> %vbfmlalt1.i
				}

				define <4 x float> @test_vbfmlalbq_lane_f32(<4 x float> %r, <8 x bfloat> %a, <4 x bfloat> %b) {
				; CHECK-LABEL: test_vbfmlalbq_lane_f32:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: @ kill: def $d4 killed $d4 def $q2
				; CHECK-NEXT: vfmab.bf16 q0, q1, d4[0]
				; CHECK-NEXT: bx lr
				entry:
				%vecinit35 = shufflevector <4 x bfloat> %b, <4 x bfloat> undef, <8 x i32> zeroinitializer
				%0 = bitcast <8 x bfloat> %a to <16 x i8>
				%1 = bitcast <8 x bfloat> %vecinit35 to <16 x i8>
				%vbfmlalb1.i = call <4 x float> @llvm.arm.neon.bfmlalb.v4f32.v16i8(<4 x float> %r, <16 x i8> %0, <16 x i8> %1)
				ret <4 x float> %vbfmlalb1.i
				}

				define <4 x float> @test_vbfmlalbq_laneq_f32(<4 x float> %r, <8 x bfloat> %a, <8 x bfloat> %b) {
				; CHECK-LABEL: test_vbfmlalbq_laneq_f32:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: vfmab.bf16 q0, q1, d4[3]
				; CHECK-NEXT: bx lr
				entry:
				%vecinit35 = shufflevector <8 x bfloat> %b, <8 x bfloat> undef, <8 x i32> <i32 3, i32 3, i32 3, i32 3, i32 3, i32 3, i32 3, i32 3>
				%0 = bitcast <8 x bfloat> %a to <16 x i8>
				%1 = bitcast <8 x bfloat> %vecinit35 to <16 x i8>
				%vbfmlalb1.i = call <4 x float> @llvm.arm.neon.bfmlalb.v4f32.v16i8(<4 x float> %r, <16 x i8> %0, <16 x i8> %1)
				ret <4 x float> %vbfmlalb1.i
				}

				define <4 x float> @test_vbfmlaltq_lane_f32(<4 x float> %r, <8 x bfloat> %a, <4 x bfloat> %b) {
				; CHECK-LABEL: test_vbfmlaltq_lane_f32:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: @ kill: def $d4 killed $d4 def $q2
				; CHECK-NEXT: vfmat.bf16 q0, q1, d4[0]
				; CHECK-NEXT: bx lr
				entry:
				%vecinit35 = shufflevector <4 x bfloat> %b, <4 x bfloat> undef, <8 x i32> zeroinitializer
				%0 = bitcast <8 x bfloat> %a to <16 x i8>
				%1 = bitcast <8 x bfloat> %vecinit35 to <16 x i8>
				%vbfmlalt1.i = call <4 x float> @llvm.arm.neon.bfmlalt.v4f32.v16i8(<4 x float> %r, <16 x i8> %0, <16 x i8> %1)
				ret <4 x float> %vbfmlalt1.i
				}

				define <4 x float> @test_vbfmlaltq_laneq_f32(<4 x float> %r, <8 x bfloat> %a, <8 x bfloat> %b) {
				; CHECK-LABEL: test_vbfmlaltq_laneq_f32:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: vfmat.bf16 q0, q1, d4[3]
				; CHECK-NEXT: bx lr
				entry:
				%vecinit35 = shufflevector <8 x bfloat> %b, <8 x bfloat> undef, <8 x i32> <i32 3, i32 3, i32 3, i32 3, i32 3, i32 3, i32 3, i32 3>
				%0 = bitcast <8 x bfloat> %a to <16 x i8>
				%1 = bitcast <8 x bfloat> %vecinit35 to <16 x i8>
				%vbfmlalt1.i = call <4 x float> @llvm.arm.neon.bfmlalt.v4f32.v16i8(<4 x float> %r, <16 x i8> %0, <16 x i8> %1)
				ret <4 x float> %vbfmlalt1.i
				}

				define <4 x float> @test_vbfmlaltq_laneq_f32_v2(<4 x float> %r, <8 x bfloat> %a, <8 x bfloat> %b) {
				; CHECK-LABEL: test_vbfmlaltq_laneq_f32_v2:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: vdup.16 q8, d5[2]
				; CHECK-NEXT: vfmat.bf16 q0, q1, q8
				; CHECK-NEXT: bx lr
				entry:
				%vecinit35 = shufflevector <8 x bfloat> %b, <8 x bfloat> undef, <8 x i32> <i32 6, i32 6, i32 6, i32 6, i32 6, i32 6, i32 6, i32 6>
				%0 = bitcast <8 x bfloat> %a to <16 x i8>
				%1 = bitcast <8 x bfloat> %vecinit35 to <16 x i8>
				%vbfmlalt1.i = call <4 x float> @llvm.arm.neon.bfmlalt.v4f32.v16i8(<4 x float> %r, <16 x i8> %0, <16 x i8> %1)
				ret <4 x float> %vbfmlalt1.i
				}

				declare <2 x float> @llvm.arm.neon.bfdot.v2f32.v8i8(<2 x float>, <8 x i8>, <8 x i8>)
				declare <4 x float> @llvm.arm.neon.bfdot.v4f32.v16i8(<4 x float>, <16 x i8>, <16 x i8>)
				declare <4 x float> @llvm.arm.neon.bfmmla.v4f32.v16i8(<4 x float>, <16 x i8>, <16 x i8>)
				declare <4 x float> @llvm.arm.neon.bfmlalb.v4f32.v16i8(<4 x float>, <16 x i8>, <16 x i8>)
				declare <4 x float> @llvm.arm.neon.bfmlalt.v4f32.v16i8(<4 x float>, <16 x i8>, <16 x i8>)

This is an archive of the discontinued LLVM Phabricator instance.

[ARM] BFloat MatMul Intrinsics&CodeGenClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 272687

clang/lib/CodeGen/CGBuiltin.cpp

clang/test/CodeGen/arm-bf16-dotprod-intrinsics.c

llvm/include/llvm/IR/IntrinsicsARM.td

llvm/lib/Target/ARM/ARMInstrNEON.td

llvm/test/CodeGen/ARM/arm-bf16-dotprod-intrinsics.ll

[ARM] BFloat MatMul Intrinsics&CodeGen
ClosedPublic