This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/IR/
-
llvm/
-
IR/
1/3
IntrinsicsNVVM.td
-
lib/Target/NVPTX/
-
Target/
-
NVPTX/
-
NVPTXISelLowering.cpp
-
NVPTXInstrInfo.td
-
NVPTXIntrinsics.td
-
test/CodeGen/NVPTX/
-
CodeGen/
-
NVPTX/
-
wmma.py

Differential D60015

[NVPTX] Added intrinsics/instructions for MMA ops on (sub-)integers
ClosedPublic

Authored by tra on Mar 29 2019, 2:55 PM.

Download Raw Diff

Details

Reviewers

timshen

Commits

rG16737538f4fc: PTX 6.3 extends `wmma` instruction to support s8/u8/s4/u4/b1 -> s32.
rL359247: PTX 6.3 extends `wmma` instruction to support s8/u8/s4/u4/b1 -> s32.

Summary

PTX 6.3 (CUDA-10.0) extends wmma instruction to support s8/u8/s4/u4/b1 -> s32.

All of the new instructions are still handled mostly by tablegen. I've slightly
refactored the code to drive intrinsic/instruction generation from a master
list of supported variants, so all irregularities have to be implemented in one place only.

The test generation script wmma.py has been refactored in a similar way.
I've added additional checks to verify the sanity of the set of tests generated
by the script for particular PTX and SM combination.

Diff Detail

Build Status

Buildable 30069
Build 30068: arc lint + arc unit

Event Timeline

tra created this revision.Mar 29 2019, 2:55 PM

Herald added a project: Restricted Project. · View Herald TranscriptMar 29 2019, 2:55 PM

Herald added subscribers: jdoerfert, bixia, hiraditya and 3 others. · View Herald Transcript

Harbormaster completed remote builds in B29850: Diff 192921.Mar 29 2019, 2:56 PM

Discussed with Art offline. The tablegen code is still not readable, but it's considerably better than the past, and inventing new tools (e.g. Cartesian product) may be hard.

llvm/include/llvm/IR/IntrinsicsNVVM.td
155	Can you add a few examples of the generated regs?
155	Can you document Type{A,B,C,D} for their meanings?

This revision is now accepted and ready to land.Apr 1 2019, 3:24 PM

tra marked an inline comment as done.Apr 1 2019, 4:37 PM

tra added inline comments.

llvm/include/llvm/IR/IntrinsicsNVVM.td
155	Typical MMA_REGS record looks like this: def anonymous_58 { // WMMA_REGS string geom = "m16n16k16"; string frag = "a"; string ptx_elt_type = "f16"; string gft = "m16n16k16:a:f16"; string ft = "a:f16"; list<LLVMType> regs = [llvm_v2f16_ty, llvm_v2f16_ty, llvm_v2f16_ty, llvm_v2f16_ty, llvm_v2f16_ty, llvm_v2f16_ty, llvm_v2f16_ty, llvm_v2f16_ty]; } It carries information necessary to generate relevant bits of the instrisics & instructions. E.g. how many registers we need to use for the fragment, what do we call them and what's the corresponding LLVM type. The details on supported fragment formats can be found in the latest PTX ISA docs: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-fragment I'll add the link to the comment. The string lists in TypeX carry PTX types supported by MMA ops with geometries specified by `Geom`.

tra added a child revision: D60279: [CUDA] Implemented _[bi]mma* builtins..Apr 4 2019, 11:30 AM

tra added a parent revision: D59393: [NVPTX] generate correct MMA instruction mnemonics with PTX63+..

Enabled .satf for s4/u4.

Harbormaster completed remote builds in B30069: Diff 193755.Apr 4 2019, 11:35 AM

Closed by commit rL359247: PTX 6.3 extends `wmma` instruction to support s8/u8/s4/u4/b1 -> s32. (authored by tra). · Explain WhyApr 25 2019, 3:26 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents

Path

Size

llvm/

include/

llvm/

IR/

IntrinsicsNVVM.td

250 lines

lib/

Target/

NVPTX/

NVPTXISelLowering.cpp

164 lines

NVPTXInstrInfo.td

3 lines

NVPTXIntrinsics.td

130 lines

test/

CodeGen/

NVPTX/

wmma.py

423 lines

Diff 193755

llvm/include/llvm/IR/IntrinsicsNVVM.td

Show First 20 Lines • Show All 44 Lines • ▼ Show 20 Lines
// Helper class that represents a 'fragment' of an NVPTX *MMA instruction.		// Helper class that represents a 'fragment' of an NVPTX *MMA instruction.
// Geom: m<M>n<N>k<K>. E.g. m8n32k16		// Geom: m<M>n<N>k<K>. E.g. m8n32k16
// Frag: [abcd]		// Frag: [abcd]
// PtxEltType: PTX type for the element.		// PtxEltType: PTX type for the element.
class WMMA_REGS<string Geom, string Frag, string PtxEltType> {		class WMMA_REGS<string Geom, string Frag, string PtxEltType> {
string geom = Geom;		string geom = Geom;
string frag = Frag;		string frag = Frag;
string ptx_elt_type = PtxEltType;		string ptx_elt_type = PtxEltType;
		string gft = Geom#":"#Frag#":"#ptx_elt_type;
string ft = frag#":"#ptx_elt_type;		string ft = frag#":"#ptx_elt_type;
list<LLVMType> regs = !cond(		list<LLVMType> regs = !cond(
// fp16 -> fp16/fp32 @ m16n16k16/m8n32k16/m32n8k16		// fp16 -> fp16/fp32 @ m16n16k16/m8n32k16/m32n8k16
// All currently supported geometries use the same fragment format,		// All currently supported geometries use the same fragment format,
// so we only need to consider {fragment, type}.		// so we only need to consider {fragment, type}.
!eq(ft,"a:f16") : RepLLVMType<8, llvm_v2f16_ty>.ret,		!eq(ft,"a:f16") : RepLLVMType<8, llvm_v2f16_ty>.ret,
!eq(ft,"b:f16") : RepLLVMType<8, llvm_v2f16_ty>.ret,		!eq(ft,"b:f16") : RepLLVMType<8, llvm_v2f16_ty>.ret,
!eq(ft,"c:f16") : RepLLVMType<4, llvm_v2f16_ty>.ret,		!eq(ft,"c:f16") : RepLLVMType<4, llvm_v2f16_ty>.ret,
!eq(ft,"d:f16") : RepLLVMType<4, llvm_v2f16_ty>.ret,		!eq(ft,"d:f16") : RepLLVMType<4, llvm_v2f16_ty>.ret,
!eq(ft,"c:f32") : RepLLVMType<8, llvm_float_ty>.ret,		!eq(ft,"c:f32") : RepLLVMType<8, llvm_float_ty>.ret,
!eq(ft,"d:f32") : RepLLVMType<8, llvm_float_ty>.ret);		!eq(ft,"d:f32") : RepLLVMType<8, llvm_float_ty>.ret,

		// u8/s8 -> s32 @ m16n16k16/m8n32k16/m32n8k16
		!eq(gft,"m16n16k16:a:u8") : RepLLVMType<2, llvm_i32_ty>.ret,
		!eq(gft,"m16n16k16:a:s8") : RepLLVMType<2, llvm_i32_ty>.ret,
		!eq(gft,"m16n16k16:b:u8") : RepLLVMType<2, llvm_i32_ty>.ret,
		!eq(gft,"m16n16k16:b:s8") : RepLLVMType<2, llvm_i32_ty>.ret,
		!eq(gft,"m16n16k16:c:s32") : RepLLVMType<8, llvm_i32_ty>.ret,
		!eq(gft,"m16n16k16:d:s32") : RepLLVMType<8, llvm_i32_ty>.ret,

		!eq(gft,"m8n32k16:a:u8") : [llvm_i32_ty],
		!eq(gft,"m8n32k16:a:s8") : [llvm_i32_ty],
		!eq(gft,"m8n32k16:b:u8") : RepLLVMType<4, llvm_i32_ty>.ret,
		!eq(gft,"m8n32k16:b:s8") : RepLLVMType<4, llvm_i32_ty>.ret,
		!eq(gft,"m8n32k16:c:s32") : RepLLVMType<8, llvm_i32_ty>.ret,
		!eq(gft,"m8n32k16:d:s32") : RepLLVMType<8, llvm_i32_ty>.ret,

		!eq(gft,"m32n8k16:a:u8") : RepLLVMType<4, llvm_i32_ty>.ret,
		!eq(gft,"m32n8k16:a:s8") : RepLLVMType<4, llvm_i32_ty>.ret,
		!eq(gft,"m32n8k16:b:u8") : [llvm_i32_ty],
		!eq(gft,"m32n8k16:b:s8") : [llvm_i32_ty],
		!eq(gft,"m32n8k16:c:s32") : RepLLVMType<8, llvm_i32_ty>.ret,
		!eq(gft,"m32n8k16:d:s32") : RepLLVMType<8, llvm_i32_ty>.ret,

		// u4/s4/b1 -> s32 @ m8n8k32 (u4/s4), m8n8k128(b1)
		!eq(gft,"m8n8k128:a:b1") : [llvm_i32_ty],
		!eq(gft,"m8n8k32:a:u4") : [llvm_i32_ty],
		!eq(gft,"m8n8k32:a:s4") : [llvm_i32_ty],
		!eq(gft,"m8n8k128:b:b1") : [llvm_i32_ty],
		!eq(gft,"m8n8k32:b:u4") : [llvm_i32_ty],
		!eq(gft,"m8n8k32:b:s4") : [llvm_i32_ty],
		!eq(gft,"m8n8k128:c:s32") : RepLLVMType<2, llvm_i32_ty>.ret,
		!eq(gft,"m8n8k128:d:s32") : RepLLVMType<2, llvm_i32_ty>.ret,
		!eq(gft,"m8n8k32:c:s32") : RepLLVMType<2, llvm_i32_ty>.ret,
		!eq(gft,"m8n8k32:d:s32") : RepLLVMType<2, llvm_i32_ty>.ret,
		);
}		}

class WMMA_NAME_LDST<string Op, WMMA_REGS Frag, string Layout, int WithStride> {		class WMMA_NAME_LDST<string Op, WMMA_REGS Frag, string Layout, int WithStride> {
string intr = "llvm.nvvm.wmma."		string intr = "llvm.nvvm.wmma."
# Frag.geom		# Frag.geom
# "." # Op		# "." # Op
# "." # Frag.frag		# "." # Frag.frag
# "." # Layout		# "." # Layout
# !if(WithStride, ".stride", "")		# !if(WithStride, ".stride", "")
# "." # Frag.ptx_elt_type		# "." # Frag.ptx_elt_type
;		;
// TODO(tra): record name should ideally use the same field order as the intrinsic.		// TODO(tra): record name should ideally use the same field order as the intrinsic.
// E.g. string record = !subst("llvm", "int",		// E.g. string record = !subst("llvm", "int",
// !subst(".", "_", llvm));		// !subst(".", "_", llvm));
string record = "int_nvvm_wmma_"		string record = "int_nvvm_wmma_"
# Frag.geom		# Frag.geom
# "_" # Op		# "_" # Op
# "_" # Frag.frag		# "_" # Frag.frag
# "_" # Frag.ptx_elt_type		# "_" # Frag.ptx_elt_type
# "_" # Layout		# "_" # Layout
# !if(WithStride, "_stride", "");		# !if(WithStride, "_stride", "");
}		}

class WMMA_NAME_MMA<string ALayout, string BLayout,		class MMA_SIGNATURE<WMMA_REGS A, WMMA_REGS B, WMMA_REGS C, WMMA_REGS D> {
WMMA_REGS C, WMMA_REGS D,		list<WMMA_REGS> id_frags = !cond(
int Satfinite> {		// int and sub-int ops are identified by input type.
		!eq(A.ptx_elt_type, "s8") : [A],
		!eq(A.ptx_elt_type, "u8") : [A],
		!eq(A.ptx_elt_type, "s4") : [A],
		!eq(A.ptx_elt_type, "u4") : [A],
		!eq(A.ptx_elt_type, "b1") : [A],
		// the rest are FP ops identified by accumulator & result type.
		1: [D, C]
		);
		string ret = !foldl("", id_frags, a, b, !strconcat(a, ".", b.ptx_elt_type));
		}

		class WMMA_NAME_MMA<string ALayout, string BLayout, int Satfinite,
		WMMA_REGS A, WMMA_REGS B, WMMA_REGS C, WMMA_REGS D> {
		string signature = MMA_SIGNATURE<A, B, C, D>.ret;
string llvm = "llvm.nvvm.wmma."		string llvm = "llvm.nvvm.wmma."
# C.geom		# A.geom
# ".mma"		# ".mma"
# "." # ALayout		# "." # ALayout
# "." # BLayout		# "." # BLayout
# "." # D.ptx_elt_type // Intrinsic encodes 'd' first.		# signature
# "." # C.ptx_elt_type
# !if(Satfinite, ".satfinite", "");		# !if(Satfinite, ".satfinite", "");

string record = !subst(".", "_",		string record = !subst(".", "_",
!subst("llvm.", "int_", llvm));		!subst("llvm.", "int_", llvm));
}		}

		// Generates list of 4-tuples of WMMA_REGS representing a valid MMA op.
		// Geom: list of supported geometries.
		// TypeN: PTX type of the corresponding fragment's element.
		// TypeB and TypeD may be empty if it must match that of TypeA or TypeC.
		timshenUnsubmitted Not Done Reply Inline Actions Can you add a few examples of the generated regs? timshen: Can you add a few examples of the generated regs?
		timshenUnsubmitted Not Done Reply Inline Actions Can you document Type{A,B,C,D} for their meanings? timshen: Can you document Type{A,B,C,D} for their meanings?
		traAuthorUnsubmitted Done Reply Inline Actions Typical MMA_REGS record looks like this: def anonymous_58 { // WMMA_REGS string geom = "m16n16k16"; string frag = "a"; string ptx_elt_type = "f16"; string gft = "m16n16k16:a:f16"; string ft = "a:f16"; list<LLVMType> regs = [llvm_v2f16_ty, llvm_v2f16_ty, llvm_v2f16_ty, llvm_v2f16_ty, llvm_v2f16_ty, llvm_v2f16_ty, llvm_v2f16_ty, llvm_v2f16_ty]; } It carries information necessary to generate relevant bits of the instrisics & instructions. E.g. how many registers we need to use for the fragment, what do we call them and what's the corresponding LLVM type. The details on supported fragment formats can be found in the latest PTX ISA docs: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-fragment I'll add the link to the comment. The string lists in TypeX carry PTX types supported by MMA ops with geometries specified by `Geom`. tra: Typical MMA_REGS record looks like this: ``` def anonymous_58 { // WMMA_REGS string…
		class MMA_OPS<list<string> Geom, list<string> TypeA, list<string> TypeB,
		list<string> TypeC, list<string> TypeD> {
		list<list<WMMA_REGS>> ret =
		!foldl([]<list<WMMA_REGS>>, Geom, t1, geom, !listconcat(t1,
		!foldl([]<list<WMMA_REGS>>, TypeA, t2, type_a, !listconcat(t2,
		!foldl([]<list<WMMA_REGS>>, !if(!size(TypeB), TypeB, [type_a]), t3, type_b, !listconcat(t3,
		!foldl([]<list<WMMA_REGS>>, TypeC, t4, type_c, !listconcat(t4,
		!foldl([]<list<WMMA_REGS>>, !if(!size(TypeC), TypeC, [type_c]), t5, type_d, !listconcat(t5,
		[[WMMA_REGS<geom, "a", type_a>,
		WMMA_REGS<geom, "b", type_b>,
		WMMA_REGS<geom, "c", type_c>,
		WMMA_REGS<geom, "d", type_d>]]))))))))));
		// Debugging aid for readable representation of the list above.
		list<list<string>> ops = !foreach(x, ret, [x[0].gft, x[1].gft, x[2].gft, x[3].gft]);
		}

		class MMA_LDST_OPS<list<string> Geom, list<string> Frags, list<string> Types> {
		list<WMMA_REGS> ret =
		!foldl([]<WMMA_REGS>, Geom, t1, geom, !listconcat(t1,
		!foldl([]<WMMA_REGS>, Frags, t2, frag, !listconcat(t2,
		!foldl([]<WMMA_REGS>, Types, t3, type, !listconcat(t3,
		[WMMA_REGS<geom, frag, type>]))))));
		// Debugging aid for readable representation of the list above.
		list<string> ops = !foreach(x, ret, x.gft);
		}



		// Creates list of valid combinations of fragments. This is the master list that
		// drives generation of corresponding intrinsics and instructions.
		class NVVM_MMA_OPS<int _ = 0> {
		list<list<WMMA_REGS>> fp_mma_ops = MMA_OPS<
		["m16n16k16", "m32n8k16", "m8n32k16"],
		["f16"], [], ["f16", "f32"], ["f16", "f32"]>.ret;
		list<list<WMMA_REGS>> int_mma_ops = MMA_OPS<
		["m16n16k16", "m32n8k16", "m8n32k16"],
		["s8", "u8"], [], ["s32"], []>.ret;
		list<list<WMMA_REGS>> subint_mma_ops = MMA_OPS<
		["m8n8k32"],
		["s4", "u4"], [], ["s32"], []>.ret;
		list<list<WMMA_REGS>> bit_mma_ops = MMA_OPS<
		["m8n8k128"],
		["b1"], [], ["s32"], []>.ret;
		list<list<WMMA_REGS>> all_mma_ops = !listconcat(fp_mma_ops, int_mma_ops,
		subint_mma_ops, bit_mma_ops);

		list<WMMA_REGS> ldst_ab_ops = MMA_LDST_OPS<
		["m16n16k16", "m32n8k16", "m8n32k16"],
		["a", "b"], ["f16", "u8", "s8"]>.ret;
		list<WMMA_REGS> ldst_cd_ops = MMA_LDST_OPS<
		["m16n16k16", "m32n8k16", "m8n32k16"],
		["c", "d"], ["f16", "f32", "s32"]>.ret;
		list<WMMA_REGS> ldst_subint_ab_ops = MMA_LDST_OPS<
		["m8n8k32"], ["a", "b"], ["s4","u4"]>.ret;
		list<WMMA_REGS> ldst_bit_ab_ops = MMA_LDST_OPS<
		["m8n8k128"], ["a", "b"], ["b1"]>.ret;
		list<WMMA_REGS> ldst_subint_cd_ops = MMA_LDST_OPS<
		["m8n8k32", "m8n8k128"], ["c", "d"], ["s32"]>.ret;
		list<WMMA_REGS> all_ldst_ops = !listconcat(ldst_ab_ops, ldst_cd_ops,
		ldst_subint_ab_ops,
		ldst_bit_ab_ops,
		ldst_subint_cd_ops);
		// Separate A/B/C fragments (loads) from D (stores).
		list<WMMA_REGS> all_ld_ops = !foldl([]<WMMA_REGS>, all_ldst_ops, a, b,
		!listconcat(a, !if(!eq(b.frag,"d"), [],[b])));
		list<WMMA_REGS> all_st_ops = !foldl([]<WMMA_REGS>, all_ldst_ops, a, b,
		!listconcat(a, !if(!eq(b.frag,"d"), [b],[])));
		}

		def NVVM_MMA_OPS : NVVM_MMA_OPS;

		// Returns [1] if this combination of layout/satf is supported, [] otherwise.
		// MMA ops must provide all parameters. Loads and stores -- only frags and layout_a.
		// The class is used to prevent generation of records for the unsupported variants.
		// E.g.
		// foreach _ = NVVM_MMA_SUPPORTED<...>.ret in =
		// def : FOO<>; // The record will only be defined for supported ops.
		//
		class NVVM_MMA_SUPPORTED<list<WMMA_REGS> frags, string layout_a, string layout_b="-", int satf=-1> {
		// MMA ops check both layouts.
		string mma = frags[0].ptx_elt_type
		# ":" # layout_a
		# ":" # layout_b;
		// Load ops only need type/fragment/layout.
		string ld = frags[0].ptx_elt_type
		# ":" # frags[0].frag
		# ":" # layout_a
		;
		string ldf = frags[0].ptx_elt_type
		# ":" # frags[0].frag
		;
		string t = frags[0].ptx_elt_type;
		list<int> ret = !cond(
		// Sub-int MMA only supports fixed A/B layout.
		// b1 does not support .satf.
		!eq(mma#":"#satf, "b1:row:col:0") : [1],
		!eq(mma, "s4:row:col") : [1],
		!eq(mma, "u4:row:col") : [1],
		!eq(mma, "s4:row:col") : [1],
		!eq(mma, "u4:row:col") : [1],
		// Sub-int load/stores have fixed layout for A and B.
		!and(!eq(layout_b, "-"), // It's a Load or Store op
		!or(!eq(ld, "b1:a:row"),
		!eq(ld, "b1:b:col"),
		!eq(ldf, "b1:c"),
		!eq(ldf, "b1:d"),
		!eq(ld, "s4:a:row"),
		!eq(ld, "s4:b:col"),
		!eq(ldf, "s4:c"),
		!eq(ldf, "s4:d"),
		!eq(ld, "u4:a:row"),
		!eq(ld, "u4:b:col"),
		!eq(ldf, "u4:c"),
		!eq(ldf, "u4:d"))) : [1],
		// All other sub-int ops are not supported.
		!eq(t, "b1") : [],
		!eq(t, "s4") : [],
		!eq(t, "u4") : [],
		// All other (non sub-int) are OK.
		1: [1]
		);
		}

let TargetPrefix = "nvvm" in {		let TargetPrefix = "nvvm" in {
def int_nvvm_prmt : GCCBuiltin<"__nvvm_prmt">,		def int_nvvm_prmt : GCCBuiltin<"__nvvm_prmt">,
Intrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty, llvm_i32_ty],		Intrinsic<[llvm_i32_ty], [llvm_i32_ty, llvm_i32_ty, llvm_i32_ty],
[IntrNoMem, Commutative]>;		[IntrNoMem, Commutative]>;

//		//
// Min Max		// Min Max
//		//
▲ Show 20 Lines • Show All 3,854 Lines • ▼ Show 20 Lines	: Intrinsic<[],
!listconcat(		!listconcat(
[llvm_anyptr_ty],		[llvm_anyptr_ty],
Frag.regs,		Frag.regs,
!if(WithStride, [llvm_i32_ty], [])),		!if(WithStride, [llvm_i32_ty], [])),
[IntrWriteMem, IntrArgMemOnly, WriteOnly<0>, NoCapture<0>],		[IntrWriteMem, IntrArgMemOnly, WriteOnly<0>, NoCapture<0>],
WMMA_NAME_LDST<"store", Frag, Layout, WithStride>.intr>;		WMMA_NAME_LDST<"store", Frag, Layout, WithStride>.intr>;

// Create all load/store variants		// Create all load/store variants
foreach geom = ["m16n16k16", "m32n8k16", "m8n32k16" ] in {
foreach layout = ["row", "col"] in {		foreach layout = ["row", "col"] in {
foreach stride = [0, 1] in {		foreach stride = [0, 1] in {
foreach frag = [WMMA_REGS<geom, "a", "f16">,		foreach frag = NVVM_MMA_OPS.all_ld_ops in
WMMA_REGS<geom, "b", "f16">,		foreach _ = NVVM_MMA_SUPPORTED<[frag], layout>.ret in
WMMA_REGS<geom, "c", "f16">,
WMMA_REGS<geom, "c", "f32">] in {
def WMMA_NAME_LDST<"load", frag, layout, stride>.record		def WMMA_NAME_LDST<"load", frag, layout, stride>.record
: NVVM_WMMA_LD<frag, layout, stride>;		: NVVM_WMMA_LD<frag, layout, stride>;
}		foreach frag = NVVM_MMA_OPS.all_st_ops in
foreach frag = [WMMA_REGS<geom, "d", "f16">,		foreach _ = NVVM_MMA_SUPPORTED<[frag], layout>.ret in
WMMA_REGS<geom, "d", "f32">] in {
def WMMA_NAME_LDST<"store", frag, layout, stride>.record		def WMMA_NAME_LDST<"store", frag, layout, stride>.record
: NVVM_WMMA_ST<frag, layout, stride>;		: NVVM_WMMA_ST<frag, layout, stride>;
}		}
}		}
}
}

// WMMA.MMA		// WMMA.MMA
class NVVM_WMMA_MMA<string ALayout, string BLayout,		class NVVM_WMMA_MMA<string ALayout, string BLayout, int Satfinite,
WMMA_REGS C, WMMA_REGS D, int Satfinite>		WMMA_REGS A, WMMA_REGS B,
		WMMA_REGS C, WMMA_REGS D>
: Intrinsic<D.regs,		: Intrinsic<D.regs,
!listconcat(		!listconcat(A.regs, B.regs, C.regs),
WMMA_REGS<C.geom, "a", "f16">.regs,
WMMA_REGS<C.geom, "b", "f16">.regs,
C.regs),
[IntrNoMem],		[IntrNoMem],
WMMA_NAME_MMA<ALayout, BLayout, C, D, Satfinite>.llvm>;		WMMA_NAME_MMA<ALayout, BLayout, Satfinite, A, B, C, D>.llvm>;

foreach geom = ["m16n16k16", "m32n8k16", "m8n32k16" ] in {
foreach layout_a = ["row", "col"] in {		foreach layout_a = ["row", "col"] in {
foreach layout_b = ["row", "col"] in {		foreach layout_b = ["row", "col"] in {
foreach frag_c = [WMMA_REGS<geom, "c", "f16">,
WMMA_REGS<geom, "c", "f32">] in {
foreach frag_d = [WMMA_REGS<geom, "d", "f16">,
WMMA_REGS<geom, "d", "f32">] in {
foreach satf = [0, 1] in {		foreach satf = [0, 1] in {
def WMMA_NAME_MMA<layout_a, layout_b, frag_c, frag_d, satf>.record		foreach op = NVVM_MMA_OPS.all_mma_ops in {
: NVVM_WMMA_MMA<layout_a, layout_b, frag_c, frag_d, satf>;		foreach _ = NVVM_MMA_SUPPORTED<op, layout_a, layout_b, satf>.ret in {
}		def WMMA_NAME_MMA<layout_a, layout_b, satf,
}		op[0], op[1], op[2], op[3]>.record
}		: NVVM_WMMA_MMA<layout_a, layout_b, satf,
}		op[0], op[1], op[2], op[3]>;
}		}
}		}
		} // satf
		} // layout_b
		} // layout_a

} // let TargetPrefix = "nvvm"		} // let TargetPrefix = "nvvm"

llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp

Show First 20 Lines • Show All 3,392 Lines • ▼ Show 20 Lines	case Intrinsic::nvvm_wmma_m8n32k16_load_b_f16_row_stride: {
Info.opc = ISD::INTRINSIC_W_CHAIN;		Info.opc = ISD::INTRINSIC_W_CHAIN;
Info.memVT = MVT::v8f16;		Info.memVT = MVT::v8f16;
Info.ptrVal = I.getArgOperand(0);		Info.ptrVal = I.getArgOperand(0);
Info.offset = 0;		Info.offset = 0;
Info.flags = MachineMemOperand::MOLoad;		Info.flags = MachineMemOperand::MOLoad;
Info.align = 16;		Info.align = 16;
return true;		return true;
}		}
		case Intrinsic::nvvm_wmma_m16n16k16_load_a_s8_col:
		case Intrinsic::nvvm_wmma_m16n16k16_load_a_s8_col_stride:
		case Intrinsic::nvvm_wmma_m16n16k16_load_a_u8_col_stride:
		case Intrinsic::nvvm_wmma_m16n16k16_load_a_u8_col:
		case Intrinsic::nvvm_wmma_m16n16k16_load_a_s8_row:
		case Intrinsic::nvvm_wmma_m16n16k16_load_a_s8_row_stride:
		case Intrinsic::nvvm_wmma_m16n16k16_load_a_u8_row_stride:
		case Intrinsic::nvvm_wmma_m16n16k16_load_a_u8_row:
		case Intrinsic::nvvm_wmma_m16n16k16_load_b_s8_col:
		case Intrinsic::nvvm_wmma_m16n16k16_load_b_s8_col_stride:
		case Intrinsic::nvvm_wmma_m16n16k16_load_b_u8_col_stride:
		case Intrinsic::nvvm_wmma_m16n16k16_load_b_u8_col:
		case Intrinsic::nvvm_wmma_m16n16k16_load_b_s8_row:
		case Intrinsic::nvvm_wmma_m16n16k16_load_b_s8_row_stride:
		case Intrinsic::nvvm_wmma_m16n16k16_load_b_u8_row_stride:
		case Intrinsic::nvvm_wmma_m16n16k16_load_b_u8_row: {
		Info.opc = ISD::INTRINSIC_W_CHAIN;
		Info.memVT = MVT::v2i32;
		Info.ptrVal = I.getArgOperand(0);
		Info.offset = 0;
		Info.flags = MachineMemOperand::MOLoad;
		Info.align = 8;
		return true;
		}

		case Intrinsic::nvvm_wmma_m32n8k16_load_a_s8_col:
		case Intrinsic::nvvm_wmma_m32n8k16_load_a_s8_col_stride:
		case Intrinsic::nvvm_wmma_m32n8k16_load_a_u8_col_stride:
		case Intrinsic::nvvm_wmma_m32n8k16_load_a_u8_col:
		case Intrinsic::nvvm_wmma_m32n8k16_load_a_s8_row:
		case Intrinsic::nvvm_wmma_m32n8k16_load_a_s8_row_stride:
		case Intrinsic::nvvm_wmma_m32n8k16_load_a_u8_row_stride:
		case Intrinsic::nvvm_wmma_m32n8k16_load_a_u8_row:

		case Intrinsic::nvvm_wmma_m8n32k16_load_b_s8_col:
		case Intrinsic::nvvm_wmma_m8n32k16_load_b_s8_col_stride:
		case Intrinsic::nvvm_wmma_m8n32k16_load_b_u8_col_stride:
		case Intrinsic::nvvm_wmma_m8n32k16_load_b_u8_col:
		case Intrinsic::nvvm_wmma_m8n32k16_load_b_s8_row:
		case Intrinsic::nvvm_wmma_m8n32k16_load_b_s8_row_stride:
		case Intrinsic::nvvm_wmma_m8n32k16_load_b_u8_row_stride:
		case Intrinsic::nvvm_wmma_m8n32k16_load_b_u8_row: {
		Info.opc = ISD::INTRINSIC_W_CHAIN;
		Info.memVT = MVT::v4i32;
		Info.ptrVal = I.getArgOperand(0);
		Info.offset = 0;
		Info.flags = MachineMemOperand::MOLoad;
		Info.align = 16;
		return true;
		}

		case Intrinsic::nvvm_wmma_m32n8k16_load_b_s8_col:
		case Intrinsic::nvvm_wmma_m32n8k16_load_b_s8_col_stride:
		case Intrinsic::nvvm_wmma_m32n8k16_load_b_u8_col_stride:
		case Intrinsic::nvvm_wmma_m32n8k16_load_b_u8_col:
		case Intrinsic::nvvm_wmma_m32n8k16_load_b_s8_row:
		case Intrinsic::nvvm_wmma_m32n8k16_load_b_s8_row_stride:
		case Intrinsic::nvvm_wmma_m32n8k16_load_b_u8_row_stride:
		case Intrinsic::nvvm_wmma_m32n8k16_load_b_u8_row:

		case Intrinsic::nvvm_wmma_m8n32k16_load_a_s8_col:
		case Intrinsic::nvvm_wmma_m8n32k16_load_a_s8_col_stride:
		case Intrinsic::nvvm_wmma_m8n32k16_load_a_u8_col_stride:
		case Intrinsic::nvvm_wmma_m8n32k16_load_a_u8_col:
		case Intrinsic::nvvm_wmma_m8n32k16_load_a_s8_row:
		case Intrinsic::nvvm_wmma_m8n32k16_load_a_s8_row_stride:
		case Intrinsic::nvvm_wmma_m8n32k16_load_a_u8_row_stride:
		case Intrinsic::nvvm_wmma_m8n32k16_load_a_u8_row:
		case Intrinsic::nvvm_wmma_m8n8k128_load_a_b1_row:
		case Intrinsic::nvvm_wmma_m8n8k128_load_a_b1_row_stride:
		case Intrinsic::nvvm_wmma_m8n8k128_load_b_b1_col:
		case Intrinsic::nvvm_wmma_m8n8k128_load_b_b1_col_stride:
		case Intrinsic::nvvm_wmma_m8n8k32_load_a_s4_row:
		case Intrinsic::nvvm_wmma_m8n8k32_load_a_s4_row_stride:
		case Intrinsic::nvvm_wmma_m8n8k32_load_a_u4_row_stride:
		case Intrinsic::nvvm_wmma_m8n8k32_load_a_u4_row:
		case Intrinsic::nvvm_wmma_m8n8k32_load_b_s4_col:
		case Intrinsic::nvvm_wmma_m8n8k32_load_b_s4_col_stride:
		case Intrinsic::nvvm_wmma_m8n8k32_load_b_u4_col_stride:
		case Intrinsic::nvvm_wmma_m8n8k32_load_b_u4_col: {
		Info.opc = ISD::INTRINSIC_W_CHAIN;
		Info.memVT = MVT::i32;
		Info.ptrVal = I.getArgOperand(0);
		Info.offset = 0;
		Info.flags = MachineMemOperand::MOLoad;
		Info.align = 4;
		return true;
		}

case Intrinsic::nvvm_wmma_m16n16k16_load_c_f16_col:		case Intrinsic::nvvm_wmma_m16n16k16_load_c_f16_col:
case Intrinsic::nvvm_wmma_m16n16k16_load_c_f16_row:		case Intrinsic::nvvm_wmma_m16n16k16_load_c_f16_row:
case Intrinsic::nvvm_wmma_m16n16k16_load_c_f16_col_stride:		case Intrinsic::nvvm_wmma_m16n16k16_load_c_f16_col_stride:
case Intrinsic::nvvm_wmma_m16n16k16_load_c_f16_row_stride:		case Intrinsic::nvvm_wmma_m16n16k16_load_c_f16_row_stride:
case Intrinsic::nvvm_wmma_m32n8k16_load_c_f16_col:		case Intrinsic::nvvm_wmma_m32n8k16_load_c_f16_col:
case Intrinsic::nvvm_wmma_m32n8k16_load_c_f16_row:		case Intrinsic::nvvm_wmma_m32n8k16_load_c_f16_row:
case Intrinsic::nvvm_wmma_m32n8k16_load_c_f16_col_stride:		case Intrinsic::nvvm_wmma_m32n8k16_load_c_f16_col_stride:
Show All 27 Lines	case Intrinsic::nvvm_wmma_m8n32k16_load_c_f32_row_stride: {
Info.memVT = MVT::v8f32;		Info.memVT = MVT::v8f32;
Info.ptrVal = I.getArgOperand(0);		Info.ptrVal = I.getArgOperand(0);
Info.offset = 0;		Info.offset = 0;
Info.flags = MachineMemOperand::MOLoad;		Info.flags = MachineMemOperand::MOLoad;
Info.align = 16;		Info.align = 16;
return true;		return true;
}		}

		case Intrinsic::nvvm_wmma_m16n16k16_load_c_s32_col:
		case Intrinsic::nvvm_wmma_m16n16k16_load_c_s32_col_stride:
		case Intrinsic::nvvm_wmma_m16n16k16_load_c_s32_row:
		case Intrinsic::nvvm_wmma_m16n16k16_load_c_s32_row_stride:
		case Intrinsic::nvvm_wmma_m32n8k16_load_c_s32_col:
		case Intrinsic::nvvm_wmma_m32n8k16_load_c_s32_col_stride:
		case Intrinsic::nvvm_wmma_m32n8k16_load_c_s32_row:
		case Intrinsic::nvvm_wmma_m32n8k16_load_c_s32_row_stride:
		case Intrinsic::nvvm_wmma_m8n32k16_load_c_s32_col:
		case Intrinsic::nvvm_wmma_m8n32k16_load_c_s32_col_stride:
		case Intrinsic::nvvm_wmma_m8n32k16_load_c_s32_row:
		case Intrinsic::nvvm_wmma_m8n32k16_load_c_s32_row_stride: {
		Info.opc = ISD::INTRINSIC_W_CHAIN;
		Info.memVT = MVT::v8i32;
		Info.ptrVal = I.getArgOperand(0);
		Info.offset = 0;
		Info.flags = MachineMemOperand::MOLoad;
		Info.align = 16;
		return true;
		}

		case Intrinsic::nvvm_wmma_m8n8k128_load_c_s32_col:
		case Intrinsic::nvvm_wmma_m8n8k128_load_c_s32_col_stride:
		case Intrinsic::nvvm_wmma_m8n8k128_load_c_s32_row:
		case Intrinsic::nvvm_wmma_m8n8k128_load_c_s32_row_stride:
		case Intrinsic::nvvm_wmma_m8n8k32_load_c_s32_col:
		case Intrinsic::nvvm_wmma_m8n8k32_load_c_s32_col_stride:
		case Intrinsic::nvvm_wmma_m8n8k32_load_c_s32_row:
		case Intrinsic::nvvm_wmma_m8n8k32_load_c_s32_row_stride: {
		Info.opc = ISD::INTRINSIC_W_CHAIN;
		Info.memVT = MVT::v2i32;
		Info.ptrVal = I.getArgOperand(0);
		Info.offset = 0;
		Info.flags = MachineMemOperand::MOLoad;
		Info.align = 8;
		return true;
		}

case Intrinsic::nvvm_wmma_m16n16k16_store_d_f16_col:		case Intrinsic::nvvm_wmma_m16n16k16_store_d_f16_col:
case Intrinsic::nvvm_wmma_m16n16k16_store_d_f16_row:		case Intrinsic::nvvm_wmma_m16n16k16_store_d_f16_row:
case Intrinsic::nvvm_wmma_m16n16k16_store_d_f16_col_stride:		case Intrinsic::nvvm_wmma_m16n16k16_store_d_f16_col_stride:
case Intrinsic::nvvm_wmma_m16n16k16_store_d_f16_row_stride:		case Intrinsic::nvvm_wmma_m16n16k16_store_d_f16_row_stride:
case Intrinsic::nvvm_wmma_m32n8k16_store_d_f16_col:		case Intrinsic::nvvm_wmma_m32n8k16_store_d_f16_col:
case Intrinsic::nvvm_wmma_m32n8k16_store_d_f16_row:		case Intrinsic::nvvm_wmma_m32n8k16_store_d_f16_row:
case Intrinsic::nvvm_wmma_m32n8k16_store_d_f16_col_stride:		case Intrinsic::nvvm_wmma_m32n8k16_store_d_f16_col_stride:
case Intrinsic::nvvm_wmma_m32n8k16_store_d_f16_row_stride:		case Intrinsic::nvvm_wmma_m32n8k16_store_d_f16_row_stride:
Show All 26 Lines	case Intrinsic::nvvm_wmma_m8n32k16_store_d_f32_row_stride: {
Info.memVT = MVT::v8f32;		Info.memVT = MVT::v8f32;
Info.ptrVal = I.getArgOperand(0);		Info.ptrVal = I.getArgOperand(0);
Info.offset = 0;		Info.offset = 0;
Info.flags = MachineMemOperand::MOStore;		Info.flags = MachineMemOperand::MOStore;
Info.align = 16;		Info.align = 16;
return true;		return true;
}		}

		case Intrinsic::nvvm_wmma_m16n16k16_store_d_s32_col:
		case Intrinsic::nvvm_wmma_m16n16k16_store_d_s32_col_stride:
		case Intrinsic::nvvm_wmma_m16n16k16_store_d_s32_row:
		case Intrinsic::nvvm_wmma_m16n16k16_store_d_s32_row_stride:
		case Intrinsic::nvvm_wmma_m32n8k16_store_d_s32_col:
		case Intrinsic::nvvm_wmma_m32n8k16_store_d_s32_col_stride:
		case Intrinsic::nvvm_wmma_m32n8k16_store_d_s32_row:
		case Intrinsic::nvvm_wmma_m32n8k16_store_d_s32_row_stride:
		case Intrinsic::nvvm_wmma_m8n32k16_store_d_s32_col:
		case Intrinsic::nvvm_wmma_m8n32k16_store_d_s32_col_stride:
		case Intrinsic::nvvm_wmma_m8n32k16_store_d_s32_row:
		case Intrinsic::nvvm_wmma_m8n32k16_store_d_s32_row_stride: {
		Info.opc = ISD::INTRINSIC_VOID;
		Info.memVT = MVT::v8i32;
		Info.ptrVal = I.getArgOperand(0);
		Info.offset = 0;
		Info.flags = MachineMemOperand::MOStore;
		Info.align = 16;
		return true;
		}

		case Intrinsic::nvvm_wmma_m8n8k128_store_d_s32_col:
		case Intrinsic::nvvm_wmma_m8n8k128_store_d_s32_col_stride:
		case Intrinsic::nvvm_wmma_m8n8k128_store_d_s32_row:
		case Intrinsic::nvvm_wmma_m8n8k128_store_d_s32_row_stride:
		case Intrinsic::nvvm_wmma_m8n8k32_store_d_s32_col:
		case Intrinsic::nvvm_wmma_m8n8k32_store_d_s32_col_stride:
		case Intrinsic::nvvm_wmma_m8n8k32_store_d_s32_row:
		case Intrinsic::nvvm_wmma_m8n8k32_store_d_s32_row_stride: {
		Info.opc = ISD::INTRINSIC_VOID;
		Info.memVT = MVT::v2i32;
		Info.ptrVal = I.getArgOperand(0);
		Info.offset = 0;
		Info.flags = MachineMemOperand::MOStore;
		Info.align = 8;
		return true;
		}

case Intrinsic::nvvm_atomic_load_add_f32:		case Intrinsic::nvvm_atomic_load_add_f32:
case Intrinsic::nvvm_atomic_load_add_f64:		case Intrinsic::nvvm_atomic_load_add_f64:
case Intrinsic::nvvm_atomic_load_inc_32:		case Intrinsic::nvvm_atomic_load_inc_32:
case Intrinsic::nvvm_atomic_load_dec_32:		case Intrinsic::nvvm_atomic_load_dec_32:

case Intrinsic::nvvm_atomic_add_gen_f_cta:		case Intrinsic::nvvm_atomic_add_gen_f_cta:
case Intrinsic::nvvm_atomic_add_gen_f_sys:		case Intrinsic::nvvm_atomic_add_gen_f_sys:
case Intrinsic::nvvm_atomic_add_gen_i_cta:		case Intrinsic::nvvm_atomic_add_gen_i_cta:
▲ Show 20 Lines • Show All 1,308 Lines • Show Last 20 Lines

llvm/lib/Target/NVPTX/NVPTXInstrInfo.td

	Show First 20 Lines • Show All 136 Lines • ▼ Show 20 Lines
	def hasHWROT32 : Predicate<"Subtarget->hasHWROT32()">;			def hasHWROT32 : Predicate<"Subtarget->hasHWROT32()">;
	def noHWROT32 : Predicate<"!Subtarget->hasHWROT32()">;			def noHWROT32 : Predicate<"!Subtarget->hasHWROT32()">;

	def true : Predicate<"true">;			def true : Predicate<"true">;

	def hasPTX31 : Predicate<"Subtarget->getPTXVersion() >= 31">;			def hasPTX31 : Predicate<"Subtarget->getPTXVersion() >= 31">;
	def hasPTX60 : Predicate<"Subtarget->getPTXVersion() >= 60">;			def hasPTX60 : Predicate<"Subtarget->getPTXVersion() >= 60">;
	def hasPTX61 : Predicate<"Subtarget->getPTXVersion() >= 61">;			def hasPTX61 : Predicate<"Subtarget->getPTXVersion() >= 61">;
				def hasPTX63 : Predicate<"Subtarget->getPTXVersion() >= 63">;

	def hasSM30 : Predicate<"Subtarget->getSmVersion() >= 30">;			def hasSM30 : Predicate<"Subtarget->getSmVersion() >= 30">;
	def hasSM70 : Predicate<"Subtarget->getSmVersion() >= 70">;			def hasSM70 : Predicate<"Subtarget->getSmVersion() >= 70">;
				def hasSM72 : Predicate<"Subtarget->getSmVersion() >= 72">;
				def hasSM75 : Predicate<"Subtarget->getSmVersion() >= 75">;

	def useShortPtr : Predicate<"useShortPointers()">;			def useShortPtr : Predicate<"useShortPointers()">;
	def useFP16Math: Predicate<"Subtarget->allowFP16Math()">;			def useFP16Math: Predicate<"Subtarget->allowFP16Math()">;

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	// Some Common Instruction Class Templates			// Some Common Instruction Class Templates
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	▲ Show 20 Lines • Show All 2,989 Lines • Show Last 20 Lines

llvm/lib/Target/NVPTX/NVPTXIntrinsics.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 7,400 Lines • ▼ Show 20 Lines
def INT_PTX_SREG_WARPSIZE :		def INT_PTX_SREG_WARPSIZE :
NVPTXInst<(outs Int32Regs:$dst), (ins), "mov.u32 \t$dst, WARP_SZ;",		NVPTXInst<(outs Int32Regs:$dst), (ins), "mov.u32 \t$dst, WARP_SZ;",
[(set Int32Regs:$dst, (int_nvvm_read_ptx_sreg_warpsize))]>;		[(set Int32Regs:$dst, (int_nvvm_read_ptx_sreg_warpsize))]>;

// Helper class that represents a 'fragment' of an NVPTX *MMA instruction.		// Helper class that represents a 'fragment' of an NVPTX *MMA instruction.
// In addition to target-independent fields provided by WMMA_REGS, it adds		// In addition to target-independent fields provided by WMMA_REGS, it adds
// the fields commonly used to implement specific PTX instruction -- register		// the fields commonly used to implement specific PTX instruction -- register
// types and names, constraints, parts of assembly, etc.		// types and names, constraints, parts of assembly, etc.
class WMMA_REGINFO<string Geom, string Frag, string PtxEltType>		class WMMA_REGINFO<WMMA_REGS r>
: WMMA_REGS<Geom, Frag, PtxEltType> {		: WMMA_REGS<r.geom, r.frag, r.ptx_elt_type> {
// NVPTX register types used to carry fragment data.		// NVPTX register types used to carry fragment data.
NVPTXRegClass regclass = !cond(		NVPTXRegClass regclass = !cond(
!eq(PtxEltType, "f16") : Float16x2Regs,		!eq(ptx_elt_type, "f16") : Float16x2Regs,
!eq(PtxEltType, "f32") : Float32Regs);		!eq(ptx_elt_type, "f32") : Float32Regs,
		!eq(ptx_elt_type, "s32") : Int32Regs,
		!eq(ptx_elt_type, "s8") : Int32Regs,
		!eq(ptx_elt_type, "u8") : Int32Regs,
		!eq(ptx_elt_type, "s4") : Int32Regs,
		!eq(ptx_elt_type, "u4") : Int32Regs,
		!eq(ptx_elt_type, "b1") : Int32Regs);

// Instruction input/output arguments for the fragment.		// Instruction input/output arguments for the fragment.
list<NVPTXRegClass> ptx_regs = !foreach(tmp, regs, regclass);		list<NVPTXRegClass> ptx_regs = !foreach(tmp, regs, regclass);

// List of register names for the fragment -- ["ra0", "ra1",...]		// List of register names for the fragment -- ["ra0", "ra1",...]
list<string> reg_names = RegSeq<!size(ptx_regs), "r"#frag>.ret;		list<string> reg_names = RegSeq<!size(ptx_regs), "r"#frag>.ret;

// Generates "{{$r0, $r1,.... $rN-1}}" for use in asm string construction.		// Generates "{{$r0, $r1,.... $rN-1}}" for use in asm string construction.
string regstring = "{{$" # !head(reg_names)		string regstring = "{{$" # !head(reg_names)
# !foldl("", !tail(reg_names), a, b,		# !foldl("", !tail(reg_names), a, b,
!strconcat(a, ", $", b))		!strconcat(a, ", $", b))
# "}}";		# "}}";

// Predicates for particular fragment variant. Technically those are		// Predicates for particular fragment variant. Technically those are
// per-instruction predicates, but currently all fragments that can be used in		// per-instruction predicates, but currently all fragments that can be used in
// a given instruction are subject to the same constraints, so an instruction		// a given instruction are subject to the same constraints, so an instruction
// can use predicates from any of its fragments. If/when this is no		// can use predicates from any of its fragments. If/when this is no
// longer the case, we can concat all per-fragment predicates to enforce that		// longer the case, we can concat all per-fragment predicates to enforce that
// all fragments of the instruction are viable.		// all fragments of the instruction are viable.
list<Predicate> Predicates = !cond(		list<Predicate> Predicates = !cond(
// fp16 -> fp16/fp32 @ m16n16k16		// fp16 -> fp16/fp32 @ m16n16k16
!and(!eq(Geom, "m16n16k16"),		!and(!eq(geom, "m16n16k16"),
!or(!eq(PtxEltType, "f16"),		!or(!eq(ptx_elt_type, "f16"),
!eq(PtxEltType, "f32"))) : [hasSM70, hasPTX60],		!eq(ptx_elt_type, "f32"))) : [hasSM70, hasPTX60],

// fp16 -> fp16/fp32 @ m8n32k16/m32n8k16		// fp16 -> fp16/fp32 @ m8n32k16/m32n8k16
!and(!or(!eq(Geom, "m8n32k16"),		!and(!or(!eq(geom, "m8n32k16"),
!eq(Geom, "m32n8k16")),		!eq(geom, "m32n8k16")),
!or(!eq(PtxEltType, "f16"),		!or(!eq(ptx_elt_type, "f16"),
!eq(PtxEltType, "f32"))) : [hasSM70, hasPTX61]);		!eq(ptx_elt_type, "f32"))) : [hasSM70, hasPTX61],

		// u8/s8 -> s32 @ m16n16k16/m8n32k16/m32n8k16
		!and(!or(!eq(geom,"m16n16k16"),
		!eq(geom,"m8n32k16"),
		!eq(geom,"m32n8k16")),
		!or(!eq(ptx_elt_type, "u8"),
		!eq(ptx_elt_type, "s8"),
		!eq(ptx_elt_type, "s32"))) : [hasSM72, hasPTX63],

		// u4/s4/b1 -> s32 @ m8n8k32 (u4/s4), m8n8k128(b1)
		!or(!eq(geom,"m8n8k128"),
		!eq(geom,"m8n8k32")) : [hasSM75, hasPTX63]);

// template DAGs for instruction inputs/output.		// template DAGs for instruction inputs/output.
dag Outs = !dag(outs, ptx_regs, reg_names);		dag Outs = !dag(outs, ptx_regs, reg_names);
dag Ins = !dag(ins, ptx_regs, reg_names);		dag Ins = !dag(ins, ptx_regs, reg_names);
}		}

// Convert dag of arguments into a dag to match given intrinsic.		// Convert dag of arguments into a dag to match given intrinsic.
class BuildPatternI<Intrinsic Intr, dag Ins> {		class BuildPatternI<Intrinsic Intr, dag Ins> {
▲ Show 20 Lines • Show All 101 Lines • ▼ Show 20 Lines	let AsmString = "wmma.store.d.sync"
# " \t[$dst],"		# " \t[$dst],"
# Frag.regstring		# Frag.regstring
# !if(WithStride, ", $ldm", "")		# !if(WithStride, ", $ldm", "")
# ";";		# ";";
}		}

// Create all load/store variants		// Create all load/store variants
defset list<WMMA_INSTR> MMA_LDSTs = {		defset list<WMMA_INSTR> MMA_LDSTs = {
foreach geom = ["m16n16k16", "m32n8k16", "m8n32k16" ] in {
foreach layout = ["row", "col"] in {		foreach layout = ["row", "col"] in {
foreach stride = [0, 1] in {		foreach stride = [0, 1] in {
foreach space = [".global", ".shared", ""] in {		foreach space = [".global", ".shared", ""] in {
foreach addr = [imem, Int32Regs, Int64Regs, MEMri, MEMri64] in {		foreach addr = [imem, Int32Regs, Int64Regs, MEMri, MEMri64] in {
foreach frag = [WMMA_REGINFO<geom, "a", "f16">,		foreach frag = NVVM_MMA_OPS.all_ld_ops in
WMMA_REGINFO<geom, "b", "f16">,		foreach _ = NVVM_MMA_SUPPORTED<[frag], layout>.ret in
WMMA_REGINFO<geom, "c", "f16">,		def : WMMA_LOAD<WMMA_REGINFO<frag>, layout, space, stride, addr>;
WMMA_REGINFO<geom, "c", "f32">] in {		foreach frag = NVVM_MMA_OPS.all_st_ops in
def : WMMA_LOAD<frag, layout, space, stride, addr>;		foreach _ = NVVM_MMA_SUPPORTED<[frag], layout>.ret in
}		def : WMMA_STORE_D<WMMA_REGINFO<frag>, layout, space, stride, addr>;
foreach frag = [WMMA_REGINFO<geom, "d", "f16">,
WMMA_REGINFO<geom, "d", "f32">] in {
def : WMMA_STORE_D<frag, layout, space, stride, addr>;
}
} // addr		} // addr
} // space		} // space
} // stride		} // stride
} // layout		} // layout
} // geom
} // defset		} // defset

// WMMA.MMA		// WMMA.MMA
class WMMA_MMA<WMMA_REGINFO FragA, WMMA_REGINFO FragB,		class WMMA_MMA<WMMA_REGINFO FragA, WMMA_REGINFO FragB,
WMMA_REGINFO FragC, WMMA_REGINFO FragD,		WMMA_REGINFO FragC, WMMA_REGINFO FragD,
string ALayout, string BLayout, int Satfinite>		string ALayout, string BLayout, int Satfinite>
: WMMA_INSTR<WMMA_NAME_MMA<ALayout, BLayout, FragC, FragD, Satfinite>.record,		: WMMA_INSTR<WMMA_NAME_MMA<ALayout, BLayout, Satfinite, FragA, FragB, FragC, FragD>.record,
[FragA.Ins, FragB.Ins, FragC.Ins]>,		[FragA.Ins, FragB.Ins, FragC.Ins]>,
Requires<FragC.Predicates> {		// Requires does not seem to have effect on Instruction w/o Patterns.
		// We set it here anyways and propagate to the Pat<> we construct below.
		Requires<FragA.Predicates> {
let OutOperandList = FragD.Outs;		let OutOperandList = FragD.Outs;
let InOperandList = !con(Args, (ins MmaCode:$ptx));		let InOperandList = !con(Args, (ins MmaCode:$ptx));
let AsmString = "wmma.mma.sync"		string TypeList = !cond(
		!eq(FragD.ptx_elt_type, "s32") : ".s32"
		# "." # FragA.ptx_elt_type
		# "." # FragB.ptx_elt_type
		# ".s32",
		1: "." # FragD.ptx_elt_type # "." # FragC.ptx_elt_type,
		);
		let AsmString = "wmma.mma"
		# !if(!eq(FragA.ptx_elt_type, "b1"), ".xor.popc", "")
		# ".sync"
# "${ptx:aligned}"		# "${ptx:aligned}"
# "." # ALayout		# "." # ALayout
# "." # BLayout		# "." # BLayout
# "." # FragA.geom		# "." # FragA.geom
# "." # FragD.ptx_elt_type		# TypeList
# "." # FragC.ptx_elt_type
# !if(Satfinite, ".satfinite", "") # "\n\t\t"		# !if(Satfinite, ".satfinite", "") # "\n\t\t"
# FragD.regstring # ",\n\t\t"		# FragD.regstring # ",\n\t\t"
# FragA.regstring # ",\n\t\t"		# FragA.regstring # ",\n\t\t"
# FragB.regstring # ",\n\t\t"		# FragB.regstring # ",\n\t\t"
# FragC.regstring # ";";		# FragC.regstring # ";";
}		}

defset list<WMMA_INSTR> MMAs = {		defset list<WMMA_INSTR> MMAs = {
foreach geom = ["m16n16k16", "m32n8k16", "m8n32k16" ] in {
foreach layout_a = ["row", "col"] in {		foreach layout_a = ["row", "col"] in {
foreach layout_b = ["row", "col"] in {		foreach layout_b = ["row", "col"] in {
foreach frag_c = [WMMA_REGINFO<geom, "c", "f16">,
WMMA_REGINFO<geom, "c", "f32">] in {
foreach frag_d = [WMMA_REGINFO<geom, "d", "f16">,
WMMA_REGINFO<geom, "d", "f32">] in {
foreach satf = [0, 1] in {		foreach satf = [0, 1] in {
def : WMMA_MMA<WMMA_REGINFO<geom, "a", "f16">,		foreach op = NVVM_MMA_OPS.all_mma_ops in {
WMMA_REGINFO<geom, "b", "f16">,		foreach _ = NVVM_MMA_SUPPORTED<op, layout_a, layout_b, satf>.ret in {
frag_c, frag_d, layout_a, layout_b, satf>;		def : WMMA_MMA<WMMA_REGINFO<op[0]>,
		WMMA_REGINFO<op[1]>,
		WMMA_REGINFO<op[2]>,
		WMMA_REGINFO<op[3]>,
		layout_a, layout_b, satf>;
		}
		} // op
} // satf		} // satf
} // frag_d
} // frag_c
} // layout_b		} // layout_b
} // layout_a		} // layout_a
} // geom
} // defset		} // defset


// Constructing non-flat DAGs is still a pain. I can't !subst a dag node with a		// Constructing non-flat DAGs is still a pain. I can't !subst a dag node with a
// dag, so the ptx.version must be appended after foreach replaces 'ins' with		// dag, so the ptx.version must be appended after foreach replaces 'ins' with
// the instruction record.		// the instruction record.
class WMMA_PAT<WMMA_INSTR wi>		class WMMA_PAT<WMMA_INSTR wi>
: Pat<wi.IntrinsicPattern,		: Pat<wi.IntrinsicPattern,
!con(!foreach(tmp, wi.Args, !subst(ins, wi, tmp)),		!con(!foreach(tmp, wi.Args, !subst(ins, wi, tmp)),
(wi ptx.version))>;		(wi ptx.version))>,
		Requires<wi.Predicates>;

// Build intrinsic->instruction patterns for all MMA instructions.		// Build intrinsic->instruction patterns for all MMA instructions.
foreach mma = !listconcat(MMAs, MMA_LDSTs) in		foreach mma = !listconcat(MMAs, MMA_LDSTs) in
def : WMMA_PAT<mma>;		def : WMMA_PAT<mma>;

llvm/test/CodeGen/NVPTX/wmma.py

# This test generates all variants of wmma intrinsics and verifies that LLVM		# This test generates all variants of wmma intrinsics and verifies that LLVM
# generates correct instructions for them.		# generates correct instructions for them.

# RUN: python %s > %t.ll		# Check all variants of instructions supported by PTX60 on SM70
# RUN: llc < %t.ll -march=nvptx64 -mcpu=sm_70 -mattr=+ptx61 \| FileCheck %t.ll		# RUN: python %s --ptx=60 --gpu-arch=70 > %t-ptx60-sm_70.ll
# RUN: python %s --ptx=63 > %t-ptx63.ll		# RUN: FileCheck %t-ptx60-sm_70.ll < %t-ptx60-sm_70.ll \
# RUN: llc < %t-ptx63.ll -march=nvptx64 -mcpu=sm_70 -mattr=+ptx63 \| FileCheck %t-ptx63.ll		# RUN: --check-prefixes=INTRINSICS,PTX60,SM70
		# RUN: FileCheck %t-ptx60-sm_70.ll < %t-ptx60-sm_70.ll \
		# RUN: --check-prefixes=INTRINSICS,PTX60U,SM70U
		# RUN: llc < %t-ptx60-sm_70.ll -march=nvptx64 -mcpu=sm_70 -mattr=+ptx60 \
		# RUN: \| FileCheck %t-ptx60-sm_70.ll

		# Check all variants of instructions supported by PTX61 on SM70
		# RUN: python %s --ptx=61 --gpu-arch=70 > %t-ptx61-sm_70.ll
		# RUN: FileCheck %t-ptx61-sm_70.ll < %t-ptx61-sm_70.ll \
		# RUN: --check-prefixes=INTRINSICS,PTX60,PTX61,SM70
		# RUN: FileCheck %t-ptx61-sm_70.ll < %t-ptx61-sm_70.ll \
		# RUN: --check-prefixes=INTRINSICS,PTX61U,SM70U
		# RUN: llc < %t-ptx61-sm_70.ll -march=nvptx64 -mcpu=sm_70 -mattr=+ptx61 \
		# RUN: \| FileCheck %t-ptx61-sm_70.ll

		# Check all variants of instructions supported by PTX63 on SM72
		# RUN: python %s --ptx=63 --gpu-arch=72 > %t-ptx63-sm_72.ll
		# RUN: FileCheck %t-ptx63-sm_72.ll < %t-ptx63-sm_72.ll \
		# RUN: --check-prefixes=INTRINSICS,PTX60,PTX61,PTX63,SM70,SM72
		# RUN: FileCheck %t-ptx63-sm_72.ll < %t-ptx63-sm_72.ll \
		# RUN: --check-prefixes=INTRINSICS,PTX63U,SM72U
		# RUN: llc < %t-ptx63-sm_72.ll -march=nvptx64 -mcpu=sm_72 -mattr=+ptx63 \
		# RUN: \| FileCheck %t-ptx63-sm_72.ll

		# Check all variants of instructions supported by PTX63 on SM75
		# RUN: python %s --ptx=63 --gpu-arch=75 > %t-ptx63-sm_75.ll
		# RUN: FileCheck %t-ptx63-sm_75.ll < %t-ptx63-sm_75.ll \
		# RUN: --check-prefixes=INTRINSICS,PTX60,PTX61,PTX63,SM70,SM72,SM75
		# RUN: FileCheck %t-ptx63-sm_75.ll < %t-ptx63-sm_75.ll \
		# RUN: --check-prefixes=INTRINSICS,PTX63U,SM75U
		# RUN: llc < %t-ptx63-sm_75.ll -march=nvptx64 -mcpu=sm_75 -mattr=+ptx63 \
		# RUN: \| FileCheck %t-ptx63-sm_75.ll


from __future__ import print_function		from __future__ import print_function

import argparse		import argparse
from itertools import product		from itertools import product
from string import Template		from string import Template

def make_wmma_slice_ty(abcd, itype):		class MMAType:
elt_ty = "<2 x half>" if itype == "f16" else "float"		def __init__(self, ptx_type):
num_elts = 4 if abcd in "cd" and itype == "f16" else 8;		self.ptx_type = ptx_type
return [elt_ty] * num_elts		self.llvm_type = {
		"f16" : "<2 x half>",
def make_wmma_ld_ret_ty(abc, itype):		"f32" : "float",
return "{%s}" % ", ".join(make_wmma_slice_ty(abc, itype))		"s32" : "i32",
		"s8" : "i32",
		"u8" : "i32",
		"s4" : "i32",
		"u4" : "i32",
		"b1" : "i32",
		}[ptx_type];

		self.ptx_reg_pattern = {
		"f16" : "%hh[0-9]+",
		"f32" : "%f[0-9]+",
		}.get(ptx_type, "%r[0-9]+")

		def __repr__(self):
		return "%s/%s" % (self.ptx_type, self.llvm_type)

		class MMAFrag:
		def __init__(self, geom, frag, ptx_elt_type):
		self.geom = geom
		self.frag = frag
		self.mma_type = MMAType(ptx_elt_type);
		self.nregs = {
		"a:f16" : 8,
		"b:f16" : 8,
		"c:f16" : 4,
		"d:f16" : 4,
		"c:f32" : 8,
		"d:f32" : 8,
		}.get("%s:%s" % (frag, ptx_elt_type), {
		# u8/s8 -> s32 @ m16n16k16/m8n32k16/m32n8k16
		"m16n16k16:a:u8" : 2,
		"m16n16k16:a:s8" : 2,
		"m16n16k16:b:u8" : 2,
		"m16n16k16:b:s8" : 2,
		"m16n16k16:c:s32" : 8,
		"m16n16k16:d:s32" : 8,

		"m8n32k16:a:u8" : 1,
		"m8n32k16:a:s8" : 1,
		"m8n32k16:b:u8" : 4,
		"m8n32k16:b:s8" : 4,
		"m8n32k16:c:s32" : 8,
		"m8n32k16:d:s32" : 8,

		"m32n8k16:a:u8" : 4,
		"m32n8k16:a:s8" : 4,
		"m32n8k16:b:u8" : 1,
		"m32n8k16:b:s8" : 1,
		"m32n8k16:c:s32" : 8,
		"m32n8k16:d:s32" : 8,

		# u4/s4/b1 -> s32 @ m8n8k32 (u4/s4), m8n8k128(b1)
		"m8n8k128:a:b1" : 1,
		"m8n8k32:a:u4" : 1,
		"m8n8k32:a:s4" : 1,
		"m8n8k128:b:b1" : 1,
		"m8n8k32:b:u4" : 1,
		"m8n8k32:b:s4" : 1,
		"m8n8k128:c:s32" : 2,
		"m8n8k128:d:s32" : 2,
		"m8n8k32:c:s32" : 2,
		"m8n8k32:d:s32" : 2,
		}.get("%s:%s:%s" % (geom, frag, ptx_elt_type), None));
		assert(self.nregs);

		def __repr__(self):
		return "%s:%s:%s%s" % (self.geom, self.frag, self.mma_type,
		"" if self.nregs == 1 else ("*%d" % self.nregs))

		class MMAOp:
		def __init__(self, a, b, c, d):
		self.a = a
		self.b = b
		self.c = c
		self.d = d

		def __repr__(self):
		return ("{A:%s, B:%s, C:%s, D:%s}" % (self.a, self.b, self.c, self.d ))

		def make_mma_ops(geoms, types_a, types_b, types_c, types_d):
		ops = []
		for geom, type_a, type_c in product( geoms, types_a, types_c):
		for type_b, type_d in product(types_b if types_b else [type_a],
		types_d if types_d else [type_c]):
		ops.append(MMAOp(MMAFrag(geom, "a", type_a),
		MMAFrag(geom, "b", type_b),
		MMAFrag(geom, "c", type_c),
		MMAFrag(geom, "d", type_d)))
		return ops

		def make_ldst_ops(geoms, frags, types):
		return [MMAFrag(geom, frag, ptx_type) for (geom, frag, ptx_type)
		in product(geoms, frags, types)]

		def get_mma_ops():
		return (make_mma_ops(["m16n16k16", "m32n8k16", "m8n32k16"],
		["f16"], [], ["f16", "f32"], ["f16", "f32"]) +
		make_mma_ops(["m16n16k16", "m32n8k16", "m8n32k16"],
		["s8", "u8"], [], ["s32"], []) +
		make_mma_ops(["m8n8k32"],
		["s4", "u4"], [], ["s32"], []) +
		make_mma_ops(["m8n8k128"],
		["b1"], [], ["s32"], []))
		def get_ldst_ops(kind):
		ldst_ops = (make_ldst_ops(["m16n16k16", "m32n8k16", "m8n32k16"],
		["a", "b"], ["f16", "u8", "s8"]) +
		make_ldst_ops(["m16n16k16", "m32n8k16", "m8n32k16"],
		["c", "d"], ["f16", "f32", "s32"]) +
		make_ldst_ops(["m8n8k32"], ["a", "b"], ["s4","u4"]) +
		make_ldst_ops(["m8n8k128"], ["a", "b"], ["b1"]) +
		make_ldst_ops(["m8n8k32", "m8n8k128"], ["c", "d"], ["s32"]))
		return [ x for x in ldst_ops if (x.frag == "d") == (kind == "store")]

		def is_geom_supported(geom):
		# geometries for FP and ints.
		if geom in ["m8n32k16", "m32n8k16"]:
		return ptx_version >= 61
		# geometries for sub-ints.
		if geom in ["m8n8k32", "m8n8k128"]:
		return ptx_version >= 63 and gpu_arch >= 75
		if geom == "m16n16k16":
		return ptx_version >= 60
		assert(False) # Unexpected geometry.

		def is_type_supported(ptx_type):
		if ptx_type in ["s8", "u8", "s32"]:
		return ptx_version >= 63 and gpu_arch >= 72
		if ptx_type in ["s4", "u4", "b1"]:
		return ptx_version >= 63 and gpu_arch >= 75
		return ptx_version >= 60 and gpu_arch >= 70


		def is_mma_variant_supported(op, layout_a, layout_b, satf):
		if not (is_type_supported(op.a.mma_type.ptx_type)
		and is_geom_supported(op.a.geom)):
		return False
		# sub-integer require row/col layout, and no satf.
		if op.a.mma_type.ptx_type in ["s4", "u4", "b1"]:
		if op.a.mma_type.ptx_type == "b1" and satf:
		return False
		return layout_a == "row" and layout_b == "col"
		return True

		def is_ldst_variant_supported(frag, layout):
		if not (is_type_supported(frag.mma_type.ptx_type)
		and is_geom_supported(frag.geom)):
		return False
		if frag.mma_type.ptx_type in ["s4", "u4", "b1"]:
		# sub-integer require sm_75 and ptx63, row/col layout for a/b.
		return ((frag.frag == "a" and layout == "row")
		or (frag.frag == "b" and layout == "col")
		or frag.frag in ["c", "d"])
		return True

		def make_wmma_slice_ty(frag):
		return [frag.mma_type.llvm_type] * frag.nregs

		def make_wmma_ld_ret_ty(frag):
		results = make_wmma_slice_ty(frag)
		if len(results) == 1:
		return "%s" % results[0]
		return "{%s}" % ", ".join(results)

# returns address space		# returns address space
def get_aspace(space):		def get_aspace(space):
space_map = {		space_map = {
".global" : 1,		".global" : 1,
".shared" : 3,		".shared" : 3,
".const" : 4,		".const" : 4,
".local" : 5,		".local" : 5,
".param" : 101,		".param" : 101,
"" : 0,		"" : 0,
".generic": 0		".generic": 0
}		}
return space_map[space];		return space_map[space];

def get_pspace(space):		def get_pspace(space):
return "p%di8" % get_aspace(space);		return "p%di8" % get_aspace(space);

# Convenient test patterns.		def check_pattern(frag):
check_f16_8 = "{{%s}}" % ", ".join(["%hh[0-9]+"] 8)		return "{{%s}}" % ", ".join([frag.mma_type.ptx_reg_pattern] frag.nregs)
check_f16_4 = "{{%s}}" % ", ".join(["%hh[0-9]+"] 4)
check_f32_8 = "{{%s}}" % ", ".join(["%f[0-9]+"] 8)

known_geoms = ["m16n16k16", "m8n32k16", "m32n8k16"]		known_geoms = ["m16n16k16", "m8n32k16", "m32n8k16"]

def gen_wmma_load_tests():		def gen_wmma_load_tests():
load_template = """		load_template = """
declare ${ret_ty} @${intrinsic}(i8 ${as}* %src ${extra_args});		declare ${ret_ty} @${intrinsic}(i8 ${as}* %src ${extra_args});

; CHECK-LABEL: .func {{.*}}test_${function}(		; CHECK-LABEL: .func {{.*}}test_${function}(
Show All 13 Lines	; CHECK: [%rd{{[0-9]+}}+128]${stride_pattern}
%src1 = getelementptr i8, i8 ${as}* %src, i32 128;		%src1 = getelementptr i8, i8 ${as}* %src, i32 128;
%v0 = call ${ret_ty} @${intrinsic}(i8 ${as}* %src1 ${extra_args});		%v0 = call ${ret_ty} @${intrinsic}(i8 ${as}* %src1 ${extra_args});
ret ${ret_ty} %v0;		ret ${ret_ty} %v0;
}		}
"""		"""
intrinsic_template = "llvm.nvvm.wmma.${geom}.load.${abc}.${layout}${stride}.${itype}.${pspace}"		intrinsic_template = "llvm.nvvm.wmma.${geom}.load.${abc}.${layout}${stride}.${itype}.${pspace}"
instruction_template = "wmma.load.${abc}.sync${aligned}.${layout}.${geom}${space}.${itype}"		instruction_template = "wmma.load.${abc}.sync${aligned}.${layout}.${geom}${space}.${itype}"

for geom, abc, layout, space, stride, itype in product(		generated_items = []
known_geoms,
"abc",		for frag, layout, space, stride in product(
		get_ldst_ops("load"),
["row","col"],		["row","col"],
["",".shared",".global"],		["",".shared",".global"],
["", ".stride"],		["", ".stride"],
["f16", "f32"]):		):
		if not is_ldst_variant_supported(frag, layout):
		continue

params = {		params = {
"abc" : abc,		"abc" : frag.frag,
"aligned" : ".aligned" if ptx_version >= 63 else "",		"aligned" : ".aligned" if ptx_version >= 63 else "",
"layout" : layout,		"layout" : layout,
"space" : space,		"space" : space,
"stride" : stride,		"stride" : stride,
"itype" : itype,		"itype" : frag.mma_type.ptx_type,
"pspace" : get_pspace(space),		"pspace" : get_pspace(space),
"as" : "addrspace(%d)" % get_aspace(space),		"as" : "addrspace(%d)" % get_aspace(space),
"geom" : geom,		"geom" : frag.geom,
}		}

if itype == "f32" and abc != "c":
continue

test_params = params		test_params = params
test_params["intrinsic"] = Template(intrinsic_template).substitute(params)		test_params["intrinsic"] = Template(intrinsic_template).substitute(params)
test_params["function"] = test_params["intrinsic"].replace(".","_")		test_params["function"] = test_params["intrinsic"].replace(".","_")
test_params["instruction"] = Template(instruction_template).substitute(params)		test_params["instruction"] = Template(instruction_template).substitute(params)
test_params["ret_ty"] = make_wmma_ld_ret_ty(abc, itype)		test_params["ret_ty"] = make_wmma_ld_ret_ty(frag)
if abc == "c" :		test_params["check_result"] = check_pattern(frag)
test_params["check_result"] = check_f16_4 if itype == "f16" else check_f32_8
else:
test_params["check_result"] = check_f16_8

if stride:		if stride:
test_params["extra_args"] = ", i32 %stride";		test_params["extra_args"] = ", i32 %stride";
test_params["stride_pattern"] = ", %r{{[0-9]+}}"		test_params["stride_pattern"] = ", %r{{[0-9]+}}"
else:		else:
test_params["extra_args"] = ""		test_params["extra_args"] = ""
test_params["stride_pattern"] = ""		test_params["stride_pattern"] = ""

print(Template(load_template).substitute(test_params))		print(Template(load_template).substitute(test_params))

def make_wmma_slice_args(itype, abcd, prefix="v"):		generated_items.append((test_params["intrinsic"],
return ", ".join(["%s %%%s%d" % (t, prefix, i) for i,t		test_params["instruction"]))
in enumerate(make_wmma_slice_ty(abcd, itype))])
		return generated_items

		def make_wmma_slice_args(frag):
		return ", ".join(["%s %%%s%d" % (t, frag.frag, i) for i,t
		in enumerate(make_wmma_slice_ty(frag))])

def gen_wmma_store_tests():		def gen_wmma_store_tests():
store_template = """		store_template = """
declare void @${intrinsic}(i8 ${as}* %src, ${args}${extra_args});		declare void @${intrinsic}(i8 ${as}* %src, ${args}${extra_args});

; CHECK-LABEL: .func {{.*}}test_${function}(		; CHECK-LABEL: .func {{.*}}test_${function}(
define void @test_${function}(i8 ${as}* %src, ${args}${extra_args}) {		define void @test_${function}(i8 ${as}* %src, ${args}${extra_args}) {
; CHECK: ${instruction} {{.*}}[%rd{{[0-9+]}}		; CHECK: ${instruction} {{.*}}[%rd{{[0-9+]}}
Show All 11 Lines	; CHECK: ${stride_pattern}
%src1 = getelementptr i8, i8 ${as}* %src, i32 128;		%src1 = getelementptr i8, i8 ${as}* %src, i32 128;
call void @${intrinsic}(i8 ${as}* %src1, ${args}${extra_args});		call void @${intrinsic}(i8 ${as}* %src1, ${args}${extra_args});
ret void		ret void
}		}
"""		"""
intrinsic_template = "llvm.nvvm.wmma.${geom}.store.${abc}.${layout}${stride}.${itype}.${pspace}"		intrinsic_template = "llvm.nvvm.wmma.${geom}.store.${abc}.${layout}${stride}.${itype}.${pspace}"
instruction_template = "wmma.store.${abc}.sync${aligned}.${layout}.${geom}${space}.${itype}"		instruction_template = "wmma.store.${abc}.sync${aligned}.${layout}.${geom}${space}.${itype}"

for geom, abc, layout, space, stride, itype in product(		generated_items = []
known_geoms,
"d",		for frag, layout, space, stride in product(
		get_ldst_ops("store"),
["row","col"],		["row","col"],
["",".shared",".global"],		["",".shared",".global"],
["", ".stride"],		["", ".stride"]):
["f16", "f32"]):
		if not is_ldst_variant_supported(frag, layout):
		continue

params = {		params = {
"abc" : abc,		"abc" : frag.frag,
"aligned" : ".aligned" if ptx_version >= 63 else "",		"aligned" : ".aligned" if ptx_version >= 63 else "",
"layout" : layout,		"layout" : layout,
"space" : space,		"space" : space,
"stride" : stride,		"stride" : stride,
"itype" : itype,		"itype" : frag.mma_type.ptx_type,
"pspace" : get_pspace(space),		"pspace" : get_pspace(space),
"as" : "addrspace(%d)" % get_aspace(space),		"as" : "addrspace(%d)" % get_aspace(space),
"geom" : geom,		"geom" : frag.geom,
}		}

test_params = params		test_params = params
test_params["intrinsic"] = Template(intrinsic_template).substitute(params)		test_params["intrinsic"] = Template(intrinsic_template).substitute(params)
test_params["function"] = test_params["intrinsic"].replace(".","_")		test_params["function"] = test_params["intrinsic"].replace(".","_")
test_params["instruction"] = Template(instruction_template).substitute(params)		test_params["instruction"] = Template(instruction_template).substitute(params)
test_params["ret_ty"] = make_wmma_ld_ret_ty(abc, itype)		test_params["ret_ty"] = make_wmma_ld_ret_ty(frag)
test_params["check_args"] = check_f16_4 if itype == "f16" else check_f32_8		test_params["check_args"] = check_pattern(frag)
if stride:		if stride:
test_params["extra_args"] = ", i32 %stride";		test_params["extra_args"] = ", i32 %stride";
test_params["stride_pattern"] = ", %r{{[0-9]+}};"		test_params["stride_pattern"] = ", %r{{[0-9]+}};"
else:		else:
test_params["extra_args"] = ""		test_params["extra_args"] = ""
test_params["stride_pattern"] = ";"		test_params["stride_pattern"] = ";"
test_params["args"] = make_wmma_slice_args(itype, "d");		test_params["args"] = make_wmma_slice_args(frag);

print(Template(store_template).substitute(test_params))		print(Template(store_template).substitute(test_params))
		generated_items.append((test_params["intrinsic"],
		test_params["instruction"]))

		return generated_items

		def mma_signature(op):
		if op.a.mma_type.ptx_type in ["s8", "u8", "s4", "u4", "b1"]:
		# int and sub-int ops are identified by input type.
		return op.a.mma_type.ptx_type
		else:
		# the rest are FP ops identified by accumulator & result type.
		return "%s.%s" % (op.d.mma_type.ptx_type, op.c.mma_type.ptx_type)

		def mma_ptx_signature(op):
		if op.a.mma_type.ptx_type in ["s8", "u8", "s4", "u4", "b1"]:
		# int and sub-int instructions encode all four types as D.A.B.C
		return ".".join(x.mma_type.ptx_type for x in (op.d, op.a, op.b, op.c))
		else:
		# the rest are FP instructions use D.C
		return "%s.%s" % (op.d.mma_type.ptx_type, op.c.mma_type.ptx_type)

def gen_wmma_mma_tests():		def gen_wmma_mma_tests():
mma_template = """		mma_template = """
declare ${ret_ty} @${intrinsic}(		declare ${ret_ty} @${intrinsic}(
${args});		${args});

; CHECK-LABEL: .func {{.*}}test_${function}(		; CHECK-LABEL: .func {{.*}}test_${function}(
define ${ret_ty} @test_${function}(		define ${ret_ty} @test_${function}(
${args}) {		${args}) {
; CHECK: ${instruction}		; CHECK: ${instruction}
; CHECK-NEXT: ${check_d}		; CHECK-NEXT: ${check_d}
; CHECK-NEXT: ${check_ab}		; CHECK-NEXT: ${check_a}
; CHECK-NEXT: ${check_ab}		; CHECK-NEXT: ${check_b}
; CHECK-NEXT: ${check_c}		; CHECK-NEXT: ${check_c}
%r = call ${ret_ty} @${intrinsic}(		%r = call ${ret_ty} @${intrinsic}(
${args});		${args});
ret ${ret_ty} %r;		ret ${ret_ty} %r;
}		}
"""		"""
intrinsic_template = "llvm.nvvm.wmma.${geom}.mma.${alayout}.${blayout}.${dtype}.${ctype}${satf}"		intrinsic_template = "llvm.nvvm.wmma.${geom}.mma.${alayout}.${blayout}.${intrinsic_signature}${satf}"
instruction_template = "wmma.mma.sync${aligned}.${alayout}.${blayout}.${geom}.${dtype}.${ctype}${satf}"		instruction_template = "wmma.mma${mma_variant}.sync${aligned}.${alayout}.${blayout}.${geom}.${ptx_signature}${satf}"

		generated_items=[]

for geom, alayout, blayout, ctype, dtype, satf in product(		for op, alayout, blayout, satf in product(
known_geoms,		get_mma_ops(),
["row","col"],		["row","col"],
["row","col"],		["row","col"],
["f16", "f32"],
["f16", "f32"],
[".satfinite", ""]):		[".satfinite", ""]):

		if not is_mma_variant_supported(op, alayout, blayout, satf):
		continue

params = {		params = {
"aligned" : ".aligned" if ptx_version >= 63 else "",		"aligned" : ".aligned" if ptx_version >= 63 else "",
"alayout" : alayout,		"alayout" : alayout,
"blayout" : blayout,		"blayout" : blayout,
"ctype" : ctype,		"intrinsic_signature" : mma_signature(op),
"dtype" : dtype,		"ptx_signature" : mma_ptx_signature(op),
"satf" : satf,		"satf" : satf,
"geom" : geom,		"geom" : op.a.geom,
		"mma_variant" : ".xor.popc" if op.a.mma_type.ptx_type == "b1" else "",
}		}

test_params = params		test_params = params
test_params["intrinsic"] = Template(intrinsic_template).substitute(params)		test_params["intrinsic"] = Template(intrinsic_template).substitute(params)
test_params["function"] = test_params["intrinsic"].replace(".", "_")		test_params["function"] = test_params["intrinsic"].replace(".", "_")
test_params["instruction"] = Template(instruction_template).substitute(params)		test_params["instruction"] = Template(instruction_template).substitute(params)
test_params["ret_ty"] = make_wmma_ld_ret_ty("d", dtype)		test_params["ret_ty"] = make_wmma_ld_ret_ty(op.d)
test_params["check_ab"] = check_f16_8		test_params["check_a"] = check_pattern(op.a)
test_params["check_c"] = check_f16_4 if ctype == "f16" else check_f32_8		test_params["check_b"] = check_pattern(op.b)
test_params["check_d"] = check_f16_4 if dtype == "f16" else check_f32_8		test_params["check_c"] = check_pattern(op.c)
args = ",\n ".join(make_wmma_slice_args(t, abcd, prefix=abcd)		test_params["check_d"] = check_pattern(op.d)
for abcd, t in (("a", "f16"),		args = ",\n ".join(make_wmma_slice_args(frag)
("b", "f16"),		for frag in (op.a, op.b, op.c))
("c", ctype)))
test_params["args"] = args		test_params["args"] = args
print(Template(mma_template).substitute(test_params))		print(Template(mma_template).substitute(test_params))
		generated_items.append((test_params["intrinsic"],
		test_params["instruction"]))

		return generated_items

def main():		# Append complete list of intrinsics and instructions we've generated tests for.
gen_wmma_load_tests()		# Generate set of checks to verify that that we did generate sensible set of
gen_wmma_store_tests()		# tests for the given combination of PTX and SM variants.
gen_wmma_mma_tests()		#
		# PTX<N>: verifies that we did generate tests for correct classes of intrinsics.
		# PTX<N>U: verifies that we did not generate intrinsics unsupported by
		# the PTX version.
		# SM<N>: verifies that we did generate correct classes of instructions for the SM.
		# SM<N>U: verifies that we did not generate instructions unsupported by the SM
		#
		# Note that SM/PTX constraints overlap, but DAG checks do not allow overlapping
		# matches. We implicitly rely that we generate multiple variants of most of the
		# instructions and usually have enough input data to find more than one match of
		# the same kind, if necessary. When it's not possible (e.g. there's only one
		# m8n8k128.mma.row.col.b1), we may need to match PTX instruction instead.
		def gen_check_unsupported_ops(items):
		print("; Complete list of intrinsics supported by PTX%d on sm_%d"
		% (ptx_version, gpu_arch))
		print("; INTRINSICS: {{^; INTRINSICS_LIST_BEGIN}}")
		print("""
		; PTX60-DAG: m16n16k16.load.{{[ab].*}}.f16.p
		; PTX60-DAG: m16n16k16.{{load\|store}}.{{[cd].*\.(f16\|f32)}}.p
		; PTX60U-NOT: m32n8k16
		; PTX60U-NOT: m8n32k16
		; PTX60U-NOT: .{{s32\|s[48]\|u[48]\|b1}}

		; All features of PTX60, plus m32n8k16/m8n32k16 geometries.
		; PTX61-DAG: m32n8k16.load.{{[ab].*}}.f16.p
		; PTX61-DAG: m32n8k16.{{load\|store}}.{{[cd].*\.(f16\|f32)}}.p
		; PTX61-DAG: m8n32k16.load.{{[ab].*}}.f16.p
		; PTX61-DAG: m8n32k16.{{load\|store}}.{{[cd].*\.(f16\|f32)}}.p
		; PTX61U-NOT: .{{s32\|s[48]\|u[48]\|b1}}

		; SM70U-NOT: .{{s32\|s[48]\|u[48]\|b1}}

		; PTX63 supports all features of PTX60+PTX61, plus support for integers.
		; Alas we can"t just use PTX<N> checks for that as available instructions
		; depend on SM integers need sm72+ and subinteger ops need sm75, so we
		; transition to SM<N> checks
		; SM72-DAG: m16n16k16.load.{{[ab].*}}.s8.p
		; SM72-DAG: m8n32k16.load.{{[ab].*}}.s8.p
		; SM72-DAG: m32n8k16.load.{{[ab].*}}.s8.p
		; SM72-DAG: m16n16k16.load.{{[ab].*}}.u8.p
		; SM72-DAG: m8n32k16.load.{{[ab].*}}.u8.p
		; SM72-DAG: m32n8k16.load.{{[ab].*}}.u8.p
		; SM72-DAG: m32n8k16.{{load\|store}}.{{[cd].*\.s32}}.p
		; SM72U-NOT: .{{s4\|u4\|b1}}

		; SM75-DAG: m8n8k128.load.{{[ab].*}}.b1.p
		; SM75-DAG: m8n8k32.load.{{[ab].*}}.s4.p
		; SM75-DAG: m8n8k32.load.{{[ab].*}}.u4.p
		; SM75-DAG: m8n8k128.{{load\|store}}.{{[cd].*\.s32}}.p
		; SM75-DAG: m8n8k32.{{load\|store}}.{{[cd].*\.s32}}.p
		""")

		print("; INTRINSICS_LIST_BEGIN")
		for intrinsic, instruction in sorted(items):
		print("; ", intrinsic, " -> ", instruction,"")
		print("; INTRINSICS_LIST_END")
		print("; INTRINSICS: ; INTRINSICS_LIST_END")

		def gen_tests():
		items = gen_wmma_load_tests()
		items += gen_wmma_store_tests()
		items += gen_wmma_mma_tests()
		gen_check_unsupported_ops(items)

parser = argparse.ArgumentParser()		parser = argparse.ArgumentParser()
parser.add_argument('--ptx', type=int, default=60)		parser.add_argument("--ptx", type=int, default=60)
		parser.add_argument("--gpu-arch", type=int, default=70)
args = parser.parse_args()		args = parser.parse_args()
ptx_version = args.ptx		ptx_version = args.ptx
		gpu_arch = args.gpu_arch

main()		gen_tests()