This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
clang/
-
include/clang/Basic/
-
clang/
-
Basic/
1/4
BuiltinsNVPTX.def
-
lib/CodeGen/
-
CodeGen/
2/3
CGBuiltin.cpp
-
test/CodeGen/
-
CodeGen/
1/3
builtins-nvptx-mma.cu
3/5
builtins-nvptx-mma.py
-
llvm/
-
include/llvm/IR/
-
llvm/
-
IR/
4/5
IntrinsicsNVVM.td
-
lib/Target/NVPTX/
-
Target/
-
NVPTX/
-
NVPTXISelLowering.cpp
-
NVPTXInstrInfo.td
-
NVPTXIntrinsics.td
-
test/CodeGen/NVPTX/
-
CodeGen/
-
NVPTX/
-
lit.local.cfg
-
wmma.py

Differential D104847

[Clang][NVPTX] Add NVPTX intrinsics and builtins for CUDA PTX 6.5 and 7.0 WMMA and MMA instructions
ClosedPublic

Authored by steffenlarsen on Jun 24 2021, 4:30 AM.

Download Raw Diff

Details

Reviewers

tra

Commits

rG3644726a78e3: [Clang][NVPTX] Add NVPTX intrinsics and builtins for CUDA PTX 6.5 and 7.0 WMMA…

Summary

Adds NVPTX builtins and intrinsics for the CUDA PTX wmma.load, wmma.store, wmma.mma, and mma instructions added in PTX 6.5 and 7.0.

PTX ISA description of

Overview of wmma.mma and mma matrix shape/type combinations added with specific PTX versions: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-shape

Authored-by: Steffen Larsen <steffen.larsen@codeplay.com>
Co-Authored-by: Stuart Adams <stuart.adams@codeplay.com>

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

steffenlarsen created this revision.Jun 24 2021, 4:30 AM

Herald added subscribers: hiraditya, yaxunl, jholewinski. · View Herald TranscriptJun 24 2021, 4:30 AM

steffenlarsen requested review of this revision.Jun 24 2021, 4:30 AM

Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptJun 24 2021, 4:30 AM

Herald added subscribers: llvm-commits, cfe-commits, jdoerfert. · View Herald Transcript

Harbormaster completed remote builds in B110796: Diff 354199.Jun 24 2021, 5:13 AM

Nice. Thank you for adding support for these missing instructions!
LGTM, modulo a few of cosmetic nits.

clang/include/clang/Basic/BuiltinsNVPTX.def
762	Is this a sufficient set of builtins to compile mma.h in CUDA-11.x?
clang/lib/CodeGen/CGBuiltin.cpp
16503–16513	Nit: satf variants are in the minority. We could move them to the end of the variant list and rely on default initialization to 0. E.g something like this: unsigned getMMAIntrinsic(int Layout, bool Satf) { unsigned Index = Layout + 4*Satf; if (Index >= Variants.size()) return 0; return Variants[Index]; } #define MMA_VARIANTS(geom, type) Intrinsic::nvvm_wmma_##geom##_mma_row_row_##type, \ Intrinsic::nvvm_wmma_##geom##_mma_row_col_##type, \ Intrinsic::nvvm_wmma_##geom##_mma_col_row_##type, \ Intrinsic::nvvm_wmma_##geom##_mma_col_col_##type #define MMA_SATF_VARIANTS(geom, type) MMA_VARIANTS(geom, type), \ Intrinsic::nvvm_wmma_##geom##_mma_row_row_##type##_satfinite, \ Intrinsic::nvvm_wmma_##geom##_mma_row_col_##type##_satfinite, \ Intrinsic::nvvm_wmma_##geom##_mma_col_row_##type##_satfinite, \ Intrinsic::nvvm_wmma_##geom##_mma_col_col_##type##_satfinite ... case NVPTX::BI__hmma_m16n16k16_mma_f16f16: return {8, 8, 4, 4, {{ MMA_SATF_VARIANTS(m16n16k16, f16_f16) }}}; ... case NVPTX::BI__dmma_m8n8k4_mma_f64: return {1, 1, 2, 2, {{MMA_VARIANTS(m8n8k4, f64)}}}; Up to you.
clang/test/CodeGen/builtins-nvptx-mma.py
111	typo in the original code: `sub-integers` or `sub-integer types`
142–146	It's not obvious why frag `d` is `__mma` and not `__mma_tf32` Can we use frag.ptx_type to make that decision?
llvm/include/llvm/IR/IntrinsicsNVVM.td
55	Nit: I'd drop `some`.
219–223	Nit: `are identified`
221	Nit: `types` as both A and B are considered.
4474	We're often using an empty string to represent a `none`. Comparisons with `-` where we check `rnd` look like we're doing something special there. I'd use an empty string for `rnd`, too.

This revision is now accepted and ready to land.Jun 24 2021, 11:46 AM

Adjusted for comments and fixed formatting issues.

@tra Thank you for the quick response! I agree with your comments and have made changes accordingly.

clang/include/clang/Basic/BuiltinsNVPTX.def
762	mma.h was my frame-of-reference for the builtins and I have CUDA 11.3 (465.19.01) installed, so I would expect it to be.
clang/lib/CodeGen/CGBuiltin.cpp
16503–16513	I agree, I like this better. In case other options will use the satf parameter (e.g. rnd which isn't indicated as being in use from mma.h) this solution is also easier to extend in the future.
clang/test/CodeGen/builtins-nvptx-mma.py
142–146	We absolutely can. I don't know why that wasn't my first solution.
llvm/include/llvm/IR/IntrinsicsNVVM.td
4474	Empty string works for me. I think there are/were some places that used "-" as a default parameter meaning `none`, but I agree with your assessment.

Harbormaster completed remote builds in B110991: Diff 354487.Jun 25 2021, 7:44 AM

LGTM. Would you like me to land the patch on your behalf?

clang/lib/CodeGen/CGBuiltin.cpp
16489	A comment here describing expected arrangement of the variants here would be helpful. E.g. `ordered by layout-A/layout-B/satf, where 'row' has priority over 'col' for layout. The index of non-satf variants is expected to match the undocumented layout constants used by CUDA's mma.hpp`. It could be cleaner if we could use designated initializer, but we can't use them yet until LLVM switches to c++20.

Added comment about variant ordering.

In D104847#2841242, @tra wrote:

LGTM. Would you like me to land the patch on your behalf?

If it isn't to much of a bother. Thank you. 😄

Harbormaster completed remote builds in B111232: Diff 354814.Jun 28 2021, 3:04 AM

Closed by commit rG3644726a78e3: [Clang][NVPTX] Add NVPTX intrinsics and builtins for CUDA PTX 6.5 and 7.0 WMMA… (authored by steffenlarsen, committed by tra). · Explain WhyJun 29 2021, 3:45 PM

This revision was automatically updated to reflect the committed changes.

tra added a commit: rG3644726a78e3: [Clang][NVPTX] Add NVPTX intrinsics and builtins for CUDA PTX 6.5 and 7.0 WMMA….

tra added inline comments.Jul 2 2021, 11:15 AM

clang/include/clang/Basic/BuiltinsNVPTX.def
727	Bummer. mma.h in CUDA-11.3 still does not compile for Ampere. We appear to be missing the new `__bmma_m8n8k128_mma_and_popc_b1` builtin for the `.and` variant of 1-bit `mma` introduced in PTX 7.1 and not included in this patch. https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-wmma-mma Do you, by any chance, have upcoming patch for PTX7.1, too. :-)

tra added inline comments.Jul 2 2021, 3:25 PM

clang/test/CodeGen/builtins-nvptx-mma.cu
781–786	This looks rather odd. We're calling a `tf32` builtin, but expect to see and `f32` load intrinsic. Is that expected ?
clang/test/CodeGen/builtins-nvptx-mma.py
74	This does not seem to match the generated `builtins-nvptx-mma.cu` which does have `__mma_tf32_m16n16k8_ld_c` If I regenrate the test I see a somewhat different set of tests, possibly related to the oddity I've pointed in the generated test changes in this patch.

tra added inline comments.Jul 2 2021, 4:21 PM

clang/test/CodeGen/builtins-nvptx-mma.cu
781–786	Never mind. I think I understand what's going on now. CUDA headers use __mma_tf32 builtins. `A` and `B` operate on opaque integer types. `C` and `D` operate on floats. However, on the PTX front we have `wmma.load.{a,b}...tf32` but `wmma.load.c...f32`. I guess it does make sense to keep LLVM intrinsic names close to the instructions they produce.

tra mentioned this in D105384: [NVPTX, CUDA] Add .and.popc variant of the b1 MMA instruction..Jul 2 2021, 5:07 PM

Sorry for the late response. Looks like you have handled the issues and more in your patch. Thank you for fixing my blunders. 😄

clang/include/clang/Basic/BuiltinsNVPTX.def
727	Haha, I didn't think of that one. Sadly we don't have any plans to work on extending support for PTX 7.1+ in the next couple of months, but it looks like your new patch (D105384) takes care of it anyway. 😆
clang/test/CodeGen/builtins-nvptx-mma.cu
781–786	Yeah, it was definitely confusing to write. I think the current state is the best solution, as it prioritizes consistency within the sub-projects. Not a big fan of the inconsistency though, but if we want to follow CUDA's example I suppose we're stuck with this.
clang/test/CodeGen/builtins-nvptx-mma.py
74	You are absolutely right. That's a mistake on my part. Looks like you've got it under control in your patch. Thanks!

Revision Contents

Path

Size

clang/

include/

clang/

Basic/

BuiltinsNVPTX.def

23 lines

lib/

CodeGen/

CGBuiltin.cpp

220 lines

test/

CodeGen/

builtins-nvptx-mma.cu

169 lines

builtins-nvptx-mma.py

114 lines

llvm/

include/

llvm/

IR/

IntrinsicsNVVM.td

409 lines

lib/

Target/

NVPTX/

NVPTXISelLowering.cpp

94 lines

NVPTXInstrInfo.td

1 line

NVPTXIntrinsics.td

207 lines

test/

CodeGen/

NVPTX/

lit.local.cfg

1 line

wmma.py

454 lines

Diff 355390

clang/include/clang/Basic/BuiltinsNVPTX.def

	Show First 20 Lines • Show All 718 Lines • ▼ Show 20 Lines
	TARGET_BUILTIN(__hmma_m8n32k16_mma_f32f16, "vfiCiCiCIiIi", "", AND(SM_70,PTX61))			TARGET_BUILTIN(__hmma_m8n32k16_mma_f32f16, "vfiCiCiCIiIi", "", AND(SM_70,PTX61))
	TARGET_BUILTIN(__hmma_m8n32k16_mma_f32f32, "vfiCiCfCIiIi", "", AND(SM_70,PTX61))			TARGET_BUILTIN(__hmma_m8n32k16_mma_f32f32, "vfiCiCfCIiIi", "", AND(SM_70,PTX61))
	TARGET_BUILTIN(__hmma_m8n32k16_mma_f16f32, "viiCiCfCIiIi", "", AND(SM_70,PTX61))			TARGET_BUILTIN(__hmma_m8n32k16_mma_f16f32, "viiCiCfCIiIi", "", AND(SM_70,PTX61))

	// Builtins to support integer and sub-integer WMMA instructions on sm_72/sm_75			// Builtins to support integer and sub-integer WMMA instructions on sm_72/sm_75
	TARGET_BUILTIN(__bmma_m8n8k128_ld_a_b1, "viiCUiIi", "", AND(SM_75,PTX63))			TARGET_BUILTIN(__bmma_m8n8k128_ld_a_b1, "viiCUiIi", "", AND(SM_75,PTX63))
	TARGET_BUILTIN(__bmma_m8n8k128_ld_b_b1, "viiCUiIi", "", AND(SM_75,PTX63))			TARGET_BUILTIN(__bmma_m8n8k128_ld_b_b1, "viiCUiIi", "", AND(SM_75,PTX63))
	TARGET_BUILTIN(__bmma_m8n8k128_ld_c, "viiCUiIi", "", AND(SM_75,PTX63))			TARGET_BUILTIN(__bmma_m8n8k128_ld_c, "viiCUiIi", "", AND(SM_75,PTX63))
	TARGET_BUILTIN(__bmma_m8n8k128_mma_xor_popc_b1, "viiCiCiCIi", "", AND(SM_75,PTX63))			TARGET_BUILTIN(__bmma_m8n8k128_mma_xor_popc_b1, "viiCiCiCIi", "", AND(SM_75,PTX63))
				traUnsubmitted Not Done Reply Inline Actions Bummer. mma.h in CUDA-11.3 still does not compile for Ampere. We appear to be missing the new `__bmma_m8n8k128_mma_and_popc_b1` builtin for the `.and` variant of 1-bit `mma` introduced in PTX 7.1 and not included in this patch. https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#warp-level-matrix-instructions-wmma-mma Do you, by any chance, have upcoming patch for PTX7.1, too. :-) tra: Bummer. mma.h in CUDA-11.3 still does not compile for Ampere. We appear to be missing the new…
				steffenlarsenAuthorUnsubmitted Done Reply Inline Actions Haha, I didn't think of that one. Sadly we don't have any plans to work on extending support for PTX 7.1+ in the next couple of months, but it looks like your new patch (D105384) takes care of it anyway. 😆 steffenlarsen: Haha, I didn't think of that one. Sadly we don't have any plans to work on extending support…
	TARGET_BUILTIN(__bmma_m8n8k128_st_c_i32, "viiCUiIi", "", AND(SM_75,PTX63))			TARGET_BUILTIN(__bmma_m8n8k128_st_c_i32, "viiCUiIi", "", AND(SM_75,PTX63))
	TARGET_BUILTIN(__imma_m16n16k16_ld_a_s8, "viiCUiIi", "", AND(SM_72,PTX63))			TARGET_BUILTIN(__imma_m16n16k16_ld_a_s8, "viiCUiIi", "", AND(SM_72,PTX63))
	TARGET_BUILTIN(__imma_m16n16k16_ld_a_u8, "viiCUiIi", "", AND(SM_72,PTX63))			TARGET_BUILTIN(__imma_m16n16k16_ld_a_u8, "viiCUiIi", "", AND(SM_72,PTX63))
	TARGET_BUILTIN(__imma_m16n16k16_ld_b_s8, "viiCUiIi", "", AND(SM_72,PTX63))			TARGET_BUILTIN(__imma_m16n16k16_ld_b_s8, "viiCUiIi", "", AND(SM_72,PTX63))
	TARGET_BUILTIN(__imma_m16n16k16_ld_b_u8, "viiCUiIi", "", AND(SM_72,PTX63))			TARGET_BUILTIN(__imma_m16n16k16_ld_b_u8, "viiCUiIi", "", AND(SM_72,PTX63))
	TARGET_BUILTIN(__imma_m16n16k16_ld_c, "viiCUiIi", "", AND(SM_72,PTX63))			TARGET_BUILTIN(__imma_m16n16k16_ld_c, "viiCUiIi", "", AND(SM_72,PTX63))
	TARGET_BUILTIN(__imma_m16n16k16_mma_s8, "viiCiCiCIiIi", "", AND(SM_72,PTX63))			TARGET_BUILTIN(__imma_m16n16k16_mma_s8, "viiCiCiCIiIi", "", AND(SM_72,PTX63))
	TARGET_BUILTIN(__imma_m16n16k16_mma_u8, "viiCiCiCIiIi", "", AND(SM_72,PTX63))			TARGET_BUILTIN(__imma_m16n16k16_mma_u8, "viiCiCiCIiIi", "", AND(SM_72,PTX63))
	Show All 18 Lines
	TARGET_BUILTIN(__imma_m8n8k32_ld_a_u4, "viiCUiIi", "", AND(SM_75,PTX63))			TARGET_BUILTIN(__imma_m8n8k32_ld_a_u4, "viiCUiIi", "", AND(SM_75,PTX63))
	TARGET_BUILTIN(__imma_m8n8k32_ld_b_s4, "viiCUiIi", "", AND(SM_75,PTX63))			TARGET_BUILTIN(__imma_m8n8k32_ld_b_s4, "viiCUiIi", "", AND(SM_75,PTX63))
	TARGET_BUILTIN(__imma_m8n8k32_ld_b_u4, "viiCUiIi", "", AND(SM_75,PTX63))			TARGET_BUILTIN(__imma_m8n8k32_ld_b_u4, "viiCUiIi", "", AND(SM_75,PTX63))
	TARGET_BUILTIN(__imma_m8n8k32_ld_c, "viiCUiIi", "", AND(SM_75,PTX63))			TARGET_BUILTIN(__imma_m8n8k32_ld_c, "viiCUiIi", "", AND(SM_75,PTX63))
	TARGET_BUILTIN(__imma_m8n8k32_mma_s4, "viiCiCiCIiIi", "", AND(SM_75,PTX63))			TARGET_BUILTIN(__imma_m8n8k32_mma_s4, "viiCiCiCIiIi", "", AND(SM_75,PTX63))
	TARGET_BUILTIN(__imma_m8n8k32_mma_u4, "viiCiCiCIiIi", "", AND(SM_75,PTX63))			TARGET_BUILTIN(__imma_m8n8k32_mma_u4, "viiCiCiCIiIi", "", AND(SM_75,PTX63))
	TARGET_BUILTIN(__imma_m8n8k32_st_c_i32, "viiCUiIi", "", AND(SM_75,PTX63))			TARGET_BUILTIN(__imma_m8n8k32_st_c_i32, "viiCUiIi", "", AND(SM_75,PTX63))

				// Builtins to support double and alternate float WMMA instructions on sm_80
				traUnsubmitted Not Done Reply Inline Actions Is this a sufficient set of builtins to compile mma.h in CUDA-11.x? tra: Is this a sufficient set of builtins to compile mma.h in CUDA-11.x?
				steffenlarsenAuthorUnsubmitted Not Done Reply Inline Actions mma.h was my frame-of-reference for the builtins and I have CUDA 11.3 (465.19.01) installed, so I would expect it to be. steffenlarsen: mma.h was my frame-of-reference for the builtins and I have CUDA 11.3 (465.19.01) installed, so…
				TARGET_BUILTIN(__dmma_m8n8k4_ld_a, "vddCUiIi", "", AND(SM_80,PTX70))
				TARGET_BUILTIN(__dmma_m8n8k4_ld_b, "vddCUiIi", "", AND(SM_80,PTX70))
				TARGET_BUILTIN(__dmma_m8n8k4_ld_c, "vddCUiIi", "", AND(SM_80,PTX70))
				TARGET_BUILTIN(__dmma_m8n8k4_st_c_f64, "vddCUiIi", "", AND(SM_80,PTX70))
				TARGET_BUILTIN(__dmma_m8n8k4_mma_f64, "vddCdCdCIiIi", "", AND(SM_80,PTX70))

				TARGET_BUILTIN(__mma_bf16_m16n16k16_ld_a, "viiCUiIi", "", AND(SM_80,PTX70))
				TARGET_BUILTIN(__mma_bf16_m16n16k16_ld_b, "viiCUiIi", "", AND(SM_80,PTX70))
				TARGET_BUILTIN(__mma_bf16_m16n16k16_mma_f32, "vfiCiCfCIiIi", "", AND(SM_80,PTX70))
				TARGET_BUILTIN(__mma_bf16_m8n32k16_ld_a, "viiCUiIi", "", AND(SM_80,PTX70))
				TARGET_BUILTIN(__mma_bf16_m8n32k16_ld_b, "viiCUiIi", "", AND(SM_80,PTX70))
				TARGET_BUILTIN(__mma_bf16_m8n32k16_mma_f32, "vfiCiCfCIiIi", "", AND(SM_80,PTX70))
				TARGET_BUILTIN(__mma_bf16_m32n8k16_ld_a, "viiCUiIi", "", AND(SM_80,PTX70))
				TARGET_BUILTIN(__mma_bf16_m32n8k16_ld_b, "viiCUiIi", "", AND(SM_80,PTX70))
				TARGET_BUILTIN(__mma_bf16_m32n8k16_mma_f32, "vfiCiCfCIiIi", "", AND(SM_80,PTX70))

				TARGET_BUILTIN(__mma_tf32_m16n16k8_ld_a, "viiCUiIi", "", AND(SM_80,PTX70))
				TARGET_BUILTIN(__mma_tf32_m16n16k8_ld_b, "viiCUiIi", "", AND(SM_80,PTX70))
				TARGET_BUILTIN(__mma_tf32_m16n16k8_ld_c, "vffCUiIi", "", AND(SM_80,PTX70))
				TARGET_BUILTIN(__mma_m16n16k8_st_c_f32, "vffCUiIi", "", AND(SM_80,PTX70))
				TARGET_BUILTIN(__mma_tf32_m16n16k8_mma_f32, "vfiCiCfCIiIi", "", AND(SM_80,PTX70))

	// Async Copy			// Async Copy
	TARGET_BUILTIN(__nvvm_cp_async_mbarrier_arrive, "vWi*", "", AND(SM_80,PTX70))			TARGET_BUILTIN(__nvvm_cp_async_mbarrier_arrive, "vWi*", "", AND(SM_80,PTX70))
	TARGET_BUILTIN(__nvvm_cp_async_mbarrier_arrive_shared, "vWi*3", "", AND(SM_80,PTX70))			TARGET_BUILTIN(__nvvm_cp_async_mbarrier_arrive_shared, "vWi*3", "", AND(SM_80,PTX70))
	TARGET_BUILTIN(__nvvm_cp_async_mbarrier_arrive_noinc, "vWi*", "", AND(SM_80,PTX70))			TARGET_BUILTIN(__nvvm_cp_async_mbarrier_arrive_noinc, "vWi*", "", AND(SM_80,PTX70))
	TARGET_BUILTIN(__nvvm_cp_async_mbarrier_arrive_noinc_shared, "vWi*3", "", AND(SM_80,PTX70))			TARGET_BUILTIN(__nvvm_cp_async_mbarrier_arrive_noinc_shared, "vWi*3", "", AND(SM_80,PTX70))

	TARGET_BUILTIN(__nvvm_cp_async_ca_shared_global_4, "vv3vC1", "", AND(SM_80,PTX70))			TARGET_BUILTIN(__nvvm_cp_async_ca_shared_global_4, "vv3vC1", "", AND(SM_80,PTX70))
	TARGET_BUILTIN(__nvvm_cp_async_ca_shared_global_8, "vv3vC1", "", AND(SM_80,PTX70))			TARGET_BUILTIN(__nvvm_cp_async_ca_shared_global_8, "vv3vC1", "", AND(SM_80,PTX70))
	Show All 24 Lines

clang/lib/CodeGen/CGBuiltin.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 16,396 Lines • ▼ Show 20 Lines	case NVPTX::BI__imma_m8n8k32_ld_c:
return MMA_LDST(2, m8n8k32_load_c_s32);		return MMA_LDST(2, m8n8k32_load_c_s32);
case NVPTX::BI__bmma_m8n8k128_ld_a_b1:		case NVPTX::BI__bmma_m8n8k128_ld_a_b1:
return {1, 0, MMA_INTR(m8n8k128_load_a_b1, row)};		return {1, 0, MMA_INTR(m8n8k128_load_a_b1, row)};
case NVPTX::BI__bmma_m8n8k128_ld_b_b1:		case NVPTX::BI__bmma_m8n8k128_ld_b_b1:
return {1, MMA_INTR(m8n8k128_load_b_b1, col), 0};		return {1, MMA_INTR(m8n8k128_load_b_b1, col), 0};
case NVPTX::BI__bmma_m8n8k128_ld_c:		case NVPTX::BI__bmma_m8n8k128_ld_c:
return MMA_LDST(2, m8n8k128_load_c_s32);		return MMA_LDST(2, m8n8k128_load_c_s32);

		// Double MMA loads
		case NVPTX::BI__dmma_m8n8k4_ld_a:
		return MMA_LDST(1, m8n8k4_load_a_f64);
		case NVPTX::BI__dmma_m8n8k4_ld_b:
		return MMA_LDST(1, m8n8k4_load_b_f64);
		case NVPTX::BI__dmma_m8n8k4_ld_c:
		return MMA_LDST(2, m8n8k4_load_c_f64);

		// Alternate float MMA loads
		case NVPTX::BI__mma_bf16_m16n16k16_ld_a:
		return MMA_LDST(4, m16n16k16_load_a_bf16);
		case NVPTX::BI__mma_bf16_m16n16k16_ld_b:
		return MMA_LDST(4, m16n16k16_load_b_bf16);
		case NVPTX::BI__mma_bf16_m8n32k16_ld_a:
		return MMA_LDST(2, m8n32k16_load_a_bf16);
		case NVPTX::BI__mma_bf16_m8n32k16_ld_b:
		return MMA_LDST(8, m8n32k16_load_b_bf16);
		case NVPTX::BI__mma_bf16_m32n8k16_ld_a:
		return MMA_LDST(8, m32n8k16_load_a_bf16);
		case NVPTX::BI__mma_bf16_m32n8k16_ld_b:
		return MMA_LDST(2, m32n8k16_load_b_bf16);
		case NVPTX::BI__mma_tf32_m16n16k8_ld_a:
		return MMA_LDST(4, m16n16k8_load_a_tf32);
		case NVPTX::BI__mma_tf32_m16n16k8_ld_b:
		return MMA_LDST(2, m16n16k8_load_b_tf32);
		case NVPTX::BI__mma_tf32_m16n16k8_ld_c:
		return MMA_LDST(8, m16n16k8_load_c_f32);

// NOTE: We need to follow inconsitent naming scheme used by NVCC. Unlike		// NOTE: We need to follow inconsitent naming scheme used by NVCC. Unlike
// PTX and LLVM IR where stores always use fragment D, NVCC builtins always		// PTX and LLVM IR where stores always use fragment D, NVCC builtins always
// use fragment C for both loads and stores.		// use fragment C for both loads and stores.
// FP MMA stores.		// FP MMA stores.
case NVPTX::BI__hmma_m16n16k16_st_c_f16:		case NVPTX::BI__hmma_m16n16k16_st_c_f16:
return MMA_LDST(4, m16n16k16_store_d_f16);		return MMA_LDST(4, m16n16k16_store_d_f16);
case NVPTX::BI__hmma_m16n16k16_st_c_f32:		case NVPTX::BI__hmma_m16n16k16_st_c_f32:
return MMA_LDST(8, m16n16k16_store_d_f32);		return MMA_LDST(8, m16n16k16_store_d_f32);
Show All 15 Lines	case NVPTX::BI__imma_m32n8k16_st_c_i32:
return MMA_LDST(8, m32n8k16_store_d_s32);		return MMA_LDST(8, m32n8k16_store_d_s32);
case NVPTX::BI__imma_m8n32k16_st_c_i32:		case NVPTX::BI__imma_m8n32k16_st_c_i32:
return MMA_LDST(8, m8n32k16_store_d_s32);		return MMA_LDST(8, m8n32k16_store_d_s32);
case NVPTX::BI__imma_m8n8k32_st_c_i32:		case NVPTX::BI__imma_m8n8k32_st_c_i32:
return MMA_LDST(2, m8n8k32_store_d_s32);		return MMA_LDST(2, m8n8k32_store_d_s32);
case NVPTX::BI__bmma_m8n8k128_st_c_i32:		case NVPTX::BI__bmma_m8n8k128_st_c_i32:
return MMA_LDST(2, m8n8k128_store_d_s32);		return MMA_LDST(2, m8n8k128_store_d_s32);

		// Double MMA store
		case NVPTX::BI__dmma_m8n8k4_st_c_f64:
		return MMA_LDST(2, m8n8k4_store_d_f64);

		// Alternate float MMA store
		case NVPTX::BI__mma_m16n16k8_st_c_f32:
		return MMA_LDST(8, m16n16k8_store_d_f32);

default:		default:
llvm_unreachable("Unknown MMA builtin");		llvm_unreachable("Unknown MMA builtin");
}		}
}		}
#undef MMA_LDST		#undef MMA_LDST
#undef MMA_INTR		#undef MMA_INTR


struct NVPTXMmaInfo {		struct NVPTXMmaInfo {
unsigned NumEltsA;		unsigned NumEltsA;
unsigned NumEltsB;		unsigned NumEltsB;
unsigned NumEltsC;		unsigned NumEltsC;
unsigned NumEltsD;		unsigned NumEltsD;

		// Variants are ordered by layout-A/layout-B/satf, where 'row' has priority
		// over 'col' for layout. The index of non-satf variants is expected to match
		// the undocumented layout constants used by CUDA's mma.hpp.
std::array<unsigned, 8> Variants;		std::array<unsigned, 8> Variants;
		traUnsubmitted Done Reply Inline Actions A comment here describing expected arrangement of the variants here would be helpful. E.g. `ordered by layout-A/layout-B/satf, where 'row' has priority over 'col' for layout. The index of non-satf variants is expected to match the undocumented layout constants used by CUDA's mma.hpp`. It could be cleaner if we could use designated initializer, but we can't use them yet until LLVM switches to c++20. tra: A comment here describing expected arrangement of the variants here would be helpful. E.g.

unsigned getMMAIntrinsic(int Layout, bool Satf) {		unsigned getMMAIntrinsic(int Layout, bool Satf) {
unsigned Index = Layout * 2 + Satf;		unsigned Index = Layout + 4 * Satf;
if (Index >= Variants.size())		if (Index >= Variants.size())
return 0;		return 0;
return Variants[Index];		return Variants[Index];
}		}
};		};

// Returns an intrinsic that matches Layout and Satf for valid combinations of		// Returns an intrinsic that matches Layout and Satf for valid combinations of
// Layout and Satf, 0 otherwise.		// Layout and Satf, 0 otherwise.
static NVPTXMmaInfo getNVPTXMmaInfo(unsigned BuiltinID) {		static NVPTXMmaInfo getNVPTXMmaInfo(unsigned BuiltinID) {
// clang-format off		// clang-format off
#define MMA_VARIANTS(geom, type) {{ \		#define MMA_VARIANTS(geom, type) \
Intrinsic::nvvm_wmma_##geom##_mma_row_row_##type, \		Intrinsic::nvvm_wmma_##geom##_mma_row_row_##type, \
Intrinsic::nvvm_wmma_##geom##_mma_row_row_##type##_satfinite, \
Intrinsic::nvvm_wmma_##geom##_mma_row_col_##type, \		Intrinsic::nvvm_wmma_##geom##_mma_row_col_##type, \
Intrinsic::nvvm_wmma_##geom##_mma_row_col_##type##_satfinite, \
Intrinsic::nvvm_wmma_##geom##_mma_col_row_##type, \		Intrinsic::nvvm_wmma_##geom##_mma_col_row_##type, \
		Intrinsic::nvvm_wmma_##geom##_mma_col_col_##type
		#define MMA_SATF_VARIANTS(geom, type) \
		MMA_VARIANTS(geom, type), \
		Intrinsic::nvvm_wmma_##geom##_mma_row_row_##type##_satfinite, \
		Intrinsic::nvvm_wmma_##geom##_mma_row_col_##type##_satfinite, \
Intrinsic::nvvm_wmma_##geom##_mma_col_row_##type##_satfinite, \		Intrinsic::nvvm_wmma_##geom##_mma_col_row_##type##_satfinite, \
Intrinsic::nvvm_wmma_##geom##_mma_col_col_##type, \		Intrinsic::nvvm_wmma_##geom##_mma_col_col_##type##_satfinite
		traUnsubmitted Not Done Reply Inline Actions Nit: satf variants are in the minority. We could move them to the end of the variant list and rely on default initialization to 0. E.g something like this: unsigned getMMAIntrinsic(int Layout, bool Satf) { unsigned Index = Layout + 4Satf; if (Index >= Variants.size()) return 0; return Variants[Index]; } #define MMA_VARIANTS(geom, type) Intrinsic::nvvm_wmma_##geom##_mma_row_row_##type, \ Intrinsic::nvvm_wmma_##geom##_mma_row_col_##type, \ Intrinsic::nvvm_wmma_##geom##_mma_col_row_##type, \ Intrinsic::nvvm_wmma_##geom##_mma_col_col_##type #define MMA_SATF_VARIANTS(geom, type) MMA_VARIANTS(geom, type), \ Intrinsic::nvvm_wmma_##geom##_mma_row_row_##type##_satfinite, \ Intrinsic::nvvm_wmma_##geom##_mma_row_col_##type##_satfinite, \ Intrinsic::nvvm_wmma_##geom##_mma_col_row_##type##_satfinite, \ Intrinsic::nvvm_wmma_##geom##_mma_col_col_##type##_satfinite ... case NVPTX::BI__hmma_m16n16k16_mma_f16f16: return {8, 8, 4, 4, {{ MMA_SATF_VARIANTS(m16n16k16, f16_f16) }}}; ... case NVPTX::BI__dmma_m8n8k4_mma_f64: return {1, 1, 2, 2, {{MMA_VARIANTS(m8n8k4, f64)}}}; Up to you. tra:* Nit: satf variants are in the minority. We could move them to the end of the variant list and…
		steffenlarsenAuthorUnsubmitted Done Reply Inline Actions I agree, I like this better. In case other options will use the satf parameter (e.g. rnd which isn't indicated as being in use from mma.h) this solution is also easier to extend in the future. steffenlarsen: I agree, I like this better. In case other options will use the satf parameter (e.g. rnd which…
Intrinsic::nvvm_wmma_##geom##_mma_col_col_##type##_satfinite \
}}
// Sub-integer MMA only supports row.col layout.		// Sub-integer MMA only supports row.col layout.
#define MMA_VARIANTS_I4(geom, type) {{ \		#define MMA_VARIANTS_I4(geom, type) \
0, \
0, \		0, \
Intrinsic::nvvm_wmma_##geom##_mma_row_col_##type, \		Intrinsic::nvvm_wmma_##geom##_mma_row_col_##type, \
Intrinsic::nvvm_wmma_##geom##_mma_row_col_##type##_satfinite, \
0, \		0, \
0, \		0, \
0, \		0, \
0 \		Intrinsic::nvvm_wmma_##geom##_mma_row_col_##type##_satfinite, \
}}
// b1 MMA does not support .satfinite.
#define MMA_VARIANTS_B1(geom, type) {{ \
0, \		0, \
		0
		// b1 MMA does not support .satfinite.
		#define MMA_VARIANTS_B1(geom, type) \
0, \		0, \
Intrinsic::nvvm_wmma_##geom##_mma_row_col_##type, \		Intrinsic::nvvm_wmma_##geom##_mma_row_col_##type, \
0, \		0, \
0, \		0, \
0, \		0, \
0, \		0, \
0 \		0, \
}}		0
// clang-format on		// clang-format on
switch (BuiltinID) {		switch (BuiltinID) {
// FP MMA		// FP MMA
// Note that 'type' argument of MMA_VARIANT uses D_C notation, while		// Note that 'type' argument of MMA_SATF_VARIANTS uses D_C notation, while
// NumEltsN of return value are ordered as A,B,C,D.		// NumEltsN of return value are ordered as A,B,C,D.
case NVPTX::BI__hmma_m16n16k16_mma_f16f16:		case NVPTX::BI__hmma_m16n16k16_mma_f16f16:
return {8, 8, 4, 4, MMA_VARIANTS(m16n16k16, f16_f16)};		return {8, 8, 4, 4, {{MMA_SATF_VARIANTS(m16n16k16, f16_f16)}}};
case NVPTX::BI__hmma_m16n16k16_mma_f32f16:		case NVPTX::BI__hmma_m16n16k16_mma_f32f16:
return {8, 8, 4, 8, MMA_VARIANTS(m16n16k16, f32_f16)};		return {8, 8, 4, 8, {{MMA_SATF_VARIANTS(m16n16k16, f32_f16)}}};
case NVPTX::BI__hmma_m16n16k16_mma_f16f32:		case NVPTX::BI__hmma_m16n16k16_mma_f16f32:
return {8, 8, 8, 4, MMA_VARIANTS(m16n16k16, f16_f32)};		return {8, 8, 8, 4, {{MMA_SATF_VARIANTS(m16n16k16, f16_f32)}}};
case NVPTX::BI__hmma_m16n16k16_mma_f32f32:		case NVPTX::BI__hmma_m16n16k16_mma_f32f32:
return {8, 8, 8, 8, MMA_VARIANTS(m16n16k16, f32_f32)};		return {8, 8, 8, 8, {{MMA_SATF_VARIANTS(m16n16k16, f32_f32)}}};
case NVPTX::BI__hmma_m32n8k16_mma_f16f16:		case NVPTX::BI__hmma_m32n8k16_mma_f16f16:
return {8, 8, 4, 4, MMA_VARIANTS(m32n8k16, f16_f16)};		return {8, 8, 4, 4, {{MMA_SATF_VARIANTS(m32n8k16, f16_f16)}}};
case NVPTX::BI__hmma_m32n8k16_mma_f32f16:		case NVPTX::BI__hmma_m32n8k16_mma_f32f16:
return {8, 8, 4, 8, MMA_VARIANTS(m32n8k16, f32_f16)};		return {8, 8, 4, 8, {{MMA_SATF_VARIANTS(m32n8k16, f32_f16)}}};
case NVPTX::BI__hmma_m32n8k16_mma_f16f32:		case NVPTX::BI__hmma_m32n8k16_mma_f16f32:
return {8, 8, 8, 4, MMA_VARIANTS(m32n8k16, f16_f32)};		return {8, 8, 8, 4, {{MMA_SATF_VARIANTS(m32n8k16, f16_f32)}}};
case NVPTX::BI__hmma_m32n8k16_mma_f32f32:		case NVPTX::BI__hmma_m32n8k16_mma_f32f32:
return {8, 8, 8, 8, MMA_VARIANTS(m32n8k16, f32_f32)};		return {8, 8, 8, 8, {{MMA_SATF_VARIANTS(m32n8k16, f32_f32)}}};
case NVPTX::BI__hmma_m8n32k16_mma_f16f16:		case NVPTX::BI__hmma_m8n32k16_mma_f16f16:
return {8, 8, 4, 4, MMA_VARIANTS(m8n32k16, f16_f16)};		return {8, 8, 4, 4, {{MMA_SATF_VARIANTS(m8n32k16, f16_f16)}}};
case NVPTX::BI__hmma_m8n32k16_mma_f32f16:		case NVPTX::BI__hmma_m8n32k16_mma_f32f16:
return {8, 8, 4, 8, MMA_VARIANTS(m8n32k16, f32_f16)};		return {8, 8, 4, 8, {{MMA_SATF_VARIANTS(m8n32k16, f32_f16)}}};
case NVPTX::BI__hmma_m8n32k16_mma_f16f32:		case NVPTX::BI__hmma_m8n32k16_mma_f16f32:
return {8, 8, 8, 4, MMA_VARIANTS(m8n32k16, f16_f32)};		return {8, 8, 8, 4, {{MMA_SATF_VARIANTS(m8n32k16, f16_f32)}}};
case NVPTX::BI__hmma_m8n32k16_mma_f32f32:		case NVPTX::BI__hmma_m8n32k16_mma_f32f32:
return {8, 8, 8, 8, MMA_VARIANTS(m8n32k16, f32_f32)};		return {8, 8, 8, 8, {{MMA_SATF_VARIANTS(m8n32k16, f32_f32)}}};

// Integer MMA		// Integer MMA
case NVPTX::BI__imma_m16n16k16_mma_s8:		case NVPTX::BI__imma_m16n16k16_mma_s8:
return {2, 2, 8, 8, MMA_VARIANTS(m16n16k16, s8)};		return {2, 2, 8, 8, {{MMA_SATF_VARIANTS(m16n16k16, s8)}}};
case NVPTX::BI__imma_m16n16k16_mma_u8:		case NVPTX::BI__imma_m16n16k16_mma_u8:
return {2, 2, 8, 8, MMA_VARIANTS(m16n16k16, u8)};		return {2, 2, 8, 8, {{MMA_SATF_VARIANTS(m16n16k16, u8)}}};
case NVPTX::BI__imma_m32n8k16_mma_s8:		case NVPTX::BI__imma_m32n8k16_mma_s8:
return {4, 1, 8, 8, MMA_VARIANTS(m32n8k16, s8)};		return {4, 1, 8, 8, {{MMA_SATF_VARIANTS(m32n8k16, s8)}}};
case NVPTX::BI__imma_m32n8k16_mma_u8:		case NVPTX::BI__imma_m32n8k16_mma_u8:
return {4, 1, 8, 8, MMA_VARIANTS(m32n8k16, u8)};		return {4, 1, 8, 8, {{MMA_SATF_VARIANTS(m32n8k16, u8)}}};
case NVPTX::BI__imma_m8n32k16_mma_s8:		case NVPTX::BI__imma_m8n32k16_mma_s8:
return {1, 4, 8, 8, MMA_VARIANTS(m8n32k16, s8)};		return {1, 4, 8, 8, {{MMA_SATF_VARIANTS(m8n32k16, s8)}}};
case NVPTX::BI__imma_m8n32k16_mma_u8:		case NVPTX::BI__imma_m8n32k16_mma_u8:
return {1, 4, 8, 8, MMA_VARIANTS(m8n32k16, u8)};		return {1, 4, 8, 8, {{MMA_SATF_VARIANTS(m8n32k16, u8)}}};

// Sub-integer MMA		// Sub-integer MMA
case NVPTX::BI__imma_m8n8k32_mma_s4:		case NVPTX::BI__imma_m8n8k32_mma_s4:
return {1, 1, 2, 2, MMA_VARIANTS_I4(m8n8k32, s4)};		return {1, 1, 2, 2, {{MMA_VARIANTS_I4(m8n8k32, s4)}}};
case NVPTX::BI__imma_m8n8k32_mma_u4:		case NVPTX::BI__imma_m8n8k32_mma_u4:
return {1, 1, 2, 2, MMA_VARIANTS_I4(m8n8k32, u4)};		return {1, 1, 2, 2, {{MMA_VARIANTS_I4(m8n8k32, u4)}}};
case NVPTX::BI__bmma_m8n8k128_mma_xor_popc_b1:		case NVPTX::BI__bmma_m8n8k128_mma_xor_popc_b1:
return {1, 1, 2, 2, MMA_VARIANTS_B1(m8n8k128, b1)};		return {1, 1, 2, 2, {{MMA_VARIANTS_B1(m8n8k128, b1)}}};

		// Double MMA
		case NVPTX::BI__dmma_m8n8k4_mma_f64:
		return {1, 1, 2, 2, {{MMA_VARIANTS(m8n8k4, f64)}}};

		// Alternate FP MMA
		case NVPTX::BI__mma_bf16_m16n16k16_mma_f32:
		return {4, 4, 8, 8, {{MMA_VARIANTS(m16n16k16, bf16)}}};
		case NVPTX::BI__mma_bf16_m8n32k16_mma_f32:
		return {2, 8, 8, 8, {{MMA_VARIANTS(m8n32k16, bf16)}}};
		case NVPTX::BI__mma_bf16_m32n8k16_mma_f32:
		return {8, 2, 8, 8, {{MMA_VARIANTS(m32n8k16, bf16)}}};
		case NVPTX::BI__mma_tf32_m16n16k8_mma_f32:
		return {4, 4, 8, 8, {{MMA_VARIANTS(m16n16k8, tf32)}}};
default:		default:
llvm_unreachable("Unexpected builtin ID.");		llvm_unreachable("Unexpected builtin ID.");
}		}
#undef MMA_VARIANTS		#undef MMA_VARIANTS
		#undef MMA_SATF_VARIANTS
#undef MMA_VARIANTS_I4		#undef MMA_VARIANTS_I4
#undef MMA_VARIANTS_B1		#undef MMA_VARIANTS_B1
}		}

} // namespace		} // namespace

Value *		Value *
CodeGenFunction::EmitNVPTXBuiltinExpr(unsigned BuiltinID, const CallExpr *E) {		CodeGenFunction::EmitNVPTXBuiltinExpr(unsigned BuiltinID, const CallExpr *E) {
▲ Show 20 Lines • Show All 281 Lines • ▼ Show 20 Lines	CodeGenFunction::EmitNVPTXBuiltinExpr(unsigned BuiltinID, const CallExpr *E) {
case NVPTX::BI__imma_m8n8k32_ld_a_s4:		case NVPTX::BI__imma_m8n8k32_ld_a_s4:
case NVPTX::BI__imma_m8n8k32_ld_a_u4:		case NVPTX::BI__imma_m8n8k32_ld_a_u4:
case NVPTX::BI__imma_m8n8k32_ld_b_s4:		case NVPTX::BI__imma_m8n8k32_ld_b_s4:
case NVPTX::BI__imma_m8n8k32_ld_b_u4:		case NVPTX::BI__imma_m8n8k32_ld_b_u4:
case NVPTX::BI__imma_m8n8k32_ld_c:		case NVPTX::BI__imma_m8n8k32_ld_c:
case NVPTX::BI__bmma_m8n8k128_ld_a_b1:		case NVPTX::BI__bmma_m8n8k128_ld_a_b1:
case NVPTX::BI__bmma_m8n8k128_ld_b_b1:		case NVPTX::BI__bmma_m8n8k128_ld_b_b1:
case NVPTX::BI__bmma_m8n8k128_ld_c:		case NVPTX::BI__bmma_m8n8k128_ld_c:
{		// Double MMA loads.
		case NVPTX::BI__dmma_m8n8k4_ld_a:
		case NVPTX::BI__dmma_m8n8k4_ld_b:
		case NVPTX::BI__dmma_m8n8k4_ld_c:
		// Alternate float MMA loads.
		case NVPTX::BI__mma_bf16_m16n16k16_ld_a:
		case NVPTX::BI__mma_bf16_m16n16k16_ld_b:
		case NVPTX::BI__mma_bf16_m8n32k16_ld_a:
		case NVPTX::BI__mma_bf16_m8n32k16_ld_b:
		case NVPTX::BI__mma_bf16_m32n8k16_ld_a:
		case NVPTX::BI__mma_bf16_m32n8k16_ld_b:
		case NVPTX::BI__mma_tf32_m16n16k8_ld_a:
		case NVPTX::BI__mma_tf32_m16n16k8_ld_b:
		case NVPTX::BI__mma_tf32_m16n16k8_ld_c: {
Address Dst = EmitPointerWithAlignment(E->getArg(0));		Address Dst = EmitPointerWithAlignment(E->getArg(0));
Value *Src = EmitScalarExpr(E->getArg(1));		Value *Src = EmitScalarExpr(E->getArg(1));
Value *Ldm = EmitScalarExpr(E->getArg(2));		Value *Ldm = EmitScalarExpr(E->getArg(2));
Optional<llvm::APSInt> isColMajorArg =		Optional<llvm::APSInt> isColMajorArg =
E->getArg(3)->getIntegerConstantExpr(getContext());		E->getArg(3)->getIntegerConstantExpr(getContext());
if (!isColMajorArg)		if (!isColMajorArg)
return nullptr;		return nullptr;
bool isColMajor = isColMajorArg->getSExtValue();		bool isColMajor = isColMajorArg->getSExtValue();
Show All 28 Lines	CodeGenFunction::EmitNVPTXBuiltinExpr(unsigned BuiltinID, const CallExpr *E) {
case NVPTX::BI__hmma_m32n8k16_st_c_f16:		case NVPTX::BI__hmma_m32n8k16_st_c_f16:
case NVPTX::BI__hmma_m32n8k16_st_c_f32:		case NVPTX::BI__hmma_m32n8k16_st_c_f32:
case NVPTX::BI__hmma_m8n32k16_st_c_f16:		case NVPTX::BI__hmma_m8n32k16_st_c_f16:
case NVPTX::BI__hmma_m8n32k16_st_c_f32:		case NVPTX::BI__hmma_m8n32k16_st_c_f32:
case NVPTX::BI__imma_m16n16k16_st_c_i32:		case NVPTX::BI__imma_m16n16k16_st_c_i32:
case NVPTX::BI__imma_m32n8k16_st_c_i32:		case NVPTX::BI__imma_m32n8k16_st_c_i32:
case NVPTX::BI__imma_m8n32k16_st_c_i32:		case NVPTX::BI__imma_m8n32k16_st_c_i32:
case NVPTX::BI__imma_m8n8k32_st_c_i32:		case NVPTX::BI__imma_m8n8k32_st_c_i32:
case NVPTX::BI__bmma_m8n8k128_st_c_i32: {		case NVPTX::BI__bmma_m8n8k128_st_c_i32:
		case NVPTX::BI__dmma_m8n8k4_st_c_f64:
		case NVPTX::BI__mma_m16n16k8_st_c_f32: {
Value *Dst = EmitScalarExpr(E->getArg(0));		Value *Dst = EmitScalarExpr(E->getArg(0));
Address Src = EmitPointerWithAlignment(E->getArg(1));		Address Src = EmitPointerWithAlignment(E->getArg(1));
Value *Ldm = EmitScalarExpr(E->getArg(2));		Value *Ldm = EmitScalarExpr(E->getArg(2));
Optional<llvm::APSInt> isColMajorArg =		Optional<llvm::APSInt> isColMajorArg =
E->getArg(3)->getIntegerConstantExpr(getContext());		E->getArg(3)->getIntegerConstantExpr(getContext());
if (!isColMajorArg)		if (!isColMajorArg)
return nullptr;		return nullptr;
bool isColMajor = isColMajorArg->getSExtValue();		bool isColMajor = isColMajorArg->getSExtValue();
Show All 35 Lines	CodeGenFunction::EmitNVPTXBuiltinExpr(unsigned BuiltinID, const CallExpr *E) {
case NVPTX::BI__imma_m16n16k16_mma_s8:		case NVPTX::BI__imma_m16n16k16_mma_s8:
case NVPTX::BI__imma_m16n16k16_mma_u8:		case NVPTX::BI__imma_m16n16k16_mma_u8:
case NVPTX::BI__imma_m32n8k16_mma_s8:		case NVPTX::BI__imma_m32n8k16_mma_s8:
case NVPTX::BI__imma_m32n8k16_mma_u8:		case NVPTX::BI__imma_m32n8k16_mma_u8:
case NVPTX::BI__imma_m8n32k16_mma_s8:		case NVPTX::BI__imma_m8n32k16_mma_s8:
case NVPTX::BI__imma_m8n32k16_mma_u8:		case NVPTX::BI__imma_m8n32k16_mma_u8:
case NVPTX::BI__imma_m8n8k32_mma_s4:		case NVPTX::BI__imma_m8n8k32_mma_s4:
case NVPTX::BI__imma_m8n8k32_mma_u4:		case NVPTX::BI__imma_m8n8k32_mma_u4:
case NVPTX::BI__bmma_m8n8k128_mma_xor_popc_b1: {		case NVPTX::BI__bmma_m8n8k128_mma_xor_popc_b1:
		case NVPTX::BI__dmma_m8n8k4_mma_f64:
		case NVPTX::BI__mma_bf16_m16n16k16_mma_f32:
		case NVPTX::BI__mma_bf16_m8n32k16_mma_f32:
		case NVPTX::BI__mma_bf16_m32n8k16_mma_f32:
		case NVPTX::BI__mma_tf32_m16n16k8_mma_f32: {
Address Dst = EmitPointerWithAlignment(E->getArg(0));		Address Dst = EmitPointerWithAlignment(E->getArg(0));
Address SrcA = EmitPointerWithAlignment(E->getArg(1));		Address SrcA = EmitPointerWithAlignment(E->getArg(1));
Address SrcB = EmitPointerWithAlignment(E->getArg(2));		Address SrcB = EmitPointerWithAlignment(E->getArg(2));
Address SrcC = EmitPointerWithAlignment(E->getArg(3));		Address SrcC = EmitPointerWithAlignment(E->getArg(3));
Optional<llvm::APSInt> LayoutArg =		Optional<llvm::APSInt> LayoutArg =
E->getArg(4)->getIntegerConstantExpr(getContext());		E->getArg(4)->getIntegerConstantExpr(getContext());
if (!LayoutArg)		if (!LayoutArg)
return nullptr;		return nullptr;
▲ Show 20 Lines • Show All 1,136 Lines • Show Last 20 Lines

clang/test/CodeGen/builtins-nvptx-mma.cu


//		//
// * DO NOT EDIT *		// * DO NOT EDIT *
//		//
// This test has been automatically generated by		// This test has been automatically generated by
// builtins-nvtx-mma.py --ptx=63 --gpu-arch=75		// builtins-nvtx-mma.py --ptx=70 --gpu-arch=80
//		//
// Make sure we can handle all builtins available on sm_75 with PTX63		// Make sure we can handle all builtins available on sm_80 with PTX70
// RUN: %clang_cc1 -triple nvptx64-unknown-unknown -target-cpu sm_75 \		// RUN: %clang_cc1 -triple nvptx64-unknown-unknown -target-cpu sm_80 \
// RUN: -fcuda-is-device -target-feature +ptx63 \		// RUN: -fcuda-is-device -target-feature +ptx70 \
// RUN: -DPTX=63 -DSM=75 \		// RUN: -DPTX=70 -DSM=80 \
// RUN: -S -emit-llvm -o - -x cuda %s \		// RUN: -S -emit-llvm -o - -x cuda %s \
// RUN: \| FileCheck -check-prefixes=CHECK_PTX61_SM70,CHECK_PTX63_SM75,CHECK_PTX63_SM72,CHECK_PTX60_SM70 %s		// RUN: \| FileCheck -check-prefixes=CHECK_PTX70_SM80,CHECK_PTX60_SM70,CHECK_PTX63_SM72,CHECK_PTX61_SM70,CHECK_PTX63_SM75 %s
// Verify that all builtins have correct constraints.		// Verify that all builtins have correct constraints.
// RUN: %clang_cc1 -triple nvptx-unknown-unknown \		// RUN: %clang_cc1 -triple nvptx-unknown-unknown \
// RUN: -target-cpu sm_60 -target-feature +ptx42 \		// RUN: -target-cpu sm_60 -target-feature +ptx42 \
// RUN: -DPTX=63 -DSM=75 -fcuda-is-device -S -o /dev/null -x cuda \		// RUN: -DPTX=70 -DSM=80 -fcuda-is-device -S -o /dev/null -x cuda \
// RUN: -verify %s		// RUN: -verify %s


#if !defined(CUDA_VERSION)		#if !defined(CUDA_VERSION)
#define __device__ __attribute__((device))		#define __device__ __attribute__((device))
#define __global__ __attribute__((global))		#define __global__ __attribute__((global))
#define __shared__ __attribute__((shared))		#define __shared__ __attribute__((shared))
#define __constant__ __attribute__((constant))		#define __constant__ __attribute__((constant))

typedef unsigned long long uint64_t;		typedef unsigned long long uint64_t;
#endif		#endif

// CHECK-LABEL: test_wmma_buitins		// CHECK-LABEL: test_wmma_buitins
__device__ void test_wmma_buitins(int src, int dst,		__device__ void test_wmma_buitins(int src, int dst,
float fsrc, float fdst, int ldm) {		float fsrc, float fdst,
		double dsrc, double ddst, int ldm) {

#if (PTX >= 60) && (SM >= 70)		#if (PTX >= 60) && (SM >= 70)

// CHECK_PTX60_SM70: call {{.*}} @llvm.nvvm.wmma.m16n16k16.load.a.col.stride.f16		// CHECK_PTX60_SM70: call {{.*}} @llvm.nvvm.wmma.m16n16k16.load.a.col.stride.f16
// expected-error-re@+1 {{'__hmma_m16n16k16_ld_a' needs target feature (sm_70{{.}},(ptx60{{.}}}}		// expected-error-re@+1 {{'__hmma_m16n16k16_ld_a' needs target feature (sm_70{{.}},(ptx60{{.}}}}
__hmma_m16n16k16_ld_a(dst, src, ldm, 1);		__hmma_m16n16k16_ld_a(dst, src, ldm, 1);
// CHECK_PTX60_SM70: call {{.*}} @llvm.nvvm.wmma.m16n16k16.load.a.row.stride.f16		// CHECK_PTX60_SM70: call {{.*}} @llvm.nvvm.wmma.m16n16k16.load.a.row.stride.f16
// expected-error-re@+1 {{'__hmma_m16n16k16_ld_a' needs target feature (sm_70{{.}},(ptx60{{.}}}}		// expected-error-re@+1 {{'__hmma_m16n16k16_ld_a' needs target feature (sm_70{{.}},(ptx60{{.}}}}
▲ Show 20 Lines • Show All 704 Lines • ▼ Show 20 Lines	#if (PTX >= 63) && (SM >= 75)
// expected-error-re@+1 {{'__imma_m8n8k32_mma_s4' needs target feature (sm_75{{.}},(ptx63{{.}}}}		// expected-error-re@+1 {{'__imma_m8n8k32_mma_s4' needs target feature (sm_75{{.}},(ptx63{{.}}}}
__imma_m8n8k32_mma_s4(dst, src, src, src, 1, 1);		__imma_m8n8k32_mma_s4(dst, src, src, src, 1, 1);
// CHECK_PTX63_SM75: call {{.*}} @llvm.nvvm.wmma.m8n8k32.mma.row.col.u4		// CHECK_PTX63_SM75: call {{.*}} @llvm.nvvm.wmma.m8n8k32.mma.row.col.u4
// expected-error-re@+1 {{'__imma_m8n8k32_mma_u4' needs target feature (sm_75{{.}},(ptx63{{.}}}}		// expected-error-re@+1 {{'__imma_m8n8k32_mma_u4' needs target feature (sm_75{{.}},(ptx63{{.}}}}
__imma_m8n8k32_mma_u4(dst, src, src, src, 1, 0);		__imma_m8n8k32_mma_u4(dst, src, src, src, 1, 0);
// CHECK_PTX63_SM75: call {{.*}} @llvm.nvvm.wmma.m8n8k32.mma.row.col.u4.satfinite		// CHECK_PTX63_SM75: call {{.*}} @llvm.nvvm.wmma.m8n8k32.mma.row.col.u4.satfinite
// expected-error-re@+1 {{'__imma_m8n8k32_mma_u4' needs target feature (sm_75{{.}},(ptx63{{.}}}}		// expected-error-re@+1 {{'__imma_m8n8k32_mma_u4' needs target feature (sm_75{{.}},(ptx63{{.}}}}
__imma_m8n8k32_mma_u4(dst, src, src, src, 1, 1);		__imma_m8n8k32_mma_u4(dst, src, src, src, 1, 1);
#endif // (PTX >= 63) && (SM >= 75)		#endif // (PTX >= 63) && (SM >= 75)

		#if (PTX >= 70) && (SM >= 80)

		// CHECK_PTX70_SM80: call {{.*}} @llvm.nvvm.wmma.m16n16k16.load.a.col.stride.bf16
		// expected-error-re@+1 {{'__mma_bf16_m16n16k16_ld_a' needs target feature (sm_80{{.}},(ptx70{{.}}}}
		__mma_bf16_m16n16k16_ld_a(dst, src, ldm, 1);
		// CHECK_PTX70_SM80: call {{.*}} @llvm.nvvm.wmma.m16n16k16.load.a.row.stride.bf16
		// expected-error-re@+1 {{'__mma_bf16_m16n16k16_ld_a' needs target feature (sm_80{{.}},(ptx70{{.}}}}
		__mma_bf16_m16n16k16_ld_a(dst, src, ldm, 0);
		// CHECK_PTX70_SM80: call {{.*}} @llvm.nvvm.wmma.m16n16k16.load.b.col.stride.bf16
		// expected-error-re@+1 {{'__mma_bf16_m16n16k16_ld_b' needs target feature (sm_80{{.}},(ptx70{{.}}}}
		__mma_bf16_m16n16k16_ld_b(dst, src, ldm, 1);
		// CHECK_PTX70_SM80: call {{.*}} @llvm.nvvm.wmma.m16n16k16.load.b.row.stride.bf16
		// expected-error-re@+1 {{'__mma_bf16_m16n16k16_ld_b' needs target feature (sm_80{{.}},(ptx70{{.}}}}
		__mma_bf16_m16n16k16_ld_b(dst, src, ldm, 0);
		// CHECK_PTX70_SM80: call {{.*}} @llvm.nvvm.wmma.m16n16k8.load.a.col.stride.tf32
		// expected-error-re@+1 {{'__mma_tf32_m16n16k8_ld_a' needs target feature (sm_80{{.}},(ptx70{{.}}}}
		__mma_tf32_m16n16k8_ld_a(dst, src, ldm, 1);
		// CHECK_PTX70_SM80: call {{.*}} @llvm.nvvm.wmma.m16n16k8.load.a.row.stride.tf32
		// expected-error-re@+1 {{'__mma_tf32_m16n16k8_ld_a' needs target feature (sm_80{{.}},(ptx70{{.}}}}
		__mma_tf32_m16n16k8_ld_a(dst, src, ldm, 0);
		// CHECK_PTX70_SM80: call {{.*}} @llvm.nvvm.wmma.m16n16k8.load.b.col.stride.tf32
		// expected-error-re@+1 {{'__mma_tf32_m16n16k8_ld_b' needs target feature (sm_80{{.}},(ptx70{{.}}}}
		__mma_tf32_m16n16k8_ld_b(dst, src, ldm, 1);
		// CHECK_PTX70_SM80: call {{.*}} @llvm.nvvm.wmma.m16n16k8.load.b.row.stride.tf32
		// expected-error-re@+1 {{'__mma_tf32_m16n16k8_ld_b' needs target feature (sm_80{{.}},(ptx70{{.}}}}
		__mma_tf32_m16n16k8_ld_b(dst, src, ldm, 0);
		// CHECK_PTX70_SM80: call {{.*}} @llvm.nvvm.wmma.m16n16k8.load.c.col.stride.f32
		// expected-error-re@+1 {{'__mma_tf32_m16n16k8_ld_c' needs target feature (sm_80{{.}},(ptx70{{.}}}}
		__mma_tf32_m16n16k8_ld_c(fdst, fsrc, ldm, 1);
		// CHECK_PTX70_SM80: call {{.*}} @llvm.nvvm.wmma.m16n16k8.load.c.row.stride.f32
		// expected-error-re@+1 {{'__mma_tf32_m16n16k8_ld_c' needs target feature (sm_80{{.}},(ptx70{{.}}}}
		__mma_tf32_m16n16k8_ld_c(fdst, fsrc, ldm, 0);
		traUnsubmitted Not Done Reply Inline Actions This looks rather odd. We're calling a `tf32` builtin, but expect to see and `f32` load intrinsic. Is that expected ? tra: This looks rather odd. We're calling a `tf32` builtin, but expect to see and `f32` load…
		traUnsubmitted Not Done Reply Inline Actions Never mind. I think I understand what's going on now. CUDA headers use __mma_tf32 builtins. `A` and `B` operate on opaque integer types. `C` and `D` operate on floats. However, on the PTX front we have `wmma.load.{a,b}...tf32` but `wmma.load.c...f32`. I guess it does make sense to keep LLVM intrinsic names close to the instructions they produce. tra: Never mind. I think I understand what's going on now. CUDA headers use __mma_tf32 builtins.
		steffenlarsenAuthorUnsubmitted Done Reply Inline Actions Yeah, it was definitely confusing to write. I think the current state is the best solution, as it prioritizes consistency within the sub-projects. Not a big fan of the inconsistency though, but if we want to follow CUDA's example I suppose we're stuck with this. steffenlarsen: Yeah, it was definitely confusing to write. I think the current state is the best solution, as…
		// CHECK_PTX70_SM80: call {{.*}} @llvm.nvvm.wmma.m16n16k8.store.d.col.stride.f32
		// expected-error-re@+1 {{'__mma_m16n16k8_st_c_f32' needs target feature (sm_80{{.}},(ptx70{{.}}}}
		__mma_m16n16k8_st_c_f32(fdst, fsrc, ldm, 1);
		// CHECK_PTX70_SM80: call {{.*}} @llvm.nvvm.wmma.m16n16k8.store.d.row.stride.f32
		// expected-error-re@+1 {{'__mma_m16n16k8_st_c_f32' needs target feature (sm_80{{.}},(ptx70{{.}}}}
		__mma_m16n16k8_st_c_f32(fdst, fsrc, ldm, 0);
		// CHECK_PTX70_SM80: call {{.*}} @llvm.nvvm.wmma.m32n8k16.load.a.col.stride.bf16
		// expected-error-re@+1 {{'__mma_bf16_m32n8k16_ld_a' needs target feature (sm_80{{.}},(ptx70{{.}}}}
		__mma_bf16_m32n8k16_ld_a(dst, src, ldm, 1);
		// CHECK_PTX70_SM80: call {{.*}} @llvm.nvvm.wmma.m32n8k16.load.a.row.stride.bf16
		// expected-error-re@+1 {{'__mma_bf16_m32n8k16_ld_a' needs target feature (sm_80{{.}},(ptx70{{.}}}}
		__mma_bf16_m32n8k16_ld_a(dst, src, ldm, 0);
		// CHECK_PTX70_SM80: call {{.*}} @llvm.nvvm.wmma.m32n8k16.load.b.col.stride.bf16
		// expected-error-re@+1 {{'__mma_bf16_m32n8k16_ld_b' needs target feature (sm_80{{.}},(ptx70{{.}}}}
		__mma_bf16_m32n8k16_ld_b(dst, src, ldm, 1);
		// CHECK_PTX70_SM80: call {{.*}} @llvm.nvvm.wmma.m32n8k16.load.b.row.stride.bf16
		// expected-error-re@+1 {{'__mma_bf16_m32n8k16_ld_b' needs target feature (sm_80{{.}},(ptx70{{.}}}}
		__mma_bf16_m32n8k16_ld_b(dst, src, ldm, 0);
		// CHECK_PTX70_SM80: call {{.*}} @llvm.nvvm.wmma.m8n32k16.load.a.col.stride.bf16
		// expected-error-re@+1 {{'__mma_bf16_m8n32k16_ld_a' needs target feature (sm_80{{.}},(ptx70{{.}}}}
		__mma_bf16_m8n32k16_ld_a(dst, src, ldm, 1);
		// CHECK_PTX70_SM80: call {{.*}} @llvm.nvvm.wmma.m8n32k16.load.a.row.stride.bf16
		// expected-error-re@+1 {{'__mma_bf16_m8n32k16_ld_a' needs target feature (sm_80{{.}},(ptx70{{.}}}}
		__mma_bf16_m8n32k16_ld_a(dst, src, ldm, 0);
		// CHECK_PTX70_SM80: call {{.*}} @llvm.nvvm.wmma.m8n32k16.load.b.col.stride.bf16
		// expected-error-re@+1 {{'__mma_bf16_m8n32k16_ld_b' needs target feature (sm_80{{.}},(ptx70{{.}}}}
		__mma_bf16_m8n32k16_ld_b(dst, src, ldm, 1);
		// CHECK_PTX70_SM80: call {{.*}} @llvm.nvvm.wmma.m8n32k16.load.b.row.stride.bf16
		// expected-error-re@+1 {{'__mma_bf16_m8n32k16_ld_b' needs target feature (sm_80{{.}},(ptx70{{.}}}}
		__mma_bf16_m8n32k16_ld_b(dst, src, ldm, 0);
		// CHECK_PTX70_SM80: call {{.*}} @llvm.nvvm.wmma.m8n8k4.load.a.col.stride.f64
		// expected-error-re@+1 {{'__dmma_m8n8k4_ld_a' needs target feature (sm_80{{.}},(ptx70{{.}}}}
		__dmma_m8n8k4_ld_a(ddst, dsrc, ldm, 1);
		// CHECK_PTX70_SM80: call {{.*}} @llvm.nvvm.wmma.m8n8k4.load.a.row.stride.f64
		// expected-error-re@+1 {{'__dmma_m8n8k4_ld_a' needs target feature (sm_80{{.}},(ptx70{{.}}}}
		__dmma_m8n8k4_ld_a(ddst, dsrc, ldm, 0);
		// CHECK_PTX70_SM80: call {{.*}} @llvm.nvvm.wmma.m8n8k4.load.b.col.stride.f64
		// expected-error-re@+1 {{'__dmma_m8n8k4_ld_b' needs target feature (sm_80{{.}},(ptx70{{.}}}}
		__dmma_m8n8k4_ld_b(ddst, dsrc, ldm, 1);
		// CHECK_PTX70_SM80: call {{.*}} @llvm.nvvm.wmma.m8n8k4.load.b.row.stride.f64
		// expected-error-re@+1 {{'__dmma_m8n8k4_ld_b' needs target feature (sm_80{{.}},(ptx70{{.}}}}
		__dmma_m8n8k4_ld_b(ddst, dsrc, ldm, 0);
		// CHECK_PTX70_SM80: call {{.*}} @llvm.nvvm.wmma.m8n8k4.load.c.col.stride.f64
		// expected-error-re@+1 {{'__dmma_m8n8k4_ld_c' needs target feature (sm_80{{.}},(ptx70{{.}}}}
		__dmma_m8n8k4_ld_c(ddst, dsrc, ldm, 1);
		// CHECK_PTX70_SM80: call {{.*}} @llvm.nvvm.wmma.m8n8k4.load.c.row.stride.f64
		// expected-error-re@+1 {{'__dmma_m8n8k4_ld_c' needs target feature (sm_80{{.}},(ptx70{{.}}}}
		__dmma_m8n8k4_ld_c(ddst, dsrc, ldm, 0);
		// CHECK_PTX70_SM80: call {{.*}} @llvm.nvvm.wmma.m8n8k4.store.d.col.stride.f64
		// expected-error-re@+1 {{'__dmma_m8n8k4_st_c_f64' needs target feature (sm_80{{.}},(ptx70{{.}}}}
		__dmma_m8n8k4_st_c_f64(ddst, dsrc, ldm, 1);
		// CHECK_PTX70_SM80: call {{.*}} @llvm.nvvm.wmma.m8n8k4.store.d.row.stride.f64
		// expected-error-re@+1 {{'__dmma_m8n8k4_st_c_f64' needs target feature (sm_80{{.}},(ptx70{{.}}}}
		__dmma_m8n8k4_st_c_f64(ddst, dsrc, ldm, 0);
		// CHECK_PTX70_SM80: call {{.*}} @llvm.nvvm.wmma.m16n16k16.mma.col.col.bf16
		// expected-error-re@+1 {{'__mma_bf16_m16n16k16_mma_f32' needs target feature (sm_80{{.}},(ptx70{{.}}}}
		__mma_bf16_m16n16k16_mma_f32(fdst, src, src, fsrc, 3, 0);
		// CHECK_PTX70_SM80: call {{.*}} @llvm.nvvm.wmma.m16n16k16.mma.col.row.bf16
		// expected-error-re@+1 {{'__mma_bf16_m16n16k16_mma_f32' needs target feature (sm_80{{.}},(ptx70{{.}}}}
		__mma_bf16_m16n16k16_mma_f32(fdst, src, src, fsrc, 2, 0);
		// CHECK_PTX70_SM80: call {{.*}} @llvm.nvvm.wmma.m16n16k16.mma.row.col.bf16
		// expected-error-re@+1 {{'__mma_bf16_m16n16k16_mma_f32' needs target feature (sm_80{{.}},(ptx70{{.}}}}
		__mma_bf16_m16n16k16_mma_f32(fdst, src, src, fsrc, 1, 0);
		// CHECK_PTX70_SM80: call {{.*}} @llvm.nvvm.wmma.m16n16k16.mma.row.row.bf16
		// expected-error-re@+1 {{'__mma_bf16_m16n16k16_mma_f32' needs target feature (sm_80{{.}},(ptx70{{.}}}}
		__mma_bf16_m16n16k16_mma_f32(fdst, src, src, fsrc, 0, 0);
		// CHECK_PTX70_SM80: call {{.*}} @llvm.nvvm.wmma.m16n16k8.mma.col.col.tf32
		// expected-error-re@+1 {{'__mma_tf32_m16n16k8_mma_f32' needs target feature (sm_80{{.}},(ptx70{{.}}}}
		__mma_tf32_m16n16k8_mma_f32(fdst, src, src, fsrc, 3, 0);
		// CHECK_PTX70_SM80: call {{.*}} @llvm.nvvm.wmma.m16n16k8.mma.col.row.tf32
		// expected-error-re@+1 {{'__mma_tf32_m16n16k8_mma_f32' needs target feature (sm_80{{.}},(ptx70{{.}}}}
		__mma_tf32_m16n16k8_mma_f32(fdst, src, src, fsrc, 2, 0);
		// CHECK_PTX70_SM80: call {{.*}} @llvm.nvvm.wmma.m16n16k8.mma.row.col.tf32
		// expected-error-re@+1 {{'__mma_tf32_m16n16k8_mma_f32' needs target feature (sm_80{{.}},(ptx70{{.}}}}
		__mma_tf32_m16n16k8_mma_f32(fdst, src, src, fsrc, 1, 0);
		// CHECK_PTX70_SM80: call {{.*}} @llvm.nvvm.wmma.m16n16k8.mma.row.row.tf32
		// expected-error-re@+1 {{'__mma_tf32_m16n16k8_mma_f32' needs target feature (sm_80{{.}},(ptx70{{.}}}}
		__mma_tf32_m16n16k8_mma_f32(fdst, src, src, fsrc, 0, 0);
		// CHECK_PTX70_SM80: call {{.*}} @llvm.nvvm.wmma.m32n8k16.mma.col.col.bf16
		// expected-error-re@+1 {{'__mma_bf16_m32n8k16_mma_f32' needs target feature (sm_80{{.}},(ptx70{{.}}}}
		__mma_bf16_m32n8k16_mma_f32(fdst, src, src, fsrc, 3, 0);
		// CHECK_PTX70_SM80: call {{.*}} @llvm.nvvm.wmma.m32n8k16.mma.col.row.bf16
		// expected-error-re@+1 {{'__mma_bf16_m32n8k16_mma_f32' needs target feature (sm_80{{.}},(ptx70{{.}}}}
		__mma_bf16_m32n8k16_mma_f32(fdst, src, src, fsrc, 2, 0);
		// CHECK_PTX70_SM80: call {{.*}} @llvm.nvvm.wmma.m32n8k16.mma.row.col.bf16
		// expected-error-re@+1 {{'__mma_bf16_m32n8k16_mma_f32' needs target feature (sm_80{{.}},(ptx70{{.}}}}
		__mma_bf16_m32n8k16_mma_f32(fdst, src, src, fsrc, 1, 0);
		// CHECK_PTX70_SM80: call {{.*}} @llvm.nvvm.wmma.m32n8k16.mma.row.row.bf16
		// expected-error-re@+1 {{'__mma_bf16_m32n8k16_mma_f32' needs target feature (sm_80{{.}},(ptx70{{.}}}}
		__mma_bf16_m32n8k16_mma_f32(fdst, src, src, fsrc, 0, 0);
		// CHECK_PTX70_SM80: call {{.*}} @llvm.nvvm.wmma.m8n32k16.mma.col.col.bf16
		// expected-error-re@+1 {{'__mma_bf16_m8n32k16_mma_f32' needs target feature (sm_80{{.}},(ptx70{{.}}}}
		__mma_bf16_m8n32k16_mma_f32(fdst, src, src, fsrc, 3, 0);
		// CHECK_PTX70_SM80: call {{.*}} @llvm.nvvm.wmma.m8n32k16.mma.col.row.bf16
		// expected-error-re@+1 {{'__mma_bf16_m8n32k16_mma_f32' needs target feature (sm_80{{.}},(ptx70{{.}}}}
		__mma_bf16_m8n32k16_mma_f32(fdst, src, src, fsrc, 2, 0);
		// CHECK_PTX70_SM80: call {{.*}} @llvm.nvvm.wmma.m8n32k16.mma.row.col.bf16
		// expected-error-re@+1 {{'__mma_bf16_m8n32k16_mma_f32' needs target feature (sm_80{{.}},(ptx70{{.}}}}
		__mma_bf16_m8n32k16_mma_f32(fdst, src, src, fsrc, 1, 0);
		// CHECK_PTX70_SM80: call {{.*}} @llvm.nvvm.wmma.m8n32k16.mma.row.row.bf16
		// expected-error-re@+1 {{'__mma_bf16_m8n32k16_mma_f32' needs target feature (sm_80{{.}},(ptx70{{.}}}}
		__mma_bf16_m8n32k16_mma_f32(fdst, src, src, fsrc, 0, 0);
		// CHECK_PTX70_SM80: call {{.*}} @llvm.nvvm.wmma.m8n8k4.mma.col.col.f64
		// expected-error-re@+1 {{'__dmma_m8n8k4_mma_f64' needs target feature (sm_80{{.}},(ptx70{{.}}}}
		__dmma_m8n8k4_mma_f64(ddst, dsrc, dsrc, dsrc, 3, 0);
		// CHECK_PTX70_SM80: call {{.*}} @llvm.nvvm.wmma.m8n8k4.mma.col.row.f64
		// expected-error-re@+1 {{'__dmma_m8n8k4_mma_f64' needs target feature (sm_80{{.}},(ptx70{{.}}}}
		__dmma_m8n8k4_mma_f64(ddst, dsrc, dsrc, dsrc, 2, 0);
		// CHECK_PTX70_SM80: call {{.*}} @llvm.nvvm.wmma.m8n8k4.mma.row.col.f64
		// expected-error-re@+1 {{'__dmma_m8n8k4_mma_f64' needs target feature (sm_80{{.}},(ptx70{{.}}}}
		__dmma_m8n8k4_mma_f64(ddst, dsrc, dsrc, dsrc, 1, 0);
		// CHECK_PTX70_SM80: call {{.*}} @llvm.nvvm.wmma.m8n8k4.mma.row.row.f64
		// expected-error-re@+1 {{'__dmma_m8n8k4_mma_f64' needs target feature (sm_80{{.}},(ptx70{{.}}}}
		__dmma_m8n8k4_mma_f64(ddst, dsrc, dsrc, dsrc, 0, 0);
		#endif // (PTX >= 70) && (SM >= 80)
}		}

clang/test/CodeGen/builtins-nvptx-mma.py

Show First 20 Lines • Show All 41 Lines • ▼ Show 20 Lines	for type_b, type_d in product(types_b if types_b else [type_a],
MMAFrag(geom, "d", type_d)))		MMAFrag(geom, "d", type_d)))
return ops		return ops

def make_ldst_ops(geoms, frags, types):		def make_ldst_ops(geoms, frags, types):
return [MMAFrag(geom, frag, ptx_type) for (geom, frag, ptx_type)		return [MMAFrag(geom, frag, ptx_type) for (geom, frag, ptx_type)
in product(geoms, frags, types)]		in product(geoms, frags, types)]

def get_mma_ops():		def get_mma_ops():
return (make_mma_ops(["m16n16k16", "m32n8k16", "m8n32k16"],		return (make_mma_ops(["m16n16k8"],
		["tf32"], [], ["f32"], []) +
		make_mma_ops(["m16n16k16", "m32n8k16", "m8n32k16"],
		["bf16"], [], ["f32"], []) +
		make_mma_ops(["m8n8k4"],
		["f64"], [], ["f64"], []) +
		make_mma_ops(["m16n16k16", "m32n8k16", "m8n32k16"],
["f16"], [], ["f16", "f32"], ["f16", "f32"]) +		["f16"], [], ["f16", "f32"], ["f16", "f32"]) +
make_mma_ops(["m16n16k16", "m32n8k16", "m8n32k16"],		make_mma_ops(["m16n16k16", "m32n8k16", "m8n32k16"],
["s8", "u8"], [], ["s32"], []) +		["s8", "u8"], [], ["s32"], []) +
make_mma_ops(["m8n8k32"],		make_mma_ops(["m8n8k32"],
["s4", "u4"], [], ["s32"], []) +		["s4", "u4"], [], ["s32"], []) +
make_mma_ops(["m8n8k128"],		make_mma_ops(["m8n8k128"],
["b1"], [], ["s32"], []))		["b1"], [], ["s32"], []))

def get_ldst_ops():		def get_ldst_ops():
return (make_ldst_ops(["m16n16k16", "m32n8k16", "m8n32k16"],		return (make_ldst_ops(["m16n16k16", "m32n8k16", "m8n32k16"],
["a", "b"], ["f16", "u8", "s8"]) +		["a", "b"], ["f16", "u8", "s8", "bf16"]) +
make_ldst_ops(["m16n16k16", "m32n8k16", "m8n32k16"],		make_ldst_ops(["m16n16k16", "m32n8k16", "m8n32k16"],
["c", "d"], ["f16", "f32", "s32"]) +		["c", "d"], ["f16", "f32", "s32"]) +
make_ldst_ops(["m8n8k32"], ["a", "b"], ["s4","u4"]) +		make_ldst_ops(["m8n8k32"], ["a", "b"], ["s4","u4"]) +
make_ldst_ops(["m8n8k128"], ["a", "b"], ["b1"]) +		make_ldst_ops(["m8n8k128"], ["a", "b"], ["b1"]) +
make_ldst_ops(["m8n8k32", "m8n8k128"], ["c", "d"], ["s32"]))		make_ldst_ops(["m8n8k32", "m8n8k128"], ["c", "d"], ["s32"]) +
		make_ldst_ops(["m8n8k4"], ["a", "b", "c", "d"], ["f64"]) +
		make_ldst_ops(["m16n16k8"], ["a", "b"], ["tf32"]) +
		traUnsubmitted Not Done Reply Inline Actions This does not seem to match the generated `builtins-nvptx-mma.cu` which does have `__mma_tf32_m16n16k8_ld_c` If I regenrate the test I see a somewhat different set of tests, possibly related to the oddity I've pointed in the generated test changes in this patch. tra: This does not seem to match the generated `builtins-nvptx-mma.cu` which does have…
		steffenlarsenAuthorUnsubmitted Done Reply Inline Actions You are absolutely right. That's a mistake on my part. Looks like you've got it under control in your patch. Thanks! steffenlarsen: You are absolutely right. That's a mistake on my part. Looks like you've got it under control…
		make_ldst_ops(["m16n16k8"], ["c", "d"], ["f32"]))

def is_geom_supported(geom):		def is_geom_supported(geom):
# geometries for FP and ints.		# geometries for FP and ints.
if geom in ["m8n32k16", "m32n8k16"]:		if geom in ["m8n32k16", "m32n8k16"]:
return ptx_version >= 61		return ptx_version >= 61
# geometries for sub-ints.		# geometries for sub-ints.
if geom in ["m8n8k32", "m8n8k128"]:		if geom in ["m8n8k32", "m8n8k128"]:
return ptx_version >= 63 and gpu_arch >= 75		return ptx_version >= 63 and gpu_arch >= 75
if geom == "m16n16k16":		if geom == "m16n16k16":
return ptx_version >= 60		return ptx_version >= 60
		if geom in ["m16n16k8", "m8n8k4"]:
		return ptx_version >= 70 and gpu_arch >= 80
assert(False) # Unexpected geometry.		assert(False) # Unexpected geometry.

def is_type_supported(ptx_type):		def is_type_supported(ptx_type):
if ptx_type in ["s8", "u8", "s32"]:		if ptx_type in ["s8", "u8", "s32"]:
return ptx_version >= 63 and gpu_arch >= 72		return ptx_version >= 63 and gpu_arch >= 72
if ptx_type in ["s4", "u4", "b1"]:		if ptx_type in ["s4", "u4", "b1"]:
return ptx_version >= 63 and gpu_arch >= 75		return ptx_version >= 63 and gpu_arch >= 75
		if ptx_type in ["bf16", "tf32", "f64"]:
		return ptx_version >= 70 and gpu_arch >= 80
return ptx_version >= 60 and gpu_arch >= 70		return ptx_version >= 60 and gpu_arch >= 70

		def is_rnd_supported(op):
		# rnd is only supported for FP64 WMMA
		return op.a.ptx_type == "f64"

def is_mma_variant_supported(op, layout_a, layout_b, satf):		def is_mma_variant_supported(op, layout_a, layout_b, satf):
if not (is_type_supported(op.a.ptx_type)		if not (is_type_supported(op.a.ptx_type)
and is_geom_supported(op.a.geom)):		and is_geom_supported(op.a.geom)):
return False		return False
# sub-integer require row/col layout, and no satf.
if op.a.ptx_type in ["s4", "u4", "b1"]:		if satf and not op.a.ptx_type in ["f16", "s8", "u8", "s4", "u4"]:
if op.a.ptx_type == "b1" and satf:
return False		return False

		# sub-integer types require row/col layout.
		traUnsubmitted Done Reply Inline Actions typo in the original code: `sub-integers` or `sub-integer types` tra: typo in the original code: `sub-integers` or `sub-integer types`
		if op.a.ptx_type in ["s4", "u4", "b1"]:
return layout_a == "row" and layout_b == "col"		return layout_a == "row" and layout_b == "col"
return True		return True

def is_ldst_variant_supported(frag, layout):		def is_ldst_variant_supported(frag, layout):
if not (is_type_supported(frag.ptx_type)		if not (is_type_supported(frag.ptx_type)
and is_geom_supported(frag.geom)):		and is_geom_supported(frag.geom)):
return False		return False
if frag.ptx_type in ["s4", "u4", "b1"]:		if frag.ptx_type in ["s4", "u4", "b1"]:
# sub-integer require sm_75 and ptx63, row/col layout for a/b.		# sub-integer types require sm_75 and ptx63, row/col layout for a/b.
return ((frag.frag == "a" and layout == "row")		return ((frag.frag == "a" and layout == "row")
or (frag.frag == "b" and layout == "col")		or (frag.frag == "b" and layout == "col")
or frag.frag in ["c", "d"])		or frag.frag in ["c", "d"])
return True		return True

def get_builtin_prefix(frag):		def get_builtin_prefix(frag):
prefix = None		prefix = None
if frag.geom in ["m16n16k16", "m32n8k16", "m8n32k16"]:		if frag.geom in ["m16n16k16", "m32n8k16", "m8n32k16"]:
if frag.ptx_type in ["f16", "f32"]:		if frag.ptx_type in ["f16", "f32"]:
prefix = "__hmma"		prefix = "__hmma"
		elif frag.ptx_type == "bf16":
		prefix = "__mma_bf16"
else:		else:
prefix = "__imma"		prefix = "__imma"
elif frag.geom == "m8n8k32":		elif frag.geom == "m8n8k32":
prefix = "__imma" # sub-integers		prefix = "__imma" # sub-integers
elif frag.geom == "m8n8k128":		elif frag.geom == "m8n8k128":
prefix = "__bmma"		prefix = "__bmma"
		elif frag.geom == "m8n8k4":
		prefix = "__dmma"
		elif frag.geom == "m16n16k8":
		if frag.ptx_type == "f32":
		prefix = "__mma"
		else:
		prefix = "__mma_tf32"
		traUnsubmitted Not Done Reply Inline Actions It's not obvious why frag `d` is `__mma` and not `__mma_tf32` Can we use frag.ptx_type to make that decision? tra: It's not obvious why frag `d` is `__mma` and not `__mma_tf32` Can we use frag.ptx_type to…
		steffenlarsenAuthorUnsubmitted Done Reply Inline Actions We absolutely can. I don't know why that wasn't my first solution. steffenlarsen: We absolutely can. I don't know why that wasn't my first solution.
assert prefix		assert prefix
return prefix		return prefix

def get_ldst_builtin_name(frag):		def get_ldst_builtin_name(frag):
prefix = get_builtin_prefix(frag)		prefix = get_builtin_prefix(frag)

if prefix == "__hmma":		if prefix == "__hmma":
suffix = "" if frag.frag in ["a","b"] else frag.ptx_type		suffix = "" if frag.frag in ["a","b"] else frag.ptx_type
elif prefix in ["__imma", "__bmma"]:		elif prefix in ["__dmma", "__mma_bf16", "__mma_tf32"]:
suffix = "" if frag.frag in ["c"] else frag.ptx_type		suffix = "" if frag.frag in ["a","b","c"] else frag.ptx_type
		else:
		suffix = "" if frag.frag == "c" else frag.ptx_type
if suffix == "s32":		if suffix == "s32":
suffix = "i32"		suffix = "i32"

if frag.frag == "d":		if frag.frag == "d":
ifrag = "c"		ifrag = "c"
op = "st"		op = "st"
else:		else:
ifrag = frag.frag		ifrag = frag.frag
op = "ld"		op = "ld"

name = "%s_%s_%s_%s%s" % (prefix, frag.geom, op, ifrag,		name = "%s_%s_%s_%s%s" % (prefix, frag.geom, op, ifrag,
"_" + suffix if suffix else "")		"_" + suffix if suffix else "")
return name		return name

def get_mma_builtin_name(op):		def get_mma_builtin_name(op):
prefix = get_builtin_prefix(op.a)		prefix = get_builtin_prefix(op.a)

if prefix == "__hmma":		if prefix == "__hmma":
suffix = op.d.ptx_type + op.c.ptx_type		suffix = op.d.ptx_type + op.c.ptx_type
		elif prefix in ["__mma_bf16", "__mma_tf32"]:
		suffix = op.d.ptx_type
else:		else:
suffix = op.a.ptx_type		suffix = op.a.ptx_type

name = "%s_%s_mma%s_%s" % (prefix, op.a.geom,		name = "%s_%s_mma%s_%s" % (prefix, op.a.geom,
"_xor_popc" if op.a.ptx_type == "b1" else "",		"_xor_popc" if op.a.ptx_type == "b1" else "",
suffix)		suffix)
return name		return name


def get_required_sm(frag):		def get_required_sm(frag):
		if frag.ptx_type in ["f64", "bf16", "tf32"]:
		return 80
if frag.ptx_type in ["u4", "s4", "b1"]:		if frag.ptx_type in ["u4", "s4", "b1"]:
return 75		return 75
if frag.ptx_type in ["s8", "u8"]:		if frag.ptx_type in ["s8", "u8"]:
return 72		return 72
if frag.ptx_type == "s32":		if frag.ptx_type == "s32":
if frag.geom in ["m8n8k32", "m8n8k128"]: # s4/u4/b1		if frag.geom in ["m8n8k32", "m8n8k128"]: # s4/u4/b1
return 75		return 75
else: # s8/u8		else: # s8/u8
return 72		return 72
if frag.ptx_type in ["f16", "f32"]:		if frag.ptx_type in ["f16", "f32"]:
		if frag.geom == "m16n16k8":
		return 80
		else:
return 70		return 70
assert(False)		assert(False)

def get_required_ptx(frag):		def get_required_ptx(frag):
		if frag.ptx_type in ["f64", "bf16", "tf32"]:
		return 70
if frag.ptx_type in ["f16", "f32"]:		if frag.ptx_type in ["f16", "f32"]:
return 60 if frag.geom == "m16n16k16" else 61		if frag.geom == "m16n16k16":
		return 60
		if frag.geom == "m16n16k8":
		return 70
		return 61
return 63		return 63

		def get_src_dst_prefix(ptx_type):
		if ptx_type == "f32":
		return "f"
		if ptx_type == "f64":
		return "d"
		return ""

def gen_wmma_ldst_tests(results):		def gen_wmma_ldst_tests(results):
load_template = """		load_template = """
// CHECK${check_suffix}: call {{.*}} @${intrinsic}		// CHECK${check_suffix}: call {{.*}} @${intrinsic}
// expected-error-re@+1 {{'${builtin}' needs target feature sm_${min_sm}{{.}},ptx${min_ptx}{{.}}}}		// expected-error-re@+1 {{'${builtin}' needs target feature (sm_${min_sm}{{.}},(ptx${min_ptx}{{.}}}}
${builtin}(${dst}, ${src}, ldm, ${blayout});		${builtin}(${dst}, ${src}, ldm, ${blayout});
""".rstrip()		""".rstrip()
intrinsic_template = "llvm.nvvm.wmma.${geom}.${op}.${frag}.${ilayout}.stride.${itype}"		intrinsic_template = "llvm.nvvm.wmma.${geom}.${op}.${frag}.${ilayout}.stride.${itype}"

for frag, layout in sorted(product(get_ldst_ops(), ["row","col"]), key=str):		for frag, layout in sorted(product(get_ldst_ops(), ["row","col"]), key=str):

if not is_ldst_variant_supported(frag, layout):		if not is_ldst_variant_supported(frag, layout):
continue		continue

is_fp = frag.ptx_type == "f32"		src_dst_prefix = get_src_dst_prefix(frag.ptx_type)
min_sm = get_required_sm(frag)		min_sm = get_required_sm(frag)
min_ptx = get_required_ptx(frag)		min_ptx = get_required_ptx(frag)
params = {		params = {
"check_suffix" : "_PTX%d_SM%d" % (min_ptx, min_sm),		"check_suffix" : "_PTX%d_SM%d" % (min_ptx, min_sm),
"builtin" : get_ldst_builtin_name(frag),		"builtin" : get_ldst_builtin_name(frag),
"min_ptx" : min_ptx,		"min_ptx" : min_ptx,
"min_sm" : min_sm,		"min_sm" : min_sm,
"dst": "fdst" if is_fp else "dst",		"dst": src_dst_prefix + "dst",
"src": "fsrc" if is_fp else "src",		"src": src_dst_prefix + "src",
"blayout" : 0 if layout == "row" else 1,		"blayout" : 0 if layout == "row" else 1,
"intrinsic" : Template(intrinsic_template).substitute({		"intrinsic" : Template(intrinsic_template).substitute({
"frag" : frag.frag,		"frag" : frag.frag,
"geom" : frag.geom,		"geom" : frag.geom,
"ilayout" : layout,		"ilayout" : layout,
"itype" : frag.ptx_type,		"itype" : frag.ptx_type,
"op" : "store" if frag.frag == "d" else "load",		"op" : "store" if frag.frag == "d" else "load",
})		})
}		}
results[(min_ptx,min_sm)] += Template(load_template).substitute(params)		results[(min_ptx,min_sm)] += Template(load_template).substitute(params)

return results		return results

def mma_signature(op):		def mma_signature(op):
if op.a.ptx_type in ["s8", "u8", "s4", "u4", "b1"]:		if op.a.ptx_type == "f16":
# int and sub-int ops are identified by input type.		# FP16 ops identified by accumulator & result type.
return op.a.ptx_type
else:
# the rest are FP ops identified by accumulator & result type.
return "%s.%s" % (op.d.ptx_type, op.c.ptx_type)		return "%s.%s" % (op.d.ptx_type, op.c.ptx_type)
		else:
		# other ops are identified by input type.
		return op.a.ptx_type

# Get numeric value for rowcol parameter of the builtin		# Get numeric value for rowcol parameter of the builtin
# AFAICT it uses the encoding accepted by NVVM intrinsics:		# AFAICT it uses the encoding accepted by NVVM intrinsics:
# https://docs.nvidia.com/cuda/nvvm-ir-spec/index.html#nvvm-intrin-warp-level-matrix-mma		# https://docs.nvidia.com/cuda/nvvm-ir-spec/index.html#nvvm-intrin-warp-level-matrix-mma
def get_ilayout(a, b):		def get_ilayout(a, b):
return {		return {
"row.row" : 0,		"row.row" : 0,
"row.col" : 1,		"row.col" : 1,
"col.row" : 2,		"col.row" : 2,
"col.col" : 3		"col.col" : 3
}[a + "." + b]		}[a + "." + b]

def gen_wmma_mma_tests(results):		def gen_wmma_mma_tests(results):
mma_template = """		mma_template = """
// CHECK${check_suffix}: call {{.*}} @${intrinsic}		// CHECK${check_suffix}: call {{.*}} @${intrinsic}
// expected-error-re@+1 {{'${builtin}' needs target feature sm_${min_sm}{{.}},ptx${min_ptx}{{.}}}}		// expected-error-re@+1 {{'${builtin}' needs target feature (sm_${min_sm}{{.}},(ptx${min_ptx}{{.}}}}
${builtin}(${dst}, ${asrc}, ${asrc}, ${csrc}, ${ilayout}${maybe_isatf});		${builtin}(${dst}, ${asrc}, ${asrc}, ${csrc}, ${ilayout}${maybe_satf});
""".rstrip()		""".rstrip()
intrinsic_template = "llvm.nvvm.wmma.${geom}.mma.${alayout}.${blayout}.${intrinsic_signature}${satf}"		intrinsic_template = "llvm.nvvm.wmma.${geom}.mma.${alayout}.${blayout}.${intrinsic_signature}${satf}"

for op, alayout, blayout, satf in sorted(product( get_mma_ops(),		for op, alayout, blayout, satf in sorted(product( get_mma_ops(),
["row","col"],		["row","col"],
["row","col"],		["row","col"],
[".satfinite", ""]),		[".satfinite", ""]),
key=str):		key=str):

if not is_mma_variant_supported(op, alayout, blayout, satf):		if not is_mma_variant_supported(op, alayout, blayout, satf):
continue		continue

a_is_fp = op.a.ptx_type == "f32"		asrc_prefix = get_src_dst_prefix(op.a.ptx_type)
c_is_fp = op.c.ptx_type == "f32"		csrc_prefix = get_src_dst_prefix(op.c.ptx_type)
d_is_fp = op.d.ptx_type == "f32"		ddst_prefix = get_src_dst_prefix(op.d.ptx_type)
min_sm = get_required_sm(op.a)		min_sm = get_required_sm(op.a)
min_ptx = get_required_ptx(op.a)		min_ptx = get_required_ptx(op.a)
if op.a.ptx_type == "b1": # .b1 MMA has no satf argument.		if op.a.ptx_type == "b1": # .b1 MMA has no satf argument.
isatf_arg = ""		isatf_arg = ""
else:		else:
isatf_arg = ", 1" if satf else ", 0"		isatf_arg = ", 1" if satf else ", 0"
params = {		params = {
"check_suffix" : "_PTX%d_SM%d" % (min_ptx, min_sm),		"check_suffix" : "_PTX%d_SM%d" % (min_ptx, min_sm),
"builtin" : get_mma_builtin_name(op),		"builtin" : get_mma_builtin_name(op),
"min_ptx" : min_ptx,		"min_ptx" : min_ptx,
"min_sm" : min_sm,		"min_sm" : min_sm,
"dst": "fdst" if d_is_fp else "dst",		"dst": ddst_prefix + "dst",
"asrc": "fsrc" if a_is_fp else "src",		"asrc": asrc_prefix + "src",
"csrc": "fsrc" if c_is_fp else "src",		"csrc": csrc_prefix + "src",
"ilayout" : get_ilayout(alayout, blayout),		"ilayout" : get_ilayout(alayout, blayout),
"maybe_isatf" : isatf_arg,		"maybe_satf" : isatf_arg,
"intrinsic" : Template(intrinsic_template).substitute({		"intrinsic" : Template(intrinsic_template).substitute({
"geom" : op.a.geom,		"geom" : op.a.geom,
"alayout" : alayout,		"alayout" : alayout,
"blayout" : blayout,		"blayout" : blayout,
"intrinsic_signature" : mma_signature(op),		"intrinsic_signature" : mma_signature(op),
"satf" : satf,		"satf" : satf,
})		})
}		}
▲ Show 20 Lines • Show All 44 Lines • ▼ Show 20 Lines
#define __shared__ __attribute__((shared))		#define __shared__ __attribute__((shared))
#define __constant__ __attribute__((constant))		#define __constant__ __attribute__((constant))

typedef unsigned long long uint64_t;		typedef unsigned long long uint64_t;
#endif		#endif

// CHECK-LABEL: test_wmma_buitins		// CHECK-LABEL: test_wmma_buitins
__device__ void test_wmma_buitins(int src, int dst,		__device__ void test_wmma_buitins(int src, int dst,
float fsrc, float fdst, int ldm) {		float fsrc, float fdst,
		double dsrc, double ddst, int ldm) {
""");		""");

for (ptx, sm), tests in sorted(results.items()):		for (ptx, sm), tests in sorted(results.items()):
print()		print()
print("#if (PTX >= %d) && (SM >= %d)" % (ptx, sm))		print("#if (PTX >= %d) && (SM >= %d)" % (ptx, sm))
print(tests)		print(tests)
print("#endif // (PTX >= %d) && (SM >= %d) "% (ptx, sm))		print("#endif // (PTX >= %d) && (SM >= %d) "% (ptx, sm))

Show All 10 Lines

llvm/include/llvm/IR/IntrinsicsNVVM.td

Show First 20 Lines • Show All 46 Lines • ▼ Show 20 Lines
// PtxEltType: PTX type for the element.		// PtxEltType: PTX type for the element.
class WMMA_REGS<string Geom, string Frag, string PtxEltType> {		class WMMA_REGS<string Geom, string Frag, string PtxEltType> {
string geom = Geom;		string geom = Geom;
string frag = Frag;		string frag = Frag;
string ptx_elt_type = PtxEltType;		string ptx_elt_type = PtxEltType;
string gft = Geom#":"#Frag#":"#ptx_elt_type;		string gft = Geom#":"#Frag#":"#ptx_elt_type;
string ft = frag#":"#ptx_elt_type;		string ft = frag#":"#ptx_elt_type;
list<LLVMType> regs = !cond(		list<LLVMType> regs = !cond(
// mma.sync.m8n8k4 uses smaller a/b fragments than wmma fp ops		// mma fp ops use smaller fragments than wmma fp ops
		traUnsubmitted Done Reply Inline Actions Nit: I'd drop `some`. tra: Nit: I'd drop `some`.
!eq(gft,"m8n8k4:a:f16") : !listsplat(llvm_v2f16_ty, 2),		!eq(gft,"m8n8k4:a:f16") : !listsplat(llvm_v2f16_ty, 2),
!eq(gft,"m8n8k4:b:f16") : !listsplat(llvm_v2f16_ty, 2),		!eq(gft,"m8n8k4:b:f16") : !listsplat(llvm_v2f16_ty, 2),
		!eq(gft,"m16n8k8:a:f16") : !listsplat(llvm_v2f16_ty, 2),
// fp16 -> fp16/fp32 @ m16n16k16/m8n32k16/m32n8k16		!eq(gft,"m16n8k8:b:f16") : [llvm_v2f16_ty],
// All currently supported geometries use the same fragment format,		!eq(gft,"m16n8k8:c:f16") : !listsplat(llvm_v2f16_ty, 2),
// so we only need to consider {fragment, type}.		!eq(gft,"m16n8k8:d:f16") : !listsplat(llvm_v2f16_ty, 2),
		!eq(gft,"m16n8k8:c:f32") : !listsplat(llvm_float_ty, 4),
		!eq(gft,"m16n8k8:d:f32") : !listsplat(llvm_float_ty, 4),
		!eq(gft,"m16n8k16:a:f16") : !listsplat(llvm_v2f16_ty, 4),
		!eq(gft,"m16n8k16:b:f16") : !listsplat(llvm_v2f16_ty, 2),
		!eq(gft,"m16n8k16:c:f16") : !listsplat(llvm_v2f16_ty, 2),
		!eq(gft,"m16n8k16:d:f16") : !listsplat(llvm_v2f16_ty, 2),
		!eq(gft,"m16n8k16:c:f32") : !listsplat(llvm_float_ty, 4),
		!eq(gft,"m16n8k16:d:f32") : !listsplat(llvm_float_ty, 4),
		!eq(gft,"m16n8k4:c:f32") : !listsplat(llvm_float_ty, 4),
		!eq(gft,"m16n8k4:d:f32") : !listsplat(llvm_float_ty, 4),

		// wmma fp16 -> fp16/fp32 @ m16n16k16/m8n32k16/m32n8k16
		// All other supported geometries use the same fragment format for f32 and
		// f16, so we only need to consider {fragment, type}.
!eq(ft,"a:f16") : !listsplat(llvm_v2f16_ty, 8),		!eq(ft,"a:f16") : !listsplat(llvm_v2f16_ty, 8),
!eq(ft,"b:f16") : !listsplat(llvm_v2f16_ty, 8),		!eq(ft,"b:f16") : !listsplat(llvm_v2f16_ty, 8),
!eq(ft,"c:f16") : !listsplat(llvm_v2f16_ty, 4),		!eq(ft,"c:f16") : !listsplat(llvm_v2f16_ty, 4),
!eq(ft,"d:f16") : !listsplat(llvm_v2f16_ty, 4),		!eq(ft,"d:f16") : !listsplat(llvm_v2f16_ty, 4),
!eq(ft,"c:f32") : !listsplat(llvm_float_ty, 8),		!eq(ft,"c:f32") : !listsplat(llvm_float_ty, 8),
!eq(ft,"d:f32") : !listsplat(llvm_float_ty, 8),		!eq(ft,"d:f32") : !listsplat(llvm_float_ty, 8),

// u8/s8 -> s32 @ m16n16k16/m8n32k16/m32n8k16		// wmma tf32 -> s32 @ m16n16k8
		!eq(gft,"m16n16k8:a:tf32") : !listsplat(llvm_i32_ty, 4),
		!eq(gft,"m16n16k8:b:tf32") : !listsplat(llvm_i32_ty, 4),

		// mma tf32 -> s32 @ m16n16k8/m16n8k8
		!eq(gft,"m16n8k4:a:tf32") : !listsplat(llvm_i32_ty, 2),
		!eq(gft,"m16n8k4:b:tf32") : [llvm_i32_ty],
		!eq(gft,"m16n8k8:a:tf32") : !listsplat(llvm_i32_ty, 4),
		!eq(gft,"m16n8k8:b:tf32") : !listsplat(llvm_i32_ty, 2),

		!eq(gft,"m8n8k4:a:f64") : [llvm_double_ty],
		!eq(gft,"m8n8k4:b:f64") : [llvm_double_ty],
		!eq(gft,"m8n8k4:c:f64") : !listsplat(llvm_double_ty, 2),
		!eq(gft,"m8n8k4:d:f64") : !listsplat(llvm_double_ty, 2),

		// wmma bf16 -> s32 @ m16n16k16/m8n32k16/m32n8k16
		!eq(gft,"m16n16k16:a:bf16") : !listsplat(llvm_i32_ty, 4),
		!eq(gft,"m16n16k16:b:bf16") : !listsplat(llvm_i32_ty, 4),
		!eq(gft,"m8n32k16:a:bf16") : !listsplat(llvm_i32_ty, 2),
		!eq(gft,"m8n32k16:b:bf16") : !listsplat(llvm_i32_ty, 8),
		!eq(gft,"m32n8k16:a:bf16") : !listsplat(llvm_i32_ty, 8),
		!eq(gft,"m32n8k16:b:bf16") : !listsplat(llvm_i32_ty, 2),

		// mma bf16 -> s32 @ m16n8k16/m16n8k8
		!eq(gft,"m16n8k16:a:bf16") : !listsplat(llvm_i32_ty, 4),
		!eq(gft,"m16n8k16:b:bf16") : !listsplat(llvm_i32_ty, 2),
		!eq(gft,"m16n8k8:a:bf16") : !listsplat(llvm_i32_ty, 2),
		!eq(gft,"m16n8k8:b:bf16") : [llvm_i32_ty],

		// wmma u8/s8 -> s32 @ m16n16k16/m8n32k16/m32n8k16
!eq(gft,"m16n16k16:a:u8") : !listsplat(llvm_i32_ty, 2),		!eq(gft,"m16n16k16:a:u8") : !listsplat(llvm_i32_ty, 2),
!eq(gft,"m16n16k16:a:s8") : !listsplat(llvm_i32_ty, 2),		!eq(gft,"m16n16k16:a:s8") : !listsplat(llvm_i32_ty, 2),
!eq(gft,"m16n16k16:b:u8") : !listsplat(llvm_i32_ty, 2),		!eq(gft,"m16n16k16:b:u8") : !listsplat(llvm_i32_ty, 2),
!eq(gft,"m16n16k16:b:s8") : !listsplat(llvm_i32_ty, 2),		!eq(gft,"m16n16k16:b:s8") : !listsplat(llvm_i32_ty, 2),
!eq(gft,"m16n16k16:c:s32") : !listsplat(llvm_i32_ty, 8),		!eq(gft,"m16n16k16:c:s32") : !listsplat(llvm_i32_ty, 8),
!eq(gft,"m16n16k16:d:s32") : !listsplat(llvm_i32_ty, 8),		!eq(gft,"m16n16k16:d:s32") : !listsplat(llvm_i32_ty, 8),

!eq(gft,"m8n32k16:a:u8") : [llvm_i32_ty],		!eq(gft,"m8n32k16:a:u8") : [llvm_i32_ty],
!eq(gft,"m8n32k16:a:s8") : [llvm_i32_ty],		!eq(gft,"m8n32k16:a:s8") : [llvm_i32_ty],
!eq(gft,"m8n32k16:b:u8") : !listsplat(llvm_i32_ty, 4),		!eq(gft,"m8n32k16:b:u8") : !listsplat(llvm_i32_ty, 4),
!eq(gft,"m8n32k16:b:s8") : !listsplat(llvm_i32_ty, 4),		!eq(gft,"m8n32k16:b:s8") : !listsplat(llvm_i32_ty, 4),
!eq(gft,"m8n32k16:c:s32") : !listsplat(llvm_i32_ty, 8),		!eq(gft,"m8n32k16:c:s32") : !listsplat(llvm_i32_ty, 8),
!eq(gft,"m8n32k16:d:s32") : !listsplat(llvm_i32_ty, 8),		!eq(gft,"m8n32k16:d:s32") : !listsplat(llvm_i32_ty, 8),

!eq(gft,"m32n8k16:a:u8") : !listsplat(llvm_i32_ty, 4),		!eq(gft,"m32n8k16:a:u8") : !listsplat(llvm_i32_ty, 4),
!eq(gft,"m32n8k16:a:s8") : !listsplat(llvm_i32_ty, 4),		!eq(gft,"m32n8k16:a:s8") : !listsplat(llvm_i32_ty, 4),
!eq(gft,"m32n8k16:b:u8") : [llvm_i32_ty],		!eq(gft,"m32n8k16:b:u8") : [llvm_i32_ty],
!eq(gft,"m32n8k16:b:s8") : [llvm_i32_ty],		!eq(gft,"m32n8k16:b:s8") : [llvm_i32_ty],
!eq(gft,"m32n8k16:c:s32") : !listsplat(llvm_i32_ty, 8),		!eq(gft,"m32n8k16:c:s32") : !listsplat(llvm_i32_ty, 8),
!eq(gft,"m32n8k16:d:s32") : !listsplat(llvm_i32_ty, 8),		!eq(gft,"m32n8k16:d:s32") : !listsplat(llvm_i32_ty, 8),

// u4/s4/b1 -> s32 @ m8n8k32 (u4/s4), m8n8k128(b1)		// mma u8/s8 -> s32 @ m8n8k16/m16n8k16/m16n8k32
!eq(gft,"m8n8k128:a:b1") : [llvm_i32_ty],		!eq(gft,"m8n8k16:a:u8") : [llvm_i32_ty],
		!eq(gft,"m8n8k16:a:s8") : [llvm_i32_ty],
		!eq(gft,"m8n8k16:b:u8") : [llvm_i32_ty],
		!eq(gft,"m8n8k16:b:s8") : [llvm_i32_ty],
		!eq(gft,"m8n8k16:c:s32") : !listsplat(llvm_i32_ty, 2),
		!eq(gft,"m8n8k16:d:s32") : !listsplat(llvm_i32_ty, 2),

		!eq(gft,"m16n8k16:a:u8") : !listsplat(llvm_i32_ty, 2),
		!eq(gft,"m16n8k16:a:s8") : !listsplat(llvm_i32_ty, 2),
		!eq(gft,"m16n8k16:b:u8") : [llvm_i32_ty],
		!eq(gft,"m16n8k16:b:s8") : [llvm_i32_ty],
		!eq(gft,"m16n8k16:c:s32") : !listsplat(llvm_i32_ty, 4),
		!eq(gft,"m16n8k16:d:s32") : !listsplat(llvm_i32_ty, 4),

		!eq(gft,"m16n8k32:a:u8") : !listsplat(llvm_i32_ty, 4),
		!eq(gft,"m16n8k32:a:s8") : !listsplat(llvm_i32_ty, 4),
		!eq(gft,"m16n8k32:b:u8") : !listsplat(llvm_i32_ty, 2),
		!eq(gft,"m16n8k32:b:s8") : !listsplat(llvm_i32_ty, 2),
		!eq(gft,"m16n8k32:c:s32") : !listsplat(llvm_i32_ty, 4),
		!eq(gft,"m16n8k32:d:s32") : !listsplat(llvm_i32_ty, 4),

		// wmma/mma u4/s4 -> s32 @ m8n8k32 (u4/s4)
!eq(gft,"m8n8k32:a:u4") : [llvm_i32_ty],		!eq(gft,"m8n8k32:a:u4") : [llvm_i32_ty],
!eq(gft,"m8n8k32:a:s4") : [llvm_i32_ty],		!eq(gft,"m8n8k32:a:s4") : [llvm_i32_ty],
!eq(gft,"m8n8k128:b:b1") : [llvm_i32_ty],
!eq(gft,"m8n8k32:b:u4") : [llvm_i32_ty],		!eq(gft,"m8n8k32:b:u4") : [llvm_i32_ty],
!eq(gft,"m8n8k32:b:s4") : [llvm_i32_ty],		!eq(gft,"m8n8k32:b:s4") : [llvm_i32_ty],
!eq(gft,"m8n8k128:c:s32") : !listsplat(llvm_i32_ty, 2),
!eq(gft,"m8n8k128:d:s32") : !listsplat(llvm_i32_ty, 2),
!eq(gft,"m8n8k32:c:s32") : !listsplat(llvm_i32_ty, 2),		!eq(gft,"m8n8k32:c:s32") : !listsplat(llvm_i32_ty, 2),
!eq(gft,"m8n8k32:d:s32") : !listsplat(llvm_i32_ty, 2),		!eq(gft,"m8n8k32:d:s32") : !listsplat(llvm_i32_ty, 2),

		!eq(gft,"m16n8k32:a:u4") : !listsplat(llvm_i32_ty, 2),
		!eq(gft,"m16n8k32:a:s4") : !listsplat(llvm_i32_ty, 2),
		!eq(gft,"m16n8k32:b:u4") : [llvm_i32_ty],
		!eq(gft,"m16n8k32:b:s4") : [llvm_i32_ty],
		!eq(gft,"m16n8k32:c:s32") : !listsplat(llvm_i32_ty, 4),
		!eq(gft,"m16n8k32:d:s32") : !listsplat(llvm_i32_ty, 4),

		!eq(gft,"m16n8k64:a:u4") : !listsplat(llvm_i32_ty, 4),
		!eq(gft,"m16n8k64:a:s4") : !listsplat(llvm_i32_ty, 4),
		!eq(gft,"m16n8k64:b:u4") : !listsplat(llvm_i32_ty, 2),
		!eq(gft,"m16n8k64:b:s4") : !listsplat(llvm_i32_ty, 2),
		!eq(gft,"m16n8k64:c:s32") : !listsplat(llvm_i32_ty, 4),
		!eq(gft,"m16n8k64:d:s32") : !listsplat(llvm_i32_ty, 4),

		// wmma/mma b1 -> s32 @ m8n8k128(b1)
		!eq(gft,"m8n8k128:a:b1") : [llvm_i32_ty],
		!eq(gft,"m8n8k128:b:b1") : [llvm_i32_ty],
		!eq(gft,"m8n8k128:c:s32") : !listsplat(llvm_i32_ty, 2),
		!eq(gft,"m8n8k128:d:s32") : !listsplat(llvm_i32_ty, 2),

		!eq(gft,"m16n8k128:a:b1") : !listsplat(llvm_i32_ty, 2),
		!eq(gft,"m16n8k128:b:b1") : [llvm_i32_ty],
		!eq(gft,"m16n8k128:c:s32") : !listsplat(llvm_i32_ty, 4),
		!eq(gft,"m16n8k128:d:s32") : !listsplat(llvm_i32_ty, 4),

		!eq(gft,"m16n8k256:a:b1") : !listsplat(llvm_i32_ty, 4),
		!eq(gft,"m16n8k256:b:b1") : !listsplat(llvm_i32_ty, 2),
		!eq(gft,"m16n8k256:c:s32") : !listsplat(llvm_i32_ty, 4),
		!eq(gft,"m16n8k256:d:s32") : !listsplat(llvm_i32_ty, 4),
);		);
}		}

class WMMA_NAME_LDST<string Op, WMMA_REGS Frag, string Layout, int WithStride> {		class WMMA_NAME_LDST<string Op, WMMA_REGS Frag, string Layout, int WithStride> {
string intr = "llvm.nvvm.wmma."		string intr = "llvm.nvvm.wmma."
# Frag.geom		# Frag.geom
# "." # Op		# "." # Op
# "." # Frag.frag		# "." # Frag.frag
Show All 10 Lines	string record = "int_nvvm_wmma_"
# "_" # Frag.frag		# "_" # Frag.frag
# "_" # Frag.ptx_elt_type		# "_" # Frag.ptx_elt_type
# "_" # Layout		# "_" # Layout
# !if(WithStride, "_stride", "");		# !if(WithStride, "_stride", "");
}		}

class MMA_SIGNATURE<WMMA_REGS A, WMMA_REGS B, WMMA_REGS C, WMMA_REGS D> {		class MMA_SIGNATURE<WMMA_REGS A, WMMA_REGS B, WMMA_REGS C, WMMA_REGS D> {
list<WMMA_REGS> id_frags = !cond(		list<WMMA_REGS> id_frags = !cond(
// int and sub-int ops are identified by input type.		// FP16 ops are identified by accumulator & result type.
!eq(A.ptx_elt_type, "s8") : [A],		!eq(A.ptx_elt_type, "f16") : [D, C],
!eq(A.ptx_elt_type, "u8") : [A],		// other ops are identified by input types.
		traUnsubmitted Done Reply Inline Actions Nit: `types` as both A and B are considered. tra: Nit: `types` as both A and B are considered.
!eq(A.ptx_elt_type, "s4") : [A],		!ne(A.ptx_elt_type, B.ptx_elt_type): [A, B],
!eq(A.ptx_elt_type, "u4") : [A],		true: [A]
		traUnsubmitted Done Reply Inline Actions Nit: `are identified` tra: Nit: `are identified`
!eq(A.ptx_elt_type, "b1") : [A],
// the rest are FP ops identified by accumulator & result type.
true: [D, C]
);		);
string ret = !foldl("", id_frags, a, b, !strconcat(a, ".", b.ptx_elt_type));		string ret = !foldl("", id_frags, a, b, !strconcat(a, ".", b.ptx_elt_type));
}		}

class WMMA_NAME_MMA<string ALayout, string BLayout, int Satfinite,		class WMMA_NAME<string ALayout, string BLayout, int Satfinite, string Rnd,
WMMA_REGS A, WMMA_REGS B, WMMA_REGS C, WMMA_REGS D> {		WMMA_REGS A, WMMA_REGS B, WMMA_REGS C, WMMA_REGS D> {
string signature = MMA_SIGNATURE<A, B, C, D>.ret;		string signature = MMA_SIGNATURE<A, B, C, D>.ret;
string llvm = !if(		string llvm = "llvm.nvvm.wmma."
!eq(A.geom, "m8n8k4"),
"llvm.nvvm.mma.m8n8k4"
# "." # ALayout
# "." # BLayout
# signature,
"llvm.nvvm.wmma."
# A.geom		# A.geom
# ".mma"		# ".mma"
# "." # ALayout		# "." # ALayout
# "." # BLayout		# "." # BLayout
		# !if(!ne(Rnd, ""), !strconcat(".", Rnd), "")
# signature		# signature
# !if(Satfinite, ".satfinite", ""));		# !if(Satfinite, ".satfinite", "");

		string record = !subst(".", "_",
		!subst("llvm.", "int_", llvm));
		}

		class MMA_NAME<string ALayout, string BLayout, int Satfinite,
		WMMA_REGS A, WMMA_REGS B, WMMA_REGS C, WMMA_REGS D> {
		string signature = MMA_SIGNATURE<A, B, C, D>.ret;
		string llvm = "llvm.nvvm.mma."
		# A.geom
		# "." # ALayout
		# "." # BLayout
		# !if(Satfinite, ".satfinite", "")
		# signature;
string record = !subst(".", "_",		string record = !subst(".", "_",
!subst("llvm.", "int_", llvm));		!subst("llvm.", "int_", llvm));
}		}

// Generates list of 4-tuples of WMMA_REGS representing a valid MMA op.		// Generates list of 4-tuples of WMMA_REGS representing a valid MMA op.
// Geom: list of supported geometries.		// Geom: list of supported geometries.
// TypeN: PTX type of the corresponding fragment's element.		// TypeN: PTX type of the corresponding fragment's element.
// TypeB and TypeD may be empty if it must match that of TypeA or TypeC.		// TypeB and TypeD may be empty if it must match that of TypeA or TypeC.
Show All 18 Lines	list<WMMA_REGS> ret =
!foldl([]<WMMA_REGS>, Geom, t1, geom, !listconcat(t1,		!foldl([]<WMMA_REGS>, Geom, t1, geom, !listconcat(t1,
!foldl([]<WMMA_REGS>, Frags, t2, frag, !listconcat(t2,		!foldl([]<WMMA_REGS>, Frags, t2, frag, !listconcat(t2,
!foldl([]<WMMA_REGS>, Types, t3, type, !listconcat(t3,		!foldl([]<WMMA_REGS>, Types, t3, type, !listconcat(t3,
[WMMA_REGS<geom, frag, type>]))))));		[WMMA_REGS<geom, frag, type>]))))));
// Debugging aid for readable representation of the list above.		// Debugging aid for readable representation of the list above.
list<string> ops = !foreach(x, ret, x.gft);		list<string> ops = !foreach(x, ret, x.gft);
}		}



// Creates list of valid combinations of fragments. This is the master list that		// Creates list of valid combinations of fragments. This is the master list that
// drives generation of corresponding intrinsics and instructions.		// drives generation of corresponding intrinsics and instructions.
class NVVM_MMA_OPS<int _ = 0> {		class NVVM_MMA_OPS<int _ = 0> {
list<list<WMMA_REGS>> fp_mma_ops = MMA_OPS<		list<list<WMMA_REGS>> tf32_wmma_ops = MMA_OPS<
		["m16n16k8"],
		["tf32"], [], ["f32"], []>.ret;
		list<list<WMMA_REGS>> bf16_wmma_ops = MMA_OPS<
		["m16n16k16", "m32n8k16", "m8n32k16"],
		["bf16"], [], ["f32"], []>.ret;
		list<list<WMMA_REGS>> f64_wmma_ops = MMA_OPS<
["m8n8k4"],		["m8n8k4"],
["f16"], [], ["f16", "f32"], ["f16", "f32"]>.ret;		["f64"], [], ["f64"], []>.ret;
list<list<WMMA_REGS>> fp_wmma_ops = MMA_OPS<		list<list<WMMA_REGS>> fp_wmma_ops = MMA_OPS<
["m16n16k16", "m32n8k16", "m8n32k16"],		["m16n16k16", "m32n8k16", "m8n32k16"],
["f16"], [], ["f16", "f32"], ["f16", "f32"]>.ret;		["f16"], [], ["f16", "f32"], ["f16", "f32"]>.ret;
list<list<WMMA_REGS>> int_wmma_ops = MMA_OPS<		list<list<WMMA_REGS>> int_wmma_ops = MMA_OPS<
["m16n16k16", "m32n8k16", "m8n32k16"],		["m16n16k16", "m32n8k16", "m8n32k16"],
["s8", "u8"], [], ["s32"], []>.ret;		["s8", "u8"], [], ["s32"], []>.ret;
list<list<WMMA_REGS>> subint_wmma_ops = MMA_OPS<		list<list<WMMA_REGS>> subint_wmma_ops = MMA_OPS<
["m8n8k32"],		["m8n8k32"],
["s4", "u4"], [], ["s32"], []>.ret;		["s4", "u4"], [], ["s32"], []>.ret;
list<list<WMMA_REGS>> bit_wmma_ops = MMA_OPS<		list<list<WMMA_REGS>> bit_wmma_ops = MMA_OPS<
["m8n8k128"],		["m8n8k128"],
["b1"], [], ["s32"], []>.ret;		["b1"], [], ["s32"], []>.ret;
		list<list<WMMA_REGS>> all_wmma_ops = !listconcat(
		tf32_wmma_ops, bf16_wmma_ops, f64_wmma_ops,
		fp_wmma_ops, int_wmma_ops, subint_wmma_ops, bit_wmma_ops);

		list<list<WMMA_REGS>> tf32_mma_ops = MMA_OPS<
		["m16n8k4", "m16n8k8"],
		["tf32"], [], ["f32"], []>.ret;
		list<list<WMMA_REGS>> bf16_mma_ops = MMA_OPS<
		["m16n8k16", "m16n8k8"],
		["bf16"], [], ["f32"], []>.ret;
		list<list<WMMA_REGS>> f64_mma_ops = MMA_OPS<
		["m8n8k4"],
		["f64"], [], ["f64"], []>.ret;
		list<list<WMMA_REGS>> fp_mma_ops = MMA_OPS<
		["m8n8k4", "m16n8k8", "m16n8k16"],
		["f16"], [], ["f16", "f32"], ["f16", "f32"]>.ret;
		list<list<WMMA_REGS>> int_mma_ops = MMA_OPS<
		["m8n8k16", "m16n8k16", "m16n8k32"],
		["s8", "u8"], ["s8", "u8"], ["s32"], []>.ret;
		list<list<WMMA_REGS>> subint_mma_ops = MMA_OPS<
		["m8n8k32", "m16n8k32", "m16n8k64"],
		["s4", "u4"], ["s4", "u4"], ["s32"], []>.ret;
		list<list<WMMA_REGS>> bit_mma_ops = MMA_OPS<
		["m8n8k128", "m16n8k128", "m16n8k256"],
		["b1"], [], ["s32"], []>.ret;
list<list<WMMA_REGS>> all_mma_ops = !listconcat(		list<list<WMMA_REGS>> all_mma_ops = !listconcat(
fp_mma_ops, fp_wmma_ops, int_wmma_ops,		tf32_mma_ops, bf16_mma_ops, f64_mma_ops,
subint_wmma_ops, bit_wmma_ops);		fp_mma_ops, int_mma_ops, subint_mma_ops, bit_mma_ops);

list<WMMA_REGS> ldst_ab_ops = MMA_LDST_OPS<		list<WMMA_REGS> ldst_ab_ops = MMA_LDST_OPS<
["m16n16k16", "m32n8k16", "m8n32k16"],		["m16n16k16", "m32n8k16", "m8n32k16"],
["a", "b"], ["f16", "u8", "s8"]>.ret;		["a", "b"], ["f16", "u8", "s8", "bf16"]>.ret;
list<WMMA_REGS> ldst_cd_ops = MMA_LDST_OPS<		list<WMMA_REGS> ldst_cd_ops = MMA_LDST_OPS<
["m16n16k16", "m32n8k16", "m8n32k16"],		["m16n16k16", "m32n8k16", "m8n32k16"],
["c", "d"], ["f16", "f32", "s32"]>.ret;		["c", "d"], ["f16", "f32", "s32"]>.ret;
		list<WMMA_REGS> ldst_tf32_ab_ops = MMA_LDST_OPS<
		["m16n16k8"],
		["a", "b"], ["tf32"]>.ret;
		list<WMMA_REGS> ldst_tf32_cd_ops = MMA_LDST_OPS<
		["m16n16k8"],
		["c", "d"], ["f32"]>.ret;
		list<WMMA_REGS> ldst_f64_abcd_ops = MMA_LDST_OPS<
		["m8n8k4"],
		["a", "b", "c", "d"], ["f64"]>.ret;
list<WMMA_REGS> ldst_subint_ab_ops = MMA_LDST_OPS<		list<WMMA_REGS> ldst_subint_ab_ops = MMA_LDST_OPS<
["m8n8k32"], ["a", "b"], ["s4","u4"]>.ret;		["m8n8k32"], ["a", "b"], ["s4","u4"]>.ret;
list<WMMA_REGS> ldst_bit_ab_ops = MMA_LDST_OPS<		list<WMMA_REGS> ldst_bit_ab_ops = MMA_LDST_OPS<
["m8n8k128"], ["a", "b"], ["b1"]>.ret;		["m8n8k128"], ["a", "b"], ["b1"]>.ret;
list<WMMA_REGS> ldst_subint_cd_ops = MMA_LDST_OPS<		list<WMMA_REGS> ldst_subint_cd_ops = MMA_LDST_OPS<
["m8n8k32", "m8n8k128"], ["c", "d"], ["s32"]>.ret;		["m8n8k32", "m8n8k128"], ["c", "d"], ["s32"]>.ret;
list<WMMA_REGS> all_ldst_ops = !listconcat(ldst_ab_ops, ldst_cd_ops,		list<WMMA_REGS> all_ldst_ops = !listconcat(ldst_ab_ops, ldst_cd_ops,
		ldst_tf32_ab_ops,
		ldst_tf32_cd_ops,
		ldst_f64_abcd_ops,
ldst_subint_ab_ops,		ldst_subint_ab_ops,
ldst_bit_ab_ops,		ldst_bit_ab_ops,
ldst_subint_cd_ops);		ldst_subint_cd_ops);
// Separate A/B/C fragments (loads) from D (stores).		// Separate A/B/C fragments (loads) from D (stores).
list<WMMA_REGS> all_ld_ops = !filter(op, all_ldst_ops, !ne(op.frag, "d"));		list<WMMA_REGS> all_ld_ops = !filter(op, all_ldst_ops, !ne(op.frag, "d"));
list<WMMA_REGS> all_st_ops = !filter(op, all_ldst_ops, !eq(op.frag, "d"));		list<WMMA_REGS> all_st_ops = !filter(op, all_ldst_ops, !eq(op.frag, "d"));
}		}

def NVVM_MMA_OPS : NVVM_MMA_OPS;		def NVVM_MMA_OPS : NVVM_MMA_OPS;

// Returns true if this combination of layout/satf is supported; false otherwise.
// MMA ops must provide all parameters. Loads and stores -- only frags and layout_a.		// Returns true if this combination of fragment and layout for WMMA load/store
// The class is used to prevent generation of records for the unsupported variants.		// ops is supported; false otherwise.
		// E.g.
		// if NVVM_WMMA_LDST_SUPPORTED<...>.ret then
		// def : FOO<>; // The record will only be defined for supported ops.
		//
		class NVVM_WMMA_LDST_SUPPORTED<WMMA_REGS frag, string layout> {
		string f = frag.frag;
		string t = frag.ptx_elt_type;

		bit ret = !cond(
		// Sub-int load and store requires A fragment to be of row layout and B
		// fragments to be of column layout.
		!and(!or(!eq(t, "b1"),
		!eq(t, "u4"),
		!eq(t, "s4")),
		!or(!and(!eq(f, "a"),
		!ne(layout, "row")),
		!and(!eq(f, "b"),
		!ne(layout, "col")))) : false,
		true: true
		);
		}

		// Returns true if this combination of layout/satf/rnd for WMMA ops is
		// supported; false otherwise.
		// E.g.
		// if NVVM_WMMA_SUPPORTED<...>.ret then
		// def : FOO<>; // The record will only be defined for supported ops.
		//
		class NVVM_WMMA_SUPPORTED<list<WMMA_REGS> frags, string layout_a, string layout_b, int satf, string rnd> {
		// WMMA ops check both layouts.
		string layout = layout_a # ":" # layout_b;
		string t = frags[0].ptx_elt_type;

		bit ret = !cond(
		// only f64 wmma functions support rnd options
		// any non f64 type that uses a rnd value is invalid
		!and(!ne(t, "f64"), !ne(rnd, "")) : false,

		// satf is only valid for select types
		!and(!eq(satf, 1),
		!ne(t, "s8"),
		!ne(t, "u8"),
		!ne(t, "s4"),
		!ne(t, "u4"),
		!ne(t, "f16")): false,

		// Sub-int wmma requires row/column layout
		!and(!or(!eq(t, "s4"),
		!eq(t, "u4"),
		!eq(t, "b1")),
		!ne(layout, "row:col")) : false,
		true: true
		);
		}

		// Returns true if this combination of layout/satf for MMA ops is supported;
		// false otherwise.
// E.g.		// E.g.
// if NVVM_MMA_SUPPORTED<...>.ret then		// if NVVM_MMA_SUPPORTED<...>.ret then
// def : FOO<>; // The record will only be defined for supported ops.		// def : FOO<>; // The record will only be defined for supported ops.
//		//
class NVVM_MMA_SUPPORTED<list<WMMA_REGS> frags, string layout_a, string layout_b="-", int satf=-1> {		class NVVM_MMA_SUPPORTED<list<WMMA_REGS> frags, string layout_a, string layout_b, int satf> {
// MMA ops check both layouts.		// MMA ops check both layouts.
string mma = frags[0].ptx_elt_type		string layout = layout_a # ":" # layout_b;
# ":" # layout_a		string a_type = frags[0].ptx_elt_type;
# ":" # layout_b;		string b_type = frags[1].ptx_elt_type;
// Load ops only need type/fragment/layout.		string c_type = frags[2].ptx_elt_type;
string ld = frags[0].ptx_elt_type		string d_type = frags[3].ptx_elt_type;
# ":" # frags[0].frag		string geom = frags[0].geom;
# ":" # layout_a
;
string ldf = frags[0].ptx_elt_type
# ":" # frags[0].frag
;
string t = frags[0].ptx_elt_type;

// gcd is a shortcut used to identify instructions that depend on		// gcd is a shortcut used to identify instructions that depend on
// geom+frag_c+frag_d. Not all instances of this class have all fragments		// geom+frag_c+frag_d.
// specified. If there are not enough fragments, the tail evaluates to '?'.		string gcd = geom # ":" # c_type # d_type;
string gcd = frags[0].geom
# ":"
# !if(!eq(!size(frags), 4),
frags[2].ptx_elt_type # frags[3].ptx_elt_type,
"?");
bit ret = !cond(		bit ret = !cond(
// Sub-int MMA only supports fixed A/B layout.
// b1 does not support .satf.
!eq(mma#":"#satf, "b1:row:col:0") : true,
// mma.m8n8k4 has no .satf modifier.
!and(!eq(frags[0].geom, "m8n8k4"),
!ne(satf, 0)): false,

// mma.m8n8k4 has no C=f32 D=f16 variant.		// Limit satf to valid types
		!and(!eq(satf, 1),
		!ne(a_type, "s8"),
		!ne(a_type, "u8"),
		!ne(a_type, "s4"),
		!ne(a_type, "u4")): false,

		// m8n8k4 has no C=f32 D=f16 variant.
!eq(gcd, "m8n8k4:f32f16"): false,		!eq(gcd, "m8n8k4:f32f16"): false,
!eq(mma, "s4:row:col") : true,
!eq(mma, "u4:row:col") : true,		// only m8n8k4 for f16 does not require row:col layout
!eq(mma, "s4:row:col") : true,		!and(!ne(layout, "row:col"),
!eq(mma, "u4:row:col") : true,		!or(!ne(geom, "m8n8k4"),
// Sub-int load/stores have fixed layout for A and B.		!ne(a_type, "f16"))) : false,
!and(!eq(layout_b, "-"), // It's a Load or Store op
!or(!eq(ld, "b1:a:row"),		// m16n8k8 requires A and B to be the same type and C and D to be the same
!eq(ld, "b1:b:col"),		// type.
!eq(ldf, "b1:c"),		!and(!eq(geom, "m16n8k8"),
!eq(ldf, "b1:d"),		!or(!ne(a_type, b_type),
!eq(ld, "s4:a:row"),		!ne(c_type, d_type))): false,
!eq(ld, "s4:b:col"),
!eq(ldf, "s4:c"),		// m16n8k8 requires C and D to be the same type.
!eq(ldf, "s4:d"),		!and(!eq(geom, "m16n8k8"),
!eq(ld, "u4:a:row"),		!ne(c_type, d_type)): false,
!eq(ld, "u4:b:col"),
!eq(ldf, "u4:c"),		// All other are OK.
!eq(ldf, "u4:d"))) : true,
// All other sub-int ops are not supported.
!eq(t, "b1") : false,
!eq(t, "s4") : false,
!eq(t, "u4") : false,
// All other (non sub-int) are OK.
true: true		true: true
);		);
}		}

class SHFL_INFO<bit sync, string mode, string type, bit return_pred> {		class SHFL_INFO<bit sync, string mode, string type, bit return_pred> {
string Suffix = !if(sync, "sync_", "")		string Suffix = !if(sync, "sync_", "")
# mode # "_"		# mode # "_"
# type		# type
▲ Show 20 Lines • Show All 3,957 Lines • ▼ Show 20 Lines	: Intrinsic<[],
!if(WithStride, [llvm_i32_ty], [])),		!if(WithStride, [llvm_i32_ty], [])),
[IntrWriteMem, IntrArgMemOnly, WriteOnly<ArgIndex<0>>, NoCapture<ArgIndex<0>>],		[IntrWriteMem, IntrArgMemOnly, WriteOnly<ArgIndex<0>>, NoCapture<ArgIndex<0>>],
WMMA_NAME_LDST<"store", Frag, Layout, WithStride>.intr>;		WMMA_NAME_LDST<"store", Frag, Layout, WithStride>.intr>;

// Create all load/store variants		// Create all load/store variants
foreach layout = ["row", "col"] in {		foreach layout = ["row", "col"] in {
foreach stride = [0, 1] in {		foreach stride = [0, 1] in {
foreach frag = NVVM_MMA_OPS.all_ld_ops in		foreach frag = NVVM_MMA_OPS.all_ld_ops in
if NVVM_MMA_SUPPORTED<[frag], layout>.ret then		if NVVM_WMMA_LDST_SUPPORTED<frag, layout>.ret then
def WMMA_NAME_LDST<"load", frag, layout, stride>.record		def WMMA_NAME_LDST<"load", frag, layout, stride>.record
: NVVM_WMMA_LD<frag, layout, stride>;		: NVVM_WMMA_LD<frag, layout, stride>;
foreach frag = NVVM_MMA_OPS.all_st_ops in		foreach frag = NVVM_MMA_OPS.all_st_ops in
if NVVM_MMA_SUPPORTED<[frag], layout>.ret then		if NVVM_WMMA_LDST_SUPPORTED<frag, layout>.ret then
def WMMA_NAME_LDST<"store", frag, layout, stride>.record		def WMMA_NAME_LDST<"store", frag, layout, stride>.record
: NVVM_WMMA_ST<frag, layout, stride>;		: NVVM_WMMA_ST<frag, layout, stride>;
}		}
}		}

// WMMA.MMA		// WMMA.MMA
class NVVM_WMMA_MMA<string ALayout, string BLayout, int Satfinite,		class NVVM_WMMA_MMA<string ALayout, string BLayout, int Satfinite, string rnd,
WMMA_REGS A, WMMA_REGS B,		WMMA_REGS A, WMMA_REGS B,
WMMA_REGS C, WMMA_REGS D>		WMMA_REGS C, WMMA_REGS D>
: Intrinsic<D.regs,		: Intrinsic<D.regs,
!listconcat(A.regs, B.regs, C.regs),		!listconcat(A.regs, B.regs, C.regs),
[IntrNoMem],		[IntrNoMem],
WMMA_NAME_MMA<ALayout, BLayout, Satfinite, A, B, C, D>.llvm>;		WMMA_NAME<ALayout, BLayout, Satfinite, rnd, A, B, C, D>.llvm>;

foreach layout_a = ["row", "col"] in {		foreach layout_a = ["row", "col"] in {
foreach layout_b = ["row", "col"] in {		foreach layout_b = ["row", "col"] in {
foreach satf = [0, 1] in {		foreach satf = [0, 1] in {
foreach op = NVVM_MMA_OPS.all_mma_ops in {		foreach rnd = ["", "rn", "rz", "rm", "rp"] in {
		traUnsubmitted Not Done Reply Inline Actions We're often using an empty string to represent a `none`. Comparisons with `-` where we check `rnd` look like we're doing something special there. I'd use an empty string for `rnd`, too. tra: We're often using an empty string to represent a `none`. Comparisons with `-` where we check…
		steffenlarsenAuthorUnsubmitted Done Reply Inline Actions Empty string works for me. I think there are/were some places that used "-" as a default parameter meaning `none`, but I agree with your assessment. steffenlarsen: Empty string works for me. I think there are/were some places that used "-" as a default…
if NVVM_MMA_SUPPORTED<op, layout_a, layout_b, satf>.ret then {		foreach op = NVVM_MMA_OPS.all_wmma_ops in {
def WMMA_NAME_MMA<layout_a, layout_b, satf,		if NVVM_WMMA_SUPPORTED<op, layout_a, layout_b, satf, rnd>.ret then {
		def WMMA_NAME<layout_a, layout_b, satf, rnd,
op[0], op[1], op[2], op[3]>.record		op[0], op[1], op[2], op[3]>.record
: NVVM_WMMA_MMA<layout_a, layout_b, satf,		: NVVM_WMMA_MMA<layout_a, layout_b, satf, rnd,
op[0], op[1], op[2], op[3]>;		op[0], op[1], op[2], op[3]>;
}		}
		} // op
		} // rnd
		} // satf
		} // layout_b
		} // layout_a

		// MMA
		class NVVM_MMA<string ALayout, string BLayout, int Satfinite,
		WMMA_REGS A, WMMA_REGS B, WMMA_REGS C, WMMA_REGS D>
		: Intrinsic<D.regs,
		!listconcat(A.regs, B.regs, C.regs),
		[IntrNoMem],
		MMA_NAME<ALayout, BLayout, Satfinite, A, B, C, D>.llvm>;

		foreach layout_a = ["row", "col"] in {
		foreach layout_b = ["row", "col"] in {
		foreach satf = [0, 1] in {
		foreach op = NVVM_MMA_OPS.all_mma_ops in {
		if NVVM_MMA_SUPPORTED<op, layout_a, layout_b, satf>.ret then {
		def MMA_NAME<layout_a, layout_b, satf, op[0], op[1], op[2], op[3]>.record
		: NVVM_MMA<layout_a, layout_b, satf, op[0], op[1], op[2], op[3]>;
}		}
		} // op
} // satf		} // satf
} // layout_b		} // layout_b
} // layout_a		} // layout_a

} // let TargetPrefix = "nvvm"		} // let TargetPrefix = "nvvm"

llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp

Show First 20 Lines • Show All 3,484 Lines • ▼ Show 20 Lines	bool NVPTXTargetLowering::getTgtMemIntrinsic(
case Intrinsic::nvvm_wmma_m16n16k16_load_a_s8_col:		case Intrinsic::nvvm_wmma_m16n16k16_load_a_s8_col:
case Intrinsic::nvvm_wmma_m16n16k16_load_a_s8_col_stride:		case Intrinsic::nvvm_wmma_m16n16k16_load_a_s8_col_stride:
case Intrinsic::nvvm_wmma_m16n16k16_load_a_u8_col_stride:		case Intrinsic::nvvm_wmma_m16n16k16_load_a_u8_col_stride:
case Intrinsic::nvvm_wmma_m16n16k16_load_a_u8_col:		case Intrinsic::nvvm_wmma_m16n16k16_load_a_u8_col:
case Intrinsic::nvvm_wmma_m16n16k16_load_a_s8_row:		case Intrinsic::nvvm_wmma_m16n16k16_load_a_s8_row:
case Intrinsic::nvvm_wmma_m16n16k16_load_a_s8_row_stride:		case Intrinsic::nvvm_wmma_m16n16k16_load_a_s8_row_stride:
case Intrinsic::nvvm_wmma_m16n16k16_load_a_u8_row_stride:		case Intrinsic::nvvm_wmma_m16n16k16_load_a_u8_row_stride:
case Intrinsic::nvvm_wmma_m16n16k16_load_a_u8_row:		case Intrinsic::nvvm_wmma_m16n16k16_load_a_u8_row:
		case Intrinsic::nvvm_wmma_m8n32k16_load_a_bf16_col:
		case Intrinsic::nvvm_wmma_m8n32k16_load_a_bf16_col_stride:
		case Intrinsic::nvvm_wmma_m8n32k16_load_a_bf16_row:
		case Intrinsic::nvvm_wmma_m8n32k16_load_a_bf16_row_stride:
case Intrinsic::nvvm_wmma_m16n16k16_load_b_s8_col:		case Intrinsic::nvvm_wmma_m16n16k16_load_b_s8_col:
case Intrinsic::nvvm_wmma_m16n16k16_load_b_s8_col_stride:		case Intrinsic::nvvm_wmma_m16n16k16_load_b_s8_col_stride:
case Intrinsic::nvvm_wmma_m16n16k16_load_b_u8_col_stride:		case Intrinsic::nvvm_wmma_m16n16k16_load_b_u8_col_stride:
case Intrinsic::nvvm_wmma_m16n16k16_load_b_u8_col:		case Intrinsic::nvvm_wmma_m16n16k16_load_b_u8_col:
case Intrinsic::nvvm_wmma_m16n16k16_load_b_s8_row:		case Intrinsic::nvvm_wmma_m16n16k16_load_b_s8_row:
case Intrinsic::nvvm_wmma_m16n16k16_load_b_s8_row_stride:		case Intrinsic::nvvm_wmma_m16n16k16_load_b_s8_row_stride:
case Intrinsic::nvvm_wmma_m16n16k16_load_b_u8_row_stride:		case Intrinsic::nvvm_wmma_m16n16k16_load_b_u8_row_stride:
case Intrinsic::nvvm_wmma_m16n16k16_load_b_u8_row: {		case Intrinsic::nvvm_wmma_m16n16k16_load_b_u8_row:
		case Intrinsic::nvvm_wmma_m32n8k16_load_b_bf16_col:
		case Intrinsic::nvvm_wmma_m32n8k16_load_b_bf16_col_stride:
		case Intrinsic::nvvm_wmma_m32n8k16_load_b_bf16_row:
		case Intrinsic::nvvm_wmma_m32n8k16_load_b_bf16_row_stride: {
Info.opc = ISD::INTRINSIC_W_CHAIN;		Info.opc = ISD::INTRINSIC_W_CHAIN;
Info.memVT = MVT::v2i32;		Info.memVT = MVT::v2i32;
Info.ptrVal = I.getArgOperand(0);		Info.ptrVal = I.getArgOperand(0);
Info.offset = 0;		Info.offset = 0;
Info.flags = MachineMemOperand::MOLoad;		Info.flags = MachineMemOperand::MOLoad;
Info.align = Align(8);		Info.align = Align(8);
return true;		return true;
}		}

case Intrinsic::nvvm_wmma_m32n8k16_load_a_s8_col:		case Intrinsic::nvvm_wmma_m32n8k16_load_a_s8_col:
case Intrinsic::nvvm_wmma_m32n8k16_load_a_s8_col_stride:		case Intrinsic::nvvm_wmma_m32n8k16_load_a_s8_col_stride:
case Intrinsic::nvvm_wmma_m32n8k16_load_a_u8_col_stride:		case Intrinsic::nvvm_wmma_m32n8k16_load_a_u8_col_stride:
case Intrinsic::nvvm_wmma_m32n8k16_load_a_u8_col:		case Intrinsic::nvvm_wmma_m32n8k16_load_a_u8_col:
case Intrinsic::nvvm_wmma_m32n8k16_load_a_s8_row:		case Intrinsic::nvvm_wmma_m32n8k16_load_a_s8_row:
case Intrinsic::nvvm_wmma_m32n8k16_load_a_s8_row_stride:		case Intrinsic::nvvm_wmma_m32n8k16_load_a_s8_row_stride:
case Intrinsic::nvvm_wmma_m32n8k16_load_a_u8_row_stride:		case Intrinsic::nvvm_wmma_m32n8k16_load_a_u8_row_stride:
case Intrinsic::nvvm_wmma_m32n8k16_load_a_u8_row:		case Intrinsic::nvvm_wmma_m32n8k16_load_a_u8_row:
		case Intrinsic::nvvm_wmma_m16n16k16_load_a_bf16_col:
		case Intrinsic::nvvm_wmma_m16n16k16_load_a_bf16_col_stride:
		case Intrinsic::nvvm_wmma_m16n16k16_load_a_bf16_row:
		case Intrinsic::nvvm_wmma_m16n16k16_load_a_bf16_row_stride:
		case Intrinsic::nvvm_wmma_m16n16k8_load_a_tf32_col:
		case Intrinsic::nvvm_wmma_m16n16k8_load_a_tf32_col_stride:
		case Intrinsic::nvvm_wmma_m16n16k8_load_a_tf32_row:
		case Intrinsic::nvvm_wmma_m16n16k8_load_a_tf32_row_stride:

case Intrinsic::nvvm_wmma_m8n32k16_load_b_s8_col:		case Intrinsic::nvvm_wmma_m8n32k16_load_b_s8_col:
case Intrinsic::nvvm_wmma_m8n32k16_load_b_s8_col_stride:		case Intrinsic::nvvm_wmma_m8n32k16_load_b_s8_col_stride:
case Intrinsic::nvvm_wmma_m8n32k16_load_b_u8_col_stride:		case Intrinsic::nvvm_wmma_m8n32k16_load_b_u8_col_stride:
case Intrinsic::nvvm_wmma_m8n32k16_load_b_u8_col:		case Intrinsic::nvvm_wmma_m8n32k16_load_b_u8_col:
case Intrinsic::nvvm_wmma_m8n32k16_load_b_s8_row:		case Intrinsic::nvvm_wmma_m8n32k16_load_b_s8_row:
case Intrinsic::nvvm_wmma_m8n32k16_load_b_s8_row_stride:		case Intrinsic::nvvm_wmma_m8n32k16_load_b_s8_row_stride:
case Intrinsic::nvvm_wmma_m8n32k16_load_b_u8_row_stride:		case Intrinsic::nvvm_wmma_m8n32k16_load_b_u8_row_stride:
case Intrinsic::nvvm_wmma_m8n32k16_load_b_u8_row: {		case Intrinsic::nvvm_wmma_m8n32k16_load_b_u8_row:
		case Intrinsic::nvvm_wmma_m16n16k16_load_b_bf16_col:
		case Intrinsic::nvvm_wmma_m16n16k16_load_b_bf16_col_stride:
		case Intrinsic::nvvm_wmma_m16n16k16_load_b_bf16_row:
		case Intrinsic::nvvm_wmma_m16n16k16_load_b_bf16_row_stride:
		case Intrinsic::nvvm_wmma_m16n16k8_load_b_tf32_col:
		case Intrinsic::nvvm_wmma_m16n16k8_load_b_tf32_col_stride:
		case Intrinsic::nvvm_wmma_m16n16k8_load_b_tf32_row:
		case Intrinsic::nvvm_wmma_m16n16k8_load_b_tf32_row_stride: {
Info.opc = ISD::INTRINSIC_W_CHAIN;		Info.opc = ISD::INTRINSIC_W_CHAIN;
Info.memVT = MVT::v4i32;		Info.memVT = MVT::v4i32;
Info.ptrVal = I.getArgOperand(0);		Info.ptrVal = I.getArgOperand(0);
Info.offset = 0;		Info.offset = 0;
Info.flags = MachineMemOperand::MOLoad;		Info.flags = MachineMemOperand::MOLoad;
Info.align = Align(16);		Info.align = Align(16);
return true;		return true;
}		}
▲ Show 20 Lines • Show All 63 Lines • ▼ Show 20 Lines	bool NVPTXTargetLowering::getTgtMemIntrinsic(
case Intrinsic::nvvm_wmma_m16n16k16_load_c_f32_row_stride:		case Intrinsic::nvvm_wmma_m16n16k16_load_c_f32_row_stride:
case Intrinsic::nvvm_wmma_m32n8k16_load_c_f32_col:		case Intrinsic::nvvm_wmma_m32n8k16_load_c_f32_col:
case Intrinsic::nvvm_wmma_m32n8k16_load_c_f32_row:		case Intrinsic::nvvm_wmma_m32n8k16_load_c_f32_row:
case Intrinsic::nvvm_wmma_m32n8k16_load_c_f32_col_stride:		case Intrinsic::nvvm_wmma_m32n8k16_load_c_f32_col_stride:
case Intrinsic::nvvm_wmma_m32n8k16_load_c_f32_row_stride:		case Intrinsic::nvvm_wmma_m32n8k16_load_c_f32_row_stride:
case Intrinsic::nvvm_wmma_m8n32k16_load_c_f32_col:		case Intrinsic::nvvm_wmma_m8n32k16_load_c_f32_col:
case Intrinsic::nvvm_wmma_m8n32k16_load_c_f32_row:		case Intrinsic::nvvm_wmma_m8n32k16_load_c_f32_row:
case Intrinsic::nvvm_wmma_m8n32k16_load_c_f32_col_stride:		case Intrinsic::nvvm_wmma_m8n32k16_load_c_f32_col_stride:
case Intrinsic::nvvm_wmma_m8n32k16_load_c_f32_row_stride: {		case Intrinsic::nvvm_wmma_m8n32k16_load_c_f32_row_stride:
		case Intrinsic::nvvm_wmma_m16n16k8_load_c_f32_col:
		case Intrinsic::nvvm_wmma_m16n16k8_load_c_f32_row:
		case Intrinsic::nvvm_wmma_m16n16k8_load_c_f32_col_stride:
		case Intrinsic::nvvm_wmma_m16n16k8_load_c_f32_row_stride: {
Info.opc = ISD::INTRINSIC_W_CHAIN;		Info.opc = ISD::INTRINSIC_W_CHAIN;
Info.memVT = MVT::v8f32;		Info.memVT = MVT::v8f32;
Info.ptrVal = I.getArgOperand(0);		Info.ptrVal = I.getArgOperand(0);
Info.offset = 0;		Info.offset = 0;
Info.flags = MachineMemOperand::MOLoad;		Info.flags = MachineMemOperand::MOLoad;
Info.align = Align(16);		Info.align = Align(16);
return true;		return true;
}		}

		case Intrinsic::nvvm_wmma_m32n8k16_load_a_bf16_col:
		case Intrinsic::nvvm_wmma_m32n8k16_load_a_bf16_col_stride:
		case Intrinsic::nvvm_wmma_m32n8k16_load_a_bf16_row:
		case Intrinsic::nvvm_wmma_m32n8k16_load_a_bf16_row_stride:

		case Intrinsic::nvvm_wmma_m8n32k16_load_b_bf16_col:
		case Intrinsic::nvvm_wmma_m8n32k16_load_b_bf16_col_stride:
		case Intrinsic::nvvm_wmma_m8n32k16_load_b_bf16_row:
		case Intrinsic::nvvm_wmma_m8n32k16_load_b_bf16_row_stride:

case Intrinsic::nvvm_wmma_m16n16k16_load_c_s32_col:		case Intrinsic::nvvm_wmma_m16n16k16_load_c_s32_col:
case Intrinsic::nvvm_wmma_m16n16k16_load_c_s32_col_stride:		case Intrinsic::nvvm_wmma_m16n16k16_load_c_s32_col_stride:
case Intrinsic::nvvm_wmma_m16n16k16_load_c_s32_row:		case Intrinsic::nvvm_wmma_m16n16k16_load_c_s32_row:
case Intrinsic::nvvm_wmma_m16n16k16_load_c_s32_row_stride:		case Intrinsic::nvvm_wmma_m16n16k16_load_c_s32_row_stride:
case Intrinsic::nvvm_wmma_m32n8k16_load_c_s32_col:		case Intrinsic::nvvm_wmma_m32n8k16_load_c_s32_col:
case Intrinsic::nvvm_wmma_m32n8k16_load_c_s32_col_stride:		case Intrinsic::nvvm_wmma_m32n8k16_load_c_s32_col_stride:
case Intrinsic::nvvm_wmma_m32n8k16_load_c_s32_row:		case Intrinsic::nvvm_wmma_m32n8k16_load_c_s32_row:
case Intrinsic::nvvm_wmma_m32n8k16_load_c_s32_row_stride:		case Intrinsic::nvvm_wmma_m32n8k16_load_c_s32_row_stride:
Show All 22 Lines	case Intrinsic::nvvm_wmma_m8n8k32_load_c_s32_row_stride: {
Info.memVT = MVT::v2i32;		Info.memVT = MVT::v2i32;
Info.ptrVal = I.getArgOperand(0);		Info.ptrVal = I.getArgOperand(0);
Info.offset = 0;		Info.offset = 0;
Info.flags = MachineMemOperand::MOLoad;		Info.flags = MachineMemOperand::MOLoad;
Info.align = Align(8);		Info.align = Align(8);
return true;		return true;
}		}

		case Intrinsic::nvvm_wmma_m8n8k4_load_a_f64_col:
		case Intrinsic::nvvm_wmma_m8n8k4_load_a_f64_col_stride:
		case Intrinsic::nvvm_wmma_m8n8k4_load_a_f64_row:
		case Intrinsic::nvvm_wmma_m8n8k4_load_a_f64_row_stride:

		case Intrinsic::nvvm_wmma_m8n8k4_load_b_f64_col:
		case Intrinsic::nvvm_wmma_m8n8k4_load_b_f64_col_stride:
		case Intrinsic::nvvm_wmma_m8n8k4_load_b_f64_row:
		case Intrinsic::nvvm_wmma_m8n8k4_load_b_f64_row_stride: {
		Info.opc = ISD::INTRINSIC_W_CHAIN;
		Info.memVT = MVT::f64;
		Info.ptrVal = I.getArgOperand(0);
		Info.offset = 0;
		Info.flags = MachineMemOperand::MOLoad;
		Info.align = Align(8);
		return true;
		}

		case Intrinsic::nvvm_wmma_m8n8k4_load_c_f64_col:
		case Intrinsic::nvvm_wmma_m8n8k4_load_c_f64_col_stride:
		case Intrinsic::nvvm_wmma_m8n8k4_load_c_f64_row:
		case Intrinsic::nvvm_wmma_m8n8k4_load_c_f64_row_stride: {
		Info.opc = ISD::INTRINSIC_W_CHAIN;
		Info.memVT = MVT::v2f64;
		Info.ptrVal = I.getArgOperand(0);
		Info.offset = 0;
		Info.flags = MachineMemOperand::MOLoad;
		Info.align = Align(16);
		return true;
		}

case Intrinsic::nvvm_wmma_m16n16k16_store_d_f16_col:		case Intrinsic::nvvm_wmma_m16n16k16_store_d_f16_col:
case Intrinsic::nvvm_wmma_m16n16k16_store_d_f16_row:		case Intrinsic::nvvm_wmma_m16n16k16_store_d_f16_row:
case Intrinsic::nvvm_wmma_m16n16k16_store_d_f16_col_stride:		case Intrinsic::nvvm_wmma_m16n16k16_store_d_f16_col_stride:
case Intrinsic::nvvm_wmma_m16n16k16_store_d_f16_row_stride:		case Intrinsic::nvvm_wmma_m16n16k16_store_d_f16_row_stride:
case Intrinsic::nvvm_wmma_m32n8k16_store_d_f16_col:		case Intrinsic::nvvm_wmma_m32n8k16_store_d_f16_col:
case Intrinsic::nvvm_wmma_m32n8k16_store_d_f16_row:		case Intrinsic::nvvm_wmma_m32n8k16_store_d_f16_row:
case Intrinsic::nvvm_wmma_m32n8k16_store_d_f16_col_stride:		case Intrinsic::nvvm_wmma_m32n8k16_store_d_f16_col_stride:
case Intrinsic::nvvm_wmma_m32n8k16_store_d_f16_row_stride:		case Intrinsic::nvvm_wmma_m32n8k16_store_d_f16_row_stride:
Show All 16 Lines	bool NVPTXTargetLowering::getTgtMemIntrinsic(
case Intrinsic::nvvm_wmma_m16n16k16_store_d_f32_row_stride:		case Intrinsic::nvvm_wmma_m16n16k16_store_d_f32_row_stride:
case Intrinsic::nvvm_wmma_m32n8k16_store_d_f32_col:		case Intrinsic::nvvm_wmma_m32n8k16_store_d_f32_col:
case Intrinsic::nvvm_wmma_m32n8k16_store_d_f32_row:		case Intrinsic::nvvm_wmma_m32n8k16_store_d_f32_row:
case Intrinsic::nvvm_wmma_m32n8k16_store_d_f32_col_stride:		case Intrinsic::nvvm_wmma_m32n8k16_store_d_f32_col_stride:
case Intrinsic::nvvm_wmma_m32n8k16_store_d_f32_row_stride:		case Intrinsic::nvvm_wmma_m32n8k16_store_d_f32_row_stride:
case Intrinsic::nvvm_wmma_m8n32k16_store_d_f32_col:		case Intrinsic::nvvm_wmma_m8n32k16_store_d_f32_col:
case Intrinsic::nvvm_wmma_m8n32k16_store_d_f32_row:		case Intrinsic::nvvm_wmma_m8n32k16_store_d_f32_row:
case Intrinsic::nvvm_wmma_m8n32k16_store_d_f32_col_stride:		case Intrinsic::nvvm_wmma_m8n32k16_store_d_f32_col_stride:
case Intrinsic::nvvm_wmma_m8n32k16_store_d_f32_row_stride: {		case Intrinsic::nvvm_wmma_m8n32k16_store_d_f32_row_stride:
		case Intrinsic::nvvm_wmma_m16n16k8_store_d_f32_col:
		case Intrinsic::nvvm_wmma_m16n16k8_store_d_f32_row:
		case Intrinsic::nvvm_wmma_m16n16k8_store_d_f32_col_stride:
		case Intrinsic::nvvm_wmma_m16n16k8_store_d_f32_row_stride: {
Info.opc = ISD::INTRINSIC_VOID;		Info.opc = ISD::INTRINSIC_VOID;
Info.memVT = MVT::v8f32;		Info.memVT = MVT::v8f32;
Info.ptrVal = I.getArgOperand(0);		Info.ptrVal = I.getArgOperand(0);
Info.offset = 0;		Info.offset = 0;
Info.flags = MachineMemOperand::MOStore;		Info.flags = MachineMemOperand::MOStore;
Info.align = Align(16);		Info.align = Align(16);
return true;		return true;
}		}
Show All 31 Lines	case Intrinsic::nvvm_wmma_m8n8k32_store_d_s32_row_stride: {
Info.memVT = MVT::v2i32;		Info.memVT = MVT::v2i32;
Info.ptrVal = I.getArgOperand(0);		Info.ptrVal = I.getArgOperand(0);
Info.offset = 0;		Info.offset = 0;
Info.flags = MachineMemOperand::MOStore;		Info.flags = MachineMemOperand::MOStore;
Info.align = Align(8);		Info.align = Align(8);
return true;		return true;
}		}

		case Intrinsic::nvvm_wmma_m8n8k4_store_d_f64_col:
		case Intrinsic::nvvm_wmma_m8n8k4_store_d_f64_col_stride:
		case Intrinsic::nvvm_wmma_m8n8k4_store_d_f64_row:
		case Intrinsic::nvvm_wmma_m8n8k4_store_d_f64_row_stride: {
		Info.opc = ISD::INTRINSIC_VOID;
		Info.memVT = MVT::v2f64;
		Info.ptrVal = I.getArgOperand(0);
		Info.offset = 0;
		Info.flags = MachineMemOperand::MOStore;
		Info.align = Align(16);
		return true;
		}

case Intrinsic::nvvm_atomic_load_inc_32:		case Intrinsic::nvvm_atomic_load_inc_32:
case Intrinsic::nvvm_atomic_load_dec_32:		case Intrinsic::nvvm_atomic_load_dec_32:

case Intrinsic::nvvm_atomic_add_gen_f_cta:		case Intrinsic::nvvm_atomic_add_gen_f_cta:
case Intrinsic::nvvm_atomic_add_gen_f_sys:		case Intrinsic::nvvm_atomic_add_gen_f_sys:
case Intrinsic::nvvm_atomic_add_gen_i_cta:		case Intrinsic::nvvm_atomic_add_gen_i_cta:
case Intrinsic::nvvm_atomic_add_gen_i_sys:		case Intrinsic::nvvm_atomic_add_gen_i_sys:
case Intrinsic::nvvm_atomic_and_gen_i_cta:		case Intrinsic::nvvm_atomic_and_gen_i_cta:
▲ Show 20 Lines • Show All 1,298 Lines • Show Last 20 Lines

llvm/lib/Target/NVPTX/NVPTXInstrInfo.td

	Show First 20 Lines • Show All 138 Lines • ▼ Show 20 Lines

	def True : Predicate<"true">;			def True : Predicate<"true">;

	def hasPTX31 : Predicate<"Subtarget->getPTXVersion() >= 31">;			def hasPTX31 : Predicate<"Subtarget->getPTXVersion() >= 31">;
	def hasPTX60 : Predicate<"Subtarget->getPTXVersion() >= 60">;			def hasPTX60 : Predicate<"Subtarget->getPTXVersion() >= 60">;
	def hasPTX61 : Predicate<"Subtarget->getPTXVersion() >= 61">;			def hasPTX61 : Predicate<"Subtarget->getPTXVersion() >= 61">;
	def hasPTX63 : Predicate<"Subtarget->getPTXVersion() >= 63">;			def hasPTX63 : Predicate<"Subtarget->getPTXVersion() >= 63">;
	def hasPTX64 : Predicate<"Subtarget->getPTXVersion() >= 64">;			def hasPTX64 : Predicate<"Subtarget->getPTXVersion() >= 64">;
				def hasPTX65 : Predicate<"Subtarget->getPTXVersion() >= 65">;
	def hasPTX70 : Predicate<"Subtarget->getPTXVersion() >= 70">;			def hasPTX70 : Predicate<"Subtarget->getPTXVersion() >= 70">;

	def hasSM30 : Predicate<"Subtarget->getSmVersion() >= 30">;			def hasSM30 : Predicate<"Subtarget->getSmVersion() >= 30">;
	def hasSM70 : Predicate<"Subtarget->getSmVersion() >= 70">;			def hasSM70 : Predicate<"Subtarget->getSmVersion() >= 70">;
	def hasSM72 : Predicate<"Subtarget->getSmVersion() >= 72">;			def hasSM72 : Predicate<"Subtarget->getSmVersion() >= 72">;
	def hasSM75 : Predicate<"Subtarget->getSmVersion() >= 75">;			def hasSM75 : Predicate<"Subtarget->getSmVersion() >= 75">;
	def hasSM80 : Predicate<"Subtarget->getSmVersion() >= 80">;			def hasSM80 : Predicate<"Subtarget->getSmVersion() >= 80">;

	▲ Show 20 Lines • Show All 2,991 Lines • Show Last 20 Lines

llvm/lib/Target/NVPTX/NVPTXIntrinsics.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,937 Lines • ▼ Show 20 Lines
def _ari64: NVPTXInst<(outs regclass:$dst1, regclass:$dst2),		def _ari64: NVPTXInst<(outs regclass:$dst1, regclass:$dst2),
(ins MEMri64:$src),		(ins MEMri64:$src),
!strconcat("ldu.global.", TyStr), []>;		!strconcat("ldu.global.", TyStr), []>;
def _avar: NVPTXInst<(outs regclass:$dst1, regclass:$dst2),		def _avar: NVPTXInst<(outs regclass:$dst1, regclass:$dst2),
(ins imemAny:$src),		(ins imemAny:$src),
!strconcat("ldu.global.", TyStr), []>;		!strconcat("ldu.global.", TyStr), []>;
}		}

multiclass VLDU_G_ELE_V4<string TyStr, NVPTXRegClass regclass> {		multiclass VLDU_G_ELE_V4<string TyStr, NVPTXRegClass regclass> {
def _areg32: NVPTXInst<(outs regclass:$dst1, regclass:$dst2, regclass:$dst3,		def _areg32: NVPTXInst<(outs regclass:$dst1, regclass:$dst2, regclass:$dst3,
regclass:$dst4), (ins Int32Regs:$src),		regclass:$dst4), (ins Int32Regs:$src),
!strconcat("ldu.global.", TyStr), []>;		!strconcat("ldu.global.", TyStr), []>;
def _areg64: NVPTXInst<(outs regclass:$dst1, regclass:$dst2, regclass:$dst3,		def _areg64: NVPTXInst<(outs regclass:$dst1, regclass:$dst2, regclass:$dst3,
regclass:$dst4), (ins Int64Regs:$src),		regclass:$dst4), (ins Int64Regs:$src),
!strconcat("ldu.global.", TyStr), []>;		!strconcat("ldu.global.", TyStr), []>;
def _ari32: NVPTXInst<(outs regclass:$dst1, regclass:$dst2, regclass:$dst3,		def _ari32: NVPTXInst<(outs regclass:$dst1, regclass:$dst2, regclass:$dst3,
regclass:$dst4), (ins MEMri:$src),		regclass:$dst4), (ins MEMri:$src),
!strconcat("ldu.global.", TyStr), []>;		!strconcat("ldu.global.", TyStr), []>;
def _ari64: NVPTXInst<(outs regclass:$dst1, regclass:$dst2, regclass:$dst3,		def _ari64: NVPTXInst<(outs regclass:$dst1, regclass:$dst2, regclass:$dst3,
regclass:$dst4), (ins MEMri64:$src),		regclass:$dst4), (ins MEMri64:$src),
!strconcat("ldu.global.", TyStr), []>;		!strconcat("ldu.global.", TyStr), []>;
def _avar: NVPTXInst<(outs regclass:$dst1, regclass:$dst2, regclass:$dst3,		def _avar: NVPTXInst<(outs regclass:$dst1, regclass:$dst2, regclass:$dst3,
regclass:$dst4), (ins imemAny:$src),		regclass:$dst4), (ins imemAny:$src),
!strconcat("ldu.global.", TyStr), []>;		!strconcat("ldu.global.", TyStr), []>;
}		}

defm INT_PTX_LDU_G_v2i8_ELE		defm INT_PTX_LDU_G_v2i8_ELE
: VLDU_G_ELE_V2<"v2.u8 \t{{$dst1, $dst2}}, [$src];", Int16Regs>;		: VLDU_G_ELE_V2<"v2.u8 \t{{$dst1, $dst2}}, [$src];", Int16Regs>;
defm INT_PTX_LDU_G_v2i16_ELE		defm INT_PTX_LDU_G_v2i16_ELE
: VLDU_G_ELE_V2<"v2.u16 \t{{$dst1, $dst2}}, [$src];", Int16Regs>;		: VLDU_G_ELE_V2<"v2.u16 \t{{$dst1, $dst2}}, [$src];", Int16Regs>;
defm INT_PTX_LDU_G_v2i32_ELE		defm INT_PTX_LDU_G_v2i32_ELE
Show All 23 Lines	defm INT_PTX_LDU_G_v4f16x2_ELE
: VLDU_G_ELE_V4<"v4.b32 \t{{$dst1, $dst2, $dst3, $dst4}}, [$src];",		: VLDU_G_ELE_V4<"v4.b32 \t{{$dst1, $dst2, $dst3, $dst4}}, [$src];",
Float16x2Regs>;		Float16x2Regs>;
defm INT_PTX_LDU_G_v4f32_ELE		defm INT_PTX_LDU_G_v4f32_ELE
: VLDU_G_ELE_V4<"v4.f32 \t{{$dst1, $dst2, $dst3, $dst4}}, [$src];",		: VLDU_G_ELE_V4<"v4.f32 \t{{$dst1, $dst2, $dst3, $dst4}}, [$src];",
Float32Regs>;		Float32Regs>;


//-----------------------------------		//-----------------------------------
// Support for ldg on sm_35 or later		// Support for ldg on sm_35 or later
//-----------------------------------		//-----------------------------------

// Don't annotate ld.global.nc as mayLoad, because these loads go through the		// Don't annotate ld.global.nc as mayLoad, because these loads go through the
// non-coherent texture cache, and therefore the values read must be read-only		// non-coherent texture cache, and therefore the values read must be read-only
// during the lifetime of the kernel.		// during the lifetime of the kernel.

multiclass LDG_G<string TyStr, NVPTXRegClass regclass> {		multiclass LDG_G<string TyStr, NVPTXRegClass regclass> {
def areg: NVPTXInst<(outs regclass:$result), (ins Int32Regs:$src),		def areg: NVPTXInst<(outs regclass:$result), (ins Int32Regs:$src),
Show All 31 Lines	defm INT_PTX_LDG_GLOBAL_f64
: LDG_G<"f64 \t$result, [$src];", Float64Regs>;		: LDG_G<"f64 \t$result, [$src];", Float64Regs>;
defm INT_PTX_LDG_GLOBAL_p32		defm INT_PTX_LDG_GLOBAL_p32
: LDG_G<"u32 \t$result, [$src];", Int32Regs>;		: LDG_G<"u32 \t$result, [$src];", Int32Regs>;
defm INT_PTX_LDG_GLOBAL_p64		defm INT_PTX_LDG_GLOBAL_p64
: LDG_G<"u64 \t$result, [$src];", Int64Regs>;		: LDG_G<"u64 \t$result, [$src];", Int64Regs>;

// vector		// vector

// Elementized vector ldg		// Elementized vector ldg
multiclass VLDG_G_ELE_V2<string TyStr, NVPTXRegClass regclass> {		multiclass VLDG_G_ELE_V2<string TyStr, NVPTXRegClass regclass> {
def _areg32: NVPTXInst<(outs regclass:$dst1, regclass:$dst2),		def _areg32: NVPTXInst<(outs regclass:$dst1, regclass:$dst2),
(ins Int32Regs:$src),		(ins Int32Regs:$src),
!strconcat("ld.global.nc.", TyStr), []>;		!strconcat("ld.global.nc.", TyStr), []>;
def _areg64: NVPTXInst<(outs regclass:$dst1, regclass:$dst2),		def _areg64: NVPTXInst<(outs regclass:$dst1, regclass:$dst2),
(ins Int64Regs:$src),		(ins Int64Regs:$src),
!strconcat("ld.global.nc.", TyStr), []>;		!strconcat("ld.global.nc.", TyStr), []>;
def _ari32: NVPTXInst<(outs regclass:$dst1, regclass:$dst2),		def _ari32: NVPTXInst<(outs regclass:$dst1, regclass:$dst2),
(ins MEMri:$src),		(ins MEMri:$src),
!strconcat("ld.global.nc.", TyStr), []>;		!strconcat("ld.global.nc.", TyStr), []>;
def _ari64: NVPTXInst<(outs regclass:$dst1, regclass:$dst2),		def _ari64: NVPTXInst<(outs regclass:$dst1, regclass:$dst2),
(ins MEMri64:$src),		(ins MEMri64:$src),
!strconcat("ld.global.nc.", TyStr), []>;		!strconcat("ld.global.nc.", TyStr), []>;
def _avar: NVPTXInst<(outs regclass:$dst1, regclass:$dst2),		def _avar: NVPTXInst<(outs regclass:$dst1, regclass:$dst2),
(ins imemAny:$src),		(ins imemAny:$src),
!strconcat("ld.global.nc.", TyStr), []>;		!strconcat("ld.global.nc.", TyStr), []>;
}		}

multiclass VLDG_G_ELE_V4<string TyStr, NVPTXRegClass regclass> {		multiclass VLDG_G_ELE_V4<string TyStr, NVPTXRegClass regclass> {
def _areg32: NVPTXInst<(outs regclass:$dst1, regclass:$dst2, regclass:$dst3,		def _areg32: NVPTXInst<(outs regclass:$dst1, regclass:$dst2, regclass:$dst3,
regclass:$dst4), (ins Int32Regs:$src),		regclass:$dst4), (ins Int32Regs:$src),
!strconcat("ld.global.nc.", TyStr), []>;		!strconcat("ld.global.nc.", TyStr), []>;
def _areg64: NVPTXInst<(outs regclass:$dst1, regclass:$dst2, regclass:$dst3,		def _areg64: NVPTXInst<(outs regclass:$dst1, regclass:$dst2, regclass:$dst3,
regclass:$dst4), (ins Int64Regs:$src),		regclass:$dst4), (ins Int64Regs:$src),
!strconcat("ld.global.nc.", TyStr), []>;		!strconcat("ld.global.nc.", TyStr), []>;
def _ari32: NVPTXInst<(outs regclass:$dst1, regclass:$dst2, regclass:$dst3,		def _ari32: NVPTXInst<(outs regclass:$dst1, regclass:$dst2, regclass:$dst3,
regclass:$dst4), (ins MEMri:$src),		regclass:$dst4), (ins MEMri:$src),
!strconcat("ld.global.nc.", TyStr), []>;		!strconcat("ld.global.nc.", TyStr), []>;
def _ari64: NVPTXInst<(outs regclass:$dst1, regclass:$dst2, regclass:$dst3,		def _ari64: NVPTXInst<(outs regclass:$dst1, regclass:$dst2, regclass:$dst3,
regclass:$dst4), (ins MEMri64:$src),		regclass:$dst4), (ins MEMri64:$src),
!strconcat("ld.global.nc.", TyStr), []>;		!strconcat("ld.global.nc.", TyStr), []>;
def _avar: NVPTXInst<(outs regclass:$dst1, regclass:$dst2, regclass:$dst3,		def _avar: NVPTXInst<(outs regclass:$dst1, regclass:$dst2, regclass:$dst3,
regclass:$dst4), (ins imemAny:$src),		regclass:$dst4), (ins imemAny:$src),
!strconcat("ld.global.nc.", TyStr), []>;		!strconcat("ld.global.nc.", TyStr), []>;
}		}

// FIXME: 8-bit LDG should be fixed once LDG/LDU nodes are made into proper loads.		// FIXME: 8-bit LDG should be fixed once LDG/LDU nodes are made into proper loads.
defm INT_PTX_LDG_G_v2i8_ELE		defm INT_PTX_LDG_G_v2i8_ELE
: VLDG_G_ELE_V2<"v2.u8 \t{{$dst1, $dst2}}, [$src];", Int16Regs>;		: VLDG_G_ELE_V2<"v2.u8 \t{{$dst1, $dst2}}, [$src];", Int16Regs>;
defm INT_PTX_LDG_G_v2i16_ELE		defm INT_PTX_LDG_G_v2i16_ELE
: VLDG_G_ELE_V2<"v2.u16 \t{{$dst1, $dst2}}, [$src];", Int16Regs>;		: VLDG_G_ELE_V2<"v2.u16 \t{{$dst1, $dst2}}, [$src];", Int16Regs>;
▲ Show 20 Lines • Show All 5,473 Lines • ▼ Show 20 Lines
def INT_PTX_SREG_WARPSIZE :		def INT_PTX_SREG_WARPSIZE :
NVPTXInst<(outs Int32Regs:$dst), (ins), "mov.u32 \t$dst, WARP_SZ;",		NVPTXInst<(outs Int32Regs:$dst), (ins), "mov.u32 \t$dst, WARP_SZ;",
[(set Int32Regs:$dst, (int_nvvm_read_ptx_sreg_warpsize))]>;		[(set Int32Regs:$dst, (int_nvvm_read_ptx_sreg_warpsize))]>;

// Helper class that represents a 'fragment' of an NVPTX *MMA instruction.		// Helper class that represents a 'fragment' of an NVPTX *MMA instruction.
// In addition to target-independent fields provided by WMMA_REGS, it adds		// In addition to target-independent fields provided by WMMA_REGS, it adds
// the fields commonly used to implement specific PTX instruction -- register		// the fields commonly used to implement specific PTX instruction -- register
// types and names, constraints, parts of assembly, etc.		// types and names, constraints, parts of assembly, etc.
class WMMA_REGINFO<WMMA_REGS r>		class WMMA_REGINFO<WMMA_REGS r, string op>
: WMMA_REGS<r.geom, r.frag, r.ptx_elt_type> {		: WMMA_REGS<r.geom, r.frag, r.ptx_elt_type> {
// NVPTX register types used to carry fragment data.		// NVPTX register types used to carry fragment data.
NVPTXRegClass regclass = !cond(		NVPTXRegClass regclass = !cond(
!eq(ptx_elt_type, "f16") : Float16x2Regs,		!eq(ptx_elt_type, "f16") : Float16x2Regs,
!eq(ptx_elt_type, "f32") : Float32Regs,		!eq(ptx_elt_type, "f32") : Float32Regs,
		!eq(ptx_elt_type, "f64") : Float64Regs,
		!eq(ptx_elt_type, "bf16") : Int32Regs,
		!eq(ptx_elt_type, "tf32") : Int32Regs,
!eq(ptx_elt_type, "s32") : Int32Regs,		!eq(ptx_elt_type, "s32") : Int32Regs,
!eq(ptx_elt_type, "s8") : Int32Regs,		!eq(ptx_elt_type, "s8") : Int32Regs,
!eq(ptx_elt_type, "u8") : Int32Regs,		!eq(ptx_elt_type, "u8") : Int32Regs,
!eq(ptx_elt_type, "s4") : Int32Regs,		!eq(ptx_elt_type, "s4") : Int32Regs,
!eq(ptx_elt_type, "u4") : Int32Regs,		!eq(ptx_elt_type, "u4") : Int32Regs,
!eq(ptx_elt_type, "b1") : Int32Regs);		!eq(ptx_elt_type, "b1") : Int32Regs);

// Instruction input/output arguments for the fragment.		// Instruction input/output arguments for the fragment.
Show All 12 Lines	class WMMA_REGINFO<WMMA_REGS r, string op>
// longer the case, we can concat all per-fragment predicates to enforce that		// longer the case, we can concat all per-fragment predicates to enforce that
// all fragments of the instruction are viable.		// all fragments of the instruction are viable.
list<Predicate> Predicates = !cond(		list<Predicate> Predicates = !cond(
// fp16 -> fp16/fp32 @ m16n16k16		// fp16 -> fp16/fp32 @ m16n16k16
!and(!eq(geom, "m16n16k16"),		!and(!eq(geom, "m16n16k16"),
!or(!eq(ptx_elt_type, "f16"),		!or(!eq(ptx_elt_type, "f16"),
!eq(ptx_elt_type, "f32"))) : [hasSM70, hasPTX60],		!eq(ptx_elt_type, "f32"))) : [hasSM70, hasPTX60],

		!and(!eq(geom,"m8n8k4"),
		!eq(ptx_elt_type, "f64")) : [hasSM80, hasPTX70],

// fp16 -> fp16/fp32 @ m8n32k16/m32n8k16		// fp16 -> fp16/fp32 @ m8n32k16/m32n8k16
!and(!or(!eq(geom, "m8n32k16"),		!and(!or(!eq(geom, "m8n32k16"),
!eq(geom, "m32n8k16")),		!eq(geom, "m32n8k16")),
!or(!eq(ptx_elt_type, "f16"),		!or(!eq(ptx_elt_type, "f16"),
!eq(ptx_elt_type, "f32"))) : [hasSM70, hasPTX61],		!eq(ptx_elt_type, "f32"))) : [hasSM70, hasPTX61],

// u8/s8 -> s32 @ m16n16k16/m8n32k16/m32n8k16		// u8/s8 -> s32 @ m16n16k16/m8n32k16/m32n8k16
!and(!or(!eq(geom,"m16n16k16"),		!and(!or(!eq(geom,"m16n16k16"),
!eq(geom,"m8n32k16"),		!eq(geom,"m8n32k16"),
!eq(geom,"m32n8k16")),		!eq(geom,"m32n8k16")),
!or(!eq(ptx_elt_type, "u8"),		!or(!eq(ptx_elt_type, "u8"),
!eq(ptx_elt_type, "s8"),		!eq(ptx_elt_type, "s8"),
!eq(ptx_elt_type, "s32"))) : [hasSM72, hasPTX63],		!eq(ptx_elt_type, "s32"))) : [hasSM72, hasPTX63],

// u4/s4/b1 -> s32 @ m8n8k32 (u4/s4), m8n8k128(b1)		!and(!or(!eq(geom,"m16n16k16"),
!or(!eq(geom,"m8n8k128"),		!eq(geom,"m8n32k16"),
		!eq(geom,"m32n8k16")),
		!eq(ptx_elt_type, "bf16")) : [hasSM80, hasPTX70],

		!and(!eq(geom,"m16n16k8"),
		!eq(ptx_elt_type, "tf32")) : [hasSM80, hasPTX70],

		!and(!eq(geom,"m16n16k8"),
		!eq(ptx_elt_type, "f32")) : [hasSM80, hasPTX70],

		// b1 -> s32 @ m8n8k128(b1)
		!and(!ne(op,"mma"),
		!eq(geom,"m8n8k128")) : [hasSM75, hasPTX63],

		// u4/s4 -> s32 @ m8n8k32 (u4/s4)
		!and(!ne(op,"mma"),
!eq(geom,"m8n8k32")) : [hasSM75, hasPTX63],		!eq(geom,"m8n8k32")) : [hasSM75, hasPTX63],

!eq(geom, "m8n8k4") : [hasSM70, hasPTX64]);		!or(!eq(geom,"m16n8k8"),
		!eq(geom,"m8n8k16")) : [hasSM75, hasPTX65],

		!and(!ne(ptx_elt_type,"f64"),
		!eq(geom, "m8n8k4")) : [hasSM70, hasPTX64],

		// mma m8n8k32 requires higher PTX version
		!and(!eq(op,"mma"),
		!eq(geom,"m8n8k32")) : [hasSM75, hasPTX65],

		!and(!eq(ptx_elt_type,"f64"),
		!eq(geom, "m8n8k4")) : [hasSM80, hasPTX70],

		!and(!eq(op,"mma"),
		!or(!eq(geom, "m16n8k16"),
		!eq(geom, "m16n8k4"),
		!eq(geom, "m16n8k32"),
		!eq(geom, "m16n8k64"),
		!eq(geom, "m8n8k128"),
		!eq(geom, "m16n8k128"),
		!eq(geom, "m16n8k256"))) : [hasSM80, hasPTX70]);

// template DAGs for instruction inputs/output.		// template DAGs for instruction inputs/output.
dag Outs = !dag(outs, ptx_regs, reg_names);		dag Outs = !dag(outs, ptx_regs, reg_names);
dag Ins = !dag(ins, ptx_regs, reg_names);		dag Ins = !dag(ins, ptx_regs, reg_names);
}		}

// Convert dag of arguments into a dag to match given intrinsic.		// Convert dag of arguments into a dag to match given intrinsic.
class BuildPatternI<Intrinsic Intr, dag Ins> {		class BuildPatternI<Intrinsic Intr, dag Ins> {
▲ Show 20 Lines • Show All 107 Lines • ▼ Show 20 Lines

// Create all load/store variants		// Create all load/store variants
defset list<WMMA_INSTR> MMA_LDSTs = {		defset list<WMMA_INSTR> MMA_LDSTs = {
foreach layout = ["row", "col"] in {		foreach layout = ["row", "col"] in {
foreach stride = [false, true] in {		foreach stride = [false, true] in {
foreach space = [".global", ".shared", ""] in {		foreach space = [".global", ".shared", ""] in {
foreach addr = [imem, Int32Regs, Int64Regs, MEMri, MEMri64] in {		foreach addr = [imem, Int32Regs, Int64Regs, MEMri, MEMri64] in {
foreach frag = NVVM_MMA_OPS.all_ld_ops in		foreach frag = NVVM_MMA_OPS.all_ld_ops in
if NVVM_MMA_SUPPORTED<[frag], layout>.ret then		if NVVM_WMMA_LDST_SUPPORTED<frag, layout>.ret then
def : WMMA_LOAD<WMMA_REGINFO<frag>, layout, space, stride, addr>;		def : WMMA_LOAD<WMMA_REGINFO<frag, "load">, layout, space, stride, addr>;
foreach frag = NVVM_MMA_OPS.all_st_ops in		foreach frag = NVVM_MMA_OPS.all_st_ops in
if NVVM_MMA_SUPPORTED<[frag], layout>.ret then		if NVVM_WMMA_LDST_SUPPORTED<frag, layout>.ret then
def : WMMA_STORE_D<WMMA_REGINFO<frag>, layout, space, stride, addr>;		def : WMMA_STORE_D<WMMA_REGINFO<frag, "store">, layout, space, stride, addr>;
} // addr		} // addr
} // space		} // space
} // stride		} // stride
} // layout		} // layout
} // defset		} // defset

// WMMA.MMA		// WMMA.MMA
class WMMA_MMA<WMMA_REGINFO FragA, WMMA_REGINFO FragB,		class WMMA_MMA<WMMA_REGINFO FragA, WMMA_REGINFO FragB,
WMMA_REGINFO FragC, WMMA_REGINFO FragD,		WMMA_REGINFO FragC, WMMA_REGINFO FragD,
string ALayout, string BLayout, int Satfinite>		string ALayout, string BLayout, int Satfinite, string rnd>
: WMMA_INSTR<WMMA_NAME_MMA<ALayout, BLayout, Satfinite, FragA, FragB, FragC, FragD>.record,		: WMMA_INSTR<WMMA_NAME<ALayout, BLayout, Satfinite, rnd, FragA, FragB, FragC, FragD>.record,
[FragA.Ins, FragB.Ins, FragC.Ins]>,		[FragA.Ins, FragB.Ins, FragC.Ins]>,
// Requires does not seem to have effect on Instruction w/o Patterns.		// Requires does not seem to have effect on Instruction w/o Patterns.
// We set it here anyways and propagate to the Pat<> we construct below.		// We set it here anyways and propagate to the Pat<> we construct below.
Requires<FragA.Predicates> {		Requires<FragA.Predicates> {
let OutOperandList = FragD.Outs;		let OutOperandList = FragD.Outs;
let InOperandList = !con(Args, (ins MmaCode:$ptx));		let InOperandList = !con(Args, (ins MmaCode:$ptx));
string TypeList = !cond(		string TypeList = !cond(
!eq(FragD.geom, "m8n8k4") : "." # FragD.ptx_elt_type		!eq(FragA.ptx_elt_type, "f16") : "." # FragD.ptx_elt_type
# ".f16.f16."		# "." # FragC.ptx_elt_type,
# FragC.ptx_elt_type,		1: "." # FragD.ptx_elt_type
!eq(FragD.ptx_elt_type, "s32") : ".s32"
# "." # FragA.ptx_elt_type		# "." # FragA.ptx_elt_type
# "." # FragB.ptx_elt_type		# "." # FragB.ptx_elt_type
# ".s32",		# "." # FragC.ptx_elt_type,
1: "." # FragD.ptx_elt_type # "." # FragC.ptx_elt_type,
);		);
let AsmString = !if(!eq(FragA.geom, "m8n8k4"),		let AsmString = "wmma.mma"
"mma.sync.aligned.m8n8k4"
# "." # ALayout
# "." # BLayout
# TypeList # "\n\t\t"
# FragD.regstring # ",\n\t\t"
# FragA.regstring # ",\n\t\t"
# FragB.regstring # ",\n\t\t"
# FragC.regstring # ";",
"wmma.mma"
# !if(!eq(FragA.ptx_elt_type, "b1"), ".xor.popc", "")		# !if(!eq(FragA.ptx_elt_type, "b1"), ".xor.popc", "")
# ".sync"		# ".sync"
# "${ptx:aligned}"		# "${ptx:aligned}"
# "." # ALayout		# "." # ALayout
# "." # BLayout		# "." # BLayout
# "." # FragA.geom		# "." # FragA.geom
		# !if(!ne(rnd, ""), !strconcat(".", rnd), "")
# TypeList		# TypeList
# !if(Satfinite, ".satfinite", "") # "\n\t\t"		# !if(Satfinite, ".satfinite", "") # "\n\t\t"
# FragD.regstring # ",\n\t\t"		# FragD.regstring # ",\n\t\t"
# FragA.regstring # ",\n\t\t"		# FragA.regstring # ",\n\t\t"
# FragB.regstring # ",\n\t\t"		# FragB.regstring # ",\n\t\t"
# FragC.regstring # ";");		# FragC.regstring # ";";
		}

		defset list<WMMA_INSTR> WMMAs = {
		foreach layout_a = ["row", "col"] in {
		foreach layout_b = ["row", "col"] in {
		foreach satf = [0, 1] in {
		foreach rnd = ["", "rn", "rz", "rm", "rp"] in {
		foreach op = NVVM_MMA_OPS.all_wmma_ops in {
		if NVVM_WMMA_SUPPORTED<op, layout_a, layout_b, satf, rnd>.ret then {
		def : WMMA_MMA<WMMA_REGINFO<op[0], "wmma.mma">,
		WMMA_REGINFO<op[1], "wmma.mma">,
		WMMA_REGINFO<op[2], "wmma.mma">,
		WMMA_REGINFO<op[3], "wmma.mma">,
		layout_a, layout_b, satf, rnd>;
		}
		} // op
		} // rnd
		} // satf
		} // layout_b
		} // layout_a
		} // defset

		// MMA
		class MMA<WMMA_REGINFO FragA, WMMA_REGINFO FragB,
		WMMA_REGINFO FragC, WMMA_REGINFO FragD,
		string ALayout, string BLayout, int Satfinite>
		: WMMA_INSTR<MMA_NAME<ALayout, BLayout, Satfinite, FragA, FragB, FragC, FragD>.record,
		[FragA.Ins, FragB.Ins, FragC.Ins]>,
		// Requires does not seem to have effect on Instruction w/o Patterns.
		// We set it here anyways and propagate to the Pat<> we construct below.
		Requires<FragA.Predicates> {
		let OutOperandList = FragD.Outs;
		let InOperandList = !con(Args, (ins MmaCode:$ptx));
		string TypeList = "." # FragD.ptx_elt_type
		# "." # FragA.ptx_elt_type
		# "." # FragB.ptx_elt_type
		# "." # FragC.ptx_elt_type;
		let AsmString = "mma.sync.aligned."
		# FragA.geom
		# "." # ALayout
		# "." # BLayout
		# !if(Satfinite, ".satfinite", "")
		# TypeList
		# !if(!eq(FragA.ptx_elt_type, "b1"), ".xor.popc", "") # "\n\t\t"
		# FragD.regstring # ",\n\t\t"
		# FragA.regstring # ",\n\t\t"
		# FragB.regstring # ",\n\t\t"
		# FragC.regstring # ";";
}		}

defset list<WMMA_INSTR> MMAs = {		defset list<WMMA_INSTR> MMAs = {
foreach layout_a = ["row", "col"] in {		foreach layout_a = ["row", "col"] in {
foreach layout_b = ["row", "col"] in {		foreach layout_b = ["row", "col"] in {
foreach satf = [0, 1] in {		foreach satf = [0, 1] in {
foreach op = NVVM_MMA_OPS.all_mma_ops in {		foreach op = NVVM_MMA_OPS.all_mma_ops in {
if NVVM_MMA_SUPPORTED<op, layout_a, layout_b, satf>.ret then {		if NVVM_MMA_SUPPORTED<op, layout_a, layout_b, satf>.ret then {
def : WMMA_MMA<WMMA_REGINFO<op[0]>,		def : MMA<WMMA_REGINFO<op[0], "mma">,
WMMA_REGINFO<op[1]>,		WMMA_REGINFO<op[1], "mma">,
WMMA_REGINFO<op[2]>,		WMMA_REGINFO<op[2], "mma">,
WMMA_REGINFO<op[3]>,		WMMA_REGINFO<op[3], "mma">,
layout_a, layout_b, satf>;		layout_a, layout_b, satf>;
}		}
} // op		} // op
} // satf		} // satf
} // layout_b		} // layout_b
} // layout_a		} // layout_a
} // defset		} // defset


// Constructing non-flat DAGs is still a pain. I can't !subst a dag node with a		// Constructing non-flat DAGs is still a pain. I can't !subst a dag node with a
// dag, so the ptx.version must be appended after foreach replaces 'ins' with		// dag, so the ptx.version must be appended after foreach replaces 'ins' with
// the instruction record.		// the instruction record.
class WMMA_PAT<WMMA_INSTR wi>		class MMA_PAT<WMMA_INSTR wi>
: Pat<wi.IntrinsicPattern,		: Pat<wi.IntrinsicPattern,
!con(!foreach(tmp, wi.Args, !subst(ins, wi, tmp)),		!con(!foreach(tmp, wi.Args, !subst(ins, wi, tmp)),
(wi ptx.version))>,		(wi ptx.version))>,
Requires<wi.Predicates>;		Requires<wi.Predicates>;

// Build intrinsic->instruction patterns for all MMA instructions.		// Build intrinsic->instruction patterns for all MMA instructions.
foreach mma = !listconcat(MMAs, MMA_LDSTs) in		foreach mma = !listconcat(MMAs, WMMAs, MMA_LDSTs) in
def : WMMA_PAT<mma>;		def : MMA_PAT<mma>;

llvm/test/CodeGen/NVPTX/lit.local.cfg

	if not 'NVPTX' in config.root.targets:			if not 'NVPTX' in config.root.targets:
	config.unsupported = True			config.unsupported = True
				config.suffixes.add('.py')

llvm/test/CodeGen/NVPTX/wmma.py

# This test generates all variants of wmma intrinsics and verifies that LLVM		# This test generates all variants of wmma intrinsics and verifies that LLVM
# generates correct instructions for them.		# generates correct instructions for them.

# Check all variants of instructions supported by PTX60 on SM70		# Check all variants of instructions supported by PTX60 on SM70
# RUN: python %s --ptx=60 --gpu-arch=70 > %t-ptx60-sm_70.ll		# RUN: python %s --ptx=60 --gpu-arch=70 > %t-ptx60-sm_70.ll
# RUN: FileCheck %t-ptx60-sm_70.ll < %t-ptx60-sm_70.ll \		# RUN: FileCheck %t-ptx60-sm_70.ll < %t-ptx60-sm_70.ll \
# RUN: --check-prefixes=INTRINSICS,M16N16		# RUN: --check-prefixes=INTRINSICS,M16N16
# RUN: FileCheck %t-ptx60-sm_70.ll < %t-ptx60-sm_70.ll \		# RUN: FileCheck %t-ptx60-sm_70.ll < %t-ptx60-sm_70.ll \
# RUN: --check-prefixes=INTRINSICS,NOEXTGEOM,NOINT,NOSUBINT,NOMMA		# RUN: --check-prefixes=INTRINSICS,NOEXTGEOM,NOINT,NOSUBINT,NOMMA,NODOUBLE,NOALTFLOAT
# RUN: llc < %t-ptx60-sm_70.ll -march=nvptx64 -mcpu=sm_70 -mattr=+ptx60 \		# RUN: llc < %t-ptx60-sm_70.ll -march=nvptx64 -mcpu=sm_70 -mattr=+ptx60 \
# RUN: \| FileCheck %t-ptx60-sm_70.ll		# RUN: \| FileCheck %t-ptx60-sm_70.ll

# Check all variants of instructions supported by PTX61 on SM70		# Check all variants of instructions supported by PTX61 on SM70
# RUN: python %s --ptx=61 --gpu-arch=70 > %t-ptx61-sm_70.ll		# RUN: python %s --ptx=61 --gpu-arch=70 > %t-ptx61-sm_70.ll
# RUN: FileCheck %t-ptx61-sm_70.ll < %t-ptx61-sm_70.ll \		# RUN: FileCheck %t-ptx61-sm_70.ll < %t-ptx61-sm_70.ll \
# RUN: --check-prefixes=INTRINSICS,M16N16,EXTGEOM		# RUN: --check-prefixes=INTRINSICS,M16N16,EXTGEOM
# RUN: FileCheck %t-ptx61-sm_70.ll < %t-ptx61-sm_70.ll \		# RUN: FileCheck %t-ptx61-sm_70.ll < %t-ptx61-sm_70.ll \
# RUN: --check-prefixes=INTRINSICS,NOINT,NOSUBINT,NOMMA		# RUN: --check-prefixes=INTRINSICS,NOINT,NOSUBINT,NOMMA,NODOUBLE,NOALTFLOAT
# RUN: llc < %t-ptx61-sm_70.ll -march=nvptx64 -mcpu=sm_70 -mattr=+ptx61 \		# RUN: llc < %t-ptx61-sm_70.ll -march=nvptx64 -mcpu=sm_70 -mattr=+ptx61 \
# RUN: \| FileCheck %t-ptx61-sm_70.ll		# RUN: \| FileCheck %t-ptx61-sm_70.ll

# Check all variants of instructions supported by PTX63 on SM72		# Check all variants of instructions supported by PTX63 on SM72
# RUN: python %s --ptx=63 --gpu-arch=72 > %t-ptx63-sm_72.ll		# RUN: python %s --ptx=63 --gpu-arch=72 > %t-ptx63-sm_72.ll
# RUN: FileCheck %t-ptx63-sm_72.ll < %t-ptx63-sm_72.ll \		# RUN: FileCheck %t-ptx63-sm_72.ll < %t-ptx63-sm_72.ll \
# RUN: --check-prefixes=INTRINSICS,M16N16,EXTGEOM,INT		# RUN: --check-prefixes=INTRINSICS,M16N16,EXTGEOM,INT
# RUN: FileCheck %t-ptx63-sm_72.ll < %t-ptx63-sm_72.ll \		# RUN: FileCheck %t-ptx63-sm_72.ll < %t-ptx63-sm_72.ll \
# RUN: --check-prefixes=INTRINSICS,NOSUBINT,NOMMA		# RUN: --check-prefixes=INTRINSICS,NOSUBINT,NOMMA,NODOUBLE,NOALTFLOAT
# RUN: llc < %t-ptx63-sm_72.ll -march=nvptx64 -mcpu=sm_72 -mattr=+ptx63 \		# RUN: llc < %t-ptx63-sm_72.ll -march=nvptx64 -mcpu=sm_72 -mattr=+ptx63 \
# RUN: \| FileCheck %t-ptx63-sm_72.ll		# RUN: \| FileCheck %t-ptx63-sm_72.ll

# Check all variants of instructions supported by PTX63 on SM75		# Check all variants of instructions supported by PTX63 on SM75
# RUN: python %s --ptx=63 --gpu-arch=75 > %t-ptx63-sm_75.ll		# RUN: python %s --ptx=63 --gpu-arch=75 > %t-ptx63-sm_75.ll
# RUN: FileCheck %t-ptx63-sm_75.ll < %t-ptx63-sm_75.ll \		# RUN: FileCheck %t-ptx63-sm_75.ll < %t-ptx63-sm_75.ll \
# RUN: --check-prefixes=INTRINSICS,M16N16,EXTGEOM,INT,SUBINT		# RUN: --check-prefixes=INTRINSICS,M16N16,EXTGEOM,INT,SUBINT
# RUN: FileCheck %t-ptx63-sm_75.ll < %t-ptx63-sm_75.ll \		# RUN: FileCheck %t-ptx63-sm_75.ll < %t-ptx63-sm_75.ll \
# RUN: --check-prefixes=INTRINSICS,NOMMA		# RUN: --check-prefixes=INTRINSICS,NOMMA,NODOUBLE,NOALTFLOAT
# RUN: llc < %t-ptx63-sm_75.ll -march=nvptx64 -mcpu=sm_75 -mattr=+ptx63 \		# RUN: llc < %t-ptx63-sm_75.ll -march=nvptx64 -mcpu=sm_75 -mattr=+ptx63 \
# RUN: \| FileCheck %t-ptx63-sm_75.ll		# RUN: \| FileCheck %t-ptx63-sm_75.ll

# Check all variants of instructions supported by PTX64 on SM70+		# Check all variants of instructions supported by PTX64 on SM70+
# RUN: python %s --ptx=64 --gpu-arch=70 > %t-ptx64-sm_70.ll		# RUN: python %s --ptx=64 --gpu-arch=70 > %t-ptx64-sm_70.ll
# RUN: FileCheck %t-ptx64-sm_70.ll < %t-ptx64-sm_70.ll \		# RUN: FileCheck %t-ptx64-sm_70.ll < %t-ptx64-sm_70.ll \
# RUN: --check-prefixes=INTRINSICS,M16N16,EXTGEOM,MMA		# RUN: --check-prefixes=INTRINSICS,M16N16,EXTGEOM,MMA
# RUN: FileCheck %t-ptx64-sm_70.ll < %t-ptx64-sm_70.ll \		# RUN: FileCheck %t-ptx64-sm_70.ll < %t-ptx64-sm_70.ll \
# RUN: --check-prefixes=INTRINSICS,NOINT,NOSUBINT		# RUN: --check-prefixes=INTRINSICS,NOINT,NOSUBINT,NODOUBLE,NOALTFLOAT
# RUN: llc < %t-ptx64-sm_70.ll -march=nvptx64 -mcpu=sm_70 -mattr=+ptx64 \		# RUN: llc < %t-ptx64-sm_70.ll -march=nvptx64 -mcpu=sm_70 -mattr=+ptx64 \
# RUN: \| FileCheck %t-ptx64-sm_70.ll		# RUN: \| FileCheck %t-ptx64-sm_70.ll

		# Check all variants of instructions supported by PTX65 on SM75+
		# RUN: python %s --ptx=65 --gpu-arch=75 > %t-ptx65-sm_75.ll
		# RUN: FileCheck %t-ptx65-sm_75.ll < %t-ptx65-sm_75.ll \
		# RUN: --check-prefixes=INTRINSICS,M16N16,EXTGEOM,INT,SUBINT,MMA,PTX65MMA
		# RUN: FileCheck %t-ptx65-sm_75.ll < %t-ptx65-sm_75.ll \
		# RUN: --check-prefixes=INTRINSICS
		# RUN: llc < %t-ptx65-sm_75.ll -march=nvptx64 -mcpu=sm_75 -mattr=+ptx65 \
		# RUN: \| FileCheck %t-ptx65-sm_75.ll

		# Check all variants of instructions supported by PTX70 on SM80+
		# RUN: python %s --ptx=70 --gpu-arch=80 > %t-ptx70-sm_80.ll
		# RUN: FileCheck %t-ptx70-sm_80.ll < %t-ptx70-sm_80.ll \
		# RUN: --check-prefixes=INTRINSICS,M16N16,EXTGEOM,INT,SUBINT,MMA,ALTFLOAT,DOUBLE,PTX65MMA,PTX70MMA
		# RUN: FileCheck %t-ptx70-sm_80.ll < %t-ptx70-sm_80.ll \
		# RUN: --check-prefixes=INTRINSICS
		# RUN: llc < %t-ptx70-sm_80.ll -march=nvptx64 -mcpu=sm_80 -mattr=+ptx70 \
		# RUN: \| FileCheck %t-ptx70-sm_80.ll

from __future__ import print_function		from __future__ import print_function

import argparse		import argparse
from itertools import product		from itertools import product
from string import Template		from string import Template

class MMAType:		class MMAType:
def __init__(self, ptx_type):		def __init__(self, ptx_type):
self.ptx_type = ptx_type		self.ptx_type = ptx_type
self.llvm_type = {		self.llvm_type = {
"f16" : "<2 x half>",		"f16" : "<2 x half>",
"f32" : "float",		"f32" : "float",
		"f64" : "double",
"s32" : "i32",		"s32" : "i32",
"s8" : "i32",		"s8" : "i32",
"u8" : "i32",		"u8" : "i32",
"s4" : "i32",		"s4" : "i32",
"u4" : "i32",		"u4" : "i32",
"b1" : "i32",		"b1" : "i32",
		"bf16" : "i32",
		"tf32" : "i32",
}[ptx_type];		}[ptx_type];

self.ptx_reg_pattern = {		self.ptx_reg_pattern = {
"f16" : "%hh[0-9]+",		"f16" : "%hh[0-9]+",
"f32" : "%f[0-9]+",		"f32" : "%f[0-9]+",
		"f64" : "%fd[0-9]+",
}.get(ptx_type, "%r[0-9]+")		}.get(ptx_type, "%r[0-9]+")

def __repr__(self):		def __repr__(self):
return "%s/%s" % (self.ptx_type, self.llvm_type)		return "%s/%s" % (self.ptx_type, self.llvm_type)

class MMAFrag:		class MMAFrag:
def __init__(self, geom, frag, ptx_elt_type):		def __init__(self, geom, frag, ptx_elt_type):
self.geom = geom		self.geom = geom
self.frag = frag		self.frag = frag
self.is_mma = True if geom == "m8n8k4" else False;
self.mma_type = MMAType(ptx_elt_type);		self.mma_type = MMAType(ptx_elt_type);
self.nregs = {		self.nregs = {
"a:f16" : 2 if self.is_mma else 8,
"b:f16" : 2 if self.is_mma else 8,
"c:f16" : 4,
"d:f16" : 4,
"c:f32" : 8,
"d:f32" : 8,
}.get("%s:%s" % (frag, ptx_elt_type), {
# u8/s8 -> s32 @ m16n16k16/m8n32k16/m32n8k16		# u8/s8 -> s32 @ m16n16k16/m8n32k16/m32n8k16
"m16n16k16:a:u8" : 2,		"m16n16k16:a:u8" : 2,
"m16n16k16:a:s8" : 2,		"m16n16k16:a:s8" : 2,
"m16n16k16:b:u8" : 2,		"m16n16k16:b:u8" : 2,
"m16n16k16:b:s8" : 2,		"m16n16k16:b:s8" : 2,
"m16n16k16:c:s32" : 8,		"m16n16k16:c:s32" : 8,
"m16n16k16:d:s32" : 8,		"m16n16k16:d:s32" : 8,

"m8n32k16:a:u8" : 1,		"m8n32k16:a:u8" : 1,
"m8n32k16:a:s8" : 1,		"m8n32k16:a:s8" : 1,
"m8n32k16:b:u8" : 4,		"m8n32k16:b:u8" : 4,
"m8n32k16:b:s8" : 4,		"m8n32k16:b:s8" : 4,
"m8n32k16:c:s32" : 8,		"m8n32k16:c:s32" : 8,
"m8n32k16:d:s32" : 8,		"m8n32k16:d:s32" : 8,

"m32n8k16:a:u8" : 4,		"m32n8k16:a:u8" : 4,
"m32n8k16:a:s8" : 4,		"m32n8k16:a:s8" : 4,
"m32n8k16:b:u8" : 1,		"m32n8k16:b:u8" : 1,
"m32n8k16:b:s8" : 1,		"m32n8k16:b:s8" : 1,
"m32n8k16:c:s32" : 8,		"m32n8k16:c:s32" : 8,
"m32n8k16:d:s32" : 8,		"m32n8k16:d:s32" : 8,

# u4/s4/b1 -> s32 @ m8n8k32 (u4/s4), m8n8k128(b1)		"m8n8k16:a:u8": 1,
"m8n8k128:a:b1" : 1,		"m8n8k16:a:s8": 1,
		"m8n8k16:b:u8": 1,
		"m8n8k16:b:s8": 1,
		"m8n8k16:c:s32": 2,
		"m8n8k16:d:s32": 2,

		"m16n8k16:a:u8": 2,
		"m16n8k16:a:s8": 2,
		"m16n8k16:b:u8": 1,
		"m16n8k16:b:s8": 1,
		"m16n8k16:c:s32": 4,
		"m16n8k16:d:s32": 4,

		"m16n8k32:a:u8": 4,
		"m16n8k32:a:s8": 4,
		"m16n8k32:b:u8": 2,
		"m16n8k32:b:s8": 2,
		"m16n8k32:c:s32": 4,
		"m16n8k32:d:s32": 4,

		# u4/s4 -> s32 @ m8n8k32 (u4/s4)
"m8n8k32:a:u4" : 1,		"m8n8k32:a:u4" : 1,
"m8n8k32:a:s4" : 1,		"m8n8k32:a:s4" : 1,
"m8n8k128:b:b1" : 1,
"m8n8k32:b:u4" : 1,		"m8n8k32:b:u4" : 1,
"m8n8k32:b:s4" : 1,		"m8n8k32:b:s4" : 1,
"m8n8k128:c:s32" : 2,
"m8n8k128:d:s32" : 2,
"m8n8k32:c:s32" : 2,		"m8n8k32:c:s32" : 2,
"m8n8k32:d:s32" : 2,		"m8n8k32:d:s32" : 2,
}.get("%s:%s:%s" % (geom, frag, ptx_elt_type), None));
		"m16n8k32:a:u4" : 2,
		"m16n8k32:a:s4" : 2,
		"m16n8k32:b:u4" : 1,
		"m16n8k32:b:s4" : 1,
		"m16n8k32:c:s32" : 4,
		"m16n8k32:d:s32" : 4,

		"m16n8k64:a:u4" : 4,
		"m16n8k64:a:s4" : 4,
		"m16n8k64:b:u4" : 2,
		"m16n8k64:b:s4" : 2,
		"m16n8k64:c:s32" : 4,
		"m16n8k64:d:s32" : 4,

		# b1 -> s32 @ m8n8k128(b1)
		"m8n8k128:a:b1" : 1,
		"m8n8k128:b:b1" : 1,
		"m8n8k128:c:s32" : 2,
		"m8n8k128:d:s32" : 2,

		"m16n8k128:a:b1" : 2,
		"m16n8k128:b:b1" : 1,
		"m16n8k128:c:s32" : 4,
		"m16n8k128:d:s32" : 4,

		"m16n8k256:a:b1" : 4,
		"m16n8k256:b:b1" : 2,
		"m16n8k256:c:s32" : 4,
		"m16n8k256:d:s32" : 4,

		# bf16 -> s32 @ m16n16k16/m8n32k16/m32n8k16
		"m16n16k16:a:bf16" : 4,
		"m16n16k16:b:bf16" : 4,
		"m8n32k16:a:bf16" : 2,
		"m8n32k16:b:bf16" : 8,
		"m32n8k16:a:bf16" : 8,
		"m32n8k16:b:bf16" : 2,

		"m16n8k16:a:bf16" : 4,
		"m16n8k16:b:bf16" : 2,
		"m16n8k16:c:f32" : 4,
		"m16n8k16:d:f32" : 4,
		"m16n8k8:a:bf16" : 2,
		"m16n8k8:b:bf16" : 1,
		"m16n8k8:c:f32" : 4,
		"m16n8k8:d:f32" : 4,

		"m8n8k4:a:f64" : 1,
		"m8n8k4:b:f64" : 1,
		"m8n8k4:c:f64" : 2,
		"m8n8k4:d:f64" : 2,

		# tf32 -> s32 @ m16n16k8
		"m16n16k8:a:tf32" : 4,
		"m16n16k8:b:tf32" : 4,

		"m16n8k4:a:tf32" : 2,
		"m16n8k4:b:tf32" : 1,
		"m16n8k4:c:f32" : 4,
		"m16n8k4:d:f32" : 4,
		"m16n8k8:a:tf32" : 4,
		"m16n8k8:b:tf32" : 2,
		"m16n8k8:c:f32" : 4,
		"m16n8k8:d:f32" : 4,

		"m8n8k4:a:f16": 2,
		"m8n8k4:b:f16": 2,
		"m16n8k8:a:f16": 2,
		"m16n8k8:b:f16": 1,
		"m16n8k8:c:f16": 2,
		"m16n8k8:d:f16": 2,
		"m16n8k8:c:f32": 4,
		"m16n8k8:d:f32": 4,
		"m16n8k16:a:f16": 4,
		"m16n8k16:b:f16": 2,
		"m16n8k16:c:f16": 2,
		"m16n8k16:d:f16": 2,
		"m16n8k16:c:f32": 4,
		"m16n8k16:d:f32": 4,
		}.get("%s:%s:%s" % (geom, frag, ptx_elt_type), {
		# All other FP shape/fragment/type combinations have the same size
		"a:f16" : 8,
		"b:f16" : 8,
		"c:f16" : 4,
		"d:f16" : 4,
		"c:f32" : 8,
		"d:f32" : 8,
		}.get("%s:%s" % (frag, ptx_elt_type), None))
assert(self.nregs);		assert(self.nregs);

def __repr__(self):		def __repr__(self):
return "%s:%s:%s%s" % (self.geom, self.frag, self.mma_type,		return "%s:%s:%s%s" % (self.geom, self.frag, self.mma_type,
"" if self.nregs == 1 else ("*%d" % self.nregs))		"" if self.nregs == 1 else ("*%d" % self.nregs))

class MMAOp:		class MMAOp:
def __init__(self, a, b, c, d):		def __init__(self, a, b, c, d):
Show All 15 Lines	for type_b, type_d in product(types_b if types_b else [type_a],
MMAFrag(geom, "c", type_c),		MMAFrag(geom, "c", type_c),
MMAFrag(geom, "d", type_d)))		MMAFrag(geom, "d", type_d)))
return ops		return ops

def make_ldst_ops(geoms, frags, types):		def make_ldst_ops(geoms, frags, types):
return [MMAFrag(geom, frag, ptx_type) for (geom, frag, ptx_type)		return [MMAFrag(geom, frag, ptx_type) for (geom, frag, ptx_type)
in product(geoms, frags, types)]		in product(geoms, frags, types)]

def get_mma_ops():		def get_wmma_ops():
return (make_mma_ops(["m8n8k4"],		return (make_mma_ops(["m16n16k8"],
["f16"], [], ["f16", "f32"], ["f16", "f32"]) +		["tf32"], [], ["f32"], []) +
		make_mma_ops(["m16n16k16", "m32n8k16", "m8n32k16"],
		["bf16"], [], ["f32"], []) +
		make_mma_ops(["m8n8k4"],
		["f64"], [], ["f64"], []) +
make_mma_ops(["m16n16k16", "m32n8k16", "m8n32k16"],		make_mma_ops(["m16n16k16", "m32n8k16", "m8n32k16"],
["f16"], [], ["f16", "f32"], ["f16", "f32"]) +		["f16"], [], ["f16", "f32"], ["f16", "f32"]) +
make_mma_ops(["m16n16k16", "m32n8k16", "m8n32k16"],		make_mma_ops(["m16n16k16", "m32n8k16", "m8n32k16"],
["s8", "u8"], [], ["s32"], []) +		["s8", "u8"], [], ["s32"], []) +
make_mma_ops(["m8n8k32"],		make_mma_ops(["m8n8k32"],
["s4", "u4"], [], ["s32"], []) +		["s4", "u4"], [], ["s32"], []) +
make_mma_ops(["m8n8k128"],		make_mma_ops(["m8n8k128"],
["b1"], [], ["s32"], []))		["b1"], [], ["s32"], []))

		def get_mma_ops():
		return (make_mma_ops(["m8n8k4"],
		["f64"], [], ["f64"], []) +
		make_mma_ops(["m16n8k4", "m16n8k8"],
		["tf32"], [], ["f32"], []) +
		make_mma_ops(["m16n8k16", "m16n8k8"],
		["bf16"], [], ["f32"], []) +
		make_mma_ops(["m8n8k4", "m16n8k8", "m16n8k16"],
		["f16"], [], ["f16", "f32"], ["f16", "f32"]) +
		make_mma_ops(["m8n8k16", "m16n8k16", "m16n8k32"],
		["s8", "u8"], ["s8", "u8"], ["s32"], []) +
		make_mma_ops(["m8n8k32", "m16n8k32", "m16n8k64"],
		["s4", "u4"], ["s4", "u4"], ["s32"], []) +
		make_mma_ops(["m8n8k128", "m16n8k128", "m16n8k256"],
		["b1"], [], ["s32"], []))

def get_ldst_ops(kind):		def get_ldst_ops(kind):
ldst_ops = (make_ldst_ops(["m16n16k16", "m32n8k16", "m8n32k16"],		ldst_ops = (make_ldst_ops(["m16n16k16", "m32n8k16", "m8n32k16"],
["a", "b"], ["f16", "u8", "s8"]) +		["a", "b"], ["f16", "u8", "s8", "bf16"]) +
make_ldst_ops(["m16n16k16", "m32n8k16", "m8n32k16"],		make_ldst_ops(["m16n16k16", "m32n8k16", "m8n32k16"],
["c", "d"], ["f16", "f32", "s32"]) +		["c", "d"], ["f16", "f32", "s32"]) +
make_ldst_ops(["m8n8k32"], ["a", "b"], ["s4","u4"]) +		make_ldst_ops(["m8n8k32"], ["a", "b"], ["s4","u4"]) +
make_ldst_ops(["m8n8k128"], ["a", "b"], ["b1"]) +		make_ldst_ops(["m8n8k128"], ["a", "b"], ["b1"]) +
make_ldst_ops(["m8n8k32", "m8n8k128"], ["c", "d"], ["s32"]))		make_ldst_ops(["m8n8k32", "m8n8k128"], ["c", "d"], ["s32"]) +
		make_ldst_ops(["m8n8k4"], ["a", "b", "c", "d"], ["f64"]) +
		make_ldst_ops(["m16n16k8"], ["a", "b"], ["tf32"]) +
		make_ldst_ops(["m16n16k8"], ["c", "d"], ["f32"]))
return [ x for x in ldst_ops if (x.frag == "d") == (kind == "store")]		return [ x for x in ldst_ops if (x.frag == "d") == (kind == "store")]

def is_geom_supported(geom):		def is_wmma_geom_supported(geom):
# geometries for FP and ints.		# geometries for FP and ints.
if geom == "m8n8k4":
return ptx_version >= 64
if geom in ["m8n32k16", "m32n8k16"]:		if geom in ["m8n32k16", "m32n8k16"]:
return ptx_version >= 61		return ptx_version >= 61
# geometries for sub-ints.		# geometries for sub-ints.
if geom in ["m8n8k32", "m8n8k128"]:		if geom in ["m8n8k32", "m8n8k128"]:
return ptx_version >= 63 and gpu_arch >= 75		return ptx_version >= 63 and gpu_arch >= 75
if geom == "m16n16k16":		if geom == "m16n16k16":
return ptx_version >= 60		return ptx_version >= 60
		if geom == "m16n8k8":
		return ptx_version >= 65
		if geom in ["m16n16k8", "m8n8k4"]:
		return ptx_version >= 70
		assert(False) # Unexpected geometry.

		def is_mma_geom_supported(geom):
		# geometries for FP and ints.
		if geom == "m8n8k4":
		return ptx_version >= 64
		if geom in ["m16n8k8", "m8n8k16", "m8n8k32"]:
		return ptx_version >= 65
		if geom in ["m16n8k16", "m16n8k4", "m16n8k32", "m16n8k64", "m8n8k128",
		"m16n8k128", "m16n8k256"]:
		return ptx_version >= 70
assert(False) # Unexpected geometry.		assert(False) # Unexpected geometry.

def is_type_supported(ptx_type):		def is_type_supported(ptx_type):
if ptx_type in ["s8", "u8", "s32"]:		if ptx_type in ["s8", "u8", "s32"]:
return ptx_version >= 63 and gpu_arch >= 72		return ptx_version >= 63 and gpu_arch >= 72
if ptx_type in ["s4", "u4", "b1"]:		if ptx_type in ["s4", "u4", "b1"]:
return ptx_version >= 63 and gpu_arch >= 75		return ptx_version >= 63 and gpu_arch >= 75
		if ptx_type in ["bf16", "tf32", "f64"]:
		return ptx_version >= 70
return ptx_version >= 60 and gpu_arch >= 70		return ptx_version >= 60 and gpu_arch >= 70

		def is_wmma_variant_supported(op, layout_a, layout_b, rnd, satf):
def is_mma_variant_supported(op, layout_a, layout_b, satf):
if not (is_type_supported(op.a.mma_type.ptx_type)		if not (is_type_supported(op.a.mma_type.ptx_type)
and is_geom_supported(op.a.geom)):		and is_wmma_geom_supported(op.a.geom)):
		return False

		# rnd is only supported for FP64 WMMA
		if rnd and op.a.mma_type.ptx_type != "f64":
return False		return False
if op.a.geom == "m8n8k4":
if satf:		if satf:
		# satfinite for floating points was removed in PTX 6.5
		if op.a.mma_type.ptx_type == "f16" and ptx_version >= 65:
		return False
		if not op.a.mma_type.ptx_type in ["f16", "s8", "u8", "s4", "u4"]:
return False		return False
if op.c.mma_type.ptx_type == "f32":
# If C is f32, D must be, too.
return op.d.mma_type.ptx_type == "f32"

# sub-integer require row/col layout, and no satf.		# sub-integer require row/col layout.
if op.a.mma_type.ptx_type in ["s4", "u4", "b1"]:		if op.a.mma_type.ptx_type in ["s4", "u4", "b1"]:
if op.a.mma_type.ptx_type == "b1" and satf:		return layout_a == "row" and layout_b == "col"
		return True

		def is_mma_variant_supported(op, layout_a, layout_b, satf):
		if not (is_type_supported(op.a.mma_type.ptx_type)
		and is_mma_geom_supported(op.a.geom)):
return False		return False

		if satf and not op.a.mma_type.ptx_type in ["s8", "u8", "s4", "u4"]:
		return False

		# If the type of C is f32 then so must the type of D
		if (op.a.geom == "m8n8k4" and op.c.mma_type.ptx_type == "f32"
		and op.d.mma_type.ptx_type != "f32"):
		return False

		# A and B type must be the same. C and D type must be the same
		if (op.a.geom == "m16n8k8"
		and (op.a.mma_type.ptx_type != op.b.mma_type.ptx_type
		or op.c.mma_type.ptx_type != op.d.mma_type.ptx_type)):
		return False

		# C and D type must be the same
		if (op.a.geom == "m16n8k16"
		and op.c.mma_type.ptx_type != op.d.mma_type.ptx_type):
		return False

		# Require row/col layout for all MMA except m8n8k4 on FP16
		if not (op.a.geom == "m8n8k4" and op.a.mma_type.ptx_type == "f16"):
return layout_a == "row" and layout_b == "col"		return layout_a == "row" and layout_b == "col"
return True		return True

def is_ldst_variant_supported(frag, layout):		def is_ldst_variant_supported(frag, layout):
if not (is_type_supported(frag.mma_type.ptx_type)		if not (is_type_supported(frag.mma_type.ptx_type)
and is_geom_supported(frag.geom)):		and is_wmma_geom_supported(frag.geom)):
return False		return False
if frag.mma_type.ptx_type in ["s4", "u4", "b1"]:		if frag.mma_type.ptx_type in ["s4", "u4", "b1"]:
# sub-integer require sm_75 and ptx63, row/col layout for a/b.		# sub-integer require sm_75 and ptx63, row/col layout for a/b.
return ((frag.frag == "a" and layout == "row")		return ((frag.frag == "a" and layout == "row")
or (frag.frag == "b" and layout == "col")		or (frag.frag == "b" and layout == "col")
or frag.frag in ["c", "d"])		or frag.frag in ["c", "d"])
return True		return True

▲ Show 20 Lines • Show All 164 Lines • ▼ Show 20 Lines	for frag, layout, space, stride in product(

print(Template(store_template).substitute(test_params))		print(Template(store_template).substitute(test_params))
generated_items.append((test_params["intrinsic"],		generated_items.append((test_params["intrinsic"],
test_params["instruction"]))		test_params["instruction"]))

return generated_items		return generated_items

def mma_signature(op):		def mma_signature(op):
if op.a.mma_type.ptx_type in ["s8", "u8", "s4", "u4", "b1"]:		if op.a.mma_type.ptx_type == "f16":
# int and sub-int ops are identified by input type.		# FP16 ops identified by accumulator & result type.
return op.a.mma_type.ptx_type
else:
# the rest are FP ops identified by accumulator & result type.
return "%s.%s" % (op.d.mma_type.ptx_type, op.c.mma_type.ptx_type)		return "%s.%s" % (op.d.mma_type.ptx_type, op.c.mma_type.ptx_type)
		elif op.a.mma_type.ptx_type != op.b.mma_type.ptx_type:
		# other ops are identified by input types.
		return "%s.%s" % (op.a.mma_type.ptx_type, op.b.mma_type.ptx_type)
		else:
		# if input types are the same, it only appears once.
		return op.a.mma_type.ptx_type

def mma_ptx_signature(op):		def mma_ptx_signature(op):
if op.a.mma_type.ptx_type in ["s8", "u8", "s4", "u4", "b1"]:		# Encode all four types as D.A.B.C
# int and sub-int instructions encode all four types as D.A.B.C
return ".".join(x.mma_type.ptx_type for x in (op.d, op.a, op.b, op.c))		return ".".join(x.mma_type.ptx_type for x in (op.d, op.a, op.b, op.c))
if op.a.geom == "m8n8k4":
return "%s.f16.f16.%s" % (op.d.mma_type.ptx_type, op.c.mma_type.ptx_type)		def wmma_signature(op):
		if op.a.mma_type.ptx_type == "f16":
		# FP16 ops identified by accumulator & result type.
		return "%s.%s" % (op.d.mma_type.ptx_type, op.c.mma_type.ptx_type)
else:		else:
# the rest are FP instructions use D.C		# other ops are identified by input type.
		return op.a.mma_type.ptx_type

		def wmma_ptx_signature(op):
		if op.a.mma_type.ptx_type == "f16":
		# FP16 instructions use D.C
return "%s.%s" % (op.d.mma_type.ptx_type, op.c.mma_type.ptx_type)		return "%s.%s" % (op.d.mma_type.ptx_type, op.c.mma_type.ptx_type)
		else:
		# other instructions encode all four types as D.A.B.C
		return ".".join(x.mma_type.ptx_type for x in (op.d, op.a, op.b, op.c))

def gen_wmma_mma_tests():		def common_mma_test_gen(params, op, intrinsic_template, instruction_template):
mma_template = """		mma_template = """
declare ${ret_ty} @${intrinsic}(		declare ${ret_ty} @${intrinsic}(
${args});		${args});

; CHECK-LABEL: .func {{.*}}test_${function}(		; CHECK-LABEL: .func {{.*}}test_${function}(
define ${ret_ty} @test_${function}(		define ${ret_ty} @test_${function}(
${args}) {		${args}) {
; CHECK: ${instruction}		; CHECK: ${instruction}
; CHECK-NEXT: ${check_d}		; CHECK-NEXT: ${check_d}
; CHECK-NEXT: ${check_a}		; CHECK-NEXT: ${check_a}
; CHECK-NEXT: ${check_b}		; CHECK-NEXT: ${check_b}
; CHECK-NEXT: ${check_c}		; CHECK-NEXT: ${check_c}
%r = call ${ret_ty} @${intrinsic}(		%r = call ${ret_ty} @${intrinsic}(
${args});		${args});
ret ${ret_ty} %r;		ret ${ret_ty} %r;
}		}
"""		"""
wmma_intrinsic_template = "llvm.nvvm.wmma.${geom}.mma.${alayout}.${blayout}.${intrinsic_signature}${satf}"
wmma_instruction_template = "wmma.mma${mma_variant}.sync${aligned}.${alayout}.${blayout}.${geom}.${ptx_signature}${satf}"		test_params = params
mma_intrinsic_template = "llvm.nvvm.mma.${geom}.${alayout}.${blayout}.${intrinsic_signature}"		test_params["intrinsic"] = Template(intrinsic_template).substitute(params)
mma_instruction_template = "mma.sync${aligned}.${geom}.${alayout}.${blayout}.${ptx_signature}"		test_params["function"] = test_params["intrinsic"].replace(".", "_")
		test_params["instruction"] = Template(instruction_template).substitute(params)
		test_params["ret_ty"] = make_wmma_ld_ret_ty(op.d)
		test_params["check_a"] = check_pattern(op.a)
		test_params["check_b"] = check_pattern(op.b)
		test_params["check_c"] = check_pattern(op.c)
		test_params["check_d"] = check_pattern(op.d)
		args = ",\n ".join(make_wmma_slice_args(frag)
		for frag in (op.a, op.b, op.c))
		test_params["args"] = args
		print(Template(mma_template).substitute(test_params))
		return (test_params["intrinsic"], test_params["instruction"])

		def gen_wmma_mma_tests():
		wmma_intrinsic_template = "llvm.nvvm.wmma.${geom}.mma.${alayout}.${blayout}${rnd}.${intrinsic_signature}${satf}"
		wmma_instruction_template = "wmma.mma${mma_variant}.sync${aligned}.${alayout}.${blayout}.${geom}${rnd}.${ptx_signature}${satf}"

		generated_items=[]

		for op, alayout, blayout, rnd, satf in product(
		get_wmma_ops(),
		["row","col"],
		["row","col"],
		[".rn", ".rz", ".rm", ".rp", ""],
		[".satfinite", ""]):

		if not is_wmma_variant_supported(op, alayout, blayout, rnd, satf):
		continue

		params = {
		"aligned" : ".aligned" if ptx_version >= 63 else "",
		"alayout" : alayout,
		"blayout" : blayout,
		"intrinsic_signature" : wmma_signature(op),
		"ptx_signature" : wmma_ptx_signature(op),
		"satf" : satf,
		"rnd" : rnd,
		"geom" : op.a.geom,
		"mma_variant" : ".xor.popc" if op.a.mma_type.ptx_type == "b1" else "",
		}

		intrinsic_template = wmma_intrinsic_template
		instruction_template = wmma_instruction_template

		generated_items.append(common_mma_test_gen(params, op,
		intrinsic_template, instruction_template))

		return generated_items

		def gen_mma_tests():
		mma_intrinsic_template = "llvm.nvvm.mma.${geom}.${alayout}.${blayout}${satf}.${intrinsic_signature}"
		mma_instruction_template = "mma.sync${aligned}.${geom}.${alayout}.${blayout}${satf}.${ptx_signature}${mma_variant}"

generated_items=[]		generated_items=[]

for op, alayout, blayout, satf in product(		for op, alayout, blayout, satf in product(
get_mma_ops(),		get_mma_ops(),
["row","col"],		["row","col"],
["row","col"],		["row","col"],
[".satfinite", ""]):		[".satfinite", ""]):

if not is_mma_variant_supported(op, alayout, blayout, satf):		if not is_mma_variant_supported(op, alayout, blayout, satf):
continue		continue

params = {		params = {
"aligned" : ".aligned" if ptx_version >= 63 else "",		"aligned" : ".aligned" if ptx_version >= 63 else "",
"alayout" : alayout,		"alayout" : alayout,
"blayout" : blayout,		"blayout" : blayout,
"intrinsic_signature" : mma_signature(op),		"intrinsic_signature" : mma_signature(op),
"ptx_signature" : mma_ptx_signature(op),		"ptx_signature" : mma_ptx_signature(op),
"satf" : satf,		"satf" : satf,
"geom" : op.a.geom,		"geom" : op.a.geom,
"mma_variant" : ".xor.popc" if op.a.mma_type.ptx_type == "b1" else "",		"mma_variant" : ".xor.popc" if op.a.mma_type.ptx_type == "b1" else "",
}		}

if op.a.geom == "m8n8k4":
intrinsic_template = mma_intrinsic_template		intrinsic_template = mma_intrinsic_template
instruction_template = mma_instruction_template		instruction_template = mma_instruction_template
else:
intrinsic_template = wmma_intrinsic_template
instruction_template = wmma_instruction_template

test_params = params		generated_items.append(common_mma_test_gen(params, op,
test_params["intrinsic"] = Template(intrinsic_template).substitute(params)		intrinsic_template, instruction_template))
test_params["function"] = test_params["intrinsic"].replace(".", "_")
test_params["instruction"] = Template(instruction_template).substitute(params)
test_params["ret_ty"] = make_wmma_ld_ret_ty(op.d)
test_params["check_a"] = check_pattern(op.a)
test_params["check_b"] = check_pattern(op.b)
test_params["check_c"] = check_pattern(op.c)
test_params["check_d"] = check_pattern(op.d)
args = ",\n ".join(make_wmma_slice_args(frag)
for frag in (op.a, op.b, op.c))
test_params["args"] = args
print(Template(mma_template).substitute(test_params))
generated_items.append((test_params["intrinsic"],
test_params["instruction"]))

return generated_items		return generated_items

# Append complete list of intrinsics and instructions we've generated tests for.		# Append complete list of intrinsics and instructions we've generated tests for.
# Generate set of checks to verify that that we did generate sensible set of		# Generate set of checks to verify that that we did generate sensible set of
# tests for the given combination of PTX and SM variants.		# tests for the given combination of PTX and SM variants.
#		#
def gen_check_unsupported_ops(items):		def gen_check_unsupported_ops(items):
print("; Complete list of intrinsics supported by PTX%d on sm_%d"		print("; Complete list of intrinsics supported by PTX%d on sm_%d"
% (ptx_version, gpu_arch))		% (ptx_version, gpu_arch))
print("; INTRINSICS: {{^; INTRINSICS_LIST_BEGIN}}")		print("; INTRINSICS: {{^; INTRINSICS_LIST_BEGIN}}")
print("""		print("""

; NOEXTGEOM-NOT: {{m8n32\|m32n8}}		; NOEXTGEOM-NOT: {{m8n32\|m32n8}}
; NOINT-NOT: .{{s32\|s8}}		; NOINT-NOT: .{{s32\|s8}}
; NOSUBINT-NOT: {{s4\|u4\|b1}}		; NOSUBINT-NOT: {{s4\|u4\|b1}}
; NOMMA-NOT: .m8n8k4.		; NOMMA-NOT: .m8n8k4.
		; NOALTFLOAT-NOT: .{{bf16\|tf32}}
		; NODOUBLE-NOT: .f64

; M16N16-DAG: m16n16k16.load.{{[ab].*}}.f16.p		; M16N16-DAG: m16n16k16.load.{{[ab].*}}.f16.p
; M16N16-DAG: m16n16k16.{{load\|store}}.{{[cd].*\.(f16\|f32)}}.p		; M16N16-DAG: m16n16k16.{{load\|store}}.{{[cd].*\.(f16\|f32)}}.p
; M16N16-DAG: m16n16k16.mma.{{.*}}.f16.f32		; M16N16-DAG: m16n16k16.mma.{{.*}}.f16.f32
; M16N16-DAG: m16n16k16.mma.{{.*}}.f32.f16		; M16N16-DAG: m16n16k16.mma.{{.*}}.f32.f16
; M16N16-DAG: m16n16k16.mma.{{.*}}.f16.f16		; M16N16-DAG: m16n16k16.mma.{{.*}}.f16.f16
; M16N16-DAG: m16n16k16.mma.{{.*}}.f32.f32		; M16N16-DAG: m16n16k16.mma.{{.*}}.f32.f32

Show All 30 Lines
; SUBINT-DAG: m8n8k32.load.{{[ab].*}}.s4.p		; SUBINT-DAG: m8n8k32.load.{{[ab].*}}.s4.p
; SUBINT-DAG: m8n8k32.load.{{[ab].*}}.u4.p		; SUBINT-DAG: m8n8k32.load.{{[ab].*}}.u4.p
; SUBINT-DAG: m8n8k128.{{load\|store}}.{{[cd].*\.s32}}.p		; SUBINT-DAG: m8n8k128.{{load\|store}}.{{[cd].*\.s32}}.p
; SUBINT-DAG: m8n8k32.{{load\|store}}.{{[cd].*\.s32}}.p		; SUBINT-DAG: m8n8k32.{{load\|store}}.{{[cd].*\.s32}}.p
; SUBINT-DAG: m8n8k32.mma.{{.*}}.u4		; SUBINT-DAG: m8n8k32.mma.{{.*}}.u4
; SUBINT-DAG: m8n8k32.mma.{{.*}}.s4		; SUBINT-DAG: m8n8k32.mma.{{.*}}.s4
; SUBINT-DAG: m8n8k128.mma.{{.*}}.b1		; SUBINT-DAG: m8n8k128.mma.{{.*}}.b1

		; ALTFLOAT-DAG: m16n16k16.load.{{[ab].*}}.bf16.p
		; ALTFLOAT-DAG: m8n32k16.load.{{[ab].*}}.bf16.p
		; ALTFLOAT-DAG: m32n8k16.load.{{[ab].*}}.bf16.p
		; ALTFLOAT-DAG: m16n16k8.load.{{[ab].*}}.tf32.p
		; ALTFLOAT-DAG: m16n16k16.mma.{{.*}}.bf16
		; ALTFLOAT-DAG: m8n32k16.mma.{{.*}}.bf16
		; ALTFLOAT-DAG: m32n8k16.mma.{{.*}}.bf16
		; ALTFLOAT-DAG: m16n16k8.mma.{{.*}}.tf32

		; DOUBLE-DAG: m8n8k4.load.{{[abc].*}}.f64.p
		; DOUBLE-DAG: m8n8k4.store.d.{{.*}}.f64.p
		; DOUBLE-DAG: m8n8k4.mma.{{.*}}.f64

; MMA-DAG: mma.m8n8k4.{{.*}}.f16.f32		; MMA-DAG: mma.m8n8k4.{{.*}}.f16.f32
; MMA-DAG: mma.m8n8k4.{{.*}}.f32.f16		; MMA-DAG: mma.m8n8k4.{{.*}}.f32.f16
; MMA-DAG: mma.m8n8k4.{{.*}}.f16.f16		; MMA-DAG: mma.m8n8k4.{{.*}}.f16.f16
; MMA-DAG: mma.m8n8k4.{{.*}}.f32.f32		; MMA-DAG: mma.m8n8k4.{{.*}}.f32.f32

		; PTX65MMA-DAG: mma.m16n8k8.row.col.f16.f16
		; PTX65MMA-DAG: mma.m16n8k8.row.col.f32.f32
		; PTX65MMA-DAG: mma.m8n8k16.row.col{{.*}}.u8.u8
		; PTX65MMA-DAG: mma.m8n8k16.row.col{{.*}}.s8.s8
		; PTX65MMA-DAG: mma.m8n8k16.row.col{{.*}}.s8.u8
		; PTX65MMA-DAG: mma.m8n8k16.row.col{{.*}}.u8.s8
		; PTX65MMA-DAG: mma.m8n8k32.row.col{{.*}}.u4.u4
		; PTX65MMA-DAG: mma.m8n8k32.row.col{{.*}}.s4.s4
		; PTX65MMA-DAG: mma.m8n8k32.row.col{{.*}}.s4.u4
		; PTX65MMA-DAG: mma.m8n8k32.row.col{{.*}}.u4.s4

		; PTX70MMA-DAG: mma.m8n8k4.row.col.f64
		; PTX70MMA-DAG: mma.m16n8k4.row.col.tf32
		; PTX70MMA-DAG: mma.m16n8k8.row.col.tf32
		; PTX70MMA-DAG: mma.m16n8k16.row.col.bf16
		; PTX70MMA-DAG: mma.m16n8k8.row.col.bf16
		; PTX70MMA-DAG: mma.m16n8k16.row.col.f16.f16
		; PTX70MMA-DAG: mma.m16n8k16.row.col.f32.f32
		; PTX70MMA-DAG: mma.m16n8k16.row.col{{.*}}.u8.u8
		; PTX70MMA-DAG: mma.m16n8k16.row.col{{.*}}.s8.s8
		; PTX70MMA-DAG: mma.m16n8k16.row.col{{.*}}.s8.u8
		; PTX70MMA-DAG: mma.m16n8k16.row.col{{.*}}.u8.s8
		; PTX70MMA-DAG: mma.m16n8k32.row.col{{.*}}.u8.u8
		; PTX70MMA-DAG: mma.m16n8k32.row.col{{.*}}.s8.s8
		; PTX70MMA-DAG: mma.m16n8k32.row.col{{.*}}.s8.u8
		; PTX70MMA-DAG: mma.m16n8k32.row.col{{.*}}.u8.s8
		; PTX70MMA-DAG: mma.m16n8k32.row.col{{.*}}.u4.u4
		; PTX70MMA-DAG: mma.m16n8k32.row.col{{.*}}.s4.s4
		; PTX70MMA-DAG: mma.m16n8k32.row.col{{.*}}.s4.u4
		; PTX70MMA-DAG: mma.m16n8k32.row.col{{.*}}.u4.s4
		; PTX70MMA-DAG: mma.m16n8k64.row.col{{.*}}.u4.u4
		; PTX70MMA-DAG: mma.m16n8k64.row.col{{.*}}.s4.s4
		; PTX70MMA-DAG: mma.m16n8k64.row.col{{.*}}.s4.u4
		; PTX70MMA-DAG: mma.m16n8k64.row.col{{.*}}.u4.s4
		; PTX70MMA-DAG: mma.m8n8k128.row.col.b1
		; PTX70MMA-DAG: mma.m16n8k128.row.col.b1
		; PTX70MMA-DAG: mma.m16n8k256.row.col.b1
;		;

""")		""")

print("; INTRINSICS_LIST_BEGIN")		print("; INTRINSICS_LIST_BEGIN")
for intrinsic, instruction in sorted(items):		for intrinsic, instruction in sorted(items):
print("; ", intrinsic, " -> ", instruction,"")		print("; ", intrinsic, " -> ", instruction,"")
print("; INTRINSICS_LIST_END")		print("; INTRINSICS_LIST_END")
print("; INTRINSICS: ; INTRINSICS_LIST_END")		print("; INTRINSICS: ; INTRINSICS_LIST_END")

def gen_tests():		def gen_tests():
items = gen_wmma_load_tests()		items = gen_wmma_load_tests()
items += gen_wmma_store_tests()		items += gen_wmma_store_tests()
items += gen_wmma_mma_tests()		items += gen_wmma_mma_tests()
		items += gen_mma_tests()
gen_check_unsupported_ops(items)		gen_check_unsupported_ops(items)

parser = argparse.ArgumentParser()		parser = argparse.ArgumentParser()
parser.add_argument("--ptx", type=int, default=60)		parser.add_argument("--ptx", type=int, default=60)
parser.add_argument("--gpu-arch", type=int, default=70)		parser.add_argument("--gpu-arch", type=int, default=70)
args = parser.parse_args()		args = parser.parse_args()
ptx_version = args.ptx		ptx_version = args.ptx
gpu_arch = args.gpu_arch		gpu_arch = args.gpu_arch

gen_tests()		gen_tests()

This is an archive of the discontinued LLVM Phabricator instance.

[Clang][NVPTX] Add NVPTX intrinsics and builtins for CUDA PTX 6.5 and 7.0 WMMA and MMA instructionsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 355390

clang/include/clang/Basic/BuiltinsNVPTX.def

clang/lib/CodeGen/CGBuiltin.cpp

clang/test/CodeGen/builtins-nvptx-mma.cu

clang/test/CodeGen/builtins-nvptx-mma.py

llvm/include/llvm/IR/IntrinsicsNVVM.td

llvm/lib/Target/NVPTX/NVPTXISelLowering.cpp

llvm/lib/Target/NVPTX/NVPTXInstrInfo.td

llvm/lib/Target/NVPTX/NVPTXIntrinsics.td

llvm/test/CodeGen/NVPTX/lit.local.cfg

llvm/test/CodeGen/NVPTX/wmma.py

[Clang][NVPTX] Add NVPTX intrinsics and builtins for CUDA PTX 6.5 and 7.0 WMMA and MMA instructions
ClosedPublic