This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
include/mlir/Conversion/GPUToNVVM/
-
mlir/
-
Conversion/
-
GPUToNVVM/
-
GPUToNVVMPass.h
-
lib/Conversion/GPUToNVVM/
-
Conversion/
-
GPUToNVVM/
-
CMakeLists.txt
2/2
LowerGpuOpsToNVVMOps.cpp
-
WmmaOpsToNvvm.cpp
-
test/Conversion/GPUToNVVM/
-
Conversion/
-
GPUToNVVM/
3/3
wmma-ops-to-nvvm.mlir

Differential D95331

[MLIR][GPU][NVVM] Add conversion of warp synchronous matrix-multiply accumulate GPU ops
ClosedPublic

Authored by navdeepkk on Jan 24 2021, 11:17 PM.

Download Raw Diff

Details

Reviewers

bondhugula
herhut
ftynse

Commits

rGeaaf7a6a09da: [MLIR][GPU][NVVM] Add conversion of warp synchronous matrix-multiply accumulate…

Summary

[MLIR][GPU][NVVM] Add conversion of warp synchronous matrix-multiply
accumulate GPU ops
Add conversion of warp synchronous matrix-multiply accumulate GPU ops to
NVVM ops. The following conversions are added :-

1.) subgroup_mma_load_matrix -> wmma.m16n16k16.load.[a,b,c]..row.stride
2.) subgroup_mma_store_matrix -> wmma.m16n16k16.store.d.[f16,f32].row.stride
3.) subgroup_mma_compute -> wmma.m16n16k16.mma.row.row.[f16,f32].[f16,f32]

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

navdeepkk created this revision.Jan 24 2021, 11:17 PM

Herald added subscribers: teijeong, rdzhabarov, tatianashp and 16 others. · View Herald TranscriptJan 24 2021, 11:17 PM

navdeepkk requested review of this revision.Jan 24 2021, 11:17 PM

Herald added a reviewer: herhut. · View Herald TranscriptJan 24 2021, 11:17 PM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: stephenneuendorffer, nicolasvasilache. · View Herald Transcript

navdeepkk added a parent revision: D95330: [MLIR][GPU][NVVM] Add warp synchronous matrix-multiply accumulate ops.Jan 24 2021, 11:18 PM

navdeepkk added a child revision: D95332: [MLIR][CUDA-RUNNER] Add CL options to pass SM version and index-bitwidth.Jan 24 2021, 11:22 PM

Harbormaster completed remote builds in B86507: Diff 318906.Jan 24 2021, 11:32 PM

ftynse added a reviewer: ftynse.Jan 25 2021, 5:28 AM

This needs to be rebased on master, I have removed LLVMType several weeks ago.

ftynse requested changes to this revision.Jan 25 2021, 5:28 AM

This revision now requires changes to proceed.Jan 25 2021, 5:28 AM

Changes in this diff :-

1.) Rebase on master to drop the use of LLVMType.
2.) Make changes to use the !gpu.mmafragment type introduced in parent
  revision D95330.

Harbormaster completed remote builds in B87841: Diff 321330.Feb 4 2021, 12:52 AM

I haven't done a detailed review since this commit may change if its parent changes.

mlir/lib/Conversion/GPUToNVVM/CommonTypes.h
1 ↗	(On Diff #321330)	Headers need to have `-- C++ --` at the end of the first line
mlir/lib/Conversion/GPUToNVVM/WmmaLoadOptoNvvmLowering.h
13 ↗	(On Diff #321330)	I don't see the point of putting _implementations_ of conversion patterns into separate _header_ files.
mlir/test/Conversion/GPUToNVVM/wmma-ops-to-nvvm.mlir
16–37	I don't think we care if these operations are on the next line or not. Furthermore, they seem to be testing the conversion of allocations, which isn't anyhow relevant to what this test is about. Such tests lead to excessive churn and spurious breakages. Please only test what the new code does.

ftynse requested changes to this revision.Feb 9 2021, 6:22 AM

This revision now requires changes to proceed.Feb 9 2021, 6:22 AM

Changes in this diff :-

1.) Address comments on diff 321330.

Harbormaster completed remote builds in B89559: Diff 324328.Feb 17 2021, 10:02 AM

Changes in this diff :-

1.) Rebase on upstream/main.
3.) Make changes to operate with the newly intoduced gpu.mma_matrix type.

Herald added subscribers: dcaballe, cota. · View Herald TranscriptMay 2 2021, 1:24 PM

Harbormaster completed remote builds in B102196: Diff 342267.May 2 2021, 2:07 PM

bondhugula added inline comments.May 2 2021, 5:54 PM

mlir/lib/Conversion/GPUToNVVM/LowerGpuOpsToNVVMOps.cpp
146	C++ style comment here. Terminate with full stop.
mlir/lib/Conversion/GPUToNVVM/WmmaLoadStoreToNvvmLowering.h
31 ↗	(On Diff #342267)	Cannot -> cannot
36 ↗	(On Diff #342267)	Typo: implements
64 ↗	(On Diff #342267)	Expected -> expected
73 ↗	(On Diff #342267)	Remove commented out code please.
302–307 ↗	(On Diff #342267)	Nit: In such cases, consider using braces for the `then` and `else` blocks.
mlir/test/Conversion/GPUToNVVM/wmma-ops-to-nvvm.mlir
42–57	Nit: Leave a space after `//`, i.e., CHECK -> CHECK

bondhugula added inline comments.May 2 2021, 5:59 PM

mlir/lib/Conversion/GPUToNVVM/CommonTypes.h
25 ↗	(On Diff #342267)	Many of these are now built-in MLIR types. `common LLVM` -> 'common LLVM and builtin MLIR`
62–63 ↗	(On Diff #342267)	The context here isn't clear as to operands to which op. Mention about mma/wmma when referring to fragments?
74–75 ↗	(On Diff #342267)	Doc comments for these two.
mlir/lib/Conversion/GPUToNVVM/WmmaLoadStoreToNvvmLowering.h
119 ↗	(On Diff #342267)	Drop commented out code.

navdeepkk edited the summary of this revision. (Show Details)May 5 2021, 9:54 PM

Changes in this diff :-

1.) Address comments on previous diff(342267).

Harbormaster completed remote builds in B102911: Diff 343284.May 5 2021, 10:47 PM

bondhugula added inline comments.May 6 2021, 7:48 AM

mlir/lib/Conversion/GPUToNVVM/CommonTypes.h
15 ↗	(On Diff #343284)	clang-tidy warning if this is a meaningful one.
mlir/lib/Conversion/GPUToNVVM/WmmaLoadStoreToNvvmLowering.h
82–90 ↗	(On Diff #343284)	All of this code in header files instead of a source file? Is this an exception for some reason?

ftynse requested changes to this revision.May 6 2021, 8:15 AM

ftynse added inline comments.

mlir/lib/Conversion/GPUToNVVM/WmmaLoadStoreToNvvmLowering.h
25 ↗	(On Diff #343284)	Add a documentation comment please.
47 ↗	(On Diff #343284)	How about having this class (privately) derive `CommonLLVMAndBuiltInMLIRTypes` instead of containing it? This will let it use the fields directly without the annoying prefix. Also, I'm not entirely certain why this is even needed. Most patterns don't need all of the types, yet this will create the types inside each pattern, taking locks in the context. Yes, we won't need to do it every time the pattern is applied, but it feels like the number of spuriously created types compensates this. Another alternative would be to have a single instance of `CommonLLVMAndBuiltInMLIRTypes` and pass a reference to this instance to each pattern.
64 ↗	(On Diff #343284)	Typo: meref -> memref ?
75 ↗	(On Diff #343284)	Please expand auto here.
75 ↗	(On Diff #343284)	Also, I suppose you need promoteOneMemRefDescriptor instead. Then you can wrap the result into a MemRefDescriptor class and get much more readable code below.
76 ↗	(On Diff #343284)	Could this use OpAdaptor to get named operand accessors instead of magic constants?
82–90 ↗	(On Diff #343284)	I made the same comment on the first version of this patch, there is no point for this code to be in a header, even if the header lives under lib/.
93 ↗	(On Diff #343284)	Please avoid magic constants.
186 ↗	(On Diff #343284)	Same comments as above in this class.
302–307 ↗	(On Diff #342267)	This doesn't look done, did you forget to upload?
mlir/lib/Conversion/GPUToNVVM/WmmaMmaOptoNvvmLowering.h
35–44 ↗	(On Diff #343284)	A similar function is present in another file this patch is adding. Consider refactoring to only have one definition.
60 ↗	(On Diff #343284)	SmallVector now has a default number of stack elements. Drop the manually specified number unless there is a specific reason to choose the value. Here, I don't see why 24 is special.
87 ↗	(On Diff #343284)	Please use OpAdaptors here and below to avoid `operands[magic-constant]`
mlir/test/Conversion/GPUToNVVM/wmma-ops-to-nvvm.mlir
16–37	I still see a lot of CHECK-NEXT, and I am not convinced that we should care about these operations being on subsequent strings. If tomorrow we decide to change the syntax of `llvm.extractvalue` to print the type a separate line, these tests would break for no good reason.

This revision now requires changes to proceed.May 6 2021, 8:15 AM

Changes in this diff :-

1.)  Address comments on previous diff(343284).

Herald added a subscriber: mgorny. · View Herald TranscriptMay 14 2021, 12:13 AM

Harbormaster completed remote builds in B104439: Diff 345358.May 14 2021, 12:48 AM

Changes in this diff :-

1.) Fix formatting in WmmaOpsToNvvmLowering.cpp.

Harbormaster completed remote builds in B104452: Diff 345382.May 14 2021, 3:34 AM

bondhugula accepted this revision.May 16 2021, 12:00 AM

This is okay with me except the splitting between files. I really don't understand the motivation behind adding header files to lib/. It is justified if there are some declarations or template private implementations that must be shared between several .cpp files and not visible to the external users, but here is is not the case.

This can be organized as follows:

the file mlir/include/mlir/Conversion/GPUToNVVM/GPUToNVVMPass.h contains a populateGPUWMMAPAtterns(...) next to the populateGPUToNVVM that it already has;
there's one file, mlir/lib/Conversion/GPUToNVVM/WmmaOpsToNVVM.cpp (also, drop the "lowering" from the name while we are here), that contains everything from CommonTypes.h, and the current header/source pair, all in an anonymous namespace to avoid exporting the names and slowing down the linker;
whoever needs this patterns just includes the header and calls the populate function.

This is simple to navigate, reason about and is the pattern that all other conversions adopt.

mlir/lib/Conversion/GPUToNVVM/CommonTypes.h
1 ↗	(On Diff #321330)	Looks like this comment wasn't addressed..
mlir/lib/Conversion/GPUToNVVM/LowerGpuOpsToNVVMOps.cpp
146	Doesn't look addressed. C++-style comments are line comments starting with `//`.
mlir/lib/Conversion/GPUToNVVM/WmmaOpsToNvvmLowering.cpp
31 ↗	(On Diff #345382)	Nit: add braces here, this isn't a trivial condition anymore.
50 ↗	(On Diff #345382)	I would have just used `unsigned` here.
140–142 ↗	(On Diff #345382)	Nit: can you put this comment inside `return rewriter.notifyMatchFailure("")` instead? Here and below.

This revision now requires changes to proceed.May 20 2021, 1:53 AM

In D95331#2770496, @ftynse wrote:

This is okay with me except the splitting between files. I really don't understand the motivation behind adding header files to lib/. It is justified if there are some declarations or template private implementations that must be shared between several .cpp files and not visible to the external users, but here is is not the case.

This can be organized as follows:

the file mlir/include/mlir/Conversion/GPUToNVVM/GPUToNVVMPass.h contains a populateGPUWMMAPAtterns(...) next to the populateGPUToNVVM that it already has;

there's one file, mlir/lib/Conversion/GPUToNVVM/WmmaOpsToNVVM.cpp (also, drop the "lowering" from the name while we are here), that contains everything from CommonTypes.h, and the current header/source pair, all in an anonymous namespace to avoid exporting the names and slowing down the linker;

whoever needs this patterns just includes the header and calls the populate function.

This is simple to navigate, reason about and is the pattern that all other conversions adopt.

Hi @ftynse,
Thanks for the insightful comments. I have a question regarding this new structure. Once I put all the code in WmmaOpsToNVVM.cpp the lowering patterns are no longer exposed. So I cannot directly use them in LowerGpuOpsToNVVMOps.cpp where they are actually added into RewritePatternSet. The way I could find is to include WmmaOpsToNVVm.cpp in LowerGpuOpsToNVVMOps.cpp. Is that okay to do in this context? Or can we just expose the lowerings in a header file?

mlir/lib/Conversion/GPUToNVVM/CommonTypes.h
15 ↗	(On Diff #343284)	The style given here already followed. https://llvm.org/docs/CodingStandards.html#header-guard
mlir/lib/Conversion/GPUToNVVM/WmmaLoadOptoNvvmLowering.h
13 ↗	(On Diff #321330)	We may add more versions of NVVM ops in the future, so each lowering file may grow large. To save hassle later, I kept them in separate files. Let me know if you want me to merge them.
mlir/lib/Conversion/GPUToNVVM/WmmaLoadStoreToNvvmLowering.h
47 ↗	(On Diff #343284)	Took the first approach and removed all the spurious types present. Now the types are minimal and the overheads would be less. Let me know what you think.

In D95331#2773383, @navdeepkk wrote:

Hi @ftynse,
Thanks for the insightful comments. I have a question regarding this new structure. Once I put all the code in WmmaOpsToNVVM.cpp the lowering patterns are no longer exposed. So I cannot directly use them in LowerGpuOpsToNVVMOps.cpp where they are actually added into RewritePatternSet.

You shouldn't even need to use them directly. Just declare a function populateWmmaToNVVMPatterns(RewritePatternSet &set) in GPUToNVVMPass.h, put its declaration in the header and implementation in WmmaOpsToNVVM.cpp. Then call this function from LowerGpuOpsToNVVMOps.cpp. You understand how function declaration works in C++, right? It's enough for the caller to see the declaration (not definition) to be able to call the function.

The way I could find is to include WmmaOpsToNVVm.cpp in LowerGpuOpsToNVVMOps.cpp. Is that okay to do in this context?

Absolutely not! It's almost never okay to include a .cpp into another .cpp.

Or can we just expose the lowerings in a header file?

You don't need to expose the classes, the function that adds instances of these classes into the set will amply suffice, as I mentioned repeatedly. If you look around a bit in the code base, LowerGpuOpsToNVVMOps.cpp defines the pattern classes in an anonymous namespace and they are not at all visible in the headers. The users of this lowering just call populateGpuToNVVMConversionPatterns and never ever need to see the classes. I see no reason why you can't just do the same.

Changes in this diff :-

1.) Address comments on diff 345382.

Changes in this diff :-

1.) Fix spelling in file title.

Thanks!

This revision is now accepted and ready to land.May 21 2021, 6:02 AM

In D95331#2773568, @ftynse wrote:

Thanks!

Thanks for the long series of great insightful comments. And apologies for taking too long in the re-structuring part.

bondhugula accepted this revision.May 21 2021, 6:25 AM

Harbormaster completed remote builds in B105611: Diff 347002.May 21 2021, 6:59 AM

In D95331#2773585, @navdeepkk wrote:

In D95331#2773568, @ftynse wrote:

Thanks!

Thanks for the long series of great insightful comments. And apologies for taking too long in the re-structuring part.

There appears to be something weird with this patch - doing an arc patch D95331 yields an exception and then shows zero changes. Looks like an arc bug:

...
Applied patch mlir/include/mlir/Target/LLVMIR/ModuleTranslation.h cleanly.
Applied patch mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td cleanly.
Applied patch mlir/include/mlir/Dialect/GPU/GPUOps.td cleanly.
Applied patch mlir/include/mlir/Dialect/GPU/GPUDialect.h cleanly.
Applied patch mlir/include/mlir/Dialect/GPU/GPUBase.td cleanly.
 COMMITTED  Successfully committed patch.

 Cherry Pick Failed!
 Exception 
Command failed with error #1!
COMMAND
git cherry-pick -- arcpatch-D95330

STDOUT
Auto-merging mlir/test/Dialect/LLVMIR/invalid.mlir
Auto-merging mlir/lib/Target/LLVMIR/ModuleTranslation.cpp
Auto-merging mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
Auto-merging mlir/include/mlir/Target/LLVMIR/ModuleTranslation.h
Auto-merging mlir/include/mlir/Dialect/GPU/GPUOps.td
On branch arcpatch-D95331_1
You are currently cherry-picking commit 372dcf47bd93.
  (all conflicts fixed: run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	mlir/0001-MLIR-Update-affine.for-unroll-for-iter_args-support.patch
	mlir/compile_commands.json
	mlir/tags
	mlir/tf.mlir

nothing added to commit but untracked files present (use "git add" to track)


STDERR
The previous cherry-pick is now empty, possibly due to conflict resolution.
If you wish to commit it anyway, use:

    git commit --allow-empty

Otherwise, please use 'git cherry-pick --skip'

(Run with `--trace` for a full exception trace.)

Using --trace shows:

Otherwise, please use 'git cherry-pick --skip'
 at [<arcanist>/src/future/exec/ExecFuture.php:421]
arcanist(head=master, ref.master=239ad5c55d8d)
  #0 ExecFuture::raiseResultError(array) called at [<arcanist>/src/future/exec/ExecFuture.php:325]
  #1 ExecFuture::resolvex() called at [<arcanist>/src/repository/api/ArcanistRepositoryAPI.php:399]
  #2 ArcanistRepositoryAPI::execxLocal(string, string) called at [<arcanist>/src/workflow/ArcanistPatchWorkflow.php:778]
  #3 ArcanistPatchWorkflow::run() called at [<arcanist>/src/workflow/ArcanistPatchWorkflow.php:983]
  #4 ArcanistPatchWorkflow::applyDependencies(ArcanistBundle) called at [<arcanist>/src/workflow/ArcanistPatchWorkflow.php:469]
  #5 ArcanistPatchWorkflow::run() called at [<arcanist>/src/workflow/ArcanistPatchWorkflow.php:398]
  #6 ArcanistPatchWorkflow::run() called at [<arcanist>/scripts/arcanist.php:419]
<<< [1] (+15,126) <exec> 15,126,163 us

Rebase on upstream/main.

This revision was landed with ongoing or failed builds.May 21 2021, 8:51 AM

Closed by commit rGeaaf7a6a09da: [MLIR][GPU][NVVM] Add conversion of warp synchronous matrix-multiply accumulate… (authored by navdeepkk, committed by bondhugula). · Explain Why

This revision was automatically updated to reflect the committed changes.

bondhugula added a commit: rGeaaf7a6a09da: [MLIR][GPU][NVVM] Add conversion of warp synchronous matrix-multiply accumulate….

Harbormaster completed remote builds in B105646: Diff 347046.May 21 2021, 9:19 AM

navdeepkk mentioned this in D95333: [MLIR][NVVM] Add test cases to check translation of matrix-multiply accumulate ops to the corresponding intrinsics in NVPTX backend.May 22 2021, 3:09 AM

Revision Contents

Path

Size

mlir/

include/

mlir/

Conversion/

GPUToNVVM/

GPUToNVVMPass.h

4 lines

lib/

Conversion/

GPUToNVVM/

CMakeLists.txt

1 line

LowerGpuOpsToNVVMOps.cpp

33 lines

WmmaOpsToNvvm.cpp

451 lines

test/

Conversion/

GPUToNVVM/

wmma-ops-to-nvvm.mlir

91 lines

Diff 347002

mlir/include/mlir/Conversion/GPUToNVVM/GPUToNVVMPass.h

	Show All 25 Lines

	/// Configure target to convert from the GPU dialect to NVVM.			/// Configure target to convert from the GPU dialect to NVVM.
	void configureGpuToNVVMConversionLegality(ConversionTarget &target);			void configureGpuToNVVMConversionLegality(ConversionTarget &target);

	/// Collect a set of patterns to convert from the GPU dialect to NVVM.			/// Collect a set of patterns to convert from the GPU dialect to NVVM.
	void populateGpuToNVVMConversionPatterns(LLVMTypeConverter &converter,			void populateGpuToNVVMConversionPatterns(LLVMTypeConverter &converter,
	RewritePatternSet &patterns);			RewritePatternSet &patterns);

				/// Collect a set of patterns to convert WMMA ops from GPU dialect to NVVM.
				void populateGpuWMMAToNVVMConversionPatterns(LLVMTypeConverter &converter,
				RewritePatternSet &patterns);

	/// Creates a pass that lowers GPU dialect operations to NVVM counterparts. The			/// Creates a pass that lowers GPU dialect operations to NVVM counterparts. The
	/// index bitwidth used for the lowering of the device side index computations			/// index bitwidth used for the lowering of the device side index computations
	/// is configurable.			/// is configurable.
	std::unique_ptr<OperationPass<gpu::GPUModuleOp>> createLowerGpuOpsToNVVMOpsPass(			std::unique_ptr<OperationPass<gpu::GPUModuleOp>> createLowerGpuOpsToNVVMOpsPass(
	unsigned indexBitwidth = kDeriveIndexBitwidthFromDataLayout);			unsigned indexBitwidth = kDeriveIndexBitwidthFromDataLayout);

	} // namespace mlir			} // namespace mlir

	#endif // MLIR_CONVERSION_GPUTONVVM_GPUTONVVMPASS_H_			#endif // MLIR_CONVERSION_GPUTONVVM_GPUTONVVMPASS_H_

mlir/lib/Conversion/GPUToNVVM/CMakeLists.txt

	set(LLVM_TARGET_DEFINITIONS GPUToNVVM.td)			set(LLVM_TARGET_DEFINITIONS GPUToNVVM.td)
	mlir_tablegen(GPUToNVVM.cpp.inc -gen-rewriters)			mlir_tablegen(GPUToNVVM.cpp.inc -gen-rewriters)
	add_public_tablegen_target(MLIRGPUToNVVMIncGen)			add_public_tablegen_target(MLIRGPUToNVVMIncGen)

	add_mlir_conversion_library(MLIRGPUToNVVMTransforms			add_mlir_conversion_library(MLIRGPUToNVVMTransforms
	LowerGpuOpsToNVVMOps.cpp			LowerGpuOpsToNVVMOps.cpp
				WmmaOpsToNvvm.cpp

	DEPENDS			DEPENDS
	MLIRConversionPassIncGen			MLIRConversionPassIncGen
	MLIRGPUToNVVMIncGen			MLIRGPUToNVVMIncGen

	LINK_LIBS PUBLIC			LINK_LIBS PUBLIC
	MLIRGPU			MLIRGPU
	MLIRGPUToGPURuntimeTransforms			MLIRGPUToGPURuntimeTransforms
	MLIRLLVMIR			MLIRLLVMIR
	MLIRMemRef			MLIRMemRef
	MLIRNVVMIR			MLIRNVVMIR
	MLIRPass			MLIRPass
	MLIRStandardToLLVM			MLIRStandardToLLVM
	MLIRTransformUtils			MLIRTransformUtils
	)			)

mlir/lib/Conversion/GPUToNVVM/LowerGpuOpsToNVVMOps.cpp

Show First 20 Lines • Show All 121 Lines • ▼ Show 20 Lines	void runOnOperation() override {
LLVMTypeConverter converter(m.getContext(), options);		LLVMTypeConverter converter(m.getContext(), options);
converter.addConversion([&](MemRefType type) -> Optional<Type> {		converter.addConversion([&](MemRefType type) -> Optional<Type> {
if (type.getMemorySpaceAsInt() !=		if (type.getMemorySpaceAsInt() !=
gpu::GPUDialect::getPrivateAddressSpace())		gpu::GPUDialect::getPrivateAddressSpace())
return llvm::None;		return llvm::None;
return converter.convertType(MemRefType::Builder(type).setMemorySpace(0));		return converter.convertType(MemRefType::Builder(type).setMemorySpace(0));
});		});

		// Lowering for MMAMatrixType.
		converter.addConversion([&](gpu::MMAMatrixType type) -> Type {
		// The number of items in structToReturn are dependent on the the dataType
		// and the MMA operand that this operation is associated with.
		llvm::DenseMap<StringRef, int64_t> numElemsPerThreadF16,
		numElemsPerThreadF32;
		numElemsPerThreadF16["AOp"] = 8;
		numElemsPerThreadF16["BOp"] = 8;
		numElemsPerThreadF16["COp"] = 4;
		numElemsPerThreadF16["DOp"] = 4;
		numElemsPerThreadF32["AOp"] = 8;
		numElemsPerThreadF32["BOp"] = 8;
		numElemsPerThreadF32["COp"] = 8;
		numElemsPerThreadF32["DOp"] = 8;
		Type structToReturn;
		if (type.getElementType().isF16()) {
		// Number of f16's in 32-bit.
		bondhugulaUnsubmitted Done Reply Inline Actions C++ style comment here. Terminate with full stop. bondhugula: C++ style comment here. Terminate with full stop.
		ftynseUnsubmitted Done Reply Inline Actions Doesn't look addressed. C++-style comments are line comments starting with `//`. ftynse: Doesn't look addressed. C++-style comments are line comments starting with `//`.
		unsigned vecSize = 2;
		Type vec = VectorType::get(vecSize, FloatType::getF16(&getContext()));
		unsigned size = numElemsPerThreadF16[type.getOperand()];
		SmallVector<Type> elements(size, vec);
		structToReturn =
		LLVM::LLVMStructType::getLiteral(&getContext(), elements);
		} else if (type.getElementType().isF32()) {
		unsigned size = numElemsPerThreadF32[type.getOperand()];
		SmallVector<Type> elements(size, FloatType::getF32(&getContext()));
		structToReturn =
		LLVM::LLVMStructType::getLiteral(&getContext(), elements);
		}
		return structToReturn;
		});

RewritePatternSet patterns(m.getContext());		RewritePatternSet patterns(m.getContext());
RewritePatternSet llvmPatterns(m.getContext());		RewritePatternSet llvmPatterns(m.getContext());

// Apply in-dialect lowering first. In-dialect lowering will replace ops		// Apply in-dialect lowering first. In-dialect lowering will replace ops
// which need to be lowered further, which is not supported by a single		// which need to be lowered further, which is not supported by a single
// conversion pass.		// conversion pass.
populateGpuRewritePatterns(patterns);		populateGpuRewritePatterns(patterns);
(void)applyPatternsAndFoldGreedily(m, std::move(patterns));		(void)applyPatternsAndFoldGreedily(m, std::move(patterns));

populateStdToLLVMConversionPatterns(converter, llvmPatterns);		populateStdToLLVMConversionPatterns(converter, llvmPatterns);
populateGpuToNVVMConversionPatterns(converter, llvmPatterns);		populateGpuToNVVMConversionPatterns(converter, llvmPatterns);
		populateGpuWMMAToNVVMConversionPatterns(converter, llvmPatterns);
LLVMConversionTarget target(getContext());		LLVMConversionTarget target(getContext());
configureGpuToNVVMConversionLegality(target);		configureGpuToNVVMConversionLegality(target);
if (failed(applyPartialConversion(m, target, std::move(llvmPatterns))))		if (failed(applyPartialConversion(m, target, std::move(llvmPatterns))))
signalPassFailure();		signalPassFailure();
}		}
};		};

} // anonymous namespace		} // anonymous namespace
▲ Show 20 Lines • Show All 76 Lines • Show Last 20 Lines

mlir/lib/Conversion/GPUToNVVM/WmmaOpsToNvvm.cpp

This file was added.

				//===------ WmmaOpsToNVVM.cpp - WMMA LD/ST/Compute to NVVM lowering -------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// This file contains definitions of patterns to lower GPU Subgroup MMA ops to
				// NVVM Dialect.
				//
				//===----------------------------------------------------------------------===//

				#include "mlir/Conversion/StandardToLLVM/ConvertStandardToLLVM.h"
				#include "mlir/Dialect/GPU/GPUDialect.h"
				#include "mlir/Dialect/LLVMIR/LLVMDialect.h"
				#include "mlir/Dialect/LLVMIR/NVVMDialect.h"

				using namespace mlir;

				namespace {

				/// Contains all the common LLVM types which are used across the lowerings of
				/// GPU subgroup ops to NVVM dialect.
				struct CommonLLVMAndBuiltInMLIRTypes {
				public:
				CommonLLVMAndBuiltInMLIRTypes(MLIRContext *context) {
				numHalfsInOpFrags.resize(4);
				numHalfsInOpFrags[A] = 8;
				numHalfsInOpFrags[B] = 8;
				numHalfsInOpFrags[C] = 4;
				numHalfsInOpFrags[D] = 4;
				i32Ty = IntegerType::get(context, 32);
				f16Ty = FloatType::getF16(context);
				f32Ty = FloatType::getF32(context);
				f16x2Ty = VectorType::get(2, f16Ty);
				fragArrayABTy = LLVM::LLVMStructType::getLiteral(
				context, SmallVector<Type>(8, f16x2Ty));
				fragArrayCDTy = LLVM::LLVMStructType::getLiteral(
				context, SmallVector<Type>(4, f16x2Ty));
				fragArrayCDF32Ty =
				LLVM::LLVMStructType::getLiteral(context, SmallVector<Type>(8, f32Ty));
				};

				Type i32Ty;
				Type f16Ty;
				Type f32Ty;
				Type f16x2Ty;
				/// Type for the fragment of A and B operands that a single thread holds for
				/// fp16 data type in a WMMA operation of the form D = (alpha(AB)) +
				/// (beta*C).
				Type fragArrayABTy;
				/// Type for the fragment of C and D operands that a single thread holds for
				/// fp16 data type in a WMMA operation of the form D = (alpha(AB)) +
				/// (beta*C).
				Type fragArrayCDTy;
				/// Type for the fragment of C and D operands that a single thread holds for
				/// fp32 data type in a WMMA operation of the form D = (alpha(AB)) +
				/// (beta*C).
				Type fragArrayCDF32Ty;
				/// Represents the number of f16 elements a single thread holds in a WMMA
				/// operation of the form D = (alpha(AB)) + (beta*C) .
				SmallVector<unsigned, 4> numHalfsInOpFrags;
				/// Represents the operands of a MMA operation of the form D = (alpha(AB)) +
				/// (beta*C).
				enum OperandMap { A, B, C, D };
				};

				/// Checks if all the operands of the op being lowered are of LLVM Types. The
				/// types are expected to be converted by the `LLVMTypeConverter` before the op
				/// is actually lowered. If the type of an operands is not already converted it
				/// hints a missing typeConversion and failure is returned in that case.
				static LogicalResult areAllLLVMTypes(Operation *op, ValueRange operands,
				ConversionPatternRewriter &rewriter) {
				if (!llvm::all_of(operands, [](Value value) {
				return LLVM::isCompatibleType(value.getType());
				})) {
				return rewriter.notifyMatchFailure(
				op, "cannot convert if operands aren't of LLVM type.");
				}

				return success();
				}

				/// Error string to emit when unimplemented WMMA variant is encountered.
				static constexpr StringRef kInvalidCaseStr =
				"Unimplemented WMMA variant, Only M16N16K16 version implemented.";

				/// This class implements the conversion of GPU MMA loadOp to wmma.load op
				/// in the NVVM dialect. The conversion not only emits the NVVM op but also
				/// emits code that is necessary to store the data in the destination memref
				/// after it has been loaded.
				struct WmmaLoadOpToNVVMLowering
				: public ConvertOpToLLVMPattern<gpu::SubgroupMmaLoadMatrixOp>,
				private CommonLLVMAndBuiltInMLIRTypes {
				public:
				explicit WmmaLoadOpToNVVMLowering(LLVMTypeConverter &typeConverter)
				: ConvertOpToLLVMPattern<gpu::SubgroupMmaLoadMatrixOp>(typeConverter),
				CommonLLVMAndBuiltInMLIRTypes(&this->getTypeConverter()->getContext()) {
				}

				LogicalResult
				matchAndRewrite(gpu::SubgroupMmaLoadMatrixOp subgroupMmaLoadMatrixOp,
				ArrayRef<Value> operands,
				ConversionPatternRewriter &rewriter) const override {
				Operation *op = subgroupMmaLoadMatrixOp.getOperation();
				if (failed(areAllLLVMTypes(op, operands, rewriter)))
				return failure();

				unsigned indexTypeBitwidth =
				this->getTypeConverter()->getIndexTypeBitwidth();

				// The corresponding intrinsics expects leadDimension to be a 32-bit
				// integer, so all the calculations of linearizing the load address
				// must also follow this restriction.
				if (indexTypeBitwidth != 32)
				return rewriter.notifyMatchFailure(
				op, "Expected indices to the memref to be 32-bit wide.");

				// Source memref of the original op.
				MemRefType srcMemrefType =
				subgroupMmaLoadMatrixOp.srcMemref().getType().cast<MemRefType>();
				Location loc = op->getLoc();

				auto leadDimension = subgroupMmaLoadMatrixOp.leadDimensionAttr();

				// MemRefDescriptor to extract alignedPtr and offset.
				MemRefDescriptor promotedSrcOp(
				gpu::SubgroupMmaLoadMatrixOpAdaptor(operands).srcMemref());

				// Emit ops which compute the load offset using `srcOffsetI`,
				// `srcOffsetJ`. The actualOffset is (memrefOffset + (alignedPtr +
				// ((leadDimension * srcOffsetI) + srcOffsetJ)). The memrefs here are
				// assumed to be normalized and hence the simple conversion works.
				SmallVector<Value> indices(subgroupMmaLoadMatrixOp.indices());
				Value srcOffsetIVal = indices[0];
				Value srcOffsetJVal = indices[1];
				Value leadingDim32 =
				rewriter.create<LLVM::ConstantOp>(loc, i32Ty, leadDimension);
				Value numElemsLeadDim =
				rewriter.create<LLVM::MulOp>(loc, i32Ty, leadingDim32, srcOffsetIVal);
				Value loadOffset = rewriter.create<LLVM::AddOp>(loc, i32Ty, numElemsLeadDim,
				srcOffsetJVal);

				Value promotedSrcOpToUse;
				promotedSrcOpToUse = promotedSrcOp.offset(rewriter, loc);
				Value actualOffset = rewriter.create<LLVM::AddOp>(loc, i32Ty, loadOffset,
				promotedSrcOpToUse);
				Value loadAddress = rewriter.create<LLVM::GEPOp>(
				loc,
				LLVM::LLVMPointerType::get(f16Ty, srcMemrefType.getMemorySpaceAsInt()),
				promotedSrcOp.alignedPtr(rewriter, loc), ArrayRef<Value>{actualOffset});

				// Bitcast the base address pointer of the destination memref, So that
				// values can be stored in chunks of 32-bits and semantics match with the
				// intrinsic exposed by NVPTX backend.
				Value loadAddressCasted = rewriter.create<LLVM::BitcastOp>(
				loc,
				LLVM::LLVMPointerType::get(i32Ty, srcMemrefType.getMemorySpaceAsInt()),
				loadAddress);

				// Get the shape of the MMAMatrix type being returned. The shape will
				// choose which intrinsic this op will be lowered to.
				gpu::MMAMatrixType retType =
				subgroupMmaLoadMatrixOp.res().getType().cast<gpu::MMAMatrixType>();
				ArrayRef<int64_t> retTypeShape = retType.getShape();

				Type resType;
				StringRef operandStr = retType.getOperand();
				if (operandStr.equals("AOp") \|\| operandStr.equals("BOp")) {
				resType = fragArrayABTy;
				} else {
				if (srcMemrefType.getElementType().isF16())
				resType = fragArrayCDTy;
				else if (srcMemrefType.getElementType().isF32())
				resType = fragArrayCDF32Ty;
				else
				return failure();
				}

				// Create nvvm.mma_load op according to the operand types.
				SmallVector<Value, 2> loadOpOperands({loadAddressCasted, leadingDim32});
				if (operandStr.equals("AOp")) {
				if (retTypeShape[0] == 16 && retTypeShape[1] == 16) {
				NVVM::WMMALoadAM16N16K16Op wmmaLoadAOp =
				rewriter.create<NVVM::WMMALoadAM16N16K16Op>(loc, resType,
				loadOpOperands);
				rewriter.replaceOp(op, wmmaLoadAOp.getResult());
				} else {
				return rewriter.notifyMatchFailure(op, kInvalidCaseStr);
				}
				} else if (operandStr.equals("BOp")) {
				if (retTypeShape[0] == 16 && retTypeShape[1] == 16) {
				NVVM::WMMALoadBM16N16K16Op wmmaLoadBOp =
				rewriter.create<NVVM::WMMALoadBM16N16K16Op>(loc, resType,
				loadOpOperands);
				rewriter.replaceOp(op, wmmaLoadBOp.getResult());
				} else {
				return rewriter.notifyMatchFailure(op, kInvalidCaseStr);
				}
				} else {
				if (retTypeShape[0] == 16 && retTypeShape[1] == 16) {
				if (srcMemrefType.getElementType().isF16()) {
				NVVM::WMMALoadCF16M16N16K16Op wmmaLoadCOp =
				rewriter.create<NVVM::WMMALoadCF16M16N16K16Op>(loc, resType,
				loadOpOperands);
				rewriter.replaceOp(op, wmmaLoadCOp.getResult());
				} else if (srcMemrefType.getElementType().isF32()) {
				NVVM::WMMALoadCF32M16N16K16Op wmmaLoadCOp =
				rewriter.create<NVVM::WMMALoadCF32M16N16K16Op>(loc, resType,
				loadOpOperands);
				rewriter.replaceOp(op, wmmaLoadCOp.getResult());
				}
				} else {
				return rewriter.notifyMatchFailure(op, kInvalidCaseStr);
				}
				}
				return success();
				}
				};

				/// This class implements the conversion of GPU MMA storeOp to wmma.store op
				/// in the NVVM dialect. The conversion not only emits the NVVM op but also
				/// emits code that is necessary to unpack the data in the source and
				/// convert the data in the format that is needed by the NVVM op.
				struct WmmaStoreOpToNVVMLowering
				: public ConvertOpToLLVMPattern<gpu::SubgroupMmaStoreMatrixOp>,
				private CommonLLVMAndBuiltInMLIRTypes {
				public:
				explicit WmmaStoreOpToNVVMLowering(LLVMTypeConverter &typeConverter)
				: ConvertOpToLLVMPattern<gpu::SubgroupMmaStoreMatrixOp>(typeConverter),
				CommonLLVMAndBuiltInMLIRTypes(&this->getTypeConverter()->getContext()) {
				}

				LogicalResult
				matchAndRewrite(gpu::SubgroupMmaStoreMatrixOp subgroupMmaStoreMatrixOp,
				ArrayRef<Value> operands,
				ConversionPatternRewriter &rewriter) const override {
				Operation *op = subgroupMmaStoreMatrixOp.getOperation();
				if (failed(areAllLLVMTypes(op, operands, rewriter)))
				return failure();

				unsigned indexTypeBitwidth =
				this->getTypeConverter()->getIndexTypeBitwidth();
				// The corresponding intrinsics expects leadDimension to be a 32-bit
				// integer, so all the calculations of linearizing the store address
				// must also follow this restriction.
				if (indexTypeBitwidth != 32)
				return rewriter.notifyMatchFailure(
				op, "expected indices to the memref to be 32-bit wide.");

				Location loc = op->getLoc();

				// Destination memref of the original op.
				MemRefType dstMemrefType =
				subgroupMmaStoreMatrixOp.dstMemref().getType().cast<MemRefType>();

				// MemRefDescriptor to extract alignedPtr and offset.
				MemRefDescriptor promotedDstOp(
				gpu::SubgroupMmaStoreMatrixOpAdaptor(operands).dstMemref());

				auto leadDimension = subgroupMmaStoreMatrixOp.leadDimensionAttr();

				// Emit ops which compute the store offset using `dstOffsetI`,
				// `dstOffsetJ`. The actualOffset is (memrefOffset + (alignedPtr +
				// ((leadDimension * dstOffsetI) + dstOffsetJ)).
				SmallVector<Value> indices(subgroupMmaStoreMatrixOp.indices());
				Value dstOffsetIVal = indices[0];
				Value dstOffsetJVal = indices[1];
				Value leadingDim32 =
				rewriter.create<LLVM::ConstantOp>(loc, i32Ty, leadDimension);
				Value numElemsLeadDim =
				rewriter.create<LLVM::MulOp>(loc, i32Ty, leadingDim32, dstOffsetIVal);
				Value loadOffset = rewriter.create<LLVM::AddOp>(loc, i32Ty, numElemsLeadDim,
				dstOffsetJVal);

				Value promotedDstOpToUse;
				promotedDstOpToUse = promotedDstOp.offset(rewriter, loc);
				Value actualOffset = rewriter.create<LLVM::AddOp>(loc, i32Ty, loadOffset,
				promotedDstOpToUse);
				Value storeAddress = rewriter.create<LLVM::GEPOp>(
				loc,
				LLVM::LLVMPointerType::get(f16Ty, dstMemrefType.getMemorySpaceAsInt()),
				promotedDstOp.alignedPtr(rewriter, loc), ArrayRef<Value>{actualOffset});

				// Bitcast the base address pointer of the destination memref, So that
				// values can be stored in chunks of 32-bits and semantics match with the
				// intrinsic exposed by NVPTX backend.
				Value storeAddressCasted = rewriter.create<LLVM::BitcastOp>(
				loc,
				LLVM::LLVMPointerType::get(i32Ty, dstMemrefType.getMemorySpaceAsInt()),
				storeAddress);

				SmallVector<Value, 4> storeOpOperands;
				storeOpOperands.push_back(storeAddressCasted);

				// Get the shape of the MMAMatrix type being stored. The shape will
				// choose which intrinsic this op will be lowered to.
				gpu::MMAMatrixType srcType =
				subgroupMmaStoreMatrixOp.src().getType().cast<gpu::MMAMatrixType>();
				ArrayRef<int64_t> srcTypeShape = srcType.getShape();

				// Unpack the results from the source.
				if (subgroupMmaStoreMatrixOp.src()
				.getType()
				.cast<gpu::MMAMatrixType>()
				.getElementType() == f16Ty) {
				for (unsigned i = 0, e = numHalfsInOpFrags[D]; i < e; ++i) {
				Value toUse = rewriter.create<LLVM::ExtractValueOp>(
				loc, f16x2Ty, operands[0], rewriter.getI32ArrayAttr(i));
				storeOpOperands.push_back(toUse);
				}
				storeOpOperands.push_back(leadingDim32);

				// Create nvvm.mma_store op.
				if (srcTypeShape[0] == 16 && srcTypeShape[1] == 16) {
				rewriter.create<NVVM::WMMAStoreF16M16N16K16Op>(loc, storeOpOperands);
				} else {
				return rewriter.notifyMatchFailure(op, kInvalidCaseStr);
				}
				rewriter.eraseOp(op);
				return success();
				} else if (subgroupMmaStoreMatrixOp.src()
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: do not use 'else' after 'return' [llvm-else-after-return] not useful Lint: Pre-merge checks: clang-tidy: warning: do not use 'else' after 'return' [llvm-else-after-return] [[https://github.
				.getType()
				.cast<gpu::MMAMatrixType>()
				.getElementType() == f32Ty) {
				for (unsigned i = 0, e = 8; i < e; ++i) {
				Value toUse = rewriter.create<LLVM::ExtractValueOp>(
				loc, f32Ty, operands[0], rewriter.getI32ArrayAttr(i));
				storeOpOperands.push_back(toUse);
				}
				storeOpOperands.push_back(leadingDim32);

				// Create nvvm.mma_store op.
				if (srcTypeShape[0] == 16 && srcTypeShape[1] == 16)
				rewriter.create<NVVM::WMMAStoreF32M16N16K16Op>(loc, storeOpOperands);
				else {
				return rewriter.notifyMatchFailure(op, kInvalidCaseStr);
				}
				rewriter.eraseOp(op);
				return success();
				}

				return failure();
				}
				};

				/// This class implements the conversion of GPU MMA computeOp to wmma.mma op
				/// in the NVVM dialect.
				struct WmmaMmaOpToNVVMLowering
				: public ConvertOpToLLVMPattern<gpu::SubgroupMmaComputeOp>,
				private CommonLLVMAndBuiltInMLIRTypes {
				explicit WmmaMmaOpToNVVMLowering(LLVMTypeConverter &typeConverter)
				: ConvertOpToLLVMPattern<gpu::SubgroupMmaComputeOp>(typeConverter),
				CommonLLVMAndBuiltInMLIRTypes(&this->getTypeConverter()->getContext()) {
				}

				LogicalResult
				matchAndRewrite(gpu::SubgroupMmaComputeOp subgroupMmaComputeOp,
				ArrayRef<Value> operands,
				ConversionPatternRewriter &rewriter) const override {
				Operation *op = subgroupMmaComputeOp.getOperation();
				if (failed(areAllLLVMTypes(op, operands, rewriter)))
				return failure();

				Location loc = op->getLoc();

				// The wmma.mma intrinsic in llvm requires the operands as individual
				// values. So individual elements from the memrefs need to be extracted and
				// then passed on to the intrinsic call. Emit llvm ops to extract individual
				// values form lowered memrefs.
				SmallVector<Value> unpackedOps;

				auto unpackOp = [&](CommonLLVMAndBuiltInMLIRTypes::OperandMap op,
				Value operand, unsigned numElems, Type elemType) {
				for (unsigned i = 0; i < numElems; ++i) {
				Value toUse = rewriter.create<LLVM::ExtractValueOp>(
				loc, elemType, operand, rewriter.getI32ArrayAttr(i));
				unpackedOps.push_back(toUse);
				}
				};

				// Get the shapes of the MMAMatrix type being used. The shapes will
				// choose which intrinsic this op will be lowered to.
				gpu::MMAMatrixType aType =
				subgroupMmaComputeOp.opA().getType().cast<gpu::MMAMatrixType>();
				ArrayRef<int64_t> aTypeShape = aType.getShape();
				gpu::MMAMatrixType bType =
				subgroupMmaComputeOp.opA().getType().cast<gpu::MMAMatrixType>();
				ArrayRef<int64_t> bTypeShape = bType.getShape();
				gpu::MMAMatrixType cType =
				subgroupMmaComputeOp.opA().getType().cast<gpu::MMAMatrixType>();
				ArrayRef<int64_t> cTypeShape = cType.getShape();

				gpu::SubgroupMmaComputeOpAdaptor transformedOperands(operands);
				if (subgroupMmaComputeOp.opC()
				.getType()
				.cast<gpu::MMAMatrixType>()
				.getElementType() == f16Ty) {
				unpackOp(A, transformedOperands.opA(), numHalfsInOpFrags[A], f16x2Ty);
				unpackOp(B, transformedOperands.opB(), numHalfsInOpFrags[B], f16x2Ty);
				unpackOp(C, transformedOperands.opC(), numHalfsInOpFrags[C], f16x2Ty);

				if (aTypeShape[0] == 16 && aTypeShape[1] == 16 && bTypeShape[0] == 16 &&
				bTypeShape[1] == 16 && cTypeShape[0] == 16 && cTypeShape[1] == 16) {
				// Create nvvm.wmma.mma op.
				NVVM::WMMAMmaF16F16M16N16K16Op wmmaMmaOp =
				rewriter.create<NVVM::WMMAMmaF16F16M16N16K16Op>(loc, fragArrayCDTy,
				unpackedOps);

				rewriter.replaceOp(op, wmmaMmaOp.getResult());
				return success();
				} else {
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: do not use 'else' after 'return' [llvm-else-after-return] not useful Lint: Pre-merge checks: clang-tidy: warning: do not use 'else' after 'return' [llvm-else-after-return] [[https://github.
				return rewriter.notifyMatchFailure(op, kInvalidCaseStr);
				}
				} else if (subgroupMmaComputeOp.opC()
				.getType()
				.cast<gpu::MMAMatrixType>()
				.getElementType() == f32Ty) {
				unpackOp(A, transformedOperands.opA(), numHalfsInOpFrags[A], f16x2Ty);
				unpackOp(B, transformedOperands.opB(), numHalfsInOpFrags[B], f16x2Ty);
				unpackOp(C, transformedOperands.opC(), 8, f32Ty);

				if (aTypeShape[0] == 16 && aTypeShape[1] == 16 && bTypeShape[0] == 16 &&
				bTypeShape[1] == 16 && cTypeShape[0] == 16 && cTypeShape[1] == 16) {
				// Create nvvm.wmma.mma op.
				NVVM::WMMAMmaF32F32M16N16K16Op wmmaMmaOp =
				rewriter.create<NVVM::WMMAMmaF32F32M16N16K16Op>(
				loc, fragArrayCDF32Ty, unpackedOps);

				rewriter.replaceOp(op, wmmaMmaOp.getResult());
				return success();
				} else {
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: do not use 'else' after 'return' [llvm-else-after-return] not useful Lint: Pre-merge checks: clang-tidy: warning: do not use 'else' after 'return' [llvm-else-after-return] [[https://github.
				return rewriter.notifyMatchFailure(op, kInvalidCaseStr);
				}
				}

				return failure();
				}
				};

				} // anonymous namespace

				namespace mlir {
				void populateGpuWMMAToNVVMConversionPatterns(LLVMTypeConverter &converter,
				RewritePatternSet &patterns) {
				patterns.insert<WmmaLoadOpToNVVMLowering>(converter);
				patterns.insert<WmmaMmaOpToNVVMLowering>(converter);
				patterns.insert<WmmaStoreOpToNVVMLowering>(converter);
				}
				} // namespace mlir

mlir/test/Conversion/GPUToNVVM/wmma-ops-to-nvvm.mlir

This file was added.

				// RUN: mlir-opt --convert-gpu-to-nvvm="index-bitwidth=32" --split-input-file %s \| FileCheck %s

				gpu.module @test_module {

				// CHECK-LABEL: func @gpu_wmma_load_op() ->
				// CHECK-SAME: !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)> {
				func @gpu_wmma_load_op() -> (!gpu.mma_matrix<16x16xf16, "AOp">) {
				%wg = memref.alloca() {alignment = 32} : memref<32x32xf16, 3>
				%i = constant 16 : index
				%j = constant 16 : index
				%0 = gpu.subgroup_mma_load_matrix %wg[%i, %j] {leadDimension = 32 : index} : memref<32x32xf16, 3> -> !gpu.mma_matrix<16x16xf16, "AOp">
				// CHECK: %[[INX:.*]] = llvm.mlir.constant(16 : index) : i32
				// CHECK: %{{.}} = llvm.insertvalue %{{.}}, %{{.}}[{{.}}, {{.*}}]
				// CHECK: %[[LDM:.*]] = llvm.mlir.constant(32 : index) : i32
				// CHECK: %[[LI:.*]] = llvm.mul %[[LDM]], %[[INX]] : i32
				// CHECK: %[[LIJ:.*]] = llvm.add %[[LI]], %[[INX]] : i32
				// CHECK: %[[OFFSET:.]] = llvm.extractvalue %{{.}}[2] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i32, array<2 x i32>, array<2 x i32>)>
				// CHECK: %[[LIJO:.*]] = llvm.add %[[LIJ]], %[[OFFSET]] : i32
				// CHECK: %[[BASE:.]] = llvm.extractvalue %{{.}}[1] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i32, array<2 x i32>, array<2 x i32>)>
				// CHECK: %[[ADDRESS:.*]] = llvm.getelementptr %[[BASE]][%[[LIJO]]] : (!llvm.ptr<f16, 3>, i32) -> !llvm.ptr<f16, 3>
				// CHECK: %[[CADDRESS:.*]] = llvm.bitcast %[[ADDRESS]] : !llvm.ptr<f16, 3> to !llvm.ptr<i32, 3>
				// CHECK: %[[FRAG:.*]] = nvvm.wmma.m16n16k16.load.a.f16.row.stride %[[CADDRESS]], %[[LDM]] : (!llvm.ptr<i32, 3>, i32) -> !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				// CHECK: llvm.return %[[FRAG]] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				return %0 : !gpu.mma_matrix<16x16xf16, "AOp">
				}
				}

				// -----

				gpu.module @test_module {

				// CHECK-LABEL: func @gpu_wmma_store_op
				// CHECK-SAME: (%[[D:.*]]: !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>) {
				func @gpu_wmma_store_op(%arg0 : !gpu.mma_matrix<16x16xf16, "DOp">) -> () {
				%sg = memref.alloca(){alignment = 32} : memref<32x32xf16, 3>
				%i = constant 16 : index
				%j = constant 16 : index
				ftynseUnsubmitted Done Reply Inline Actions I don't think we care if these operations are on the next line or not. Furthermore, they seem to be testing the conversion of allocations, which isn't anyhow relevant to what this test is about. Such tests lead to excessive churn and spurious breakages. Please only test what the new code does. ftynse: I don't think we care if these operations are on the next line or not. Furthermore, they seem…
				ftynseUnsubmitted Done Reply Inline Actions I still see a lot of CHECK-NEXT, and I am not convinced that we should care about these operations being on subsequent strings. If tomorrow we decide to change the syntax of `llvm.extractvalue` to print the type a separate line, these tests would break for no good reason. ftynse: I still see a lot of CHECK-NEXT, and I am not convinced that we should care about these…
				gpu.subgroup_mma_store_matrix %arg0, %sg[%i,%j] {leadDimension= 32 : index} : !gpu.mma_matrix<16x16xf16, "DOp">, memref<32x32xf16, 3>
				// CHECK: %[[INX:.*]] = llvm.mlir.constant(16 : index) : i32
				// CHECK: %{{.}} = llvm.insertvalue %{{.}}, %{{.}}[{{.}}, {{.*}}]
				// CHECK: %[[LDM:.*]] = llvm.mlir.constant(32 : index) : i32
				// CHECK: %[[LI:.*]] = llvm.mul %[[LDM]], %[[INX]] : i32
				// CHECK: %[[LIJ:.*]] = llvm.add %[[LI]], %[[INX]] : i32
				// CHECK: %[[OFFSET:.*]] = llvm.extractvalue %17[2] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i32, array<2 x i32>, array<2 x i32>)>
				// CHECK: %[[LIJO:.*]] = llvm.add %[[LIJ]], %[[OFFSET]] : i32
				// CHECK: %[[BASE:.*]] = llvm.extractvalue %17[1] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i32, array<2 x i32>, array<2 x i32>)>
				// CHECK: %[[ADDRESS:.*]] = llvm.getelementptr %[[BASE]][%[[LIJO]]] : (!llvm.ptr<f16, 3>, i32) -> !llvm.ptr<f16, 3>
				// CHECK: %[[CADDRESS:.*]] = llvm.bitcast %[[ADDRESS]] : !llvm.ptr<f16, 3> to !llvm.ptr<i32, 3>
				// CHECK: %[[EL1:.*]] = llvm.extractvalue %[[D]][0 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				// CHECK: %[[EL2:.*]] = llvm.extractvalue %[[D]][1 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				// CHECK: %[[EL3:.*]] = llvm.extractvalue %[[D]][2 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				// CHECK: %[[EL4:.*]] = llvm.extractvalue %[[D]][3 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				// CHECK: nvvm.wmma.m16n16k16.store.d.f16.row.stride %[[CADDRESS]], %[[EL1]], %[[EL2]], %[[EL3]], %[[EL4]], %[[LDM]] : !llvm.ptr<i32, 3>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, i32
				// CHECK: llvm.return
				return
				}
				}
				bondhugulaUnsubmitted Done Reply Inline Actions Nit: Leave a space after `//`, i.e., CHECK -> CHECK bondhugula: Nit: Leave a space after `//`, i.e., //CHECK -> // CHECK

				// -----

				gpu.module @test_module {

				// CHECK-LABEL: func @gpu_wmma_mma_op
				// CHECK-SAME: (%[[A:.]]: !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>, %[[B:.]]: !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>, %[[C:.*]]: !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>) {
				func @gpu_wmma_mma_op(%A : !gpu.mma_matrix<16x16xf16, "AOp">, %B : !gpu.mma_matrix<16x16xf16, "BOp">, %C : !gpu.mma_matrix<16x16xf16, "COp">) -> () {
				%D = gpu.subgroup_mma_compute %A, %B, %C : !gpu.mma_matrix<16x16xf16, "AOp">, !gpu.mma_matrix<16x16xf16, "BOp">, !gpu.mma_matrix<16x16xf16, "COp"> -> !gpu.mma_matrix<16x16xf16, "DOp">
				// CHECK: %[[A1:.*]] = llvm.extractvalue %[[A]][0 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				// CHECK: %[[A2:.*]] = llvm.extractvalue %[[A]][1 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				// CHECK: %[[A3:.*]] = llvm.extractvalue %[[A]][2 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				// CHECK: %[[A4:.*]] = llvm.extractvalue %[[A]][3 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				// CHECK: %[[A5:.*]] = llvm.extractvalue %[[A]][4 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				// CHECK: %[[A6:.*]] = llvm.extractvalue %[[A]][5 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				// CHECK: %[[A7:.*]] = llvm.extractvalue %[[A]][6 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				// CHECK: %[[A8:.*]] = llvm.extractvalue %[[A]][7 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				// CHECK: %[[B1:.*]] = llvm.extractvalue %[[B]][0 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				// CHECK: %[[B2:.*]] = llvm.extractvalue %[[B]][1 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				// CHECK: %[[B3:.*]] = llvm.extractvalue %[[B]][2 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				// CHECK: %[[B4:.*]] = llvm.extractvalue %[[B]][3 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				// CHECK: %[[B5:.*]] = llvm.extractvalue %[[B]][4 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				// CHECK: %[[B6:.*]] = llvm.extractvalue %[[B]][5 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				// CHECK: %[[B7:.*]] = llvm.extractvalue %[[B]][6 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				// CHECK: %[[B8:.*]] = llvm.extractvalue %[[B]][7 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				// CHECK: %[[C1:.*]] = llvm.extractvalue %[[C]][0 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				// CHECK: %[[C2:.*]] = llvm.extractvalue %[[C]][1 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				// CHECK: %[[C3:.*]] = llvm.extractvalue %[[C]][2 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				// CHECK: %[[C4:.*]] = llvm.extractvalue %[[C]][3 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				// CHECK: %{{.*}} = nvvm.wmma.m16n16k16.mma.row.row.f16.f16 %[[A1]], %[[A2]], %[[A3]], %[[A4]], %[[A5]], %[[A6]], %[[A7]], %[[A8]], %[[B1]], %[[B2]], %[[B3]], %[[B4]], %[[B5]], %[[B6]], %[[B7]], %[[B8]], %[[C1]], %[[C2]], %[[C3]], %[[C4]] : vector<2xf16> -> !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				// CHECK: llvm.return
				return
				}
				}

This is an archive of the discontinued LLVM Phabricator instance.

[MLIR][GPU][NVVM] Add conversion of warp synchronous matrix-multiply accumulate GPU opsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 347002

mlir/include/mlir/Conversion/GPUToNVVM/GPUToNVVMPass.h

mlir/lib/Conversion/GPUToNVVM/CMakeLists.txt

mlir/lib/Conversion/GPUToNVVM/LowerGpuOpsToNVVMOps.cpp

mlir/lib/Conversion/GPUToNVVM/WmmaOpsToNvvm.cpp

mlir/test/Conversion/GPUToNVVM/wmma-ops-to-nvvm.mlir

[MLIR][GPU][NVVM] Add conversion of warp synchronous matrix-multiply accumulate GPU ops
ClosedPublic