This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
lib/Conversion/GPUToNVVM/
-
Conversion/
-
GPUToNVVM/
7/7
CommonTypes.h
2/2
LowerGpuOpsToNVVMOps.cpp
2/2
WmmaLoadOptoNvvmLowering.h
3/3
WmmaMmaOptoNvvmLowering.h
-
WmmaStoreOptoNvvmLowering.h
-
test/Conversion/GPUToNVVM/
-
Conversion/
-
GPUToNVVM/
3/3
wmma-ops-to-nvvm.mlir

Differential D95331

[MLIR][GPU][NVVM] Add conversion of warp synchronous matrix-multiply accumulate GPU ops
ClosedPublic

Authored by navdeepkk on Jan 24 2021, 11:17 PM.

Download Raw Diff

Details

Reviewers

bondhugula
herhut
ftynse

Commits

rGeaaf7a6a09da: [MLIR][GPU][NVVM] Add conversion of warp synchronous matrix-multiply accumulate…

Summary

[MLIR][GPU][NVVM] Add conversion of warp synchronous matrix-multiply
accumulate GPU ops
Add conversion of warp synchronous matrix-multiply accumulate GPU ops to
NVVM ops. The following conversions are added :-

1.) subgroup_mma_load_matrix -> wmma.m16n16k16.load.[a,b,c]..row.stride
2.) subgroup_mma_store_matrix -> wmma.m16n16k16.store.d.[f16,f32].row.stride
3.) subgroup_mma_compute -> wmma.m16n16k16.mma.row.row.[f16,f32].[f16,f32]

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	130 ms	x64 windows > MLIR.Conversion/GPUToNVVM::wmma-ops-to-nvvm.mlir

Event Timeline

navdeepkk created this revision.Jan 24 2021, 11:17 PM

Herald added subscribers: teijeong, rdzhabarov, tatianashp and 16 others. · View Herald TranscriptJan 24 2021, 11:17 PM

navdeepkk requested review of this revision.Jan 24 2021, 11:17 PM

Herald added a reviewer: herhut. · View Herald TranscriptJan 24 2021, 11:17 PM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: stephenneuendorffer, nicolasvasilache. · View Herald Transcript

navdeepkk added a parent revision: D95330: [MLIR][GPU][NVVM] Add warp synchronous matrix-multiply accumulate ops.Jan 24 2021, 11:18 PM

navdeepkk added a child revision: D95332: [MLIR][CUDA-RUNNER] Add CL options to pass SM version and index-bitwidth.Jan 24 2021, 11:22 PM

Harbormaster completed remote builds in B86507: Diff 318906.Jan 24 2021, 11:32 PM

ftynse added a reviewer: ftynse.Jan 25 2021, 5:28 AM

This needs to be rebased on master, I have removed LLVMType several weeks ago.

ftynse requested changes to this revision.Jan 25 2021, 5:28 AM

This revision now requires changes to proceed.Jan 25 2021, 5:28 AM

Changes in this diff :-

1.) Rebase on master to drop the use of LLVMType.
2.) Make changes to use the !gpu.mmafragment type introduced in parent
  revision D95330.

Harbormaster completed remote builds in B87841: Diff 321330.Feb 4 2021, 12:52 AM

I haven't done a detailed review since this commit may change if its parent changes.

mlir/lib/Conversion/GPUToNVVM/CommonTypes.h
1	Headers need to have `-- C++ --` at the end of the first line
mlir/lib/Conversion/GPUToNVVM/WmmaLoadOptoNvvmLowering.h
13	I don't see the point of putting _implementations_ of conversion patterns into separate _header_ files.
mlir/test/Conversion/GPUToNVVM/wmma-ops-to-nvvm.mlir
15–36	I don't think we care if these operations are on the next line or not. Furthermore, they seem to be testing the conversion of allocations, which isn't anyhow relevant to what this test is about. Such tests lead to excessive churn and spurious breakages. Please only test what the new code does.

ftynse requested changes to this revision.Feb 9 2021, 6:22 AM

This revision now requires changes to proceed.Feb 9 2021, 6:22 AM

Changes in this diff :-

1.) Address comments on diff 321330.

Harbormaster completed remote builds in B89559: Diff 324328.Feb 17 2021, 10:02 AM

Changes in this diff :-

1.) Rebase on upstream/main.
3.) Make changes to operate with the newly intoduced gpu.mma_matrix type.

Herald added subscribers: dcaballe, cota. · View Herald TranscriptMay 2 2021, 1:24 PM

Harbormaster completed remote builds in B102196: Diff 342267.May 2 2021, 2:07 PM

bondhugula added inline comments.May 2 2021, 5:54 PM

mlir/lib/Conversion/GPUToNVVM/LowerGpuOpsToNVVMOps.cpp
144	C++ style comment here. Terminate with full stop.
mlir/lib/Conversion/GPUToNVVM/WmmaLoadStoreToNvvmLowering.h
31 ↗	(On Diff #342267)	Cannot -> cannot
36 ↗	(On Diff #342267)	Typo: implements
64 ↗	(On Diff #342267)	Expected -> expected
73 ↗	(On Diff #342267)	Remove commented out code please.
302–307 ↗	(On Diff #342267)	Nit: In such cases, consider using braces for the `then` and `else` blocks.
mlir/test/Conversion/GPUToNVVM/wmma-ops-to-nvvm.mlir
42–57	Nit: Leave a space after `//`, i.e., CHECK -> CHECK

bondhugula added inline comments.May 2 2021, 5:59 PM

mlir/lib/Conversion/GPUToNVVM/CommonTypes.h
26	Many of these are now built-in MLIR types. `common LLVM` -> 'common LLVM and builtin MLIR`
63–64	The context here isn't clear as to operands to which op. Mention about mma/wmma when referring to fragments?
75–76	Doc comments for these two.
mlir/lib/Conversion/GPUToNVVM/WmmaLoadStoreToNvvmLowering.h
119 ↗	(On Diff #342267)	Drop commented out code.

navdeepkk edited the summary of this revision. (Show Details)May 5 2021, 9:54 PM

Changes in this diff :-

1.) Address comments on previous diff(342267).

Harbormaster completed remote builds in B102911: Diff 343284.May 5 2021, 10:47 PM

bondhugula added inline comments.May 6 2021, 7:48 AM

mlir/lib/Conversion/GPUToNVVM/CommonTypes.h
16	clang-tidy warning if this is a meaningful one.
mlir/lib/Conversion/GPUToNVVM/WmmaLoadStoreToNvvmLowering.h
82–90 ↗	(On Diff #343284)	All of this code in header files instead of a source file? Is this an exception for some reason?

ftynse requested changes to this revision.May 6 2021, 8:15 AM

ftynse added inline comments.

mlir/lib/Conversion/GPUToNVVM/WmmaLoadStoreToNvvmLowering.h
25 ↗	(On Diff #343284)	Add a documentation comment please.
47 ↗	(On Diff #343284)	How about having this class (privately) derive `CommonLLVMAndBuiltInMLIRTypes` instead of containing it? This will let it use the fields directly without the annoying prefix. Also, I'm not entirely certain why this is even needed. Most patterns don't need all of the types, yet this will create the types inside each pattern, taking locks in the context. Yes, we won't need to do it every time the pattern is applied, but it feels like the number of spuriously created types compensates this. Another alternative would be to have a single instance of `CommonLLVMAndBuiltInMLIRTypes` and pass a reference to this instance to each pattern.
64 ↗	(On Diff #343284)	Typo: meref -> memref ?
75 ↗	(On Diff #343284)	Please expand auto here.
75 ↗	(On Diff #343284)	Also, I suppose you need promoteOneMemRefDescriptor instead. Then you can wrap the result into a MemRefDescriptor class and get much more readable code below.
76 ↗	(On Diff #343284)	Could this use OpAdaptor to get named operand accessors instead of magic constants?
82–90 ↗	(On Diff #343284)	I made the same comment on the first version of this patch, there is no point for this code to be in a header, even if the header lives under lib/.
93 ↗	(On Diff #343284)	Please avoid magic constants.
186 ↗	(On Diff #343284)	Same comments as above in this class.
302–307 ↗	(On Diff #342267)	This doesn't look done, did you forget to upload?
mlir/lib/Conversion/GPUToNVVM/WmmaMmaOptoNvvmLowering.h
36–45	A similar function is present in another file this patch is adding. Consider refactoring to only have one definition.
61	SmallVector now has a default number of stack elements. Drop the manually specified number unless there is a specific reason to choose the value. Here, I don't see why 24 is special.
88	Please use OpAdaptors here and below to avoid `operands[magic-constant]`
mlir/test/Conversion/GPUToNVVM/wmma-ops-to-nvvm.mlir
15–36	I still see a lot of CHECK-NEXT, and I am not convinced that we should care about these operations being on subsequent strings. If tomorrow we decide to change the syntax of `llvm.extractvalue` to print the type a separate line, these tests would break for no good reason.

This revision now requires changes to proceed.May 6 2021, 8:15 AM

Changes in this diff :-

1.)  Address comments on previous diff(343284).

Herald added a subscriber: mgorny. · View Herald TranscriptMay 14 2021, 12:13 AM

Harbormaster completed remote builds in B104439: Diff 345358.May 14 2021, 12:48 AM

Changes in this diff :-

1.) Fix formatting in WmmaOpsToNvvmLowering.cpp.

Harbormaster completed remote builds in B104452: Diff 345382.May 14 2021, 3:34 AM

bondhugula accepted this revision.May 16 2021, 12:00 AM

This is okay with me except the splitting between files. I really don't understand the motivation behind adding header files to lib/. It is justified if there are some declarations or template private implementations that must be shared between several .cpp files and not visible to the external users, but here is is not the case.

This can be organized as follows:

the file mlir/include/mlir/Conversion/GPUToNVVM/GPUToNVVMPass.h contains a populateGPUWMMAPAtterns(...) next to the populateGPUToNVVM that it already has;
there's one file, mlir/lib/Conversion/GPUToNVVM/WmmaOpsToNVVM.cpp (also, drop the "lowering" from the name while we are here), that contains everything from CommonTypes.h, and the current header/source pair, all in an anonymous namespace to avoid exporting the names and slowing down the linker;
whoever needs this patterns just includes the header and calls the populate function.

This is simple to navigate, reason about and is the pattern that all other conversions adopt.

mlir/lib/Conversion/GPUToNVVM/CommonTypes.h
1	Looks like this comment wasn't addressed..
mlir/lib/Conversion/GPUToNVVM/LowerGpuOpsToNVVMOps.cpp
144	Doesn't look addressed. C++-style comments are line comments starting with `//`.
mlir/lib/Conversion/GPUToNVVM/WmmaOpsToNvvmLowering.cpp
31 ↗	(On Diff #345382)	Nit: add braces here, this isn't a trivial condition anymore.
50 ↗	(On Diff #345382)	I would have just used `unsigned` here.
140–142 ↗	(On Diff #345382)	Nit: can you put this comment inside `return rewriter.notifyMatchFailure("")` instead? Here and below.

This revision now requires changes to proceed.May 20 2021, 1:53 AM

In D95331#2770496, @ftynse wrote:

This is okay with me except the splitting between files. I really don't understand the motivation behind adding header files to lib/. It is justified if there are some declarations or template private implementations that must be shared between several .cpp files and not visible to the external users, but here is is not the case.

This can be organized as follows:

the file mlir/include/mlir/Conversion/GPUToNVVM/GPUToNVVMPass.h contains a populateGPUWMMAPAtterns(...) next to the populateGPUToNVVM that it already has;

there's one file, mlir/lib/Conversion/GPUToNVVM/WmmaOpsToNVVM.cpp (also, drop the "lowering" from the name while we are here), that contains everything from CommonTypes.h, and the current header/source pair, all in an anonymous namespace to avoid exporting the names and slowing down the linker;

whoever needs this patterns just includes the header and calls the populate function.

This is simple to navigate, reason about and is the pattern that all other conversions adopt.

Hi @ftynse,
Thanks for the insightful comments. I have a question regarding this new structure. Once I put all the code in WmmaOpsToNVVM.cpp the lowering patterns are no longer exposed. So I cannot directly use them in LowerGpuOpsToNVVMOps.cpp where they are actually added into RewritePatternSet. The way I could find is to include WmmaOpsToNVVm.cpp in LowerGpuOpsToNVVMOps.cpp. Is that okay to do in this context? Or can we just expose the lowerings in a header file?

mlir/lib/Conversion/GPUToNVVM/CommonTypes.h
16	The style given here already followed. https://llvm.org/docs/CodingStandards.html#header-guard
mlir/lib/Conversion/GPUToNVVM/WmmaLoadOptoNvvmLowering.h
13	We may add more versions of NVVM ops in the future, so each lowering file may grow large. To save hassle later, I kept them in separate files. Let me know if you want me to merge them.
mlir/lib/Conversion/GPUToNVVM/WmmaLoadStoreToNvvmLowering.h
47 ↗	(On Diff #343284)	Took the first approach and removed all the spurious types present. Now the types are minimal and the overheads would be less. Let me know what you think.

In D95331#2773383, @navdeepkk wrote:

Hi @ftynse,
Thanks for the insightful comments. I have a question regarding this new structure. Once I put all the code in WmmaOpsToNVVM.cpp the lowering patterns are no longer exposed. So I cannot directly use them in LowerGpuOpsToNVVMOps.cpp where they are actually added into RewritePatternSet.

You shouldn't even need to use them directly. Just declare a function populateWmmaToNVVMPatterns(RewritePatternSet &set) in GPUToNVVMPass.h, put its declaration in the header and implementation in WmmaOpsToNVVM.cpp. Then call this function from LowerGpuOpsToNVVMOps.cpp. You understand how function declaration works in C++, right? It's enough for the caller to see the declaration (not definition) to be able to call the function.

The way I could find is to include WmmaOpsToNVVm.cpp in LowerGpuOpsToNVVMOps.cpp. Is that okay to do in this context?

Absolutely not! It's almost never okay to include a .cpp into another .cpp.

Or can we just expose the lowerings in a header file?

You don't need to expose the classes, the function that adds instances of these classes into the set will amply suffice, as I mentioned repeatedly. If you look around a bit in the code base, LowerGpuOpsToNVVMOps.cpp defines the pattern classes in an anonymous namespace and they are not at all visible in the headers. The users of this lowering just call populateGpuToNVVMConversionPatterns and never ever need to see the classes. I see no reason why you can't just do the same.

Changes in this diff :-

1.) Address comments on diff 345382.

Changes in this diff :-

1.) Fix spelling in file title.

Thanks!

This revision is now accepted and ready to land.May 21 2021, 6:02 AM

In D95331#2773568, @ftynse wrote:

Thanks!

Thanks for the long series of great insightful comments. And apologies for taking too long in the re-structuring part.

bondhugula accepted this revision.May 21 2021, 6:25 AM

Harbormaster completed remote builds in B105611: Diff 347002.May 21 2021, 6:59 AM

In D95331#2773585, @navdeepkk wrote:

In D95331#2773568, @ftynse wrote:

Thanks!

Thanks for the long series of great insightful comments. And apologies for taking too long in the re-structuring part.

There appears to be something weird with this patch - doing an arc patch D95331 yields an exception and then shows zero changes. Looks like an arc bug:

...
Applied patch mlir/include/mlir/Target/LLVMIR/ModuleTranslation.h cleanly.
Applied patch mlir/include/mlir/Dialect/LLVMIR/NVVMOps.td cleanly.
Applied patch mlir/include/mlir/Dialect/GPU/GPUOps.td cleanly.
Applied patch mlir/include/mlir/Dialect/GPU/GPUDialect.h cleanly.
Applied patch mlir/include/mlir/Dialect/GPU/GPUBase.td cleanly.
 COMMITTED  Successfully committed patch.

 Cherry Pick Failed!
 Exception 
Command failed with error #1!
COMMAND
git cherry-pick -- arcpatch-D95330

STDOUT
Auto-merging mlir/test/Dialect/LLVMIR/invalid.mlir
Auto-merging mlir/lib/Target/LLVMIR/ModuleTranslation.cpp
Auto-merging mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
Auto-merging mlir/include/mlir/Target/LLVMIR/ModuleTranslation.h
Auto-merging mlir/include/mlir/Dialect/GPU/GPUOps.td
On branch arcpatch-D95331_1
You are currently cherry-picking commit 372dcf47bd93.
  (all conflicts fixed: run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	mlir/0001-MLIR-Update-affine.for-unroll-for-iter_args-support.patch
	mlir/compile_commands.json
	mlir/tags
	mlir/tf.mlir

nothing added to commit but untracked files present (use "git add" to track)


STDERR
The previous cherry-pick is now empty, possibly due to conflict resolution.
If you wish to commit it anyway, use:

    git commit --allow-empty

Otherwise, please use 'git cherry-pick --skip'

(Run with `--trace` for a full exception trace.)

Using --trace shows:

Otherwise, please use 'git cherry-pick --skip'
 at [<arcanist>/src/future/exec/ExecFuture.php:421]
arcanist(head=master, ref.master=239ad5c55d8d)
  #0 ExecFuture::raiseResultError(array) called at [<arcanist>/src/future/exec/ExecFuture.php:325]
  #1 ExecFuture::resolvex() called at [<arcanist>/src/repository/api/ArcanistRepositoryAPI.php:399]
  #2 ArcanistRepositoryAPI::execxLocal(string, string) called at [<arcanist>/src/workflow/ArcanistPatchWorkflow.php:778]
  #3 ArcanistPatchWorkflow::run() called at [<arcanist>/src/workflow/ArcanistPatchWorkflow.php:983]
  #4 ArcanistPatchWorkflow::applyDependencies(ArcanistBundle) called at [<arcanist>/src/workflow/ArcanistPatchWorkflow.php:469]
  #5 ArcanistPatchWorkflow::run() called at [<arcanist>/src/workflow/ArcanistPatchWorkflow.php:398]
  #6 ArcanistPatchWorkflow::run() called at [<arcanist>/scripts/arcanist.php:419]
<<< [1] (+15,126) <exec> 15,126,163 us

Rebase on upstream/main.

This revision was landed with ongoing or failed builds.May 21 2021, 8:51 AM

Closed by commit rGeaaf7a6a09da: [MLIR][GPU][NVVM] Add conversion of warp synchronous matrix-multiply accumulate… (authored by navdeepkk, committed by bondhugula). · Explain Why

This revision was automatically updated to reflect the committed changes.

bondhugula added a commit: rGeaaf7a6a09da: [MLIR][GPU][NVVM] Add conversion of warp synchronous matrix-multiply accumulate….

Harbormaster completed remote builds in B105646: Diff 347046.May 21 2021, 9:19 AM

navdeepkk mentioned this in D95333: [MLIR][NVVM] Add test cases to check translation of matrix-multiply accumulate ops to the corresponding intrinsics in NVPTX backend.May 22 2021, 3:09 AM

Revision Contents

Path

Size

mlir/

lib/

Conversion/

GPUToNVVM/

CommonTypes.h

68 lines

LowerGpuOpsToNVVMOps.cpp

17 lines

WmmaLoadOptoNvvmLowering.h

151 lines

WmmaMmaOptoNvvmLowering.h

100 lines

WmmaStoreOptoNvvmLowering.h

137 lines

test/

Conversion/

GPUToNVVM/

wmma-ops-to-nvvm.mlir

135 lines

Diff 321330

mlir/lib/Conversion/GPUToNVVM/CommonTypes.h

This file was added.

				//===----- CommonTypes.h - Contains LLVM Types common to all Lowerings. ---===//
				ftynseUnsubmitted Done Reply Inline Actions Headers need to have `-- C++ --` at the end of the first line ftynse: Headers need to have `-- C++ --` at the end of the first line
				ftynseUnsubmitted Done Reply Inline Actions Looks like this comment wasn't addressed.. ftynse: Looks like this comment wasn't addressed..
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// This file contains the common LLVM types that are used by the lowerings of
				// GPU MMA Ops to NVVM ops.
				//
				//===----------------------------------------------------------------------===//

				#ifndef MLIR_LIB_CONVERSION_GPUTONVVM_COMMONTYPES_H
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: header guard does not follow preferred style [llvm-header-guard] not useful Lint: Pre-merge checks: clang-tidy: warning: header guard does not follow preferred style [llvm-header-guard] [[https…
				#define MLIR_LIB_CONVERSION_GPUTONVVM_COMMONTYPES_H

				bondhugulaUnsubmitted Done Reply Inline Actions clang-tidy warning if this is a meaningful one. bondhugula: clang-tidy warning if this is a meaningful one.
				navdeepkkAuthorUnsubmitted Done Reply Inline Actions The style given here already followed. https://llvm.org/docs/CodingStandards.html#header-guard navdeepkk: The style given here already followed. https://llvm.org/docs/CodingStandards.html#header-guard
				#include "mlir/Dialect/LLVMIR/LLVMDialect.h"
				#include "mlir/IR/Builders.h"
				#include "llvm/IR/DerivedTypes.h"

				namespace mlir {

				/// Contains all the common LLVM types which are used across the lowerings of
				/// GPU subgroup ops to NVVM dialect.
				struct CommonLLVMTypes {
				public:
				bondhugulaUnsubmitted Done Reply Inline Actions Many of these are now built-in MLIR types. `common LLVM` -> 'common LLVM and builtin MLIR` bondhugula: Many of these are now built-in MLIR types. `common LLVM` -> 'common LLVM and builtin MLIR`
				CommonLLVMTypes(MLIRContext *context) {
				numHalfsInOpFrags.resize(4);
				numHalfsInOpFrags[A] = 8;
				numHalfsInOpFrags[B] = 8;
				numHalfsInOpFrags[C] = 4;
				numHalfsInOpFrags[D] = 4;
				int8Type = IntegerType::get(context, 8);
				int64Type = IntegerType::get(context, 64);
				int32Type = IntegerType::get(context, 32);
				int32PtrTy = LLVM::LLVMPointerType::get(int32Type);
				f16Ty = FloatType::getF16(context);
				f16PtrTy = LLVM::LLVMPointerType::get(f16Ty);
				f16x8Ty = VectorType::get(8, f16Ty);
				f16x16Ty = VectorType::get(16, f16Ty);
				f16x2Ty = VectorType::get(2, f16Ty);
				fragArrayABTy = LLVM::LLVMStructType::getLiteral(
				context, SmallVector<Type>(8, f16x2Ty));
				fragArrayABPtrTy = LLVM::LLVMPointerType::get(fragArrayABTy);
				fragArrayCDTy = LLVM::LLVMStructType::getLiteral(
				context, SmallVector<Type>(4, f16x2Ty));
				fragArrayCDPtrTy = LLVM::LLVMPointerType::get(fragArrayCDTy);
				};

				Type int8Type;
				Type int32Type;
				Type int64Type;
				Type int32PtrTy;
				Type f16Ty;
				Type f16PtrTy;
				Type f16x2Ty;
				Type f16x8Ty;
				Type f16x16Ty;
				Type fragArrayABTy;
				Type fragArrayABPtrTy;
				Type fragArrayCDTy;
				Type fragArrayCDPtrTy;
				SmallVector<unsigned, 4> numHalfsInOpFrags;
				enum OperandMap { A, B, C, D };
				bondhugulaUnsubmitted Done Reply Inline Actions The context here isn't clear as to operands to which op. Mention about mma/wmma when referring to fragments? bondhugula: The context here isn't clear as to operands to which op. Mention about mma/wmma when referring…
				};

				} // namespace mlir
				#endif
				bondhugulaUnsubmitted Done Reply Inline Actions Doc comments for these two. bondhugula: Doc comments for these two.

mlir/lib/Conversion/GPUToNVVM/LowerGpuOpsToNVVMOps.cpp

Show All 20 Lines
#include "mlir/Transforms/DialectConversion.h"		#include "mlir/Transforms/DialectConversion.h"
#include "mlir/Transforms/GreedyPatternRewriteDriver.h"		#include "mlir/Transforms/GreedyPatternRewriteDriver.h"
#include "llvm/Support/FormatVariadic.h"		#include "llvm/Support/FormatVariadic.h"

#include "../GPUCommon/GPUOpsLowering.h"		#include "../GPUCommon/GPUOpsLowering.h"
#include "../GPUCommon/IndexIntrinsicsOpLowering.h"		#include "../GPUCommon/IndexIntrinsicsOpLowering.h"
#include "../GPUCommon/OpToFuncCallLowering.h"		#include "../GPUCommon/OpToFuncCallLowering.h"
#include "../PassDetail.h"		#include "../PassDetail.h"
		#include "WmmaLoadOptoNvvmLowering.h"
		#include "WmmaMmaOptoNvvmLowering.h"
		#include "WmmaStoreOptoNvvmLowering.h"

using namespace mlir;		using namespace mlir;

namespace {		namespace {

struct GPUShuffleOpLowering : public ConvertOpToLLVMPattern<gpu::ShuffleOp> {		struct GPUShuffleOpLowering : public ConvertOpToLLVMPattern<gpu::ShuffleOp> {
using ConvertOpToLLVMPattern<gpu::ShuffleOp>::ConvertOpToLLVMPattern;		using ConvertOpToLLVMPattern<gpu::ShuffleOp>::ConvertOpToLLVMPattern;

▲ Show 20 Lines • Show All 80 Lines • ▼ Show 20 Lines	void runOnOperation() override {
/// converter drops the private memory space to support the use case above.		/// converter drops the private memory space to support the use case above.
LLVMTypeConverter converter(m.getContext(), options);		LLVMTypeConverter converter(m.getContext(), options);
converter.addConversion([&](MemRefType type) -> Optional<Type> {		converter.addConversion([&](MemRefType type) -> Optional<Type> {
if (type.getMemorySpace() != gpu::GPUDialect::getPrivateAddressSpace())		if (type.getMemorySpace() != gpu::GPUDialect::getPrivateAddressSpace())
return llvm::None;		return llvm::None;
return converter.convertType(MemRefType::Builder(type).setMemorySpace(0));		return converter.convertType(MemRefType::Builder(type).setMemorySpace(0));
});		});

		converter.addConversion([&](gpu::MMAFragmentType type) -> Type {
		VectorType vecTy = type.getElementType().cast<VectorType>();
		unsigned vecSize = vecTy.getDimSize(0);
		Type vec = VectorType::get(vecSize, FloatType::getF16(&getContext()));
		unsigned size = type.getSize();
		SmallVector<Type, 8> elements(size, vec);
		auto structType =
		LLVM::LLVMStructType::getLiteral(&getContext(), elements);
		return structType;
		});

OwningRewritePatternList patterns, llvmPatterns;		OwningRewritePatternList patterns, llvmPatterns;

// Apply in-dialect lowering first. In-dialect lowering will replace ops		// Apply in-dialect lowering first. In-dialect lowering will replace ops
// which need to be lowered further, which is not supported by a single		// which need to be lowered further, which is not supported by a single
// conversion pass.		// conversion pass.
populateGpuRewritePatterns(m.getContext(), patterns);		populateGpuRewritePatterns(m.getContext(), patterns);
		bondhugulaUnsubmitted Done Reply Inline Actions C++ style comment here. Terminate with full stop. bondhugula: C++ style comment here. Terminate with full stop.
		ftynseUnsubmitted Done Reply Inline Actions Doesn't look addressed. C++-style comments are line comments starting with `//`. ftynse: Doesn't look addressed. C++-style comments are line comments starting with `//`.
applyPatternsAndFoldGreedily(m, std::move(patterns));		applyPatternsAndFoldGreedily(m, std::move(patterns));

populateStdToLLVMConversionPatterns(converter, llvmPatterns);		populateStdToLLVMConversionPatterns(converter, llvmPatterns);
populateGpuToNVVMConversionPatterns(converter, llvmPatterns);		populateGpuToNVVMConversionPatterns(converter, llvmPatterns);
LLVMConversionTarget target(getContext());		LLVMConversionTarget target(getContext());
configureGpuToNVVMConversionLegality(target);		configureGpuToNVVMConversionLegality(target);
if (failed(applyPartialConversion(m, target, std::move(llvmPatterns))))		if (failed(applyPartialConversion(m, target, std::move(llvmPatterns))))
signalPassFailure();		signalPassFailure();
Show All 27 Lines	patterns
NVVM::BlockIdYOp, NVVM::BlockIdZOp>,		NVVM::BlockIdYOp, NVVM::BlockIdZOp>,
GPUIndexIntrinsicOpLowering<gpu::GridDimOp, NVVM::GridDimXOp,		GPUIndexIntrinsicOpLowering<gpu::GridDimOp, NVVM::GridDimXOp,
NVVM::GridDimYOp, NVVM::GridDimZOp>,		NVVM::GridDimYOp, NVVM::GridDimZOp>,
GPUShuffleOpLowering, GPUReturnOpLowering,		GPUShuffleOpLowering, GPUReturnOpLowering,
// Explicitly drop memory space when lowering private memory		// Explicitly drop memory space when lowering private memory
// attributions since NVVM models it as `alloca`s in the default		// attributions since NVVM models it as `alloca`s in the default
// memory space and does not support `alloca`s with addrspace(5).		// memory space and does not support `alloca`s with addrspace(5).
GPUFuncOpLowering<0>>(converter);		GPUFuncOpLowering<0>>(converter);
		patterns.insert<WmmaLoadOptoNVVMLowering>(converter);
		patterns.insert<WmmaMmaOptoNVVMLowering>(converter);
		patterns.insert<WmmaStoreOptoNVVMLowering>(converter);
patterns.insert<OpToFuncCallLowering<AbsFOp>>(converter, "__nv_fabsf",		patterns.insert<OpToFuncCallLowering<AbsFOp>>(converter, "__nv_fabsf",
"__nv_fabs");		"__nv_fabs");
patterns.insert<OpToFuncCallLowering<AtanOp>>(converter, "__nv_atanf",		patterns.insert<OpToFuncCallLowering<AtanOp>>(converter, "__nv_atanf",
"__nv_atan");		"__nv_atan");
patterns.insert<OpToFuncCallLowering<Atan2Op>>(converter, "__nv_atan2f",		patterns.insert<OpToFuncCallLowering<Atan2Op>>(converter, "__nv_atan2f",
"__nv_atan2");		"__nv_atan2");
patterns.insert<OpToFuncCallLowering<CeilFOp>>(converter, "__nv_ceilf",		patterns.insert<OpToFuncCallLowering<CeilFOp>>(converter, "__nv_ceilf",
"__nv_ceil");		"__nv_ceil");
Show All 30 Lines

mlir/lib/Conversion/GPUToNVVM/WmmaLoadOptoNvvmLowering.h

This file was added.

				//===-- WmmaLoadOptoNVVMLowering.h - GPU MMA loadOp to NVVM Op lowering ---===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// This file contains patterns to lower the GPU subgroup MMA_loadOp to NVVM
				// Dialect.
				//
				//===----------------------------------------------------------------------===//

				ftynseUnsubmitted Done Reply Inline Actions I don't see the point of putting _implementations_ of conversion patterns into separate _header_ files. ftynse: I don't see the point of putting _implementations_ of conversion patterns into separate…
				navdeepkkAuthorUnsubmitted Done Reply Inline Actions We may add more versions of NVVM ops in the future, so each lowering file may grow large. To save hassle later, I kept them in separate files. Let me know if you want me to merge them. navdeepkk: We may add more versions of NVVM ops in the future, so each lowering file may grow large. To…
				#ifndef MLIR_LIB_CONVERSION_GPUTONVVM_WMMALOADOPTONVVMLOWERING_H
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: header guard does not follow preferred style [llvm-header-guard] not useful Lint: Pre-merge checks: clang-tidy: warning: header guard does not follow preferred style [llvm-header-guard] [[https…
				#define MLIR_LIB_CONVERSION_GPUTONVVM_WMMALOADOPTONVVMLOWERING_H

				#include "CommonTypes.h"
				#include "mlir/Conversion/StandardToLLVM/ConvertStandardToLLVM.h"
				#include "mlir/Dialect/GPU/GPUDialect.h"
				#include "mlir/Dialect/LLVMIR/LLVMDialect.h"
				#include "mlir/Dialect/LLVMIR/NVVMDialect.h"

				namespace mlir {
				/// This class implemtents the conversion of GPU MMA loadOp to wmma.load op
				/// in the NVVM dialect. The conversion not only emits the NVVM op but also
				/// emits code that is necessary to store the data in the destination memref
				/// after it has been loaded.
				struct WmmaLoadOptoNVVMLowering
				: public ConvertOpToLLVMPattern<gpu::SubgroupMmaLoadMatrixOp> {
				public:
				MLIRContext *context = &this->getTypeConverter()->getContext();

				explicit WmmaLoadOptoNVVMLowering(LLVMTypeConverter &typeConverter)
				: ConvertOpToLLVMPattern<gpu::SubgroupMmaLoadMatrixOp>(typeConverter),
				llvmTypes(context) {}

				static LogicalResult areAllLLVMTypes(Operation *op, ValueRange operands,
				ConversionPatternRewriter &rewriter) {
				if (!llvm::all_of(operands, [](Value value) {
				return LLVM::isCompatibleType(value.getType());
				}))
				return rewriter.notifyMatchFailure(
				op, "Cannot convert if operands aren't of LLVM type.");

				return success();
				}

				LogicalResult
				matchAndRewrite(gpu::SubgroupMmaLoadMatrixOp subgroupMmaLoadMatrixOp,
				ArrayRef<Value> operands,
				ConversionPatternRewriter &rewriter) const override {
				Operation *op = subgroupMmaLoadMatrixOp.getOperation();
				if (failed(areAllLLVMTypes(op, operands, rewriter)))
				return failure();

				int8_t indexTypeBitwidth = this->getTypeConverter()->getIndexTypeBitwidth();

				// The corresponding intrinsics expects leadDimension to be a 32-bit
				// integer, so all the calculations of linearizing the load address
				// must also follow this restriction.
				if (indexTypeBitwidth != 32)
				return rewriter.notifyMatchFailure(
				op, "Expected indices to the meref to be 32-bit wide.");

				// Source memref of the original op.
				MemRefType srcMemrefType =
				subgroupMmaLoadMatrixOp.srcMemref().getType().cast<MemRefType>();
				Location loc = op->getLoc();

				auto beginInx = subgroupMmaLoadMatrixOp.indices().getBeginOperandIndex();
				auto leadDimension = subgroupMmaLoadMatrixOp.leadDimensionAttr();
				auto operand = subgroupMmaLoadMatrixOp.operandAttr();

				// Emit information for the memref operands.
				auto promotedSrcOp = this->getTypeConverter()->promoteOperands(
				loc, op->getOperand(0), operands[0], rewriter);

				// Emit ops which compute the load offset using `srcOffsetI`,
				// `srcOffsetJ`. The actualOffset is (memrefOffset + (alignedPtr +
				// ((leadDimension * srcOffsetI) + srcOffsetJ)). The memrefs here are
				// assumed to be normalized and hence the simple conversion works.
				Value srcOffsetIVal = subgroupMmaLoadMatrixOp->getOpOperand(beginInx).get();
				Value srcOffsetJVal =
				subgroupMmaLoadMatrixOp->getOpOperand(beginInx + 1).get();
				Value leadingDim32 = rewriter.create<LLVM::ConstantOp>(
				loc, llvmTypes.int32Type, leadDimension);
				Value numElemsLeadDim = rewriter.create<LLVM::MulOp>(
				loc, llvmTypes.int32Type, leadingDim32, srcOffsetIVal);
				Value loadOffset = rewriter.create<LLVM::AddOp>(
				loc, llvmTypes.int32Type, numElemsLeadDim, srcOffsetJVal);
				// Cast offset I64 to make the calculation below independent of index
				// bitwidth supplied.
				Value promotedSrcOpToUse;

				promotedSrcOpToUse = promotedSrcOp[2];
				Value actualOffset = rewriter.create<LLVM::AddOp>(
				loc, llvmTypes.int32Type, loadOffset, promotedSrcOpToUse);
				Value loadAddress = rewriter.create<LLVM::GEPOp>(
				loc,
				LLVM::LLVMPointerType::get(llvmTypes.f16Ty,
				srcMemrefType.getMemorySpace()),
				promotedSrcOp[1], ArrayRef<Value>{actualOffset});

				// Bitcast the pointer from half to i32 so that it matches the semantics
				// of the inrinsic exposed by the NVPTX backend.
				Value loadAddressCasted = rewriter.create<LLVM::BitcastOp>(
				loc,
				LLVM::LLVMPointerType::get(llvmTypes.int32Type,
				srcMemrefType.getMemorySpace()),
				loadAddress);

				Type resType;
				unsigned numElemsInResFrag;
				StringRef operandStr = operand.cast<mlir::StringAttr>().getValue();

				if (operandStr.equals("AOp") \|\| operandStr.equals("BOp")) {
				resType = llvmTypes.fragArrayABTy;
				numElemsInResFrag = llvmTypes.numHalfsInOpFrags[llvmTypes.A];
				} else {
				resType = llvmTypes.fragArrayCDTy;
				numElemsInResFrag = llvmTypes.numHalfsInOpFrags[llvmTypes.C];
				}

				ValueRange loadOpOperands({loadAddressCasted, leadingDim32});

				// Create nvvm.mma_load op according to the operand.
				if (operandStr.equals("AOp")) {
				NVVM::WMMALoadAOp wmmaLoadAOp =
				rewriter.create<NVVM::WMMALoadAOp>(loc, resType, loadOpOperands);
				rewriter.replaceOp(op, wmmaLoadAOp.getResult());
				} else if (operandStr.equals("BOp")) {
				NVVM::WMMALoadBOp wmmaLoadBOp =
				rewriter.create<NVVM::WMMALoadBOp>(loc, resType, loadOpOperands);
				rewriter.replaceOp(op, wmmaLoadBOp.getResult());
				} else {
				NVVM::WMMALoadCOp wmmaLoadCOp =
				rewriter.create<NVVM::WMMALoadCOp>(loc, resType, loadOpOperands);
				rewriter.replaceOp(op, wmmaLoadCOp.getResult());
				}

				return success();
				}

				private:
				/// Contains definitions of all the LLVM types which are used for lowering
				/// this GPU subgroupMmaLoadMatrixOp.
				CommonLLVMTypes llvmTypes;
				};
				} // namespace mlir

				#endif

mlir/lib/Conversion/GPUToNVVM/WmmaMmaOptoNvvmLowering.h

This file was added.

				//===--- WmmaMmaOptoNVVMLowering.h - GPU MMA mmaOp to NVVM Op lowering ----===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// This file contains patterns to lower the GPU subgroup MMA_computeop to the
				// NVVM Dialect.
				//
				//===----------------------------------------------------------------------===//

				#ifndef MLIR_LIB_CONVERSION_GPUTONVVM_WMMASTOREOPTONVVMLOWERING_H
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: header guard does not follow preferred style [llvm-header-guard] not useful Lint: Pre-merge checks: clang-tidy: warning: header guard does not follow preferred style [llvm-header-guard] [[https…
				#define MLIR_LIB_CONVERSION_GPUTONVVM_WMMASTOREOPTONVVMLOWERING_H

				#include "CommonTypes.h"
				#include "mlir/Conversion/StandardToLLVM/ConvertStandardToLLVM.h"
				#include "mlir/Dialect/GPU/GPUDialect.h"
				#include "mlir/Dialect/LLVMIR/LLVMDialect.h"
				#include "mlir/Dialect/LLVMIR/NVVMDialect.h"

				namespace mlir {
				/// This class implemtents the conversion of GPU MMA computeOp to wmma.mma op
				/// in the NVVM dialect. The conversion not only emits the NVVM op but also
				/// emits code that is necessary to unpack the data from source memrefs to give
				/// them to the NVVM OP and then again pack the results to store them into the
				/// destination memref.
				struct WmmaMmaOptoNVVMLowering
				: public ConvertOpToLLVMPattern<gpu::SubgroupMmaComputeOp> {
				public:
				MLIRContext *context = &this->getTypeConverter()->getContext();

				explicit WmmaMmaOptoNVVMLowering(LLVMTypeConverter &typeConverter)
				: ConvertOpToLLVMPattern<gpu::SubgroupMmaComputeOp>(typeConverter),
				llvmTypes(context) {}

				static LogicalResult areAllLLVMTypes(Operation *op, ValueRange operands,
				ConversionPatternRewriter &rewriter) {
				if (!llvm::all_of(operands, [](Value value) {
				return LLVM::isCompatibleType(value.getType());
				}))
				return rewriter.notifyMatchFailure(
				op, "Cannot convert if operands aren't of LLVM type.");

				ftynseUnsubmitted Done Reply Inline Actions A similar function is present in another file this patch is adding. Consider refactoring to only have one definition. ftynse: A similar function is present in another file this patch is adding. Consider refactoring to…
				return success();
				}

				LogicalResult
				matchAndRewrite(gpu::SubgroupMmaComputeOp subgroupMmaComputeOp,
				ArrayRef<Value> operands,
				ConversionPatternRewriter &rewriter) const override {
				Operation *op = subgroupMmaComputeOp.getOperation();
				if (failed(areAllLLVMTypes(op, operands, rewriter)))
				return failure();

				Location loc = op->getLoc();

				SmallVector<mlir::VectorType, 4> opTypes;
				SmallVector<Type, 4> elemTypes;
				SmallVector<Type, 4> llvmElemTypes;
				ftynseUnsubmitted Done Reply Inline Actions SmallVector now has a default number of stack elements. Drop the manually specified number unless there is a specific reason to choose the value. Here, I don't see why 24 is special. ftynse: SmallVector now has a default number of stack elements. Drop the manually specified number…
				SmallVector<Value, 4> opIndices;

				// The wmma.mma intrinsic in llvm requires the operands as individual
				// values. So individual elements from the memrefs need to be extracted and
				// then passed on to the intrinsic call. Emit llvm ops to extract individual
				// values form lowered memrefs.
				SmallVector<Value, 24> unpackedOps;

				auto unpackOp = [&](CommonLLVMTypes::OperandMap op, Value operand) {
				for (unsigned i = 0, e = llvmTypes.numHalfsInOpFrags[op]; i < e; ++i) {
				Value toUse = rewriter.create<LLVM::ExtractValueOp>(
				loc, llvmTypes.f16x2Ty, operand, rewriter.getI64ArrayAttr(i));
				unpackedOps.push_back(toUse);
				}
				};

				unpackOp(llvmTypes.A, operands[0]);
				unpackOp(llvmTypes.B, operands[1]);
				unpackOp(llvmTypes.C, operands[2]);

				// Operand holder for wmma.mma.op.
				ValueRange wmmaMmaOpOperands(unpackedOps);

				// Create nvvm.wmma.mma op.
				NVVM::WMMAMmaOp wmmaMmaOp = rewriter.create<NVVM::WMMAMmaOp>(
				loc, llvmTypes.fragArrayCDTy, wmmaMmaOpOperands);

				ftynseUnsubmitted Done Reply Inline Actions Please use OpAdaptors here and below to avoid `operands[magic-constant]` ftynse: Please use OpAdaptors here and below to avoid `operands[magic-constant]`
				rewriter.replaceOp(op, wmmaMmaOp.getResult());
				return success();
				}

				private:
				/// Contains definitions of all the LLVM types which are used for lowering
				/// this GPU subgroupMmaComputeOp.
				CommonLLVMTypes llvmTypes;
				};
				} // namespace mlir

				#endif

mlir/lib/Conversion/GPUToNVVM/WmmaStoreOptoNvvmLowering.h

This file was added.

				//==-- WmmaStoreOptoNVVMLowering.h - GPU MMA storeOp to NVVM Op lowering --==//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// This file contains patterns to lower the GPU subgroup MMA_store op to the
				// NVVM Dialect.
				//
				//===----------------------------------------------------------------------===//

				#ifndef MLIR_LIB_CONVERSION_GPUTONVVM_WMMAMMAOPTONVVMLOWERING_H
				Lint: Pre-merge checks Inline Actions clang-tidy: warning: header guard does not follow preferred style [llvm-header-guard] not useful Lint: Pre-merge checks: clang-tidy: warning: header guard does not follow preferred style [llvm-header-guard] [[https…
				#define MLIR_LIB_CONVERSION_GPUTONVVM_WMMAMMAOPTONVVMLOWERING_H

				#include "CommonTypes.h"
				#include "mlir/Conversion/StandardToLLVM/ConvertStandardToLLVM.h"
				#include "mlir/Dialect/GPU/GPUDialect.h"
				#include "mlir/Dialect/LLVMIR/LLVMDialect.h"
				#include "mlir/Dialect/LLVMIR/NVVMDialect.h"

				namespace mlir {
				/// This class implemtents the conversion of GPU MMA storeOp to wmma.store op
				/// in the NVVM dialect. The conversion not only emits the NVVM op but also
				/// emits code that is necessary to unpack the data in the source memref and
				/// convert the data in the format that is needed by the NVVM op.
				struct WmmaStoreOptoNVVMLowering
				: public ConvertOpToLLVMPattern<gpu::SubgroupMmaStoreMatrixOp> {
				public:
				MLIRContext *context = &this->getTypeConverter()->getContext();

				explicit WmmaStoreOptoNVVMLowering(LLVMTypeConverter &typeConverter)
				: ConvertOpToLLVMPattern<gpu::SubgroupMmaStoreMatrixOp>(typeConverter),
				llvmTypes(context) {}

				static LogicalResult areAllLLVMTypes(Operation *op, ValueRange operands,
				ConversionPatternRewriter &rewriter) {
				if (!llvm::all_of(operands, [](Value value) {
				return LLVM::isCompatibleType(value.getType());
				}))
				return rewriter.notifyMatchFailure(
				op, "Cannot convert if operands aren't of LLVM type.");

				return success();
				}

				LogicalResult
				matchAndRewrite(gpu::SubgroupMmaStoreMatrixOp subgroupMmaStoreMatrixOp,
				ArrayRef<Value> operands,
				ConversionPatternRewriter &rewriter) const override {
				Operation *op = subgroupMmaStoreMatrixOp.getOperation();
				if (failed(areAllLLVMTypes(op, operands, rewriter)))
				return failure();

				int8_t indexTypeBitwidth = this->getTypeConverter()->getIndexTypeBitwidth();
				// The corresponding intrinsics expects leadDimension to be a 32-bit
				// integer, so all the calculations of linearizing the load address
				// must also follow this restriction.
				if (indexTypeBitwidth != 32)
				return rewriter.notifyMatchFailure(
				op, "Expected indices to the meref to be 32-bit wide.");

				Location loc = op->getLoc();

				// Destination memref of the original op.
				MemRefType dstMemrefType =
				subgroupMmaStoreMatrixOp.dstMemref().getType().cast<MemRefType>();

				auto promotedDstOp = this->getTypeConverter()->promoteOperands(
				loc, op->getOperand(1), operands[1], rewriter);

				auto leadDimension = subgroupMmaStoreMatrixOp.leadDimensionAttr();
				unsigned beginInx =
				subgroupMmaStoreMatrixOp.indices().getBeginOperandIndex();

				// Emit ops which compute the store offset using `dstOffsetI`,
				// `dstOffsetJ`. The actualOffset is (memrefOffset + (alignedPtr +
				// ((leadDimension * dstOffsetI) + dstOffsetJ)).
				Value dstOffsetIVal = subgroupMmaStoreMatrixOp.getOperand(beginInx);
				Value dstOffsetJVal = subgroupMmaStoreMatrixOp.getOperand(beginInx + 1);
				Value leadingDim32 = rewriter.create<LLVM::ConstantOp>(
				loc, llvmTypes.int32Type, leadDimension);
				Value numElemsLeadDim = rewriter.create<LLVM::MulOp>(
				loc, llvmTypes.int32Type, leadingDim32, dstOffsetIVal);
				Value loadOffset = rewriter.create<LLVM::AddOp>(
				loc, llvmTypes.int32Type, numElemsLeadDim, dstOffsetJVal);
				// Cast offset I64 to make the calculation below independent of index
				// bitwidth supplied.
				Value promotedDstOpToUse;

				promotedDstOpToUse = promotedDstOp[2];
				Value actualOffset = rewriter.create<LLVM::AddOp>(
				loc, llvmTypes.int32Type, loadOffset, promotedDstOpToUse);
				Value storeAddress = rewriter.create<LLVM::GEPOp>(
				loc,
				LLVM::LLVMPointerType::get(llvmTypes.f16Ty,
				dstMemrefType.getMemorySpace()),
				promotedDstOp[1], ArrayRef<Value>{actualOffset});

				// Bitcast the base address pointer of the destination memref, So that
				// values can be stored in chunks of 32-bits.
				Value storeAddressCasted = rewriter.create<LLVM::BitcastOp>(
				loc,
				LLVM::LLVMPointerType::get(llvmTypes.int32Type,
				dstMemrefType.getMemorySpace()),
				storeAddress);

				SmallVector<Value, 4> storeOpOperands;
				storeOpOperands.push_back(storeAddressCasted);

				// Unpack the results from the source memref.
				for (unsigned i = 0, e = llvmTypes.numHalfsInOpFrags[llvmTypes.D]; i < e;
				++i) {
				Value toUse = rewriter.create<LLVM::ExtractValueOp>(
				loc, llvmTypes.f16x2Ty, operands[0], rewriter.getI64ArrayAttr(i));
				storeOpOperands.push_back(toUse);
				}

				storeOpOperands.push_back(leadingDim32);

				// Create nvvm.mma_store op.
				ValueRange unpackedValueRange(storeOpOperands);
				rewriter.create<NVVM::WMMAStoreOp>(loc, storeOpOperands);

				rewriter.eraseOp(op);
				return success();
				}

				private:
				/// Contains definitions of all the LLVM types which are used for lowering
				/// this GPU SubgroupMmaStoreMatrixOp.
				CommonLLVMTypes llvmTypes;
				};
				} // namespace mlir

				#endif

mlir/test/Conversion/GPUToNVVM/wmma-ops-to-nvvm.mlir

This file was added.

				// RUN: mlir-opt --convert-gpu-to-nvvm="index-bitwidth=32" --split-input-file %s \| FileCheck %s

				gpu.module @test_module {

				// CHECK-LABEL: func @gpu_wmma_load_op() ->
				// CHECK-SAME: !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)> {
				func @gpu_wmma_load_op() -> (!gpu.mmafragment<8, vector<2xf16>>) {
				%wg = alloca() {alignment = 32} : memref<32x32xf16, 3>
				%A = alloca() : memref<1xvector<16xf16>, 5>
				%i = constant 16 : index
				%j = constant 16 : index
				%bias = constant dense<42.> : vector<2xf16>
				%0 = gpu.subgroup_mma_load_matrix %wg[%i, %j] {operand = "AOp", leadDimension = 32 : i32} : memref<32x32xf16, 3> -> !gpu.mmafragment<8, vector<2xf16>>
				//CHECK: %[[INX:.*]] = llvm.mlir.constant(16 : index) : i32
				//CHECK-NEXT: %{{.*}} = llvm.mlir.constant(32 : index) : i32
				//CHECK-NEXT: %{{.*}} = llvm.mlir.constant(32 : index) : i32
				//CHECK-NEXT: %{{.*}} = llvm.mlir.constant(1 : index) : i32
				//CHECK-NEXT: %{{.*}} = llvm.mlir.constant(1024 : index) : i32
				//CHECK-NEXT: %{{.*}} = llvm.mlir.null : !llvm.ptr<f16, 3>
				//CHECK-NEXT: %{{.}} = llvm.getelementptr %{{.}}[%{{.*}}] : (!llvm.ptr<f16, 3>, i32) -> !llvm.ptr<f16, 3>
				//CHECK-NEXT: %{{.}} = llvm.ptrtoint %{{.}} : !llvm.ptr<f16, 3> to i32
				//CHECK-NEXT: %{{.}} = llvm.alloca %{{.}} x f16 {alignment = 32 : i64} : (i32) -> !llvm.ptr<f16, 3>
				//CHECK-NEXT: %{{.*}} = llvm.mlir.undef : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i32, array<2 x i32>, array<2 x i32>)>
				//CHECK-NEXT: %{{.}} = llvm.insertvalue %{{.}}, %{{.*}}[0] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i32, array<2 x i32>, array<2 x i32>)>
				//CHECK-NEXT: %{{.}} = llvm.insertvalue %{{.}}, %{{.*}}[1] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i32, array<2 x i32>, array<2 x i32>)>
				//CHECK-NEXT: %{{.*}} = llvm.mlir.constant(0 : index) : i32
				//CHECK-NEXT: %{{.}} = llvm.insertvalue %{{.}}, %{{.*}}[2] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i32, array<2 x i32>, array<2 x i32>)>
				//CHECK-NEXT: %{{.}} = llvm.insertvalue %{{.}}, %{{.*}}[3, 0] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i32, array<2 x i32>, array<2 x i32>)>
				//CHECK-NEXT: %{{.}} = llvm.insertvalue %{{.}}, %{{.*}}[3, 1] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i32, array<2 x i32>, array<2 x i32>)>
				//CHECK-NEXT: %{{.}} = llvm.insertvalue %{{.}}, %{{.*}}[4, 0] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i32, array<2 x i32>, array<2 x i32>)>
				//CHECK-NEXT: %{{.}} = llvm.insertvalue %{{.}}, %{{.*}}[4, 1] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i32, array<2 x i32>, array<2 x i32>)>
				//CHECK-NEXT: %{{.}} = llvm.extractvalue %{{.}}[0] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i32, array<2 x i32>, array<2 x i32>)>
				//CHECK-NEXT: %[[BASE:.]] = llvm.extractvalue %{{.}}[1] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i32, array<2 x i32>, array<2 x i32>)>
				//CHECK-NEXT: %[[OFFSET:.]] = llvm.extractvalue %{{.}}[2] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i32, array<2 x i32>, array<2 x i32>)>
				//CHECK-NEXT: %{{.}} = llvm.extractvalue %{{.}}[3, 0] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i32, array<2 x i32>, array<2 x i32>)>
				//CHECK-NEXT: %{{.}} = llvm.extractvalue %{{.}}[3, 1] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i32, array<2 x i32>, array<2 x i32>)>
				ftynseUnsubmitted Done Reply Inline Actions I don't think we care if these operations are on the next line or not. Furthermore, they seem to be testing the conversion of allocations, which isn't anyhow relevant to what this test is about. Such tests lead to excessive churn and spurious breakages. Please only test what the new code does. ftynse: I don't think we care if these operations are on the next line or not. Furthermore, they seem…
				ftynseUnsubmitted Done Reply Inline Actions I still see a lot of CHECK-NEXT, and I am not convinced that we should care about these operations being on subsequent strings. If tomorrow we decide to change the syntax of `llvm.extractvalue` to print the type a separate line, these tests would break for no good reason. ftynse: I still see a lot of CHECK-NEXT, and I am not convinced that we should care about these…
				//CHECK-NEXT: %{{.}} = llvm.extractvalue %{{.}}[4, 0] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i32, array<2 x i32>, array<2 x i32>)>
				//CHECK-NEXT: %{{.}} = llvm.extractvalue %{{.}}[4, 1] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i32, array<2 x i32>, array<2 x i32>)>
				//CHECK-NEXT: %[[LDM:.*]] = llvm.mlir.constant(32 : i32) : i32
				//CHECK-NEXT: %[[LI:.*]] = llvm.mul %[[LDM]], %[[INX]] : i32
				//CHECK-NEXT: %[[LIJ:.*]] = llvm.add %[[LI]], %[[INX]] : i32
				//CHECK-NEXT: %[[LIJO:.*]] = llvm.add %[[LIJ]], %[[OFFSET]] : i32
				//CHECK-NEXT: %[[ADDRESS:.*]] = llvm.getelementptr %[[BASE]][%[[LIJO]]] : (!llvm.ptr<f16, 3>, i32) -> !llvm.ptr<f16, 3>
				//CHECK-NEXT: %[[CADDRESS:.*]] = llvm.bitcast %[[ADDRESS]] : !llvm.ptr<f16, 3> to !llvm.ptr<i32, 3>
				//CHECK-NEXT: %[[FRAG:.*]] = nvvm.wmma.m16n16k16.load.a.f16.row.stride %[[CADDRESS]], %[[LDM]] : (!llvm.ptr<i32, 3>, i32) -> !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				//CHECK-NEXT: llvm.return %[[FRAG]] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				return %0 : !gpu.mmafragment<8, vector<2xf16>>
				}
				}

				// -----

				gpu.module @test_module {

				// CHECK-LABEL: func @gpu_wmma_store_op
				// CHECK-SAME: (%[[D:.*]]: !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>) {
				func @gpu_wmma_store_op(%arg0 : !gpu.mmafragment<4, vector<2xf16>>) -> () {
				bondhugulaUnsubmitted Done Reply Inline Actions Nit: Leave a space after `//`, i.e., CHECK -> CHECK bondhugula: Nit: Leave a space after `//`, i.e., //CHECK -> // CHECK
				%sg = alloca(){alignment = 32} : memref<32x32xf16, 3>
				%i = constant 16 : index
				%j = constant 16 : index
				gpu.subgroup_mma_store_matrix %arg0, %sg[%i,%j] {leadDimension= 32 : i32} : !gpu.mmafragment<4, vector<2xf16>>, memref<32x32xf16, 3>
				//CHECK: %[[INX:.*]] = llvm.mlir.constant(16 : index) : i32
				//CHECK-NEXT: %1 = llvm.mlir.constant(32 : index) : i32
				//CHECK-NEXT: %2 = llvm.mlir.constant(32 : index) : i32
				//CHECK-NEXT: %3 = llvm.mlir.constant(1 : index) : i32
				//CHECK-NEXT: %4 = llvm.mlir.constant(1024 : index) : i32
				//CHECK-NEXT: %5 = llvm.mlir.null : !llvm.ptr<f16, 3>
				//CHECK-NEXT: %6 = llvm.getelementptr %5[%4] : (!llvm.ptr<f16, 3>, i32) -> !llvm.ptr<f16, 3>
				//CHECK-NEXT: %7 = llvm.ptrtoint %6 : !llvm.ptr<f16, 3> to i32
				//CHECK-NEXT: %8 = llvm.alloca %7 x f16 {alignment = 32 : i64} : (i32) -> !llvm.ptr<f16, 3>
				//CHECK-NEXT: %9 = llvm.mlir.undef : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i32, array<2 x i32>, array<2 x i32>)>
				//CHECK-NEXT: %10 = llvm.insertvalue %8, %9[0] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i32, array<2 x i32>, array<2 x i32>)>
				//CHECK-NEXT: %11 = llvm.insertvalue %8, %10[1] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i32, array<2 x i32>, array<2 x i32>)>
				//CHECK-NEXT: %12 = llvm.mlir.constant(0 : index) : i32
				//CHECK-NEXT: %13 = llvm.insertvalue %12, %11[2] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i32, array<2 x i32>, array<2 x i32>)>
				//CHECK-NEXT: %14 = llvm.insertvalue %1, %13[3, 0] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i32, array<2 x i32>, array<2 x i32>)>
				//CHECK-NEXT: %15 = llvm.insertvalue %2, %14[3, 1] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i32, array<2 x i32>, array<2 x i32>)>
				//CHECK-NEXT: %16 = llvm.insertvalue %2, %15[4, 0] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i32, array<2 x i32>, array<2 x i32>)>
				//CHECK-NEXT: %17 = llvm.insertvalue %3, %16[4, 1] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i32, array<2 x i32>, array<2 x i32>)>
				//CHECK-NEXT: %18 = llvm.extractvalue %17[0] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i32, array<2 x i32>, array<2 x i32>)>
				//CHECK-NEXT: %[[BASE:.*]] = llvm.extractvalue %17[1] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i32, array<2 x i32>, array<2 x i32>)>
				//CHECK-NEXT: %[[OFFSET:.*]] = llvm.extractvalue %17[2] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i32, array<2 x i32>, array<2 x i32>)>
				//CHECK-NEXT: %21 = llvm.extractvalue %17[3, 0] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i32, array<2 x i32>, array<2 x i32>)>
				//CHECK-NEXT: %22 = llvm.extractvalue %17[3, 1] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i32, array<2 x i32>, array<2 x i32>)>
				//CHECK-NEXT: %23 = llvm.extractvalue %17[4, 0] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i32, array<2 x i32>, array<2 x i32>)>
				//CHECK-NEXT: %24 = llvm.extractvalue %17[4, 1] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i32, array<2 x i32>, array<2 x i32>)>
				//CHECK-NEXT: %[[LDM:.*]] = llvm.mlir.constant(32 : i32) : i32
				//CHECK-NEXT: %[[LI:.*]] = llvm.mul %[[LDM]], %[[INX]] : i32
				//CHECK-NEXT: %[[LIJ:.*]] = llvm.add %[[LI]], %[[INX]] : i32
				//CHECK-NEXT: %[[LIJO:.*]] = llvm.add %[[LIJ]], %[[OFFSET]] : i32
				//CHECK-NEXT: %[[ADDRESS:.*]] = llvm.getelementptr %[[BASE]][%[[LIJO]]] : (!llvm.ptr<f16, 3>, i32) -> !llvm.ptr<f16, 3>
				//CHECK-NEXT: %[[CADDRESS:.*]] = llvm.bitcast %[[ADDRESS]] : !llvm.ptr<f16, 3> to !llvm.ptr<i32, 3>
				//CHECK-NEXT: %[[EL1:.*]] = llvm.extractvalue %[[D]][0] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				//CHECK-NEXT: %[[EL2:.*]] = llvm.extractvalue %[[D]][1] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				//CHECK-NEXT: %[[EL3:.*]] = llvm.extractvalue %[[D]][2] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				//CHECK-NEXT: %[[EL4:.*]] = llvm.extractvalue %[[D]][3] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				//CHECK-NEXT: nvvm.wmma.m16n16k16.store.d.f16.row.stride %[[CADDRESS]], %[[EL1]], %[[EL2]], %[[EL3]], %[[EL4]], %[[LDM]] : !llvm.ptr<i32, 3>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, i32
				//CHECK-NEXT: llvm.return
				return
				}
				}

				// -----

				gpu.module @test_module {

				// CHECK-LABEL: func @gpu_wmma_mma_op
				// CHECK-SAME: (%[[A:.]]: !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>, %[[B:.]]: !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>, %[[C:.*]]: !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>) {
				func @gpu_wmma_mma_op(%A : !gpu.mmafragment<8, vector<2xf16>>, %B : !gpu.mmafragment<8, vector<2xf16>>, %C : !gpu.mmafragment<4, vector<2xf16>>) -> () {
				%D = gpu.subgroup_mma_compute %A, %B, %C : !gpu.mmafragment<8, vector<2xf16>>, !gpu.mmafragment<8, vector<2xf16>>, !gpu.mmafragment<4, vector<2xf16>> -> !gpu.mmafragment<4, vector<2xf16>>
				//CHECK: %[[A1:.*]] = llvm.extractvalue %[[A]][0] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				//CHECK-NEXT: %[[A2:.*]] = llvm.extractvalue %[[A]][1] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				//CHECK-NEXT: %[[A3:.*]] = llvm.extractvalue %[[A]][2] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				//CHECK-NEXT: %[[A4:.*]] = llvm.extractvalue %[[A]][3] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				//CHECK-NEXT: %[[A5:.*]] = llvm.extractvalue %[[A]][4] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				//CHECK-NEXT: %[[A6:.*]] = llvm.extractvalue %[[A]][5] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				//CHECK-NEXT: %[[A7:.*]] = llvm.extractvalue %[[A]][6] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				//CHECK-NEXT: %[[A8:.*]] = llvm.extractvalue %[[A]][7] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				//CHECK-NEXT: %[[B1:.*]] = llvm.extractvalue %[[B]][0] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				//CHECK-NEXT: %[[B2:.*]] = llvm.extractvalue %[[B]][1] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				//CHECK-NEXT: %[[B3:.*]] = llvm.extractvalue %[[B]][2] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				//CHECK-NEXT: %[[B4:.*]] = llvm.extractvalue %[[B]][3] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				//CHECK-NEXT: %[[B5:.*]] = llvm.extractvalue %[[B]][4] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				//CHECK-NEXT: %[[B6:.*]] = llvm.extractvalue %[[B]][5] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				//CHECK-NEXT: %[[B7:.*]] = llvm.extractvalue %[[B]][6] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				//CHECK-NEXT: %[[B8:.*]] = llvm.extractvalue %[[B]][7] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				//CHECK-NEXT: %[[C1:.*]] = llvm.extractvalue %[[C]][0] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				//CHECK-NEXT: %[[C2:.*]] = llvm.extractvalue %[[C]][1] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				//CHECK-NEXT: %[[C3:.*]] = llvm.extractvalue %[[C]][2] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				//CHECK-NEXT: %[[C4:.*]] = llvm.extractvalue %[[C]][3] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				//CHECK-NEXT: %20 = nvvm.wmma.m16n16k16.mma.row.row.f16.f16 %[[A1]], %[[A2]], %[[A3]], %[[A4]], %[[A5%]], %[[A6]], %[[A7]], %[[A8]], %[[B1]], %[[B2]], %[[B3]], %[[B4]], %[[B5]], %[[B6]], %[[B7]], %[[B8]], %[[C1]], %[[C2]], %[[C3]], %[[C4]] : vector<2xf16> -> !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				//CHECK-NEXT: llvm.return
				return
				}
				}

This is an archive of the discontinued LLVM Phabricator instance.

[MLIR][GPU][NVVM] Add conversion of warp synchronous matrix-multiply accumulate GPU opsClosedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 321330

mlir/lib/Conversion/GPUToNVVM/CommonTypes.h

mlir/lib/Conversion/GPUToNVVM/LowerGpuOpsToNVVMOps.cpp

mlir/lib/Conversion/GPUToNVVM/WmmaLoadOptoNvvmLowering.h

mlir/lib/Conversion/GPUToNVVM/WmmaMmaOptoNvvmLowering.h

mlir/lib/Conversion/GPUToNVVM/WmmaStoreOptoNvvmLowering.h

mlir/test/Conversion/GPUToNVVM/wmma-ops-to-nvvm.mlir

[MLIR][GPU][NVVM] Add conversion of warp synchronous matrix-multiply accumulate GPU ops
ClosedPublic