This is an archive of the discontinued LLVM Phabricator instance.

mlir/lib/Conversion/GPUToNVVM/WmmaOpsToNvvm.cpp
187–189	I assume this explanation wasn't a valid one. Could you clarify?
mlir/test/Conversion/GPUToNVVM/wmma-ops-to-nvvm.mlir
13–23	The default has changed to 64-bit here: could you please capture this in the commit summary?
mlir/test/Integration/GPU/CUDA/TensorCore/wmma-matmul-f32.mlir
3	This can be run with a 32-bit index bitwidth. Do we need this change?

bondhugula added inline comments.Oct 26 2021, 11:06 PM

mlir/test/Conversion/GPUToNVVM/wmma-ops-to-nvvm.mlir
13–23	I now realize that the default now just works as opposed to failing: it's just derived from the data layout and would happen to be 64-bit by default.

LGTM!

mlir/lib/Conversion/GPUToNVVM/WmmaOpsToNvvm.cpp
87	`auto` -> `IntegerAttr` here.

This revision is now accepted and ready to land.Oct 26 2021, 11:07 PM

Thanks for the review!

mlir/lib/Conversion/GPUToNVVM/WmmaOpsToNvvm.cpp
187–189	Yes, even though the intrinsic only accept 32bits stride those ops can be used with 64bits indexes.
mlir/test/Integration/GPU/CUDA/TensorCore/wmma-matmul-f32.mlir
3	Correct it can run with 32bits index but most tests use the default 64bit mode. Overall I don't believe that the 32bit mode is as relevant for CUDA. Let me know if you want me to keep an integration test in 32bit mode, note that I'm still testing both mode in the lit tests.

bondhugula added inline comments.Oct 27 2021, 12:33 AM

mlir/test/Integration/GPU/CUDA/TensorCore/wmma-matmul-f32.mlir
3	Reg. 32-bit mode's relevance for CUDA, my understanding is exactly the opposite. The 32-bit indexing is quite important and common: it's often sufficient and is expected to only perform better. I recall @herhut or others from his team reporting on the MLIR forum that they saw better performance in many cases when using 32-bit indexing. So I think it's completely fine to use 32-bit indexing for cases where it's guaranteed to be sufficient.

Address review comment

ThomasRaoux marked an inline comment as done.Oct 27 2021, 9:31 PM

This revision was landed with ongoing or failed builds.Oct 27 2021, 9:32 PM

Closed by commit rGeacd6e1ebef5: [mlir][GPUtoNVVM] Relax restriction on wmma op lowering (authored by ThomasRaoux). · Explain Why

This revision was automatically updated to reflect the committed changes.

ThomasRaoux added a commit: rGeacd6e1ebef5: [mlir][GPUtoNVVM] Relax restriction on wmma op lowering.

Harbormaster completed remote builds in B131117: Diff 382906.Oct 27 2021, 9:43 PM

Revision Contents

Path

Size

mlir/

lib/

Conversion/

GPUToNVVM/

WmmaOpsToNvvm.cpp

64 lines

test/

Conversion/

GPUToNVVM/

wmma-ops-to-nvvm.mlir

94 lines

Integration/

GPU/

CUDA/

TensorCore/

wmma-matmul-f16.mlir

2 lines

wmma-matmul-f32.mlir

2 lines

Diff 382906

mlir/lib/Conversion/GPUToNVVM/WmmaOpsToNvvm.cpp

Show First 20 Lines • Show All 69 Lines • ▼ Show 20 Lines	struct WmmaLoadOpToNVVMLowering
LogicalResult		LogicalResult
matchAndRewrite(gpu::SubgroupMmaLoadMatrixOp subgroupMmaLoadMatrixOp,		matchAndRewrite(gpu::SubgroupMmaLoadMatrixOp subgroupMmaLoadMatrixOp,
OpAdaptor adaptor,		OpAdaptor adaptor,
ConversionPatternRewriter &rewriter) const override {		ConversionPatternRewriter &rewriter) const override {
Operation *op = subgroupMmaLoadMatrixOp.getOperation();		Operation *op = subgroupMmaLoadMatrixOp.getOperation();
if (failed(areAllLLVMTypes(op, adaptor.getOperands(), rewriter)))		if (failed(areAllLLVMTypes(op, adaptor.getOperands(), rewriter)))
return failure();		return failure();

unsigned indexTypeBitwidth =
this->getTypeConverter()->getIndexTypeBitwidth();

// The corresponding intrinsics expects leadDimension to be a 32-bit
// integer, so all the calculations of linearizing the load address
// must also follow this restriction.
if (indexTypeBitwidth != 32)
return rewriter.notifyMatchFailure(
op, "Expected indices to the memref to be 32-bit wide.");
Location loc = op->getLoc();		Location loc = op->getLoc();

auto leadDimension = subgroupMmaLoadMatrixOp.leadDimensionAttr();

// MemRefDescriptor to extract alignedPtr and offset.		// MemRefDescriptor to extract alignedPtr and offset.
MemRefDescriptor promotedSrcOp(adaptor.srcMemref());		MemRefDescriptor promotedSrcOp(adaptor.srcMemref());

// Emit ops which compute the load offset using `srcOffsetI`,		// Emit ops which compute the load offset using `srcOffsetI`,
// `srcOffsetJ`. The actualOffset is (memrefOffset + (alignedPtr +		// `srcOffsetJ`. The actualOffset is (memrefOffset + (alignedPtr +
// ((leadDimension * srcOffsetI) + srcOffsetJ)). The memrefs here are		// ((leadDimension * srcOffsetI) + srcOffsetJ)). The memrefs here are
// assumed to be normalized and hence the simple conversion works.		// assumed to be normalized and hence the simple conversion works.
		IntegerAttr leadDimension = subgroupMmaLoadMatrixOp.leadDimensionAttr();
		bondhugulaUnsubmitted Done Reply Inline Actions `auto` -> `IntegerAttr` here. bondhugula: `auto` -> `IntegerAttr` here.
SmallVector<Value> indices(adaptor.indices());		SmallVector<Value> indices(adaptor.indices());
Value srcOffsetIVal = indices[0];		Value srcOffsetIVal = indices[0];
Value srcOffsetJVal = indices[1];		Value srcOffsetJVal = indices[1];
Type i32Ty = rewriter.getI32Type();		Value leadingDim = rewriter.create<LLVM::ConstantOp>(
Value leadingDim32 =		loc, srcOffsetIVal.getType(), leadDimension);
rewriter.create<LLVM::ConstantOp>(loc, i32Ty, leadDimension);
Value numElemsLeadDim =		Value numElemsLeadDim =
rewriter.create<LLVM::MulOp>(loc, i32Ty, leadingDim32, srcOffsetIVal);		rewriter.create<LLVM::MulOp>(loc, leadingDim, srcOffsetIVal);
Value loadOffset = rewriter.create<LLVM::AddOp>(loc, i32Ty, numElemsLeadDim,		Value loadOffset =
srcOffsetJVal);		rewriter.create<LLVM::AddOp>(loc, numElemsLeadDim, srcOffsetJVal);

Value promotedSrcOpToUse;		Value promotedSrcOpToUse;
promotedSrcOpToUse = promotedSrcOp.offset(rewriter, loc);		promotedSrcOpToUse = promotedSrcOp.offset(rewriter, loc);
Value actualOffset = rewriter.create<LLVM::AddOp>(loc, i32Ty, loadOffset,		Value actualOffset =
promotedSrcOpToUse);		rewriter.create<LLVM::AddOp>(loc, loadOffset, promotedSrcOpToUse);
Value loadAddress = rewriter.create<LLVM::GEPOp>(		Value loadAddress = rewriter.create<LLVM::GEPOp>(
loc, promotedSrcOp.getElementPtrType(),		loc, promotedSrcOp.getElementPtrType(),
promotedSrcOp.alignedPtr(rewriter, loc), ArrayRef<Value>{actualOffset});		promotedSrcOp.alignedPtr(rewriter, loc), ArrayRef<Value>{actualOffset});

// Bitcast the base address pointer of the destination memref, So that		// Bitcast the base address pointer of the destination memref, So that
// values can be stored in chunks of 32-bits and semantics match with the		// values can be stored in chunks of 32-bits and semantics match with the
// intrinsic exposed by NVPTX backend.		// intrinsic exposed by NVPTX backend.
Value loadAddressCasted = rewriter.create<LLVM::BitcastOp>(		Value loadAddressCasted = rewriter.create<LLVM::BitcastOp>(
loc,		loc,
LLVM::LLVMPointerType::get(		LLVM::LLVMPointerType::get(
i32Ty, promotedSrcOp.getElementPtrType().getAddressSpace()),		rewriter.getI32Type(),
		promotedSrcOp.getElementPtrType().getAddressSpace()),
loadAddress);		loadAddress);

// Get the shape of the MMAMatrix type being returned. The shape will		// Get the shape of the MMAMatrix type being returned. The shape will
// choose which intrinsic this op will be lowered to.		// choose which intrinsic this op will be lowered to.
gpu::MMAMatrixType retType =		gpu::MMAMatrixType retType =
subgroupMmaLoadMatrixOp.res().getType().cast<gpu::MMAMatrixType>();		subgroupMmaLoadMatrixOp.res().getType().cast<gpu::MMAMatrixType>();
ArrayRef<int64_t> retTypeShape = retType.getShape();		ArrayRef<int64_t> retTypeShape = retType.getShape();

Type resType = convertMMAToLLVMType(retType);		Type resType = convertMMAToLLVMType(retType);
StringRef operandStr = retType.getOperand();		StringRef operandStr = retType.getOperand();

// Create nvvm.mma_load op according to the operand types.		// Create nvvm.mma_load op according to the operand types.
		Value leadingDim32 = rewriter.create<LLVM::ConstantOp>(
		loc, rewriter.getI32Type(), leadDimension);
SmallVector<Value, 2> loadOpOperands({loadAddressCasted, leadingDim32});		SmallVector<Value, 2> loadOpOperands({loadAddressCasted, leadingDim32});
if (operandStr.equals("AOp")) {		if (operandStr.equals("AOp")) {
if (retTypeShape[0] == 16 && retTypeShape[1] == 16) {		if (retTypeShape[0] == 16 && retTypeShape[1] == 16) {
rewriter.replaceOpWithNewOp<NVVM::WMMALoadAM16N16K16Op>(op, resType,		rewriter.replaceOpWithNewOp<NVVM::WMMALoadAM16N16K16Op>(op, resType,
loadOpOperands);		loadOpOperands);
} else {		} else {
return rewriter.notifyMatchFailure(op, kInvalidCaseStr);		return rewriter.notifyMatchFailure(op, kInvalidCaseStr);
}		}
Show All 33 Lines	struct WmmaStoreOpToNVVMLowering
LogicalResult		LogicalResult
matchAndRewrite(gpu::SubgroupMmaStoreMatrixOp subgroupMmaStoreMatrixOp,		matchAndRewrite(gpu::SubgroupMmaStoreMatrixOp subgroupMmaStoreMatrixOp,
OpAdaptor adaptor,		OpAdaptor adaptor,
ConversionPatternRewriter &rewriter) const override {		ConversionPatternRewriter &rewriter) const override {
Operation *op = subgroupMmaStoreMatrixOp.getOperation();		Operation *op = subgroupMmaStoreMatrixOp.getOperation();
if (failed(areAllLLVMTypes(op, adaptor.getOperands(), rewriter)))		if (failed(areAllLLVMTypes(op, adaptor.getOperands(), rewriter)))
return failure();		return failure();

unsigned indexTypeBitwidth =
this->getTypeConverter()->getIndexTypeBitwidth();
// The corresponding intrinsics expects leadDimension to be a 32-bit
// integer, so all the calculations of linearizing the store address
// must also follow this restriction.
bondhugulaUnsubmitted Not Done Reply Inline Actions I assume this explanation wasn't a valid one. Could you clarify? bondhugula: I assume this explanation wasn't a valid one. Could you clarify?
ThomasRaouxAuthorUnsubmitted Done Reply Inline Actions Yes, even though the intrinsic only accept 32bits stride those ops can be used with 64bits indexes. ThomasRaoux: Yes, even though the intrinsic only accept 32bits stride those ops can be used with 64bits…
if (indexTypeBitwidth != 32)
return rewriter.notifyMatchFailure(
op, "expected indices to the memref to be 32-bit wide.");

Location loc = op->getLoc();		Location loc = op->getLoc();

// MemRefDescriptor to extract alignedPtr and offset.		// MemRefDescriptor to extract alignedPtr and offset.
MemRefDescriptor promotedDstOp(adaptor.dstMemref());		MemRefDescriptor promotedDstOp(adaptor.dstMemref());

auto leadDimension = subgroupMmaStoreMatrixOp.leadDimensionAttr();

// Emit ops which compute the store offset using `dstOffsetI`,		// Emit ops which compute the store offset using `dstOffsetI`,
// `dstOffsetJ`. The actualOffset is (memrefOffset + (alignedPtr +		// `dstOffsetJ`. The actualOffset is (memrefOffset + (alignedPtr +
// ((leadDimension * dstOffsetI) + dstOffsetJ)).		// ((leadDimension * dstOffsetI) + dstOffsetJ)).
		auto leadDimension = subgroupMmaStoreMatrixOp.leadDimensionAttr();
SmallVector<Value> indices(adaptor.indices());		SmallVector<Value> indices(adaptor.indices());
Value dstOffsetIVal = indices[0];		Value dstOffsetIVal = indices[0];
Value dstOffsetJVal = indices[1];		Value dstOffsetJVal = indices[1];
Type i32Ty = rewriter.getI32Type();		Value leadingDim = rewriter.create<LLVM::ConstantOp>(
Value leadingDim32 =		loc, dstOffsetIVal.getType(), leadDimension);
rewriter.create<LLVM::ConstantOp>(loc, i32Ty, leadDimension);
Value numElemsLeadDim =		Value numElemsLeadDim =
rewriter.create<LLVM::MulOp>(loc, i32Ty, leadingDim32, dstOffsetIVal);		rewriter.create<LLVM::MulOp>(loc, leadingDim, dstOffsetIVal);
Value loadOffset = rewriter.create<LLVM::AddOp>(loc, i32Ty, numElemsLeadDim,		Value loadOffset =
dstOffsetJVal);		rewriter.create<LLVM::AddOp>(loc, numElemsLeadDim, dstOffsetJVal);

Value promotedDstOpToUse;		Value promotedDstOpToUse;
promotedDstOpToUse = promotedDstOp.offset(rewriter, loc);		promotedDstOpToUse = promotedDstOp.offset(rewriter, loc);
Value actualOffset = rewriter.create<LLVM::AddOp>(loc, i32Ty, loadOffset,		Value actualOffset =
promotedDstOpToUse);		rewriter.create<LLVM::AddOp>(loc, loadOffset, promotedDstOpToUse);
Value storeAddress = rewriter.create<LLVM::GEPOp>(		Value storeAddress = rewriter.create<LLVM::GEPOp>(
loc, promotedDstOp.getElementPtrType(),		loc, promotedDstOp.getElementPtrType(),
promotedDstOp.alignedPtr(rewriter, loc), ArrayRef<Value>{actualOffset});		promotedDstOp.alignedPtr(rewriter, loc), ArrayRef<Value>{actualOffset});

// Bitcast the base address pointer of the destination memref, So that		// Bitcast the base address pointer of the destination memref, So that
// values can be stored in chunks of 32-bits and semantics match with the		// values can be stored in chunks of 32-bits and semantics match with the
// intrinsic exposed by NVPTX backend.		// intrinsic exposed by NVPTX backend.
Value storeAddressCasted = rewriter.create<LLVM::BitcastOp>(		Value storeAddressCasted = rewriter.create<LLVM::BitcastOp>(
loc,		loc,
LLVM::LLVMPointerType::get(		LLVM::LLVMPointerType::get(
i32Ty, promotedDstOp.getElementPtrType().getAddressSpace()),		rewriter.getI32Type(),
		promotedDstOp.getElementPtrType().getAddressSpace()),
storeAddress);		storeAddress);

SmallVector<Value, 4> storeOpOperands;		SmallVector<Value, 4> storeOpOperands;
storeOpOperands.push_back(storeAddressCasted);		storeOpOperands.push_back(storeAddressCasted);

// Get the shape of the MMAMatrix type being stored. The shape will		// Get the shape of the MMAMatrix type being stored. The shape will
// choose which intrinsic this op will be lowered to.		// choose which intrinsic this op will be lowered to.
gpu::MMAMatrixType srcType =		gpu::MMAMatrixType srcType =
subgroupMmaStoreMatrixOp.src().getType().cast<gpu::MMAMatrixType>();		subgroupMmaStoreMatrixOp.src().getType().cast<gpu::MMAMatrixType>();
ArrayRef<int64_t> srcTypeShape = srcType.getShape();		ArrayRef<int64_t> srcTypeShape = srcType.getShape();

auto matrixType = adaptor.src().getType().cast<LLVM::LLVMStructType>();		auto matrixType = adaptor.src().getType().cast<LLVM::LLVMStructType>();
for (unsigned i = 0, e = matrixType.getBody().size(); i < e; ++i) {		for (unsigned i = 0, e = matrixType.getBody().size(); i < e; ++i) {
Value toUse = rewriter.create<LLVM::ExtractValueOp>(		Value toUse = rewriter.create<LLVM::ExtractValueOp>(
loc, matrixType.getBody()[i], adaptor.src(),		loc, matrixType.getBody()[i], adaptor.src(),
rewriter.getI32ArrayAttr(i));		rewriter.getI32ArrayAttr(i));
storeOpOperands.push_back(toUse);		storeOpOperands.push_back(toUse);
}		}
		Value leadingDim32 = rewriter.create<LLVM::ConstantOp>(
		loc, rewriter.getI32Type(), leadDimension);
storeOpOperands.push_back(leadingDim32);		storeOpOperands.push_back(leadingDim32);
// Unpack the results from the source.		// Unpack the results from the source.
if (srcType.getElementType().isF16()) {		if (srcType.getElementType().isF16()) {
// Create nvvm.mma_store op.		// Create nvvm.mma_store op.
if (srcTypeShape[0] == 16 && srcTypeShape[1] == 16) {		if (srcTypeShape[0] == 16 && srcTypeShape[1] == 16) {
rewriter.create<NVVM::WMMAStoreF16M16N16K16Op>(loc, storeOpOperands);		rewriter.create<NVVM::WMMAStoreF16M16N16K16Op>(loc, storeOpOperands);
} else {		} else {
return rewriter.notifyMatchFailure(op, kInvalidCaseStr);		return rewriter.notifyMatchFailure(op, kInvalidCaseStr);
▲ Show 20 Lines • Show All 141 Lines • Show Last 20 Lines

mlir/test/Conversion/GPUToNVVM/wmma-ops-to-nvvm.mlir

	// RUN: mlir-opt --convert-gpu-to-nvvm="index-bitwidth=32" --split-input-file %s \| FileCheck %s			// RUN: mlir-opt --convert-gpu-to-nvvm --split-input-file %s \| FileCheck %s
				// RUN: mlir-opt --convert-gpu-to-nvvm="index-bitwidth=32" --split-input-file %s \| FileCheck --check-prefix=CHECK32 %s

	gpu.module @test_module {			gpu.module @test_module {

	// CHECK-LABEL: func @gpu_wmma_load_op() ->			// CHECK-LABEL: func @gpu_wmma_load_op() ->
	// CHECK-SAME: !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)> {			// CHECK-SAME: !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)> {
				// CHECK32-LABEL: func @gpu_wmma_load_op() ->
	builtin.func @gpu_wmma_load_op() -> (!gpu.mma_matrix<16x16xf16, "AOp">) {			builtin.func @gpu_wmma_load_op() -> (!gpu.mma_matrix<16x16xf16, "AOp">) {
	%wg = memref.alloca() {alignment = 32} : memref<32x32xf16, 3>			%wg = memref.alloca() {alignment = 32} : memref<32x32xf16, 3>
	%i = arith.constant 16 : index			%i = arith.constant 16 : index
	%j = arith.constant 16 : index			%j = arith.constant 16 : index
	%0 = gpu.subgroup_mma_load_matrix %wg[%i, %j] {leadDimension = 32 : index} : memref<32x32xf16, 3> -> !gpu.mma_matrix<16x16xf16, "AOp">			%0 = gpu.subgroup_mma_load_matrix %wg[%i, %j] {leadDimension = 32 : index} : memref<32x32xf16, 3> -> !gpu.mma_matrix<16x16xf16, "AOp">
	// CHECK: %[[INX:.*]] = llvm.mlir.constant(16 : index) : i32			// CHECK: %[[INX:.*]] = llvm.mlir.constant(16 : index) : i64
	// CHECK: %{{.}} = llvm.insertvalue %{{.}}, %{{.}}[{{.}}, {{.*}}]			// CHECK: %{{.}} = llvm.insertvalue %{{.}}, %{{.}}[{{.}}, {{.*}}]
	// CHECK: %[[LDM:.*]] = llvm.mlir.constant(32 : index) : i32			// CHECK: %[[LDM:.*]] = llvm.mlir.constant(32 : index) : i64
	// CHECK: %[[LI:.*]] = llvm.mul %[[LDM]], %[[INX]] : i32			// CHECK: %[[LI:.*]] = llvm.mul %[[LDM]], %[[INX]] : i64
	// CHECK: %[[LIJ:.*]] = llvm.add %[[LI]], %[[INX]] : i32			// CHECK: %[[LIJ:.*]] = llvm.add %[[LI]], %[[INX]] : i64
	// CHECK: %[[OFFSET:.]] = llvm.extractvalue %{{.}}[2] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i32, array<2 x i32>, array<2 x i32>)>			// CHECK: %[[OFFSET:.]] = llvm.extractvalue %{{.}}[2] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i64, array<2 x i64>, array<2 x i64>)>
	// CHECK: %[[LIJO:.*]] = llvm.add %[[LIJ]], %[[OFFSET]] : i32			// CHECK: %[[LIJO:.*]] = llvm.add %[[LIJ]], %[[OFFSET]] : i64
	// CHECK: %[[BASE:.]] = llvm.extractvalue %{{.}}[1] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i32, array<2 x i32>, array<2 x i32>)>			// CHECK: %[[BASE:.]] = llvm.extractvalue %{{.}}[1] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i64, array<2 x i64>, array<2 x i64>)>
	// CHECK: %[[ADDRESS:.*]] = llvm.getelementptr %[[BASE]][%[[LIJO]]] : (!llvm.ptr<f16, 3>, i32) -> !llvm.ptr<f16, 3>			// CHECK: %[[ADDRESS:.*]] = llvm.getelementptr %[[BASE]][%[[LIJO]]] : (!llvm.ptr<f16, 3>, i64) -> !llvm.ptr<f16, 3>
	// CHECK: %[[CADDRESS:.*]] = llvm.bitcast %[[ADDRESS]] : !llvm.ptr<f16, 3> to !llvm.ptr<i32, 3>			// CHECK: %[[CADDRESS:.*]] = llvm.bitcast %[[ADDRESS]] : !llvm.ptr<f16, 3> to !llvm.ptr<i32, 3>
				bondhugulaUnsubmitted Not Done Reply Inline Actions The default has changed to 64-bit here: could you please capture this in the commit summary? bondhugula: The default has changed to 64-bit here: could you please capture this in the commit summary?
				bondhugulaUnsubmitted Done Reply Inline Actions I now realize that the default now just works as opposed to failing: it's just derived from the data layout and would happen to be 64-bit by default. bondhugula: I now realize that the default now just works as opposed to failing: it's just derived from the…
	// CHECK: %[[FRAG:.*]] = nvvm.wmma.m16n16k16.load.a.f16.row.stride %[[CADDRESS]], %[[LDM]] : (!llvm.ptr<i32, 3>, i32) -> !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[LDM32:.*]] = llvm.mlir.constant(32 : index) : i32
				// CHECK: %[[FRAG:.*]] = nvvm.wmma.m16n16k16.load.a.f16.row.stride %[[CADDRESS]], %[[LDM32]] : (!llvm.ptr<i32, 3>, i32) -> !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: llvm.return %[[FRAG]] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: llvm.return %[[FRAG]] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>

				// CHECK32: %[[INX:.*]] = llvm.mlir.constant(16 : index) : i32
				// CHECK32: %{{.}} = llvm.insertvalue %{{.}}, %{{.}}[{{.}}, {{.*}}]
				// CHECK32: %[[LDM:.*]] = llvm.mlir.constant(32 : index) : i32
				// CHECK32: %[[LI:.*]] = llvm.mul %[[LDM]], %[[INX]] : i32
				// CHECK32: %[[LIJ:.*]] = llvm.add %[[LI]], %[[INX]] : i32
				// CHECK32: %[[OFFSET:.]] = llvm.extractvalue %{{.}}[2] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i32, array<2 x i32>, array<2 x i32>)>
				// CHECK32: %[[LIJO:.*]] = llvm.add %[[LIJ]], %[[OFFSET]] : i32
				// CHECK32: %[[BASE:.]] = llvm.extractvalue %{{.}}[1] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i32, array<2 x i32>, array<2 x i32>)>
				// CHECK32: %[[ADDRESS:.*]] = llvm.getelementptr %[[BASE]][%[[LIJO]]] : (!llvm.ptr<f16, 3>, i32) -> !llvm.ptr<f16, 3>
				// CHECK32: %[[CADDRESS:.*]] = llvm.bitcast %[[ADDRESS]] : !llvm.ptr<f16, 3> to !llvm.ptr<i32, 3>
				// CHECK32: %[[LDM32:.*]] = llvm.mlir.constant(32 : index) : i32
				// CHECK32: %[[FRAG:.*]] = nvvm.wmma.m16n16k16.load.a.f16.row.stride %[[CADDRESS]], %[[LDM32]] : (!llvm.ptr<i32, 3>, i32) -> !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				// CHECK32: llvm.return %[[FRAG]] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	return %0 : !gpu.mma_matrix<16x16xf16, "AOp">			return %0 : !gpu.mma_matrix<16x16xf16, "AOp">
	}			}
	}			}

	// -----			// -----

	gpu.module @test_module {			gpu.module @test_module {

	// CHECK-LABEL: func @gpu_wmma_store_op			// CHECK-LABEL: func @gpu_wmma_store_op
	// CHECK-SAME: (%[[D:.*]]: !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>) {			// CHECK-SAME: (%[[D:.*]]: !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>) {
				// CHECK32-LABEL: func @gpu_wmma_store_op
				// CHECK32-SAME: (%[[D:.*]]: !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>) {
	builtin.func @gpu_wmma_store_op(%arg0 : !gpu.mma_matrix<16x16xf16, "COp">) -> () {			builtin.func @gpu_wmma_store_op(%arg0 : !gpu.mma_matrix<16x16xf16, "COp">) -> () {
	%sg = memref.alloca(){alignment = 32} : memref<32x32xf16, 3>			%sg = memref.alloca(){alignment = 32} : memref<32x32xf16, 3>
	%i = arith.constant 16 : index			%i = arith.constant 16 : index
	%j = arith.constant 16 : index			%j = arith.constant 16 : index
	gpu.subgroup_mma_store_matrix %arg0, %sg[%i,%j] {leadDimension= 32 : index} : !gpu.mma_matrix<16x16xf16, "COp">, memref<32x32xf16, 3>			gpu.subgroup_mma_store_matrix %arg0, %sg[%i,%j] {leadDimension= 32 : index} : !gpu.mma_matrix<16x16xf16, "COp">, memref<32x32xf16, 3>
	// CHECK: %[[INX:.*]] = llvm.mlir.constant(16 : index) : i32			// CHECK: %[[INX:.*]] = llvm.mlir.constant(16 : index) : i64
	// CHECK: %{{.}} = llvm.insertvalue %{{.}}, %{{.}}[{{.}}, {{.*}}]			// CHECK: %{{.}} = llvm.insertvalue %{{.}}, %{{.}}[{{.}}, {{.*}}]
	// CHECK: %[[LDM:.*]] = llvm.mlir.constant(32 : index) : i32			// CHECK: %[[LDM:.*]] = llvm.mlir.constant(32 : index) : i64
	// CHECK: %[[LI:.*]] = llvm.mul %[[LDM]], %[[INX]] : i32			// CHECK: %[[LI:.*]] = llvm.mul %[[LDM]], %[[INX]] : i64
	// CHECK: %[[LIJ:.*]] = llvm.add %[[LI]], %[[INX]] : i32			// CHECK: %[[LIJ:.*]] = llvm.add %[[LI]], %[[INX]] : i64
	// CHECK: %[[OFFSET:.*]] = llvm.extractvalue %17[2] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i32, array<2 x i32>, array<2 x i32>)>			// CHECK: %[[OFFSET:.*]] = llvm.extractvalue %17[2] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i64, array<2 x i64>, array<2 x i64>)>
	// CHECK: %[[LIJO:.*]] = llvm.add %[[LIJ]], %[[OFFSET]] : i32			// CHECK: %[[LIJO:.*]] = llvm.add %[[LIJ]], %[[OFFSET]] : i64
	// CHECK: %[[BASE:.*]] = llvm.extractvalue %17[1] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i32, array<2 x i32>, array<2 x i32>)>			// CHECK: %[[BASE:.*]] = llvm.extractvalue %17[1] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i64, array<2 x i64>, array<2 x i64>)>
	// CHECK: %[[ADDRESS:.*]] = llvm.getelementptr %[[BASE]][%[[LIJO]]] : (!llvm.ptr<f16, 3>, i32) -> !llvm.ptr<f16, 3>			// CHECK: %[[ADDRESS:.*]] = llvm.getelementptr %[[BASE]][%[[LIJO]]] : (!llvm.ptr<f16, 3>, i64) -> !llvm.ptr<f16, 3>
	// CHECK: %[[CADDRESS:.*]] = llvm.bitcast %[[ADDRESS]] : !llvm.ptr<f16, 3> to !llvm.ptr<i32, 3>			// CHECK: %[[CADDRESS:.*]] = llvm.bitcast %[[ADDRESS]] : !llvm.ptr<f16, 3> to !llvm.ptr<i32, 3>
	// CHECK: %[[EL1:.*]] = llvm.extractvalue %[[D]][0 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[EL1:.*]] = llvm.extractvalue %[[D]][0 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: %[[EL2:.*]] = llvm.extractvalue %[[D]][1 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[EL2:.*]] = llvm.extractvalue %[[D]][1 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: %[[EL3:.*]] = llvm.extractvalue %[[D]][2 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[EL3:.*]] = llvm.extractvalue %[[D]][2 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: %[[EL4:.*]] = llvm.extractvalue %[[D]][3 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[EL4:.*]] = llvm.extractvalue %[[D]][3 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: nvvm.wmma.m16n16k16.store.d.f16.row.stride %[[CADDRESS]], %[[EL1]], %[[EL2]], %[[EL3]], %[[EL4]], %[[LDM]] : !llvm.ptr<i32, 3>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, i32			// CHECK: %[[LDM32:.*]] = llvm.mlir.constant(32 : index) : i32
				// CHECK: nvvm.wmma.m16n16k16.store.d.f16.row.stride %[[CADDRESS]], %[[EL1]], %[[EL2]], %[[EL3]], %[[EL4]], %[[LDM32]] : !llvm.ptr<i32, 3>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, i32
	// CHECK: llvm.return			// CHECK: llvm.return

				// CHECK32: %[[INX:.*]] = llvm.mlir.constant(16 : index) : i32
				// CHECK32: %{{.}} = llvm.insertvalue %{{.}}, %{{.}}[{{.}}, {{.*}}]
				// CHECK32: %[[LDM:.*]] = llvm.mlir.constant(32 : index) : i32
				// CHECK32: %[[LI:.*]] = llvm.mul %[[LDM]], %[[INX]] : i32
				// CHECK32: %[[LIJ:.*]] = llvm.add %[[LI]], %[[INX]] : i32
				// CHECK32: %[[OFFSET:.*]] = llvm.extractvalue %17[2] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i32, array<2 x i32>, array<2 x i32>)>
				// CHECK32: %[[LIJO:.*]] = llvm.add %[[LIJ]], %[[OFFSET]] : i32
				// CHECK32: %[[BASE:.*]] = llvm.extractvalue %17[1] : !llvm.struct<(ptr<f16, 3>, ptr<f16, 3>, i32, array<2 x i32>, array<2 x i32>)>
				// CHECK32: %[[ADDRESS:.*]] = llvm.getelementptr %[[BASE]][%[[LIJO]]] : (!llvm.ptr<f16, 3>, i32) -> !llvm.ptr<f16, 3>
				// CHECK32: %[[CADDRESS:.*]] = llvm.bitcast %[[ADDRESS]] : !llvm.ptr<f16, 3> to !llvm.ptr<i32, 3>
				// CHECK32: %[[EL1:.*]] = llvm.extractvalue %[[D]][0 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				// CHECK32: %[[EL2:.*]] = llvm.extractvalue %[[D]][1 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				// CHECK32: %[[EL3:.*]] = llvm.extractvalue %[[D]][2 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				// CHECK32: %[[EL4:.*]] = llvm.extractvalue %[[D]][3 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
				// CHECK32: %[[LDM32:.*]] = llvm.mlir.constant(32 : index) : i32
				// CHECK32: nvvm.wmma.m16n16k16.store.d.f16.row.stride %[[CADDRESS]], %[[EL1]], %[[EL2]], %[[EL3]], %[[EL4]], %[[LDM32]] : !llvm.ptr<i32, 3>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, i32
				// CHECK32: llvm.return
	return			return
	}			}
	}			}

	// -----			// -----

	gpu.module @test_module {			gpu.module @test_module {

	Show All 28 Lines
	}			}

	// -----			// -----

	gpu.module @test_module {			gpu.module @test_module {

	// CHECK-LABEL: func @gpu_wmma_mma_loop_op			// CHECK-LABEL: func @gpu_wmma_mma_loop_op
	// CHECK: %[[C:.+]] = nvvm.wmma.m16n16k16.load.c.f16.row.stride %{{.}}, %{{.}} : (!llvm.ptr<i32>, i32) -> !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[C:.+]] = nvvm.wmma.m16n16k16.load.c.f16.row.stride %{{.}}, %{{.}} : (!llvm.ptr<i32>, i32) -> !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: llvm.br ^bb1(%{{.*}}, %[[C]] : i32, !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>)			// CHECK: llvm.br ^bb1(%{{.*}}, %[[C]] : i64, !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>)
	// CHECK: ^bb1(%{{.*}}: i32, %[[ACC:.+]]: !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>): // 2 preds: ^bb0, ^bb2			// CHECK: ^bb1(%{{.*}}: i64, %[[ACC:.+]]: !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>): // 2 preds: ^bb0, ^bb2
	// CHECK: llvm.cond_br %38, ^bb2, ^bb3			// CHECK: llvm.cond_br %{{.*}}, ^bb2, ^bb3
	// CHECK: ^bb2: // pred: ^bb1			// CHECK: ^bb2: // pred: ^bb1
	// CHECK: %[[A:.+]] = nvvm.wmma.m16n16k16.load.a.f16.row.stride %{{.}}, %{{.}} : (!llvm.ptr<i32>, i32) -> !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[A:.+]] = nvvm.wmma.m16n16k16.load.a.f16.row.stride %{{.}}, %{{.}} : (!llvm.ptr<i32>, i32) -> !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: %[[B:.+]] = nvvm.wmma.m16n16k16.load.b.f16.row.stride %{{.}}, %{{.}} : (!llvm.ptr<i32>, i32) -> !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[B:.+]] = nvvm.wmma.m16n16k16.load.b.f16.row.stride %{{.}}, %{{.}} : (!llvm.ptr<i32>, i32) -> !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: %[[A0:.+]] = llvm.extractvalue %[[A]][0 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[A0:.+]] = llvm.extractvalue %[[A]][0 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: %[[A1:.+]] = llvm.extractvalue %[[A]][1 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[A1:.+]] = llvm.extractvalue %[[A]][1 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: %[[A2:.+]] = llvm.extractvalue %[[A]][2 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[A2:.+]] = llvm.extractvalue %[[A]][2 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: %[[A3:.+]] = llvm.extractvalue %[[A]][3 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[A3:.+]] = llvm.extractvalue %[[A]][3 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: %[[A4:.+]] = llvm.extractvalue %[[A]][4 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[A4:.+]] = llvm.extractvalue %[[A]][4 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: %[[A5:.+]] = llvm.extractvalue %[[A]][5 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[A5:.+]] = llvm.extractvalue %[[A]][5 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: %[[A6:.+]] = llvm.extractvalue %[[A]][6 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[A6:.+]] = llvm.extractvalue %[[A]][6 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: %[[A7:.+]] = llvm.extractvalue %[[A]][7 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[A7:.+]] = llvm.extractvalue %[[A]][7 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: %[[B0:.+]] = llvm.extractvalue %[[B]][0 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[B0:.+]] = llvm.extractvalue %[[B]][0 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: %[[B1:.+]] = llvm.extractvalue %[[B]][1 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[B1:.+]] = llvm.extractvalue %[[B]][1 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: %[[B2:.+]] = llvm.extractvalue %[[B]][2 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[B2:.+]] = llvm.extractvalue %[[B]][2 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: %[[B3:.+]] = llvm.extractvalue %[[B]][3 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[B3:.+]] = llvm.extractvalue %[[B]][3 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: %[[B4:.+]] = llvm.extractvalue %[[B]][4 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[B4:.+]] = llvm.extractvalue %[[B]][4 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: %[[B5:.+]] = llvm.extractvalue %[[B]][5 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[B5:.+]] = llvm.extractvalue %[[B]][5 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: %[[B6:.+]] = llvm.extractvalue %[[B]][6 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[B6:.+]] = llvm.extractvalue %[[B]][6 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: %[[B7:.+]] = llvm.extractvalue %[[B]][7 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[B7:.+]] = llvm.extractvalue %[[B]][7 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: %[[ACC0:.+]] = llvm.extractvalue %[[ACC]][0 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[ACC0:.+]] = llvm.extractvalue %[[ACC]][0 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: %[[ACC1:.+]] = llvm.extractvalue %[[ACC]][1 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[ACC1:.+]] = llvm.extractvalue %[[ACC]][1 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: %[[ACC2:.+]] = llvm.extractvalue %[[ACC]][2 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[ACC2:.+]] = llvm.extractvalue %[[ACC]][2 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: %[[ACC3:.+]] = llvm.extractvalue %[[ACC]][3 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[ACC3:.+]] = llvm.extractvalue %[[ACC]][3 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: %[[ACC_MUL:.+]] = nvvm.wmma.m16n16k16.mma.row.row.f16.f16 %[[A0]], %[[A1]], %[[A2]], %[[A3]], %[[A4]], %[[A5]], %[[A6]], %[[A7]], %[[B0]], %[[B1]], %[[B2]], %[[B3]], %[[B4]], %[[B5]], %[[B6]], %[[B7]], %[[ACC0]], %[[ACC1]], %[[ACC2]], %[[ACC3]] : vector<2xf16> -> !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[ACC_MUL:.+]] = nvvm.wmma.m16n16k16.mma.row.row.f16.f16 %[[A0]], %[[A1]], %[[A2]], %[[A3]], %[[A4]], %[[A5]], %[[A6]], %[[A7]], %[[B0]], %[[B1]], %[[B2]], %[[B3]], %[[B4]], %[[B5]], %[[B6]], %[[B7]], %[[ACC0]], %[[ACC1]], %[[ACC2]], %[[ACC3]] : vector<2xf16> -> !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: llvm.br ^bb1(%{{.*}}, %[[ACC_MUL]] : i32, !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>)			// CHECK: llvm.br ^bb1(%{{.*}}, %[[ACC_MUL]] : i64, !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>)
	// CHECK: ^bb3: // pred: ^bb1			// CHECK: ^bb3: // pred: ^bb1
	// CHECK: %87 = llvm.extractvalue %[[ACC]][0 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[E0:.+]] = llvm.extractvalue %[[ACC]][0 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: %88 = llvm.extractvalue %[[ACC]][1 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[E1:.+]] = llvm.extractvalue %[[ACC]][1 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: %89 = llvm.extractvalue %[[ACC]][2 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[E2:.+]] = llvm.extractvalue %[[ACC]][2 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: %90 = llvm.extractvalue %[[ACC]][3 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>			// CHECK: %[[E3:.+]] = llvm.extractvalue %[[ACC]][3 : i32] : !llvm.struct<(vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>)>
	// CHECK: nvvm.wmma.m16n16k16.store.d.f16.row.stride %86, %87, %88, %89, %90, %79 : !llvm.ptr<i32>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, i32			// CHECK: nvvm.wmma.m16n16k16.store.d.f16.row.stride %{{.}}, %[[E0]], %[[E1]], %[[E2]], %[[E3]], %{{.}} : !llvm.ptr<i32>, vector<2xf16>, vector<2xf16>, vector<2xf16>, vector<2xf16>, i32

	builtin.func @gpu_wmma_mma_loop_op(%arg0: memref<128x128xf16>, %arg1: memref<128x128xf16>, %arg2: memref<128x128xf16>) {			builtin.func @gpu_wmma_mma_loop_op(%arg0: memref<128x128xf16>, %arg1: memref<128x128xf16>, %arg2: memref<128x128xf16>) {
	%c0 = arith.constant 0 : index			%c0 = arith.constant 0 : index
	%c128 = arith.constant 128 : index			%c128 = arith.constant 128 : index
	%c32 = arith.constant 32 : index			%c32 = arith.constant 32 : index
	%0 = gpu.subgroup_mma_load_matrix %arg2[%c0, %c0] {leadDimension = 128 : index} : memref<128x128xf16> -> !gpu.mma_matrix<16x16xf16, "COp">			%0 = gpu.subgroup_mma_load_matrix %arg2[%c0, %c0] {leadDimension = 128 : index} : memref<128x128xf16> -> !gpu.mma_matrix<16x16xf16, "COp">
	br ^bb1(%c0, %0 : index, !gpu.mma_matrix<16x16xf16, "COp">)			br ^bb1(%c0, %0 : index, !gpu.mma_matrix<16x16xf16, "COp">)
	^bb1(%1: index, %2: !gpu.mma_matrix<16x16xf16, "COp">): // 2 preds: ^bb0, ^bb2			^bb1(%1: index, %2: !gpu.mma_matrix<16x16xf16, "COp">): // 2 preds: ^bb0, ^bb2
	Show All 38 Lines

mlir/test/Integration/GPU/CUDA/TensorCore/wmma-matmul-f16.mlir

	// RUN: mlir-opt %s \			// RUN: mlir-opt %s \
	// RUN: -gpu-kernel-outlining \			// RUN: -gpu-kernel-outlining \
	// RUN: -pass-pipeline='gpu.module(strip-debuginfo,convert-gpu-to-nvvm{index-bitwidth=32},gpu-to-cubin{chip=sm_70})' \			// RUN: -pass-pipeline='gpu.module(strip-debuginfo,convert-gpu-to-nvvm,gpu-to-cubin{chip=sm_70})' \
	// RUN: --convert-scf-to-std -gpu-to-llvm \			// RUN: --convert-scf-to-std -gpu-to-llvm \
	// RUN: \| mlir-cpu-runner \			// RUN: \| mlir-cpu-runner \
	// RUN: --shared-libs=%linalg_test_lib_dir/libmlir_cuda_runtime%shlibext \			// RUN: --shared-libs=%linalg_test_lib_dir/libmlir_cuda_runtime%shlibext \
	// RUN: --shared-libs=%linalg_test_lib_dir/libmlir_runner_utils%shlibext \			// RUN: --shared-libs=%linalg_test_lib_dir/libmlir_runner_utils%shlibext \
	// RUN: --entry-point-result=void \			// RUN: --entry-point-result=void \
	// RUN: \| FileCheck %s			// RUN: \| FileCheck %s
	// Test case to check the working of Tensor cores on Nvidia GPUs. The kernel has already			// Test case to check the working of Tensor cores on Nvidia GPUs. The kernel has already
	// been outlined to prevent crashing due to introduction of an empty basic block by --gpu-			// been outlined to prevent crashing due to introduction of an empty basic block by --gpu-
	▲ Show 20 Lines • Show All 75 Lines • Show Last 20 Lines

mlir/test/Integration/GPU/CUDA/TensorCore/wmma-matmul-f32.mlir

	// RUN: mlir-opt %s \			// RUN: mlir-opt %s \
	// RUN: -gpu-kernel-outlining \			// RUN: -gpu-kernel-outlining \
	// RUN: -pass-pipeline='gpu.module(strip-debuginfo,convert-gpu-to-nvvm{index-bitwidth=32},gpu-to-cubin{chip=sm_70})' \			// RUN: -pass-pipeline='gpu.module(strip-debuginfo,convert-gpu-to-nvvm,gpu-to-cubin{chip=sm_70})' \
				bondhugulaUnsubmitted Not Done Reply Inline Actions This can be run with a 32-bit index bitwidth. Do we need this change? bondhugula: This can be run with a 32-bit index bitwidth. Do we need this change?
				ThomasRaouxAuthorUnsubmitted Done Reply Inline Actions Correct it can run with 32bits index but most tests use the default 64bit mode. Overall I don't believe that the 32bit mode is as relevant for CUDA. Let me know if you want me to keep an integration test in 32bit mode, note that I'm still testing both mode in the lit tests. ThomasRaoux: Correct it can run with 32bits index but most tests use the default 64bit mode. Overall I don't…
				bondhugulaUnsubmitted Not Done Reply Inline Actions Reg. 32-bit mode's relevance for CUDA, my understanding is exactly the opposite. The 32-bit indexing is quite important and common: it's often sufficient and is expected to only perform better. I recall @herhut or others from his team reporting on the MLIR forum that they saw better performance in many cases when using 32-bit indexing. So I think it's completely fine to use 32-bit indexing for cases where it's guaranteed to be sufficient. bondhugula: Reg. 32-bit mode's relevance for CUDA, my understanding is exactly the opposite. The 32-bit…
	// RUN: --convert-scf-to-std -gpu-to-llvm \			// RUN: --convert-scf-to-std -gpu-to-llvm \
	// RUN: \| mlir-cpu-runner \			// RUN: \| mlir-cpu-runner \
	// RUN: --shared-libs=%linalg_test_lib_dir/libmlir_cuda_runtime%shlibext \			// RUN: --shared-libs=%linalg_test_lib_dir/libmlir_cuda_runtime%shlibext \
	// RUN: --shared-libs=%linalg_test_lib_dir/libmlir_runner_utils%shlibext \			// RUN: --shared-libs=%linalg_test_lib_dir/libmlir_runner_utils%shlibext \
	// RUN: --entry-point-result=void \			// RUN: --entry-point-result=void \
	// RUN: \| FileCheck %s			// RUN: \| FileCheck %s

	func @main() {			func @main() {
	▲ Show 20 Lines • Show All 63 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][GPUtoNVVM] Relax restriction on wmma op loweringClosedPublic

Details

Diff Detail