Download Raw Diff

Details

Reviewers

qcolombet
nicolasvasilache
herhut
ThomasRaoux
manishucsd

Commits

rGcce3e8ed895b: [MLIR][NVGPU] Introduction of wgmma.generate.descriptor Op

Summary

This work introduces a new Op, wgmma.generate.descriptor, designed to create a wgmma descriptor for inputs of matrix multiply and accumulate operations using wgmma.mma_async PTX instruction.

The descriptor format specifications can be found in the following link:
https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#asynchronous-warpgroup-level-matrix-shared-memory-layout-matrix-descriptor

It's important to note that this op is in its initial phase, and it does come with certain limitations. It only supports 128b swizzling and does not incorporate interleaving. In the future, different calculations will be addressed in separate works, expanding the capabilities of the op.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

guraypp created this revision.Aug 8 2023, 4:22 AM

Herald added a project: Restricted Project. · View Herald TranscriptAug 8 2023, 4:22 AM

Herald added subscribers: bviyer, Moerafaat, zero9178 and 24 others. · View Herald Transcript

guraypp requested review of this revision.Aug 8 2023, 4:22 AM

Herald added a reviewer: herhut. · View Herald TranscriptAug 8 2023, 4:22 AM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: stephenneuendorffer, jholewinski. · View Herald Transcript

guraypp added a reviewer: ThomasRaoux.Aug 8 2023, 4:23 AM

Harbormaster completed remote builds in B251051: Diff 548145.Aug 8 2023, 5:47 AM

guraypp added a reviewer: manishucsd.Aug 9 2023, 8:24 AM

qcolombet added inline comments.Aug 10 2023, 12:40 AM

mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td
646	Maybe add: Where `Mod` is the swizzling mode.
mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp
944	Stupid question, do we have to unwrap the descriptor, as opposed to using it directly?
983	That's not enough to get the start address, is it? The first three fields (start address, leading dim, and offset) are stored as `matrix-descriptor-encode(x) = (x & 0x3FFFF) >> 0x4`, if I'm not mistaken. (Though the spec is weird because I feel we would lose some information in the encoding process for the offset and leading dim at least.) Anyhow, this leads back to my other question, should we even unwrap the descriptor.
986	Could you introduce constants for the various shifts and sizes? E.g., StartAddrSizeInBits = ... StartAddrBitStartPos = ... And use that in the shifts and masks etc.
mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir
610	I found it strange to see a `0xf16`. Could we use `f16` instead? Also I would expect generally speaking we would have something like `memref<?xf16>` for this kind of global variable, wouldn't we?

qcolombet added inline comments.Aug 10 2023, 1:05 AM

mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp
944	Scratch that, I'm stupid, we're wrapping the value in here, not unwrapping it, so we need to create that thing. I thought there was a dedicated nvidia instruction that creates the matrix descriptor, but no :).

address comments

guraypp added a subscriber: ftynse.Aug 10 2023, 2:20 AM

guraypp added inline comments.

mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp
944	We had chat offline. Here we create descriptor from scratch, so we need to fill the bits.
983	Good catch. I added this to other fields as well. Thanks
mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir
610	Very good question. `0xf16,3` is needed for `dynamic shared` memory, NVPTX backend of LLVM implemented in that way. `memref<?xf16,3>` IR does not get verified as a global object. One can think of using `memref<8192xf16,3>`. However, there is CUDA limitation here. Sized shared memory is generated as static shared memory, and its limit is 48k whereas dynamic shared memory (`0xf16`) has larger. I had chat about that with @ftynse. I am planning to model dynamic shared memory in MLIR.

calculate stride at compile-time

calculate leading dim also at compile time, and exclude 4 LSB

use const for LSB bit

fix the test

Harbormaster completed remote builds in B251645: Diff 548974.Aug 10 2023, 8:16 AM

qcolombet added inline comments.Aug 10 2023, 9:53 AM

mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp
945	Stick to swizzle mode or say "layout mode" for the swizzle bits, because it can be difficult to connect both information when the comment explaining the struct of the bits says swizzle and the variable says layoutBit.
949	That should be `0` I believe.
987	I would suggest a higher level API (to hide the left/right shift used for masking) of the from: val = insertBits(dst, field, start, size) And chain that together, e.g.: addressWo4LSB = makeShl(start_address, 4) desc = insertBits(zero, addressWo4LSB, startAddressPos /i.e., 0/, startAddressSize /i.e., 14/) strideDimWo4LSB = makeShl(strideDim, 4) desc = insertBits(desc, strideDimWo4LSB, startStridePos, strideSize) ... Where `insertBits` would produce something like: insertBits(dst, field, start, size): mask = (1 << size) - 1 // computed as a constant at compile time masked_field = field & mask res = dst \| (masked_field << start) return res Note: at this point, you can simply implement `exclude4LSB` with `makeShl(..., 4)`. (I.e., keep that function, the name helps the understanding, but the implementation becomes simpler.)

qcolombet added inline comments.Aug 10 2023, 9:54 AM

mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir
610	That's fine for now. Maybe just add a comment saying that's how dynamic shared memory is currently represented.

address comments

guraypp marked 3 inline comments as done.Aug 16 2023, 12:29 PM

Harbormaster completed remote builds in B253013: Diff 550842.Aug 16 2023, 1:22 PM

guraypp added a child revision: D158151: [MLIR] Add sm_90a integration test with `wgmma`.Aug 17 2023, 12:20 AM

guraypp added a child revision: D158403: [MLIR][NVGPU] Introduce Warpgroup Matrix Descriptor Type.Aug 21 2023, 3:19 AM

qcolombet added inline comments.Aug 22 2023, 3:12 AM

mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp
968	We'll also need the size of the field (e.g., 14 for `BaseAddr`) to mask the bits appropriately. Technically we may not need that because our values should always be in range (e.g., the masking should be a noop), but I'd rather we emit the proper code sequence and have it being optimized away. Alternatively, we could emit `llvm.assume(val < ((1 << size) - 1))` if you believe the masking is overkill.

simplify the pattern, return ssa values

Harbormaster completed remote builds in B254049: Diff 552300.Aug 22 2023, 3:47 AM

guraypp added inline comments.Aug 22 2023, 5:02 AM

mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp
968	For the baseAddr, I do following below. Is it not enough? Value basePtr14bit = shiftRight(shiftLeft(basePtr, 46), 50);

qcolombet accepted this revision.Aug 22 2023, 6:25 AM

qcolombet added inline comments.

mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp
968	Ah right. It is enough, though it is not super readable (e.g., I missed it :)). I would have hidden the left/right shift directly in `insertBits` by passing a size directly. Anyhow, aside from `BaseAddr` I see that everything is known at compile time, so let's go with what you have here. The systematic masking is not necessary / overkill at this point.

This revision is now accepted and ready to land.Aug 22 2023, 6:25 AM

Closed by commit rGcce3e8ed895b: [MLIR][NVGPU] Introduction of wgmma.generate.descriptor Op (authored by guraypp). · Explain WhyAug 22 2023, 7:12 AM

This revision was automatically updated to reflect the committed changes.

guraypp added a commit: rGcce3e8ed895b: [MLIR][NVGPU] Introduction of wgmma.generate.descriptor Op.

Diff 548935

mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td

Show First 20 Lines • Show All 619 Lines • ▼ Show 20 Lines	let arguments = (ins AnyUnrankedMemRef:$tensor,
Variadic<Index>:$boxDimensions);		Variadic<Index>:$boxDimensions);
let results = (outs NVGPU_TensorMapDescriptor:$tensorMap);		let results = (outs NVGPU_TensorMapDescriptor:$tensorMap);
let assemblyFormat = [{		let assemblyFormat = [{
$tensor `box` `[` $boxDimensions `]` attr-dict `:` type($tensor) `->` type($tensorMap)		$tensor `box` `[` $boxDimensions `]` attr-dict `:` type($tensor) `->` type($tensorMap)
}];		}];
let hasVerifier = 1;		let hasVerifier = 1;
}		}

		def NVGPU_GenerateGmmaDescriptorOp : NVGPU_Op<"wgmma.generate.descriptor", []> {
		let summary = "Generate a wgmma matrix descriptor";
		let description = [{
		This Op builds a wgmma descriptor that is used by wgmma matrix multiply
		and accumulate.

		The descriptor specifies the properties of the matrix in shared memory that
		is a multiplicand in the matrix multiply and accumulate operation.

		The descriptor is a 64-bit value contained in a register with the following
		```
		+---------+-----+-----------+-----+-----------+-----+-----+-----------+-----+
		\| 0-13 \|14-15\| 16-29 \|30-31\| 32-45 \|46-48\|49-51\| 52-61 \|62-63\|
		+---------+-----+-----------+-----+-----------+-----+-----+-----------+-----+
		\| 14bits \|2bits\| 14bits \|2bits\| 14bits \|2bits\|3bits\| 10bits \|2bits\|
		+---------+-----+-----------+-----+-----------+-----+-----+-----------+-----+
		\| BaseAddr\| 0 \| LeadingDim\| 0 \| Stride \| 0 \|Offst\| 0 \|Swzle\|
		+---------+-----+-----------+-----+-----------+-----+-----+-----------+-----+
		```
		qcolombetUnsubmitted Done Reply Inline Actions Maybe add: Where `Mod` is the swizzling mode. qcolombet: Maybe add: Where `Mod` is the swizzling mode.

		See for more details:
		https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#asynchronous-warpgroup-level-matrix-shared-memory-layout-matrix-descriptor

		}];
		let results = (outs I64:$descriptor);
		let arguments = (ins Arg<AnyMemRef, "", [MemWrite]>:$tensor,
		NVGPU_TensorMapDescriptor:$tensorMap);
		let assemblyFormat = [{$tensor `,` $tensorMap attr-dict `:` type($tensor) `,` type($tensorMap)}];
		let hasVerifier = 1;
		}

#endif // NVGPU		#endif // NVGPU

mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp

Show First 20 Lines • Show All 928 Lines • ▼ Show 20 Lines	for (auto [index, value] : llvm::enumerate(coords)) {
coords[index] = truncToI32(rewriter, op->getLoc(), value);		coords[index] = truncToI32(rewriter, op->getLoc(), value);
}		}

rewriter.replaceOpWithNewOp<NVVM::CpAsyncBulkTensorGlobalToSharedClusterOp>(		rewriter.replaceOpWithNewOp<NVVM::CpAsyncBulkTensorGlobalToSharedClusterOp>(
op, dest, adaptor.getTensorMapDescriptor(), barrier, coords);		op, dest, adaptor.getTensorMapDescriptor(), barrier, coords);
return success();		return success();
}		}
};		};
		struct NVGPUGenerateGmmaDescriptorLowering
		: public ConvertOpToLLVMPattern<nvgpu::GenerateGmmaDescriptorOp> {
		using ConvertOpToLLVMPattern<
		nvgpu::GenerateGmmaDescriptorOp>::ConvertOpToLLVMPattern;

		LogicalResult
		matchAndRewrite(nvgpu::GenerateGmmaDescriptorOp op, OpAdaptor adaptor,
		ConversionPatternRewriter &rewriter) const override {
		qcolombetUnsubmitted Done Reply Inline Actions Stupid question, do we have to unwrap the descriptor, as opposed to using it directly? qcolombet: Stupid question, do we have to unwrap the descriptor, as opposed to using it directly?
		qcolombetUnsubmitted Done Reply Inline Actions Scratch that, I'm stupid, we're wrapping the value in here, not unwrapping it, so we need to create that thing. I thought there was a dedicated nvidia instruction that creates the matrix descriptor, but no :). qcolombet: Scratch that, I'm stupid, we're wrapping the value in here, not unwrapping it, so we need to…
		gurayppAuthorUnsubmitted Done Reply Inline Actions We had chat offline. Here we create descriptor from scratch, so we need to fill the bits. guraypp: We had chat offline. Here we create descriptor from scratch, so we need to fill the bits.
		constexpr int startLayoutBit = 62;
		qcolombetUnsubmitted Done Reply Inline Actions Stick to swizzle mode or say "layout mode" for the swizzle bits, because it can be difficult to connect both information when the comment explaining the struct of the bits says swizzle and the variable says layoutBit. qcolombet: Stick to swizzle mode or say "layout mode" for the swizzle bits, because it can be difficult to…
		constexpr int startOffsetBit = 49;
		constexpr int startStrideBit = 32;
		constexpr int startLeadingDimBit = 16;
		constexpr int startBaseAddrBit = 50;
		qcolombetUnsubmitted Done Reply Inline Actions That should be `0` I believe. qcolombet: That should be `0` I believe.

		Location loc = op->getLoc();

		nvgpu::TensorMapSwizzleKind swizzleKind =
		op.getTensorMap().getType().getSwizzle();

		unsigned swizzle =
		(swizzleKind == nvgpu::TensorMapSwizzleKind::SWIZZLE_128B) ? 128
		: (swizzleKind == nvgpu::TensorMapSwizzleKind::SWIZZLE_64B) ? 64
		: (swizzleKind == nvgpu::TensorMapSwizzleKind::SWIZZLE_32B) ? 32
		: 1;

		auto ti64 = rewriter.getIntegerType(64);
		auto makeConst = [&](uint64_t index) -> Value {
		return rewriter.create<LLVM::ConstantOp>(
		loc, ti64, rewriter.getI64IntegerAttr(index));
		};
		auto shiftLeft = [&](Value value, unsigned shift) -> Value {
		return rewriter.create<LLVM::ShlOp>(loc, ti64, value, makeConst(shift));
		qcolombetUnsubmitted Not Done Reply Inline Actions We'll also need the size of the field (e.g., 14 for `BaseAddr`) to mask the bits appropriately. Technically we may not need that because our values should always be in range (e.g., the masking should be a noop), but I'd rather we emit the proper code sequence and have it being optimized away. Alternatively, we could emit `llvm.assume(val < ((1 << size) - 1))` if you believe the masking is overkill. qcolombet: We'll also need the size of the field (e.g., 14 for `BaseAddr`) to mask the bits appropriately.
		gurayppAuthorUnsubmitted Done Reply Inline Actions For the baseAddr, I do following below. Is it not enough? Value basePtr14bit = shiftRight(shiftLeft(basePtr, 46), 50); guraypp: For the baseAddr, I do following below. Is it not enough? ``` Value basePtr14bit = shiftRight…
		qcolombetUnsubmitted Not Done Reply Inline Actions Ah right. It is enough, though it is not super readable (e.g., I missed it :)). I would have hidden the left/right shift directly in `insertBits` by passing a size directly. Anyhow, aside from `BaseAddr` I see that everything is known at compile time, so let's go with what you have here. The systematic masking is not necessary / overkill at this point. qcolombet: Ah right. It is enough, though it is not super readable (e.g., I missed it :)). I would have…
		};
		auto shiftRight = [&](Value value, unsigned shift) -> Value {
		return rewriter.create<LLVM::LShrOp>(loc, ti64, value, makeConst(shift));
		};
		auto makeOr = [&](Value lhs, Value rhs) -> Value {
		return rewriter.create<LLVM::OrOp>(loc, ti64, lhs, rhs);
		};
		auto exclude4LSB = [&](Value value, unsigned startBit) -> Value {
		return shiftRight(shiftLeft(value, (startBit - 4)), startBit);
		};

		Value desc = makeConst(0);
		// [62,64) layout type
		// 6 bits unused, 2 bits [6,8)
		desc = makeOr(desc, shiftLeft(makeConst(uint64_t(1)), startLayoutBit));
		qcolombetUnsubmitted Done Reply Inline Actions That's not enough to get the start address, is it? The first three fields (start address, leading dim, and offset) are stored as `matrix-descriptor-encode(x) = (x & 0x3FFFF) >> 0x4`, if I'm not mistaken. (Though the spec is weird because I feel we would lose some information in the encoding process for the offset and leading dim at least.) Anyhow, this leads back to my other question, should we even unwrap the descriptor. qcolombet: That's not enough to get the start address, is it? The first three fields (start address…
		gurayppAuthorUnsubmitted Done Reply Inline Actions Good catch. I added this to other fields as well. Thanks guraypp: Good catch. I added this to other fields as well. Thanks
		// [49,52) base_offset
		// 1 bit unused, 3 bits [1,4), 4 bits unused
		// Valid only for SWIZZLE_128B and SWIZZLE_64B
		qcolombetUnsubmitted Done Reply Inline Actions Could you introduce constants for the various shifts and sizes? E.g., StartAddrSizeInBits = ... StartAddrBitStartPos = ... And use that in the shifts and masks etc. qcolombet: Could you introduce constants for the various shifts and sizes? E.g., StartAddrSizeInBits = ...
		desc = makeOr(desc, shiftLeft(makeConst(0), startOffsetBit));
		qcolombetUnsubmitted Done Reply Inline Actions I would suggest a higher level API (to hide the left/right shift used for masking) of the from: val = insertBits(dst, field, start, size) And chain that together, e.g.: addressWo4LSB = makeShl(start_address, 4) desc = insertBits(zero, addressWo4LSB, startAddressPos /i.e., 0/, startAddressSize /i.e., 14/) strideDimWo4LSB = makeShl(strideDim, 4) desc = insertBits(desc, strideDimWo4LSB, startStridePos, strideSize) ... Where `insertBits` would produce something like: insertBits(dst, field, start, size): mask = (1 << size) - 1 // computed as a constant at compile time masked_field = field & mask res = dst \| (masked_field << start) return res Note: at this point, you can simply implement `exclude4LSB` with `makeShl(..., 4)`. (I.e., keep that function, the name helps the understanding, but the implementation becomes simpler.) qcolombet: I would suggest a higher level API (to hide the left/right shift used for masking) of the from…
		// [32,46) stride
		// 14 bits [0,14), 2 bits unused
		Value strideDim = shiftRight(shiftLeft(makeConst(swizzle), 3), 4);
		// Exclude 4LSB
		desc = makeOr(desc, exclude4LSB(strideDim, startStrideBit));
		// [16,30) leading dimension
		// 14 bits [0,14), 2 bits unused
		// Not used with swizzling. Exclude 4LSB
		desc = makeOr(desc, exclude4LSB(makeConst(1), startLeadingDimBit));
		// [0,14) start_address
		// 14 bits [0,14), 2 bits unused
		Value basePtr = rewriter.create<LLVM::ExtractValueOp>(
		op->getLoc(), adaptor.getTensor(), 1);
		Value ptri64 = rewriter.create<LLVM::PtrToIntOp>(loc, ti64, basePtr);
		// Exclude 4LSB
		Value startAdress = exclude4LSB(ptri64, startBaseAddrBit);
		desc = makeOr(desc, startAdress);

		rewriter.replaceOp(op, desc);
		return success();
		}
		};

static Value makeI64Const(RewriterBase &rewriter, Operation *op,		static Value makeI64Const(RewriterBase &rewriter, Operation *op,
int32_t index) {		int32_t index) {
return rewriter.create<LLVM::ConstantOp>(op->getLoc(),		return rewriter.create<LLVM::ConstantOp>(op->getLoc(),
rewriter.getIntegerType(64),		rewriter.getIntegerType(64),
rewriter.getI32IntegerAttr(index));		rewriter.getI32IntegerAttr(index));
}		}

▲ Show 20 Lines • Show All 114 Lines • ▼ Show 20 Lines	patterns.add<
NVGPUMBarrierArriveLowering, // nvgpu.mbarrier.arrive		NVGPUMBarrierArriveLowering, // nvgpu.mbarrier.arrive
NVGPUMBarrierArriveNoCompleteLowering, // nvgpu.mbarrier.arrive.no_complete		NVGPUMBarrierArriveNoCompleteLowering, // nvgpu.mbarrier.arrive.no_complete
NVGPUMBarrierTestWaitLowering, // nvgpu.mbarrier.test_wait_parity		NVGPUMBarrierTestWaitLowering, // nvgpu.mbarrier.test_wait_parity
NVGPUMBarrierTryWaitParityLowering, // nvgpu.mbarrier.try_wait_parity		NVGPUMBarrierTryWaitParityLowering, // nvgpu.mbarrier.try_wait_parity
NVGPUTmaAsyncLoadOpLowering, // nvgpu.tma.async.load		NVGPUTmaAsyncLoadOpLowering, // nvgpu.tma.async.load
NVGPUTmaCreateDescriptorOpLowering, // nvgpu.tma.create.descriptor		NVGPUTmaCreateDescriptorOpLowering, // nvgpu.tma.create.descriptor
NVGPUMBarrierArriveExpectTxLowering, // nvgpu.mbarrier.arrive.expect_tx		NVGPUMBarrierArriveExpectTxLowering, // nvgpu.mbarrier.arrive.expect_tx
NVGPUTmaAsyncLoadOpLowering, // nvgpu.tma.async.load		NVGPUTmaAsyncLoadOpLowering, // nvgpu.tma.async.load
		NVGPUGenerateGmmaDescriptorLowering, // nvgpu.wgmma.generate.descriptor
MmaSyncOptoNVVM, MmaLdMatrixOpToNVVM, NVGPUAsyncCopyLowering,		MmaSyncOptoNVVM, MmaLdMatrixOpToNVVM, NVGPUAsyncCopyLowering,
NVGPUAsyncCreateGroupLowering, NVGPUAsyncWaitLowering,		NVGPUAsyncCreateGroupLowering, NVGPUAsyncWaitLowering,
NVGPUMmaSparseSyncLowering>(converter);		NVGPUMmaSparseSyncLowering>(converter);
}		}

mlir/lib/Dialect/NVGPU/IR/NVGPUDialect.cpp

Show First 20 Lines • Show All 361 Lines • ▼ Show 20 Lines	LogicalResult TmaCreateDescriptorOp::verify() {
nvgpu::TensorMapDescriptorType desc = getTensorMap().getType();		nvgpu::TensorMapDescriptorType desc = getTensorMap().getType();
if (desc.getInterleave() != TensorMapInterleaveKind::INTERLEAVE_NONE)		if (desc.getInterleave() != TensorMapInterleaveKind::INTERLEAVE_NONE)
return emitError() << "Interleave options are not supported yet.";		return emitError() << "Interleave options are not supported yet.";

return success();		return success();
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
		// NVGPU_GenerateGmmaDescriptorOp
		//===----------------------------------------------------------------------===//

		LogicalResult GenerateGmmaDescriptorOp::verify() {
		MemRefType memrefType = getTensor().getType();
		MemRefType tensorMapType = getTensorMap().getType().getTensor();

		if (memrefType != tensorMapType)
		return emitError() << "memref and tensor map type mismatch";

		if (!memrefType.hasStaticShape() \|\| !tensorMapType.hasStaticShape())
		return emitError() << "supports only static shapes";

		if (memrefType.getRank() != 2)
		return emitError() << "supports only 2d memref is supported for now";

		if (getTensorMap().getType().getSwizzle() !=
		TensorMapSwizzleKind::SWIZZLE_128B) {
		return emitError() << "supports only "
		<< stringifyTensorMapSwizzleKind(
		TensorMapSwizzleKind::SWIZZLE_128B)
		<< " is supported for the time being";
		}

		if (getTensorMap().getType().getInterleave() !=
		TensorMapInterleaveKind::INTERLEAVE_NONE) {
		return emitError() << "supports only "
		<< stringifyTensorMapInterleaveKind(
		TensorMapInterleaveKind::INTERLEAVE_NONE)
		<< " is supported for the time being";
		}

		return success();
		}

		//===----------------------------------------------------------------------===//
// TableGen'd dialect, type, and op definitions		// TableGen'd dialect, type, and op definitions
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#define GET_ATTRDEF_CLASSES		#define GET_ATTRDEF_CLASSES
#include "mlir/Dialect/NVGPU/IR/NVGPUAttrDefs.cpp.inc"		#include "mlir/Dialect/NVGPU/IR/NVGPUAttrDefs.cpp.inc"

#include "mlir/Dialect/NVGPU/IR/NVGPUEnums.cpp.inc"		#include "mlir/Dialect/NVGPU/IR/NVGPUEnums.cpp.inc"

#define GET_OP_CLASSES		#define GET_OP_CLASSES
#include "mlir/Dialect/NVGPU/IR/NVGPU.cpp.inc"		#include "mlir/Dialect/NVGPU/IR/NVGPU.cpp.inc"

#define GET_TYPEDEF_CLASSES		#define GET_TYPEDEF_CLASSES
#include "mlir/Dialect/NVGPU/IR/NVGPUTypes.cpp.inc"		#include "mlir/Dialect/NVGPU/IR/NVGPUTypes.cpp.inc"

mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir

// RUN: mlir-opt %s -convert-nvgpu-to-nvvm='use-opaque-pointers=1' \| FileCheck %s		// RUN: mlir-opt %s -convert-nvgpu-to-nvvm='use-opaque-pointers=1' \| FileCheck %s
// RUN: mlir-opt %s -test-transform-dialect-interpreter \| FileCheck %s		// RUN: mlir-opt %s -test-transform-dialect-interpreter \| FileCheck %s
		// RUN: mlir-opt --convert-nvgpu-to-nvvm='use-opaque-pointers=1' --split-input-file -cse -canonicalize %s \| FileCheck %s

// CHECK-LABEL: @m16n8k16_fp16		// CHECK-LABEL: @m16n8k16_fp16
func.func @m16n8k16_fp16(%arg0: vector<4x2xf16>, %arg1: vector<2x2xf16>, %arg2: vector<2x2xf16>) -> vector<2x2xf16> {		func.func @m16n8k16_fp16(%arg0: vector<4x2xf16>, %arg1: vector<2x2xf16>, %arg2: vector<2x2xf16>) -> vector<2x2xf16> {
// CHECK: llvm.extractvalue %{{.*}}[0] : !llvm.array<4 x vector<2xf16>>		// CHECK: llvm.extractvalue %{{.*}}[0] : !llvm.array<4 x vector<2xf16>>
// CHECK: llvm.extractvalue %{{.*}}[1] : !llvm.array<4 x vector<2xf16>>		// CHECK: llvm.extractvalue %{{.*}}[1] : !llvm.array<4 x vector<2xf16>>
// CHECK: llvm.extractvalue %{{.*}}[2] : !llvm.array<4 x vector<2xf16>>		// CHECK: llvm.extractvalue %{{.*}}[2] : !llvm.array<4 x vector<2xf16>>
// CHECK: llvm.extractvalue %{{.*}}[3] : !llvm.array<4 x vector<2xf16>>		// CHECK: llvm.extractvalue %{{.*}}[3] : !llvm.array<4 x vector<2xf16>>
// CHECK: llvm.extractvalue %{{.*}}[0] : !llvm.array<2 x vector<2xf16>>		// CHECK: llvm.extractvalue %{{.*}}[0] : !llvm.array<2 x vector<2xf16>>
▲ Show 20 Lines • Show All 590 Lines • ▼ Show 20 Lines	func.func @create_tensor_map(%devicePtr2d : memref<64x128xf32>, %devicePtr1d : memref<128xf32>) {
// CHECK : llvm.call @mgpuTensorMapEncodeTiledMemref		// CHECK : llvm.call @mgpuTensorMapEncodeTiledMemref
%tensorMap1d = nvgpu.tma.create.descriptor %devicePtr1d_unranked box[%crd1] : memref<*xf32> -> !tensorMap1d		%tensorMap1d = nvgpu.tma.create.descriptor %devicePtr1d_unranked box[%crd1] : memref<*xf32> -> !tensorMap1d
func.return		func.return
}		}

!lhsTensorMap = !nvgpu.tensormap.descriptor<tensor = memref<128x64xf16, 3>, swizzle = swizzle_128b, l2promo = none, oob = zero, interleave = none>		!lhsTensorMap = !nvgpu.tensormap.descriptor<tensor = memref<128x64xf16, 3>, swizzle = swizzle_128b, l2promo = none, oob = zero, interleave = none>
!rhsTensorMap = !nvgpu.tensormap.descriptor<tensor = memref<64x128xf16, strided<[128, 1], offset: 8192>, 3>, swizzle = swizzle_128b, l2promo = none, oob = zero, interleave = none>		!rhsTensorMap = !nvgpu.tensormap.descriptor<tensor = memref<64x128xf16, strided<[128, 1], offset: 8192>, 3>, swizzle = swizzle_128b, l2promo = none, oob = zero, interleave = none>

!shmemlhs = memref<128x64xf16,3>		!shmemlhs = memref<128x64xf16,3>
		qcolombetUnsubmitted Done Reply Inline Actions I found it strange to see a `0xf16`. Could we use `f16` instead? Also I would expect generally speaking we would have something like `memref<?xf16>` for this kind of global variable, wouldn't we? qcolombet: I found it strange to see a `0xf16`. Could we use `f16` instead? Also I would expect generally…
		gurayppAuthorUnsubmitted Done Reply Inline Actions Very good question. `0xf16,3` is needed for `dynamic shared` memory, NVPTX backend of LLVM implemented in that way. `memref<?xf16,3>` IR does not get verified as a global object. One can think of using `memref<8192xf16,3>`. However, there is CUDA limitation here. Sized shared memory is generated as static shared memory, and its limit is 48k whereas dynamic shared memory (`0xf16`) has larger. I had chat about that with @ftynse. I am planning to model dynamic shared memory in MLIR. guraypp: Very good question. `0xf16,3` is needed for `dynamic shared` memory, NVPTX backend of LLVM…
		qcolombetUnsubmitted Done Reply Inline Actions That's fine for now. Maybe just add a comment saying that's how dynamic shared memory is currently represented. qcolombet: That's fine for now. Maybe just add a comment saying that's how dynamic shared memory is…
!shmemrhs = memref<64x128xf16, strided<[128, 1], offset: 8192>, 3>		!shmemrhs = memref<64x128xf16, strided<[128, 1], offset: 8192>, 3>

module @mymodule {		module @mymodule {
// Dynamic Shared memory		// Dynamic Shared memory
memref.global "private" @dynamicShmem : memref<0xf16,3>		memref.global "private" @dynamicShmem : memref<0xf16,3>

func.func @async_tma_load(%lhsTensorMap: !lhsTensorMap, %rhsTensorMap: !rhsTensorMap, %mbarrier: !barrierType) {		func.func @async_tma_load(%lhsTensorMap: !lhsTensorMap, %rhsTensorMap: !rhsTensorMap, %mbarrier: !barrierType) {
%c0 = arith.constant 0 : index		%c0 = arith.constant 0 : index
Show All 18 Lines	^bb1(%arg1: !transform.any_op):
%0 = transform.structured.match ops{["func.func"]} in %arg1		%0 = transform.structured.match ops{["func.func"]} in %arg1
: (!transform.any_op) -> !transform.any_op		: (!transform.any_op) -> !transform.any_op
transform.apply_conversion_patterns to %0 {		transform.apply_conversion_patterns to %0 {
transform.apply_conversion_patterns.nvgpu.nvgpu_to_nvvm		transform.apply_conversion_patterns.nvgpu.nvgpu_to_nvvm
} with type_converter {		} with type_converter {
transform.apply_conversion_patterns.memref.memref_to_llvm_type_converter		transform.apply_conversion_patterns.memref.memref_to_llvm_type_converter
{use_opaque_pointers = true}		{use_opaque_pointers = true}
} {legal_dialects = ["arith", "func", "llvm", "memref", "nvvm", "scf"], partial_conversion} : !transform.any_op		} {legal_dialects = ["arith", "func", "llvm", "memref", "nvvm", "scf"], partial_conversion} : !transform.any_op

		// -----

		!tensorMap = !nvgpu.tensormap.descriptor<tensor = memref<128x64xf16,3>, swizzle = swizzle_128b, l2promo=none, oob=zero, interleave=none>
		memref.global "private" @dynamicShmem : memref<0xf16,3>
		func.func @create_wgmma_descriptor(%tensorMap : !tensorMap) -> i64{
		%dynamicMem = memref.get_global @dynamicShmem : memref<0xf16, 3>
		%lhsShmem = memref.reinterpret_cast %dynamicMem to offset: [0], sizes: [128,64], strides: [64,1] : memref<0xf16, 3> to memref<128x64xf16,3>
		// CHECK: %[[S0:.+]] = memref.get_global @dynamicShmem : memref<0xf16, 3>
		// CHECK: %[[R0:.+]] = memref.reinterpret_cast %[[S0]] to offset: [0], sizes: [128, 64], strides: [64, 1] : memref<0xf16, 3> to memref<128x64xf16, 3>
		// CHECK: %[[S1:.+]] = builtin.unrealized_conversion_cast %[[R0]] : memref<128x64xf16, 3> to !llvm.struct<(ptr<3>, ptr<3>, i64, array<2 x i64>, array<2 x i64>)>
		// CHECK: %[[S2:.+]] = llvm.extractvalue %[[S1]][0] : !llvm.struct<(ptr<3>, ptr<3>, i64, array<2 x i64>, array<2 x i64>)>
		// CHECK: %[[PTR:.+]] = llvm.ptrtoint %[[S2]] : !llvm.ptr<3> to i64
		// CHECK: %[[S10:.+]] = llvm.mlir.constant(46 : i64) : i64
		// CHECK: %[[S11:.+]] = llvm.shl %[[PTR]], %[[S10]] : i64
		// CHECK: %[[S12:.+]] = llvm.mlir.constant(50 : i64) : i64
		// CHECK: %[[S13:.+]] = llvm.lshr %[[S11]], %[[S12]] : i64
		// CHECK: %[[S14:.+]] = llvm.mlir.constant(128 : i64) : i64
		// CHECK: %[[S15:.+]] = llvm.mlir.constant(3 : i64) : i64
		// CHECK: %[[S16:.+]] = llvm.shl %[[S14]], %[[S15]] : i64
		// CHECK: %[[S17:.+]] = llvm.mlir.constant(4 : i64) : i64
		// CHECK: %[[S18:.+]] = llvm.lshr %[[S16]], %[[S17]] : i64
		// CHECK: %[[S19:.+]] = llvm.mlir.constant(16384 : i64) : i64
		// CHECK: %[[S20:.+]] = llvm.mlir.constant(4 : i64) : i64
		// CHECK: %[[S21:.+]] = llvm.lshr %[[S19]], %[[S20]] : i64
		// CHECK: %[[S22:.+]] = llvm.mlir.constant(0 : i64) : i64
		// CHECK: %[[S23:.+]] = llvm.mlir.constant(3 : i64) : i64
		// CHECK: %[[S24:.+]] = llvm.mlir.constant(62 : i64) : i64
		// CHECK: %[[S25:.+]] = llvm.shl %[[S23]], %[[S24]] : i64
		// CHECK: %[[S26:.+]] = llvm.or %[[S22]], %[[S25]] : i64
		// CHECK: %[[S27:.+]] = llvm.mlir.constant(32 : i64) : i64
		// CHECK: %[[S28:.+]] = llvm.shl %[[S18]], %[[S27]] : i64
		// CHECK: %[[S29:.+]] = llvm.or %[[S26]], %[[S28]] : i64
		// CHECK: %[[S30:.+]] = llvm.mlir.constant(16 : i64) : i64
		// CHECK: %[[S31:.+]] = llvm.shl %[[S21]], %[[S30]] : i64
		// CHECK: %[[S32:.+]] = llvm.or %[[S29]], %[[S31]] : i64
		// CHECK: %[[S33:.+]] = llvm.mlir.constant(0 : i64) : i64
		// CHECK: %[[S34:.+]] = llvm.mlir.constant(49 : i64) : i64
		// CHECK: %[[S35:.+]] = llvm.shl %[[S33]], %[[S34]] : i64
		// CHECK: %[[S36:.+]] = llvm.or %[[S32]], %[[S35]] : i64
		// CHECK: %[[DESC:.+]] = llvm.or %[[S36]], %[[S13]] : i64
		// CHECK: return %[[DESC]] : i64
		%descA = nvgpu.wgmma.generate.descriptor %lhsShmem, %tensorMap : memref<128x64xf16,3>, !tensorMap
		func.return %descA : i64
}		}

This is an archive of the discontinued LLVM Phabricator instance.

[MLIR][NVGPU] Introduction of wgmma.generate.descriptor Op
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 548935

mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td

mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp

mlir/lib/Dialect/NVGPU/IR/NVGPUDialect.cpp

mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir

This is an archive of the discontinued LLVM Phabricator instance.

[MLIR][NVGPU] Introduction of wgmma.generate.descriptor OpClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 548935

mlir/include/mlir/Dialect/NVGPU/IR/NVGPU.td

mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp

mlir/lib/Dialect/NVGPU/IR/NVGPUDialect.cpp

mlir/test/Conversion/NVGPUToNVVM/nvgpu-to-nvvm.mlir

[MLIR][NVGPU] Introduction of wgmma.generate.descriptor Op
ClosedPublic