This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
lib/Dialect/NVGPU/Transforms/
-
Dialect/
-
NVGPU/
-
Transforms/
9/9
CreateAsyncGroups.cpp
-
test/Dialect/NVGPU/
-
Dialect/
-
NVGPU/
-
transform-create-async-groups.mlir

Differential D156695

[mlir][NVGPU] Support 2D masks in transform.nvgpu.create_async_groups
ClosedPublic

Authored by springerm on Jul 31 2023, 7:42 AM.

Download Raw Diff

Details

Reviewers

nicolasvasilache
herhut
guraypp

Commits

rG39d8876da363: [mlir][NVGPU] Support 2D masks in transform.nvgpu.create_async_groups

Summary

Support IR that is generated by the vector-to-scf lowering of 2D vector transfers with a mask. Only 2D transfers that were fully unrolled are supported at the moment.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

springerm created this revision.Jul 31 2023, 7:42 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 31 2023, 7:42 AM

Herald added subscribers: bviyer, Moerafaat, zero9178 and 23 others. · View Herald Transcript

springerm requested review of this revision.Jul 31 2023, 7:42 AM

Herald added a reviewer: herhut. · View Herald TranscriptJul 31 2023, 7:42 AM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added a subscriber: stephenneuendorffer. · View Herald Transcript

Harbormaster completed remote builds in B249215: Diff 545656.Jul 31 2023, 8:50 AM

springerm added a reviewer: guraypp.Aug 7 2023, 1:40 AM

nicolasvasilache accepted this revision.Aug 7 2023, 2:55 AM

nicolasvasilache added inline comments.

mlir/lib/Dialect/NVGPU/Transforms/CreateAsyncGroups.cpp
63	can this return a failure rather than crash ? I see the original impl has an assert higher up, can we fail here and assert at the same place rather than sink the assertion?
83	then should this return failure ?
90	let's propagate the error up and fail at the caller.
106	seems like we could refactor these preconditions into getMaskOp so that the transform could fail gracefully with proper error messages rather than crash in a few places
244	Seems like we could refactor this to precompute the masks and if all of them are valid then only perform rewrites. This way the transform could fail more gracefully, with tested readable error messages.

This revision is now accepted and ready to land.Aug 7 2023, 2:55 AM

Thanks for working on this. It looks good.

Only 2D transfers that were fully unrolled are supported at the moment.

Out of curiosity - why do we have this restriction? unrolling isn't always beneficial.

In D156695#4565076, @guraypp wrote:

Thanks for working on this. It looks good.

Only 2D transfers that were fully unrolled are supported at the moment.

Out of curiosity - why do we have this restriction? unrolling isn't always beneficial.

What I mean by this is that only certain IR is is supported. E.g., only IR where the mask is a vector.extract(...) from a 2D mask. That's the IR that is generated by vector-to-scf. This transformation here does not unroll. If unrolling is not beneficial, we do not have to unroll. (But then we cannot use copy_async.)

This code is mostly copied from IREE and refactored a bit. To keep it simple, that implementation only supports 2D masks. But it could be extended to higher dimensions if needed. We just didn't need it so far, so it is not implemented.

mlir/lib/Dialect/NVGPU/Transforms/CreateAsyncGroups.cpp
63	This is already handled by this piece of code: if (cast<VectorType>(vectorVal.getType()).getRank() != 1) return; This assert here is just to make sure that `createAsyncGroups` and this helper function stay in sync. It should never be triggered.
90	Same, this is already handled by this code, the assertion will never fail unless someone changes the code: if (cast<VectorType>(vectorVal.getType()).getRank() != 1) return;
106	This also already handled by a check in `getMaskOp`: if (extractOp.getPosition().size() == 1 && extractOp.getSourceVectorType().getRank() == 2) Note `getMaskOp` is called twice. The first time when looking for "eligible" ops. If `getMaskOp` returns `failure` at that point, we don't make it `buildNumReadElements`. All the assertions here are just to highlight what's supported and what is not. But they cannot fail at the moment.
244	That is happening during the call to `getMaskOp` further up in this function: // Look for compatible mask and padding. If mask/padding is not supported, the op is skipped. (no error)

Closed by commit rG39d8876da363: [mlir][NVGPU] Support 2D masks in transform.nvgpu.create_async_groups (authored by springerm). · Explain WhyAug 7 2023, 6:47 AM

This revision was automatically updated to reflect the committed changes.

springerm marked 4 inline comments as done.

springerm added a commit: rG39d8876da363: [mlir][NVGPU] Support 2D masks in transform.nvgpu.create_async_groups.

Revision Contents

Path

Size

mlir/

lib/

Dialect/

NVGPU/

Transforms/

CreateAsyncGroups.cpp

87 lines

test/

Dialect/

NVGPU/

transform-create-async-groups.mlir

48 lines

Diff 545656

mlir/lib/Dialect/NVGPU/Transforms/CreateAsyncGroups.cpp

Show All 40 Lines
/// vector.transfer_read or vector.load op.		/// vector.transfer_read or vector.load op.
static bool isContiguousRead(Operation *read) {		static bool isContiguousRead(Operation *read) {
if (auto transferRead = dyn_cast<vector::TransferReadOp>(read))		if (auto transferRead = dyn_cast<vector::TransferReadOp>(read))
return isContiguousXferOp(transferRead);		return isContiguousXferOp(transferRead);
// vector.load are always contiguous.		// vector.load are always contiguous.
return isa<vector::LoadOp>(read);		return isa<vector::LoadOp>(read);
}		}

		namespace {
		/// A vector.create_mask op and extract position.
		struct TransferMask {
		vector::CreateMaskOp createMaskOp;
		SmallVector<int64_t> extractPosition;
		};
		} // namespace

/// If the given vector load op has a mask that is defined by		/// If the given vector load op has a mask that is defined by
/// vector.create_mask, return that op.		/// vector.create_mask, return that op.
static vector::CreateMaskOp getMaskOp(Operation *loadOp) {		static FailureOr<TransferMask> getMaskOp(Operation *loadOp) {
auto transferRead = dyn_cast<vector::TransferReadOp>(loadOp);		auto transferRead = dyn_cast<vector::TransferReadOp>(loadOp);
if (!transferRead \|\| !transferRead.getMask())		if (!transferRead \|\| !transferRead.getMask())
		return TransferMask{{}, {}};
		assert(transferRead.getMask().getType().getRank() == 1 &&
		nicolasvasilacheUnsubmitted Done Reply Inline Actions can this return a failure rather than crash ? I see the original impl has an assert higher up, can we fail here and assert at the same place rather than sink the assertion? nicolasvasilache: can this return a failure rather than crash ? I see the original impl has an assert higher up…
		springermAuthorUnsubmitted Done Reply Inline Actions This is already handled by this piece of code: if (cast<VectorType>(vectorVal.getType()).getRank() != 1) return; This assert here is just to make sure that `createAsyncGroups` and this helper function stay in sync. It should never be triggered. springerm: This is already handled by this piece of code: ``` if (cast<VectorType>(vectorVal.getType…
		"expected 1-D mask");

		// Case 1: Mask is the result of a vector.create_mask.
		if (auto maskOp =
		transferRead.getMask().getDefiningOp<vector::CreateMaskOp>())
		return TransferMask{maskOp, {}};

		// Case 2: Mask is the result of a vector.extract(vector.create_mask). Only
		// 2D -> 1D extracts are supported at the moment.
		if (auto extractOp =
		transferRead.getMask().getDefiningOp<vector::ExtractOp>())
		if (auto maskOp =
		extractOp.getVector().getDefiningOp<vector::CreateMaskOp>())
		if (extractOp.getPosition().size() == 1 &&
		extractOp.getSourceVectorType().getRank() == 2)
		return TransferMask{maskOp,
		SmallVector<int64_t>(extractOp.getPosition())};

		// All other cases: not supported.
return {};		return {};
		nicolasvasilacheUnsubmitted Done Reply Inline Actions then should this return failure ? nicolasvasilache: then should this return failure ?
auto maskOp = transferRead.getMask().getDefiningOp<vector::CreateMaskOp>();		}
// TODO: Support 2D masks and higher. Ops with a >1D mask are ignored at the
// moment.		/// Build an SSA value that represents the number of read elements.
if (maskOp.getVectorType().getRank() != 1)		static Value buildNumReadElements(OpBuilder &b, Location loc,
return {};		Operation *readOp) {
return maskOp;		FailureOr<TransferMask> transferMask = getMaskOp(readOp);
		assert(succeeded(transferMask) && "invalid transfer mask");
		nicolasvasilacheUnsubmitted Done Reply Inline Actions let's propagate the error up and fail at the caller. nicolasvasilache: let's propagate the error up and fail at the caller.
		springermAuthorUnsubmitted Done Reply Inline Actions Same, this is already handled by this code, the assertion will never fail unless someone changes the code: if (cast<VectorType>(vectorVal.getType()).getRank() != 1) return; springerm: Same, this is already handled by this code, the assertion will never fail unless someone…

		// No mask => no num_read_elements.
		if (!transferMask->createMaskOp)
		return Value();

		// No extract: return size of "ones" segment in the mask.
		if (transferMask->extractPosition.empty()) {
		assert(transferMask->createMaskOp.getNumOperands() == 1 &&
		"expected single operand");
		return transferMask->createMaskOp.getOperand(0);
		}

		// vector.extract(vector.create_mask).
		// If extract_pos < num_ones, take number of elements from the least
		// significant dimension.
		assert(transferMask->createMaskOp.getVectorType().getRank() == 2 &&
		nicolasvasilacheUnsubmitted Done Reply Inline Actions seems like we could refactor these preconditions into getMaskOp so that the transform could fail gracefully with proper error messages rather than crash in a few places nicolasvasilache: seems like we could refactor these preconditions into getMaskOp so that the transform could…
		springermAuthorUnsubmitted Done Reply Inline Actions This also already handled by a check in `getMaskOp`: if (extractOp.getPosition().size() == 1 && extractOp.getSourceVectorType().getRank() == 2) Note `getMaskOp` is called twice. The first time when looking for "eligible" ops. If `getMaskOp` returns `failure` at that point, we don't make it `buildNumReadElements`. All the assertions here are just to highlight what's supported and what is not. But they cannot fail at the moment. springerm: This also already handled by a check in `getMaskOp`: ``` if (extractOp.getPosition().size…
		"expected 2D mask");
		assert(transferMask->extractPosition.size() == 1 &&
		"expected 2D->1D extract");
		Value cmp = b.create<arith::CmpIOp>(
		loc, arith::CmpIPredicate::slt,
		b.create<arith::ConstantIndexOp>(loc,
		transferMask->extractPosition.front()),
		transferMask->createMaskOp->getOperands().front());
		return b.create<arith::SelectOp>(
		loc, cmp, transferMask->createMaskOp->getOperands().back(),
		b.create<arith::ConstantIndexOp>(loc, 0));
}		}

/// Return "true" if the conversion to async copy is supported by "async copy".		/// Return "true" if the conversion to async copy is supported by "async copy".
static bool resultsInSupportedAsyncCopy(MemRefType memrefType,		static bool resultsInSupportedAsyncCopy(MemRefType memrefType,
Operation::operand_range indices,
VectorType vecType) {		VectorType vecType) {
assert(vecType.getRank() == 1 && "expected 1-D vector");		assert(vecType.getRank() == 1 && "expected 1-D vector");
constexpr int64_t kSupportedCpAsyncAlignmentsInBytes[3] = {4, 8, 16};		constexpr int64_t kSupportedCpAsyncAlignmentsInBytes[3] = {4, 8, 16};

// Condition 1: the copy size must be supported.		// Condition 1: the copy size must be supported.
bool supportedCopySize = false;		bool supportedCopySize = false;
int64_t numElements = vecType.getNumElements();		int64_t numElements = vecType.getNumElements();
Type elementType = vecType.getElementType();		Type elementType = vecType.getElementType();
▲ Show 20 Lines • Show All 42 Lines • ▼ Show 20 Lines	if (nvgpu::NVGPUDialect::hasSharedMemoryAddressSpace(
return;		return;

// Look for compatible mask and padding.		// Look for compatible mask and padding.
if (auto transferRead = dyn_cast<vector::TransferReadOp>(readOp)) {		if (auto transferRead = dyn_cast<vector::TransferReadOp>(readOp)) {
if (Value mask = transferRead.getMask()) {		if (Value mask = transferRead.getMask()) {
if (getConstantIntValue(transferRead.getPadding()) ==		if (getConstantIntValue(transferRead.getPadding()) ==
static_cast<int64_t>(0))		static_cast<int64_t>(0))
return;		return;
if (!getMaskOp(readOp))		if (failed(getMaskOp(readOp)))
return;		return;
}		}
}		}

// Check whether both accesses are supported before we emit: this is		// Check whether both accesses are supported before we emit: this is
// necessary to ensure the correctness of DeviceAsyncCopyOp.		// necessary to ensure the correctness of DeviceAsyncCopyOp.
VectorType vecType = cast<VectorType>(vectorVal.getType());		VectorType vecType = cast<VectorType>(vectorVal.getType());

if (!resultsInSupportedAsyncCopy(cast<MemRefType>(loadBase.getType()),		if (!resultsInSupportedAsyncCopy(cast<MemRefType>(loadBase.getType()),
nvgpu::getIndices(readOp), vecType) \|\|		vecType) \|\|
!resultsInSupportedAsyncCopy(cast<MemRefType>(storeBase.getType()),		!resultsInSupportedAsyncCopy(cast<MemRefType>(storeBase.getType()),
nvgpu::getIndices(writeOp), vecType))		vecType))
return;		return;

copyToSharedMem.insert(writeOp);		copyToSharedMem.insert(writeOp);
return;		return;
});		});

while (!copyToSharedMem.empty()) {		while (!copyToSharedMem.empty()) {
// Start a group with the first write.		// Start a group with the first write.
Show All 34 Lines	while (!copyToSharedMem.empty()) {
for (Operation *writeOp : group) {		for (Operation *writeOp : group) {
rewriter.setInsertionPoint(writeOp);		rewriter.setInsertionPoint(writeOp);
Value vectorVal = nvgpu::getValueStored(writeOp);		Value vectorVal = nvgpu::getValueStored(writeOp);
auto vectorType = cast<VectorType>(vectorVal.getType());		auto vectorType = cast<VectorType>(vectorVal.getType());
int64_t numElements = vectorType.getNumElements();		int64_t numElements = vectorType.getNumElements();
Operation *readOp = vectorVal.getDefiningOp();		Operation *readOp = vectorVal.getDefiningOp();
Value storeBase = nvgpu::getMemrefOperand(writeOp);		Value storeBase = nvgpu::getMemrefOperand(writeOp);
Value loadBase = nvgpu::getMemrefOperand(readOp);		Value loadBase = nvgpu::getMemrefOperand(readOp);
Value numReadElements;		Value numReadElements =
if (vector::CreateMaskOp maskOp = getMaskOp(readOp)) {		buildNumReadElements(rewriter, writeOp->getLoc(), readOp);
		nicolasvasilacheUnsubmitted Done Reply Inline Actions Seems like we could refactor this to precompute the masks and if all of them are valid then only perform rewrites. This way the transform could fail more gracefully, with tested readable error messages. nicolasvasilache: Seems like we could refactor this to precompute the masks and if all of them are valid then…
		springermAuthorUnsubmitted Done Reply Inline Actions That is happening during the call to `getMaskOp` further up in this function: // Look for compatible mask and padding. If mask/padding is not supported, the op is skipped. (no error) springerm: That is happening during the call to `getMaskOp` further up in this function: ``` // Look for…
assert(maskOp.getNumOperands() == 1 && "expected single operand");
numReadElements = maskOp.getOperand(0);
}
auto dstMemref = cast<MemRefType>(storeBase.getType());		auto dstMemref = cast<MemRefType>(storeBase.getType());
int64_t sizeInBytes =		int64_t sizeInBytes =
(dstMemref.getElementTypeBitWidth() * numElements) / 8;		(dstMemref.getElementTypeBitWidth() * numElements) / 8;
// bypass_l1 only possible with 16 byte transfer.		// bypass_l1 only possible with 16 byte transfer.
Value token = rewriter.create<nvgpu::DeviceAsyncCopyOp>(		Value token = rewriter.create<nvgpu::DeviceAsyncCopyOp>(
writeOp->getLoc(), nvgpu::DeviceAsyncTokenType::get(op->getContext()),		writeOp->getLoc(), nvgpu::DeviceAsyncTokenType::get(op->getContext()),
/dst=/storeBase, /dstIndices=/nvgpu::getIndices(writeOp),		/dst=/storeBase, /dstIndices=/nvgpu::getIndices(writeOp),
/src=/loadBase,		/src=/loadBase,
Show All 19 Lines

mlir/test/Dialect/NVGPU/transform-create-async-groups.mlir

Show First 20 Lines • Show All 145 Lines • ▼ Show 20 Lines	builtin.module {
}		}

transform.sequence failures(propagate) {		transform.sequence failures(propagate) {
^bb1(%variant_op: !transform.any_op):		^bb1(%variant_op: !transform.any_op):
%top_level_func = transform.structured.match ops{["func.func"]} in %variant_op : (!transform.any_op) -> !transform.any_op		%top_level_func = transform.structured.match ops{["func.func"]} in %variant_op : (!transform.any_op) -> !transform.any_op
transform.nvgpu.create_async_groups %top_level_func {bypass_l1} : (!transform.any_op) -> (!transform.any_op)		transform.nvgpu.create_async_groups %top_level_func {bypass_l1} : (!transform.any_op) -> (!transform.any_op)
}		}
}		}

		// -----

		// 2D vector.transfer_read with a mask.
		builtin.module {
		// CHECK-LABEL: @read_2d_with_mask(
		// CHECK-SAME: %[[sz0:.]]: index, %[[sz1:.]]: index, %[[a:.*]]: memref<1024x1024xf32>
		func.func @read_2d_with_mask(%sz0: index, %sz1: index, %a: memref<1024x1024xf32>) {
		// CHECK: %[[c0:.*]] = arith.constant 0 : index
		// CHECK: %[[c1:.*]] = arith.constant 1 : index
		// CHECK: %[[c2:.*]] = arith.constant 2 : index
		%0 = memref.alloc() : memref<4x32x16xf32, #gpu.address_space<workgroup>>
		%c0 = arith.constant 0 : index
		%cst_0 = arith.constant 0.000000e+00 : f32
		// CHECK: %[[mask:.*]] = vector.create_mask
		// CHECK: %[[e0:.*]] = vector.extract %[[mask]][0] : vector<3x4xi1>
		// CHECK: %[[e1:.*]] = vector.extract %[[mask]][1] : vector<3x4xi1>
		// CHECK: %[[e2:.*]] = vector.extract %[[mask]][2] : vector<3x4xi1>

		// CHECK: %[[cmpi0:.*]] = arith.cmpi slt, %[[c0]], %[[sz0]]
		// CHECK: %[[s0:.*]] = arith.select %[[cmpi0]], %[[sz1]], %[[c0]]
		// CHECK: nvgpu.device_async_copy %[[a]][%[[c0]], %[[c0]]], {{.*}}, 4, %[[s0]] {bypassL1}

		// CHECK: %[[cmpi1:.*]] = arith.cmpi slt, %[[c1]], %[[sz0]]
		// CHECK: %[[s1:.*]] = arith.select %[[cmpi1]], %[[sz1]], %[[c0]]
		// CHECK: nvgpu.device_async_copy %[[a]][%[[c1]], %[[c0]]], {{.*}}, 4, %[[s1]] {bypassL1}

		// CHECK: %[[cmpi2:.*]] = arith.cmpi slt, %[[c2]], %[[sz0]]
		// CHECK: %[[s2:.*]] = arith.select %[[cmpi2]], %[[sz1]], %[[c0]]
		// CHECK: nvgpu.device_async_copy %[[a]][%[[c2]], %[[c0]]], {{.*}}, 4, %[[s2]] {bypassL1}
		%mask = vector.create_mask %sz0, %sz1 : vector<3x4xi1>
		%1 = vector.transfer_read %a[%c0, %c0], %cst_0, %mask {in_bounds = [true, true]} : memref<1024x1024xf32>, vector<3x4xf32>
		vector.transfer_write %1, %0[%c0, %c0, %c0] {in_bounds = [true, true]} : vector<3x4xf32>, memref<4x32x16xf32, #gpu.address_space<workgroup>>

		return
		}

		transform.sequence failures(propagate) {
		^bb1(%variant_op: !transform.any_op):
		%top_level_func = transform.structured.match ops{["func.func"]} in %variant_op : (!transform.any_op) -> !transform.any_op
		transform.apply_patterns to %top_level_func {
		transform.apply_patterns.vector.transfer_to_scf max_transfer_rank = 1 full_unroll = true
		} : !transform.any_op
		transform.nvgpu.create_async_groups %top_level_func {bypass_l1} : (!transform.any_op) -> (!transform.any_op)
		%top_level_func_2 = transform.structured.match ops{["func.func"]} in %variant_op : (!transform.any_op) -> !transform.any_op
		transform.apply_cse to %top_level_func_2 : !transform.any_op
		}
		}