This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
include/mlir/Dialect/NVGPU/TransformOps/
-
mlir/
-
Dialect/
-
NVGPU/
-
TransformOps/
-
NVGPUTransformOps.td
-
lib/Dialect/
-
Dialect/
-
NVGPU/TransformOps/
-
TransformOps/
19/20
NVGPUTransformOps.cpp
-
Utils/
-
IndexingUtils.cpp
-
test/
-
Dialect/NVGPU/
-
NVGPU/
-
tmaload-transform.mlir
-
Integration/GPU/CUDA/sm90/
-
GPU/
-
CUDA/
-
sm90/
-
tmaload-transform.mlir

Differential D157087

[mlir][nvgpu] Add a nvgpu.rewrite_copy_as_tma transform operation.
ClosedPublic

Authored by nicolasvasilache on Aug 4 2023, 5:10 AM.

Download Raw Diff

Details

Reviewers

herhut
mehdi_amini
guraypp
qcolombet
ftynse
springerm

Commits

rGa3cd2eeb2d2a: [mlir][nvgpu] Add a nvgpu.rewrite_copy_as_tma transform operation.

Summary

This revision adds support for direct lowering of a linalg.copy on buffers between global and shared memory to a tma async load + synchronization operations.
This uses the recently introduced Hopper NVVM and NVGPU abstraction to connect things end to end.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

nicolasvasilache created this revision.Aug 4 2023, 5:10 AM

Herald added a project: Restricted Project. · View Herald TranscriptAug 4 2023, 5:10 AM

Herald added subscribers: bviyer, Moerafaat, zero9178 and 24 others. · View Herald Transcript

nicolasvasilache requested review of this revision.Aug 4 2023, 5:10 AM

Herald added a reviewer: herhut. · View Herald TranscriptAug 4 2023, 5:10 AM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: limo1996, stephenneuendorffer. · View Herald Transcript

nicolasvasilache added reviewers: mehdi_amini, guraypp, qcolombet, ftynse, springerm.Aug 4 2023, 5:12 AM

Harbormaster completed remote builds in B250294: Diff 547171.Aug 4 2023, 5:35 AM

Add higher-levle IR test

ftynse added inline comments.Aug 4 2023, 5:45 AM

mlir/lib/Dialect/NVGPU/TransformOps/NVGPUTransformOps.cpp
759–772	Can this just use https://github.com/llvm/llvm-project/blob/main/mlir/include/mlir/IR/Value.h#L435 ?
792	Nit: `///`
836–837	Why is this necessary? Are we not setting the listener properly when creating the nested builder? If so, we should fix that.
854–855	It's impossible to yield different number of values from then and else branches. Speaking of which, I don't know much about the barrier behavior, but double-check that it's okay to have barrier with and without operands in diverging branches.
865–871	This seems to be confusing terminology. The GPU dialect is supposed to be using vendor-neutral terminology aligned with Khronos group specs (e.g. opencl), though I know it's not systematic. Mapping that terminology to CUDA terminology gives us: "workgroup" -> "block", "workgroup memory" -> "shared memory" i.e. memory accessed by any workitem/thread in the workgroup/block. It's unclear to me what "workgroup address space" is referring to here and how is that different from "shared memory space". The enum values for different address spaces in the GPU dialect are arbitrary and must not be interpreted as LLVM-compatible integers. A proper conversion, that should already be available in type converters, should be used to convert these.
917–925	It's the third occurrence of this snippet with commented-out code in a row. This is likely worth factoring out into a function that can be easily updated.
943	Please expand auto unless the type is obvious from context or impossible to spell.
974–975	Nit: explain the magic number.
1003	Does this have to hardcode num threads?
1049	Would be helpful to indicate which of the ops failed the precondition.
1057–1060	I don't actually see any way for the callee to return failure. Consider changing its return type and dropping the message here.
1062–1063	Would it make sense for the builder logic to erase this, so it is usable as a C++ call?

Harbormaster completed remote builds in B250296: Diff 547174.Aug 4 2023, 6:30 AM

guraypp added inline comments.Aug 7 2023, 1:09 AM

mlir/lib/Dialect/NVGPU/TransformOps/NVGPUTransformOps.cpp
789	I guess comment is forgotten here, because it talks about wgmma descriptors.
826	Using `threadIdx.x` is going to change, see an example below. It is okay right now, but just giving you a heads up. We will select the leader thread using `elect` instruction in PTX. I need to implement it first. https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss.hpp#L344-L345
1003	We can use `blockDim.x` instead of 128 for this code. But programs are slightly faster when it's hardcoded. It could be read from `%block_x = %c128,` on `gpu.launch`.

Address comments.

Herald added a subscriber: arphaman. · View Herald TranscriptAug 7 2023, 6:57 AM

nicolasvasilache added inline comments.Aug 7 2023, 6:57 AM

mlir/lib/Dialect/NVGPU/TransformOps/NVGPUTransformOps.cpp
826	ack, thanks!
836–837	no reason, remnant from a previous state, thanks for catching!
854–855	Well the the other path would just yield `0`. Refactored to make it less confusing.
865–871	Ok I understand the issue now, I was mistakenly using `gpu::GPUMemorySpaceMappingAttr` that seem to have been added for the purpose of transforms but do not lower further..

Harbormaster completed remote builds in B250775: Diff 547772.Aug 7 2023, 8:32 AM

It is really nice that we can use linalg.copy for Hopper's TMA load. Thanks working on this.
Looks clear to me.

mlir/lib/Dialect/NVGPU/TransformOps/NVGPUTransformOps.cpp
955	nice this is way better

This revision is now accepted and ready to land.Aug 8 2023, 1:58 AM

Closed by commit rGa3cd2eeb2d2a: [mlir][nvgpu] Add a nvgpu.rewrite_copy_as_tma transform operation. (authored by nicolasvasilache). · Explain WhyAug 8 2023, 5:08 AM

This revision was automatically updated to reflect the committed changes.

nicolasvasilache added a commit: rGa3cd2eeb2d2a: [mlir][nvgpu] Add a nvgpu.rewrite_copy_as_tma transform operation..

Revision Contents

Path

Size

mlir/

include/

mlir/

Dialect/

NVGPU/

TransformOps/

NVGPUTransformOps.td

29 lines

lib/

Dialect/

NVGPU/

TransformOps/

NVGPUTransformOps.cpp

341 lines

Utils/

IndexingUtils.cpp

4 lines

test/

Dialect/

NVGPU/

tmaload-transform.mlir

84 lines

Integration/

GPU/

CUDA/

sm90/

tmaload-transform.mlir

109 lines

Diff 548159

mlir/include/mlir/Dialect/NVGPU/TransformOps/NVGPUTransformOps.td

Show First 20 Lines • Show All 158 Lines • ▼ Show 20 Lines	let extraClassDeclaration = [{
::mlir::DiagnosedSilenceableFailure applyToOne(		::mlir::DiagnosedSilenceableFailure applyToOne(
::mlir::transform::TransformRewriter &rewriter,		::mlir::transform::TransformRewriter &rewriter,
::mlir::linalg::LinalgOp linalgOp,		::mlir::linalg::LinalgOp linalgOp,
::mlir::transform::ApplyToEachResultList &results,		::mlir::transform::ApplyToEachResultList &results,
::mlir::transform::TransformState &state);		::mlir::transform::TransformState &state);
}];		}];
}		}

		//===----------------------------------------------------------------------===//
		// RewriteCopyAsTmaOp
		//===----------------------------------------------------------------------===//

		def RewriteCopyAsTmaOp :
		Op<Transform_Dialect, "nvgpu.rewrite_copy_as_tma",
		[FunctionalStyleTransformOpTrait,
		MemoryEffectsOpInterface,
		TransformEachOpTrait,
		TransformOpInterface,
		ReportTrackingListenerFailuresOpTrait]> {
		let description = [{
		Rewrite a copy operation on memref to tma operations that transit through
		shared memory.
		}];

		let arguments = (ins TransformHandleTypeInterface:$target);
		let results = (outs);

		let assemblyFormat = "$target attr-dict `:` functional-type(operands, results) ";

		let extraClassDeclaration = [{
		::mlir::DiagnosedSilenceableFailure apply(
		::mlir::transform::TransformRewriter &rewriter,
		::mlir::transform::TransformResults &transformResults,
		::mlir::transform::TransformState &state);
		}];
		}

#endif // NVGPU_TRANSFORM_OPS		#endif // NVGPU_TRANSFORM_OPS

mlir/lib/Dialect/NVGPU/TransformOps/NVGPUTransformOps.cpp

//===- NVGPUTransformOps.cpp - Implementation of NVGPU transform ops ------===//		//===- NVGPUTransformOps.cpp - Implementation of NVGPU transform ops ------===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "mlir/Dialect/NVGPU/TransformOps/NVGPUTransformOps.h"		#include "mlir/Dialect/NVGPU/TransformOps/NVGPUTransformOps.h"

#include "mlir/Analysis/SliceAnalysis.h"		#include "mlir/Analysis/SliceAnalysis.h"
#include "mlir/Dialect/Affine/IR/AffineOps.h"		#include "mlir/Dialect/Affine/IR/AffineOps.h"
#include "mlir/Dialect/Arith/IR/Arith.h"		#include "mlir/Dialect/Arith/IR/Arith.h"
#include "mlir/Dialect/Arith/Utils/Utils.h"		#include "mlir/Dialect/Arith/Utils/Utils.h"
#include "mlir/Dialect/GPU/IR/GPUDialect.h"		#include "mlir/Dialect/GPU/IR/GPUDialect.h"
		#include "mlir/Dialect/LLVMIR/NVVMDialect.h"
#include "mlir/Dialect/Linalg/IR/Linalg.h"		#include "mlir/Dialect/Linalg/IR/Linalg.h"
#include "mlir/Dialect/MemRef/IR/MemRef.h"		#include "mlir/Dialect/MemRef/IR/MemRef.h"
#include "mlir/Dialect/NVGPU/IR/NVGPUDialect.h"		#include "mlir/Dialect/NVGPU/IR/NVGPUDialect.h"
#include "mlir/Dialect/NVGPU/Transforms/Transforms.h"		#include "mlir/Dialect/NVGPU/Transforms/Transforms.h"
#include "mlir/Dialect/SCF/IR/SCF.h"		#include "mlir/Dialect/SCF/IR/SCF.h"
#include "mlir/Dialect/SCF/Transforms/Transforms.h"		#include "mlir/Dialect/SCF/Transforms/Transforms.h"
#include "mlir/Dialect/Utils/IndexingUtils.h"		#include "mlir/Dialect/Utils/IndexingUtils.h"
		#include "mlir/Dialect/Utils/StaticValueUtils.h"
#include "mlir/Dialect/Vector/IR/VectorOps.h"		#include "mlir/Dialect/Vector/IR/VectorOps.h"
#include "mlir/IR/AffineExpr.h"		#include "mlir/IR/AffineExpr.h"
#include "mlir/IR/BuiltinTypes.h"		#include "mlir/IR/BuiltinTypes.h"
#include "mlir/IR/MLIRContext.h"		#include "mlir/IR/Value.h"
#include "mlir/IR/Operation.h"
#include "mlir/IR/TypeRange.h"
#include "mlir/IR/TypeUtilities.h"
#include "mlir/Support/LogicalResult.h"
#include "llvm/ADT/ArrayRef.h"		#include "llvm/ADT/ArrayRef.h"
#include "llvm/Support/Debug.h"

using namespace mlir;		using namespace mlir;
using namespace mlir::linalg;		using namespace mlir::linalg;
using namespace mlir::nvgpu;		using namespace mlir::nvgpu;
		using namespace mlir::NVVM;
using namespace mlir::transform;		using namespace mlir::transform;

#define DEBUG_TYPE "nvgpu-transforms"		#define DEBUG_TYPE "nvgpu-transforms"
#define DBGS() (llvm::dbgs() << "[" DEBUG_TYPE "]: ")		#define DBGS() (llvm::dbgs() << "[" DEBUG_TYPE "]: ")
#define DBGSNL() (llvm::dbgs() << "\n")		#define DBGSNL() (llvm::dbgs() << "\n")
#define LDBG(X) LLVM_DEBUG(DBGS() << X << "\n")		#define LDBG(X) LLVM_DEBUG(DBGS() << X << "\n")

//===---------------------------------------------------------------------===//		//===---------------------------------------------------------------------===//
▲ Show 20 Lines • Show All 467 Lines • ▼ Show 20 Lines	private:
//===--------------------------------------------------------------------===//		//===--------------------------------------------------------------------===//
/// Helper functions to create customizable load and stores operations. The		/// Helper functions to create customizable load and stores operations. The
/// specific shapes of each MMA instruction are passed via the		/// specific shapes of each MMA instruction are passed via the
/// IndexCalculator callback.		/// IndexCalculator callback.
//===--------------------------------------------------------------------===//		//===--------------------------------------------------------------------===//
/// Build a list of memref.load operations indexed at `(row, col)` indices		/// Build a list of memref.load operations indexed at `(row, col)` indices
/// that make sense for a particular MMA instruction and specified via the		/// that make sense for a particular MMA instruction and specified via the
/// IndexCalculator callback.		/// IndexCalculator callback.
SmallVector<Value> buildMemrefLoads(OpBuilder &b, Location loc,		SmallVector<Value> buildMemRefLoads(OpBuilder &b, Location loc,
OpFoldResult laneId, Value memref,		OpFoldResult laneId, Value memref,
IndexCalculator indexFn);		IndexCalculator indexFn);

/// Perform a distributed load of a vector operand of `vectorShape` for a		/// Perform a distributed load of a vector operand of `vectorShape` for a
/// particular MMA instruction whose `(row, col)` indices are specified via		/// particular MMA instruction whose `(row, col)` indices are specified via
/// the IndexCalculator callback. Each `laneId` loads the subportion of the		/// the IndexCalculator callback. Each `laneId` loads the subportion of the
/// data that makes sense for the particular MMA operation.		/// data that makes sense for the particular MMA operation.
/// The `vectorShape` matches existing NVGPU dialect op specification but		/// The `vectorShape` matches existing NVGPU dialect op specification but
/// could also be flattened in the future if needed for simplification.		/// could also be flattened in the future if needed for simplification.
Value buildMmaSyncMemrefLoadOperand(OpBuilder &b, Location loc,		Value buildMmaSyncMemRefLoadOperand(OpBuilder &b, Location loc,
OpFoldResult laneId, Value memref,		OpFoldResult laneId, Value memref,
IndexCalculator indexFn,		IndexCalculator indexFn,
ArrayRef<int64_t> vectorShape);		ArrayRef<int64_t> vectorShape);

/// Build a list of memref.store operations indexed at `(row, col)` indices		/// Build a list of memref.store operations indexed at `(row, col)` indices
/// that make sense for a particular MMA instruction and specified via the		/// that make sense for a particular MMA instruction and specified via the
/// IndexCalculator callback.		/// IndexCalculator callback.
SmallVector<Operation *> buildMemrefStores(OpBuilder &b, Location loc,		SmallVector<Operation *> buildMemRefStores(OpBuilder &b, Location loc,
ValueRange toStore,		ValueRange toStore,
OpFoldResult laneId, Value memref,		OpFoldResult laneId, Value memref,
IndexCalculator indexFn);		IndexCalculator indexFn);

/// Perform a distributed store of a vector operand of `vectorShape` for a		/// Perform a distributed store of a vector operand of `vectorShape` for a
/// particular MMA instruction whose `(row, col)` indices are specified via		/// particular MMA instruction whose `(row, col)` indices are specified via
/// the IndexCalculator callback. Each `laneId` loads the subportion of the		/// the IndexCalculator callback. Each `laneId` loads the subportion of the
/// data that makes sense for the particular MMA operation.		/// data that makes sense for the particular MMA operation.
/// The `vectorShape` matches existing NVGPU dialect op specification but		/// The `vectorShape` matches existing NVGPU dialect op specification but
/// could also be flattened in the future if needed for simplification.		/// could also be flattened in the future if needed for simplification.
SmallVector<Operation *> buildMmaSyncMemrefStoreOperand(		SmallVector<Operation *> buildMmaSyncMemRefStoreOperand(
OpBuilder &b, Location loc, Value vectorToStore, OpFoldResult laneId,		OpBuilder &b, Location loc, Value vectorToStore, OpFoldResult laneId,
Value memref, IndexCalculator indexFn, ArrayRef<int64_t> vectorShape);		Value memref, IndexCalculator indexFn, ArrayRef<int64_t> vectorShape);

OpBuilder &b;		OpBuilder &b;
Location loc;		Location loc;
OpFoldResult laneId;		OpFoldResult laneId;
};		};

Show All 10 Lines	static void foreachIndividualVectorElement(Value vector, ApplyFn applyFn,
auto vectorShape = vectorType.getShape();		auto vectorShape = vectorType.getShape();
auto strides = computeStrides(vectorShape);		auto strides = computeStrides(vectorShape);
for (int64_t idx = 0, e = vectorShape[0] * strides[0]; idx < e; ++idx) {		for (int64_t idx = 0, e = vectorShape[0] * strides[0]; idx < e; ++idx) {
auto indices = delinearize(idx, strides);		auto indices = delinearize(idx, strides);
reduceFn(applyFn(vector, idx, indices), idx, indices);		reduceFn(applyFn(vector, idx, indices), idx, indices);
}		}
}		}

SmallVector<Value> MmaSyncBuilder::buildMemrefLoads(OpBuilder &b, Location loc,		SmallVector<Value> MmaSyncBuilder::buildMemRefLoads(OpBuilder &b, Location loc,
OpFoldResult laneId,		OpFoldResult laneId,
Value memref,		Value memref,
IndexCalculator indexFn) {		IndexCalculator indexFn) {
auto aff = [&](AffineExpr e) {		auto aff = [&](AffineExpr e) {
return affine::makeComposedFoldedAffineApply(b, loc, e, laneId);		return affine::makeComposedFoldedAffineApply(b, loc, e, laneId);
};		};
SmallVector<Value> res;		SmallVector<Value> res;
SmallVector<RowColIndexing> indexings = indexFn(b.getContext());		SmallVector<RowColIndexing> indexings = indexFn(b.getContext());
for (auto indexing : indexings) {		for (auto indexing : indexings) {
Value row = getValueOrCreateConstantIndexOp(b, loc, aff(indexing.row()));		Value row = getValueOrCreateConstantIndexOp(b, loc, aff(indexing.row()));
Value col = getValueOrCreateConstantIndexOp(b, loc, aff(indexing.col()));		Value col = getValueOrCreateConstantIndexOp(b, loc, aff(indexing.col()));
auto load = b.create<memref::LoadOp>(loc, memref, ValueRange{row, col});		auto load = b.create<memref::LoadOp>(loc, memref, ValueRange{row, col});
res.push_back(load);		res.push_back(load);
}		}
return res;		return res;
}		}

Value MmaSyncBuilder::buildMmaSyncMemrefLoadOperand(		Value MmaSyncBuilder::buildMmaSyncMemRefLoadOperand(
OpBuilder &b, Location loc, OpFoldResult laneId, Value memref,		OpBuilder &b, Location loc, OpFoldResult laneId, Value memref,
IndexCalculator indexFn, ArrayRef<int64_t> vectorShape) {		IndexCalculator indexFn, ArrayRef<int64_t> vectorShape) {
auto loads = buildMemrefLoads(b, loc, laneId, memref, indexFn);		auto loads = buildMemRefLoads(b, loc, laneId, memref, indexFn);

Type elementType = getElementTypeOrSelf(memref.getType());		Type elementType = getElementTypeOrSelf(memref.getType());
auto vt = VectorType::get(vectorShape, elementType);		auto vt = VectorType::get(vectorShape, elementType);
Value res = b.create<vector::SplatOp>(loc, vt, loads[0]);		Value res = b.create<vector::SplatOp>(loc, vt, loads[0]);
foreachIndividualVectorElement(		foreachIndividualVectorElement(
res,		res,
/applyFn=/		/applyFn=/
[&](Value v, int64_t linearIdx, ArrayRef<int64_t> indices) {		[&](Value v, int64_t linearIdx, ArrayRef<int64_t> indices) {
return loads[linearIdx];		return loads[linearIdx];
},		},
/reduceFn=/		/reduceFn=/
[&](Value v, int64_t linearIdx, ArrayRef<int64_t> indices) {		[&](Value v, int64_t linearIdx, ArrayRef<int64_t> indices) {
res = b.create<vector::InsertOp>(loc, v, res, indices);		res = b.create<vector::InsertOp>(loc, v, res, indices);
});		});

return res;		return res;
}		}

SmallVector<Operation *>		SmallVector<Operation *>
MmaSyncBuilder::buildMemrefStores(OpBuilder &b, Location loc,		MmaSyncBuilder::buildMemRefStores(OpBuilder &b, Location loc,
ValueRange toStore, OpFoldResult laneId,		ValueRange toStore, OpFoldResult laneId,
Value memref, IndexCalculator indexFn) {		Value memref, IndexCalculator indexFn) {
auto aff = [&](AffineExpr e) {		auto aff = [&](AffineExpr e) {
return affine::makeComposedFoldedAffineApply(b, loc, e, laneId);		return affine::makeComposedFoldedAffineApply(b, loc, e, laneId);
};		};
SmallVector<Operation *> res;		SmallVector<Operation *> res;
for (auto [indexing, val] :		for (auto [indexing, val] :
llvm::zip_equal(indexFn(b.getContext()), toStore)) {		llvm::zip_equal(indexFn(b.getContext()), toStore)) {
Value row = getValueOrCreateConstantIndexOp(b, loc, aff(indexing.row()));		Value row = getValueOrCreateConstantIndexOp(b, loc, aff(indexing.row()));
Value col = getValueOrCreateConstantIndexOp(b, loc, aff(indexing.col()));		Value col = getValueOrCreateConstantIndexOp(b, loc, aff(indexing.col()));
Operation *store =		Operation *store =
b.create<memref::StoreOp>(loc, val, memref, ValueRange{row, col});		b.create<memref::StoreOp>(loc, val, memref, ValueRange{row, col});
res.push_back(store);		res.push_back(store);
}		}
return res;		return res;
}		}

SmallVector<Operation *> MmaSyncBuilder::buildMmaSyncMemrefStoreOperand(		SmallVector<Operation *> MmaSyncBuilder::buildMmaSyncMemRefStoreOperand(
OpBuilder &b, Location loc, Value vectorToStore, OpFoldResult laneId,		OpBuilder &b, Location loc, Value vectorToStore, OpFoldResult laneId,
Value memref, IndexCalculator indexFn, ArrayRef<int64_t> vectorShape) {		Value memref, IndexCalculator indexFn, ArrayRef<int64_t> vectorShape) {
SmallVector<Value> toStore;		SmallVector<Value> toStore;
toStore.reserve(32);		toStore.reserve(32);
foreachIndividualVectorElement(		foreachIndividualVectorElement(
vectorToStore,		vectorToStore,
/applyFn=/		/applyFn=/
[&](Value v, int64_t linearIdx, ArrayRef<int64_t> indices) {		[&](Value v, int64_t linearIdx, ArrayRef<int64_t> indices) {
return b.create<vector::ExtractOp>(loc, vectorToStore, indices);		return b.create<vector::ExtractOp>(loc, vectorToStore, indices);
},		},
/reduceFn=/		/reduceFn=/
[&](Value v, int64_t linearIdx, ArrayRef<int64_t> indices) {		[&](Value v, int64_t linearIdx, ArrayRef<int64_t> indices) {
toStore.push_back(v);		toStore.push_back(v);
});		});
return buildMemrefStores(b, loc, toStore, laneId, memref, indexFn);		return buildMemRefStores(b, loc, toStore, laneId, memref, indexFn);
}		}

static std::tuple<SmallVector<int64_t>, SmallVector<int64_t>,		static std::tuple<SmallVector<int64_t>, SmallVector<int64_t>,
SmallVector<int64_t>>		SmallVector<int64_t>>
makeVectorShapes(ArrayRef<int64_t> lhs, ArrayRef<int64_t> rhs,		makeVectorShapes(ArrayRef<int64_t> lhs, ArrayRef<int64_t> rhs,
ArrayRef<int64_t> res) {		ArrayRef<int64_t> res) {
SmallVector<int64_t> vlhs{lhs.begin(), lhs.end()};		SmallVector<int64_t> vlhs{lhs.begin(), lhs.end()};
SmallVector<int64_t> vrhs{rhs.begin(), rhs.end()};		SmallVector<int64_t> vrhs{rhs.begin(), rhs.end()};
Show All 26 Lines	return MmaSyncInfo{std::make_tuple(&MmaSyncBuilder::m16n8k16f16Lhs,
makeVectorShapes({4, 2}, {2, 2}, {2, 2}),		makeVectorShapes({4, 2}, {2, 2}, {2, 2}),
SmallVector<int64_t>{opShape.begin(), opShape.end()},		SmallVector<int64_t>{opShape.begin(), opShape.end()},
/tf32Enabled=/false};		/tf32Enabled=/false};
}		}
return failure();		return failure();
}		}

FailureOr<Operation *> MmaSyncBuilder::buildMmaSync(LinalgOp linalgOp) {		FailureOr<Operation *> MmaSyncBuilder::buildMmaSync(LinalgOp linalgOp) {
Value lhsMemref = linalgOp.getDpsInputOperand(0)->get();		Value lhsMemRef = linalgOp.getDpsInputOperand(0)->get();
Value rhsMemref = linalgOp.getDpsInputOperand(1)->get();		Value rhsMemRef = linalgOp.getDpsInputOperand(1)->get();
Value resMemref = linalgOp.getDpsInitOperand(0)->get();		Value resMemRef = linalgOp.getDpsInitOperand(0)->get();
assert(lhsMemref.getType().cast<MemRefType>().getRank() == 2 &&		assert(lhsMemRef.getType().cast<MemRefType>().getRank() == 2 &&
"expected lhs to be a 2D memref");		"expected lhs to be a 2D memref");
assert(rhsMemref.getType().cast<MemRefType>().getRank() == 2 &&		assert(rhsMemRef.getType().cast<MemRefType>().getRank() == 2 &&
"expected rhs to be a 2D memref");		"expected rhs to be a 2D memref");
assert(resMemref.getType().cast<MemRefType>().getRank() == 2 &&		assert(resMemRef.getType().cast<MemRefType>().getRank() == 2 &&
"expected res to be a 2D memref");		"expected res to be a 2D memref");

int64_t m = cast<MemRefType>(lhsMemref.getType()).getShape()[0];		int64_t m = cast<MemRefType>(lhsMemRef.getType()).getShape()[0];
int64_t n = cast<MemRefType>(rhsMemref.getType()).getShape()[1];		int64_t n = cast<MemRefType>(rhsMemRef.getType()).getShape()[1];
int64_t k = cast<MemRefType>(lhsMemref.getType()).getShape()[1];		int64_t k = cast<MemRefType>(lhsMemRef.getType()).getShape()[1];
Type lhsType = getElementTypeOrSelf(lhsMemref.getType());		Type lhsType = getElementTypeOrSelf(lhsMemRef.getType());
Type rhsType = getElementTypeOrSelf(rhsMemref.getType());		Type rhsType = getElementTypeOrSelf(rhsMemRef.getType());
Type resType = getElementTypeOrSelf(resMemref.getType());		Type resType = getElementTypeOrSelf(resMemRef.getType());

FailureOr<MmaSyncInfo> maybeInfo =		FailureOr<MmaSyncInfo> maybeInfo =
getIndexCalculators({m, n, k}, {lhsType, rhsType, resType});		getIndexCalculators({m, n, k}, {lhsType, rhsType, resType});
if (failed(maybeInfo))		if (failed(maybeInfo))
return failure();		return failure();

MmaSyncInfo info = *maybeInfo;		MmaSyncInfo info = *maybeInfo;
auto [lhsIndexFn, rhsIndexFn, resIndexFn] = info.indexFns;		auto [lhsIndexFn, rhsIndexFn, resIndexFn] = info.indexFns;
auto [lhsShape, rhsShape, resShape] = info.vectorShapes;		auto [lhsShape, rhsShape, resShape] = info.vectorShapes;
Value lhs = buildMmaSyncMemrefLoadOperand(b, loc, laneId, lhsMemref,		Value lhs = buildMmaSyncMemRefLoadOperand(b, loc, laneId, lhsMemRef,
lhsIndexFn, lhsShape);		lhsIndexFn, lhsShape);
Value rhs = buildMmaSyncMemrefLoadOperand(b, loc, laneId, rhsMemref,		Value rhs = buildMmaSyncMemRefLoadOperand(b, loc, laneId, rhsMemRef,
rhsIndexFn, rhsShape);		rhsIndexFn, rhsShape);
Value res = buildMmaSyncMemrefLoadOperand(b, loc, laneId, resMemref,		Value res = buildMmaSyncMemRefLoadOperand(b, loc, laneId, resMemRef,
resIndexFn, resShape);		resIndexFn, resShape);
res = b.create<nvgpu::MmaSyncOp>(loc, lhs, rhs, res, info.mmaShape,		res = b.create<nvgpu::MmaSyncOp>(loc, lhs, rhs, res, info.mmaShape,
info.tf32Enabled);		info.tf32Enabled);
buildMmaSyncMemrefStoreOperand(b, loc, res, laneId, resMemref, resIndexFn,		buildMmaSyncMemRefStoreOperand(b, loc, res, laneId, resMemRef, resIndexFn,
resShape);		resShape);
return res.getDefiningOp();		return res.getDefiningOp();
}		}

DiagnosedSilenceableFailure transform::RewriteMatmulAsMmaSyncOp::applyToOne(		DiagnosedSilenceableFailure transform::RewriteMatmulAsMmaSyncOp::applyToOne(
transform::TransformRewriter &rewriter, LinalgOp linalgOp,		transform::TransformRewriter &rewriter, LinalgOp linalgOp,
transform::ApplyToEachResultList &results,		transform::ApplyToEachResultList &results,
transform::TransformState &state) {		transform::TransformState &state) {
Show All 15 Lines	if (fail) {
return diag;		return diag;
}		}

rewriter.eraseOp(linalgOp);		rewriter.eraseOp(linalgOp);
return DiagnosedSilenceableFailure::success();		return DiagnosedSilenceableFailure::success();
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
		// Hopper builders.
		//===----------------------------------------------------------------------===//

		/// Helper to create the base Hopper-specific operations that are reused in
		/// various other places.
		struct HopperBuilder {
		HopperBuilder(RewriterBase &rewriter, Location loc)
		: rewriter(rewriter), loc(loc) {}

		TypedValue<nvgpu::MBarrierType>
		buildAndInitBarrierInSharedMemory(OpFoldResult numThreads);

		/// Create tma descriptor op to initiate transfer from global to shared
		/// memory. This must be done before the launch op, on the host.
		TypedValue<nvgpu::TensorMapDescriptorType>
		buildGlobalMemRefDescriptor(TypedValue<MemRefType> memref,
		gpu::LaunchOp launchOp);
		ftynseUnsubmitted Done Reply Inline Actions Can this just use https://github.com/llvm/llvm-project/blob/main/mlir/include/mlir/IR/Value.h#L435 ? ftynse: Can this just use https://github.com/llvm/llvm-project/blob/main/mlir/include/mlir/IR/Value.

		/// Build a tma load from global memory to shared memory using `barrier` to
		/// synchronize. Return the number of bytes that will be transferred.
		OpFoldResult
		buildTmaAsyncLoad(TypedValue<nvgpu::TensorMapDescriptorType> globalDesc,
		TypedValue<MemRefType> sharedMemref,
		TypedValue<nvgpu::MBarrierType> barrier,
		SmallVectorImpl<Operation *> &loadOps);
		void buildBarrierArriveTx(TypedValue<nvgpu::MBarrierType> barrier,
		ArrayRef<OpFoldResult> sizes);

		/// If threadIdx.x == 0 does TMA request + wait, else just wait.
		/// Return the operation that performs the transfer on thread0.
		// TODO: In the future, don't hardcode to thread 0 but elect a leader.
		SmallVector<Operation *> buildPredicateLoadsOnThread0(
		ArrayRef<TypedValue<nvgpu::TensorMapDescriptorType>> globalDescriptors,
		ArrayRef<TypedValue<MemRefType>> sharedMemBuffers,
		gurayppUnsubmitted Done Reply Inline Actions I guess comment is forgotten here, because it talks about wgmma descriptors. guraypp: I guess comment is forgotten here, because it talks about wgmma descriptors.
		TypedValue<nvgpu::MBarrierType> barrier);

		void buildTryWaitParity(TypedValue<nvgpu::MBarrierType> barrier);
		ftynseUnsubmitted Done Reply Inline Actions Nit: `///` ftynse: Nit: `///`

		RewriterBase &rewriter;
		Location loc;
		};

		SmallVector<Operation *> HopperBuilder::buildPredicateLoadsOnThread0(
		ArrayRef<TypedValue<nvgpu::TensorMapDescriptorType>> globalDescriptors,
		ArrayRef<TypedValue<MemRefType>> sharedMemBuffers,
		TypedValue<nvgpu::MBarrierType> barrier) {
		SmallVector<Operation *> loadOps;
		Value zero = rewriter.create<arith::ConstantIndexOp>(loc, 0);
		Value tidx = rewriter.create<gpu::ThreadIdOp>(loc, gpu::Dimension::x);
		Value cond =
		rewriter.create<arith::CmpIOp>(loc, arith::CmpIPredicate::eq, tidx, zero);
		// clang-format off
		rewriter.create<scf::IfOp>(
		/location=/loc,
		/conditional=/cond,
		/thenBuilder=/
		[&](OpBuilder &lb, Location loc) {
		SmallVector<OpFoldResult> sizes;
		sizes.reserve(globalDescriptors.size());
		for (auto [desc, shmem] : llvm::zip_equal(
		globalDescriptors, sharedMemBuffers)) {
		OpFoldResult sz = buildTmaAsyncLoad(desc, shmem, barrier, loadOps);
		sizes.push_back(sz);
		}
		// TODO: Note that cutlass predeclares the barrier arrive tx before the tma.async.load.
		// This may or may not have perf implications.
		buildBarrierArriveTx(barrier, sizes);
		rewriter.create<scf::YieldOp>(loc);
		},
		/elseBuilder=/
		[&](OpBuilder &lb, Location loc) {
		gurayppUnsubmitted Done Reply Inline Actions Using `threadIdx.x` is going to change, see an example below. It is okay right now, but just giving you a heads up. We will select the leader thread using `elect` instruction in PTX. I need to implement it first. https://github.com/NVIDIA/cutlass/blob/main/include/cutlass/gemm/collective/sm90_mma_tma_gmma_ss.hpp#L344-L345 guraypp: Using `threadIdx.x` is going to change, see an example below. It is okay right now, but just…
		nicolasvasilacheAuthorUnsubmitted Done Reply Inline Actions ack, thanks! nicolasvasilache: ack, thanks!
		// TODO: is this for no-thread divergence?
		// Should we just yield the size and hoist?
		buildBarrierArriveTx(barrier, getAsIndexOpFoldResult(rewriter.getContext(), 0));
		rewriter.create<scf::YieldOp>(loc);
		});
		// clang-format on
		return loadOps;
		}

		static Attribute getSharedAddressSpaceAttribute(OpBuilder &b) {
		return gpu::AddressSpaceAttr::get(
		ftynseUnsubmitted Done Reply Inline Actions Why is this necessary? Are we not setting the listener properly when creating the nested builder? If so, we should fix that. ftynse: Why is this necessary? Are we not setting the listener properly when creating the nested…
		nicolasvasilacheAuthorUnsubmitted Done Reply Inline Actions no reason, remnant from a previous state, thanks for catching! nicolasvasilache: no reason, remnant from a previous state, thanks for catching!
		b.getContext(), gpu::GPUDialect::getWorkgroupAddressSpace());
		// return b.getI64IntegerAttr(static_cast<int64_t>(kSharedMemorySpace));
		}

		TypedValue<nvgpu::MBarrierType>
		HopperBuilder::buildAndInitBarrierInSharedMemory(OpFoldResult numThreads) {
		auto sharedMemorySpace = getSharedAddressSpaceAttribute(rewriter);
		Value barrier = rewriter.create<nvgpu::MBarrierCreateOp>(
		loc, nvgpu::MBarrierType::get(rewriter.getContext(), sharedMemorySpace));
		rewriter.create<nvgpu::MBarrierInitOp>(
		loc, barrier, getValueOrCreateConstantIndexOp(rewriter, loc, numThreads));
		rewriter.create<gpu::BarrierOp>(loc);
		return cast<TypedValue<nvgpu::MBarrierType>>(barrier);
		}

		TypedValue<nvgpu::TensorMapDescriptorType>
		HopperBuilder::buildGlobalMemRefDescriptor(TypedValue<MemRefType> memref,
		gpu::LaunchOp launchOp) {
		ftynseUnsubmitted Done Reply Inline Actions It's impossible to yield different number of values from then and else branches. Speaking of which, I don't know much about the barrier behavior, but double-check that it's okay to have barrier with and without operands in diverging branches. ftynse: It's impossible to yield different number of values from then and else branches. Speaking of…
		nicolasvasilacheAuthorUnsubmitted Done Reply Inline Actions Well the the other path would just yield `0`. Refactored to make it less confusing. nicolasvasilache: Well the the other path would just yield `0`. Refactored to make it less confusing.
		OpBuilder::InsertionGuard guard(rewriter);
		rewriter.setInsertionPoint(launchOp);
		Value unrankedMemRef = rewriter.create<memref::CastOp>(
		loc,
		UnrankedMemRefType::get(memref.getType().getElementType(),
		memref.getType().getMemorySpace()),
		memref);
		SmallVector<OpFoldResult> mixedSizes =
		memref::getMixedSizes(rewriter, loc, memref);
		SmallVector<Value> sizes =
		getValueOrCreateConstantIndexOp(rewriter, loc, mixedSizes);

		auto sharedMemorySpace = getSharedAddressSpaceAttribute(rewriter);
		Value desc = rewriter.create<nvgpu::TmaCreateDescriptorOp>(
		loc,
		nvgpu::TensorMapDescriptorType::get(
		ftynseUnsubmitted Done Reply Inline Actions This seems to be confusing terminology. The GPU dialect is supposed to be using vendor-neutral terminology aligned with Khronos group specs (e.g. opencl), though I know it's not systematic. Mapping that terminology to CUDA terminology gives us: "workgroup" -> "block", "workgroup memory" -> "shared memory" i.e. memory accessed by any workitem/thread in the workgroup/block. It's unclear to me what "workgroup address space" is referring to here and how is that different from "shared memory space". The enum values for different address spaces in the GPU dialect are arbitrary and must not be interpreted as LLVM-compatible integers. A proper conversion, that should already be available in type converters, should be used to convert these. ftynse: This seems to be confusing terminology. The GPU dialect is supposed to be using vendor-neutral…
		nicolasvasilacheAuthorUnsubmitted Done Reply Inline Actions Ok I understand the issue now, I was mistakenly using `gpu::GPUMemorySpaceMappingAttr` that seem to have been added for the purpose of transforms but do not lower further.. nicolasvasilache: Ok I understand the issue now, I was mistakenly using `gpu::GPUMemorySpaceMappingAttr` that…
		rewriter.getContext(),
		MemRefType::Builder(memref.getType())
		.setMemorySpace(sharedMemorySpace),
		TensorMapSwizzleKind::SWIZZLE_NONE,
		TensorMapL2PromoKind::L2PROMO_NONE, TensorMapOOBKind::OOB_ZERO,
		TensorMapInterleaveKind::INTERLEAVE_NONE),
		unrankedMemRef, sizes);
		return cast<TypedValue<nvgpu::TensorMapDescriptorType>>(desc);
		}

		OpFoldResult HopperBuilder::buildTmaAsyncLoad(
		TypedValue<nvgpu::TensorMapDescriptorType> globalDesc,
		TypedValue<MemRefType> sharedMemref,
		TypedValue<nvgpu::MBarrierType> barrier,
		SmallVectorImpl<Operation *> &loadOps) {
		MLIRContext *ctx = rewriter.getContext();
		Value zero = rewriter.create<arith::ConstantIndexOp>(loc, 0);
		Operation *loadOp = rewriter.create<nvgpu::TmaAsyncLoadOp>(
		loc, sharedMemref, barrier, globalDesc, ValueRange{zero, zero});
		loadOps.push_back(loadOp);
		auto mixedSizes = memref::getMixedSizes(rewriter, loc, sharedMemref);
		SmallVector<AffineExpr> symbols(mixedSizes.size());
		bindSymbolsList(ctx, llvm::MutableArrayRef{symbols});
		AffineExpr prodExprInBytes =
		computeProduct(ctx, symbols) *
		(sharedMemref.getType().getElementTypeBitWidth() / 8);
		auto res = affine::makeComposedFoldedAffineApply(rewriter, loc,
		prodExprInBytes, mixedSizes);
		return res;
		}

		void HopperBuilder::buildBarrierArriveTx(
		TypedValue<nvgpu::MBarrierType> barrier,
		ArrayRef<OpFoldResult> mixedSizes) {
		assert(!mixedSizes.empty() && "expecte non-empty sizes");
		MLIRContext *ctx = rewriter.getContext();
		SmallVector<AffineExpr> symbols(mixedSizes.size());
		bindSymbolsList(ctx, llvm::MutableArrayRef{symbols});
		AffineExpr sumExpr = computeSum(ctx, symbols);
		OpFoldResult size =
		affine::makeComposedFoldedAffineApply(rewriter, loc, sumExpr, mixedSizes);
		Value sizeVal = getValueOrCreateConstantIndexOp(rewriter, loc, size);
		rewriter.create<nvgpu::MBarrierArriveExpectTxOp>(loc, barrier, sizeVal);
		}

		void HopperBuilder::buildTryWaitParity(
		TypedValue<nvgpu::MBarrierType> barrier) {
		Value parity = rewriter.create<arith::ConstantIndexOp>(loc, 0);
		// 10M is an arbitrary, not too small or too big number to specify the number
		// of ticks before retry.
		// TODO: hoist this in a default dialect constant.
		Value ticksBeforeRetry =
		rewriter.create<arith::ConstantIndexOp>(loc, 10000000);
		rewriter.create<nvgpu::MBarrierTryWaitParityOp>(loc, barrier, parity,
		ftynseUnsubmitted Done Reply Inline Actions It's the third occurrence of this snippet with commented-out code in a row. This is likely worth factoring out into a function that can be easily updated. ftynse: It's the third occurrence of this snippet with commented-out code in a row. This is likely…
		ticksBeforeRetry);
		}

		//===----------------------------------------------------------------------===//
		// RewriteCopyAsTmaOp
		//===----------------------------------------------------------------------===//

		/// Helper to create the tma operations corresponding to `linalg::CopyOp`.
		struct CopyBuilder : public HopperBuilder {
		CopyBuilder(RewriterBase &rewriter, Location loc)
		: HopperBuilder(rewriter, loc) {}

		SmallVector<Operation > rewrite(ArrayRef<Operation > copyOps);
		};

		SmallVector<Operation > CopyBuilder::rewrite(ArrayRef<Operation > copyOps) {
		MLIRContext *ctx = rewriter.getContext();
		if (copyOps.empty())
		ftynseUnsubmitted Done Reply Inline Actions Please expand auto unless the type is obvious from context or impossible to spell. ftynse: Please expand auto unless the type is obvious from context or impossible to spell.
		return SmallVector<Operation *>();

		auto launchOp = copyOps.front()->getParentOfType<gpu::LaunchOp>();
		assert(launchOp && "expected launch op");

		// 1. Init a barrier object in shared memory.
		OpBuilder::InsertionGuard g(rewriter);
		rewriter.setInsertionPoint(copyOps.front());
		AffineExpr bx, by, bz;
		bindSymbols(ctx, bx, by, bz);
		AffineExpr prod = computeProduct(ctx, ArrayRef<AffineExpr>{bx, by, bz});
		OpFoldResult numThreads = affine::makeComposedFoldedAffineApply(
		gurayppUnsubmitted Not Done Reply Inline Actions nice this is way better guraypp: nice this is way better
		rewriter, loc, prod,
		ArrayRef<OpFoldResult>{launchOp.getBlockSizeX(), launchOp.getBlockSizeY(),
		launchOp.getBlockSizeZ()});

		TypedValue<nvgpu::MBarrierType> barrier =
		buildAndInitBarrierInSharedMemory(numThreads);

		SmallVector<TypedValue<MemRefType>> shmems;
		SmallVector<TypedValue<nvgpu::TensorMapDescriptorType>> globalDescs;
		for (Operation *op : copyOps) {
		auto copyOp = cast<linalg::CopyOp>(op);
		auto inMemRef =
		cast<TypedValue<MemRefType>>(copyOp.getDpsInputOperand(0)->get());
		MemRefType inMemRefType = inMemRef.getType();
		assert(inMemRefType.getRank() == 2 && "expected in to be a 2D memref");

		// 2. Build global memory descriptor.
		TypedValue<nvgpu::TensorMapDescriptorType> globalDesc =
		buildGlobalMemRefDescriptor(inMemRef, launchOp);
		globalDescs.push_back(globalDesc);
		ftynseUnsubmitted Done Reply Inline Actions Nit: explain the magic number. ftynse: Nit: explain the magic number.

		// 3. Shared memory and descriptor for the tmp array.
		auto shmem =
		cast<TypedValue<MemRefType>>(copyOp.getDpsInitOperand(0)->get());
		shmems.push_back(shmem);
		}

		// 4. Load in from global memory to shared memory using tma.
		OpBuilder::InsertionGuard g2(rewriter);
		rewriter.setInsertionPoint(copyOps.front());
		SmallVector<Operation *> results =
		buildPredicateLoadsOnThread0(globalDescs, shmems, barrier);

		// 5. Spin-loop until data is ready.
		buildTryWaitParity(barrier);

		// 6. Erase the ops that have now been rewritten.
		for (Operation *op : copyOps)
		rewriter.eraseOp(op);

		return results;
		}

		DiagnosedSilenceableFailure
		transform::RewriteCopyAsTmaOp::apply(transform::TransformRewriter &rewriter,
		transform::TransformResults &results,
		transform::TransformState &state) {
		auto payloadOps = state.getPayloadOps(getTarget());
		ftynseUnsubmitted Done Reply Inline Actions Does this have to hardcode num threads? ftynse: Does this have to hardcode num threads?
		gurayppUnsubmitted Done Reply Inline Actions We can use `blockDim.x` instead of 128 for this code. But programs are slightly faster when it's hardcoded. It could be read from `%block_x = %c128,` on `gpu.launch`. guraypp: We can use `blockDim.x` instead of 128 for this code. But programs are slightly faster when…
		gpu::LaunchOp commonLaunchOp;
		Operation firstOp, failingOp;
		if (llvm::any_of(payloadOps, [&](Operation *op) {
		if (!commonLaunchOp) {
		commonLaunchOp = op->getParentOfType<gpu::LaunchOp>();
		firstOp = op;
		}
		auto fail = !op->getParentOfType<gpu::LaunchOp>() \|\|
		commonLaunchOp != op->getParentOfType<gpu::LaunchOp>() \|\|
		!isa<linalg::CopyOp>(op);
		if (fail)
		failingOp = op;
		return fail;
		})) {
		DiagnosedSilenceableFailure diag =
		emitSilenceableError()
		<< "target ops must be linalg::CopyOp nested under a common "
		"gpu.LaunchOp to be rewritten because the tma descriptors need to "
		"be created on the host.\nBut got: "
		<< firstOp << "\nand " << failingOp;
		return diag;
		}

		// TODO: more robust detection of copy, with transposes etc.
		CopyBuilder(rewriter, getLoc()).rewrite(llvm::to_vector(payloadOps));

		return DiagnosedSilenceableFailure::success();
		}

		//===----------------------------------------------------------------------===//
// Transform op registration		// Transform op registration
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

namespace {		namespace {
class NVGPUTransformDialectExtension		class NVGPUTransformDialectExtension
: public transform::TransformDialectExtension<		: public transform::TransformDialectExtension<
NVGPUTransformDialectExtension> {		NVGPUTransformDialectExtension> {
public:		public:
NVGPUTransformDialectExtension() {		NVGPUTransformDialectExtension() {
declareGeneratedDialect<arith::ArithDialect>();		declareGeneratedDialect<arith::ArithDialect>();
declareGeneratedDialect<affine::AffineDialect>();		declareGeneratedDialect<affine::AffineDialect>();
declareGeneratedDialect<nvgpu::NVGPUDialect>();		declareGeneratedDialect<nvgpu::NVGPUDialect>();
		declareGeneratedDialect<NVVM::NVVMDialect>();
declareGeneratedDialect<vector::VectorDialect>();		declareGeneratedDialect<vector::VectorDialect>();
registerTransformOps<		registerTransformOps<
#define GET_OP_LIST		#define GET_OP_LIST
		ftynseUnsubmitted Done Reply Inline Actions Would be helpful to indicate which of the ops failed the precondition. ftynse: Would be helpful to indicate which of the ops failed the precondition.
#include "mlir/Dialect/NVGPU/TransformOps/NVGPUTransformOps.cpp.inc"		#include "mlir/Dialect/NVGPU/TransformOps/NVGPUTransformOps.cpp.inc"
>();		>();
}		}
};		};
} // namespace		} // namespace

#define GET_OP_CLASSES		#define GET_OP_CLASSES
#include "mlir/Dialect/NVGPU/TransformOps/NVGPUTransformOps.cpp.inc"		#include "mlir/Dialect/NVGPU/TransformOps/NVGPUTransformOps.cpp.inc"

void mlir::nvgpu::registerTransformDialectExtension(DialectRegistry &registry) {		void mlir::nvgpu::registerTransformDialectExtension(DialectRegistry &registry) {
registry.addExtensions<NVGPUTransformDialectExtension>();		registry.addExtensions<NVGPUTransformDialectExtension>();
		ftynseUnsubmitted Done Reply Inline Actions I don't actually see any way for the callee to return failure. Consider changing its return type and dropping the message here. ftynse: I don't actually see any way for the callee to return failure. Consider changing its return…
}		}
		ftynseUnsubmitted Done Reply Inline Actions Would it make sense for the builder logic to erase this, so it is usable as a C++ call? ftynse: Would it make sense for the builder logic to erase this, so it is usable as a C++ call?

mlir/lib/Dialect/Utils/IndexingUtils.cpp

Show First 20 Lines • Show All 155 Lines • ▼ Show 20 Lines	SmallVector<AffineExpr> mlir::computeElementwiseMul(ArrayRef<AffineExpr> v1,
ArrayRef<AffineExpr> v2) {		ArrayRef<AffineExpr> v2) {
return computeElementwiseMulImpl(v1, v2);		return computeElementwiseMulImpl(v1, v2);
}		}

AffineExpr mlir::computeSum(MLIRContext *ctx, ArrayRef<AffineExpr> basis) {		AffineExpr mlir::computeSum(MLIRContext *ctx, ArrayRef<AffineExpr> basis) {
if (basis.empty())		if (basis.empty())
return getAffineConstantExpr(0, ctx);		return getAffineConstantExpr(0, ctx);
return std::accumulate(basis.begin(), basis.end(),		return std::accumulate(basis.begin(), basis.end(),
getAffineConstantExpr(1, ctx),		getAffineConstantExpr(0, ctx),
std::plus<AffineExpr>());		std::plus<AffineExpr>());
}		}

AffineExpr mlir::computeProduct(MLIRContext *ctx, ArrayRef<AffineExpr> basis) {		AffineExpr mlir::computeProduct(MLIRContext *ctx, ArrayRef<AffineExpr> basis) {
if (basis.empty())		if (basis.empty())
return getAffineConstantExpr(0, ctx);		return getAffineConstantExpr(1, ctx);
return std::accumulate(basis.begin(), basis.end(),		return std::accumulate(basis.begin(), basis.end(),
getAffineConstantExpr(1, ctx),		getAffineConstantExpr(1, ctx),
std::multiplies<AffineExpr>());		std::multiplies<AffineExpr>());
}		}

AffineExpr mlir::linearize(MLIRContext *ctx, ArrayRef<AffineExpr> offsets,		AffineExpr mlir::linearize(MLIRContext *ctx, ArrayRef<AffineExpr> offsets,
ArrayRef<AffineExpr> basis) {		ArrayRef<AffineExpr> basis) {
AffineExpr zero = getAffineConstantExpr(0, ctx);		AffineExpr zero = getAffineConstantExpr(0, ctx);
▲ Show 20 Lines • Show All 85 Lines • Show Last 20 Lines

mlir/test/Dialect/NVGPU/tmaload-transform.mlir

This file was added.

				// RUN: mlir-opt %s \
				// RUN: -test-transform-dialect-interpreter \
				// RUN: -test-transform-dialect-erase-schedule \
				// RUN: \| FileCheck %s

				memref.global "private" @bufferLhsGlobal : memref<64x8xf32, #gpu.address_space<workgroup>>
				memref.global "private" @bufferRhsGlobal : memref<8x128xf32, #gpu.address_space<workgroup>>

				// CHECK-LABEL: func.func @main()
				func.func @main() {
				%c1 = arith.constant 1 : index
				%c128 = arith.constant 128 : index

				%0 = gpu.wait async
				%memref, %asyncToken = gpu.alloc async [%0] () : memref<64x8xf32>
				%memref_1, %asyncToken_2 = gpu.alloc async [%0] () : memref<8x128xf32>

				// CHECK: %[[M1:.]] = memref.cast %{{.}} : memref<64x8xf32> to memref<*xf32>
				// CHECK: %[[c64:.*]] = arith.constant 64 : index
				// CHECK: %[[c8:.*]] = arith.constant 8 : index
				// CHECK: %[[D1:.*]] = nvgpu.tma.create.descriptor %[[M1]] box[%[[c64]], %[[c8]]]
				// CHECK-SAME: : memref<*xf32> -> <tensor = memref<64x8xf32, #gpu.address_space<workgroup>>, swizzle = none, l2promo = none, oob = zero, interleave = none>
				// CHECK: %[[cast_2:.]] = memref.cast %memref_0 : memref<8x128xf32> to memref<xf32>
				// CHECK: %[[c8_2:.*]] = arith.constant 8 : index
				// CHECK: %[[c128_2:.*]] = arith.constant 128 : index
				// CHECK: %[[D2:.*]] = nvgpu.tma.create.descriptor %cast_2 box[%[[c8_2]], %[[c128_2]]]
				// CHECK-SAME: : memref<*xf32> -> <tensor = memref<8x128xf32, #gpu.address_space<workgroup>>, swizzle = none, l2promo = none, oob = zero, interleave = none>
				// CHECK: gpu.launch
				gpu.launch blocks(%bx, %by, %bz) in (%grid_x = %c1, %grid_y = %c1, %grid_z = %c1)
				threads(%tx, %ty, %tz) in (%block_x = %c128, %block_y = %c1, %block_z = %c1) {
				// CHECK: %[[G1:.*]] = memref.get_global @bufferLhsGlobal : memref<64x8xf32, #gpu.address_space<workgroup>>
				// CHECK: %[[G2:.*]] = memref.get_global @bufferRhsGlobal : memref<8x128xf32, #gpu.address_space<workgroup>>
				%out = memref.get_global @bufferLhsGlobal : memref<64x8xf32, #gpu.address_space<workgroup>>
				%out_1 = memref.get_global @bufferRhsGlobal : memref<8x128xf32, #gpu.address_space<workgroup>>

				// CHECK: %[[B:.*]] = nvgpu.mbarrier.create -> <memorySpace = #gpu.address_space<workgroup>
				// CHECK: nvgpu.mbarrier.init %[[B]], %{{.*}} : <memorySpace = #gpu.address_space<workgroup>
				// CHECK: gpu.barrier
				//
				// CHECK: %[[c0:.*]] = arith.constant 0 : index
				// CHECK: %[[TIDX:.*]] = gpu.thread_id x
				// CHECK: %[[CMP:.*]] = arith.cmpi eq, %[[TIDX]], %[[c0]] : index
				//
				// CHECK: scf.if %[[CMP]] {
				//
				// CHECK: %[[c0_7:.*]] = arith.constant 0 : index
				// CHECK: nvgpu.tma.async.load %[[D1]][%[[c0_7]], %[[c0_7]]], %[[B]] to %[[G1]]
				// CHECK-SAME: : <tensor = memref<64x8xf32, #gpu.address_space<workgroup>>,
				// CHECK-SAME: swizzle = none, l2promo = none, oob = zero, interleave = none>, <memorySpace = #gpu.address_space<workgroup>
				// CHECK-SAME: -> memref<64x8xf32, #gpu.address_space<workgroup>>
				//
				// CHECK: %[[c0_8:.*]] = arith.constant 0 : index
				// CHECK: nvgpu.tma.async.load %[[D2]][%[[c0_8]], %[[c0_8]]], %[[B]] to %[[G2]]
				// CHECK-SAME: : <tensor = memref<8x128xf32, #gpu.address_space<workgroup>>,
				// CHECK-SAME: swizzle = none, l2promo = none, oob = zero, interleave = none>, <memorySpace = #gpu.address_space<workgroup>
				// CHECK-SAME: -> memref<8x128xf32, #gpu.address_space<workgroup>>
				//
				// CHECK: %[[c6144:.*]] = arith.constant 6144 : index
				// CHECK: nvgpu.mbarrier.arrive.expect_tx %[[B]], %[[c6144]] : <memorySpace = #gpu.address_space<workgroup>
				// CHECK: } else {
				// CHECK: %[[c0_7:.*]] = arith.constant 0 : index
				// CHECK: nvgpu.mbarrier.arrive.expect_tx %[[B]], %[[c0_7]] : <memorySpace = #gpu.address_space<workgroup>
				// CHECK: }
				//
				// CHECK: %[[c0_6:.*]] = arith.constant 0 : index
				// CHECK: %[[c10000000:.*]] = arith.constant 10000000 : index
				// CHECK: nvgpu.mbarrier.try_wait.parity %[[B]], %[[c0_6]], %[[c10000000]] : <memorySpace = #gpu.address_space<workgroup>

				/// Both copies are matched and end up in the same async group.
				linalg.copy ins(%memref: memref<64x8xf32>) outs(%out: memref<64x8xf32, #gpu.address_space<workgroup>>)
				linalg.copy ins(%memref_1: memref<8x128xf32>) outs(%out_1: memref<8x128xf32, #gpu.address_space<workgroup>>)

				gpu.terminator
				}

				return
				}

				transform.sequence failures(propagate) {
				^bb1(%arg1: !transform.any_op):
				%copy = transform.structured.match ops{["linalg.copy"]} in %arg1
				: (!transform.any_op) -> !transform.any_op
				transform.nvgpu.rewrite_copy_as_tma %copy : (!transform.any_op) -> ()
				}

mlir/test/Integration/GPU/CUDA/sm90/tmaload-transform.mlir

This file was added.

				// RUN: mlir-opt %s \
				// RUN: -test-transform-dialect-interpreter \
				// RUN: -test-transform-dialect-erase-schedule \
				// RUN: -convert-nvgpu-to-nvvm -gpu-kernel-outlining \
				// RUN: -convert-scf-to-cf -convert-nvvm-to-llvm \
				// RUN: -convert-vector-to-llvm \
				// RUN: -convert-math-to-llvm \
				// RUN: -expand-strided-metadata \
				// RUN: -lower-affine \
				// RUN: -convert-index-to-llvm=index-bitwidth=32 \
				// RUN: -convert-arith-to-llvm \
				// RUN: -finalize-memref-to-llvm \
				// RUN: -convert-func-to-llvm \
				// RUN: -canonicalize \
				// RUN: \| mlir-opt -pass-pipeline='builtin.module(gpu.module(strip-debuginfo,convert-gpu-to-nvvm,convert-nvgpu-to-nvvm{use-opaque-pointers=1},lower-affine,convert-scf-to-cf,convert-vector-to-llvm,convert-math-to-llvm,expand-strided-metadata,lower-affine,convert-index-to-llvm{index-bitwidth=32},convert-arith-to-llvm,reconcile-unrealized-casts,gpu-to-cubin{chip=sm_90 features=+ptx80 dump-ptx}))' \
				// RUN: 2&>1 \| FileCheck %s --check-prefixes=CHECK-PTX

				// CHECK-PTX: mbarrier.init.shared {{.*}} !llvm.ptr<3>, i32
				/// If branch
				// CHECK-PTX: cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes
				// CHECK-PTX: cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes
				// CHECK-PTX: mbarrier.arrive.expect_tx.shared
				/// Else branch
				// CHECK-PTX: mbarrier.arrive.expect_tx.shared
				// CHECK-PTX: mbarrier.try_wait.parity.shared

				// TODO: GPU layering does not currently work end-to-end. Activate the following
				// when fixed.
				// R-UN: \| mlir-opt -convert-index-to-llvm=index-bitwidth=32 \
				// R-UN: -gpu-to-llvm \
				// R-UN: -convert-func-to-llvm \
				// R-UN: -cse \
				// R-UN: -canonicalize \
				// R-UN: -reconcile-unrealized-casts \
				// R-UN: \| mlir-cpu-runner \
				// R-UN: --shared-libs=%mlir_cuda_runtime \
				// R-UN: --shared-libs=%mlir_runner_utils \
				// R-UN: --entry-point-result=void \
				// R-UN: \| FileCheck %s

				// C-HECK: [GPU] TMA BEFORE lhs[45][7] 0.000000
				// C-HECK: [GPU] TMA BEFORE rhs[7][0] 0.000000
				// C-HECK: [GPU] TMA LOADED lhs[45][7] 7.000000
				// C-HECK: [GPU] TMA LOADED rhs[7][0] 3.000000


				module @mymod {
				memref.global "private" @bufferLhsGlobal : memref<64x8xf32, 3>
				memref.global "private" @bufferRhsGlobal : memref<8x128xf32, 3>
				func.func @main() {
				%c10000000 = arith.constant 10000000 : index
				%c6144 = arith.constant 6144 : index
				%c45 = arith.constant 45 : index
				%c7 = arith.constant 7 : index
				%c64 = arith.constant 64 : index
				%c1 = arith.constant 1 : index
				%c0 = arith.constant 0 : index
				%c8 = arith.constant 8 : index
				%c128 = arith.constant 128 : index
				%cst = arith.constant 3.000000e+00 : f32
				%alloc = memref.alloc() : memref<64x8xf32>
				%alloc_0 = memref.alloc() : memref<8x128xf32>
				scf.for %arg0 = %c0 to %c8 step %c1 {
				scf.for %arg1 = %c0 to %c128 step %c1 {
				memref.store %cst, %alloc_0[%arg0, %arg1] : memref<8x128xf32>
				}
				}
				scf.for %arg0 = %c0 to %c64 step %c1 {
				scf.for %arg1 = %c0 to %c8 step %c1 {
				%5 = arith.index_cast %arg1 : index to i64
				%6 = arith.uitofp %5 : i64 to f32
				memref.store %6, %alloc[%arg0, %arg1] : memref<64x8xf32>
				}
				}
				%0 = gpu.wait async
				%memref, %asyncToken = gpu.alloc async [%0] () : memref<64x8xf32>
				%memref_1, %asyncToken_2 = gpu.alloc async [%0] () : memref<8x128xf32>
				%1 = gpu.memcpy async [%0] %memref, %alloc : memref<64x8xf32>, memref<64x8xf32>
				%2 = gpu.memcpy async [%0] %memref_1, %alloc_0 : memref<8x128xf32>, memref<8x128xf32>

				gpu.launch blocks(%bx, %by, %bz) in (%grid_x = %c1, %grid_y = %c1, %grid_z = %c1)
				threads(%tx, %ty, %tz) in (%block_x = %c128, %block_y = %c1, %block_z = %c1) {
				%out = memref.get_global @bufferLhsGlobal : memref<64x8xf32, 3>
				%out_1 = memref.get_global @bufferRhsGlobal : memref<8x128xf32, 3>
				linalg.copy ins(%memref: memref<64x8xf32>) outs(%out: memref<64x8xf32, 3>)
				linalg.copy ins(%memref_1: memref<8x128xf32>) outs(%out_1: memref<8x128xf32, 3>)

				%6 = gpu.thread_id x
				%10 = arith.cmpi eq, %6, %c0 : index
				scf.if %10 {
				%11 = memref.load %out[%c45, %c7] : memref<64x8xf32, 3>
				%12 = memref.load %out_1[%c7, %c0] : memref<8x128xf32, 3>
				gpu.printf "[GPU] TMA LOADED lhs[45][7] %f\0A" %11 : f32
				gpu.printf "[GPU] TMA LOADED rhs[7][0] %f\0A" %12 : f32
				}
				gpu.terminator
				}

				return
				}
				}

				transform.sequence failures(propagate) {
				^bb1(%arg1: !transform.any_op):
				%copy = transform.structured.match ops{["linalg.copy"]} in %arg1
				: (!transform.any_op) -> !transform.any_op
				transform.nvgpu.rewrite_copy_as_tma %copy
				: (!transform.any_op) -> ()
				}

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][nvgpu] Add a nvgpu.rewrite_copy_as_tma transform operation.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 548159

mlir/include/mlir/Dialect/NVGPU/TransformOps/NVGPUTransformOps.td

mlir/lib/Dialect/NVGPU/TransformOps/NVGPUTransformOps.cpp

mlir/lib/Dialect/Utils/IndexingUtils.cpp

mlir/test/Dialect/NVGPU/tmaload-transform.mlir

mlir/test/Integration/GPU/CUDA/sm90/tmaload-transform.mlir

[mlir][nvgpu] Add a nvgpu.rewrite_copy_as_tma transform operation.
ClosedPublic