This is an archive of the discontinued LLVM Phabricator instance.

mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp
657	what is the proper way to replace the magic constant at the GPU -> LLVM conversion ?
mlir/lib/Conversion/GPUToNVVM/LowerGpuOpsToNVVMOps.cpp
206	what is the proper way to replace the magic constant at the GPU -> NVVM conversion ?

guraypp added inline comments.Jul 28 2023, 7:30 AM

mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp
657	I put `isSharedMemoryAddressSpace` in nvgpu dialect, not sure you can use it here.

Harbormaster completed remote builds in B248845: Diff 545151.Jul 28 2023, 8:20 AM

nicolasvasilache added reviewers: mehdi_amini, qcolombet, kerrmudgeon, guraypp.Jul 29 2023, 3:01 AM

Update after debugging with --mlir-print-ir-after-all and ensuring all intermediate IRs are valid.

Harbormaster completed remote builds in B249148: Diff 545569.Jul 31 2023, 4:33 AM

qcolombet added inline comments.Aug 7 2023, 5:10 AM

mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp
659	The size of the address space pointer is technically carried by the datalayout in LLVM. For NVPTX in particular, this is controlled by `nvptx-short-ptr` (https://github.com/llvm/llvm-project/blob/0b17e9d2859acfec2cf757472f3822f6b5aad020/llvm/lib/Target/NVPTX/NVPTXTargetMachine.cpp#L60). So we may need to query the datalayout here (no idea if that's even feasible) instead of hardcoding this.

guraypp added inline comments.Aug 7 2023, 6:14 AM

mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp
659	I previously highlighted this flag in another PR. Let's ensure its setting in MLIR, otherwise LLVM promotes 32-bit registers to 64-bit no matter what we do in MLIR. Nevertheless, having 32-bit in MLIR, as in this work, offers advantages. For instance, when generating assembly or PTX directly, maintaining 32-bit is crucial.

Revision Contents

Path

Size

mlir/

include/

mlir/

Conversion/

LLVMCommon/

LoweringOptions.h

4 lines

lib/

Conversion/

GPUCommon/

GPUToLLVMConversion.cpp

8 lines

GPUToNVVM/

LowerGpuOpsToNVVMOps.cpp

8 lines

LLVMCommon/

TypeConverter.cpp

4 lines

MemRefToLLVM/

CMakeLists.txt

1 line

NVGPUToNVVM/

NVGPUToNVVM.cpp

154 lines

test/

Integration/

GPU/

CUDA/

sm90/

tmaload.mlir

16 lines

lib/

Dialect/

GPU/

TestLowerToNVVM.cpp

55 lines

Diff 545569

mlir/include/mlir/Conversion/LLVMCommon/LoweringOptions.h

//===- LoweringOptions.h - Common config for lowering to LLVM ---- C++ --===//		//===- LoweringOptions.h - Common config for lowering to LLVM ---- C++ --===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
//		//
// Provides a configuration shared by several conversions targeting the LLVM		// Provides a configuration shared by several conversions targeting the LLVM
// dialect.		// dialect.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#ifndef MLIR_CONVERSION_LLVMCOMMON_LOWERINGOPTIONS_H		#ifndef MLIR_CONVERSION_LLVMCOMMON_LOWERINGOPTIONS_H
#define MLIR_CONVERSION_LLVMCOMMON_LOWERINGOPTIONS_H		#define MLIR_CONVERSION_LLVMCOMMON_LOWERINGOPTIONS_H

		#include "mlir/IR/BuiltinTypes.h"
#include "llvm/IR/DataLayout.h"		#include "llvm/IR/DataLayout.h"

namespace mlir {		namespace mlir {

class DataLayout;		class DataLayout;
class MLIRContext;		class MLIRContext;

/// Value to pass as bitwidth for the index type when the converter is expected		/// Value to pass as bitwidth for the index type when the converter is expected
Show All 36 Lines	void overrideIndexBitwidth(unsigned bitwidth) {
assert(bitwidth != kDeriveIndexBitwidthFromDataLayout &&		assert(bitwidth != kDeriveIndexBitwidthFromDataLayout &&
"can only override to a concrete bitwidth");		"can only override to a concrete bitwidth");
indexBitwidth = bitwidth;		indexBitwidth = bitwidth;
}		}

/// Get the index bitwidth.		/// Get the index bitwidth.
unsigned getIndexBitwidth() const { return indexBitwidth; }		unsigned getIndexBitwidth() const { return indexBitwidth; }

		/// Hook to customize the conversion of MemRefType to LLVMType.
		llvm::function_ref<Type(MemRefType)> memrefIndexTypeConverter = nullptr;

private:		private:
unsigned indexBitwidth;		unsigned indexBitwidth;
};		};

} // namespace mlir		} // namespace mlir

#endif // MLIR_CONVERSION_LLVMCOMMON_LOWERINGOPTIONS_H		#endif // MLIR_CONVERSION_LLVMCOMMON_LOWERINGOPTIONS_H

mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp

	Show First 20 Lines • Show All 647 Lines • ▼ Show 20 Lines
	private:			private:
	LogicalResult			LogicalResult
	matchAndRewrite(gpu::SDDMMOp op, OpAdaptor adaptor,			matchAndRewrite(gpu::SDDMMOp op, OpAdaptor adaptor,
	ConversionPatternRewriter &rewriter) const override;			ConversionPatternRewriter &rewriter) const override;
	};			};

	} // namespace			} // namespace

				static IntegerType getIndexTypeForMemRef(MemRefType type) {
				if (type.getMemorySpaceAsInt() == 3)
				nicolasvasilacheAuthorUnsubmitted Done Reply Inline Actions what is the proper way to replace the magic constant at the GPU -> LLVM conversion ? nicolasvasilache: what is the proper way to replace the magic constant at the GPU -> LLVM conversion ?
				gurayppUnsubmitted Not Done Reply Inline Actions I put `isSharedMemoryAddressSpace` in nvgpu dialect, not sure you can use it here. guraypp: I put `isSharedMemoryAddressSpace` in nvgpu dialect, not sure you can use it here.
				// nvgpu::NVGPUDialect::kSharedMemoryAddressSpace)
				return IntegerType::get(type.getContext(), 32);
				qcolombetUnsubmitted Not Done Reply Inline Actions The size of the address space pointer is technically carried by the datalayout in LLVM. For NVPTX in particular, this is controlled by `nvptx-short-ptr` (https://github.com/llvm/llvm-project/blob/0b17e9d2859acfec2cf757472f3822f6b5aad020/llvm/lib/Target/NVPTX/NVPTXTargetMachine.cpp#L60). So we may need to query the datalayout here (no idea if that's even feasible) instead of hardcoding this. qcolombet: The size of the address space pointer is technically carried by the datalayout in LLVM. For…
				gurayppUnsubmitted Not Done Reply Inline Actions I previously highlighted this flag in another PR. Let's ensure its setting in MLIR, otherwise LLVM promotes 32-bit registers to 64-bit no matter what we do in MLIR. Nevertheless, having 32-bit in MLIR, as in this work, offers advantages. For instance, when generating assembly or PTX directly, maintaining 32-bit is crucial. guraypp: I previously highlighted this flag in another PR. Let's ensure its setting in MLIR, otherwise…
				return IntegerType::get(type.getContext(), 64);
				}

	void GpuToLLVMConversionPass::runOnOperation() {			void GpuToLLVMConversionPass::runOnOperation() {
	LowerToLLVMOptions options(&getContext());			LowerToLLVMOptions options(&getContext());
	options.useOpaquePointers = useOpaquePointers;			options.useOpaquePointers = useOpaquePointers;
	options.useBarePtrCallConv = hostBarePtrCallConv;			options.useBarePtrCallConv = hostBarePtrCallConv;
				options.memrefIndexTypeConverter = getIndexTypeForMemRef;

	LLVMTypeConverter converter(&getContext(), options);			LLVMTypeConverter converter(&getContext(), options);
	RewritePatternSet patterns(&getContext());			RewritePatternSet patterns(&getContext());
	LLVMConversionTarget target(getContext());			LLVMConversionTarget target(getContext());

	target.addIllegalDialect<gpu::GPUDialect>();			target.addIllegalDialect<gpu::GPUDialect>();

	mlir::arith::populateArithToLLVMConversionPatterns(converter, patterns);			mlir::arith::populateArithToLLVMConversionPatterns(converter, patterns);
	▲ Show 20 Lines • Show All 1,135 Lines • Show Last 20 Lines

mlir/lib/Conversion/GPUToNVVM/LowerGpuOpsToNVVMOps.cpp

Show First 20 Lines • Show All 196 Lines • ▼ Show 20 Lines	matchAndRewrite(gpu::LaneIdOp op, gpu::LaneIdOp::Adaptor adaptor,
rewriter.replaceOp(op, {newOp});		rewriter.replaceOp(op, {newOp});
return success();		return success();
}		}
};		};

/// Import the GPU Ops to NVVM Patterns.		/// Import the GPU Ops to NVVM Patterns.
#include "GPUToNVVM.cpp.inc"		#include "GPUToNVVM.cpp.inc"

		static IntegerType getIndexTypeForMemRef(MemRefType type) {
		if (type.getMemorySpaceAsInt() == 3)
		nicolasvasilacheAuthorUnsubmitted Done Reply Inline Actions what is the proper way to replace the magic constant at the GPU -> NVVM conversion ? nicolasvasilache: what is the proper way to replace the magic constant at the GPU -> NVVM conversion ?
		// nvgpu::NVGPUDialect::kSharedMemoryAddressSpace)
		return IntegerType::get(type.getContext(), 32);
		return IntegerType::get(type.getContext(), 64);
		}

/// A pass that replaces all occurrences of GPU device operations with their		/// A pass that replaces all occurrences of GPU device operations with their
/// corresponding NVVM equivalent.		/// corresponding NVVM equivalent.
///		///
/// This pass only handles device code and is not meant to be run on GPU host		/// This pass only handles device code and is not meant to be run on GPU host
/// code.		/// code.
struct LowerGpuOpsToNVVMOpsPass		struct LowerGpuOpsToNVVMOpsPass
: public impl::ConvertGpuOpsToNVVMOpsBase<LowerGpuOpsToNVVMOpsPass> {		: public impl::ConvertGpuOpsToNVVMOpsBase<LowerGpuOpsToNVVMOpsPass> {
LowerGpuOpsToNVVMOpsPass() = default;		LowerGpuOpsToNVVMOpsPass() = default;
Show All 14 Lines	void runOnOperation() override {
// Customize the bitwidth used for the device side index computations.		// Customize the bitwidth used for the device side index computations.
LowerToLLVMOptions options(		LowerToLLVMOptions options(
m.getContext(),		m.getContext(),
DataLayout(cast<DataLayoutOpInterface>(m.getOperation())));		DataLayout(cast<DataLayoutOpInterface>(m.getOperation())));
if (indexBitwidth != kDeriveIndexBitwidthFromDataLayout)		if (indexBitwidth != kDeriveIndexBitwidthFromDataLayout)
options.overrideIndexBitwidth(indexBitwidth);		options.overrideIndexBitwidth(indexBitwidth);
options.useOpaquePointers = useOpaquePointers;		options.useOpaquePointers = useOpaquePointers;
options.useBarePtrCallConv = useBarePtrCallConv;		options.useBarePtrCallConv = useBarePtrCallConv;
		options.memrefIndexTypeConverter = getIndexTypeForMemRef;

// Apply in-dialect lowering. In-dialect lowering will replace		// Apply in-dialect lowering. In-dialect lowering will replace
// ops which need to be lowered further, which is not supported by a		// ops which need to be lowered further, which is not supported by a
// single conversion pass.		// single conversion pass.
{		{
RewritePatternSet patterns(m.getContext());		RewritePatternSet patterns(m.getContext());
populateGpuRewritePatterns(patterns);		populateGpuRewritePatterns(patterns);
if (failed(applyPatternsAndFoldGreedily(m, std::move(patterns))))		if (failed(applyPatternsAndFoldGreedily(m, std::move(patterns))))
▲ Show 20 Lines • Show All 139 Lines • Show Last 20 Lines

mlir/lib/Conversion/LLVMCommon/TypeConverter.cpp

Show First 20 Lines • Show All 333 Lines • ▼ Show 20 Lines	emitError(UnknownLoc::get(type.getContext()),
"conversion of memref memory space ")		"conversion of memref memory space ")
<< type.getMemorySpace()		<< type.getMemorySpace()
<< " to integer address space "		<< " to integer address space "
"failed. Consider adding memory space conversions.";		"failed. Consider adding memory space conversions.";
return {};		return {};
}		}
auto ptrTy = getPointerType(elementType, *addressSpace);		auto ptrTy = getPointerType(elementType, *addressSpace);

auto indexTy = getIndexType();		auto indexTy = options.memrefIndexTypeConverter
		? options.memrefIndexTypeConverter(type)
		: getIndexType();

SmallVector<Type, 5> results = {ptrTy, ptrTy, indexTy};		SmallVector<Type, 5> results = {ptrTy, ptrTy, indexTy};
auto rank = type.getRank();		auto rank = type.getRank();
if (rank == 0)		if (rank == 0)
return results;		return results;

if (unpackAggregates)		if (unpackAggregates)
results.insert(results.end(), 2 * rank, indexTy);		results.insert(results.end(), 2 * rank, indexTy);
▲ Show 20 Lines • Show All 300 Lines • Show Last 20 Lines

mlir/lib/Conversion/MemRefToLLVM/CMakeLists.txt

	add_mlir_conversion_library(MLIRMemRefToLLVM			add_mlir_conversion_library(MLIRMemRefToLLVM
	AllocLikeConversion.cpp			AllocLikeConversion.cpp
	MemRefToLLVM.cpp			MemRefToLLVM.cpp

	ADDITIONAL_HEADER_DIRS			ADDITIONAL_HEADER_DIRS
	${MLIR_MAIN_INCLUDE_DIR}/mlir/Conversion/MemRefToLLVM			${MLIR_MAIN_INCLUDE_DIR}/mlir/Conversion/MemRefToLLVM

	DEPENDS			DEPENDS
	MLIRConversionPassIncGen			MLIRConversionPassIncGen

	LINK_COMPONENTS			LINK_COMPONENTS
	Core			Core

	LINK_LIBS PUBLIC			LINK_LIBS PUBLIC
	MLIRAnalysis			MLIRAnalysis
	MLIRDataLayoutInterfaces			MLIRDataLayoutInterfaces
				MLIRIndexDialect
	MLIRLLVMCommonConversion			MLIRLLVMCommonConversion
	MLIRMemRefDialect			MLIRMemRefDialect
	MLIRMemRefUtils			MLIRMemRefUtils
	MLIRLLVMDialect			MLIRLLVMDialect
	MLIRTransforms			MLIRTransforms
	)			)

mlir/lib/Conversion/NVGPUToNVVM/NVGPUToNVVM.cpp

//===- NVGPUToNVVM.cpp - NVGPU to NVVM dialect conversion -----------------===//		//===- NVGPUToNVVM.cpp - NVGPU to NVVM dialect conversion -----------------===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "mlir/Conversion/NVGPUToNVVM/NVGPUToNVVM.h"		#include "mlir/Conversion/NVGPUToNVVM/NVGPUToNVVM.h"

#include "mlir/Conversion/GPUCommon/GPUCommonPass.h"		#include "mlir/Conversion/GPUCommon/GPUCommonPass.h"
#include "mlir/Conversion/LLVMCommon/ConversionTarget.h"		#include "mlir/Conversion/LLVMCommon/ConversionTarget.h"
#include "mlir/Conversion/LLVMCommon/Pattern.h"		#include "mlir/Conversion/LLVMCommon/Pattern.h"
#include "mlir/Dialect/GPU/IR/GPUDialect.h"		#include "mlir/Dialect/GPU/IR/GPUDialect.h"
		#include "mlir/Dialect/Index/IR/IndexDialect.h"
		#include "mlir/Dialect/Index/IR/IndexOps.h"
#include "mlir/Dialect/LLVMIR/LLVMDialect.h"		#include "mlir/Dialect/LLVMIR/LLVMDialect.h"
#include "mlir/Dialect/LLVMIR/LLVMTypes.h"		#include "mlir/Dialect/LLVMIR/LLVMTypes.h"
#include "mlir/Dialect/LLVMIR/NVVMDialect.h"		#include "mlir/Dialect/LLVMIR/NVVMDialect.h"
#include "mlir/Dialect/MemRef/IR/MemRef.h"		#include "mlir/Dialect/MemRef/IR/MemRef.h"
#include "mlir/Dialect/NVGPU/IR/NVGPUDialect.h"		#include "mlir/Dialect/NVGPU/IR/NVGPUDialect.h"
		#include "mlir/IR/BuiltinTypes.h"
#include "mlir/IR/PatternMatch.h"		#include "mlir/IR/PatternMatch.h"
#include "mlir/IR/TypeUtilities.h"		#include "mlir/IR/TypeUtilities.h"
#include "mlir/Pass/Pass.h"		#include "mlir/Pass/Pass.h"
#include "llvm/Support/raw_ostream.h"		#include "llvm/Support/raw_ostream.h"

namespace mlir {		namespace mlir {
#define GEN_PASS_DEF_CONVERTNVGPUTONVVMPASS		#define GEN_PASS_DEF_CONVERTNVGPUTONVVMPASS
#include "mlir/Conversion/Passes.h.inc"		#include "mlir/Conversion/Passes.h.inc"
} // namespace mlir		} // namespace mlir

using namespace mlir;		using namespace mlir;

/// GPU has 32 bit registers, this function truncates values when larger width		/// GPU has 32 bit registers, this function truncates values when larger
/// is not needed.		/// width is not needed.
static Value truncToI32(ConversionPatternRewriter &rewriter, Location loc,		static Value truncToI32(ConversionPatternRewriter &rewriter, Location loc,
Value value) {		Value value) {
Type type = value.getType();		Type type = value.getType();
		if (llvm::isa<IndexType>(type))
		return rewriter.create<index::CastSOp>(loc, rewriter.getI32Type(), value);

assert(llvm::isa<IntegerType>(type) && "expected an integer Value");		assert(llvm::isa<IntegerType>(type) && "expected an integer Value");
if (type.getIntOrFloatBitWidth() <= 32)		if (type.getIntOrFloatBitWidth() <= 32)
return value;		return value;
return rewriter.create<LLVM::TruncOp>(loc, rewriter.getI32Type(), value);		// Avoid direct use of LVVM and instead roundtrip through index dialect which
		// connects things properly.
		Value index =
		rewriter.create<index::CastSOp>(loc, rewriter.getIndexType(), value);
		return rewriter.create<index::CastSOp>(loc, rewriter.getI32Type(), index);
}		}

/// Returns the type for the intrinsic given the vectorResultType of the		/// Returns the type for the intrinsic given the vectorResultType of the
/// `gpu.mma.sync` operation.		/// `gpu.mma.sync` operation.
static Type inferIntrinsicResultType(Type vectorResultType) {		static Type inferIntrinsicResultType(Type vectorResultType) {
MLIRContext *ctx = vectorResultType.getContext();		MLIRContext *ctx = vectorResultType.getContext();
auto a = cast<LLVM::LLVMArrayType>(vectorResultType);		auto a = cast<LLVM::LLVMArrayType>(vectorResultType);
auto f16x2Ty = LLVM::getFixedVectorType(Float16Type::get(ctx), 2);		auto f16x2Ty = LLVM::getFixedVectorType(Float16Type::get(ctx), 2);
▲ Show 20 Lines • Show All 43 Lines • ▼ Show 20 Lines	static Value convertIntrinsicResult(Location loc, Type intrinsicResultType,
Type f64Ty = rewriter.getF64Type();		Type f64Ty = rewriter.getF64Type();
Type f16x2Ty = LLVM::getFixedVectorType(rewriter.getF16Type(), 2);		Type f16x2Ty = LLVM::getFixedVectorType(rewriter.getF16Type(), 2);
Type i32x2Ty = LLVM::getFixedVectorType(i32Ty, 2);		Type i32x2Ty = LLVM::getFixedVectorType(i32Ty, 2);
Type f64x2Ty = LLVM::getFixedVectorType(f64Ty, 2);		Type f64x2Ty = LLVM::getFixedVectorType(f64Ty, 2);
Type f32x2Ty = LLVM::getFixedVectorType(f32Ty, 2);		Type f32x2Ty = LLVM::getFixedVectorType(f32Ty, 2);
Type f32x1Ty = LLVM::getFixedVectorType(f32Ty, 1);		Type f32x1Ty = LLVM::getFixedVectorType(f32Ty, 1);

auto makeConst = [&](int32_t index) -> Value {		auto makeConst = [&](int32_t index) -> Value {
return rewriter.create<LLVM::ConstantOp>(loc, IntegerType::get(ctx, 32),		return rewriter.create<index::ConstantOp>(loc,
rewriter.getI32IntegerAttr(index));		rewriter.getIndexAttr(index));
};		};

if (arrayType) {		if (arrayType) {
SmallVector<Value, 4> elements;		SmallVector<Value, 4> elements;

// The intrinsic returns 32-bit wide elements in a form which can be		// The intrinsic returns 32-bit wide elements in a form which can be
// directly bitcasted and inserted into the result vector.		// directly bitcasted and inserted into the result vector.
if (arrayType.getElementType() == f16x2Ty \|\|		if (arrayType.getElementType() == f16x2Ty \|\|
▲ Show 20 Lines • Show All 81 Lines • ▼ Show 20 Lines	for (unsigned i = 0, e = arrayTy.getNumElements(); i < e; ++i) {
VectorType innerArrayTy = dyn_cast<VectorType>(arrayTy.getElementType());		VectorType innerArrayTy = dyn_cast<VectorType>(arrayTy.getElementType());
if (innerArrayTy && (innerArrayTy.getElementType() == i32Ty \|\|		if (innerArrayTy && (innerArrayTy.getElementType() == i32Ty \|\|
innerArrayTy.getElementType() == f64Ty \|\|		innerArrayTy.getElementType() == f64Ty \|\|
innerArrayTy.getElementType() == f32Ty)) {		innerArrayTy.getElementType() == f32Ty)) {
for (unsigned idx = 0, innerSize = innerArrayTy.getNumElements();		for (unsigned idx = 0, innerSize = innerArrayTy.getNumElements();
idx < innerSize; idx++) {		idx < innerSize; idx++) {
result.push_back(rewriter.create<LLVM::ExtractElementOp>(		result.push_back(rewriter.create<LLVM::ExtractElementOp>(
loc, toUse,		loc, toUse,
rewriter.create<LLVM::ConstantOp>(		rewriter.create<index::ConstantOp>(loc,
loc, rewriter.getI64Type(), rewriter.getI64IntegerAttr(idx))));		rewriter.getIndexAttr(idx))));
}		}
continue;		continue;
}		}
result.push_back(toUse);		result.push_back(toUse);
}		}
return result;		return result;
}		}

▲ Show 20 Lines • Show All 176 Lines • ▼ Show 20 Lines	static Value getMbarrierPtr(ConversionPatternRewriter &rewriter,
TypedValue<nvgpu::MBarrierType> barrier,		TypedValue<nvgpu::MBarrierType> barrier,
Value barrierMemref) {		Value barrierMemref) {
MemRefType memrefType = createMBarrierMemrefType(rewriter, barrier.getType());		MemRefType memrefType = createMBarrierMemrefType(rewriter, barrier.getType());
MemRefDescriptor memRefDescriptor(barrierMemref);		MemRefDescriptor memRefDescriptor(barrierMemref);
return memRefDescriptor.bufferPtr(rewriter, barrier.getLoc(), typeConverter,		return memRefDescriptor.bufferPtr(rewriter, barrier.getLoc(), typeConverter,
memrefType);		memrefType);
}		}

struct ConvertNVGPUToNVVMPass
: public impl::ConvertNVGPUToNVVMPassBase<ConvertNVGPUToNVVMPass> {
using Base::Base;

void getDependentDialects(DialectRegistry &registry) const override {
registry
.insert<memref::MemRefDialect, LLVM::LLVMDialect, NVVM::NVVMDialect>();
}

void runOnOperation() override {
LowerToLLVMOptions options(&getContext());
options.useOpaquePointers = useOpaquePointers;
RewritePatternSet patterns(&getContext());
LLVMTypeConverter converter(&getContext(), options);
IRRewriter rewriter(&getContext());
/// device-side async tokens cannot be materialized in nvvm. We just
/// convert them to a dummy i32 type in order to easily drop them during
/// conversion.
converter.addConversion([&](nvgpu::DeviceAsyncTokenType type) -> Type {
return converter.convertType(IntegerType::get(type.getContext(), 32));
});
converter.addConversion([&](nvgpu::MBarrierTokenType type) -> Type {
return converter.convertType(IntegerType::get(type.getContext(), 64));
});
converter.addConversion([&](nvgpu::MBarrierType type) -> Type {
return converter.convertType(createMBarrierMemrefType(rewriter, type));
});
converter.addConversion([&](nvgpu::TensorMapDescriptorType type) -> Type {
return converter.getPointerType(type.getTensor().getElementType());
});
populateNVGPUToNVVMConversionPatterns(converter, patterns);
LLVMConversionTarget target(getContext());
target.addLegalDialect<::mlir::LLVM::LLVMDialect>();
target.addLegalDialect<::mlir::memref::MemRefDialect>();
target.addLegalDialect<::mlir::NVVM::NVVMDialect>();
if (failed(applyPartialConversion(getOperation(), target,
std::move(patterns))))
signalPassFailure();
}
};

/// Returns the constraints for the sparse MMA inline assembly instruction.		/// Returns the constraints for the sparse MMA inline assembly instruction.
static std::string buildMmaSparseAsmConstraintString(unsigned matASize,		static std::string buildMmaSparseAsmConstraintString(unsigned matASize,
unsigned matBSize,		unsigned matBSize,
unsigned matCSize) {		unsigned matCSize) {
std::string str;		std::string str;
llvm::raw_string_ostream ss(str);		llvm::raw_string_ostream ss(str);
for (unsigned i = 0; i < matCSize; i++)		for (unsigned i = 0; i < matCSize; i++)
ss << "=r,";		ss << "=r,";
▲ Show 20 Lines • Show All 208 Lines • ▼ Show 20 Lines	matchAndRewrite(nvgpu::DeviceAsyncCopyOp op, OpAdaptor adaptor,
// memory) to fill DstElements number of elements in the destination		// memory) to fill DstElements number of elements in the destination
// (shared memory).		// (shared memory).
Value srcBytes = adaptor.getSrcElements();		Value srcBytes = adaptor.getSrcElements();
if (srcBytes) {		if (srcBytes) {
// When the optional SrcElements argument is present, the source (global		// When the optional SrcElements argument is present, the source (global
// memory) of CpAsyncOp is read only for SrcElements number of elements.		// memory) of CpAsyncOp is read only for SrcElements number of elements.
// The rest of the DstElements in the destination (shared memory) are		// The rest of the DstElements in the destination (shared memory) are
// filled with zeros.		// filled with zeros.
Value c3I32 = rewriter.create<LLVM::ConstantOp>(		Value c3I32 =
loc, rewriter.getI32Type(), rewriter.getI32IntegerAttr(3));		rewriter.create<index::ConstantOp>(loc, rewriter.getIndexAttr(3));
Value bitwidth = rewriter.create<LLVM::ConstantOp>(		Value bitwidth = rewriter.create<index::ConstantOp>(
loc, rewriter.getI32Type(),		loc, rewriter.getIndexAttr(srcMemrefType.getElementTypeBitWidth()));
rewriter.getI32IntegerAttr(srcMemrefType.getElementTypeBitWidth()));
Value srcElementsI32 =		Value srcElementsI32 =
rewriter.create<LLVM::TruncOp>(loc, rewriter.getI32Type(), srcBytes);		rewriter.create<LLVM::TruncOp>(loc, rewriter.getI32Type(), srcBytes);
srcBytes = rewriter.create<LLVM::LShrOp>(		srcBytes = rewriter.create<LLVM::LShrOp>(
loc, rewriter.create<LLVM::MulOp>(loc, bitwidth, srcElementsI32),		loc, rewriter.create<LLVM::MulOp>(loc, bitwidth, srcElementsI32),
c3I32);		c3I32);
}		}
// Cache global (.cg) for 16 dst bytes, Cache all (.ca) for sizes other than		// Cache global (.cg) for 16 dst bytes, Cache all (.ca) for sizes other than
// 16 dst bytes.		// 16 dst bytes.
NVVM::LoadCacheModifierKind cacheModifier =		NVVM::LoadCacheModifierKind cacheModifier =
(op.getBypassL1().value_or(false) && sizeInBytes == 16)		(op.getBypassL1().value_or(false) && sizeInBytes == 16)
? NVVM::LoadCacheModifierKind::CG		? NVVM::LoadCacheModifierKind::CG
: NVVM::LoadCacheModifierKind::CA;		: NVVM::LoadCacheModifierKind::CA;

rewriter.create<NVVM::CpAsyncOp>(		rewriter.create<NVVM::CpAsyncOp>(
loc, dstPtr, scrPtr, rewriter.getI32IntegerAttr(sizeInBytes),		loc, dstPtr, scrPtr, rewriter.getI32IntegerAttr(sizeInBytes),
NVVM::LoadCacheModifierKindAttr::get(op->getContext(), cacheModifier),		NVVM::LoadCacheModifierKindAttr::get(op->getContext(), cacheModifier),
srcBytes);		srcBytes);

// Drop the result token.		// Drop the result token.
Value zero = rewriter.create<LLVM::ConstantOp>(		Value zero = rewriter.create<index::ConstantOp>(op->getLoc(),
op->getLoc(), IntegerType::get(op.getContext(), 32),		rewriter.getIndexAttr(0));
rewriter.getI32IntegerAttr(0));
rewriter.replaceOp(op, zero);		rewriter.replaceOp(op, zero);
return success();		return success();
}		}
};		};

struct NVGPUAsyncCreateGroupLowering		struct NVGPUAsyncCreateGroupLowering
: public ConvertOpToLLVMPattern<nvgpu::DeviceAsyncCreateGroupOp> {		: public ConvertOpToLLVMPattern<nvgpu::DeviceAsyncCreateGroupOp> {
using ConvertOpToLLVMPattern<		using ConvertOpToLLVMPattern<
nvgpu::DeviceAsyncCreateGroupOp>::ConvertOpToLLVMPattern;		nvgpu::DeviceAsyncCreateGroupOp>::ConvertOpToLLVMPattern;

LogicalResult		LogicalResult
matchAndRewrite(nvgpu::DeviceAsyncCreateGroupOp op, OpAdaptor adaptor,		matchAndRewrite(nvgpu::DeviceAsyncCreateGroupOp op, OpAdaptor adaptor,
ConversionPatternRewriter &rewriter) const override {		ConversionPatternRewriter &rewriter) const override {
rewriter.create<NVVM::CpAsyncCommitGroupOp>(op.getLoc());		rewriter.create<NVVM::CpAsyncCommitGroupOp>(op.getLoc());
// Drop the result token.		// Drop the result token.
Value zero = rewriter.create<LLVM::ConstantOp>(		Value zero = rewriter.create<index::ConstantOp>(op->getLoc(),
op->getLoc(), IntegerType::get(op.getContext(), 32),		rewriter.getIndexAttr(0));
rewriter.getI32IntegerAttr(0));
rewriter.replaceOp(op, zero);		rewriter.replaceOp(op, zero);
return success();		return success();
}		}
};		};

struct NVGPUAsyncWaitLowering		struct NVGPUAsyncWaitLowering
: public ConvertOpToLLVMPattern<nvgpu::DeviceAsyncWaitOp> {		: public ConvertOpToLLVMPattern<nvgpu::DeviceAsyncWaitOp> {
using ConvertOpToLLVMPattern<		using ConvertOpToLLVMPattern<
▲ Show 20 Lines • Show All 59 Lines • ▼ Show 20 Lines	struct NVGPUMBarrierInitLowering
using ConvertOpToLLVMPattern<nvgpu::MBarrierInitOp>::ConvertOpToLLVMPattern;		using ConvertOpToLLVMPattern<nvgpu::MBarrierInitOp>::ConvertOpToLLVMPattern;

LogicalResult		LogicalResult
matchAndRewrite(nvgpu::MBarrierInitOp op, OpAdaptor adaptor,		matchAndRewrite(nvgpu::MBarrierInitOp op, OpAdaptor adaptor,
ConversionPatternRewriter &rewriter) const override {		ConversionPatternRewriter &rewriter) const override {
rewriter.setInsertionPoint(op);		rewriter.setInsertionPoint(op);
Value barrier = getMbarrierPtr(rewriter, *getTypeConverter(),		Value barrier = getMbarrierPtr(rewriter, *getTypeConverter(),
op.getBarrier(), adaptor.getBarrier());		op.getBarrier(), adaptor.getBarrier());
		Value count = truncToI32(rewriter, op->getLoc(), op.getCount());
Value count = truncToI32(rewriter, op->getLoc(), adaptor.getCount());

if (isMbarrierShared(op.getBarrier().getType())) {		if (isMbarrierShared(op.getBarrier().getType())) {
rewriter.replaceOpWithNewOp<NVVM::MBarrierInitSharedOp>(op, barrier,		rewriter.replaceOpWithNewOp<NVVM::MBarrierInitSharedOp>(op, barrier,
count);		count);
} else {		} else {
rewriter.replaceOpWithNewOp<NVVM::MBarrierInitOp>(op, barrier, count);		rewriter.replaceOpWithNewOp<NVVM::MBarrierInitOp>(op, barrier, count);
}		}
return success();		return success();
Show All 31 Lines	struct NVGPUMBarrierArriveNoCompleteLowering

LogicalResult		LogicalResult
matchAndRewrite(nvgpu::MBarrierArriveNoCompleteOp op, OpAdaptor adaptor,		matchAndRewrite(nvgpu::MBarrierArriveNoCompleteOp op, OpAdaptor adaptor,
ConversionPatternRewriter &rewriter) const override {		ConversionPatternRewriter &rewriter) const override {
Value barrier = getMbarrierPtr(rewriter, *getTypeConverter(),		Value barrier = getMbarrierPtr(rewriter, *getTypeConverter(),
op.getBarrier(), adaptor.getBarrier());		op.getBarrier(), adaptor.getBarrier());
Type tokenType = getTypeConverter()->convertType(		Type tokenType = getTypeConverter()->convertType(
nvgpu::MBarrierTokenType::get(op->getContext()));		nvgpu::MBarrierTokenType::get(op->getContext()));
Value count = truncToI32(rewriter, op->getLoc(), adaptor.getCount());		Value count = truncToI32(rewriter, op->getLoc(), op.getCount());
if (isMbarrierShared(op.getBarrier().getType())) {		if (isMbarrierShared(op.getBarrier().getType())) {
rewriter.replaceOpWithNewOp<NVVM::MBarrierArriveNocompleteSharedOp>(		rewriter.replaceOpWithNewOp<NVVM::MBarrierArriveNocompleteSharedOp>(
op, tokenType, barrier, count);		op, tokenType, barrier, count);
} else {		} else {
rewriter.replaceOpWithNewOp<NVVM::MBarrierArriveNocompleteOp>(		rewriter.replaceOpWithNewOp<NVVM::MBarrierArriveNocompleteOp>(
op, tokenType, barrier, count);		op, tokenType, barrier, count);
}		}
return success();		return success();
Show All 28 Lines	struct NVGPUMBarrierArriveExpectTxLowering
using ConvertOpToLLVMPattern<		using ConvertOpToLLVMPattern<
nvgpu::MBarrierArriveExpectTxOp>::ConvertOpToLLVMPattern;		nvgpu::MBarrierArriveExpectTxOp>::ConvertOpToLLVMPattern;

LogicalResult		LogicalResult
matchAndRewrite(nvgpu::MBarrierArriveExpectTxOp op, OpAdaptor adaptor,		matchAndRewrite(nvgpu::MBarrierArriveExpectTxOp op, OpAdaptor adaptor,
ConversionPatternRewriter &rewriter) const override {		ConversionPatternRewriter &rewriter) const override {
Value barrier = getMbarrierPtr(rewriter, *getTypeConverter(),		Value barrier = getMbarrierPtr(rewriter, *getTypeConverter(),
op.getBarrier(), adaptor.getBarrier());		op.getBarrier(), adaptor.getBarrier());
Value txcount = truncToI32(rewriter, op->getLoc(), adaptor.getTxcount());		Value txcount = truncToI32(rewriter, op->getLoc(), op.getTxcount());

if (isMbarrierShared(op.getBarrier().getType())) {		if (isMbarrierShared(op.getBarrier().getType())) {
rewriter.replaceOpWithNewOp<NVVM::MBarrierArriveExpectTxSharedOp>(		rewriter.replaceOpWithNewOp<NVVM::MBarrierArriveExpectTxSharedOp>(
op, barrier, txcount);		op, barrier, txcount);
return success();		return success();
}		}

rewriter.replaceOpWithNewOp<NVVM::MBarrierArriveExpectTxOp>(op, barrier,		rewriter.replaceOpWithNewOp<NVVM::MBarrierArriveExpectTxOp>(op, barrier,
txcount);		txcount);
return success();		return success();
}		}
};		};

struct NVGPUMBarrierTryWaitParityLowering		struct NVGPUMBarrierTryWaitParityLowering
: public ConvertOpToLLVMPattern<nvgpu::MBarrierTryWaitParityOp> {		: public ConvertOpToLLVMPattern<nvgpu::MBarrierTryWaitParityOp> {
using ConvertOpToLLVMPattern<		using ConvertOpToLLVMPattern<
nvgpu::MBarrierTryWaitParityOp>::ConvertOpToLLVMPattern;		nvgpu::MBarrierTryWaitParityOp>::ConvertOpToLLVMPattern;

LogicalResult		LogicalResult
matchAndRewrite(nvgpu::MBarrierTryWaitParityOp op, OpAdaptor adaptor,		matchAndRewrite(nvgpu::MBarrierTryWaitParityOp op, OpAdaptor adaptor,
ConversionPatternRewriter &rewriter) const override {		ConversionPatternRewriter &rewriter) const override {
Value barrier = getMbarrierPtr(rewriter, *getTypeConverter(),		Value barrier = getMbarrierPtr(rewriter, *getTypeConverter(),
op.getBarrier(), adaptor.getBarrier());		op.getBarrier(), adaptor.getBarrier());
Value ticks = truncToI32(rewriter, op->getLoc(), adaptor.getTicks());		Value ticks = truncToI32(rewriter, op->getLoc(), op.getTicks());
Value phase = truncToI32(rewriter, op->getLoc(), adaptor.getPhase());		Value phase = truncToI32(rewriter, op->getLoc(), op.getPhase());

if (isMbarrierShared(op.getBarrier().getType())) {		if (isMbarrierShared(op.getBarrier().getType())) {
rewriter.replaceOpWithNewOp<NVVM::MBarrierTryWaitParitySharedOp>(		rewriter.replaceOpWithNewOp<NVVM::MBarrierTryWaitParitySharedOp>(
op, barrier, phase, ticks);		op, barrier, phase, ticks);
return success();		return success();
}		}

rewriter.replaceOpWithNewOp<NVVM::MBarrierTryWaitParityOp>(op, barrier,		rewriter.replaceOpWithNewOp<NVVM::MBarrierTryWaitParityOp>(op, barrier,
phase, ticks);		phase, ticks);
return success();		return success();
}		}
};		};

struct NVGPUTmaAsyncLoadOpLowering		struct NVGPUTmaAsyncLoadOpLowering
: public ConvertOpToLLVMPattern<nvgpu::TmaAsyncLoadOp> {		: public ConvertOpToLLVMPattern<nvgpu::TmaAsyncLoadOp> {
using ConvertOpToLLVMPattern<nvgpu::TmaAsyncLoadOp>::ConvertOpToLLVMPattern;		using ConvertOpToLLVMPattern<nvgpu::TmaAsyncLoadOp>::ConvertOpToLLVMPattern;
LogicalResult		LogicalResult
matchAndRewrite(nvgpu::TmaAsyncLoadOp op, OpAdaptor adaptor,		matchAndRewrite(nvgpu::TmaAsyncLoadOp op, OpAdaptor adaptor,
ConversionPatternRewriter &rewriter) const override {		ConversionPatternRewriter &rewriter) const override {
auto dest = rewriter.create<LLVM::ExtractValueOp>(op->getLoc(),		auto dest = rewriter.create<LLVM::ExtractValueOp>(op->getLoc(),
adaptor.getDst(), 1);		adaptor.getDst(), 1);
Value barrier = getMbarrierPtr(rewriter, *getTypeConverter(),		Value barrier = getMbarrierPtr(rewriter, *getTypeConverter(),
op.getBarrier(), adaptor.getBarrier());		op.getBarrier(), adaptor.getBarrier());

SmallVector<Value> coords = adaptor.getCoordinates();		SmallVector<Value> coords = op.getCoordinates();
for (auto [index, value] : llvm::enumerate(coords)) {		for (auto [index, value] : llvm::enumerate(coords)) {
coords[index] = truncToI32(rewriter, op->getLoc(), value);		coords[index] = truncToI32(rewriter, op->getLoc(), value);
}		}

rewriter.replaceOpWithNewOp<NVVM::CpAsyncBulkTensorGlobalToSharedClusterOp>(		rewriter.replaceOpWithNewOp<NVVM::CpAsyncBulkTensorGlobalToSharedClusterOp>(
op, dest, adaptor.getTensorMapDescriptor(), barrier, coords);		op, dest, adaptor.getTensorMapDescriptor(), barrier, coords);
return success();		return success();
}		}
};		};

		/// Create an i64 LLVM constant value. This should only be used with unambiguous
		/// sink operations where we know for a fact the underlying LLVM will precisely
		/// want i64.
static Value makeI64Const(RewriterBase &rewriter, Operation *op,		static Value makeI64Const(RewriterBase &rewriter, Operation *op,
int32_t index) {		int32_t index) {
return rewriter.create<LLVM::ConstantOp>(op->getLoc(),		return rewriter.create<LLVM::ConstantOp>(op->getLoc(),
rewriter.getIntegerType(64),		rewriter.getIntegerType(64),
rewriter.getI32IntegerAttr(index));		rewriter.getI32IntegerAttr(index));
}		}

/// Returns a Value that holds data type enum that is expected by CUDA driver.		/// Returns a Value that holds data type enum that is expected by CUDA driver.
▲ Show 20 Lines • Show All 117 Lines • ▼ Show 20 Lines	patterns.add<
NVGPUTmaAsyncLoadOpLowering, // nvgpu.tma.async.load		NVGPUTmaAsyncLoadOpLowering, // nvgpu.tma.async.load
NVGPUTmaCreateDescriptorOpLowering, // nvgpu.tma.create.descriptor		NVGPUTmaCreateDescriptorOpLowering, // nvgpu.tma.create.descriptor
NVGPUMBarrierArriveExpectTxLowering, // nvgpu.mbarrier.arrive.expect_tx		NVGPUMBarrierArriveExpectTxLowering, // nvgpu.mbarrier.arrive.expect_tx
NVGPUTmaAsyncLoadOpLowering, // nvgpu.tma.async.load		NVGPUTmaAsyncLoadOpLowering, // nvgpu.tma.async.load
MmaSyncOptoNVVM, MmaLdMatrixOpToNVVM, NVGPUAsyncCopyLowering,		MmaSyncOptoNVVM, MmaLdMatrixOpToNVVM, NVGPUAsyncCopyLowering,
NVGPUAsyncCreateGroupLowering, NVGPUAsyncWaitLowering,		NVGPUAsyncCreateGroupLowering, NVGPUAsyncWaitLowering,
NVGPUMmaSparseSyncLowering>(converter);		NVGPUMmaSparseSyncLowering>(converter);
}		}

		static IntegerType getIndexTypeForMemRef(MemRefType type) {
		if (type.getMemorySpaceAsInt() ==
		nvgpu::NVGPUDialect::kSharedMemoryAddressSpace)
		return IntegerType::get(type.getContext(), 32);
		return IntegerType::get(type.getContext(), 64);
		}

		namespace {

		struct ConvertNVGPUToNVVMPass
		: public impl::ConvertNVGPUToNVVMPassBase<ConvertNVGPUToNVVMPass> {
		using Base::Base;

		void getDependentDialects(DialectRegistry &registry) const override {
		registry.insert<index::IndexDialect, memref::MemRefDialect,
		LLVM::LLVMDialect, NVVM::NVVMDialect>();
		}

		void runOnOperation() override {
		LowerToLLVMOptions options(&getContext());
		options.useOpaquePointers = useOpaquePointers;
		options.memrefIndexTypeConverter = getIndexTypeForMemRef;
		RewritePatternSet patterns(&getContext());
		LLVMTypeConverter converter(&getContext(), options);
		IRRewriter rewriter(&getContext());
		/// device-side async tokens cannot be materialized in nvvm. We just
		/// convert them to a dummy i32 type in order to easily drop them during
		/// conversion.
		converter.addConversion([&](nvgpu::DeviceAsyncTokenType type) -> Type {
		return converter.convertType(IntegerType::get(type.getContext(), 32));
		});
		converter.addConversion([&](nvgpu::MBarrierTokenType type) -> Type {
		return converter.convertType(IntegerType::get(type.getContext(), 64));
		});
		converter.addConversion([&](nvgpu::MBarrierType type) -> Type {
		return converter.convertType(createMBarrierMemrefType(rewriter, type));
		});
		converter.addConversion([&](nvgpu::TensorMapDescriptorType type) -> Type {
		return converter.getPointerType(type.getTensor().getElementType());
		});
		populateNVGPUToNVVMConversionPatterns(converter, patterns);
		LLVMConversionTarget target(getContext());
		target.addLegalDialect<::mlir::index::IndexDialect>();
		target.addLegalDialect<::mlir::LLVM::LLVMDialect>();
		target.addLegalDialect<::mlir::memref::MemRefDialect>();
		target.addLegalDialect<::mlir::NVVM::NVVMDialect>();
		if (failed(applyPartialConversion(getOperation(), target,
		std::move(patterns))))
		signalPassFailure();
		}
		};

		} // namespace

mlir/test/Integration/GPU/CUDA/sm90/tmaload.mlir

// RUN: mlir-opt %s --convert-nvgpu-to-nvvm -gpu-kernel-outlining \		// RUN: mlir-opt %s
// RUN: -convert-scf-to-cf -convert-nvvm-to-llvm \		// RUN: -test-lower-to-nvvm="kernel-index-bitwidth=32 cubin-chip=sm_90 cubin-features=+ptx80 dump-ptx"
// RUN: -convert-vector-to-llvm \
// RUN: -convert-math-to-llvm \
// RUN: -expand-strided-metadata \
// RUN: -lower-affine \
// RUN: -convert-index-to-llvm=index-bitwidth=32 \
// RUN: -convert-arith-to-llvm \
// RUN: -finalize-memref-to-llvm \
// RUN: -convert-func-to-llvm \
// RUN: -canonicalize \
// RUN: \| mlir-opt -pass-pipeline='builtin.module(gpu.module(strip-debuginfo,convert-gpu-to-nvvm,convert-nvgpu-to-nvvm{use-opaque-pointers=1},lower-affine,convert-scf-to-cf,convert-vector-to-llvm,convert-math-to-llvm,expand-strided-metadata,lower-affine,convert-index-to-llvm{index-bitwidth=32},convert-arith-to-llvm,reconcile-unrealized-casts,gpu-to-cubin{chip=sm_90 features=+ptx80 dump-ptx}))' \
// RUN: 2&>1 \| FileCheck %s --check-prefixes=CHECK-PTX		// RUN: 2&>1 \| FileCheck %s --check-prefixes=CHECK-PTX

// CHECK-PTX: mbarrier.init.shared.b64		// CHECK-PTX: mbarrier.init.shared.b64
// CHECK-PTX: mbarrier.arrive.expect_tx.shared.b64		// CHECK-PTX: mbarrier.arrive.expect_tx.shared.b64
// CHECK-PTX: cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes		// CHECK-PTX: cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes
// CHECK-PTX: cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes		// CHECK-PTX: cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes
// CHECK-PTX: mbarrier.arrive.expect_tx.shared.b64		// CHECK-PTX: mbarrier.arrive.expect_tx.shared.b64
// CHECK-PTX: mbarrier.try_wait.parity.shared.b64		// CHECK-PTX: mbarrier.try_wait.parity.shared.b64
▲ Show 20 Lines • Show All 61 Lines • ▼ Show 20 Lines	gpu.launch blocks(%arg0, %arg1, %arg2) in (%arg6 = %c1, %arg7 = %c1, %arg8 = %c1) threads(%arg3, %arg4, %arg5) in (%arg9 = %c128, %arg10 = %c1, %arg11 = %c1) {
%12 = memref.load %8[%c7, %c0] : memref<8x128xf32, 3>		%12 = memref.load %8[%c7, %c0] : memref<8x128xf32, 3>
gpu.printf "[GPU] TMA LOADED lhs[45][7] %f\0A" %11 : f32		gpu.printf "[GPU] TMA LOADED lhs[45][7] %f\0A" %11 : f32
gpu.printf "[GPU] TMA LOADED rhs[7][0] %f\0A" %12 : f32		gpu.printf "[GPU] TMA LOADED rhs[7][0] %f\0A" %12 : f32
}		}
gpu.terminator		gpu.terminator
}		}
return		return
}		}
}		}
No newline at end of file

mlir/test/lib/Dialect/GPU/TestLowerToNVVM.cpp

Show All 14 Lines
#include "mlir/Conversion/ArithToLLVM/ArithToLLVM.h"		#include "mlir/Conversion/ArithToLLVM/ArithToLLVM.h"
#include "mlir/Conversion/FuncToLLVM/ConvertFuncToLLVMPass.h"		#include "mlir/Conversion/FuncToLLVM/ConvertFuncToLLVMPass.h"
#include "mlir/Conversion/GPUCommon/GPUCommonPass.h"		#include "mlir/Conversion/GPUCommon/GPUCommonPass.h"
#include "mlir/Conversion/GPUToNVVM/GPUToNVVMPass.h"		#include "mlir/Conversion/GPUToNVVM/GPUToNVVMPass.h"
#include "mlir/Conversion/IndexToLLVM/IndexToLLVM.h"		#include "mlir/Conversion/IndexToLLVM/IndexToLLVM.h"
#include "mlir/Conversion/MathToLLVM/MathToLLVM.h"		#include "mlir/Conversion/MathToLLVM/MathToLLVM.h"
#include "mlir/Conversion/MemRefToLLVM/MemRefToLLVM.h"		#include "mlir/Conversion/MemRefToLLVM/MemRefToLLVM.h"
#include "mlir/Conversion/NVGPUToNVVM/NVGPUToNVVM.h"		#include "mlir/Conversion/NVGPUToNVVM/NVGPUToNVVM.h"
		#include "mlir/Conversion/NVVMToLLVM/NVVMToLLVM.h"
#include "mlir/Conversion/ReconcileUnrealizedCasts/ReconcileUnrealizedCasts.h"		#include "mlir/Conversion/ReconcileUnrealizedCasts/ReconcileUnrealizedCasts.h"
#include "mlir/Conversion/SCFToControlFlow/SCFToControlFlow.h"		#include "mlir/Conversion/SCFToControlFlow/SCFToControlFlow.h"
#include "mlir/Conversion/VectorToLLVM/ConvertVectorToLLVM.h"		#include "mlir/Conversion/VectorToLLVM/ConvertVectorToLLVM.h"
#include "mlir/Conversion/VectorToSCF/VectorToSCF.h"		#include "mlir/Conversion/VectorToSCF/VectorToSCF.h"
#include "mlir/Dialect/Func/IR/FuncOps.h"		#include "mlir/Dialect/Func/IR/FuncOps.h"
#include "mlir/Dialect/GPU/IR/GPUDialect.h"		#include "mlir/Dialect/GPU/IR/GPUDialect.h"
#include "mlir/Dialect/GPU/Transforms/Passes.h"		#include "mlir/Dialect/GPU/Transforms/Passes.h"
#include "mlir/Dialect/MemRef/Transforms/Passes.h"		#include "mlir/Dialect/MemRef/Transforms/Passes.h"
Show All 34 Lines	PassOptions::Option<std::string> cubinTriple{
llvm::cl::init("nvptx64-nvidia-cuda")};		llvm::cl::init("nvptx64-nvidia-cuda")};
PassOptions::Option<std::string> cubinChip{		PassOptions::Option<std::string> cubinChip{
*this, "cubin-chip", llvm::cl::desc("Chip to use to serialize to cubin."),		*this, "cubin-chip", llvm::cl::desc("Chip to use to serialize to cubin."),
llvm::cl::init("sm_80")};		llvm::cl::init("sm_80")};
PassOptions::Option<std::string> cubinFeatures{		PassOptions::Option<std::string> cubinFeatures{
*this, "cubin-features",		*this, "cubin-features",
llvm::cl::desc("Features to use to serialize to cubin."),		llvm::cl::desc("Features to use to serialize to cubin."),
llvm::cl::init("+ptx76")};		llvm::cl::init("+ptx76")};
		PassOptions::Option<bool> dumpPtx{
		*this, "dump-ptx", llvm::cl::desc("Whether to dump the produced ptx)"),
		llvm::cl::init(false)};
};		};

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// GPUModule-specific stuff.		// GPUModule-specific stuff.
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
void buildGpuPassPipeline(OpPassManager &pm,		void buildGpuPassPipeline(OpPassManager &pm,
const TestLowerToNVVMOptions &options) {		const TestLowerToNVVMOptions &options) {
pm.addNestedPass<gpu::GPUModuleOp>(createStripDebugInfoPass());		pm.addNestedPass<gpu::GPUModuleOp>(createStripDebugInfoPass());
Show All 39 Lines	void buildGpuPassPipeline(OpPassManager &pm,
// Must be 64b on the host, things don't compose properly around		// Must be 64b on the host, things don't compose properly around
// gpu::LaunchOp and gpu::HostRegisterOp.		// gpu::LaunchOp and gpu::HostRegisterOp.
// TODO: fix GPU layering.		// TODO: fix GPU layering.
convertIndexToLLVMPassOpt.indexBitwidth = options.kernelIndexBitWidth;		convertIndexToLLVMPassOpt.indexBitwidth = options.kernelIndexBitWidth;
pm.addNestedPass<gpu::GPUModuleOp>(		pm.addNestedPass<gpu::GPUModuleOp>(
createConvertIndexToLLVMPass(convertIndexToLLVMPassOpt));		createConvertIndexToLLVMPass(convertIndexToLLVMPassOpt));

// TODO: C++20 designated initializers.		// TODO: C++20 designated initializers.
		ConvertNVGPUToNVVMPassOptions convertNVGPUToNVVMPassOptions;
		convertNVGPUToNVVMPassOptions.useOpaquePointers = true;
		pm.addNestedPass<gpu::GPUModuleOp>(
		createConvertNVGPUToNVVMPass(convertNVGPUToNVVMPassOptions));

		pm.addNestedPass<gpu::GPUModuleOp>(createConvertSCFToCFPass());

		// TODO: C++20 designated initializers.
// The following pass is inconsistent.		// The following pass is inconsistent.
// ConvertGpuOpsToNVVMOpsOptions convertGpuOpsToNVVMOpsOptions;		// ConvertGpuOpsToNVVMOpsOptions convertGpuOpsToNVVMOpsOptions;
// convertGpuOpsToNVVMOpsOptions.indexBitwidth =		// convertGpuOpsToNVVMOpsOptions.indexBitwidth =
// options.kernelIndexBitWidth;		// options.kernelIndexBitWidth;
pm.addNestedPass<gpu::GPUModuleOp>(		pm.addNestedPass<gpu::GPUModuleOp>(
// TODO: fix inconsistence.		// TODO: fix inconsistence.
createLowerGpuOpsToNVVMOpsPass(/indexBitWidth=/		createLowerGpuOpsToNVVMOpsPass(/indexBitWidth=/
options.kernelIndexBitWidth));		options.kernelIndexBitWidth));

// TODO: C++20 designated initializers.		// TODO: C++20 designated initializers.
ConvertNVGPUToNVVMPassOptions convertNVGPUToNVVMPassOptions;
convertNVGPUToNVVMPassOptions.useOpaquePointers = true;
pm.addNestedPass<gpu::GPUModuleOp>(
createConvertNVGPUToNVVMPass(convertNVGPUToNVVMPassOptions));
pm.addNestedPass<gpu::GPUModuleOp>(createConvertSCFToCFPass());

// TODO: C++20 designated initializers.
GpuToLLVMConversionPassOptions gpuToLLVMConversionOptions;		GpuToLLVMConversionPassOptions gpuToLLVMConversionOptions;
// Note: hostBarePtrCallConv must be false for now otherwise		// Note: hostBarePtrCallConv must be false for now otherwise
// gpu::HostRegister is ill-defined: it wants unranked memrefs but can't		// gpu::HostRegister is ill-defined: it wants unranked memrefs but can't
// lower the to bare ptr.		// lower the to bare ptr.
gpuToLLVMConversionOptions.hostBarePtrCallConv =		gpuToLLVMConversionOptions.hostBarePtrCallConv =
options.hostUseBarePtrCallConv;		options.hostUseBarePtrCallConv;
gpuToLLVMConversionOptions.kernelBarePtrCallConv =		gpuToLLVMConversionOptions.kernelBarePtrCallConv =
options.kernelUseBarePtrCallConv;		options.kernelUseBarePtrCallConv;
gpuToLLVMConversionOptions.useOpaquePointers = true;		gpuToLLVMConversionOptions.useOpaquePointers = true;

// TODO: something useful here.		// TODO: something useful here.
// gpuToLLVMConversionOptions.gpuBinaryAnnotation = "";		// gpuToLLVMConversionOptions.gpuBinaryAnnotation = "";
pm.addNestedPass<gpu::GPUModuleOp>(		pm.addNestedPass<gpu::GPUModuleOp>(
createGpuToLLVMConversionPass(gpuToLLVMConversionOptions));		createGpuToLLVMConversionPass(gpuToLLVMConversionOptions));

// Convert vector to LLVM (always needed).		// Convert vector to LLVM (always needed).
// TODO: C++20 designated initializers.		// TODO: C++20 designated initializers.
ConvertVectorToLLVMPassOptions convertVectorToLLVMPassOptions;		ConvertVectorToLLVMPassOptions convertVectorToLLVMPassOptions;
convertVectorToLLVMPassOptions.reassociateFPReductions = true;		convertVectorToLLVMPassOptions.reassociateFPReductions = true;
pm.addNestedPass<gpu::GPUModuleOp>(		pm.addNestedPass<gpu::GPUModuleOp>(
createConvertVectorToLLVMPass(convertVectorToLLVMPassOptions));		createConvertVectorToLLVMPass(convertVectorToLLVMPassOptions));

		pm.addNestedPass<gpu::GPUModuleOp>(createConvertNVVMToLLVMPass());

// Sprinkle some cleanups.		// Sprinkle some cleanups.
pm.addPass(createCanonicalizerPass());		pm.addPass(createCanonicalizerPass());
pm.addPass(createCSEPass());		pm.addPass(createCSEPass());

// Finally we can reconcile unrealized casts.		// Finally we can reconcile unrealized casts.
pm.addNestedPass<gpu::GPUModuleOp>(createReconcileUnrealizedCastsPass());		pm.addNestedPass<gpu::GPUModuleOp>(createReconcileUnrealizedCastsPass());

#if MLIR_GPU_TO_CUBIN_PASS_ENABLE		#if MLIR_GPU_TO_CUBIN_PASS_ENABLE
pm.addNestedPass<gpu::GPUModuleOp>(createGpuSerializeToCubinPass(		pm.addNestedPass<gpu::GPUModuleOp>(createGpuSerializeToCubinPass(
options.cubinTriple, options.cubinChip, options.cubinFeatures));		options.cubinTriple, options.cubinChip, options.cubinFeatures,
		/optLevel=/2, /dumpPtx=/options.dumpPtx));
#endif // MLIR_GPU_TO_CUBIN_PASS_ENABLE		#endif // MLIR_GPU_TO_CUBIN_PASS_ENABLE
}		}

void buildLowerToNVVMPassPipeline(OpPassManager &pm,		void buildLowerToNVVMPassPipeline(OpPassManager &pm,
const TestLowerToNVVMOptions &options) {		const TestLowerToNVVMOptions &options) {
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// Host-specific stuff.		// Host-specific stuff.
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// Important, must be run at the top-level.
pm.addPass(createGpuKernelOutliningPass());

// Important, all host passes must be run at the func level so that host		// Important, all host passes must be run at the func level so that host
// conversions can remain with 64 bit indices without polluting the GPU		// conversions can remain with 64 bit indices without polluting the GPU
// kernel that may have 32 bit indices.		// kernel that may have 32 bit indices.
// Must be 64b on the host, things don't compose properly around		// Must be 64b on the host, things don't compose properly around
// gpu::LaunchOp and gpu::HostRegisterOp.		// gpu::LaunchOp and gpu::HostRegisterOp.
// TODO: fix GPU layering.		// TODO: fix GPU layering.
pm.addNestedPass<func::FuncOp>(createConvertVectorToSCFPass());		pm.addNestedPass<func::FuncOp>(createConvertVectorToSCFPass());
Show All 28 Lines	void buildLowerToNVVMPassPipeline(OpPassManager &pm,
// TODO: fix GPU layering.		// TODO: fix GPU layering.
convertFuncToLLVMPassOptions.indexBitwidth = options.hostIndexBitWidth;		convertFuncToLLVMPassOptions.indexBitwidth = options.hostIndexBitWidth;
convertFuncToLLVMPassOptions.useBarePtrCallConv =		convertFuncToLLVMPassOptions.useBarePtrCallConv =
options.hostUseBarePtrCallConv;		options.hostUseBarePtrCallConv;
convertFuncToLLVMPassOptions.useOpaquePointers = true;		convertFuncToLLVMPassOptions.useOpaquePointers = true;
pm.addNestedPass<func::FuncOp>(		pm.addNestedPass<func::FuncOp>(
createConvertFuncToLLVMPass(convertFuncToLLVMPassOptions));		createConvertFuncToLLVMPass(convertFuncToLLVMPassOptions));

// TODO: C++20 designated initializers.
ConvertIndexToLLVMPassOptions convertIndexToLLVMPassOpt;
// Must be 64b on the host, things don't compose properly around
// gpu::LaunchOp and gpu::HostRegisterOp.
// TODO: fix GPU layering.
convertIndexToLLVMPassOpt.indexBitwidth = options.hostIndexBitWidth;
pm.addNestedPass<func::FuncOp>(
createConvertIndexToLLVMPass(convertIndexToLLVMPassOpt));

pm.addNestedPass<func::FuncOp>(createArithToLLVMConversionPass());

// Sprinkle some cleanups.		// Sprinkle some cleanups.
pm.addNestedPass<func::FuncOp>(createCanonicalizerPass());		pm.addNestedPass<func::FuncOp>(createCanonicalizerPass());
pm.addNestedPass<func::FuncOp>(createCSEPass());		pm.addNestedPass<func::FuncOp>(createCSEPass());

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// GPUModule-specific stuff.		// GPUModule-specific stuff.
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

		// Due to gpu::LaunchOp and gpu::LaunchFuncOp layering and conversions, there
		// is currently a need to call convertNVGPUToNVVM at the top-level to get
		// proper types at the function boundary for the TMADescriptors.
		// TODO: Fix this broken layering: conversion of TMA descriptor should be
		// separated from introducing LLVM types.
		// TODO: C++20 designated initializers.
		ConvertNVGPUToNVVMPassOptions convertNVGPUToNVVMPassOptions;
		convertNVGPUToNVVMPassOptions.useOpaquePointers = true;
		pm.addPass(createConvertNVGPUToNVVMPass(convertNVGPUToNVVMPassOptions));

		// Important, must be run at the top-level.
		pm.addPass(createGpuKernelOutliningPass());

buildGpuPassPipeline(pm, options);		buildGpuPassPipeline(pm, options);

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// Host post-GPUModule-specific stuff.		// Host post-GPUModule-specific stuff.
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// Convert vector to LLVM (always needed).		// Convert vector to LLVM (always needed).
// TODO: C++20 designated initializers.		// TODO: C++20 designated initializers.
ConvertVectorToLLVMPassOptions convertVectorToLLVMPassOptions;		ConvertVectorToLLVMPassOptions convertVectorToLLVMPassOptions;
convertVectorToLLVMPassOptions.reassociateFPReductions = true;		convertVectorToLLVMPassOptions.reassociateFPReductions = true;
pm.addNestedPass<func::FuncOp>(		pm.addNestedPass<func::FuncOp>(
createConvertVectorToLLVMPass(convertVectorToLLVMPassOptions));		createConvertVectorToLLVMPass(convertVectorToLLVMPassOptions));

		pm.addPass(createConvertNVVMToLLVMPass());

ConvertIndexToLLVMPassOptions convertIndexToLLVMPassOpt3;		ConvertIndexToLLVMPassOptions convertIndexToLLVMPassOpt3;
// Must be 64b on the host, things don't compose properly around		// Must be 64b on the host, things don't compose properly around
// gpu::LaunchOp and gpu::HostRegisterOp.		// gpu::LaunchOp and gpu::HostRegisterOp.
// TODO: fix GPU layering.		// TODO: fix GPU layering.
convertIndexToLLVMPassOpt3.indexBitwidth = options.hostIndexBitWidth;		convertIndexToLLVMPassOpt3.indexBitwidth = options.hostIndexBitWidth;
pm.addPass(createConvertIndexToLLVMPass(convertIndexToLLVMPassOpt3));		pm.addPass(createConvertIndexToLLVMPass(convertIndexToLLVMPassOpt3));

		pm.addNestedPass<func::FuncOp>(createArithToLLVMConversionPass());

// This must happen after cubin translation otherwise gpu.launch_func is		// This must happen after cubin translation otherwise gpu.launch_func is
// illegal if no cubin annotation is present.		// illegal if no cubin annotation is present.
// TODO: C++20 designated initializers.		// TODO: C++20 designated initializers.
GpuToLLVMConversionPassOptions gpuToLLVMConversionOptions;		GpuToLLVMConversionPassOptions gpuToLLVMConversionOptions;
// Note: hostBarePtrCallConv must be false for now otherwise		// Note: hostBarePtrCallConv must be false for now otherwise
// gpu::HostRegister is ill-defined: it wants unranked memrefs but can't		// gpu::HostRegister is ill-defined: it wants unranked memrefs but can't
// lower the to bare ptr.		// lower the to bare ptr.
gpuToLLVMConversionOptions.hostBarePtrCallConv =		gpuToLLVMConversionOptions.hostBarePtrCallConv =
Show All 40 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][NVGPU] WIP - Apply layering changes for e2e NVVMNeeds ReviewPublic

Details

Diff Detail

Event Timeline