This is an archive of the discontinued LLVM Phabricator instance.

mlir/lib/Conversion/GPUToNVVM/LowerGpuOpsToNVVMOps.cpp
254	nit: why not add this above?
mlir/test/Integration/GPU/CUDA/printf.mlir
16	Could you test this with multiple arguments that have different sizes? I wonder whether your struct packing approach fulfills the alignment requirements for arguments to the `vprintf` call.

F32 needs to be extended to f64. If you add a test you will see it fails.

mlir/lib/Conversion/GPUCommon/GPUOpsLowering.cpp
381	This same idiom (lines around 380 through 400) is repeated 3 times in this file, so might be useful to outline to helper function? Since SymbolTable class can't be constructed in the middle of conversion, maybe it would be helpful if this type of function (e.g. `getUniqueSymbolName`) was added to the `OpTrait::SymbolTable` class in `IR/SymbolTable.h`, but maybe that's too heavy handed?

christopherbate added inline comments.Jan 5 2023, 10:14 AM

mlir/lib/Conversion/GPUCommon/GPUOpsLowering.cpp
408	F32 needs to be extended to f64 before pushing into struct. If you add a test, you will see it fails for f32 args currently. Integers need to be extended to 32 bits, but this happens automatically based on my initial test.
mlir/test/Integration/GPU/CUDA/printf.mlir
16	F32 needs to be extended to f64 before pushing into struct. If you add a test, you will see it fails for f32 args currently.

christopherbate requested changes to this revision.Jan 5 2023, 10:14 AM

This revision now requires changes to proceed.Jan 5 2023, 10:14 AM

Address review comments.

ThomasRaoux marked 3 inline comments as done.Jan 5 2023, 1:24 PM

ThomasRaoux added inline comments.

mlir/lib/Conversion/GPUCommon/GPUOpsLowering.cpp
381	moved those in a local helper for now. I'm not sure how common this use case is.
408	Thanks for catching that! It wasn't working indeed. Added FExt for floats.

Harbormaster completed remote builds in B205979: Diff 486662.Jan 5 2023, 2:00 PM

LGTM!

This revision is now accepted and ready to land.Jan 6 2023, 9:05 AM

Closed by commit rG7efdc117b151: [mlir][nvvm] Add lowering of gpu.printf to nvvm (authored by ThomasRaoux). · Explain WhyJan 6 2023, 9:30 AM

This revision was automatically updated to reflect the committed changes.

ThomasRaoux added a commit: rG7efdc117b151: [mlir][nvvm] Add lowering of gpu.printf to nvvm.

Revision Contents

Path

Size

mlir/

lib/

Conversion/

GPUCommon/

GPUOpsLowering.h

10 lines

GPUOpsLowering.cpp

69 lines

GPUToNVVM/

LowerGpuOpsToNVVMOps.cpp

1 line

test/

Conversion/

GPUToNVVM/

gpu-to-nvvm.mlir

38 lines

Integration/

GPU/

CUDA/

printf.mlir

29 lines

Diff 486545

mlir/lib/Conversion/GPUCommon/GPUOpsLowering.h

Show First 20 Lines • Show All 61 Lines • ▼ Show 20 Lines	struct GPUPrintfOpToLLVMCallLowering
LogicalResult		LogicalResult
matchAndRewrite(gpu::PrintfOp gpuPrintfOp, gpu::PrintfOpAdaptor adaptor,		matchAndRewrite(gpu::PrintfOp gpuPrintfOp, gpu::PrintfOpAdaptor adaptor,
ConversionPatternRewriter &rewriter) const override;		ConversionPatternRewriter &rewriter) const override;

private:		private:
int addressSpace;		int addressSpace;
};		};

		/// Lowering of gpu.printf to a vprintf standard library.
		struct GPUPrintfOpToVPrintfLowering
		: public ConvertOpToLLVMPattern<gpu::PrintfOp> {
		using ConvertOpToLLVMPattern<gpu::PrintfOp>::ConvertOpToLLVMPattern;

		LogicalResult
		matchAndRewrite(gpu::PrintfOp gpuPrintfOp, gpu::PrintfOpAdaptor adaptor,
		ConversionPatternRewriter &rewriter) const override;
		};

struct GPUReturnOpLowering : public ConvertOpToLLVMPattern<gpu::ReturnOp> {		struct GPUReturnOpLowering : public ConvertOpToLLVMPattern<gpu::ReturnOp> {
using ConvertOpToLLVMPattern<gpu::ReturnOp>::ConvertOpToLLVMPattern;		using ConvertOpToLLVMPattern<gpu::ReturnOp>::ConvertOpToLLVMPattern;

LogicalResult		LogicalResult
matchAndRewrite(gpu::ReturnOp op, OpAdaptor adaptor,		matchAndRewrite(gpu::ReturnOp op, OpAdaptor adaptor,
ConversionPatternRewriter &rewriter) const override {		ConversionPatternRewriter &rewriter) const override {
rewriter.replaceOpWithNewOp<LLVM::ReturnOp>(op, adaptor.getOperands());		rewriter.replaceOpWithNewOp<LLVM::ReturnOp>(op, adaptor.getOperands());
return success();		return success();
Show All 27 Lines

mlir/lib/Conversion/GPUCommon/GPUOpsLowering.cpp

Show First 20 Lines • Show All 353 Lines • ▼ Show 20 Lines	LogicalResult GPUPrintfOpToLLVMCallLowering::matchAndRewrite(
printfArgs.push_back(stringStart);		printfArgs.push_back(stringStart);
printfArgs.append(argsRange.begin(), argsRange.end());		printfArgs.append(argsRange.begin(), argsRange.end());

rewriter.create<LLVM::CallOp>(loc, printfDecl, printfArgs);		rewriter.create<LLVM::CallOp>(loc, printfDecl, printfArgs);
rewriter.eraseOp(gpuPrintfOp);		rewriter.eraseOp(gpuPrintfOp);
return success();		return success();
}		}

		LogicalResult GPUPrintfOpToVPrintfLowering::matchAndRewrite(
		gpu::PrintfOp gpuPrintfOp, gpu::PrintfOpAdaptor adaptor,
		ConversionPatternRewriter &rewriter) const {
		Location loc = gpuPrintfOp->getLoc();

		mlir::Type llvmI8 = typeConverter->convertType(rewriter.getIntegerType(8));
		mlir::Type i8Ptr = LLVM::LLVMPointerType::get(llvmI8);

		// Note: this is the GPUModule op, not the ModuleOp that surrounds it
		// This ensures that global constants and declarations are placed within
		// the device code, not the host code
		auto moduleOp = gpuPrintfOp->getParentOfType<gpu::GPUModuleOp>();

		auto vprintfType =
		LLVM::LLVMFunctionType::get(rewriter.getI32Type(), {i8Ptr, i8Ptr});
		LLVM::LLVMFuncOp vprintfDecl =
		getOrDefineFunction(moduleOp, loc, rewriter, "vprintf", vprintfType);

		// Get a unique global name.
		unsigned stringNumber = 0;
		christopherbateUnsubmitted Not Done Reply Inline Actions This same idiom (lines around 380 through 400) is repeated 3 times in this file, so might be useful to outline to helper function? Since SymbolTable class can't be constructed in the middle of conversion, maybe it would be helpful if this type of function (e.g. `getUniqueSymbolName`) was added to the `OpTrait::SymbolTable` class in `IR/SymbolTable.h`, but maybe that's too heavy handed? christopherbate: This same idiom (lines around 380 through 400) is repeated 3 times in this file, so might be…
		ThomasRaouxAuthorUnsubmitted Done Reply Inline Actions moved those in a local helper for now. I'm not sure how common this use case is. ThomasRaoux: moved those in a local helper for now. I'm not sure how common this use case is.
		SmallString<16> stringConstName;
		do {
		stringConstName.clear();
		(formatStringPrefix + Twine(stringNumber++)).toStringRef(stringConstName);
		} while (moduleOp.lookupSymbol(stringConstName));

		llvm::SmallString<20> formatString(adaptor.getFormat());
		formatString.push_back('\0'); // Null terminate for C
		auto globalType =
		LLVM::LLVMArrayType::get(llvmI8, formatString.size_in_bytes());
		LLVM::GlobalOp global;
		{
		ConversionPatternRewriter::InsertionGuard guard(rewriter);
		rewriter.setInsertionPointToStart(moduleOp.getBody());
		global = rewriter.create<LLVM::GlobalOp>(
		loc, globalType,
		/isConstant=/true, LLVM::Linkage::Internal, stringConstName,
		rewriter.getStringAttr(formatString), /allignment=/0);
		}

		// Get a pointer to the format string's first element
		Value globalPtr = rewriter.create<LLVM::AddressOfOp>(loc, global);
		Value stringStart = rewriter.create<LLVM::GEPOp>(
		loc, i8Ptr, globalPtr, ArrayRef<LLVM::GEPArg>{0, 0});
		SmallVector<Type> types;
		// Pack the arguments into a stack allocation.
		for (Value arg : adaptor.getArgs())
		christopherbateUnsubmitted Not Done Reply Inline Actions F32 needs to be extended to f64 before pushing into struct. If you add a test, you will see it fails for f32 args currently. Integers need to be extended to 32 bits, but this happens automatically based on my initial test. christopherbate: F32 needs to be extended to f64 before pushing into struct. If you add a test, you will see it…
		ThomasRaouxAuthorUnsubmitted Done Reply Inline Actions Thanks for catching that! It wasn't working indeed. Added FExt for floats. ThomasRaoux: Thanks for catching that! It wasn't working indeed. Added FExt for floats.
		types.push_back(arg.getType());
		Type structType =
		LLVM::LLVMStructType::getLiteral(gpuPrintfOp.getContext(), types);
		Type structPtrType = LLVM::LLVMPointerType::get(structType);
		Value one = rewriter.create<LLVM::ConstantOp>(loc, rewriter.getI64Type(),
		rewriter.getIndexAttr(1));
		Value tempAlloc = rewriter.create<LLVM::AllocaOp>(loc, structPtrType, one,
		/alignment=/0);
		for (auto [index, arg] : llvm::enumerate(adaptor.getArgs())) {
		Value ptr = rewriter.create<LLVM::GEPOp>(
		loc, LLVM::LLVMPointerType::get(arg.getType()), tempAlloc,
		ArrayRef<LLVM::GEPArg>{0, index});
		rewriter.create<LLVM::StoreOp>(loc, arg, ptr);
		}
		tempAlloc = rewriter.create<LLVM::BitcastOp>(loc, i8Ptr, tempAlloc);
		std::array<Value, 2> printfArgs = {stringStart, tempAlloc};

		rewriter.create<LLVM::CallOp>(loc, vprintfDecl, printfArgs);
		rewriter.eraseOp(gpuPrintfOp);
		return success();
		}

/// Unrolls op if it's operating on vectors.		/// Unrolls op if it's operating on vectors.
LogicalResult impl::scalarizeVectorOp(Operation *op, ValueRange operands,		LogicalResult impl::scalarizeVectorOp(Operation *op, ValueRange operands,
ConversionPatternRewriter &rewriter,		ConversionPatternRewriter &rewriter,
LLVMTypeConverter &converter) {		LLVMTypeConverter &converter) {
TypeRange operandTypes(operands);		TypeRange operandTypes(operands);
if (llvm::none_of(operandTypes,		if (llvm::none_of(operandTypes,
[](Type type) { return type.isa<VectorType>(); })) {		[](Type type) { return type.isa<VectorType>(); })) {
return rewriter.notifyMatchFailure(op, "expected vector operand");		return rewriter.notifyMatchFailure(op, "expected vector operand");
Show All 33 Lines

mlir/lib/Conversion/GPUToNVVM/LowerGpuOpsToNVVMOps.cpp

Show First 20 Lines • Show All 245 Lines • ▼ Show 20 Lines	patterns
NVVM::BlockDimYOp, NVVM::BlockDimZOp>,		NVVM::BlockDimYOp, NVVM::BlockDimZOp>,
GPUIndexIntrinsicOpLowering<gpu::BlockIdOp, NVVM::BlockIdXOp,		GPUIndexIntrinsicOpLowering<gpu::BlockIdOp, NVVM::BlockIdXOp,
NVVM::BlockIdYOp, NVVM::BlockIdZOp>,		NVVM::BlockIdYOp, NVVM::BlockIdZOp>,
GPUIndexIntrinsicOpLowering<gpu::GridDimOp, NVVM::GridDimXOp,		GPUIndexIntrinsicOpLowering<gpu::GridDimOp, NVVM::GridDimXOp,
NVVM::GridDimYOp, NVVM::GridDimZOp>,		NVVM::GridDimYOp, NVVM::GridDimZOp>,
GPULaneIdOpToNVVM, GPUShuffleOpLowering, GPUReturnOpLowering>(		GPULaneIdOpToNVVM, GPUShuffleOpLowering, GPUReturnOpLowering>(
converter);		converter);

		patterns.add<GPUPrintfOpToVPrintfLowering>(converter);
		herhutUnsubmitted Done Reply Inline Actions nit: why not add this above? herhut: nit: why not add this above?
// Explicitly drop memory space when lowering private memory		// Explicitly drop memory space when lowering private memory
// attributions since NVVM models it as `alloca`s in the default		// attributions since NVVM models it as `alloca`s in the default
// memory space and does not support `alloca`s with addrspace(5).		// memory space and does not support `alloca`s with addrspace(5).
patterns.add<GPUFuncOpLowering>(		patterns.add<GPUFuncOpLowering>(
converter, /allocaAddrSpace=/0,		converter, /allocaAddrSpace=/0,
StringAttr::get(&converter.getContext(),		StringAttr::get(&converter.getContext(),
NVVM::NVVMDialect::getKernelFuncAttrName()));		NVVM::NVVMDialect::getKernelFuncAttrName()));

Show All 38 Lines

mlir/test/Conversion/GPUToNVVM/gpu-to-nvvm.mlir

Show First 20 Lines • Show All 495 Lines • ▼ Show 20 Lines	gpu.module @test_module {
// CHECK-LABEL: @kernel_func		// CHECK-LABEL: @kernel_func
// CHECK: attributes		// CHECK: attributes
// CHECK: gpu.kernel		// CHECK: gpu.kernel
// CHECK: nvvm.kernel		// CHECK: nvvm.kernel
gpu.func @kernel_func() kernel {		gpu.func @kernel_func() kernel {
gpu.return		gpu.return
}		}
}		}

		// -----

		gpu.module @test_module {
		// CHECK-DAG: llvm.mlir.global internal constant @[[$PRINT_GLOBAL0:[A-Za-z0-9_]+]]("Hello, world\0A\00")
		// CHECK-DAG: llvm.mlir.global internal constant @[[$PRINT_GLOBAL1:[A-Za-z0-9_]+]]("Hello: %d\0A\00")
		// CHECK-DAG: llvm.func @vprintf(!llvm.ptr<i8>, !llvm.ptr<i8>) -> i32

		// CHECK-LABEL: func @test_const_printf
		gpu.func @test_const_printf() {
		// CHECK-NEXT: %[[FORMATSTR:.*]] = llvm.mlir.addressof @[[$PRINT_GLOBAL0]] : !llvm.ptr<array<14 x i8>>
		// CHECK-NEXT: %[[FORMATSTART:.*]] = llvm.getelementptr %[[FORMATSTR]][0, 0] : (!llvm.ptr<array<14 x i8>>) -> !llvm.ptr<i8>
		// CHECK-NEXT: %[[O:.*]] = llvm.mlir.constant(1 : index) : i64
		// CHECK-NEXT: %[[ALLOC:.*]] = llvm.alloca %[[O]] x !llvm.struct<()> : (i64) -> !llvm.ptr<struct<()>>
		// CHECK-NEXT: %[[ARGPTR:.*]] = llvm.bitcast %[[ALLOC]] : !llvm.ptr<struct<()>> to !llvm.ptr<i8>
		// CHECK-NEXT: llvm.call @vprintf(%[[FORMATSTART]], %[[ARGPTR]]) : (!llvm.ptr<i8>, !llvm.ptr<i8>) -> i32
		gpu.printf "Hello, world\n"
		gpu.return
		}

		// CHECK-LABEL: func @test_printf
		// CHECK: (%[[ARG0:.]]: i32, %[[ARG1:.]]: f32)
		gpu.func @test_printf(%arg0: i32, %arg1: f32) {
		// CHECK-NEXT: %[[FORMATSTR:.*]] = llvm.mlir.addressof @[[$PRINT_GLOBAL1]] : !llvm.ptr<array<11 x i8>>
		// CHECK-NEXT: %[[FORMATSTART:.*]] = llvm.getelementptr %[[FORMATSTR]][0, 0] : (!llvm.ptr<array<11 x i8>>) -> !llvm.ptr<i8>
		// CHECK-NEXT: %[[O:.*]] = llvm.mlir.constant(1 : index) : i64
		// CHECK-NEXT: %[[ALLOC:.*]] = llvm.alloca %[[O]] x !llvm.struct<(i32, f32)> : (i64) -> !llvm.ptr<struct<(i32, f32)>>
		// CHECK-NEXT: %[[EL0:.*]] = llvm.getelementptr %[[ALLOC]][0, 0] : (!llvm.ptr<struct<(i32, f32)>>) -> !llvm.ptr<i32>
		// CHECK-NEXT: llvm.store %[[ARG0]], %[[EL0]] : !llvm.ptr<i32>
		// CHECK-NEXT: %[[EL1:.*]] = llvm.getelementptr %[[ALLOC]][0, 1] : (!llvm.ptr<struct<(i32, f32)>>) -> !llvm.ptr<f32>
		// CHECK-NEXT: llvm.store %[[ARG1]], %[[EL1]] : !llvm.ptr<f32>
		// CHECK-NEXT: %[[ARGPTR:.*]] = llvm.bitcast %[[ALLOC]] : !llvm.ptr<struct<(i32, f32)>> to !llvm.ptr<i8>
		// CHECK-NEXT: llvm.call @vprintf(%[[FORMATSTART]], %[[ARGPTR]]) : (!llvm.ptr<i8>, !llvm.ptr<i8>) -> i32
		gpu.printf "Hello: %d\n" %arg0, %arg1 : i32, f32
		gpu.return
		}
		}

mlir/test/Integration/GPU/CUDA/printf.mlir

This file was added.

				// RUN: mlir-opt %s \
				// RUN: \| mlir-opt -pass-pipeline='builtin.module(gpu.module(strip-debuginfo,convert-gpu-to-nvvm,gpu-to-cubin))' \
				// RUN: \| mlir-opt -gpu-to-llvm \
				// RUN: \| mlir-cpu-runner \
				// RUN: --shared-libs=%mlir_lib_dir/libmlir_cuda_runtime%shlibext \
				// RUN: --shared-libs=%mlir_lib_dir/libmlir_runner_utils%shlibext \
				// RUN: --entry-point-result=void \
				// RUN: \| FileCheck %s

				// CHECK: Hello from 0
				// CHECK: Hello from 1
				module attributes {gpu.container_module} {
				gpu.module @kernels {
				gpu.func @hello() kernel {
				%0 = gpu.thread_id x
				gpu.printf "Hello from %d\n" %0 : index
				herhutUnsubmitted Done Reply Inline Actions Could you test this with multiple arguments that have different sizes? I wonder whether your struct packing approach fulfills the alignment requirements for arguments to the `vprintf` call. herhut: Could you test this with multiple arguments that have different sizes? I wonder whether your…
				christopherbateUnsubmitted Done Reply Inline Actions F32 needs to be extended to f64 before pushing into struct. If you add a test, you will see it fails for f32 args currently. christopherbate: F32 needs to be extended to f64 before pushing into struct. If you add a test, you will see it…
				gpu.return
				}
				}

				func.func @main() {
				%c2 = arith.constant 2 : index
				%c1 = arith.constant 1 : index
				gpu.launch_func @kernels::@hello
				blocks in (%c1, %c1, %c1)
				threads in (%c2, %c1, %c1)
				return
				}
				}

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][nvvm] Add lowering of gpu.printf to nvvmClosedPublic

Details

Diff Detail

Event Timeline