Page MenuHomePhabricator

[MLIR][GPU] Define gpu.printf op and its lowerings
ClosedPublic

Authored by krzysz00 on Sep 24 2021, 2:42 PM.

Details

Summary
  • Define a gpu.printf op, which can be lowered to any GPU printf() support (which is present in CUDA, HIP, and OpenCL). This op only supports constant format strings and scalar arguments
  • Define the lowering of gpu.pirntf to a call to printf() (which is what is required for AMD GPUs when using OpenCL) as well as to the hostcall interface present in the AMD Open Compute device library, which is the interface present when kernels are running under HIP.
  • Add a "runtime" enum that allows specifying which of the possible runtimes a ROCDL kernel will be executed under or that the runtime is unknown. This enum controls how gpu.printf is lowered

This change does not enable lowering for Nvidia GPUs, but such a lowering should be possible in principle.

And:
[MLIR][AMDGPU] Always set amdgpu-implicitarg-num-bytes=56 on kernels

This is something that Clang always sets on both OpenCL and HIP kernels, and failing to include it causes mysterious crashes with printf() support.

In addition, revert the max-flat-work-group-size to (1, 256) to avoid triggering bugs in the AMDGPU backend.

Diff Detail

Event Timeline

krzysz00 created this revision.Sep 24 2021, 2:42 PM
krzysz00 requested review of this revision.Sep 24 2021, 2:42 PM
rriddle added inline comments.Sep 24 2021, 3:17 PM
mlir/lib/Conversion/GPUCommon/GPUOpsLowering.h
86

This overload is deprecated, please switch to the adaptor variant.

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
908–916 ↗(On Diff #374967)

Can you encode this in ODS instead?

bondhugula added inline comments.
mlir/lib/Conversion/GPUCommon/GPUOpsLowering.cpp
149

Doc comment here or on the struct.

197

As compact to just use op.args() inline below.

198–200

Reserve first before pushing?

201

Much more compact with:

printArgs.append(..., ...);
mlir/lib/Conversion/GPUToROCDL/LowerGpuOpsToROCDLOps.cpp
121

Please add this in sorted order.

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
910 ↗(On Diff #374967)

This constraint can be placed on the TD definition.

911–912 ↗(On Diff #374967)

return op.emitOpError(...

bondhugula requested changes to this revision.Sep 24 2021, 6:21 PM
bondhugula added inline comments.
mlir/lib/Conversion/GPUCommon/GPUOpsLowering.cpp
160

This isn't needed. Ops have a default null init.
Nit: printf -> printfOp.

171–195

Code comments for the major blocks please.

This revision now requires changes to proceed.Sep 24 2021, 6:21 PM

It isn't clear to me that this belong to the GPU dialect: what is GPU specific to this op or the lowering right now?

krzysz00 updated this revision to Diff 375289.Sep 27 2021, 8:40 AM
  • Adress first round of review comments
krzysz00 marked 9 inline comments as done.Sep 27 2021, 8:47 AM

It isn't clear to me that this belong to the GPU dialect: what is GPU specific to this op or the lowering right now?

For the lowering I've defined, there's not much GPU-specific about it (except that the globals are placed in the gpu.module). However, IIRC, platforms like Vulkan have a much different printf() machinery that would require a different lowering.

From what I can tell, there could be an argument for putting printf as a concept somewhere else. Do you think std would be a good place for it instead?

mlir/lib/Conversion/GPUCommon/GPUOpsLowering.cpp
160

Good to know, and done.

198–200

Whoops, fixed.

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
910 ↗(On Diff #374967)

Thanks for pointing out that's possible now. This whole function has been removed

Commit title fixes:

ILVM -> LLVM
it's -> its

bondhugula added inline comments.Sep 27 2021, 8:40 PM
mlir/lib/Conversion/GPUCommon/GPUOpsLowering.cpp
195

pointr -> pointer

203

args() gives you a ValueRange. You don't need rvalue references here. You can simply use:
ValueRange args = adaptor.args(); like it's done everywhere.

krzysz00 marked 3 inline comments as done.Sep 28 2021, 10:53 AM

To update y'all, it turns out the printf() lowering I was trying to use targets OpenCL, while AMD's MLIR kernels are often run under HIP, which has a different printf() arrangement. I'll likely be resubmitting with a much different lowering pass soon

krzysz00 updated this revision to Diff 382865.Oct 27 2021, 5:42 PM
krzysz00 retitled this revision from [MLIR][GPU] Add a gpu.printf op (and it's lowering to lLVM) to [MLIR][GPU] Define gpu.printf op and its lowerings.
krzysz00 edited the summary of this revision. (Show Details)

Rework printf lowering to account for HIP being different

krzysz00 edited the summary of this revision. (Show Details)Oct 27 2021, 5:44 PM

(I'm definitely open to having this op in a different dialect, say std, but that feels like a broader discussion that would need to happen)

(adding back Uday since the issue isn't inactive anymore?)

@herhut @ftynse - could I get another look at this, since it's been sitting here for a while?

krzysz00 updated this revision to Diff 389056.Nov 22 2021, 3:33 PM
  • [MLIR][AMDGPU] Always set amdgpu-implicitarg-num-bytes=56 on kernels
  • [MLIR][GPU] Define gpu.printf op and its lowering
  • Add integration test and call to set amdgpu_hostcall attribute
krzysz00 edited the summary of this revision. (Show Details)Nov 22 2021, 3:34 PM
krzysz00 updated this revision to Diff 389245.Nov 23 2021, 10:03 AM
  • Correctly promote floats
mehdi_amini accepted this revision.Dec 7 2021, 5:46 PM
mehdi_amini added inline comments.
mlir/include/mlir/Conversion/GPUToROCDL/GPUToROCDLPass.h
29

This should be documented as requiring to be executed on a ModulePass.

mlir/include/mlir/Dialect/GPU/GPUOps.td
562

This would seem nicer :)

mlir/lib/Conversion/GPUCommon/GPUOpsLowering.h
51
mlir/lib/Conversion/GPUToROCDL/LowerGpuOpsToROCDLOps.cpp
129

Can you document the address space?

This revision is now accepted and ready to land.Dec 7 2021, 5:46 PM
krzysz00 updated this revision to Diff 392953.Dec 8 2021, 3:29 PM
  • [MLIR][AMDGPU] Always set amdgpu-implicitarg-num-bytes=56 on kernels
  • [MLIR][GPU] Define gpu.printf op and its lowering
  • Add integration test and call to set amdgpu_hostcall attribute
  • Correctly promote floats
  • Address pre-landing review commants on gpu.printf
krzysz00 updated this revision to Diff 392961.Dec 8 2021, 3:47 PM

Re-push for patch weirdness

krzysz00 updated this revision to Diff 392964.Dec 8 2021, 3:51 PM

Removed dependency on old SerializeToHsaco revision

This revision was automatically updated to reflect the committed changes.
mehdi_amini added inline comments.Dec 9 2021, 6:00 PM
mlir/test/Conversion/GPUToROCDL/gpu-to-rocdl-hip.mlir
1

This test is crashing on one of my bot.

The CMake invocation is:

cmake -G Ninja ${src_dir}/llvm -DLLVM_TARGETS_TO_BUILD="X86;AMDGPU;NVPTX" -DLLVM_ENABLE_PROJECTS=mlir -DLLVM_BUILD_EXAMPLES=ON -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_ASSERTIONS=Off -DLLVM_LINK_LLVM_DYLIB=ON -DMLIR_INCLUDE_INTEGRATION_TESTS=ON -DMLIR_INCLUDE_INTEGRATION_TESTS=ON

(on Linux Ubuntu with gcc 8 as host compiler)

Can you have a look at it?

mehdi_amini added inline comments.Dec 9 2021, 9:00 PM
mlir/lib/Conversion/GPUCommon/GPUOpsLowering.cpp
243

The issue is here, this ValueRange refers to a temporary. I'll fix it.

mlir/test/Conversion/GPUToROCDL/gpu-to-rocdl-hip.mlir
1

ASAN:

=================================================================
==4963==ERROR: AddressSanitizer: stack-use-after-scope on address 0x7f17d905cde0 at pc 0x560629d0fdcf bp 0x7ffd9716f9b0 sp 0x7ffd9716f9a8
READ of size 8 at 0x7f17d905cde0 thread T0
    #0 0x560629d0fdce in dereference_iterator third_party/llvm/llvm-project/mlir/lib/IR/OperationSupport.cpp:608:12
    #1 0x560629d0fdce in operator* third_party/llvm/llvm-project/llvm/include/llvm/ADT/STLExtras.h:1133:14
    #2 0x560629d0fdce in mlir::Value* std::__u::uninitialized_copy<llvm::detail::indexed_accessor_range_base<mlir::ValueRange, llvm::PointerUnion<mlir::Value const*, mlir::OpOperand*, mlir::detail::OpResultImpl*>, mlir::Value, mlir::Value, mlir::Value>::iterator, mlir::Value*>(llvm::detail::indexed_accessor_range_base<mlir::ValueRange, llvm::PointerUnion<mlir::Value const*, mlir::OpOperand*, mlir::detail::OpResultImpl*>, mlir::Value, mlir::Value, mlir::Value>::iterator, llvm::detail::indexed_accessor_range_base<mlir::ValueRange, llvm::PointerUnion<mlir::Value const*, mlir::OpOperand*, mlir::detail::OpResultImpl*>, mlir::Value, mlir::Value, mlir::Value>::iterator, mlir::Value*) third_party/crosstool/v18/stable/toolchain/bin/../include/c++/v1/__memory/uninitialized_algorithms.h:36:62
    #3 0x560629d054d4 in uninitialized_copy<llvm::detail::indexed_accessor_range_base<mlir::ValueRange, llvm::PointerUnion<const mlir::Value *, mlir::OpOperand *, mlir::detail::OpResultImpl *>, mlir::Value, mlir::Value, mlir::Value>::iterator, mlir::Value *> third_party/llvm/llvm-project/llvm/include/llvm/ADT/SmallVector.h:490:5
    #4 0x560629d054d4 in append<llvm::detail::indexed_accessor_range_base<mlir::ValueRange, llvm::PointerUnion<const mlir::Value *, mlir::OpOperand *, mlir::detail::OpResultImpl *>, mlir::Value, mlir::Value, mlir::Value>::iterator, void> third_party/llvm/llvm-project/llvm/include/llvm/ADT/SmallVector.h:652:5
    #5 0x560629d054d4 in mlir::OperationState::addOperands(mlir::ValueRange) third_party/llvm/llvm-project/mlir/lib/IR/OperationSupport.cpp:188:12
    #6 0x560628e65810 in mlir::LLVM::CallOp::build(mlir::OpBuilder&, mlir::OperationState&, mlir::LLVM::LLVMFuncOp, mlir::ValueRange, llvm::ArrayRef<mlir::NamedAttribute>) blaze-out/k8-asan-opt/bin/third_party/llvm/llvm-project/mlir/_virtual_includes/LLVMOpsIncGen/mlir/Dialect/LLVMIR/LLVMOps.cpp.inc:4117:16
    #7 0x560623ffb6b8 in mlir::LLVM::CallOp mlir::OpBuilder::create<mlir::LLVM::CallOp, mlir::LLVM::LLVMFuncOp&, mlir::ValueRange&>(mlir::Location, mlir::LLVM::LLVMFuncOp&, mlir::ValueRange&) third_party/llvm/llvm-project/mlir/include/mlir/IR/Builders.h:427:5
    #8 0x560623ff8bc8 in mlir::GPUPrintfOpToHIPLowering::matchAndRewrite(mlir::gpu::PrintfOp, mlir::gpu::PrintfOpAdaptor, mlir::ConversionPatternRewriter&) const third_party/llvm/llvm-project/mlir/lib/Conversion/GPUCommon/GPUOpsLowering.cpp:245:16
    #9 0x560623fff4e0 in mlir::ConvertOpToLLVMPattern<mlir::gpu::PrintfOp>::matchAndRewrite(mlir::Operation*, llvm::ArrayRef<mlir::Value>,