This is an archive of the discontinued LLVM Phabricator instance.

[mlir][gpu] Adds a gpu serialization pipeline for offloading GPUDialect Ops to clang compatible annotations.
AbandonedPublic

Authored by fmorac on Apr 30 2023, 3:23 PM.

Details

Summary

For general context see:
https://discourse.llvm.org/t/rfc-extending-mlir-gpu-device-codegen-pipeline/70199/1

What this diff is not:

  • It's not a replacement of the current serialization pipeline.
    • However this diff provides the infrastructure to reimplement the current pipeline with little effort and address several of its current shortcomings, specifically the ability to link to bitcode libraries.
    • There are several reasons why this diff doesn't include this re-implementation, the top reasons: the patch is already large enough, I handle AMDGPU code generation by linking to the bitcode libraries, instead of introducing symbols, see: https://github.com/llvm/llvm-project/blob/main/mlir/lib/Dialect/GPU/Transforms/SerializeToHsaco.cpp#L198-L269 , and don't know if that's something the current code owners of that pipeline want.
  • This patch doesn't introduce a clang build dependency, in fact there are no additional build dependencies.

What this diff is:

  • Is an additional pipeline that currently comes with several restrictions but with many new features.
    • Restrictions: It requires a compatible clang compiler to generate executables -the clang features this patch relies on are currently available only on Linux, thus until clang extends their support *this pipeline is restricted to Linux*.
    • Features:
      • Link to device bitcode libraries.
      • Additional AMDGPU features like fast math.
      • Automatic linking to libdevice provided there's a valid CUDA toolkit path.
      • This pipeline is always available as long the respective target is built (AMDGPU, NVPTX).
      • Enables access to clang's code generation features.

Code walkthrough:

Summary

This diff introduces:

  • --gpu-to-(nvptx|amdgpu): These passes serialize gpu.modules to LLVM bitecode which then get further serialized to an offload object format supported by LLVM and compatible with clang.
  • --gpu-name-mangling: It mangles the names of symbols inside gpu modules. This pass might be required as clang will unpack all offload objects and link them together, and if a function has the same name clang will merge the symbols. The mangling scheme is: __G<gpu module name>_S<function name>.
  • --gpu-to-offload: This pass is equivalent to --gpu-to-llvm, except that it introduces clang offload annotations and handles the conversion to LaunchFuncOp differently.
  • Creates the libraries mlir_cudart_runtime & mlir_hiprt_runtime, as clang uses runtime functions instead of driver functions.

Key files walkthrough:

  • GpuToDeviceObjectCommon.h: this file does the heavy lifting for --gpu-to-(nvptx|amdgpu) as it handles the serialization pipeline. The classes in this file would be the ones used for re-implementing the current pipeline.
  • GpuToDeviceOffload.cpp: implements the passes --gpu-to-(nvptx|amdgpu).
  • NameMangling.cpp: implements the pass --gpu-name-mangling.
  • CudaRuntimeWrappers.cpp: implements the library mlir_cudart_runtime. Instead of creating a new file I decided it was better to keep everything in one file and use the macro MLIR_USE_CUDART_RUNNER to handle both libraries. The advantage of this approach is that it ensures developers always update both versions of this library.
  • GPUToLLVMConversion.cpp: handles the pass --gpu-name-offload, the key modifications in this file are the addition of the class GPUOffloadBuilder, GpuToOffloadConversionPass pass, populateGpuToLLVMOffloadConversionPatterns function, and updating ConvertLaunchFuncOpToGpuRuntimeCallPattern::matchAndRewrite.

Real world testing:

This patch was tested in the following 3 platforms:

  1. Frontier at OLCF - ORNL, using AMD MI250x gfx90a
  2. Perlmutter at NERSC - LBNL, using NVIDIA A100.
  3. A local server, using NVIDIA V100.
    • CUDA: 11.8
    • clang version: 17.0.0 (++20230417095441+43ac269bdd00-1~exp1~20230417215605.872)

The test consists of a gpu kernel written in MLIR test.mlir, and a C++ file with the main and verification functionality: test.cpp.

In order to run the tests I built MLIR from scratch on the above systems and compiled the test using an already existing clang version. On Perlmutter I also ran an additional test using the LLVM IR test.ll file generated from platform 3, to test a sort of 'cross-compiling', it ran successfully.

The steps to compile the test for NVIDIA sm_70 targets are:

mlir-opt test.mlir \
  -gpu-launch-sink-index-computations \
  -gpu-kernel-outlining \
  -gpu-async-region \
  -gpu-name-mangling \
  -convert-scf-to-cf \
  -convert-gpu-to-nvvm \
  -convert-math-to-llvm \
  -convert-arith-to-llvm \
  -convert-index-to-llvm \
  -canonicalize \
  -gpu-to-nvptx="chip=sm_70 cuda-path=<cuda toolkit path>" \
  -gpu-to-offload \
  -canonicalize \
  -o test_llvm.mlir
mlir-translate -mlir-to-llvmir test_llvm.mlir -o test.ll
clang++ -fgpu-rdc --offload-new-driver test.ll test.cpp \
		-L${LLVM_PATH}/lib/ -lmlir_cudart_runtime -lcudart \
		-O3 -o test.exe

In all cases the tests were completed successfully. I ensured that all of them were indeed calling all the appropriate runtime functions by profiling the code with nsys & rocprof. Here is the output from nsys for Perlmutter:

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)       Med (ns)      Min (ns)     Max (ns)    StdDev (ns)           Name         
 --------  ---------------  ---------  -------------  -------------  -----------  -----------  -----------  ----------------------
     99.9      259,394,921          1  259,394,921.0  259,394,921.0  259,394,921  259,394,921          0.0  cudaStreamCreate      
      0.1          137,695          2       68,847.5       68,847.5        5,250      132,445     89,940.4  cudaMalloc            
      0.0          103,289          2       51,644.5       51,644.5        7,655       95,634     62,210.5  cudaFree              
      0.0           60,957          1       60,957.0       60,957.0       60,957       60,957          0.0  cuLibraryLoadData     
      0.0           41,681          3       13,893.7       18,426.0        3,647       19,608      8,893.5  cudaMemcpyAsync       
      0.0           22,644          1       22,644.0       22,644.0       22,644       22,644          0.0  cudaLaunchKernel      
      0.0           11,492          1       11,492.0       11,492.0       11,492       11,492          0.0  cudaStreamDestroy     
      0.0            4,188          1        4,188.0        4,188.0        4,188        4,188          0.0  cudaStreamSynchronize 
      0.0            1,132          1        1,132.0        1,132.0        1,132        1,132          0.0  cuModuleGetLoadingMode

[6/8] Executing 'gpukernsum' stats report

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)     GridXYZ         BlockXYZ                     Name                 
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  --------------  --------------  -------------------------------------
    100.0            4,320          1   4,320.0   4,320.0     4,320     4,320          0.0     3    1    1   128    1    1  __Gtest_mlir_kernel_Stest_mlir_kernel

Test files

Diff Detail

Event Timeline

fmorac created this revision.Apr 30 2023, 3:23 PM
Herald added a reviewer: dcaballe. · View Herald Transcript
Herald added a project: Restricted Project. · View Herald Transcript
fmorac updated this revision to Diff 518365.Apr 30 2023, 3:52 PM

Fixes an error on the tests, caused by not finding the CUDA & ROCm paths.

fmorac edited the summary of this revision. (Show Details)May 1 2023, 11:15 AM
fmorac added reviewers: mehdi_amini, tra, krzysz00.
fmorac edited the summary of this revision. (Show Details)May 1 2023, 11:16 AM
fmorac edited the summary of this revision. (Show Details)May 1 2023, 11:20 AM
fmorac edited the summary of this revision. (Show Details)
fmorac published this revision for review.May 1 2023, 11:22 AM
tra added a subscriber: jhuber6.May 1 2023, 12:21 PM
mlir-opt test.mlir \
...
  -gpu-to-nvptx="chip=sm_70 cuda-path=<cuda toolkit path>" \
  -gpu-to-offload \
  -canonicalize \
  -o test_llvm.mlir
mlir-translate -mlir-to-llvmir test_llvm.mlir -o test.ll
clang++ -fgpu-rdc --offload-new-driver test.ll test.cpp \
		-L${LLVM_PATH}/lib/ -lmlir_cudart_runtime -lcudart \
		-O3 -o test.exe

Can you elaborate on what the clang compilation is expected to do? I'm curious about the interaction between the -fgpu-rdc , the compilation of test.ll and test.cpp in the same invocation.

  • Where does the test.cpp come from?
  • -fgpu-rdc should be enabled by the new driver by default, I think.
  • Is clang supposed to assume that test.ll is the GPU IR? If so, which GPU variant? I think clang may default to a GPU that's different from the GPU you've specified for mlir-opt. Can you post all sub-commands clang++ prints with -### ?

@jhuber6 -- is this combination supposed to work in general?

mlir/lib/Dialect/GPU/CMakeLists.txt
50

Do I understand it correctly that this patch does not add build or run-time dependencies on CUDA or GPU driver on the MLIRGPUTransforms library itself, and that the HIP/CUDA-dependent bits will only be needed by the runtime wrappers and the test executable?

mlir-opt test.mlir \
...
  -gpu-to-nvptx="chip=sm_70 cuda-path=<cuda toolkit path>" \
  -gpu-to-offload \
  -canonicalize \
  -o test_llvm.mlir
mlir-translate -mlir-to-llvmir test_llvm.mlir -o test.ll
clang++ -fgpu-rdc --offload-new-driver test.ll test.cpp \
		-L${LLVM_PATH}/lib/ -lmlir_cudart_runtime -lcudart \
		-O3 -o test.exe

Can you elaborate on what the clang compilation is expected to do? I'm curious about the interaction between the -fgpu-rdc , the compilation of test.ll and test.cpp in the same invocation.

  • Where does the test.cpp come from?

Presumably it's a corresponding host implementation file. Alternatively you could use an empty file compiled with -nostdlib to treat it as a GPU fatbinary with no CPU related code. This is how we build the OpenMP device runtime static library.

  • -fgpu-rdc should be enabled by the new driver by default, I think.

It's not currently, I kept it for consistency with the existing CUDA support. Also because nvcc supports RDC but also doesn't enable it by default because of performance reasons. If you enable -foffload-lto it should have no performance penalty at least.

  • Is clang supposed to assume that test.ll is the GPU IR? If so, which GPU variant? I think clang may default to a GPU that's different from the GPU you've specified for mlir-opt. Can you post all sub-commands clang++ prints with -### ?

@jhuber6 -- is this combination supposed to work in general?

I'm assuming that the test.ll is meant to be embedded in test.cpp although it's not listed that way. The way to handle that would be -Xclang -fembed-offload-object=test.ll.

fmorac marked an inline comment as done.May 1 2023, 1:50 PM

Can you elaborate on what the clang compilation is expected to do? I'm curious about the interaction between the -fgpu-rdc , the compilation of test.ll and test.cpp in the same invocation.

Here's the -### output:

"/usr/lib/llvm-17/bin/clang" "-cc1" "-triple" "x86_64-pc-linux-gnu" "-emit-obj" "-disable-free" "-clear-ast-before-backend" "-disable-llvm-verifier" "-discard-value-names" "-main-file-name" "test.ll" "-mrelocation-model" "pic" "-pic-level" "2" "-pic-is-pie" "-mframe-pointer=none" "-fmath-errno" "-ffp-contract=on" "-fno-rounding-math" "-mconstructor-aliases" "-funwind-tables=2" "-target-cpu" "x86-64" "-tune-cpu" "generic" "-debugger-tuning=gdb" "-fcoverage-compilation-dir=/usa/fmorac/mlir_leia" "-resource-dir" "/usr/lib/llvm-17/lib/clang/17" "-O3" "-fdebug-compilation-dir=/usa/fmorac/mlir_leia" "-ferror-limit" "19" "--offload-new-driver" "-fgnuc-version=4.2.1" "-fcolor-diagnostics" "-vectorize-loops" "-vectorize-slp" "-faddrsig" "-D__GCC_HAVE_DWARF2_CFI_ASM=1" "-o" "/tmp/test-fe1f0c.o" "-x" "ir" "test.ll"
"/usr/lib/llvm-17/bin/clang" "-cc1" "-triple" "x86_64-pc-linux-gnu" "-emit-obj" "-disable-free" "-clear-ast-before-backend" "-disable-llvm-verifier" "-discard-value-names" "-main-file-name" "main.cpp" "-mrelocation-model" "pic" "-pic-level" "2" "-pic-is-pie" "-mframe-pointer=none" "-fmath-errno" "-ffp-contract=on" "-fno-rounding-math" "-mconstructor-aliases" "-funwind-tables=2" "-target-cpu" "x86-64" "-tune-cpu" "generic" "-debugger-tuning=gdb" "-fcoverage-compilation-dir=/usa/fmorac/mlir_leia" "-resource-dir" "/usr/lib/llvm-17/lib/clang/17" "-I/usr/local/cuda/include" "-internal-isystem" "/usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12" "-internal-isystem" "/usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/x86_64-linux-gnu/c++/12" "-internal-isystem" "/usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/backward" "-internal-isystem" "/usr/lib/llvm-17/lib/clang/17/include" "-internal-isystem" "/usr/local/include" "-internal-isystem" "/usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/include" "-internal-externc-isystem" "/usr/include/x86_64-linux-gnu" "-internal-externc-isystem" "/include" "-internal-externc-isystem" "/usr/include" "-O3" "-fdeprecated-macro" "-fdebug-compilation-dir=/usa/fmorac/mlir_leia" "-ferror-limit" "19" "--offload-new-driver" "-fgnuc-version=4.2.1" "-fcxx-exceptions" "-fexceptions" "-fcolor-diagnostics" "-vectorize-loops" "-vectorize-slp" "-faddrsig" "-D__GCC_HAVE_DWARF2_CFI_ASM=1" "-o" "/tmp/main-c4f25e.o" "-x" "c++" "main.cpp"
"/usr/lib/llvm-17/bin/clang-linker-wrapper" "--host-triple=x86_64-pc-linux-gnu" "--linker-path=/usr/bin/ld" "--" "-pie" "-z" "relro" "--hash-style=gnu" "--build-id" "--eh-frame-hdr" "-m" "elf_x86_64" "-dynamic-linker" "/lib64/ld-linux-x86-64.so.2" "-o" "test.exe" "/lib/x86_64-linux-gnu/Scrt1.o" "/lib/x86_64-linux-gnu/crti.o" "/usr/bin/../lib/gcc/x86_64-linux-gnu/12/crtbeginS.o" "-L/opt/llvm/build//lib/" "-L/usr/bin/../lib/gcc/x86_64-linux-gnu/12" "-L/usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../lib64" "-L/lib/x86_64-linux-gnu" "-L/lib/../lib64" "-L/usr/lib/x86_64-linux-gnu" "-L/usr/lib/../lib64" "-L/lib" "-L/usr/lib" "-L/usr/local/cuda/lib64" "/tmp/test-fe1f0c.o" "/tmp/main-c4f25e.o" "-lmlir_cudart_runtime" "-lcudart" "-lstdc++" "-lm" "-lgcc_s" "-lgcc" "-lc" "-lgcc_s" "-lgcc" "/usr/bin/../lib/gcc/x86_64-linux-gnu/12/crtendS.o" "/lib/x86_64-linux-gnu/crtn.o"

Most of the interesting stuff happens when clang invokes clank-linker-wrapper, which unpacks the device image, compiles it, links any device code, adds kernel registration code and assembles everything together.

  • Where does the test.cpp come from?

I attached both test.cpp and test.mlir at the end of the summary, they should be after the profiler results.

  • -fgpu-rdc should be enabled by the new driver by default, I think.

As @jhuber6 said, it's not enabled by default yet.

  • Is clang supposed to assume that test.ll is the GPU IR? If so, which GPU variant? I think clang may default to a GPU that's different from the GPU you've specified for mlir-opt. Can you post all sub-commands clang++ prints with -### ?

test.ll is the file containing the host IR and an embedded object containing the device bitcode. The embedded object contains all the information needed by clang including the arch.
If you look at:

mlir-opt test.mlir \
...
  -gpu-to-nvptx="chip=sm_70 cuda-path=<cuda toolkit path>" \
...

The arch is there.

I'm assuming that the test.ll is meant to be embedded in test.cpp although it's not listed that way. The way to handle that would be -Xclang -fembed-offload-object=test.ll.

No, the pass uses #include "llvm/Object/OffloadBinary.h" to create a complete file, with the code already embedded.

I'm attaching the files generated by mlir-opt (test_llvm.mlir) and mlir-translate (test.ll):

mlir/lib/Dialect/GPU/CMakeLists.txt
50

You're correct. If NVPTX or AMDGPU is listed in the targets this pipeline will be there, there are no strict CUDA or ROCm dependencies, this is possible because this pipeline never steps outside LLVM, however that also means this pipeline cannot get down to an executable without clang. Having said that, if CUDA or ROCm are not present, then you need to supply those libraries to clang, as the generated bitcode would only contain declarations.

fmorac edited the summary of this revision. (Show Details)May 1 2023, 1:55 PM

This is quite a large patch, I'm not sure if this is really coupled or could be split reasonnably actually? Left a few comments skimming through.

What seems to be really needed at the moment to me is a better overall documentation of the GPU compilation flows and how the two end-to-end integration are setup and compare, do you think you can sketch this?

mlir/include/mlir/Dialect/GPU/Transforms/Passes.td
46

This say what it does, but lack a bit of context, as in the "why would someone want to do this?"

53

Lowerings are in general by convention implemented in the Conversion directory, why did you with Transforms here?

80

I'm a bit confused as of why GPUDialect is dependent here, isn't it the actual input of the pass? The dependent should be whatever is produced by the pass that isn't already in the input.

mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp
78

LLVM has the StringLitteral class for this I think

545

This is expensive, can you cache this by using a SymbolTable initialized once in the run method?

713

This is a recursive IR walk, seems a bit overkill, isn't it? Can we just walk the immediately nested operations?

734

Nit: You haven't flushed the stream here.

766

LLVM convention for the comments is to use /*argname=*/ where argname matches the actual parameters of the functions.

1400

(same nit for the comment format)

mlir/lib/Dialect/GPU/Transforms/GpuToDeviceObjectCommon.h
46

Can we move to a non-templated base class everything that isn't strictly depending on the derived class? (thinking about code size here...)

264

I don't think this brace is correctly placed (assuming you intended to use this for flushing the ostream with RAII)

mlir/lib/Dialect/GPU/Transforms/NameMangling.cpp
107

Aren't there utilities to do that already? Seems like reimplementing some logic that should be fairly generic?

114

Can this be a pass that runs on GPUModuleOp instead?

tra added a comment.May 1 2023, 3:45 PM

Most of the interesting stuff happens when clang invokes clank-linker-wrapper, which unpacks the device image, compiles it, links any device code, adds kernel registration code and assembles everything together.

I would not expect it to work with an IR as an input as that would require reimplementing offload binary embedding just so.

I see that that's exactly what the patch does. OK. Now it makes sense.

@jhuber6 should we consolidate offload-related glue generation into one place, so we don't have everybody rolling a DIY implementation? Object/OffloadBinary.h seems to be providing mostly APIs for parsing offload binaries.

Most of the interesting stuff happens when clang invokes clank-linker-wrapper, which unpacks the device image, compiles it, links any device code, adds kernel registration code and assembles everything together.

I would not expect it to work with an IR as an input as that would require reimplementing offload binary embedding just so.

It should, it handles LLVM-IR from the host or the GPU through LTO. The host-side support obviously requires a linker that accepts it however.

I see that that's exactly what the patch does. OK. Now it makes sense.

@jhuber6 should we consolidate offload-related glue generation into one place, so we don't have everybody rolling a DIY implementation? Object/OffloadBinary.h seems to be providing mostly APIs for parsing offload binaries.

All of these tools are minimally bound to clang so we could move them to LLVM. The OffloadBinary.h file provides interfaces for creating and extracting them.

fmorac marked 10 inline comments as done.May 1 2023, 6:20 PM

I'll answer comments first, and leave the large explanation at the end.

This is quite a large patch, I'm not sure if this is really coupled or could be split reasonnably actually? Left a few comments skimming through.

I thought of maybe splitting in it into 3, gpu-name-mangling, gpu-to(nvptx|amgdpu) and gpu-offload. However there's a functionality dependency between gpu-to(nvptx|amgdpu) and gpu-offload, I added gpu-name-mangling here because it was a relatively small change and it makes sense for it to be here.

What seems to be really needed at the moment to me is a better overall documentation of the GPU compilation flows and how the two end-to-end integration are setup and compare, do you think you can sketch this?

I'll address that at the end of this comment.

@jhuber6 should we consolidate offload-related glue generation into one place, so we don't have everybody rolling a DIY implementation? Object/OffloadBinary.h seems to be providing mostly APIs for parsing offload binaries.

That could knock off a couple hundreds of lines out of this patch, however what I'm thinking now is that then the embedding would have to be performed at mlir-translate, as it would require device bitcode and host LLVM IR. I can't find any drawbacks to this approach, but I don't know if we would want that.

Full break down of the compilation process

To make things easier, I'll attach files to show the changes after each step. Input files: test.mlir

and test.cpp .

  1. We start with test.mlir, this file is quite high level so we lower it a bit to test_prenvptx.mlir , this step doesn't use anything from this patch:
    • CMD:
mlir-opt test.mlir \
	-gpu-launch-sink-index-computations \
	-gpu-kernel-outlining \
	-gpu-async-region \
	-convert-scf-to-cf \
	-convert-gpu-to-nvvm \
	-convert-math-to-llvm \
	-convert-arith-to-llvm \
	-convert-index-to-llvm \
	-canonicalize \
	-o test_prenvptx.mlir
  1. We take test_prenvptx.mlir and apply mlir-opt -gpu-to-nvptx="chip=sm_70" to obtain test_postnvptx.mlir . This step introduces binary annotations to gpu.module, one annotation with LLVM bitcode corresponding to the gpu.module and one with the offloading kind, either hip or cuda. This pass is similar to what gpu-to-cubin does, but instead of generating cubin it generates bitcode.
  1. We take test_postnvptx.mlir and apply mlir-opt -gpu-to-offload -canonicalize to obtain test_llvm.mlir . This pass performs a similar function to that of gpu-to-llvm, as it lowers down to LLVM. What it does is that it takes all binary gpu.module images and creates one big binary blob with all images at the start of the module, clang will use this blob to compile down to fatbinary. It also inserts an offload annotation and function stub (similar to kernel stubs in CUDA) for each kernel that it's called by a LaunchFuncOp, these annotations are the key to make clang aware that it needs to do kernel registration.
  1. We translate test_llvm.mlir via mlir-translate to LLVM IR to obtain test.ll .

Now we call clang with:

clang++ -fgpu-rdc --offload-new-driver test.ll test.cpp \
		-L${LLVM_PATH}/lib/ -lmlir_cudart_runtime -lcudart \
		-O3 -o test.exe

Simplifying a bit the process what clang is doing is:

clang <...> "-o" "test.s" "-x" "ir" "test.ll"
clang <...> "-o" "test.o" "test.s"
clang <...> "-o" "main.o" "-x" "c++" "main.cpp"
clang-linker-wrapper <...> "-o" "test.exe" <...> "test.o" "main.o" <...>

The key step for us happens in clang-linker-wrapper, here clang will unpack the binary blob with our device bitcodes, and will start:

  1. Linking bitcode.
  2. Perform further optimization, including LTO.
  3. Generating ptx.
  4. Calling ptxas.
  5. Calling cubin.
  6. Calling nvlink.
  7. Calling fatbinary
  8. Adding kernel registration code, as this method uses the cuda runtime.
  9. Linking everything up to get test.exe.

One important thing to mention is that the binary blob contains all the info required by clang, including, triple, arch, offloading kind, and that the blob is bitcode.

Comparison with the current pipeline

The current pipeline has no optimization for the cuda backend (there's no IR optimization and no ptx optimization), there's no linking in general so no libdevice. On the AMD side, it hard codes several globals like the ABI version to 400 (500 is already out there), and other variables, which adds constrains to the generated code, there's no fast math, and not general linking. Many of these shortcomings can be addressed by re-implementing the current pipeline using mlir/lib/Dialect/GPU/Transforms/GpuToDeviceObjectCommon.h.

This new pipeline provides extensibility, more optimization opportunities, LTO, device linking, removing the need for requiring libcuda or the cuda toolkit at building, or at runtime, as it moves all the burden of generating actual device code to clang, AMDPU fast math. The drawbacks of this pipeline are: no JIT, requires a compatible clang as it must be able to understand the IR produced by MLIR, and until clang extends the new driver to other platforms, there's only Linux support for generating executables.

mlir/include/mlir/Dialect/GPU/Transforms/Passes.td
46

I'll add that to the docs, but just that we're all in the same page, I'll explain with an example why is need it.

gpu.module @module_1 {
    func.func @bar() {
      return
    }
}
gpu.module @module_2 {
    func.func @bar() {
      return
    }
}

Clang in this case will link these into a single bitcode file having all offloading code. Thus one of module_1::bar or module_2::bar is going to get removed, which is a problem if they are different.

53

The reason I added these here is because they provide similar functionality to that of gpu-to-cubin, and those passes are implemented in the Transforms folder. I have no issue with moving them to Conversions.

80

To be honest I was also confused, but several passes in this file listed it as a dependency, so I didn't want to diverge:

https://github.com/llvm/llvm-project/blob/main/mlir/include/mlir/Dialect/GPU/Transforms/Passes.td#L17

However, I must have misinterpreted the reason for those. I'll remove that dependency.

mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp
78

You're right, I'll change them to use StringLitteral.

545

This function is called only once per LaunchFuncOp in line 697 of this file. I didn't worry too much because the existing conversion pattern (pre patch) already does a similar call:
https://github.com/llvm/llvm-project/blob/main/mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp#L800
So no overhead was added beyond the already existing pre patch. But I guess I could try to fix that for both cases.

713

We could, but we would stop handling cases like:

module {
  my.module {
    gpu.module {
    }
  }
}

We could say it's a precondition for this pass and assume that the above never occurs.

734

Will add that.

766

Will change it.

mlir/lib/Dialect/GPU/Transforms/GpuToDeviceObjectCommon.h
46

This is the class that would help re-implement the existing gpu serialization pipeline, as it handles translating down to bitcode, linking, optimization, and creating a binary object, so a child class needs to implement only certain methods . The reason I used CRTP, is because inside the class I call pass functions:
getOperation, signalPassFailure, so it made sense to me to have this way, also in the run method I use getDerived().<func> to handle the inheritance bit. However I guess I can lower the size of the class by moving part of the code to a cpp.

264

I was intending to also trash the buffer binaryData asap, but forgot to flush.

mlir/lib/Dialect/GPU/Transforms/NameMangling.cpp
107

The problem is that as far as I know the current implementation handles renaming only flat symbols:
https://mlir.llvm.org/doxygen/classmlir_1_1SymbolTable.html#a256c12869c03f20d4d1a122ec02eb417

And in this case I'm renaming <symbol> in <gpu_module>::<symbol>. But I could be missing something.

114

No, as this pass needs to update also LaunchFuncOps with the new symbol name, which is located outside the GPUModule.

Full break down of the compilation process

To make things easier, I'll attach files to show the changes after each step.

Fantastic, thanks a lot for this!

  1. We translate test_llvm.mlir via mlir-translate to LLVM IR to obtain test.ll .

...

The key step for us happens in clang-linker-wrapper, here clang will unpack the binary blob with our device bitcodes, and will start:

  1. Linking bitcode.
  2. Perform further optimization, including LTO.
  3. Generating ptx.
  4. Calling ptxas.
  5. Calling cubin.
  6. Calling nvlink.
  7. Calling fatbinary
  8. Adding kernel registration code, as this method uses the cuda runtime.
  9. Linking everything up to get test.exe.

One important thing to mention is that the binary blob contains all the info required by clang, including, triple, arch, offloading kind, and that the blob is bitcode.

That's nice, I wonder about the logic of clang-linker-wrapper, it seems to me that there is very little coupling to the object files and that most of this can operate directly on LLVM IR just the same.
With a good layering and API, we should be able to use this exact flow in a JIT environment as well: anything I am missing?

fmorac marked 9 inline comments as done.May 1 2023, 6:50 PM

That's nice, I wonder about the logic of clang-linker-wrapper, it seems to me that there is very little coupling to the object files and that most of this can operate directly on LLVM IR just the same.

Yes, for the most part yes, here is the implementation of that tool:
https://github.com/llvm/llvm-project/tree/main/clang/tools/clang-linker-wrapper
If understood correctly the code the tool calls clang under the hood, and clang is the one that call all of the device tools. I would think all these utilities could be pure LLVM, but that's clang's code so up to them.

With a good layering and API, we should be able to use this exact flow in a JIT environment as well: anything I am missing?

For JIT, we would have to require a working cuda toolkit installation. But yes, it would be possible to have a JIT, however I wouldn't add device linking nvlink to it, as we would probably end up with mlirc a mlir compiler to executable.

With a good layering and API, we should be able to use this exact flow in a JIT environment as well: anything I am missing?

For JIT, we would have to require a working cuda toolkit installation. But yes, it would be possible to have a JIT, however I wouldn't add device linking nvlink to it, as we would probably end up with mlirc a mlir compiler to executable.

Right but in a JIT environment you always need to have a working Cuda installation anyway, and with https://reviews.llvm.org/D145527 the ptx->cubin step can happen directly using the nvptxcompile library.

mlir/include/mlir/Dialect/GPU/Transforms/Passes.td
53

That's fine to keep this aligned with gpu-to-cubin, we should probably be careful of the naming and extend the documentation to clarify what this all does (that is: it actually invoked the entire LLVM backend right?)

mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp
713

Right, if someone has a my.module, they should schedule the pass there. We can make this an "Operation pass" so that it can run on any kind of module.

mlir/lib/Dialect/GPU/Transforms/NameMangling.cpp
114

OK, so let's just make this runnable on any SymbolTable operation (like the other case) and replace the full IR walk with building a symbol table and using it directly?

fmorac marked 2 inline comments as done.EditedMay 2 2023, 5:05 AM

And now it's time to potentially self sabotage this diff.

LLVM Offloading

I really like @tra's idea of further consolidating LLVM offload stuff under llvm, which I think is something that the clang & OpenMP team have been doing for a while.

What I'm thinking is:
Moving certain bits from clang & clang-linker-wrapper down to llvm, namely kernel registration from clang-linker-wrapper, IR embedding from clang, and the creation of offload annotations (namely generating offloading entries).
For the latter I'm thinking something along the lines of:

SomeKindOfErrorCode 
llvm::offloading::embedOffloadEntriesToModule(llvm::Module &module, SmallVector<std::pair<StringRef, OffloadBinary::OffloadingImage>> &&offloadingEntries);

That function would embed the objects by concatenating them, and generate the offloading entries from the StringRef.

What do you think @jhuber6 & @jdoerfert ?

MLIR

What would this new route mean for MLIR?

  • Getting rid of gpu-to-(nvptx|cubin|amdgpu|hsaco) passes, that functionality would be moved and consolidated under translation.
  • Requiring the cuda toolkit for JIT, which @mehdi_amini already said it would be ok.
  • Moving away from the cuda driver for the runner to a single library based on the cuda runtime.
  • Moving away from loading device code modules, and switching to kernel registration.

Final comments :
If all the above parties agree we could do the above option, that would mean either trash this diff or commit it (after addressing all the comments) and when the above functionality is ready switch to that.
In other circumstances I'd tell you, give me one week and I can do it, but I'm chasing a paper deadline, so I cannot commit (pun intended) to do the above proposal until after the 19th of May, that's why I'd prefer to commit this diff and then make the switch (provided that we agree on the above proposal).

mlir/include/mlir/Dialect/GPU/Transforms/Passes.td
53

Yeah, to be honest there's not much documentation on this. If you're referring to the MLIR LLVM backend, then yes, if you're referring to LLVM codgen backend, then no, as gpu-to-nvptx stops at bitcode, but gpu-to-cubin goes down to cubin.

mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp
713

That would work.

mlir/lib/Dialect/GPU/Transforms/NameMangling.cpp
114

If I understood you correctly, yes, we could switch this to run on SymbolTable Ops, and collect GPUModule symbols defined on that particular symbol table.

And now is time to potentially self sabotage this diff.

LLVM Offloading

I really like @tra's idea of further consolidating LLVM offload stuff under llvm, which I think is something that the clang & OpenMP team has been doing for a while.

What I'm thinking is:
Moving certain bits from clang & clang-linker-wrapper down to llvm, namely kernel registration from clang-linker-wrapper, IR embedding from clang, and the creation of offload annotations (namely generating offloading entries).
For the latter I'm thinking something along the lines of:

SomeKindOfErrorCode 
llvm::offloading::embedOffloadEntriesToModule(llvm::Module &module, SmallVector<std::pair<StringRef, OffloadBinary::OffloadingImage>> &&offloadingEntries);

That function would embed the objects by concatenating them, and generate the offloading entries from the StringRef.

What do you think @jhuber6 & @jdoerfert ?

I'm fully in favor of moving more of this offload handling into LLVM. It was mostly put in clang because that was its only consumer at time of writing. The only clang related dependency on the clang-offload-packager and the clang-linker-wrapper is the version string which could be easily removed. We use clang internally, but could allow using different methods if needed.

There is some work I haven't yet gotten around to doing for the offloading entries however. Currently, they rely on a C-identifier section name to get the linker to emit symbols to the beginning and end of the section. This currently only works on Linux, with MacOS and Windows having slightly different methods for iterating through a section. If we want something generic we'll need to either make special handling when we emit the sections based off of a triple, or just shove that logic in the linker wrapper. The linker wrapper already includes a lot of magic, and the more I put in it the closer it gets to an actual linker, but it would be fairly trivial to just search the linked image for the section name and emit an array independent of target OS.

Furthermore, the actual structure of the offloading entry was copied from OpenMP mostly for convenience. It doesn't have enough fields to actually describe some more esoteric CUDA / HIP constructs like textures so I just ignored those for now. That should be changed as well.

I really like @tra's idea of further consolidating LLVM offload stuff under llvm, which I think is something that the clang & OpenMP team have been doing for a while.
...

What would this new route mean for MLIR?

  • Getting rid of gpu-to-(nvptx|cubin|amdgpu|hsaco) passes, that functionality would be moved and consolidated under translation.

Just to clarify: it means that we'll get JIT and AOT aligned on the same path in MLIR right? If so then that looks great :)

I'm not sure what the input to "translate" will look like exactly, can you clarify this a bit?

fmorac marked 2 inline comments as done.EditedMay 2 2023, 11:39 AM

Just to clarify: it means that we'll get JIT and AOT aligned on the same path in MLIR right? If so then that looks great :)

Yes, we should have AOT and JIT for devices as long there's a CUDA toolkit or ROCm installation. In reality we just need device bitcode libraries and a mechanism for compiling device assembly down to object.

I'm not sure what the input to "translate" will look like exactly, can you clarify this a bit?

What I'm thinking (I'm not 100% sure if it's possible, but I don't see a reason why wouldn't), is that if we add some logic to the LLVM Translation Interface, it should be able to handle:

module attributes {gpu.container_module} {
  llvm.func @bar() {
   ...
  }
  gpu.module {
    llvm.func kernel() {
       ...
    }
  }
}

Thus, we would be able to remove all interactions with LLVM IR from the passes. The obvious benefit here is that we would be able to use pure LLVM IR API to embed the device object, or do other manipulations, making things easier to implement in this case -provided that we have an unified LLVM offload backend.

What I'm realizing just now is, I'm not sure if the Translation interface is already capable of handling options for this use case, e.g. to specify arch and stuff like that, however for now those could use gpu.module attributes.

In summary:

  1. We would remove all the gpu-to-(cubin|...) passes.
  2. Add logic to the LLVM translation interface so that is able to handle generating LLVM offload code (the intention of this patch), or generating complete executables by using cuda or hip tools.

What I'm thinking (I'm not 100% sure if it's possible, but I don't see a reason why wouldn't), is that if we add some logic to the LLVM Translation Interface, it should be able to handle:

module attributes {gpu.container_module} {
  llvm.func @bar() {
   ...
  }
  gpu.module {
    llvm.func kernel() {
       ...
    }
  }
}

Thus, we would be able to remove all interactions with LLVM IR from the passes. The obvious benefit here is that we would be able to use pure LLVM IR API to embed the device object, or do other manipulations, making things easier to implement in this case -provided that we have an unified LLVM offload backend.

What I'm realizing just now is, I'm not sure if the Translation interface is already capable of handling options for this use case, e.g. to specify arch and stuff like that, however for now those could be gpu.module attributes.

I think that this may be close to how we handle the OpenMP translation right now, but @ftynse should be able to confirm.

Just to check, with this new pipeline, is there a way to carry your own copy of the relevant clang/LLVM IR offloading support in the same binary/library? That is, suppose I know that /opt/rocm/llvm/bin/* is too old (ex. the intrinsics MLIR generates have attributes that said system LLVM can't understand), and I want to just link in the LLVM that lives beside whichever MLIR commit I'm targeting. Can I do that?

Overall, I'm liking this refactor.

(and part of why I had the global constants in SerializeToHsaco is that we needed that pass to work - so long as device libraries weren't actually needed - in envirenments where ROCm either wasn't installed or was in some weird path we didn't know about. I can perhaps go poke the people who make our PyTorch wheels to try and find out more about what exactly is going on there, but still, I hope that helps with context.)

mlir/include/mlir/Dialect/GPU/Transforms/Passes.td
83

On both these passes, would it be reasonable to add a llvm-args or clang-args or the like, so I can pass in whatever weird flags (for instance, -global-isel) that I feel like?

fmorac added a comment.May 4 2023, 8:52 AM

Just to check, with this new pipeline, is there a way to carry your own copy of the relevant clang/LLVM IR offloading support in the same binary/library? That is, suppose I know that /opt/rocm/llvm/bin/* is too old (ex. the intrinsics MLIR generates have attributes that said system LLVM can't understand), and I want to just link in the LLVM that lives beside whichever MLIR commit I'm targeting. Can I do that?

If I understood your question correctly, yes and no, as this pipeline doesn't carry any clang dependencies, the user is the one responsible for supplying a valid clang compiler.

For example, ROCm is used only to link against <ROCm path>/amdgcn/bitcode libraries, and the user is responsible for supplying the clang compiler they want to use to compile the offload code.
You can even use a different ROCm version by supplying rocm-path to the pass, and if you supply no ROCm path, then no linking is performed and it's up to the user to supply the appropriate bitcode libraries when calling clang.

ftynse added a comment.May 5 2023, 5:38 AM

I think that this may be close to how we handle the OpenMP translation right now, but @ftynse should be able to confirm.

For OpenMP, we are storing an OpenMPIRBuilder instance inside ModuleTranslation. Interface implementations for specific operations can use that builder. It should be possible to follow the approach for GPU. The contract for the interface method is rather simple: it needs to leave the llvm::IRBuilder In a valid state for further insertion, and it needs to update moduleTranslation so that it has mappings between any values and symbols defined by this operation and their LLVM IR counterparts. The implementation itself can be arbitrarily complex. AFAICS, the main think to take care of would be preserving the "launch" calls from host code to device code through some sort of mapping stored in moduleTranslation.

fmorac added a comment.May 8 2023, 7:03 AM

I think that this may be close to how we handle the OpenMP translation right now, but @ftynse should be able to confirm.

.... The implementation itself can be arbitrarily complex. AFAICS, the main think to take care of would be preserving the "launch" calls from host code to device code through some sort of mapping stored in moduleTranslation.

After digging around on how OpenMP handles translation and @ftynse's comment, I think all the serialization pipelines should be moved to Translation, they make more sense to be there. I can take care of moving them, however I'd be able to do it until the end of May, now my question is, what should we do with this diff? I could fix all the comments -it would take me less than a day, and push it, or we can abandon it, wait and then move everything to translation. Thoughts @mehdi_amini , @tra ?

mlir/include/mlir/Dialect/GPU/Transforms/Passes.td
83

This approach doesn't call clang directly. You can inject those flags when you call clang.

krzysz00 added inline comments.May 8 2023, 9:19 AM
mlir/include/mlir/Dialect/GPU/Transforms/Passes.td
83

Wait, hold on, these new serialize passes handle the LLVM IR -> binary translation, so where in this diff does the relevant clang driver get called?

fmorac marked an inline comment as done.May 8 2023, 9:51 AM
fmorac added inline comments.
mlir/include/mlir/Dialect/GPU/Transforms/Passes.td
83

No, this is how you would do it:

mlir-opt test.mlir \
  -gpu-launch-sink-index-computations \
  -gpu-kernel-outlining \
  -gpu-async-region \
  -gpu-name-mangling \
  -convert-scf-to-cf \
  -convert-gpu-to-nvvm \
  -convert-math-to-llvm \
  -convert-arith-to-llvm \
  -convert-index-to-llvm \
  -canonicalize \
  -gpu-to-nvptx="chip=sm_70 cuda-path=<cuda toolkit path>" \
  -gpu-to-offload \
  -canonicalize \
  -o test_llvm.mlir
mlir-translate -mlir-to-llvmir test_llvm.mlir -o test.ll
clang++ -fgpu-rdc --offload-new-driver test.ll test.cpp \
		-L${LLVM_PATH}/lib/ -lmlir_cudart_runtime -lcudart \
		-O3 -o test.exe

Calling clang to get the final executable is a responsibility of the user.

MLIR

What would this new route mean for MLIR?

  • Getting rid of gpu-to-(nvptx|cubin|amdgpu|hsaco) passes, that functionality would be moved and consolidated under translation.
  • Requiring the cuda toolkit for JIT, which @mehdi_amini already said it would be ok.
  • Moving away from the cuda driver for the runner to a single library based on the cuda runtime.
  • Moving away from loading device code modules, and switching to kernel registration.

Final comments :
If all the above parties agree we could do the above option, that would mean either trash this diff or commit it (after addressing all the comments) and when the above functionality is ready switch to that.
In other circumstances I'd tell you, give me one week and I can do it, but I'm chasing a paper deadline, so I cannot commit (pun intended) to do the above proposal until after the 19th of May, that's why I'd prefer to commit this diff and then make the switch (provided that we agree on the above proposal).

All of this would be a drastic change to the GPU compilation approach in MLIR, and it needs to be discussed in discourse. Most users wouldn't have seen this redesign proposal.

If all the above parties agree we could do the above option, that would mean either trash this diff or commit it (after addressing all the comments) and when the above functionality is ready switch to that.

I think @mehdi_amini commented on this upthread. This patch has too many things to be considered a single MLIR revision, and it can't be committed in this form, irrespective of the approach to take.

The current pipeline has no optimization for the cuda backend (there's no IR optimization and no ptx optimization), there's no linking in general so no libdevice

But it's simple to address these, and just the lack of these can't be the only motivation to change the architecture here. One can add a transformer on the LLVM specifying the optimization level in the serialize-to-cubin pass before translating to PTX. The linking to libdevice can also be performed there. See https://discourse.llvm.org/t/nvptx-codegen-for-llvm-sin-and-friends/58170/16 (@csigg provided me the pointer there as the approach used by XLA to link in libdevice). Supporting these aren't more than 20 lines of code each.

fmorac abandoned this revision.Jun 13 2023, 5:33 AM
fmorac marked an inline comment as done.