For general context see:
https://discourse.llvm.org/t/rfc-extending-mlir-gpu-device-codegen-pipeline/70199/1
What this diff is not:
- It's not a replacement of the current serialization pipeline.
- However this diff provides the infrastructure to reimplement the current pipeline with little effort and address several of its current shortcomings, specifically the ability to link to bitcode libraries.
- There are several reasons why this diff doesn't include this re-implementation, the top reasons: the patch is already large enough, I handle AMDGPU code generation by linking to the bitcode libraries, instead of introducing symbols, see: https://github.com/llvm/llvm-project/blob/main/mlir/lib/Dialect/GPU/Transforms/SerializeToHsaco.cpp#L198-L269 , and don't know if that's something the current code owners of that pipeline want.
- This patch doesn't introduce a clang build dependency, in fact there are no additional build dependencies.
What this diff is:
- Is an additional pipeline that currently comes with several restrictions but with many new features.
- Restrictions: It requires a compatible clang compiler to generate executables -the clang features this patch relies on are currently available only on Linux, thus until clang extends their support *this pipeline is restricted to Linux*.
- Features:
- Link to device bitcode libraries.
- Additional AMDGPU features like fast math.
- Automatic linking to libdevice provided there's a valid CUDA toolkit path.
- This pipeline is always available as long the respective target is built (AMDGPU, NVPTX).
- Enables access to clang's code generation features.
Code walkthrough:
Summary
This diff introduces:
- --gpu-to-(nvptx|amdgpu): These passes serialize gpu.modules to LLVM bitecode which then get further serialized to an offload object format supported by LLVM and compatible with clang.
- --gpu-name-mangling: It mangles the names of symbols inside gpu modules. This pass might be required as clang will unpack all offload objects and link them together, and if a function has the same name clang will merge the symbols. The mangling scheme is: __G<gpu module name>_S<function name>.
- --gpu-to-offload: This pass is equivalent to --gpu-to-llvm, except that it introduces clang offload annotations and handles the conversion to LaunchFuncOp differently.
- Creates the libraries mlir_cudart_runtime & mlir_hiprt_runtime, as clang uses runtime functions instead of driver functions.
Key files walkthrough:
- GpuToDeviceObjectCommon.h: this file does the heavy lifting for --gpu-to-(nvptx|amdgpu) as it handles the serialization pipeline. The classes in this file would be the ones used for re-implementing the current pipeline.
- GpuToDeviceOffload.cpp: implements the passes --gpu-to-(nvptx|amdgpu).
- NameMangling.cpp: implements the pass --gpu-name-mangling.
- CudaRuntimeWrappers.cpp: implements the library mlir_cudart_runtime. Instead of creating a new file I decided it was better to keep everything in one file and use the macro MLIR_USE_CUDART_RUNNER to handle both libraries. The advantage of this approach is that it ensures developers always update both versions of this library.
- GPUToLLVMConversion.cpp: handles the pass --gpu-name-offload, the key modifications in this file are the addition of the class GPUOffloadBuilder, GpuToOffloadConversionPass pass, populateGpuToLLVMOffloadConversionPatterns function, and updating ConvertLaunchFuncOpToGpuRuntimeCallPattern::matchAndRewrite.
Real world testing:
This patch was tested in the following 3 platforms:
- Frontier at OLCF - ORNL, using AMD MI250x gfx90a
- ROCm version: 5.4.3
- clang version: 17.0.0 (https://github.com/llvm/llvm-project.git ea3a8700328050a4dec29904b2c72d53a3be0660)
- Perlmutter at NERSC - LBNL, using NVIDIA A100.
- CUDA: 12.0
- clang version: 17.0.0 (https://github.com/llvm/llvm-project.git 6875424135312aeb26ab8e0358ba7f9e6e80e741)
- A local server, using NVIDIA V100.
- CUDA: 11.8
- clang version: 17.0.0 (++20230417095441+43ac269bdd00-1~exp1~20230417215605.872)
The test consists of a gpu kernel written in MLIR test.mlir, and a C++ file with the main and verification functionality: test.cpp.
In order to run the tests I built MLIR from scratch on the above systems and compiled the test using an already existing clang version. On Perlmutter I also ran an additional test using the LLVM IR test.ll file generated from platform 3, to test a sort of 'cross-compiling', it ran successfully.
The steps to compile the test for NVIDIA sm_70 targets are:
mlir-opt test.mlir \ -gpu-launch-sink-index-computations \ -gpu-kernel-outlining \ -gpu-async-region \ -gpu-name-mangling \ -convert-scf-to-cf \ -convert-gpu-to-nvvm \ -convert-math-to-llvm \ -convert-arith-to-llvm \ -convert-index-to-llvm \ -canonicalize \ -gpu-to-nvptx="chip=sm_70 cuda-path=<cuda toolkit path>" \ -gpu-to-offload \ -canonicalize \ -o test_llvm.mlir mlir-translate -mlir-to-llvmir test_llvm.mlir -o test.ll clang++ -fgpu-rdc --offload-new-driver test.ll test.cpp \ -L${LLVM_PATH}/lib/ -lmlir_cudart_runtime -lcudart \ -O3 -o test.exe
In all cases the tests were completed successfully. I ensured that all of them were indeed calling all the appropriate runtime functions by profiling the code with nsys & rocprof. Here is the output from nsys for Perlmutter:
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name -------- --------------- --------- ------------- ------------- ----------- ----------- ----------- ---------------------- 99.9 259,394,921 1 259,394,921.0 259,394,921.0 259,394,921 259,394,921 0.0 cudaStreamCreate 0.1 137,695 2 68,847.5 68,847.5 5,250 132,445 89,940.4 cudaMalloc 0.0 103,289 2 51,644.5 51,644.5 7,655 95,634 62,210.5 cudaFree 0.0 60,957 1 60,957.0 60,957.0 60,957 60,957 0.0 cuLibraryLoadData 0.0 41,681 3 13,893.7 18,426.0 3,647 19,608 8,893.5 cudaMemcpyAsync 0.0 22,644 1 22,644.0 22,644.0 22,644 22,644 0.0 cudaLaunchKernel 0.0 11,492 1 11,492.0 11,492.0 11,492 11,492 0.0 cudaStreamDestroy 0.0 4,188 1 4,188.0 4,188.0 4,188 4,188 0.0 cudaStreamSynchronize 0.0 1,132 1 1,132.0 1,132.0 1,132 1,132 0.0 cuModuleGetLoadingMode [6/8] Executing 'gpukernsum' stats report Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) GridXYZ BlockXYZ Name -------- --------------- --------- -------- -------- -------- -------- ----------- -------------- -------------- ------------------------------------- 100.0 4,320 1 4,320.0 4,320.0 4,320 4,320 0.0 3 1 1 128 1 1 __Gtest_mlir_kernel_Stest_mlir_kernel
This say what it does, but lack a bit of context, as in the "why would someone want to do this?"