Brief
The intent of this diff is moving the serialization passes gpu-to-cubin & gpu-to-hsaco to the translation step, while also introducing all the needed infrastructure for introducing additional serialization pipelines.
Why?
From a conceptual point of view serialization involves translating the Ops inside a GPU module to a serialized string, as such this shouldn't happen in a pass but rather during translation. From an implementation point of view it's easier to serialize and manipulate the process when both the host and device LLVM Modules are available, this is not possible during a pass however it's possible in translation.
Overview
The biggest changes introduced by this patch are:
- Introducing the attribute TranslationTarget and companion c++ interfaces defined in GPUTranslationTargets.h, this attribute informs the serialization options to the translation stage. It must be present as an attribute in the gpu.module to be able to perform the translation. Format:
#gpu.target<PIPELINE: triple = TARGETTRIPLE, chip = TARGETCHIP, features = TARGETFEATURES, toolkit = TOOLKITPATH, link = [LIST OF BITCODE FILES TO LINK], opts = {EXTRA OPTS}> ; AMDGPU example using default chip = gfx600 #gpu.target<AMDGPU: toolkit = "/opt/rocm/5.4.3", link = ["mylib.bc"], opts = {fast, ftz}> ; NVPTX example, with default options, chip = sm_35, triple = nvptx64-nvidia-cuda. #gpu.target<NVPTX>
Example:
gpu.module @kernel_module attributes {rocdl.hsaco = #gpu.target<AMDGPU : chip = "gfx90a">, target = #gpu.target<NVPTX>} { llvm.func @kernel(%arg0: i32, %arg1: !llvm.ptr<f32>, %arg2: !llvm.ptr<f32>, %arg3: i64, %arg4: i64, %arg5: i64) attributes {gpu.kernel} { llvm.return } }
- Modifying the gpu-to-llvm pass to avoid removing the gpu.modules, while also adding a stub for the serialized string to be modified during translation. Additionally this pass can be used to set or add a target to the gpu.modules. Example:
; Selects the `rocdl.hsaco` target, this target must be present in the attributes of all `gpu.module`s: ie: `gpu.module ... attributes{rocdl.hsaco ... }` --gpu-to-llvm='target=rocdl.hsaco' ; Sets the gpu target to a specific target. The format used for specifying the target is the format of the body of the `TranslationTarget` attribute, the quotation marks have to be carefully managed to successfully parse the attribute. --gpu-to-llvm='target="NVPTX: chip = "sm_90", opts = {ftz}"'
- The addition of the ModuleToObject class. This class controls the behavior of all serialization pipelines. It allows for linking to any specified bitcode file, or if toolkit paths are detected, -or specified, linking against the devices libraries found in the toolkits.
Why the patch is so big?
- Such a big change can't be done on a series of steps without leaving broken bits.
- Many SLOC are reused from the original pipelines (this is specially true for the files NVPTXPipeline.cpp, AMDGPUPipeline.cpp), the only truly original files are: TranslationTargetAttr.td, GPUTranslationTargets.*, ModuleToObject.*.
TODO
This diff is the first in a series of patches on extending the GPU serialization pipeline.
The remaining patches will:
- Remove all the serialization passes while updating in-tree projects to use the updated pipeline.
- Introduce LIT tests for testing translation.
- Introduce the offload pipeline proposed in this RFC. With this change this patch should be less that 100 source lines.
Testing
For testing the patch, 2 machines were used:
- Local machine, with NVIDIA V100, CUDA Toolkit 11.8 and Ubuntu 22.04.2.
- Frontier at ORNL, MI250X.
In all instances the test was successfully completed.
Clang was used to compile the final executable just out of convenience, the JIT should remain functional.
The input files were:
- test.cpp , this file verifies the results produced by MLIR.
- test.mlir , gpu kernel.
Setup 1
For compiling the test for NVIDIA targets, the following commands were used:
mlir-opt test.mlir \ -gpu-launch-sink-index-computations \ -gpu-kernel-outlining \ -gpu-async-region \ -convert-scf-to-cf \ -convert-gpu-to-nvvm \ -convert-math-to-llvm \ -convert-arith-to-llvm \ -convert-index-to-llvm \ -canonicalize \ -gpu-to-llvm='target="NVPTX: chip="sm_70" "' \ -canonicalize \ -o test_llvm.mlir mlir-translate -mlir-to-llvmir test_llvm.mlir -o test.ll clang++ test.ll test.cpp -lmlir_cuda_runtime -o test.exe
The following profile was generated with nsys.
Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name -------- --------------- --------- --------- --------- -------- -------- ----------- ------------------- 33.2 207,866 2 103,933.0 103,933.0 4,538 203,328 140,565.8 cuMemAlloc_v2 23.4 146,329 1 146,329.0 146,329.0 146,329 146,329 0.0 cuModuleLoadData 17.8 111,172 1 111,172.0 111,172.0 111,172 111,172 0.0 cuModuleUnload 12.0 75,213 2 37,606.5 37,606.5 4,769 70,444 46,439.2 cuMemFree_v2 6.6 41,398 3 13,799.3 17,192.0 3,466 20,740 9,123.1 cuMemcpyAsync 3.1 19,197 1 19,197.0 19,197.0 19,197 19,197 0.0 cuLaunchKernel 2.8 17,343 1 17,343.0 17,343.0 17,343 17,343 0.0 cuStreamCreate 0.6 4,068 1 4,068.0 4,068.0 4,068 4,068 0.0 cuStreamDestroy_v2 0.5 3,416 1 3,416.0 3,416.0 3,416 3,416 0.0 cuStreamSynchronize Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) GridXYZ BlockXYZ Name -------- --------------- --------- -------- -------- -------- -------- ----------- -------------- -------------- ---------------- 100.0 4,096 1 4,096.0 4,096.0 4,096 4,096 0.0 3 1 1 128 1 1 test_mlir_kernel
Setup 2
For compiling the test for AMDGPU targets, the following commands were used:
mlir-opt test.mlir \ -gpu-launch-sink-index-computations \ -gpu-kernel-outlining \ -gpu-async-region \ -convert-scf-to-cf \ -convert-gpu-to-rocdl \ -convert-math-to-llvm \ -convert-arith-to-llvm \ -convert-index-to-llvm \ -canonicalize \ -gpu-to-llvm='target="AMDGPU: chip="gfx90a" "' \ -canonicalize \ -o test_llvm.mlir mlir-translate -mlir-to-llvmir test_llvm.mlir -o test.ll clang++ test.ll test.cpp -lmlir_rocm_runtime -o test.exe
The following profile was generated with rocprof.
Would 2 be better default?