This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
include/mlir/
-
mlir/
-
Conversion/
-
GPUCommon/
-
GPUCommonPass.h
-
Passes.td
-
Dialect/GPU/Transforms/
-
GPU/
-
Transforms/
-
Passes.h
9/12
Passes.td
-
lib/
-
Conversion/GPUCommon/
-
GPUCommon/
11/13
GPUToLLVMConversion.cpp
-
Dialect/GPU/
-
GPU/
2/2
CMakeLists.txt
-
Transforms/
3/4
GpuToDeviceObjectCommon.h
-
GpuToDeviceOffload.cpp
6/6
NameMangling.cpp
-
ExecutionEngine/
-
CMakeLists.txt
-
CudaRuntimeWrappers.cpp
-
RocmRuntimeWrappers.cpp
-
test/
-
Conversion/GPUCommon/
-
GPUCommon/
-
offload.mlir
-
Dialect/GPU/
-
GPU/
-
mangle-names.mlir
-
serialize-to-amdgpu.mlir
-
serialize-to-nvptx.mlir

Differential D149559

[mlir][gpu] Adds a gpu serialization pipeline for offloading GPUDialect Ops to clang compatible annotations.
AbandonedPublic

Authored by fmorac on Apr 30 2023, 3:23 PM.

Download Raw Diff

Details

Reviewers

ftynse
bondhugula
ThomasRaoux
dcaballe
mehdi_amini
tra
krzysz00
jdoerfert
nicolasvasilache
herhut

Summary

For general context see:
https://discourse.llvm.org/t/rfc-extending-mlir-gpu-device-codegen-pipeline/70199/1

What this diff is not:

It's not a replacement of the current serialization pipeline.
- However this diff provides the infrastructure to reimplement the current pipeline with little effort and address several of its current shortcomings, specifically the ability to link to bitcode libraries.
- There are several reasons why this diff doesn't include this re-implementation, the top reasons: the patch is already large enough, I handle AMDGPU code generation by linking to the bitcode libraries, instead of introducing symbols, see: https://github.com/llvm/llvm-project/blob/main/mlir/lib/Dialect/GPU/Transforms/SerializeToHsaco.cpp#L198-L269 , and don't know if that's something the current code owners of that pipeline want.
This patch doesn't introduce a clang build dependency, in fact there are no additional build dependencies.

What this diff is:

Is an additional pipeline that currently comes with several restrictions but with many new features.
- Restrictions: It requires a compatible clang compiler to generate executables -the clang features this patch relies on are currently available only on Linux, thus until clang extends their support *this pipeline is restricted to Linux*.
- Features:
  - Link to device bitcode libraries.
  - Additional AMDGPU features like fast math.
  - Automatic linking to libdevice provided there's a valid CUDA toolkit path.
  - This pipeline is always available as long the respective target is built (AMDGPU, NVPTX).
  - Enables access to clang's code generation features.

Code walkthrough:

Summary

This diff introduces:

--gpu-to-(nvptx|amdgpu): These passes serialize gpu.modules to LLVM bitecode which then get further serialized to an offload object format supported by LLVM and compatible with clang.
--gpu-name-mangling: It mangles the names of symbols inside gpu modules. This pass might be required as clang will unpack all offload objects and link them together, and if a function has the same name clang will merge the symbols. The mangling scheme is: __G<gpu module name>_S<function name>.
--gpu-to-offload: This pass is equivalent to --gpu-to-llvm, except that it introduces clang offload annotations and handles the conversion to LaunchFuncOp differently.
Creates the libraries mlir_cudart_runtime & mlir_hiprt_runtime, as clang uses runtime functions instead of driver functions.

Key files walkthrough:

GpuToDeviceObjectCommon.h: this file does the heavy lifting for --gpu-to-(nvptx|amdgpu) as it handles the serialization pipeline. The classes in this file would be the ones used for re-implementing the current pipeline.
GpuToDeviceOffload.cpp: implements the passes --gpu-to-(nvptx|amdgpu).
NameMangling.cpp: implements the pass --gpu-name-mangling.
CudaRuntimeWrappers.cpp: implements the library mlir_cudart_runtime. Instead of creating a new file I decided it was better to keep everything in one file and use the macro MLIR_USE_CUDART_RUNNER to handle both libraries. The advantage of this approach is that it ensures developers always update both versions of this library.
GPUToLLVMConversion.cpp: handles the pass --gpu-name-offload, the key modifications in this file are the addition of the class GPUOffloadBuilder, GpuToOffloadConversionPass pass, populateGpuToLLVMOffloadConversionPatterns function, and updating ConvertLaunchFuncOpToGpuRuntimeCallPattern::matchAndRewrite.

Real world testing:

This patch was tested in the following 3 platforms:

Frontier at OLCF - ORNL, using AMD MI250x gfx90a
- ROCm version: 5.4.3
- clang version: 17.0.0 (https://github.com/llvm/llvm-project.git ea3a8700328050a4dec29904b2c72d53a3be0660)
Perlmutter at NERSC - LBNL, using NVIDIA A100.
- CUDA: 12.0
- clang version: 17.0.0 (https://github.com/llvm/llvm-project.git 6875424135312aeb26ab8e0358ba7f9e6e80e741)
A local server, using NVIDIA V100.
- CUDA: 11.8
- clang version: 17.0.0 (++20230417095441+43ac269bdd00-1~exp1~20230417215605.872)

The test consists of a gpu kernel written in MLIR test.mlir, and a C++ file with the main and verification functionality: test.cpp.

In order to run the tests I built MLIR from scratch on the above systems and compiled the test using an already existing clang version. On Perlmutter I also ran an additional test using the LLVM IR test.ll file generated from platform 3, to test a sort of 'cross-compiling', it ran successfully.

The steps to compile the test for NVIDIA sm_70 targets are:

mlir-opt test.mlir \
  -gpu-launch-sink-index-computations \
  -gpu-kernel-outlining \
  -gpu-async-region \
  -gpu-name-mangling \
  -convert-scf-to-cf \
  -convert-gpu-to-nvvm \
  -convert-math-to-llvm \
  -convert-arith-to-llvm \
  -convert-index-to-llvm \
  -canonicalize \
  -gpu-to-nvptx="chip=sm_70 cuda-path=<cuda toolkit path>" \
  -gpu-to-offload \
  -canonicalize \
  -o test_llvm.mlir
mlir-translate -mlir-to-llvmir test_llvm.mlir -o test.ll
clang++ -fgpu-rdc --offload-new-driver test.ll test.cpp \
		-L${LLVM_PATH}/lib/ -lmlir_cudart_runtime -lcudart \
		-O3 -o test.exe

In all cases the tests were completed successfully. I ensured that all of them were indeed calling all the appropriate runtime functions by profiling the code with nsys & rocprof. Here is the output from nsys for Perlmutter:

 Time (%)  Total Time (ns)  Num Calls    Avg (ns)       Med (ns)      Min (ns)     Max (ns)    StdDev (ns)           Name         
 --------  ---------------  ---------  -------------  -------------  -----------  -----------  -----------  ----------------------
     99.9      259,394,921          1  259,394,921.0  259,394,921.0  259,394,921  259,394,921          0.0  cudaStreamCreate      
      0.1          137,695          2       68,847.5       68,847.5        5,250      132,445     89,940.4  cudaMalloc            
      0.0          103,289          2       51,644.5       51,644.5        7,655       95,634     62,210.5  cudaFree              
      0.0           60,957          1       60,957.0       60,957.0       60,957       60,957          0.0  cuLibraryLoadData     
      0.0           41,681          3       13,893.7       18,426.0        3,647       19,608      8,893.5  cudaMemcpyAsync       
      0.0           22,644          1       22,644.0       22,644.0       22,644       22,644          0.0  cudaLaunchKernel      
      0.0           11,492          1       11,492.0       11,492.0       11,492       11,492          0.0  cudaStreamDestroy     
      0.0            4,188          1        4,188.0        4,188.0        4,188        4,188          0.0  cudaStreamSynchronize 
      0.0            1,132          1        1,132.0        1,132.0        1,132        1,132          0.0  cuModuleGetLoadingMode

[6/8] Executing 'gpukernsum' stats report

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)     GridXYZ         BlockXYZ                     Name                 
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  --------------  --------------  -------------------------------------
    100.0            4,320          1   4,320.0   4,320.0     4,320     4,320          0.0     3    1    1   128    1    1  __Gtest_mlir_kernel_Stest_mlir_kernel

Test files

test.cpp1 KBDownload

test.mlir1 KBDownload

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

fmorac created this revision.Apr 30 2023, 3:23 PM

Herald added a reviewer: ftynse. · View Herald TranscriptApr 30 2023, 3:23 PM

Herald added a reviewer: bondhugula. · View Herald Transcript

Herald added a reviewer: ThomasRaoux. · View Herald Transcript

Herald added a reviewer: dcaballe. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: bviyer, Moerafaat, zero9178 and 29 others. · View Herald Transcript

Harbormaster completed remote builds in B229165: Diff 518361.Apr 30 2023, 3:46 PM

Fixes an error on the tests, caused by not finding the CUDA & ROCm paths.

Harbormaster completed remote builds in B229167: Diff 518365.Apr 30 2023, 4:10 PM

fmorac edited the summary of this revision. (Show Details)May 1 2023, 11:15 AM

fmorac added reviewers: mehdi_amini, tra, krzysz00.

Herald added a subscriber: tpr. · View Herald TranscriptMay 1 2023, 11:15 AM

fmorac edited the summary of this revision. (Show Details)May 1 2023, 11:16 AM

fmorac edited the summary of this revision. (Show Details)May 1 2023, 11:20 AM

fmorac edited the summary of this revision. (Show Details)

fmorac published this revision for review.May 1 2023, 11:22 AM

Herald added a reviewer: jdoerfert. · View Herald TranscriptMay 1 2023, 11:22 AM

Herald added a reviewer: nicolasvasilache. · View Herald Transcript

Herald added a reviewer: herhut. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: jplehr, sstefan1, stephenneuendorffer, nicolasvasilache. · View Herald Transcript

mlir-opt test.mlir \
...
  -gpu-to-nvptx="chip=sm_70 cuda-path=<cuda toolkit path>" \
  -gpu-to-offload \
  -canonicalize \
  -o test_llvm.mlir
mlir-translate -mlir-to-llvmir test_llvm.mlir -o test.ll
clang++ -fgpu-rdc --offload-new-driver test.ll test.cpp \
		-L${LLVM_PATH}/lib/ -lmlir_cudart_runtime -lcudart \
		-O3 -o test.exe

Can you elaborate on what the clang compilation is expected to do? I'm curious about the interaction between the -fgpu-rdc , the compilation of test.ll and test.cpp in the same invocation.

Where does the test.cpp come from?
-fgpu-rdc should be enabled by the new driver by default, I think.
Is clang supposed to assume that test.ll is the GPU IR? If so, which GPU variant? I think clang may default to a GPU that's different from the GPU you've specified for mlir-opt. Can you post all sub-commands clang++ prints with -### ?

@jhuber6 -- is this combination supposed to work in general?

mlir/lib/Dialect/GPU/CMakeLists.txt
50	Do I understand it correctly that this patch does not add build or run-time dependencies on CUDA or GPU driver on the `MLIRGPUTransforms` library itself, and that the HIP/CUDA-dependent bits will only be needed by the runtime wrappers and the test executable?

In D149559#4310481, @tra wrote:
mlir-opt test.mlir \
...
  -gpu-to-nvptx="chip=sm_70 cuda-path=<cuda toolkit path>" \
  -gpu-to-offload \
  -canonicalize \
  -o test_llvm.mlir
mlir-translate -mlir-to-llvmir test_llvm.mlir -o test.ll
clang++ -fgpu-rdc --offload-new-driver test.ll test.cpp \
		-L${LLVM_PATH}/lib/ -lmlir_cudart_runtime -lcudart \
		-O3 -o test.exe
Can you elaborate on what the clang compilation is expected to do? I'm curious about the interaction between the -fgpu-rdc , the compilation of test.ll and test.cpp in the same invocation.

Where does the test.cpp come from?

Presumably it's a corresponding host implementation file. Alternatively you could use an empty file compiled with -nostdlib to treat it as a GPU fatbinary with no CPU related code. This is how we build the OpenMP device runtime static library.

-fgpu-rdc should be enabled by the new driver by default, I think.

It's not currently, I kept it for consistency with the existing CUDA support. Also because nvcc supports RDC but also doesn't enable it by default because of performance reasons. If you enable -foffload-lto it should have no performance penalty at least.

Is clang supposed to assume that test.ll is the GPU IR? If so, which GPU variant? I think clang may default to a GPU that's different from the GPU you've specified for mlir-opt. Can you post all sub-commands clang++ prints with -### ?

@jhuber6 -- is this combination supposed to work in general?

I'm assuming that the test.ll is meant to be embedded in test.cpp although it's not listed that way. The way to handle that would be -Xclang -fembed-offload-object=test.ll.

In D149559#4310481, @tra wrote:

Can you elaborate on what the clang compilation is expected to do? I'm curious about the interaction between the -fgpu-rdc , the compilation of test.ll and test.cpp in the same invocation.

Here's the -### output:

"/usr/lib/llvm-17/bin/clang" "-cc1" "-triple" "x86_64-pc-linux-gnu" "-emit-obj" "-disable-free" "-clear-ast-before-backend" "-disable-llvm-verifier" "-discard-value-names" "-main-file-name" "test.ll" "-mrelocation-model" "pic" "-pic-level" "2" "-pic-is-pie" "-mframe-pointer=none" "-fmath-errno" "-ffp-contract=on" "-fno-rounding-math" "-mconstructor-aliases" "-funwind-tables=2" "-target-cpu" "x86-64" "-tune-cpu" "generic" "-debugger-tuning=gdb" "-fcoverage-compilation-dir=/usa/fmorac/mlir_leia" "-resource-dir" "/usr/lib/llvm-17/lib/clang/17" "-O3" "-fdebug-compilation-dir=/usa/fmorac/mlir_leia" "-ferror-limit" "19" "--offload-new-driver" "-fgnuc-version=4.2.1" "-fcolor-diagnostics" "-vectorize-loops" "-vectorize-slp" "-faddrsig" "-D__GCC_HAVE_DWARF2_CFI_ASM=1" "-o" "/tmp/test-fe1f0c.o" "-x" "ir" "test.ll"
"/usr/lib/llvm-17/bin/clang" "-cc1" "-triple" "x86_64-pc-linux-gnu" "-emit-obj" "-disable-free" "-clear-ast-before-backend" "-disable-llvm-verifier" "-discard-value-names" "-main-file-name" "main.cpp" "-mrelocation-model" "pic" "-pic-level" "2" "-pic-is-pie" "-mframe-pointer=none" "-fmath-errno" "-ffp-contract=on" "-fno-rounding-math" "-mconstructor-aliases" "-funwind-tables=2" "-target-cpu" "x86-64" "-tune-cpu" "generic" "-debugger-tuning=gdb" "-fcoverage-compilation-dir=/usa/fmorac/mlir_leia" "-resource-dir" "/usr/lib/llvm-17/lib/clang/17" "-I/usr/local/cuda/include" "-internal-isystem" "/usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12" "-internal-isystem" "/usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/x86_64-linux-gnu/c++/12" "-internal-isystem" "/usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/backward" "-internal-isystem" "/usr/lib/llvm-17/lib/clang/17/include" "-internal-isystem" "/usr/local/include" "-internal-isystem" "/usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../x86_64-linux-gnu/include" "-internal-externc-isystem" "/usr/include/x86_64-linux-gnu" "-internal-externc-isystem" "/include" "-internal-externc-isystem" "/usr/include" "-O3" "-fdeprecated-macro" "-fdebug-compilation-dir=/usa/fmorac/mlir_leia" "-ferror-limit" "19" "--offload-new-driver" "-fgnuc-version=4.2.1" "-fcxx-exceptions" "-fexceptions" "-fcolor-diagnostics" "-vectorize-loops" "-vectorize-slp" "-faddrsig" "-D__GCC_HAVE_DWARF2_CFI_ASM=1" "-o" "/tmp/main-c4f25e.o" "-x" "c++" "main.cpp"
"/usr/lib/llvm-17/bin/clang-linker-wrapper" "--host-triple=x86_64-pc-linux-gnu" "--linker-path=/usr/bin/ld" "--" "-pie" "-z" "relro" "--hash-style=gnu" "--build-id" "--eh-frame-hdr" "-m" "elf_x86_64" "-dynamic-linker" "/lib64/ld-linux-x86-64.so.2" "-o" "test.exe" "/lib/x86_64-linux-gnu/Scrt1.o" "/lib/x86_64-linux-gnu/crti.o" "/usr/bin/../lib/gcc/x86_64-linux-gnu/12/crtbeginS.o" "-L/opt/llvm/build//lib/" "-L/usr/bin/../lib/gcc/x86_64-linux-gnu/12" "-L/usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../lib64" "-L/lib/x86_64-linux-gnu" "-L/lib/../lib64" "-L/usr/lib/x86_64-linux-gnu" "-L/usr/lib/../lib64" "-L/lib" "-L/usr/lib" "-L/usr/local/cuda/lib64" "/tmp/test-fe1f0c.o" "/tmp/main-c4f25e.o" "-lmlir_cudart_runtime" "-lcudart" "-lstdc++" "-lm" "-lgcc_s" "-lgcc" "-lc" "-lgcc_s" "-lgcc" "/usr/bin/../lib/gcc/x86_64-linux-gnu/12/crtendS.o" "/lib/x86_64-linux-gnu/crtn.o"

Most of the interesting stuff happens when clang invokes clank-linker-wrapper, which unpacks the device image, compiles it, links any device code, adds kernel registration code and assembles everything together.

In D149559#4310481, @tra wrote:

Where does the test.cpp come from?

I attached both test.cpp and test.mlir at the end of the summary, they should be after the profiler results.

In D149559#4310481, @tra wrote:

-fgpu-rdc should be enabled by the new driver by default, I think.

As @jhuber6 said, it's not enabled by default yet.

In D149559#4310481, @tra wrote:

Is clang supposed to assume that test.ll is the GPU IR? If so, which GPU variant? I think clang may default to a GPU that's different from the GPU you've specified for mlir-opt. Can you post all sub-commands clang++ prints with -### ?

test.ll is the file containing the host IR and an embedded object containing the device bitcode. The embedded object contains all the information needed by clang including the arch.
If you look at:

mlir-opt test.mlir \
...
  -gpu-to-nvptx="chip=sm_70 cuda-path=<cuda toolkit path>" \
...

The arch is there.

In D149559#4310502, @jhuber6 wrote:

I'm assuming that the test.ll is meant to be embedded in test.cpp although it's not listed that way. The way to handle that would be -Xclang -fembed-offload-object=test.ll.

No, the pass uses #include "llvm/Object/OffloadBinary.h" to create a complete file, with the code already embedded.

I'm attaching the files generated by mlir-opt (test_llvm.mlir) and mlir-translate (test.ll):

test.ll12 KBDownload

test_llvm.mlir16 KBDownload

mlir/lib/Dialect/GPU/CMakeLists.txt
50	You're correct. If `NVPTX` or `AMDGPU` is listed in the targets this pipeline will be there, there are no strict CUDA or ROCm dependencies, this is possible because this pipeline never steps outside LLVM, however that also means this pipeline cannot get down to an executable without clang. Having said that, if CUDA or ROCm are not present, then you need to supply those libraries to `clang`, as the generated bitcode would only contain declarations.

fmorac edited the summary of this revision. (Show Details)May 1 2023, 1:55 PM

This is quite a large patch, I'm not sure if this is really coupled or could be split reasonnably actually? Left a few comments skimming through.

What seems to be really needed at the moment to me is a better overall documentation of the GPU compilation flows and how the two end-to-end integration are setup and compare, do you think you can sketch this?

mlir/include/mlir/Dialect/GPU/Transforms/Passes.td
46	This say what it does, but lack a bit of context, as in the "why would someone want to do this?"
53	Lowerings are in general by convention implemented in the Conversion directory, why did you with Transforms here?
80	I'm a bit confused as of why GPUDialect is dependent here, isn't it the actual input of the pass? The dependent should be whatever is produced by the pass that isn't already in the input.
mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp
78	LLVM has the StringLitteral class for this I think
545	This is expensive, can you cache this by using a SymbolTable initialized once in the run method?
713	This is a recursive IR walk, seems a bit overkill, isn't it? Can we just walk the immediately nested operations?
734	Nit: You haven't flushed the stream here.
766	LLVM convention for the comments is to use `/argname=/` where `argname` matches the actual parameters of the functions.
1400	(same nit for the comment format)
mlir/lib/Dialect/GPU/Transforms/GpuToDeviceObjectCommon.h
46	Can we move to a non-templated base class everything that isn't strictly depending on the derived class? (thinking about code size here...)
264	I don't think this brace is correctly placed (assuming you intended to use this for flushing the ostream with RAII)
mlir/lib/Dialect/GPU/Transforms/NameMangling.cpp
107	Aren't there utilities to do that already? Seems like reimplementing some logic that should be fairly generic?
114	Can this be a pass that runs on GPUModuleOp instead?

In D149559#4310729, @fmorac wrote:

Most of the interesting stuff happens when clang invokes clank-linker-wrapper, which unpacks the device image, compiles it, links any device code, adds kernel registration code and assembles everything together.

I would not expect it to work with an IR as an input as that would require reimplementing offload binary embedding just so.

I see that that's exactly what the patch does. OK. Now it makes sense.

@jhuber6 should we consolidate offload-related glue generation into one place, so we don't have everybody rolling a DIY implementation? Object/OffloadBinary.h seems to be providing mostly APIs for parsing offload binaries.

In D149559#4311049, @tra wrote:

In D149559#4310729, @fmorac wrote:

Most of the interesting stuff happens when clang invokes clank-linker-wrapper, which unpacks the device image, compiles it, links any device code, adds kernel registration code and assembles everything together.

I would not expect it to work with an IR as an input as that would require reimplementing offload binary embedding just so.

It should, it handles LLVM-IR from the host or the GPU through LTO. The host-side support obviously requires a linker that accepts it however.

I see that that's exactly what the patch does. OK. Now it makes sense.

@jhuber6 should we consolidate offload-related glue generation into one place, so we don't have everybody rolling a DIY implementation? Object/OffloadBinary.h seems to be providing mostly APIs for parsing offload binaries.

All of these tools are minimally bound to clang so we could move them to LLVM. The OffloadBinary.h file provides interfaces for creating and extracting them.

I'll answer comments first, and leave the large explanation at the end.

In D149559#4310836, @mehdi_amini wrote:

This is quite a large patch, I'm not sure if this is really coupled or could be split reasonnably actually? Left a few comments skimming through.

I thought of maybe splitting in it into 3, gpu-name-mangling, gpu-to(nvptx|amgdpu) and gpu-offload. However there's a functionality dependency between gpu-to(nvptx|amgdpu) and gpu-offload, I added gpu-name-mangling here because it was a relatively small change and it makes sense for it to be here.

In D149559#4310836, @mehdi_amini wrote:

What seems to be really needed at the moment to me is a better overall documentation of the GPU compilation flows and how the two end-to-end integration are setup and compare, do you think you can sketch this?

I'll address that at the end of this comment.

In D149559#4311049, @tra wrote:

@jhuber6 should we consolidate offload-related glue generation into one place, so we don't have everybody rolling a DIY implementation? Object/OffloadBinary.h seems to be providing mostly APIs for parsing offload binaries.

That could knock off a couple hundreds of lines out of this patch, however what I'm thinking now is that then the embedding would have to be performed at mlir-translate, as it would require device bitcode and host LLVM IR. I can't find any drawbacks to this approach, but I don't know if we would want that.

Full break down of the compilation process

To make things easier, I'll attach files to show the changes after each step. Input files: test.mlir

test.mlir1 KBDownload

and test.cpp

test.cpp1 KBDownload

We start with test.mlir, this file is quite high level so we lower it a bit to test_prenvptx.mlir
test_prenvptx.mlir2 KBDownload
, this step doesn't use anything from this patch:
- CMD:

mlir-opt test.mlir \
	-gpu-launch-sink-index-computations \
	-gpu-kernel-outlining \
	-gpu-async-region \
	-convert-scf-to-cf \
	-convert-gpu-to-nvvm \
	-convert-math-to-llvm \
	-convert-arith-to-llvm \
	-convert-index-to-llvm \
	-canonicalize \
	-o test_prenvptx.mlir

We take test_prenvptx.mlir and apply mlir-opt -gpu-to-nvptx="chip=sm_70" to obtain test_postnvptx.mlir
test_postnvptx.mlir10 KBDownload
. This step introduces binary annotations to gpu.module, one annotation with LLVM bitcode corresponding to the gpu.module and one with the offloading kind, either hip or cuda. This pass is similar to what gpu-to-cubin does, but instead of generating cubin it generates bitcode.

We take test_postnvptx.mlir and apply mlir-opt -gpu-to-offload -canonicalize to obtain test_llvm.mlir
test_llvm.mlir16 KBDownload
. This pass performs a similar function to that of gpu-to-llvm, as it lowers down to LLVM. What it does is that it takes all binary gpu.module images and creates one big binary blob with all images at the start of the module, clang will use this blob to compile down to fatbinary. It also inserts an offload annotation and function stub (similar to kernel stubs in CUDA) for each kernel that it's called by a LaunchFuncOp, these annotations are the key to make clang aware that it needs to do kernel registration.

We translate test_llvm.mlir via mlir-translate to LLVM IR to obtain test.ll
test.ll12 KBDownload
.

Now we call clang with:

clang++ -fgpu-rdc --offload-new-driver test.ll test.cpp \
		-L${LLVM_PATH}/lib/ -lmlir_cudart_runtime -lcudart \
		-O3 -o test.exe

Simplifying a bit the process what clang is doing is:

clang <...> "-o" "test.s" "-x" "ir" "test.ll"
clang <...> "-o" "test.o" "test.s"
clang <...> "-o" "main.o" "-x" "c++" "main.cpp"
clang-linker-wrapper <...> "-o" "test.exe" <...> "test.o" "main.o" <...>

The key step for us happens in clang-linker-wrapper, here clang will unpack the binary blob with our device bitcodes, and will start:

Linking bitcode.
Perform further optimization, including LTO.
Generating ptx.
Calling ptxas.
Calling cubin.
Calling nvlink.
Calling fatbinary
Adding kernel registration code, as this method uses the cuda runtime.
Linking everything up to get test.exe.

One important thing to mention is that the binary blob contains all the info required by clang, including, triple, arch, offloading kind, and that the blob is bitcode.

Comparison with the current pipeline

The current pipeline has no optimization for the cuda backend (there's no IR optimization and no ptx optimization), there's no linking in general so no libdevice. On the AMD side, it hard codes several globals like the ABI version to 400 (500 is already out there), and other variables, which adds constrains to the generated code, there's no fast math, and not general linking. Many of these shortcomings can be addressed by re-implementing the current pipeline using mlir/lib/Dialect/GPU/Transforms/GpuToDeviceObjectCommon.h.

This new pipeline provides extensibility, more optimization opportunities, LTO, device linking, removing the need for requiring libcuda or the cuda toolkit at building, or at runtime, as it moves all the burden of generating actual device code to clang, AMDPU fast math. The drawbacks of this pipeline are: no JIT, requires a compatible clang as it must be able to understand the IR produced by MLIR, and until clang extends the new driver to other platforms, there's only Linux support for generating executables.

mlir/include/mlir/Dialect/GPU/Transforms/Passes.td
46	I'll add that to the docs, but just that we're all in the same page, I'll explain with an example why is need it. gpu.module @module_1 { func.func @bar() { return } } gpu.module @module_2 { func.func @bar() { return } } Clang in this case will link these into a single bitcode file having all offloading code. Thus one of `module_1::bar` or `module_2::bar` is going to get removed, which is a problem if they are different.
53	The reason I added these here is because they provide similar functionality to that of `gpu-to-cubin`, and those passes are implemented in the Transforms folder. I have no issue with moving them to Conversions.
80	To be honest I was also confused, but several passes in this file listed it as a dependency, so I didn't want to diverge: https://github.com/llvm/llvm-project/blob/main/mlir/include/mlir/Dialect/GPU/Transforms/Passes.td#L17 However, I must have misinterpreted the reason for those. I'll remove that dependency.
mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp
78	You're right, I'll change them to use `StringLitteral`.
545	This function is called only once per `LaunchFuncOp` in line 697 of this file. I didn't worry too much because the existing conversion pattern (pre patch) already does a similar call: https://github.com/llvm/llvm-project/blob/main/mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp#L800 So no overhead was added beyond the already existing pre patch. But I guess I could try to fix that for both cases.
713	We could, but we would stop handling cases like: module { my.module { gpu.module { } } } We could say it's a precondition for this pass and assume that the above never occurs.
734	Will add that.
766	Will change it.
mlir/lib/Dialect/GPU/Transforms/GpuToDeviceObjectCommon.h
46	This is the class that would help re-implement the existing gpu serialization pipeline, as it handles translating down to bitcode, linking, optimization, and creating a binary object, so a child class needs to implement only certain methods . The reason I used CRTP, is because inside the class I call pass functions: `getOperation`, `signalPassFailure`, so it made sense to me to have this way, also in the `run` method I use `getDerived().<func>` to handle the inheritance bit. However I guess I can lower the size of the class by moving part of the code to a `cpp`.
264	I was intending to also trash the buffer `binaryData` asap, but forgot to flush.
mlir/lib/Dialect/GPU/Transforms/NameMangling.cpp
107	The problem is that as far as I know the current implementation handles renaming only flat symbols: https://mlir.llvm.org/doxygen/classmlir_1_1SymbolTable.html#a256c12869c03f20d4d1a122ec02eb417 And in this case I'm renaming `<symbol>` in `<gpu_module>::<symbol>`. But I could be missing something.
114	No, as this pass needs to update also `LaunchFuncOp`s with the new symbol name, which is located outside the `GPUModule`.

In D149559#4311327, @fmorac wrote:

Full break down of the compilation process

To make things easier, I'll attach files to show the changes after each step.

Fantastic, thanks a lot for this!

We translate test_llvm.mlir via mlir-translate to LLVM IR to obtain test.ll
test.ll12 KBDownload
.

...

The key step for us happens in clang-linker-wrapper, here clang will unpack the binary blob with our device bitcodes, and will start:

Linking bitcode.

Perform further optimization, including LTO.

Generating ptx.

Calling ptxas.

Calling cubin.

Calling nvlink.

Calling fatbinary

Adding kernel registration code, as this method uses the cuda runtime.

Linking everything up to get test.exe.

One important thing to mention is that the binary blob contains all the info required by clang, including, triple, arch, offloading kind, and that the blob is bitcode.

That's nice, I wonder about the logic of clang-linker-wrapper, it seems to me that there is very little coupling to the object files and that most of this can operate directly on LLVM IR just the same.
With a good layering and API, we should be able to use this exact flow in a JIT environment as well: anything I am missing?

In D149559#4311373, @mehdi_amini wrote:

That's nice, I wonder about the logic of clang-linker-wrapper, it seems to me that there is very little coupling to the object files and that most of this can operate directly on LLVM IR just the same.

Yes, for the most part yes, here is the implementation of that tool:
https://github.com/llvm/llvm-project/tree/main/clang/tools/clang-linker-wrapper
If understood correctly the code the tool calls clang under the hood, and clang is the one that call all of the device tools. I would think all these utilities could be pure LLVM, but that's clang's code so up to them.

In D149559#4311373, @mehdi_amini wrote:

With a good layering and API, we should be able to use this exact flow in a JIT environment as well: anything I am missing?

For JIT, we would have to require a working cuda toolkit installation. But yes, it would be possible to have a JIT, however I wouldn't add device linking nvlink to it, as we would probably end up with mlirc a mlir compiler to executable.

In D149559#4311387, @fmorac wrote:

In D149559#4311373, @mehdi_amini wrote:

With a good layering and API, we should be able to use this exact flow in a JIT environment as well: anything I am missing?

For JIT, we would have to require a working cuda toolkit installation. But yes, it would be possible to have a JIT, however I wouldn't add device linking nvlink to it, as we would probably end up with mlirc a mlir compiler to executable.

Right but in a JIT environment you always need to have a working Cuda installation anyway, and with https://reviews.llvm.org/D145527 the ptx->cubin step can happen directly using the nvptxcompile library.

mlir/include/mlir/Dialect/GPU/Transforms/Passes.td
53	That's fine to keep this aligned with `gpu-to-cubin`, we should probably be careful of the naming and extend the documentation to clarify what this all does (that is: it actually invoked the entire LLVM backend right?)
mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp
713	Right, if someone has a my.module, they should schedule the pass there. We can make this an "Operation pass" so that it can run on any kind of module.
mlir/lib/Dialect/GPU/Transforms/NameMangling.cpp
114	OK, so let's just make this runnable on any SymbolTable operation (like the other case) and replace the full IR walk with building a symbol table and using it directly?

And now it's time to potentially self sabotage this diff.

LLVM Offloading

I really like @tra's idea of further consolidating LLVM offload stuff under llvm, which I think is something that the clang & OpenMP team have been doing for a while.

What I'm thinking is:
Moving certain bits from clang & clang-linker-wrapper down to llvm, namely kernel registration from clang-linker-wrapper, IR embedding from clang, and the creation of offload annotations (namely generating offloading entries).
For the latter I'm thinking something along the lines of:

SomeKindOfErrorCode 
llvm::offloading::embedOffloadEntriesToModule(llvm::Module &module, SmallVector<std::pair<StringRef, OffloadBinary::OffloadingImage>> &&offloadingEntries);

That function would embed the objects by concatenating them, and generate the offloading entries from the StringRef.

What do you think @jhuber6 & @jdoerfert ?

MLIR

What would this new route mean for MLIR?

Getting rid of gpu-to-(nvptx|cubin|amdgpu|hsaco) passes, that functionality would be moved and consolidated under translation.
Requiring the cuda toolkit for JIT, which @mehdi_amini already said it would be ok.
Moving away from the cuda driver for the runner to a single library based on the cuda runtime.
Moving away from loading device code modules, and switching to kernel registration.

Final comments :
If all the above parties agree we could do the above option, that would mean either trash this diff or commit it (after addressing all the comments) and when the above functionality is ready switch to that.
In other circumstances I'd tell you, give me one week and I can do it, but I'm chasing a paper deadline, so I cannot commit (pun intended) to do the above proposal until after the 19th of May, that's why I'd prefer to commit this diff and then make the switch (provided that we agree on the above proposal).

mlir/include/mlir/Dialect/GPU/Transforms/Passes.td
53	Yeah, to be honest there's not much documentation on this. If you're referring to the MLIR LLVM backend, then yes, if you're referring to LLVM codgen backend, then no, as `gpu-to-nvptx` stops at bitcode, but `gpu-to-cubin` goes down to cubin.
mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp
713	That would work.
mlir/lib/Dialect/GPU/Transforms/NameMangling.cpp
114	If I understood you correctly, yes, we could switch this to run on `SymbolTable` Ops, and collect `GPUModule` symbols defined on that particular symbol table.

In D149559#4312037, @fmorac wrote:
And now is time to potentially self sabotage this diff.

LLVM Offloading

I really like @tra's idea of further consolidating LLVM offload stuff under llvm, which I think is something that the clang & OpenMP team has been doing for a while.

What I'm thinking is:
Moving certain bits from clang & clang-linker-wrapper down to llvm, namely kernel registration from clang-linker-wrapper, IR embedding from clang, and the creation of offload annotations (namely generating offloading entries).
For the latter I'm thinking something along the lines of:
SomeKindOfErrorCode 
llvm::offloading::embedOffloadEntriesToModule(llvm::Module &module, SmallVector<std::pair<StringRef, OffloadBinary::OffloadingImage>> &&offloadingEntries);
That function would embed the objects by concatenating them, and generate the offloading entries from the StringRef.

What do you think @jhuber6 & @jdoerfert ?

I'm fully in favor of moving more of this offload handling into LLVM. It was mostly put in clang because that was its only consumer at time of writing. The only clang related dependency on the clang-offload-packager and the clang-linker-wrapper is the version string which could be easily removed. We use clang internally, but could allow using different methods if needed.

There is some work I haven't yet gotten around to doing for the offloading entries however. Currently, they rely on a C-identifier section name to get the linker to emit symbols to the beginning and end of the section. This currently only works on Linux, with MacOS and Windows having slightly different methods for iterating through a section. If we want something generic we'll need to either make special handling when we emit the sections based off of a triple, or just shove that logic in the linker wrapper. The linker wrapper already includes a lot of magic, and the more I put in it the closer it gets to an actual linker, but it would be fairly trivial to just search the linked image for the section name and emit an array independent of target OS.

Furthermore, the actual structure of the offloading entry was copied from OpenMP mostly for convenience. It doesn't have enough fields to actually describe some more esoteric CUDA / HIP constructs like textures so I just ignored those for now. That should be changed as well.

I really like @tra's idea of further consolidating LLVM offload stuff under llvm, which I think is something that the clang & OpenMP team have been doing for a while.
...

What would this new route mean for MLIR?

Getting rid of gpu-to-(nvptx|cubin|amdgpu|hsaco) passes, that functionality would be moved and consolidated under translation.

Just to clarify: it means that we'll get JIT and AOT aligned on the same path in MLIR right? If so then that looks great :)

I'm not sure what the input to "translate" will look like exactly, can you clarify this a bit?

In D149559#4313027, @mehdi_amini wrote:

Just to clarify: it means that we'll get JIT and AOT aligned on the same path in MLIR right? If so then that looks great :)

Yes, we should have AOT and JIT for devices as long there's a CUDA toolkit or ROCm installation. In reality we just need device bitcode libraries and a mechanism for compiling device assembly down to object.

In D149559#4313027, @mehdi_amini wrote:

I'm not sure what the input to "translate" will look like exactly, can you clarify this a bit?

What I'm thinking (I'm not 100% sure if it's possible, but I don't see a reason why wouldn't), is that if we add some logic to the LLVM Translation Interface, it should be able to handle:

module attributes {gpu.container_module} {
  llvm.func @bar() {
   ...
  }
  gpu.module {
    llvm.func kernel() {
       ...
    }
  }
}

Thus, we would be able to remove all interactions with LLVM IR from the passes. The obvious benefit here is that we would be able to use pure LLVM IR API to embed the device object, or do other manipulations, making things easier to implement in this case -provided that we have an unified LLVM offload backend.

What I'm realizing just now is, I'm not sure if the Translation interface is already capable of handling options for this use case, e.g. to specify arch and stuff like that, however for now those could use gpu.module attributes.

In summary:

We would remove all the gpu-to-(cubin|...) passes.
Add logic to the LLVM translation interface so that is able to handle generating LLVM offload code (the intention of this patch), or generating complete executables by using cuda or hip tools.

In D149559#4313201, @fmorac wrote:
What I'm thinking (I'm not 100% sure if it's possible, but I don't see a reason why wouldn't), is that if we add some logic to the LLVM Translation Interface, it should be able to handle:
module attributes {gpu.container_module} {
  llvm.func @bar() {
   ...
  }
  gpu.module {
    llvm.func kernel() {
       ...
    }
  }
}
Thus, we would be able to remove all interactions with LLVM IR from the passes. The obvious benefit here is that we would be able to use pure LLVM IR API to embed the device object, or do other manipulations, making things easier to implement in this case -provided that we have an unified LLVM offload backend.

What I'm realizing just now is, I'm not sure if the Translation interface is already capable of handling options for this use case, e.g. to specify arch and stuff like that, however for now those could be gpu.module attributes.

I think that this may be close to how we handle the OpenMP translation right now, but @ftynse should be able to confirm.

Just to check, with this new pipeline, is there a way to carry your own copy of the relevant clang/LLVM IR offloading support in the same binary/library? That is, suppose I know that /opt/rocm/llvm/bin/* is too old (ex. the intrinsics MLIR generates have attributes that said system LLVM can't understand), and I want to just link in the LLVM that lives beside whichever MLIR commit I'm targeting. Can I do that?

Overall, I'm liking this refactor.

(and part of why I had the global constants in SerializeToHsaco is that we needed that pass to work - so long as device libraries weren't actually needed - in envirenments where ROCm either wasn't installed or was in some weird path we didn't know about. I can perhaps go poke the people who make our PyTorch wheels to try and find out more about what exactly is going on there, but still, I hope that helps with context.)

mlir/include/mlir/Dialect/GPU/Transforms/Passes.td
83	On both these passes, would it be reasonable to add a `llvm-args` or `clang-args` or the like, so I can pass in whatever weird flags (for instance, `-global-isel`) that I feel like?

In D149559#4318946, @krzysz00 wrote:

Just to check, with this new pipeline, is there a way to carry your own copy of the relevant clang/LLVM IR offloading support in the same binary/library? That is, suppose I know that /opt/rocm/llvm/bin/* is too old (ex. the intrinsics MLIR generates have attributes that said system LLVM can't understand), and I want to just link in the LLVM that lives beside whichever MLIR commit I'm targeting. Can I do that?

If I understood your question correctly, yes and no, as this pipeline doesn't carry any clang dependencies, the user is the one responsible for supplying a valid clang compiler.

For example, ROCm is used only to link against <ROCm path>/amdgcn/bitcode libraries, and the user is responsible for supplying the clang compiler they want to use to compile the offload code.
You can even use a different ROCm version by supplying rocm-path to the pass, and if you supply no ROCm path, then no linking is performed and it's up to the user to supply the appropriate bitcode libraries when calling clang.

I think that this may be close to how we handle the OpenMP translation right now, but @ftynse should be able to confirm.

For OpenMP, we are storing an OpenMPIRBuilder instance inside ModuleTranslation. Interface implementations for specific operations can use that builder. It should be possible to follow the approach for GPU. The contract for the interface method is rather simple: it needs to leave the llvm::IRBuilder In a valid state for further insertion, and it needs to update moduleTranslation so that it has mappings between any values and symbols defined by this operation and their LLVM IR counterparts. The implementation itself can be arbitrarily complex. AFAICS, the main think to take care of would be preserving the "launch" calls from host code to device code through some sort of mapping stored in moduleTranslation.

In D149559#4321775, @ftynse wrote:

I think that this may be close to how we handle the OpenMP translation right now, but @ftynse should be able to confirm.

.... The implementation itself can be arbitrarily complex. AFAICS, the main think to take care of would be preserving the "launch" calls from host code to device code through some sort of mapping stored in moduleTranslation.

After digging around on how OpenMP handles translation and @ftynse's comment, I think all the serialization pipelines should be moved to Translation, they make more sense to be there. I can take care of moving them, however I'd be able to do it until the end of May, now my question is, what should we do with this diff? I could fix all the comments -it would take me less than a day, and push it, or we can abandon it, wait and then move everything to translation. Thoughts @mehdi_amini , @tra ?

mlir/include/mlir/Dialect/GPU/Transforms/Passes.td
83	This approach doesn't call clang directly. You can inject those flags when you call clang.

krzysz00 added inline comments.May 8 2023, 9:19 AM

mlir/include/mlir/Dialect/GPU/Transforms/Passes.td
83	Wait, hold on, these new serialize passes handle the LLVM IR -> binary translation, so where in this diff does the relevant clang driver get called?

fmorac marked an inline comment as done.May 8 2023, 9:51 AM

fmorac added inline comments.

mlir/include/mlir/Dialect/GPU/Transforms/Passes.td

No, this is how you would do it:

mlir-opt test.mlir \
  -gpu-launch-sink-index-computations \
  -gpu-kernel-outlining \
  -gpu-async-region \
  -gpu-name-mangling \
  -convert-scf-to-cf \
  -convert-gpu-to-nvvm \
  -convert-math-to-llvm \
  -convert-arith-to-llvm \
  -convert-index-to-llvm \
  -canonicalize \
  -gpu-to-nvptx="chip=sm_70 cuda-path=<cuda toolkit path>" \
  -gpu-to-offload \
  -canonicalize \
  -o test_llvm.mlir
mlir-translate -mlir-to-llvmir test_llvm.mlir -o test.ll
clang++ -fgpu-rdc --offload-new-driver test.ll test.cpp \
		-L${LLVM_PATH}/lib/ -lmlir_cudart_runtime -lcudart \
		-O3 -o test.exe

Calling clang to get the final executable is a responsibility of the user.

In D149559#4312037, @fmorac wrote:

MLIR

What would this new route mean for MLIR?

Getting rid of gpu-to-(nvptx|cubin|amdgpu|hsaco) passes, that functionality would be moved and consolidated under translation.

Requiring the cuda toolkit for JIT, which @mehdi_amini already said it would be ok.

Moving away from the cuda driver for the runner to a single library based on the cuda runtime.

Moving away from loading device code modules, and switching to kernel registration.

Final comments :
If all the above parties agree we could do the above option, that would mean either trash this diff or commit it (after addressing all the comments) and when the above functionality is ready switch to that.
In other circumstances I'd tell you, give me one week and I can do it, but I'm chasing a paper deadline, so I cannot commit (pun intended) to do the above proposal until after the 19th of May, that's why I'd prefer to commit this diff and then make the switch (provided that we agree on the above proposal).

All of this would be a drastic change to the GPU compilation approach in MLIR, and it needs to be discussed in discourse. Most users wouldn't have seen this redesign proposal.

If all the above parties agree we could do the above option, that would mean either trash this diff or commit it (after addressing all the comments) and when the above functionality is ready switch to that.

I think @mehdi_amini commented on this upthread. This patch has too many things to be considered a single MLIR revision, and it can't be committed in this form, irrespective of the approach to take.

The current pipeline has no optimization for the cuda backend (there's no IR optimization and no ptx optimization), there's no linking in general so no libdevice

But it's simple to address these, and just the lack of these can't be the only motivation to change the architecture here. One can add a transformer on the LLVM specifying the optimization level in the serialize-to-cubin pass before translating to PTX. The linking to libdevice can also be performed there. See https://discourse.llvm.org/t/nvptx-codegen-for-llvm-sin-and-friends/58170/16 (@csigg provided me the pointer there as the approach used by XLA to link in libdevice). Supporting these aren't more than 20 lines of code each.

fmorac mentioned this in D151766: [mlir][gpu] Move the GPU serialization passes to translation..May 31 2023, 4:35 PM

Abandoned in favor of this proposal:
https://discourse.llvm.org/t/rfc-extending-mlir-gpu-device-codegen-pipeline/70199/54

Revision Contents

Path

Size

mlir/

include/

mlir/

Conversion/

GPUCommon/

GPUCommonPass.h

14 lines

Passes.td

33 lines

Dialect/

GPU/

Transforms/

Passes.h

35 lines

Passes.td

97 lines

lib/

Conversion/

GPUCommon/

GPUToLLVMConversion.cpp

487 lines

Dialect/

GPU/

CMakeLists.txt

41 lines

Transforms/

GpuToDeviceObjectCommon.h

516 lines

GpuToDeviceOffload.cpp

195 lines

NameMangling.cpp

125 lines

ExecutionEngine/

CMakeLists.txt

52 lines

CudaRuntimeWrappers.cpp

156 lines

RocmRuntimeWrappers.cpp

14 lines

test/

Conversion/

GPUCommon/

offload.mlir

129 lines

Dialect/

GPU/

mangle-names.mlir

96 lines

serialize-to-amdgpu.mlir

64 lines

serialize-to-nvptx.mlir

64 lines

Diff 518365

mlir/include/mlir/Conversion/GPUCommon/GPUCommonPass.h

	Show All 34 Lines
	} // namespace gpu			} // namespace gpu

	namespace LLVM {			namespace LLVM {
	class LLVMDialect;			class LLVMDialect;
	} // namespace LLVM			} // namespace LLVM

	#define GEN_PASS_DECL_GPUTOLLVMCONVERSIONPASS			#define GEN_PASS_DECL_GPUTOLLVMCONVERSIONPASS
	#include "mlir/Conversion/Passes.h.inc"			#include "mlir/Conversion/Passes.h.inc"
				#define GEN_PASS_DECL_GPUTOOFFLOADCONVERSIONPASS
				#include "mlir/Conversion/Passes.h.inc"

	using OwnedBlob = std::unique_ptr<std::vector<char>>;			using OwnedBlob = std::unique_ptr<std::vector<char>>;
	using BlobGenerator =			using BlobGenerator =
	std::function<OwnedBlob(const std::string &, Location, StringRef)>;			std::function<OwnedBlob(const std::string &, Location, StringRef)>;
	using LoweringCallback = std::function<std::unique_ptr<llvm::Module>(			using LoweringCallback = std::function<std::unique_ptr<llvm::Module>(
	Operation *, llvm::LLVMContext &, StringRef)>;			Operation *, llvm::LLVMContext &, StringRef)>;

	/// Collect a set of patterns to convert from the GPU dialect to LLVM and			/// Collect a set of patterns to convert from the GPU dialect to LLVM and
	/// populate converter for gpu types.			/// populate converter for gpu types.
	void populateGpuToLLVMConversionPatterns(LLVMTypeConverter &converter,			void populateGpuToLLVMConversionPatterns(LLVMTypeConverter &converter,
	RewritePatternSet &patterns,			RewritePatternSet &patterns,
	StringRef gpuBinaryAnnotation = {},			StringRef gpuBinaryAnnotation = {},
	bool kernelBarePtrCallConv = false);			bool kernelBarePtrCallConv = false);

				/// Collect a set of patterns to convert from the GPU dialect to LLVM with
				/// offload annotations and populate the converter for gpu types.
				void populateGpuToLLVMOffloadConversionPatterns(
				LLVMTypeConverter &converter, RewritePatternSet &patterns,
				bool kernelBarePtrCallConv = false);

				/// Utility function that concatenates all offloading objects in a module
				/// into a single object and adds it at the start of the module. If there are no
				/// gpu.modules in the module it doesn't add anything to the module. Returns
				/// failure if a gpu.module doesn't contain a valid offload annotation.
				LogicalResult makeLLVMOffloadObject(ModuleOp module,
				LLVMTypeConverter &converter);
	} // namespace mlir			} // namespace mlir

	#endif // MLIR_CONVERSION_GPUCOMMON_GPUCOMMONPASS_H_			#endif // MLIR_CONVERSION_GPUCOMMON_GPUCOMMONPASS_H_

mlir/include/mlir/Conversion/Passes.td

Show First 20 Lines • Show All 375 Lines • ▼ Show 20 Lines	def GpuToLLVMConversionPass : Pass<"gpu-to-llvm", "ModuleOp"> {
];		];

let dependentDialects = [		let dependentDialects = [
"LLVM::LLVMDialect",		"LLVM::LLVMDialect",
"memref::MemRefDialect",		"memref::MemRefDialect",
];		];
}		}

		def GpuToOffloadConversionPass : Pass<"gpu-to-offload", "ModuleOp"> {
		let summary = "Convert GPU dialect to LLVM dialect with offload annotations.";

		let description = [{
		Creates a pass to convert a GPU operations into a sequence of GPU runtime
		calls and adds clang compatible offload annotations for clang's new offload
		driver.

		This pass does not generate code to call GPU runtime APIs directly but
		instead uses a small wrapper library that exports a stable and conveniently
		typed ABI on top of GPU runtimes such as CUDA or ROCm (HIP).
		For more information:
		https://clang.llvm.org/docs/OffloadingDesign.html
		https://clang.llvm.org/docs/ClangOffloadPackager.html
		}];

		let options = [
		Option<"kernelBarePtrCallConv", "use-bare-pointers-for-kernels", "bool",
		/default=/"false",
		"Use bare pointers to pass memref arguments to kernels. "
		"The kernel must use the same setting for this option."
		>,
		Option<"useOpaquePointers", "use-opaque-pointers", "bool",
		/default=/"true", "Generate LLVM IR using opaque pointers "
		"instead of typed pointers">,
		];

		let dependentDialects = [
		"::mlir::LLVM::LLVMDialect",
		"memref::MemRefDialect",
		];
		}

def LowerHostCodeToLLVMPass : Pass<"lower-host-to-llvm", "ModuleOp"> {		def LowerHostCodeToLLVMPass : Pass<"lower-host-to-llvm", "ModuleOp"> {
let summary = "Lowers the host module code and `gpu.launch_func` to LLVM";		let summary = "Lowers the host module code and `gpu.launch_func` to LLVM";

let description = [{		let description = [{
Creates a pass to emulate `gpu.launch_func` call in LLVM dialect and lower		Creates a pass to emulate `gpu.launch_func` call in LLVM dialect and lower
the host module code to LLVM.		the host module code to LLVM.

This transformation creates a sequence of global variables that are later		This transformation creates a sequence of global variables that are later
▲ Show 20 Lines • Show All 738 Lines • Show Last 20 Lines

mlir/include/mlir/Dialect/GPU/Transforms/Passes.h

	Show All 25 Lines

	namespace mlir {			namespace mlir {
	class TypeConverter;			class TypeConverter;
	class ConversionTarget;			class ConversionTarget;
	namespace func {			namespace func {
	class FuncOp;			class FuncOp;
	} // namespace func			} // namespace func

				namespace gpu {
				/// Returns the default cuda toolkit path, none if it wasn't found.
				StringRef getDefaultCudaToolkitPath();

				/// Returns the default ROCm path.
				StringRef getDefaultRocmPath();
				} // namespace gpu

	#define GEN_PASS_DECL			#define GEN_PASS_DECL
	#include "mlir/Dialect/GPU/Transforms/Passes.h.inc"			#include "mlir/Dialect/GPU/Transforms/Passes.h.inc"

	/// Pass that moves ops which are likely an index computation into gpu.launch			/// Pass that moves ops which are likely an index computation into gpu.launch
	/// body.			/// body.
	std::unique_ptr<Pass> createGpuLauchSinkIndexComputationsPass();			std::unique_ptr<Pass> createGpuLauchSinkIndexComputationsPass();

	/// Replaces `gpu.launch` with `gpu.launch_func` by moving the region into			/// Replaces `gpu.launch` with `gpu.launch_func` by moving the region into
	Show All 15 Lines
	void populateGpuAllReducePatterns(RewritePatternSet &patterns);			void populateGpuAllReducePatterns(RewritePatternSet &patterns);

	/// Collect all patterns to rewrite ops within the GPU dialect.			/// Collect all patterns to rewrite ops within the GPU dialect.
	inline void populateGpuRewritePatterns(RewritePatternSet &patterns) {			inline void populateGpuRewritePatterns(RewritePatternSet &patterns) {
	populateGpuAllReducePatterns(patterns);			populateGpuAllReducePatterns(patterns);
	}			}

	namespace gpu {			namespace gpu {
				/// Define gpu offload kinds.
				enum class OffloadKind { unk = -1, cuda, hip };

				/// Convert a string to offload kind.
				inline OffloadKind getOffloadKind(StringRef kind) {
				if (kind == "cuda")
				return OffloadKind::cuda;
				else if (kind == "hip")
				return OffloadKind::hip;
				return OffloadKind::unk;
				}

				/// Get the string representation of an offload kind.
				inline StringRef fromOffloadKind(OffloadKind kind) {
				if (kind == OffloadKind::cuda)
				return "cuda";
				else if (kind == OffloadKind::hip)
				return "hip";
				return "unk";
				}

				/// Returns the annotation name for GPU offload object blobs.
				StringRef getGpuOffloadObjectAnnotation();

				/// Returns the annotation name for GPU offload kind.
				StringRef getGpuOffloadKindAnnotation();

	/// Base pass class to serialize kernel functions through LLVM into			/// Base pass class to serialize kernel functions through LLVM into
	/// user-specified IR and add the resulting blob as module attribute.			/// user-specified IR and add the resulting blob as module attribute.
	class SerializeToBlobPass : public OperationPass<gpu::GPUModuleOp> {			class SerializeToBlobPass : public OperationPass<gpu::GPUModuleOp> {
	public:			public:
	SerializeToBlobPass(TypeID passID);			SerializeToBlobPass(TypeID passID);
	SerializeToBlobPass(const SerializeToBlobPass &other);			SerializeToBlobPass(const SerializeToBlobPass &other);

	void runOnOperation() final;			void runOnOperation() final;
	▲ Show 20 Lines • Show All 71 Lines • Show Last 20 Lines

mlir/include/mlir/Dialect/GPU/Transforms/Passes.td

	Show All 31 Lines
	def GpuMapParallelLoopsPass			def GpuMapParallelLoopsPass
	: Pass<"gpu-map-parallel-loops", "mlir::func::FuncOp"> {			: Pass<"gpu-map-parallel-loops", "mlir::func::FuncOp"> {
	let summary = "Greedily maps loops to GPU hardware dimensions.";			let summary = "Greedily maps loops to GPU hardware dimensions.";
	let constructor = "mlir::createGpuMapParallelLoopsPass()";			let constructor = "mlir::createGpuMapParallelLoopsPass()";
	let description = "Greedily maps loops to GPU hardware dimensions.";			let description = "Greedily maps loops to GPU hardware dimensions.";
	let dependentDialects = ["mlir::gpu::GPUDialect"];			let dependentDialects = ["mlir::gpu::GPUDialect"];
	}			}

				def GpuNameMangling: Pass<"gpu-name-mangling",
				"::mlir::ModuleOp"> {
				let summary = "Mangle the names of all the top symbols inside a gpu.module.";
				let description = [{
				Mangle the names of all the top level definitions inside a `gpu.module`
				for all the `gpu.module`s inside a `module`, from: `<symbol>` to:
				`__G<gpu module name>_S<symbol>`, and updates all the symbol references.
				mehdi_aminiUnsubmitted Done Reply Inline Actions This say what it does, but lack a bit of context, as in the "why would someone want to do this?" mehdi_amini: This say what it does, but lack a bit of context, as in the "why would someone want to do this?"
				fmoracAuthorUnsubmitted Done Reply Inline Actions I'll add that to the docs, but just that we're all in the same page, I'll explain with an example why is need it. gpu.module @module_1 { func.func @bar() { return } } gpu.module @module_2 { func.func @bar() { return } } Clang in this case will link these into a single bitcode file having all offloading code. Thus one of `module_1::bar` or `module_2::bar` is going to get removed, which is a problem if they are different. fmorac: I'll add that to the docs, but just that we're all in the same page, I'll explain with an…
				}];
				let dependentDialects = ["mlir::gpu::GPUDialect"];
				}

				def GpuToNVPTXOffload: Pass<"gpu-to-nvptx",
				"::mlir::gpu::GPUModuleOp"> {
				let summary = "Lowers a `gpu.module` to a binary annotation with NVVM code.";
				mehdi_aminiUnsubmitted Done Reply Inline Actions Lowerings are in general by convention implemented in the Conversion directory, why did you with Transforms here? mehdi_amini: Lowerings are in general by convention implemented in the Conversion directory, why did you…
				fmoracAuthorUnsubmitted Done Reply Inline Actions The reason I added these here is because they provide similar functionality to that of `gpu-to-cubin`, and those passes are implemented in the Transforms folder. I have no issue with moving them to Conversions. fmorac: The reason I added these here is because they provide similar functionality to that of `gpu-to…
				mehdi_aminiUnsubmitted Not Done Reply Inline Actions That's fine to keep this aligned with `gpu-to-cubin`, we should probably be careful of the naming and extend the documentation to clarify what this all does (that is: it actually invoked the entire LLVM backend right?) mehdi_amini: That's fine to keep this aligned with `gpu-to-cubin`, we should probably be careful of the…
				fmoracAuthorUnsubmitted Done Reply Inline Actions Yeah, to be honest there's not much documentation on this. If you're referring to the MLIR LLVM backend, then yes, if you're referring to LLVM codgen backend, then no, as `gpu-to-nvptx` stops at bitcode, but `gpu-to-cubin` goes down to cubin. fmorac: Yeah, to be honest there's not much documentation on this. If you're referring to the MLIR LLVM…
				let description = [{
				Lowers a `gpu.module` with NVVM LLVM IR to a binary annotation compatible
				with clang's new offload driver, and with the tool: `clang-offload-packager`.
				For more information:
				https://clang.llvm.org/docs/OffloadingDesign.html
				https://clang.llvm.org/docs/ClangOffloadPackager.html
				}];
				let options = [
				Option<"triple", "triple", "std::string",
				/default=/ "\"nvptx64-nvidia-cuda\"",
				"Target triple.">,
				Option<"chip", "chip", "std::string",
				/default=/"\"sm_35\"",
				"Target chip.">,
				Option<"features", "features", "std::string",
				/default=/"\"+ptx60\"",
				"Target features.">,
				Option<"cudaPath", "cuda-path", "std::string",
				/default=/"gpu::getDefaultCudaToolkitPath().str()",
				"CUDA Toolkit path.">,
				Option<"optLevel", "O", "unsigned",
				/default=/"0",
				"Optimization level.">,
				ListOption<"bcPaths", "libs", "std::string",
				"Extra bitcode libraries paths.">,
				];
				let dependentDialects = ["mlir::gpu::GPUDialect"];
				mehdi_aminiUnsubmitted Not Done Reply Inline Actions I'm a bit confused as of why GPUDialect is dependent here, isn't it the actual input of the pass? The dependent should be whatever is produced by the pass that isn't already in the input. mehdi_amini: I'm a bit confused as of why GPUDialect is dependent here, isn't it the actual input of the…
				fmoracAuthorUnsubmitted Done Reply Inline Actions To be honest I was also confused, but several passes in this file listed it as a dependency, so I didn't want to diverge: https://github.com/llvm/llvm-project/blob/main/mlir/include/mlir/Dialect/GPU/Transforms/Passes.td#L17 However, I must have misinterpreted the reason for those. I'll remove that dependency. fmorac: To be honest I was also confused, but several passes in this file listed it as a dependency, so…
				}

				def GpuToAMDGPUOffload: Pass<"gpu-to-amdgpu",
				krzysz00Unsubmitted Not Done Reply Inline Actions On both these passes, would it be reasonable to add a `llvm-args` or `clang-args` or the like, so I can pass in whatever weird flags (for instance, `-global-isel`) that I feel like? krzysz00: On both these passes, would it be reasonable to add a `llvm-args` or `clang-args` or the like…
				fmoracAuthorUnsubmitted Done Reply Inline Actions This approach doesn't call clang directly. You can inject those flags when you call clang. fmorac: This approach doesn't call clang directly. You can inject those flags when you call clang.
				krzysz00Unsubmitted Done Reply Inline Actions Wait, hold on, these new serialize passes handle the LLVM IR -> binary translation, so where in this diff does the relevant clang driver get called? krzysz00: Wait, hold on, these new serialize passes handle the LLVM IR -> binary translation, so where in…
				fmoracAuthorUnsubmitted Done Reply Inline Actions No, this is how you would do it: mlir-opt test.mlir \ -gpu-launch-sink-index-computations \ -gpu-kernel-outlining \ -gpu-async-region \ -gpu-name-mangling \ -convert-scf-to-cf \ -convert-gpu-to-nvvm \ -convert-math-to-llvm \ -convert-arith-to-llvm \ -convert-index-to-llvm \ -canonicalize \ -gpu-to-nvptx="chip=sm_70 cuda-path=<cuda toolkit path>" \ -gpu-to-offload \ -canonicalize \ -o test_llvm.mlir mlir-translate -mlir-to-llvmir test_llvm.mlir -o test.ll clang++ -fgpu-rdc --offload-new-driver test.ll test.cpp \ -L${LLVM_PATH}/lib/ -lmlir_cudart_runtime -lcudart \ -O3 -o test.exe Calling clang to get the final executable is a responsibility of the user. fmorac: No, this is how you would do it: ``` mlir-opt test.mlir \ -gpu-launch-sink-index-computations…
				"::mlir::gpu::GPUModuleOp"> {
				let summary = "Lowers a `gpu.module` to a binary annotation with ROCDl code.";
				let description = [{
				Lowers a `gpu.module` with ROCDl LLVM IR to a binary annotation compatible
				with clang's new offload driver, and with the tool: `clang-offload-packager`.
				For more information:
				https://clang.llvm.org/docs/OffloadingDesign.html
				https://clang.llvm.org/docs/ClangOffloadPackager.html
				}];
				let options = [
				Option<"triple", "triple", "std::string",
				/default=/ "\"amdgcn-amd-amdhsa\"",
				"Target triple.">,
				Option<"chip", "chip", "std::string",
				/default=/"\"generic\"",
				"Target chip.">,
				Option<"features", "features", "std::string",
				/default=/"\"\"",
				"Target features.">,
				Option<"rocmPath", "rocm-path", "std::string",
				/default=/"gpu::getDefaultRocmPath().str()",
				"ROCm path.">,
				Option<"optLevel", "O", "unsigned",
				/default=/"0",
				"Optimization level.">,
				Option<"wave64", "wave64", "bool",
				/default=/"true",
				"Use Wave64 mode.">,
				Option<"daz", "daz", "bool",
				/default=/"false",
				"Enable denormals are zero opt.">,
				Option<"finiteOnly", "finite-only", "bool",
				/default=/"false",
				"Enable finite only opt.">,
				Option<"unsafeMath", "unsafe-math", "bool",
				/default=/"false",
				"Enable unsafe math opt.">,
				Option<"fastMath", "fast-math", "bool",
				/default=/"false",
				"Enable fast relaxed math opt.">,
				Option<"correctSqrt", "correct-sqrt", "bool",
				/default=/"true",
				"Enable correct rounded sqrt.">,
				Option<"abiVer", "abi-ver", "std::string",
				/default=/"\"400\"",
				"ABI version.">,
				ListOption<"bcPaths", "libs", "std::string",
				"Extra bitcode libraries paths.">,
				];
				let dependentDialects = ["mlir::gpu::GPUDialect"];
				}

	#endif // MLIR_DIALECT_GPU_PASSES			#endif // MLIR_DIALECT_GPU_PASSES

mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp

Show All 35 Lines

#include "llvm/ADT/STLExtras.h"		#include "llvm/ADT/STLExtras.h"
#include "llvm/Support/Error.h"		#include "llvm/Support/Error.h"
#include "llvm/Support/FormatVariadic.h"		#include "llvm/Support/FormatVariadic.h"

namespace mlir {		namespace mlir {
#define GEN_PASS_DEF_GPUTOLLVMCONVERSIONPASS		#define GEN_PASS_DEF_GPUTOLLVMCONVERSIONPASS
#include "mlir/Conversion/Passes.h.inc"		#include "mlir/Conversion/Passes.h.inc"
		#define GEN_PASS_DEF_GPUTOOFFLOADCONVERSIONPASS
		#include "mlir/Conversion/Passes.h.inc"
} // namespace mlir		} // namespace mlir

using namespace mlir;		using namespace mlir;

static constexpr const char *kGpuBinaryStorageSuffix = "_gpubin_cst";		static constexpr const char *kGpuBinaryStorageSuffix = "_gpubin_cst";

namespace {		namespace {

class GpuToLLVMConversionPass		class GpuToLLVMConversionPass
: public impl::GpuToLLVMConversionPassBase<GpuToLLVMConversionPass> {		: public impl::GpuToLLVMConversionPassBase<GpuToLLVMConversionPass> {
public:		public:
using Base::Base;		using Base::Base;

// Run the dialect converter on the module.		// Run the dialect converter on the module.
void runOnOperation() override;		void runOnOperation() override;
};		};

		class GpuToOffloadConversionPass
		: public impl::GpuToOffloadConversionPassBase<GpuToOffloadConversionPass> {
		public:
		using Base::Base;

		void runOnOperation() final;
		};

		// Class for creating and managing offload entries & objects.
		// For more information on these entries:
		// https://clang.llvm.org/docs/OffloadingDesign.html#generating-offloading-entries
		// https://clang.llvm.org/docs/ClangOffloadPackager.html
		// https://clang.llvm.org/docs/ClangLinkerWrapper.html
		struct GPUOffloadBuilder {
		// Offload string constants.
		static constexpr char kOffloadStructTypeName[] = "struct.__tgt_offload_entry";
		mehdi_aminiUnsubmitted Done Reply Inline Actions LLVM has the StringLitteral class for this I think mehdi_amini: LLVM has the StringLitteral class for this I think
		fmoracAuthorUnsubmitted Done Reply Inline Actions You're right, I'll change them to use `StringLitteral`. fmorac: You're right, I'll change them to use `StringLitteral`.
		static constexpr char kOffloadEntryName[] = ".omp_offloading.entry_name.";
		static constexpr char kOffloadEntry[] = ".omp_offloading.entry.";

		static constexpr char kOffloadObjectGlobalId[] = "llvm.embedded.object";
		static constexpr char kCompilerUsedGlobalId[] = "llvm.compiler.used";

		static constexpr char kLLVMOffloadingSection[] = ".llvm.offloading";
		static constexpr char kLLVMMetadataSection[] = "llvm.metadata";
		static constexpr char kCudaOffloadingSection[] = "cuda_offloading_entries";
		static constexpr char kHipOffloadingSection[] = "hip_offloading_entries";

		// Get the offloading entry type, described in:
		static LLVM::LLVMStructType getOffloadEntryType(MLIRContext *context);

		// Create an unique identifier for the kernel combining the module and kernel
		// name: __M<module name>_K<kernel name>.
		static std::string getUniqueIdentifier(gpu::LaunchFuncOp kernelLaunch);

		// Create an identifier for the kernel stub: <unique identifier>_stub.
		static std::string getStubIdentifier(gpu::LaunchFuncOp kernelLaunch);

		// Create a kernel host stub, needed for function registration by the
		// runtimes.
		static std::string createKernelStub(gpu::LaunchFuncOp kernelLaunch,
		OpBuilder &builder);

		// Inserts an offloading entry into the the top module.
		static LogicalResult insertOffloadEntry(gpu::LaunchFuncOp kernelLaunch,
		OpBuilder &builder);

		// Calls all the required methods to obtain a valid offloading launch address.
		static Value getOrCreateLaunchAddress(gpu::LaunchFuncOp kernelLaunch,
		OpBuilder &builder);

		// Concatenates all offloading objects into a single object and adds it at the
		// start of the top module.
		static LogicalResult createOffloadObject(ModuleOp module,
		LLVMTypeConverter &converter);
		};

struct FunctionCallBuilder {		struct FunctionCallBuilder {
FunctionCallBuilder(StringRef functionName, Type returnType,		FunctionCallBuilder(StringRef functionName, Type returnType,
ArrayRef<Type> argumentTypes)		ArrayRef<Type> argumentTypes)
: functionName(functionName),		: functionName(functionName),
functionType(LLVM::LLVMFunctionType::get(returnType, argumentTypes)) {}		functionType(LLVM::LLVMFunctionType::get(returnType, argumentTypes)) {}
LLVM::CallOp create(Location loc, OpBuilder &builder,		LLVM::CallOp create(Location loc, OpBuilder &builder,
ArrayRef<Value> arguments) const;		ArrayRef<Value> arguments) const;

▲ Show 20 Lines • Show All 234 Lines • ▼ Show 20 Lines
/// * streamSynchronize -- waits for operations on the stream to finish		/// * streamSynchronize -- waits for operations on the stream to finish
///		///
/// Intermediate data structures are allocated on the stack.		/// Intermediate data structures are allocated on the stack.
class ConvertLaunchFuncOpToGpuRuntimeCallPattern		class ConvertLaunchFuncOpToGpuRuntimeCallPattern
: public ConvertOpToGpuRuntimeCallPattern<gpu::LaunchFuncOp> {		: public ConvertOpToGpuRuntimeCallPattern<gpu::LaunchFuncOp> {
public:		public:
ConvertLaunchFuncOpToGpuRuntimeCallPattern(LLVMTypeConverter &typeConverter,		ConvertLaunchFuncOpToGpuRuntimeCallPattern(LLVMTypeConverter &typeConverter,
StringRef gpuBinaryAnnotation,		StringRef gpuBinaryAnnotation,
bool kernelBarePtrCallConv)		bool kernelBarePtrCallConv,
		bool llvmOffload = false)
: ConvertOpToGpuRuntimeCallPattern<gpu::LaunchFuncOp>(typeConverter),		: ConvertOpToGpuRuntimeCallPattern<gpu::LaunchFuncOp>(typeConverter),
gpuBinaryAnnotation(gpuBinaryAnnotation),		gpuBinaryAnnotation(gpuBinaryAnnotation),
kernelBarePtrCallConv(kernelBarePtrCallConv) {}		kernelBarePtrCallConv(kernelBarePtrCallConv), llvmOffload(llvmOffload) {
		}

private:		private:
Value generateParamsArray(gpu::LaunchFuncOp launchOp, OpAdaptor adaptor,		Value generateParamsArray(gpu::LaunchFuncOp launchOp, OpAdaptor adaptor,
OpBuilder &builder) const;		OpBuilder &builder) const;
Value generateKernelNameConstant(StringRef moduleName, StringRef name,		Value generateKernelNameConstant(StringRef moduleName, StringRef name,
Location loc, OpBuilder &builder) const;		Location loc, OpBuilder &builder) const;

LogicalResult		LogicalResult
matchAndRewrite(gpu::LaunchFuncOp launchOp, OpAdaptor adaptor,		matchAndRewrite(gpu::LaunchFuncOp launchOp, OpAdaptor adaptor,
ConversionPatternRewriter &rewriter) const override;		ConversionPatternRewriter &rewriter) const override;

llvm::SmallString<32> gpuBinaryAnnotation;		llvm::SmallString<32> gpuBinaryAnnotation;
bool kernelBarePtrCallConv;		bool kernelBarePtrCallConv;
		bool llvmOffload;
};		};

class EraseGpuModuleOpPattern : public OpRewritePattern<gpu::GPUModuleOp> {		class EraseGpuModuleOpPattern : public OpRewritePattern<gpu::GPUModuleOp> {
using OpRewritePattern<gpu::GPUModuleOp>::OpRewritePattern;		using OpRewritePattern<gpu::GPUModuleOp>::OpRewritePattern;

LogicalResult matchAndRewrite(gpu::GPUModuleOp op,		LogicalResult matchAndRewrite(gpu::GPUModuleOp op,
PatternRewriter &rewriter) const override {		PatternRewriter &rewriter) const override {
// GPU kernel modules are no longer necessary since we have a global		// GPU kernel modules are no longer necessary since we have a global
▲ Show 20 Lines • Show All 67 Lines • ▼ Show 20 Lines	void GpuToLLVMConversionPass::runOnOperation() {
populateGpuToLLVMConversionPatterns(converter, patterns, gpuBinaryAnnotation,		populateGpuToLLVMConversionPatterns(converter, patterns, gpuBinaryAnnotation,
kernelBarePtrCallConv);		kernelBarePtrCallConv);

if (failed(		if (failed(
applyPartialConversion(getOperation(), target, std::move(patterns))))		applyPartialConversion(getOperation(), target, std::move(patterns))))
signalPassFailure();		signalPassFailure();
}		}

		void GpuToOffloadConversionPass::runOnOperation() {
		LowerToLLVMOptions options(&getContext());
		options.useOpaquePointers = useOpaquePointers;

		LLVMTypeConverter converter(&getContext(), options);

		// Create the offload object.
		if (failed(GPUOffloadBuilder::createOffloadObject(getOperation(), converter)))
		signalPassFailure();

		RewritePatternSet patterns(&getContext());
		LLVMConversionTarget target(getContext());

		target.addIllegalDialect<gpu::GPUDialect>();

		mlir::arith::populateArithToLLVMConversionPatterns(converter, patterns);
		mlir::cf::populateControlFlowToLLVMConversionPatterns(converter, patterns);
		populateVectorToLLVMConversionPatterns(converter, patterns);
		populateFinalizeMemRefToLLVMConversionPatterns(converter, patterns);
		populateFuncToLLVMConversionPatterns(converter, patterns);
		populateAsyncStructuralTypeConversionsAndLegality(converter, patterns,
		target);
		populateGpuToLLVMOffloadConversionPatterns(converter, patterns,
		kernelBarePtrCallConv);

		if (failed(
		applyPartialConversion(getOperation(), target, std::move(patterns))))
		signalPassFailure();
		}

		LogicalResult makeLLVMOffloadObject(ModuleOp module,
		LLVMTypeConverter &converter) {
		return GPUOffloadBuilder::createOffloadObject(module, converter);
		}

		LLVM::LLVMStructType
		GPUOffloadBuilder::getOffloadEntryType(MLIRContext *context) {
		auto type =
		LLVM::LLVMStructType::getIdentified(context, kOffloadStructTypeName);
		if (!type.isInitialized()) {
		// Create the offload struct entry according to:
		// https://clang.llvm.org/docs/OffloadingDesign.html#generating-offloading-entries
		auto ptrType = LLVM::LLVMPointerType::get(context);
		auto i32Type = IntegerType::get(context, 32);
		auto i64Type = IntegerType::get(context, 64);
		auto result =
		type.setBody({ptrType, ptrType, i64Type, i32Type, i32Type}, false);
		if (!result.succeeded())
		return nullptr;
		}
		return type;
		}

		std::string
		GPUOffloadBuilder::getUniqueIdentifier(gpu::LaunchFuncOp kernelLaunch) {
		return "_M" + kernelLaunch.getKernelModuleName().str() + "_K" +
		kernelLaunch.getKernelName().str();
		}

		std::string
		GPUOffloadBuilder::getStubIdentifier(gpu::LaunchFuncOp kernelLaunch) {
		return getUniqueIdentifier(kernelLaunch) + "_stub";
		}

		std::string GPUOffloadBuilder::createKernelStub(gpu::LaunchFuncOp kernelLaunch,
		OpBuilder &builder) {
		// Create the stub name.
		auto name = getStubIdentifier(kernelLaunch);

		// Get the top module.
		auto module = kernelLaunch->getParentOfType<ModuleOp>();
		{
		// Avoid inserting the stub more than once.
		auto op = module.lookupSymbol(name);
		mehdi_aminiUnsubmitted Not Done Reply Inline Actions This is expensive, can you cache this by using a SymbolTable initialized once in the run method? mehdi_amini: This is expensive, can you cache this by using a SymbolTable initialized once in the run method?
		fmoracAuthorUnsubmitted Done Reply Inline Actions This function is called only once per `LaunchFuncOp` in line 697 of this file. I didn't worry too much because the existing conversion pattern (pre patch) already does a similar call: https://github.com/llvm/llvm-project/blob/main/mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp#L800 So no overhead was added beyond the already existing pre patch. But I guess I could try to fix that for both cases. fmorac: This function is called only once per `LaunchFuncOp` in line 697 of this file. I didn't worry…
		if (op)
		return name;
		}
		auto insertionGuard = ConversionPatternRewriter::InsertionGuard(builder);
		builder.setInsertionPointToStart(&module.getRegion().front());

		// Create a simple function stub `void()`, what's important is the address of
		// the function for doing the kernel registration.
		auto voidType = LLVM::LLVMVoidType::get(builder.getContext());
		auto func = builder.create<LLVM::LLVMFuncOp>(
		kernelLaunch.getLoc(), name, LLVM::LLVMFunctionType::get(voidType, {}),
		LLVM::Linkage::External, true);
		auto block = func.addEntryBlock();
		builder.setInsertionPointToStart(block);
		builder.create<LLVM::ReturnOp>(kernelLaunch.getLoc(), ValueRange());
		return name;
		}

		LogicalResult
		GPUOffloadBuilder::insertOffloadEntry(gpu::LaunchFuncOp kernelLaunch,
		OpBuilder &builder) {

		using namespace LLVM;
		auto module = kernelLaunch->getParentOfType<ModuleOp>();
		if (!module) {
		emitError(kernelLaunch.getLoc(), "operation is not inside of a ModuleOp.");
		return failure();
		}
		// Create the identifiers for the entries.
		auto name = kernelLaunch.getKernelName();
		auto kernelUid = getUniqueIdentifier(kernelLaunch);
		std::string entryNameId = kOffloadEntryName + kernelUid;
		std::string entryId = kOffloadEntry + kernelUid;
		{
		// Avoid inserting the entry more than once.
		auto entryName = module.lookupSymbol(entryNameId);
		auto entry = module.lookupSymbol(entryId);
		// Entries are already there.
		if (entryName && entry)
		return success();
		// One of the entries is missing.
		if ((entry && !entryName) \|\| (!entry && entryName)) {
		emitError(kernelLaunch.getLoc(),
		"one of the offloading entries is missing.");
		return failure();
		}
		}

		// For information on these entries see:
		// https://clang.llvm.org/docs/OffloadingDesign.html#generating-offloading-entries
		// Define and create useful variables.
		auto context = builder.getContext();
		auto loc = module.getLoc();
		auto ptrType = LLVM::LLVMPointerType::get(context);
		auto i32Type = builder.getI32Type();
		auto i64Type = builder.getI64Type();

		// Set the insertion point to the start of the module.
		auto insertionGuard = ConversionPatternRewriter::InsertionGuard(builder);
		builder.setInsertionPointToStart(&module.getRegion().front());

		// Obtain the kernel name including the null terminator.
		auto kernelName =
		builder.getStringAttr(StringRef(name.data(), name.size() + 1));

		// Create the kernel name offloading entry.
		auto stringCnstType =
		LLVMArrayType::get(IntegerType::get(context, 8), kernelName.size());
		GlobalOp entryName = builder.create<GlobalOp>(
		loc, /* Type / stringCnstType, / Constant */ true,
		/* Name */ entryNameId,
		/* Linkage / Linkage::Internal, / DSO local */ false,
		/* Thread local / false, / Value */ kernelName,
		/* Alignment / nullptr, / Address space */ 0,
		/* Unnamed address */
		UnnamedAddrAttr::get(context, UnnamedAddr::Global),
		/* Section */ nullptr);

		// Determine the offloading section kind.
		auto kernelModule = SymbolTable::lookupNearestSymbolFrom<gpu::GPUModuleOp>(
		kernelLaunch, kernelLaunch.getKernelModuleName());
		if (!module) {
		emitError(kernelLaunch.getLoc(), "expected a kernel module");
		return failure();
		}
		StringRef offloadSection;
		if (auto attr = kernelModule->getAttr(gpu::getGpuOffloadKindAnnotation()))
		if (auto strAttr = dyn_cast<StringAttr>(attr)) {
		gpu::OffloadKind offloadKind = gpu::getOffloadKind(strAttr.getValue());
		if (offloadKind == gpu::OffloadKind::cuda)
		offloadSection = kCudaOffloadingSection;
		else if (offloadKind == gpu::OffloadKind::hip)
		offloadSection = kHipOffloadingSection;
		}
		if (offloadSection.empty()) {
		emitError(kernelModule->getLoc(),
		"the module doesn't contain a valid offloading kind.");
		return failure();
		}

		// Create the offloading entry.
		auto offloadEntryType = getOffloadEntryType(context);
		GlobalOp offloadingEntry = builder.create<GlobalOp>(
		loc, /* Type / offloadEntryType, / Constant */ true,
		/* Name */ entryId,
		/* Linkage / Linkage::Weak, / DSO local */ false,
		/* Thread local / false, / Value */ nullptr,
		/* Alignment */ builder.getIntegerAttr(i64Type, 1),
		/* Address space */ 0,
		/* Unnamed address */ nullptr,
		/* Section */ builder.getStringAttr(offloadSection));

		// Add an initializer to the global.
		auto block = builder.createBlock(&offloadingEntry.getRegion());
		builder.setInsertionPointToStart(block);

		// Create an undef struct entry.
		Value entryInit = builder.create<UndefOp>(loc, offloadEntryType).getRes();

		// Insert the stub address to the offloading entry.
		auto stubAddress = builder.create<AddressOfOp>(
		loc, ptrType, getStubIdentifier(kernelLaunch));
		entryInit = builder.create<InsertValueOp>(
		entryInit.getLoc(), entryInit, stubAddress, llvm::ArrayRef<int64_t>{0});

		// Insert the kernel name offloading entry to the offloading entry.
		auto entryNameAddress =
		builder.create<AddressOfOp>(loc, ptrType, entryName.getSymName());
		entryInit = builder.create<InsertValueOp>(entryInit.getLoc(), entryInit,
		entryNameAddress,
		llvm::ArrayRef<int64_t>{1});

		// Set struct fields according to:
		// https://clang.llvm.org/docs/OffloadingDesign.html#generating-offloading-entries
		auto c0I64 =
		builder.create<ConstantOp>(loc, builder.getIntegerAttr(i64Type, 0));
		entryInit = builder.create<InsertValueOp>(entryInit.getLoc(), entryInit,
		c0I64, llvm::ArrayRef<int64_t>{2});
		auto c0I32 =
		builder.create<ConstantOp>(loc, builder.getIntegerAttr(i32Type, 0));
		entryInit = builder.create<InsertValueOp>(entryInit.getLoc(), entryInit,
		c0I32, llvm::ArrayRef<int64_t>{3});
		entryInit = builder.create<InsertValueOp>(entryInit.getLoc(), entryInit,
		c0I32, llvm::ArrayRef<int64_t>{4});
		builder.create<ReturnOp>(loc, entryInit);
		return success();
		}

		Value GPUOffloadBuilder::getOrCreateLaunchAddress(
		gpu::LaunchFuncOp kernelLaunch, OpBuilder &builder) {
		auto ptrType = LLVM::LLVMPointerType::get(builder.getContext());
		auto stubId = createKernelStub(kernelLaunch, builder);
		if (failed(insertOffloadEntry(kernelLaunch, builder)))
		return {};
		return builder.create<LLVM::AddressOfOp>(kernelLaunch.getLoc(), ptrType,
		stubId);
		}

		LogicalResult
		GPUOffloadBuilder::createOffloadObject(ModuleOp op,
		LLVMTypeConverter &converter) {
		using namespace LLVM;
		OpBuilder builder(op.getContext());
		auto nativeIntType = builder.getIntegerType(converter.getPointerBitwidth());

		// Collect all GPUModules.
		SmallVector<gpu::GPUModuleOp> modules;
		op.walk([&modules](gpu::GPUModuleOp op) { modules.push_back(op); });
		mehdi_aminiUnsubmitted Done Reply Inline Actions This is a recursive IR walk, seems a bit overkill, isn't it? Can we just walk the immediately nested operations? mehdi_amini: This is a recursive IR walk, seems a bit overkill, isn't it? Can we just walk the immediately…
		fmoracAuthorUnsubmitted Done Reply Inline Actions We could, but we would stop handling cases like: module { my.module { gpu.module { } } } We could say it's a precondition for this pass and assume that the above never occurs. fmorac: We could, but we would stop handling cases like: ``` module { my.module { gpu.module {…
		mehdi_aminiUnsubmitted Done Reply Inline Actions Right, if someone has a my.module, they should schedule the pass there. We can make this an "Operation pass" so that it can run on any kind of module. mehdi_amini: Right, if someone has a my.module, they should schedule the pass there. We can make this an…
		fmoracAuthorUnsubmitted Done Reply Inline Actions That would work. fmorac: That would work.

		// If there's no work finish without creating the object.
		if (modules.empty())
		return success();

		// Concatenate all offloading entries.
		SmallVector<char, 1024> binaryData;
		llvm::raw_svector_ostream outputStream(binaryData);
		for (auto module : modules) {
		bool hasAnnotation = false;
		if (auto attr = module->getAttr(gpu::getGpuOffloadObjectAnnotation())) {
		if (auto bytecode = dyn_cast<StringAttr>(attr)) {
		outputStream << bytecode.getValue();
		hasAnnotation = true;
		}
		}
		if (!hasAnnotation) {
		module.emitError() << "the gpu.module doesn't contain an offload object.";
		return failure();
		}
		}
		mehdi_aminiUnsubmitted Done Reply Inline Actions Nit: You haven't flushed the stream here. mehdi_amini: Nit: You haven't flushed the stream here.
		fmoracAuthorUnsubmitted Done Reply Inline Actions Will add that. fmorac: Will add that.

		// Set the insertion point to the start of the module.
		builder.setInsertionPointToStart(op.getBody());
		auto stringCnstType = LLVMArrayType::get(IntegerType::get(op.getContext(), 8),
		binaryData.size());

		// Create the offload section with all the binary annotations. TODO add
		// !exclude metadata to this variable see:
		// https://llvm.org/docs/LangRef.html#exclude-metadata
		auto offloadObject = builder.create<GlobalOp>(
		op.getLoc(), /* Type / stringCnstType, / Constant */ true,
		/* Name */ kOffloadObjectGlobalId,
		/* Linkage / Linkage::Private, / DSO local */ false,
		/* Thread local / false, / Value */ builder.getStringAttr(binaryData),
		/* Alignment */ builder.getIntegerAttr(nativeIntType, 8),
		/* Address space */ 0,
		/* Unnamed address */ nullptr,
		/* Section */ builder.getStringAttr(kLLVMOffloadingSection));

		// This second global is to prevent the offloadObject from being optimized
		// away.
		auto ptrType = LLVM::LLVMPointerType::get(op.getContext());
		auto ptrArrayType = LLVMArrayType::get(ptrType, 1);
		auto llvmMetadata = builder.create<GlobalOp>(
		op.getLoc(), /* Type / ptrArrayType, / Constant */ false,
		/* Name */ kCompilerUsedGlobalId,
		/* Linkage / Linkage::Appending, / DSO local */ false,
		/* Thread local / false, / Value */ nullptr,
		/* Alignment */ nullptr,
		/* Address space */ 0,
		/* Unnamed address */ nullptr,
		/* Section */ builder.getStringAttr(kLLVMMetadataSection));
		mehdi_aminiUnsubmitted Done Reply Inline Actions LLVM convention for the comments is to use `/argname=/` where `argname` matches the actual parameters of the functions. mehdi_amini: LLVM convention for the comments is to use `/argname=/` where `argname` matches the actual…
		fmoracAuthorUnsubmitted Done Reply Inline Actions Will change it. fmorac: Will change it.

		// Insert the offloadObject to the llvmMetadata section to prevent being
		// optimized away.
		auto block = builder.createBlock(&llvmMetadata.getRegion());
		builder.setInsertionPointToStart(block);
		Value entryInit = builder.create<UndefOp>(op.getLoc(), ptrArrayType).getRes();
		auto stubAddress = builder.create<AddressOfOp>(op.getLoc(), ptrType,
		offloadObject.getName());
		entryInit = builder.create<InsertValueOp>(
		entryInit.getLoc(), entryInit, stubAddress, llvm::ArrayRef<int64_t>{0});
		builder.create<ReturnOp>(op.getLoc(), entryInit);
		return success();
		}

LLVM::CallOp FunctionCallBuilder::create(Location loc, OpBuilder &builder,		LLVM::CallOp FunctionCallBuilder::create(Location loc, OpBuilder &builder,
ArrayRef<Value> arguments) const {		ArrayRef<Value> arguments) const {
auto module = builder.getBlock()->getParent()->getParentOfType<ModuleOp>();		auto module = builder.getBlock()->getParent()->getParentOfType<ModuleOp>();
auto function = [&] {		auto function = [&] {
if (auto function = module.lookupSymbol<LLVM::LLVMFuncOp>(functionName))		if (auto function = module.lookupSymbol<LLVM::LLVMFuncOp>(functionName))
return function;		return function;
return OpBuilder::atBlockEnd(module.getBody())		return OpBuilder::atBlockEnd(module.getBody())
.create<LLVM::LLVMFuncOp>(loc, functionName, functionType);		.create<LLVM::LLVMFuncOp>(loc, functionName, functionType);
▲ Show 20 Lines • Show All 335 Lines • ▼ Show 20 Lines	Value ConvertLaunchFuncOpToGpuRuntimeCallPattern::generateKernelNameConstant(

std::string globalName =		std::string globalName =
std::string(llvm::formatv("{0}_{1}_kernel_name", moduleName, name));		std::string(llvm::formatv("{0}_{1}_kernel_name", moduleName, name));
return LLVM::createGlobalString(		return LLVM::createGlobalString(
loc, builder, globalName, StringRef(kernelName.data(), kernelName.size()),		loc, builder, globalName, StringRef(kernelName.data(), kernelName.size()),
LLVM::Linkage::Internal, getTypeConverter()->useOpaquePointers());		LLVM::Linkage::Internal, getTypeConverter()->useOpaquePointers());
}		}

// Emits LLVM IR to launch a kernel function. Expects the module that contains		// If llvmOffload is set to false in the pattern emits LLVM IR to launch a
// the compiled kernel function as a cubin in the 'nvvm.cubin' attribute, or a		// kernel function. Expects the module that contains the compiled kernel
// hsaco in the 'rocdl.hsaco' attribute of the kernel function in the IR.		// function as a cubin in the 'nvvm.cubin' attribute, or a hsaco in the
		// 'rocdl.hsaco' attribute of the kernel function in the IR.
//		//
// %0 = call %binarygetter		// %0 = call %binarygetter
// %1 = call %moduleLoad(%0)		// %1 = call %moduleLoad(%0)
// %2 = <see generateKernelNameConstant>		// %2 = <see generateKernelNameConstant>
// %3 = call %moduleGetFunction(%1, %2)		// %3 = call %moduleGetFunction(%1, %2)
// %4 = call %streamCreate()		// %4 = call %streamCreate()
// %5 = <see generateParamsArray>		// %5 = <see generateParamsArray>
// call %launchKernel(%3, <launchOp operands 0..5>, 0, %4, %5, nullptr)		// call %launchKernel(%3, <launchOp operands 0..5>, 0, %4, %5, nullptr)
// call %streamSynchronize(%4)		// call %streamSynchronize(%4)
// call %streamDestroy(%4)		// call %streamDestroy(%4)
// call %moduleUnload(%1)		// call %moduleUnload(%1)
//		//
		// When llvmOffload is set to true, this pattern emits LLVM IR to launch a
		// kernel function and expects LLVM offload annotations in the gpu.modules. In
		// this case the generated code is:
		//
		// %0 = call %streamCreate()
		// %1 = <see generateParamsArray>
		// %2 = <see GPUOffloadBuilder::getOrCreateLaunchAddress>
		// call %launchKernel(%2, <launchOp operands 0..5>, 0, %0, %1, nullptr)
		// call %streamSynchronize(%0)
		// call %streamDestroy(%0)
		//
// If the op is async, the stream corresponds to the (single) async dependency		// If the op is async, the stream corresponds to the (single) async dependency
// as well as the async token the op produces.		// as well as the async token the op produces.
		//
LogicalResult ConvertLaunchFuncOpToGpuRuntimeCallPattern::matchAndRewrite(		LogicalResult ConvertLaunchFuncOpToGpuRuntimeCallPattern::matchAndRewrite(
gpu::LaunchFuncOp launchOp, OpAdaptor adaptor,		gpu::LaunchFuncOp launchOp, OpAdaptor adaptor,
ConversionPatternRewriter &rewriter) const {		ConversionPatternRewriter &rewriter) const {
if (failed(areAllLLVMTypes(launchOp, adaptor.getOperands(), rewriter)))		if (failed(areAllLLVMTypes(launchOp, adaptor.getOperands(), rewriter)))
return failure();		return failure();

if (launchOp.getAsyncDependencies().size() > 1)		if (launchOp.getAsyncDependencies().size() > 1)
return rewriter.notifyMatchFailure(		return rewriter.notifyMatchFailure(
launchOp, "Cannot convert with more than one async dependency.");		launchOp, "Cannot convert with more than one async dependency.");

// Fail when the synchronous version of the op has async dependencies. The		// Fail when the synchronous version of the op has async dependencies. The
// lowering destroys the stream, and we do not want to check that there is no		// lowering destroys the stream, and we do not want to check that there is no
// use of the stream after this op.		// use of the stream after this op.
if (!launchOp.getAsyncToken() && !launchOp.getAsyncDependencies().empty())		if (!launchOp.getAsyncToken() && !launchOp.getAsyncDependencies().empty())
return rewriter.notifyMatchFailure(		return rewriter.notifyMatchFailure(
launchOp, "Cannot convert non-async op with async dependencies.");		launchOp, "Cannot convert non-async op with async dependencies.");

Location loc = launchOp.getLoc();		Location loc = launchOp.getLoc();

		Value kernelAddress, moduleAddress;
		// If llvmOffload is set to false, generate the traditional offload code.
		if (!llvmOffload) {
// Create an LLVM global with CUBIN extracted from the kernel annotation and		// Create an LLVM global with CUBIN extracted from the kernel annotation and
// obtain a pointer to the first byte in it.		// obtain a pointer to the first byte in it.
auto kernelModule = SymbolTable::lookupNearestSymbolFrom<gpu::GPUModuleOp>(		auto kernelModule = SymbolTable::lookupNearestSymbolFrom<gpu::GPUModuleOp>(
launchOp, launchOp.getKernelModuleName());		launchOp, launchOp.getKernelModuleName());
assert(kernelModule && "expected a kernel module");		assert(kernelModule && "expected a kernel module");

auto binaryAttr =		auto binaryAttr =
kernelModule->getAttrOfType<StringAttr>(gpuBinaryAnnotation);		kernelModule->getAttrOfType<StringAttr>(gpuBinaryAnnotation);
if (!binaryAttr) {		if (!binaryAttr) {
kernelModule.emitOpError()		kernelModule.emitOpError()
<< "missing " << gpuBinaryAnnotation << " attribute";		<< "missing " << gpuBinaryAnnotation << " attribute";
return failure();		return failure();
}		}

SmallString<128> nameBuffer(kernelModule.getName());		SmallString<128> nameBuffer(kernelModule.getName());
nameBuffer.append(kGpuBinaryStorageSuffix);		nameBuffer.append(kGpuBinaryStorageSuffix);
Value data = LLVM::createGlobalString(		Value data = LLVM::createGlobalString(
loc, rewriter, nameBuffer.str(), binaryAttr.getValue(),		loc, rewriter, nameBuffer.str(), binaryAttr.getValue(),
LLVM::Linkage::Internal, getTypeConverter()->useOpaquePointers());		LLVM::Linkage::Internal, getTypeConverter()->useOpaquePointers());

auto module = moduleLoadCallBuilder.create(loc, rewriter, data);		moduleAddress =
		moduleLoadCallBuilder.create(loc, rewriter, data).getResult();
// Get the function from the module. The name corresponds to the name of		// Get the function from the module. The name corresponds to the name of
// the kernel function.		// the kernel function.
auto kernelName = generateKernelNameConstant(		auto kernelName = generateKernelNameConstant(
launchOp.getKernelModuleName().getValue(),		launchOp.getKernelModuleName().getValue(),
launchOp.getKernelName().getValue(), loc, rewriter);		launchOp.getKernelName().getValue(), loc, rewriter);
auto function = moduleGetFunctionCallBuilder.create(		kernelAddress = moduleGetFunctionCallBuilder
loc, rewriter, {module.getResult(), kernelName});		.create(loc, rewriter, {moduleAddress, kernelName})
		.getResult();
		} else {
		kernelAddress =
		GPUOffloadBuilder::getOrCreateLaunchAddress(launchOp, rewriter);
		if (!kernelAddress)
		return failure();
		}

Value zero = rewriter.create<LLVM::ConstantOp>(loc, llvmInt32Type, 0);		Value zero = rewriter.create<LLVM::ConstantOp>(loc, llvmInt32Type, 0);
Value stream =		Value stream =
adaptor.getAsyncDependencies().empty()		adaptor.getAsyncDependencies().empty()
? streamCreateCallBuilder.create(loc, rewriter, {}).getResult()		? streamCreateCallBuilder.create(loc, rewriter, {}).getResult()
: adaptor.getAsyncDependencies().front();		: adaptor.getAsyncDependencies().front();
// Create array of pointers to kernel arguments.		// Create array of pointers to kernel arguments.
auto kernelParams = generateParamsArray(launchOp, adaptor, rewriter);		auto kernelParams = generateParamsArray(launchOp, adaptor, rewriter);
auto nullpointer = rewriter.create<LLVM::NullOp>(loc, llvmPointerPointerType);		auto nullpointer = rewriter.create<LLVM::NullOp>(loc, llvmPointerPointerType);
Value dynamicSharedMemorySize = launchOp.getDynamicSharedMemorySize()		Value dynamicSharedMemorySize = launchOp.getDynamicSharedMemorySize()
? launchOp.getDynamicSharedMemorySize()		? launchOp.getDynamicSharedMemorySize()
: zero;		: zero;
launchKernelCallBuilder.create(		launchKernelCallBuilder.create(
loc, rewriter,		loc, rewriter,
{function.getResult(), adaptor.getGridSizeX(), adaptor.getGridSizeY(),		{kernelAddress, adaptor.getGridSizeX(), adaptor.getGridSizeY(),
adaptor.getGridSizeZ(), adaptor.getBlockSizeX(), adaptor.getBlockSizeY(),		adaptor.getGridSizeZ(), adaptor.getBlockSizeX(), adaptor.getBlockSizeY(),
adaptor.getBlockSizeZ(), dynamicSharedMemorySize, stream, kernelParams,		adaptor.getBlockSizeZ(), dynamicSharedMemorySize, stream, kernelParams,
/extra=/nullpointer});		/extra=/nullpointer});

if (launchOp.getAsyncToken()) {		if (launchOp.getAsyncToken()) {
// Async launch: make dependent ops use the same stream.		// Async launch: make dependent ops use the same stream.
rewriter.replaceOp(launchOp, {stream});		rewriter.replaceOp(launchOp, {stream});
} else {		} else {
// Synchronize with host and destroy stream. This must be the stream created		// Synchronize with host and destroy stream. This must be the stream created
// above (with no other uses) because we check that the synchronous version		// above (with no other uses) because we check that the synchronous version
// does not have any async dependencies.		// does not have any async dependencies.
streamSynchronizeCallBuilder.create(loc, rewriter, stream);		streamSynchronizeCallBuilder.create(loc, rewriter, stream);
streamDestroyCallBuilder.create(loc, rewriter, stream);		streamDestroyCallBuilder.create(loc, rewriter, stream);
rewriter.eraseOp(launchOp);		rewriter.eraseOp(launchOp);
}		}
moduleUnloadCallBuilder.create(loc, rewriter, module.getResult());		if (!llvmOffload)
		moduleUnloadCallBuilder.create(loc, rewriter, moduleAddress);

return success();		return success();
}		}

static Value bitAndAddrspaceCast(Location loc,		static Value bitAndAddrspaceCast(Location loc,
ConversionPatternRewriter &rewriter,		ConversionPatternRewriter &rewriter,
LLVM::LLVMPointerType destinationType,		LLVM::LLVMPointerType destinationType,
Value sourcePtr,		Value sourcePtr,
▲ Show 20 Lines • Show All 109 Lines • ▼ Show 20 Lines	patterns.add<ConvertAllocOpToGpuRuntimeCallPattern,
ConvertHostUnregisterOpToGpuRuntimeCallPattern,		ConvertHostUnregisterOpToGpuRuntimeCallPattern,
ConvertMemcpyOpToGpuRuntimeCallPattern,		ConvertMemcpyOpToGpuRuntimeCallPattern,
ConvertMemsetOpToGpuRuntimeCallPattern,		ConvertMemsetOpToGpuRuntimeCallPattern,
ConvertSetDefaultDeviceOpToGpuRuntimeCallPattern,		ConvertSetDefaultDeviceOpToGpuRuntimeCallPattern,
ConvertWaitAsyncOpToGpuRuntimeCallPattern,		ConvertWaitAsyncOpToGpuRuntimeCallPattern,
ConvertWaitOpToGpuRuntimeCallPattern,		ConvertWaitOpToGpuRuntimeCallPattern,
ConvertAsyncYieldToGpuRuntimeCallPattern>(converter);		ConvertAsyncYieldToGpuRuntimeCallPattern>(converter);
patterns.add<ConvertLaunchFuncOpToGpuRuntimeCallPattern>(		patterns.add<ConvertLaunchFuncOpToGpuRuntimeCallPattern>(
converter, gpuBinaryAnnotation, kernelBarePtrCallConv);		converter, gpuBinaryAnnotation, kernelBarePtrCallConv,
		/* disable llvmOffload */ false);
		patterns.add<EraseGpuModuleOpPattern>(&converter.getContext());
		}

		void mlir::populateGpuToLLVMOffloadConversionPatterns(
		LLVMTypeConverter &converter, RewritePatternSet &patterns,
		bool kernelBarePtrCallConv) {
		converter.addConversion([&converter](gpu::AsyncTokenType type) -> Type {
		return converter.getPointerType(
		IntegerType::get(&converter.getContext(), 8));
		});
		patterns.add<ConvertAllocOpToGpuRuntimeCallPattern,
		ConvertDeallocOpToGpuRuntimeCallPattern,
		ConvertHostRegisterOpToGpuRuntimeCallPattern,
		ConvertHostUnregisterOpToGpuRuntimeCallPattern,
		ConvertMemcpyOpToGpuRuntimeCallPattern,
		ConvertMemsetOpToGpuRuntimeCallPattern,
		ConvertSetDefaultDeviceOpToGpuRuntimeCallPattern,
		ConvertWaitAsyncOpToGpuRuntimeCallPattern,
		ConvertWaitOpToGpuRuntimeCallPattern,
		ConvertAsyncYieldToGpuRuntimeCallPattern>(converter);
		patterns.add<ConvertLaunchFuncOpToGpuRuntimeCallPattern>(
		converter, gpu::getDefaultGpuBinaryAnnotation(), kernelBarePtrCallConv,
		/* enable llvmOffload */ true);
		mehdi_aminiUnsubmitted Not Done Reply Inline Actions (same nit for the comment format) mehdi_amini: (same nit for the comment format)
patterns.add<EraseGpuModuleOpPattern>(&converter.getContext());		patterns.add<EraseGpuModuleOpPattern>(&converter.getContext());
}		}

mlir/lib/Dialect/GPU/CMakeLists.txt

Show First 20 Lines • Show All 41 Lines • ▼ Show 20 Lines	add_mlir_dialect_library(MLIRGPUOps
MLIRSideEffectInterfaces		MLIRSideEffectInterfaces
MLIRSupport		MLIRSupport
)		)

add_mlir_dialect_library(MLIRGPUTransforms		add_mlir_dialect_library(MLIRGPUTransforms
Transforms/AllReduceLowering.cpp		Transforms/AllReduceLowering.cpp
Transforms/AsyncRegionRewriter.cpp		Transforms/AsyncRegionRewriter.cpp
Transforms/KernelOutlining.cpp		Transforms/KernelOutlining.cpp
		Transforms/GpuToDeviceOffload.cpp
		traUnsubmitted Done Reply Inline Actions Do I understand it correctly that this patch does not add build or run-time dependencies on CUDA or GPU driver on the `MLIRGPUTransforms` library itself, and that the HIP/CUDA-dependent bits will only be needed by the runtime wrappers and the test executable? tra: Do I understand it correctly that this patch does not add build or run-time dependencies on…
		fmoracAuthorUnsubmitted Done Reply Inline Actions You're correct. If `NVPTX` or `AMDGPU` is listed in the targets this pipeline will be there, there are no strict CUDA or ROCm dependencies, this is possible because this pipeline never steps outside LLVM, however that also means this pipeline cannot get down to an executable without clang. Having said that, if CUDA or ROCm are not present, then you need to supply those libraries to `clang`, as the generated bitcode would only contain declarations. fmorac: You're correct. If `NVPTX` or `AMDGPU` is listed in the targets this pipeline will be there…
Transforms/MemoryPromotion.cpp		Transforms/MemoryPromotion.cpp
		Transforms/NameMangling.cpp
Transforms/ParallelLoopMapper.cpp		Transforms/ParallelLoopMapper.cpp
Transforms/SerializeToBlob.cpp		Transforms/SerializeToBlob.cpp
Transforms/SerializeToCubin.cpp		Transforms/SerializeToCubin.cpp
Transforms/SerializeToHsaco.cpp		Transforms/SerializeToHsaco.cpp

ADDITIONAL_HEADER_DIRS		ADDITIONAL_HEADER_DIRS
${MLIR_MAIN_INCLUDE_DIR}/mlir/Dialect/GPU		${MLIR_MAIN_INCLUDE_DIR}/mlir/Dialect/GPU

Show All 24 Lines	add_mlir_dialect_library(MLIRGPUTransforms
MLIRSCFDialect		MLIRSCFDialect
MLIRSideEffectInterfaces		MLIRSideEffectInterfaces
MLIRSupport		MLIRSupport
MLIRTransformUtils		MLIRTransformUtils
)		)

add_subdirectory(TransformOps)		add_subdirectory(TransformOps)

		if("NVPTX" IN_LIST LLVM_TARGETS_TO_BUILD)
		if (NOT DEFINED CUDAToolkit_ROOT)
		find_package(CUDAToolkit)
		get_filename_component(CUDAToolkit_ROOT ${CUDAToolkit_BIN_DIR} DIRECTORY ABSOLUTE)
		endif()

		# Enable gpu-to-nvptx pass.
		target_compile_definitions(obj.MLIRGPUTransforms
		PRIVATE
		MLIR_GPU_TO_NVPTX_PASS_ENABLE=1
		# This variable should be set to CUDAToolkit_LIBRARY_ROOT, however the
		# variable is unset for recent cuda toolkits, see:
		# https://gitlab.kitware.com/cmake/cmake/-/issues/24858
		__DEFAULT_CUDATOOLKIT_PATH__="${CUDAToolkit_ROOT}"
		)

		target_link_libraries(MLIRGPUTransforms
		PRIVATE
		MLIRNVVMToLLVMIRTranslation
		)
		endif()

if(MLIR_ENABLE_CUDA_RUNNER)		if(MLIR_ENABLE_CUDA_RUNNER)
if(NOT MLIR_ENABLE_CUDA_CONVERSIONS)		if(NOT MLIR_ENABLE_CUDA_CONVERSIONS)
message(SEND_ERROR		message(SEND_ERROR
"Building mlir with cuda support requires the NVPTX backend")		"Building mlir with cuda support requires the NVPTX backend")
endif()		endif()

# Configure CUDA language support. Using check_language first allows us to		# Configure CUDA language support. Using check_language first allows us to
# give a custom error message.		# give a custom error message.
Show All 23 Lines	if(MLIR_ENABLE_CUDA_RUNNER)
target_link_libraries(MLIRGPUTransforms		target_link_libraries(MLIRGPUTransforms
PRIVATE		PRIVATE
MLIRNVVMToLLVMIRTranslation		MLIRNVVMToLLVMIRTranslation
${CUDA_DRIVER_LIBRARY}		${CUDA_DRIVER_LIBRARY}
)		)

endif()		endif()

		if("AMDGPU" IN_LIST LLVM_TARGETS_TO_BUILD)
		# Enable gpu-to-amdgpu pass.
		set(ROCM_PATH "/opt/rocm" CACHE PATH "Fallback path to search for ROCm installs")
		target_compile_definitions(obj.MLIRGPUTransforms
		PRIVATE
		MLIR_GPU_TO_AMDGPU_PASS_ENABLE=1
		__DEFAULT_ROCM_PATH__="${ROCM_PATH}"
		)

		target_link_libraries(MLIRGPUTransforms
		PRIVATE
		MLIRROCDLToLLVMIRTranslation
		)
		endif()

if(MLIR_ENABLE_ROCM_CONVERSIONS)		if(MLIR_ENABLE_ROCM_CONVERSIONS)
if (NOT ("AMDGPU" IN_LIST LLVM_TARGETS_TO_BUILD))		if (NOT ("AMDGPU" IN_LIST LLVM_TARGETS_TO_BUILD))
message(SEND_ERROR		message(SEND_ERROR
"Building mlir with ROCm support requires the AMDGPU backend")		"Building mlir with ROCm support requires the AMDGPU backend")
endif()		endif()

set(DEFAULT_ROCM_PATH "/opt/rocm" CACHE PATH "Fallback path to search for ROCm installs")
target_compile_definitions(obj.MLIRGPUTransforms		target_compile_definitions(obj.MLIRGPUTransforms
PRIVATE		PRIVATE
__DEFAULT_ROCM_PATH__="${DEFAULT_ROCM_PATH}"
MLIR_GPU_TO_HSACO_PASS_ENABLE=1		MLIR_GPU_TO_HSACO_PASS_ENABLE=1
)		)

target_link_libraries(MLIRGPUTransforms		target_link_libraries(MLIRGPUTransforms
PRIVATE		PRIVATE
MLIRROCDLToLLVMIRTranslation		MLIRROCDLToLLVMIRTranslation
)		)
endif()		endif()

mlir/lib/Dialect/GPU/Transforms/GpuToDeviceObjectCommon.h

This file was added.

				//===- GpuToDeviceObjectCommon.h - GPU to Device object utilities ---------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// This file implements a series of utilities for transforming GPUModuleOps into
				// binary annotations.
				//
				//===----------------------------------------------------------------------===//

				#include "mlir/Dialect/GPU/IR/GPUDialect.h"
				#include "mlir/ExecutionEngine/OptUtils.h"
				#include "mlir/IR/BuiltinOps.h"
				#include "mlir/Pass/Pass.h"
				#include "mlir/Target/LLVMIR/Dialect/GPU/GPUToLLVMIRTranslation.h"
				#include "mlir/Target/LLVMIR/Dialect/LLVMIR/LLVMToLLVMIRTranslation.h"
				#include "mlir/Target/LLVMIR/Export.h"

				#include "llvm/ADT/StringExtras.h"
				#include "llvm/Bitcode/BitcodeWriter.h"
				#include "llvm/IRReader/IRReader.h"
				#include "llvm/Linker/Linker.h"
				#include "llvm/MC/TargetRegistry.h"
				#include "llvm/Object/OffloadBinary.h"
				#include "llvm/Support/FileSystem.h"
				#include "llvm/Support/ManagedStatic.h"
				#include "llvm/Support/Path.h"
				#include "llvm/Support/SourceMgr.h"
				#include "llvm/Support/TargetSelect.h"
				#include "llvm/Support/raw_ostream.h"
				#include "llvm/Target/TargetMachine.h"
				#include "llvm/TargetParser/TargetParser.h"
				#include "llvm/Transforms/IPO/Internalize.h"

				#include "mlir/Dialect/GPU/Transforms/Passes.h"

				namespace mlir {
				namespace gpu {
				// Mixin class for all GpuToDevice* passes.
				// This class needs to be listed as a friend class, as it will access protected
				// members.
				template <typename Derived>
				class GpuToDeviceOffloadMixin {
				mehdi_aminiUnsubmitted Not Done Reply Inline Actions Can we move to a non-templated base class everything that isn't strictly depending on the derived class? (thinking about code size here...) mehdi_amini: Can we move to a non-templated base class everything that isn't strictly depending on the…
				fmoracAuthorUnsubmitted Done Reply Inline Actions This is the class that would help re-implement the existing gpu serialization pipeline, as it handles translating down to bitcode, linking, optimization, and creating a binary object, so a child class needs to implement only certain methods . The reason I used CRTP, is because inside the class I call pass functions: `getOperation`, `signalPassFailure`, so it made sense to me to have this way, also in the `run` method I use `getDerived().<func>` to handle the inheritance bit. However I guess I can lower the size of the class by moving part of the code to a `cpp`. fmorac: This is the class that would help re-implement the existing gpu serialization pipeline, as it…
				private:
				Derived &getDerived() { return static_cast<Derived &>(*this); }

				protected:
				// Function interfaces to be implemented by the final class.
				// Return the optimization level, -1 signifies don't run the optimization
				// pipeline.
				int getOptLevel() { return -1; }

				// Hook for loading bitcode files, returns std::nullopt on failure.
				std::optional<SmallVector<std::unique_ptr<llvm::Module>>>
				loadBitcodeFiles(llvm::LLVMContext &context, llvm::Module &module) {
				return SmallVector<std::unique_ptr<llvm::Module>>{};
				}

				// Hook for performing additional actions on a loaded bitcode file.
				void handleBitcodeFile(llvm::Module &module) {}

				// Hook for performing additional actions on the llvmModule pre linking.
				void handleModulePreLink(llvm::Module &module) {}

				// Hook for performing additional actions on the llvmModule post linking.
				void handleModulePostLink(llvm::Module &module) {}

				protected:
				// Create the target machine based on the target triple and chip.
				std::unique_ptr<llvm::TargetMachine> createTargetMachine();

				// Loads a bitcode file from path.
				std::unique_ptr<llvm::Module> loadBitcodeFile(llvm::LLVMContext &context,
				StringRef path);

				// Loads multiple bitcode files.
				LogicalResult
				loadBitcodeFileList(llvm::LLVMContext &context,
				ArrayRef<std::string> fileList,
				SmallVector<std::unique_ptr<llvm::Module>> &llvmModules,
				bool failureOnError = true);

				// Translates the gpu.module to LLVM IR.
				std::unique_ptr<llvm::Module>
				translateToLLVMIR(llvm::LLVMContext &llvmContext);

				// Link the llvmModule to other bitcode file.
				LogicalResult linkFiles(llvm::Module &module,
				SmallVector<std::unique_ptr<llvm::Module>> &&libs);

				// Optimize the module.
				LogicalResult optimizeModule(llvm::Module &module,
				llvm::TargetMachine &targetMachine,
				int optLevel = 3);

				// Serializes the LLVM IR bitcode to a special object file described in:
				// https://clang.llvm.org/docs/ClangOffloadPackager.html
				SmallVector<char> serializeModuleToObject(llvm::Module &llvmModule,
				gpu::OffloadKind offloadKind);

				// Insert the binary annotation to the GPUModule.
				void insertAnnotations(gpu::GPUModuleOp module,
				SmallVector<char> &binaryObject,
				gpu::OffloadKind offloadKind);

				// Run the pass.
				void run(gpu::OffloadKind offloadKind);
				};

				template <typename Derived>
				std::unique_ptr<llvm::TargetMachine>
				GpuToDeviceOffloadMixin<Derived>::createTargetMachine() {
				auto &self = getDerived();
				Location loc = self.getOperation().getLoc();
				std::string error;

				// Load the target.
				const llvm::Target *target =
				llvm::TargetRegistry::lookupTarget(self.triple, error);
				if (!target) {
				emitError(loc, Twine("failed to lookup target: ") + error);
				return {};
				}

				// Create the target machine using the target.
				llvm::TargetMachine *machine = target->createTargetMachine(
				self.triple, self.chip, self.features, {}, {});
				if (!machine) {
				emitError(loc, "failed to create target machine");
				return {};
				}
				return std::unique_ptr<llvm::TargetMachine>{machine};
				}

				template <typename Derived>
				std::unique_ptr<llvm::Module>
				GpuToDeviceOffloadMixin<Derived>::loadBitcodeFile(llvm::LLVMContext &context,
				StringRef path) {
				auto &self = getDerived();
				llvm::SMDiagnostic error;
				std::unique_ptr<llvm::Module> library =
				llvm::getLazyIRFileModule(path, error, context);
				if (!library) {
				self.getOperation().emitError() << "Failed to load file from " << path
				<< ", error: " << error.getMessage();
				return nullptr;
				}
				return library;
				}

				template <typename Derived>
				LogicalResult GpuToDeviceOffloadMixin<Derived>::loadBitcodeFileList(
				llvm::LLVMContext &context, ArrayRef<std::string> fileList,
				SmallVector<std::unique_ptr<llvm::Module>> &llvmModules,
				bool failureOnError) {
				auto &self = getDerived();
				for (const std::string &str : fileList) {
				// Test if the path exists, if it doesn't abort.
				StringRef pathRef = StringRef(str.data(), str.size());
				if (!llvm::sys::fs::is_regular_file(pathRef)) {
				self.getOperation().emitError()
				<< "File path: " << pathRef << " does not exist or is not a file.\n";
				return failure();
				}
				// Load the file or abort on error.
				if (auto bcFile = loadBitcodeFile(context, pathRef))
				llvmModules.push_back(std::move(bcFile));
				else if (failureOnError)
				return failure();
				}
				return success();
				}

				template <typename Derived>
				std::unique_ptr<llvm::Module>
				GpuToDeviceOffloadMixin<Derived>::translateToLLVMIR(
				llvm::LLVMContext &llvmContext) {
				return translateModuleToLLVMIR(getDerived().getOperation(), llvmContext,
				"LLVMDialectModule");
				}

				template <typename Derived>
				LogicalResult GpuToDeviceOffloadMixin<Derived>::linkFiles(
				llvm::Module &module, SmallVector<std::unique_ptr<llvm::Module>> &&libs) {
				auto &self = getDerived();
				if (libs.empty())
				return success();
				llvm::Linker linker(module);
				for (std::unique_ptr<llvm::Module> &libModule : libs) {
				// This bitcode linking code is substantially similar to what is used in
				// hip-clang It imports the library functions into the module, allowing LLVM
				// optimization passes (which must run after linking) to optimize across the
				// libraries and the module's code. We also only import symbols if they are
				// referenced by the module or a previous library since there will be no
				// other source of references to those symbols in this compilation and since
				// we don't want to bloat the resulting code object.
				bool err = linker.linkInModule(
				std::move(libModule), llvm::Linker::Flags::LinkOnlyNeeded,
				[](llvm::Module &m, const StringSet<> &gvs) {
				llvm::internalizeModule(m, [&gvs](const llvm::GlobalValue &gv) {
				return !gv.hasName() \|\| (gvs.count(gv.getName()) == 0);
				});
				});
				// True is linker failure
				if (err) {
				self.getOperation().emitError(
				"unrecoverable failure during device library linking.");
				// We have no guaranties about the state of `ret`, so bail
				return failure();
				}
				}
				return success();
				}

				template <typename Derived>
				LogicalResult GpuToDeviceOffloadMixin<Derived>::optimizeModule(
				llvm::Module &module, llvm::TargetMachine &targetMachine, int optLevel) {
				auto &self = getDerived();
				if (optLevel < 0 \|\| optLevel > 3)
				return self.getOperation().emitError()
				<< "Invalid optimization level" << optLevel << "\n";

				targetMachine.setOptLevel(static_cast<llvm::CodeGenOpt::Level>(optLevel));

				auto transformer =
				makeOptimizingTransformer(optLevel, /sizeLevel=/0, &targetMachine);
				auto error = transformer(&module);
				if (error) {
				InFlightDiagnostic mlirError = self.getOperation()->emitError();
				llvm::handleAllErrors(
				std::move(error), [&mlirError](const llvm::ErrorInfoBase &ei) {
				mlirError << "Could not optimize LLVM IR: " << ei.message() << "\n";
				});
				return mlirError;
				}
				return success();
				}

				template <typename Derived>
				SmallVector<char> GpuToDeviceOffloadMixin<Derived>::serializeModuleToObject(
				llvm::Module &llvmModule, gpu::OffloadKind offloadKind) {
				using namespace llvm;
				using namespace llvm::object;
				auto &self = getDerived();

				// Set the offload kind.
				llvm::object::OffloadKind offKind = OFK_None;
				if (offloadKind == gpu::OffloadKind::cuda)
				offKind = OFK_Cuda;
				else if (offloadKind == gpu::OffloadKind::hip)
				offKind = OFK_HIP;

				SmallVector<char> offloadData;
				{
				// Create the offload object, for more information check:
				// https://clang.llvm.org/docs/ClangOffloadPackager.html
				OffloadBinary::OffloadingImage imageBinary{};
				std::unique_ptr<MemoryBuffer> buffer;
				// Add a scope to trash the binaryObject buffer as soon it's done being
				// used.
				{
				mehdi_aminiUnsubmitted Done Reply Inline Actions I don't think this brace is correctly placed (assuming you intended to use this for flushing the ostream with RAII) mehdi_amini: I don't think this brace is correctly placed (assuming you intended to use this for flushing…
				fmoracAuthorUnsubmitted Done Reply Inline Actions I was intending to also trash the buffer `binaryData` asap, but forgot to flush. fmorac: I was intending to also trash the buffer `binaryData` asap, but forgot to flush.
				SmallVector<char, 1024> binaryData;
				// Write the LLVM module bitcode to a buffer.
				raw_svector_ostream outputStream(binaryData);
				WriteBitcodeToFile(llvmModule, outputStream);
				imageBinary.TheImageKind = IMG_Bitcode;
				imageBinary.TheOffloadKind = offKind;
				imageBinary.StringData["triple"] = self.triple;
				// Avoid setting the arch if no arch was given in the cmd, as clang will
				// compile code only for this arch if set, so running the code on an
				// incompatible arch will result in error.
				if (!self.chip.isDefaultOption())
				imageBinary.StringData["arch"] = self.chip;
				imageBinary.Image = MemoryBuffer::getMemBuffer(
				StringRef(binaryData.data(), binaryData.size()), "", false);
				buffer = OffloadBinary::write(imageBinary);
				}
				// Check that the image was properly created. This step was taken from:
				// https://github.com/llvm/llvm-project/blob/main/clang/tools/clang-offload-packager/ClangOffloadPackager.cpp
				if (buffer->getBufferSize() % OffloadBinary::getAlignment() != 0) {
				emitError(self.getOperation().getLoc(),
				"Offload binary has an invalid size alignment");
				return {};
				}
				// Write the buffer.
				raw_svector_ostream outputStream(offloadData);
				outputStream << buffer->getBuffer();
				}
				return offloadData;
				}

				template <typename Derived>
				void GpuToDeviceOffloadMixin<Derived>::insertAnnotations(
				gpu::GPUModuleOp module, SmallVector<char> &binaryObject,
				gpu::OffloadKind offloadKind) {
				auto &self = getDerived();
				// Create a pair of annotations to store the object and the offload kind.
				module->setAttr(
				gpu::getGpuOffloadObjectAnnotation(),
				StringAttr::get(&self.getContext(),
				StringRef(binaryObject.data(), binaryObject.size())));
				module->setAttr(
				gpu::getGpuOffloadKindAnnotation(),
				StringAttr::get(&self.getContext(), gpu::fromOffloadKind(offloadKind)));
				}

				template <typename Derived>
				void GpuToDeviceOffloadMixin<Derived>::run(gpu::OffloadKind offloadKind) {
				auto &self = getDerived();
				// Translate the GPUModule to LLVM IR.
				llvm::LLVMContext llvmContext;
				std::unique_ptr<llvm::Module> llvmModule =
				self.translateToLLVMIR(llvmContext);
				if (!llvmModule)
				return self.signalPassFailure();

				// Create the target machine.
				std::unique_ptr<llvm::TargetMachine> targetMachine =
				self.createTargetMachine();
				if (!targetMachine)
				return self.signalPassFailure();

				// Set the data layout and target triple of the module.
				llvmModule->setDataLayout(targetMachine->createDataLayout());
				llvmModule->setTargetTriple(targetMachine->getTargetTriple().getTriple());

				// Link bitcode files.
				self.handleModulePreLink(*llvmModule);
				{
				auto libs = self.loadBitcodeFiles(llvmContext, *llvmModule);
				if (!libs)
				return self.signalPassFailure();
				if (libs->size())
				if (failed(self.linkFiles(llvmModule, std::move(libs))))
				return self.signalPassFailure();
				self.handleModulePostLink(*llvmModule);
				}

				// Optimize the module.
				auto optLevel = self.getOptLevel();
				if (optLevel != -1 &&
				failed(self.optimizeModule(llvmModule, targetMachine, optLevel)))
				return self.signalPassFailure();

				// Serialize the LLVM Module to an object file.
				auto binaryObject = self.serializeModuleToObject(*llvmModule, offloadKind);
				auto op = self.getOperation();
				if (binaryObject.empty()) {
				emitError(op.getLoc(), "Failed to serialize to bitcode.");
				return self.signalPassFailure();
				}

				// Insert the binary object to the module.
				self.insertAnnotations(op, binaryObject, offloadKind);
				}

				// NVPTX specific mixin implementation of common functions for
				// `GpuToDeviceOffloadMixin`. This class needs to be listed as a friend class,
				// as it will access protected members.
				template <typename Derived>
				class GPUToNVPTXMixin {
				private:
				Derived &getDerived() { return static_cast<Derived &>(*this); }

				protected:
				using NVPTXBase = GPUToNVPTXMixin;

				// Implementation of GpuToDeviceOffloadMixin::loadBitcodeFiles. It can be used
				// in child classes by adding `using NVPTXBase::loadBitcodeFiles;`.
				std::optional<SmallVector<std::unique_ptr<llvm::Module>>>
				loadBitcodeFiles(llvm::LLVMContext &context, llvm::Module &module) {
				auto &self = getDerived();
				SmallVector<std::unique_ptr<llvm::Module>> bcFiles;

				// Try to load libdevice from a cuda installation.
				StringRef pathRef(self.cudaPath.getValue());
				if (pathRef.size()) {
				SmallVector<char, 256> path;
				path.insert(path.begin(), pathRef.begin(), pathRef.end());
				pathRef = StringRef(path.data(), path.size());
				if (!llvm::sys::fs::is_directory(pathRef)) {
				self.getOperation().emitError()
				<< "CUDA path: " << pathRef
				<< " does not exist or is not a directory.\n";
				return std::nullopt;
				}
				// TODO remove this hard coded path.
				llvm::sys::path::append(path, "nvvm", "libdevice", "libdevice.10.bc");
				pathRef = StringRef(path.data(), path.size());
				if (!llvm::sys::fs::is_regular_file(pathRef)) {
				self.getOperation().emitError()
				<< "LibDevice path: " << pathRef
				<< " does not exist or is not a file.\n";
				return std::nullopt;
				}
				if (auto bcFile = self.loadBitcodeFile(context, pathRef))
				bcFiles.push_back(std::move(bcFile));
				}

				// Add extra libraries.
				if (failed(self.loadBitcodeFileList(context, self.bcPaths, bcFiles, true)))
				return std::nullopt;
				return bcFiles;
				}
				};

				// AMDGPU specific mixin implementation of common functions for
				// `GpuToDeviceOffloadMixin`. This class needs to be listed as a friend class,
				// as it will access protected members.
				template <typename Derived>
				class GPUToAMDGPUMixin {
				private:
				Derived &getDerived() { return static_cast<Derived &>(*this); }

				protected:
				using AMDGPUBase = GPUToAMDGPUMixin;

				// Get the paths of ROCm device libraries. Function adapted from:
				// https://github.com/llvm/llvm-project/blob/main/clang/lib/Driver/ToolChains/AMDGPU.cpp
				void getCommonBitcodeLibs(llvm::SmallVector<std::string, 12> &libs,
				SmallVector<char, 256> &libPath,
				StringRef isaVersion, bool wave64, bool daz,
				bool finiteOnly, bool unsafeMath, bool fastMath,
				bool correctSqrt, StringRef abiVer) {
				auto &self = getDerived();
				auto addLib = [&](StringRef path) {
				if (!llvm::sys::fs::is_regular_file(path)) {
				self.getOperation().emitRemark()
				<< "Bitcode library path: " << path
				<< " does not exist or is not a file.\n";
				return;
				}
				libs.push_back(path.str());
				};
				auto optLib = [](StringRef name, bool on) -> Twine {
				return name + (on ? "_on" : "_off");
				};
				auto getLibPath = [&libPath](Twine lib) {
				auto baseSize = libPath.size();
				llvm::sys::path::append(libPath, lib + ".bc");
				std::string path(StringRef(libPath.data(), libPath.size()).str());
				libPath.truncate(baseSize);
				return path;
				};

				// Add ROCm device libraries.
				addLib(getLibPath("ocml"));
				addLib(getLibPath("ockl"));
				addLib(getLibPath(optLib("oclc_daz_opt", daz)));
				addLib(getLibPath(optLib("oclc_unsafe_math", unsafeMath \|\| fastMath)));
				addLib(getLibPath(optLib("oclc_finite_only", finiteOnly \|\| fastMath)));
				addLib(getLibPath(optLib("oclc_correctly_rounded_sqrt", correctSqrt)));
				addLib(getLibPath(optLib("oclc_wavefrontsize64", wave64)));
				addLib(getLibPath("oclc_isa_version_" + isaVersion));
				if (abiVer.size())
				addLib(getLibPath("oclc_abi_version_" + abiVer));
				}

				// Implementation of GpuToDeviceOffloadMixin::loadBitcodeFiles. It can be used
				// in child classes by adding `using AMDGPUBase::loadBitcodeFiles;`.
				std::optional<SmallVector<std::unique_ptr<llvm::Module>>>
				loadBitcodeFiles(llvm::LLVMContext &context, llvm::Module &module) {
				SmallVector<std::unique_ptr<llvm::Module>> bcFiles;
				auto &self = getDerived();
				SmallVector<std::string, 12> libsPath;

				// Try to load device libraries from a ROCm installation.
				StringRef pathRef(self.rocmPath.getValue());
				if (pathRef.size()) {
				SmallVector<char, 256> path;
				path.insert(path.begin(), pathRef.begin(), pathRef.end());
				// TODO remove this hard coded ROCm path.
				llvm::sys::path::append(path, "amdgcn", "bitcode");
				pathRef = StringRef(path.data(), path.size());
				if (!llvm::sys::fs::is_directory(pathRef)) {
				self.getOperation().emitRemark()
				<< "ROCm amdgcn bitcode path: " << pathRef
				<< " does not exist or is not a directory\n";
				return std::nullopt;
				}
				std::string isaVersion;
				auto isaVer = llvm::AMDGPU::getIsaVersion(self.chip);
				if (isaVer.Major != 0)
				isaVersion = std::to_string(isaVer.Major) +
				std::to_string(isaVer.Minor) +
				llvm::utohexstr(isaVer.Stepping, /lower case/ true);
				getCommonBitcodeLibs(libsPath, path, isaVersion, self.wave64, self.daz,
				self.finiteOnly, self.unsafeMath, self.fastMath,
				self.correctSqrt, self.abiVer);
				} else
				libsPath.reserve(libsPath.size() + self.bcPaths.size());

				libsPath.insert(libsPath.end(), self.bcPaths.begin(), self.bcPaths.end());

				// Add extra libraries.
				if (failed(self.loadBitcodeFileList(context, libsPath, bcFiles, true)))
				return std::nullopt;
				return bcFiles;
				}

				// Implementation of GpuToDeviceOffloadMixin::handleBitcodeFile. It can be
				// used in child classes by adding `using AMDGPUBase::handleBitcodeFile;`.
				void handleBitcodeFile(llvm::Module &module) {
				// Some ROCM builds don't strip this like they should
				if (auto *openclVersion = module.getNamedMetadata("opencl.ocl.version"))
				module.eraseNamedMetadata(openclVersion);
				// Stop spamming us with clang version numbers
				if (auto *ident = module.getNamedMetadata("llvm.ident"))
				module.eraseNamedMetadata(ident);
				}
				};
				} // namespace gpu
				} // namespace mlir

mlir/lib/Dialect/GPU/Transforms/GpuToDeviceOffload.cpp

This file was added.

				//===- GpuToDeviceOffload.cpp - Impl. of GPU to NVPTX & AMDGPU passes -----===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// This file implements the GpuToNVPTXOffload & GpuToAMDGPUOffload passes.
				//
				//===----------------------------------------------------------------------===//

				#include "GpuToDeviceObjectCommon.h"

				using namespace mlir;

				namespace mlir {
				namespace gpu {
				StringRef getDefaultCudaToolkitPath() {
				#ifdef __DEFAULT_CUDATOOLKIT_PATH__
				return __DEFAULT_CUDATOOLKIT_PATH__;
				#else
				return "";
				#endif
				}

				StringRef getDefaultRocmPath() {
				#ifdef __DEFAULT_ROCM_PATH__
				return __DEFAULT_ROCM_PATH__;
				#else
				return "";
				#endif
				}

				StringRef getGpuOffloadObjectAnnotation() { return "llvm.offload.object"; }

				StringRef getGpuOffloadKindAnnotation() { return "llvm.offload.kind"; }
				} // namespace gpu

				#define GEN_PASS_DEF_GPUTONVPTXOFFLOAD
				#include "mlir/Dialect/GPU/Transforms/Passes.h.inc"
				#define GEN_PASS_DEF_GPUTOAMDGPUOFFLOAD
				#include "mlir/Dialect/GPU/Transforms/Passes.h.inc"
				} // namespace mlir

				#ifdef MLIR_GPU_TO_NVPTX_PASS_ENABLE
				#include "mlir/Target/LLVMIR/Dialect/NVVM/NVVMToLLVMIRTranslation.h"
				namespace {
				// NVPTX target initializer.
				struct InitNVPTXTarget {
				InitNVPTXTarget() {
				LLVMInitializeNVPTXTarget();
				LLVMInitializeNVPTXTargetInfo();
				LLVMInitializeNVPTXTargetMC();
				LLVMInitializeNVPTXAsmPrinter();
				}
				};

				// This ensures that the target is initialized once.
				llvm::ManagedStatic<InitNVPTXTarget> nvptxTargetInit;
				// CUDA specific Offload pass
				class GpuToNVPTXOffload
				: public impl::GpuToNVPTXOffloadBase<GpuToNVPTXOffload>,
				public gpu::GpuToDeviceOffloadMixin<GpuToNVPTXOffload>,
				public gpu::GPUToNVPTXMixin<GpuToNVPTXOffload> {
				private:
				template <typename Derived>
				friend class ::mlir::gpu::GpuToDeviceOffloadMixin;
				template <typename Derived>
				friend class ::mlir::gpu::GPUToNVPTXMixin;

				public:
				using Base::Base;
				using NVPTXBase::loadBitcodeFiles;

				// Initialize the NVPTX target.
				LogicalResult initialize(MLIRContext *context) override;

				// Add LLVM IR dialect translations to the registry.
				void getDependentDialects(DialectRegistry &registry) const override;

				void runOnOperation() final;

				protected:
				// Return the optimization level, -1 signifies don't run the optimization
				// pipeline.
				int getOptLevel() { return optLevel.getValue(); }
				};
				} // namespace

				LogicalResult GpuToNVPTXOffload::initialize(MLIRContext *context) {
				*nvptxTargetInit;
				return success();
				}

				void GpuToNVPTXOffload::getDependentDialects(DialectRegistry &registry) const {
				impl::GpuToNVPTXOffloadBase<GpuToNVPTXOffload>::getDependentDialects(
				registry);
				registerGPUDialectTranslation(registry);
				registerLLVMDialectTranslation(registry);
				registerNVVMDialectTranslation(registry);
				}

				void GpuToNVPTXOffload::runOnOperation() { run(gpu::OffloadKind::cuda); }
				#else
				namespace {
				class GpuToNVPTXOffload
				: public impl::GpuToNVPTXOffloadBase<GpuToNVPTXOffload> {
				public:
				using impl::GpuToNVPTXOffloadBase<GpuToNVPTXOffload>::GpuToNVPTXOffloadBase;

				void runOnOperation() final {
				getOperation().emitError()
				<< "This pass requires the NVPTX target but it wasn't built.";
				return signalPassFailure();
				}
				};
				} // namespace
				#endif

				#ifdef MLIR_GPU_TO_AMDGPU_PASS_ENABLE
				#include "mlir/Target/LLVMIR/Dialect/ROCDL/ROCDLToLLVMIRTranslation.h"
				namespace {
				// AMDGPU target initializer.
				struct InitAMDGPUTarget {
				InitAMDGPUTarget() {
				LLVMInitializeAMDGPUTarget();
				LLVMInitializeAMDGPUTargetInfo();
				LLVMInitializeAMDGPUTargetMC();
				LLVMInitializeAMDGPUAsmPrinter();
				}
				};
				// This ensures that the target is initialized once.
				llvm::ManagedStatic<InitAMDGPUTarget> amdgpuTargetInit;

				// AMDGPU specific Offload pass
				class GpuToAMDGPUOffload
				: public impl::GpuToAMDGPUOffloadBase<GpuToAMDGPUOffload>,
				public gpu::GpuToDeviceOffloadMixin<GpuToAMDGPUOffload>,
				public gpu::GPUToAMDGPUMixin<GpuToAMDGPUOffload> {
				private:
				template <typename Derived>
				friend class ::mlir::gpu::GpuToDeviceOffloadMixin;
				template <typename Derived>
				friend class ::mlir::gpu::GPUToAMDGPUMixin;

				public:
				using AMDGPUBase::handleBitcodeFile;
				using AMDGPUBase::loadBitcodeFiles;
				using Base::Base;
				// Initialize the AMDGPU target.
				LogicalResult initialize(MLIRContext *context) override;

				// Add LLVM IR dialect translations to the registry.
				void getDependentDialects(DialectRegistry &registry) const override;

				void runOnOperation() final;

				protected:
				// Return the optimization level, -1 signifies don't run the optimization
				// pipeline.
				int getOptLevel() { return optLevel.getValue(); }
				};
				} // namespace

				LogicalResult GpuToAMDGPUOffload::initialize(MLIRContext *context) {
				*amdgpuTargetInit;
				return success();
				}

				void GpuToAMDGPUOffload::getDependentDialects(DialectRegistry &registry) const {
				impl::GpuToAMDGPUOffloadBase<GpuToAMDGPUOffload>::getDependentDialects(
				registry);
				registerGPUDialectTranslation(registry);
				registerLLVMDialectTranslation(registry);
				registerROCDLDialectTranslation(registry);
				}

				void GpuToAMDGPUOffload::runOnOperation() { run(gpu::OffloadKind::hip); }
				#else
				namespace {
				class GpuToAMDGPUOffload
				: public impl::GpuToAMDGPUOffloadBase<GpuToAMDGPUOffload> {
				public:
				using impl::GpuToAMDGPUOffloadBase<
				GpuToAMDGPUOffload>::GpuToAMDGPUOffloadBase;

				void runOnOperation() final {
				getOperation().emitError()
				<< "This pass requires the AMDGPU target but it wasn't built.";
				return signalPassFailure();
				}
				};
				} // namespace
				#endif

mlir/lib/Dialect/GPU/Transforms/NameMangling.cpp

This file was added.

				//===- NameMangling.cpp - Implementation of GPU symbols mangling ----------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// This file implements the GPU dialect name mangling pass.
				//
				//===----------------------------------------------------------------------===//

				#include "mlir/Dialect/GPU/Transforms/Passes.h"

				#include "mlir/Dialect/GPU/IR/GPUDialect.h"
				#include "mlir/IR/BuiltinOps.h"
				#include "mlir/Pass/Pass.h"

				using namespace mlir;

				namespace mlir {
				#define GEN_PASS_DEF_GPUNAMEMANGLING
				#include "mlir/Dialect/GPU/Transforms/Passes.h.inc"
				} // namespace mlir

				namespace {
				// Mangle the names of all the top symbols inside a GPUModuleOp from symbol to
				// "__G<module name>_S<symbol name>", for all GPUModuleOps in a module.
				class GpuNameMangling : public impl::GpuNameManglingBase<GpuNameMangling> {
				public:
				using Base::Base;

				// Get the mangled name for the symbol.
				StringAttr getMangledName(StringAttr moduleName, StringAttr symbol);

				// Mangle all the definitions inside a particular GPUModuleOp.
				LogicalResult mangleNamesInModule(gpu::GPUModuleOp module);

				// Update all the symbol uses of a particular symbol inside the top module.
				// `symbolUses` is the range of symbol uses of the gpu.module name in the top
				// module symbol table.
				void updateSymbolUses(SymbolTable::UseRange &&symbolUses);

				void runOnOperation() final;
				};
				} // namespace

				StringAttr GpuNameMangling::getMangledName(StringAttr moduleName,
				StringAttr symbol) {
				std::string name = "__G" + moduleName.str() + "_S" + symbol.str();
				return StringAttr::get(&getContext(), name);
				}

				LogicalResult GpuNameMangling::mangleNamesInModule(gpu::GPUModuleOp gpuModule) {
				SymbolTable synbolTable(gpuModule);
				for (auto &op : gpuModule.getBody()->getOperations()) {
				// Ignore external functions.
				if (auto fn = dyn_cast<FunctionOpInterface>(op))
				if (fn.isExternal())
				continue;
				if (auto symbol = dyn_cast<SymbolOpInterface>(op)) {
				auto mangledName =
				getMangledName(gpuModule.getNameAttr(), symbol.getNameAttr());

				// Replace all the symbol uses of `symbol` to its mangled name.
				if (failed(synbolTable.replaceAllSymbolUses(
				symbol.getNameAttr(), mangledName, &gpuModule.getRegion()))) {
				emitError(op.getLoc(), "Failed to replace the symbol name.");
				return failure();
				}

				// On symbol replacement success rename the symbol.
				synbolTable.setSymbolName(symbol, mangledName);
				}
				}
				return success();
				}

				void GpuNameMangling::updateSymbolUses(SymbolTable::UseRange &&symbolUses) {
				// All symbolUses correspond to a particular gpu.module name.
				for (auto symbolUse : symbolUses) {
				Operation *operation = symbolUse.getUser();
				SmallVector<std::pair<StringAttr, SymbolRefAttr>> symbolReferences;

				// Collect all references to the `symbol` in the attributes of the
				// operation.
				for (auto opAttr : operation->getAttrs()) {
				if (auto symbol = dyn_cast<SymbolRefAttr>(opAttr.getValue()))
				if (symbol == symbolUse.getSymbolRef())
				symbolReferences.push_back({opAttr.getName(), symbol});
				}

				// Update the symbol references.
				for (auto &[attrName, symbol] : symbolReferences) {
				auto nestedReferences = symbol.getNestedReferences();
				if (nestedReferences.size()) {
				SmallVector<FlatSymbolRefAttr> updatedReferences(nestedReferences);
				// Only the first nested reference was updated by the previous step,
				// thus we just update that one.
				updatedReferences[0] = FlatSymbolRefAttr::get(getMangledName(
				symbol.getRootReference(), nestedReferences[0].getRootReference()));
				operation->setAttr(
				attrName,
				SymbolRefAttr::get(symbol.getRootReference(), updatedReferences));
				}
				}
				}
				mehdi_aminiUnsubmitted Done Reply Inline Actions Aren't there utilities to do that already? Seems like reimplementing some logic that should be fairly generic? mehdi_amini: Aren't there utilities to do that already? Seems like reimplementing some logic that should be…
				fmoracAuthorUnsubmitted Done Reply Inline Actions The problem is that as far as I know the current implementation handles renaming only flat symbols: https://mlir.llvm.org/doxygen/classmlir_1_1SymbolTable.html#a256c12869c03f20d4d1a122ec02eb417 And in this case I'm renaming `<symbol>` in `<gpu_module>::<symbol>`. But I could be missing something. fmorac: The problem is that as far as I know the current implementation handles renaming only flat…
				}

				void GpuNameMangling::runOnOperation() {
				auto module = getOperation();
				SmallVector<gpu::GPUModuleOp> gpuModules;
				// Collect all gpu.modules.
				module.walk([&gpuModules](gpu::GPUModuleOp op) { gpuModules.push_back(op); });
				mehdi_aminiUnsubmitted Done Reply Inline Actions Can this be a pass that runs on GPUModuleOp instead? mehdi_amini: Can this be a pass that runs on GPUModuleOp instead?
				fmoracAuthorUnsubmitted Done Reply Inline Actions No, as this pass needs to update also `LaunchFuncOp`s with the new symbol name, which is located outside the `GPUModule`. fmorac: No, as this pass needs to update also `LaunchFuncOp`s with the new symbol name, which is…
				mehdi_aminiUnsubmitted Done Reply Inline Actions OK, so let's just make this runnable on any SymbolTable operation (like the other case) and replace the full IR walk with building a symbol table and using it directly? mehdi_amini: OK, so let's just make this runnable on any SymbolTable operation (like the other case) and…
				fmoracAuthorUnsubmitted Done Reply Inline Actions If I understood you correctly, yes, we could switch this to run on `SymbolTable` Ops, and collect `GPUModule` symbols defined on that particular symbol table. fmorac: If I understood you correctly, yes, we could switch this to run on `SymbolTable` Ops, and…
				SymbolTable moduleTable(module);

				// Mangle the names.
				for (auto gpuModule : gpuModules) {
				if (failed(mangleNamesInModule(gpuModule)))
				return signalPassFailure();
				if (auto symbolUses = moduleTable.getSymbolUses(gpuModule.getNameAttr(),
				&module.getRegion()))
				updateSymbolUses(std::move(*symbolUses));
				}
				}

mlir/lib/ExecutionEngine/CMakeLists.txt

Show First 20 Lines • Show All 201 Lines • ▼ Show 20 Lines	target_include_directories(mlir_cuda_runtime
${CMAKE_CUDA_TOOLKIT_INCLUDE_DIRECTORIES}		${CMAKE_CUDA_TOOLKIT_INCLUDE_DIRECTORIES}
)		)
target_link_libraries(mlir_cuda_runtime		target_link_libraries(mlir_cuda_runtime
PRIVATE		PRIVATE
${CUDA_RUNTIME_LIBRARY}		${CUDA_RUNTIME_LIBRARY}
)		)
endif()		endif()

		if(MLIR_ENABLE_CUDART_RUNNER)
		find_package(CUDAToolkit REQUIRED)

		add_mlir_library(mlir_cudart_runtime
		SHARED
		CudaRuntimeWrappers.cpp

		EXCLUDE_FROM_LIBMLIR
		)
		set_property(TARGET mlir_cudart_runtime PROPERTY CXX_STANDARD 14)
		target_include_directories(mlir_cudart_runtime
		PRIVATE
		${CUDAToolkit_INCLUDE_DIRS}
		)
		target_link_libraries(mlir_cudart_runtime
		PRIVATE
		CUDA::cudart
		CUDA::cuda_driver
		)
		target_compile_definitions(mlir_cudart_runtime
		PRIVATE
		MLIR_USE_CUDART_RUNNER=1
		)
		endif()

if(MLIR_ENABLE_ROCM_RUNNER)		if(MLIR_ENABLE_ROCM_RUNNER)
# Configure ROCm support.		# Configure ROCm support.
if (NOT DEFINED ROCM_PATH)		if (NOT DEFINED ROCM_PATH)
if (NOT DEFINED ENV{ROCM_PATH})		if (NOT DEFINED ENV{ROCM_PATH})
set(ROCM_PATH "/opt/rocm" CACHE PATH "Path to which ROCm has been installed")		set(ROCM_PATH "/opt/rocm" CACHE PATH "Path to which ROCm has been installed")
else()		else()
set(ROCM_PATH $ENV{ROCM_PATH} CACHE PATH "Path to which ROCm has been installed")		set(ROCM_PATH $ENV{ROCM_PATH} CACHE PATH "Path to which ROCm has been installed")
endif()		endif()
Show All 30 Lines	if(MLIR_ENABLE_ROCM_RUNNER)
endif()		endif()

add_mlir_library(mlir_rocm_runtime		add_mlir_library(mlir_rocm_runtime
SHARED		SHARED
RocmRuntimeWrappers.cpp		RocmRuntimeWrappers.cpp

EXCLUDE_FROM_LIBMLIR		EXCLUDE_FROM_LIBMLIR
)		)
		add_mlir_library(mlir_rocmrt_runtime
		SHARED
		RocmRuntimeWrappers.cpp

		EXCLUDE_FROM_LIBMLIR
		)

# Supress compiler warnings from HIP headers		# Supress compiler warnings from HIP headers
check_cxx_compiler_flag(-Wno-c++98-compat-extra-semi		check_cxx_compiler_flag(-Wno-c++98-compat-extra-semi
CXX_SUPPORTS_NO_CXX98_COMPAT_EXTRA_SEMI_FLAG)		CXX_SUPPORTS_NO_CXX98_COMPAT_EXTRA_SEMI_FLAG)
if (CXX_SUPPORTS_CXX98_COMPAT_EXTRA_SEMI_FLAG)		if (CXX_SUPPORTS_CXX98_COMPAT_EXTRA_SEMI_FLAG)
target_compile_options(mlir_rocm_runtime PRIVATE		target_compile_options(mlir_rocm_runtime PRIVATE
"-Wno-c++98-compat-extra-semi")		"-Wno-c++98-compat-extra-semi")
		target_compile_options(mlir_rocmrt_runtime PRIVATE
		"-Wno-c++98-compat-extra-semi")
endif()		endif()
check_cxx_compiler_flag(-Wno-return-type-c-linkage		check_cxx_compiler_flag(-Wno-return-type-c-linkage
CXX_SUPPORTS_WNO_RETURN_TYPE_C_LINKAGE_FLAG)		CXX_SUPPORTS_WNO_RETURN_TYPE_C_LINKAGE_FLAG)
if (CXX_SUPPORTS_WNO_RETURN_TYPE_C_LINKAGE_FLAG)		if (CXX_SUPPORTS_WNO_RETURN_TYPE_C_LINKAGE_FLAG)
target_compile_options(mlir_rocm_runtime PRIVATE		target_compile_options(mlir_rocm_runtime PRIVATE
"-Wno-return-type-c-linkage")		"-Wno-return-type-c-linkage")
		target_compile_options(mlir_rocmrt_runtime PRIVATE
		"-Wno-return-type-c-linkage")
endif()		endif()
check_cxx_compiler_flag(-Wno-nested-anon-types		check_cxx_compiler_flag(-Wno-nested-anon-types
CXX_SUPPORTS_WNO_NESTED_ANON_TYPES_FLAG)		CXX_SUPPORTS_WNO_NESTED_ANON_TYPES_FLAG)
if (CXX_SUPPORTS_WNO_NESTED_ANON_TYPES_FLAG)		if (CXX_SUPPORTS_WNO_NESTED_ANON_TYPES_FLAG)
target_compile_options(mlir_rocm_runtime PRIVATE		target_compile_options(mlir_rocm_runtime PRIVATE
"-Wno-nested-anon-types")		"-Wno-nested-anon-types")
		target_compile_options(mlir_rocmrt_runtime PRIVATE
		"-Wno-nested-anon-types")
endif()		endif()
check_cxx_compiler_flag(-Wno-gnu-anonymous-struct		check_cxx_compiler_flag(-Wno-gnu-anonymous-struct
CXX_SUPPORTS_WNO_GNU_ANONYMOUS_STRUCT_FLAG)		CXX_SUPPORTS_WNO_GNU_ANONYMOUS_STRUCT_FLAG)
if (CXX_SUPPORTS_WNO_GNU_ANONYMOUS_STRUCT_FLAG)		if (CXX_SUPPORTS_WNO_GNU_ANONYMOUS_STRUCT_FLAG)
target_compile_options(mlir_rocm_runtime PRIVATE		target_compile_options(mlir_rocm_runtime PRIVATE
"-Wno-gnu-anonymous-struct")		"-Wno-gnu-anonymous-struct")
		target_compile_options(mlir_rocmrt_runtime PRIVATE
		"-Wno-gnu-anonymous-struct")
endif()		endif()

set_property(TARGET mlir_rocm_runtime		set_property(TARGET mlir_rocm_runtime
PROPERTY INSTALL_RPATH_USE_LINK_PATH ON)		PROPERTY INSTALL_RPATH_USE_LINK_PATH ON)

		set_property(TARGET mlir_rocmrt_runtime
		PROPERTY INSTALL_RPATH_USE_LINK_PATH ON)

target_link_libraries(mlir_rocm_runtime		target_link_libraries(mlir_rocm_runtime
PUBLIC		PUBLIC
hip::host hip::amdhip64		hip::host hip::amdhip64
)		)

		target_link_libraries(mlir_rocmrt_runtime
		PUBLIC
		hip::host hip::amdhip64
		)

		target_compile_definitions(mlir_rocmrt_runtime
		PRIVATE
		MLIR_USE_HIPRT_RUNNER=1
		)
endif()		endif()
endif()		endif()

mlir/lib/ExecutionEngine/CudaRuntimeWrappers.cpp

Show All 10 Lines
// run on GPUs.		// run on GPUs.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "mlir/ExecutionEngine/CRunnerUtils.h"		#include "mlir/ExecutionEngine/CRunnerUtils.h"

#include <stdio.h>		#include <stdio.h>

		// Most of these functions should be interoperable:
		// https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DRIVER.html
		// However to avoid any pitfalls in context management, the runtime api is used
		// when available.

		#ifdef MLIR_USE_CUDART_RUNNER
		#include "cuda_runtime.h"
		#endif
#include "cuda.h"		#include "cuda.h"

		#ifdef MLIR_USE_CUDART_RUNNER
		#define kernel_t void *
		#define stream_t cudaStream_t
		#define event_t cudaEvent_t
		#else
		#define kernel_t CUfunction
		#define stream_t CUstream
		#define event_t CUevent
		#endif

#ifdef _WIN32		#ifdef _WIN32
#define MLIR_CUDA_WRAPPERS_EXPORT __declspec(dllexport)		#define MLIR_CUDA_WRAPPERS_EXPORT __declspec(dllexport)
#else		#else
#define MLIR_CUDA_WRAPPERS_EXPORT		#define MLIR_CUDA_WRAPPERS_EXPORT
#endif // _WIN32		#endif // _WIN32

		#ifdef MLIR_USE_CUDART_RUNNER
#define CUDA_REPORT_IF_ERROR(expr) \		#define CUDA_REPORT_IF_ERROR(expr) \
		[](cudaError_t result) { \
		if (!result) \
		return; \
		const char *name = cudaGetErrorName(result); \
		if (!name) \
		name = "<unknown>"; \
		fprintf(stderr, "'%s' failed with '%s'\n", #expr, name); \
		}(expr)
		#define CUDA_DRIVER_REPORT_IF_ERROR(expr) \
[](CUresult result) { \		[](CUresult result) { \
if (!result) \		if (!result) \
return; \		return; \
const char *name = nullptr; \		const char *name = nullptr; \
cuGetErrorName(result, &name); \		cuGetErrorName(result, &name); \
if (!name) \		if (!name) \
name = "<unknown>"; \		name = "<unknown>"; \
fprintf(stderr, "'%s' failed with '%s'\n", #expr, name); \		fprintf(stderr, "'%s' failed with '%s'\n", #expr, name); \
}(expr)		}(expr)

		#else
		#define CUDA_REPORT_IF_ERROR(expr) \
		[](CUresult result) { \
		if (!result) \
		return; \
		const char *name = nullptr; \
		cuGetErrorName(result, &name); \
		if (!name) \
		name = "<unknown>"; \
		fprintf(stderr, "'%s' failed with '%s'\n", #expr, name); \
		}(expr)
		#define CUDA_DRIVER_REPORT_IF_ERROR(expr) CUDA_REPORT_IF_ERROR(expr)

thread_local static int32_t defaultDevice = 0;		thread_local static int32_t defaultDevice = 0;

// Make the primary context of the current default device current for the		// Make the primary context of the current default device current for the
// duration		// duration
// of the instance and restore the previous context on destruction.		// of the instance and restore the previous context on destruction.
class ScopedContext {		class ScopedContext {
public:		public:
ScopedContext() {		ScopedContext() {
Show All 9 Lines	static CUcontext context = [] {
return ctx;		return ctx;
}();		}();

CUDA_REPORT_IF_ERROR(cuCtxPushCurrent(context));		CUDA_REPORT_IF_ERROR(cuCtxPushCurrent(context));
}		}

~ScopedContext() { CUDA_REPORT_IF_ERROR(cuCtxPopCurrent(nullptr)); }		~ScopedContext() { CUDA_REPORT_IF_ERROR(cuCtxPopCurrent(nullptr)); }
};		};
		#endif

		#ifndef MLIR_USE_CUDART_RUNNER
extern "C" MLIR_CUDA_WRAPPERS_EXPORT CUmodule mgpuModuleLoad(void *data) {		extern "C" MLIR_CUDA_WRAPPERS_EXPORT CUmodule mgpuModuleLoad(void *data) {
ScopedContext scopedContext;		ScopedContext scopedContext;
CUmodule module = nullptr;		CUmodule module = nullptr;
CUDA_REPORT_IF_ERROR(cuModuleLoadData(&module, data));		CUDA_REPORT_IF_ERROR(cuModuleLoadData(&module, data));
return module;		return module;
}		}

extern "C" MLIR_CUDA_WRAPPERS_EXPORT void mgpuModuleUnload(CUmodule module) {		extern "C" MLIR_CUDA_WRAPPERS_EXPORT void mgpuModuleUnload(CUmodule module) {
CUDA_REPORT_IF_ERROR(cuModuleUnload(module));		CUDA_REPORT_IF_ERROR(cuModuleUnload(module));
}		}

extern "C" MLIR_CUDA_WRAPPERS_EXPORT CUfunction		extern "C" MLIR_CUDA_WRAPPERS_EXPORT CUfunction
mgpuModuleGetFunction(CUmodule module, const char *name) {		mgpuModuleGetFunction(CUmodule module, const char *name) {
CUfunction function = nullptr;		CUfunction function = nullptr;
CUDA_REPORT_IF_ERROR(cuModuleGetFunction(&function, module, name));		CUDA_REPORT_IF_ERROR(cuModuleGetFunction(&function, module, name));
return function;		return function;
}		}
		#endif

// The wrapper uses intptr_t instead of CUDA's unsigned int to match		// The wrapper uses intptr_t instead of CUDA's unsigned int to match
// the type of MLIR's index type. This avoids the need for casts in the		// the type of MLIR's index type. This avoids the need for casts in the
// generated MLIR code.		// generated MLIR code.
extern "C" MLIR_CUDA_WRAPPERS_EXPORT void		extern "C" MLIR_CUDA_WRAPPERS_EXPORT void
mgpuLaunchKernel(CUfunction function, intptr_t gridX, intptr_t gridY,		mgpuLaunchKernel(kernel_t function, intptr_t gridX, intptr_t gridY,
intptr_t gridZ, intptr_t blockX, intptr_t blockY,		intptr_t gridZ, intptr_t blockX, intptr_t blockY,
intptr_t blockZ, int32_t smem, CUstream stream, void **params,		intptr_t blockZ, int32_t smem, stream_t stream, void **params,
void **extra) {		void **extra) {
		#ifdef MLIR_USE_CUDART_RUNNER
		CUDA_REPORT_IF_ERROR(cudaLaunchKernel(function, dim3(gridX, gridY, gridZ),
		dim3(blockX, blockY, blockZ), params,
		smem, stream));
		#else
ScopedContext scopedContext;		ScopedContext scopedContext;
CUDA_REPORT_IF_ERROR(cuLaunchKernel(function, gridX, gridY, gridZ, blockX,		CUDA_REPORT_IF_ERROR(cuLaunchKernel(function, gridX, gridY, gridZ, blockX,
blockY, blockZ, smem, stream, params,		blockY, blockZ, smem, stream, params,
extra));		extra));
		#endif
}		}

extern "C" MLIR_CUDA_WRAPPERS_EXPORT CUstream mgpuStreamCreate() {		extern "C" MLIR_CUDA_WRAPPERS_EXPORT stream_t mgpuStreamCreate() {
		#ifdef MLIR_USE_CUDART_RUNNER
		cudaStream_t stream;
		CUDA_REPORT_IF_ERROR(cudaStreamCreate(&stream));
		return stream;
		#else
ScopedContext scopedContext;		ScopedContext scopedContext;
CUstream stream = nullptr;		stream_t stream = nullptr;
CUDA_REPORT_IF_ERROR(cuStreamCreate(&stream, CU_STREAM_NON_BLOCKING));		CUDA_REPORT_IF_ERROR(cuStreamCreate(&stream, CU_STREAM_NON_BLOCKING));
return stream;		return stream;
		#endif
}		}

extern "C" MLIR_CUDA_WRAPPERS_EXPORT void mgpuStreamDestroy(CUstream stream) {		extern "C" MLIR_CUDA_WRAPPERS_EXPORT void mgpuStreamDestroy(stream_t stream) {
		#ifdef MLIR_USE_CUDART_RUNNER
		CUDA_REPORT_IF_ERROR(cudaStreamDestroy(stream));
		#else
CUDA_REPORT_IF_ERROR(cuStreamDestroy(stream));		CUDA_REPORT_IF_ERROR(cuStreamDestroy(stream));
		#endif
}		}

extern "C" MLIR_CUDA_WRAPPERS_EXPORT void		extern "C" MLIR_CUDA_WRAPPERS_EXPORT void
mgpuStreamSynchronize(CUstream stream) {		mgpuStreamSynchronize(stream_t stream) {
		#ifdef MLIR_USE_CUDART_RUNNER
		CUDA_REPORT_IF_ERROR(cudaStreamSynchronize(stream));
		#else
CUDA_REPORT_IF_ERROR(cuStreamSynchronize(stream));		CUDA_REPORT_IF_ERROR(cuStreamSynchronize(stream));
		#endif
}		}

extern "C" MLIR_CUDA_WRAPPERS_EXPORT void mgpuStreamWaitEvent(CUstream stream,		extern "C" MLIR_CUDA_WRAPPERS_EXPORT void mgpuStreamWaitEvent(stream_t stream,
CUevent event) {		event_t event) {
		#ifdef MLIR_USE_CUDART_RUNNER
		CUDA_REPORT_IF_ERROR(cudaStreamWaitEvent(stream, event, /flags=/0));
		#else
CUDA_REPORT_IF_ERROR(cuStreamWaitEvent(stream, event, /flags=/0));		CUDA_REPORT_IF_ERROR(cuStreamWaitEvent(stream, event, /flags=/0));
		#endif
}		}

extern "C" MLIR_CUDA_WRAPPERS_EXPORT CUevent mgpuEventCreate() {		extern "C" MLIR_CUDA_WRAPPERS_EXPORT event_t mgpuEventCreate() {
		#ifdef MLIR_USE_CUDART_RUNNER
		cudaEvent_t event;
		CUDA_REPORT_IF_ERROR(cudaEventCreate(&event));
		return event;
		#else
ScopedContext scopedContext;		ScopedContext scopedContext;
CUevent event = nullptr;		event_t event = nullptr;
CUDA_REPORT_IF_ERROR(cuEventCreate(&event, CU_EVENT_DISABLE_TIMING));		CUDA_REPORT_IF_ERROR(cuEventCreate(&event, CU_EVENT_DISABLE_TIMING));
return event;		return event;
		#endif
}		}

extern "C" MLIR_CUDA_WRAPPERS_EXPORT void mgpuEventDestroy(CUevent event) {		extern "C" MLIR_CUDA_WRAPPERS_EXPORT void mgpuEventDestroy(event_t event) {
		#ifdef MLIR_USE_CUDART_RUNNER
		CUDA_REPORT_IF_ERROR(cudaEventDestroy(event));
		#else
CUDA_REPORT_IF_ERROR(cuEventDestroy(event));		CUDA_REPORT_IF_ERROR(cuEventDestroy(event));
		#endif
}		}

extern MLIR_CUDA_WRAPPERS_EXPORT "C" void mgpuEventSynchronize(CUevent event) {		extern MLIR_CUDA_WRAPPERS_EXPORT "C" void mgpuEventSynchronize(event_t event) {
		#ifdef MLIR_USE_CUDART_RUNNER
		CUDA_REPORT_IF_ERROR(cudaEventSynchronize(event));
		#else
CUDA_REPORT_IF_ERROR(cuEventSynchronize(event));		CUDA_REPORT_IF_ERROR(cuEventSynchronize(event));
		#endif
}		}

extern MLIR_CUDA_WRAPPERS_EXPORT "C" void mgpuEventRecord(CUevent event,		extern MLIR_CUDA_WRAPPERS_EXPORT "C" void mgpuEventRecord(event_t event,
CUstream stream) {		stream_t stream) {
		#ifdef MLIR_USE_CUDART_RUNNER
		CUDA_REPORT_IF_ERROR(cudaEventRecord(event, stream));
		#else
CUDA_REPORT_IF_ERROR(cuEventRecord(event, stream));		CUDA_REPORT_IF_ERROR(cuEventRecord(event, stream));
		#endif
}		}

extern "C" void mgpuMemAlloc(uint64_t sizeBytes, CUstream /stream*/) {		extern "C" void mgpuMemAlloc(uint64_t sizeBytes, stream_t /stream*/) {
		#ifdef MLIR_USE_CUDART_RUNNER
		void *ptr;
		CUDA_REPORT_IF_ERROR(cudaMalloc(&ptr, sizeBytes));
		return ptr;
		#else
ScopedContext scopedContext;		ScopedContext scopedContext;
CUdeviceptr ptr;		CUdeviceptr ptr;
CUDA_REPORT_IF_ERROR(cuMemAlloc(&ptr, sizeBytes));		CUDA_REPORT_IF_ERROR(cuMemAlloc(&ptr, sizeBytes));
return reinterpret_cast<void *>(ptr);		return reinterpret_cast<void *>(ptr);
		#endif
}		}

extern "C" void mgpuMemFree(void ptr, CUstream /stream*/) {		extern "C" void mgpuMemFree(void ptr, stream_t /stream*/) {
		#ifdef MLIR_USE_CUDART_RUNNER
		CUDA_REPORT_IF_ERROR(cudaFree(ptr));
		#else
CUDA_REPORT_IF_ERROR(cuMemFree(reinterpret_cast<CUdeviceptr>(ptr)));		CUDA_REPORT_IF_ERROR(cuMemFree(reinterpret_cast<CUdeviceptr>(ptr)));
		#endif
}		}

extern "C" void mgpuMemcpy(void dst, void src, size_t sizeBytes,		extern "C" void mgpuMemcpy(void dst, void src, size_t sizeBytes,
CUstream stream) {		stream_t stream) {
		#ifdef MLIR_USE_CUDART_RUNNER
		CUDA_REPORT_IF_ERROR(
		cudaMemcpyAsync(dst, src, sizeBytes, cudaMemcpyDefault, stream));
		#else
CUDA_REPORT_IF_ERROR(cuMemcpyAsync(reinterpret_cast<CUdeviceptr>(dst),		CUDA_REPORT_IF_ERROR(cuMemcpyAsync(reinterpret_cast<CUdeviceptr>(dst),
reinterpret_cast<CUdeviceptr>(src),		reinterpret_cast<CUdeviceptr>(src),
sizeBytes, stream));		sizeBytes, stream));
		#endif
}		}

extern "C" void mgpuMemset32(void *dst, unsigned int value, size_t count,		extern "C" void mgpuMemset32(void *dst, unsigned int value, size_t count,
CUstream stream) {		stream_t stream) {
CUDA_REPORT_IF_ERROR(cuMemsetD32Async(reinterpret_cast<CUdeviceptr>(dst),		// There's no cuda runtime equivalent for this specific function, but it
value, count, stream));		// should be interoperable:
		// https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DRIVER.html
		CUDA_DRIVER_REPORT_IF_ERROR(cuMemsetD32Async(
		reinterpret_cast<CUdeviceptr>(dst), value, count, stream));
}		}

/// Helper functions for writing mlir example code		/// Helper functions for writing mlir example code

// Allows to register byte array with the CUDA runtime. Helpful until we have		// Allows to register byte array with the CUDA runtime. Helpful until we have
// transfer functions implemented.		// transfer functions implemented.
extern "C" MLIR_CUDA_WRAPPERS_EXPORT void		extern "C" MLIR_CUDA_WRAPPERS_EXPORT void
mgpuMemHostRegister(void *ptr, uint64_t sizeBytes) {		mgpuMemHostRegister(void *ptr, uint64_t sizeBytes) {
		#ifdef MLIR_USE_CUDART_RUNNER
		CUDA_REPORT_IF_ERROR(cudaHostRegister(ptr, sizeBytes, /flags=/0));
		#else
ScopedContext scopedContext;		ScopedContext scopedContext;
CUDA_REPORT_IF_ERROR(cuMemHostRegister(ptr, sizeBytes, /flags=/0));		CUDA_REPORT_IF_ERROR(cuMemHostRegister(ptr, sizeBytes, /flags=/0));
		#endif
}		}

/// Registers a memref with the CUDA runtime. `descriptor` is a pointer to a		/// Registers a memref with the CUDA runtime. `descriptor` is a pointer to a
/// ranked memref descriptor struct of rank `rank`. Helpful until we have		/// ranked memref descriptor struct of rank `rank`. Helpful until we have
/// transfer functions implemented.		/// transfer functions implemented.
extern "C" MLIR_CUDA_WRAPPERS_EXPORT void		extern "C" MLIR_CUDA_WRAPPERS_EXPORT void
mgpuMemHostRegisterMemRef(int64_t rank, StridedMemRefType<char, 1> *descriptor,		mgpuMemHostRegisterMemRef(int64_t rank, StridedMemRefType<char, 1> *descriptor,
int64_t elementSizeBytes) {		int64_t elementSizeBytes) {
Show All 12 Lines	assert(strides[i] == denseStrides[i] &&
"Mismatch in computed dense strides");		"Mismatch in computed dense strides");

auto ptr = descriptor->data + descriptor->offset elementSizeBytes;		auto ptr = descriptor->data + descriptor->offset elementSizeBytes;
mgpuMemHostRegister(ptr, sizeBytes);		mgpuMemHostRegister(ptr, sizeBytes);
}		}

// Allows to unregister byte array with the CUDA runtime.		// Allows to unregister byte array with the CUDA runtime.
extern "C" MLIR_CUDA_WRAPPERS_EXPORT void mgpuMemHostUnregister(void *ptr) {		extern "C" MLIR_CUDA_WRAPPERS_EXPORT void mgpuMemHostUnregister(void *ptr) {
		#ifdef MLIR_USE_CUDART_RUNNER
		CUDA_REPORT_IF_ERROR(cudaHostUnregister(ptr));
		#else
ScopedContext scopedContext;		ScopedContext scopedContext;
CUDA_REPORT_IF_ERROR(cuMemHostUnregister(ptr));		CUDA_REPORT_IF_ERROR(cuMemHostUnregister(ptr));
		#endif
}		}

/// Unregisters a memref with the CUDA runtime. `descriptor` is a pointer to a		/// Unregisters a memref with the CUDA runtime. `descriptor` is a pointer to a
/// ranked memref descriptor struct of rank `rank`		/// ranked memref descriptor struct of rank `rank`
extern "C" MLIR_CUDA_WRAPPERS_EXPORT void		extern "C" MLIR_CUDA_WRAPPERS_EXPORT void
mgpuMemHostUnregisterMemRef(int64_t rank,		mgpuMemHostUnregisterMemRef(int64_t rank,
StridedMemRefType<char, 1> *descriptor,		StridedMemRefType<char, 1> *descriptor,
int64_t elementSizeBytes) {		int64_t elementSizeBytes) {
auto ptr = descriptor->data + descriptor->offset elementSizeBytes;		auto ptr = descriptor->data + descriptor->offset elementSizeBytes;
mgpuMemHostUnregister(ptr);		mgpuMemHostUnregister(ptr);
}		}

extern "C" MLIR_CUDA_WRAPPERS_EXPORT void mgpuSetDefaultDevice(int32_t device) {		extern "C" MLIR_CUDA_WRAPPERS_EXPORT void mgpuSetDefaultDevice(int32_t device) {
		#ifdef MLIR_USE_CUDART_RUNNER
		CUDA_REPORT_IF_ERROR(cudaSetDevice(device));
		#else
defaultDevice = device;		defaultDevice = device;
		#endif
}		}

mlir/lib/ExecutionEngine/RocmRuntimeWrappers.cpp

Show First 20 Lines • Show All 46 Lines • ▼ Show 20 Lines	extern "C" hipFunction_t mgpuModuleGetFunction(hipModule_t module,
hipFunction_t function = nullptr;		hipFunction_t function = nullptr;
HIP_REPORT_IF_ERROR(hipModuleGetFunction(&function, module, name));		HIP_REPORT_IF_ERROR(hipModuleGetFunction(&function, module, name));
return function;		return function;
}		}

// The wrapper uses intptr_t instead of ROCM's unsigned int to match		// The wrapper uses intptr_t instead of ROCM's unsigned int to match
// the type of MLIR's index type. This avoids the need for casts in the		// the type of MLIR's index type. This avoids the need for casts in the
// generated MLIR code.		// generated MLIR code.
extern "C" void mgpuLaunchKernel(hipFunction_t function, intptr_t gridX,
		#ifdef MLIR_USE_HIPRT_RUNNER
		#define kernel_t void *
		#else
		#define kernel_t hipFunction_t
		#endif
		extern "C" void mgpuLaunchKernel(kernel_t function, intptr_t gridX,
intptr_t gridY, intptr_t gridZ,		intptr_t gridY, intptr_t gridZ,
intptr_t blockX, intptr_t blockY,		intptr_t blockX, intptr_t blockY,
intptr_t blockZ, int32_t smem,		intptr_t blockZ, int32_t smem,
hipStream_t stream, void **params,		hipStream_t stream, void **params,
void **extra) {		void **extra) {
		#ifdef MLIR_USE_HIPRT_RUNNER
		HIP_REPORT_IF_ERROR(hipLaunchKernel(function, dim3(gridX, gridY, gridZ),
		dim3(blockX, blockY, blockZ), params,
		smem, stream));
		#else
HIP_REPORT_IF_ERROR(hipModuleLaunchKernel(function, gridX, gridY, gridZ,		HIP_REPORT_IF_ERROR(hipModuleLaunchKernel(function, gridX, gridY, gridZ,
blockX, blockY, blockZ, smem,		blockX, blockY, blockZ, smem,
stream, params, extra));		stream, params, extra));
		#endif
}		}

extern "C" hipStream_t mgpuStreamCreate() {		extern "C" hipStream_t mgpuStreamCreate() {
hipStream_t stream = nullptr;		hipStream_t stream = nullptr;
HIP_REPORT_IF_ERROR(hipStreamCreate(&stream));		HIP_REPORT_IF_ERROR(hipStreamCreate(&stream));
return stream;		return stream;
}		}

▲ Show 20 Lines • Show All 126 Lines • Show Last 20 Lines

mlir/test/Conversion/GPUCommon/offload.mlir

This file was added.

				// RUN: mlir-opt --gpu-to-offload --split-input-file -verify-diagnostics %s \| FileCheck %s

				// Perform a comprehensive test of the offload annotations.
				module attributes {gpu.container_module} {
				// CHECK: llvm.mlir.global internal unnamed_addr constant @".omp_offloading.entry_name.[[KERNEL_ID:.]]"("[[KERNEL_NAME:.]]\00") {addr_space = 0 : i32}
				// CHECK-NEXT: llvm.mlir.global weak constant @".omp_offloading.entry.[[KERNEL_ID]]"()
				// CHECK-SAME: {addr_space = 0 : i32, alignment = 1 : i64, section = "hip_offloading_entries"} : !llvm.struct<[[STRUCT_BODY:.*]]> {
				// CHECK-NEXT: %[[V0:.*]] = llvm.mlir.undef : !llvm.struct<[[STRUCT_BODY]]>
				// CHECK-NEXT: %[[V1:.*]] = llvm.mlir.addressof @[[KERNEL_ID]]_stub : !llvm.ptr
				// CHECK-NEXT: %[[V2:.*]] = llvm.insertvalue %[[V1]], %[[V0]][0] : !llvm.struct<[[STRUCT_BODY]]>
				// CHECK-NEXT: %[[V3:.*]] = llvm.mlir.addressof @".omp_offloading.entry_name.[[KERNEL_ID]]" : !llvm.ptr
				// CHECK-NEXT: %[[V4:.*]] = llvm.insertvalue %[[V3]], %[[V2]][1] : !llvm.struct<[[STRUCT_BODY]]>
				// CHECK-NEXT: %[[V5:.*]] = llvm.mlir.constant(0 : i64) : i64
				// CHECK-NEXT: %[[V6:.*]] = llvm.insertvalue %[[V5]], %[[V4]][2] : !llvm.struct<[[STRUCT_BODY]]>
				// CHECK-NEXT: %[[V7:.*]] = llvm.mlir.constant(0 : i32) : i32
				// CHECK-NEXT: %[[V8:.*]] = llvm.insertvalue %[[V7]], %[[V6]][3] : !llvm.struct<[[STRUCT_BODY]]>
				// CHECK-NEXT: %[[V9:.*]] = llvm.insertvalue %[[V7]], %[[V8]][4] : !llvm.struct<[[STRUCT_BODY]]>
				// CHECK-NEXT: llvm.return %[[V9]] : !llvm.struct<[[STRUCT_BODY]]>
				// CHECK-NEXT: }
				// CHECK: llvm.func @[[KERNEL_ID]]_stub() attributes {dso_local} {
				// CHECK-NEXT: llvm.return
				// CHECK-NEXT: }
				// CHECK: llvm.mlir.global private constant @[[OBJECT_LABEL:.*]]("\10\FF\10\AD") {addr_space = 0 : i32, alignment = 8 : i64, section = ".llvm.offloading"}
				// CHECK-NEXT: llvm.mlir.global appending @llvm.compiler.used() {addr_space = 0 : i32, section = "llvm.metadata"} : !llvm.array<1 x ptr> {
				// CHECK-NEXT: %[[V0:.*]] = llvm.mlir.undef : !llvm.array<1 x ptr>
				// CHECK-NEXT: %[[V1:.*]] = llvm.mlir.addressof @[[OBJECT_LABEL]] : !llvm.ptr
				// CHECK-NEXT: %[[COMPILER_USED_GLOBAL:.*]] = llvm.insertvalue %[[V1]], %[[V0]][0] : !llvm.array<1 x ptr>
				// CHECK-NEXT: llvm.return %[[COMPILER_USED_GLOBAL]] : !llvm.array<1 x ptr>
				// CHECK-NEXT: }
				// CHECK: llvm.func @host_function() {
				// CHECK: %[[KERNEL:.*]] = llvm.mlir.addressof @[[KERNEL_ID]]_stub : !llvm.ptr
				// CHECK: llvm.call @mgpuLaunchKernel(%[[KERNEL]], {{.*}}) : (!llvm.ptr, i64, i64, i64, i64, i64, i64, i32, !llvm.ptr, !llvm.ptr, !llvm.ptr) -> ()
				// CHECK-NEXT: llvm.call @mgpuStreamSynchronize({{.*}}) : (!llvm.ptr) -> ()
				// CHECK-NEXT: llvm.call @mgpuStreamDestroy({{.*}}) : (!llvm.ptr) -> ()
				// CHECK-NEXT: llvm.return
				// CHECK-NEXT: }
				// CHECK: llvm.func @mgpuStreamCreate() -> !llvm.ptr
				// CHECK: llvm.func @mgpuLaunchKernel(!llvm.ptr, i64, i64, i64, i64, i64, i64, i32, !llvm.ptr, !llvm.ptr, !llvm.ptr)
				// CHECK: llvm.func @mgpuStreamSynchronize(!llvm.ptr)
				// CHECK: llvm.func @mgpuStreamDestroy(!llvm.ptr)
				func.func @host_function() {
				%c0 = arith.constant 0 : i32
				%c1 = arith.constant 1 : index
				%c128 = arith.constant 128 : index
				gpu.launch_func @device_module::@llvm_kernel blocks in (%c1, %c1, %c1) threads in (%c128, %c1, %c1) args(%c0: i32)
				return
				}
				gpu.module @device_module attributes {llvm.offload.kind = "hip", llvm.offload.object = "\10\FF\10\AD"} {
				llvm.func @llvm_kernel(%arg0: i32) attributes {gpu.kernel} {
				llvm.return
				}
				}
				}

				// -----

				// Test multiple modules.
				module attributes {gpu.container_module} {
				// CHECK: llvm.mlir.global internal unnamed_addr constant @".omp_offloading.entry_name.[[CUDA_KERNEL_ID:.*]]"("kernel\00") {addr_space = 0 : i32}
				// CHECK-NEXT: llvm.mlir.global weak constant @".omp_offloading.entry.[[CUDA_KERNEL_ID]]"()
				// CHECK-SAME: {addr_space = 0 : i32, alignment = 1 : i64, section = "cuda_offloading_entries"} : !llvm.struct<[[STRUCT_BODY:.*]]> {
				// CHECK: llvm.func @[[CUDA_KERNEL_ID]]_stub() attributes {dso_local} {
				// CHECK: llvm.mlir.global internal unnamed_addr constant @".omp_offloading.entry_name.[[HIP_KERNEL_ID:.*]]"("kernel\00") {addr_space = 0 : i32}
				// CHECK-NEXT: llvm.mlir.global weak constant @".omp_offloading.entry.[[HIP_KERNEL_ID]]"()
				// CHECK-SAME: {addr_space = 0 : i32, alignment = 1 : i64, section = "hip_offloading_entries"} : !llvm.struct<[[STRUCT_BODY:.*]]> {
				// CHECK: llvm.func @[[HIP_KERNEL_ID]]_stub() attributes {dso_local} {
				// CHECK: llvm.mlir.global private constant @[[OBJECT_LABEL:.*]]("HIP_BLOBCUDA_BLOB") {addr_space = 0 : i32, alignment = 8 : i64, section = ".llvm.offloading"}
				func.func @host_function() {
				%c0 = arith.constant 0 : i32
				%c1 = arith.constant 1 : index
				%c128 = arith.constant 128 : index
				// CHECK: %[[HIP_KERNEL:.*]] = llvm.mlir.addressof @[[HIP_KERNEL_ID]]_stub : !llvm.ptr
				// CHECK: llvm.call @mgpuLaunchKernel(%[[HIP_KERNEL]], {{.*}}) : (!llvm.ptr, i64, i64, i64, i64, i64, i64, i32, !llvm.ptr, !llvm.ptr, !llvm.ptr) -> ()
				gpu.launch_func @hip_module::@kernel blocks in (%c1, %c1, %c1) threads in (%c128, %c1, %c1) args(%c0: i32)
				// CHECK: %[[CUDA_KERNEL:.*]] = llvm.mlir.addressof @[[CUDA_KERNEL_ID]]_stub : !llvm.ptr
				// CHECK: llvm.call @mgpuLaunchKernel(%[[CUDA_KERNEL]], {{.*}}) : (!llvm.ptr, i64, i64, i64, i64, i64, i64, i32, !llvm.ptr, !llvm.ptr, !llvm.ptr) -> ()
				gpu.launch_func @cuda_module::@kernel blocks in (%c1, %c1, %c1) threads in (%c128, %c1, %c1) args(%c0: i32)
				return
				}
				gpu.module @hip_module attributes {llvm.offload.kind = "hip", llvm.offload.object = "HIP_BLOB"} {
				llvm.func @kernel(%arg0: i32) attributes {gpu.kernel} {
				llvm.return
				}
				}
				gpu.module @cuda_module attributes {llvm.offload.kind = "cuda", llvm.offload.object = "CUDA_BLOB"} {
				llvm.func @kernel(%arg0: i32) attributes {gpu.kernel} {
				llvm.return
				}
				}
				}

				// -----

				// Test an invalid module with no offload object attribute.
				module attributes {gpu.container_module} {
				func.func @host_function() {
				%c0 = arith.constant 0 : i32
				%c1 = arith.constant 1 : index
				%c128 = arith.constant 128 : index
				gpu.launch_func @hip_module::@kernel blocks in (%c1, %c1, %c1) threads in (%c128, %c1, %c1) args(%c0: i32)
				return
				}
				// expected-error@+1 {{the gpu.module doesn't contain an offload object}}
				gpu.module @hip_module attributes {llvm.offload.kind = "hip"} {
				llvm.func @kernel(%arg0: i32) attributes {gpu.kernel} {
				llvm.return
				}
				}
				}

				// -----

				// Test an invalid module with no offload kind attribute.
				module attributes {gpu.container_module} {
				func.func @host_function() {
				%c0 = arith.constant 0 : i32
				%c1 = arith.constant 1 : index
				%c128 = arith.constant 128 : index
				// expected-error@+1 {{failed to legalize operation 'gpu.launch_func'}}
				gpu.launch_func @hip_module::@kernel blocks in (%c1, %c1, %c1) threads in (%c128, %c1, %c1) args(%c0: i32)
				return
				}
				// expected-error@+1 {{the module doesn't contain a valid offloading kind}}
				gpu.module @hip_module attributes {llvm.offload.object = "HIP_BLOB"} {
				llvm.func @kernel(%arg0: i32) attributes {gpu.kernel} {
				llvm.return
				}
				}
				}
				No newline at end of file

mlir/test/Dialect/GPU/mangle-names.mlir

This file was added.

				// RUN: mlir-opt --gpu-name-mangling --split-input-file -verify-diagnostics %s \| FileCheck %s

				// Verify that only the symbols defined inside the gpu.module are mangled,
				// and that all the symbol references are updated accordingly.
				module attributes {gpu.container_module} {
				// CHECK-LABEL: func.func @bar
				func.func @bar() {
				return
				}

				// CHECK-LABEL: func.func @host_function
				func.func @host_function() {
				%c0 = arith.constant 0 : i32
				%c1 = arith.constant 1 : index
				%c128 = arith.constant 128 : index
				// CHECK: call @bar
				func.call @bar(): () -> ()
				// CHECK: gpu.launch_func @device_module::@__Gdevice_module_Skernel
				gpu.launch_func @device_module::@kernel blocks in (%c1, %c1, %c1) threads in (%c128, %c1, %c1) args()
				// CHECK: gpu.launch_func @device_module::@__Gdevice_module_Sllvm_kernel
				gpu.launch_func @device_module::@llvm_kernel blocks in (%c1, %c1, %c1) threads in (%c128, %c1, %c1) args(%c0: i32)
				return
				}

				// CHECK-LABEL: gpu.module @device_module
				gpu.module @device_module {
				// CHECK-LABEL: func.func private @foo
				func.func private @foo()
				// CHECK-LABEL: func.func @__Gdevice_module_Sbar
				func.func @bar() {
				return
				}
				// CHECK-LABEL: gpu.func @__Gdevice_module_Skernel
				gpu.func @kernel() kernel attributes {gpu.known_block_size = array<i32: 128, 1, 1>} {
				// CHECK: call @__Gdevice_module_Sbar
				func.call @bar(): () -> ()
				gpu.return
				}
				// CHECK-LABEL: llvm.func @__Gdevice_module_Sllvm_kernel
				llvm.func @llvm_kernel(%arg0: i32) attributes {gpu.kernel} {
				llvm.return
				}
				}
				}

				// -----

				// Test name mangling with multiple modules.
				module attributes {gpu.container_module} {
				// CHECK-LABEL: func.func @host
				func.func @host() {
				%c0 = arith.constant 0 : i32
				%c1 = arith.constant 1 : index
				%c128 = arith.constant 128 : index
				// CHECK: gpu.launch_func @device_module_1::@__Gdevice_module_1_Skernel
				gpu.launch_func @device_module_1::@kernel blocks in (%c1, %c1, %c1) threads in (%c128, %c1, %c1) args()
				// CHECK: gpu.launch_func @device_module_1::@__Gdevice_module_1_Skernel_bar
				gpu.launch_func @device_module_1::@kernel_bar blocks in (%c1, %c1, %c1) threads in (%c128, %c1, %c1) args()
				// CHECK: gpu.launch_func @device_module_2::@__Gdevice_module_2_Skernel
				gpu.launch_func @device_module_2::@kernel blocks in (%c1, %c1, %c1) threads in (%c128, %c1, %c1) args()
				return
				}

				// CHECK-LABEL: gpu.module @device_module_1
				gpu.module @device_module_1 {
				// CHECK-LABEL: func.func @__Gdevice_module_1_Sbar
				func.func @bar() {
				return
				}
				// CHECK-LABEL: gpu.func @__Gdevice_module_1_Skernel
				gpu.func @kernel() kernel attributes {gpu.known_block_size = array<i32: 128, 1, 1>} {
				// CHECK: call @__Gdevice_module_1_Sbar
				func.call @bar(): () -> ()
				gpu.return
				}
				// CHECK-LABEL: gpu.func @__Gdevice_module_1_Skernel_bar
				gpu.func @kernel_bar() kernel attributes {gpu.known_block_size = array<i32: 128, 1, 1>} {
				gpu.return
				}
				}

				// CHECK-LABEL: gpu.module @device_module_2
				gpu.module @device_module_2 {
				// CHECK-LABEL: func.func @__Gdevice_module_2_Sbar
				func.func @bar() {
				return
				}
				// CHECK-LABEL: gpu.func @__Gdevice_module_2_Skernel
				gpu.func @kernel() kernel attributes {gpu.known_block_size = array<i32: 128, 1, 1>} {
				// CHECK: call @__Gdevice_module_2_Sbar
				func.call @bar(): () -> ()
				gpu.return
				}
				}
				}

mlir/test/Dialect/GPU/serialize-to-amdgpu.mlir

This file was added.

				// RUN: mlir-opt --gpu-to-amdgpu="rocm-path=" --split-input-file -verify-diagnostics %s \| FileCheck %s

				// CHECK-LABEL: gpu.module @kernel
				// CHECK-SAME: llvm.offload.kind = "hip"
				// CHECK-SAME: llvm.offload.object = "\10\FF\10\AD
				gpu.module @kernel {
				llvm.func @kernel() attributes {gpu.kernel, rocdl.kernel, rocdl.reqd_work_group_size = array<i32: 128, 1, 1>} {
				llvm.return
				}
				}

				// -----

				// Test an empty module
				// CHECK-LABEL: gpu.module @empty_module
				// CHECK-SAME: llvm.offload.kind = "hip"
				// CHECK-SAME: llvm.offload.object = "\10\FF\10\AD
				gpu.module @empty_module {
				}

				// -----

				// Test multiple modules
				// CHECK-LABEL: gpu.module @module_1
				// CHECK-SAME: llvm.offload.kind = "hip"
				// CHECK-SAME: llvm.offload.object = "\10\FF\10\AD
				gpu.module @module_1 {
				llvm.func @kernel() attributes {gpu.kernel, rocdl.kernel, rocdl.reqd_work_group_size = array<i32: 128, 1, 1>} {
				llvm.return
				}
				}

				// CHECK-LABEL: gpu.module @module_2
				// CHECK-SAME: llvm.offload.kind = "hip"
				// CHECK-SAME: llvm.offload.object = "\10\FF\10\AD
				gpu.module @module_2 {
				llvm.func @bar(i32) -> f64
				llvm.func @kernel() attributes {gpu.kernel, rocdl.kernel, rocdl.reqd_work_group_size = array<i32: 128, 1, 1>} {
				llvm.return
				}
				}

				// CHECK-LABEL: gpu.module @module_3
				// CHECK-SAME: llvm.offload.kind = "hip"
				// CHECK-SAME: llvm.offload.object = "\10\FF\10\AD
				gpu.module @module_3 {
				llvm.func @kernel_1() attributes {gpu.kernel, rocdl.kernel, rocdl.reqd_work_group_size = array<i32: 128, 1, 1>} {
				llvm.return
				}
				llvm.func @kernel_2() attributes {gpu.kernel, rocdl.kernel, rocdl.reqd_work_group_size = array<i32: 128, 1, 1>} {
				llvm.return
				}
				}

				// -----

				// Test a kernel with an invalid instruction.
				gpu.module @kernel_invalid {
				llvm.func @kernel() attributes {gpu.kernel, rocdl.kernel, rocdl.reqd_work_group_size = array<i32: 128, 1, 1>} {
				// expected-error@+1 {{cannot be converted to LLVM IR: missing `LLVMTranslationDialectInterface`}}
				%0 = nvvm.read.ptx.sreg.ctaid.x : i32
				llvm.return
				}
				}

mlir/test/Dialect/GPU/serialize-to-nvptx.mlir

This file was added.

				// RUN: mlir-opt --gpu-to-nvptx="cuda-path=" --split-input-file -verify-diagnostics %s \| FileCheck %s

				// CHECK-LABEL: gpu.module @kernel
				// CHECK-SAME: llvm.offload.kind = "cuda"
				// CHECK-SAME: llvm.offload.object = "\10\FF\10\AD
				gpu.module @kernel {
				llvm.func @kernel() attributes {gpu.kernel, gpu.known_block_size = array<i32: 128, 1, 1>, nvvm.kernel} {
				llvm.return
				}
				}

				// -----

				// Test an empty module
				// CHECK-LABEL: gpu.module @empty_module
				// CHECK-SAME: llvm.offload.kind = "cuda"
				// CHECK-SAME: llvm.offload.object = "\10\FF\10\AD
				gpu.module @empty_module {
				}

				// -----

				// Test multiple modules
				// CHECK-LABEL: gpu.module @module_1
				// CHECK-SAME: llvm.offload.kind = "cuda"
				// CHECK-SAME: llvm.offload.object = "\10\FF\10\AD
				gpu.module @module_1 {
				llvm.func @kernel() attributes {gpu.kernel, gpu.known_block_size = array<i32: 128, 1, 1>, nvvm.kernel} {
				llvm.return
				}
				}

				// CHECK-LABEL: gpu.module @module_2
				// CHECK-SAME: llvm.offload.kind = "cuda"
				// CHECK-SAME: llvm.offload.object = "\10\FF\10\AD
				gpu.module @module_2 {
				llvm.func @bar(i32) -> f64
				llvm.func @kernel() attributes {gpu.kernel, gpu.known_block_size = array<i32: 128, 1, 1>, nvvm.kernel} {
				llvm.return
				}
				}

				// CHECK-LABEL: gpu.module @module_3
				// CHECK-SAME: llvm.offload.kind = "cuda"
				// CHECK-SAME: llvm.offload.object = "\10\FF\10\AD
				gpu.module @module_3 {
				llvm.func @kernel_1() attributes {gpu.kernel, gpu.known_block_size = array<i32: 128, 1, 1>, nvvm.kernel} {
				llvm.return
				}
				llvm.func @kernel_2() attributes {gpu.kernel, gpu.known_block_size = array<i32: 128, 1, 1>, nvvm.kernel} {
				llvm.return
				}
				}

				// -----

				// Test a kernel with an invalid instruction.
				gpu.module @kernel_invalid {
				llvm.func @kernel() attributes {gpu.kernel, gpu.known_block_size = array<i32: 128, 1, 1>, nvvm.kernel} {
				// expected-error@+1 {{cannot be converted to LLVM IR: missing `LLVMTranslationDialectInterface`}}
				%0 = rocdl.workgroup.id.x : i32
				llvm.return
				}
				}
				No newline at end of file

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][gpu] Adds a gpu serialization pipeline for offloading GPUDialect Ops to clang compatible annotations.AbandonedPublic

Details

What this diff is not:

What this diff is:

Code walkthrough:

Summary

Key files walkthrough:

Real world testing:

Test files

Diff Detail

Event Timeline

Full break down of the compilation process

Comparison with the current pipeline

Full break down of the compilation process

LLVM Offloading

MLIR

LLVM Offloading

MLIR

Revision Contents

Diff 518365

mlir/include/mlir/Conversion/GPUCommon/GPUCommonPass.h

mlir/include/mlir/Conversion/Passes.td

mlir/include/mlir/Dialect/GPU/Transforms/Passes.h

mlir/include/mlir/Dialect/GPU/Transforms/Passes.td

mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp

mlir/lib/Dialect/GPU/CMakeLists.txt

mlir/lib/Dialect/GPU/Transforms/GpuToDeviceObjectCommon.h

mlir/lib/Dialect/GPU/Transforms/GpuToDeviceOffload.cpp

mlir/lib/Dialect/GPU/Transforms/NameMangling.cpp

mlir/lib/ExecutionEngine/CMakeLists.txt

mlir/lib/ExecutionEngine/CudaRuntimeWrappers.cpp

mlir/lib/ExecutionEngine/RocmRuntimeWrappers.cpp

mlir/test/Conversion/GPUCommon/offload.mlir

mlir/test/Dialect/GPU/mangle-names.mlir

mlir/test/Dialect/GPU/serialize-to-amdgpu.mlir

mlir/test/Dialect/GPU/serialize-to-nvptx.mlir

[mlir][gpu] Adds a gpu serialization pipeline for offloading GPUDialect Ops to clang compatible annotations.
AbandonedPublic