This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
include/mlir/
-
mlir/
-
Conversion/
-
GPUCommon/
-
GPUCommonPass.h
-
Passes.td
-
Dialect/GPU/
-
GPU/
-
IR/
-
GPUOps.td
4/7
TranslationTargetAttr.td
-
Transforms/
-
Passes.h
-
Utils.h
-
Target/LLVMIR/Dialect/GPU/
-
LLVMIR/
-
Dialect/
-
GPU/
2
GPUTranslationTargets.h
-
lib/
-
Conversion/GPUCommon/
-
GPUCommon/
6/8
GPUToLLVMConversion.cpp
-
Dialect/GPU/IR/
-
GPU/
-
IR/
2/3
GPUDialect.cpp
-
Target/LLVMIR/Dialect/GPU/
-
LLVMIR/
-
Dialect/
-
GPU/
3/8
AMDGPUPipeline.cpp
1/1
CMakeLists.txt
-
GPUToLLVMIRTranslation.cpp
-
GPUTranslationTargets.cpp
-
ModuleToObject.h
5/6
ModuleToObject.cpp
8/14
NVPTXPipeline.cpp
-
TranslationPipelines.h
-
test/Conversion/GPUCommon/
-
Conversion/
-
GPUCommon/
-
lower-launch-func-to-gpu-runtime-calls.mlir

Differential D151766

[mlir][gpu] Move the GPU serialization passes to translation.
AbandonedPublic

Authored by fmorac on May 30 2023, 6:19 PM.

Download Raw Diff

Details

Reviewers

ftynse
aaron.ballman
bondhugula
ThomasRaoux
dcaballe
mehdi_amini
tra
krzysz00
stellaraccident
mravishankar
aartbik
nicolasvasilache
herhut

Summary

Brief

The intent of this diff is moving the serialization passes gpu-to-cubin & gpu-to-hsaco to the translation step, while also introducing all the needed infrastructure for introducing additional serialization pipelines.

Why?
From a conceptual point of view serialization involves translating the Ops inside a GPU module to a serialized string, as such this shouldn't happen in a pass but rather during translation. From an implementation point of view it's easier to serialize and manipulate the process when both the host and device LLVM Modules are available, this is not possible during a pass however it's possible in translation.

Overview

The biggest changes introduced by this patch are:

Introducing the attribute TranslationTarget and companion c++ interfaces defined in GPUTranslationTargets.h, this attribute informs the serialization options to the translation stage. It must be present as an attribute in the gpu.module to be able to perform the translation. Format:

#gpu.target<PIPELINE: triple = TARGETTRIPLE, chip = TARGETCHIP, features = TARGETFEATURES, toolkit = TOOLKITPATH, link = [LIST OF BITCODE FILES TO LINK], opts = {EXTRA OPTS}>
; AMDGPU example using default chip = gfx600
#gpu.target<AMDGPU: toolkit = "/opt/rocm/5.4.3", link = ["mylib.bc"], opts = {fast, ftz}>

; NVPTX example, with default options, chip = sm_35, triple = nvptx64-nvidia-cuda.
#gpu.target<NVPTX>

Example:

gpu.module @kernel_module attributes {rocdl.hsaco = #gpu.target<AMDGPU : chip = "gfx90a">, target = #gpu.target<NVPTX>} {
  llvm.func @kernel(%arg0: i32, %arg1: !llvm.ptr<f32>, %arg2: !llvm.ptr<f32>, %arg3: i64, %arg4: i64, %arg5: i64) attributes {gpu.kernel} {
    llvm.return
  }
}

Modifying the gpu-to-llvm pass to avoid removing the gpu.modules, while also adding a stub for the serialized string to be modified during translation. Additionally this pass can be used to set or add a target to the gpu.modules. Example:

; Selects the `rocdl.hsaco` target, this target must be present in the attributes of all `gpu.module`s: ie: `gpu.module ... attributes{rocdl.hsaco ... }`
--gpu-to-llvm='target=rocdl.hsaco'

; Sets the gpu target to a specific target. The format used for specifying the target is the format of the body of the `TranslationTarget` attribute, the quotation marks have to be carefully managed to successfully parse the attribute. 
--gpu-to-llvm='target="NVPTX: chip = "sm_90", opts = {ftz}"'

The addition of the ModuleToObject class. This class controls the behavior of all serialization pipelines. It allows for linking to any specified bitcode file, or if toolkit paths are detected, -or specified, linking against the devices libraries found in the toolkits.

Why the patch is so big?

Such a big change can't be done on a series of steps without leaving broken bits.
Many SLOC are reused from the original pipelines (this is specially true for the files NVPTXPipeline.cpp, AMDGPUPipeline.cpp), the only truly original files are: TranslationTargetAttr.td, GPUTranslationTargets.*, ModuleToObject.*.

TODO

This diff is the first in a series of patches on extending the GPU serialization pipeline.
The remaining patches will:

Remove all the serialization passes while updating in-tree projects to use the updated pipeline.
Introduce LIT tests for testing translation.
Introduce the offload pipeline proposed in this RFC. With this change this patch should be less that 100 source lines.

Testing

For testing the patch, 2 machines were used:

Local machine, with NVIDIA V100, CUDA Toolkit 11.8 and Ubuntu 22.04.2.
Frontier at ORNL, MI250X.

In all instances the test was successfully completed.

Clang was used to compile the final executable just out of convenience, the JIT should remain functional.

The input files were:

test.cpp
test.cpp1 KBDownload
, this file verifies the results produced by MLIR.
test.mlir
test.mlir1 KBDownload
, gpu kernel.

Setup 1

For compiling the test for NVIDIA targets, the following commands were used:

mlir-opt test.mlir \
  -gpu-launch-sink-index-computations \
  -gpu-kernel-outlining \
  -gpu-async-region \
  -convert-scf-to-cf \
  -convert-gpu-to-nvvm \
  -convert-math-to-llvm \
  -convert-arith-to-llvm \
  -convert-index-to-llvm \
  -canonicalize \
  -gpu-to-llvm='target="NVPTX: chip="sm_70" "' \
  -canonicalize \
  -o test_llvm.mlir
mlir-translate -mlir-to-llvmir test_llvm.mlir -o test.ll
clang++ test.ll test.cpp -lmlir_cuda_runtime -o test.exe

The following profile was generated with nsys.

Time (%)  Total Time (ns)  Num Calls  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)         Name        
--------  ---------------  ---------  ---------  ---------  --------  --------  -----------  -------------------
    33.2          207,866          2  103,933.0  103,933.0     4,538   203,328    140,565.8  cuMemAlloc_v2      
    23.4          146,329          1  146,329.0  146,329.0   146,329   146,329          0.0  cuModuleLoadData   
    17.8          111,172          1  111,172.0  111,172.0   111,172   111,172          0.0  cuModuleUnload     
    12.0           75,213          2   37,606.5   37,606.5     4,769    70,444     46,439.2  cuMemFree_v2       
     6.6           41,398          3   13,799.3   17,192.0     3,466    20,740      9,123.1  cuMemcpyAsync      
     3.1           19,197          1   19,197.0   19,197.0    19,197    19,197          0.0  cuLaunchKernel     
     2.8           17,343          1   17,343.0   17,343.0    17,343    17,343          0.0  cuStreamCreate     
     0.6            4,068          1    4,068.0    4,068.0     4,068     4,068          0.0  cuStreamDestroy_v2 
     0.5            3,416          1    3,416.0    3,416.0     3,416     3,416          0.0  cuStreamSynchronize

Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)     GridXYZ         BlockXYZ           Name      
--------  ---------------  ---------  --------  --------  --------  --------  -----------  --------------  --------------  ----------------
   100.0            4,096          1   4,096.0   4,096.0     4,096     4,096          0.0     3    1    1   128    1    1  test_mlir_kernel

Setup 2

For compiling the test for AMDGPU targets, the following commands were used:

mlir-opt test.mlir \
  -gpu-launch-sink-index-computations \
  -gpu-kernel-outlining \
  -gpu-async-region \
  -convert-scf-to-cf \
  -convert-gpu-to-rocdl \
  -convert-math-to-llvm \
  -convert-arith-to-llvm \
  -convert-index-to-llvm \
  -canonicalize \
  -gpu-to-llvm='target="AMDGPU: chip="gfx90a" "' \
  -canonicalize \
  -o test_llvm.mlir
mlir-translate -mlir-to-llvmir test_llvm.mlir -o test.ll
clang++ test.ll test.cpp -lmlir_rocm_runtime -o test.exe

The following profile was generated with rocprof.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

fmorac created this revision.May 30 2023, 6:19 PM

Herald added a reviewer: ftynse. · View Herald TranscriptMay 30 2023, 6:19 PM

Herald added a reviewer: aaron.ballman. · View Herald Transcript

Herald added a reviewer: bondhugula. · View Herald Transcript

Herald added a reviewer: ThomasRaoux. · View Herald Transcript

Herald added a reviewer: dcaballe. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: bviyer, Moerafaat, zero9178 and 28 others. · View Herald Transcript

Updated commit typo in commit message.

fmorac retitled this revision from [mlir][gpu] Move the GPU serialization passes to translationn. to [mlir][gpu] Move the GPU serialization passes to translation..May 30 2023, 6:50 PM

fmorac edited the summary of this revision. (Show Details)

fmorac added reviewers: mehdi_amini, tra, krzysz00, stellaraccident, mravishankar, aartbik.

fmorac added a project: Restricted Project.

Herald added a subscriber: tpr. · View Herald TranscriptMay 30 2023, 6:50 PM

Harbormaster completed remote builds in B235454: Diff 526863.May 30 2023, 6:52 PM

fmorac edited the summary of this revision. (Show Details)May 30 2023, 7:31 PM

fmorac published this revision for review.May 30 2023, 7:37 PM

fmorac added inline comments.

mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp
639	Parse the target attribute used by the pass `gpu-to-llvm`.
662	Method for updating the target attribute.
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
61	Verify some precondition of the `TranslationTargetAttr`.
mlir/lib/Target/LLVMIR/Dialect/GPU/AMDGPUPipeline.cpp
196–331	Code taken from the original hsaco serialization pipeline.
mlir/lib/Target/LLVMIR/Dialect/GPU/CMakeLists.txt
44–115	Code taken from the CMake in the original serialization passes.
mlir/lib/Target/LLVMIR/Dialect/GPU/NVPTXPipeline.cpp
129–204	Code take from the original `gpu-to-cubin` pass.

Herald added a reviewer: nicolasvasilache. · View Herald TranscriptMay 30 2023, 7:37 PM

Herald added a reviewer: herhut. · View Herald Transcript

Herald added subscribers: stephenneuendorffer, nicolasvasilache, jholewinski. · View Herald Transcript

mehdi_amini added inline comments.May 30 2023, 9:34 PM

mlir/include/mlir/Dialect/GPU/IR/TranslationTargetAttr.td
58	Would 2 be better default?
63	It's not clear to me that these kind of information belongs to the IR: that seems oddly specific to one filesystem?
mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp
653	Can't you just do `TranslationTargetAttr::parse(attrStr)` or something like that here? Adding the dialect prefix seems to jump through hoops somehow
671	I think the function could benefit from early returns in general: // If `target` is valid, set the attribute. if (target) { module->setAttr(getTargetAttrName(), target); return succes(); } // Try selecting the attribute from the existing module attributes. StringRef attrName = getTargetAttrName(); if (attrName = targetAttrName) return success(); ...
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
41	Any reason this isn't promoted to first-class supported inherent attribute for the gpu.module op?
mlir/lib/Target/LLVMIR/Dialect/GPU/ModuleToObject.cpp
45	Does it have to be a std::string? Seems like consumed as a StringRef everywhere, I'd write it as `auto triple = ...`
124	Nit: `!gv.hasName() \|\| !gvs.contains(gv.getName())`
214	Don't we have access to the current LLVM context used by the overall translation? Can we just reuse it?
mlir/lib/Target/LLVMIR/Dialect/GPU/NVPTXPipeline.cpp
99	We should have a CMake variable telling us about this?
129	When is the code above this guard used when this guard is false? I can't find the entry point...
129–204	I would think that we could either serialize to PTX or serialize to Cubin based on an option (possibly on the TargetAttr)? Also is this the code that right now requires a GPU available in the machine at compile time, and that we should update to use the Cuda runtime library instead?
156	if (!serializedISA) { getOperation().emitError() << "Failed translating the Module to ISA."; return std::nullopt; }
226	I'm not sure why this is not a failure? Seems like we'd just "ignore" the GPU module, succeed with the translation even though it is incomplete?

This looks pretty nicely done overall!

Once we agree on the overall idea of the patch, I'll make all the changes in the patch addressing the comments.

mlir/include/mlir/Dialect/GPU/IR/TranslationTargetAttr.td
58	Yes, It makes more sense, e.g. on `ptxas` the default is 3.
63	This is here because unlike passes there's not a general way to pass options to translation -as far as I know. However I ended up liking this idea because then you can have: gpu.module @A1 attributes {target = #gpu.target<NVPTX: link = ["foo1.bc"]>} {...} gpu.module @A2 attributes {target = #gpu.target<NVPTX: link = ["foo2.bc"]>} {...} Allowing linking on a per module basis.
mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp
653	The reason I add the prefix is because It provides for a cleaner cmd option, i.e.: --gpu-to-llvm='target="NVPTX"' ; Versus --gpu-to-llvm='target="#gpu.target<NVPTX>"' But I can change it.
671	I'll change it.
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
41	None, I'll change it.
mlir/lib/Target/LLVMIR/Dialect/GPU/ModuleToObject.cpp
45	Not anymore: https://github.com/llvm/llvm-project/commit/5a1de140677e9138625135514fc4ed0dc969d80c . I'll change it.
124	Will change it.
214	Yes & I think so, however I figured that we could eventually add threads to this, so I created a new one.
mlir/lib/Target/LLVMIR/Dialect/GPU/NVPTXPipeline.cpp
99	The function `target.getToolkitPath()` either returns the path specified in the translation attribute, or the path found by CMake during building. See file `GPUTranslationTargets.cpp` lines 51 and 99.
129	In this patch, none, however the above code will be re-utilized by the offload pipeline, which is not dependent on `MLIR_GPU_TO_CUBIN_TRANSLATION_ENABLED`.
129–204	We could, however if we stop at ptx we might have to modify the runner library or the JIT? Right now this serializes to cubin. Yes, this is the code from the old pipeline that requires the driver. Since the patch was already to big I preferred leaving that migration to a future patch.
156	Will change it.
226	Seems like we'd just "ignore" the GPU module, succeed with the translation even though it is incomplete? That's exactly what it's doing, however I agree with you, this should err.

I like the overall idea of this patch and am glad to see someone come along and clean things up!

As a minor nit, would it be possible to get a debug option (analogous to current --debug-only=serialize-to-blob) for dumping out ISA after the pipeline runs?

mlir/lib/Target/LLVMIR/Dialect/GPU/AMDGPUPipeline.cpp
188	Ooh, these are the magic incantations! Avoiding all this spam is why I wrote old serialize-to-hsaco to be much more selective about when things got linked it.
214	Nit: don't we usually not `auto` these?

In D151766#4384784, @krzysz00 wrote:

I like the overall idea of this patch and am glad to see someone come along and clean things up!

As a minor nit, would it be possible to get a debug option (analogous to current --debug-only=serialize-to-blob) for dumping out ISA after the pipeline runs?

Yeah, for sure, just the ISA, or also the device LLVM module?

mlir/lib/Target/LLVMIR/Dialect/GPU/AMDGPUPipeline.cpp
214	I think you're right, I believe the policy is auto-ing only variables with an explicit type. Some of these are left overs from the serialization passes, but I'll do one more check on all auto variables and update them accordingly.

Re the debug switch(es), I've had good reason to call for "let's see the assembly" (which is why I added serialize-to-blob's debug ages back), but there's also "let's see what opt did" (which can be done with strategic use of --print-before on the command line but maybe it'd be good to have a know for dumping it).

There's what may be a separate utility or option (we have it downstream as a translate option called something like -gpu-module-to-rocdlir) that dumps out the LLVM IR before we start doing much to it (other than perhaps setting the data layout). It may be good to grow an upstream version of that.

I'd say that, most immediately, "print out the ISA" is what I'd want to preserve, but it could be good to provide a richer set of options while we're here.

and on the "while we're here" note, if GPU modules are going to have targets, would it be reasonable to, up in the --gpu-to{rocdl,nvvm} passes (the ones that go to MLIR-flavored LLVM), set the LLVM data layout during the translation process? This'll make sure that things like alloca address space = 5 gets handled correctly. (Or, in the future, that we don't trip over address space 7 being p7:128:160:160:32 during translation once it becomes a thing)

In D151766#4385289, @krzysz00 wrote:

I'd say that, most immediately, "print out the ISA" is what I'd want to preserve, but it could be good to provide a richer set of options while we're here.

I like the idea, but let's do this, I'll add the option for dumping the ISA that was present in the original pass, but lets postpone dumping more stuff and options to a future patch, that way we keep the spirit of this diff of just moving things down to translation, and other stuff for a future patch, what do you think?

Yeah, let's not add more debug options in this patch - it's busy enough as it is.

In D151766#4385298, @krzysz00 wrote:

and on the "while we're here" note, if GPU modules are going to have targets, would it be reasonable to, up in the --gpu-to{rocdl,nvvm} passes (the ones that go to MLIR-flavored LLVM), set the LLVM data layout during the translation process? This'll make sure that things like alloca address space = 5 gets handled correctly. (Or, in the future, that we don't trip over address space 7 being p7:128:160:160:32 during translation once it becomes a thing)

I actually thought that we could even remove those passes in favor of a single gpu-to-llvm, as I believe breaking the lowering of gpu into multiple passes was to accommodate for the serialization passes. But let's do that on a future patch.

In D151766#4385349, @fmorac wrote:

In D151766#4385298, @krzysz00 wrote:

and on the "while we're here" note, if GPU modules are going to have targets, would it be reasonable to, up in the --gpu-to{rocdl,nvvm} passes (the ones that go to MLIR-flavored LLVM), set the LLVM data layout during the translation process? This'll make sure that things like alloca address space = 5 gets handled correctly. (Or, in the future, that we don't trip over address space 7 being p7:128:160:160:32 during translation once it becomes a thing)

I actually thought that we could even remove those passes in favor of a single gpu-to-llvm, as I believe breaking the lowering of gpu into multiple passes was to accommodate for the serialization passes. But let's do that on a future patch.

+1 to everything here on how to proceed: iterative development is better!

(when we understand the direction, see my inline comment about this)

mlir/include/mlir/Dialect/GPU/IR/TranslationTargetAttr.td
63	I think there are two things: "list of libraries to link", which I am onboard with (I see it as the `-l` flag for the linker), and then there is the `path` component (which would be the `-L` of the linker). What I am saying is that `-l` makes sense, but I have concerns about `-L` if you see what I mean? So here `toolkit` for example looks like a path, and `link` isn't clear if this is a list of libraries to link or absolute file paths (I rather make it a list of libs really) As of "translation options", you makes me realize that it's actually not clear to me that this all fits a "translation" actually: that is a translation is a "simple" almost 1-1 layer: I lost track of why we can stage things to keep "translations" simpler? I guess I'm re-reading this patch description: From a conceptual point of view serialization involves translating the Ops inside a GPU module to a serialized string, as such this shouldn't happen in a pass but rather during translation. From an implementation point of view it's easier to serialize and manipulate the process when both the host and device LLVM Modules are available, this is not possible during a pass however it's possible in translation. But I'm not sure I agree with this: first I don't get the explanation on why this should be a translation and not a pass? Then I don't follow the next sentence either: the pass seems to me to also have access to both the host and the device code.
mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp
653	Actually my point was that the prefix is necessary so that `parseAttribute()` knows how to route it, but it'll internally unwrap and pass the payload to `TranslationTargetAttr::parse(attrStr)` , while you could skip these extra steps and call `TranslationTargetAttr::parse(attrStr)` without the prefix directly. I don't suggest the user to write the prefix in the option itself.
mlir/lib/Target/LLVMIR/Dialect/GPU/NVPTXPipeline.cpp
129–204	We could, however if we stop at ptx we might have to modify the runner library or the JIT? Right now this serializes to cubin. I'm wondering about things like AOT cases actually: ultimately we may want to be able to emit only LLVM IR only right?

fmorac added inline comments.May 31 2023, 4:35 PM

mlir/include/mlir/Dialect/GPU/IR/TranslationTargetAttr.td
63	I think there are two things: "list of libraries to link", which I am onboard with (I see it as the `-l` flag for the linker), and then there is the `path` component (which would be the `-L` of the linker). What I am saying is that `-l` makes sense, but I have concerns about `-L` if you see what I mean? So here `toolkit` for example looks like a path, and `link` isn't clear if this is a list of libraries to link or absolute file paths (I rather make it a list of libs really) You're right on the `-l`, however `toolkit` is more like a convenience option,. The toolkit option is just for loading the device libraries we know are in the toolkit without having to specify all the library files with `link`, so not `-L`, for example the toolkit option loads`libdevice`. In the case of NVIDIA is just `libdevice`, however for AMD there are many of them. As of "translation options", you makes me realize that it's actually not clear to me that this all fits a "translation" actually: that is a translation is a "simple" almost 1-1 layer: I lost track of why we can stage things to keep "translations" simpler? We could have a translation without options, however that requires stopping at LLVM IR and having a more general AOT and JIT infra. I guess I'm re-reading this patch description: From a conceptual point of view serialization involves translating the Ops inside a GPU module to a serialized string, as such this shouldn't happen in a pass but rather during translation. From an implementation point of view it's easier to serialize and manipulate the process when both the host and device LLVM Modules are available, this is not possible during a pass however it's possible in translation. But I'm not sure I agree with this: first I don't get the explanation on why this should be a translation and not a pass? Then I don't follow the next sentence either: the pass seems to me to also have access to both the host and the device code. The conceptual pov, that things involving translation should be on translation, It's just my opinion of how I view translation. However from the implementation perspective, it makes things easier to have things in translation. The reason the other patch (D149559) it's so big, is because I had to recreate a lot of infrastructure already available on `clang`, and even if we move that infrastructure to `llvm`, the patch would still be big because all the infrastructure relies on having both host LLVM IR and device LLVM IR and interacting with both of them at the same time, that's not possible in a pass, in a pass you would have host MLIR and device LLVM IR. I mean both routes are possible (pass or translation), the other patch already does the job, however it doesn't allows to reuse LLVM infrastructure.
mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp
653	Oh, ok. The reason I didn't use the class method was because it required an `AsmParser` and it was not super clear to me how to get one with the string , when I asked the best way to do it from a string on discord someone pointed me to that method. But if there's a way, we can change it.
mlir/lib/Target/LLVMIR/Dialect/GPU/NVPTXPipeline.cpp
129–204	The answer is maybe. In the offload pipeline we can do that and we'll end on LLVM IR and let another stage take care of the compilation. However I'm not proposing that yet, instead I was proposing having 2 co-existing pipelines: The offload pipeline: stops on LLVM and we handle code generation on a separate stage (something like the clang driver). These legacy pipelines. However, in the future we can discuss having just option 1, hence the maybe. Having just 1 entirely on MLIR requires more patches and I think a broader conversation on AOT and MLIR.

mehdi_amini added inline comments.May 31 2023, 5:07 PM

mlir/include/mlir/Dialect/GPU/IR/TranslationTargetAttr.td
63	however toolkit is more like a convenience option,. The toolkit option is just for loading the device libraries we know are in the toolkit without having to specify all the library files with link [...] I would need to see some examples, it's not clear to me what you mean. The conceptual pov, that things involving translation should be on translation, It's just my opinion of how I view translation. Sure: what I'm saying is that in the process of: "taking an IR with a GPU module, translating it to LLVM, running LLVM optimizations, emitting assembly, assembling to binary, embedding this as a string" ; there is "translating" in the middle, the rest isn't "translation".
mlir/lib/Target/LLVMIR/Dialect/GPU/NVPTXPipeline.cpp
129–204	I think we discussed it, but I'll repeat it here because I'm not sure we're aligned on this: I am quite concerned about moving forward with 2 co-existing pipelines before we have a plan to converge to just one.

lamb-j added a subscriber: lamb-j.Jun 1 2023, 11:26 AM

lamb-j added inline comments.

mlir/include/mlir/Target/LLVMIR/Dialect/GPU/GPUTranslationTargets.h
130	I think gfx803 or gfx900 may be a better default. HSA never supported gfx600, and gfx700 was experimental. ROCr supports gfx803, but it has been removed for CLR (so rocm OpenCL doesn't support gfx803).
136	We're in the process of updating the default ABI to code object v5 (500). Just something to be aware of that may need to be changed, depending on which patch gets in first.
mlir/lib/Target/LLVMIR/Dialect/GPU/AMDGPUPipeline.cpp
38	This call isn't thread safe (specifically RegisterTarget()). It may not be an issue in this context, but just a heads up (I recently hit a tricky bug due to this)
45	I believe ManagedStatic is in the process of being removed from LLVM. See here: https://reviews.llvm.org/D129134
47	AMDGPU?
80	AMDGPU? hsaco?

Abandoned in favor of this proposal:
https://discourse.llvm.org/t/rfc-extending-mlir-gpu-device-codegen-pipeline/70199/54

Revision Contents

Path

Size

mlir/

include/

mlir/

Conversion/

GPUCommon/

GPUCommonPass.h

3 lines

Passes.td

32 lines

Dialect/

GPU/

IR/

GPUOps.td

1 line

TranslationTargetAttr.td

95 lines

Transforms/

Passes.h

8 lines

Utils.h

13 lines

Target/

LLVMIR/

Dialect/

GPU/

GPUTranslationTargets.h

144 lines

lib/

Conversion/

GPUCommon/

GPUToLLVMConversion.cpp

110 lines

Dialect/

GPU/

IR/

GPUDialect.cpp

44 lines

Target/

LLVMIR/

Dialect/

GPU/

AMDGPUPipeline.cpp

377 lines

CMakeLists.txt

102 lines

GPUToLLVMIRTranslation.cpp

44 lines

GPUTranslationTargets.cpp

134 lines

ModuleToObject.h

126 lines

ModuleToObject.cpp

256 lines

NVPTXPipeline.cpp

228 lines

TranslationPipelines.h

40 lines

test/

Conversion/

GPUCommon/

lower-launch-func-to-gpu-runtime-calls.mlir

13 lines

Diff 526863

mlir/include/mlir/Conversion/GPUCommon/GPUCommonPass.h

//===- GPUCommonPass.h - MLIR GPU runtime support -------------------------===//		//===- GPUCommonPass.h - MLIR GPU runtime support -------------------------===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
#ifndef MLIR_CONVERSION_GPUCOMMON_GPUCOMMONPASS_H_		#ifndef MLIR_CONVERSION_GPUCOMMON_GPUCOMMONPASS_H_
#define MLIR_CONVERSION_GPUCOMMON_GPUCOMMONPASS_H_		#define MLIR_CONVERSION_GPUCOMMON_GPUCOMMONPASS_H_

#include "mlir/Dialect/GPU/Transforms/Utils.h"		#include "mlir/Dialect/GPU/Transforms/Utils.h"
		#include "mlir/IR/Attributes.h"
#include "mlir/Support/LLVM.h"		#include "mlir/Support/LLVM.h"
#include "llvm/ADT/StringRef.h"		#include "llvm/ADT/StringRef.h"
#include <functional>		#include <functional>
#include <vector>		#include <vector>

namespace llvm {		namespace llvm {
class LLVMContext;		class LLVMContext;
class Module;		class Module;
Show All 26 Lines	using BlobGenerator =
std::function<OwnedBlob(const std::string &, Location, StringRef)>;		std::function<OwnedBlob(const std::string &, Location, StringRef)>;
using LoweringCallback = std::function<std::unique_ptr<llvm::Module>(		using LoweringCallback = std::function<std::unique_ptr<llvm::Module>(
Operation *, llvm::LLVMContext &, StringRef)>;		Operation *, llvm::LLVMContext &, StringRef)>;

/// Collect a set of patterns to convert from the GPU dialect to LLVM and		/// Collect a set of patterns to convert from the GPU dialect to LLVM and
/// populate converter for gpu types.		/// populate converter for gpu types.
void populateGpuToLLVMConversionPatterns(LLVMTypeConverter &converter,		void populateGpuToLLVMConversionPatterns(LLVMTypeConverter &converter,
RewritePatternSet &patterns,		RewritePatternSet &patterns,
StringRef gpuBinaryAnnotation = {},
bool kernelBarePtrCallConv = false);		bool kernelBarePtrCallConv = false);

} // namespace mlir		} // namespace mlir

#endif // MLIR_CONVERSION_GPUCOMMON_GPUCOMMONPASS_H_		#endif // MLIR_CONVERSION_GPUCOMMON_GPUCOMMONPASS_H_

mlir/include/mlir/Conversion/Passes.td

Show First 20 Lines • Show All 352 Lines • ▼ Show 20 Lines	def GpuToLLVMConversionPass : Pass<"gpu-to-llvm", "ModuleOp"> {

let description = [{		let description = [{
Creates a pass to convert a GPU operations into a sequence of GPU runtime		Creates a pass to convert a GPU operations into a sequence of GPU runtime
calls.		calls.

This pass does not generate code to call GPU runtime APIs directly but		This pass does not generate code to call GPU runtime APIs directly but
instead uses a small wrapper library that exports a stable and conveniently		instead uses a small wrapper library that exports a stable and conveniently
typed ABI on top of GPU runtimes such as CUDA or ROCm (HIP).		typed ABI on top of GPU runtimes such as CUDA or ROCm (HIP).

		Target option examples:
		1. Set the target of the module to AMDGPU, chip `gfx90a` and using fast
		math.
		```
		--gpu-to-llvm='target="AMDGPU: chip="gfx90a", opts = {fast}"'
		```
		2. Select the target `mytarget` from the existing attributes of the module.
		```
		--gpu-to-llvm='target=mytarget'
		```
}];		}];

let options = [		let options = [
Option<"kernelBarePtrCallConv", "use-bare-pointers-for-kernels", "bool",		Option<"kernelBarePtrCallConv", "use-bare-pointers-for-kernels", "bool",
/default=/"false",		/default=/"false",
"Use bare pointers to pass memref arguments to kernels. "		"Use bare pointers to pass memref arguments to kernels. "
"The kernel must use the same setting for this option."		"The kernel must use the same setting for this option."
>,		>,
Option<"gpuBinaryAnnotation", "gpu-binary-annotation", "std::string",		Option<"gpuTarget", "target", "std::string",
/default=/"gpu::getDefaultGpuBinaryAnnotation()",		/default=/"\"\"",
"Annotation attribute string for GPU binary"		"Selects or sets a GPU target in the module for translation."
		"For setting a target, this option must start and end with `\\\"`."
>,		>,
Option<"useOpaquePointers", "use-opaque-pointers", "bool",		Option<"useOpaquePointers", "use-opaque-pointers", "bool",
/default=/"true", "Generate LLVM IR using opaque pointers "		/default=/"true", "Generate LLVM IR using opaque pointers "
"instead of typed pointers">,		"instead of typed pointers">,
];		];

let dependentDialects = [		let dependentDialects = [
"LLVM::LLVMDialect",		"LLVM::LLVMDialect",
"memref::MemRefDialect",		"memref::MemRefDialect",
];		];
}		}

▲ Show 20 Lines • Show All 732 Lines • Show Last 20 Lines

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td

	Show All 10 Lines
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#ifndef GPU_OPS			#ifndef GPU_OPS
	#define GPU_OPS			#define GPU_OPS

	include "mlir/Dialect/DLTI/DLTIBase.td"			include "mlir/Dialect/DLTI/DLTIBase.td"
	include "mlir/Dialect/GPU/IR/GPUBase.td"			include "mlir/Dialect/GPU/IR/GPUBase.td"
	include "mlir/Dialect/GPU/IR/ParallelLoopMapperAttr.td"			include "mlir/Dialect/GPU/IR/ParallelLoopMapperAttr.td"
				include "mlir/Dialect/GPU/IR/TranslationTargetAttr.td"
	include "mlir/Dialect/GPU/TransformOps/GPUDeviceMappingAttr.td"			include "mlir/Dialect/GPU/TransformOps/GPUDeviceMappingAttr.td"
	include "mlir/IR/EnumAttr.td"			include "mlir/IR/EnumAttr.td"
	include "mlir/IR/FunctionInterfaces.td"			include "mlir/IR/FunctionInterfaces.td"
	include "mlir/IR/SymbolInterfaces.td"			include "mlir/IR/SymbolInterfaces.td"
	include "mlir/Interfaces/DataLayoutInterfaces.td"			include "mlir/Interfaces/DataLayoutInterfaces.td"
	include "mlir/Interfaces/InferIntRangeInterface.td"			include "mlir/Interfaces/InferIntRangeInterface.td"
	include "mlir/Interfaces/InferTypeOpInterface.td"			include "mlir/Interfaces/InferTypeOpInterface.td"
	include "mlir/Interfaces/SideEffectInterfaces.td"			include "mlir/Interfaces/SideEffectInterfaces.td"
	▲ Show 20 Lines • Show All 2,129 Lines • Show Last 20 Lines

mlir/include/mlir/Dialect/GPU/IR/TranslationTargetAttr.td

This file was added.

				//===-- TranslationTargetAttr.td - GPU translation target attribute -------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// Defines the translation target used to configure the serialization pipeline
				// during translation of the GPU dialect.
				//
				//===----------------------------------------------------------------------===//

				#ifndef GPU_TRANSLATIONTARGET
				#define GPU_TRANSLATIONTARGET

				include "mlir/Dialect/GPU/IR/GPUBase.td"

				def GPU_TranslationPipeline : I32EnumAttr<"TranslationPipeline",
				"Pipeline to be used during GPU translation.",
				[
				I32EnumAttrCase<"NVPTX", 0>,
				I32EnumAttrCase<"AMDGPU", 1>,
				]>{
				let genSpecializedAttr = 0;
				let cppNamespace = "::mlir::gpu";
				}

				def GPU_TranslationPipelineAttr : EnumAttr<GPU_Dialect,
				GPU_TranslationPipeline,
				"pipeline">;

				def GPU_TranslationTargetAttr: GPU_Attr<"TranslationTarget", "target"> {
				let description = [{
				Target options used during translation. The only required parameter is the
				translation pipeline, all other parameters decay into default values if not
				present. The defaul values will depend on the chosen target.

				Examples:

				1. NVPTX target with default values.
				```
				gpu.module attributes { target = #gpu.target<NVPTX> } {
				...
				}
				```
				2. AMDGPU target, using target chip `gfx90a`, fast math options and no wave64.
				```
				gpu.module attributes {
				target = #gpu.target<AMDGPU: chip = "gfx90a", opts = {fast, noWave64}>
				} {
				...
				}
				```
				}];
				let parameters = (ins
				AttrOrTypeParameter<"TranslationPipeline", "Pipeline used during translation.">:$pipeline,
				DefaultValuedParameter<"int", "0", "Optimization level to apply.">:$O,
				mehdi_aminiUnsubmitted Done Reply Inline Actions Would 2 be better default? mehdi_amini: Would 2 be better default?
				fmoracAuthorUnsubmitted Done Reply Inline Actions Yes, It makes more sense, e.g. on `ptxas` the default is 3. fmorac: Yes, It makes more sense, e.g. on `ptxas` the default is 3.
				OptionalParameter<"StringAttr", "Target triple.">:$triple,
				OptionalParameter<"StringAttr", "Target chip.">:$chip,
				OptionalParameter<"StringAttr", "Chip features.">:$features,
				OptionalParameter<"StringAttr", "Path to toolkit with device libraries.">:$toolkit,
				OptionalParameter<"ArrayAttr", "Files to link with the device module.">:$link,
				mehdi_aminiUnsubmitted Not Done Reply Inline Actions It's not clear to me that these kind of information belongs to the IR: that seems oddly specific to one filesystem? mehdi_amini: It's not clear to me that these kind of information belongs to the IR: that seems oddly…
				fmoracAuthorUnsubmitted Done Reply Inline Actions This is here because unlike passes there's not a general way to pass options to translation -as far as I know. However I ended up liking this idea because then you can have: gpu.module @A1 attributes {target = #gpu.target<NVPTX: link = ["foo1.bc"]>} {...} gpu.module @A2 attributes {target = #gpu.target<NVPTX: link = ["foo2.bc"]>} {...} Allowing linking on a per module basis. fmorac: This is here because unlike passes there's not a general way to pass options to translation -as…
				mehdi_aminiUnsubmitted Not Done Reply Inline Actions I think there are two things: "list of libraries to link", which I am onboard with (I see it as the `-l` flag for the linker), and then there is the `path` component (which would be the `-L` of the linker). What I am saying is that `-l` makes sense, but I have concerns about `-L` if you see what I mean? So here `toolkit` for example looks like a path, and `link` isn't clear if this is a list of libraries to link or absolute file paths (I rather make it a list of libs really) As of "translation options", you makes me realize that it's actually not clear to me that this all fits a "translation" actually: that is a translation is a "simple" almost 1-1 layer: I lost track of why we can stage things to keep "translations" simpler? I guess I'm re-reading this patch description: From a conceptual point of view serialization involves translating the Ops inside a GPU module to a serialized string, as such this shouldn't happen in a pass but rather during translation. From an implementation point of view it's easier to serialize and manipulate the process when both the host and device LLVM Modules are available, this is not possible during a pass however it's possible in translation. But I'm not sure I agree with this: first I don't get the explanation on why this should be a translation and not a pass? Then I don't follow the next sentence either: the pass seems to me to also have access to both the host and the device code. mehdi_amini: I think there are two things: "list of libraries to link", which I am onboard with (I see it as…
				fmoracAuthorUnsubmitted Done Reply Inline Actions I think there are two things: "list of libraries to link", which I am onboard with (I see it as the `-l` flag for the linker), and then there is the `path` component (which would be the `-L` of the linker). What I am saying is that `-l` makes sense, but I have concerns about `-L` if you see what I mean? So here `toolkit` for example looks like a path, and `link` isn't clear if this is a list of libraries to link or absolute file paths (I rather make it a list of libs really) You're right on the `-l`, however `toolkit` is more like a convenience option,. The toolkit option is just for loading the device libraries we know are in the toolkit without having to specify all the library files with `link`, so not `-L`, for example the toolkit option loads`libdevice`. In the case of NVIDIA is just `libdevice`, however for AMD there are many of them. As of "translation options", you makes me realize that it's actually not clear to me that this all fits a "translation" actually: that is a translation is a "simple" almost 1-1 layer: I lost track of why we can stage things to keep "translations" simpler? We could have a translation without options, however that requires stopping at LLVM IR and having a more general AOT and JIT infra. I guess I'm re-reading this patch description: From a conceptual point of view serialization involves translating the Ops inside a GPU module to a serialized string, as such this shouldn't happen in a pass but rather during translation. From an implementation point of view it's easier to serialize and manipulate the process when both the host and device LLVM Modules are available, this is not possible during a pass however it's possible in translation. But I'm not sure I agree with this: first I don't get the explanation on why this should be a translation and not a pass? Then I don't follow the next sentence either: the pass seems to me to also have access to both the host and the device code. The conceptual pov, that things involving translation should be on translation, It's just my opinion of how I view translation. However from the implementation perspective, it makes things easier to have things in translation. The reason the other patch (D149559) it's so big, is because I had to recreate a lot of infrastructure already available on `clang`, and even if we move that infrastructure to `llvm`, the patch would still be big because all the infrastructure relies on having both host LLVM IR and device LLVM IR and interacting with both of them at the same time, that's not possible in a pass, in a pass you would have host MLIR and device LLVM IR. I mean both routes are possible (pass or translation), the other patch already does the job, however it doesn't allows to reuse LLVM infrastructure. fmorac: > I think there are two things: "list of libraries to link", which I am onboard with (I see it…
				mehdi_aminiUnsubmitted Not Done Reply Inline Actions however toolkit is more like a convenience option,. The toolkit option is just for loading the device libraries we know are in the toolkit without having to specify all the library files with link [...] I would need to see some examples, it's not clear to me what you mean. The conceptual pov, that things involving translation should be on translation, It's just my opinion of how I view translation. Sure: what I'm saying is that in the process of: "taking an IR with a GPU module, translating it to LLVM, running LLVM optimizations, emitting assembly, assembling to binary, embedding this as a string" ; there is "translating" in the middle, the rest isn't "translation". mehdi_amini: > however toolkit is more like a convenience option,. The toolkit option is just for loading…
				OptionalParameter<"Attribute", "Target specific options.">:$opts
				);
				let assemblyFormat = [{
				`<`$pipeline (`:` struct($O, $triple, $chip, $features, $toolkit, $link, $opts)^)? `>`
				}];
				let builders = [
				AttrBuilder<(ins "TranslationPipeline":$pipeline,
				CArg<"int", "0">:$optLevel,
				CArg<"StringRef", "{}">:$triple,
				CArg<"StringRef", "{}">:$chip,
				CArg<"StringRef", "{}">:$features,
				CArg<"StringRef", "{}">:$toolkitPath,
				CArg<"ArrayAttr", "{}">:$filesToLink,
				CArg<"Attribute", "{}">:$pipelineOptions), [{
				auto getStrAttr = [&$_ctxt](StringRef str) {
				return str.empty() ? StringAttr() : StringAttr::get($_ctxt, str);
				};
				return Base::get($_ctxt,
				pipeline,
				optLevel,
				getStrAttr(triple),
				getStrAttr(chip),
				getStrAttr(features),
				getStrAttr(toolkitPath),
				filesToLink,
				pipelineOptions);
				}]>
				];
				let genVerifyDecl = 1;
				}

				#endif // GPU_TRANSLATIONTARGET

mlir/include/mlir/Dialect/GPU/Transforms/Passes.h

	Show First 20 Lines • Show All 64 Lines • ▼ Show 20 Lines
	/// Collect all patterns to rewrite ops within the GPU dialect.			/// Collect all patterns to rewrite ops within the GPU dialect.
	inline void populateGpuRewritePatterns(RewritePatternSet &patterns) {			inline void populateGpuRewritePatterns(RewritePatternSet &patterns) {
	populateGpuAllReducePatterns(patterns);			populateGpuAllReducePatterns(patterns);
	populateGpuGlobalIdPatterns(patterns);			populateGpuGlobalIdPatterns(patterns);
	populateGpuShufflePatterns(patterns);			populateGpuShufflePatterns(patterns);
	}			}

	namespace gpu {			namespace gpu {
				/// Sets the target attribute used in translation. If target is `null` it
				/// selects the module attribute with key `targetAttrName`, otherwise it sets
				/// the attribute to `target`. Retuns failure if there's no attribute with key
				/// `targetAttrName`.
				LogicalResult selectOrSetTargetAttr(GPUModuleOp module,
				StringRef targetAttrName,
				TranslationTargetAttr target = {});

	/// Base pass class to serialize kernel functions through LLVM into			/// Base pass class to serialize kernel functions through LLVM into
	/// user-specified IR and add the resulting blob as module attribute.			/// user-specified IR and add the resulting blob as module attribute.
	class SerializeToBlobPass : public OperationPass<gpu::GPUModuleOp> {			class SerializeToBlobPass : public OperationPass<gpu::GPUModuleOp> {
	public:			public:
	SerializeToBlobPass(TypeID passID);			SerializeToBlobPass(TypeID passID);
	SerializeToBlobPass(const SerializeToBlobPass &other);			SerializeToBlobPass(const SerializeToBlobPass &other);

	void runOnOperation() final;			void runOnOperation() final;
	▲ Show 20 Lines • Show All 71 Lines • Show Last 20 Lines

mlir/include/mlir/Dialect/GPU/Transforms/Utils.h

	Show All 19 Lines
	namespace mlir {			namespace mlir {
	struct LogicalResult;			struct LogicalResult;
	class Operation;			class Operation;
	class Value;			class Value;

	namespace gpu {			namespace gpu {
	class GPUFuncOp;			class GPUFuncOp;
	class LaunchOp;			class LaunchOp;
				class GPUModuleOp;
				class TranslationTargetAttr;

	/// Returns the default annotation name for GPU binary blobs.			/// Returns the default annotation name for GPU binary blobs.
	std::string getDefaultGpuBinaryAnnotation();			std::string getDefaultGpuBinaryAnnotation();

				/// Returns the name of the target attribute.
				StringRef getTargetAttrName();

				/// Returns the name of the global variable to be used for storing the binary
				/// annotation stub during the `--gpu-to-llvm` pass.
				SmallString<128> getBinaryStorageStubName(StringRef moduleName);

				/// Returns the name of the global variable to be used for storing the binary
				/// annotation during translation.
				SmallString<128> getBinaryStorageName(StringRef moduleName);
	} // namespace gpu			} // namespace gpu

	/// Get a gpu.func created from outlining the region of a gpu.launch op with the			/// Get a gpu.func created from outlining the region of a gpu.launch op with the
	/// given `kernelFnName`. The region of the `launchOp` can use values from			/// given `kernelFnName`. The region of the `launchOp` can use values from
	/// above. These need to be captured and passed as arguments to the generated			/// above. These need to be captured and passed as arguments to the generated
	/// gpu.func. The generated function has arguments			/// gpu.func. The generated function has arguments
	/// - corresponding to the values passed in as `operands`, in that order.			/// - corresponding to the values passed in as `operands`, in that order.
	/// - any additional values that might be used within the region of the			/// - any additional values that might be used within the region of the
	Show All 14 Lines

mlir/include/mlir/Target/LLVMIR/Dialect/GPU/GPUTranslationTargets.h

This file was added.

				//===- GPUTranslationTargets.h - GPU Dialect translation targets --------*-===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// This files provides interfaces for interacting with `TranslationTargetAttr`.
				//
				//===----------------------------------------------------------------------===//

				#ifndef MLIR_TARGET_LLVMIR_DIALECT_GPU_GPUTRANSLATIONTARGETS_H
				#define MLIR_TARGET_LLVMIR_DIALECT_GPU_GPUTRANSLATIONTARGETS_H

				#include "mlir/Dialect/GPU/IR/GPUDialect.h"
				#include "mlir/Dialect/GPU/Transforms/Utils.h"

				#include "llvm/ADT/StringRef.h"

				namespace mlir {
				namespace gpu {
				/// Interface for all TranslationTarget* classes.
				class TranslationTarget {
				public:
				/// Get the target triple, if null, return default triple.
				StringRef getTargetTriple();

				/// Get the target chip, if null, return default chip.
				StringRef getTargetChip();

				/// Get the target features, if null, return default features.
				StringRef getTargetFeatures();

				/// Get the toolkit path, if null, return the path inferred by CMake.
				StringRef getToolkitPath();

				/// Get the bitcode files for linking specified in the attribute.
				SmallVector<std::string> getFilesToLink();

				/// Get the optimization level.
				int getOptLevel();

				/// Get the target specific options.
				Attribute getTargetOptions();

				/// If `opts` is a `DictionaryAttr`, try to return the attribute with name
				/// `optionName`.
				Attribute getTargetOption(StringRef optionName);

				/// If `opts` is a `DictionaryAttr`, try to return the attribute with name
				/// `optionName` casted to `Ty`.
				template <typename Ty>
				Ty getTargetOption(StringRef optionName) {
				return dyn_cast_or_null<Ty>(getTargetOption(optionName));
				}

				/// Checks if the fast math option was passed.
				bool getFastMath();

				/// Checks if the `ftz` option was passed.
				bool getFtz();

				protected:
				TranslationTarget(TranslationTargetAttr target, StringRef defaultTriple,
				StringRef defaultChip, StringRef defaultFeatures,
				StringRef defaultToolkitPath);

				/// Translation target attribute.
				TranslationTargetAttr target;

				private:
				/// Default target triple.
				const StringRef defaultTriple;

				/// Default target chip.
				const StringRef defaultChip;

				/// Default target features.
				const StringRef defaultFeatures;

				/// Default toolkit path.
				const StringRef defaultToolkitPath;
				};

				/// Class for interacting with a translation target attribute for NVPTX targets,
				/// e.g. `#gpu.target<NVPTX>`.
				class NVPTXTranslationTarget : public TranslationTarget {
				public:
				NVPTXTranslationTarget(TranslationTargetAttr target);

				/// Default target triple.
				static constexpr llvm::StringLiteral kDefaultTriple = "nvptx64-nvidia-cuda";

				/// Default target chip.
				static constexpr llvm::StringLiteral kDefaultChip = "sm_35";

				/// Default target features.
				static constexpr llvm::StringLiteral kDefaultFeatures = "+ptx60";

				/// Get the toolkit path inferred by CMake, or `""` if none was inferred.
				static StringRef getDefaultToolkitPath();
				};

				/// Class for interacting with a translation target attribute for AMDGPU
				/// targets, e.g. `#gpu.target<AMDGPU>`.
				class AMDGPUTranslationTarget : public TranslationTarget {
				public:
				AMDGPUTranslationTarget(TranslationTargetAttr target);

				/// Get wether to use the wave64 mode -it's enabled by default.
				bool getWave64();

				/// Checks if the `finite_only` option was passed.
				bool getFiniteOnly();

				/// Checks if the `unsafe_math` option was passed.
				bool getUnsafeMath();

				/// Checks if the `correct_sqrt` option was passed.
				bool getCorrectSqrt();

				/// Returns the `abi_ver` option.
				StringRef getAbiVer();

				/// Default target triple.
				static constexpr llvm::StringLiteral kDefaultTriple = "amdgcn-amd-amdhsa";

				/// Default target chip.
				static constexpr llvm::StringLiteral kDefaultChip = "gfx600";
				lamb-jUnsubmitted Not Done Reply Inline Actions I think gfx803 or gfx900 may be a better default. HSA never supported gfx600, and gfx700 was experimental. ROCr supports gfx803, but it has been removed for CLR (so rocm OpenCL doesn't support gfx803). lamb-j: I think gfx803 or gfx900 may be a better default. HSA never supported gfx600, and gfx700 was…

				/// Default target features.
				static constexpr llvm::StringLiteral kDefaultFeatures = "";

				/// Default ABI version.
				static constexpr llvm::StringLiteral kDefaultAbiVer = "400";
				lamb-jUnsubmitted Not Done Reply Inline Actions We're in the process of updating the default ABI to code object v5 (500). Just something to be aware of that may need to be changed, depending on which patch gets in first. lamb-j: We're in the process of updating the default ABI to code object v5 (500). Just something to be…

				/// Get the toolkit path inferred by CMake, or `""` if none was inferred.
				static StringRef getDefaultToolkitPath();
				};
				} // namespace gpu
				} // namespace mlir

				#endif

mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp

Show All 9 Lines
// GPU runtime calls. As most of GPU runtimes does not have a stable published		// GPU runtime calls. As most of GPU runtimes does not have a stable published
// ABI, this pass uses a slim runtime layer that builds on top of the public		// ABI, this pass uses a slim runtime layer that builds on top of the public
// API from GPU runtime headers.		// API from GPU runtime headers.
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "mlir/Conversion/GPUCommon/GPUCommonPass.h"		#include "mlir/Conversion/GPUCommon/GPUCommonPass.h"

		#include "mlir/AsmParser/AsmParser.h"
#include "mlir/Conversion/ArithToLLVM/ArithToLLVM.h"		#include "mlir/Conversion/ArithToLLVM/ArithToLLVM.h"
#include "mlir/Conversion/AsyncToLLVM/AsyncToLLVM.h"		#include "mlir/Conversion/AsyncToLLVM/AsyncToLLVM.h"
#include "mlir/Conversion/ControlFlowToLLVM/ControlFlowToLLVM.h"		#include "mlir/Conversion/ControlFlowToLLVM/ControlFlowToLLVM.h"
#include "mlir/Conversion/FuncToLLVM/ConvertFuncToLLVM.h"		#include "mlir/Conversion/FuncToLLVM/ConvertFuncToLLVM.h"
#include "mlir/Conversion/FuncToLLVM/ConvertFuncToLLVMPass.h"		#include "mlir/Conversion/FuncToLLVM/ConvertFuncToLLVMPass.h"
#include "mlir/Conversion/LLVMCommon/ConversionTarget.h"		#include "mlir/Conversion/LLVMCommon/ConversionTarget.h"
#include "mlir/Conversion/LLVMCommon/Pattern.h"		#include "mlir/Conversion/LLVMCommon/Pattern.h"
#include "mlir/Conversion/MemRefToLLVM/MemRefToLLVM.h"		#include "mlir/Conversion/MemRefToLLVM/MemRefToLLVM.h"
Show All 14 Lines

namespace mlir {		namespace mlir {
#define GEN_PASS_DEF_GPUTOLLVMCONVERSIONPASS		#define GEN_PASS_DEF_GPUTOLLVMCONVERSIONPASS
#include "mlir/Conversion/Passes.h.inc"		#include "mlir/Conversion/Passes.h.inc"
} // namespace mlir		} // namespace mlir

using namespace mlir;		using namespace mlir;

static constexpr const char *kGpuBinaryStorageSuffix = "_gpubin_cst";

namespace {		namespace {

class GpuToLLVMConversionPass		class GpuToLLVMConversionPass
: public impl::GpuToLLVMConversionPassBase<GpuToLLVMConversionPass> {		: public impl::GpuToLLVMConversionPassBase<GpuToLLVMConversionPass> {
public:		public:
using Base::Base;		using Base::Base;

// Run the dialect converter on the module.		// Run the dialect converter on the module.
▲ Show 20 Lines • Show All 321 Lines • ▼ Show 20 Lines
/// * launchKernel -- launches the kernel on a stream		/// * launchKernel -- launches the kernel on a stream
/// * streamSynchronize -- waits for operations on the stream to finish		/// * streamSynchronize -- waits for operations on the stream to finish
///		///
/// Intermediate data structures are allocated on the stack.		/// Intermediate data structures are allocated on the stack.
class ConvertLaunchFuncOpToGpuRuntimeCallPattern		class ConvertLaunchFuncOpToGpuRuntimeCallPattern
: public ConvertOpToGpuRuntimeCallPattern<gpu::LaunchFuncOp> {		: public ConvertOpToGpuRuntimeCallPattern<gpu::LaunchFuncOp> {
public:		public:
ConvertLaunchFuncOpToGpuRuntimeCallPattern(LLVMTypeConverter &typeConverter,		ConvertLaunchFuncOpToGpuRuntimeCallPattern(LLVMTypeConverter &typeConverter,
StringRef gpuBinaryAnnotation,
bool kernelBarePtrCallConv)		bool kernelBarePtrCallConv)
: ConvertOpToGpuRuntimeCallPattern<gpu::LaunchFuncOp>(typeConverter),		: ConvertOpToGpuRuntimeCallPattern<gpu::LaunchFuncOp>(typeConverter),
gpuBinaryAnnotation(gpuBinaryAnnotation),
kernelBarePtrCallConv(kernelBarePtrCallConv) {}		kernelBarePtrCallConv(kernelBarePtrCallConv) {}

private:		private:
Value generateParamsArray(gpu::LaunchFuncOp launchOp, OpAdaptor adaptor,		Value generateParamsArray(gpu::LaunchFuncOp launchOp, OpAdaptor adaptor,
OpBuilder &builder) const;		OpBuilder &builder) const;
Value generateKernelNameConstant(StringRef moduleName, StringRef name,		Value generateKernelNameConstant(StringRef moduleName, StringRef name,
Location loc, OpBuilder &builder) const;		Location loc, OpBuilder &builder) const;

LogicalResult		LogicalResult
matchAndRewrite(gpu::LaunchFuncOp launchOp, OpAdaptor adaptor,		matchAndRewrite(gpu::LaunchFuncOp launchOp, OpAdaptor adaptor,
ConversionPatternRewriter &rewriter) const override;		ConversionPatternRewriter &rewriter) const override;

llvm::SmallString<32> gpuBinaryAnnotation;
bool kernelBarePtrCallConv;		bool kernelBarePtrCallConv;
};		};

class EraseGpuModuleOpPattern : public OpRewritePattern<gpu::GPUModuleOp> {
using OpRewritePattern<gpu::GPUModuleOp>::OpRewritePattern;

LogicalResult matchAndRewrite(gpu::GPUModuleOp op,
PatternRewriter &rewriter) const override {
// GPU kernel modules are no longer necessary since we have a global
// constant with the CUBIN, or HSACO data.
rewriter.eraseOp(op);
return success();
}
};

/// A rewrite pattern to convert gpu.memcpy operations into a GPU runtime		/// A rewrite pattern to convert gpu.memcpy operations into a GPU runtime
/// call. Currently it supports CUDA and ROCm (HIP).		/// call. Currently it supports CUDA and ROCm (HIP).
class ConvertMemcpyOpToGpuRuntimeCallPattern		class ConvertMemcpyOpToGpuRuntimeCallPattern
: public ConvertOpToGpuRuntimeCallPattern<gpu::MemcpyOp> {		: public ConvertOpToGpuRuntimeCallPattern<gpu::MemcpyOp> {
public:		public:
ConvertMemcpyOpToGpuRuntimeCallPattern(LLVMTypeConverter &typeConverter)		ConvertMemcpyOpToGpuRuntimeCallPattern(LLVMTypeConverter &typeConverter)
: ConvertOpToGpuRuntimeCallPattern<gpu::MemcpyOp>(typeConverter) {}		: ConvertOpToGpuRuntimeCallPattern<gpu::MemcpyOp>(typeConverter) {}

▲ Show 20 Lines • Show All 217 Lines • ▼ Show 20 Lines	ConvertSDDMMOpToGpuRuntimeCallPattern(LLVMTypeConverter &typeConverter)
: ConvertOpToGpuRuntimeCallPattern<gpu::SDDMMOp>(typeConverter) {}		: ConvertOpToGpuRuntimeCallPattern<gpu::SDDMMOp>(typeConverter) {}

private:		private:
LogicalResult		LogicalResult
matchAndRewrite(gpu::SDDMMOp op, OpAdaptor adaptor,		matchAndRewrite(gpu::SDDMMOp op, OpAdaptor adaptor,
ConversionPatternRewriter &rewriter) const override;		ConversionPatternRewriter &rewriter) const override;
};		};

		// Parses a TranslationTargetAttr from a string. If parsing failed it returns
		// failure on the first attribute. If the `attrStr` doesn't starts and ends with
		// `"`, returns success and a nullptr.
		std::pair<LogicalResult, gpu::TranslationTargetAttr>
		fmoracAuthorUnsubmitted Done Reply Inline Actions Parse the target attribute used by the pass `gpu-to-llvm`. fmorac: Parse the target attribute used by the pass `gpu-to-llvm`.
		parseTargetAttr(StringRef attrStr, MLIRContext *context) {
		// This method assumes that the body of the target attribute is surrounded by
		// `"`.
		if (attrStr.starts_with("\"") && attrStr.ends_with("\"")) {
		attrStr = attrStr.ltrim("\"").rtrim("\"");

		// Empty attributes are not valid.
		if (attrStr.empty())
		return {failure(), nullptr};

		// Parse the attribute.
		std::string attrTmp = "#gpu.target<" + attrStr.str() + ">";
		auto attr = dyn_cast_or_null<gpu::TranslationTargetAttr>(
		parseAttribute(attrTmp, context));
		mehdi_aminiUnsubmitted Not Done Reply Inline Actions Can't you just do `TranslationTargetAttr::parse(attrStr)` or something like that here? Adding the dialect prefix seems to jump through hoops somehow mehdi_amini: Can't you just do `TranslationTargetAttr::parse(attrStr)` or something like that here? Adding…
		fmoracAuthorUnsubmitted Done Reply Inline Actions The reason I add the prefix is because It provides for a cleaner cmd option, i.e.: --gpu-to-llvm='target="NVPTX"' ; Versus --gpu-to-llvm='target="#gpu.target<NVPTX>"' But I can change it. fmorac: The reason I add the prefix is because It provides for a cleaner cmd option, i.e.: ``` --gpu…
		mehdi_aminiUnsubmitted Not Done Reply Inline Actions Actually my point was that the prefix is necessary so that `parseAttribute()` knows how to route it, but it'll internally unwrap and pass the payload to `TranslationTargetAttr::parse(attrStr)` , while you could skip these extra steps and call `TranslationTargetAttr::parse(attrStr)` without the prefix directly. I don't suggest the user to write the prefix in the option itself. mehdi_amini: Actually my point was that the prefix is necessary so that `parseAttribute()` knows how to…
		fmoracAuthorUnsubmitted Done Reply Inline Actions Oh, ok. The reason I didn't use the class method was because it required an `AsmParser` and it was not super clear to me how to get one with the string , when I asked the best way to do it from a string on discord someone pointed me to that method. But if there's a way, we can change it. fmorac: Oh, ok. The reason I didn't use the class method was because it required an `AsmParser` and it…
		return {attr ? success() : failure(), attr};
		}
		// If the method doesn't starts and ends with `"`, the method deduces the
		// string is not a TranslationTarget.
		return {success(), nullptr};
		}
} // namespace		} // namespace

		LogicalResult mlir::gpu::selectOrSetTargetAttr(GPUModuleOp module,
		fmoracAuthorUnsubmitted Done Reply Inline Actions Method for updating the target attribute. fmorac: Method for updating the target attribute.
		StringRef targetAttrName,
		TranslationTargetAttr target) {
		// If `target` is valid, set the attribute.
		if (target) {
		module->setAttr(getTargetAttrName(), target);
		} else {
		// Try selecting the attribute from the existing module attributes.
		StringRef attrName = getTargetAttrName();
		if (attrName != targetAttrName) {
		mehdi_aminiUnsubmitted Done Reply Inline Actions I think the function could benefit from early returns in general: // If `target` is valid, set the attribute. if (target) { module->setAttr(getTargetAttrName(), target); return succes(); } // Try selecting the attribute from the existing module attributes. StringRef attrName = getTargetAttrName(); if (attrName = targetAttrName) return success(); ... mehdi_amini: I think the function could benefit from early returns in general: ``` // If `target` is valid…
		fmoracAuthorUnsubmitted Done Reply Inline Actions I'll change it. fmorac: I'll change it.
		Attribute attr = module->removeAttr(targetAttrName);
		if (!attr) {
		module.emitError() << "`" << targetAttrName
		<< "` is not a valid attribute key.";
		return failure();
		}
		module->setAttr(attrName, attr);
		}
		}
		return success();
		}

void GpuToLLVMConversionPass::runOnOperation() {		void GpuToLLVMConversionPass::runOnOperation() {
		// If `gpuTarget` is not empty update all modules.
		if (gpuTarget.size()) {
		// Try to parse `gpuTarget` as an attribute, if the attribute is null and
		// parsing succeeded it means the pattern will perform target selection
		// instead of setting the attribute.
		auto [parseStatus, targetAttr] = parseTargetAttr(gpuTarget, &getContext());
		if (failed(parseStatus)) {
		getOperation().emitError()
		<< gpuTarget << " is not a valid target attribute.";
		return signalPassFailure();
		}
		std::string targetAttrName = gpuTarget;
		if (targetAttr)
		targetAttrName = gpu::getTargetAttrName();

		// Update the target attribute in all nested `gpu.module`s.
		for (auto op : getOperation().getBody()->getOps<gpu::GPUModuleOp>())
		if (failed(selectOrSetTargetAttr(op, targetAttrName, targetAttr)))
		return signalPassFailure();
		}

		// Populate the conversion pass.
LowerToLLVMOptions options(&getContext());		LowerToLLVMOptions options(&getContext());
options.useOpaquePointers = useOpaquePointers;		options.useOpaquePointers = useOpaquePointers;

LLVMTypeConverter converter(&getContext(), options);		LLVMTypeConverter converter(&getContext(), options);
RewritePatternSet patterns(&getContext());		RewritePatternSet patterns(&getContext());
LLVMConversionTarget target(getContext());		LLVMConversionTarget target(getContext());

target.addIllegalDialect<gpu::GPUDialect>();

mlir::arith::populateArithToLLVMConversionPatterns(converter, patterns);		mlir::arith::populateArithToLLVMConversionPatterns(converter, patterns);
mlir::cf::populateControlFlowToLLVMConversionPatterns(converter, patterns);		mlir::cf::populateControlFlowToLLVMConversionPatterns(converter, patterns);
populateVectorToLLVMConversionPatterns(converter, patterns);		populateVectorToLLVMConversionPatterns(converter, patterns);
populateFinalizeMemRefToLLVMConversionPatterns(converter, patterns);		populateFinalizeMemRefToLLVMConversionPatterns(converter, patterns);
populateFuncToLLVMConversionPatterns(converter, patterns);		populateFuncToLLVMConversionPatterns(converter, patterns);
populateAsyncStructuralTypeConversionsAndLegality(converter, patterns,		populateAsyncStructuralTypeConversionsAndLegality(converter, patterns,
target);		target);
populateGpuToLLVMConversionPatterns(converter, patterns, gpuBinaryAnnotation,		populateGpuToLLVMConversionPatterns(converter, patterns,
kernelBarePtrCallConv);		kernelBarePtrCallConv);

if (failed(		if (failed(
applyPartialConversion(getOperation(), target, std::move(patterns))))		applyPartialConversion(getOperation(), target, std::move(patterns))))
signalPassFailure();		signalPassFailure();
}		}

LLVM::CallOp FunctionCallBuilder::create(Location loc, OpBuilder &builder,		LLVM::CallOp FunctionCallBuilder::create(Location loc, OpBuilder &builder,
▲ Show 20 Lines • Show All 384 Lines • ▼ Show 20 Lines	LogicalResult ConvertLaunchFuncOpToGpuRuntimeCallPattern::matchAndRewrite(
Location loc = launchOp.getLoc();		Location loc = launchOp.getLoc();

// Create an LLVM global with CUBIN extracted from the kernel annotation and		// Create an LLVM global with CUBIN extracted from the kernel annotation and
// obtain a pointer to the first byte in it.		// obtain a pointer to the first byte in it.
auto kernelModule = SymbolTable::lookupNearestSymbolFrom<gpu::GPUModuleOp>(		auto kernelModule = SymbolTable::lookupNearestSymbolFrom<gpu::GPUModuleOp>(
launchOp, launchOp.getKernelModuleName());		launchOp, launchOp.getKernelModuleName());
assert(kernelModule && "expected a kernel module");		assert(kernelModule && "expected a kernel module");

auto binaryAttr =		SmallString<128> nameBuffer =
kernelModule->getAttrOfType<StringAttr>(gpuBinaryAnnotation);		gpu::getBinaryStorageStubName(kernelModule.getName());
if (!binaryAttr) {
kernelModule.emitOpError()
<< "missing " << gpuBinaryAnnotation << " attribute";
return failure();
}

SmallString<128> nameBuffer(kernelModule.getName());
nameBuffer.append(kGpuBinaryStorageSuffix);
Value data = LLVM::createGlobalString(		Value data = LLVM::createGlobalString(
loc, rewriter, nameBuffer.str(), binaryAttr.getValue(),		loc, rewriter, nameBuffer.str(), "", LLVM::Linkage::Internal,
LLVM::Linkage::Internal, getTypeConverter()->useOpaquePointers());		getTypeConverter()->useOpaquePointers());

auto module = moduleLoadCallBuilder.create(loc, rewriter, data);		auto module = moduleLoadCallBuilder.create(loc, rewriter, data);
// Get the function from the module. The name corresponds to the name of		// Get the function from the module. The name corresponds to the name of
// the kernel function.		// the kernel function.
auto kernelName = generateKernelNameConstant(		auto kernelName = generateKernelNameConstant(
launchOp.getKernelModuleName().getValue(),		launchOp.getKernelModuleName().getValue(),
launchOp.getKernelName().getValue(), loc, rewriter);		launchOp.getKernelName().getValue(), loc, rewriter);
auto function = moduleGetFunctionCallBuilder.create(		auto function = moduleGetFunctionCallBuilder.create(
▲ Show 20 Lines • Show All 491 Lines • ▼ Show 20 Lines	SDDMMCallBuilder.create(loc, rewriter,
adaptor.getDnmatB(), adaptor.getSpmatC(), dw, pBuf,		adaptor.getDnmatB(), adaptor.getSpmatC(), dw, pBuf,
stream});		stream});
rewriter.replaceOp(op, {stream});		rewriter.replaceOp(op, {stream});
return success();		return success();
}		}

void mlir::populateGpuToLLVMConversionPatterns(LLVMTypeConverter &converter,		void mlir::populateGpuToLLVMConversionPatterns(LLVMTypeConverter &converter,
RewritePatternSet &patterns,		RewritePatternSet &patterns,
StringRef gpuBinaryAnnotation,
bool kernelBarePtrCallConv) {		bool kernelBarePtrCallConv) {
addOpaquePointerConversion<gpu::AsyncTokenType>(converter);		addOpaquePointerConversion<gpu::AsyncTokenType>(converter);
addOpaquePointerConversion<gpu::SparseDnVecHandleType>(converter);		addOpaquePointerConversion<gpu::SparseDnVecHandleType>(converter);
addOpaquePointerConversion<gpu::SparseDnMatHandleType>(converter);		addOpaquePointerConversion<gpu::SparseDnMatHandleType>(converter);
addOpaquePointerConversion<gpu::SparseSpMatHandleType>(converter);		addOpaquePointerConversion<gpu::SparseSpMatHandleType>(converter);
addOpaquePointerConversion<gpu::SparseEnvHandleType>(converter);		addOpaquePointerConversion<gpu::SparseEnvHandleType>(converter);

patterns.add<ConvertAllocOpToGpuRuntimeCallPattern,		patterns.add<ConvertAllocOpToGpuRuntimeCallPattern,
Show All 17 Lines	patterns.add<ConvertAllocOpToGpuRuntimeCallPattern,
ConvertDestroySpMatOpToGpuRuntimeCallPattern,		ConvertDestroySpMatOpToGpuRuntimeCallPattern,
ConvertSpMVBufferSizeOpToGpuRuntimeCallPattern,		ConvertSpMVBufferSizeOpToGpuRuntimeCallPattern,
ConvertSpMVOpToGpuRuntimeCallPattern,		ConvertSpMVOpToGpuRuntimeCallPattern,
ConvertSpMMBufferSizeOpToGpuRuntimeCallPattern,		ConvertSpMMBufferSizeOpToGpuRuntimeCallPattern,
ConvertSpMMOpToGpuRuntimeCallPattern,		ConvertSpMMOpToGpuRuntimeCallPattern,
ConvertSDDMMBufferSizeOpToGpuRuntimeCallPattern,		ConvertSDDMMBufferSizeOpToGpuRuntimeCallPattern,
ConvertSDDMMOpToGpuRuntimeCallPattern>(converter);		ConvertSDDMMOpToGpuRuntimeCallPattern>(converter);
patterns.add<ConvertLaunchFuncOpToGpuRuntimeCallPattern>(		patterns.add<ConvertLaunchFuncOpToGpuRuntimeCallPattern>(
converter, gpuBinaryAnnotation, kernelBarePtrCallConv);		converter, kernelBarePtrCallConv);
patterns.add<EraseGpuModuleOpPattern>(&converter.getContext());
}		}

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp

	//===- GPUDialect.cpp - MLIR Dialect for GPU Kernels implementation -------===//			//===- GPUDialect.cpp - MLIR Dialect for GPU Kernels implementation -------===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// This file implements the GPU kernel-related dialect and its operations.			// This file implements the GPU kernel-related dialect and its operations.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#include "mlir/Dialect/GPU/IR/GPUDialect.h"			#include "mlir/Dialect/GPU/IR/GPUDialect.h"

	#include "mlir/Dialect/Arith/IR/Arith.h"			#include "mlir/Dialect/Arith/IR/Arith.h"
				#include "mlir/Dialect/GPU/Transforms/Utils.h"
	#include "mlir/Dialect/MemRef/IR/MemRef.h"			#include "mlir/Dialect/MemRef/IR/MemRef.h"
	#include "mlir/IR/Attributes.h"			#include "mlir/IR/Attributes.h"
	#include "mlir/IR/Builders.h"			#include "mlir/IR/Builders.h"
	#include "mlir/IR/BuiltinAttributes.h"			#include "mlir/IR/BuiltinAttributes.h"
	#include "mlir/IR/BuiltinOps.h"			#include "mlir/IR/BuiltinOps.h"
	#include "mlir/IR/BuiltinTypes.h"			#include "mlir/IR/BuiltinTypes.h"
	#include "mlir/IR/DialectImplementation.h"			#include "mlir/IR/DialectImplementation.h"
	#include "mlir/IR/FunctionImplementation.h"			#include "mlir/IR/FunctionImplementation.h"
	#include "mlir/IR/Matchers.h"			#include "mlir/IR/Matchers.h"
	#include "mlir/IR/OpImplementation.h"			#include "mlir/IR/OpImplementation.h"
	#include "mlir/IR/PatternMatch.h"			#include "mlir/IR/PatternMatch.h"
	#include "mlir/IR/TypeUtilities.h"			#include "mlir/IR/TypeUtilities.h"
	#include "mlir/Interfaces/SideEffectInterfaces.h"			#include "mlir/Interfaces/SideEffectInterfaces.h"
	#include "mlir/Transforms/InliningUtils.h"			#include "mlir/Transforms/InliningUtils.h"
	#include "llvm/ADT/TypeSwitch.h"			#include "llvm/ADT/TypeSwitch.h"

	using namespace mlir;			using namespace mlir;
	using namespace mlir::gpu;			using namespace mlir::gpu;

	#include "mlir/Dialect/GPU/IR/GPUOpsDialect.cpp.inc"			#include "mlir/Dialect/GPU/IR/GPUOpsDialect.cpp.inc"

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
				// GPU Utility functions
				//===----------------------------------------------------------------------===//
				StringRef mlir::gpu::getTargetAttrName() { return "target"; }
				mehdi_aminiUnsubmitted Not Done Reply Inline Actions Any reason this isn't promoted to first-class supported inherent attribute for the gpu.module op? mehdi_amini: Any reason this isn't promoted to first-class supported inherent attribute for the gpu.module…
				fmoracAuthorUnsubmitted Done Reply Inline Actions None, I'll change it. fmorac: None, I'll change it.

				SmallString<128> mlir::gpu::getBinaryStorageStubName(StringRef moduleName) {
				static constexpr const char *kGpuBinaryStorageSuffix = "_gpubin_stub";
				SmallString<128> name(moduleName);
				name.append(kGpuBinaryStorageSuffix);
				return name;
				}

				SmallString<128> mlir::gpu::getBinaryStorageName(StringRef moduleName) {
				static constexpr const char *kGpuBinaryStorageSuffix = "_gpubin_cst";
				SmallString<128> name(moduleName);
				name.append(kGpuBinaryStorageSuffix);
				return name;
				}

				//===----------------------------------------------------------------------===//
				// GPU Translation Attributes
				//===----------------------------------------------------------------------===//

				::mlir::LogicalResult TranslationTargetAttr::verify(
				fmoracAuthorUnsubmitted Done Reply Inline Actions Verify some precondition of the `TranslationTargetAttr`. fmorac: Verify some precondition of the `TranslationTargetAttr`.
				::llvm::function_ref<::mlir::InFlightDiagnostic()> emitError,
				TranslationPipeline pipeline, int optLevel, StringAttr triple,
				StringAttr chip, StringAttr features, StringAttr toolkit, ArrayAttr link,
				Attribute opts) {
				if (optLevel > 3 \|\| optLevel < 0) {
				emitError() << "O" << optLevel << " is not a valid optimization level.";
				return failure();
				}
				if (link) {
				for (Attribute attr : link.getValue()) {
				if (!mlir::isa<StringAttr>(attr)) {
				emitError() << "All link values must be strings.";
				return failure();
				}
				}
				}
				return success();
				}

				//===----------------------------------------------------------------------===//
	// GPU Device Mapping Attributes			// GPU Device Mapping Attributes
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	int64_t GPUBlockMappingAttr::getMappingId() const {			int64_t GPUBlockMappingAttr::getMappingId() const {
	return static_cast<int64_t>(getBlock());			return static_cast<int64_t>(getBlock());
	}			}

	int64_t GPUWarpMappingAttr::getMappingId() const {			int64_t GPUWarpMappingAttr::getMappingId() const {
	▲ Show 20 Lines • Show All 1,743 Lines • Show Last 20 Lines

mlir/lib/Target/LLVMIR/Dialect/GPU/AMDGPUPipeline.cpp

This file was added.

				//===- NVPTXPipeline.cpp - GPU Dialect translation for NVPTX targets ------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// This files provides the serialization pipelines for NVPTX targets.
				//
				//===----------------------------------------------------------------------===//

				#include "ModuleToObject.h"
				#include "TranslationPipelines.h"

				#include "mlir/Target/LLVMIR/ModuleTranslation.h"
				#include "llvm/IR/Constants.h"
				#include "llvm/Support/FileSystem.h"
				#include "llvm/Support/ManagedStatic.h"
				#include "llvm/Support/Path.h"
				#include "llvm/Support/SourceMgr.h"
				#include "llvm/Support/TargetSelect.h"
				#include "llvm/TargetParser/TargetParser.h"

				using namespace mlir;
				using namespace mlir::gpu;

				#if MLIR_ROCM_CONVERSIONS_ENABLED == 1
				namespace {
				//===----------------------------------------------------------------------===//
				// Base declarations.
				//===----------------------------------------------------------------------===//

				// AMDGPU target initializer.
				struct InitAMDGPUTarget {
				InitAMDGPUTarget() {
				LLVMInitializeAMDGPUTarget();
				LLVMInitializeAMDGPUTargetInfo();
				lamb-jUnsubmitted Not Done Reply Inline Actions This call isn't thread safe (specifically RegisterTarget()). It may not be an issue in this context, but just a heads up (I recently hit a tricky bug due to this) lamb-j: This call isn't thread safe (specifically RegisterTarget()). It may not be an issue in this…
				LLVMInitializeAMDGPUTargetMC();
				LLVMInitializeAMDGPUAsmParser();
				LLVMInitializeAMDGPUAsmPrinter();
				}
				};
				// This ensures that the target is initialized once.
				llvm::ManagedStatic<InitAMDGPUTarget> amdgpuTargetInit;
				lamb-jUnsubmitted Not Done Reply Inline Actions I believe ManagedStatic is in the process of being removed from LLVM. See here: https://reviews.llvm.org/D129134 lamb-j: I believe ManagedStatic is in the process of being removed from LLVM. See here: https://reviews.

				// Base for all NVPTX serialization pipelines.
				lamb-jUnsubmitted Not Done Reply Inline Actions AMDGPU? lamb-j: AMDGPU?
				class AMDGPUPipelineBase : public ModuleToObject {
				public:
				AMDGPUPipelineBase(GPUModuleOp module, AMDGPUTranslationTarget target);

				// Return the translation target.
				TranslationTarget &getTranslationTarget() override;

				// Get the paths of ROCm device libraries. Function adapted from:
				// https://github.com/llvm/llvm-project/blob/main/clang/lib/Driver/ToolChains/AMDGPU.cpp
				void getCommonBitcodeLibs(llvm::SmallVector<std::string> &libs,
				SmallVector<char, 256> &libPath,
				StringRef isaVersion, bool wave64, bool daz,
				bool finiteOnly, bool unsafeMath, bool fastMath,
				bool correctSqrt, StringRef abiVer);

				// Implementation of ModuleToObject::loadBitcodeFiles, if the toolkit path is
				// non empty it will try to load `libdevice` and err on failure.
				std::optional<SmallVector<std::unique_ptr<llvm::Module>>>
				loadBitcodeFiles(llvm::LLVMContext &context, llvm::Module &module) override;

				// Removes unnecessary metadata from the loaded bitcode files.
				void handleBitcodeFile(llvm::Module &module,
				llvm::TargetMachine &targetMachine) override;

				protected:
				AMDGPUTranslationTarget target;
				};

				//===----------------------------------------------------------------------===//
				// Pipeline declarations.
				//===----------------------------------------------------------------------===//

				// NVPTX pipeline using the driver to compile to cubin.
				lamb-jUnsubmitted Not Done Reply Inline Actions AMDGPU? hsaco? lamb-j: AMDGPU? hsaco?
				class AMDGPUPipeline : public AMDGPUPipelineBase {
				public:
				using AMDGPUPipelineBase::AMDGPUPipelineBase;

				// Assembles the object.
				std::optional<SmallVector<char>> assembleIsa(StringRef isa);

				// Create the HSACO object.
				std::optional<SmallVector<char>> createHsaco(SmallVector<char> &&ptx);

				// Serializes the object.
				std::optional<SmallVector<char>>
				serializeToObject(llvm::Module &llvmModule,
				llvm::TargetMachine &targetMachine) override;

				// Embeds the serialized object in the host module.
				LogicalResult handleSerializedObject(
				SmallVector<char> object, llvm::IRBuilderBase &hostBuilder,
				LLVM::ModuleTranslation &hostModuleTranslation) override;
				};
				} // namespace

				//===----------------------------------------------------------------------===//
				// Base pipeline methods.
				//===----------------------------------------------------------------------===//

				AMDGPUPipelineBase::AMDGPUPipelineBase(GPUModuleOp module,
				AMDGPUTranslationTarget target)
				: ModuleToObject(module), target(target) {
				*amdgpuTargetInit;
				}

				TranslationTarget &AMDGPUPipelineBase::getTranslationTarget() { return target; }

				void AMDGPUPipelineBase::getCommonBitcodeLibs(
				llvm::SmallVector<std::string> &libs, SmallVector<char, 256> &libPath,
				StringRef isaVersion, bool wave64, bool daz, bool finiteOnly,
				bool unsafeMath, bool fastMath, bool correctSqrt, StringRef abiVer) {
				auto addLib = [&](StringRef path) {
				if (!llvm::sys::fs::is_regular_file(path)) {
				getOperation().emitRemark() << "Bitcode library path: " << path
				<< " does not exist or is not a file.\n";
				return;
				}
				libs.push_back(path.str());
				};
				auto optLib = [](StringRef name, bool on) -> Twine {
				return name + (on ? "_on" : "_off");
				};
				auto getLibPath = [&libPath](Twine lib) {
				auto baseSize = libPath.size();
				llvm::sys::path::append(libPath, lib + ".bc");
				std::string path(StringRef(libPath.data(), libPath.size()).str());
				libPath.truncate(baseSize);
				return path;
				};

				// Add ROCm device libraries.
				addLib(getLibPath("ocml"));
				addLib(getLibPath("ockl"));
				addLib(getLibPath(optLib("oclc_daz_opt", daz)));
				addLib(getLibPath(optLib("oclc_unsafe_math", unsafeMath \|\| fastMath)));
				addLib(getLibPath(optLib("oclc_finite_only", finiteOnly \|\| fastMath)));
				addLib(getLibPath(optLib("oclc_correctly_rounded_sqrt", correctSqrt)));
				addLib(getLibPath(optLib("oclc_wavefrontsize64", wave64)));
				addLib(getLibPath("oclc_isa_version_" + isaVersion));
				if (abiVer.size())
				addLib(getLibPath("oclc_abi_version_" + abiVer));
				}

				std::optional<SmallVector<std::unique_ptr<llvm::Module>>>
				AMDGPUPipelineBase::loadBitcodeFiles(llvm::LLVMContext &context,
				llvm::Module &module) {
				SmallVector<std::string> fileList = target.getFilesToLink();

				// Try loading device libraries from the ROCm toolkit installation.
				StringRef pathRef = target.getToolkitPath();
				if (pathRef.size()) {
				SmallVector<char, 256> path;
				path.insert(path.begin(), pathRef.begin(), pathRef.end());
				llvm::sys::path::append(path, "amdgcn", "bitcode");
				pathRef = StringRef(path.data(), path.size());
				if (!llvm::sys::fs::is_directory(pathRef)) {
				getOperation().emitRemark() << "ROCm amdgcn bitcode path: " << pathRef
				<< " does not exist or is not a directory.";
				return std::nullopt;
				}
				StringRef isaVersion = llvm::AMDGPU::getArchNameAMDGCN(
				llvm::AMDGPU::parseArchAMDGCN(target.getTargetChip()));
				isaVersion.consume_front("gfx");
				getCommonBitcodeLibs(fileList, path, isaVersion, target.getWave64(),
				target.getFtz(), target.getFiniteOnly(),
				target.getUnsafeMath(), target.getFastMath(),
				target.getCorrectSqrt(), target.getAbiVer());
				}

				SmallVector<std::unique_ptr<llvm::Module>> bcFiles;
				if (failed(loadBitcodeFilesFromList(context, fileList, bcFiles, true)))
				return std::nullopt;
				return bcFiles;
				}

				void AMDGPUPipelineBase::handleBitcodeFile(llvm::Module &module,
				llvm::TargetMachine &targetMachine) {
				// Some ROCM builds don't strip this like they should
				if (auto *openclVersion = module.getNamedMetadata("opencl.ocl.version"))
				module.eraseNamedMetadata(openclVersion);
				// Stop spamming us with clang version numbers
				krzysz00Unsubmitted Not Done Reply Inline Actions Ooh, these are the magic incantations! Avoiding all this spam is why I wrote old serialize-to-hsaco to be much more selective about when things got linked it. krzysz00: Ooh, these are the magic incantations! Avoiding all this spam is why I wrote old serialize-to…
				if (auto *ident = module.getNamedMetadata("llvm.ident"))
				module.eraseNamedMetadata(ident);
				}

				//===----------------------------------------------------------------------===//
				// AMDGPU pipeline methods.
				//===----------------------------------------------------------------------===//
				#ifdef MLIR_GPU_TO_HSACO_TRANSLATION_ENABLED
				#include "mlir/Support/FileUtilities.h"
				#include "llvm/MC/MCAsmBackend.h"
				#include "llvm/MC/MCAsmInfo.h"
				#include "llvm/MC/MCCodeEmitter.h"
				#include "llvm/MC/MCContext.h"
				#include "llvm/MC/MCInstrInfo.h"
				#include "llvm/MC/MCObjectFileInfo.h"
				#include "llvm/MC/MCObjectWriter.h"
				#include "llvm/MC/MCParser/MCTargetAsmParser.h"
				#include "llvm/MC/MCRegisterInfo.h"
				#include "llvm/MC/MCStreamer.h"
				#include "llvm/MC/MCSubtargetInfo.h"
				#include "llvm/MC/TargetRegistry.h"
				#include "llvm/Support/FileUtilities.h"
				#include "llvm/Support/Program.h"

				std::optional<SmallVector<char>> AMDGPUPipeline::assembleIsa(StringRef isa) {
				auto loc = getOperation().getLoc();
				krzysz00Unsubmitted Done Reply Inline Actions Nit: don't we usually not `auto` these? krzysz00: Nit: don't we usually not `auto` these?
				fmoracAuthorUnsubmitted Done Reply Inline Actions I think you're right, I believe the policy is auto-ing only variables with an explicit type. Some of these are left overs from the serialization passes, but I'll do one more check on all auto variables and update them accordingly. fmorac: I think you're right, I believe the policy is auto-ing only variables with an explicit type.

				StringRef targetTriple = target.getTargetTriple();
				StringRef chip = target.getTargetChip();
				StringRef features = target.getTargetFeatures();

				SmallVector<char, 0> result;
				llvm::raw_svector_ostream os(result);

				llvm::Triple triple(llvm::Triple::normalize(targetTriple));
				std::string error;
				const llvm::Target *target =
				llvm::TargetRegistry::lookupTarget(triple.normalize(), error);
				if (!target) {
				emitError(loc, Twine("failed to lookup target: ") + error);
				return std::nullopt;
				}

				llvm::SourceMgr srcMgr;
				srcMgr.AddNewSourceBuffer(llvm::MemoryBuffer::getMemBuffer(isa), SMLoc());

				const llvm::MCTargetOptions mcOptions;
				std::unique_ptr<llvm::MCRegisterInfo> mri(
				target->createMCRegInfo(targetTriple));
				std::unique_ptr<llvm::MCAsmInfo> mai(
				target->createMCAsmInfo(*mri, targetTriple, mcOptions));
				mai->setRelaxELFRelocations(true);
				std::unique_ptr<llvm::MCSubtargetInfo> sti(
				target->createMCSubtargetInfo(targetTriple, chip, features));

				llvm::MCContext ctx(triple, mai.get(), mri.get(), sti.get(), &srcMgr,
				&mcOptions);
				std::unique_ptr<llvm::MCObjectFileInfo> mofi(target->createMCObjectFileInfo(
				ctx, /PIC=/false, /LargeCodeModel=/false));
				ctx.setObjectFileInfo(mofi.get());

				SmallString<128> cwd;
				if (!llvm::sys::fs::current_path(cwd))
				ctx.setCompilationDir(cwd);

				std::unique_ptr<llvm::MCStreamer> mcStreamer;
				std::unique_ptr<llvm::MCInstrInfo> mcii(target->createMCInstrInfo());

				llvm::MCCodeEmitter ce = target->createMCCodeEmitter(mcii, ctx);
				llvm::MCAsmBackend mab = target->createMCAsmBackend(sti, *mri, mcOptions);
				mcStreamer.reset(target->createMCObjectStreamer(
				triple, ctx, std::unique_ptr<llvm::MCAsmBackend>(mab),
				mab->createObjectWriter(os), std::unique_ptr<llvm::MCCodeEmitter>(ce),
				*sti, mcOptions.MCRelaxAll, mcOptions.MCIncrementalLinkerCompatible,
				/DWARFMustBeAtTheEnd/ false));
				mcStreamer->setUseAssemblerInfoForParsing(true);

				std::unique_ptr<llvm::MCAsmParser> parser(
				createMCAsmParser(srcMgr, ctx, mcStreamer, mai));
				std::unique_ptr<llvm::MCTargetAsmParser> tap(
				target->createMCAsmParser(sti, parser, *mcii, mcOptions));

				if (!tap) {
				emitError(loc, "assembler initialization error");
				return {};
				}

				parser->setTargetParser(*tap);
				parser->Run(false);

				return result;
				}

				std::optional<SmallVector<char>>
				AMDGPUPipeline::createHsaco(SmallVector<char> &&ptx) {
				SmallVector<char> isaBinary = std::move(ptx);
				auto loc = getOperation().getLoc();

				// Save the ISA binary to a temp file.
				int tempIsaBinaryFd = -1;
				SmallString<128> tempIsaBinaryFilename;
				if (llvm::sys::fs::createTemporaryFile("kernel", "o", tempIsaBinaryFd,
				tempIsaBinaryFilename)) {
				emitError(loc, "temporary file for ISA binary creation error");
				return {};
				}
				llvm::FileRemover cleanupIsaBinary(tempIsaBinaryFilename);
				llvm::raw_fd_ostream tempIsaBinaryOs(tempIsaBinaryFd, true);
				tempIsaBinaryOs << StringRef(isaBinary.data(), isaBinary.size());
				tempIsaBinaryOs.close();

				// Create a temp file for HSA code object.
				int tempHsacoFD = -1;
				SmallString<128> tempHsacoFilename;
				if (llvm::sys::fs::createTemporaryFile("kernel", "hsaco", tempHsacoFD,
				tempHsacoFilename)) {
				emitError(loc, "temporary file for HSA code object creation error");
				return {};
				}
				llvm::FileRemover cleanupHsaco(tempHsacoFilename);

				StringRef theRocmPath = target.getToolkitPath();
				llvm::SmallString<32> lldPath(theRocmPath);
				llvm::sys::path::append(lldPath, "llvm", "bin", "ld.lld");
				int lldResult = llvm::sys::ExecuteAndWait(
				lldPath,
				{"ld.lld", "-shared", tempIsaBinaryFilename, "-o", tempHsacoFilename});
				if (lldResult != 0) {
				emitError(loc, "lld invocation error");
				return {};
				}

				// Load the HSA code object.
				auto hsacoFile = openInputFile(tempHsacoFilename);
				if (!hsacoFile) {
				emitError(loc, "read HSA code object from temp file error");
				return {};
				}

				StringRef buffer = hsacoFile->getBuffer();

				return SmallVector<char>(buffer.begin(), buffer.end());
				}
				fmoracAuthorUnsubmitted Done Reply Inline Actions Code taken from the original hsaco serialization pipeline. fmorac: Code taken from the original hsaco serialization pipeline.

				std::optional<SmallVector<char>>
				AMDGPUPipeline::serializeToObject(llvm::Module &llvmModule,
				llvm::TargetMachine &targetMachine) {
				std::optional<std::string> serializedISA =
				translateToISA(llvmModule, targetMachine);

				if (!serializedISA) {
				getOperation().emitError() << "Failed translating the Module to ISA.";
				return std::nullopt;
				}
				std::optional<SmallVector<char>> assembledIsa =
				assembleIsa(serializedISA.value());

				if (!assembledIsa) {
				getOperation().emitError() << "Failed during ISA assembling.";
				return std::nullopt;
				}

				return createHsaco(std::move(assembledIsa.value()));
				}

				LogicalResult AMDGPUPipeline::handleSerializedObject(
				SmallVector<char> object, llvm::IRBuilderBase &hostBuilder,
				LLVM::ModuleTranslation &hostModuleTranslation) {
				return embedBinaryObject(object, hostBuilder, hostModuleTranslation);
				}

				LogicalResult
				mlir::gpu::runAMDGPUPipeline(GPUModuleOp module, AMDGPUTranslationTarget target,
				llvm::IRBuilderBase &builder,
				LLVM::ModuleTranslation &moduleTranslation) {
				return AMDGPUPipeline(module, target).run(builder, moduleTranslation);
				}
				#endif
				#endif

				#if MLIR_ROCM_CONVERSIONS_ENABLED == 0 \|\| \
				!defined(MLIR_GPU_TO_HSACO_TRANSLATION_ENABLED)
				LogicalResult runAMDGPUPipeline(GPUModuleOp module,
				AMDGPUTranslationTarget target,
				llvm::IRBuilderBase &builder,
				LLVM::ModuleTranslation &moduleTranslation) {
				return success();
				}
				#endif

mlir/lib/Target/LLVMIR/Dialect/GPU/CMakeLists.txt

				if (MLIR_ENABLE_CUDA_CONVERSIONS)
				set(NVPTX_LIBS
				NVPTXCodeGen
				NVPTXDesc
				NVPTXInfo
				)
				endif()

				if (MLIR_ENABLE_ROCM_CONVERSIONS)
				set(AMDGPU_LIBS
				IRReader
				IPO
				linker
				MCParser
				AMDGPUAsmParser
				AMDGPUCodeGen
				AMDGPUDesc
				AMDGPUInfo
				target
				)
				endif()

	add_mlir_translation_library(MLIRGPUToLLVMIRTranslation			add_mlir_translation_library(MLIRGPUToLLVMIRTranslation
				AMDGPUPipeline.cpp
	GPUToLLVMIRTranslation.cpp			GPUToLLVMIRTranslation.cpp
				GPUTranslationTargets.cpp
				ModuleToObject.cpp
				NVPTXPipeline.cpp

	LINK_COMPONENTS			LINK_COMPONENTS
	Core			Core
				MC
				${NVPTX_LIBS}
				${AMDGPU_LIBS}

	LINK_LIBS PUBLIC			LINK_LIBS PUBLIC
	MLIRIR			MLIRIR
				MLIRExecutionEngineUtils
	MLIRGPUDialect			MLIRGPUDialect
	MLIRLLVMDialect			MLIRLLVMDialect
	MLIRSupport			MLIRSupport
	MLIRTargetLLVMIRExport			MLIRTargetLLVMIRExport
	)			)

				if(MLIR_ENABLE_CUDA_RUNNER)
				if(NOT MLIR_ENABLE_CUDA_CONVERSIONS)
				message(SEND_ERROR
				"Building mlir with cuda support requires the NVPTX backend")
				endif()

				# Configure CUDA language support. Using check_language first allows us to
				# give a custom error message.
				include(CheckLanguage)
				check_language(CUDA)
				if (CMAKE_CUDA_COMPILER)
				enable_language(CUDA)
				else()
				message(SEND_ERROR
				"Building mlir with cuda support requires a working CUDA install")
				endif()

				if (NOT DEFINED CUDAToolkit_ROOT)
				find_package(CUDAToolkit)
				get_filename_component(CUDAToolkit_ROOT ${CUDAToolkit_BIN_DIR} DIRECTORY ABSOLUTE)
				endif()
				message(VERBOSE "MLIR Default CUDA toolkit path: ${CUDAToolkit_ROOT}")

				# Enable gpu to cubin translation.
				target_compile_definitions(obj.MLIRGPUToLLVMIRTranslation
				PRIVATE
				MLIR_GPU_TO_CUBIN_TRANSLATION_ENABLED=1
				__DEFAULT_CUDATOOLKIT_PATH__="${CUDAToolkit_ROOT}"
				)

				# Add CUDA headers includes and the libcuda.so library.
				target_include_directories(obj.MLIRGPUToLLVMIRTranslation
				PRIVATE
				${CMAKE_CUDA_TOOLKIT_INCLUDE_DIRECTORIES}
				)

				find_library(CUDA_DRIVER_LIBRARY cuda)

				target_link_libraries(MLIRGPUToLLVMIRTranslation
				PRIVATE
				MLIRNVVMToLLVMIRTranslation
				${CUDA_DRIVER_LIBRARY}
				)

				endif()

				if(MLIR_ENABLE_ROCM_CONVERSIONS)
				if (NOT ("AMDGPU" IN_LIST LLVM_TARGETS_TO_BUILD))
				message(SEND_ERROR
				"Building mlir with ROCm support requires the AMDGPU backend")
				endif()
				if (DEFINED ROCM_PATH)
				set(DEFAULT_ROCM_PATH "${ROCM_PATH}" CACHE PATH "Fallback path to search for ROCm installs")
				elseif(DEFINED ENV{ROCM_PATH})
				set(DEFAULT_ROCM_PATH "$ENV{ROCM_PATH}" CACHE PATH "Fallback path to search for ROCm installs")
				else()
				set(DEFAULT_ROCM_PATH "/opt/rocm" CACHE PATH "Fallback path to search for ROCm installs")
				endif()
				message(VERBOSE "MLIR Default ROCM toolkit path: ${DEFAULT_ROCM_PATH}")

				target_compile_definitions(obj.MLIRGPUToLLVMIRTranslation
				PRIVATE
				__DEFAULT_ROCM_PATH__="${DEFAULT_ROCM_PATH}"
				MLIR_GPU_TO_HSACO_TRANSLATION_ENABLED=1
				)

				target_link_libraries(MLIRGPUToLLVMIRTranslation
				PRIVATE
				MLIRROCDLToLLVMIRTranslation
				)
				endif()
				fmoracAuthorUnsubmitted Done Reply Inline Actions Code taken from the CMake in the original serialization passes. fmorac: Code taken from the CMake in the original serialization passes.

mlir/lib/Target/LLVMIR/Dialect/GPU/GPUToLLVMIRTranslation.cpp

	//===- GPUToLLVMIRTranslation.cpp - Translate GPU dialect to LLVM IR ------===//			//===- GPUToLLVMIRTranslation.cpp - Translate GPU dialect to LLVM IR ------===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// This file implements a translation between the MLIR GPU dialect and LLVM IR.			// This file implements a translation between the MLIR GPU dialect and LLVM IR.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#include "mlir/Target/LLVMIR/Dialect/GPU/GPUToLLVMIRTranslation.h"			#include "mlir/Target/LLVMIR/Dialect/GPU/GPUToLLVMIRTranslation.h"
	#include "mlir/Dialect/GPU/IR/GPUDialect.h"
				#include "TranslationPipelines.h"
				#include "mlir/Target/LLVMIR/Dialect/GPU/GPUTranslationTargets.h"
	#include "mlir/Target/LLVMIR/LLVMTranslationInterface.h"			#include "mlir/Target/LLVMIR/LLVMTranslationInterface.h"
				#include "llvm/ADT/TypeSwitch.h"

	using namespace mlir;			using namespace mlir;
				using namespace mlir::gpu;

	namespace {			static LogicalResult
				handleModuleOp(gpu::GPUModuleOp module, llvm::IRBuilderBase &builder,
				LLVM::ModuleTranslation &moduleTranslation) {
				if (Attribute attr = module->removeAttr("target")) {
				if (auto targetAttr = dyn_cast<TranslationTargetAttr>(attr)) {
				TranslationPipeline pipeline = targetAttr.getPipeline();
				if (pipeline == TranslationPipeline::NVPTX) {
				return runNVPTXDriverPipeline(module,
				NVPTXTranslationTarget(targetAttr),
				builder, moduleTranslation);
				} else if (pipeline == TranslationPipeline::AMDGPU) {
				return runAMDGPUPipeline(module, AMDGPUTranslationTarget(targetAttr),
				builder, moduleTranslation);
				}
				}
				}
				return success();
				}

				namespace {
	class GPUDialectLLVMIRTranslationInterface			class GPUDialectLLVMIRTranslationInterface
	: public LLVMTranslationDialectInterface {			: public LLVMTranslationDialectInterface {
	public:			public:
	using LLVMTranslationDialectInterface::LLVMTranslationDialectInterface;			using LLVMTranslationDialectInterface::LLVMTranslationDialectInterface;

	LogicalResult			LogicalResult
	convertOperation(Operation *op, llvm::IRBuilderBase &builder,			convertOperation(Operation *op, llvm::IRBuilderBase &builder,
	LLVM::ModuleTranslation &moduleTranslation) const override {			LLVM::ModuleTranslation &moduleTranslation) const override;
	return isa<gpu::GPUModuleOp>(op) ? success() : failure();
	}
	};			};

	} // namespace			} // namespace

				LogicalResult GPUDialectLLVMIRTranslationInterface::convertOperation(
				Operation *op, llvm::IRBuilderBase &builder,
				LLVM::ModuleTranslation &moduleTranslation) const {
				return llvm::TypeSwitch<Operation *, LogicalResult>(op)
				.Case([&](gpu::GPUModuleOp module) {
				return handleModuleOp(module, builder, moduleTranslation);
				})
				.Default([&](Operation *op) {
				return op->emitError("unsupported GPU operation: ") << op->getName();
				});
				}

	void mlir::registerGPUDialectTranslation(DialectRegistry &registry) {			void mlir::registerGPUDialectTranslation(DialectRegistry &registry) {
	registry.insert<gpu::GPUDialect>();			registry.insert<gpu::GPUDialect>();
	registry.addExtension(+[](MLIRContext ctx, gpu::GPUDialect dialect) {			registry.addExtension(+[](MLIRContext ctx, gpu::GPUDialect dialect) {
	dialect->addInterfaces<GPUDialectLLVMIRTranslationInterface>();			dialect->addInterfaces<GPUDialectLLVMIRTranslationInterface>();
	});			});
	}			}

	void mlir::registerGPUDialectTranslation(MLIRContext &context) {			void mlir::registerGPUDialectTranslation(MLIRContext &context) {
	DialectRegistry registry;			DialectRegistry registry;
	registerGPUDialectTranslation(registry);			registerGPUDialectTranslation(registry);
	context.appendDialectRegistry(registry);			context.appendDialectRegistry(registry);
	}			}

mlir/lib/Target/LLVMIR/Dialect/GPU/GPUTranslationTargets.cpp

This file was added.

				//===- GPUTranslationTargets.cpp - GPU Dialect translation targets ------*-===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// This files provides interfaces for interacting with `TranslationTargetAttr`.
				//
				//===----------------------------------------------------------------------===//

				#include "mlir/Target/LLVMIR/Dialect/GPU/GPUTranslationTargets.h"

				using namespace mlir;
				using namespace mlir::gpu;

				//===----------------------------------------------------------------------===//
				// Toolkit paths if not defined by CMake.
				//===----------------------------------------------------------------------===//

				#ifndef __DEFAULT_CUDATOOLKIT_PATH__
				#define __DEFAULT_CUDATOOLKIT_PATH__ ""
				#endif

				#ifndef __DEFAULT_ROCM_PATH__
				#define __DEFAULT_ROCM_PATH__ ""
				#endif

				//===----------------------------------------------------------------------===//
				// TranslationTarget methods.
				//===----------------------------------------------------------------------===//

				StringRef TranslationTarget::getTargetTriple() {
				StringAttr triple = target.getTriple();
				return triple ? triple.getValue() : defaultTriple;
				}

				StringRef TranslationTarget::getTargetChip() {
				StringAttr chip = target.getChip();
				return chip ? chip.getValue() : defaultChip;
				}

				StringRef TranslationTarget::getTargetFeatures() {
				StringAttr features = target.getFeatures();
				return features ? features.getValue() : defaultFeatures;
				}

				StringRef TranslationTarget::getToolkitPath() {
				StringAttr toolkit = target.getToolkit();
				return toolkit ? toolkit.getValue() : defaultToolkitPath;
				}

				SmallVector<std::string> TranslationTarget::getFilesToLink() {
				SmallVector<std::string> fileList;
				if (ArrayAttr files = target.getLink())
				for (auto attr : files.getValue())
				if (auto file = dyn_cast<StringAttr>(attr))
				fileList.push_back(file.str());
				return fileList;
				}

				int TranslationTarget::getOptLevel() { return target.getO(); }

				Attribute TranslationTarget::getTargetOptions() { return target.getOpts(); }

				Attribute TranslationTarget::getTargetOption(StringRef optionName) {
				if (Attribute opts = target.getOpts())
				if (DictionaryAttr optsDict = dyn_cast<DictionaryAttr>(opts))
				return optsDict.get(optionName);
				return nullptr;
				}

				bool TranslationTarget::getFastMath() {
				return getTargetOption("fast") != nullptr;
				}

				bool TranslationTarget::getFtz() { return getTargetOption("ftz") != nullptr; }

				TranslationTarget::TranslationTarget(TranslationTargetAttr target,
				StringRef defaultTriple,
				StringRef defaultChip,
				StringRef defaultFeatures,
				StringRef defaultToolkitPath)
				: target(target), defaultTriple(defaultTriple), defaultChip(defaultChip),
				defaultFeatures(defaultFeatures), defaultToolkitPath(defaultToolkitPath) {
				assert(target && "The target must be non null.");
				}

				//===----------------------------------------------------------------------===//
				// NVPTXTranslationTarget methods.
				//===----------------------------------------------------------------------===//

				NVPTXTranslationTarget::NVPTXTranslationTarget(TranslationTargetAttr target)
				: TranslationTarget(target, kDefaultTriple, kDefaultChip, kDefaultFeatures,
				getDefaultToolkitPath()) {}

				StringRef NVPTXTranslationTarget::getDefaultToolkitPath() {
				return __DEFAULT_CUDATOOLKIT_PATH__;
				}

				//===----------------------------------------------------------------------===//
				// AMDGPUTranslationTarget methods.
				//===----------------------------------------------------------------------===//

				AMDGPUTranslationTarget::AMDGPUTranslationTarget(TranslationTargetAttr target)
				: TranslationTarget(target, kDefaultTriple, kDefaultChip, kDefaultFeatures,
				getDefaultToolkitPath()) {}

				StringRef AMDGPUTranslationTarget::getDefaultToolkitPath() {
				return __DEFAULT_ROCM_PATH__;
				}

				bool AMDGPUTranslationTarget::getWave64() {
				return getTargetOption("wave64") != nullptr \|\|
				getTargetOption("noWave64") == nullptr;
				}

				bool AMDGPUTranslationTarget::getFiniteOnly() {
				return getTargetOption("finite_only") != nullptr;
				}

				bool AMDGPUTranslationTarget::getUnsafeMath() {
				return getTargetOption("unsafe_math") != nullptr;
				}

				bool AMDGPUTranslationTarget::getCorrectSqrt() {
				return getTargetOption("correct_sqrt") != nullptr;
				}

				StringRef AMDGPUTranslationTarget::getAbiVer() {
				auto abiVer = getTargetOption<StringAttr>("abi_ver");
				return abiVer ? abiVer.getValue() : StringRef(kDefaultAbiVer);
				}

mlir/lib/Target/LLVMIR/Dialect/GPU/ModuleToObject.h

This file was added.

				//===- ModuleToObject.h - GPU Module to object base class -----------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// This file implements the base class for transforming GPUModuleOps into binary
				// annotations.
				//
				//===----------------------------------------------------------------------===//

				#ifndef MLIR_TARGET_LLVMIR_DIALECT_GPU_MODULETOOBJECT_H
				#define MLIR_TARGET_LLVMIR_DIALECT_GPU_MODULETOOBJECT_H

				#include "mlir/Target/LLVMIR/Dialect/GPU/GPUTranslationTargets.h"

				#include "llvm/IR/Module.h"

				namespace llvm {
				class IRBuilderBase;
				class TargetMachine;
				} // namespace llvm

				namespace mlir {
				namespace LLVM {
				class ModuleTranslation;
				}
				namespace gpu {
				static constexpr const char *kGpuBinaryStorageSuffix = "_gpubin_cst";

				// Base class for all GPUToObject* translations.
				class ModuleToObject {
				public:
				ModuleToObject(GPUModuleOp module);
				virtual ~ModuleToObject() = default;

				// Returns the gpu.module being serialized.
				GPUModuleOp getOperation();

				// Returns the translation target.
				virtual TranslationTarget &getTranslationTarget() = 0;

				// Runs the serialization pipeline, returning failure on error.
				LogicalResult run(llvm::IRBuilderBase &builder,
				LLVM::ModuleTranslation &moduleTranslation);

				protected:
				// Hooks to be implemented by derived classes.

				// Hook for loading bitcode files, returns std::nullopt on failure.
				virtual std::optional<SmallVector<std::unique_ptr<llvm::Module>>>
				loadBitcodeFiles(llvm::LLVMContext &context, llvm::Module &module) {
				return SmallVector<std::unique_ptr<llvm::Module>>();
				}

				// Hook for performing additional actions on a loaded bitcode file.
				virtual void handleBitcodeFile(llvm::Module &module,
				llvm::TargetMachine &targetMachine) {}

				// Hook for performing additional actions on the llvmModule pre linking.
				virtual void handleModulePreLink(llvm::Module &module,
				llvm::TargetMachine &targetMachine) {}

				// Hook for performing additional actions on the llvmModule post linking.
				virtual void handleModulePostLink(llvm::Module &module,
				llvm::TargetMachine &targetMachine) {}

				// Serializes the LLVM IR bitcode to an object file.
				virtual std::optional<SmallVector<char>>
				serializeToObject(llvm::Module &llvmModule,
				llvm::TargetMachine &targetMachine) {
				return {};
				}

				// Hook for performing actions on the serialized object and the host module.
				virtual LogicalResult
				handleSerializedObject(SmallVector<char> object,
				llvm::IRBuilderBase &hostBuilder,
				LLVM::ModuleTranslation &hostModuleTranslation) = 0;

				protected:
				// Create the target machine based on the target triple and chip.
				std::unique_ptr<llvm::TargetMachine> createTargetMachine();

				// Loads a bitcode file from path.
				std::unique_ptr<llvm::Module> loadBitcodeFile(llvm::LLVMContext &context,
				StringRef path);

				// Loads multiple bitcode files.
				virtual LogicalResult loadBitcodeFilesFromList(
				llvm::LLVMContext &context, ArrayRef<std::string> fileList,
				SmallVector<std::unique_ptr<llvm::Module>> &llvmModules,
				bool failureOnError = true);

				// Translates the gpu.module to LLVM IR.
				std::unique_ptr<llvm::Module>
				translateToLLVMIR(llvm::LLVMContext &llvmContext);

				// Link the llvmModule to other bitcode file.
				LogicalResult linkFiles(llvm::Module &module,
				SmallVector<std::unique_ptr<llvm::Module>> &&libs);

				// Optimize the module.
				LogicalResult optimizeModule(llvm::Module &module,
				llvm::TargetMachine &targetMachine,
				int optLevel = 3);

				// Utility function for translating to ISA, returns `std::nullopt` on failure.
				static std::optional<std::string>
				translateToISA(llvm::Module &llvmModule, llvm::TargetMachine &targetMachine);

				// Utility function for embedding the binary object as a global constant
				// string for regular serialization pipelines.
				LogicalResult
				embedBinaryObject(SmallVector<char> object, llvm::IRBuilderBase &hostBuilder,
				LLVM::ModuleTranslation &hostModuleTranslation);

				private:
				GPUModuleOp module;
				};
				} // namespace gpu
				} // namespace mlir

				#endif

mlir/lib/Target/LLVMIR/Dialect/GPU/ModuleToObject.cpp

This file was added.

				//===- ModuleToObject.cpp - GPU Module to object base class ---------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// This file implements the base class for transforming GPUModuleOps into binary
				// annotations.
				//
				//===----------------------------------------------------------------------===//

				#include "ModuleToObject.h"

				#include "mlir/ExecutionEngine/OptUtils.h"
				#include "mlir/IR/BuiltinOps.h"
				#include "mlir/Target/LLVMIR/Dialect/GPU/GPUToLLVMIRTranslation.h"
				#include "mlir/Target/LLVMIR/Dialect/LLVMIR/LLVMToLLVMIRTranslation.h"
				#include "mlir/Target/LLVMIR/Export.h"
				#include "mlir/Target/LLVMIR/ModuleTranslation.h"

				#include "llvm/IR/LegacyPassManager.h"
				#include "llvm/IRReader/IRReader.h"
				#include "llvm/Linker/Linker.h"
				#include "llvm/MC/TargetRegistry.h"
				#include "llvm/Support/FileSystem.h"
				#include "llvm/Support/Path.h"
				#include "llvm/Support/SourceMgr.h"
				#include "llvm/Target/TargetMachine.h"
				#include "llvm/TargetParser/TargetParser.h"
				#include "llvm/Transforms/IPO/Internalize.h"

				using namespace mlir;
				using namespace mlir::gpu;

				ModuleToObject::ModuleToObject(GPUModuleOp module) : module(module) {
				assert(module && "The module must be non null.");
				}

				GPUModuleOp ModuleToObject::getOperation() { return module; }

				std::unique_ptr<llvm::TargetMachine> ModuleToObject::createTargetMachine() {
				TranslationTarget &translationTarget = getTranslationTarget();
				std::string triple = translationTarget.getTargetTriple().str();
				mehdi_aminiUnsubmitted Not Done Reply Inline Actions Does it have to be a std::string? Seems like consumed as a StringRef everywhere, I'd write it as `auto triple = ...` mehdi_amini: Does it have to be a std::string? Seems like consumed as a StringRef everywhere, I'd write it…
				fmoracAuthorUnsubmitted Done Reply Inline Actions Not anymore: https://github.com/llvm/llvm-project/commit/5a1de140677e9138625135514fc4ed0dc969d80c . I'll change it. fmorac: Not anymore: https://github.com/llvm/llvm-project/commit/5a1de140677e9138625135514fc4ed0dc969d8…
				std::string error;
				// Load the target.
				const llvm::Target *target =
				llvm::TargetRegistry::lookupTarget(triple, error);
				if (!target) {
				getOperation().emitError() << "Failed to lookup target: " << error;
				return {};
				}

				// Create the target machine using the target.
				llvm::TargetMachine *machine = target->createTargetMachine(
				triple, translationTarget.getTargetChip(),
				translationTarget.getTargetFeatures(), {}, {});
				if (!machine) {
				getOperation().emitError() << "Failed to create target machine";
				return {};
				}
				return std::unique_ptr<llvm::TargetMachine>{machine};
				}

				std::unique_ptr<llvm::Module>
				ModuleToObject::loadBitcodeFile(llvm::LLVMContext &context, StringRef path) {
				llvm::SMDiagnostic error;
				std::unique_ptr<llvm::Module> library =
				llvm::getLazyIRFileModule(path, error, context);
				if (!library) {
				getOperation().emitError() << "Failed loading file from " << path
				<< ", error: " << error.getMessage();
				return nullptr;
				}
				return library;
				}

				LogicalResult ModuleToObject::loadBitcodeFilesFromList(
				llvm::LLVMContext &context, ArrayRef<std::string> fileList,
				SmallVector<std::unique_ptr<llvm::Module>> &llvmModules,
				bool failureOnError) {
				for (const std::string &str : fileList) {
				// Test if the path exists, if it doesn't abort.
				StringRef pathRef = StringRef(str.data(), str.size());
				if (!llvm::sys::fs::is_regular_file(pathRef)) {
				getOperation().emitError()
				<< "File path: " << pathRef << " does not exist or is not a file.\n";
				return failure();
				}
				// Load the file or abort on error.
				if (auto bcFile = loadBitcodeFile(context, pathRef))
				llvmModules.push_back(std::move(bcFile));
				else if (failureOnError)
				return failure();
				}
				return success();
				}

				std::unique_ptr<llvm::Module>
				ModuleToObject::translateToLLVMIR(llvm::LLVMContext &llvmContext) {
				return translateModuleToLLVMIR(getOperation(), llvmContext,
				"GPUDialectModule");
				}

				LogicalResult
				ModuleToObject::linkFiles(llvm::Module &module,
				SmallVector<std::unique_ptr<llvm::Module>> &&libs) {
				if (libs.empty())
				return success();
				llvm::Linker linker(module);
				for (std::unique_ptr<llvm::Module> &libModule : libs) {
				// This bitcode linking code is substantially similar to what is used in
				// hip-clang It imports the library functions into the module, allowing LLVM
				// optimization passes (which must run after linking) to optimize across the
				// libraries and the module's code. We also only import symbols if they are
				// referenced by the module or a previous library since there will be no
				// other source of references to those symbols in this compilation and since
				// we don't want to bloat the resulting code object.
				bool err = linker.linkInModule(
				std::move(libModule), llvm::Linker::Flags::LinkOnlyNeeded,
				[](llvm::Module &m, const StringSet<> &gvs) {
				llvm::internalizeModule(m, [&gvs](const llvm::GlobalValue &gv) {
				return !gv.hasName() \|\| (gvs.count(gv.getName()) == 0);
				mehdi_aminiUnsubmitted Done Reply Inline Actions Nit: `!gv.hasName() \|\| !gvs.contains(gv.getName())` mehdi_amini: Nit: `!gv.hasName() \|\| !gvs.contains(gv.getName())`
				fmoracAuthorUnsubmitted Done Reply Inline Actions Will change it. fmorac: Will change it.
				});
				});
				// True is linker failure
				if (err) {
				getOperation().emitError("Unrecoverable failure during bitcode linking.");
				// We have no guaranties about the state of `ret`, so bail
				return failure();
				}
				}
				return success();
				}

				LogicalResult ModuleToObject::optimizeModule(llvm::Module &module,
				llvm::TargetMachine &targetMachine,
				int optLevel) {
				if (optLevel < 0 \|\| optLevel > 3)
				return getOperation().emitError()
				<< "Invalid optimization level" << optLevel << "\n";

				targetMachine.setOptLevel(static_cast<llvm::CodeGenOpt::Level>(optLevel));

				auto transformer =
				makeOptimizingTransformer(optLevel, /sizeLevel=/0, &targetMachine);
				auto error = transformer(&module);
				if (error) {
				InFlightDiagnostic mlirError = getOperation().emitError();
				llvm::handleAllErrors(
				std::move(error), [&mlirError](const llvm::ErrorInfoBase &ei) {
				mlirError << "Could not optimize LLVM IR: " << ei.message() << "\n";
				});
				return mlirError;
				}
				return success();
				}

				std::optional<std::string>
				ModuleToObject::translateToISA(llvm::Module &llvmModule,
				llvm::TargetMachine &targetMachine) {
				std::string targetISA;
				llvm::raw_string_ostream stream(targetISA);

				{ // Drop pstream after this to prevent the ISA from being stuck buffering
				llvm::buffer_ostream pstream(stream);
				llvm::legacy::PassManager codegenPasses;

				if (targetMachine.addPassesToEmitFile(codegenPasses, pstream, nullptr,
				llvm::CGFT_AssemblyFile))
				return std::nullopt;

				codegenPasses.run(llvmModule);
				}
				return stream.str();
				}

				LogicalResult ModuleToObject::embedBinaryObject(
				SmallVector<char> object, llvm::IRBuilderBase &hostBuilder,
				LLVM::ModuleTranslation &hostModuleTranslation) {
				llvm::Module *hostModule = hostModuleTranslation.getLLVMModule();
				assert(hostModule && "The host module can't be null.");

				// Get the `gpu.module` name.
				SmallString<128> nameBuffer =
				getBinaryStorageStubName(getOperation().getName());
				if (llvm::GlobalVariable *gv =
				hostModule->getGlobalVariable(nameBuffer, true)) {
				// Create the new global variable with the serialized object.
				nameBuffer = getBinaryStorageName(getOperation().getName());
				llvm::GlobalVariable *serializedObj = hostBuilder.CreateGlobalString(
				StringRef(object.data(), object.size()), nameBuffer, 0, hostModule);
				serializedObj->setLinkage(llvm::GlobalValue::LinkageTypes::InternalLinkage);
				serializedObj->setUnnamedAddr(llvm::GlobalValue::UnnamedAddr::None);
				serializedObj->setAlignment(llvm::MaybeAlign(8));

				// Update all uses to point to the correct global variable.
				gv->replaceAllUsesWith(serializedObj);
				hostModule->eraseGlobalVariable(gv);
				} else {
				getOperation().emitError()
				<< "There's no global variable for embedding this module.";
				return failure();
				}
				return success();
				}

				LogicalResult
				ModuleToObject::run(llvm::IRBuilderBase &hostBuilder,
				LLVM::ModuleTranslation &hostModuleTranslation) {
				TranslationTarget &translationTarget = getTranslationTarget();
				// Translate the GPUModule to LLVM IR.
				llvm::LLVMContext llvmContext;
				mehdi_aminiUnsubmitted Done Reply Inline Actions Don't we have access to the current LLVM context used by the overall translation? Can we just reuse it? mehdi_amini: Don't we have access to the current LLVM context used by the overall translation? Can we just…
				fmoracAuthorUnsubmitted Done Reply Inline Actions Yes & I think so, however I figured that we could eventually add threads to this, so I created a new one. fmorac: Yes & I think so, however I figured that we could eventually add threads to this, so I created…
				std::unique_ptr<llvm::Module> llvmModule = translateToLLVMIR(llvmContext);
				if (!llvmModule) {
				getOperation().emitError() << "Failed creating the llvm::Module.";
				return failure();
				}

				// Create the target machine.
				std::unique_ptr<llvm::TargetMachine> targetMachine = createTargetMachine();
				if (!targetMachine)
				return failure();

				// Set the data layout and target triple of the module.
				llvmModule->setDataLayout(targetMachine->createDataLayout());
				llvmModule->setTargetTriple(targetMachine->getTargetTriple().getTriple());

				// Link bitcode files.
				handleModulePreLink(llvmModule, targetMachine);
				{
				auto libs = loadBitcodeFiles(llvmContext, *llvmModule);
				if (!libs)
				return failure();
				if (libs->size())
				if (failed(linkFiles(llvmModule, std::move(libs))))
				return failure();
				handleModulePostLink(llvmModule, targetMachine);
				}

				// Optimize the module.
				int optLevel = translationTarget.getOptLevel();
				if (failed(optimizeModule(llvmModule, targetMachine, optLevel)))
				return failure();

				// Perform additional manipulations on the serialized object.
				std::optional<SmallVector<char>> object =
				serializeToObject(llvmModule, targetMachine);
				if (!object) {
				getOperation().emitError() << "Failed while serializing the module.";
				return failure();
				}
				return handleSerializedObject(object.value(), hostBuilder,
				hostModuleTranslation);
				}

mlir/lib/Target/LLVMIR/Dialect/GPU/NVPTXPipeline.cpp

This file was added.

				//===- NVPTXPipeline.cpp - GPU Dialect translation for NVPTX targets ------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// This files provides the serialization pipelines for NVPTX targets.
				//
				//===----------------------------------------------------------------------===//

				#include "ModuleToObject.h"
				#include "TranslationPipelines.h"

				#include "llvm/IR/Constants.h"
				#include "llvm/Support/FileSystem.h"
				#include "llvm/Support/ManagedStatic.h"
				#include "llvm/Support/Path.h"
				#include "llvm/Support/TargetSelect.h"

				using namespace mlir;
				using namespace mlir::gpu;

				#if MLIR_CUDA_CONVERSIONS_ENABLED == 1
				namespace {
				//===----------------------------------------------------------------------===//
				// Base declarations.
				//===----------------------------------------------------------------------===//

				// NVPTX target initializer.
				struct InitNVPTXTarget {
				InitNVPTXTarget() {
				LLVMInitializeNVPTXTarget();
				LLVMInitializeNVPTXTargetInfo();
				LLVMInitializeNVPTXTargetMC();
				LLVMInitializeNVPTXAsmPrinter();
				}
				};
				// This ensures that the target is initialized once.
				llvm::ManagedStatic<InitNVPTXTarget> nvptxTargetInit;

				// Base for all NVPTX serialization pipelines.
				class NVPTXPipelineBase : public ModuleToObject {
				public:
				NVPTXPipelineBase(GPUModuleOp module, NVPTXTranslationTarget target);

				// Return the translation target.
				TranslationTarget &getTranslationTarget() override;

				// Implementation of ModuleToObject::loadBitcodeFiles, if the toolkit path is
				// non empty it will try to load `libdevice` and err on failure.
				std::optional<SmallVector<std::unique_ptr<llvm::Module>>>
				loadBitcodeFiles(llvm::LLVMContext &context, llvm::Module &module) override;

				protected:
				NVPTXTranslationTarget target;
				};

				//===----------------------------------------------------------------------===//
				// Pipeline declarations.
				//===----------------------------------------------------------------------===//

				// NVPTX pipeline using the driver to compile to cubin.
				class NVPTXDriverPipeline : public NVPTXPipelineBase {
				public:
				using NVPTXPipelineBase::NVPTXPipelineBase;

				// Serializes the object using the JIT in the CUDA driver.
				std::optional<SmallVector<char>>
				serializeToObject(llvm::Module &llvmModule,
				llvm::TargetMachine &targetMachine) override;

				// Embeds the serialized object in the host module.
				LogicalResult handleSerializedObject(
				SmallVector<char> object, llvm::IRBuilderBase &hostBuilder,
				LLVM::ModuleTranslation &hostModuleTranslation) override;
				};
				} // namespace

				//===----------------------------------------------------------------------===//
				// Base pipeline methods.
				//===----------------------------------------------------------------------===//

				NVPTXPipelineBase::NVPTXPipelineBase(GPUModuleOp module,
				NVPTXTranslationTarget target)
				: ModuleToObject(module), target(target) {
				*nvptxTargetInit;
				}

				TranslationTarget &NVPTXPipelineBase::getTranslationTarget() { return target; }

				std::optional<SmallVector<std::unique_ptr<llvm::Module>>>
				NVPTXPipelineBase::loadBitcodeFiles(llvm::LLVMContext &context,
				llvm::Module &module) {
				SmallVector<std::string> fileList = target.getFilesToLink();

				// Try loading `libdevice` from a CUDA toolkit installation.
				StringRef pathRef = target.getToolkitPath();
				mehdi_aminiUnsubmitted Not Done Reply Inline Actions We should have a CMake variable telling us about this? mehdi_amini: We should have a CMake variable telling us about this?
				fmoracAuthorUnsubmitted Done Reply Inline Actions The function `target.getToolkitPath()` either returns the path specified in the translation attribute, or the path found by CMake during building. See file `GPUTranslationTargets.cpp` lines 51 and 99. fmorac: The function `target.getToolkitPath()` either returns the path specified in the translation…
				if (pathRef.size()) {
				SmallVector<char, 256> path;
				path.insert(path.begin(), pathRef.begin(), pathRef.end());
				pathRef = StringRef(path.data(), path.size());
				if (!llvm::sys::fs::is_directory(pathRef)) {
				getOperation().emitError() << "CUDA path: " << pathRef
				<< " does not exist or is not a directory.\n";
				return std::nullopt;
				}
				// TODO remove this hard coded path.
				llvm::sys::path::append(path, "nvvm", "libdevice", "libdevice.10.bc");
				pathRef = StringRef(path.data(), path.size());
				if (!llvm::sys::fs::is_regular_file(pathRef)) {
				getOperation().emitError() << "LibDevice path: " << pathRef
				<< " does not exist or is not a file.\n";
				return std::nullopt;
				}
				fileList.push_back(pathRef.str());
				}

				SmallVector<std::unique_ptr<llvm::Module>> bcFiles;
				if (failed(loadBitcodeFilesFromList(context, fileList, bcFiles, true)))
				return std::nullopt;
				return bcFiles;
				}

				//===----------------------------------------------------------------------===//
				// Driver pipeline methods.
				//===----------------------------------------------------------------------===//
				#ifdef MLIR_GPU_TO_CUBIN_TRANSLATION_ENABLED
				mehdi_aminiUnsubmitted Not Done Reply Inline Actions When is the code above this guard used when this guard is false? I can't find the entry point... mehdi_amini: When is the code above this guard used when this guard is false? I can't find the entry point...
				fmoracAuthorUnsubmitted Done Reply Inline Actions In this patch, none, however the above code will be re-utilized by the offload pipeline, which is not dependent on `MLIR_GPU_TO_CUBIN_TRANSLATION_ENABLED`. fmorac: In this patch, none, however the above code will be re-utilized by the offload pipeline, which…
				#include <cuda.h>

				static void emitCudaError(const llvm::Twine &expr, const char *buffer,
				CUresult result, Location loc) {
				const char *error;
				cuGetErrorString(result, &error);
				emitError(loc, expr.concat(" failed with error code ")
				.concat(llvm::Twine{error})
				.concat("[")
				.concat(buffer)
				.concat("]"));
				}

				#define RETURN_ON_CUDA_ERROR(expr) \
				do { \
				if (auto status = (expr)) { \
				emitCudaError(#expr, jitErrorBuffer, status, loc); \
				return {}; \
				} \
				} while (false)

				std::optional<SmallVector<char>>
				NVPTXDriverPipeline::serializeToObject(llvm::Module &llvmModule,
				llvm::TargetMachine &targetMachine) {
				std::optional<std::string> serializedISA =
				translateToISA(llvmModule, targetMachine);
				if (serializedISA) {
				mehdi_aminiUnsubmitted Done Reply Inline Actions if (!serializedISA) { getOperation().emitError() << "Failed translating the Module to ISA."; return std::nullopt; } mehdi_amini: ``` if (!serializedISA) { getOperation().emitError() << "Failed translating the Module to…
				fmoracAuthorUnsubmitted Done Reply Inline Actions Will change it. fmorac: Will change it.
				auto loc = getOperation().getLoc();
				char jitErrorBuffer[4096] = {0};

				RETURN_ON_CUDA_ERROR(cuInit(0));

				// Linking requires a device context.
				CUdevice device;
				RETURN_ON_CUDA_ERROR(cuDeviceGet(&device, 0));
				CUcontext context;
				RETURN_ON_CUDA_ERROR(cuCtxCreate(&context, 0, device));
				CUlinkState linkState;

				CUjit_option jitOptions[] = {CU_JIT_ERROR_LOG_BUFFER,
				CU_JIT_ERROR_LOG_BUFFER_SIZE_BYTES};
				void *jitOptionsVals[] = {jitErrorBuffer,
				reinterpret_cast<void *>(sizeof(jitErrorBuffer))};

				RETURN_ON_CUDA_ERROR(cuLinkCreate(2, /* number of jit options */
				jitOptions, /* jit options */
				jitOptionsVals, /* jit option values */
				&linkState));

				auto kernelName = getOperation().getName().str();
				RETURN_ON_CUDA_ERROR(cuLinkAddData(
				linkState, CUjitInputType::CU_JIT_INPUT_PTX,
				const_cast<void >(static_cast<const void >(serializedISA->c_str())),
				serializedISA->length(), kernelName.c_str(),
				0, /* number of jit options */
				nullptr, /* jit options */
				nullptr /* jit option values */
				));

				void *cubinData;
				size_t cubinSize;
				RETURN_ON_CUDA_ERROR(cuLinkComplete(linkState, &cubinData, &cubinSize));

				char cubinAsChar = static_cast<char >(cubinData);
				auto result = SmallVector<char>(cubinAsChar, cubinAsChar + cubinSize);

				// This will also destroy the cubin data.
				RETURN_ON_CUDA_ERROR(cuLinkDestroy(linkState));
				RETURN_ON_CUDA_ERROR(cuCtxDestroy(context));
				return result;
				} else {
				getOperation().emitError() << "Failed translating the Module to ISA.";
				return std::nullopt;
				}
				}
				fmoracAuthorUnsubmitted Done Reply Inline Actions Code take from the original `gpu-to-cubin` pass. fmorac: Code take from the original `gpu-to-cubin` pass.
				mehdi_aminiUnsubmitted Not Done Reply Inline Actions I would think that we could either serialize to PTX or serialize to Cubin based on an option (possibly on the TargetAttr)? Also is this the code that right now requires a GPU available in the machine at compile time, and that we should update to use the Cuda runtime library instead? mehdi_amini: I would think that we could either serialize to PTX or serialize to Cubin based on an option…
				fmoracAuthorUnsubmitted Done Reply Inline Actions We could, however if we stop at ptx we might have to modify the runner library or the JIT? Right now this serializes to cubin. Yes, this is the code from the old pipeline that requires the driver. Since the patch was already to big I preferred leaving that migration to a future patch. fmorac: We could, however if we stop at ptx we might have to modify the runner library or the JIT?
				mehdi_aminiUnsubmitted Not Done Reply Inline Actions We could, however if we stop at ptx we might have to modify the runner library or the JIT? Right now this serializes to cubin. I'm wondering about things like AOT cases actually: ultimately we may want to be able to emit only LLVM IR only right? mehdi_amini: > We could, however if we stop at ptx we might have to modify the runner library or the JIT?
				fmoracAuthorUnsubmitted Done Reply Inline Actions The answer is maybe. In the offload pipeline we can do that and we'll end on LLVM IR and let another stage take care of the compilation. However I'm not proposing that yet, instead I was proposing having 2 co-existing pipelines: The offload pipeline: stops on LLVM and we handle code generation on a separate stage (something like the clang driver). These legacy pipelines. However, in the future we can discuss having just option 1, hence the maybe. Having just 1 entirely on MLIR requires more patches and I think a broader conversation on AOT and MLIR. fmorac: The answer is maybe. In the offload pipeline we can do that and we'll end on LLVM IR and let…
				mehdi_aminiUnsubmitted Not Done Reply Inline Actions I think we discussed it, but I'll repeat it here because I'm not sure we're aligned on this: I am quite concerned about moving forward with 2 co-existing pipelines before we have a plan to converge to just one. mehdi_amini: I think we discussed it, but I'll repeat it here because I'm not sure we're aligned on this: I…

				LogicalResult NVPTXDriverPipeline::handleSerializedObject(
				SmallVector<char> object, llvm::IRBuilderBase &hostBuilder,
				LLVM::ModuleTranslation &hostModuleTranslation) {
				return embedBinaryObject(object, hostBuilder, hostModuleTranslation);
				}

				LogicalResult mlir::gpu::runNVPTXDriverPipeline(
				GPUModuleOp module, NVPTXTranslationTarget target,
				llvm::IRBuilderBase &builder, LLVM::ModuleTranslation &moduleTranslation) {
				return NVPTXDriverPipeline(module, target).run(builder, moduleTranslation);
				}

				#endif
				#endif

				#if MLIR_CUDA_CONVERSIONS_ENABLED == 0 \|\| \
				!defined(MLIR_GPU_TO_CUBIN_TRANSLATION_ENABLED)
				LogicalResult mlir::gpu::runNVPTXDriverPipeline(
				GPUModuleOp module, NVPTXTranslationTarget target,
				llvm::IRBuilderBase &builder, LLVM::ModuleTranslation &moduleTranslation) {
				return success();
				mehdi_aminiUnsubmitted Not Done Reply Inline Actions I'm not sure why this is not a failure? Seems like we'd just "ignore" the GPU module, succeed with the translation even though it is incomplete? mehdi_amini: I'm not sure why this is not a failure? Seems like we'd just "ignore" the GPU module, succeed…
				fmoracAuthorUnsubmitted Done Reply Inline Actions Seems like we'd just "ignore" the GPU module, succeed with the translation even though it is incomplete? That's exactly what it's doing, however I agree with you, this should err. fmorac: > Seems like we'd just "ignore" the GPU module, succeed with the translation even though it is…
				}
				#endif

mlir/lib/Target/LLVMIR/Dialect/GPU/TranslationPipelines.h

This file was added.

				//===- TranslationPipelines.h - GPU translation pipelines -----------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				// This file declares all available GPU translation pipelines.
				//
				//===----------------------------------------------------------------------===//

				#ifndef MLIR_TARGET_LLVMIR_DIALECT_GPU_TRANSLATIONPIPELINES_H
				#define MLIR_TARGET_LLVMIR_DIALECT_GPU_TRANSLATIONPIPELINES_H

				#include "mlir/Target/LLVMIR/Dialect/GPU/GPUTranslationTargets.h"

				namespace llvm {
				class IRBuilderBase;
				}

				namespace mlir {
				namespace LLVM {
				class ModuleTranslation;
				}

				namespace gpu {
				LogicalResult runAMDGPUPipeline(GPUModuleOp module,
				AMDGPUTranslationTarget target,
				llvm::IRBuilderBase &builder,
				LLVM::ModuleTranslation &moduleTranslation);

				LogicalResult
				runNVPTXDriverPipeline(GPUModuleOp module, NVPTXTranslationTarget target,
				llvm::IRBuilderBase &builder,
				LLVM::ModuleTranslation &moduleTranslation);
				} // namespace gpu
				} // namespace mlir

				#endif

mlir/test/Conversion/GPUCommon/lower-launch-func-to-gpu-runtime-calls.mlir

	// RUN: mlir-opt %s --gpu-to-llvm="gpu-binary-annotation=nvvm.cubin use-opaque-pointers=1" \| FileCheck %s			// RUN: mlir-opt %s --gpu-to-llvm="target=nvvm.cubin use-opaque-pointers=1" \| FileCheck %s
	// RUN: mlir-opt %s --gpu-to-llvm="gpu-binary-annotation=rocdl.hsaco use-opaque-pointers=1" \| FileCheck %s --check-prefix=ROCDL			// RUN: mlir-opt %s --gpu-to-llvm="target=rocdl.hsaco use-opaque-pointers=1" \| FileCheck %s --check-prefix=ROCDL
				// RUN: mlir-opt %s --gpu-to-llvm='target="NVPTX: chip = "sm_70", opts = {fast}" use-opaque-pointers=1' \| FileCheck %s --check-prefix=NVVM

	module attributes {gpu.container_module} {			module attributes {gpu.container_module} {

	// CHECK: llvm.mlir.global internal constant @[[KERNEL_NAME:.*]]("kernel\00")			// CHECK: llvm.mlir.global internal constant @[[KERNEL_NAME:.*]]("kernel\00")
	// CHECK: llvm.mlir.global internal constant @[[GLOBAL:.*]]("CUBIN")			// CHECK: llvm.mlir.global internal constant @[[GLOBAL:.*]]("")
	// ROCDL: llvm.mlir.global internal constant @[[GLOBAL:.*]]("HSACO")			// CHECK: gpu.module @kernel_module attributes {rocdl.hsaco = #gpu.target<AMDGPU : chip = "gfx90a">, target = #gpu.target<NVPTX>}
				// ROCDL: gpu.module @kernel_module attributes {nvvm.cubin = #gpu.target<NVPTX>, target = #gpu.target<AMDGPU : chip = "gfx90a">}
				// NVVM: gpu.module @kernel_module attributes {nvvm.cubin = #gpu.target<NVPTX>, rocdl.hsaco = #gpu.target<AMDGPU : chip = "gfx90a">, target = #gpu.target<NVPTX : chip = "sm_70", opts = {fast}>}

	gpu.module @kernel_module attributes {			gpu.module @kernel_module attributes {
	nvvm.cubin = "CUBIN", rocdl.hsaco = "HSACO"			nvvm.cubin = #gpu.target<NVPTX>, rocdl.hsaco = #gpu.target<AMDGPU: chip = "gfx90a">
	} {			} {
	llvm.func @kernel(%arg0: i32, %arg1: !llvm.ptr<f32>,			llvm.func @kernel(%arg0: i32, %arg1: !llvm.ptr<f32>,
	%arg2: !llvm.ptr<f32>, %arg3: i64, %arg4: i64,			%arg2: !llvm.ptr<f32>, %arg3: i64, %arg4: i64,
	%arg5: i64) attributes {gpu.kernel} {			%arg5: i64) attributes {gpu.kernel} {
	llvm.return			llvm.return
	}			}
	}			}

	▲ Show 20 Lines • Show All 44 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][gpu] Move the GPU serialization passes to translation.AbandonedPublic

Details

Brief

Overview

TODO

Testing

Setup 1

Setup 2

Diff Detail

Event Timeline

Revision Contents

Diff 526863

mlir/include/mlir/Conversion/GPUCommon/GPUCommonPass.h

mlir/include/mlir/Conversion/Passes.td

mlir/include/mlir/Dialect/GPU/IR/GPUOps.td

mlir/include/mlir/Dialect/GPU/IR/TranslationTargetAttr.td

mlir/include/mlir/Dialect/GPU/Transforms/Passes.h

mlir/include/mlir/Dialect/GPU/Transforms/Utils.h

mlir/include/mlir/Target/LLVMIR/Dialect/GPU/GPUTranslationTargets.h

mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp

mlir/lib/Target/LLVMIR/Dialect/GPU/AMDGPUPipeline.cpp

mlir/lib/Target/LLVMIR/Dialect/GPU/CMakeLists.txt

mlir/lib/Target/LLVMIR/Dialect/GPU/GPUToLLVMIRTranslation.cpp

mlir/lib/Target/LLVMIR/Dialect/GPU/GPUTranslationTargets.cpp

mlir/lib/Target/LLVMIR/Dialect/GPU/ModuleToObject.h

mlir/lib/Target/LLVMIR/Dialect/GPU/ModuleToObject.cpp

mlir/lib/Target/LLVMIR/Dialect/GPU/NVPTXPipeline.cpp

mlir/lib/Target/LLVMIR/Dialect/GPU/TranslationPipelines.h

mlir/test/Conversion/GPUCommon/lower-launch-func-to-gpu-runtime-calls.mlir

[mlir][gpu] Move the GPU serialization passes to translation.
AbandonedPublic