This is an archive of the discontinued LLVM Phabricator instance.

[mlir][gpu] Move the GPU serialization passes to translation.
AbandonedPublic

Authored by fmorac on May 30 2023, 6:19 PM.

Details

Summary

Brief

The intent of this diff is moving the serialization passes gpu-to-cubin & gpu-to-hsaco to the translation step, while also introducing all the needed infrastructure for introducing additional serialization pipelines.

Why?
From a conceptual point of view serialization involves translating the Ops inside a GPU module to a serialized string, as such this shouldn't happen in a pass but rather during translation. From an implementation point of view it's easier to serialize and manipulate the process when both the host and device LLVM Modules are available, this is not possible during a pass however it's possible in translation.

Overview

The biggest changes introduced by this patch are:

  • Introducing the attribute TranslationTarget and companion c++ interfaces defined in GPUTranslationTargets.h, this attribute informs the serialization options to the translation stage. It must be present as an attribute in the gpu.module to be able to perform the translation. Format:
#gpu.target<PIPELINE: triple = TARGETTRIPLE, chip = TARGETCHIP, features = TARGETFEATURES, toolkit = TOOLKITPATH, link = [LIST OF BITCODE FILES TO LINK], opts = {EXTRA OPTS}>
; AMDGPU example using default chip = gfx600
#gpu.target<AMDGPU: toolkit = "/opt/rocm/5.4.3", link = ["mylib.bc"], opts = {fast, ftz}>

; NVPTX example, with default options, chip = sm_35, triple = nvptx64-nvidia-cuda.
#gpu.target<NVPTX>

Example:

gpu.module @kernel_module attributes {rocdl.hsaco = #gpu.target<AMDGPU : chip = "gfx90a">, target = #gpu.target<NVPTX>} {
  llvm.func @kernel(%arg0: i32, %arg1: !llvm.ptr<f32>, %arg2: !llvm.ptr<f32>, %arg3: i64, %arg4: i64, %arg5: i64) attributes {gpu.kernel} {
    llvm.return
  }
}
  • Modifying the gpu-to-llvm pass to avoid removing the gpu.modules, while also adding a stub for the serialized string to be modified during translation. Additionally this pass can be used to set or add a target to the gpu.modules. Example:
; Selects the `rocdl.hsaco` target, this target must be present in the attributes of all `gpu.module`s: ie: `gpu.module ... attributes{rocdl.hsaco ... }`
--gpu-to-llvm='target=rocdl.hsaco'

; Sets the gpu target to a specific target. The format used for specifying the target is the format of the body of the `TranslationTarget` attribute, the quotation marks have to be carefully managed to successfully parse the attribute. 
--gpu-to-llvm='target="NVPTX: chip = "sm_90", opts = {ftz}"'
  • The addition of the ModuleToObject class. This class controls the behavior of all serialization pipelines. It allows for linking to any specified bitcode file, or if toolkit paths are detected, -or specified, linking against the devices libraries found in the toolkits.

Why the patch is so big?

  • Such a big change can't be done on a series of steps without leaving broken bits.
  • Many SLOC are reused from the original pipelines (this is specially true for the files NVPTXPipeline.cpp, AMDGPUPipeline.cpp), the only truly original files are: TranslationTargetAttr.td, GPUTranslationTargets.*, ModuleToObject.*.

TODO

This diff is the first in a series of patches on extending the GPU serialization pipeline.
The remaining patches will:

  • Remove all the serialization passes while updating in-tree projects to use the updated pipeline.
  • Introduce LIT tests for testing translation.
  • Introduce the offload pipeline proposed in this RFC. With this change this patch should be less that 100 source lines.

Testing

For testing the patch, 2 machines were used:

  1. Local machine, with NVIDIA V100, CUDA Toolkit 11.8 and Ubuntu 22.04.2.
  2. Frontier at ORNL, MI250X.

In all instances the test was successfully completed.

Clang was used to compile the final executable just out of convenience, the JIT should remain functional.

The input files were:

  • test.cpp , this file verifies the results produced by MLIR.
  • test.mlir , gpu kernel.

Setup 1

For compiling the test for NVIDIA targets, the following commands were used:

mlir-opt test.mlir \
  -gpu-launch-sink-index-computations \
  -gpu-kernel-outlining \
  -gpu-async-region \
  -convert-scf-to-cf \
  -convert-gpu-to-nvvm \
  -convert-math-to-llvm \
  -convert-arith-to-llvm \
  -convert-index-to-llvm \
  -canonicalize \
  -gpu-to-llvm='target="NVPTX: chip="sm_70" "' \
  -canonicalize \
  -o test_llvm.mlir
mlir-translate -mlir-to-llvmir test_llvm.mlir -o test.ll
clang++ test.ll test.cpp -lmlir_cuda_runtime -o test.exe

The following profile was generated with nsys.

Time (%)  Total Time (ns)  Num Calls  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)         Name        
--------  ---------------  ---------  ---------  ---------  --------  --------  -----------  -------------------
    33.2          207,866          2  103,933.0  103,933.0     4,538   203,328    140,565.8  cuMemAlloc_v2      
    23.4          146,329          1  146,329.0  146,329.0   146,329   146,329          0.0  cuModuleLoadData   
    17.8          111,172          1  111,172.0  111,172.0   111,172   111,172          0.0  cuModuleUnload     
    12.0           75,213          2   37,606.5   37,606.5     4,769    70,444     46,439.2  cuMemFree_v2       
     6.6           41,398          3   13,799.3   17,192.0     3,466    20,740      9,123.1  cuMemcpyAsync      
     3.1           19,197          1   19,197.0   19,197.0    19,197    19,197          0.0  cuLaunchKernel     
     2.8           17,343          1   17,343.0   17,343.0    17,343    17,343          0.0  cuStreamCreate     
     0.6            4,068          1    4,068.0    4,068.0     4,068     4,068          0.0  cuStreamDestroy_v2 
     0.5            3,416          1    3,416.0    3,416.0     3,416     3,416          0.0  cuStreamSynchronize

Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)     GridXYZ         BlockXYZ           Name      
--------  ---------------  ---------  --------  --------  --------  --------  -----------  --------------  --------------  ----------------
   100.0            4,096          1   4,096.0   4,096.0     4,096     4,096          0.0     3    1    1   128    1    1  test_mlir_kernel

Setup 2

For compiling the test for AMDGPU targets, the following commands were used:

mlir-opt test.mlir \
  -gpu-launch-sink-index-computations \
  -gpu-kernel-outlining \
  -gpu-async-region \
  -convert-scf-to-cf \
  -convert-gpu-to-rocdl \
  -convert-math-to-llvm \
  -convert-arith-to-llvm \
  -convert-index-to-llvm \
  -canonicalize \
  -gpu-to-llvm='target="AMDGPU: chip="gfx90a" "' \
  -canonicalize \
  -o test_llvm.mlir
mlir-translate -mlir-to-llvmir test_llvm.mlir -o test.ll
clang++ test.ll test.cpp -lmlir_rocm_runtime -o test.exe

The following profile was generated with rocprof.

Diff Detail

Event Timeline

fmorac created this revision.May 30 2023, 6:19 PM
fmorac updated this revision to Diff 526863.May 30 2023, 6:20 PM

Updated commit typo in commit message.

fmorac retitled this revision from [mlir][gpu] Move the GPU serialization passes to translationn. to [mlir][gpu] Move the GPU serialization passes to translation..May 30 2023, 6:50 PM
fmorac edited the summary of this revision. (Show Details)
fmorac added a project: Restricted Project.
fmorac edited the summary of this revision. (Show Details)May 30 2023, 7:31 PM
fmorac published this revision for review.May 30 2023, 7:37 PM
fmorac added inline comments.
mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp
639

Parse the target attribute used by the pass gpu-to-llvm.

662

Method for updating the target attribute.

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
61

Verify some precondition of the TranslationTargetAttr.

mlir/lib/Target/LLVMIR/Dialect/GPU/AMDGPUPipeline.cpp
196–331

Code taken from the original hsaco serialization pipeline.

mlir/lib/Target/LLVMIR/Dialect/GPU/CMakeLists.txt
44–115

Code taken from the CMake in the original serialization passes.

mlir/lib/Target/LLVMIR/Dialect/GPU/NVPTXPipeline.cpp
129–204

Code take from the original gpu-to-cubin pass.

mehdi_amini added inline comments.May 30 2023, 9:34 PM
mlir/include/mlir/Dialect/GPU/IR/TranslationTargetAttr.td
58

Would 2 be better default?

63

It's not clear to me that these kind of information belongs to the IR: that seems oddly specific to one filesystem?

mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp
653

Can't you just do TranslationTargetAttr::parse(attrStr) or something like that here? Adding the dialect prefix seems to jump through hoops somehow

671

I think the function could benefit from early returns in general:

// If `target` is valid, set the attribute.
 if (target) {
   module->setAttr(getTargetAttrName(), target);
   return succes();
 }
 // Try selecting the attribute from the existing module attributes.
 StringRef attrName = getTargetAttrName();
 if (attrName = targetAttrName)
   return success();

 ...
mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
41

Any reason this isn't promoted to first-class supported inherent attribute for the gpu.module op?

mlir/lib/Target/LLVMIR/Dialect/GPU/ModuleToObject.cpp
45

Does it have to be a std::string? Seems like consumed as a StringRef everywhere, I'd write it as auto triple = ...

124

Nit: !gv.hasName() || !gvs.contains(gv.getName())

214

Don't we have access to the current LLVM context used by the overall translation? Can we just reuse it?

mlir/lib/Target/LLVMIR/Dialect/GPU/NVPTXPipeline.cpp
99

We should have a CMake variable telling us about this?

129

When is the code above this guard used when this guard is false? I can't find the entry point...

129–204

I would think that we could either serialize to PTX or serialize to Cubin based on an option (possibly on the TargetAttr)?

Also is this the code that right now requires a GPU available in the machine at compile time, and that we should update to use the Cuda runtime library instead?

156
if (!serializedISA) {
    getOperation().emitError() << "Failed translating the Module to ISA.";
    return std::nullopt;
}
226

I'm not sure why this is not a failure?
Seems like we'd just "ignore" the GPU module, succeed with the translation even though it is incomplete?

This looks pretty nicely done overall!

fmorac marked 5 inline comments as done.May 31 2023, 6:46 AM

Once we agree on the overall idea of the patch, I'll make all the changes in the patch addressing the comments.

mlir/include/mlir/Dialect/GPU/IR/TranslationTargetAttr.td
58

Yes, It makes more sense, e.g. on ptxas the default is 3.

63

This is here because unlike passes there's not a general way to pass options to translation -as far as I know. However I ended up liking this idea because then you can have:

gpu.module @A1 attributes {target = #gpu.target<NVPTX: link = ["foo1.bc"]>} {...}


gpu.module @A2 attributes {target = #gpu.target<NVPTX: link = ["foo2.bc"]>} {...}

Allowing linking on a per module basis.

mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp
653

The reason I add the prefix is because It provides for a cleaner cmd option, i.e.:

--gpu-to-llvm='target="NVPTX"'
; Versus
--gpu-to-llvm='target="#gpu.target<NVPTX>"'

But I can change it.

671

I'll change it.

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
41

None, I'll change it.

mlir/lib/Target/LLVMIR/Dialect/GPU/ModuleToObject.cpp
45
124

Will change it.

214

Yes & I think so, however I figured that we could eventually add threads to this, so I created a new one.

mlir/lib/Target/LLVMIR/Dialect/GPU/NVPTXPipeline.cpp
99

The function target.getToolkitPath() either returns the path specified in the translation attribute, or the path found by CMake during building.
See file GPUTranslationTargets.cpp lines 51 and 99.

129

In this patch, none, however the above code will be re-utilized by the offload pipeline, which is not dependent on MLIR_GPU_TO_CUBIN_TRANSLATION_ENABLED.

129–204

We could, however if we stop at ptx we might have to modify the runner library or the JIT? Right now this serializes to cubin.

Yes, this is the code from the old pipeline that requires the driver. Since the patch was already to big I preferred leaving that migration to a future patch.

156

Will change it.

226

Seems like we'd just "ignore" the GPU module, succeed with the translation even though it is incomplete?

That's exactly what it's doing, however I agree with you, this should err.

I like the overall idea of this patch and am glad to see someone come along and clean things up!

As a minor nit, would it be possible to get a debug option (analogous to current --debug-only=serialize-to-blob) for dumping out ISA after the pipeline runs?

mlir/lib/Target/LLVMIR/Dialect/GPU/AMDGPUPipeline.cpp
188

Ooh, these are the magic incantations! Avoiding all this spam is why I wrote old serialize-to-hsaco to be much more selective about when things got linked it.

214

Nit: don't we usually not auto these?

fmorac marked 6 inline comments as done.May 31 2023, 1:01 PM

I like the overall idea of this patch and am glad to see someone come along and clean things up!

As a minor nit, would it be possible to get a debug option (analogous to current --debug-only=serialize-to-blob) for dumping out ISA after the pipeline runs?

Yeah, for sure, just the ISA, or also the device LLVM module?

mlir/lib/Target/LLVMIR/Dialect/GPU/AMDGPUPipeline.cpp
214

I think you're right, I believe the policy is auto-ing only variables with an explicit type. Some of these are left overs from the serialization passes, but I'll do one more check on all auto variables and update them accordingly.

Re the debug switch(es), I've had good reason to call for "let's see the assembly" (which is why I added serialize-to-blob's debug ages back), but there's also "let's see what opt did" (which can be done with strategic use of --print-before on the command line but maybe it'd be good to have a know for dumping it).

There's what may be a separate utility or option (we have it downstream as a translate option called something like -gpu-module-to-rocdlir) that dumps out the LLVM IR before we start doing much to it (other than perhaps setting the data layout). It may be good to grow an upstream version of that.

I'd say that, most immediately, "print out the ISA" is what I'd want to preserve, but it could be good to provide a richer set of options while we're here.

and on the "while we're here" note, if GPU modules are going to have targets, would it be reasonable to, up in the --gpu-to{rocdl,nvvm} passes (the ones that go to MLIR-flavored LLVM), set the LLVM data layout during the translation process? This'll make sure that things like alloca address space = 5 gets handled correctly. (Or, in the future, that we don't trip over address space 7 being p7:128:160:160:32 during translation once it becomes a thing)

fmorac marked an inline comment as done.May 31 2023, 1:13 PM

I'd say that, most immediately, "print out the ISA" is what I'd want to preserve, but it could be good to provide a richer set of options while we're here.

I like the idea, but let's do this, I'll add the option for dumping the ISA that was present in the original pass, but lets postpone dumping more stuff and options to a future patch, that way we keep the spirit of this diff of just moving things down to translation, and other stuff for a future patch, what do you think?

Yeah, let's not add more debug options in this patch - it's busy enough as it is.

and on the "while we're here" note, if GPU modules are going to have targets, would it be reasonable to, up in the --gpu-to{rocdl,nvvm} passes (the ones that go to MLIR-flavored LLVM), set the LLVM data layout during the translation process? This'll make sure that things like alloca address space = 5 gets handled correctly. (Or, in the future, that we don't trip over address space 7 being p7:128:160:160:32 during translation once it becomes a thing)

I actually thought that we could even remove those passes in favor of a single gpu-to-llvm, as I believe breaking the lowering of gpu into multiple passes was to accommodate for the serialization passes. But let's do that on a future patch.

mehdi_amini added a comment.EditedMay 31 2023, 2:56 PM

and on the "while we're here" note, if GPU modules are going to have targets, would it be reasonable to, up in the --gpu-to{rocdl,nvvm} passes (the ones that go to MLIR-flavored LLVM), set the LLVM data layout during the translation process? This'll make sure that things like alloca address space = 5 gets handled correctly. (Or, in the future, that we don't trip over address space 7 being p7:128:160:160:32 during translation once it becomes a thing)

I actually thought that we could even remove those passes in favor of a single gpu-to-llvm, as I believe breaking the lowering of gpu into multiple passes was to accommodate for the serialization passes. But let's do that on a future patch.

+1 to everything here on how to proceed: iterative development is better!

(when we understand the direction, see my inline comment about this)

mlir/include/mlir/Dialect/GPU/IR/TranslationTargetAttr.td
63

I think there are two things: "list of libraries to link", which I am onboard with (I see it as the -l flag for the linker), and then there is the path component (which would be the -L of the linker).
What I am saying is that -l makes sense, but I have concerns about -L if you see what I mean?

So here toolkit for example looks like a path, and link isn't clear if this is a list of libraries to link or absolute file paths (I rather make it a list of libs really)

As of "translation options", you makes me realize that it's actually not clear to me that this all fits a "translation" actually: that is a translation is a "simple" almost 1-1 layer: I lost track of why we can stage things to keep "translations" simpler?
I guess I'm re-reading this patch description:

From a conceptual point of view serialization involves translating the Ops inside a GPU module to a serialized string, as such this shouldn't happen in a pass but rather during translation. From an implementation point of view it's easier to serialize and manipulate the process when both the host and device LLVM Modules are available, this is not possible during a pass however it's possible in translation.

But I'm not sure I agree with this: first I don't get the explanation on why this should be a translation and not a pass? Then I don't follow the next sentence either: the pass seems to me to also have access to both the host and the device code.

mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp
653

Actually my point was that the prefix is necessary so that parseAttribute() knows how to route it, but it'll internally unwrap and pass the payload to TranslationTargetAttr::parse(attrStr) , while you could skip these extra steps and call TranslationTargetAttr::parse(attrStr) without the prefix directly. I don't suggest the user to write the prefix in the option itself.

mlir/lib/Target/LLVMIR/Dialect/GPU/NVPTXPipeline.cpp
129–204

We could, however if we stop at ptx we might have to modify the runner library or the JIT? Right now this serializes to cubin.

I'm wondering about things like AOT cases actually: ultimately we may want to be able to emit only LLVM IR only right?

fmorac added inline comments.May 31 2023, 4:35 PM
mlir/include/mlir/Dialect/GPU/IR/TranslationTargetAttr.td
63

I think there are two things: "list of libraries to link", which I am onboard with (I see it as the -l flag for the linker), and then there is the path component (which would be the -L of the linker).
What I am saying is that -l makes sense, but I have concerns about -L if you see what I mean?

So here toolkit for example looks like a path, and link isn't clear if this is a list of libraries to link or absolute file paths (I rather make it a list of libs really)

You're right on the -l, however toolkit is more like a convenience option,. The toolkit option is just for loading the device libraries we know are in the toolkit without having to specify all the library files with link, so not -L, for example the toolkit option loads`libdevice`. In the case of NVIDIA is just libdevice, however for AMD there are many of them.

As of "translation options", you makes me realize that it's actually not clear to me that this all fits a "translation" actually: that is a translation is a "simple" almost 1-1 layer: I lost track of why we can stage things to keep "translations" simpler?

We could have a translation without options, however that requires stopping at LLVM IR and having a more general AOT and JIT infra.

I guess I'm re-reading this patch description:

From a conceptual point of view serialization involves translating the Ops inside a GPU module to a serialized string, as such this shouldn't happen in a pass but rather during translation. From an implementation point of view it's easier to serialize and manipulate the process when both the host and device LLVM Modules are available, this is not possible during a pass however it's possible in translation.

But I'm not sure I agree with this: first I don't get the explanation on why this should be a translation and not a pass? Then I don't follow the next sentence either: the pass seems to me to also have access to both the host and the device code.

The conceptual pov, that things involving translation should be on translation, It's just my opinion of how I view translation.

However from the implementation perspective, it makes things easier to have things in translation. The reason the other patch (D149559) it's so big, is because I had to recreate a lot of infrastructure already available on clang, and even if we move that infrastructure to llvm, the patch would still be big because all the infrastructure relies on having both host LLVM IR and device LLVM IR and interacting with both of them at the same time, that's not possible in a pass, in a pass you would have host MLIR and device LLVM IR.

I mean both routes are possible (pass or translation), the other patch already does the job, however it doesn't allows to reuse LLVM infrastructure.

mlir/lib/Conversion/GPUCommon/GPUToLLVMConversion.cpp
653

Oh, ok. The reason I didn't use the class method was because it required an AsmParser and it was not super clear to me how to get one with the string , when I asked the best way to do it from a string on discord someone pointed me to that method. But if there's a way, we can change it.

mlir/lib/Target/LLVMIR/Dialect/GPU/NVPTXPipeline.cpp
129–204

The answer is maybe. In the offload pipeline we can do that and we'll end on LLVM IR and let another stage take care of the compilation. However I'm not proposing that yet, instead I was proposing having 2 co-existing pipelines:

  1. The offload pipeline: stops on LLVM and we handle code generation on a separate stage (something like the clang driver).
  2. These legacy pipelines.

However, in the future we can discuss having just option 1, hence the maybe. Having just 1 entirely on MLIR requires more patches and I think a broader conversation on AOT and MLIR.

mehdi_amini added inline comments.May 31 2023, 5:07 PM
mlir/include/mlir/Dialect/GPU/IR/TranslationTargetAttr.td
63

however toolkit is more like a convenience option,. The toolkit option is just for loading the device libraries we know are in the toolkit without having to specify all the library files with link [...]

I would need to see some examples, it's not clear to me what you mean.

The conceptual pov, that things involving translation should be on translation, It's just my opinion of how I view translation.

Sure: what I'm saying is that in the process of: "taking an IR with a GPU module, translating it to LLVM, running LLVM optimizations, emitting assembly, assembling to binary, embedding this as a string" ; there is "translating" in the middle, the rest isn't "translation".

mlir/lib/Target/LLVMIR/Dialect/GPU/NVPTXPipeline.cpp
129–204

I think we discussed it, but I'll repeat it here because I'm not sure we're aligned on this: I am quite concerned about moving forward with 2 co-existing pipelines before we have a plan to converge to just one.

lamb-j added a subscriber: lamb-j.Jun 1 2023, 11:26 AM
lamb-j added inline comments.
mlir/include/mlir/Target/LLVMIR/Dialect/GPU/GPUTranslationTargets.h
130

I think gfx803 or gfx900 may be a better default. HSA never supported gfx600, and gfx700 was experimental.

ROCr supports gfx803, but it has been removed for CLR (so rocm OpenCL doesn't support gfx803).

136

We're in the process of updating the default ABI to code object v5 (500). Just something to be aware of that may need to be changed, depending on which patch gets in first.

mlir/lib/Target/LLVMIR/Dialect/GPU/AMDGPUPipeline.cpp
38

This call isn't thread safe (specifically RegisterTarget()). It may not be an issue in this context, but just a heads up (I recently hit a tricky bug due to this)

45

I believe ManagedStatic is in the process of being removed from LLVM. See here: https://reviews.llvm.org/D129134

47

AMDGPU?

80

AMDGPU? hsaco?