This is an archive of the discontinued LLVM Phabricator instance.

[mlir][gpu] Improving Cubin Serialization with ptxas Compiler
ClosedPublic

Authored by guraypp on Jul 18 2023, 1:07 AM.

Details

Summary

This work improves how we compile the generated PTX code using the ptxas compiler. Currently, we rely on the driver's jit API to compile the PTX code. However, this approach has some limitations. It doesn't always produce the same binary output as the ptxas compiler, leading to potential inconsistencies in the generated Cubin files.

This work introduces a significant improvement by directly utilizing the ptxas compiler for PTX compilation. By doing so, we can achieve more consistent and reliable results in generating cubin files. Key Benefits:

  • Using the Ptxas compiler directly ensures that the cubin files generated during the build process remain consistent with CUDA compilation using nvcc or clang.
  • Another advantage of this work is that it allows developers to experiment with different ptxas compilers without the need to change the compiler. Performance among ptxas compiler versions are vary, therefore, one can easily try different ptxas compilers.

Diff Detail

Event Timeline

guraypp created this revision.Jul 18 2023, 1:07 AM
guraypp requested review of this revision.Jul 18 2023, 1:07 AM
Herald added a project: Restricted Project. · View Herald Transcript
mlir/include/mlir/Dialect/GPU/Transforms/Passes.h
176

this now has enough magic flags that you may want an options struct with named fields to hold the information.

guraypp added inline comments.Jul 19 2023, 12:06 AM
mlir/include/mlir/Dialect/GPU/Transforms/Passes.h
176

Right, we have many flags here, we might even have more since ptx compilation is done by another compiler.

How does the options struct look like? Do we have a example?

mlir/include/mlir/Dialect/GPU/Transforms/Passes.h
176

Yes, quite a few actually.

It depends how much you want to be configurable with CLI options.

You could look up ConvertFuncToLLVMPass which has a let options = and how it is used in https://reviews.llvm.org/D155463.

Alternatively, you could also create your own struct manually, there are examples in Dialect/Linalg/Transforms/Transforms.h

Approving conditioned on better options struct to avoid proliferation of flags.

mlir/lib/Dialect/GPU/Transforms/SerializeToCubin.cpp
228

spurious format change

This revision is now accepted and ready to land.Jul 20 2023, 11:57 PM
guraypp updated this revision to Diff 542968.Jul 21 2023, 9:32 AM

add parameters for ptxas, and rebase

guraypp updated this revision to Diff 543405.Jul 24 2023, 12:40 AM
guraypp marked an inline comment as done.

use struct for the options

Also I have concern with the patch in the first place: the previous plan on Discourse was to use the cuda toolkit API instead of shelling out to the ptxas binary.

See also the plans for changing the entire GPU-to-LLVM lowering path here: https://reviews.llvm.org/D154153 (series of patch).

Lot of context here: https://discourse.llvm.org/t/rfc-extending-mlir-gpu-device-codegen-pipeline/70199/

Thanks @mehdi_amini for addressing the issue promptly. I made two additional fixes after seeing builtbot failures. I thought they fixed it. Where can I see the latest failures?

Also I have concern with the patch in the first place: the previous plan on Discourse was to use the cuda toolkit API instead of shelling out to the ptxas binary.

I'm happy to discuss it at the discourse, but I thought this work would make everyone happy. Let me explain the need here, if it's going to be long, we can continue the discourse.

This work impacts only PTX-to-SASS compilation (not GPU-to-LLVM-to-PTX). It enables using ptxas compiler rather than CUDA driver. It's crucial for two reasons:

  1. The flexibility to choose a different ptxas compiler when facing performance regressions. Currently, MLIR uses the CUDA driver for PTX compilation, which limits the ptxas compiler to the underlying driver version, leaving no room for choice.
  2. Ensuring you get SASS code as nvcc produces. Interestingly, the SASS produced by the CUDA driver differs from that of ptxas. During my implementation of Hopper's TMA load instruction, I encountered MISALIGNED address issues.

It is not easy to find an answer when you hit one of these problems. SASS is not documented.

The nvcc pipeline is state-of-art for nvidia compilation, and it uses ptxas for that. I think we should mimic what nvcc is doing in MLIR.

See also the plans for changing the entire GPU-to-LLVM lowering path here: https://reviews.llvm.org/D154153 (series of patch).

Lot of context here: https://discourse.llvm.org/t/rfc-extending-mlir-gpu-device-codegen-pipeline/70199/

That's cool. Current GPU pipeline has some shortcomings. I am very interested every aspects of GPUs, please add me as reviewers when you have something new, so I can take a look.

mlir/include/mlir/Dialect/GPU/Transforms/Passes.h
176

I still did not understand what kind of struct I need to use. There is td file of this pass, I cannot do let options.
I think I better land this and implement the struct as a next PR.

This work impacts only PTX-to-SASS compilation (not GPU-to-LLVM-to-PTX). It enables using ptxas compiler rather than CUDA driver. It's crucial for two reasons:
The flexibility to choose a different ptxas compiler when facing performance regressions. Currently, MLIR uses the CUDA driver for PTX compilation, which limits the ptxas compiler to the underlying driver version, leaving no room for choice.
Ensuring you get SASS code as nvcc produces. Interestingly, the SASS produced by the CUDA driver differs from that of ptxas. During my implementation of Hopper's TMA load instruction, I encountered MISALIGNED address issues.
It is not easy to find an answer when you hit one of these problems. SASS is not documented.

Sure: but aren't ptxas and nvrtc both just distributed with the cuda toolkit, from the point of view you mentioned above what is the difference to use one or the other?

The nvcc pipeline is state-of-art for nvidia compilation, and it uses ptxas for that. I think we should mimic what nvcc is doing in MLIR.

Actually it does only if you ask it for right? Otherwise it embeds PTX?

This work impacts only PTX-to-SASS compilation (not GPU-to-LLVM-to-PTX). It enables using ptxas compiler rather than CUDA driver. It's crucial for two reasons:
The flexibility to choose a different ptxas compiler when facing performance regressions. Currently, MLIR uses the CUDA driver for PTX compilation, which limits the ptxas compiler to the underlying driver version, leaving no room for choice.
Ensuring you get SASS code as nvcc produces. Interestingly, the SASS produced by the CUDA driver differs from that of ptxas. During my implementation of Hopper's TMA load instruction, I encountered MISALIGNED address issues.
It is not easy to find an answer when you hit one of these problems. SASS is not documented.

Sure: but aren't ptxas and nvrtc both just distributed with the cuda toolkit, from the point of view you mentioned above what is the difference to use one or the other?

I have extracted the relevant codes for you to review. The link below contains the ptxas generated SASS code on the left side, the driver-generated SASS code on the right side, and the PTX code used for both cases in the box at the bottom. Unfortunately, the program compiled with the driver crashes with a Warp Misaligned Address error. I suspect that the issue may be related to the UTMLDG.2D instruction, but it's hard to say. I am unable to debug this further. The other program work as expected.
https://godbolt.org/z/6onxv6enz

I'm uncertain about whether ptxas and the CUDA driver's compiler consistently generate the same SASS code, as there has been no confirmation on this matter in the past.

Nevertheless, to maintain our sanity and ensure consistent performance, I strongly advocate selecting ptxas. It's really sad to see a 10% performance drop, even when generating identical PTX code, simply due to someone installed a new CUDA driver :(

The nvcc pipeline is state-of-art for nvidia compilation, and it uses ptxas for that. I think we should mimic what nvcc is doing in MLIR.

Actually it does only if you ask it for right? Otherwise it embeds PTX?

Try nvcc -v code.cu, you will see ptxas in the compilation flow.

Additionally, I want to clarify that I am not suggesting the removal of driver compilation. Instead, I propose adding ptxas compilation as an option and a flag. This way, anyone can choose to opt-in or out based on their specific requirements and preferences.