This is an archive of the discontinued LLVM Phabricator instance.

[mlir][cuda] Add a test-lower-to-nvvm catchall passpipeline.
ClosedPublic

Authored by nicolasvasilache on Jul 17 2023, 7:17 AM.

Details

Summary

This mirrors the test-lower-to-llvm pass pipeline that provides some sanity when running e2e examples.

One peculiarity of the GPU pipeline is that we want to allow 32b indexing in kernels.
This is currently not straightforward as there are dependencies between passes.
This new test pass orders passes in a way that connects end-to-end.

Diff Detail

Event Timeline

Herald added a project: Restricted Project. · View Herald Transcript
nicolasvasilache requested review of this revision.Jul 17 2023, 7:17 AM
ftynse accepted this revision.Jul 17 2023, 7:27 AM
ftynse added inline comments.
mlir/test/lib/Dialect/NVVM/TestLowerToNVVM.cpp
134

inconsistent?

208

Can we put this 64 in a named constant?

This revision is now accepted and ready to land.Jul 17 2023, 7:27 AM
guraypp accepted this revision.Jul 17 2023, 7:33 AM

Setting the pass pipeline manually was a headache. This is really useful to run something quickly on the GPU.

One peculiarity of the GPU pipeline is that we want to allow 32b indexing in kernels.

That's right, GPU only have 32bit registers however it can use 2x32bit registers for 64bit address space.

Constant and Shared memory spaces are 32bit addressable. In LLVM we have a flag for that, can we set this flag (see below)?
https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/NVPTX/NVPTXTargetMachine.cpp#L60-L65

guraypp added inline comments.Jul 17 2023, 7:35 AM
mlir/test/lib/Dialect/NVVM/TestLowerToNVVM.cpp
143

We need to call createConvertNVVMToLLVMPass as well. I added that for the PTX builder

nicolasvasilache marked 3 inline comments as done.Jul 17 2023, 7:59 AM
nicolasvasilache added inline comments.
mlir/test/lib/Dialect/NVVM/TestLowerToNVVM.cpp
143

discussed offline, @guraypp will pick this up in a followup as he knows what needs to be done and I'd be in discovery mode

nicolasvasilache marked an inline comment as done.

Update

This revision was landed with ongoing or failed builds.Jul 17 2023, 8:18 AM
This revision was automatically updated to reflect the committed changes.

Thanks for doing this! It's useful to have this kind of reference pipeline available :)

kerrmudgeon added a subscriber: kerrmudgeon.EditedJul 25 2023, 10:44 AM

Enabling 32b indexing is a useful optimization, but limiting kernels to strictly 32b linear offsets and strides is not suitable for many classes of production kernels which must be able to index the entire physical GPU memory.

The following use-cases come to mind with corresponding index types:

  • Global Memory offsets and strides: i64
  • Shared Memory offsets: i32
  • Tensor Coordinates: i32

Is it feasible to one day support a multiplicity of index types within the GPU kernel? When lowering the index type, the relevant pass would need to consider the memory space the index dereferences. This is admittedly complicated, but range limits have a significant impact on functionality of the resulting kernels.

Enabling 32b indexing is a useful optimization, but limiting kernels to strictly 32b linear offsets and strides is not suitable for many classes of production kernels which must be able to index the entire physical GPU memory.

The following use-cases come to mind with corresponding index types:

  • Global Memory offsets and strides: i64
  • Shared Memory offsets: i32
  • Tensor Coordinates: i32

Is it feasible to one day support a multiplicity of index types within the GPU kernel? When lowering the index type, the relevant pass would need to consider the memory space the index dereferences. This is admittedly complicated, but range limits have a significant impact on functionality of the resulting kernels.

Yes, absolutely this would be a good landing state.

In this first commit I purely went after the only thing I was able to make run without crashing; it was a quite unfortunate trial and error to get out of failure modes for now ..
The mixed mode will unfortunately break in various ways that I did not yet have time to investigate (see some TODOs).

As layering improves I'd love to see much more usable compositions of the things you mention.

So please only treat this as the first e2e thing that is known to work and let's iterate on improvements and generalizations.

Enabling 32b indexing is a useful optimization, but limiting kernels to strictly 32b linear offsets and strides is not suitable for many classes of production kernels which must be able to index the entire physical GPU memory.

The following use-cases come to mind with corresponding index types:

  • Global Memory offsets and strides: i64
  • Shared Memory offsets: i32
  • Tensor Coordinates: i32

Is it feasible to one day support a multiplicity of index types within the GPU kernel? When lowering the index type, the relevant pass would need to consider the memory space the index dereferences. This is admittedly complicated, but range limits have a significant impact on functionality of the resulting kernels.

We have talked offline with @nicolasvasilache. I will follow up this part as well. I think we want to set a LLVM flag (see link below) that sets layout for shared and constant memory spaces. As these memories are 32bit addressable, it is always safe to use 32b. You are right that we cannot directly use easily 32b for global address space.
https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/NVPTX/NVPTXTargetMachine.cpp#L60-L65

Even if we set kernel-index-bitwidth=32 right now, I believe PTX still uses 64bit registers because the flag does not change the layout and LLVM promote 32b to 64b. It works, but not performant.

Yes, this all makes sense. Thanks for contributing this installment!
Iterating toward the optimal design is a process. :)