This is an archive of the discontinued LLVM Phabricator instance.

@fmorac I use the gpu-module-to-binary pass, you recently introduced, for mlir->llvm->ptx->cubin, eventually link the hosts llvm (has embedded cubin) by clang to generate the executable. Is this the right way to use your Pass?

I used to run gpu mlir integration tests with mlir-cpu-runner, but I guess gpu-module-to-binary is not compatible with it.

In D159347#4636557, @guraypp wrote:

@fmorac I use the gpu-module-to-binary pass, you recently introduced, for mlir->llvm->ptx->cubin, eventually link the hosts llvm (has embedded cubin) by clang to generate the executable. Is this the right way to use your Pass?

I used to run gpu mlir integration tests with mlir-cpu-runner, but I guess gpu-module-to-binary is not compatible with it.

Couple of things mlir-cpu-runner should work, for example the following should work -if you have a sm_70 GPU, with the all-reduce-and.mlir test:

mlir-opt all-reduce-and.mlir -gpu-kernel-outlining -nvvm-attach-target=chip=sm_70 \
    | mlir-opt -pass-pipeline='builtin.module(gpu.module(strip-debuginfo,convert-gpu-to-nvvm))' \
    | mlir-opt -gpu-to-llvm -gpu-module-to-binary \
    |  mlir-cpu-runner --shared-libs=${LLVM_LIB}/libmlir_cuda_runtime.so --shared-libs=${LLVM_LIB}/libmlir_runner_utils.so --entry-point-result=void

Adding module=main_kernel is not necessary in --nvvm-attach-target=, that's just to filter to which modules to add the target.

There might be issues if the chip doesn't match the GPU the code is running, ie. chip=sm_80 but GPU is sm_90.

The clang target is not supported upstream as you have it yet.

If the above workflow with mlir-cpu-runner is not working, could you send me the error?

use mlir-cpu-runner

In D159347#4636897, @fmorac wrote:
In D159347#4636557, @guraypp wrote:

@fmorac I use the gpu-module-to-binary pass, you recently introduced, for mlir->llvm->ptx->cubin, eventually link the hosts llvm (has embedded cubin) by clang to generate the executable. Is this the right way to use your Pass?

I used to run gpu mlir integration tests with mlir-cpu-runner, but I guess gpu-module-to-binary is not compatible with it.

Couple of things mlir-cpu-runner should work, for example the following should work -if you have a sm_70 GPU, with the all-reduce-and.mlir test:
mlir-opt all-reduce-and.mlir -gpu-kernel-outlining -nvvm-attach-target=chip=sm_70 \
    | mlir-opt -pass-pipeline='builtin.module(gpu.module(strip-debuginfo,convert-gpu-to-nvvm))' \
    | mlir-opt -gpu-to-llvm -gpu-module-to-binary \
    |  mlir-cpu-runner --shared-libs=${LLVM_LIB}/libmlir_cuda_runtime.so --shared-libs=${LLVM_LIB}/libmlir_runner_utils.so --entry-point-result=void

Thanks for the recipe. My test works now with the mlir-cpu-runner. I updated the test code

Adding module=main_kernel is not necessary in --nvvm-attach-target=, that's just to filter to which modules to add the target.

I actually need it to set ptx version. The default version for sm_90 is 7.8, that does not support PTX instructions for TMA. So I set to ptx80+

There might be issues if the chip doesn't match the GPU the code is running, ie. chip=sm_80 but GPU is sm_90.

The clang target is not supported upstream as you have it yet.

If the above workflow with mlir-cpu-runner is not working, could you send me the error?

I used to get interface is not implemented error, cannot recall. I cannot reproduce now. I guess I was using it incorrectly.

Harbormaster completed remote builds in B256511: Diff 555710.Sep 4 2023, 5:21 AM

In D159347#4636957, @guraypp wrote:

Adding module=main_kernel is not necessary in --nvvm-attach-target=, that's just to filter to which modules to add the target.

I actually need it to set ptx version. The default version for sm_90 is 7.8, that does not support PTX instructions for TMA. So I set to ptx80+

What I was saying is that this is enough:

--nvvm-attach-target="features=+ptx80 chip=sm_90 O=3"

I used to get interface is not implemented error, cannot recall. I cannot reproduce now. I guess I was using it incorrectly.

Ok I see, that was a registration call that was missing, but you shouldn't get it. If it ever pops again please let me know.

LGTM!

This revision is now accepted and ready to land.Sep 4 2023, 9:09 AM

Closed by commit rG8031a088eb40: [MLIR] Run the TMA test for sm_90 (authored by guraypp). · Explain WhySep 4 2023, 9:15 AM

This revision was automatically updated to reflect the committed changes.

guraypp added a commit: rG8031a088eb40: [MLIR] Run the TMA test for sm_90.

Revision Contents

Path

Size

mlir/

test/

Integration/

GPU/

CUDA/

sm90/

tmaload.mlir

29 lines

Diff 555761

mlir/test/Integration/GPU/CUDA/sm90/tmaload.mlir

// RUN: mlir-opt %s --convert-nvgpu-to-nvvm -gpu-kernel-outlining \		// RUN: mlir-opt %s --convert-nvgpu-to-nvvm -gpu-kernel-outlining \
// RUN: -convert-scf-to-cf -convert-nvvm-to-llvm \		// RUN: -convert-scf-to-cf -convert-nvvm-to-llvm \
// RUN: -convert-vector-to-llvm \		// RUN: -convert-vector-to-llvm \
// RUN: -convert-math-to-llvm \		// RUN: -convert-math-to-llvm \
// RUN: -expand-strided-metadata \		// RUN: -expand-strided-metadata \
// RUN: -lower-affine \		// RUN: -lower-affine \
// RUN: -convert-index-to-llvm=index-bitwidth=32 \		// RUN: -convert-index-to-llvm=index-bitwidth=32 \
// RUN: -convert-arith-to-llvm \		// RUN: -convert-arith-to-llvm \
// RUN: -finalize-memref-to-llvm \		// RUN: -finalize-memref-to-llvm \
// RUN: -convert-func-to-llvm \		// RUN: -convert-func-to-llvm \
// RUN: -canonicalize \		// RUN: -canonicalize \
// RUN: \| mlir-opt -pass-pipeline='builtin.module(gpu.module(strip-debuginfo,convert-gpu-to-nvvm,convert-nvgpu-to-nvvm{use-opaque-pointers=1},lower-affine,convert-scf-to-cf,convert-vector-to-llvm,convert-math-to-llvm,expand-strided-metadata,lower-affine,convert-index-to-llvm{index-bitwidth=32},convert-arith-to-llvm,reconcile-unrealized-casts,gpu-to-cubin{chip=sm_90 features=+ptx80 dump-ptx}))' \		// RUN: \| mlir-opt -pass-pipeline='builtin.module(gpu.module(strip-debuginfo,convert-gpu-to-nvvm,convert-nvgpu-to-nvvm{use-opaque-pointers=1},lower-affine,convert-scf-to-cf,convert-vector-to-llvm,convert-math-to-llvm,expand-strided-metadata,lower-affine,convert-index-to-llvm{index-bitwidth=32},convert-arith-to-llvm,reconcile-unrealized-casts,gpu-to-cubin{chip=sm_90 features=+ptx80 dump-ptx}))' \
// RUN: 2&>1 \| FileCheck %s --check-prefixes=CHECK-PTX		// RUN: 2>&1 \| FileCheck %s --check-prefixes=CHECK-PTX

// CHECK-PTX: mbarrier.init.shared.b64		// CHECK-PTX: mbarrier.init.shared.b64
// CHECK-PTX: mbarrier.arrive.expect_tx.shared.b64		// CHECK-PTX: mbarrier.arrive.expect_tx.shared.b64
// CHECK-PTX: cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes		// CHECK-PTX: cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes
// CHECK-PTX: cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes		// CHECK-PTX: cp.async.bulk.tensor.2d.shared::cluster.global.mbarrier::complete_tx::bytes
// CHECK-PTX: mbarrier.arrive.expect_tx.shared.b64		// CHECK-PTX: mbarrier.arrive.expect_tx.shared.b64
// CHECK-PTX: mbarrier.try_wait.parity.shared.b64		// CHECK-PTX: mbarrier.try_wait.parity.shared.b64

		// RUN: mlir-opt %s --convert-nvgpu-to-nvvm \
		// RUN: -gpu-kernel-outlining \
		// RUN: -convert-nvvm-to-llvm \
		// RUN: -convert-nvgpu-to-nvvm \
		// RUN: -convert-scf-to-cf \
		// RUN: -convert-vector-to-llvm \
		// RUN: -convert-index-to-llvm=index-bitwidth=32 \
		// RUN: -convert-arith-to-llvm \
		// RUN: -finalize-memref-to-llvm='use-opaque-pointers=1' \
		// RUN: -convert-func-to-llvm \
		// RUN: -expand-strided-metadata --nvvm-attach-target="module=main_kernel features=+ptx80 chip=sm_90 O=3" \
		// RUN: \| mlir-opt -pass-pipeline='builtin.module(gpu.module(strip-debuginfo,convert-gpu-to-nvvm,convert-index-to-llvm{index-bitwidth=32},canonicalize,cse))' \
		// RUN: \| mlir-opt --gpu-to-llvm --gpu-module-to-binary -canonicalize -cse -reconcile-unrealized-casts \
		// RUN: \| mlir-cpu-runner \
		// RUN: --shared-libs=%mlir_cuda_runtime \
		// RUN: --shared-libs=%mlir_runner_utils \
		// RUN: --entry-point-result=void \
		// RUN: \| FileCheck %s


		// CHECK: [GPU] TMA BEFORE lhs[45][7] 0.000000
		// CHECK: [GPU] TMA BEFORE rhs[7][0] 0.000000
		// CHECK: [GPU] TMA LOADED lhs[45][7] 7.000000
		// CHECK: [GPU] TMA LOADED rhs[7][0] 3.000000

module @mymod {		module @mymod {
memref.global "private" @bufferLhsGlobal : memref<64x8xf32, 3>		memref.global "private" @bufferLhsGlobal : memref<64x8xf32, 3>
memref.global "private" @bufferRhsGlobal : memref<8x128xf32, 3>		memref.global "private" @bufferRhsGlobal : memref<8x128xf32, 3>
func.func @main() {		func.func @main() {
%c10000000 = arith.constant 10000000 : index		%c10000000 = arith.constant 10000000 : index
%c6144 = arith.constant 6144 : index		%c6144 = arith.constant 6144 : index
%c45 = arith.constant 45 : index		%c45 = arith.constant 45 : index
%c7 = arith.constant 7 : index		%c7 = arith.constant 7 : index
▲ Show 20 Lines • Show All 52 Lines • ▼ Show 20 Lines	gpu.launch blocks(%arg0, %arg1, %arg2) in (%arg6 = %c1, %arg7 = %c1, %arg8 = %c1) threads(%arg3, %arg4, %arg5) in (%arg9 = %c128, %arg10 = %c1, %arg11 = %c1) {
%12 = memref.load %8[%c7, %c0] : memref<8x128xf32, 3>		%12 = memref.load %8[%c7, %c0] : memref<8x128xf32, 3>
gpu.printf "[GPU] TMA LOADED lhs[45][7] %f\0A" %11 : f32		gpu.printf "[GPU] TMA LOADED lhs[45][7] %f\0A" %11 : f32
gpu.printf "[GPU] TMA LOADED rhs[7][0] %f\0A" %12 : f32		gpu.printf "[GPU] TMA LOADED rhs[7][0] %f\0A" %12 : f32
}		}
gpu.terminator		gpu.terminator
}		}
return		return
}		}
}		}
No newline at end of file

This is an archive of the discontinued LLVM Phabricator instance.

[MLIR] Run the TMA test for sm_90ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 555761

mlir/test/Integration/GPU/CUDA/sm90/tmaload.mlir

[MLIR] Run the TMA test for sm_90
ClosedPublic