This is an archive of the discontinued LLVM Phabricator instance.

[mlir] Fix gpu MMA integrations tests
ClosedPublic

Authored by ThomasRaoux on May 25 2021, 9:34 AM.

Download Raw Diff

Details

Reviewers

navdeepkk
bondhugula
mehdi_amini
mravishankar
herhut

Commits

rG750799b7bc3f: [mlir][NFC] Don't outline kernel in MMA integration tests

Summary

Don't outline the kernel in the test file as this prevent some debug info from being stripped out. Cuda driver doesn't support PTX with debug info causing conversion to cubin to fail.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

ThomasRaoux created this revision.May 25 2021, 9:34 AM

Herald added a reviewer: mravishankar. · View Herald TranscriptMay 25 2021, 9:34 AM

Herald added subscribers: dcaballe, cota, teijeong and 16 others. · View Herald Transcript

ThomasRaoux requested review of this revision.May 25 2021, 9:34 AM

Herald added a reviewer: herhut. · View Herald TranscriptMay 25 2021, 9:34 AM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: stephenneuendorffer, nicolasvasilache. · View Herald Transcript

Harbormaster completed remote builds in B106116: Diff 347704.May 25 2021, 9:34 AM

The test had started failing with https://github.com/llvm/llvm-project/commit/81467f500f6ad106a69088bc276024c5e1938571. I'll also enable those tests in google build bots that have Tesla T4 GPUs once this is fixed.

Thanks! this is something I wasn't aware of. BTW I tested these on a Turing with CUDA10.2, and they passed, but maybe they fail on some other devices.

mlir/test/Integration/GPU/CUDA/TensorCore/wmma-matmul-f16.mlir
46	This is I think present since this file was added but not required anymore. Can you please drop this? Or should I remove this in a subsequent patch?

This revision is now accepted and ready to land.May 27 2021, 9:12 AM

In D103099#2785041, @navdeepkk wrote:

Thanks! this is something I wasn't aware of. BTW I tested these on a Turing with CUDA10.2, and they passed, but maybe they fail on some other devices.

Yes I ended up doing a more generic fix for the problem as it causing more general problems: https://reviews.llvm.org/D103187

This still feels like a small improvement so I'll move forward with this patch unless you have any concerns.

In D103099#2785082, @ThomasRaoux wrote:

In D103099#2785041, @navdeepkk wrote:

Thanks! this is something I wasn't aware of. BTW I tested these on a Turing with CUDA10.2, and they passed, but maybe they fail on some other devices.

Yes I ended up doing a more generic fix for the problem as it causing more general problems: https://reviews.llvm.org/D103187

This still feels like a small improvement so I'll move forward with this patch unless you have any concerns.

Yes. We can go ahead with this patch.

ThomasRaoux updated this revision to Diff 348315.May 27 2021, 9:42 AM

ThomasRaoux marked an inline comment as done.

This revision was landed with ongoing or failed builds.May 27 2021, 9:45 AM

Closed by commit rG750799b7bc3f: [mlir][NFC] Don't outline kernel in MMA integration tests (authored by ThomasRaoux). · Explain Why

This revision was automatically updated to reflect the committed changes.

ThomasRaoux added a commit: rG750799b7bc3f: [mlir][NFC] Don't outline kernel in MMA integration tests.

Harbormaster completed remote builds in B106546: Diff 348315.May 27 2021, 10:20 AM

Revision Contents

Path

Size

mlir/

test/

Integration/

GPU/

CUDA/

TensorCore/

wmma-matmul-f16.mlir

128 lines

wmma-matmul-f32.mlir

119 lines

Diff 348317

mlir/test/Integration/GPU/CUDA/TensorCore/wmma-matmul-f16.mlir

	// RUN: mlir-opt %s \			// RUN: mlir-opt %s \
	// RUN: -gpu-kernel-outlining \			// RUN: -gpu-kernel-outlining \
	// RUN: -pass-pipeline='gpu.module(strip-debuginfo,convert-gpu-to-nvvm{index-bitwidth=32},gpu-to-cubin{chip=sm_70})' \			// RUN: -pass-pipeline='gpu.module(strip-debuginfo,convert-gpu-to-nvvm{index-bitwidth=32},gpu-to-cubin{chip=sm_70})' \
	// RUN: --convert-scf-to-std -gpu-to-llvm \			// RUN: --convert-scf-to-std -gpu-to-llvm \
	// RUN: \| mlir-cpu-runner \			// RUN: \| mlir-cpu-runner \
	// RUN: --shared-libs=%linalg_test_lib_dir/libmlir_cuda_runtime%shlibext \			// RUN: --shared-libs=%linalg_test_lib_dir/libmlir_cuda_runtime%shlibext \
	// RUN: --shared-libs=%linalg_test_lib_dir/libmlir_runner_utils%shlibext \			// RUN: --shared-libs=%linalg_test_lib_dir/libmlir_runner_utils%shlibext \
	// RUN: --entry-point-result=void \			// RUN: --entry-point-result=void \
	// RUN: \| FileCheck %s			// RUN: \| FileCheck %s
	// Test case to check the working of Tensor cores on Nvidia GPUs. The kernel has already			// Test case to check the working of Tensor cores on Nvidia GPUs. The kernel has already
	// been outlined to prevent crashing due to introduction of an empty basic block by --gpu-			// been outlined to prevent crashing due to introduction of an empty basic block by --gpu-
	// kernel-outling.			// kernel-outling.
	module attributes {gpu.container_module} {
	func @main() {			func @main() {
	%0 = memref.alloc() : memref<16x16xf16>			%0 = memref.alloc() : memref<16x16xf16>
	%22 = memref.alloc() : memref<16x16xf16>			%22 = memref.alloc() : memref<16x16xf16>
	%1 = memref.alloc() : memref<16x16xf32>			%1 = memref.alloc() : memref<16x16xf32>

	%f1 = constant 1.0e+00 : f16			%f1 = constant 1.0e+00 : f16
	%f0 = constant 0.0e+00 : f16			%f0 = constant 0.0e+00 : f16
	%c0 = constant 0 : index			%c0 = constant 0 : index
	%c16 = constant 16 : index			%c16 = constant 16 : index
	%c32 = constant 32 : index			%c32 = constant 32 : index
	%c1 = constant 1 : index			%c1 = constant 1 : index

	// Intialize the Input matrix with ones.			// Intialize the Input matrix with ones.
	scf.for %arg0 = %c0 to %c16 step %c1 {			scf.for %arg0 = %c0 to %c16 step %c1 {
	scf.for %arg1 = %c0 to %c16 step %c1 {			scf.for %arg1 = %c0 to %c16 step %c1 {
	memref.store %f1, %0[%arg0, %arg1] : memref<16x16xf16>			memref.store %f1, %0[%arg0, %arg1] : memref<16x16xf16>
	}			}
	}			}
	// Intialize the accumulator matrix with zeros.			// Intialize the accumulator matrix with zeros.
	scf.for %arg0 = %c0 to %c16 step %c1 {			scf.for %arg0 = %c0 to %c16 step %c1 {
	scf.for %arg1 = %c0 to %c16 step %c1 {			scf.for %arg1 = %c0 to %c16 step %c1 {
	memref.store %f0, %22[%arg0, %arg1] : memref<16x16xf16>			memref.store %f0, %22[%arg0, %arg1] : memref<16x16xf16>
	}			}
	}			}

	%2 = memref.cast %0 : memref<16x16xf16> to memref<*xf16>			%2 = memref.cast %0 : memref<16x16xf16> to memref<*xf16>
	%33 = memref.cast %22 : memref<16x16xf16> to memref<*xf16>			%33 = memref.cast %22 : memref<16x16xf16> to memref<*xf16>
	%3 = memref.cast %1 : memref<16x16xf32> to memref<*xf32>			%3 = memref.cast %1 : memref<16x16xf32> to memref<*xf32>
	gpu.host_register %2 : memref<*xf16>			gpu.host_register %2 : memref<*xf16>
	gpu.host_register %33 : memref<*xf16>			gpu.host_register %33 : memref<*xf16>

	gpu.launch_func @main_kernel::@main_kernel blocks in (%c1, %c1, %c1) threads in (%c32, %c1, %c1) args(%0 : memref<16x16xf16>, %22 : memref<16x16xf16>)			gpu.launch blocks(%bx, %by, %bz) in (%grid_x = %c1, %grid_y = %c1, %grid_z = %c1)
				threads(%tx, %ty, %tz) in (%block_x = %c32, %block_y = %c1, %block_z = %c1) {
				%A = gpu.subgroup_mma_load_matrix %0[%c0, %c0] {leadDimension = 16 : index} : memref<16x16xf16> -> !gpu.mma_matrix<16x16xf16, "AOp">
				navdeepkkUnsubmitted Done Reply Inline Actions This is I think present since this file was added but not required anymore. Can you please drop this? Or should I remove this in a subsequent patch? navdeepkk: This is I think present since this file was added but not required anymore. Can you please…
				%B = gpu.subgroup_mma_load_matrix %0[%c0, %c0] {leadDimension = 16 : index} : memref<16x16xf16> -> !gpu.mma_matrix<16x16xf16, "BOp">
				%C = gpu.subgroup_mma_load_matrix %22[%c0, %c0] {leadDimension = 16 : index} : memref<16x16xf16> -> !gpu.mma_matrix<16x16xf16, "COp">

				%R = gpu.subgroup_mma_compute %A, %B, %C : !gpu.mma_matrix<16x16xf16, "AOp">, !gpu.mma_matrix<16x16xf16, "BOp"> -> !gpu.mma_matrix<16x16xf16, "COp">

				gpu.subgroup_mma_store_matrix %R, %0[%c0, %c0] {leadDimension = 16 : index}: !gpu.mma_matrix<16x16xf16, "COp">, memref<16x16xf16>
				gpu.terminator
				}

	// Convert the results from f16 to f32 for printing.			// Convert the results from f16 to f32 for printing.
	scf.for %arg0 = %c0 to %c16 step %c1 {			scf.for %arg0 = %c0 to %c16 step %c1 {
	scf.for %arg1 = %c0 to %c16 step %c1 {			scf.for %arg1 = %c0 to %c16 step %c1 {
	%6 = memref.load %0[%arg0, %arg1] : memref<16x16xf16>			%6 = memref.load %0[%arg0, %arg1] : memref<16x16xf16>
	%7 = fpext %6 : f16 to f32			%7 = fpext %6 : f16 to f32
	memref.store %7, %1[%arg0, %arg1] : memref<16x16xf32>			memref.store %7, %1[%arg0, %arg1] : memref<16x16xf32>
	}			}
	}			}

	// Print the memref after computation.			// Print the memref after computation.
	call @print_memref_f32(%3) : (memref<*xf32>) -> ()			call @print_memref_f32(%3) : (memref<*xf32>) -> ()
	// CHECK: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],			// CHECK: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],
	// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],			// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],
	// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],			// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],
	// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],			// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],
	// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],			// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],
	// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],			// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],
	// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],			// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],
	// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],			// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],
	// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],			// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],
	// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],			// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],
	// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],			// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],
	// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],			// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],
	// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],			// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],
	// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],			// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],
	// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],			// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],
	// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16]			// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16]
	return			return
	}			}

	gpu.module @main_kernel {
	gpu.func @main_kernel(%arg0: memref<16x16xf16>, %arg22 : memref<16x16xf16>) kernel {
	%c0 = constant 0 : index

	%0 = gpu.subgroup_mma_load_matrix %arg0[%c0, %c0] {operand = "AOp", leadDimension = 16 : index} : memref<16x16xf16> -> !gpu.mma_matrix<16x16xf16, "AOp">
	%1 = gpu.subgroup_mma_load_matrix %arg0[%c0, %c0] {operand = "BOp", leadDimension = 16 : index} : memref<16x16xf16> -> !gpu.mma_matrix<16x16xf16, "BOp">
	%2 = gpu.subgroup_mma_load_matrix %arg22[%c0, %c0] {operand = "COp", leadDimension = 16 : index} : memref<16x16xf16> -> !gpu.mma_matrix<16x16xf16, "COp">

	%3 = gpu.subgroup_mma_compute %0, %1, %2 : !gpu.mma_matrix<16x16xf16, "AOp">, !gpu.mma_matrix<16x16xf16, "BOp"> -> !gpu.mma_matrix<16x16xf16, "COp">

	gpu.subgroup_mma_store_matrix %3, %arg0[%c0, %c0] {leadDimension = 16 : index}: !gpu.mma_matrix<16x16xf16, "COp">, memref<16x16xf16>

	gpu.return
	}
	}

	func private @print_memref_f32(memref<*xf32>)			func private @print_memref_f32(memref<*xf32>)
	}

mlir/test/Integration/GPU/CUDA/TensorCore/wmma-matmul-f32.mlir

	// RUN: mlir-opt %s \			// RUN: mlir-opt %s \
	// RUN: -gpu-kernel-outlining \			// RUN: -gpu-kernel-outlining \
	// RUN: -pass-pipeline='gpu.module(strip-debuginfo,convert-gpu-to-nvvm{index-bitwidth=32},gpu-to-cubin{chip=sm_70})' \			// RUN: -pass-pipeline='gpu.module(strip-debuginfo,convert-gpu-to-nvvm{index-bitwidth=32},gpu-to-cubin{chip=sm_70})' \
	// RUN: --convert-scf-to-std -gpu-to-llvm \			// RUN: --convert-scf-to-std -gpu-to-llvm \
	// RUN: \| mlir-cpu-runner \			// RUN: \| mlir-cpu-runner \
	// RUN: --shared-libs=%linalg_test_lib_dir/libmlir_cuda_runtime%shlibext \			// RUN: --shared-libs=%linalg_test_lib_dir/libmlir_cuda_runtime%shlibext \
	// RUN: --shared-libs=%linalg_test_lib_dir/libmlir_runner_utils%shlibext \			// RUN: --shared-libs=%linalg_test_lib_dir/libmlir_runner_utils%shlibext \
	// RUN: --entry-point-result=void \			// RUN: --entry-point-result=void \
	// RUN: \| FileCheck %s			// RUN: \| FileCheck %s
	// Test case to check the working of Tensor cores on Nvidia GPUs. The kernel has already
	// been outlined to prevent crashing due to introduction of an empty basic block by --gpu-
	// kernel-outling.
	module attributes {gpu.container_module} {
	func @main() {			func @main() {
	%0 = memref.alloc() : memref<16x16xf16>			%0 = memref.alloc() : memref<16x16xf16>
	%22 = memref.alloc() : memref<16x16xf32>			%22 = memref.alloc() : memref<16x16xf32>
	%1 = memref.alloc() : memref<16x16xf32>			%1 = memref.alloc() : memref<16x16xf32>

	%f1 = constant 1.0e+00 : f16			%f1 = constant 1.0e+00 : f16
	%f0 = constant 0.0e+00 : f32			%f0 = constant 0.0e+00 : f32
	%c0 = constant 0 : index			%c0 = constant 0 : index
	%c16 = constant 16 : index			%c16 = constant 16 : index
	%c32 = constant 32 : index			%c32 = constant 32 : index
	%c1 = constant 1 : index			%c1 = constant 1 : index

	// Intialize the Input matrix with ones.			// Intialize the Input matrix with ones.
	scf.for %arg0 = %c0 to %c16 step %c1 {			scf.for %arg0 = %c0 to %c16 step %c1 {
	scf.for %arg1 = %c0 to %c16 step %c1 {			scf.for %arg1 = %c0 to %c16 step %c1 {
	memref.store %f1, %0[%arg0, %arg1] : memref<16x16xf16>			memref.store %f1, %0[%arg0, %arg1] : memref<16x16xf16>
	}			}
	}			}
	// Intialize the accumulator matrix with zeros.			// Intialize the accumulator matrix with zeros.
	scf.for %arg0 = %c0 to %c16 step %c1 {			scf.for %arg0 = %c0 to %c16 step %c1 {
	scf.for %arg1 = %c0 to %c16 step %c1 {			scf.for %arg1 = %c0 to %c16 step %c1 {
	memref.store %f0, %22[%arg0, %arg1] : memref<16x16xf32>			memref.store %f0, %22[%arg0, %arg1] : memref<16x16xf32>
	}			}
	}			}

	%2 = memref.cast %0 : memref<16x16xf16> to memref<*xf16>			%2 = memref.cast %0 : memref<16x16xf16> to memref<*xf16>
	%33 = memref.cast %22 : memref<16x16xf32> to memref<*xf32>			%33 = memref.cast %22 : memref<16x16xf32> to memref<*xf32>
	%3 = memref.cast %1 : memref<16x16xf32> to memref<*xf32>			%3 = memref.cast %1 : memref<16x16xf32> to memref<*xf32>
	gpu.host_register %2 : memref<*xf16>			gpu.host_register %2 : memref<*xf16>
	gpu.host_register %33 : memref<*xf32>			gpu.host_register %33 : memref<*xf32>

	gpu.launch_func @main_kernel::@main_kernel blocks in (%c1, %c1, %c1) threads in (%c32, %c1, %c1) args(%0 : memref<16x16xf16>, %22 : memref<16x16xf32>)			gpu.launch blocks(%bx, %by, %bz) in (%grid_x = %c1, %grid_y = %c1, %grid_z = %c1)
				threads(%tx, %ty, %tz) in (%block_x = %c32, %block_y = %c1, %block_z = %c1) {
				%A = gpu.subgroup_mma_load_matrix %0[%c0, %c0] {leadDimension = 16 : index} : memref<16x16xf16> -> !gpu.mma_matrix<16x16xf16, "AOp">
				%B = gpu.subgroup_mma_load_matrix %0[%c0, %c0] {leadDimension = 16 : index} : memref<16x16xf16> -> !gpu.mma_matrix<16x16xf16, "BOp">
				%C = gpu.subgroup_mma_load_matrix %22[%c0, %c0] {leadDimension = 16 : index} : memref<16x16xf32> -> !gpu.mma_matrix<16x16xf32, "COp">

				%R = gpu.subgroup_mma_compute %A, %B, %C : !gpu.mma_matrix<16x16xf16, "AOp">, !gpu.mma_matrix<16x16xf16, "BOp"> -> !gpu.mma_matrix<16x16xf32, "COp">

				gpu.subgroup_mma_store_matrix %R, %22[%c0, %c0] {leadDimension = 16 : index}: !gpu.mma_matrix<16x16xf32, "COp">, memref<16x16xf32>
				gpu.terminator
				}
	// Print the memref after computation.			// Print the memref after computation.
	call @print_memref_f32(%33) : (memref<*xf32>) -> ()			call @print_memref_f32(%33) : (memref<*xf32>) -> ()
	// CHECK: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],			// CHECK: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],
	// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],			// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],
	// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],			// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],
	// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],			// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],
	// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],			// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],
	// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],			// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],
	// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],			// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],
	// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],			// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],
	// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],			// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],
	// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],			// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],
	// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],			// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],
	// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],			// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],
	// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],			// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],
	// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],			// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],
	// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],			// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16],
	// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16]			// CHECK-NEXT: [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16]
	return			return
	}			}

	gpu.module @main_kernel {
	gpu.func @main_kernel(%arg0: memref<16x16xf16>, %arg22 : memref<16x16xf32>) kernel {
	%c0 = constant 0 : index

	%0 = gpu.subgroup_mma_load_matrix %arg0[%c0, %c0] {operand = "AOp", leadDimension = 16 : index} : memref<16x16xf16> -> !gpu.mma_matrix<16x16xf16, "AOp">
	%1 = gpu.subgroup_mma_load_matrix %arg0[%c0, %c0] {operand = "BOp", leadDimension = 16 : index} : memref<16x16xf16> -> !gpu.mma_matrix<16x16xf16, "BOp">
	%2 = gpu.subgroup_mma_load_matrix %arg22[%c0, %c0] {operand = "COp", leadDimension = 16 : index} : memref<16x16xf32> -> !gpu.mma_matrix<16x16xf32, "COp">

	%3 = gpu.subgroup_mma_compute %0, %1, %2 : !gpu.mma_matrix<16x16xf16, "AOp">, !gpu.mma_matrix<16x16xf16, "BOp"> -> !gpu.mma_matrix<16x16xf32, "COp">

	gpu.subgroup_mma_store_matrix %3, %arg22[%c0, %c0] {leadDimension = 16 : index}: !gpu.mma_matrix<16x16xf32, "COp">, memref<16x16xf32>

	gpu.return
	}
	}

	func private @print_memref_f32(memref<*xf32>)			func private @print_memref_f32(memref<*xf32>)
	}