mlir/test/Integration/Dialect/SparseTensor/GPU/CUDA/sparse-matmul-2-4-lib.mlir
18	This seems a bit copy-and-past from the sparse-mma-2-4-f16.mlir test (which really uses device code for this method by means of e.g. nvgpu.mma.sp.sync). Here, however ,the library calls are still made from the host. So I would remove the while device/host comments here at L17 and at L62). Also, the gpu.container_module is not needed, since no method is defined as gpu.module
35	commented out code?
208	avoid commented out code

address comments

K-Wu marked 3 inline comments as done.Jun 12 2023, 1:22 PM

K-Wu added inline comments.

mlir/test/Integration/Dialect/SparseTensor/GPU/CUDA/sparse-matmul-2-4-lib.mlir
18	Thanks for all these comments! They are all addressed now

Harbormaster completed remote builds in B238293: Diff 530651.Jun 12 2023, 3:05 PM

aartbik added inline comments.Jun 12 2023, 4:30 PM

mlir/test/Integration/Dialect/SparseTensor/GPU/CUDA/sparse-matmul-2-4-lib.mlir
6	it looks like this pipeline can be simplified quite a bit, all the gpu.module(....) can go, right? Also, the vector to llvm and probably more. Perhaps you can actually get rid of the first mlir-opt call and just start at L7 (bit hard to tell just by looking, but run it by hand and see how far you can strip it)
16	remove gpu.container_module
26	add comment to magic constant here
43	does it work without? in any case, let the TODO jump out a bi t more
120	Copy and paste comment, this is no longer the compressed matrix, but the full 2:4 matrix A
146	empty // line after this comment to seperate it from the CHECK
198	there are no warps in this code, so simply Call the kernel this all still runs on host

K-Wu edited the summary of this revision. (Show Details)Jun 12 2023, 4:41 PM

K-Wu marked an inline comment as done.

K-Wu marked 6 inline comments as done.Jun 12 2023, 4:59 PM

K-Wu marked an inline comment as done.Jun 12 2023, 5:17 PM

rebase origin/main; addressing comments

fix test error

aartbik accepted this revision.Jun 12 2023, 6:49 PM

This revision is now accepted and ready to land.Jun 12 2023, 6:49 PM

Harbormaster completed remote builds in B238355: Diff 530728.Jun 12 2023, 6:57 PM

Closed by commit rG8f3fcbc6870d: [mlir][sparse][GPU] add 2:4 integration test (authored by K-Wu). · Explain WhyJun 12 2023, 7:24 PM

This revision was automatically updated to reflect the committed changes.

K-Wu added a commit: rG8f3fcbc6870d: [mlir][sparse][GPU] add 2:4 integration test.

Diff 530753

mlir/test/Integration/Dialect/SparseTensor/GPU/CUDA/sparse-matmul-2-4-lib.mlir

This file was added.

				//
				// NOTE: this test requires gpu-sm80
				//
				// RUN: mlir-opt --convert-scf-to-cf -convert-cf-to-llvm --convert-vector-to-llvm \
				// RUN: --convert-arith-to-llvm --gpu-to-llvm --reconcile-unrealized-casts \
				// RUN: %s \
				aartbikUnsubmitted Done Reply Inline Actions it looks like this pipeline can be simplified quite a bit, all the gpu.module(....) can go, right? Also, the vector to llvm and probably more. Perhaps you can actually get rid of the first mlir-opt call and just start at L7 (bit hard to tell just by looking, but run it by hand and see how far you can strip it) aartbik: it looks like this pipeline can be simplified quite a bit, all the gpu.module(....) can go…
				// RUN: \| mlir-cpu-runner \
				// RUN: --shared-libs=%mlir_cuda_runtime \
				// RUN: --shared-libs=%mlir_c_runner_utils \
				// RUN: --e main --entry-point-result=void \
				// RUN: \| FileCheck %s

				module {
				func.func @sampled_matmul(%a : memref<16x32xf16>,
				%b : memref<32x16xf16>,
				%c : memref<16x16xf16>) {
				aartbikUnsubmitted Done Reply Inline Actions remove gpu.container_module aartbik: remove gpu.container_module
				%c0 = arith.constant 0.0 : f16
				%c1 = arith.constant 1 : index
				aartbikUnsubmitted Done Reply Inline Actions This seems a bit copy-and-past from the sparse-mma-2-4-f16.mlir test (which really uses device code for this method by means of e.g. nvgpu.mma.sp.sync). Here, however ,the library calls are still made from the host. So I would remove the while device/host comments here at L17 and at L62). Also, the gpu.container_module is not needed, since no method is defined as gpu.module aartbik: This seems a bit copy-and-past from the sparse-mma-2-4-f16.mlir test (which really uses device…
				K-WuAuthorUnsubmitted Done Reply Inline Actions Thanks for all these comments! They are all addressed now K-Wu: Thanks for all these comments! They are all addressed now
				%c2 = arith.constant 2 : index
				%c8 = arith.constant 8 : index
				%c16 = arith.constant 16 : index
				%c32 = arith.constant 32 : index
				%c1048576 = arith.constant 1048576 : index
				%token0 = gpu.wait async
				%d_a, %token1 = gpu.alloc async [%token0] () : memref<16x32xf16>
				%d_b, %token2 = gpu.alloc async [%token1] () : memref<32x16xf16>
				aartbikUnsubmitted Done Reply Inline Actions add comment to magic constant here aartbik: add comment to magic constant here
				%d_c, %token3 = gpu.alloc async [%token2] () : memref<16x16xf16>
				%token4 = gpu.memcpy async [%token3] %d_a, %a : memref<16x32xf16>, memref<16x32xf16>
				%token5 = gpu.memcpy async [%token4] %d_b, %b : memref<32x16xf16>, memref<32x16xf16>
				%token6 = gpu.memcpy async [%token5] %d_c, %c : memref<16x16xf16>, memref<16x16xf16>
				// Allocating larger memory than enough for workspace and storing compressed
				// matrices as we haven't implemented the op to unpack tuple %bufferSzs to
				// retrieve these three sizes.
				// TODO: implement the op to unpack tuple %bufferSzs.
				%mem1, %token7 = gpu.alloc async [%token6] (%c1048576) : memref<?xf16>
				aartbikUnsubmitted Done Reply Inline Actions commented out code? aartbik: commented out code?
				%mem2, %token8 = gpu.alloc async [%token7] (%c1048576) : memref<?xf16>
				%mem3, %token9 = gpu.alloc async [%token8] (%c1048576) : memref<?xf16>
				%env, %token10 = gpu.create_sparse_env async [%token9]
				%spmat, %token11 = gpu.create_2to4_spmat async [%token10] %env, %c16, %c32, %d_a: memref<16x32xf16>
				%dnmat, %token12 = gpu.create_dn_tensor async [%token11] %env, %d_b, %c32, %c16: index, index into memref<32x16xf16>
				%dnmat2, %token13 = gpu.create_dn_tensor async [%token12] %env, %d_c, %c16, %c16: index, index into memref<16x16xf16>
				%bufferSzs, %token14 = gpu.spmm_buffer_size async [%token13] %env, %spmat{NON_TRANSPOSE}, %dnmat{NON_TRANSPOSE}, %dnmat2 : tuple<index, index,index> into f16
				%token15 = gpu.spmm async [%token14] %env, %spmat{NON_TRANSPOSE}, %dnmat{NON_TRANSPOSE}, %dnmat2, %mem1, %mem2, %mem3 : memref<?xf16>, memref<?xf16>,memref<?xf16> into f16
				aartbikUnsubmitted Done Reply Inline Actions does it work without? in any case, let the TODO jump out a bi t more aartbik: does it work without? in any case, let the TODO jump out a bi t more
				%token16 = gpu.destroy_sp_mat async [%token15] %spmat
				%token17 = gpu.destroy_dn_tensor async [%token16] %dnmat
				%token18 = gpu.destroy_sparse_env async [%token17] %env
				%token19 = gpu.memcpy async [%token18] %c, %d_c : memref<16x16xf16>, memref<16x16xf16>
				%token20 = gpu.dealloc async [%token19] %d_c : memref<16x16xf16>
				%token21 = gpu.dealloc async [%token20] %d_b : memref<32x16xf16>
				%token22 = gpu.dealloc async [%token21] %d_a : memref<16x32xf16>
				%token23 = gpu.dealloc async [%token22] %mem3 : memref<?xf16>
				%token24 = gpu.dealloc async [%token23] %mem2 : memref<?xf16>
				%token25 = gpu.dealloc async [%token24] %mem1 : memref<?xf16>
				gpu.wait [%token25]
				return
				}

				//
				// This test performs a matrix multiplication
				// C = A x B
				// using NVidia 2:4 structured sparsity for A.
				//
				func.func @main() {
				%f0 = arith.constant 0.0 : f16
				%c0 = arith.constant 0 : index
				%c1 = arith.constant 1 : index
				%c2 = arith.constant 2 : index
				%c8 = arith.constant 8 : index
				%c16 = arith.constant 16 : index
				%c32 = arith.constant 32 : index
				%c64 = arith.constant 64 : index

				// Matrices A, B, C (16x32, 32x16, 16x16).
				%a = memref.alloc() : memref<16x32xf16> // 16x32 but 2:4, row-major
				%b = memref.alloc() : memref<32x16xf16> // regular dense column-major
				%c = memref.alloc() : memref<16x16xf16> // accumulator row-major

				//
				// Setup matrix A.
				//
				scf.for %ai = %c0 to %c16 step %c1 {
				scf.for %aj = %c0 to %c16 step %c1 {
				%cf0 = arith.constant 0.0: f16
				%a0 = arith.addi %ai, %aj : index
				%a1 = arith.addi %a0, %c1 : index
				%a2 = arith.index_cast %a1 : index to i32
				%a3 = arith.sitofp %a2 : i32 to f16
				%ajj = arith.muli %aj, %c2 : index
				%ajj2 = arith.addi %ajj, %c1 : index
				memref.store %a3, %a[%ai, %ajj] : memref<16x32xf16>
				memref.store %cf0, %a[%ai, %ajj2] : memref<16x32xf16>
				}
				}

				//
				// Setup matrix B.
				//
				scf.for %bi = %c0 to %c8 step %c1 {
				scf.for %bj = %c0 to %c32 step %c1 {
				%b0 = arith.subi %bi, %bj : index
				%b1 = arith.index_cast %b0 : index to i32
				%b2 = arith.sitofp %b1 : i32 to f16
				%bii = arith.addi %bi, %c8 : index
				memref.store %b2, %b[%bj, %bi] : memref<32x16xf16>
				memref.store %b2, %b[%bj, %bii] : memref<32x16xf16>
				}
				}

				//
				// Reset matrix C.
				//
				scf.for %ci = %c0 to %c16 step %c1 {
				scf.for %cj = %c0 to %c16 step %c1 {
				memref.store %f0, %c[%ci, %cj] : memref<16x16xf16>
				}
				}

				//
				// Sanity check on 16x32 full 2:4 input matrix A.
				//
				aartbikUnsubmitted Done Reply Inline Actions Copy and paste comment, this is no longer the compressed matrix, but the full 2:4 matrix A aartbik: Copy and paste comment, this is no longer the compressed matrix, but the full 2:4 matrix A
				//
				// CHECK: ( 1, 0, 2, 0, 3, 0, 4, 0, 5, 0, 6, 0, 7, 0, 8, 0, 9, 0, 10, 0, 11, 0, 12, 0, 13, 0, 14, 0, 15, 0, 16, 0 )
				// CHECK-NEXT: ( 2, 0, 3, 0, 4, 0, 5, 0, 6, 0, 7, 0, 8, 0, 9, 0, 10, 0, 11, 0, 12, 0, 13, 0, 14, 0, 15, 0, 16, 0, 17, 0 )
				// CHECK-NEXT: ( 3, 0, 4, 0, 5, 0, 6, 0, 7, 0, 8, 0, 9, 0, 10, 0, 11, 0, 12, 0, 13, 0, 14, 0, 15, 0, 16, 0, 17, 0, 18, 0 )
				// CHECK-NEXT: ( 4, 0, 5, 0, 6, 0, 7, 0, 8, 0, 9, 0, 10, 0, 11, 0, 12, 0, 13, 0, 14, 0, 15, 0, 16, 0, 17, 0, 18, 0, 19, 0 )
				// CHECK-NEXT: ( 5, 0, 6, 0, 7, 0, 8, 0, 9, 0, 10, 0, 11, 0, 12, 0, 13, 0, 14, 0, 15, 0, 16, 0, 17, 0, 18, 0, 19, 0, 20, 0 )
				// CHECK-NEXT: ( 6, 0, 7, 0, 8, 0, 9, 0, 10, 0, 11, 0, 12, 0, 13, 0, 14, 0, 15, 0, 16, 0, 17, 0, 18, 0, 19, 0, 20, 0, 21, 0 )
				// CHECK-NEXT: ( 7, 0, 8, 0, 9, 0, 10, 0, 11, 0, 12, 0, 13, 0, 14, 0, 15, 0, 16, 0, 17, 0, 18, 0, 19, 0, 20, 0, 21, 0, 22, 0 )
				// CHECK-NEXT: ( 8, 0, 9, 0, 10, 0, 11, 0, 12, 0, 13, 0, 14, 0, 15, 0, 16, 0, 17, 0, 18, 0, 19, 0, 20, 0, 21, 0, 22, 0, 23, 0 )
				// CHECK-NEXT: ( 9, 0, 10, 0, 11, 0, 12, 0, 13, 0, 14, 0, 15, 0, 16, 0, 17, 0, 18, 0, 19, 0, 20, 0, 21, 0, 22, 0, 23, 0, 24, 0 )
				// CHECK-NEXT: ( 10, 0, 11, 0, 12, 0, 13, 0, 14, 0, 15, 0, 16, 0, 17, 0, 18, 0, 19, 0, 20, 0, 21, 0, 22, 0, 23, 0, 24, 0, 25, 0 )
				// CHECK-NEXT: ( 11, 0, 12, 0, 13, 0, 14, 0, 15, 0, 16, 0, 17, 0, 18, 0, 19, 0, 20, 0, 21, 0, 22, 0, 23, 0, 24, 0, 25, 0, 26, 0 )
				// CHECK-NEXT: ( 12, 0, 13, 0, 14, 0, 15, 0, 16, 0, 17, 0, 18, 0, 19, 0, 20, 0, 21, 0, 22, 0, 23, 0, 24, 0, 25, 0, 26, 0, 27, 0 )
				// CHECK-NEXT: ( 13, 0, 14, 0, 15, 0, 16, 0, 17, 0, 18, 0, 19, 0, 20, 0, 21, 0, 22, 0, 23, 0, 24, 0, 25, 0, 26, 0, 27, 0, 28, 0 )
				// CHECK-NEXT: ( 14, 0, 15, 0, 16, 0, 17, 0, 18, 0, 19, 0, 20, 0, 21, 0, 22, 0, 23, 0, 24, 0, 25, 0, 26, 0, 27, 0, 28, 0, 29, 0 )
				// CHECK-NEXT: ( 15, 0, 16, 0, 17, 0, 18, 0, 19, 0, 20, 0, 21, 0, 22, 0, 23, 0, 24, 0, 25, 0, 26, 0, 27, 0, 28, 0, 29, 0, 30, 0 )
				// CHECK-NEXT: ( 16, 0, 17, 0, 18, 0, 19, 0, 20, 0, 21, 0, 22, 0, 23, 0, 24, 0, 25, 0, 26, 0, 27, 0, 28, 0, 29, 0, 30, 0, 31, 0 )
				//
				scf.for %pai = %c0 to %c16 step %c1 {
				%pa0 = vector.transfer_read %a[%pai, %c0], %f0 : memref<16x32xf16>, vector<32xf16>
				vector.print %pa0 : vector<32xf16>
				}

				//
				// Sanity check on input matrix 32x16 B.
				//
				aartbikUnsubmitted Done Reply Inline Actions empty // line after this comment to seperate it from the CHECK aartbik: empty // line after this comment to seperate it from the CHECK
				// CHECK-NEXT: ( 0, 1, 2, 3, 4, 5, 6, 7, 0, 1, 2, 3, 4, 5, 6, 7 )
				// CHECK-NEXT: ( -1, 0, 1, 2, 3, 4, 5, 6, -1, 0, 1, 2, 3, 4, 5, 6 )
				// CHECK-NEXT: ( -2, -1, 0, 1, 2, 3, 4, 5, -2, -1, 0, 1, 2, 3, 4, 5 )
				// CHECK-NEXT: ( -3, -2, -1, 0, 1, 2, 3, 4, -3, -2, -1, 0, 1, 2, 3, 4 )
				// CHECK-NEXT: ( -4, -3, -2, -1, 0, 1, 2, 3, -4, -3, -2, -1, 0, 1, 2, 3 )
				// CHECK-NEXT: ( -5, -4, -3, -2, -1, 0, 1, 2, -5, -4, -3, -2, -1, 0, 1, 2 )
				// CHECK-NEXT: ( -6, -5, -4, -3, -2, -1, 0, 1, -6, -5, -4, -3, -2, -1, 0, 1 )
				// CHECK-NEXT: ( -7, -6, -5, -4, -3, -2, -1, 0, -7, -6, -5, -4, -3, -2, -1, 0 )
				// CHECK-NEXT: ( -8, -7, -6, -5, -4, -3, -2, -1, -8, -7, -6, -5, -4, -3, -2, -1 )
				// CHECK-NEXT: ( -9, -8, -7, -6, -5, -4, -3, -2, -9, -8, -7, -6, -5, -4, -3, -2 )
				// CHECK-NEXT: ( -10, -9, -8, -7, -6, -5, -4, -3, -10, -9, -8, -7, -6, -5, -4, -3 )
				// CHECK-NEXT: ( -11, -10, -9, -8, -7, -6, -5, -4, -11, -10, -9, -8, -7, -6, -5, -4 )
				// CHECK-NEXT: ( -12, -11, -10, -9, -8, -7, -6, -5, -12, -11, -10, -9, -8, -7, -6, -5 )
				// CHECK-NEXT: ( -13, -12, -11, -10, -9, -8, -7, -6, -13, -12, -11, -10, -9, -8, -7, -6 )
				// CHECK-NEXT: ( -14, -13, -12, -11, -10, -9, -8, -7, -14, -13, -12, -11, -10, -9, -8, -7 )
				// CHECK-NEXT: ( -15, -14, -13, -12, -11, -10, -9, -8, -15, -14, -13, -12, -11, -10, -9, -8 )
				// CHECK-NEXT: ( -16, -15, -14, -13, -12, -11, -10, -9, -16, -15, -14, -13, -12, -11, -10, -9 )
				// CHECK-NEXT: ( -17, -16, -15, -14, -13, -12, -11, -10, -17, -16, -15, -14, -13, -12, -11, -10 )
				// CHECK-NEXT: ( -18, -17, -16, -15, -14, -13, -12, -11, -18, -17, -16, -15, -14, -13, -12, -11 )
				// CHECK-NEXT: ( -19, -18, -17, -16, -15, -14, -13, -12, -19, -18, -17, -16, -15, -14, -13, -12 )
				// CHECK-NEXT: ( -20, -19, -18, -17, -16, -15, -14, -13, -20, -19, -18, -17, -16, -15, -14, -13 )
				// CHECK-NEXT: ( -21, -20, -19, -18, -17, -16, -15, -14, -21, -20, -19, -18, -17, -16, -15, -14 )
				// CHECK-NEXT: ( -22, -21, -20, -19, -18, -17, -16, -15, -22, -21, -20, -19, -18, -17, -16, -15 )
				// CHECK-NEXT: ( -23, -22, -21, -20, -19, -18, -17, -16, -23, -22, -21, -20, -19, -18, -17, -16 )
				// CHECK-NEXT: ( -24, -23, -22, -21, -20, -19, -18, -17, -24, -23, -22, -21, -20, -19, -18, -17 )
				// CHECK-NEXT: ( -25, -24, -23, -22, -21, -20, -19, -18, -25, -24, -23, -22, -21, -20, -19, -18 )
				// CHECK-NEXT: ( -26, -25, -24, -23, -22, -21, -20, -19, -26, -25, -24, -23, -22, -21, -20, -19 )
				// CHECK-NEXT: ( -27, -26, -25, -24, -23, -22, -21, -20, -27, -26, -25, -24, -23, -22, -21, -20 )
				// CHECK-NEXT: ( -28, -27, -26, -25, -24, -23, -22, -21, -28, -27, -26, -25, -24, -23, -22, -21 )
				// CHECK-NEXT: ( -29, -28, -27, -26, -25, -24, -23, -22, -29, -28, -27, -26, -25, -24, -23, -22 )
				// CHECK-NEXT: ( -30, -29, -28, -27, -26, -25, -24, -23, -30, -29, -28, -27, -26, -25, -24, -23 )
				// CHECK-NEXT: ( -31, -30, -29, -28, -27, -26, -25, -24, -31, -30, -29, -28, -27, -26, -25, -24 )
				//
				//
				scf.for %pbi = %c0 to %c32 step %c1 {
				%pb0 = vector.transfer_read %b[%pbi, %c0], %f0 : memref<32x16xf16>, vector<16xf16>
				vector.print %pb0 : vector<16xf16>
				}

				// Maps the provided host buffers into the device address space.
				// Writes from the host are guaranteed to be visible to device
				// kernels that are launched afterwards. Writes from the device
				// are guaranteed to be visible on the host after synchronizing
				// with the device kernel completion.
				%cast_a = memref.cast %a : memref<16x32xf16> to memref<*xf16>
				gpu.host_register %cast_a : memref<*xf16>
				%cast_b = memref.cast %b : memref<32x16xf16> to memref<*xf16>
				gpu.host_register %cast_b : memref<*xf16>
				%cast_c = memref.cast %c : memref<16x16xf16> to memref<*xf16>
				gpu.host_register %cast_c : memref<*xf16>

				// Call the kernel.
				aartbikUnsubmitted Done Reply Inline Actions there are no warps in this code, so simply Call the kernel this all still runs on host aartbik: there are no warps in this code, so simply Call the kernel this all still runs on host
				%t1 = arith.constant 1 : index
				%t32 = arith.constant 32 : index
				call @sampled_matmul (%a, %b, %c): (memref<16x32xf16>, memref<32x16xf16>, memref<16x16xf16>) -> ()

				// Unmaps the host buffers.
				gpu.host_unregister %cast_a : memref<*xf16>
				gpu.host_unregister %cast_b : memref<*xf16>
				gpu.host_unregister %cast_c : memref<*xf16>

				//
				aartbikUnsubmitted Done Reply Inline Actions avoid commented out code aartbik: avoid commented out code
				// Verify computed matrix C.
				//
				// CHECK-NEXT: ( -2720, -2584, -2448, -2312, -2176, -2040, -1904, -1768, -2720, -2584, -2448, -2312, -2176, -2040, -1904, -1768 )
				// CHECK-NEXT: ( -2960, -2808, -2656, -2504, -2352, -2200, -2048, -1896, -2960, -2808, -2656, -2504, -2352, -2200, -2048, -1896 )
				// CHECK-NEXT: ( -3200, -3032, -2864, -2696, -2528, -2360, -2192, -2024, -3200, -3032, -2864, -2696, -2528, -2360, -2192, -2024 )
				// CHECK-NEXT: ( -3440, -3256, -3072, -2888, -2704, -2520, -2336, -2152, -3440, -3256, -3072, -2888, -2704, -2520, -2336, -2152 )
				// CHECK-NEXT: ( -3680, -3480, -3280, -3080, -2880, -2680, -2480, -2280, -3680, -3480, -3280, -3080, -2880, -2680, -2480, -2280 )
				// CHECK-NEXT: ( -3920, -3704, -3488, -3272, -3056, -2840, -2624, -2408, -3920, -3704, -3488, -3272, -3056, -2840, -2624, -2408 )
				// CHECK-NEXT: ( -4160, -3928, -3696, -3464, -3232, -3000, -2768, -2536, -4160, -3928, -3696, -3464, -3232, -3000, -2768, -2536 )
				// CHECK-NEXT: ( -4400, -4152, -3904, -3656, -3408, -3160, -2912, -2664, -4400, -4152, -3904, -3656, -3408, -3160, -2912, -2664 )
				// CHECK-NEXT: ( -4640, -4376, -4112, -3848, -3584, -3320, -3056, -2792, -4640, -4376, -4112, -3848, -3584, -3320, -3056, -2792 )
				// CHECK-NEXT: ( -4880, -4600, -4320, -4040, -3760, -3480, -3200, -2920, -4880, -4600, -4320, -4040, -3760, -3480, -3200, -2920 )
				// CHECK-NEXT: ( -5120, -4824, -4528, -4232, -3936, -3640, -3344, -3048, -5120, -4824, -4528, -4232, -3936, -3640, -3344, -3048 )
				// CHECK-NEXT: ( -5360, -5048, -4736, -4424, -4112, -3800, -3488, -3176, -5360, -5048, -4736, -4424, -4112, -3800, -3488, -3176 )
				// CHECK-NEXT: ( -5600, -5272, -4944, -4616, -4288, -3960, -3632, -3304, -5600, -5272, -4944, -4616, -4288, -3960, -3632, -3304 )
				// CHECK-NEXT: ( -5840, -5496, -5152, -4808, -4464, -4120, -3776, -3432, -5840, -5496, -5152, -4808, -4464, -4120, -3776, -3432 )
				// CHECK-NEXT: ( -6080, -5720, -5360, -5000, -4640, -4280, -3920, -3560, -6080, -5720, -5360, -5000, -4640, -4280, -3920, -3560 )
				// CHECK-NEXT: ( -6320, -5944, -5568, -5192, -4816, -4440, -4064, -3688, -6320, -5944, -5568, -5192, -4816, -4440, -4064, -3688 )
				//
				scf.for %pci = %c0 to %c16 step %c1 {
				%pc0 = vector.transfer_read %c[%pci, %c0], %f0 : memref<16x16xf16>, vector<16xf16>
				vector.print %pc0 : vector<16xf16>
				}

				return
				}
				}

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][sparse][GPU] add 2:4 integration test
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 530753

mlir/test/Integration/Dialect/SparseTensor/GPU/CUDA/sparse-matmul-2-4-lib.mlir

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][sparse][GPU] add 2:4 integration testClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 530753

mlir/test/Integration/Dialect/SparseTensor/GPU/CUDA/sparse-matmul-2-4-lib.mlir

[mlir][sparse][GPU] add 2:4 integration test
ClosedPublic