This is an archive of the discontinued LLVM Phabricator instance.

The intrinsic returns i32 values. So I thought as long as the data type is <= 32 bits and you are reading 8x128bit rows (8xf16, 4xf32, 16xi8, etc), then there is no problem.

In your test you are reading a 8x8xf32 B operand. So I was under the impression such an operand could be loaded with two ldmatrix calls, which loads two 8x128bit tiles. The distributed values (one per tile / thread) would be returned as two i32 values.

In D126846#3552445, @christopherbate wrote:

The intrinsic returns i32 values. So I thought as long as the data type is <= 32 bits and you are reading 8x128bit rows (8xf16, 4xf32, 16xi8, etc), then there is no problem.

In your test you are reading a 8x8xf32 B operand. So I was under the impression such an operand could be loaded with two ldmatrix calls, which loads two 8x128bit tiles. The distributed values (one per tile / thread) would be returned as two i32 values.

I thought that with transpose loading 32bits element was wrong because the 32bits value would be read as 2xf16 and when transposed would end up on different rows. Is that not the case? There are some miscompile when using ldmatrix in this case but this could be due to a different reason. Do you expect the transpose version of the op to work for 32bits?

In D126846#3552462, @ThomasRaoux wrote:

In D126846#3552445, @christopherbate wrote:

The intrinsic returns i32 values. So I thought as long as the data type is <= 32 bits and you are reading 8x128bit rows (8xf16, 4xf32, 16xi8, etc), then there is no problem.

In your test you are reading a 8x8xf32 B operand. So I was under the impression such an operand could be loaded with two ldmatrix calls, which loads two 8x128bit tiles. The distributed values (one per tile / thread) would be returned as two i32 values.

I thought that with transpose loading 32bits element was wrong because the 32bits value would be read as 2xf16 and when transposed would end up on different rows. Is that not the case? There are some miscompile when using ldmatrix in this case but this could be due to a different reason. Do you expect the transpose version of the op to work for 32bits?

You're right, I got mixed up with the transpose vs non-transpose versions. Transpose definitely needs the fp16 constraint.

Thanks for finding this, LGTM!

This revision is now accepted and ready to land.Jun 2 2022, 1:31 PM

Closed by commit rG271a48e02917: [mlir][VectorToGPU] Fix bug generating incorrect ldmatrix ops (authored by ThomasRaoux). · Explain WhyJun 2 2022, 9:30 PM

This revision was automatically updated to reflect the committed changes.

ThomasRaoux added a commit: rG271a48e02917: [mlir][VectorToGPU] Fix bug generating incorrect ldmatrix ops.

Revision Contents

Path

Size

mlir/

lib/

Conversion/

VectorToGPU/

VectorToGPU.cpp

3 lines

test/

Conversion/

VectorToGPU/

vector-to-mma-ops-mma-sync.mlir

60 lines

Diff 433947

mlir/lib/Conversion/VectorToGPU/VectorToGPU.cpp

Show First 20 Lines • Show All 618 Lines • ▼ Show 20 Lines	convertTransferReadToLoads(vector::TransferReadOp op,

VectorType vecTy = op.getVectorType();		VectorType vecTy = op.getVectorType();
int64_t bitWidth = vecTy.getElementType().getIntOrFloatBitWidth();		int64_t bitWidth = vecTy.getElementType().getIntOrFloatBitWidth();

// When we are transposing the B operand, ldmatrix will only work if we have		// When we are transposing the B operand, ldmatrix will only work if we have
// at least 8 rows to read and the width to read for the transpose is 128		// at least 8 rows to read and the width to read for the transpose is 128
// bits.		// bits.
if (!op.getPermutationMap().isMinorIdentity() &&		if (!op.getPermutationMap().isMinorIdentity() &&
(vecTy.getDimSize(1) < 8 \|\| vecTy.getDimSize(0) * bitWidth < 128))		(bitWidth != 16 \|\| vecTy.getDimSize(1) < 8 \|\|
		vecTy.getDimSize(0) * bitWidth < 128))
isLdMatrixCompatible = false;		isLdMatrixCompatible = false;

if (!isLdMatrixCompatible)		if (!isLdMatrixCompatible)
return createNonLdMatrixLoads(op, b, valueMapping);		return createNonLdMatrixLoads(op, b, valueMapping);

return creatLdMatrixCompatibleLoads(op, b, valueMapping);		return creatLdMatrixCompatibleLoads(op, b, valueMapping);
}		}

▲ Show 20 Lines • Show All 281 Lines • Show Last 20 Lines

mlir/test/Conversion/VectorToGPU/vector-to-mma-ops-mma-sync.mlir

Show First 20 Lines • Show All 341 Lines • ▼ Show 20 Lines	func.func @m16n8k4_tf32_f32_row_row_row(%arg0: memref<20x20xf32, 3>, %arg1: memref<20x20xf32, 3>, %arg2: memref<20x20xf32>) {
// CHECK: vector.store		// CHECK: vector.store
// CHECK: vector.extract [[d_frag]][1] : vector<2x2xf32>		// CHECK: vector.extract [[d_frag]][1] : vector<2x2xf32>
// CHECK: affine.apply [[$rowC8_map]]		// CHECK: affine.apply [[$rowC8_map]]
// CHECK: affine.apply [[$colC_map]]		// CHECK: affine.apply [[$colC_map]]
// CHECK: vector.store		// CHECK: vector.store
vector.transfer_write %D, %arg2[%c0, %c0] {in_bounds = [true, true]} : vector<16x8xf32>, memref<20x20xf32>		vector.transfer_write %D, %arg2[%c0, %c0] {in_bounds = [true, true]} : vector<16x8xf32>, memref<20x20xf32>
return		return
}		}

		// -----

		#map0 = affine_map<(d0, d1) -> (d1, d0)>
		#map1 = affine_map<(d0, d1, d2) -> (d0, d2)>
		#map2 = affine_map<(d0, d1, d2) -> (d1, d2)>
		#map3 = affine_map<(d0, d1, d2) -> (d0, d1)>

		// CHECK-DAG: [[$rowA_map:#.+]] = affine_map<()[s0] -> (s0 mod 16 + 1)>
		// CHECK-DAG: [[$colA_map:#.+]] = affine_map<()[s0] -> ((s0 floordiv 16) * 4 + 3)>

		// CHECK-DAG: [[$rowB_map:#.+]] = affine_map<()[s0] -> (s0 mod 4 + 3)>
		// CHECK-DAG: [[$colB_map:#.+]] = affine_map<()[s0] -> (s0 floordiv 4 + 3)>

		// CHECK-DAG: [[$rowC_map:#.+]] = affine_map<()[s0] -> (s0 floordiv 4)>
		// CHECK-DAG: [[$rowC8_map:#.+]] = affine_map<()[s0] -> (s0 floordiv 4 + 8)>
		// CHECK-DAG: [[$colC_map:#.+]] = affine_map<()[s0] -> (s0 * 2 - (s0 floordiv 4) * 8)>

		// CHECK-LABEL: func @m16n8k8_tf32_f32_row_row_row
		func.func @m16n8k8_tf32_f32_row_row_row(%arg0: memref<20x20xf32, 3>, %arg1: memref<20x20xf32, 3>, %arg2: memref<20x20xf32>) {
		%cst_0 = arith.constant dense<0.000000e+00> : vector<16x8xf32>
		%c0 = arith.constant 0 : index
		%c1 = arith.constant 1 : index
		%c3 = arith.constant 3 : index
		%cst = arith.constant 0.000000e+00 : f32

		// CHECK: [[c_frag:%.+]] = arith.constant {{.*}} : vector<2x2xf32>

		// CHECK-DAG: [[row:%.+]] = affine.apply [[$rowA_map]]
		// CHECK-DAG: [[col:%.+]] = affine.apply [[$colA_map]]
		// CHECK: [[a_frag:%.+]] = nvgpu.ldmatrix %arg0[[[row]], [[col]]] {numTiles = 4 : i32, transpose = false}

		// b and c are not loaded by ldmatrix in this test.
		// CHECK-NOT: nvgpu.ldmatrix

		// CHECK-DAG: [[row:%.+]] = affine.apply [[$rowB_map]]
		// CHECK-DAG: [[col:%.+]] = affine.apply [[$colB_map]]
		// CHECK: [[b_el0:%.+]] = memref.load {{%.+}} : memref<20x20xf32, 3>
		// CHECK: [[b_frag0:%.+]] = vector.insert [[b_el0]], {{.*}} : f32 into vector<2x1xf32>
		// CHECK: [[b_el1:%.+]] = memref.load {{%.+}} : memref<20x20xf32, 3>
		// CHECK: [[b_frag1:%.+]] = vector.insert [[b_el1]], {{.*}} : f32 into vector<2x1xf32>

		// CHECK: [[d_frag:%.+]] = nvgpu.mma.sync([[a_frag]], [[b_frag1]], [[c_frag]])
		// CHECK-SAME: mmaShape = [16, 8, 8]
		// CHECK-SAME: -> vector<2x2xf32>
		%A = vector.transfer_read %arg0[%c1, %c3], %cst {in_bounds = [true, true]} : memref<20x20xf32, 3>, vector<16x8xf32>
		%B = vector.transfer_read %arg1[%c3, %c3], %cst {permutation_map = #map0, in_bounds = [true, true]} : memref<20x20xf32, 3>, vector<8x8xf32>
		%D = vector.contract {indexing_maps = [#map1, #map2, #map3], iterator_types = ["parallel", "parallel", "reduction"], kind = #vector.kind<add>} %A, %B, %cst_0 : vector<16x8xf32>, vector<8x8xf32> into vector<16x8xf32>

		// CHECK: vector.extract [[d_frag]][0] : vector<2x2xf32>
		// CHECK: affine.apply [[$rowC_map]]
		// CHECK: affine.apply [[$colC_map]]
		// CHECK: vector.store
		// CHECK: vector.extract [[d_frag]][1] : vector<2x2xf32>
		// CHECK: affine.apply [[$rowC8_map]]
		// CHECK: affine.apply [[$colC_map]]
		// CHECK: vector.store
		vector.transfer_write %D, %arg2[%c0, %c0] {in_bounds = [true, true]} : vector<16x8xf32>, memref<20x20xf32>
		return
		}

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][VectorToGPU] Fix bug generating incorrect ldmatrix opsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 433947

mlir/lib/Conversion/VectorToGPU/VectorToGPU.cpp

mlir/test/Conversion/VectorToGPU/vector-to-mma-ops-mma-sync.mlir

[mlir][VectorToGPU] Fix bug generating incorrect ldmatrix ops
ClosedPublic