This is an archive of the discontinued LLVM Phabricator instance.

Differential D134163

[MLIR][Linalg] introduce batch-reduce GEMM
ClosedPublic

Authored by chelini on Sep 19 2022, 12:04 AM.

Download Raw Diff

Details

Reviewers

nicolasvasilache

Commits

rG3718082e2b11: [MLIR][Linalg] introduce batch-reduce GEMM
rGf381768a8da6: [MLIR][Linalg] introduce batch-reduce GEMM

Summary

The batch-reduce GEMM kernel essentially multiplies a sequence of input tensor
blocks (which form a batch) and the partial multiplication results are reduced
into a single output tensor block.

See: https://ieeexplore.ieee.org/document/9139809 for more details.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

chelini created this revision.Sep 19 2022, 12:04 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 19 2022, 12:04 AM

Herald added subscribers: bzcheeseman, mravishankar, sdasgup3 and 19 others. · View Herald Transcript

chelini requested review of this revision.Sep 19 2022, 12:04 AM

Herald added a reviewer: nicolasvasilache. · View Herald TranscriptSep 19 2022, 12:04 AM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: limo1996, stephenneuendorffer, nicolasvasilache. · View Herald Transcript

Improve commit msg.

chelini edited the summary of this revision. (Show Details)Sep 19 2022, 12:09 AM

Harbormaster completed remote builds in B187445: Diff 461150.Sep 19 2022, 12:41 AM

This is great, thank you Lorenzo!

This revision is now accepted and ready to land.Sep 19 2022, 2:48 AM

Closed by commit rGf381768a8da6: [MLIR][Linalg] introduce batch-reduce GEMM (authored by chelini). · Explain WhySep 19 2022, 3:12 AM

This revision was automatically updated to reflect the committed changes.

chelini added a commit: rGf381768a8da6: [MLIR][Linalg] introduce batch-reduce GEMM.

chelini added a reverting change: rGe9dd2b2d4b9c: Revert "[MLIR][Linalg] introduce batch-reduce GEMM".Sep 19 2022, 3:18 AM

chelini added a commit: rG3718082e2b11: [MLIR][Linalg] introduce batch-reduce GEMM.Sep 19 2022, 3:51 AM

Revision Contents

Path

Size

mlir/

include/

mlir/

Dialect/

Linalg/

IR/

LinalgNamedStructuredOps.yaml

70 lines

python/

mlir/

dialects/

linalg/

opdsl/

ops/

core_named_ops.py

14 lines

test/

Dialect/

Linalg/

generalize-named-ops.mlir

24 lines

named-ops.mlir

20 lines

Diff 461149

mlir/include/mlir/Dialect/Linalg/IR/LinalgNamedStructuredOps.yaml

Show First 20 Lines • Show All 543 Lines • ▼ Show 20 Lines	value: !ScalarExpression
kind: type		kind: type
fn_name: cast_signed		fn_name: cast_signed
type_var: U		type_var: U
operands:		operands:
- !ScalarExpression		- !ScalarExpression
scalar_arg: B		scalar_arg: B
--- !LinalgOpConfig		--- !LinalgOpConfig
metadata: !LinalgOpMetadata		metadata: !LinalgOpMetadata
		name: batch_reduce_matmul
		cpp_class_name: BatchReduceMatmulOp
		doc: \|-
		Performs a batch-reduce matrix multiplication of two 3D inputs.
		The partial multiplication results are reduced into a 2D output.

		Numeric casting is performed on the operands to the inner multiply, promoting
		them to the same data type as the accumulator/output.
		implements:
		- LinalgContractionOpInterface
		structured_op: !LinalgStructuredOpConfig
		args:
		- !LinalgOperandDefConfig
		name: A
		kind: input_tensor
		type_var: T1
		shape_map: affine_map<()[s0, s1, s2, s3] -> (s0, s1, s3)>
		- !LinalgOperandDefConfig
		name: B
		kind: input_tensor
		type_var: T2
		shape_map: affine_map<()[s0, s1, s2, s3] -> (s0, s3, s2)>
		- !LinalgOperandDefConfig
		name: C
		kind: output_tensor
		type_var: U
		shape_map: affine_map<()[s0, s1, s2, s3] -> (s0, s1, s2)>
		indexing_maps: !LinalgIndexingMapsConfig
		static_indexing_maps:
		- affine_map<(d0, d1, d2, d3)[s0, s1, s2, s3] -> (d0, d1, d3)>
		- affine_map<(d0, d1, d2, d3)[s0, s1, s2, s3] -> (d0, d3, d2)>
		- affine_map<(d0, d1, d2, d3)[s0, s1, s2, s3] -> (d1, d2)>
		iterator_types:
		- reduction
		- parallel
		- parallel
		- reduction
		assignments:
		- !ScalarAssign
		arg: C
		value: !ScalarExpression
		scalar_fn:
		kind: binary
		fn_name: add
		operands:
		- !ScalarExpression
		scalar_arg: C
		- !ScalarExpression
		scalar_fn:
		kind: binary
		fn_name: mul
		operands:
		- !ScalarExpression
		scalar_fn:
		kind: type
		fn_name: cast_signed
		type_var: U
		operands:
		- !ScalarExpression
		scalar_arg: A
		- !ScalarExpression
		scalar_fn:
		kind: type
		fn_name: cast_signed
		type_var: U
		operands:
		- !ScalarExpression
		scalar_arg: B
		--- !LinalgOpConfig
		metadata: !LinalgOpMetadata
name: quantized_batch_matmul		name: quantized_batch_matmul
cpp_class_name: QuantizedBatchMatmulOp		cpp_class_name: QuantizedBatchMatmulOp
doc: \|-		doc: \|-
Performs a batched matrix multiplication of two 3D inputs.		Performs a batched matrix multiplication of two 3D inputs.

Numeric casting is performed on the operands to the inner multiply, promoting		Numeric casting is performed on the operands to the inner multiply, promoting
them to the same data type as the accumulator/output. The quantized variant		them to the same data type as the accumulator/output. The quantized variant
includes zero-point adjustments for the left and right operands of the		includes zero-point adjustments for the left and right operands of the
▲ Show 20 Lines • Show All 3,322 Lines • Show Last 20 Lines

mlir/python/mlir/dialects/linalg/opdsl/ops/core_named_ops.py

Show First 20 Lines • Show All 144 Lines • ▼ Show 20 Lines	def quantized_batch_matmul(A=TensorDef(T1, Batch, S.M, S.K),
includes zero-point adjustments for the left and right operands of the		includes zero-point adjustments for the left and right operands of the
matmul.		matmul.
"""		"""
domain(D.b, D.m, D.n, D.k)		domain(D.b, D.m, D.n, D.k)
C[D.b, D.m, D.n] += (TypeFn.cast_signed(U, A[D.b, D.m, D.k]) -		C[D.b, D.m, D.n] += (TypeFn.cast_signed(U, A[D.b, D.m, D.k]) -
TypeFn.cast_signed(U, AZp)) * (TypeFn.cast_signed(		TypeFn.cast_signed(U, AZp)) * (TypeFn.cast_signed(
U, B[D.b, D.k, D.n]) - TypeFn.cast_signed(U, BZp))		U, B[D.b, D.k, D.n]) - TypeFn.cast_signed(U, BZp))

		@linalg_structured_op
		def batch_reduce_matmul(A=TensorDef(T1, Batch, S.M, S.K),
		B=TensorDef(T2, Batch, S.K, S.N),
		C=TensorDef(U, S.M, S.N, output=True)):
		"""Performs a batch-reduce matrix multiplication of two 3D inputs.
		The partial multiplication results are reduced into a 2D output.

		Numeric casting is performed on the operands to the inner multiply, promoting
		them to the same data type as the accumulator/output.
		"""
		domain(D.b, D.m, D.n, D.k)
		implements(ContractionOpInterface)
		C[D.m, D.n] += TypeFn.cast_signed(U, A[D.b, D.m, D.k] * TypeFn.cast_signed(
		U, B[D.b, D.k, D.n])

@linalg_structured_op		@linalg_structured_op
def matvec(A=TensorDef(T1, S.M, S.N),		def matvec(A=TensorDef(T1, S.M, S.N),
y=TensorDef(T2, S.N),		y=TensorDef(T2, S.N),
x=TensorDef(U, S.M, output=True)):		x=TensorDef(U, S.M, output=True)):
"""Performs a matrix-vector multiplication.		"""Performs a matrix-vector multiplication.

Numeric casting is performed on the operands to the inner multiply, promoting		Numeric casting is performed on the operands to the inner multiply, promoting
▲ Show 20 Lines • Show All 788 Lines • Show Last 20 Lines

mlir/test/Dialect/Linalg/generalize-named-ops.mlir

	Show First 20 Lines • Show All 242 Lines • ▼ Show 20 Lines
	// CHECK-SAME: ins(%{{.+}}, %{{.+}} : memref<?x?x?xi8>, memref<?x?xi8>)			// CHECK-SAME: ins(%{{.+}}, %{{.+}} : memref<?x?x?xi8>, memref<?x?xi8>)
	// CHECK-SAME: outs(%{{.+}} : memref<?x?xf32>)			// CHECK-SAME: outs(%{{.+}} : memref<?x?xf32>)
	// CHECK: ^{{.+}}(%[[BBARG0:.+]]: i8, %[[BBARG1:.+]]: i8, %[[BBARG2:.+]]: f32)			// CHECK: ^{{.+}}(%[[BBARG0:.+]]: i8, %[[BBARG1:.+]]: i8, %[[BBARG2:.+]]: f32)
	// CHECK: %[[BBARG0_F32:.+]] = arith.sitofp %[[BBARG0]] : i8 to f32			// CHECK: %[[BBARG0_F32:.+]] = arith.sitofp %[[BBARG0]] : i8 to f32
	// CHECK: %[[BBARG1_F32:.+]] = arith.sitofp %[[BBARG1]] : i8 to f32			// CHECK: %[[BBARG1_F32:.+]] = arith.sitofp %[[BBARG1]] : i8 to f32
	// CHECK: %[[MUL:.+]] = arith.mulf %[[BBARG0_F32]], %[[BBARG1_F32]]			// CHECK: %[[MUL:.+]] = arith.mulf %[[BBARG0_F32]], %[[BBARG1_F32]]
	// CHECK: %[[ADD:.+]] = arith.addf %[[BBARG2]], %[[MUL]]			// CHECK: %[[ADD:.+]] = arith.addf %[[BBARG2]], %[[MUL]]
	// CHECK: linalg.yield %[[ADD]] : f32			// CHECK: linalg.yield %[[ADD]] : f32

				// -----

				func.func @batch_reduce_gemm(%lhs: memref<7x8x9xf32>, %rhs: memref<7x9x8xf32>, %out: memref<8x8xf32>) {
				linalg.batch_reduce_matmul ins(%lhs, %rhs: memref<7x8x9xf32>, memref<7x9x8xf32>)
				outs(%out: memref<8x8xf32>)
				return
				}

				// CHECK-DAG: #[[MAP0:.+]] = affine_map<(d0, d1, d2, d3) -> (d0, d1, d3)>
				// CHECK-DAG: #[[MAP1:.+]] = affine_map<(d0, d1, d2, d3) -> (d0, d3, d2)>
				// CHECK-DAG: #[[MAP2:.+]] = affine_map<(d0, d1, d2, d3) -> (d1, d2)>

				// CHECK: @batch_reduce_gemm

				// CHECK: linalg.generic
				// CHECK-SAME: indexing_maps = [#[[MAP0]], #[[MAP1]], #[[MAP2]]]
				// CHECK-SAME: iterator_types = ["reduction", "parallel", "parallel", "reduction"]}
				// CHECK-SAME: ins(%{{.+}}, %{{.+}} : memref<7x8x9xf32>, memref<7x9x8xf32>)
				// CHECK-SAME: outs(%{{.+}} : memref<8x8xf32>
				// CHECK: ^{{.+}}(%[[BBARG0:.+]]: f32, %[[BBARG1:.+]]: f32, %[[BBARG2:.+]]: f32)
				// CHECK: %[[MUL:.+]] = arith.mulf %[[BBARG0]], %[[BBARG1]] : f32
				// CHECK: %[[ADD:.+]] = arith.addf %[[BBARG2]], %[[MUL]] : f32
				// CHECK: linalg.yield %[[ADD]] : f32

mlir/test/Dialect/Linalg/named-ops.mlir

Show First 20 Lines • Show All 788 Lines • ▼ Show 20 Lines	func.func @conv_interface_wrong_num_operands(
%0 = "linalg.conv_2d_nhwc_hwcf"(%arg0, %arg1, %arg2) ({		%0 = "linalg.conv_2d_nhwc_hwcf"(%arg0, %arg1, %arg2) ({
^bb0(%arg3: f32, %arg4: f32, %arg5 : f32):		^bb0(%arg3: f32, %arg4: f32, %arg5 : f32):
%1 = "arith.mulf"(%arg3, %arg4) : (f32, f32) -> f32		%1 = "arith.mulf"(%arg3, %arg4) : (f32, f32) -> f32
%2 = "arith.addf"(%arg5, %1) : (f32, f32) -> f32		%2 = "arith.addf"(%arg5, %1) : (f32, f32) -> f32
"linalg.yield"(%2) : (f32) -> ()		"linalg.yield"(%2) : (f32) -> ()
}) {dilations = dense<1> : tensor<2xi64>, linalg.memoized_indexing_maps = [#map0, #map1, #map2], operand_segment_sizes = array<i32: 2, 1>, strides = dense<1> : tensor<2xi64>} : (tensor<?x?x?x?xf32>, tensor<?x?x?x?x?xf32>, tensor<?x?x?x?xf32>) -> tensor<?x?x?x?xf32>		}) {dilations = dense<1> : tensor<2xi64>, linalg.memoized_indexing_maps = [#map0, #map1, #map2], operand_segment_sizes = array<i32: 2, 1>, strides = dense<1> : tensor<2xi64>} : (tensor<?x?x?x?xf32>, tensor<?x?x?x?x?xf32>, tensor<?x?x?x?xf32>) -> tensor<?x?x?x?xf32>
return %0 : tensor<?x?x?x?xf32>		return %0 : tensor<?x?x?x?xf32>
}		}

		// -----

		func.func @batch_reduce_matmul(%arg0: tensor<8x128x256xf32>, %arg1: tensor<8x256x512xf32>, %arg2: tensor<128x512xf32>) -> tensor<128x512xf32> {
		// CHECK: %{{.+}} = linalg.batch_reduce_matmul
		// CHECK-SAME: ins(%{{.+}}, %{{.+}} : tensor<8x128x256xf32>, tensor<8x256x512xf32>)
		// CHECK-SAME: outs(%{{.+}} : tensor<128x512xf32>) -> tensor<128x512xf32>
		%0 = linalg.batch_reduce_matmul ins(%arg0, %arg1 : tensor<8x128x256xf32>, tensor<8x256x512xf32>) outs(%arg2: tensor<128x512xf32>) -> tensor<128x512xf32>
		return %0: tensor<128x512xf32>
		}

		// -----

		func.func @batch_reduce_matmul(%arg0: memref<?x?x?xf32>, %arg1: memref<?x?x?xf32>, %arg2: memref<?x?xf32>) {
		// CHECK: linalg.batch_reduce_matmul
		// CHECK-SAME: ins(%{{.+}}, %{{.+}} : memref<?x?x?xf32>, memref<?x?x?xf32>)
		// CHECK-SAME: outs(%{{.+}} : memref<?x?xf32>)
		linalg.batch_reduce_matmul ins(%arg0, %arg1 : memref<?x?x?xf32>, memref<?x?x?xf32>) outs(%arg2: memref<?x?xf32>)
		return
		}

This is an archive of the discontinued LLVM Phabricator instance.

[MLIR][Linalg] introduce batch-reduce GEMMClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 461149

mlir/include/mlir/Dialect/Linalg/IR/LinalgNamedStructuredOps.yaml

mlir/python/mlir/dialects/linalg/opdsl/ops/core_named_ops.py

mlir/test/Dialect/Linalg/generalize-named-ops.mlir

mlir/test/Dialect/Linalg/named-ops.mlir

[MLIR][Linalg] introduce batch-reduce GEMM
ClosedPublic