This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
lib/Dialect/Linalg/Transforms/
-
Dialect/
-
Linalg/
-
Transforms/
1/2
Vectorization.cpp
-
test/Dialect/Linalg/
-
Dialect/
-
Linalg/
1/2
vectorization.mlir

Differential D140781

[mlir] Broadcast scalars when vectorising tensor.extract
ClosedPublic

Authored by awarzynski on Dec 30 2022, 7:36 AM.

Download Raw Diff

Details

Reviewers

aartbik
nicolasvasilache
dcaballe
pzread
rsuderman

Commits

rGa63853e6acc5: [mlir] Broadcast scalars when vectorising tensor.extract

Summary

When vectorizing tensor.extract embedded within linalg.generic, the
default option is to rewrite it as vector.gather. When doing so, we need
to make sure that the corresponding indices are vectorized accordingly.
However, the Linalg vectorizer will not vectorize constants like in the
following example. This is fixed by simply broadcasting %c0 and %c1.

func.func @example(%arg0: tensor<3x3xf32>, %arg2: tensor<1x1x3xf32>) -> tensor<1x1x3xf32> {
  %c0 = arith.constant 1 : index
  %c1 = arith.constant 2 : index
  %1 = linalg.generic {
    (...)
  } outs(...) {
  ^bb0(...):
    %2 = tensor.extract %arg0[%c0, %c1] : tensor<3x3xf32>
    linalg.yield %2 : f32
  } -> tensor<1x1x3xf32>
  return %1 : tensor<1x1x3xf32>
}

This patch makes sure that in this case the vectorizer broadcasts %c0 and %c1.
Additional tests are added to check other scenarios as well.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

awarzynski created this revision.Dec 30 2022, 7:36 AM

Herald added a reviewer: aartbik. · View Herald TranscriptDec 30 2022, 7:36 AM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: hanchung, Moerafaat, zero9178 and 23 others. · View Herald Transcript

awarzynski requested review of this revision.Dec 30 2022, 7:36 AM

Herald added a reviewer: nicolasvasilache. · View Herald TranscriptDec 30 2022, 7:36 AM

Herald added a reviewer: dcaballe. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: • pcwang-thead, limo1996, stephenneuendorffer, nicolasvasilache. · View Herald Transcript

This is just a small follow-up for https://reviews.llvm.org/D137660. My main goal was to make sure that the scenario tested in "vectorize_nd_tensor_extract_constant_idx" behaves sanely rather than throwing an error:

within split at test.mlir:1 offset :11:10: error: 'arith.addi' op requires the same type for all operands and results
    %7 = tensor.extract %arg0[%c0, %c0, %3] : tensor<3x3x3xf32>
         ^
within split at test.mlir:1 offset :11:10: note: see current operation: %9 = "arith.addi"(%4, %0) : (vector<1x1x3xindex>, index) -> index

Although the change is (hopefully) straightforward, I'm not 100% sure whether this is the right approach. @dcaballe did point out previously, that I should avoid broadcasting scalars and instead do the address calculation with scalars. That's on my TODO list.

Harbormaster completed remote builds in B205219: Diff 485700.Dec 30 2022, 8:09 AM

awarzynski added reviewers: pzread, rsuderman.Jan 9 2023, 8:04 AM

awarzynski edited the summary of this revision. (Show Details)

awarzynski edited the summary of this revision. (Show Details)Jan 9 2023, 10:01 AM

Refine the comment inside Vectorizer.cpp

I've realised that the Linalg Vectorizer does not vectorize scalar constant "by design". Thanks @dcaballe for the pointer!

Harbormaster completed remote builds in B206585: Diff 487495.Jan 9 2023, 11:02 AM

dcaballe added inline comments.Jan 9 2023, 12:09 PM

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
638	Could we use one of the existing utilities that already generate a broadcast op? For example, `broadcastIfNeeded`. I also think we have to make sure that this constant goes through `vectorizeOne` code as a copy is generated for those cases where the constant also has a user that still need a scalar version of it after vectorization.
mlir/test/Dialect/Linalg/vectorization.mlir
1521	I understand that supporting this case is mostly for completeness as we should generate a `vector.broadcast` instead of a `vector.gather` in the future (this access is loading the same element at every iteration, right?)

awarzynski added inline comments.Jan 11 2023, 7:31 AM

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
638	Wish I'd known about `broadcastIfNeeded` before, thanks! This will significantly simplify this function.
mlir/test/Dialect/Linalg/vectorization.mlir
1521	Correct. This is to document what's being generated "today" rather than what we should be aiming for. Not sure whether such tests are desirable. I do find them very helpful when going over the vectoriser. But I might be in a minority :)

Switch to using broadcastIfNeeded

Harbormaster completed remote builds in B207090: Diff 488214.Jan 11 2023, 8:07 AM

Thanks!

This revision is now accepted and ready to land.Jan 11 2023, 6:31 PM

Closed by commit rGa63853e6acc5: [mlir] Broadcast scalars when vectorising tensor.extract (authored by awarzynski). · Explain WhyJan 12 2023, 8:34 AM

This revision was automatically updated to reflect the committed changes.

awarzynski added a commit: rGa63853e6acc5: [mlir] Broadcast scalars when vectorising tensor.extract.

Revision Contents

Path

Size

mlir/

lib/

Dialect/

Linalg/

Transforms/

Vectorization.cpp

24 lines

test/

Dialect/

Linalg/

vectorization.mlir

77 lines

Diff 488670

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp

Show First 20 Lines • Show All 620 Lines • ▼ Show 20 Lines	calculateGatherOffset(OpBuilder &b, tensor::ExtractOp extractOp,
auto indexVecType = VectorType::get(targetShape, b.getIndexType());		auto indexVecType = VectorType::get(targetShape, b.getIndexType());
auto loc = extractOp.getLoc();		auto loc = extractOp.getLoc();

Value offset = b.create<vector::BroadcastOp>(		Value offset = b.create<vector::BroadcastOp>(
loc, indexVecType, bvm.lookup(extractOp.getIndices()[0]));		loc, indexVecType, bvm.lookup(extractOp.getIndices()[0]));

const size_t numIndices = extractOp.getIndices().size();		const size_t numIndices = extractOp.getIndices().size();
for (size_t i = 1; i < numIndices; i++) {		for (size_t i = 1; i < numIndices; i++) {
auto dimSizeBcast = b.create<vector::BroadcastOp>(		auto dimSize = broadcastIfNeeded(
loc, indexVecType,		b,
b.create<arith::ConstantIndexOp>(		b.create<arith::ConstantIndexOp>(
loc,		loc,
extractOp.getTensor().getType().cast<ShapedType>().getDimSize(i)));		extractOp.getTensor().getType().cast<ShapedType>().getDimSize(i)),
offset = b.create<arith::MulIOp>(loc, offset, dimSizeBcast);		indexVecType.getShape());

auto originalIndexBcast = bvm.lookup(extractOp.getIndices()[i]);		offset = b.create<arith::MulIOp>(loc, offset, dimSize);
if (i == numIndices - 1) {
// We only need an additional broadcast for the trailing index. All other		auto extractOpIndex = broadcastIfNeeded(
		dcaballeUnsubmitted Not Done Reply Inline Actions Could we use one of the existing utilities that already generate a broadcast op? For example, `broadcastIfNeeded`. I also think we have to make sure that this constant goes through `vectorizeOne` code as a copy is generated for those cases where the constant also has a user that still need a scalar version of it after vectorization. dcaballe: Could we use one of the existing utilities that already generate a broadcast op? For example…
		awarzynskiAuthorUnsubmitted Done Reply Inline Actions Wish I'd known about `broadcastIfNeeded` before, thanks! This will significantly simplify this function. awarzynski: Wish I'd known about `broadcastIfNeeded` before, thanks! This will significantly simplify this…
// indices have already been broadcast by `vectorizeLinalgIndex` to match		b, bvm.lookup(extractOp.getIndices()[i]), indexVecType.getShape());
// the output size.
originalIndexBcast = b.create<vector::BroadcastOp>(
loc, indexVecType, bvm.lookup(extractOp.getIndices()[i]));
}

offset = b.create<arith::AddIOp>(loc, originalIndexBcast, offset);		offset = b.create<arith::AddIOp>(loc, extractOpIndex, offset);
}		}

return offset;		return offset;
}		}

/// Helper function to vectorize the tensor.extract operations. Returns		/// Helper function to vectorize the tensor.extract operations. Returns
/// VectorizationStatus::NewOp to signal the vectorization algorithm that it		/// VectorizationStatus::NewOp to signal the vectorization algorithm that it
/// should map the produced operations. This function is meant to be used as a		/// should map the produced operations. This function is meant to be used as a
▲ Show 20 Lines • Show All 1,843 Lines • Show Last 20 Lines

mlir/test/Dialect/Linalg/vectorization.mlir

	Show First 20 Lines • Show All 1,488 Lines • ▼ Show 20 Lines
	^bb1(%arg1: !pdl.operation):			^bb1(%arg1: !pdl.operation):
	%0 = transform.structured.match ops{["linalg.generic"]} in %arg1			%0 = transform.structured.match ops{["linalg.generic"]} in %arg1
	%1 = get_closest_isolated_parent %0 : (!pdl.operation) -> !pdl.operation			%1 = get_closest_isolated_parent %0 : (!pdl.operation) -> !pdl.operation
	%2 = transform.structured.vectorize %1			%2 = transform.structured.vectorize %1
	}			}

	// -----			// -----

				#map1 = affine_map<(d0, d1, d2) -> (d0, d1, d2)>
				func.func @vectorize_nd_tensor_extract_constant_idx(%arg0: tensor<3x3xf32>, %arg2: tensor<1x1x3xf32>) -> tensor<1x1x3xf32> {
				%c0 = arith.constant 1 : index
				%c1 = arith.constant 2 : index
				%2 = linalg.generic {
				indexing_maps = [#map1],
				iterator_types = ["parallel", "parallel", "parallel"]
				} outs(%arg2 : tensor<1x1x3xf32>) {
				^bb0(%arg4: f32):
				%3 = linalg.index 2 : index
				%7 = tensor.extract %arg0[%c0, %c1] : tensor<3x3xf32>
				linalg.yield %7 : f32
				} -> tensor<1x1x3xf32>
				return %2 : tensor<1x1x3xf32>
				}

				// CHECK-LABEL: func.func @vectorize_nd_tensor_extract_constant_idx
				// CHECK-SAME: %[[ARG0:.*]]: tensor<3x3xf32>
				// CHECK-SAME: %[[ARG1:.*]]: tensor<1x1x3xf32>
				// CHECK: %[[MASK:.*]] = arith.constant dense<true> : vector<1x1x3xi1>
				// CHECK: %[[PASSTHRU:.*]] = arith.constant dense<0.000000e+00> : vector<1x1x3xf32>
				// CHECK: %[[C0:.*]] = arith.constant 0 : index
				// Magic "5" below comes from (1 * 3 + 2) (1: index into dim 1, 2: index into dim 2)
				// CHECK: %[[IDX:.*]] = arith.constant dense<5> : vector<1x1x3xindex>
				// CHECK: %[[GATHER:.*]] = vector.gather %[[ARG0]][%[[C0]], %[[C0]]] [%[[IDX]]], %[[MASK]], %[[PASSTHRU]] : tensor<3x3xf32>, vector<1x1x3xindex>, vector<1x1x3xi1>, vector<1x1x3xf32> into vector<1x1x3xf32>
				dcaballeUnsubmitted Not Done Reply Inline Actions I understand that supporting this case is mostly for completeness as we should generate a `vector.broadcast` instead of a `vector.gather` in the future (this access is loading the same element at every iteration, right?) dcaballe: I understand that supporting this case is mostly for completeness as we should generate a…
				awarzynskiAuthorUnsubmitted Done Reply Inline Actions Correct. This is to document what's being generated "today" rather than what we should be aiming for. Not sure whether such tests are desirable. I do find them very helpful when going over the vectoriser. But I might be in a minority :) awarzynski: Correct. This is to document what's being generated "today" rather than what we should be…
				// CHECK: vector.transfer_write %[[GATHER]]
				// CHECK: }

				transform.sequence failures(propagate) {
				^bb1(%arg1: !pdl.operation):
				%0 = transform.structured.match ops{["linalg.generic"]} in %arg1
				%1 = get_closest_isolated_parent %0 : (!pdl.operation) -> !pdl.operation
				%2 = transform.structured.vectorize %1 { vectorize_nd_extract }
				}

				// -----

				#map1 = affine_map<(d0, d1, d2) -> (d0, d1, d2)>
				func.func @vectorize_nd_tensor_extract_idx_from_iteration_index(%arg0: tensor<3x3x3xf32>, %arg2: tensor<1x1x3xf32>) -> tensor<1x1x3xf32> {
				%1 = linalg.generic {
				indexing_maps = [#map1],
				iterator_types = ["parallel", "parallel", "parallel"]
				} outs(%arg2 : tensor<1x1x3xf32>) {
				^bb0(%arg4: f32):
				%2 = linalg.index 0 : index
				%3 = linalg.index 1 : index
				%4 = linalg.index 2 : index
				%5 = tensor.extract %arg0[%2, %3, %4] : tensor<3x3x3xf32>
				linalg.yield %5 : f32
				} -> tensor<1x1x3xf32>
				return %1 : tensor<1x1x3xf32>
				}

				// CHECK-LABEL: func.func @vectorize_nd_tensor_extract_idx_from_iteration_index
				// CHECK-SAME: %[[ARG0:.*]]: tensor<3x3x3xf32>
				// CHECK-SAME: %[[ARG1:.*]]: tensor<1x1x3xf32>
				// CHECK: %[[INDICES:.*]] = arith.constant dense<[0, 1, 2]> : vector<3xindex>
				// CHECK: %[[MASK:.*]] = arith.constant dense<true> : vector<1x1x3xi1>
				// CHECK: %[[PASSTHRU:.*]] = arith.constant dense<0.000000e+00> : vector<1x1x3xf32>
				// CHECK: %[[C0:.*]] = arith.constant 0 : index
				// CHECK: %[[B:.*]] = vector.broadcast %[[INDICES]] : vector<3xindex> to vector<1x1x3xindex>
				// CHECK: %[[GATHER:.*]] = vector.gather %[[ARG0]][%[[C0]], %[[C0]], %[[C0]]] [%[[B]]], %[[MASK]], %[[PASSTHRU]] : tensor<3x3x3xf32>, vector<1x1x3xindex>, vector<1x1x3xi1>, vector<1x1x3xf32> into vector<1x1x3xf32>
				// CHECK: vector.transfer_write %[[GATHER]]

				transform.sequence failures(propagate) {
				^bb1(%arg1: !pdl.operation):
				%0 = transform.structured.match ops{["linalg.generic"]} in %arg1
				%1 = get_closest_isolated_parent %0 : (!pdl.operation) -> !pdl.operation
				%2 = transform.structured.vectorize %1 { vectorize_nd_extract }
				}

				// -----

	#map0 = affine_map<(d0, d1, d2, d3) -> (d0, d2)>			#map0 = affine_map<(d0, d1, d2, d3) -> (d0, d2)>
	#map1 = affine_map<(d0, d1, d2, d3) -> (d0, d1, d3)>			#map1 = affine_map<(d0, d1, d2, d3) -> (d0, d1, d3)>
	#map2 = affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>			#map2 = affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>
	func.func @vectorize_nd_tensor_extract(%arg0: tensor<3x3xf32>, %arg1: tensor<4x3xi32>, %arg2: tensor<4x3xi32>, %arg3: tensor<4x7x2xf32>, %arg4: tensor<4x7x3x2xf32>) -> tensor<4x7x3x2xf32> {			func.func @vectorize_nd_tensor_extract_index_from_tensor(%arg0: tensor<3x3xf32>, %arg1: tensor<4x3xi32>, %arg2: tensor<4x3xi32>, %arg3: tensor<4x7x2xf32>, %arg4: tensor<4x7x3x2xf32>) -> tensor<4x7x3x2xf32> {
	%2 = linalg.generic {			%2 = linalg.generic {
	indexing_maps = [#map0, #map0, #map1, #map2],			indexing_maps = [#map0, #map0, #map1, #map2],
	iterator_types = ["parallel", "parallel", "parallel", "parallel"]			iterator_types = ["parallel", "parallel", "parallel", "parallel"]
	} ins(%arg1, %arg2, %arg3 : tensor<4x3xi32>, tensor<4x3xi32>, tensor<4x7x2xf32>) outs(%arg4 : tensor<4x7x3x2xf32>) {			} ins(%arg1, %arg2, %arg3 : tensor<4x3xi32>, tensor<4x3xi32>, tensor<4x7x2xf32>) outs(%arg4 : tensor<4x7x3x2xf32>) {
	^bb0(%arg5: i32, %arg6: i32, %arg7: f32, %arg8: f32):			^bb0(%arg5: i32, %arg6: i32, %arg7: f32, %arg8: f32):
	%3 = arith.index_cast %arg5 : i32 to index			%3 = arith.index_cast %arg5 : i32 to index
	%4 = arith.index_cast %arg6 : i32 to index			%4 = arith.index_cast %arg6 : i32 to index
	%7 = tensor.extract %arg0[%3, %4] : tensor<3x3xf32>			%7 = tensor.extract %arg0[%3, %4] : tensor<3x3xf32>
	linalg.yield %7 : f32			linalg.yield %7 : f32
	} -> tensor<4x7x3x2xf32>			} -> tensor<4x7x3x2xf32>
	return %2 : tensor<4x7x3x2xf32>			return %2 : tensor<4x7x3x2xf32>
	}			}
	// CHECK-LABEL: func.func @vectorize_nd_tensor_extract			// CHECK-LABEL: func.func @vectorize_nd_tensor_extract_index_from_tensor
	// CHECK-SAME: %[[ARG0:.*]]: tensor<3x3xf32>			// CHECK-SAME: %[[ARG0:.*]]: tensor<3x3xf32>
	// CHECK-SAME: %[[ARG1:arg1]]: tensor<4x3xi32>			// CHECK-SAME: %[[ARG1:arg1]]: tensor<4x3xi32>
	// CHECK-SAME: %[[ARG2:arg2]]: tensor<4x3xi32>			// CHECK-SAME: %[[ARG2:arg2]]: tensor<4x3xi32>
	// CHECK-SAME: %[[ARG3:.*]]: tensor<4x7x2xf32>			// CHECK-SAME: %[[ARG3:.*]]: tensor<4x7x2xf32>
	// CHECK-SAME: %[[ARG4:.*]]: tensor<4x7x3x2xf32>			// CHECK-SAME: %[[ARG4:.*]]: tensor<4x7x3x2xf32>
	// CHECK: %[[C0:.*]] = arith.constant 0 : index			// CHECK: %[[C0:.*]] = arith.constant 0 : index
	// CHECK: %[[C0_i32:.*]] = arith.constant 0 : i32			// CHECK: %[[C0_i32:.*]] = arith.constant 0 : i32
	// CHECK: %[[CST:.*]] = arith.constant dense<3> : vector<7x2x4x3xindex>			// CHECK: %[[CST:.*]] = arith.constant dense<3> : vector<7x2x4x3xindex>
	▲ Show 20 Lines • Show All 261 Lines • Show Last 20 Lines