This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
lib/Dialect/Linalg/Transforms/
-
Dialect/
-
Linalg/
-
Transforms/
4
Vectorization.cpp
-
test/Dialect/Linalg/
-
Dialect/
-
Linalg/
3/5
vectorization.mlir

Differential D148537

[mlir][linalg] Refine `tensor.extract` vectorization
ClosedPublic

Authored by awarzynski on Apr 17 2023, 9:06 AM.

Download Raw Diff

Details

Reviewers

aartbik
nicolasvasilache
dcaballe
jpienaar

Commits

rGa3ae3931d4e0: [mlir][linalg] Refine `tensor.extract` vectorisation

Summary

This patch updates the vectorization of the extract Op so that the
permutation map for the transfer_read Op is defined explicitly by the
vectorizer.

This change is needed for cases where the rank of the source tensor is
lower than the rank of the output vector generated by the vectorizer:

mlir
    %17 = vector.transfer_read %arg1[%14, %16], %cst_4 {in_bounds = [true, true]} : tensor<257x24xf32>, vector<1x1x4xf32>

In cases like this, the vectorize will create the following permutation map:

(d0, d1) -> (0, d0, d1)

In other cases the behavior remains unchanged.

Fixes https://github.com/openxla/iree/issues/13036. That's also where
the test case was extracted from.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

awarzynski created this revision.Apr 17 2023, 9:06 AM

Herald added a reviewer: aartbik. · View Herald TranscriptApr 17 2023, 9:06 AM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: bviyer, hanchung, Moerafaat and 24 others. · View Herald Transcript

awarzynski requested review of this revision.Apr 17 2023, 9:06 AM

Herald added a reviewer: nicolasvasilache. · View Herald TranscriptApr 17 2023, 9:06 AM

Herald added a reviewer: dcaballe. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: • pcwang-thead, limo1996, stephenneuendorffer, nicolasvasilache. · View Herald Transcript

awarzynski mentioned this in D148265: [mlir][linalg] Refine `tensor.extract` vectorisation.Apr 17 2023, 9:07 AM

awarzynski added a reviewer: jpienaar.Apr 17 2023, 9:17 AM

awarzynski edited the summary of this revision. (Show Details)Apr 17 2023, 9:31 AM

tschuett added a subscriber: tschuett.Apr 17 2023, 9:53 AM

tschuett added inline comments.

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
956	Please z!

Harbormaster completed remote builds in B226121: Diff 514267.Apr 17 2023, 10:28 AM

dcaballe added inline comments.Apr 17 2023, 10:48 AM

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
956	I think neither American nor British English is enforced by the guidelines so `Vectorised` should be fine as long as it's consistent with the rest of the debug messages, which seems to be the case?
993	nit: `.`
999	It would help if you mentioned that this is to broadcast to the unitary leading dims
mlir/test/Dialect/Linalg/vectorization.mlir
1869	These constants would probably need a `-DAG`
1879	This should be a contiguous load, right? Are you fixing this separately?
1887	Hmm... How is possible that we generate a contiguous load if there is a gather in its address computation?

Thanks for taking a look!

mlir/test/Dialect/Linalg/vectorization.mlir
1879	This should be a contiguous load, right? Not quite. It's a scalar load which is loop invariant. `vector.gather` is correct, but very inefficient. Are you fixing this separately? Dealing with this `vector.gather` separately makes more sense to me. Otherwise this patch would be addressing two different issues. Also, I'd like to resolve https://github.com/openxla/iree/issues/13036 before tackling anything else. I should be able to upload a seperate patch shortly.
1887	`%extracted_0 = tensor.extract %input_1[%c0, %14] : tensor<1x20xi32>` is loop invariant :) I'll add a comment above.

Update and add comments as per PR suggestions

Harbormaster completed remote builds in B226817: Diff 515250.Apr 20 2023, 1:58 AM

Thanks! It make sense!

This revision is now accepted and ready to land.Apr 20 2023, 8:47 AM

Closed by commit rGa3ae3931d4e0: [mlir][linalg] Refine `tensor.extract` vectorisation (authored by awarzynski). · Explain WhyApr 21 2023, 12:48 AM

This revision was automatically updated to reflect the committed changes.

awarzynski added a commit: rGa3ae3931d4e0: [mlir][linalg] Refine `tensor.extract` vectorisation.

Revision Contents

Path

Size

mlir/

lib/

Dialect/

Linalg/

Transforms/

Vectorization.cpp

28 lines

test/

Dialect/

Linalg/

vectorization.mlir

72 lines

Diff 515633

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp

Show First 20 Lines • Show All 947 Lines • ▼ Show 20 Lines	Operation *gatherOp = rewriter.create<vector::GatherOp>(
maskConstantOp, passThruConstantOp);		maskConstantOp, passThruConstantOp);
gatherOp = state.maskOperation(rewriter, gatherOp, linalgOp);		gatherOp = state.maskOperation(rewriter, gatherOp, linalgOp);

LDBG("Vectorised as gather load: " << extractOp);		LDBG("Vectorised as gather load: " << extractOp);
return VectorizationResult{VectorizationStatus::NewOp, gatherOp};		return VectorizationResult{VectorizationStatus::NewOp, gatherOp};
}		}

// 2. Handle contiguous access.		// 2. Handle contiguous access.
		LDBG("Vectorised as contiguous load: " << extractOp);
		tschuettUnsubmitted Not Done Reply Inline Actions Please z! tschuett: Please z!
		dcaballeUnsubmitted Not Done Reply Inline Actions I think neither American nor British English is enforced by the guidelines so `Vectorised` should be fine as long as it's consistent with the rest of the debug messages, which seems to be the case? dcaballe: I think neither American nor British English is enforced by the guidelines so `Vectorised`…
SmallVector<Value> transferReadIdxs;		SmallVector<Value> transferReadIdxs;
auto resTrailingDim = resultType.getShape().back();		auto resTrailingDim = resultType.getShape().back();
auto zero = rewriter.create<arith::ConstantOp>(		auto zero = rewriter.create<arith::ConstantOp>(
loc, rewriter.getI32Type(), rewriter.getZeroAttr(rewriter.getI32Type()));		loc, rewriter.getI32Type(), rewriter.getZeroAttr(rewriter.getI32Type()));

// Collect indices for `vector.transfer_read`. At this point, the indices will		// Collect indices for `vector.transfer_read`. At this point, the indices will
// either be scalars or would have been broadcast to vectors matching the		// either be scalars or would have been broadcast to vectors matching the
// result type. For indices that are vectors, there are two options:		// result type. For indices that are vectors, there are two options:
Show All 17 Lines	for (size_t i = 0; i < extractOp.getIndices().size(); i++) {
auto indexAs1dVector = rewriter.create<vector::ShapeCastOp>(		auto indexAs1dVector = rewriter.create<vector::ShapeCastOp>(
loc, VectorType::get({resTrailingDim}, rewriter.getIndexType()),		loc, VectorType::get({resTrailingDim}, rewriter.getIndexType()),
bvm.lookup(extractOp.getIndices()[i]));		bvm.lookup(extractOp.getIndices()[i]));
transferReadIdxs.push_back(		transferReadIdxs.push_back(
rewriter.create<vector::ExtractElementOp>(loc, indexAs1dVector, zero));		rewriter.create<vector::ExtractElementOp>(loc, indexAs1dVector, zero));
}		}

// `tensor.extract_element` is always in-bounds, hence the following holds.		// `tensor.extract_element` is always in-bounds, hence the following holds.
SmallVector<bool> inBounds(resultType.getRank(), true);		auto dstRank = resultType.getRank();
		SmallVector<bool> inBounds(dstRank, true);

auto transferReadOp = rewriter.create<vector::TransferReadOp>(		// Create a permutation map for transfer_read Op.
		dcaballeUnsubmitted Not Done Reply Inline Actions nit: `.` dcaballe: nit: `.`
loc, resultType, extractOp.getTensor(), transferReadIdxs, inBounds);		auto srcRank = extractOp.getTensor().getType().getRank();
		auto permutationMap = AffineMap::getMinorIdentityMap(
		srcRank, std::min(dstRank, srcRank), rewriter.getContext());

		int32_t rankDiff = dstRank - srcRank;
		// When dstRank > srcRank, broadcast the source tensor to the unitary leading
		dcaballeUnsubmitted Not Done Reply Inline Actions It would help if you mentioned that this is to broadcast to the unitary leading dims dcaballe: It would help if you mentioned that this is to broadcast to the unitary leading dims
		// dims so that the ranks match. This is done by extending the map with 0s.
		// For example, for dstRank = 3, srcRank = 2, the following map created
		// above:
		// (d0, d1) --> (d0, d1)
		// is extended as:
		// (d0, d1) --> (0, d0, d1)
		while (rankDiff > 0) {
		permutationMap = permutationMap.insertResult(
		mlir::getAffineConstantExpr(0, rewriter.getContext()), 0);
		rankDiff--;
		}

LDBG("Vectorised as contiguous load: " << extractOp);		auto transferReadOp = rewriter.create<vector::TransferReadOp>(
		loc, resultType, extractOp.getTensor(), transferReadIdxs, permutationMap,
		inBounds);
return VectorizationResult{VectorizationStatus::NewOp, transferReadOp};		return VectorizationResult{VectorizationStatus::NewOp, transferReadOp};
}		}

/// Emit reduction operations if the shapes of the value to reduce is different		/// Emit reduction operations if the shapes of the value to reduce is different
/// that the result shape.		/// that the result shape.
// Note: this is a true builder that notifies the OpBuilder listener.		// Note: this is a true builder that notifies the OpBuilder listener.
// TODO: Consider moving as a static helper on the ReduceOp.		// TODO: Consider moving as a static helper on the ReduceOp.
static Operation reduceIfNeeded(OpBuilder &b, LinalgOp linalgOp, Operation op,		static Operation reduceIfNeeded(OpBuilder &b, LinalgOp linalgOp, Operation op,
▲ Show 20 Lines • Show All 1,933 Lines • Show Last 20 Lines

mlir/test/Dialect/Linalg/vectorization.mlir

	Show First 20 Lines • Show All 1,828 Lines • ▼ Show 20 Lines
	^bb1(%arg1: !pdl.operation):			^bb1(%arg1: !pdl.operation):
	%0 = transform.structured.match ops{["linalg.generic"]} in %arg1 : (!pdl.operation) -> !pdl.operation			%0 = transform.structured.match ops{["linalg.generic"]} in %arg1 : (!pdl.operation) -> !pdl.operation
	%1 = get_closest_isolated_parent %0 : (!pdl.operation) -> !pdl.operation			%1 = get_closest_isolated_parent %0 : (!pdl.operation) -> !pdl.operation
	%2 = transform.structured.vectorize %1 { vectorize_nd_extract }			%2 = transform.structured.vectorize %1 { vectorize_nd_extract }
	}			}

	// -----			// -----

				func.func @vectorize_nd_tensor_extract_with_tensor_extract(%input_1: tensor<1x20xi32>, %input_2: tensor<257x24xf32>, %arg0 : index, %arg1 : index, %arg2 : index, %arg3 : index) -> tensor<1x1x4xf32> {
				%c0 = arith.constant 0 : index
				%c256 = arith.constant 256 : index
				%output = tensor.empty() : tensor<1x1x4xf32>
				%1 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2) -> (d0, d1, d2)>], iterator_types = ["parallel", "parallel", "parallel"]} outs(%output : tensor<1x1x4xf32>) {
				^bb0(%out: f32):
				%13 = linalg.index 0 : index
				%14 = affine.apply affine_map<(d0, d1, d2) -> (d0 + d1 + d2)>(%arg0, %13, %arg2)
				%15 = linalg.index 2 : index
				%16 = linalg.index 1 : index
				%17 = affine.apply affine_map<(d0, d1, d2, d3) -> (d0 + d1 * 24 + d2 + d3)>(%arg1, %16, %15, %arg3)
				%extracted_0 = tensor.extract %input_1[%c0, %14] : tensor<1x20xi32>
				%18 = arith.index_cast %extracted_0 : i32 to index
				%19 = arith.maxsi %18, %c0 : index
				%20 = arith.minsi %19, %c256 : index
				%extracted_1 = tensor.extract %input_2[%20, %17] : tensor<257x24xf32>
				linalg.yield %extracted_1 : f32
				} -> tensor<1x1x4xf32>
				return %1 : tensor<1x1x4xf32>
				}

				// First `tensor.extract` is a loop invariant scalar load. This way, the
				// following `tensor.extract` Op becomes a contiguous load (all other Ops used
				// for address calculation also satisfy the required conditions).
				// TODO: Don't use vector.gather for the first tensor.extract.

				// CHECK-LABEL: func.func @vectorize_nd_tensor_extract_with_tensor_extract(
				// CHECK-SAME: %[[VAL_0:.*]]: tensor<1x20xi32>,
				// CHECK-SAME: %[[VAL_1:.*]]: tensor<257x24xf32>,
				// CHECK-SAME: -> tensor<1x1x4xf32> {
				// CHECK-DAG: %[[VAL_6:.*]] = arith.constant dense<0> : vector<1x1x4xindex>
				// CHECK-DAG: %[[VAL_7:.*]] = arith.constant dense<[0, 1, 2, 3]> : vector<4xindex>
				// CHECK-DAG: %[[VAL_8:.*]] = arith.constant dense<true> : vector<1x1x4xi1>
				dcaballeUnsubmitted Done Reply Inline Actions These constants would probably need a `-DAG` dcaballe: These constants would probably need a `-DAG`
				// CHECK-DAG: %[[VAL_9:.*]] = arith.constant dense<0> : vector<1x1x4xi32>
				// CHECK-DAG: %[[VAL_10:.*]] = arith.constant 0 : index
				// CHECK-DAG: %[[VAL_11:.*]] = arith.constant dense<256> : vector<1x1x4xindex>
				// CHECK-DAG: %[[VAL_12:.*]] = arith.constant 0 : i32
				// CHECK-DAG: %[[VAL_13:.*]] = arith.constant 0.000000e+00 : f32
				// CHECK: %[[VAL_14:.*]] = tensor.empty() : tensor<1x1x4xf32>
				// CHECK: %[[VAL_15:.]] = vector.broadcast %{{.}} : index to vector<1x1x4xindex>
				// CHECK: %[[VAL_16:.]] = vector.broadcast %{{.}} : index to vector<1x1x4xindex>
				// CHECK: %[[VAL_17:.*]] = arith.addi %[[VAL_15]], %[[VAL_16]] : vector<1x1x4xindex>
				// CHECK: %[[VAL_18:.]] = vector.broadcast %{{.}} : index to vector<1x1x4xindex>
				dcaballeUnsubmitted Not Done Reply Inline Actions This should be a contiguous load, right? Are you fixing this separately? dcaballe: This should be a contiguous load, right? Are you fixing this separately?
				awarzynskiAuthorUnsubmitted Done Reply Inline Actions This should be a contiguous load, right? Not quite. It's a scalar load which is loop invariant. `vector.gather` is correct, but very inefficient. Are you fixing this separately? Dealing with this `vector.gather` separately makes more sense to me. Otherwise this patch would be addressing two different issues. Also, I'd like to resolve https://github.com/openxla/iree/issues/13036 before tackling anything else. I should be able to upload a seperate patch shortly. awarzynski: > This should be a contiguous load, right? Not quite. It's a scalar load which is loop…
				// CHECK: %[[VAL_19:.*]] = vector.broadcast %[[VAL_7]] : vector<4xindex> to vector<1x1x4xindex>
				// CHECK: %[[VAL_20:.*]] = arith.addi %[[VAL_18]], %[[VAL_19]] : vector<1x1x4xindex>
				// CHECK: %[[VAL_21:.]] = vector.broadcast %{{.}} : index to vector<1x1x4xindex>
				// CHECK: %[[VAL_22:.*]] = arith.addi %[[VAL_20]], %[[VAL_21]] : vector<1x1x4xindex>
				// CHECK: %[[VAL_23:.*]] = vector.gather %[[VAL_0]]{{\[}}%[[VAL_10]], %[[VAL_10]]] {{\[}}%[[VAL_17]]], %[[VAL_8]], %[[VAL_9]] : tensor<1x20xi32>, vector<1x1x4xindex>, vector<1x1x4xi1>, vector<1x1x4xi32> into vector<1x1x4xi32>
				// CHECK: %[[VAL_24:.*]] = arith.index_cast %[[VAL_23]] : vector<1x1x4xi32> to vector<1x1x4xindex>
				// CHECK: %[[VAL_25:.*]] = arith.maxsi %[[VAL_24]], %[[VAL_6]] : vector<1x1x4xindex>
				// CHECK: %[[VAL_26:.*]] = arith.minsi %[[VAL_25]], %[[VAL_11]] : vector<1x1x4xindex>
				dcaballeUnsubmitted Not Done Reply Inline Actions Hmm... How is possible that we generate a contiguous load if there is a gather in its address computation? dcaballe: Hmm... How is possible that we generate a contiguous load if there is a gather in its address…
				awarzynskiAuthorUnsubmitted Done Reply Inline Actions `%extracted_0 = tensor.extract %input_1[%c0, %14] : tensor<1x20xi32>` is loop invariant :) I'll add a comment above. awarzynski: `%extracted_0 = tensor.extract %input_1[%c0, %14] : tensor<1x20xi32>` is loop invariant :) I'll…
				// CHECK: %[[VAL_27:.*]] = vector.shape_cast %[[VAL_26]] : vector<1x1x4xindex> to vector<4xindex>
				// CHECK: %[[VAL_28:.*]] = vector.extractelement %[[VAL_27]]{{\[}}%[[VAL_12]] : i32] : vector<4xindex>
				// CHECK: %[[VAL_29:.*]] = vector.shape_cast %[[VAL_22]] : vector<1x1x4xindex> to vector<4xindex>
				// CHECK: %[[VAL_30:.*]] = vector.extractelement %[[VAL_29]]{{\[}}%[[VAL_12]] : i32] : vector<4xindex>
				// CHECK: %[[VAL_31:.*]] = vector.transfer_read %[[VAL_1]]{{\[}}%[[VAL_28]], %[[VAL_30]]], %[[VAL_13]] {in_bounds = [true, true]} : tensor<257x24xf32>, vector<1x4xf32>
				// CHECK: %[[VAL_32:.*]] = vector.broadcast %[[VAL_31]] : vector<1x4xf32> to vector<1x1x4xf32>
				// CHECK: %[[VAL_33:.*]] = vector.transfer_write %[[VAL_32]], %[[VAL_14]]{{\[}}%[[VAL_10]], %[[VAL_10]], %[[VAL_10]]] {in_bounds = [true, true, true]} : vector<1x1x4xf32>, tensor<1x1x4xf32>
				// CHECK: return %[[VAL_33]] : tensor<1x1x4xf32>
				// CHECK: }

				transform.sequence failures(propagate) {
				^bb1(%arg1: !pdl.operation):
				%0 = transform.structured.match ops{["linalg.generic"]} in %arg1 : (!pdl.operation) -> !pdl.operation
				%1 = get_closest_isolated_parent %0 : (!pdl.operation) -> !pdl.operation
				%2 = transform.structured.vectorize %1 { vectorize_nd_extract }
				}

				// -----

	func.func @masked_static_vectorize_nd_tensor_extract_with_affine_apply_contiguous(%6: tensor<80x16xf32>, %arg0: index, %extracted_slice : tensor<1x3xf32>) -> tensor<1x3xf32> {			func.func @masked_static_vectorize_nd_tensor_extract_with_affine_apply_contiguous(%6: tensor<80x16xf32>, %arg0: index, %extracted_slice : tensor<1x3xf32>) -> tensor<1x3xf32> {
	%c79 = arith.constant 79 : index			%c79 = arith.constant 79 : index
	%1 = linalg.generic {			%1 = linalg.generic {
	indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>],			indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>],
	iterator_types = ["parallel", "parallel"]			iterator_types = ["parallel", "parallel"]
	} outs(%extracted_slice : tensor<1x3xf32>) {			} outs(%extracted_slice : tensor<1x3xf32>) {
	^bb0(%out: f32):			^bb0(%out: f32):
	%2 = linalg.index 1 : index			%2 = linalg.index 1 : index
	▲ Show 20 Lines • Show All 68 Lines • ▼ Show 20 Lines
	// CHECK: %[[VAL_27:.*]] = vector.mask %[[VAL_10]] { vector.transfer_write %[[VAL_25]], %[[VAL_2]]{{\[}}%[[VAL_26]], %[[VAL_26]]] {in_bounds = [true, true]} : vector<1x4xf32>, tensor<?x?xf32> } : vector<1x4xi1> -> tensor<?x?xf32>			// CHECK: %[[VAL_27:.*]] = vector.mask %[[VAL_10]] { vector.transfer_write %[[VAL_25]], %[[VAL_2]]{{\[}}%[[VAL_26]], %[[VAL_26]]] {in_bounds = [true, true]} : vector<1x4xf32>, tensor<?x?xf32> } : vector<1x4xi1> -> tensor<?x?xf32>
	// CHECK: return %[[VAL_27]] : tensor<?x?xf32>			// CHECK: return %[[VAL_27]] : tensor<?x?xf32>
	// CHECK: }			// CHECK: }

	transform.sequence failures(propagate) {			transform.sequence failures(propagate) {
	^bb1(%arg1: !pdl.operation):			^bb1(%arg1: !pdl.operation):
	%0 = transform.structured.match ops{["linalg.generic"]} in %arg1 : (!pdl.operation) -> !pdl.operation			%0 = transform.structured.match ops{["linalg.generic"]} in %arg1 : (!pdl.operation) -> !pdl.operation
	transform.structured.masked_vectorize %0 vector_sizes [1, 4] { vectorize_nd_extract }			transform.structured.masked_vectorize %0 vector_sizes [1, 4] { vectorize_nd_extract }
	}			}

	// -----			// -----

	// The vectorizer converts `affine.apply` so that the subsequent Ops can be vectorised based on the converted ops. Gather load.			// The vectorizer converts `affine.apply` so that the subsequent Ops can be vectorised based on the converted ops. Gather load.
	func.func @vectorize_nd_tensor_extract_with_affine_apply_gather(%6: tensor<80x16xf32>, %arg0: index, %extracted_slice : tensor<1x4xf32>) -> tensor<1x4xf32> {			func.func @vectorize_nd_tensor_extract_with_affine_apply_gather(%6: tensor<80x16xf32>, %arg0: index, %extracted_slice : tensor<1x4xf32>) -> tensor<1x4xf32> {
	%c16 = arith.constant 16 : index			%c16 = arith.constant 16 : index
	%1 = linalg.generic {			%1 = linalg.generic {
	indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>],			indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>],
	▲ Show 20 Lines • Show All 864 Lines • Show Last 20 Lines