This is an archive of the discontinued LLVM Phabricator instance.

[mlir][Linalg] Revisit heuristic ordering of tensor.insert_slice in comprehensive bufferize.
ClosedPublic

Authored by nicolasvasilache on Sep 20 2021, 7:04 AM.

Download Raw Diff

Details

Reviewers

ftynse
springerm
gysit
silvas
aartbik

Commits

rG101d017a6438: [mlir][Linalg] Revisit heuristic ordering of tensor.insert_slice in…

Summary

It was previously assumed that tensor.insert_slice should be bufferized first in a greedy fashion to avoid out-of-place bufferization of the large tensor. This heuristic does not hold upon further inspection.

This CL removes the special handling of such ops and adds a test that exhibits better behavior and appears in real use cases.

The only test adversely affected is an artificial test which results in a returned memref: this pattern is not allowed by comprehensive bufferization in real scenarios anyway and the offending test is deleted.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

nicolasvasilache created this revision.Sep 20 2021, 7:04 AM

Herald added a reviewer: aartbik. · View Herald TranscriptSep 20 2021, 7:04 AM

Herald added subscribers: wenzhicui, wrengr, Chia-hungDuan and 20 others. · View Herald Transcript

nicolasvasilache requested review of this revision.Sep 20 2021, 7:04 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 20 2021, 7:04 AM

Herald added subscribers: limo1996, stephenneuendorffer. · View Herald Transcript

gysit accepted this revision.Sep 20 2021, 7:33 AM

This revision is now accepted and ready to land.Sep 20 2021, 7:33 AM

springerm accepted this revision.Sep 20 2021, 5:47 PM

springerm added inline comments.

mlir/lib/Dialect/Linalg/Transforms/ComprehensiveBufferize.cpp
2483–2493	Less code is always good but I'm wondering if there any important cases that see a performance degradation (more out-of-place bufferization) because of this.

nicolasvasilache marked an inline comment as done.Sep 21 2021, 7:28 AM

nicolasvasilache added inline comments.

mlir/lib/Dialect/Linalg/Transforms/ComprehensiveBufferize.cpp
2483–2493	I'd turn the burden of proof back :) Evidence has shown that this was a premature concern as: the only existing test case that benefits from this is artifical and would fail bufferization anyway. the new test case we were discussing suffers from the order and hides the subalias intersection problem. It is possible (even likely) future use cases appear that will require a real heuristic; when we have the concrete need we should improve the heuristic further.

This revision was landed with ongoing or failed builds.Sep 21 2021, 7:31 AM

Closed by commit rG101d017a6438: [mlir][Linalg] Revisit heuristic ordering of tensor.insert_slice in… (authored by nicolasvasilache). · Explain Why

This revision was automatically updated to reflect the committed changes.

nicolasvasilache marked an inline comment as done.

nicolasvasilache added a commit: rG101d017a6438: [mlir][Linalg] Revisit heuristic ordering of tensor.insert_slice in….

Revision Contents

Path

Size

mlir/

lib/

Dialect/

Linalg/

Transforms/

ComprehensiveBufferize.cpp

23 lines

test/

Dialect/

Linalg/

comprehensive-module-bufferize-analysis.mlir

47 lines

comprehensive-module-bufferize.mlir

28 lines

Diff 373924

mlir/lib/Dialect/Linalg/Transforms/ComprehensiveBufferize.cpp

	Show First 20 Lines • Show All 2,461 Lines • ▼ Show 20 Lines
	static LogicalResult			static LogicalResult
	inPlaceAnalysisFuncOpBody(FuncOp funcOp, BufferizationAliasInfo &aliasInfo,			inPlaceAnalysisFuncOpBody(FuncOp funcOp, BufferizationAliasInfo &aliasInfo,
	const DominanceInfo &domInfo) {			const DominanceInfo &domInfo) {
	LLVM_DEBUG(llvm::dbgs() << "\n\n");			LLVM_DEBUG(llvm::dbgs() << "\n\n");
	LDBG("Begin InPlaceAnalysisFuncOpInternals:\n" << funcOp << '\n');			LDBG("Begin InPlaceAnalysisFuncOpInternals:\n" << funcOp << '\n');
	assert(funcOp && funcOp->getNumRegions() > 0 && !funcOp.body().empty() &&			assert(funcOp && funcOp->getNumRegions() > 0 && !funcOp.body().empty() &&
	"expected a funcOp definition with a body");			"expected a funcOp definition with a body");

	// Collect ops so we can build our own traversal.			// Collect ops so we can build our own reverse traversal.
	SmallVector<Operation *> otherOps;			SmallVector<Operation *> ops;
	SmallVector<InsertSliceOp> insertSliceOps;
	funcOp.walk([&](Operation *op) {			funcOp.walk([&](Operation *op) {
	if (auto insertSliceOp = dyn_cast<InsertSliceOp>(op))
	return insertSliceOps.push_back(insertSliceOp);
	// No tensors => no buffers.			// No tensors => no buffers.
	if (none_of(op->getOperandTypes(), isaTensor) &&			if (none_of(op->getOperandTypes(), isaTensor) &&
	none_of(op->getResultTypes(), isaTensor))			none_of(op->getResultTypes(), isaTensor))
	return;			return;
	otherOps.push_back(op);			ops.push_back(op);
	});			});

	// First, analyze InsertSliceOp greedily: we almost never want to bufferize
	// the tensor "inserted into" to become out-of-place. This implementation
	// does not distinguish between different InsertSliceOp. If we want
	// finer-grained behavior, we could order the InsertSliceOp with some metric.
	for (InsertSliceOp insertSliceOp : reverse(insertSliceOps)) {
	OpOperand &destOpOperand = insertSliceOp->getOpOperand(1);
	if (failed(bufferizableInPlaceAnalysis(
	destOpOperand, getInplaceableOpResult(destOpOperand), aliasInfo,
	domInfo)))
	return failure();
	}
	springermUnsubmitted Done Reply Inline Actions Less code is always good but I'm wondering if there any important cases that see a performance degradation (more out-of-place bufferization) because of this. springerm: Less code is always good but I'm wondering if there any important cases that see a performance…
	nicolasvasilacheAuthorUnsubmitted Done Reply Inline Actions I'd turn the burden of proof back :) Evidence has shown that this was a premature concern as: the only existing test case that benefits from this is artifical and would fail bufferization anyway. the new test case we were discussing suffers from the order and hides the subalias intersection problem. It is possible (even likely) future use cases appear that will require a real heuristic; when we have the concrete need we should improve the heuristic further. nicolasvasilache: I'd turn the burden of proof back :) Evidence has shown that this was a premature concern as…

	// Walk ops in reverse for better interference analysis.			// Walk ops in reverse for better interference analysis.
	for (Operation *op : reverse(otherOps)) {			for (Operation *op : reverse(ops)) {
	for (OpOperand &opOperand : op->getOpOperands()) {			for (OpOperand &opOperand : op->getOpOperands()) {
	if (OpResult result = getInplaceableOpResult(opOperand))			if (OpResult result = getInplaceableOpResult(opOperand))
	if (result.getType().isa<TensorType>() &&			if (result.getType().isa<TensorType>() &&
	failed(bufferizableInPlaceAnalysis(opOperand, result, aliasInfo,			failed(bufferizableInPlaceAnalysis(opOperand, result, aliasInfo,
	domInfo)))			domInfo)))
	return failure();			return failure();
	}			}
	// Special logic to analyze ExtractSliceOp.			// Special logic to analyze ExtractSliceOp.
	▲ Show 20 Lines • Show All 543 Lines • Show Last 20 Lines

mlir/test/Dialect/Linalg/comprehensive-module-bufferize-analysis.mlir

Show First 20 Lines • Show All 700 Lines • ▼ Show 20 Lines	%r = linalg.matmul
outs(%arg2 : tensor<256x256xf32>) -> tensor<256x256xf32>		outs(%arg2 : tensor<256x256xf32>) -> tensor<256x256xf32>

return %r : tensor<256x256xf32>		return %r : tensor<256x256xf32>
}		}

// -----		// -----

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
		// Chain of tensor.insert_slice is better traversed in reverse order without
		// prioritizing the tensor.insert_slice ops.
		//===----------------------------------------------------------------------===//

		func @insert_slice_chain(
		%v1: vector<32x90xf32>,
		%v2: vector<30x90xf32>,
		%arg0: tensor<62x126xf32> {linalg.buffer_layout = affine_map<(d0, d1) -> (d0, d1)>, linalg.inplaceable = false},
		%arg1: tensor<126x90xf32> {linalg.buffer_layout = affine_map<(d0, d1) -> (d0, d1)>, linalg.inplaceable = false},
		%arg2: tensor<62x90xf32> {linalg.buffer_layout = affine_map<(d0, d1) -> (d0, d1)>, linalg.inplaceable = true})
		-> tensor<62x90xf32> attributes {passthrough = [["target-cpu", "skylake-avx512"], ["prefer-vector-width", "512"]]}
		{
		%c0 = constant 0 : index
		%cst = constant 0.000000e+00 : f32

		// CHECK: linalg.fill
		// CHECK-SAME: {__inplace_results_attr__ = ["true"]
		%0 = linalg.fill(%cst, %arg2) : f32, tensor<62x90xf32> -> tensor<62x90xf32>

		// CHECK: tensor.extract_slice
		// CHECK-SAME: {__inplace_results_attr__ = ["false"]
		// TODO: in order to have this extract_slice bufferize inplace, we need to write a range
		// analysis and determine that intersection([0, 32)x[0, 90), [32, 62)x[0, 90)) is empty.
		%2 = tensor.extract_slice %0[0, 0] [32, 90] [1, 1] : tensor<62x90xf32> to tensor<32x90xf32>
		// CHECK: vector.transfer_write
		// CHECK-SAME: {__inplace_results_attr__ = ["true"]
		%7 = vector.transfer_write %v1, %2[%c0, %c0] {in_bounds = [true, true]} : vector<32x90xf32>, tensor<32x90xf32>
		// CHECK: tensor.insert_slice
		// CHECK-SAME: {__inplace_results_attr__ = ["true"]
		%8 = tensor.insert_slice %7 into %0[0, 0] [32, 90] [1, 1] : tensor<32x90xf32> into tensor<62x90xf32>

		// CHECK: tensor.extract_slice
		// CHECK-SAME: {__inplace_results_attr__ = ["true"]
		%10 = tensor.extract_slice %8[32, 0] [30, 90] [1, 1] : tensor<62x90xf32> to tensor<30x90xf32>
		// CHECK: vector.transfer_write
		// CHECK-SAME: {__inplace_results_attr__ = ["true"]
		%14 = vector.transfer_write %v2, %10[%c0, %c0] {in_bounds = [true, true]} : vector<30x90xf32>, tensor<30x90xf32>
		// CHECK: tensor.insert_slice
		// CHECK-SAME: {__inplace_results_attr__ = ["true"]
		%15 = tensor.insert_slice %14 into %8[32, 0] [30, 90] [1, 1] : tensor<30x90xf32> into tensor<62x90xf32>

		return %15 : tensor<62x90xf32>
		}

		// -----

		//===----------------------------------------------------------------------===//
// Insert point issue cases.		// Insert point issue cases.
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

// Only test IR validity wrt dominance.		// Only test IR validity wrt dominance.
// CHECK-LABEL: func @ip		// CHECK-LABEL: func @ip
func @ip(%t: tensor<10x20xf32> {linalg.inplaceable = true},		func @ip(%t: tensor<10x20xf32> {linalg.inplaceable = true},
%x: index, %y: index, %v: vector<5x6xf32>)		%x: index, %y: index, %v: vector<5x6xf32>)
-> tensor<10x20xf32>		-> tensor<10x20xf32>
Show All 14 Lines

mlir/test/Dialect/Linalg/comprehensive-module-bufferize.mlir

Show First 20 Lines • Show All 263 Lines • ▼ Show 20 Lines	func @insert_slice_fun_not_inplace(%A : tensor<?xf32>, %t : tensor<4xf32>)
// CHECK: linalg.copy(%[[t]], %[[SV]]) : memref<4xf32, #map>, memref<4xf32>		// CHECK: linalg.copy(%[[t]], %[[SV]]) : memref<4xf32, #map>, memref<4xf32>
// CHECK: memref.dealloc %[[ALLOC]] : memref<?xf32>		// CHECK: memref.dealloc %[[ALLOC]] : memref<?xf32>
%r0 = tensor.insert_slice %t into %A[0][4][1] : tensor<4xf32> into tensor<?xf32>		%r0 = tensor.insert_slice %t into %A[0][4][1] : tensor<4xf32> into tensor<?xf32>

// CHECK: return %{{.*}} : memref<?xf32>		// CHECK: return %{{.*}} : memref<?xf32>
return %r0: tensor<?xf32>		return %r0: tensor<?xf32>
}		}

// -----

// CHECK-DAG: #[[$map_1d_dyn:.]] = affine_map<(d0)[s0, s1] -> (d0 s1 + s0)>

// CHECK-LABEL: func @insert_slice_fun_not_inplace
// CHECK-SAME: %[[A:[a-zA-Z0-9]*]]: memref<?xf32, #[[$map_1d_dyn]]>
// CHECK-SAME: %[[t:[a-zA-Z0-9]*]]: memref<4xf32, #[[$map_1d_dyn]]>
func @insert_slice_fun_not_inplace(%A : tensor<?xf32> {linalg.inplaceable = true}, %t : tensor<4xf32>)
-> (tensor<?xf32>, tensor<?xf32>)
{
%f0 = constant 0.0 : f32

// tensor.insert_slice is bufferized first, %A is inplaceable so we can make this inplace
// CHECK-DAG: %[[SV_A:.]] = memref.subview %[[A]][0] [4] [1] : memref<?xf32, {{.}}> to memref<4xf32, {{.*}}>
// CHECK-DAG: linalg.copy(%[[t]], %[[SV_A]]) : memref<4xf32, {{.}}>, memref<4xf32, {{.}}>
%r0 = tensor.insert_slice %t into %A[0][4][1] : tensor<4xf32> into tensor<?xf32>

// fill would interfere with %r0 that is also being returned.
// So we need to bufferize it out of place and make a new alloc.
// CHECK-DAG: %[[ALLOC:.]] = memref.alloc({{.}}) {alignment = 128 : i64} : memref<?xf32>
// CHECK: linalg.fill(%{{.*}}, %[[ALLOC]]
%r1 = linalg.fill(%f0, %A) : f32, tensor<?xf32> -> tensor<?xf32>

// CHECK: memref.dealloc %[[ALLOC]] : memref<?xf32>
// CHECK: return %[[ALLOC]] : memref<?xf32>
return %r1, %r0: tensor<?xf32>, tensor<?xf32>
}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// Simple loop cases		// Simple loop cases
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

// -----		// -----

// CHECK-DAG: #[[$map_1d_dyn:.]] = affine_map<(d0)[s0, s1] -> (d0 s1 + s0)>		// CHECK-DAG: #[[$map_1d_dyn:.]] = affine_map<(d0)[s0, s1] -> (d0 s1 + s0)>

▲ Show 20 Lines • Show All 437 Lines • Show Last 20 Lines