This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
lib/Dialect/Tensor/Transforms/
-
Dialect/
-
Tensor/
-
Transforms/
-
BufferizableOpInterfaceImpl.cpp
-
test/Dialect/Tensor/
-
Dialect/
-
Tensor/
-
one-shot-bufferize.mlir

Differential D148408

[mlir][tensor][bufferize] Fix dealloc placement in scf.forall op
ClosedPublic

Authored by springerm on Apr 14 2023, 8:20 PM.

Download Raw Diff

Details

Reviewers

ftynse
nicolasvasilache

Commits

rG7c06f63176da: [mlir][tensor][bufferize] Fix dealloc placement in scf.forall op

Summary

The terminator of this op is special: it does not just yield a value,
but bufferizes to a memcpy. This requires special treatment to make sure
that deallocs are placed after the memcpy. (By default, deallocs are
placed right before the terminator.)

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

springerm created this revision.Apr 14 2023, 8:20 PM

Herald added a project: Restricted Project. · View Herald TranscriptApr 14 2023, 8:20 PM

Herald added subscribers: bviyer, hanchung, Moerafaat and 22 others. · View Herald Transcript

springerm requested review of this revision.Apr 14 2023, 8:20 PM

Herald added a reviewer: nicolasvasilache. · View Herald TranscriptApr 14 2023, 8:20 PM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: stephenneuendorffer, nicolasvasilache. · View Herald Transcript

springerm mentioned this in D147790: [mlir] [bufferization] Fix dealloc errors..Apr 14 2023, 8:22 PM

Harbormaster completed remote builds in B225798: Diff 513837.Apr 14 2023, 8:34 PM

ftynse accepted this revision.Apr 15 2023, 3:00 AM

This revision is now accepted and ready to land.Apr 15 2023, 3:00 AM

This patch solves deallocs placement error, but i still have some questions:

'memref.dealloc' is generated in bufferization process of 'scf.for_all' operation. It seems to me that this placement error is caused by illegal insertion, but expect others to fix it. is this over-coupling?
If we accept this approach, any operation contains in terminator region should do same things as tensor.paralle_insert_slice on bufferize, Or should we abstract this behavior as interface or something?

I was just tried to contributing to llvm community, maybe i missed some important rationales. Thank you in advance for any help and explanation.

In D148408#4271164, @cxy-1993 wrote:

This patch solves deallocs placement error, but i still have some questions:

'memref.dealloc' is generated in bufferization process of 'scf.for_all' operation. It seems to me that this placement error is caused by illegal insertion, but expect others to fix it. is this over-coupling?

If we accept this approach, any operation contains in terminator region should do same things as tensor.paralle_insert_slice on bufferize, Or should we abstract this behavior as interface or something?

I was just tried to contributing to llvm community, maybe i missed some important rationales. Thank you in advance for any help and explanation.

A bit of background: This bug was caused by our buffer deallocation strategy, which is currently too simple: it always places deallocs at the end of a block, right before the terminator. This worked great, until we added scf.for_all, which was the first time that we had an op with a terminator that does more than just yielding.

You are right, this is a not an ideal solution because the root problem is the original placement of the dealloc. We have plans to revamp the buffer deallocation logic this summer: One-Shot Bufferize would no longer insert any deallocations. This would be done by a separate pass. This change here is the smallest change to fix the issue for now, without writing a bunch of code that will be deleted in a month.

Contributions are always welcome, so please keep them coming :)

Closed by commit rG7c06f63176da: [mlir][tensor][bufferize] Fix dealloc placement in scf.forall op (authored by springerm). · Explain WhyApr 15 2023, 5:49 PM

This revision was automatically updated to reflect the committed changes.

springerm added a commit: rG7c06f63176da: [mlir][tensor][bufferize] Fix dealloc placement in scf.forall op.

In D148408#4271629, @springerm wrote:

In D148408#4271164, @cxy-1993 wrote:

This patch solves deallocs placement error, but i still have some questions:

'memref.dealloc' is generated in bufferization process of 'scf.for_all' operation. It seems to me that this placement error is caused by illegal insertion, but expect others to fix it. is this over-coupling?

If we accept this approach, any operation contains in terminator region should do same things as tensor.paralle_insert_slice on bufferize, Or should we abstract this behavior as interface or something?

I was just tried to contributing to llvm community, maybe i missed some important rationales. Thank you in advance for any help and explanation.

A bit of background: This bug was caused by our buffer deallocation strategy, which is currently too simple: it always places deallocs at the end of a block, right before the terminator. This worked great, until we added scf.for_all, which was the first time that we had an op with a terminator that does more than just yielding.

You are right, this is a not an ideal solution because the root problem is the original placement of the dealloc. We have plans to revamp the buffer deallocation logic this summer: One-Shot Bufferize would no longer insert any deallocations. This would be done by a separate pass. This change here is the smallest change to fix the issue for now, without writing a bunch of code that will be deleted in a month.

Contributions are always welcome, so please keep them coming :)

Thank you for your patience explanation, insert deallocation in another pass is a better way.
By the way, how can i get involved and contribute to the refactor process？

Revision Contents

Path

Size

mlir/

lib/

Dialect/

Tensor/

Transforms/

BufferizableOpInterfaceImpl.cpp

15 lines

test/

Dialect/

Tensor/

one-shot-bufferize.mlir

30 lines

Diff 513952

mlir/lib/Dialect/Tensor/Transforms/BufferizableOpInterfaceImpl.cpp

Show First 20 Lines • Show All 1,052 Lines • ▼ Show 20 Lines	Value subview = rewriter.create<memref::SubViewOp>(
parallelInsertSliceOp.getMixedSizes(),		parallelInsertSliceOp.getMixedSizes(),
parallelInsertSliceOp.getMixedStrides());		parallelInsertSliceOp.getMixedStrides());

// This memcpy will fold away if everything bufferizes in-place.		// This memcpy will fold away if everything bufferizes in-place.
if (failed(options.createMemCpy(rewriter, parallelInsertSliceOp.getLoc(),		if (failed(options.createMemCpy(rewriter, parallelInsertSliceOp.getLoc(),
*srcBuffer, subview)))		*srcBuffer, subview)))
return failure();		return failure();

		// In case the source was allocated in the same block, make sure that the
		// deallocation op (if any) appears after the memcpy. By default, deallocs
		// are placed before the terminator, but this does not work for ForallOp
		// because the terminator does more than just yielding a value.
		//
		// Note: This is not a problem for the destination buffer because these are
		// assumed to always bufferize in-place.
		for (Operation *user : srcBuffer->getUsers()) {
		if (hasEffect<MemoryEffects::Free>(user)) {
		if (user->getBlock() == parallelCombiningParent->getBlock())
		user->moveBefore(user->getBlock()->getTerminator());
		break;
		}
		}

// Delete the op.		// Delete the op.
rewriter.eraseOp(op);		rewriter.eraseOp(op);
return success();		return success();
}		}

bool isNotConflicting(Operation op, OpOperand uRead,		bool isNotConflicting(Operation op, OpOperand uRead,
OpOperand *uConflictingWrite,		OpOperand *uConflictingWrite,
const AnalysisState &state) const {		const AnalysisState &state) const {
Show All 33 Lines

mlir/test/Dialect/Tensor/one-shot-bufferize.mlir

Show First 20 Lines • Show All 329 Lines • ▼ Show 20 Lines	func.func @insert_slice_full_overwrite(%t: tensor<10xf32>, %b: tensor<10xf32>) -> tensor<10xf32> {
%2 = tensor.insert_slice %b into %t[0][10][1] : tensor<10xf32> into tensor<10xf32>		%2 = tensor.insert_slice %b into %t[0][10][1] : tensor<10xf32> into tensor<10xf32>
return %2 : tensor<10xf32>		return %2 : tensor<10xf32>
}		}

// -----		// -----

// CHECK-LABEL: func @dim_not_reading(		// CHECK-LABEL: func @dim_not_reading(
// CHECK-SAME: %[[t:.*]]: memref<?xf32		// CHECK-SAME: %[[t:.*]]: memref<?xf32
func.func @dim_not_reading(%t: tensor<?xf32>, %f: f32, %pos: index)		func.func @dim_not_reading(%t: tensor<?xf32>, %f: f32, %pos: index)
-> (tensor<?xf32>, index)		-> (tensor<?xf32>, index)
{		{
%c0 = arith.constant 0 : index		%c0 = arith.constant 0 : index
// CHECK-NOT: memref.alloc		// CHECK-NOT: memref.alloc
// CHECK-NOT: memref.copy		// CHECK-NOT: memref.copy
// CHECK: memref.store %{{.*}}, %[[t]]		// CHECK: memref.store %{{.*}}, %[[t]]
%0 = tensor.insert %f into %t[%pos] : tensor<?xf32>		%0 = tensor.insert %f into %t[%pos] : tensor<?xf32>
// CHECK: memref.dim %[[t]]		// CHECK: memref.dim %[[t]]
Show All 18 Lines	func.func @cast_retains_buffer_layout(
%casted = tensor.cast %t : tensor<?xf32> to tensor<10xf32>		%casted = tensor.cast %t : tensor<?xf32> to tensor<10xf32>
%slice = tensor.extract_slice %casted[2][%sz][1] : tensor<10xf32> to tensor<?xf32>		%slice = tensor.extract_slice %casted[2][%sz][1] : tensor<10xf32> to tensor<?xf32>

// Note: The %casted return type is folded away because both buffers are		// Note: The %casted return type is folded away because both buffers are
// equivalent. Therefore, we currently loose some static type information		// equivalent. Therefore, we currently loose some static type information
// in the caller.		// in the caller.
return %casted, %slice : tensor<10xf32>, tensor<?xf32>		return %casted, %slice : tensor<10xf32>, tensor<?xf32>
}		}

		// -----

		// CHECK-LABEL: func.func @parallel_insert_slice_source_out_of_place
		func.func @parallel_insert_slice_source_out_of_place(%in: tensor<1xf32>, %out: tensor<100xf32>, %f: f32) {
		%c0 = arith.constant 0 : index
		%c1 = arith.constant 1 : index
		%num_threads = arith.constant 50 : index

		// CHECK: scf.forall {{.*}} {
		%result = scf.forall (%thread_idx) in (%num_threads) shared_outs (%o = %out) -> tensor<100xf32> {
		// The tensor.insert must bufferize out-of-place.
		// CHECK: memref.alloc
		// CHECK: memref.store
		%insert = tensor.insert %f into %in[%c0] : tensor<1xf32>
		%r = tensor.extract %in[%c0] : tensor<1xf32>
		vector.print %r : f32

		// CHECK: memref.copy
		// CHECK: memref.dealloc
		scf.forall.in_parallel {
		tensor.parallel_insert_slice %insert into %o[%thread_idx][1][1] :
		tensor<1xf32> into tensor<100xf32>
		}
		}
		// CHECK: }
		return
		}