This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
lib/Dialect/
-
Dialect/
-
Linalg/Transforms/
-
Transforms/
3/3
ConvertToDestinationStyle.cpp
-
Tensor/Utils/
-
Utils/
-
Utils.cpp
-
test/Dialect/Linalg/
-
Dialect/
-
Linalg/
-
generalize-tensor-pack.mlir
-
pad-to-specific-memory-space.mlir
-
transform-op-pad.mlir

Differential D153874

[mlir][linalg] Do not emit FillOp for tensor.pad with zero padding
ClosedPublic

Authored by springerm on Jun 27 2023, 7:24 AM.

Download Raw Diff

Details

Reviewers

nicolasvasilache
mravishankar

Commits

rGbb566b652f7d: [mlir][linalg] Do not emit FillOp for tensor.pad with zero padding

Summary

No need to fill the buffer if no padding is added. I.e., the tensor.pad is packing only.

Also improve tensor::createPadHighOp to use attributes instead of SSA values for high padding sizes when possible.

Depends On: D153555

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

springerm created this revision.Jun 27 2023, 7:24 AM

Herald added a project: Restricted Project. · View Herald TranscriptJun 27 2023, 7:24 AM

Herald added subscribers: bviyer, hanchung, Moerafaat and 24 others. · View Herald Transcript

springerm requested review of this revision.Jun 27 2023, 7:24 AM

Herald added a project: Restricted Project. · View Herald TranscriptJun 27 2023, 7:24 AM

Herald added subscribers: limo1996, stephenneuendorffer. · View Herald Transcript

springerm added a parent revision: D153555: [mlir][linalg][NFC] Return tensor::PadOp handle from transform op.Jun 27 2023, 7:24 AM

Harbormaster completed remote builds in B241478: Diff 534972.Jun 27 2023, 8:13 AM

rebase

springerm mentioned this in D153970: [mlir][linalg] BufferizeToAllocationOp: Bufferize ops, not values.Jun 28 2023, 6:49 AM

springerm added a child revision: D153970: [mlir][linalg] BufferizeToAllocationOp: Bufferize ops, not values.Jun 28 2023, 6:49 AM

Harbormaster completed remote builds in B241772: Diff 535375.Jun 28 2023, 7:10 AM

mravishankar requested changes to this revision.Jun 28 2023, 9:15 PM

mravishankar added inline comments.

mlir/lib/Dialect/Linalg/Transforms/ConvertToDestinationStyle.cpp
183–188	This seems strange. Why is there even a pad op if we don't have low pad or high pad. That seems like that pad should be folded away.

This revision now requires changes to proceed.Jun 28 2023, 9:15 PM

springerm marked an inline comment as done.Jun 29 2023, 2:42 AM

springerm added inline comments.

mlir/lib/Dialect/Linalg/Transforms/ConvertToDestinationStyle.cpp
183–188	This is used for "packing" `tensor.pad` ops. I.e., `tensor.pad` ops that do not necessarily add any padding, but where a new buffer allocation is required (e.g., in shared memory on GPU). Such `tensor.pad` ops have a `nofold` attribute. If `nofold` is set, the padding operation will not be folded away even if the source type and the padded type have the same static shape. This can be used, e.g., for packing or promotion to faster memory. I'd like to remove the `nofold` attribute and this kind of use of `tensor.pad`, but it's used in quite a few places, so this will still take a while. The infrastructure that I've been adding in `ConvertToDestinationStyle.cpp` (in particular `bufferizeToAllocation`) over the last weeks will bring us closer to that. (The idea is that we generate a memref allocation right away and do not even have a `tensor.pad` if there's nothing to pad.)

springerm marked an inline comment as done.Jun 29 2023, 2:45 AM

springerm added inline comments.

mlir/lib/Dialect/Linalg/Transforms/ConvertToDestinationStyle.cpp
183–188	To clarify: We currently use `tensor.pad` with `nofold` to force a shared memory allocation (or any allocation in general). I'm working on using a different abstraction instead (`memref.alloc`). We could not do this until recently because bufferization of mixed tensor/memref IR was not supported.

mravishankar resigned from this revision.Jul 4 2023, 8:45 AM

This revision now requires review to proceed.Jul 4 2023, 8:45 AM

thanks!

This revision is now accepted and ready to land.Jul 4 2023, 9:22 AM

Closed by commit rGbb566b652f7d: [mlir][linalg] Do not emit FillOp for tensor.pad with zero padding (authored by springerm). · Explain WhyJul 4 2023, 11:45 PM

This revision was automatically updated to reflect the committed changes.

springerm added a commit: rGbb566b652f7d: [mlir][linalg] Do not emit FillOp for tensor.pad with zero padding.

Revision Contents

Path

Size

mlir/

lib/

Dialect/

Linalg/

Transforms/

ConvertToDestinationStyle.cpp

9 lines

Tensor/

Utils/

Utils.cpp

3 lines

test/

Dialect/

Linalg/

generalize-tensor-pack.mlir

4 lines

pad-to-specific-memory-space.mlir

2 lines

transform-op-pad.mlir

8 lines

Diff 534972

mlir/lib/Dialect/Linalg/Transforms/ConvertToDestinationStyle.cpp

Show First 20 Lines • Show All 174 Lines • ▼ Show 20 Lines	Value linalg::bufferizeToAllocation(RewriterBase &rewriter, PadOp padOp,
rewriter.setInsertionPoint(padOp);		rewriter.setInsertionPoint(padOp);
Location loc = padOp.getLoc();		Location loc = padOp.getLoc();

// Create buffer allocation.		// Create buffer allocation.
Value alloc =		Value alloc =
createAllocationForTensor(rewriter, loc, padOp.getResult(), memorySpace);		createAllocationForTensor(rewriter, loc, padOp.getResult(), memorySpace);
rewriter.setInsertionPointAfter(alloc.getDefiningOp());		rewriter.setInsertionPointAfter(alloc.getDefiningOp());

// Create linalg.fill or linalg.generic.		if (!padOp.hasZeroLowPad() \|\| !padOp.hasZeroHighPad()) {
Operation *fillOp = movePaddingToFillOrGenericOp(rewriter, loc, padOp, alloc);		// Create linalg.fill or linalg.generic. Not needed if there is no padding.
		Operation *fillOp =
		movePaddingToFillOrGenericOp(rewriter, loc, padOp, alloc);
rewriter.setInsertionPointAfter(fillOp);		rewriter.setInsertionPointAfter(fillOp);
		}
		mravishankarUnsubmitted Done Reply Inline Actions This seems strange. Why is there even a pad op if we don't have low pad or high pad. That seems like that pad should be folded away. mravishankar: This seems strange. Why is there even a pad op if we don't have low pad or high pad. That seems…
		springermAuthorUnsubmitted Done Reply Inline Actions This is used for "packing" `tensor.pad` ops. I.e., `tensor.pad` ops that do not necessarily add any padding, but where a new buffer allocation is required (e.g., in shared memory on GPU). Such `tensor.pad` ops have a `nofold` attribute. If `nofold` is set, the padding operation will not be folded away even if the source type and the padded type have the same static shape. This can be used, e.g., for packing or promotion to faster memory. I'd like to remove the `nofold` attribute and this kind of use of `tensor.pad`, but it's used in quite a few places, so this will still take a while. The infrastructure that I've been adding in `ConvertToDestinationStyle.cpp` (in particular `bufferizeToAllocation`) over the last weeks will bring us closer to that. (The idea is that we generate a memref allocation right away and do not even have a `tensor.pad` if there's nothing to pad.) springerm: This is used for "packing" `tensor.pad` ops. I.e., `tensor.pad` ops that do not necessarily add…
		springermAuthorUnsubmitted Done Reply Inline Actions To clarify: We currently use `tensor.pad` with `nofold` to force a shared memory allocation (or any allocation in general). I'm working on using a different abstraction instead (`memref.alloc`). We could not do this until recently because bufferization of mixed tensor/memref IR was not supported. springerm: To clarify: We currently use `tensor.pad` with `nofold` to force a shared memory allocation (or…

// Create memref.tensor_store.		// Create memref.tensor_store.
SmallVector<OpFoldResult> sizes =		SmallVector<OpFoldResult> sizes =
getMixedSizes(rewriter, loc, padOp.getSource());		getMixedSizes(rewriter, loc, padOp.getSource());
SmallVector<OpFoldResult> strides(padOp.getResultType().getRank(),		SmallVector<OpFoldResult> strides(padOp.getResultType().getRank(),
rewriter.getIndexAttr(1));		rewriter.getIndexAttr(1));
Value subview = rewriter.create<memref::SubViewOp>(		Value subview = rewriter.create<memref::SubViewOp>(
loc, alloc, /offsets=/padOp.getMixedLowPad(), sizes, strides);		loc, alloc, /offsets=/padOp.getMixedLowPad(), sizes, strides);
▲ Show 20 Lines • Show All 195 Lines • Show Last 20 Lines

mlir/lib/Dialect/Tensor/Utils/Utils.cpp

Show All 30 Lines	for (const auto &en : enumerate(type.getShape())) {
// Pad only the static dimensions of the result tensor type.		// Pad only the static dimensions of the result tensor type.
if (ShapedType::isDynamic(en.value()))		if (ShapedType::isDynamic(en.value()))
continue;		continue;
// Compute the padding width.		// Compute the padding width.
AffineExpr d0;		AffineExpr d0;
bindDims(b.getContext(), d0);		bindDims(b.getContext(), d0);
OpFoldResult sz = tensor::getMixedSize(b, loc, source, en.index());		OpFoldResult sz = tensor::getMixedSize(b, loc, source, en.index());
high[en.index()] =		high[en.index()] =
affine::makeComposedAffineApply(b, loc, en.value() - d0, {sz})		affine::makeComposedFoldedAffineApply(b, loc, en.value() - d0, {sz});
.getResult();
}		}
return b.create<PadOp>(loc, type, source, low, high, pad, nofold);		return b.create<PadOp>(loc, type, source, low, high, pad, nofold);
}		}

SmallVector<Value> mlir::tensor::createDynamicDimValues(OpBuilder &b,		SmallVector<Value> mlir::tensor::createDynamicDimValues(OpBuilder &b,
Location loc,		Location loc,
Value rankedTensor) {		Value rankedTensor) {
auto tensorTy = cast<RankedTensorType>(rankedTensor.getType());		auto tensorTy = cast<RankedTensorType>(rankedTensor.getType());
▲ Show 20 Lines • Show All 67 Lines • Show Last 20 Lines

mlir/test/Dialect/Linalg/generalize-tensor-pack.mlir

Show All 22 Lines	func.func @simple_pad_and_pack(%input: tensor<5x1xf32>, %output: tensor<1x1x8x2xf32>, %pad: f32) -> tensor<1x1x8x2xf32> {
%0 = tensor.pack %input padding_value(%pad : f32) inner_dims_pos = [0, 1] inner_tiles = [8, 2] into %output : tensor<5x1xf32> -> tensor<1x1x8x2xf32>		%0 = tensor.pack %input padding_value(%pad : f32) inner_dims_pos = [0, 1] inner_tiles = [8, 2] into %output : tensor<5x1xf32> -> tensor<1x1x8x2xf32>
return %0 : tensor<1x1x8x2xf32>		return %0 : tensor<1x1x8x2xf32>
}		}
// CHECK-LABEL: func.func @simple_pad_and_pack		// CHECK-LABEL: func.func @simple_pad_and_pack
// CHECK-SAME: %[[SRC:[a-zA-Z0-9]+]]		// CHECK-SAME: %[[SRC:[a-zA-Z0-9]+]]
// CHECK-SAME: %[[DEST:[a-zA-Z0-9]+]]		// CHECK-SAME: %[[DEST:[a-zA-Z0-9]+]]
// CHECK-SAME: %[[PAD_VAL:[a-zA-Z0-9]+]]		// CHECK-SAME: %[[PAD_VAL:[a-zA-Z0-9]+]]
// CHECK-DAG: %[[C0:.+]] = arith.constant 0 : index		// CHECK-DAG: %[[C0:.+]] = arith.constant 0 : index
// CHECK-DAG: %[[C1:.+]] = arith.constant 1 : index		// CHECK: %[[PAD:.+]] = tensor.pad %[[SRC]] low[%[[C0]], %[[C0]]] high[3, 1]
// CHECK-DAG: %[[C3:.+]] = arith.constant 3 : index
// CHECK: %[[PAD:.+]] = tensor.pad %[[SRC]] low[%[[C0]], %[[C0]]] high[%[[C3]], %[[C1]]]
// CHECK: tensor.yield %[[PAD_VAL]]		// CHECK: tensor.yield %[[PAD_VAL]]
// CHECK: %[[EMPTY:.+]] = tensor.empty() : tensor<8x2xf32>		// CHECK: %[[EMPTY:.+]] = tensor.empty() : tensor<8x2xf32>
// CHECK: %[[TRANSP:.+]] = linalg.transpose		// CHECK: %[[TRANSP:.+]] = linalg.transpose
// CHECK-SAME: ins(%[[PAD]] : tensor<8x2xf32>)		// CHECK-SAME: ins(%[[PAD]] : tensor<8x2xf32>)
// CHECK-SAME: outs(%[[EMPTY]] : tensor<8x2xf32>)		// CHECK-SAME: outs(%[[EMPTY]] : tensor<8x2xf32>)
// CHECK-SAME: permutation = [0, 1]		// CHECK-SAME: permutation = [0, 1]
// CHECK: %[[INSERT:.+]] = tensor.insert_slice %[[TRANSP]] into %[[DEST]]		// CHECK: %[[INSERT:.+]] = tensor.insert_slice %[[TRANSP]] into %[[DEST]]
// CHECK-SAME: [0, 0, 0, 0] [1, 1, 8, 2] [1, 1, 1, 1]		// CHECK-SAME: [0, 0, 0, 0] [1, 1, 8, 2] [1, 1, 1, 1]
▲ Show 20 Lines • Show All 56 Lines • Show Last 20 Lines

mlir/test/Dialect/Linalg/pad-to-specific-memory-space.mlir

Show All 26 Lines	func.func @pad_to_memory_space(%arg0: tensor<24x12xf32>,
// CHECK: memref.copy %[[s0]], %[[alloc0_view]]		// CHECK: memref.copy %[[s0]], %[[alloc0_view]]

// CHECK: %[[alloc1:.*]] = memref.alloc() : memref<7x5xf32, 3>		// CHECK: %[[alloc1:.*]] = memref.alloc() : memref<7x5xf32, 3>
// CHECK: linalg.fill {{.*}} outs(%[[alloc1]]		// CHECK: linalg.fill {{.*}} outs(%[[alloc1]]
// CHECK: %[[alloc1_view:.]] = memref.subview %[[alloc1]][0, 0] [%{{.}}, 5] [1, 1]		// CHECK: %[[alloc1_view:.]] = memref.subview %[[alloc1]][0, 0] [%{{.}}, 5] [1, 1]
// CHECK: memref.copy %[[s1]], %[[alloc1_view]]		// CHECK: memref.copy %[[s1]], %[[alloc1_view]]

// CHECK: %[[alloc2:.*]] = memref.alloc() : memref<4x5xf32, 3>		// CHECK: %[[alloc2:.*]] = memref.alloc() : memref<4x5xf32, 3>
// CHECK: linalg.fill {{.*}} outs(%[[alloc2]]		// CHECK-NOT: linalg.fill {{.*}} outs(%[[alloc2]]
// No subview because there is 0 padding		// No subview because there is 0 padding
// CHECK: memref.copy %[[s2]], %[[alloc2]]		// CHECK: memref.copy %[[s2]], %[[alloc2]]

// CHECK: linalg.matmul ins(%[[alloc0]], %[[alloc1]] : {{.}}) outs(%[[alloc2]] : {{.}})		// CHECK: linalg.matmul ins(%[[alloc0]], %[[alloc1]] : {{.}}) outs(%[[alloc2]] : {{.}})
// Copy back result.		// Copy back result.
// CHECK: memref.copy %[[alloc2]], %[[s2]]		// CHECK: memref.copy %[[alloc2]], %[[s2]]
%4 = linalg.matmul ins(%1, %2 : tensor<4x?xf32>, tensor<?x5xf32>) outs(%3 : tensor<4x5xf32>) -> tensor<4x5xf32>		%4 = linalg.matmul ins(%1, %2 : tensor<4x?xf32>, tensor<?x5xf32>) outs(%3 : tensor<4x5xf32>) -> tensor<4x5xf32>

Show All 18 Lines

mlir/test/Dialect/Linalg/transform-op-pad.mlir

	Show First 20 Lines • Show All 182 Lines • ▼ Show 20 Lines
	}			}

	// -----			// -----

	// Check that the padding can be applied even when the output argument of the			// Check that the padding can be applied even when the output argument of the
	// linalg op is not produced by an empty op or an extract_slice op.			// linalg op is not produced by an empty op or an extract_slice op.

	// CHECK-DAG: #[[$MAP_MIN:.*]] = affine_map<(d0) -> (-d0 + 2044, 16)>			// CHECK-DAG: #[[$MAP_MIN:.*]] = affine_map<(d0) -> (-d0 + 2044, 16)>
	// CHECK-DAG: #[[$MAP_C0:.*]] = affine_map<() -> (0)>
	// CHECK-DAG: #[[$MAP_TO_16:.*]] = affine_map<(d0) -> (-d0 + 16)>			// CHECK-DAG: #[[$MAP_TO_16:.*]] = affine_map<(d0) -> (-d0 + 16)>
	// CHECK-LABEL: @outs_not_produced_by_empty_or_extract_slice(			// CHECK-LABEL: @outs_not_produced_by_empty_or_extract_slice(
	// CHECK-SAME: %[[A:[^: ]*]]: tensor<128x2044xf32>,			// CHECK-SAME: %[[A:[^: ]*]]: tensor<128x2044xf32>,
	// CHECK-SAME: %[[B:[^: ]*]]: tensor<2044x128xf32>)			// CHECK-SAME: %[[B:[^: ]*]]: tensor<2044x128xf32>)
	func.func @outs_not_produced_by_empty_or_extract_slice(%a : tensor<128x2044xf32>, %b : tensor<2044x128xf32>) -> tensor<128x128xf32> {			func.func @outs_not_produced_by_empty_or_extract_slice(%a : tensor<128x2044xf32>, %b : tensor<2044x128xf32>) -> tensor<128x128xf32> {
	%cst = arith.constant 0.000000e+00 : f32			%cst = arith.constant 0.000000e+00 : f32
	%0 = tensor.empty() : tensor<128x128xf32>			%0 = tensor.empty() : tensor<128x128xf32>
	%9 = linalg.fill ins(%cst : f32) outs(%0 : tensor<128x128xf32>) -> tensor<128x128xf32>			%9 = linalg.fill ins(%cst : f32) outs(%0 : tensor<128x128xf32>) -> tensor<128x128xf32>

	%c0 = arith.constant 0 : index			%c0 = arith.constant 0 : index
	%c16 = arith.constant 16 : index			%c16 = arith.constant 16 : index
	%c2044 = arith.constant 2044 : index			%c2044 = arith.constant 2044 : index
	// CHECK: scf.for %[[ARG3:.]] = {{.}} iter_args(%[[ARG4:.]] = %{{.}})			// CHECK: scf.for %[[ARG3:.]] = {{.}} iter_args(%[[ARG4:.]] = %{{.}})
	%10 = scf.for %arg3 = %c0 to %c2044 step %c16 iter_args(%arg4 = %9) -> (tensor<128x128xf32>) {			%10 = scf.for %arg3 = %c0 to %c2044 step %c16 iter_args(%arg4 = %9) -> (tensor<128x128xf32>) {
	// CHECK: %[[MIN:.*]] = affine.min #[[$MAP_MIN]](%[[ARG3]])			// CHECK: %[[MIN:.*]] = affine.min #[[$MAP_MIN]](%[[ARG3]])
	%11 = affine.min affine_map<(d0) -> (-d0 + 2044, 16)>(%arg3)			%11 = affine.min affine_map<(d0) -> (-d0 + 2044, 16)>(%arg3)
	// CHECK: %[[A_SLICE:.*]] = tensor.extract_slice %[[A]]			// CHECK: %[[A_SLICE:.*]] = tensor.extract_slice %[[A]]
	// CHECK: %[[B_SLICE:.*]] = tensor.extract_slice %[[B]]			// CHECK: %[[B_SLICE:.*]] = tensor.extract_slice %[[B]]
	%extracted_slice_2 = tensor.extract_slice %a[0, %arg3] [128, %11] [1, 1] : tensor<128x2044xf32> to tensor<128x?xf32>			%extracted_slice_2 = tensor.extract_slice %a[0, %arg3] [128, %11] [1, 1] : tensor<128x2044xf32> to tensor<128x?xf32>
	%extracted_slice_3 = tensor.extract_slice %b[%arg3, 0] [%11, 128] [1, 1] : tensor<2044x128xf32> to tensor<?x128xf32>			%extracted_slice_3 = tensor.extract_slice %b[%arg3, 0] [%11, 128] [1, 1] : tensor<2044x128xf32> to tensor<?x128xf32>
	// CHECK-DAG: %[[CST:.*]] = arith.constant 0.			// CHECK-DAG: %[[CST:.*]] = arith.constant 0.
	// CHECK-DAG: %[[C0:.*]] = arith.constant 0 : index			// CHECK-DAG: %[[C0:.*]] = arith.constant 0 : index

	// CHECK-DAG: %[[ZERO:.*]] = affine.apply #[[$MAP_C0]]()
	// CHECK-DAG: %[[TO_16:.*]] = affine.apply #[[$MAP_TO_16]](%[[MIN]])			// CHECK-DAG: %[[TO_16:.*]] = affine.apply #[[$MAP_TO_16]](%[[MIN]])
	// CHECK: %[[PADDED_A_SLICE:.*]] = tensor.pad %[[A_SLICE]] nofold low[%[[C0]], %[[C0]]] high[%[[ZERO]], %[[TO_16]]]			// CHECK: %[[PADDED_A_SLICE:.*]] = tensor.pad %[[A_SLICE]] nofold low[%[[C0]], %[[C0]]] high[0, %[[TO_16]]]
	// CHECK: tensor.yield %[[CST]]			// CHECK: tensor.yield %[[CST]]
	// CHECK: %[[PADDED_B_SLICE:.*]] = tensor.pad %[[B_SLICE]] nofold			// CHECK: %[[PADDED_B_SLICE:.*]] = tensor.pad %[[B_SLICE]] nofold
	// The output shape is already padded, so actually we shouldn't			// The output shape is already padded, so actually we shouldn't
	// add anything to the upper bound.			// add anything to the upper bound.
	// CHECK: %[[ZERO0:.*]] = affine.apply #[[$MAP_C0]]()			// CHECK: %[[PADDED_ARG4:.]] = tensor.pad %[[ARG4]] nofold low[{{.}}] high[0, 0]
	// CHECK: %[[ZERO1:.*]] = affine.apply #[[$MAP_C0]]()
	// CHECK: %[[PADDED_ARG4:.]] = tensor.pad %[[ARG4]] nofold low[{{.}}] high[%[[ZERO0]], %[[ZERO1]]]

	// CHECK: %[[T5:.*]] = linalg.matmul			// CHECK: %[[T5:.*]] = linalg.matmul
	// CHECK-SAME: ins(%[[PADDED_A_SLICE]], %[[PADDED_B_SLICE]] : tensor<128x16xf32>, tensor<16x128xf32>)			// CHECK-SAME: ins(%[[PADDED_A_SLICE]], %[[PADDED_B_SLICE]] : tensor<128x16xf32>, tensor<16x128xf32>)
	// CHECK-SAME: outs(%[[PADDED_ARG4]] : tensor<128x128xf32>)			// CHECK-SAME: outs(%[[PADDED_ARG4]] : tensor<128x128xf32>)
	%res = linalg.matmul ins(%extracted_slice_2, %extracted_slice_3 : tensor<128x?xf32>, tensor<?x128xf32>) outs(%arg4 : tensor<128x128xf32>) -> tensor<128x128xf32>			%res = linalg.matmul ins(%extracted_slice_2, %extracted_slice_3 : tensor<128x?xf32>, tensor<?x128xf32>) outs(%arg4 : tensor<128x128xf32>) -> tensor<128x128xf32>
	scf.yield %res : tensor<128x128xf32>			scf.yield %res : tensor<128x128xf32>
	}			}
	return %10 : tensor<128x128xf32>			return %10 : tensor<128x128xf32>
	▲ Show 20 Lines • Show All 62 Lines • Show Last 20 Lines