This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
lib/Dialect/Linalg/Transforms/
-
Dialect/
-
Linalg/
-
Transforms/
-
Transforms.cpp
-
test/Dialect/Linalg/
-
Dialect/
-
Linalg/
-
generalize-tensor-pack-tile.mlir

Differential D144067

[mlir][linalg] Add more shape information for tensor.pack generalization
AbandonedPublic

Authored by hanchung on Feb 14 2023, 6:30 PM.

Download Raw Diff

Details

Reviewers

mravishankar
chelini
nicolasvasilache

Summary

The static shape information is known in the pattern because it limits
outer dims to be all 1s. In this context, inserting tensor.cast op is
safe and it gives other analysis more information.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

hanchung created this revision.Feb 14 2023, 6:30 PM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 14 2023, 6:30 PM

Herald added subscribers: Moerafaat, bzcheeseman, sdasgup3 and 21 others. · View Herald Transcript

hanchung requested review of this revision.Feb 14 2023, 6:30 PM

Herald added a project: Restricted Project. · View Herald TranscriptFeb 14 2023, 6:30 PM

Herald added subscribers: limo1996, stephenneuendorffer. · View Herald Transcript

Harbormaster completed remote builds in B213789: Diff 497522.Feb 14 2023, 6:55 PM

rebase

Harbormaster completed remote builds in B214221: Diff 498101.Feb 16 2023, 1:13 PM

chelini accepted this revision.Feb 16 2023, 11:05 PM

This revision is now accepted and ready to land.Feb 16 2023, 11:05 PM

This seems strange to me. There is nothing that prevents a subsequent transformation to collapse the cast into the slice. Indeed it is probably better to do so.

This revision now requires changes to proceed.Feb 16 2023, 11:20 PM

In D144067#4134121, @mravishankar wrote:

This seems strange to me. There is nothing that prevents a subsequent transformation to collapse the cast into the slice. Indeed it is probably better to do so.

Let me put some more context, and we can think about how to handle it correctly. So the problem is come from generic + tensor.pack vectorization. The vectorization flow is tile+fuse with the tiling sizes that makes outer dims be all ones. E.g.,

%6 = scf.for %arg0 = %c0_0 to %c16 step %c1
  %7 = scf.for %arg2 = %c0_0 to %c384 step %c1
    %21 = linalg.generic {
      indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>],
      iterator_types = ["parallel", "parallel"]}
      ins(%extracted_slice : tensor<?x?xf32>)
      outs(%extracted_slice_5 : tensor<?x?xf32>) { ... } -> tensor<?x?xf32>
   %pack = tensor.pack %21 inner_dims_pos = [0, 1] inner_tiles = [8, 1] into %extracted_slice_15 : tensor<?x?xf32> -> tensor<1x1x8x1xf32>
   ...

Then we generalize the %pack op and kick in the generic vectorizer. The issue actually happen at generalization. The generalization converts the pack op into extract_slice + transpose + insert_slice. The IR after generalization:

%5 = scf.for %arg0 = %c0 to %c16 step %c1 iter_args(%arg1 = %2) -> (tensor<16x384x8x1xf32>) {
  %6 = scf.for %arg2 = %c0 to %c384 step %c1 iter_args(%arg3 = %arg1) -> (tensor<16x384x8x1xf32>) {
    %7 = affine.min affine_map<(d0) -> (d0 * -8 + 128, 8)>(%arg0)
    %8 = affine.min affine_map<(d0) -> (-d0 + 384, 1)>(%arg2)
    %9 = affine.apply affine_map<(d0) -> (d0 * 8)>(%arg0)
    %10 = affine.apply affine_map<(d0) -> (d0 * 8)>(%arg0)
    %extracted_slice = tensor.extract_slice %3[%9, %arg2] [%7, %8] [1, 1] : tensor<128x384xf32> to tensor<?x?xf32>
    %extracted_slice_0 = tensor.extract_slice %4[%10, %arg2] [%7, %8] [1, 1] : tensor<128x384xf32> to tensor<?x?xf32>
    %11 = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%extracted_slice : tensor<?x?xf32>) outs(%extracted_slice_0 : tensor<?x?xf32>) {
    ^bb0(%in: f32, %out: f32):
      %13 = arith.addf %in, %in : f32
      linalg.yield %13 : f32
    } -> tensor<?x?xf32>
    %extracted_slice_1 = tensor.extract_slice %arg3[%arg0, %arg2, 0, 0] [1, 1, 8, 1] [1, 1, 1, 1] : tensor<16x384x8x1xf32> to tensor<1x1x8x1xf32>
    %extracted_slice_2 = tensor.extract_slice %11[0, 0] [8, 1] [1, 1] : tensor<?x?xf32> to tensor<8x1xf32>
    %12 = tensor.empty() : tensor<8x1xf32>
    %transposed = linalg.transpose ins(%extracted_slice_2 : tensor<8x1xf32>) outs(%12 : tensor<8x1xf32>) permutation = [0, 1]
    %inserted_slice = tensor.insert_slice %transposed into %extracted_slice_1[0, 0, 0, 0] [1, 1, 8, 1] [1, 1, 1, 1] : tensor<8x1xf32> into tensor<1x1x8x1xf32>
    %inserted_slice_3 = tensor.insert_slice %inserted_slice into %arg3[%arg0, %arg2, 0, 0] [1, 1, 8, 1] [1, 1, 1, 1] : tensor<1x1x8x1xf32> into tensor<16x384x8x1xf32>
    scf.yield %inserted_slice_3 : tensor<16x384x8x1xf32>
  }

If we look into the IR, we will find that the generic op has dynamic shapes. I found that extract_slice op stops the shape propagation from transpose to generic op. Thus the generic op is still in dynamic shape. This makes the vectorization fail on the generic op.

In the regular tile+fuse linalg ops + vectorization. There are direct deps between linalg ops. The linalg canonicalization pattern (which is InferStaticShapeOfOperands) can infer the static shapes for other operands. It basically would insert some tensor.cast ops around the linalg ops and fold it into the producers. That's why I'm thinking about inserting some known information (like tensor.cast op) before extract_slice op. It is a valid insertion because the input is either aligned or padded. If the cast op is inserted, it can be used to infer static shapes for generic ops; make vectorization work.

After writing this down, I found that we still can't vectorized the generic op if it's not aligned to inner tiling sizes. Hopefully, vector masking trick or whole program data-layout-propagation can handle it better. That's a separate issue. (We can chat offline if we need more bandwidth.)

https://reviews.llvm.org/D144604 is a better fix.

hanchung abandoned this revision.Feb 23 2023, 10:24 AM

Revision Contents

Path

Size

mlir/

lib/

Dialect/

Linalg/

Transforms/

Transforms.cpp

13 lines

test/

Dialect/

Linalg/

generalize-tensor-pack-tile.mlir

8 lines

Diff 498101

mlir/lib/Dialect/Linalg/Transforms/Transforms.cpp

Show First 20 Lines • Show All 543 Lines • ▼ Show 20 Lines	LogicalResult GeneralizeOuterUnitDimsPackOpPattern::matchAndRewrite(
}		}

if (llvm::any_of(packOp.getMixedTiles(),		if (llvm::any_of(packOp.getMixedTiles(),
[](OpFoldResult tile) { return tile.is<Value>(); })) {		[](OpFoldResult tile) { return tile.is<Value>(); })) {
return rewriter.notifyMatchFailure(packOp,		return rewriter.notifyMatchFailure(packOp,
"require inner tile sizes being static");		"require inner tile sizes being static");
}		}

// 1. Use rank-reduced tensor.extract_slice op to extract the tile.		// 1. Use rank-reduced tensor.extract_slice op to extract the tile. This also
		// creates a tensor.cast op right before the rank-reduced
		// tensor.extract_slice. This is a known information because all the outer
		// dims are 1s in packed domain; it is extremely helpful for other analysis.
Location loc = packOp.getLoc();		Location loc = packOp.getLoc();
Attribute zeroIdxAttr = rewriter.getIndexAttr(0);		Attribute zeroIdxAttr = rewriter.getIndexAttr(0);
Attribute oneIdxAttr = rewriter.getIndexAttr(1);		Attribute oneIdxAttr = rewriter.getIndexAttr(1);
SmallVector<OpFoldResult> readOffsets(srcRank, zeroIdxAttr);		SmallVector<OpFoldResult> readOffsets(srcRank, zeroIdxAttr);
SmallVector<OpFoldResult> readStrides(srcRank, oneIdxAttr);		SmallVector<OpFoldResult> readStrides(srcRank, oneIdxAttr);
SmallVector<OpFoldResult> readSizes;		SmallVector<OpFoldResult> readSizes;
SmallVector<int64_t> readShape;		SmallVector<int64_t> readShape, castShape;
DenseMap<int64_t, OpFoldResult> dimAndTileMapping =		DenseMap<int64_t, OpFoldResult> dimAndTileMapping =
packOp.getDimAndTileMapping();		packOp.getDimAndTileMapping();
for (auto i : llvm::seq<unsigned>(0, srcRank)) {		for (auto i : llvm::seq<unsigned>(0, srcRank)) {
if (!dimAndTileMapping.count(i)) {		if (!dimAndTileMapping.count(i)) {
readSizes.push_back(oneIdxAttr);		readSizes.push_back(oneIdxAttr);
		castShape.push_back(1);
continue;		continue;
}		}
readSizes.push_back(dimAndTileMapping[i]);		readSizes.push_back(dimAndTileMapping[i]);
readShape.push_back(getConstantIntValue(dimAndTileMapping[i])		readShape.push_back(getConstantIntValue(dimAndTileMapping[i])
.value_or(ShapedType::kDynamic));		.value_or(ShapedType::kDynamic));
		castShape.push_back(readShape.back());
}		}
Type elemType = packOp.getSourceType().getElementType();		Type elemType = packOp.getSourceType().getElementType();
auto readType = RankedTensorType::get(readShape, elemType);		auto readType = RankedTensorType::get(readShape, elemType);

Value input = getPackOpSourceOrPaddedSource(rewriter, packOp);		Value input = getPackOpSourceOrPaddedSource(rewriter, packOp);
		Value cast = rewriter.create<tensor::CastOp>(
		loc, RankedTensorType::get(castShape, elemType), input);
Value tile = rewriter.create<tensor::ExtractSliceOp>(		Value tile = rewriter.create<tensor::ExtractSliceOp>(
loc, readType, input, readOffsets, readSizes, readStrides);		loc, readType, cast, readOffsets, readSizes, readStrides);

// 2. Transpose the tile to match the inner tile order.		// 2. Transpose the tile to match the inner tile order.
SmallVector<int64_t> perm =		SmallVector<int64_t> perm =
getPackUnpackNormalizedInnerPerm(srcRank, packOp.getInnerDimsPos());		getPackUnpackNormalizedInnerPerm(srcRank, packOp.getInnerDimsPos());
SmallVector<int64_t> transpShape = readShape;		SmallVector<int64_t> transpShape = readShape;
applyPermutationToVector<int64_t>(transpShape, perm);		applyPermutationToVector<int64_t>(transpShape, perm);

Value empty = rewriter.create<tensor::EmptyOp>(loc, transpShape, elemType);		Value empty = rewriter.create<tensor::EmptyOp>(loc, transpShape, elemType);
▲ Show 20 Lines • Show All 737 Lines • Show Last 20 Lines

mlir/test/Dialect/Linalg/generalize-tensor-pack-tile.mlir

	Show All 13 Lines
	// CHECK: %{{.+}} = scf.for %[[R:[a-zA-Z0-9]+]] =			// CHECK: %{{.+}} = scf.for %[[R:[a-zA-Z0-9]+]] =
	// CHECK: %{{.+}} = scf.for %[[S:[a-zA-Z0-9]+]] =			// CHECK: %{{.+}} = scf.for %[[S:[a-zA-Z0-9]+]] =
	// CHECK: %[[IN_R:.+]] = affine.apply #[[MAP0]](%[[R]])			// CHECK: %[[IN_R:.+]] = affine.apply #[[MAP0]](%[[R]])
	// CHECK: %[[IN_R_SZ:.+]] = affine.min #[[MAP1]](%[[R]])			// CHECK: %[[IN_R_SZ:.+]] = affine.min #[[MAP1]](%[[R]])
	// CHECK: %[[IN_S:.+]] = affine.apply #[[MAP2]](%[[S]])			// CHECK: %[[IN_S:.+]] = affine.apply #[[MAP2]](%[[S]])
	// CHECK: %[[IN_S_SZ:.+]] = affine.min #[[MAP3]](%[[S]])			// CHECK: %[[IN_S_SZ:.+]] = affine.min #[[MAP3]](%[[S]])
	// CHECK: %[[SRC_SLICE:.+]] = tensor.extract_slice %[[SRC]]			// CHECK: %[[SRC_SLICE:.+]] = tensor.extract_slice %[[SRC]]
	// CHECK-SAME: [0, 0, %[[IN_R]], %[[IN_S]]] [1, 1, %[[IN_R_SZ]], %[[IN_S_SZ]]] [1, 1, 1, 1]			// CHECK-SAME: [0, 0, %[[IN_R]], %[[IN_S]]] [1, 1, %[[IN_R_SZ]], %[[IN_S_SZ]]] [1, 1, 1, 1]
	// CHECK: %[[TILE:.+]] = tensor.extract_slice %[[SRC_SLICE]]			// CHECK: %[[CAST:.+]] = tensor.cast %[[SRC_SLICE]] : tensor<1x1x?x?xf32> to tensor<1x1x32x8xf32>
	// CHECK-SAME: [0, 0, 0, 0] [1, 1, 32, 8] [1, 1, 1, 1] : tensor<1x1x?x?xf32> to tensor<32x8xf32>			// CHECK: %[[TILE:.+]] = tensor.extract_slice %[[CAST]]
				// CHECK-SAME: [0, 0, 0, 0] [1, 1, 32, 8] [1, 1, 1, 1] : tensor<1x1x32x8xf32> to tensor<32x8xf32>
	// CHECK: %[[EMPTY:.+]] = tensor.empty() : tensor<8x32xf32>			// CHECK: %[[EMPTY:.+]] = tensor.empty() : tensor<8x32xf32>
	// CHECK: %[[TRANSP:.+]] = linalg.transpose			// CHECK: %[[TRANSP:.+]] = linalg.transpose
	// CHECK-SAME: ins(%[[TILE]]			// CHECK-SAME: ins(%[[TILE]]
	// CHECK-SAME: outs(%[[EMPTY]]			// CHECK-SAME: outs(%[[EMPTY]]
	// CHECK-SAME: permutation = [1, 0]			// CHECK-SAME: permutation = [1, 0]
	// CHECK: %{{.+}} = tensor.insert_slice %[[TRANSP]] into %{{.+}}			// CHECK: %{{.+}} = tensor.insert_slice %[[TRANSP]] into %{{.+}}

	transform.sequence failures(propagate) {			transform.sequence failures(propagate) {
	▲ Show 20 Lines • Show All 48 Lines • ▼ Show 20 Lines
	// CHECK: %{{.+}} = scf.for %[[C:[a-zA-Z0-9]+]] =			// CHECK: %{{.+}} = scf.for %[[C:[a-zA-Z0-9]+]] =
	// CHECK: %{{.+}} = scf.for %[[K:[a-zA-Z0-9]+]] =			// CHECK: %{{.+}} = scf.for %[[K:[a-zA-Z0-9]+]] =
	// CHECK-DAG: %[[IN_K:.+]] = affine.apply #[[MAP0]](%[[K]])			// CHECK-DAG: %[[IN_K:.+]] = affine.apply #[[MAP0]](%[[K]])
	// CHECK-DAG: %[[IN_K_SZ:.+]] = affine.min #[[MAP1]](%[[K]])			// CHECK-DAG: %[[IN_K_SZ:.+]] = affine.min #[[MAP1]](%[[K]])
	// CHECK-DAG: %[[IN_C:.+]] = affine.apply #[[MAP2]](%[[C]])			// CHECK-DAG: %[[IN_C:.+]] = affine.apply #[[MAP2]](%[[C]])
	// CHECK-DAG: %[[IN_C_SZ:.+]] = affine.min #[[MAP3]](%[[C]])			// CHECK-DAG: %[[IN_C_SZ:.+]] = affine.min #[[MAP3]](%[[C]])
	// CHECK: %[[SRC_SLICE:.+]] = tensor.extract_slice %[[SRC]]			// CHECK: %[[SRC_SLICE:.+]] = tensor.extract_slice %[[SRC]]
	// CHECK-SAME: [%[[IN_K]], %[[IN_C]]] [%[[IN_K_SZ]], %[[IN_C_SZ]]] [1, 1]			// CHECK-SAME: [%[[IN_K]], %[[IN_C]]] [%[[IN_K_SZ]], %[[IN_C_SZ]]] [1, 1]
	// CHECK: %[[TILE:.+]] = tensor.extract_slice %[[SRC_SLICE]]			// CHECK: %[[TILE:.+]] = tensor.cast %[[SRC_SLICE]] : tensor<?x?xf32> to tensor<32x8xf32>
	// CHECK-SAME: [0, 0] [32, 8] [1, 1] : tensor<?x?xf32> to tensor<32x8xf32>
	// CHECK: %[[EMPTY:.+]] = tensor.empty() : tensor<32x8xf32>			// CHECK: %[[EMPTY:.+]] = tensor.empty() : tensor<32x8xf32>
	// CHECK: %[[TRANSP:.+]] = linalg.transpose			// CHECK: %[[TRANSP:.+]] = linalg.transpose
	// CHECK-SAME: ins(%[[TILE]]			// CHECK-SAME: ins(%[[TILE]]
	// CHECK-SAME: outs(%[[EMPTY]]			// CHECK-SAME: outs(%[[EMPTY]]
	// CHECK-SAME: permutation = [0, 1]			// CHECK-SAME: permutation = [0, 1]
	// CHECK: %[[SUB_ITER:.+]] = tensor.insert_slice %[[TRANSP]] into %{{[a-zA-Z0-9]+}}			// CHECK: %[[SUB_ITER:.+]] = tensor.insert_slice %[[TRANSP]] into %{{[a-zA-Z0-9]+}}
	// CHECK-SAME: [0, 0, 0, 0] [1, 1, 32, 8] [1, 1, 1, 1] : tensor<32x8xf32> into tensor<1x1x32x8xf32>			// CHECK-SAME: [0, 0, 0, 0] [1, 1, 32, 8] [1, 1, 1, 1] : tensor<32x8xf32> into tensor<1x1x32x8xf32>
	// CHECK: %{{.+}} = tensor.insert_slice %[[SUB_ITER]] into %{{[a-zA-Z0-9]+}}			// CHECK: %{{.+}} = tensor.insert_slice %[[SUB_ITER]] into %{{[a-zA-Z0-9]+}}
	// CHECK-SAME: [%[[C]], %[[K]], 0, 0] [1, 1, 32, 8] [1, 1, 1, 1] : tensor<1x1x32x8xf32> into tensor<32x4x32x8xf32>			// CHECK-SAME: [%[[C]], %[[K]], 0, 0] [1, 1, 32, 8] [1, 1, 1, 1] : tensor<1x1x32x8xf32> into tensor<32x4x32x8xf32>
	transform.sequence failures(propagate) {			transform.sequence failures(propagate) {
	^bb0(%arg1: !pdl.operation):			^bb0(%arg1: !pdl.operation):
	%0 = transform.structured.match ops{["tensor.pack"]} in %arg1 : (!pdl.operation) -> !pdl.operation			%0 = transform.structured.match ops{["tensor.pack"]} in %arg1 : (!pdl.operation) -> !pdl.operation
	%1, %loops:2 = transform.structured.tile_to_scf_for %0 [1, 1]			%1, %loops:2 = transform.structured.tile_to_scf_for %0 [1, 1]
	}			}