This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
lib/Dialect/MemRef/IR/
-
Dialect/
-
MemRef/
-
IR/
-
MemRefOps.cpp
-
test/Dialect/MemRef/
-
Dialect/
-
MemRef/
-
canonicalize.mlir
-
fold-memref-alias-ops.mlir

Differential D159008

[mlir][MemRef] Make `getDroppedDims` on `memref.subview` account for preserved unit dimensions.
Needs ReviewPublic

Authored by mravishankar on Aug 28 2023, 11:16 AM.

Download Raw Diff

Details

Reviewers

nicolasvasilache
springerm

Summary

When computing dropped dimensions, if any unit-dimension is preserved,
then that should not be accounted for in the dropped dimensions.

The side-effect of this is that there is some ambiguity in the
dimensions being dropped sometimes. It seems like the convention is
that the outermost dimensions are prefered to be dropped.

Fixes #60091

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

mravishankar created this revision.Aug 28 2023, 11:16 AM

Herald added a project: Restricted Project. · View Herald TranscriptAug 28 2023, 11:16 AM

Herald added subscribers: bviyer, Moerafaat, bzcheeseman and 21 others. · View Herald Transcript

mravishankar requested review of this revision.Aug 28 2023, 11:16 AM

Herald added a reviewer: nicolasvasilache. · View Herald TranscriptAug 28 2023, 11:16 AM

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: stephenneuendorffer, nicolasvasilache. · View Herald Transcript

@springerm This fixes https://github.com/llvm/llvm-project/issues/60091 , but it seems like we have some ambiguity in which dimensions get dropped when some unit-dimensions are preserved. From tests it seems like the preference is to have the outer dimensions dropped. This should be recorded somewhere. Suggestions?

Harbormaster completed remote builds in B255268: Diff 553999.Aug 28 2023, 11:44 AM

Apart from some not-so-great lit tests, this change seems fine. https://github.com/openxla/iree/pull/14851 tests this in IREE and no errors from this.

In D159008#4622080, @mravishankar wrote:

@springerm This fixes https://github.com/llvm/llvm-project/issues/60091 , but it seems like we have some ambiguity in which dimensions get dropped when some unit-dimensions are preserved. From tests it seems like the preference is to have the outer dimensions dropped. This should be recorded somewhere. Suggestions?

It seems to me that it is impossible to implement this function correctly because we don't know which dimension corresponds to which stride. What we have right now is best effort but we found cases where we can't be sure.

An even bigger concern to me are dynamic strides. We should do the exact same thing regardless of whether something is static or dynamic. The only difference is whether it can be done at compile time or running time. That would mean that the dropped dims cannot be determined at compile time and this function would need to bail out in case of dynamic strides (which is likely a big problem for existing transformations). Either that or maintain a (non-dynamic) mapping from dims to strides in the MemRefType, which would replace the entire getNumOccurences-based logic. Then there would also be no ambiguities wrt. to unit strides anymore.

In D159008#4625297, @springerm wrote:

In D159008#4622080, @mravishankar wrote:

@springerm This fixes https://github.com/llvm/llvm-project/issues/60091 , but it seems like we have some ambiguity in which dimensions get dropped when some unit-dimensions are preserved. From tests it seems like the preference is to have the outer dimensions dropped. This should be recorded somewhere. Suggestions?

It seems to me that it is impossible to implement this function correctly because we don't know which dimension corresponds to which stride. What we have right now is best effort but we found cases where we can't be sure.

We need to change the op-definition of operations that allow "rank-reducing" to take explicitly the dropped dimensions. Without that it is going to be hard.

An even bigger concern to me are dynamic strides. We should do the exact same thing regardless of whether something is static or dynamic. The only difference is whether it can be done at compile time or running time. That would mean that the dropped dims cannot be determined at compile time and this function would need to bail out in case of dynamic strides (which is likely a big problem for existing transformations). Either that or maintain a (non-dynamic) mapping from dims to strides in the MemRefType, which would replace the entire getNumOccurences-based logic. Then there would also be no ambiguities wrt. to unit strides anymore.

Its hard to make this argument in the abstract. The issue really comes down to, if you drop dimensions then corresponding static strides need to be dropped as well to be consistent, which is what the logic here is for. If it is dynamic, it actually doesn't matter. Its all ?s anyway. Whichever you drop, it statically is consistent, and things get adjusted automatically at runtime. In any case, that is an orthogonal issue. This change seems fine AFAICS.

I found a test case that breaks with this change (and passed before):

// RUN: mlir-opt %s -generate-runtime-verification -expand-strided-metadata

func.func @static_case(%arg0: memref<?x?x?xf32, strided<[5, 6, 7], offset: ?>>, %arg1: index, %idx: index) -> f32 {
  // Force the last dim to be rank-reduced by picking the first and second stride.
  %s = memref.subview %arg0[0, 0, 0] [1, 1, 1] [1, 1, 1] : memref<?x?x?xf32, strided<[5, 6, 7], offset: ?>> to memref<1x1xf32, strided<[5, 6], offset: ?>>
  %l = memref.load %s[%idx, %idx] : memref<1x1xf32, strided<[5, 6], offset: ?>>
  return %l : f32
}

It fails in the verifier:

/usr/local/google/_blaze_springerm/9abb62e345f5287adac38e0018bb89c9/execroot/google3/blaze-out/k8-dbg/bin/third_party/llvm/llvm-project/mlir/test/Integration/Dialect/Memref/cast-runtime-verification.mlir.test.runfiles/google3/third_party/llvm/llvm-project/mlir/test/Integration/Dialect/Memref/cast-runtime-verification.mlir:37:8: error: expected result type with stride = 6 instead of 5 in dim = 0
  %s = memref.subview %arg0[0, 0, 0] [1, 1, 1] [1, 1, 1] : memref<?x?x?xf32, strided<[5, 6, 7], offset: ?>> to memref<1x1xf32, strided<[5, 6], offset: ?>>
       ^
/usr/local/google/_blaze_springerm/9abb62e345f5287adac38e0018bb89c9/execroot/google3/blaze-out/k8-dbg/bin/third_party/llvm/llvm-project/mlir/test/Integration/Dialect/Memref/cast-runtime-verification.mlir.test.runfiles/google3/third_party/llvm/llvm-project/mlir/test/Integration/Dialect/Memref/cast-runtime-verification.mlir:37:8: note: see current operation: %1 = "memref.reinterpret_cast"(%0#0, %0#1) <{operandSegmentSizes = array<i32: 1, 1, 0, 0>, static_offsets = array<i64: -9223372036854775808>, static_sizes = array<i64: 1, 1>, static_strides = array<i64: 6, 7>}> : (memref<f32>, index) -> memref<1x1xf32, strided<[5, 6], offset: ?>>

I found this test case when I was trying to construct two identical test cases where the only difference is that the strides are dynamic or static; in such a way that different dims are dropped based on whether a stride is dynamic or static. This would mean that by inserting a static->dynamic stride casts, I can make getDroppedDims, computeMemRefRankReductionMask (and its callers) behave differently and circumvent the getNumOccurances-based logic.

For reference, this is the dynamic case:

func.func @dynamic_case(%arg0: memref<?x?x?xf32, strided<[?, ?, ?], offset: ?>>, %arg1: index, %idx: index) -> f32 {
  // Cannot decide rank-reduced dim based on strides, because they are all the same. Will fall back to "first dim is rank-reduced".
  %s = memref.subview %arg0[0, 0, 0] [1, 1, 1] [1, 1, 1] : memref<?x?x?xf32, strided<[?, ?, ?], offset: ?>> to memref<1x1xf32, strided<[?, ?], offset: ?>>
  %l = memref.load %s[%idx, %idx] : memref<1x1xf32, strided<[?, ?], offset: ?>>
  return %l : f32
}

Expands to:

func.func @dynamic_case(%arg0: memref<?x?x?xf32, strided<[?, ?, ?], offset: ?>>, %arg1: index, %arg2: index) -> f32 {
  %base_buffer, %offset, %sizes:3, %strides:3 = memref.extract_strided_metadata %arg0 : memref<?x?x?xf32, strided<[?, ?, ?], offset: ?>> -> memref<f32>, index, index, index, index, index, index, index
  %reinterpret_cast = memref.reinterpret_cast %base_buffer to offset: [%offset], sizes: [1, 1], strides: [%strides#1, %strides#2] : memref<f32> to memref<1x1xf32, strided<[?, ?], offset: ?>>
  %0 = memref.load %reinterpret_cast[%arg2, %arg2] : memref<1x1xf32, strided<[?, ?], offset: ?>>
  return %0 : f32
}

(This shows the strides do not get adjusted at runtime. We always pick the leading dimensions, regardless of what the runtime strides are.)

In D159008#4627836, @springerm wrote:

I found a test case that breaks with this change (and passed before):

// RUN: mlir-opt %s -generate-runtime-verification -expand-strided-metadata

func.func @static_case(%arg0: memref<?x?x?xf32, strided<[5, 6, 7], offset: ?>>, %arg1: index, %idx: index) -> f32 {
  // Force the last dim to be rank-reduced by picking the first and second stride.
  %s = memref.subview %arg0[0, 0, 0] [1, 1, 1] [1, 1, 1] : memref<?x?x?xf32, strided<[5, 6, 7], offset: ?>> to memref<1x1xf32, strided<[5, 6], offset: ?>>
  %l = memref.load %s[%idx, %idx] : memref<1x1xf32, strided<[5, 6], offset: ?>>
  return %l : f32
}

It fails in the verifier:

/usr/local/google/_blaze_springerm/9abb62e345f5287adac38e0018bb89c9/execroot/google3/blaze-out/k8-dbg/bin/third_party/llvm/llvm-project/mlir/test/Integration/Dialect/Memref/cast-runtime-verification.mlir.test.runfiles/google3/third_party/llvm/llvm-project/mlir/test/Integration/Dialect/Memref/cast-runtime-verification.mlir:37:8: error: expected result type with stride = 6 instead of 5 in dim = 0
  %s = memref.subview %arg0[0, 0, 0] [1, 1, 1] [1, 1, 1] : memref<?x?x?xf32, strided<[5, 6, 7], offset: ?>> to memref<1x1xf32, strided<[5, 6], offset: ?>>
       ^
/usr/local/google/_blaze_springerm/9abb62e345f5287adac38e0018bb89c9/execroot/google3/blaze-out/k8-dbg/bin/third_party/llvm/llvm-project/mlir/test/Integration/Dialect/Memref/cast-runtime-verification.mlir.test.runfiles/google3/third_party/llvm/llvm-project/mlir/test/Integration/Dialect/Memref/cast-runtime-verification.mlir:37:8: note: see current operation: %1 = "memref.reinterpret_cast"(%0#0, %0#1) <{operandSegmentSizes = array<i32: 1, 1, 0, 0>, static_offsets = array<i64: -9223372036854775808>, static_sizes = array<i64: 1, 1>, static_strides = array<i64: 6, 7>}> : (memref<f32>, index) -> memref<1x1xf32, strided<[5, 6], offset: ?>>

Good catch. Let me dig into this a bit, but it will be a while.

To re-iterate, the only real way to fix all of this is to change the memref.subview operation to take explicitly the list of dropped dimensions (and not rely on an auto-magic inference, which IMO can always be tripped up).

Revision Contents

Path

Size

mlir/

lib/

Dialect/

MemRef/

IR/

MemRefOps.cpp

19 lines

test/

Dialect/

MemRef/

canonicalize.mlir

13 lines

fold-memref-alias-ops.mlir

14 lines

Diff 553999

mlir/lib/Dialect/MemRef/IR/MemRefOps.cpp

	Show First 20 Lines • Show All 954 Lines • ▼ Show 20 Lines
	/// using the strides. If a dimension is dropped the stride must be dropped too.			/// using the strides. If a dimension is dropped the stride must be dropped too.
	static std::optional<llvm::SmallBitVector>			static std::optional<llvm::SmallBitVector>
	computeMemRefRankReductionMask(MemRefType originalType, MemRefType reducedType,			computeMemRefRankReductionMask(MemRefType originalType, MemRefType reducedType,
	ArrayRef<OpFoldResult> sizes) {			ArrayRef<OpFoldResult> sizes) {
	llvm::SmallBitVector unusedDims(originalType.getRank());			llvm::SmallBitVector unusedDims(originalType.getRank());
	if (originalType.getRank() == reducedType.getRank())			if (originalType.getRank() == reducedType.getRank())
	return unusedDims;			return unusedDims;

	for (const auto &dim : llvm::enumerate(sizes))			ArrayRef<int64_t> reducedShape = reducedType.getShape();
	if (auto attr = llvm::dyn_cast_if_present<Attribute>(dim.value()))			size_t reducedShapePos = reducedShape.size();
	if (llvm::cast<IntegerAttr>(attr).getInt() == 1)			for (size_t ri = 0, re = sizes.size(); ri < re; ++ri) {
	unusedDims.set(dim.index());			size_t index = sizes.size() - 1 - ri;
				OpFoldResult dim = sizes[index];
				if (auto attr = llvm::dyn_cast_if_present<Attribute>(dim)) {
				if (llvm::cast<IntegerAttr>(attr).getInt() == 1) {
				if (!(reducedShapePos > 0 && reducedShape[reducedShapePos - 1] == 1)) {
				unusedDims.set(index);
				continue;
				}
				}
				}
				reducedShapePos--;
				}

	// Early exit for the case where the number of unused dims matches the number			// Early exit for the case where the number of unused dims matches the number
	// of ranks reduced.			// of ranks reduced.
	if (static_cast<int64_t>(unusedDims.count()) + reducedType.getRank() ==			if (static_cast<int64_t>(unusedDims.count()) + reducedType.getRank() ==
	originalType.getRank())			originalType.getRank())
	return unusedDims;			return unusedDims;

	SmallVector<int64_t> originalStrides, candidateStrides;			SmallVector<int64_t> originalStrides, candidateStrides;
	▲ Show 20 Lines • Show All 2,464 Lines • Show Last 20 Lines

mlir/test/Dialect/MemRef/canonicalize.mlir

Show First 20 Lines • Show All 949 Lines • ▼ Show 20 Lines	func.func @subview_rank_reduction(%arg0: memref<1x384x384xf32>, %idx: index)
%c1 = arith.constant 1 : index		%c1 = arith.constant 1 : index
// CHECK: %[[subview:.*]] = memref.subview %[[arg0]][0, %[[arg1]], %[[arg1]]] [1, 1, %[[arg1]]] [1, 1, 1] : memref<1x384x384xf32> to memref<1x?xf32, strided<[384, 1], offset: ?>>		// CHECK: %[[subview:.*]] = memref.subview %[[arg0]][0, %[[arg1]], %[[arg1]]] [1, 1, %[[arg1]]] [1, 1, 1] : memref<1x384x384xf32> to memref<1x?xf32, strided<[384, 1], offset: ?>>
// CHECK: %[[cast:.*]] = memref.cast %[[subview]] : memref<1x?xf32, strided<[384, 1], offset: ?>> to memref<?x?xf32, strided<[384, 1], offset: ?>>		// CHECK: %[[cast:.*]] = memref.cast %[[subview]] : memref<1x?xf32, strided<[384, 1], offset: ?>> to memref<?x?xf32, strided<[384, 1], offset: ?>>
%0 = memref.subview %arg0[0, %idx, %idx] [1, %c1, %idx] [1, 1, 1]		%0 = memref.subview %arg0[0, %idx, %idx] [1, %c1, %idx] [1, 1, 1]
: memref<1x384x384xf32> to memref<?x?xf32, strided<[384, 1], offset: ?>>		: memref<1x384x384xf32> to memref<?x?xf32, strided<[384, 1], offset: ?>>
// CHECK: return %[[cast]]		// CHECK: return %[[cast]]
return %0 : memref<?x?xf32, strided<[384, 1], offset: ?>>		return %0 : memref<?x?xf32, strided<[384, 1], offset: ?>>
}		}

		// -----

		func.func @keep_preserved_unit_dimensions(%arg0: tensor<?x?x?xf32>, %arg1: index) -> index {
		%0 = bufferization.to_memref %arg0 : memref<?x?x?xf32, strided<[?, ?, ?], offset: ?>>
		%c1 = arith.constant 1 : index
		%subview = memref.subview %0[0, 0, 0] [1, %arg1, 1] [1, 1, 1] : memref<?x?x?xf32, strided<[?, ?, ?], offset: ?>> to memref<1x?xf32, strided<[?, ?], offset: ?>>
		%dim = memref.dim %subview, %c1 : memref<1x?xf32, strided<[?, ?], offset: ?>>
		return %dim : index
		}
		// CHECK-LABEL: func @keep_preserved_unit_dimensions
		// CHECK-SAME: %[[ARG1:[a-zA-Z0-9]+]]: index
		// CHECK: return %[[ARG1]]

mlir/test/Dialect/MemRef/fold-memref-alias-ops.mlir

Show First 20 Lines • Show All 167 Lines • ▼ Show 20 Lines
// CHECK-SAME: %[[ARG12:[a-zA-Z0-9_]+]]: index		// CHECK-SAME: %[[ARG12:[a-zA-Z0-9_]+]]: index
// CHECK-SAME: %[[ARG13:[a-zA-Z0-9_]+]]: index		// CHECK-SAME: %[[ARG13:[a-zA-Z0-9_]+]]: index
// CHECK-SAME: %[[ARG14:[a-zA-Z0-9_]+]]: index		// CHECK-SAME: %[[ARG14:[a-zA-Z0-9_]+]]: index
// CHECK-SAME: %[[ARG15:[a-zA-Z0-9_]+]]: index		// CHECK-SAME: %[[ARG15:[a-zA-Z0-9_]+]]: index
// CHECK-SAME: %[[ARG16:[a-zA-Z0-9_]+]]: index		// CHECK-SAME: %[[ARG16:[a-zA-Z0-9_]+]]: index
// CHECK-DAG: %[[I0:.+]] = affine.apply #[[MAP]]()[%[[ARG1]], %[[ARG13]], %[[ARG7]]]		// CHECK-DAG: %[[I0:.+]] = affine.apply #[[MAP]]()[%[[ARG1]], %[[ARG13]], %[[ARG7]]]
// CHECK-DAG: %[[I2:.+]] = affine.apply #[[MAP]]()[%[[ARG3]], %[[ARG14]], %[[ARG9]]]		// CHECK-DAG: %[[I2:.+]] = affine.apply #[[MAP]]()[%[[ARG3]], %[[ARG14]], %[[ARG9]]]
// CHECK-DAG: %[[I3:.+]] = affine.apply #[[MAP]]()[%[[ARG4]], %[[ARG15]], %[[ARG10]]]		// CHECK-DAG: %[[I3:.+]] = affine.apply #[[MAP]]()[%[[ARG4]], %[[ARG15]], %[[ARG10]]]
// CHECK-DAG: %[[I4:.+]] = affine.apply #[[MAP]]()[%[[ARG5]], %[[ARG16]], %[[ARG11]]]		// CHECK-DAG: %[[I4:.+]] = affine.apply #[[MAP]]()[%[[ARG6]], %[[ARG16]], %[[ARG12]]]
// CHECK: memref.load %[[ARG0]][%[[I0]], %[[ARG2]], %[[I2]], %[[I3]], %[[I4]], %[[ARG6]]]		// CHECK: memref.load %[[ARG0]][%[[I0]], %[[ARG2]], %[[I2]], %[[I3]], %[[ARG5]], %[[I4]]]

// -----		// -----

func.func @fold_vector_transfer_read_with_rank_reduced_subview(		func.func @fold_vector_transfer_read_with_rank_reduced_subview(
%arg0 : memref<?x?x?xf32, strided<[?, ?, ?], offset: ?>>,		%arg0 : memref<?x?x?xf32, strided<[?, ?, ?], offset: ?>>,
%arg1: index, %arg2 : index, %arg3 : index, %arg4: index, %arg5 : index,		%arg1: index, %arg2 : index, %arg3 : index, %arg4: index, %arg5 : index,
%arg6 : index) -> vector<4xf32> {		%arg6 : index) -> vector<4xf32> {
%cst = arith.constant 0.0 : f32		%cst = arith.constant 0.0 : f32
▲ Show 20 Lines • Show All 397 Lines • ▼ Show 20 Lines	func.func @fold_src_nvgpu_device_async_copy(%gmem_memref_3d : memref<2x128x768xf16>, %src_idx_0 : index, %src_idx_1 : index, %src_idx_2 : index, %src_sub_idx_0 : index, %src_sub_idx_1 : index) {
%async_token = nvgpu.device_async_copy %gmem_memref_subview_2d[%src_sub_idx_0, %src_sub_idx_1], %smem_memref_4d[%c0, %c0, %c0, %c0], 8 {bypassL1} : memref<1x8xf16, strided<[98304, 1], offset: ?>> to memref<5x1x64x64xf16, #gpu.address_space<workgroup>>		%async_token = nvgpu.device_async_copy %gmem_memref_subview_2d[%src_sub_idx_0, %src_sub_idx_1], %smem_memref_4d[%c0, %c0, %c0, %c0], 8 {bypassL1} : memref<1x8xf16, strided<[98304, 1], offset: ?>> to memref<5x1x64x64xf16, #gpu.address_space<workgroup>>
return		return
}		}

// CHECK-DAG: #[[MAP:.+]] = affine_map<()[s0, s1] -> (s0 + s1)>		// CHECK-DAG: #[[MAP:.+]] = affine_map<()[s0, s1] -> (s0 + s1)>
// CHECK: func.func @fold_src_nvgpu_device_async_copy		// CHECK: func.func @fold_src_nvgpu_device_async_copy
// CHECK-SAME: (%[[GMEM_MEMREF_3d:.+]]: memref<2x128x768xf16>, %[[SRC_IDX_0:.+]]: index, %[[SRC_IDX_1:.+]]: index, %[[SRC_IDX_2:.+]]: index, %[[SRC_SUB_IDX_0:.+]]: index, %[[SRC_SUB_IDX_1:.+]]: index)		// CHECK-SAME: (%[[GMEM_MEMREF_3d:.+]]: memref<2x128x768xf16>, %[[SRC_IDX_0:.+]]: index, %[[SRC_IDX_1:.+]]: index, %[[SRC_IDX_2:.+]]: index, %[[SRC_SUB_IDX_0:.+]]: index, %[[SRC_SUB_IDX_1:.+]]: index)
// CHECK-DAG: %[[c0:.+]] = arith.constant 0 : index		// CHECK-DAG: %[[c0:.+]] = arith.constant 0 : index
// CHECK-DAG: %[[RESOLVED_SRC_IDX_0:.+]] = affine.apply #[[MAP]]()[%[[SRC_IDX_0]], %[[SRC_SUB_IDX_0]]]		// CHECK-DAG: %[[RESOLVED_SRC_IDX_0:.+]] = affine.apply #[[MAP]]()[%[[SRC_IDX_1]], %[[SRC_SUB_IDX_0]]]
// CHECK-DAG: %[[RESOLVED_SRC_IDX_1:.+]] = affine.apply #[[MAP]]()[%[[SRC_IDX_2]], %[[SRC_SUB_IDX_1]]]		// CHECK-DAG: %[[RESOLVED_SRC_IDX_1:.+]] = affine.apply #[[MAP]]()[%[[SRC_IDX_2]], %[[SRC_SUB_IDX_1]]]
// CHECK-DAG: nvgpu.device_async_copy %[[GMEM_MEMREF_3d]][%[[RESOLVED_SRC_IDX_0]], %[[SRC_IDX_1]], %[[RESOLVED_SRC_IDX_1]]], %[[SMEM_MEMREF_4d]][%[[c0]], %[[c0]], %[[c0]], %[[c0]]], 8 {bypassL1} : memref<2x128x768xf16> to memref<5x1x64x64xf16, #gpu.address_space<workgroup>>		// CHECK-DAG: nvgpu.device_async_copy %[[GMEM_MEMREF_3d]][%[[SRC_IDX_0]], %[[RESOLVED_SRC_IDX_0]], %[[RESOLVED_SRC_IDX_1]]], %[[SMEM_MEMREF_4d]][%[[c0]], %[[c0]], %[[c0]], %[[c0]]], 8 {bypassL1} : memref<2x128x768xf16> to memref<5x1x64x64xf16, #gpu.address_space<workgroup>>

// -----		// -----


func.func @fold_src_fold_dest_nvgpu_device_async_copy(%gmem_memref_3d : memref<2x128x768xf16>, %src_idx_0 : index, %src_idx_1 : index, %src_idx_2 : index, %src_sub_idx_0 : index, %src_sub_idx_1 : index, %dest_idx_0 : index, %dest_idx_1 : index, %dest_idx_2 : index, %dest_idx_3 : index, %dest_sub_idx_0 : index, %dest_sub_idx_1 : index) {		func.func @fold_src_fold_dest_nvgpu_device_async_copy(%gmem_memref_3d : memref<2x128x768xf16>, %src_idx_0 : index, %src_idx_1 : index, %src_idx_2 : index, %src_sub_idx_0 : index, %src_sub_idx_1 : index, %dest_idx_0 : index, %dest_idx_1 : index, %dest_idx_2 : index, %dest_idx_3 : index, %dest_sub_idx_0 : index, %dest_sub_idx_1 : index) {
%c0 = arith.constant 0 : index		%c0 = arith.constant 0 : index
%smem_memref_4d = memref.alloc() : memref<5x1x64x64xf16, #gpu.address_space<workgroup>>		%smem_memref_4d = memref.alloc() : memref<5x1x64x64xf16, #gpu.address_space<workgroup>>
%gmem_memref_subview_2d = memref.subview %gmem_memref_3d[%src_idx_0, %src_idx_1, %src_idx_2] [1, 1, 8] [1, 1, 1] : memref<2x128x768xf16> to memref<1x8xf16, strided<[98304, 1], offset: ?>>		%gmem_memref_subview_2d = memref.subview %gmem_memref_3d[%src_idx_0, %src_idx_1, %src_idx_2] [1, 1, 8] [1, 1, 1] : memref<2x128x768xf16> to memref<1x8xf16, strided<[98304, 1], offset: ?>>
%smem_memref_2d = memref.subview %smem_memref_4d[%dest_idx_0, %dest_idx_1, %dest_idx_2, %dest_idx_3] [1, 1, 1, 8] [1, 1, 1, 1] : memref<5x1x64x64xf16, #gpu.address_space<workgroup>> to memref<1x8xf16, strided<[4096, 1], offset: ?>, #gpu.address_space<workgroup>>		%smem_memref_2d = memref.subview %smem_memref_4d[%dest_idx_0, %dest_idx_1, %dest_idx_2, %dest_idx_3] [1, 1, 1, 8] [1, 1, 1, 1] : memref<5x1x64x64xf16, #gpu.address_space<workgroup>> to memref<1x8xf16, strided<[4096, 1], offset: ?>, #gpu.address_space<workgroup>>
%async_token = nvgpu.device_async_copy %gmem_memref_subview_2d[%src_sub_idx_0, %src_sub_idx_1], %smem_memref_2d[%dest_sub_idx_0, %dest_sub_idx_1], 8 {bypassL1} : memref<1x8xf16, strided<[98304, 1], offset: ?>> to memref<1x8xf16, strided<[4096, 1], offset: ?>, #gpu.address_space<workgroup>>		%async_token = nvgpu.device_async_copy %gmem_memref_subview_2d[%src_sub_idx_0, %src_sub_idx_1], %smem_memref_2d[%dest_sub_idx_0, %dest_sub_idx_1], 8 {bypassL1} : memref<1x8xf16, strided<[98304, 1], offset: ?>> to memref<1x8xf16, strided<[4096, 1], offset: ?>, #gpu.address_space<workgroup>>
return		return
}		}

// CHECK-DAG: #[[MAP:.+]] = affine_map<()[s0, s1] -> (s0 + s1)>		// CHECK-DAG: #[[MAP:.+]] = affine_map<()[s0, s1] -> (s0 + s1)>
// CHECK: func.func @fold_src_fold_dest_nvgpu_device_async_copy		// CHECK: func.func @fold_src_fold_dest_nvgpu_device_async_copy
// CHECK-SAME: (%[[GMEM_MEMREF_3d:.+]]: memref<2x128x768xf16>, %[[SRC_IDX_0:.+]]: index, %[[SRC_IDX_1:.+]]: index, %[[SRC_IDX_2:.+]]: index, %[[SRC_SUB_IDX_0:.+]]: index, %[[SRC_SUB_IDX_1:.+]]: index, %[[DEST_IDX_0:.+]]: index, %[[DEST_IDX_1:.+]]: index, %[[DEST_IDX_2:.+]]: index, %[[DEST_IDX_3:.+]]: index, %[[DEST_SUB_IDX_0:.+]]: index, %[[DEST_SUB_IDX_1:.+]]: index)		// CHECK-SAME: (%[[GMEM_MEMREF_3d:.+]]: memref<2x128x768xf16>, %[[SRC_IDX_0:.+]]: index, %[[SRC_IDX_1:.+]]: index, %[[SRC_IDX_2:.+]]: index, %[[SRC_SUB_IDX_0:.+]]: index, %[[SRC_SUB_IDX_1:.+]]: index, %[[DEST_IDX_0:.+]]: index, %[[DEST_IDX_1:.+]]: index, %[[DEST_IDX_2:.+]]: index, %[[DEST_IDX_3:.+]]: index, %[[DEST_SUB_IDX_0:.+]]: index, %[[DEST_SUB_IDX_1:.+]]: index)
// CHECK-DAG: %[[RESOLVED_SRC_IDX_0:.+]] = affine.apply #[[MAP]]()[%[[SRC_IDX_0]], %[[SRC_SUB_IDX_0]]]		// CHECK-DAG: %[[RESOLVED_SRC_IDX_0:.+]] = affine.apply #[[MAP]]()[%[[SRC_IDX_1]], %[[SRC_SUB_IDX_0]]]
// CHECK-DAG: %[[RESOLVED_SRC_IDX_1:.+]] = affine.apply #[[MAP]]()[%[[SRC_IDX_2]], %[[SRC_SUB_IDX_1]]]		// CHECK-DAG: %[[RESOLVED_SRC_IDX_1:.+]] = affine.apply #[[MAP]]()[%[[SRC_IDX_2]], %[[SRC_SUB_IDX_1]]]
// CHECK-DAG: %[[RESOLVED_DST_IDX_1:.+]] = affine.apply #[[MAP]]()[%[[DEST_IDX_1]], %[[DEST_SUB_IDX_0]]]		// CHECK-DAG: %[[RESOLVED_DST_IDX_1:.+]] = affine.apply #[[MAP]]()[%[[DEST_IDX_2]], %[[DEST_SUB_IDX_0]]]
// CHECK-DAG: %[[RESOLVED_DST_IDX_3:.+]] = affine.apply #[[MAP]]()[%[[DEST_IDX_3]], %[[DEST_SUB_IDX_1]]]		// CHECK-DAG: %[[RESOLVED_DST_IDX_3:.+]] = affine.apply #[[MAP]]()[%[[DEST_IDX_3]], %[[DEST_SUB_IDX_1]]]
// CHECK-DAG: nvgpu.device_async_copy %[[GMEM_MEMREF_3d]][%[[RESOLVED_SRC_IDX_0]], %[[SRC_IDX_1]], %[[RESOLVED_SRC_IDX_1]]], %[[SMEM_MEMREF_4d]][%[[DEST_IDX_0]], %[[RESOLVED_DST_IDX_1]], %[[DEST_IDX_2]], %[[RESOLVED_DST_IDX_3]]], 8 {bypassL1} : memref<2x128x768xf16> to memref<5x1x64x64xf16, #gpu.address_space<workgroup>>		// CHECK-DAG: nvgpu.device_async_copy %[[GMEM_MEMREF_3d]][%[[SRC_IDX_0]], %[[RESOLVED_SRC_IDX_0]], %[[RESOLVED_SRC_IDX_1]]], %[[SMEM_MEMREF_4d]][%[[DEST_IDX_0]], %[[DEST_IDX_1]], %[[RESOLVED_DST_IDX_1]], %[[RESOLVED_DST_IDX_3]]], 8 {bypassL1} : memref<2x128x768xf16> to memref<5x1x64x64xf16, #gpu.address_space<workgroup>>

// -----		// -----

#map = affine_map<()[s0] -> (-s0 + 4)>		#map = affine_map<()[s0] -> (-s0 + 4)>
#map1 = affine_map<()[s0] -> (-s0 + 32)>		#map1 = affine_map<()[s0] -> (-s0 + 32)>

func.func @test_ldmatrix(%arg0: memref<4x32x32xf16, 3>, %arg1: index, %arg2: index, %arg3: index) -> vector<4x2xf16> {		func.func @test_ldmatrix(%arg0: memref<4x32x32xf16, 3>, %arg1: index, %arg2: index, %arg3: index) -> vector<4x2xf16> {
%c0 = arith.constant 0 : index		%c0 = arith.constant 0 : index
Show All 29 Lines