This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
lib/Dialect/Linalg/Transforms/
-
Dialect/
-
Linalg/
-
Transforms/
8/14
Vectorization.cpp
-
test/Dialect/Linalg/
-
Dialect/
-
Linalg/
-
vectorization.mlir
-
vectorize_tensor_extract.mlir

Differential D141998

[mlir][linalg] Vectorize tensor.extract using contiguous loads
ClosedPublic

Authored by awarzynski on Jan 18 2023, 1:23 AM.

Download Raw Diff

Details

Reviewers

aartbik
hanchung
nicolasvasilache
dcaballe
bkramer

Commits

rG8ece85a682f0: [mlir][linalg] Vectorize tensor.extract using contiguous loads
rG89b144ece330: [mlir][linalg] Vectorize tensor.extract using contiguous loads

Summary

This patch implements vectorization of tensor.extract for n-D tensor
(n >= 2) using contiguous load operations, i.e. vector.transfer_read. This
is a follow-up of https://reviews.llvm.org/D137660 in which gather loads
were used, i.e. vector.gather.

It is always safe to use gather load operations when the underlying
memory pattern is contiguous, but not vice-verse. At the moment, the
following conditions have to be met for contiguous loads to be
generated:

The _output tensor_ must be a 1-D vector with the trailing dim > 1, e.g. tensor<1x1x4xi32,
The trailing dim in the _input tensor_ must be > 1, e.g. tensor<1x1x4i32> would be fine, but not tensor<1x4x1xi32>.

If these conditions are not satisfied, gather loads are generated
instead.

Condition 1 guarantees that the iteration space of the corresponding
linalg.generic Op is relatively simple. That makes analysing the
indices for tensor.extract rather straightforward.

Condition 2 is mostly there to avoid weird vectorisation patterns, e.g.
vector<1x1x1xi32>. In practice, tensors like tensor<1x4x1xi32>
should first be collapsed to tensor<1x4xi32> before vectorisation, but
that should happen somewhere else.

If needed, both conditions can be relaxed. I've not been able to find a
good motivating example for these, hence skipping. For reference,
tosa.resize (lowered to Linalg) was the driving example used here.

Co-authored-by: Diego Caballero <diegocaballero@google.com>

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

awarzynski created this revision.Jan 18 2023, 1:23 AM

Herald added a reviewer: aartbik. · View Herald TranscriptJan 18 2023, 1:23 AM

Herald added a reviewer: hanchung. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: hanchung, Moerafaat, bzcheeseman and 23 others. · View Herald Transcript

awarzynski requested review of this revision.Jan 18 2023, 1:23 AM

Herald added a reviewer: nicolasvasilache. · View Herald TranscriptJan 18 2023, 1:23 AM

Herald added a reviewer: dcaballe. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: • pcwang-thead, limo1996, stephenneuendorffer, nicolasvasilache. · View Herald Transcript

awarzynski edited the summary of this revision. (Show Details)Jan 18 2023, 1:23 AM

Harbormaster completed remote builds in B208431: Diff 490072.Jan 18 2023, 1:42 AM

Thanks @awarzynski! I think the initial conditions are fair as a starting point. I need more time to digest the logic but here are some nits for now!

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
635	nit: I think we are using `var` instead of `/p` to highlight source code names in the doc but perhaps I'm missing something.
639	ar? -> and?
641	Can you elaborate on what `strideZero` is? Also, perhaps rename `val` to `index` or `indexVal`?
642	typo
646	-> this is?
686	You can use `isa` instead of `dyn_cast` when you only want to check for the class but you don't need the actual class object.
701	I'm a bit lost here with the `findAncestorOpInBlock` and the constant op check. What are you trying to do?
808	Why? Wouldn't the backend emulate it if it's not natively supported by the target?

Thanks for the comments @dcaballe !

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp

635

Not a Doxygen expert. This is taken from LLVM's coding standards, but I suspect that there's going to be other styles used elsewhere. Happy to update!

641

Can you elaborate on what strideZero is?

Sure and apologies for poor naming.

So, we would like the values in the vector with the trailing indices to have stride one, e.g. v0 = [0, 1, 2, 3]. This way we know that we are accessing adjacent elements. For non-trailing indices, to make things a bit simpler, I assume that the indices ought to have stride zero (or be constant), i.e. v1 = [5, 5, 5, 5] (the actual values don't matter). This example should clarify what I mean:

func.func @transfer_read(%arg0: tensor<3x3x3xf32>, %arg2: tensor<1x1x3xf32>) -> tensor<1x1x3xf32> {
  %1 = linalg.generic {
    indexing_maps = [#map1],
    iterator_types = ["parallel", "parallel", "parallel"]
  } outs(%arg2 : tensor<1x1x3xf32>) {
  ^bb0(%arg4: f32):
    %2 = linalg.index 0 : index
    %3 = linalg.index 1 : index
    %4 = linalg.index 2 : index
    %5 = tensor.extract %arg0[%2, %3, %4] : tensor<3x3x3xf32>
    linalg.yield %5 : f32
  } -> tensor<1x1x3xf32>
  return %1 : tensor<1x1x3xf32>
}

It is fine to use %2, %3 and %4 to calculate the trailing index for tensor.extract (%2 and %3 are effectively constant, %4 has stride one: [0, 1, 2]).
Things get tricky once we try to use %4 for calculating non-trailing indices (i.e. deciding whether this is a contiguous or non-contiguous load becomes trickier).

We could add the logic that you originally proposed and it should just work ™ :

bool isStrideOneAlongDimension(Value indexVal, unsigned dim) {
  Operation *defOp = indexVal.getDefiningOp();
  if (!defOp)
    return false;
  if (auto indexOp = dyn_cast<linalg::IndexOp>(defOp))
    return indexOp.getDim() == dim;
  // TODO: explore UD chain.
  return false;
}

Does this make sense?

701

IIUC, for %c0, ancestor is %c0 = arith.constant 0 : index. It's a arith::ConstantOp, so ... But arith.constant has no operands, so this is not needed, is it?

#map0 = affine_map<(d0, d1, d2) -> (d2)>
#map1 = affine_map<(d0, d1, d2) -> (d0, d1, d2)>
func.func @vector_gather(%arg0: tensor<3x3x3xf32>, %arg1: tensor<3xi32>, %arg2: tensor<1x1x3xf32>) -> tensor<1x1x3xf32> {
 %c123 = arith.constant 123 : index
  %2 = linalg.generic {
    indexing_maps = [#map0, #map1],
    iterator_types = ["parallel", "parallel", "parallel"]
  } ins(%arg1 : tensor<3xi32>) outs(%arg2 : tensor<1x1x3xf32>) {
  ^bb0(%arg3: i32, %arg4: f32):
    %c0 = arith.constant 0 : index
    %3 = arith.index_cast %arg3 : i32 to index
    %7 = tensor.extract %arg0[%c0, %c0, %3] : tensor<3x3x3xf32>
    linalg.yield %7 : f32
  } -> tensor<1x1x3xf32>
  return %2 : tensor<1x1x3xf32>
}

Add a check for AffineMaps (contiguous loads require identity maps), fix typos etc.

Herald added a subscriber: thopre. · View Herald TranscriptJan 26 2023, 11:43 AM

tschuett added a subscriber: tschuett.Jan 26 2023, 12:05 PM

tschuett added inline comments.

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
630	You use VectorMemoryAccessKind as if it is an enum class, but it is not.

Harbormaster completed remote builds in B210194: Diff 492536.Jan 26 2023, 12:28 PM

I am about to disappear for a week and just wanted to comment on this condition:

The _output tensor_ must be a 1-D vector with the trailing dim > 1, e.g. tensor<1x1x4xi32,

I do believe that this can be relaxed, but might be a bit too tricky and not really needed. Consider the following scenario:

func.func @transfer_read(%arg0: tensor<3x3xf32>, %arg2: tensor<1x4x2xf32>) -> tensor<1x4x2xf32> {
  %1 = linalg.generic {
    indexing_maps = [#map1],
    iterator_types = ["parallel", "parallel", "parallel"]
  } outs(%arg2 : tensor<1x4x2xf32>) {
  ^bb0(%arg4: f32):
    %2 = linalg.index 0 : index
    %3 = linalg.index 1 : index
    %4 = linalg.index 2 : index
    %5 = arith.addi %3, %4 : index
    %6 = tensor.extract %arg0[%2, %5] : tensor<3x3xf32>
    linalg.yield %6 : f32
  } -> tensor<1x4x3xf32>
  return %1 : tensor<1x4x2xf32>
}

This linalg.generic Op is similar to the following nested loops:

for (i = 0; i < 1; i++)
  for (j = 0; j < 4: j++)
    for (k = 0; k < 3; k++) {
       int l = j + k;
       a = vec[i][l];
    }

This will lead to the following access:

v[0][0]
v[0][1]
v[0][2]
v[0][1]
v[0][2]
v[0][3]
v[0][2]
v[0][3]
...

This is not a contiguous loads, but it's hard to infer without analyzing the expression used for calculating the indices. Instead of doing that, I decided to limit the space of the supported cases. This is still sufficient to for my use case, i.e. vectorising tosa.resize that I mentioned in iree-issues/#9198 _after_ tiling. I hope that this makes sense :)

Removed vectorize_tensor_extract.mlir (all tests are already available in vectorization.mlir), deleted unnecessary check for whether the affine maps are minor identities.

Fix typo

awarzynski added inline comments.Jan 27 2023, 11:44 AM

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
641	Also, perhaps rename val to index or indexVal? Not claiming that `val` is a great name, but `index` and `indexVal` would suggest that this is index. But this could be any value that is used to compute an index 🤔 .
808	It would, but I was thinking that it would be useful to be able to disable vectorization when only gather loads can be used (which can be very slow).

Harbormaster completed remote builds in B210449: Diff 492868.Jan 27 2023, 1:29 PM

Remove the vectorize_nd_extract attribute
Rebase on top of main

The vectorize_nd_extract attribute was introduced to prevent vectorisation using gather loads in situation where contiguous loads would be better, but weren't yet supported. As this patch introduced support for vectorisation using contiguous loads, that attribute is no longer required.

Restore the "vectorize_nd_extract" attribute

For context, see https://github.com/iree-org/iree/pull/12288#issuecomment-1437824237

Harbormaster completed remote builds in B215044: Diff 499200.Feb 21 2023, 10:05 AM

dcaballe accepted this revision.Feb 21 2023, 10:13 PM

This revision is now accepted and ready to land.Feb 21 2023, 10:13 PM

This revision was landed with ongoing or failed builds.Feb 22 2023, 11:33 AM

Closed by commit rG89b144ece330: [mlir][linalg] Vectorize tensor.extract using contiguous loads (authored by awarzynski). · Explain Why

This revision was automatically updated to reflect the committed changes.

awarzynski added a commit: rG89b144ece330: [mlir][linalg] Vectorize tensor.extract using contiguous loads.

With this change I'm seeing vectorization of

#map = affine_map<(d0) -> (d0)>
module @GatherV2_2.10 attributes {mhlo.cross_program_prefetches = [], mhlo.dynamic_parameter_bindings = [], mhlo.is_dynamic = true, mhlo.use_auto_spmd_partitioning = false} {
  rt.export @main ordinal 0
  func.func @main(%arg0: tensor<6xf32> {bufferization.writable = false, xla_framework.input_mapping = 0 : i32}, %arg1: tensor<5xi32> {bufferization.writable = false, xla_framework.input_mapping = 2 : i32}) -> tensor<5xf32> attributes {xla_framework.result_mapping = 1 : i32} {
    %c5 = arith.constant 5 : index
    %c0 = arith.constant 0 : index
    %0 = tensor.empty() : tensor<5xf32>
    %1 = linalg.generic {indexing_maps = [#map], iterator_types = ["parallel"]} outs(%0 : tensor<5xf32>) {
    ^bb0(%out: f32):
      %2 = linalg.index 0 : index
      %extracted = tensor.extract %arg1[%2] : tensor<5xi32>
      %3 = arith.index_cast %extracted : i32 to index
      %4 = arith.maxsi %3, %c0 : index
      %5 = arith.minsi %4, %c5 : index
      %extracted_0 = tensor.extract %arg0[%5] : tensor<6xf32>
      linalg.yield %extracted_0 : f32
    } -> tensor<5xf32>
    return %1 : tensor<5xf32>
  }
}

into

module @GatherV2_2.10 attributes {mhlo.cross_program_prefetches = [], mhlo.dynamic_parameter_bindings = [], mhlo.is_dynamic = true, mhlo.use_auto_spmd_partitioning = false} {
  rt.export @main ordinal 0
  func.func @main(%arg0: tensor<6xf32> {bufferization.writable = false, xla_framework.input_mapping = 0 : i32}, %arg1: tensor<5xi32> {bufferization.writable = false, xla_framework.input_mapping = 2 : i32}) -> tensor<5xf32> attributes {xla_framework.result_mapping = 1 : i32} {
    %c0 = arith.constant 0 : index
    %c0_i32 = arith.constant 0 : i32
    %cst = arith.constant dense<0> : vector<5xindex>
    %cst_0 = arith.constant dense<5> : vector<5xindex>
    %cst_1 = arith.constant 0.000000e+00 : f32
    %0 = tensor.empty() : tensor<5xf32>
    %1 = vector.transfer_read %arg1[%c0], %c0_i32 {in_bounds = [true]} : tensor<5xi32>, vector<5xi32>
    %2 = arith.index_cast %1 : vector<5xi32> to vector<5xindex>
    %3 = arith.maxsi %2, %cst : vector<5xindex>
    %4 = arith.minsi %3, %cst_0 : vector<5xindex>
    %5 = vector.extractelement %4[%c0_i32 : i32] : vector<5xindex>
    %6 = vector.transfer_read %arg0[%5], %cst_1 {in_bounds = [true]} : tensor<6xf32>, vector<5xf32>
    %7 = vector.transfer_write %6, %0[%c0] {in_bounds = [true]} : vector<5xf32>, tensor<5xf32>
    return %7 : tensor<5xf32>
  }
}

Looks like it somehow thinks this is contiguous even if it's not.

bkramer added a reverting change: rGe28bbfea5d48: Revert "[mlir][linalg] Vectorize tensor.extract using contiguous loads".Feb 28 2023, 4:33 AM

Reverted this in e28bbfea5d482c1825b1799c57aedff4e0116619, is the test case above enough to see what's going wrong?

@bkramer , sorry about this issue and thanks for reverting!

In D141998#4158076, @bkramer wrote:

is the test case above enough to see what's going wrong?

Yes! Thanks for sending this :) I think that I know what the problem is and will send a fix tomorrow (I've been traveling today and only just got back, so need a break). Just to double check - the 1st tensor.extract is a contiguous load, the 2nd is a gather. Correct?

In D141998#4158912, @awarzynski wrote:

@bkramer , sorry about this issue and thanks for reverting!

In D141998#4158076, @bkramer wrote:

is the test case above enough to see what's going wrong?

Yes! Thanks for sending this :) I think that I know what the problem is and will send a fix tomorrow (I've been traveling today and only just got back, so need a break). Just to double check - the 1st tensor.extract is a contiguous load, the 2nd is a gather. Correct?

Right, it's loading from a contiguous tensor of indices and using them as an input to gather.

awarzynski reopened this revision.Mar 1 2023, 8:39 AM

This revision is now accepted and ready to land.Mar 1 2023, 8:39 AM

Refine how contiguous loads are identified

Main change (in getTensorExtractMemoryAccessPattern):

// Reject Ops that would lead to non-contiguous accesses.
if (!isa<arith::AddIOp, arith::SubIOp, linalg::IndexOp>(ancestor))
  return false;

I took the liberty of renaming some variables and updating some comments.
Hopefully the overall logic is clearer now.

Also rebased on top of main.

@bkramer - wdyt? I added your repro as a test - does the generated output make sense?

Harbormaster completed remote builds in B216734: Diff 501532.Mar 1 2023, 9:13 AM

Using vector.gather for that case looks right to me.

Closed by commit rG8ece85a682f0: [mlir][linalg] Vectorize tensor.extract using contiguous loads (authored by awarzynski). · Explain WhyMar 2 2023, 1:20 AM

This revision was automatically updated to reflect the committed changes.

awarzynski added a commit: rG8ece85a682f0: [mlir][linalg] Vectorize tensor.extract using contiguous loads.

awarzynski mentioned this in D145385: [mlir][linalg] Refine how contiguous loads are identified.Mar 6 2023, 7:22 AM

awarzynski mentioned this in rG7a078b65fb88: [mlir][linalg] Refine how contiguous loads are identified.Mar 8 2023, 12:24 AM

Revision Contents

Path

Size

mlir/

lib/

Dialect/

Linalg/

Transforms/

Vectorization.cpp

207 lines

test/

Dialect/

Linalg/

vectorization.mlir

49 lines

vectorize_tensor_extract.mlir

131 lines

Diff 492536

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp

Show First 20 Lines • Show All 621 Lines • ▼ Show 20 Lines	auto extractOpIndex = broadcastIfNeeded(
b, bvm.lookup(extractOp.getIndices()[i]), indexVecType.getShape());		b, bvm.lookup(extractOp.getIndices()[i]), indexVecType.getShape());

offset = b.create<arith::AddIOp>(loc, extractOpIndex, offset);		offset = b.create<arith::AddIOp>(loc, extractOpIndex, offset);
}		}

return offset;		return offset;
}		}

		enum VectorMemoryAccessKind {
		tschuettUnsubmitted Not Done Reply Inline Actions You use VectorMemoryAccessKind as if it is an enum class, but it is not. tschuett: You use VectorMemoryAccessKind as if it is an enum class, but it is not.
		// TODO: ScalarBroadcast,
		Contiguous,
		Gather
		};

		dcaballeUnsubmitted Not Done Reply Inline Actions nit: I think we are using `var` instead of `/p` to highlight source code names in the doc but perhaps I'm missing something. dcaballe: nit: I think we are using `var` instead of `/p` to highlight source code names in the doc but…
		awarzynskiAuthorUnsubmitted Done Reply Inline Actions Not a Doxygen expert. This is taken from LLVM's coding standards, but I suspect that there's going to be other styles used elsewhere. Happy to update! awarzynski: Not a Doxygen expert. This is taken from [[ https://llvm.org/docs/CodingStandards.html#doxygen…
		/// Check whether /p val can be used for calculating an index for a contiguous
		/// load operation, i.e. whether /p val:
		/// * is invariant with respect to /p linalgOp, i.e. whether it remains
		/// constant for all iterations, ar
		dcaballeUnsubmitted Not Done Reply Inline Actions ar? -> and? dcaballe: ar? -> and?
		/// * increments with the loop iterator (when /p strideZero is false) or is
		/// not affected by the loop indices (/p strideZero is true).
		dcaballeUnsubmitted Not Done Reply Inline Actions Can you elaborate on what `strideZero` is? Also, perhaps rename `val` to `index` or `indexVal`? dcaballe: Can you elaborate on what `strideZero` is? Also, perhaps rename `val` to `index` or `indexVal`?
		awarzynskiAuthorUnsubmitted Done Reply Inline Actions Can you elaborate on what strideZero is? Sure and apologies for poor naming. So, we would like the values in the vector with the trailing indices to have stride one, e.g. `v0 = [0, 1, 2, 3]`. This way we know that we are accessing adjacent elements. For non-trailing indices, to make things a bit simpler, I assume that the indices ought to have stride zero (or be constant), i.e. `v1 = [5, 5, 5, 5]` (the actual values don't matter). This example should clarify what I mean: func.func @transfer_read(%arg0: tensor<3x3x3xf32>, %arg2: tensor<1x1x3xf32>) -> tensor<1x1x3xf32> { %1 = linalg.generic { indexing_maps = [#map1], iterator_types = ["parallel", "parallel", "parallel"] } outs(%arg2 : tensor<1x1x3xf32>) { ^bb0(%arg4: f32): %2 = linalg.index 0 : index %3 = linalg.index 1 : index %4 = linalg.index 2 : index %5 = tensor.extract %arg0[%2, %3, %4] : tensor<3x3x3xf32> linalg.yield %5 : f32 } -> tensor<1x1x3xf32> return %1 : tensor<1x1x3xf32> } It is fine to use `%2`, `%3` and `%4` to calculate the trailing index for `tensor.extract` (`%2` and `%3` are effectively constant, `%4` has stride one: `[0, 1, 2]`). Things get tricky once we try to use `%4` for calculating non-trailing indices (i.e. deciding whether this is a contiguous or non-contiguous load becomes trickier). We could add the logic that you originally proposed and it should just work ™ : bool isStrideOneAlongDimension(Value indexVal, unsigned dim) { Operation defOp = indexVal.getDefiningOp(); if (!defOp) return false; if (auto indexOp = dyn_cast<linalg::IndexOp>(defOp)) return indexOp.getDim() == dim; // TODO: explore UD chain. return false; } Does this make sense? awarzynski:* > Can you elaborate on what strideZero is? Sure and apologies for poor naming. So, we would…
		awarzynskiAuthorUnsubmitted Done Reply Inline Actions Also, perhaps rename val to index or indexVal? Not claiming that `val` is a great name, but `index` and `indexVal` would suggest that this is index. But this could be any value that is used to compute an index 🤔 . awarzynski: > Also, perhaps rename val to index or indexVal? Not claiming that `val` is a great name, but…
		static bool isContiguousLoadIdx(LinalgOp &linalgOp, Value &val, size_t dim,
		dcaballeUnsubmitted Done Reply Inline Actions typo dcaballe: typo
		bool strideZero) {
		auto *block = linalgOp.getBlock();

		// Bail out if this is a block argument for this linalg.generic Op.
		dcaballeUnsubmitted Done Reply Inline Actions -> this is? dcaballe: -> this is?
		// TODO: We could try analysing the corresponding affine map here.
		if (val.dyn_cast<BlockArgument>())
		return llvm::all_of(block->getArguments(),
		[&val](Value v) { return (v != val); });

		Operation *defOp = val.getDefiningOp();
		assert(defOp && "This is neither a block argument nor an operation result");

		// Given the assumption on the shape of the target tensor, index Op is
		// either:
		// * constant (for non-trailing dims), or
		// * increments with stride one together with the trailing dimension
		// Both cases are fine for contigious loads.
		if (auto indexOp = dyn_cast<linalg::IndexOp>(defOp))
		return strideZero ? (indexOp.getDim() != dim) : (indexOp.getDim() == dim);

		auto ancestor = block->findAncestorOpInBlock(defOp);

		// Values define outside `linalgOp`.
		if (!ancestor)
		return true;

		// Values defined inside `linalgOp`, which are constant.
		if (dyn_cast<arith::ConstantOp>(ancestor))
		return true;

		bool result = true;
		for (auto op : ancestor->getOperands())
		result &= isContiguousLoadIdx(linalgOp, op, dim, strideZero);

		return result;
		}

		/// Check whether the calculation of \p val is based on linalg.index Op with
		/// the dim attribute matching \p dim.
		static bool isBasedOnIndexOp(LinalgOp &linalgOp, Value &val, size_t dim) {
		auto *block = linalgOp.getBlock();
		auto targetShape = linalgOp.getStaticLoopRanges();

		if (val.isa<BlockArgument>())
		dcaballeUnsubmitted Done Reply Inline Actions You can use `isa` instead of `dyn_cast` when you only want to check for the class but you don't need the actual class object. dcaballe: You can use `isa` instead of `dyn_cast` when you only want to check for the class but you don't…
		return false;

		Operation *defOp = val.getDefiningOp();
		assert(defOp && "This is neither a block argument nor an operation result");

		if (auto indexOp = dyn_cast<linalg::IndexOp>(defOp))
		return (indexOp.getDim() == dim);

		auto ancestor = block->findAncestorOpInBlock(defOp);

		if (!ancestor)
		return false;

		bool result = false;
		for (auto op : ancestor->getOperands())
		dcaballeUnsubmitted Not Done Reply Inline Actions I'm a bit lost here with the `findAncestorOpInBlock` and the constant op check. What are you trying to do? dcaballe: I'm a bit lost here with the `findAncestorOpInBlock` and the constant op check. What are you…
		awarzynskiAuthorUnsubmitted Done Reply Inline Actions IIUC, for `%c0`, `ancestor` is `%c0 = arith.constant 0 : index`. It's a `arith::ConstantOp`, so ... But arith.constant has no operands, so this is not needed, is it? #map0 = affine_map<(d0, d1, d2) -> (d2)> #map1 = affine_map<(d0, d1, d2) -> (d0, d1, d2)> func.func @vector_gather(%arg0: tensor<3x3x3xf32>, %arg1: tensor<3xi32>, %arg2: tensor<1x1x3xf32>) -> tensor<1x1x3xf32> { %c123 = arith.constant 123 : index %2 = linalg.generic { indexing_maps = [#map0, #map1], iterator_types = ["parallel", "parallel", "parallel"] } ins(%arg1 : tensor<3xi32>) outs(%arg2 : tensor<1x1x3xf32>) { ^bb0(%arg3: i32, %arg4: f32): %c0 = arith.constant 0 : index %3 = arith.index_cast %arg3 : i32 to index %7 = tensor.extract %arg0[%c0, %c0, %3] : tensor<3x3x3xf32> linalg.yield %7 : f32 } -> tensor<1x1x3xf32> return %2 : tensor<1x1x3xf32> } awarzynski: IIUC, for `%c0`, `ancestor` is `%c0 = arith.constant 0 : index`. It's a `arith::ConstantOp`, so…
		result \|= isBasedOnIndexOp(linalgOp, op, dim);

		return result;
		}

		/// Check whether \p extractOp would be a gather or a contiguous load Op after
		/// vectorising \p linalgOp. Note that it is always safe to use gather load
		/// operations for contiguous loads (albeit slow), but not vice-versa. When in
		/// doubt, bail out and assume that \p extractOp is a gather load.
		static VectorMemoryAccessKind
		getTensorExtractMemoryAccessPattern(tensor::ExtractOp extractOp,
		LinalgOp &linalgOp) {

		auto targetShape = linalgOp.getStaticLoopRanges();

		// Assume that it's a gather load when reading _into_:
		// * an n-D vector, like`tensor<1x2x4xi32` or`tensor<2x1x4xi32>`, or
		// * a 1-D vector with the trailing dim equal 1, e.g. `tensor<1x4x1xi32`.
		// TODO: Relax these conditions.
		if ((llvm::count_if(targetShape,
		[](int64_t dimSize) { return dimSize > 1; }) != 1) \|\|
		targetShape.back() == 1)
		return VectorMemoryAccessKind::Gather;

		auto inputShape = extractOp.getTensor().getType().cast<ShapedType>();

		// Assume that it's a gather load when reading _from_ a tensor for which the
		// trailing dimension is 1, e.g. `tensor<1x4x1xi32>`.
		// TODO: Relax this condition.
		if (inputShape.getShape().back() == 1)
		return VectorMemoryAccessKind::Gather;

		if (!llvm::all_of(linalgOp.getIndexingMapsArray(),
		[](AffineMap m) { return m.isMinorIdentity(); })) {
		return VectorMemoryAccessKind::Gather;
		}

		bool isContiguous = true;

		// Iterate over all indices. Analyze whether the way each index is calculate
		// is suitable for contiguous load operations (e.g. loop invariant).
		auto indices = extractOp.getIndices();
		for (auto [i, indexVal] : llvm::enumerate(indices)) {
		if (inputShape.getShape()[i] == 1) {
		// This extractOp index must be a loop-invariant constant
		continue;
		}

		auto extractOpBottomIdx = indices.size() - 1;
		auto strideOneDim = targetShape.size() - 1;
		bool strideZero = (i != extractOpBottomIdx);
		isContiguous &=
		isContiguousLoadIdx(linalgOp, indexVal, strideOneDim, strideZero);
		}

		// The calculation of the trailing index must include the loop index. Given
		// the assumption on the output tensor (which is defined by the iteration
		// space), only the trailing dim matters.
		auto extractOpTrailingIdx = indices.back();
		isContiguous &=
		isBasedOnIndexOp(linalgOp, extractOpTrailingIdx, targetShape.size() - 1);

		if (isContiguous) {
		LDBG("Found contigous load: " << extractOp);
		return VectorMemoryAccessKind::Contiguous;
		}

		return VectorMemoryAccessKind::Gather;
		}

/// Helper function to vectorize the tensor.extract operations. Returns		/// Helper function to vectorize the tensor.extract operations. Returns
/// VectorizationStatus::NewOp to signal the vectorization algorithm that it		/// VectorizationStatus::NewOp to signal the vectorization algorithm that it
/// should map the produced operations. This function is meant to be used as a		/// should map the produced operations. This function is meant to be used as a
/// CustomVectorizationHook.		/// CustomVectorizationHook.
static VectorizationResult vectorizeTensorExtract(RewriterBase &rewriter,		static VectorizationResult vectorizeTensorExtract(RewriterBase &rewriter,
Operation *op,		Operation *op,
LinalgOp linalgOp,		LinalgOp linalgOp,
const IRMapping &bvm) {		const IRMapping &bvm) {
Show All 15 Lines	auto passThruConstantOp =
rewriter.create<arith::ConstantOp>(loc, rewriter.getZeroAttr(resultType));		rewriter.create<arith::ConstantOp>(loc, rewriter.getZeroAttr(resultType));

// Base indices are currently set to 0. We will need to re-visit if more		// Base indices are currently set to 0. We will need to re-visit if more
// generic scenarios are to be supported.		// generic scenarios are to be supported.
SmallVector<Value> baseIndices(		SmallVector<Value> baseIndices(
extractOp.getIndices().size(),		extractOp.getIndices().size(),
rewriter.create<arith::ConstantIndexOp>(loc, 0));		rewriter.create<arith::ConstantIndexOp>(loc, 0));

		VectorMemoryAccessKind memAccessKind =
		getTensorExtractMemoryAccessPattern(extractOp, linalgOp);

		// 1. Handle gather access
		if (memAccessKind == VectorMemoryAccessKind::Gather) {
		// TODO: We need a mechanism to turn gather loads on/off.
		dcaballeUnsubmitted Not Done Reply Inline Actions Why? Wouldn't the backend emulate it if it's not natively supported by the target? dcaballe: Why? Wouldn't the backend emulate it if it's not natively supported by the target?
		awarzynskiAuthorUnsubmitted Done Reply Inline Actions It would, but I was thinking that it would be useful to be able to disable vectorization when only gather loads can be used (which can be very slow). awarzynski: It would, but I was thinking that it would be useful to be able to disable vectorization when…
		// if (not vectorizeAsGatherLoads)
		// return VectorizationResult{VectorizationStatus::Failure, nullptr};

Value offset = calculateGatherOffset(rewriter, extractOp, bvm, targetShape);		Value offset = calculateGatherOffset(rewriter, extractOp, bvm, targetShape);

// Generate the gather load		// Generate the gather load
auto gatherOp = rewriter.create<vector::GatherOp>(		auto gatherOp = rewriter.create<vector::GatherOp>(
loc, resultType, extractOp.getTensor(), baseIndices, offset,		loc, resultType, extractOp.getTensor(), baseIndices, offset,
maskConstantOp, passThruConstantOp);		maskConstantOp, passThruConstantOp);

		LDBG("Vectorised as gather load: " << extractOp);
return VectorizationResult{VectorizationStatus::NewOp, gatherOp};		return VectorizationResult{VectorizationStatus::NewOp, gatherOp};
}		}

		// 2. Handle contiguous access.
		SmallVector<Value> transferReadIdxs;
		auto resTrailingDim = resultType.getShape().back();
		auto zero = rewriter.create<arith::ConstantOp>(
		loc, rewriter.getI32Type(), rewriter.getZeroAttr(rewriter.getI32Type()));

		// Collect indices for `vector.transfer_read`. At this point, the indices will
		// either be scalars or would have been broadcast to vectors matching the
		// result type. For indices that are vectors, there are two options:
		// * for non-trailing indices, all elements are identical (contiguous
		// loads are identified by looking for non-trailing indices that are
		// invariant with respect to the corresponding linalg.generic), or
		// * for trailing indices, the index vector will contain values with stride
		// one, but for `vector.transfer_read` only the first (i.e. 0th) index is
		// needed.
		// This means that
		// * for scalar indices - just re-use it,
		// * for vector indices (e.g. `vector<1x1x4xindex>`) - extract the bottom
		// (0th) element and use that.
		for (size_t i = 0; i < extractOp.getIndices().size(); i++) {
		auto idx = bvm.lookup(extractOp.getIndices()[i]);
		if (idx.getType().isIndex()) {
		transferReadIdxs.push_back(idx);
		continue;
		}

		auto indexAs1dVector = rewriter.create<vector::ShapeCastOp>(
		loc, VectorType::get({resTrailingDim}, rewriter.getIndexType()),
		bvm.lookup(extractOp.getIndices()[i]));
		transferReadIdxs.push_back(
		rewriter.create<vector::ExtractElementOp>(loc, indexAs1dVector, zero));
		}

		// `tensor.extract_element` is always in-bounds, hence the following holds.
		SmallVector<bool> inBounds(resultType.getRank(), true);

		auto transferReadOp = rewriter.create<vector::TransferReadOp>(
		loc, resultType, extractOp.getTensor(), transferReadIdxs, inBounds);

		LDBG("Vectorised as contiguous load: " << extractOp);
		return VectorizationResult{VectorizationStatus::NewOp, transferReadOp};
		}

/// Emit reduction operations if the shapes of the value to reduce is different		/// Emit reduction operations if the shapes of the value to reduce is different
/// that the result shape.		/// that the result shape.
// Note: this is a true builder that notifies the OpBuilder listener.		// Note: this is a true builder that notifies the OpBuilder listener.
// TODO: Consider moving as a static helper on the ReduceOp.		// TODO: Consider moving as a static helper on the ReduceOp.
static Operation reduceIfNeeded(OpBuilder &b, LinalgOp linalgOp, Operation op,		static Operation reduceIfNeeded(OpBuilder &b, LinalgOp linalgOp, Operation op,
Value reduceValue, Value initialValue,		Value reduceValue, Value initialValue,
const IRMapping &bvm) {		const IRMapping &bvm) {
Value reduceVec = bvm.lookup(reduceValue);		Value reduceVec = bvm.lookup(reduceValue);
▲ Show 20 Lines • Show All 1,798 Lines • Show Last 20 Lines

mlir/test/Dialect/Linalg/vectorization.mlir

Show First 20 Lines • Show All 1,533 Lines • ▼ Show 20 Lines
func.func @vectorize_nd_tensor_extract_constant_idx(%arg0: tensor<3x3xf32>, %arg2: tensor<1x1x3xf32>) -> tensor<1x1x3xf32> {		func.func @vectorize_nd_tensor_extract_constant_idx(%arg0: tensor<3x3xf32>, %arg2: tensor<1x1x3xf32>) -> tensor<1x1x3xf32> {
%c0 = arith.constant 1 : index		%c0 = arith.constant 1 : index
%c1 = arith.constant 2 : index		%c1 = arith.constant 2 : index
%2 = linalg.generic {		%2 = linalg.generic {
indexing_maps = [#map1],		indexing_maps = [#map1],
iterator_types = ["parallel", "parallel", "parallel"]		iterator_types = ["parallel", "parallel", "parallel"]
} outs(%arg2 : tensor<1x1x3xf32>) {		} outs(%arg2 : tensor<1x1x3xf32>) {
^bb0(%arg4: f32):		^bb0(%arg4: f32):
%3 = linalg.index 2 : index
%7 = tensor.extract %arg0[%c0, %c1] : tensor<3x3xf32>		%7 = tensor.extract %arg0[%c0, %c1] : tensor<3x3xf32>
linalg.yield %7 : f32		linalg.yield %7 : f32
} -> tensor<1x1x3xf32>		} -> tensor<1x1x3xf32>
return %2 : tensor<1x1x3xf32>		return %2 : tensor<1x1x3xf32>
}		}

// CHECK-LABEL: func.func @vectorize_nd_tensor_extract_constant_idx		// CHECK-LABEL: func.func @vectorize_nd_tensor_extract_constant_idx
// CHECK-SAME: %[[ARG0:.*]]: tensor<3x3xf32>		// CHECK-SAME: %[[ARG0:.*]]: tensor<3x3xf32>
Show All 12 Lines	^bb1(%arg1: !pdl.operation):
%0 = transform.structured.match ops{["linalg.generic"]} in %arg1 : (!pdl.operation) -> !pdl.operation		%0 = transform.structured.match ops{["linalg.generic"]} in %arg1 : (!pdl.operation) -> !pdl.operation
%1 = get_closest_isolated_parent %0 : (!pdl.operation) -> !pdl.operation		%1 = get_closest_isolated_parent %0 : (!pdl.operation) -> !pdl.operation
%2 = transform.structured.vectorize %1 { vectorize_nd_extract }		%2 = transform.structured.vectorize %1 { vectorize_nd_extract }
}		}

// -----		// -----

#map1 = affine_map<(d0, d1, d2) -> (d0, d1, d2)>		#map1 = affine_map<(d0, d1, d2) -> (d0, d1, d2)>
func.func @vectorize_nd_tensor_extract_idx_from_iteration_index(%arg0: tensor<3x3x3xf32>, %arg2: tensor<1x1x3xf32>) -> tensor<1x1x3xf32> {		func.func @vectorize_nd_tensor_extract_transfer_read_easy(%arg0: tensor<3x3x3xf32>, %arg2: tensor<1x1x3xf32>) -> tensor<1x1x3xf32> {
%1 = linalg.generic {		%1 = linalg.generic {
indexing_maps = [#map1],		indexing_maps = [#map1],
iterator_types = ["parallel", "parallel", "parallel"]		iterator_types = ["parallel", "parallel", "parallel"]
} outs(%arg2 : tensor<1x1x3xf32>) {		} outs(%arg2 : tensor<1x1x3xf32>) {
^bb0(%arg4: f32):		^bb0(%arg4: f32):
%2 = linalg.index 0 : index		%2 = linalg.index 0 : index
%3 = linalg.index 1 : index		%3 = linalg.index 1 : index
%4 = linalg.index 2 : index		%4 = linalg.index 2 : index
%5 = tensor.extract %arg0[%2, %3, %4] : tensor<3x3x3xf32>		%5 = tensor.extract %arg0[%2, %3, %4] : tensor<3x3x3xf32>
linalg.yield %5 : f32		linalg.yield %5 : f32
} -> tensor<1x1x3xf32>		} -> tensor<1x1x3xf32>
return %1 : tensor<1x1x3xf32>		return %1 : tensor<1x1x3xf32>
}		}

// CHECK-LABEL: func.func @vectorize_nd_tensor_extract_idx_from_iteration_index		// CHECK-LABEL: func.func @vectorize_nd_tensor_extract_transfer_read_easy
// CHECK-SAME: %[[ARG0:.*]]: tensor<3x3x3xf32>		// CHECK-SAME: %[[ARG0:.*]]: tensor<3x3x3xf32>
// CHECK-SAME: %[[ARG1:.*]]: tensor<1x1x3xf32>		// CHECK-SAME: %[[ARG1:.*]]: tensor<1x1x3xf32>
// CHECK: %[[INDICES:.*]] = arith.constant dense<[0, 1, 2]> : vector<3xindex>		// CHECK: %[[CST:.*]] = arith.constant dense<0> : vector<1x1x3xindex>
// CHECK: %[[MASK:.*]] = arith.constant dense<true> : vector<1x1x3xi1>		// CHECK: %[[C0_i32:.*]] = arith.constant 0 : i32
// CHECK: %[[PASSTHRU:.*]] = arith.constant dense<0.000000e+00> : vector<1x1x3xf32>
// CHECK: %[[C0:.*]] = arith.constant 0 : index		// CHECK: %[[C0:.*]] = arith.constant 0 : index
// CHECK: %[[B:.*]] = vector.broadcast %[[INDICES]] : vector<3xindex> to vector<1x1x3xindex>		// CHECK: %[[CST_0:.*]] = arith.constant 0.000000e+00 : f32
// CHECK: %[[GATHER:.*]] = vector.gather %[[ARG0]][%[[C0]], %[[C0]], %[[C0]]] [%[[B]]], %[[MASK]], %[[PASSTHRU]] : tensor<3x3x3xf32>, vector<1x1x3xindex>, vector<1x1x3xi1>, vector<1x1x3xf32> into vector<1x1x3xf32>		// CHECK: %[[IDX_VEC0:.*]] = vector.shape_cast %[[CST]] : vector<1x1x3xindex> to vector<3xindex>
// CHECK: vector.transfer_write %[[GATHER]]		// CHECK: %[[IDX1:.*]] = vector.extractelement %[[IDX_VEC0]][%[[C0_i32]] : i32] : vector<3xindex>
		// CHECK: %[[IDX_VEC:.*]] = vector.shape_cast %[[CST]] : vector<1x1x3xindex> to vector<3xindex>
		// CHECK: %[[IDX2:.*]] = vector.extractelement %[[IDX_VEC]][%[[C0_i32]] : i32] : vector<3xindex>
		// CHECK: %[[READ:.]] = vector.transfer_read %[[ARG0]][%[[IDX1]], %[[IDX2]], %[[C0:.]]], %[[CST_0]] {in_bounds = [true, true, true]} : tensor<3x3x3xf32>, vector<1x1x3xf32>
		// CHECK: vector.transfer_write %[[READ]], %[[ARG1]][%[[C0]], %[[C0]], %[[C0]]] {in_bounds = [true, true, true]} : vector<1x1x3xf32>, tensor<1x1x3xf32>

		transform.sequence failures(propagate) {
		^bb1(%arg1: !pdl.operation):
		%0 = transform.structured.match ops{["linalg.generic"]} in %arg1 : (!pdl.operation) -> !pdl.operation
		%1 = get_closest_isolated_parent %0 : (!pdl.operation) -> !pdl.operation
		%2 = transform.structured.vectorize %1 { vectorize_nd_extract }
		}

		// -----

		func.func @vectorize_nd_tensor_extract_transfer_read_complex(%6: tensor<45x80x16xf32>, %arg0: index, %arg2: index, %arg1: index, %arg4: index, %extracted_slice : tensor<1x4xf32>) -> tensor<1x4xf32> {
		%c79 = arith.constant 79 : index
		%25 = linalg.generic {
		indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>],
		iterator_types = ["parallel", "parallel"]
		} outs(%extracted_slice : tensor<1x4xf32>) {
		^bb0(%out: f32):
		%26 = linalg.index 0 : index
		%27 = arith.addi %arg0, %26 : index
		%28 = arith.addi %27, %arg2 : index
		%29 = linalg.index 1 : index
		%30 = arith.addi %arg1, %29 : index
		%31 = arith.addi %30, %arg4 : index
		%extracted = tensor.extract %6[%28, %c79, %31] : tensor<45x80x16xf32>
		linalg.yield %extracted : f32
		} -> tensor<1x4xf32>
		return %25 : tensor<1x4xf32>
		}
		// CHECK: vector.transfer_read

transform.sequence failures(propagate) {		transform.sequence failures(propagate) {
^bb1(%arg1: !pdl.operation):		^bb1(%arg1: !pdl.operation):
%0 = transform.structured.match ops{["linalg.generic"]} in %arg1 : (!pdl.operation) -> !pdl.operation		%0 = transform.structured.match ops{["linalg.generic"]} in %arg1 : (!pdl.operation) -> !pdl.operation
%1 = get_closest_isolated_parent %0 : (!pdl.operation) -> !pdl.operation		%1 = get_closest_isolated_parent %0 : (!pdl.operation) -> !pdl.operation
%2 = transform.structured.vectorize %1 { vectorize_nd_extract }		%2 = transform.structured.vectorize %1 { vectorize_nd_extract }
}		}

// -----		// -----

#map0 = affine_map<(d0, d1, d2, d3) -> (d0, d2)>		#map0 = affine_map<(d0, d1, d2, d3) -> (d0, d2)>
#map1 = affine_map<(d0, d1, d2, d3) -> (d0, d1, d3)>		#map1 = affine_map<(d0, d1, d2, d3) -> (d0, d1, d3)>
#map2 = affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>		#map2 = affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>
func.func @vectorize_nd_tensor_extract_index_from_tensor(%arg0: tensor<3x3xf32>, %arg1: tensor<4x3xi32>, %arg2: tensor<4x3xi32>, %arg3: tensor<4x7x2xf32>, %arg4: tensor<4x7x3x2xf32>) -> tensor<4x7x3x2xf32> {		func.func @vectorize_nd_tensor_extract_index_from_tensor(%arg0: tensor<3x3xf32>, %arg1: tensor<4x3xi32>, %arg2: tensor<4x3xi32>, %arg3: tensor<4x7x2xf32>, %arg4: tensor<4x7x3x2xf32>) -> tensor<4x7x3x2xf32> {
%2 = linalg.generic {		%2 = linalg.generic {
▲ Show 20 Lines • Show All 396 Lines • Show Last 20 Lines

mlir/test/Dialect/Linalg/vectorize_tensor_extract.mlir

This file was added.

				// RUN: mlir-opt %s -test-transform-dialect-interpreter -split-input-file \| FileCheck %s

				#map0 = affine_map<(d0, d1, d2) -> (d2)>
				#map1 = affine_map<(d0, d1, d2) -> (d0, d1, d2)>
				func.func @vector_gather(%arg0: tensor<3x3x3xf32>, %arg1: tensor<3xi32>, %arg2: tensor<1x1x3xf32>) -> tensor<1x1x3xf32> {
				%2 = linalg.generic {
				indexing_maps = [#map0, #map1],
				iterator_types = ["parallel", "parallel", "parallel"]
				} ins(%arg1 : tensor<3xi32>) outs(%arg2 : tensor<1x1x3xf32>) {
				^bb0(%arg3: i32, %arg4: f32):
				%c0 = arith.constant 0 : index
				%3 = arith.index_cast %arg3 : i32 to index
				%7 = tensor.extract %arg0[%c0, %c0, %3] : tensor<3x3x3xf32>
				linalg.yield %7 : f32
				} -> tensor<1x1x3xf32>
				return %2 : tensor<1x1x3xf32>
				}
				// CHECK: %[[GATHER:.*]] = vector.gather

				transform.sequence failures(propagate) {
				^bb1(%arg1: !pdl.operation):
				%0 = transform.structured.match ops{["linalg.generic"]} in %arg1: (!pdl.operation) -> !pdl.operation
				%1 = get_closest_isolated_parent %0 : (!pdl.operation) -> !pdl.operation
				%2 = transform.structured.vectorize %1 { vectorize_nd_extract }
				}

				// -----

				#map1 = affine_map<(d0, d1, d2) -> (d0, d1, d2)>
				func.func @gather(%arg0: tensor<3x3xf32>, %arg2: tensor<1x1x3xf32>) -> tensor<1x1x3xf32> {
				%2 = linalg.generic {
				indexing_maps = [#map1],
				iterator_types = ["parallel", "parallel", "parallel"]
				} outs(%arg2 : tensor<1x1x3xf32>) {
				^bb0(%arg4: f32):
				%c0 = arith.constant 1 : index
				%c1 = arith.constant 2 : index
				%7 = tensor.extract %arg0[%c0, %c1] : tensor<3x3xf32>
				linalg.yield %7 : f32
				} -> tensor<1x1x3xf32>
				return %2 : tensor<1x1x3xf32>
				}
				// CHECK: %[[GATHER:.*]] = vector.gather

				transform.sequence failures(propagate) {
				^bb1(%arg1: !pdl.operation):
				%0 = transform.structured.match ops{["linalg.generic"]} in %arg1: (!pdl.operation) -> !pdl.operation
				%1 = get_closest_isolated_parent %0 : (!pdl.operation) -> !pdl.operation
				%2 = transform.structured.vectorize %1 { vectorize_nd_extract }
				}

				// -----

				#map1 = affine_map<(d0, d1, d2) -> (d0, d1, d2)>
				func.func @transfer_read(%arg0: tensor<3x3x3xf32>, %arg2: tensor<1x1x3xf32>) -> tensor<1x1x3xf32> {
				%1 = linalg.generic {
				indexing_maps = [#map1],
				iterator_types = ["parallel", "parallel", "parallel"]
				} outs(%arg2 : tensor<1x1x3xf32>) {
				^bb0(%arg4: f32):
				%2 = linalg.index 0 : index
				%3 = linalg.index 1 : index
				%4 = linalg.index 2 : index
				%5 = tensor.extract %arg0[%2, %3, %4] : tensor<3x3x3xf32>
				linalg.yield %5 : f32
				} -> tensor<1x1x3xf32>
				return %1 : tensor<1x1x3xf32>
				}
				// CHECK: %[[V2:.+]] = vector.transfer_read

				transform.sequence failures(propagate) {
				^bb1(%arg1: !pdl.operation):
				%0 = transform.structured.match ops{["linalg.generic"]} in %arg1: (!pdl.operation) -> !pdl.operation
				%1 = get_closest_isolated_parent %0 : (!pdl.operation) -> !pdl.operation
				%2 = transform.structured.vectorize %1 { vectorize_nd_extract }
				}

				// -----

				#map1 = affine_map<(d0, d1, d2) -> (d0, d1, d2)>
				func.func @gather(%arg0: tensor<3x3xf32>, %arg2: tensor<1x1x3xf32>) -> tensor<1x1x3xf32> {
				%c0 = arith.constant 1 : index
				%c1 = arith.constant 2 : index
				%2 = linalg.generic {
				indexing_maps = [#map1],
				iterator_types = ["parallel", "parallel", "parallel"]
				} outs(%arg2 : tensor<1x1x3xf32>) {
				^bb0(%arg4: f32):
				%3 = linalg.index 2 : index
				%7 = tensor.extract %arg0[%c0, %c1] : tensor<3x3xf32>
				linalg.yield %7 : f32
				} -> tensor<1x1x3xf32>
				return %2 : tensor<1x1x3xf32>
				}
				// CHECK: %[[GATHER:.*]] = vector.gather

				transform.sequence failures(propagate) {
				^bb1(%arg1: !pdl.operation):
				%0 = transform.structured.match ops{["linalg.generic"]} in %arg1: (!pdl.operation) -> !pdl.operation
				%1 = get_closest_isolated_parent %0 : (!pdl.operation) -> !pdl.operation
				%2 = transform.structured.vectorize %1 { vectorize_nd_extract }
				}

				// -----

				func.func @transfer_read(%6: tensor<45x80x16xf32>, %arg0: index, %arg2: index, %arg1: index, %arg4: index, %extracted_slice : tensor<1x4xf32>) -> tensor<1x4xf32> {
				%c79 = arith.constant 79 : index
				%25 = linalg.generic {
				indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>],
				iterator_types = ["parallel", "parallel"]
				} outs(%extracted_slice : tensor<1x4xf32>) {
				^bb0(%out: f32):
				%26 = linalg.index 0 : index
				%27 = arith.addi %arg0, %26 : index
				%28 = arith.addi %27, %arg2 : index
				%29 = linalg.index 1 : index
				%30 = arith.addi %arg1, %29 : index
				%31 = arith.addi %30, %arg4 : index
				%extracted = tensor.extract %6[%28, %c79, %31] : tensor<45x80x16xf32>
				linalg.yield %extracted : f32
				} -> tensor<1x4xf32>
				return %25 : tensor<1x4xf32>
				}
				// CHECK: %[[V2:.+]] = vector.transfer_read

				transform.sequence failures(propagate) {
				^bb1(%arg1: !pdl.operation):
				%0 = transform.structured.match ops{["linalg.generic"]} in %arg1: (!pdl.operation) -> !pdl.operation
				%1 = get_closest_isolated_parent %0 : (!pdl.operation) -> !pdl.operation
				%2 = transform.structured.vectorize %1 { vectorize_nd_extract }
				}

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][linalg] Vectorize tensor.extract using contiguous loadsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 492536

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp

mlir/test/Dialect/Linalg/vectorization.mlir

mlir/test/Dialect/Linalg/vectorize_tensor_extract.mlir

[mlir][linalg] Vectorize tensor.extract using contiguous loads
ClosedPublic