This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
mlir/
-
lib/Dialect/Linalg/Transforms/
-
Dialect/
-
Linalg/
-
Transforms/
8/14
Vectorization.cpp
-
test/Dialect/Linalg/
-
Dialect/
-
Linalg/
-
vectorization-unsupported.mlir
-
vectorization.mlir

Differential D141998

[mlir][linalg] Vectorize tensor.extract using contiguous loads
ClosedPublic

Authored by awarzynski on Jan 18 2023, 1:23 AM.

Download Raw Diff

Details

Reviewers

aartbik
hanchung
nicolasvasilache
dcaballe
bkramer

Commits

rG8ece85a682f0: [mlir][linalg] Vectorize tensor.extract using contiguous loads
rG89b144ece330: [mlir][linalg] Vectorize tensor.extract using contiguous loads

Summary

This patch implements vectorization of tensor.extract for n-D tensor
(n >= 2) using contiguous load operations, i.e. vector.transfer_read. This
is a follow-up of https://reviews.llvm.org/D137660 in which gather loads
were used, i.e. vector.gather.

It is always safe to use gather load operations when the underlying
memory pattern is contiguous, but not vice-verse. At the moment, the
following conditions have to be met for contiguous loads to be
generated:

The _output tensor_ must be a 1-D vector with the trailing dim > 1, e.g. tensor<1x1x4xi32,
The trailing dim in the _input tensor_ must be > 1, e.g. tensor<1x1x4i32> would be fine, but not tensor<1x4x1xi32>.

If these conditions are not satisfied, gather loads are generated
instead.

Condition 1 guarantees that the iteration space of the corresponding
linalg.generic Op is relatively simple. That makes analysing the
indices for tensor.extract rather straightforward.

Condition 2 is mostly there to avoid weird vectorisation patterns, e.g.
vector<1x1x1xi32>. In practice, tensors like tensor<1x4x1xi32>
should first be collapsed to tensor<1x4xi32> before vectorisation, but
that should happen somewhere else.

If needed, both conditions can be relaxed. I've not been able to find a
good motivating example for these, hence skipping. For reference,
tosa.resize (lowered to Linalg) was the driving example used here.

Co-authored-by: Diego Caballero <diegocaballero@google.com>

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

awarzynski created this revision.Jan 18 2023, 1:23 AM

Herald added a reviewer: aartbik. · View Herald TranscriptJan 18 2023, 1:23 AM

Herald added a reviewer: hanchung. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: hanchung, Moerafaat, bzcheeseman and 23 others. · View Herald Transcript

awarzynski requested review of this revision.Jan 18 2023, 1:23 AM

Herald added a reviewer: nicolasvasilache. · View Herald TranscriptJan 18 2023, 1:23 AM

Herald added a reviewer: dcaballe. · View Herald Transcript

Herald added a project: Restricted Project. · View Herald Transcript

Herald added subscribers: • pcwang-thead, limo1996, stephenneuendorffer, nicolasvasilache. · View Herald Transcript

awarzynski edited the summary of this revision. (Show Details)Jan 18 2023, 1:23 AM

Harbormaster completed remote builds in B208431: Diff 490072.Jan 18 2023, 1:42 AM

Thanks @awarzynski! I think the initial conditions are fair as a starting point. I need more time to digest the logic but here are some nits for now!

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
638	nit: I think we are using `var` instead of `/p` to highlight source code names in the doc but perhaps I'm missing something.
642	ar? -> and?
644	Can you elaborate on what `strideZero` is? Also, perhaps rename `val` to `index` or `indexVal`?
645	typo
649	-> this is?
689	You can use `isa` instead of `dyn_cast` when you only want to check for the class but you don't need the actual class object.
704	I'm a bit lost here with the `findAncestorOpInBlock` and the constant op check. What are you trying to do?
821	Why? Wouldn't the backend emulate it if it's not natively supported by the target?

Thanks for the comments @dcaballe !

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp

638

Not a Doxygen expert. This is taken from LLVM's coding standards, but I suspect that there's going to be other styles used elsewhere. Happy to update!

644

Can you elaborate on what strideZero is?

Sure and apologies for poor naming.

So, we would like the values in the vector with the trailing indices to have stride one, e.g. v0 = [0, 1, 2, 3]. This way we know that we are accessing adjacent elements. For non-trailing indices, to make things a bit simpler, I assume that the indices ought to have stride zero (or be constant), i.e. v1 = [5, 5, 5, 5] (the actual values don't matter). This example should clarify what I mean:

func.func @transfer_read(%arg0: tensor<3x3x3xf32>, %arg2: tensor<1x1x3xf32>) -> tensor<1x1x3xf32> {
  %1 = linalg.generic {
    indexing_maps = [#map1],
    iterator_types = ["parallel", "parallel", "parallel"]
  } outs(%arg2 : tensor<1x1x3xf32>) {
  ^bb0(%arg4: f32):
    %2 = linalg.index 0 : index
    %3 = linalg.index 1 : index
    %4 = linalg.index 2 : index
    %5 = tensor.extract %arg0[%2, %3, %4] : tensor<3x3x3xf32>
    linalg.yield %5 : f32
  } -> tensor<1x1x3xf32>
  return %1 : tensor<1x1x3xf32>
}

It is fine to use %2, %3 and %4 to calculate the trailing index for tensor.extract (%2 and %3 are effectively constant, %4 has stride one: [0, 1, 2]).
Things get tricky once we try to use %4 for calculating non-trailing indices (i.e. deciding whether this is a contiguous or non-contiguous load becomes trickier).

We could add the logic that you originally proposed and it should just work ™ :

bool isStrideOneAlongDimension(Value indexVal, unsigned dim) {
  Operation *defOp = indexVal.getDefiningOp();
  if (!defOp)
    return false;
  if (auto indexOp = dyn_cast<linalg::IndexOp>(defOp))
    return indexOp.getDim() == dim;
  // TODO: explore UD chain.
  return false;
}

Does this make sense?

704

IIUC, for %c0, ancestor is %c0 = arith.constant 0 : index. It's a arith::ConstantOp, so ... But arith.constant has no operands, so this is not needed, is it?

#map0 = affine_map<(d0, d1, d2) -> (d2)>
#map1 = affine_map<(d0, d1, d2) -> (d0, d1, d2)>
func.func @vector_gather(%arg0: tensor<3x3x3xf32>, %arg1: tensor<3xi32>, %arg2: tensor<1x1x3xf32>) -> tensor<1x1x3xf32> {
 %c123 = arith.constant 123 : index
  %2 = linalg.generic {
    indexing_maps = [#map0, #map1],
    iterator_types = ["parallel", "parallel", "parallel"]
  } ins(%arg1 : tensor<3xi32>) outs(%arg2 : tensor<1x1x3xf32>) {
  ^bb0(%arg3: i32, %arg4: f32):
    %c0 = arith.constant 0 : index
    %3 = arith.index_cast %arg3 : i32 to index
    %7 = tensor.extract %arg0[%c0, %c0, %3] : tensor<3x3x3xf32>
    linalg.yield %7 : f32
  } -> tensor<1x1x3xf32>
  return %2 : tensor<1x1x3xf32>
}

Add a check for AffineMaps (contiguous loads require identity maps), fix typos etc.

Herald added a subscriber: thopre. · View Herald TranscriptJan 26 2023, 11:43 AM

tschuett added a subscriber: tschuett.Jan 26 2023, 12:05 PM

tschuett added inline comments.

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
633	You use VectorMemoryAccessKind as if it is an enum class, but it is not.

Harbormaster completed remote builds in B210194: Diff 492536.Jan 26 2023, 12:28 PM

I am about to disappear for a week and just wanted to comment on this condition:

The _output tensor_ must be a 1-D vector with the trailing dim > 1, e.g. tensor<1x1x4xi32,

I do believe that this can be relaxed, but might be a bit too tricky and not really needed. Consider the following scenario:

func.func @transfer_read(%arg0: tensor<3x3xf32>, %arg2: tensor<1x4x2xf32>) -> tensor<1x4x2xf32> {
  %1 = linalg.generic {
    indexing_maps = [#map1],
    iterator_types = ["parallel", "parallel", "parallel"]
  } outs(%arg2 : tensor<1x4x2xf32>) {
  ^bb0(%arg4: f32):
    %2 = linalg.index 0 : index
    %3 = linalg.index 1 : index
    %4 = linalg.index 2 : index
    %5 = arith.addi %3, %4 : index
    %6 = tensor.extract %arg0[%2, %5] : tensor<3x3xf32>
    linalg.yield %6 : f32
  } -> tensor<1x4x3xf32>
  return %1 : tensor<1x4x2xf32>
}

This linalg.generic Op is similar to the following nested loops:

for (i = 0; i < 1; i++)
  for (j = 0; j < 4: j++)
    for (k = 0; k < 3; k++) {
       int l = j + k;
       a = vec[i][l];
    }

This will lead to the following access:

v[0][0]
v[0][1]
v[0][2]
v[0][1]
v[0][2]
v[0][3]
v[0][2]
v[0][3]
...

This is not a contiguous loads, but it's hard to infer without analyzing the expression used for calculating the indices. Instead of doing that, I decided to limit the space of the supported cases. This is still sufficient to for my use case, i.e. vectorising tosa.resize that I mentioned in iree-issues/#9198 _after_ tiling. I hope that this makes sense :)

Removed vectorize_tensor_extract.mlir (all tests are already available in vectorization.mlir), deleted unnecessary check for whether the affine maps are minor identities.

Fix typo

awarzynski added inline comments.Jan 27 2023, 11:44 AM

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
644	Also, perhaps rename val to index or indexVal? Not claiming that `val` is a great name, but `index` and `indexVal` would suggest that this is index. But this could be any value that is used to compute an index 🤔 .
821	It would, but I was thinking that it would be useful to be able to disable vectorization when only gather loads can be used (which can be very slow).

Harbormaster completed remote builds in B210449: Diff 492868.Jan 27 2023, 1:29 PM

Remove the vectorize_nd_extract attribute
Rebase on top of main

The vectorize_nd_extract attribute was introduced to prevent vectorisation using gather loads in situation where contiguous loads would be better, but weren't yet supported. As this patch introduced support for vectorisation using contiguous loads, that attribute is no longer required.

Restore the "vectorize_nd_extract" attribute

For context, see https://github.com/iree-org/iree/pull/12288#issuecomment-1437824237

Harbormaster completed remote builds in B215044: Diff 499200.Feb 21 2023, 10:05 AM

dcaballe accepted this revision.Feb 21 2023, 10:13 PM

This revision is now accepted and ready to land.Feb 21 2023, 10:13 PM

This revision was landed with ongoing or failed builds.Feb 22 2023, 11:33 AM

Closed by commit rG89b144ece330: [mlir][linalg] Vectorize tensor.extract using contiguous loads (authored by awarzynski). · Explain Why

This revision was automatically updated to reflect the committed changes.

awarzynski added a commit: rG89b144ece330: [mlir][linalg] Vectorize tensor.extract using contiguous loads.

With this change I'm seeing vectorization of

#map = affine_map<(d0) -> (d0)>
module @GatherV2_2.10 attributes {mhlo.cross_program_prefetches = [], mhlo.dynamic_parameter_bindings = [], mhlo.is_dynamic = true, mhlo.use_auto_spmd_partitioning = false} {
  rt.export @main ordinal 0
  func.func @main(%arg0: tensor<6xf32> {bufferization.writable = false, xla_framework.input_mapping = 0 : i32}, %arg1: tensor<5xi32> {bufferization.writable = false, xla_framework.input_mapping = 2 : i32}) -> tensor<5xf32> attributes {xla_framework.result_mapping = 1 : i32} {
    %c5 = arith.constant 5 : index
    %c0 = arith.constant 0 : index
    %0 = tensor.empty() : tensor<5xf32>
    %1 = linalg.generic {indexing_maps = [#map], iterator_types = ["parallel"]} outs(%0 : tensor<5xf32>) {
    ^bb0(%out: f32):
      %2 = linalg.index 0 : index
      %extracted = tensor.extract %arg1[%2] : tensor<5xi32>
      %3 = arith.index_cast %extracted : i32 to index
      %4 = arith.maxsi %3, %c0 : index
      %5 = arith.minsi %4, %c5 : index
      %extracted_0 = tensor.extract %arg0[%5] : tensor<6xf32>
      linalg.yield %extracted_0 : f32
    } -> tensor<5xf32>
    return %1 : tensor<5xf32>
  }
}

into

module @GatherV2_2.10 attributes {mhlo.cross_program_prefetches = [], mhlo.dynamic_parameter_bindings = [], mhlo.is_dynamic = true, mhlo.use_auto_spmd_partitioning = false} {
  rt.export @main ordinal 0
  func.func @main(%arg0: tensor<6xf32> {bufferization.writable = false, xla_framework.input_mapping = 0 : i32}, %arg1: tensor<5xi32> {bufferization.writable = false, xla_framework.input_mapping = 2 : i32}) -> tensor<5xf32> attributes {xla_framework.result_mapping = 1 : i32} {
    %c0 = arith.constant 0 : index
    %c0_i32 = arith.constant 0 : i32
    %cst = arith.constant dense<0> : vector<5xindex>
    %cst_0 = arith.constant dense<5> : vector<5xindex>
    %cst_1 = arith.constant 0.000000e+00 : f32
    %0 = tensor.empty() : tensor<5xf32>
    %1 = vector.transfer_read %arg1[%c0], %c0_i32 {in_bounds = [true]} : tensor<5xi32>, vector<5xi32>
    %2 = arith.index_cast %1 : vector<5xi32> to vector<5xindex>
    %3 = arith.maxsi %2, %cst : vector<5xindex>
    %4 = arith.minsi %3, %cst_0 : vector<5xindex>
    %5 = vector.extractelement %4[%c0_i32 : i32] : vector<5xindex>
    %6 = vector.transfer_read %arg0[%5], %cst_1 {in_bounds = [true]} : tensor<6xf32>, vector<5xf32>
    %7 = vector.transfer_write %6, %0[%c0] {in_bounds = [true]} : vector<5xf32>, tensor<5xf32>
    return %7 : tensor<5xf32>
  }
}

Looks like it somehow thinks this is contiguous even if it's not.

bkramer added a reverting change: rGe28bbfea5d48: Revert "[mlir][linalg] Vectorize tensor.extract using contiguous loads".Feb 28 2023, 4:33 AM

Reverted this in e28bbfea5d482c1825b1799c57aedff4e0116619, is the test case above enough to see what's going wrong?

@bkramer , sorry about this issue and thanks for reverting!

In D141998#4158076, @bkramer wrote:

is the test case above enough to see what's going wrong?

Yes! Thanks for sending this :) I think that I know what the problem is and will send a fix tomorrow (I've been traveling today and only just got back, so need a break). Just to double check - the 1st tensor.extract is a contiguous load, the 2nd is a gather. Correct?

In D141998#4158912, @awarzynski wrote:

@bkramer , sorry about this issue and thanks for reverting!

In D141998#4158076, @bkramer wrote:

is the test case above enough to see what's going wrong?

Yes! Thanks for sending this :) I think that I know what the problem is and will send a fix tomorrow (I've been traveling today and only just got back, so need a break). Just to double check - the 1st tensor.extract is a contiguous load, the 2nd is a gather. Correct?

Right, it's loading from a contiguous tensor of indices and using them as an input to gather.

awarzynski reopened this revision.Mar 1 2023, 8:39 AM

This revision is now accepted and ready to land.Mar 1 2023, 8:39 AM

Refine how contiguous loads are identified

Main change (in getTensorExtractMemoryAccessPattern):

// Reject Ops that would lead to non-contiguous accesses.
if (!isa<arith::AddIOp, arith::SubIOp, linalg::IndexOp>(ancestor))
  return false;

I took the liberty of renaming some variables and updating some comments.
Hopefully the overall logic is clearer now.

Also rebased on top of main.

@bkramer - wdyt? I added your repro as a test - does the generated output make sense?

Harbormaster completed remote builds in B216734: Diff 501532.Mar 1 2023, 9:13 AM

Using vector.gather for that case looks right to me.

Closed by commit rG8ece85a682f0: [mlir][linalg] Vectorize tensor.extract using contiguous loads (authored by awarzynski). · Explain WhyMar 2 2023, 1:20 AM

This revision was automatically updated to reflect the committed changes.

awarzynski added a commit: rG8ece85a682f0: [mlir][linalg] Vectorize tensor.extract using contiguous loads.

awarzynski mentioned this in D145385: [mlir][linalg] Refine how contiguous loads are identified.Mar 6 2023, 7:22 AM

awarzynski mentioned this in rG7a078b65fb88: [mlir][linalg] Refine how contiguous loads are identified.Mar 8 2023, 12:24 AM

Revision Contents

Path

Size

mlir/

lib/

Dialect/

Linalg/

Transforms/

Vectorization.cpp

222 lines

test/

Dialect/

Linalg/

vectorization-unsupported.mlir

vectorization.mlir

165 lines

Diff 501791

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp

Show First 20 Lines • Show All 605 Lines • ▼ Show 20 Lines	static Value calculateGatherOffset(RewriterBase &rewriter,
auto indexVecType = VectorType::get(targetShape, rewriter.getIndexType());		auto indexVecType = VectorType::get(targetShape, rewriter.getIndexType());
auto loc = extractOp.getLoc();		auto loc = extractOp.getLoc();

Value offset = broadcastIfNeeded(		Value offset = broadcastIfNeeded(
rewriter, bvm.lookup(extractOp.getIndices()[0]), indexVecType.getShape());		rewriter, bvm.lookup(extractOp.getIndices()[0]), indexVecType.getShape());

const size_t numIndices = extractOp.getIndices().size();		const size_t numIndices = extractOp.getIndices().size();
for (size_t i = 1; i < numIndices; i++) {		for (size_t i = 1; i < numIndices; i++) {
		Value dimIdx = rewriter.create<arith::ConstantIndexOp>(loc, i);

auto dimSize = broadcastIfNeeded(		auto dimSize = broadcastIfNeeded(
rewriter,		rewriter,
rewriter.create<arith::ConstantIndexOp>(		rewriter.create<tensor::DimOp>(loc, extractOp.getTensor(), dimIdx),
loc,
extractOp.getTensor().getType().cast<ShapedType>().getDimSize(i)),
indexVecType.getShape());		indexVecType.getShape());

offset = rewriter.create<arith::MulIOp>(loc, offset, dimSize);		offset = rewriter.create<arith::MulIOp>(loc, offset, dimSize);

auto extractOpIndex =		auto extractOpIndex =
broadcastIfNeeded(rewriter, bvm.lookup(extractOp.getIndices()[i]),		broadcastIfNeeded(rewriter, bvm.lookup(extractOp.getIndices()[i]),
indexVecType.getShape());		indexVecType.getShape());

offset = rewriter.create<arith::AddIOp>(loc, extractOpIndex, offset);		offset = rewriter.create<arith::AddIOp>(loc, extractOpIndex, offset);
}		}

return offset;		return offset;
}		}

		enum VectorMemoryAccessKind {
		tschuettUnsubmitted Not Done Reply Inline Actions You use VectorMemoryAccessKind as if it is an enum class, but it is not. tschuett: You use VectorMemoryAccessKind as if it is an enum class, but it is not.
		// TODO: ScalarBroadcast,
		Contiguous,
		Gather
		};

		dcaballeUnsubmitted Not Done Reply Inline Actions nit: I think we are using `var` instead of `/p` to highlight source code names in the doc but perhaps I'm missing something. dcaballe: nit: I think we are using `var` instead of `/p` to highlight source code names in the doc but…
		awarzynskiAuthorUnsubmitted Done Reply Inline Actions Not a Doxygen expert. This is taken from LLVM's coding standards, but I suspect that there's going to be other styles used elsewhere. Happy to update! awarzynski: Not a Doxygen expert. This is taken from [[ https://llvm.org/docs/CodingStandards.html#doxygen…
		/// Check whether /p val can be used for calculating an index for a contiguous
		/// load operation. This means that /p val should either:
		/// * be invariant with respect to /p linalgOp, or
		/// * increment by 1 with every loop iterator (when /p shouldBeConstant is
		dcaballeUnsubmitted Not Done Reply Inline Actions ar? -> and? dcaballe: ar? -> and?
		/// false).
		/// Parameters /p trailingLoopDim and /p shouldBeConstant are used to analyze
		dcaballeUnsubmitted Not Done Reply Inline Actions Can you elaborate on what `strideZero` is? Also, perhaps rename `val` to `index` or `indexVal`? dcaballe: Can you elaborate on what `strideZero` is? Also, perhaps rename `val` to `index` or `indexVal`?
		awarzynskiAuthorUnsubmitted Done Reply Inline Actions Can you elaborate on what strideZero is? Sure and apologies for poor naming. So, we would like the values in the vector with the trailing indices to have stride one, e.g. `v0 = [0, 1, 2, 3]`. This way we know that we are accessing adjacent elements. For non-trailing indices, to make things a bit simpler, I assume that the indices ought to have stride zero (or be constant), i.e. `v1 = [5, 5, 5, 5]` (the actual values don't matter). This example should clarify what I mean: func.func @transfer_read(%arg0: tensor<3x3x3xf32>, %arg2: tensor<1x1x3xf32>) -> tensor<1x1x3xf32> { %1 = linalg.generic { indexing_maps = [#map1], iterator_types = ["parallel", "parallel", "parallel"] } outs(%arg2 : tensor<1x1x3xf32>) { ^bb0(%arg4: f32): %2 = linalg.index 0 : index %3 = linalg.index 1 : index %4 = linalg.index 2 : index %5 = tensor.extract %arg0[%2, %3, %4] : tensor<3x3x3xf32> linalg.yield %5 : f32 } -> tensor<1x1x3xf32> return %1 : tensor<1x1x3xf32> } It is fine to use `%2`, `%3` and `%4` to calculate the trailing index for `tensor.extract` (`%2` and `%3` are effectively constant, `%4` has stride one: `[0, 1, 2]`). Things get tricky once we try to use `%4` for calculating non-trailing indices (i.e. deciding whether this is a contiguous or non-contiguous load becomes trickier). We could add the logic that you originally proposed and it should just work ™ : bool isStrideOneAlongDimension(Value indexVal, unsigned dim) { Operation defOp = indexVal.getDefiningOp(); if (!defOp) return false; if (auto indexOp = dyn_cast<linalg::IndexOp>(defOp)) return indexOp.getDim() == dim; // TODO: explore UD chain. return false; } Does this make sense? awarzynski:* > Can you elaborate on what strideZero is? Sure and apologies for poor naming. So, we would…
		awarzynskiAuthorUnsubmitted Done Reply Inline Actions Also, perhaps rename val to index or indexVal? Not claiming that `val` is a great name, but `index` and `indexVal` would suggest that this is index. But this could be any value that is used to compute an index 🤔 . awarzynski: > Also, perhaps rename val to index or indexVal? Not claiming that `val` is a great name, but…
		/// `linalg.index` ops.
		dcaballeUnsubmitted Done Reply Inline Actions typo dcaballe: typo
		static bool isContiguousLoadIdx(LinalgOp &linalgOp, Value &val,
		size_t trailingLoopDim, bool shouldBeConstant) {
		auto *block = linalgOp.getBlock();

		dcaballeUnsubmitted Done Reply Inline Actions -> this is? dcaballe: -> this is?
		// Bail out if this is a block argument for this linalg.generic Op.
		// TODO: We could try analysing the corresponding affine map here.
		if (val.dyn_cast<BlockArgument>())
		return llvm::all_of(block->getArguments(),
		[&val](Value v) { return (v != val); });

		Operation *defOp = val.getDefiningOp();
		assert(defOp && "This is neither a block argument nor an operation result");

		// We know that we are reading into a 1-D tensor like this:
		// `tensor<1x1x4xi32`. Given this assumption, the following Op:
		// * `%idx = `linalg.index dim : index`,
		// will either:
		// 1. produce a constant when `dim` _is not_ the trailing loop dim, or
		// 2. increment with stride one when `dim` _is_ the trailing loop dim.
		if (auto indexOp = dyn_cast<linalg::IndexOp>(defOp))
		return shouldBeConstant ? (indexOp.getDim() != trailingLoopDim)
		: (indexOp.getDim() == trailingLoopDim);

		auto ancestor = block->findAncestorOpInBlock(defOp);

		// Values define outside `linalgOp`.
		if (!ancestor)
		return true;

		// Values defined inside `linalgOp`, which are constant.
		if (dyn_cast<arith::ConstantOp>(ancestor))
		return true;

		// Conservatively reject Ops that could lead to non-contiguous accesses.
		if (!isa<arith::AddIOp, arith::SubIOp, linalg::IndexOp>(ancestor))
		return false;

		bool result = true;
		for (auto op : ancestor->getOperands())
		result &=
		isContiguousLoadIdx(linalgOp, op, trailingLoopDim, shouldBeConstant);

		return result;
		}
		dcaballeUnsubmitted Done Reply Inline Actions You can use `isa` instead of `dyn_cast` when you only want to check for the class but you don't need the actual class object. dcaballe: You can use `isa` instead of `dyn_cast` when you only want to check for the class but you don't…

		/// Check whether the calculation of \p val is based on linalg.index Op with
		/// the dim attribute matching \p dim.
		static bool isBasedOnIndexOp(LinalgOp &linalgOp, Value &val, size_t dim) {
		auto *block = linalgOp.getBlock();
		auto targetShape = linalgOp.getStaticLoopRanges();

		if (val.isa<BlockArgument>())
		return false;

		Operation *defOp = val.getDefiningOp();
		assert(defOp && "This is neither a block argument nor an operation result");

		if (auto indexOp = dyn_cast<linalg::IndexOp>(defOp))
		return (indexOp.getDim() == dim);
		dcaballeUnsubmitted Not Done Reply Inline Actions I'm a bit lost here with the `findAncestorOpInBlock` and the constant op check. What are you trying to do? dcaballe: I'm a bit lost here with the `findAncestorOpInBlock` and the constant op check. What are you…
		awarzynskiAuthorUnsubmitted Done Reply Inline Actions IIUC, for `%c0`, `ancestor` is `%c0 = arith.constant 0 : index`. It's a `arith::ConstantOp`, so ... But arith.constant has no operands, so this is not needed, is it? #map0 = affine_map<(d0, d1, d2) -> (d2)> #map1 = affine_map<(d0, d1, d2) -> (d0, d1, d2)> func.func @vector_gather(%arg0: tensor<3x3x3xf32>, %arg1: tensor<3xi32>, %arg2: tensor<1x1x3xf32>) -> tensor<1x1x3xf32> { %c123 = arith.constant 123 : index %2 = linalg.generic { indexing_maps = [#map0, #map1], iterator_types = ["parallel", "parallel", "parallel"] } ins(%arg1 : tensor<3xi32>) outs(%arg2 : tensor<1x1x3xf32>) { ^bb0(%arg3: i32, %arg4: f32): %c0 = arith.constant 0 : index %3 = arith.index_cast %arg3 : i32 to index %7 = tensor.extract %arg0[%c0, %c0, %3] : tensor<3x3x3xf32> linalg.yield %7 : f32 } -> tensor<1x1x3xf32> return %2 : tensor<1x1x3xf32> } awarzynski: IIUC, for `%c0`, `ancestor` is `%c0 = arith.constant 0 : index`. It's a `arith::ConstantOp`, so…

		auto ancestor = block->findAncestorOpInBlock(defOp);

		if (!ancestor)
		return false;

		bool result = false;
		for (auto op : ancestor->getOperands())
		result \|= isBasedOnIndexOp(linalgOp, op, dim);

		return result;
		}

		/// Check whether \p extractOp would be a gather or a contiguous load Op after
		/// vectorising \p linalgOp. Note that it is always safe to use gather load
		/// operations for contiguous loads (albeit slow), but not vice-versa. When in
		/// doubt, bail out and assume that \p extractOp is a gather load.
		static VectorMemoryAccessKind
		getTensorExtractMemoryAccessPattern(tensor::ExtractOp extractOp,
		LinalgOp &linalgOp) {

		auto targetShape = linalgOp.getStaticLoopRanges();

		// Assume that it's a gather load when reading _into_:
		// * an n-D vector, like`tensor<1x2x4xi32` or`tensor<2x1x4xi32>`, or
		// * a 1-D vector with the trailing dim equal 1, e.g. `tensor<1x4x1xi32`.
		// TODO: Relax these conditions.
		if ((llvm::count_if(targetShape,
		[](int64_t dimSize) { return dimSize > 1; }) != 1) \|\|
		targetShape.back() == 1)
		return VectorMemoryAccessKind::Gather;

		auto inputShape = extractOp.getTensor().getType().cast<ShapedType>();

		// Assume that it's a gather load when reading _from_ a tensor for which the
		// trailing dimension is 1, e.g. `tensor<1x4x1xi32>`.
		// TODO: Relax this condition.
		if (inputShape.getShape().back() == 1)
		return VectorMemoryAccessKind::Gather;

		// The trailing loop dim is needed when analyzing ops like:
		// * %idx = `linalg.index <dim> : index`.
		auto trailingLoopDim = targetShape.size() - 1;

		bool isContiguous = true;

		// Iterate over all indices. Analyze the way each index is calculated and
		// decide whether it is suitable for a contiguous load (e.g. loop invariant).
		auto indices = extractOp.getIndices();
		for (auto [i, indexVal] : llvm::enumerate(indices)) {
		if (inputShape.getShape()[i] == 1) {
		// This index will always be equal 0, so it is a loop-invariant constant.
		continue;
		}

		// Should this index be loop invariant?
		// * _no_ if this is the trailing index,
		// * _yes_ otherwise.
		auto extractOpBottomIdx = indices.size() - 1;
		bool loopInvariantIndex = (i != extractOpBottomIdx);

		isContiguous &= isContiguousLoadIdx(linalgOp, indexVal, trailingLoopDim,
		loopInvariantIndex);
		}

		// The trailing index in the extract Op must increment with every iteration,
		// which means that it must be based on a loop index. Given the assumption
		// on the output tensor, only the trailing loop index is not constant, so
		// that's what we need to check against.
		auto extractOpTrailingIdx = indices.back();
		isContiguous &=
		isBasedOnIndexOp(linalgOp, extractOpTrailingIdx, trailingLoopDim);

		if (isContiguous) {
		LDBG("Found contigous load: " << extractOp);
		return VectorMemoryAccessKind::Contiguous;
		}

		return VectorMemoryAccessKind::Gather;
		}

/// Helper function to vectorize the tensor.extract operations. Returns		/// Helper function to vectorize the tensor.extract operations. Returns
/// VectorizationStatus::NewOp to signal the vectorization algorithm that it		/// VectorizationStatus::NewOp to signal the vectorization algorithm that it
/// should map the produced operations. This function is meant to be used as a		/// should map the produced operations. This function is meant to be used as a
/// CustomVectorizationHook.		/// CustomVectorizationHook.
static VectorizationResult		static VectorizationResult
vectorizeTensorExtract(RewriterBase &rewriter, VectorizationState &state,		vectorizeTensorExtract(RewriterBase &rewriter, VectorizationState &state,
Operation *op, LinalgOp linalgOp, const IRMapping &bvm) {		Operation *op, LinalgOp linalgOp, const IRMapping &bvm) {
tensor::ExtractOp extractOp = dyn_cast<tensor::ExtractOp>(op);		tensor::ExtractOp extractOp = dyn_cast<tensor::ExtractOp>(op);
Show All 14 Lines	auto passThruConstantOp =
rewriter.create<arith::ConstantOp>(loc, rewriter.getZeroAttr(resultType));		rewriter.create<arith::ConstantOp>(loc, rewriter.getZeroAttr(resultType));

// Base indices are currently set to 0. We will need to re-visit if more		// Base indices are currently set to 0. We will need to re-visit if more
// generic scenarios are to be supported.		// generic scenarios are to be supported.
SmallVector<Value> baseIndices(		SmallVector<Value> baseIndices(
extractOp.getIndices().size(),		extractOp.getIndices().size(),
rewriter.create<arith::ConstantIndexOp>(loc, 0));		rewriter.create<arith::ConstantIndexOp>(loc, 0));

		VectorMemoryAccessKind memAccessKind =
		getTensorExtractMemoryAccessPattern(extractOp, linalgOp);

		// 1. Handle gather access
		if (memAccessKind == VectorMemoryAccessKind::Gather) {
Value offset = calculateGatherOffset(rewriter, extractOp, bvm, targetShape);		Value offset = calculateGatherOffset(rewriter, extractOp, bvm, targetShape);
		dcaballeUnsubmitted Not Done Reply Inline Actions Why? Wouldn't the backend emulate it if it's not natively supported by the target? dcaballe: Why? Wouldn't the backend emulate it if it's not natively supported by the target?
		awarzynskiAuthorUnsubmitted Done Reply Inline Actions It would, but I was thinking that it would be useful to be able to disable vectorization when only gather loads can be used (which can be very slow). awarzynski: It would, but I was thinking that it would be useful to be able to disable vectorization when…

// Generate the gather load		// Generate the gather load
Operation *gatherOp = rewriter.create<vector::GatherOp>(		Operation *gatherOp = rewriter.create<vector::GatherOp>(
loc, resultType, extractOp.getTensor(), baseIndices, offset,		loc, resultType, extractOp.getTensor(), baseIndices, offset,
maskConstantOp, passThruConstantOp);		maskConstantOp, passThruConstantOp);
gatherOp = state.maskOperation(rewriter, gatherOp, linalgOp);		gatherOp = state.maskOperation(rewriter, gatherOp, linalgOp);

		LDBG("Vectorised as gather load: " << extractOp);
return VectorizationResult{VectorizationStatus::NewOp, gatherOp};		return VectorizationResult{VectorizationStatus::NewOp, gatherOp};
}		}

		// 2. Handle contiguous access.
		SmallVector<Value> transferReadIdxs;
		auto resTrailingDim = resultType.getShape().back();
		auto zero = rewriter.create<arith::ConstantOp>(
		loc, rewriter.getI32Type(), rewriter.getZeroAttr(rewriter.getI32Type()));

		// Collect indices for `vector.transfer_read`. At this point, the indices will
		// either be scalars or would have been broadcast to vectors matching the
		// result type. For indices that are vectors, there are two options:
		// * for non-trailing indices, all elements are identical (contiguous
		// loads are identified by looking for non-trailing indices that are
		// invariant with respect to the corresponding linalg.generic), or
		// * for trailing indices, the index vector will contain values with stride
		// one, but for `vector.transfer_read` only the first (i.e. 0th) index is
		// needed.
		// This means that
		// * for scalar indices - just re-use it,
		// * for vector indices (e.g. `vector<1x1x4xindex>`) - extract the bottom
		// (0th) element and use that.
		for (size_t i = 0; i < extractOp.getIndices().size(); i++) {
		auto idx = bvm.lookup(extractOp.getIndices()[i]);
		if (idx.getType().isIndex()) {
		transferReadIdxs.push_back(idx);
		continue;
		}

		auto indexAs1dVector = rewriter.create<vector::ShapeCastOp>(
		loc, VectorType::get({resTrailingDim}, rewriter.getIndexType()),
		bvm.lookup(extractOp.getIndices()[i]));
		transferReadIdxs.push_back(
		rewriter.create<vector::ExtractElementOp>(loc, indexAs1dVector, zero));
		}

		// `tensor.extract_element` is always in-bounds, hence the following holds.
		SmallVector<bool> inBounds(resultType.getRank(), true);

		auto transferReadOp = rewriter.create<vector::TransferReadOp>(
		loc, resultType, extractOp.getTensor(), transferReadIdxs, inBounds);

		LDBG("Vectorised as contiguous load: " << extractOp);
		return VectorizationResult{VectorizationStatus::NewOp, transferReadOp};
		}

/// Emit reduction operations if the shapes of the value to reduce is different		/// Emit reduction operations if the shapes of the value to reduce is different
/// that the result shape.		/// that the result shape.
// Note: this is a true builder that notifies the OpBuilder listener.		// Note: this is a true builder that notifies the OpBuilder listener.
// TODO: Consider moving as a static helper on the ReduceOp.		// TODO: Consider moving as a static helper on the ReduceOp.
static Operation reduceIfNeeded(OpBuilder &b, LinalgOp linalgOp, Operation op,		static Operation reduceIfNeeded(OpBuilder &b, LinalgOp linalgOp, Operation op,
Value reduceValue, Value initialValue,		Value reduceValue, Value initialValue,
const IRMapping &bvm) {		const IRMapping &bvm) {
Value reduceVec = bvm.lookup(reduceValue);		Value reduceVec = bvm.lookup(reduceValue);
▲ Show 20 Lines • Show All 1,825 Lines • Show Last 20 Lines

mlir/test/Dialect/Linalg/vectorization-unsupported.mlir

This file was deleted.

	// RUN: mlir-opt %s -test-transform-dialect-interpreter -split-input-file -verify-diagnostics

	// Masked vectorisation of `tensor.extract`:
	// * requires the `{ vectorize_nd_extract }` attribute,
	// * has not been implemented yet (hence the attribute is absent).
	// TOOD: Implement masked vectorization for `tensor.extract`

	#map1 = affine_map<(d0, d1) -> (d0, d1)>
	func.func @extract_masked_vectorize(%arg0: tensor<?x?xf32>, %arg1: tensor<?x?xf32>) -> tensor<?x?xf32> {
	%c0 = arith.constant 1 : index
	%c1 = arith.constant 2 : index
	// expected-error@+1 {{failed to vectorize op}}
	%2 = linalg.generic {
	indexing_maps = [#map1],
	iterator_types = ["parallel", "parallel"]
	} outs(%arg1 : tensor<?x?xf32>) {
	^bb0(%arg3: f32):
	%7 = tensor.extract %arg0[%c0, %c1] : tensor<?x?xf32>
	linalg.yield %7 : f32
	} -> tensor<?x?xf32>
	return %2 : tensor<?x?xf32>
	}


	transform.sequence failures(propagate) {
	^bb1(%arg1: !pdl.operation):
	%0 = transform.structured.match ops{["linalg.generic"]} in %arg1 : (!pdl.operation) -> !pdl.operation
	transform.structured.masked_vectorize %0 vector_sizes [3, 3]
	}

mlir/test/Dialect/Linalg/vectorization.mlir

Show First 20 Lines • Show All 1,578 Lines • ▼ Show 20 Lines
func.func @vectorize_nd_tensor_extract_constant_idx(%arg0: tensor<3x3xf32>, %arg2: tensor<1x1x3xf32>) -> tensor<1x1x3xf32> {		func.func @vectorize_nd_tensor_extract_constant_idx(%arg0: tensor<3x3xf32>, %arg2: tensor<1x1x3xf32>) -> tensor<1x1x3xf32> {
%c0 = arith.constant 1 : index		%c0 = arith.constant 1 : index
%c1 = arith.constant 2 : index		%c1 = arith.constant 2 : index
%2 = linalg.generic {		%2 = linalg.generic {
indexing_maps = [#map1],		indexing_maps = [#map1],
iterator_types = ["parallel", "parallel", "parallel"]		iterator_types = ["parallel", "parallel", "parallel"]
} outs(%arg2 : tensor<1x1x3xf32>) {		} outs(%arg2 : tensor<1x1x3xf32>) {
^bb0(%arg4: f32):		^bb0(%arg4: f32):
%3 = linalg.index 2 : index
%7 = tensor.extract %arg0[%c0, %c1] : tensor<3x3xf32>		%7 = tensor.extract %arg0[%c0, %c1] : tensor<3x3xf32>
linalg.yield %7 : f32		linalg.yield %7 : f32
} -> tensor<1x1x3xf32>		} -> tensor<1x1x3xf32>
return %2 : tensor<1x1x3xf32>		return %2 : tensor<1x1x3xf32>
}		}

// CHECK-LABEL: func.func @vectorize_nd_tensor_extract_constant_idx		// CHECK-LABEL: func.func @vectorize_nd_tensor_extract_constant_idx
// CHECK-SAME: %[[ARG0:.*]]: tensor<3x3xf32>		// CHECK-SAME: %[[ARG0:.*]]: tensor<3x3xf32>
Show All 12 Lines	^bb1(%arg1: !pdl.operation):
%0 = transform.structured.match ops{["linalg.generic"]} in %arg1 : (!pdl.operation) -> !pdl.operation		%0 = transform.structured.match ops{["linalg.generic"]} in %arg1 : (!pdl.operation) -> !pdl.operation
%1 = get_closest_isolated_parent %0 : (!pdl.operation) -> !pdl.operation		%1 = get_closest_isolated_parent %0 : (!pdl.operation) -> !pdl.operation
%2 = transform.structured.vectorize %1 { vectorize_nd_extract }		%2 = transform.structured.vectorize %1 { vectorize_nd_extract }
}		}

// -----		// -----

#map1 = affine_map<(d0, d1, d2) -> (d0, d1, d2)>		#map1 = affine_map<(d0, d1, d2) -> (d0, d1, d2)>
func.func @vectorize_nd_tensor_extract_idx_from_iteration_index(%arg0: tensor<3x3x3xf32>, %arg2: tensor<1x1x3xf32>) -> tensor<1x1x3xf32> {		func.func @vectorize_nd_tensor_extract_transfer_read_basic(%arg0: tensor<3x3x3xf32>, %arg2: tensor<1x1x3xf32>) -> tensor<1x1x3xf32> {
%1 = linalg.generic {		%1 = linalg.generic {
indexing_maps = [#map1],		indexing_maps = [#map1],
iterator_types = ["parallel", "parallel", "parallel"]		iterator_types = ["parallel", "parallel", "parallel"]
} outs(%arg2 : tensor<1x1x3xf32>) {		} outs(%arg2 : tensor<1x1x3xf32>) {
^bb0(%arg4: f32):		^bb0(%arg4: f32):
%2 = linalg.index 0 : index		%2 = linalg.index 0 : index
%3 = linalg.index 1 : index		%3 = linalg.index 1 : index
%4 = linalg.index 2 : index		%4 = linalg.index 2 : index
%5 = tensor.extract %arg0[%2, %3, %4] : tensor<3x3x3xf32>		%5 = tensor.extract %arg0[%2, %3, %4] : tensor<3x3x3xf32>
linalg.yield %5 : f32		linalg.yield %5 : f32
} -> tensor<1x1x3xf32>		} -> tensor<1x1x3xf32>
return %1 : tensor<1x1x3xf32>		return %1 : tensor<1x1x3xf32>
}		}

// CHECK-LABEL: func.func @vectorize_nd_tensor_extract_idx_from_iteration_index		// CHECK-LABEL: func.func @vectorize_nd_tensor_extract_transfer_read_basic
// CHECK-SAME: %[[ARG0:.*]]: tensor<3x3x3xf32>		// CHECK-SAME: %[[ARG0:.*]]: tensor<3x3x3xf32>
// CHECK-SAME: %[[ARG1:.*]]: tensor<1x1x3xf32>		// CHECK-SAME: %[[ARG1:.*]]: tensor<1x1x3xf32>
// CHECK: %[[INDICES:.*]] = arith.constant dense<[0, 1, 2]> : vector<3xindex>		// CHECK: %[[CST:.*]] = arith.constant dense<0> : vector<1x1x3xindex>
// CHECK: %[[MASK:.*]] = arith.constant dense<true> : vector<1x1x3xi1>		// CHECK: %[[C0_i32:.*]] = arith.constant 0 : i32
// CHECK: %[[PASSTHRU:.*]] = arith.constant dense<0.000000e+00> : vector<1x1x3xf32>
// CHECK: %[[C0:.*]] = arith.constant 0 : index		// CHECK: %[[C0:.*]] = arith.constant 0 : index
// CHECK: %[[B:.*]] = vector.broadcast %[[INDICES]] : vector<3xindex> to vector<1x1x3xindex>		// CHECK: %[[CST_0:.*]] = arith.constant 0.000000e+00 : f32
// CHECK: %[[GATHER:.*]] = vector.gather %[[ARG0]][%[[C0]], %[[C0]], %[[C0]]] [%[[B]]], %[[MASK]], %[[PASSTHRU]] : tensor<3x3x3xf32>, vector<1x1x3xindex>, vector<1x1x3xi1>, vector<1x1x3xf32> into vector<1x1x3xf32>		// CHECK: %[[IDX_VEC0:.*]] = vector.shape_cast %[[CST]] : vector<1x1x3xindex> to vector<3xindex>
// CHECK: vector.transfer_write %[[GATHER]]		// CHECK: %[[IDX1:.*]] = vector.extractelement %[[IDX_VEC0]][%[[C0_i32]] : i32] : vector<3xindex>
		// CHECK: %[[IDX_VEC:.*]] = vector.shape_cast %[[CST]] : vector<1x1x3xindex> to vector<3xindex>
		// CHECK: %[[IDX2:.*]] = vector.extractelement %[[IDX_VEC]][%[[C0_i32]] : i32] : vector<3xindex>
		// CHECK: %[[READ:.]] = vector.transfer_read %[[ARG0]][%[[IDX1]], %[[IDX2]], %[[C0:.]]], %[[CST_0]] {in_bounds = [true, true, true]} : tensor<3x3x3xf32>, vector<1x1x3xf32>
		// CHECK: vector.transfer_write %[[READ]], %[[ARG1]][%[[C0]], %[[C0]], %[[C0]]] {in_bounds = [true, true, true]} : vector<1x1x3xf32>, tensor<1x1x3xf32>

		transform.sequence failures(propagate) {
		^bb1(%arg1: !pdl.operation):
		%0 = transform.structured.match ops{["linalg.generic"]} in %arg1 : (!pdl.operation) -> !pdl.operation
		%1 = get_closest_isolated_parent %0 : (!pdl.operation) -> !pdl.operation
		%2 = transform.structured.vectorize %1 { vectorize_nd_extract }
		}

		// -----

		func.func @vectorize_nd_tensor_extract_transfer_read_complex(%6: tensor<45x80x16xf32>, %arg0: index, %arg2: index, %arg1: index, %arg4: index, %extracted_slice : tensor<1x4xf32>) -> tensor<1x4xf32> {
		%c79 = arith.constant 79 : index
		%25 = linalg.generic {
		indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>],
		iterator_types = ["parallel", "parallel"]
		} outs(%extracted_slice : tensor<1x4xf32>) {
		^bb0(%out: f32):
		%26 = linalg.index 0 : index
		%27 = arith.addi %arg0, %26 : index
		%28 = arith.addi %27, %arg2 : index
		%29 = linalg.index 1 : index
		%30 = arith.addi %arg1, %29 : index
		%31 = arith.addi %30, %arg4 : index
		%extracted = tensor.extract %6[%28, %c79, %31] : tensor<45x80x16xf32>
		linalg.yield %extracted : f32
		} -> tensor<1x4xf32>
		return %25 : tensor<1x4xf32>
		}

		// CHECK-LABEL: func.func @vectorize_nd_tensor_extract_transfer_read_complex
		// CHECK-SAME: %[[VAL_0:.*]]: tensor<45x80x16xf32>,
		// CHECK-SAME: {{.*}}: index,
		// CHECK-SAME: %[[VAL_5:.*]]: tensor<1x4xf32>) -> tensor<1x4xf32> {
		// CHECK: %[[VAL_6:.*]] = arith.constant dense<[0, 1, 2, 3]> : vector<4xindex>
		// CHECK: %[[VAL_7:.*]] = arith.constant 0 : i32
		// CHECK: %[[VAL_8:.*]] = arith.constant 0.000000e+00 : f32
		// CHECK: %[[VAL_9:.*]] = arith.constant 0 : index
		// CHECK: %[[VAL_10:.*]] = arith.constant 79 : index
		// CHECK: %[[VAL_11:.]] = vector.broadcast %{{.}} : index to vector<1x4xindex>
		// CHECK: %[[VAL_12:.]] = vector.broadcast %{{.}} : index to vector<1x4xindex>
		// CHECK: %[[VAL_13:.*]] = arith.addi %[[VAL_11]], %[[VAL_12]] : vector<1x4xindex>
		// CHECK: %[[VAL_14:.]] = vector.broadcast %{{.}} : index to vector<4xindex>
		// CHECK: %[[VAL_15:.*]] = arith.addi %[[VAL_14]], %[[VAL_6]] : vector<4xindex>
		// CHECK: %[[VAL_16:.]] = vector.broadcast %{{.}} : index to vector<4xindex>
		// CHECK: %[[VAL_17:.*]] = arith.addi %[[VAL_15]], %[[VAL_16]] : vector<4xindex>
		// CHECK: %[[VAL_18:.*]] = vector.shape_cast %[[VAL_13]] : vector<1x4xindex> to vector<4xindex>
		// CHECK: %[[VAL_19:.*]] = vector.extractelement %[[VAL_18]]{{\[}}%[[VAL_7]] : i32] : vector<4xindex>
		// CHECK: %[[VAL_20:.*]] = vector.extractelement %[[VAL_17]]{{\[}}%[[VAL_7]] : i32] : vector<4xindex>
		// CHECK: %[[VAL_21:.*]] = vector.transfer_read %[[VAL_0]]{{\[}}%[[VAL_19]], %[[VAL_10]], %[[VAL_20]]], %[[VAL_8]] {in_bounds = [true, true]} : tensor<45x80x16xf32>, vector<1x4xf32>
		// CHECK: %[[VAL_22:.*]] = vector.transfer_write %[[VAL_21]], %[[VAL_5]]{{\[}}%[[VAL_9]], %[[VAL_9]]] {in_bounds = [true, true]} : vector<1x4xf32>, tensor<1x4xf32>

transform.sequence failures(propagate) {		transform.sequence failures(propagate) {
^bb1(%arg1: !pdl.operation):		^bb1(%arg1: !pdl.operation):
%0 = transform.structured.match ops{["linalg.generic"]} in %arg1 : (!pdl.operation) -> !pdl.operation		%0 = transform.structured.match ops{["linalg.generic"]} in %arg1 : (!pdl.operation) -> !pdl.operation
%1 = get_closest_isolated_parent %0 : (!pdl.operation) -> !pdl.operation		%1 = get_closest_isolated_parent %0 : (!pdl.operation) -> !pdl.operation
%2 = transform.structured.vectorize %1 { vectorize_nd_extract }		%2 = transform.structured.vectorize %1 { vectorize_nd_extract }
}		}

// -----		// -----

#map0 = affine_map<(d0, d1, d2, d3) -> (d0, d2)>		#map0 = affine_map<(d0, d1, d2, d3) -> (d0, d2)>
#map1 = affine_map<(d0, d1, d2, d3) -> (d0, d1, d3)>		#map1 = affine_map<(d0, d1, d2, d3) -> (d0, d1, d3)>
#map2 = affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>		#map2 = affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>
func.func @vectorize_nd_tensor_extract_index_from_tensor(%arg0: tensor<3x3xf32>, %arg1: tensor<4x3xi32>, %arg2: tensor<4x3xi32>, %arg3: tensor<4x7x2xf32>, %arg4: tensor<4x7x3x2xf32>) -> tensor<4x7x3x2xf32> {		func.func @vectorize_nd_tensor_extract_index_from_tensor(%arg0: tensor<3x3xf32>, %arg1: tensor<4x3xi32>, %arg2: tensor<4x3xi32>, %arg3: tensor<4x7x2xf32>, %arg4: tensor<4x7x3x2xf32>) -> tensor<4x7x3x2xf32> {
%2 = linalg.generic {		%2 = linalg.generic {
Show All 32 Lines
// CHECK: vector.transfer_write %[[GATHER]], %[[ARG4]][%[[C0]], %[[C0]], %[[C0]], %[[C0]]] {in_bounds = [true, true, true, true]} : vector<4x7x3x2xf32>, tensor<4x7x3x2xf32>		// CHECK: vector.transfer_write %[[GATHER]], %[[ARG4]][%[[C0]], %[[C0]], %[[C0]], %[[C0]]] {in_bounds = [true, true, true, true]} : vector<4x7x3x2xf32>, tensor<4x7x3x2xf32>

transform.sequence failures(propagate) {		transform.sequence failures(propagate) {
^bb1(%arg1: !pdl.operation):		^bb1(%arg1: !pdl.operation):
%0 = transform.structured.match ops{["linalg.generic"]} in %arg1 : (!pdl.operation) -> !pdl.operation		%0 = transform.structured.match ops{["linalg.generic"]} in %arg1 : (!pdl.operation) -> !pdl.operation
%1 = get_closest_isolated_parent %0 : (!pdl.operation) -> !pdl.operation		%1 = get_closest_isolated_parent %0 : (!pdl.operation) -> !pdl.operation
%2 = transform.structured.vectorize %1 { vectorize_nd_extract }		%2 = transform.structured.vectorize %1 { vectorize_nd_extract }
}		}
		// -----

		#map = affine_map<(d0) -> (d0)>
		func.func @vectorize_nd_tensor_extract_contiguous_and_gather(%arg0: tensor<6xf32>, %arg1: tensor<5xi32>) -> tensor<5xf32> {
		%c5 = arith.constant 5 : index
		%c0 = arith.constant 0 : index
		%0 = tensor.empty() : tensor<5xf32>
		%1 = linalg.generic {indexing_maps = [#map], iterator_types = ["parallel"]} outs(%0 : tensor<5xf32>) {
		^bb0(%out: f32):
		%2 = linalg.index 0 : index
		%extracted = tensor.extract %arg1[%2] : tensor<5xi32>
		%3 = arith.index_cast %extracted : i32 to index
		%4 = arith.maxsi %3, %c0 : index
		%5 = arith.minsi %4, %c5 : index
		%extracted_0 = tensor.extract %arg0[%5] : tensor<6xf32>
		linalg.yield %extracted_0 : f32
		} -> tensor<5xf32>
		return %1 : tensor<5xf32>
		}

		// CHECK-LABEL: func.func @vectorize_nd_tensor_extract_contiguous_and_gather(
		// CHECK-SAME: %[[VAL_0:.*]]: tensor<6xf32>
		// CHECK-SAME: %[[VAL_1:.*]]: tensor<5xi32>
		// CHECK: %[[VAL_2:.*]] = arith.constant 0 : index
		// CHECK: %[[VAL_3:.*]] = arith.constant 0 : i32
		// CHECK: %[[VAL_4:.*]] = arith.constant dense<0> : vector<5xindex>
		// CHECK: %[[VAL_5:.*]] = arith.constant dense<5> : vector<5xindex>
		// CHECK: %[[VAL_6:.*]] = arith.constant dense<true> : vector<5xi1>
		// CHECK: %[[VAL_7:.*]] = arith.constant dense<0.000000e+00> : vector<5xf32>
		// CHECK: %[[VAL_8:.*]] = tensor.empty() : tensor<5xf32>
		// CHECK: %[[VAL_9:.*]] = vector.transfer_read %[[VAL_1]]{{\[}}%[[VAL_2]]], %[[VAL_3]] {in_bounds = [true]} : tensor<5xi32>, vector<5xi32>
		// CHECK: %[[VAL_10:.*]] = arith.index_cast %[[VAL_9]] : vector<5xi32> to vector<5xindex>
		// CHECK: %[[VAL_11:.*]] = arith.maxsi %[[VAL_10]], %[[VAL_4]] : vector<5xindex>
		// CHECK: %[[VAL_12:.*]] = arith.minsi %[[VAL_11]], %[[VAL_5]] : vector<5xindex>
		// CHECK: %[[VAL_13:.*]] = vector.gather %[[VAL_0]]{{\[}}%[[VAL_2]]] {{\[}}%[[VAL_12]]], %[[VAL_6]], %[[VAL_7]] : tensor<6xf32>, vector<5xindex>, vector<5xi1>, vector<5xf32> into vector<5xf32>
		// CHECK: %[[VAL_14:.*]] = vector.transfer_write %[[VAL_13]], %[[VAL_8]]{{\[}}%[[VAL_2]]] {in_bounds = [true]} : vector<5xf32>, tensor<5xf32>
		// CHECK: return %[[VAL_14]] : tensor<5xf32>

		transform.sequence failures(propagate) {
		^bb1(%arg1: !pdl.operation):
		%0 = transform.structured.match ops{["linalg.generic"]} in %arg1 : (!pdl.operation) -> !pdl.operation
		%1 = get_closest_isolated_parent %0 : (!pdl.operation) -> !pdl.operation
		%2 = transform.structured.vectorize %1 { vectorize_nd_extract }
		}


// -----		// -----

func.func @vectorize_map(%arg0: memref<64xf32>,		func.func @vectorize_map(%arg0: memref<64xf32>,
%arg1: memref<64xf32>, %arg2: memref<64xf32>) {		%arg1: memref<64xf32>, %arg2: memref<64xf32>) {
linalg.map ins(%arg0, %arg1 : memref<64xf32>, memref<64xf32>)		linalg.map ins(%arg0, %arg1 : memref<64xf32>, memref<64xf32>)
outs(%arg2 : memref<64xf32>)		outs(%arg2 : memref<64xf32>)
(%in: f32, %in_0: f32) {		(%in: f32, %in_0: f32) {
▲ Show 20 Lines • Show All 410 Lines • ▼ Show 20 Lines
transform.sequence failures(propagate) {		transform.sequence failures(propagate) {
^bb1(%arg1: !pdl.operation):		^bb1(%arg1: !pdl.operation):
%0 = transform.structured.match ops{["linalg.generic"]} in %arg1 : (!pdl.operation) -> !pdl.operation		%0 = transform.structured.match ops{["linalg.generic"]} in %arg1 : (!pdl.operation) -> !pdl.operation
transform.structured.masked_vectorize %0 vector_sizes [8, 32]		transform.structured.masked_vectorize %0 vector_sizes [8, 32]
}		}

// -----		// -----

		#map1 = affine_map<(d0, d1) -> (d0, d1)>
		func.func @extract_masked_vectorize(%arg0: tensor<?x?xf32>, %arg1: tensor<?x?xf32>) -> tensor<?x?xf32> {
		%c0 = arith.constant 1 : index
		%c1 = arith.constant 2 : index
		%2 = linalg.generic {
		indexing_maps = [#map1],
		iterator_types = ["parallel", "parallel"]
		} outs(%arg1 : tensor<?x?xf32>) {
		^bb0(%arg3: f32):
		%7 = tensor.extract %arg0[%c0, %c1] : tensor<?x?xf32>
		linalg.yield %7 : f32
		} -> tensor<?x?xf32>
		return %2 : tensor<?x?xf32>
		}

		// CHECK-LABEL: func.func @extract_masked_vectorize(
		// CHECK-SAME: %[[VAL_0:.*]]: tensor<?x?xf32>,
		// CHECK-SAME: %[[VAL_1:.*]]: tensor<?x?xf32>) -> tensor<?x?xf32> {
		// CHECK: %[[VAL_2:.*]] = arith.constant 1 : index
		// CHECK: %[[VAL_3:.*]] = arith.constant 2 : index
		// CHECK: %[[VAL_4:.*]] = arith.constant 0 : index
		// CHECK: %[[VAL_5:.*]] = tensor.dim %[[VAL_1]], %[[VAL_4]] : tensor<?x?xf32>
		// CHECK: %[[VAL_6:.*]] = arith.constant 1 : index
		// CHECK: %[[VAL_7:.*]] = tensor.dim %[[VAL_1]], %[[VAL_6]] : tensor<?x?xf32>
		// CHECK: %[[VAL_8:.*]] = arith.constant 0 : index
		// CHECK: %[[VAL_9:.*]] = arith.constant 0.000000e+00 : f32
		// CHECK: %[[VAL_10:.*]] = vector.create_mask %[[VAL_5]], %[[VAL_7]] : vector<3x3xi1>
		// CHECK: %[[VAL_11:.*]] = vector.mask %[[VAL_10]] { vector.transfer_read %[[VAL_1]]{{\[}}%[[VAL_8]], %[[VAL_8]]], %[[VAL_9]] {in_bounds = [true, true]} : tensor<?x?xf32>, vector<3x3xf32> } : vector<3x3xi1> -> vector<3x3xf32>
		// CHECK: %[[VAL_12:.*]] = arith.constant dense<true> : vector<3x3xi1>
		// CHECK: %[[VAL_13:.*]] = arith.constant dense<0.000000e+00> : vector<3x3xf32>
		// CHECK: %[[VAL_14:.*]] = arith.constant 0 : index
		// CHECK: %[[VAL_15:.*]] = arith.constant dense<1> : vector<3x3xindex>
		// CHECK: %[[VAL_16:.*]] = arith.constant 1 : index
		// CHECK: %[[VAL_17:.*]] = tensor.dim %[[VAL_0]], %[[VAL_16]] : tensor<?x?xf32>
		// CHECK: %[[VAL_18:.*]] = vector.broadcast %[[VAL_17]] : index to vector<3x3xindex>
		// CHECK: %[[VAL_19:.*]] = arith.muli %[[VAL_15]], %[[VAL_18]] : vector<3x3xindex>
		// CHECK: %[[VAL_20:.*]] = arith.constant dense<2> : vector<3x3xindex>
		// CHECK: %[[VAL_21:.*]] = arith.addi %[[VAL_20]], %[[VAL_19]] : vector<3x3xindex>
		// CHECK: %[[VAL_22:.*]] = vector.mask %[[VAL_10]] { vector.gather %[[VAL_0]]{{\[}}%[[VAL_14]], %[[VAL_14]]] {{\[}}%[[VAL_21]]], %[[VAL_12]], %[[VAL_13]] : tensor<?x?xf32>, vector<3x3xindex>, vector<3x3xi1>, vector<3x3xf32> into vector<3x3xf32> } : vector<3x3xi1> -> vector<3x3xf32>
		// CHECK: %[[VAL_23:.*]] = arith.constant 0 : index
		// CHECK: %[[VAL_24:.*]] = vector.mask %[[VAL_10]] { vector.transfer_write %[[VAL_22]], %[[VAL_1]]{{\[}}%[[VAL_23]], %[[VAL_23]]] {in_bounds = [true, true]} : vector<3x3xf32>, tensor<?x?xf32> } : vector<3x3xi1> -> tensor<?x?xf32>

		transform.sequence failures(propagate) {
		^bb1(%arg1: !pdl.operation):
		%0 = transform.structured.match ops{["linalg.generic"]} in %arg1 : (!pdl.operation) -> !pdl.operation
		transform.structured.masked_vectorize %0 vector_sizes [3, 3] { vectorize_nd_extract }
		}

		// -----

func.func @do_not_generate_masks(%arg0: tensor<8x32xf32>,		func.func @do_not_generate_masks(%arg0: tensor<8x32xf32>,
%arg1: tensor<8x32xf32>,		%arg1: tensor<8x32xf32>,
%arg2: tensor<8x32xf32>) -> tensor<8x32xf32> {		%arg2: tensor<8x32xf32>) -> tensor<8x32xf32> {
%0 = linalg.generic { indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>,		%0 = linalg.generic { indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>,
affine_map<(d0, d1) -> (d0, d1)>,		affine_map<(d0, d1) -> (d0, d1)>,
affine_map<(d0, d1) -> (d0, d1)>],		affine_map<(d0, d1) -> (d0, d1)>],
iterator_types = ["parallel", "parallel"] }		iterator_types = ["parallel", "parallel"] }
ins(%arg0, %arg1 : tensor<8x32xf32>, tensor<8x32xf32>)		ins(%arg0, %arg1 : tensor<8x32xf32>, tensor<8x32xf32>)
▲ Show 20 Lines • Show All 56 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[mlir][linalg] Vectorize tensor.extract using contiguous loadsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 501791

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp

mlir/test/Dialect/Linalg/vectorization-unsupported.mlir

mlir/test/Dialect/Linalg/vectorization.mlir

[mlir][linalg] Vectorize tensor.extract using contiguous loads
ClosedPublic