This is an archive of the discontinued LLVM Phabricator instance.

[mlir][linalg] Vectorize tensor.extract using contiguous loads
ClosedPublic

Authored by awarzynski on Jan 18 2023, 1:23 AM.

Details

Summary

This patch implements vectorization of tensor.extract for n-D tensor
(n >= 2) using contiguous load operations, i.e. vector.transfer_read. This
is a follow-up of https://reviews.llvm.org/D137660 in which gather loads
were used, i.e. vector.gather.

It is always safe to use gather load operations when the underlying
memory pattern is contiguous, but not vice-verse. At the moment, the
following conditions have to be met for contiguous loads to be
generated:

  1. The _output tensor_ must be a 1-D vector with the trailing dim > 1, e.g. tensor<1x1x4xi32,
  2. The trailing dim in the _input tensor_ must be > 1, e.g. tensor<1x1x4i32> would be fine, but not tensor<1x4x1xi32>.

If these conditions are not satisfied, gather loads are generated
instead.

Condition 1 guarantees that the iteration space of the corresponding
linalg.generic Op is relatively simple. That makes analysing the
indices for tensor.extract rather straightforward.

Condition 2 is mostly there to avoid weird vectorisation patterns, e.g.
vector<1x1x1xi32>. In practice, tensors like tensor<1x4x1xi32>
should first be collapsed to tensor<1x4xi32> before vectorisation, but
that should happen somewhere else.

If needed, both conditions can be relaxed. I've not been able to find a
good motivating example for these, hence skipping. For reference,
tosa.resize (lowered to Linalg) was the driving example used here.

Co-authored-by: Diego Caballero <diegocaballero@google.com>

Diff Detail

Event Timeline

awarzynski created this revision.Jan 18 2023, 1:23 AM
Herald added a reviewer: hanchung. · View Herald Transcript
Herald added a project: Restricted Project. · View Herald Transcript
awarzynski requested review of this revision.Jan 18 2023, 1:23 AM
awarzynski edited the summary of this revision. (Show Details)Jan 18 2023, 1:23 AM

Thanks @awarzynski! I think the initial conditions are fair as a starting point. I need more time to digest the logic but here are some nits for now!

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
638

nit: I think we are using var instead of /p to highlight source code names in the doc but perhaps I'm missing something.

642

ar? -> and?

644

Can you elaborate on what strideZero is? Also, perhaps rename val to index or indexVal?

645

typo

649

-> this is?

689

You can use isa instead of dyn_cast when you only want to check for the class but you don't need the actual class object.

704

I'm a bit lost here with the findAncestorOpInBlock and the constant op check. What are you trying to do?

821

Why? Wouldn't the backend emulate it if it's not natively supported by the target?

Thanks for the comments @dcaballe !

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
638

Not a Doxygen expert. This is taken from LLVM's coding standards, but I suspect that there's going to be other styles used elsewhere. Happy to update!

644

Can you elaborate on what strideZero is?

Sure and apologies for poor naming.

So, we would like the values in the vector with the trailing indices to have stride one, e.g. v0 = [0, 1, 2, 3]. This way we know that we are accessing adjacent elements. For non-trailing indices, to make things a bit simpler, I assume that the indices ought to have stride zero (or be constant), i.e. v1 = [5, 5, 5, 5] (the actual values don't matter). This example should clarify what I mean:

func.func @transfer_read(%arg0: tensor<3x3x3xf32>, %arg2: tensor<1x1x3xf32>) -> tensor<1x1x3xf32> {
  %1 = linalg.generic {
    indexing_maps = [#map1],
    iterator_types = ["parallel", "parallel", "parallel"]
  } outs(%arg2 : tensor<1x1x3xf32>) {
  ^bb0(%arg4: f32):
    %2 = linalg.index 0 : index
    %3 = linalg.index 1 : index
    %4 = linalg.index 2 : index
    %5 = tensor.extract %arg0[%2, %3, %4] : tensor<3x3x3xf32>
    linalg.yield %5 : f32
  } -> tensor<1x1x3xf32>
  return %1 : tensor<1x1x3xf32>
}
  • It is fine to use %2, %3 and %4 to calculate the trailing index for tensor.extract (%2 and %3 are effectively constant, %4 has stride one: [0, 1, 2]).
  • Things get tricky once we try to use %4 for calculating non-trailing indices (i.e. deciding whether this is a contiguous or non-contiguous load becomes trickier).

We could add the logic that you originally proposed and it should just work ™ :

bool isStrideOneAlongDimension(Value indexVal, unsigned dim) {
  Operation *defOp = indexVal.getDefiningOp();
  if (!defOp)
    return false;
  if (auto indexOp = dyn_cast<linalg::IndexOp>(defOp))
    return indexOp.getDim() == dim;
  // TODO: explore UD chain.
  return false;
}

Does this make sense?

704

IIUC, for %c0, ancestor is %c0 = arith.constant 0 : index. It's a arith::ConstantOp, so ... But arith.constant has no operands, so this is not needed, is it?

#map0 = affine_map<(d0, d1, d2) -> (d2)>
#map1 = affine_map<(d0, d1, d2) -> (d0, d1, d2)>
func.func @vector_gather(%arg0: tensor<3x3x3xf32>, %arg1: tensor<3xi32>, %arg2: tensor<1x1x3xf32>) -> tensor<1x1x3xf32> {
 %c123 = arith.constant 123 : index
  %2 = linalg.generic {
    indexing_maps = [#map0, #map1],
    iterator_types = ["parallel", "parallel", "parallel"]
  } ins(%arg1 : tensor<3xi32>) outs(%arg2 : tensor<1x1x3xf32>) {
  ^bb0(%arg3: i32, %arg4: f32):
    %c0 = arith.constant 0 : index
    %3 = arith.index_cast %arg3 : i32 to index
    %7 = tensor.extract %arg0[%c0, %c0, %3] : tensor<3x3x3xf32>
    linalg.yield %7 : f32
  } -> tensor<1x1x3xf32>
  return %2 : tensor<1x1x3xf32>
}

Add a check for AffineMaps (contiguous loads require identity maps), fix typos etc.

tschuett added inline comments.
mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
633

You use VectorMemoryAccessKind as if it is an enum class, but it is not.

I am about to disappear for a week and just wanted to comment on this condition:

The _output tensor_ must be a 1-D vector with the trailing dim > 1, e.g. tensor<1x1x4xi32,

I do believe that this can be relaxed, but might be a bit too tricky and not really needed. Consider the following scenario:

func.func @transfer_read(%arg0: tensor<3x3xf32>, %arg2: tensor<1x4x2xf32>) -> tensor<1x4x2xf32> {
  %1 = linalg.generic {
    indexing_maps = [#map1],
    iterator_types = ["parallel", "parallel", "parallel"]
  } outs(%arg2 : tensor<1x4x2xf32>) {
  ^bb0(%arg4: f32):
    %2 = linalg.index 0 : index
    %3 = linalg.index 1 : index
    %4 = linalg.index 2 : index
    %5 = arith.addi %3, %4 : index
    %6 = tensor.extract %arg0[%2, %5] : tensor<3x3xf32>
    linalg.yield %6 : f32
  } -> tensor<1x4x3xf32>
  return %1 : tensor<1x4x2xf32>
}

This linalg.generic Op is similar to the following nested loops:

for (i = 0; i < 1; i++)
  for (j = 0; j < 4: j++)
    for (k = 0; k < 3; k++) {
       int l = j + k;
       a = vec[i][l];
    }

This will lead to the following access:

v[0][0]
v[0][1]
v[0][2]
v[0][1]
v[0][2]
v[0][3]
v[0][2]
v[0][3]
...

This is not a contiguous loads, but it's hard to infer without analyzing the expression used for calculating the indices. Instead of doing that, I decided to limit the space of the supported cases. This is still sufficient to for my use case, i.e. vectorising tosa.resize that I mentioned in iree-issues/#9198 _after_ tiling. I hope that this makes sense :)

Removed vectorize_tensor_extract.mlir (all tests are already available in vectorization.mlir), deleted unnecessary check for whether the affine maps are minor identities.

awarzynski marked 3 inline comments as done.

Fix typo

awarzynski added inline comments.Jan 27 2023, 11:44 AM
mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
644

Also, perhaps rename val to index or indexVal?

Not claiming that val is a great name, but index and indexVal would suggest that this is index. But this could be any value that is used to compute an index 🤔 .

821

It would, but I was thinking that it would be useful to be able to disable vectorization when only gather loads can be used (which can be very slow).

  • Remove the vectorize_nd_extract attribute
  • Rebase on top of main

The vectorize_nd_extract attribute was introduced to prevent vectorisation using gather loads in situation where contiguous loads would be better, but weren't yet supported. As this patch introduced support for vectorisation using contiguous loads, that attribute is no longer required.

Restore the "vectorize_nd_extract" attribute

For context, see https://github.com/iree-org/iree/pull/12288#issuecomment-1437824237

dcaballe accepted this revision.Feb 21 2023, 10:13 PM
This revision is now accepted and ready to land.Feb 21 2023, 10:13 PM
This revision was landed with ongoing or failed builds.Feb 22 2023, 11:33 AM
This revision was automatically updated to reflect the committed changes.

With this change I'm seeing vectorization of

#map = affine_map<(d0) -> (d0)>
module @GatherV2_2.10 attributes {mhlo.cross_program_prefetches = [], mhlo.dynamic_parameter_bindings = [], mhlo.is_dynamic = true, mhlo.use_auto_spmd_partitioning = false} {
  rt.export @main ordinal 0
  func.func @main(%arg0: tensor<6xf32> {bufferization.writable = false, xla_framework.input_mapping = 0 : i32}, %arg1: tensor<5xi32> {bufferization.writable = false, xla_framework.input_mapping = 2 : i32}) -> tensor<5xf32> attributes {xla_framework.result_mapping = 1 : i32} {
    %c5 = arith.constant 5 : index
    %c0 = arith.constant 0 : index
    %0 = tensor.empty() : tensor<5xf32>
    %1 = linalg.generic {indexing_maps = [#map], iterator_types = ["parallel"]} outs(%0 : tensor<5xf32>) {
    ^bb0(%out: f32):
      %2 = linalg.index 0 : index
      %extracted = tensor.extract %arg1[%2] : tensor<5xi32>
      %3 = arith.index_cast %extracted : i32 to index
      %4 = arith.maxsi %3, %c0 : index
      %5 = arith.minsi %4, %c5 : index
      %extracted_0 = tensor.extract %arg0[%5] : tensor<6xf32>
      linalg.yield %extracted_0 : f32
    } -> tensor<5xf32>
    return %1 : tensor<5xf32>
  }
}

into

module @GatherV2_2.10 attributes {mhlo.cross_program_prefetches = [], mhlo.dynamic_parameter_bindings = [], mhlo.is_dynamic = true, mhlo.use_auto_spmd_partitioning = false} {
  rt.export @main ordinal 0
  func.func @main(%arg0: tensor<6xf32> {bufferization.writable = false, xla_framework.input_mapping = 0 : i32}, %arg1: tensor<5xi32> {bufferization.writable = false, xla_framework.input_mapping = 2 : i32}) -> tensor<5xf32> attributes {xla_framework.result_mapping = 1 : i32} {
    %c0 = arith.constant 0 : index
    %c0_i32 = arith.constant 0 : i32
    %cst = arith.constant dense<0> : vector<5xindex>
    %cst_0 = arith.constant dense<5> : vector<5xindex>
    %cst_1 = arith.constant 0.000000e+00 : f32
    %0 = tensor.empty() : tensor<5xf32>
    %1 = vector.transfer_read %arg1[%c0], %c0_i32 {in_bounds = [true]} : tensor<5xi32>, vector<5xi32>
    %2 = arith.index_cast %1 : vector<5xi32> to vector<5xindex>
    %3 = arith.maxsi %2, %cst : vector<5xindex>
    %4 = arith.minsi %3, %cst_0 : vector<5xindex>
    %5 = vector.extractelement %4[%c0_i32 : i32] : vector<5xindex>
    %6 = vector.transfer_read %arg0[%5], %cst_1 {in_bounds = [true]} : tensor<6xf32>, vector<5xf32>
    %7 = vector.transfer_write %6, %0[%c0] {in_bounds = [true]} : vector<5xf32>, tensor<5xf32>
    return %7 : tensor<5xf32>
  }
}

Looks like it somehow thinks this is contiguous even if it's not.

Reverted this in e28bbfea5d482c1825b1799c57aedff4e0116619, is the test case above enough to see what's going wrong?

@bkramer , sorry about this issue and thanks for reverting!

is the test case above enough to see what's going wrong?

Yes! Thanks for sending this :) I think that I know what the problem is and will send a fix tomorrow (I've been traveling today and only just got back, so need a break). Just to double check - the 1st tensor.extract is a contiguous load, the 2nd is a gather. Correct?

@bkramer , sorry about this issue and thanks for reverting!

is the test case above enough to see what's going wrong?

Yes! Thanks for sending this :) I think that I know what the problem is and will send a fix tomorrow (I've been traveling today and only just got back, so need a break). Just to double check - the 1st tensor.extract is a contiguous load, the 2nd is a gather. Correct?

Right, it's loading from a contiguous tensor of indices and using them as an input to gather.

awarzynski reopened this revision.Mar 1 2023, 8:39 AM
This revision is now accepted and ready to land.Mar 1 2023, 8:39 AM
awarzynski updated this revision to Diff 501532.Mar 1 2023, 8:47 AM

Refine how contiguous loads are identified

Main change (in getTensorExtractMemoryAccessPattern):

// Reject Ops that would lead to non-contiguous accesses.
if (!isa<arith::AddIOp, arith::SubIOp, linalg::IndexOp>(ancestor))
  return false;

I took the liberty of renaming some variables and updating some comments.
Hopefully the overall logic is clearer now.

Also rebased on top of main.

@bkramer - wdyt? I added your repro as a test - does the generated output make sense?

bkramer accepted this revision.Mar 1 2023, 9:21 AM

Using vector.gather for that case looks right to me.

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp