This is an archive of the discontinued LLVM Phabricator instance.

[mlir][linalg] Add scalar broadcast load case to the vectoriser
ClosedPublic

Authored by awarzynski on May 2 2023, 12:14 PM.

Details

Summary

This patch extends the Linalg vectoriser so that scalar loads are
correctly identified as scalar rather than gather loads. Below is an
example of a scalar load:

func.func @example(%arg0: tensor<80x16xf32>, %arg2: tensor<1x4xf32>) -> tensor<1x4xf32> {
%c8 = arith.constant 8 : index
%c16 = arith.constant 16 : index
%1 = linalg.generic {
    indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>],
    iterator_types = ["parallel", "parallel"]
  } outs(%arg2 : tensor<1x4xf32>) {
  ^bb0(%out: f32):
    %2 = linalg.index 0 : index
    %extracted = tensor.extract %arg0[%2, %c16] : tensor<80x16xf32>
    linalg.yield %extracted : f32
  } -> tensor<1x4xf32>
  return %1 : tensor<1x4xf32>
}

This patch also makes sure that these scalar loads are indeed lowered to
scalar loads (implemented as vector.gather) followed by broadcast:

%8 = vector.gather %arg0[%c0, %c0] [%7], %cst_1, %cst_2 : tensor<80x16xf32>, vector<1xindex>, vector<1xi1>, vector<1xf32> into vector<1xf32>
%9 = vector.broadcast %8 : vector<1xf32> to vector<1x4xf32>

We still need to check what backends do in these cases and whether the
vectoriser could generate genuinely scalar code instead.

Diff Detail

Event Timeline

awarzynski created this revision.May 2 2023, 12:14 PM
Herald added a project: Restricted Project. · View Herald Transcript
awarzynski requested review of this revision.May 2 2023, 12:14 PM
Matt added a subscriber: Matt.May 19 2023, 2:21 PM
awarzynski updated this revision to Diff 527782.EditedJun 2 2023, 2:11 AM

Rebase

@dcaballe, I was hoping that I could replace vector.gather with either vector.extractelement or vector.extract, but:

  • vector.extractelement only works for 1-D or 0-D vectors,
  • vector.extract requires positions as "64-bit integer array attribute".

Neither of these is satisified though. Any thoughts? Perhaps I'm missing something 🤔 .

awarzynski edited the summary of this revision. (Show Details)Jun 2 2023, 2:12 AM

You can represent a scalar load + broadcast with a vector.transfer_read if you provide the right affine map with results set to zero. It would be something like for a scalar load + 2D broadcast:

%0 = vector.transfer_read %M[%i, %j], %cst {permutation_map = affine_map<(d0, d1)->(0, 0)>} :  tensor<?x?xf32>, vector<4x8xf32>

Use vector.transfer_read instead of vector.gather

Funnily enough, some canonicalisation turns this back into tensor.extract. The output looks OK, though I've not been able to trace that back (the debug dump suggests TransferOpReduceRank).

dcaballe accepted this revision.Jun 9 2023, 8:37 AM

Awesome, thank you so much! Some minor comments but this is good to go!

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
877

is ValueRange needed here?

900

is this message accurate now given that we can return Gather?

904

This is really clear now, thanks!

1021

cool!

This revision is now accepted and ready to land.Jun 9 2023, 8:37 AM

Thanks for the review, Diego! I will incorporate your suggestions in the final version that I will be merging shortly.

mlir/lib/Dialect/Linalg/Transforms/Vectorization.cpp
877

Nope :) I will remove it before merging this change.

900

Given this on L886:

if (!leadingIdxsLoopInvariant)
  return VectorMemoryAccessKind::Gather;

I can safely simplify this block so that it can only return VectorMemoryAccessKind::ScalarBroadcast. With that change the message would be valid. Will update before merging.

This revision was landed with ongoing or failed builds.Jun 12 2023, 7:19 AM
This revision was automatically updated to reflect the committed changes.