This is an archive of the discontinued LLVM Phabricator instance.

Order parallel indices before transposing the input in multireductions
ClosedPublic

Authored by harsh on Jun 24 2021, 4:33 PM.

Details

Summary

The current code does not preserve the order of the parallel
dimensions when doing multi-reductions and thus we can end
up in scenarios where the result shape does not match the
desired shape after reduction.

This patch fixes that by ensuring that the parallel indices
are in order and then concatenates them to the reduction dimensions
so that the reduction dimensions are innermost.

Diff Detail

Event Timeline

harsh created this revision.Jun 24 2021, 4:33 PM
harsh requested review of this revision.Jun 24 2021, 4:33 PM
harsh updated this revision to Diff 354411.Jun 24 2021, 6:19 PM

Fixed lint changes

ThomasRaoux added inline comments.
mlir/lib/Dialect/Vector/VectorTransforms.cpp
3954

Why don't we sort the indices so that the non reduction dimensions are in order in this transpose instead? This way we don't need a second transpose if they are not in the right order.

ThomasRaoux added inline comments.
mlir/lib/Dialect/Vector/VectorTransforms.cpp
3938–3947

I think you can simplify this logic and solve the problem by just pushing the non reduced dimensions first in order then push the reduction dimensions.

asaadaldien added inline comments.
mlir/lib/Dialect/Vector/VectorTransforms.cpp
3939

Good catch the bug is here! the linear swap isn't correct because it shuffles parallel dims around. You can replace this by inserting parallel induces first then reductions, e.g something like the following:

+1 to what @ThomasRaoux said.
It should also be possible to rewrite that logic in a much cleaner and simpler form like:

auto reductionDimsRange = op.reduction_dims.getAsValueRange<int64_t>();
SmallVector<int64_t, 4> reductionDims = llvm::to_vector<4>(llvm::map_range(reductionDimsRange.begin(), reductionDimsRange.end(), [](APInt a) { return a.getZExtValue(); }));
SmallDenseSet<int64_T> reductionDimsSet;
// fill it from reductionDims or from llvm::map_range, whichever makes most sense. 
SmallVector<int64_t> parallelDims;
for (int64_t idx = 0, e = ; idx != e; ++idx)
  if (!reductionDimsSet.contains(idx))
    parallelDims.push_back(idx);

if (parallelDims != llvm::seq<int64_t>(0, parallelDims.size())) { // This is the condition for "reductions must be innermost", see comment below.
  // concat parallelDims, reductionDims + add transpose. 
}

One thing to note is that the current lowering seems to always put reductions inside and use vector.reduction.
I think in most cases this will result in quite bad code because it will rely on horizontal reductions and add a ton of insert/extract.

We should have an option to emit the other way around too: put the reductionDims outside and just unroll pointwise operations on the parallel dims.
This is closer to the "outerproduct-like" semantics and has much better chances of being rewritten as fmas when possible; missing fma instructions is immediately a 2x performance penalty on many systems.

With the code structuring I propose above, the proper transposition amounts to just changing the order of concatenated dims.

harsh marked 3 inline comments as done.Jun 28 2021, 2:44 PM

Thanks @ThomasRaoux , @asaadaldien , @nicolasvasilache for the comments. @nicolasvasilache I have modified the patch as per your changes but kept the current patch to just handle moving reductions to the inner most dimensions. I will put up another patch to handle moving reductions to the outermost dimensions and based on the performance of inner vs outer, we can decide which path we want to take.

mlir/lib/Dialect/Vector/VectorTransforms.cpp
3938–3947

Thanks you are right. I have fixed this as per Nicolas' comments.

3939

Thanks! Since I will be attempting moving the reduction dims to be outermost also, I have modified the patch as per Nicolas' comments.

3954

Makes sense.

Thanks @ThomasRaoux , @asaadaldien , @nicolasvasilache for the comments. @nicolasvasilache I have modified the patch as per your changes but kept the current patch to just handle moving reductions to the inner most dimensions. I will put up another patch to handle moving reductions to the outermost dimensions and based on the performance of inner vs outer, we can decide which path we want to take.

Did you upload the new patch?

harsh updated this revision to Diff 355051.Jun 28 2021, 2:59 PM
harsh marked 3 inline comments as done.
harsh retitled this revision from Add additional transpose to multireduction op to restore output to original shape to Order parallel indices before transposing the input in multireductions.
harsh edited the summary of this revision. (Show Details)

Fixed as per comments

harsh added a comment.Jun 28 2021, 3:02 PM

@ThomasRaoux - yes I just uploaded it.

ThomasRaoux accepted this revision.Jun 28 2021, 3:12 PM

LGTM

mlir/lib/Dialect/Vector/VectorTransforms.cpp
3934

nit: formatting looks off.

3939
3947
This revision is now accepted and ready to land.Jun 28 2021, 3:12 PM
harsh marked 3 inline comments as done.Jun 28 2021, 4:10 PM
harsh added inline comments.
mlir/lib/Dialect/Vector/VectorTransforms.cpp
3947

Just to clarify, this means I should check if the inner most dims are reduction and if so, exit early here?

harsh updated this revision to Diff 355059.Jun 28 2021, 4:11 PM
harsh marked an inline comment as done.

Address comments and linting errors

ThomasRaoux accepted this revision.Jun 28 2021, 5:07 PM
ThomasRaoux added inline comments.
mlir/lib/Dialect/Vector/VectorTransforms.cpp
3947

Yes, looks good.

harsh added a comment.Jun 28 2021, 5:09 PM

Thanks and feel free to merge because I don't have merge priveleges.

Thanks and feel free to merge because I don't have merge priveleges.

Done