Page MenuHomePhabricator

[mlir][Affine][Vector] Support vectorizing reduction loops
ClosedPublic

Authored by sgrechanik on Apr 16 2021, 7:29 PM.

Details

Summary

This patch adds support for vectorizing loops with 'iter_args'
implementing known reductions along the vector dimension. Comparing to
the non-vector-dimension case, two additional things are done during
vectorization of such loops:

  • The resulting vector returned from the loop is reduced to a scalar using vector.reduce.
  • In some cases a mask is applied to the vector yielded at the end of the loop to prevent garbage values from being written to the accumulator.

Vectorization of reduction loops is disabled by default. To enable it, a
map from loops to array of reduction descriptors should be explicitly passed to
vectorizeAffineLoops, or vectorize-reductions=true should be passed
to the SuperVectorize pass.

Current limitations:

  • Loops with a non-unit step size are not supported.
  • n-D vectorization with n > 1 is not supported.

Diff Detail

Event Timeline

sgrechanik created this revision.Apr 16 2021, 7:29 PM
sgrechanik requested review of this revision.Apr 16 2021, 7:29 PM
nicolasvasilache accepted this revision.Apr 19 2021, 5:25 AM

Looks good, thanks @sgrechanik !

mlir/lib/Dialect/Affine/Transforms/SuperVectorize.cpp
996

Note: there is also the case where all the vector.transfer_read in the backward slice of the reduction (intersected with the filter "nested under loop of interest") have inBounds == true.
This is the way we can inject static information into dynamic memrefs atm.

1031

for composability you prob. want this to be an AffineApplyOp itself.

1428

Note: some of this is quite ancient and predates OpBuilder::InsertionGuard.
We should cleanup the load-bearing "insertion point"-passing across function boundaries via state.builder at some point.

1775

Nice!
This is completely unrelated to affine though and could help other places (e.g. the Linalg vectorizer).
Can you please move this to a dialect-independent/std-dialect utils ?

This revision is now accepted and ready to land.Apr 19 2021, 5:25 AM

Thank you for the feedback, I'll update the patch soon.

mlir/include/mlir/Analysis/Utils.h
364 ↗(On Diff #338270)

After talking to my colleagues I decided that I overcomplicated this function. I'll split it in the next patch update into two functions: isLoopParallel and isParallelReductionLoop.

mlir/lib/Dialect/Affine/Transforms/SuperVectorize.cpp
996

Not sure I understand your comment correctly, but there may be cases when the read is within bounds, but we still need to filter out some elements for the reduction (e.g. when the dimension size of a memref is larger than the upper bound of the corresponding loop).

1031

Yes, makes sense.

1775

Yeah, I'll try to move it into some more publicly accessible place.

Changes:

  • Generate affine.apply instead of subi when creating a mask.
  • Moved ReductionInfo and ReductionRecognizer into separate files.
  • Created a separate isParallelReductionLoop utility to simplify isLoopParallel.
nicolasvasilache accepted this revision.Apr 23 2021, 3:50 AM

sounds good, please fix the include order to appease the linter and let's land this!

I ran into some obscure optimization-dependent linking errors while trying to reuse the StandardReductionInfo template, so I moved most of its implementation to the header file.

Work around the MSVC issue

bondhugula added inline comments.
mlir/include/mlir/Analysis/Utils.h
359–360 ↗(On Diff #340428)

Making the method return true with the ignoreIterArgs set to true looks hacky. That's an incorrect return result since the loop isn't actually parallel. I'd recommend making the change suggested in your comment below - since it's in the right direction to start with.

mlir/include/mlir/Dialect/Affine/Utils.h
16–22

Prune includes please.

bondhugula requested changes to this revision.Apr 27 2021, 6:55 AM
This revision now requires changes to proceed.Apr 27 2021, 6:55 AM
sgrechanik added inline comments.Apr 27 2021, 9:03 AM
mlir/include/mlir/Analysis/Utils.h
359–360 ↗(On Diff #340428)

Ok, I'll split it further then, so that we'll have three functions:

  • isLoopMemoryParallel (better name suggestions are welcome) that will check only memory dependences
  • isLoopParallel that will work as the old version of isLoopParallel, i.e. loops with iter_args will be considered non-parallel.
  • isParallelReductionLoop that will check both memory dependences and try to recognize reduction loops.
  • Split isLoopParallel into isLoopParallel and isLoopMemoryParallel
  • Removed unnecessary includes from Affine/Utils.h
sgrechanik edited the summary of this revision. (Show Details)
sgrechanik added a reviewer: ftynse.
sgrechanik added a subscriber: ftynse.

I removed my reduction recognition machinery in favor of the code introduced by @ftynse. I still think that in the future we might want to replace AtomicRMWKind with something more extensible to support custom reductions, but I was unhappy with my solution, and it should probably be done in a separate patch (and maybe by someone else).

To retain some extensibility I did this:

  • I added the isLoopMemoryParallel function that checks only memory dependences and ignores iter_args. This will facilitate writing custom isLoopParallel-like functions with custom reduction recognition logic.
  • Vectorization utilities don't call isLoopParallel directly, they receive a map from loops to arrays of recognized reductions. So that they can be used together with custom isLoopParallel-like functions.

Please review.

bondhugula resigned from this revision.May 1 2021, 9:47 PM

This looks good to me - the earlier comment was the only one I had based on a partial look. Please go with the earlier review or perhaps @ftynse can review the changes.

This revision is now accepted and ready to land.May 1 2021, 9:47 PM

@ftynse Please check if the reduction detection changes look ok, I added the condition !yielded.hasOneUse() to exclude the case vecdim_partial_sums_1_rejected where we leak partial sums by storing them to a memref. I also added a helper getVectorReductionOp to generate vector reduction ops by AtomicRMWKind and move the getIdentityValue and getReductionOp helpers to make them publicly available.

If there are no more comments I'll merge it tomorrow morning