This is an archive of the discontinued LLVM Phabricator instance.

[MLIR] fix/update affine data copy utility for max/min bounds
ClosedPublic

Authored by bondhugula on Apr 2 2020, 10:09 AM.

Details

Summary

Fix point-wise copy generation to work with bounds that have max/min.
Change structure of copy loop nest to use absolute loop indices and
subtracting base from the indexes of the fast buffers. Update supporting
utilities: Fix FlatAffineConstraints::getLowerAndUpperBound to look at
equalities as well and for a missing division. Update unionBoundingBox
to not discard common constraints (leads to a tighter system). Update
MemRefRegion::getConstantBoundingSizeAndShape to add memref dimension
constraints. Run removeTrivialRedundancy at the end of
MemRefRegion::compute. Run single iteration loop promotion and
load/store canonicalization after affine data copy (in its test pass as
well).

Diff Detail

Event Timeline

bondhugula created this revision.Apr 2 2020, 10:09 AM
Herald added a project: Restricted Project. · View Herald TranscriptApr 2 2020, 10:09 AM
bondhugula edited the summary of this revision. (Show Details)Apr 2 2020, 10:09 AM
bondhugula added a reviewer: dcaballe.
bondhugula edited the summary of this revision. (Show Details)Apr 2 2020, 10:18 AM
andydavis1 added inline comments.Apr 2 2020, 3:30 PM
mlir/lib/Analysis/Utils.cpp
106

Have you run into over approximations again?

dcaballe added inline comments.Apr 2 2020, 4:31 PM
mlir/lib/Dialect/Affine/Transforms/AffineDataCopyGeneration.cpp
288

I'm not asking for any changes now but just wondering if it would make sense in the future to do all of these "clean-up" optimizations in a separate pass(es) that we can invoke as needed, maybe after running a bunch of optimizations instead of trying to optimize right after each one if it's not absolutely necessary. I guess that could reduce compile time and avoid duplicating this clean-up per pass. IIRC, loop fusion performed also some optimization around temporary tensors after fusion. Not sure if that optimizations would also fit into this category.

mlir/lib/Transforms/Utils/LoopUtils.cpp
1488

used in line 1496

mlir/test/Dialect/Affine/affine-data-copy.mlir
114

nit: you can remove the %{{.*}}

bondhugula added inline comments.Apr 2 2020, 11:06 PM
mlir/lib/Analysis/Utils.cpp
106

It's the same over approximation that existed - but I've changed affine data copy generate to use /*addMemRefDimBounds=*/false with MemRefRegion::compute to prevent redundant bounds from being added for the common case. So, instead, this is adding them when getting the constant bounding size and shape, but not when we do getLowerAndUpperBound on that region to get the range for the copy loops. This basically means the code that does the copying now risks going out of bounds when there is overapproximation. Ultimately, we shouldn't be using approximation based projection at all for region computation, and instead work with the equalities/local expressions to keep the bounds accurate -- if that's not possible (due to yet unimplemented detection or complex cases we may not be interested in), the region computation should just fail and bail out. We have a similar over approximation with unionBoundingBox. This approximation shouldn't be done for write regions; we should bail out if we can't be exact in those case.

For this patch, we have two options: (1) we could keep it like this (use addMemRefDimBounds = false with region compute) and then work on getting rid of the use of project in region compute. Once that's done, we don't need to add memref dim bounds anywhere; (2) we addMemrefDimBounds = true for region computation and update test cases because there'd be some redundant bounds. This still means we would later need to get rid of the over approximation (and fail instead) to avoid extra writes (which impact correctness) and extra reads (which may only impact performance). Let me know which one you prefer.

mlir/lib/Dialect/Affine/Transforms/AffineDataCopyGeneration.cpp
288

Yes, this is an issue common to several passes - as do whether we want to do light weight cleanup at the end. If it's really simple canonicalizations, it should really have no impact on compile time (so long as you are doing only the necessary stuff). Its real benefit is that it makes the output of the pass more intuitive to read and test cases easier to write / more readable. One issue here is that the current greedy pattern rewriter would run folding and DCE on *all* ops irrespective of the patterns, and so we get all sorts of unexpected simplifications from the pass and in the test cases. I'm sending out a patch/proposal to add a flag to applyPatternsGreedily that makes it only run the supplied patterns and not do any folding/DCE. This is also needed when entering the pass/utility when you want to canonicalize things by selectively applying some patterns (instead of requiring the client to do it). It's not always feasible to check whether it's already in the canonical form - would require a lot of extra code.

andydavis1 added inline comments.Apr 3 2020, 9:13 AM
mlir/lib/Analysis/Utils.cpp
106

OK thanks. Yes, lets go with option (1). Do you need to make additional changes to this revision for option (1)?

mehdi_amini added inline comments.Apr 3 2020, 9:17 PM
mlir/lib/Dialect/Affine/Transforms/AffineDataCopyGeneration.cpp
288

It is common that a pass would clean-up behind itself when it knows exactly what to cleanup: for example while you're promoting a single iteration loop you know that you may have specific code to clean in the promoted block and you perform these directly.
This is very targeted and "cheap".

Here is seems borderline though: it applies a some canonicalization patterns unconditionally at the function scope level.

bondhugula marked 2 inline comments as done.Apr 4 2020, 12:03 AM
bondhugula added inline comments.
mlir/lib/Analysis/Utils.cpp
106

This revision already does option (1). No more additional changes needed for it.

mlir/lib/Dialect/Affine/Transforms/AffineDataCopyGeneration.cpp
288

That's right. I'd like to ideally avoid doing function scope canonicalizations and only restrict this cleanup to load/store op's we touched (which can be easily collected and they are few in number - as many as the memrefs packed/copied) - but the current pattern rewriter doesn't support that.

dcaballe added inline comments.Apr 5 2020, 12:03 PM
mlir/lib/Dialect/Affine/Transforms/AffineDataCopyGeneration.cpp
288

Agree! I like the idea of doing some trivial simplifications (1-it loop) as long as it is on code that the pass is generating/modifying and it's not too involved to do so. However, "trivial" and "involved" are a bit subjective terms and I think this is one of those things that will get convoluted over time if we don't keep it really low profile. My personal opinion is that between simplicity and convenience, I lean towards the former and I would do this only for really trivial cases. I would say that bookkeeping operations over the algorithm to be simplified at the end for convenience is a bit borderline for me. Of course, this is my personal opinion, totally arguable :).

I think we should also not simplify anything if the pass doesn't do any core transformation. Otherwise, that might create a dependency between the simplifications done by the pass and subsequent ones. Should we limit the load/store simplification to cases where there is a least a copyNest?

mehdi_amini added inline comments.Apr 5 2020, 2:37 PM
mlir/lib/Dialect/Affine/Transforms/AffineDataCopyGeneration.cpp
288

I would say that bookkeeping operations over the algorithm to be simplified at the end for convenience is a bit borderline for me.

I would say it has to be balanced with having to schedule a complete run of canonicalize before the following pass, I suspect this is gonna be case-by-case.
Sometimes the context changes over time as well, for example this complex technique in LLVM had more impact when it was introduced and now will get removed based on more recent benchmarks: http://lists.llvm.org/pipermail/llvm-dev/2020-April/140542.html

I think we should also not simplify anything if the pass doesn't do any core transformation. Otherwise, that might create a dependency between the simplifications done by the pass and subsequent ones.

+1 on this. But I suspect Uday would like to make it even more targeted when it'll be possible to.

bondhugula updated this revision to Diff 255374.Apr 6 2020, 9:13 AM
bondhugula marked 6 inline comments as done.

Address review comments.

mlir/lib/Dialect/Affine/Transforms/AffineDataCopyGeneration.cpp
288

@dcaballe @mehdi_amini - all of these are really good points, and I strongly agree. For this patch, I'd like to change the whole function load/store canononicalization happening here, but that depends on when another line of revisions that are pending land. I could remove this and update test cases making the test case checks more clumsy or have a TODO and make the canonicalization more focussed when the pattern rewriter support improves.

mlir/lib/Transforms/Utils/LoopUtils.cpp
1488

Thanks!

No more comments from my side. LGTM! I'll leave the final approval to Andy, once the over approximation discussion is done. Thanks!

mlir/lib/Dialect/Affine/Transforms/AffineDataCopyGeneration.cpp
288

I'm ok with the TODO + later revisit approach and protecting the load/store canonicalization with if (!copyNests.empty()), if that makes sense.

andydavis1 accepted this revision.Apr 6 2020, 4:10 PM
This revision is now accepted and ready to land.Apr 6 2020, 4:10 PM
bondhugula updated this revision to Diff 255585.Apr 6 2020, 9:44 PM
bondhugula marked 7 inline comments as done.

Add TODO comment for targeted canonicalization.

mehdi_amini added inline comments.Apr 6 2020, 10:01 PM
mlir/lib/Dialect/Affine/Transforms/AffineDataCopyGeneration.cpp
288

I don't expect you to hold this revision, coming back and improve later is perfectly fine with me (in case it wasn't clear).

bondhugula updated this revision to Diff 255611.Apr 7 2020, 1:27 AM
bondhugula marked an inline comment as done.

Update dma-generate test checks post canonicalization guard !copyNests.empty()
(some of the affine.applys that became dead are no longer removed - these were
already in the input test case).

This revision was automatically updated to reflect the committed changes.