Fix point-wise copy generation to work with bounds that have max/min.
Change structure of copy loop nest to use absolute loop indices and
subtracting base from the indexes of the fast buffers. Update supporting
utilities: Fix FlatAffineConstraints::getLowerAndUpperBound to look at
equalities as well and for a missing division. Update unionBoundingBox
to not discard common constraints (leads to a tighter system). Update
MemRefRegion::getConstantBoundingSizeAndShape to add memref dimension
constraints. Run removeTrivialRedundancy at the end of
MemRefRegion::compute. Run single iteration loop promotion and
load/store canonicalization after affine data copy (in its test pass as
well).
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
mlir/lib/Analysis/Utils.cpp | ||
---|---|---|
106 | Have you run into over approximations again? |
mlir/lib/Dialect/Affine/Transforms/AffineDataCopyGeneration.cpp | ||
---|---|---|
288 | I'm not asking for any changes now but just wondering if it would make sense in the future to do all of these "clean-up" optimizations in a separate pass(es) that we can invoke as needed, maybe after running a bunch of optimizations instead of trying to optimize right after each one if it's not absolutely necessary. I guess that could reduce compile time and avoid duplicating this clean-up per pass. IIRC, loop fusion performed also some optimization around temporary tensors after fusion. Not sure if that optimizations would also fit into this category. | |
mlir/lib/Transforms/Utils/LoopUtils.cpp | ||
1488 | used in line 1496 | |
mlir/test/Dialect/Affine/affine-data-copy.mlir | ||
114 | nit: you can remove the %{{.*}} |
mlir/lib/Analysis/Utils.cpp | ||
---|---|---|
106 | It's the same over approximation that existed - but I've changed affine data copy generate to use /*addMemRefDimBounds=*/false with MemRefRegion::compute to prevent redundant bounds from being added for the common case. So, instead, this is adding them when getting the constant bounding size and shape, but not when we do getLowerAndUpperBound on that region to get the range for the copy loops. This basically means the code that does the copying now risks going out of bounds when there is overapproximation. Ultimately, we shouldn't be using approximation based projection at all for region computation, and instead work with the equalities/local expressions to keep the bounds accurate -- if that's not possible (due to yet unimplemented detection or complex cases we may not be interested in), the region computation should just fail and bail out. We have a similar over approximation with unionBoundingBox. This approximation shouldn't be done for write regions; we should bail out if we can't be exact in those case. For this patch, we have two options: (1) we could keep it like this (use addMemRefDimBounds = false with region compute) and then work on getting rid of the use of project in region compute. Once that's done, we don't need to add memref dim bounds anywhere; (2) we addMemrefDimBounds = true for region computation and update test cases because there'd be some redundant bounds. This still means we would later need to get rid of the over approximation (and fail instead) to avoid extra writes (which impact correctness) and extra reads (which may only impact performance). Let me know which one you prefer. | |
mlir/lib/Dialect/Affine/Transforms/AffineDataCopyGeneration.cpp | ||
288 | Yes, this is an issue common to several passes - as do whether we want to do light weight cleanup at the end. If it's really simple canonicalizations, it should really have no impact on compile time (so long as you are doing only the necessary stuff). Its real benefit is that it makes the output of the pass more intuitive to read and test cases easier to write / more readable. One issue here is that the current greedy pattern rewriter would run folding and DCE on *all* ops irrespective of the patterns, and so we get all sorts of unexpected simplifications from the pass and in the test cases. I'm sending out a patch/proposal to add a flag to applyPatternsGreedily that makes it only run the supplied patterns and not do any folding/DCE. This is also needed when entering the pass/utility when you want to canonicalize things by selectively applying some patterns (instead of requiring the client to do it). It's not always feasible to check whether it's already in the canonical form - would require a lot of extra code. |
mlir/lib/Analysis/Utils.cpp | ||
---|---|---|
106 | OK thanks. Yes, lets go with option (1). Do you need to make additional changes to this revision for option (1)? |
mlir/lib/Dialect/Affine/Transforms/AffineDataCopyGeneration.cpp | ||
---|---|---|
288 | It is common that a pass would clean-up behind itself when it knows exactly what to cleanup: for example while you're promoting a single iteration loop you know that you may have specific code to clean in the promoted block and you perform these directly. Here is seems borderline though: it applies a some canonicalization patterns unconditionally at the function scope level. |
mlir/lib/Analysis/Utils.cpp | ||
---|---|---|
106 | This revision already does option (1). No more additional changes needed for it. | |
mlir/lib/Dialect/Affine/Transforms/AffineDataCopyGeneration.cpp | ||
288 | That's right. I'd like to ideally avoid doing function scope canonicalizations and only restrict this cleanup to load/store op's we touched (which can be easily collected and they are few in number - as many as the memrefs packed/copied) - but the current pattern rewriter doesn't support that. |
mlir/lib/Dialect/Affine/Transforms/AffineDataCopyGeneration.cpp | ||
---|---|---|
288 | Agree! I like the idea of doing some trivial simplifications (1-it loop) as long as it is on code that the pass is generating/modifying and it's not too involved to do so. However, "trivial" and "involved" are a bit subjective terms and I think this is one of those things that will get convoluted over time if we don't keep it really low profile. My personal opinion is that between simplicity and convenience, I lean towards the former and I would do this only for really trivial cases. I would say that bookkeeping operations over the algorithm to be simplified at the end for convenience is a bit borderline for me. Of course, this is my personal opinion, totally arguable :). I think we should also not simplify anything if the pass doesn't do any core transformation. Otherwise, that might create a dependency between the simplifications done by the pass and subsequent ones. Should we limit the load/store simplification to cases where there is a least a copyNest? |
mlir/lib/Dialect/Affine/Transforms/AffineDataCopyGeneration.cpp | ||
---|---|---|
288 |
I would say it has to be balanced with having to schedule a complete run of canonicalize before the following pass, I suspect this is gonna be case-by-case.
+1 on this. But I suspect Uday would like to make it even more targeted when it'll be possible to. |
Address review comments.
mlir/lib/Dialect/Affine/Transforms/AffineDataCopyGeneration.cpp | ||
---|---|---|
288 | @dcaballe @mehdi_amini - all of these are really good points, and I strongly agree. For this patch, I'd like to change the whole function load/store canononicalization happening here, but that depends on when another line of revisions that are pending land. I could remove this and update test cases making the test case checks more clumsy or have a TODO and make the canonicalization more focussed when the pattern rewriter support improves. | |
mlir/lib/Transforms/Utils/LoopUtils.cpp | ||
1488 | Thanks! |
No more comments from my side. LGTM! I'll leave the final approval to Andy, once the over approximation discussion is done. Thanks!
mlir/lib/Dialect/Affine/Transforms/AffineDataCopyGeneration.cpp | ||
---|---|---|
288 | I'm ok with the TODO + later revisit approach and protecting the load/store canonicalization with if (!copyNests.empty()), if that makes sense. |
mlir/lib/Dialect/Affine/Transforms/AffineDataCopyGeneration.cpp | ||
---|---|---|
288 | I don't expect you to hold this revision, coming back and improve later is perfectly fine with me (in case it wasn't clear). |
Update dma-generate test checks post canonicalization guard !copyNests.empty()
(some of the affine.applys that became dead are no longer removed - these were
already in the input test case).
Have you run into over approximations again?