Page MenuHomePhabricator

[mlir] Add an option to still use bottom-up traversal
ClosedPublic

Authored by akuegel on Mon, Mar 22, 1:49 AM.

Details

Summary

GreedyPatternRewriteDriver was changed from bottom-up traversal to top-down traversal. Not all passes work yet with that change for traversal order. To give some time for fixing, add an option to allow to switch back to bottom-up traversal. Use this option in FusionOfTensorOpsPass which fails otherwise.

Diff Detail

Event Timeline

akuegel created this revision.Mon, Mar 22, 1:49 AM
akuegel requested review of this revision.Mon, Mar 22, 1:49 AM
ftynse accepted this revision.Mon, Mar 22, 1:55 AM
This revision is now accepted and ready to land.Mon, Mar 22, 1:55 AM
This revision was landed with ongoing or failed builds.Mon, Mar 22, 1:59 AM
This revision was automatically updated to reflect the committed changes.

No test? For a revision touching the core infrastructure this is a bit light and looked a bit rushed in right now.

No test? For a revision touching the core infrastructure this is a bit light and looked a bit rushed in right now.

Yes, it was rushed in, but with a good reason. It means no revert of the change to TopDown traversal is needed, despite it breaking the FusionOfTensorOpsPass.
Arguably the change to TopDown traversal should have made it configurable from the beginning to allow for incremental rollout.

No test? For a revision touching the core infrastructure this is a bit light and looked a bit rushed in right now.

Yes, it was rushed in, but with a good reason.

Sorry but from an upstream point of view I don't see the reason for a rush here, do you mind expanding?

It means no revert of the change to TopDown traversal is needed, despite it breaking the FusionOfTensorOpsPass.

you mean it exposed a bug (!= breaking) in FusionOfTensorOpsPass in some downstream project,

Arguably the change to TopDown traversal should have made it configurable from the beginning to allow for incremental rollout.

This may be arguable, but that's not a justification for rushing this upstream in itself.

I'm not sure I understand the justification of this patch, given that I'm pretty sure all passes upstream do properly support the new behavior. In general I'd prefer to have proper discussion before submitting changes like this.

The iteration order reversal exposed a bug in FusionOfTensorOpsPass, which basically made it unusable on any large input and required a non-trivial change to fix it. It's not great that this wasn't caught by any tests in the mlir repo, but that doesn't mean the problem doesn't exist. The alternative to get to green would've been to revert Chris' change, which would've created even more havoc as we have way too many overspecified lit tests with no automatic update mechanism.

That being said I'm quite unhappy with the tone on your code review comments. Some people are just trying to do their job and keep MLIR healthy, there's no need to constantly try to block their work instead of providing constructive feedback. I hear people complain about the hostility of the MLIR project all the time, and that's certainly driven by a couple of actors on code reviews.

This is tested, although indirectly, by the linalg fusion-on-tensors pass that was changed to use the functionality. The buggy functionality is in upstream code, not exposing the bug upstream is justification enough for me. We can argue whether the linalg code was actually buggy, or was just relying on the behavior of the rewriter that was not specified as either part of the contract or not part of it (also, https://www.hyrumslaw.com/). I think minimizing the test that exposes the problem and adding it upstream would be beneficial for everyone, and is reasonable request, post-commit or otherwise. Suggestions on how to test the order-of-traversal directly are welcome. And I agree that we can do better on explaining our changes and our requests.

No test? For a revision touching the core infrastructure this is a bit light and looked a bit rushed in right now.

Yes, it was rushed in, but with a good reason. It means no revert of the change to TopDown traversal is needed, despite it breaking the FusionOfTensorOpsPass.
Arguably the change to TopDown traversal should have made it configurable from the beginning to allow for incremental rollout.

In fact, I was chatting with the author of the traversal order patch about other matters when we started hitting these snags and he offered a rollback for a time if it made things easier. This override had already landed and seemed like a better middle ground, so I declined the offer.

For context, in addition to the out of tree use that highlighted a bug in a core patch, the traversal order change literally melted Google's use of MLIR. We've been working around the clock for days to patch all of the things -- it's good in the end to flush out such fragility, but it is better done incrementally once such an impact is discovered. I think providing that optionality for all of the users who are going to struggle with this over the coming weeks is pragmatic justification enough.

Thanks for submitting the switch.

The iteration order reversal exposed a bug in FusionOfTensorOpsPass, which basically made it unusable on any large input and required a non-trivial change to fix it. It's not great that this wasn't caught by any tests in the mlir repo, but that doesn't mean the problem doesn't exist.

Maybe, but:

The alternative to get to green would've been to revert Chris' change, which would've created even more havoc as we have way too many overspecified lit tests with no automatic update mechanism.

With an IR repro for a pass upstream, that would have been an option. Otherwise reverting the patch in your downstream project, or landing this patch in your downstream project is always possible.

That being said I'm quite unhappy with the tone on your code review comments. Some people are just trying to do their job and keep MLIR healthy, there's no need to constantly try to block their work instead of providing constructive feedback. I hear people complain about the hostility of the MLIR project all the time, and that's certainly driven by a couple of actors on code reviews.

Maybe we should talk more about it, I'd love for you to elaborate here.
It isn't clear to me that "keeping MLIR healthy" requires urgently landing patches like that when there is no bot broken and no reproducer. Right now this is basically "dead code" in the repo, and this is implemented in a way that does not seem long-term maintainable: it can be fine as a stop-gap and a transition plan but none of this requires urgent landing, instead of taking a couple of days to evaluate the options.

You may not perceive it, but "keeping MLIR healthy" on the long-term is my main focus and has been for over two years, and that's exactly the motivation here.

Let me get a bit more practical here. Do you think this patch would have been more acceptable if it also included a test that exposed the buggy behavior in that Linalg pass?

Note that it would have been still possible to just revert this patch without breaking buildbots simply because it would also remove the test as part of the revert commit. I am generally not convinced that "buildbots are not broken" is a sufficient condition to say upstream fixes are not permitted. The inverse is true though: broken buildbots are a sufficient condition to mandate upstream fixes.

I will let Ben elaborate, but I am under a strong impression that if the first comment had been phrased more along the lines of "Please include a test that exposes the buggy behavior and consider providing more detailed justification for such commits in the future", we wouldn't be having this discussion and would have given the author more concrete guidance for improvement.

Let me get a bit more practical here. Do you think this patch would have been more acceptable if it also included a test that exposed the buggy behavior in that Linalg pass?

With a test, I would have preferred to revert Chris' patch in this case.
Exposing the buggy behavior in linalg is cause to fix linalg, would we revert this patch immediately when fixing Linalg's bug?

Note that it would have been still possible to just revert this patch without breaking buildbots simply because it would also remove the test as part of the revert commit.

This is a bit pedantic: this would apply to any bug fix commit :)
I meant reverting the code itself (not the test...) wouldn't break make check. In general if I can add assert(false) (or assert(flag==false)) in a function implementation and the test suite runs, this is a problem.

As a corollary, a "bug fix" commit that comes with a test that is passing without the code change is also problematic.
As an anecdote, a few years ago I setup an LLVM pre-submit CI system that would patch-in only the test-part of the patch and make sure ninja check fails before applying the code change and checking that it passes.

I am generally not convinced that "buildbots are not broken" is a sufficient condition to say upstream fixes are not permitted. The inverse is true though: broken buildbots are a sufficient condition to mandate upstream fixes.

You write "upstream fixes are not permitted", I'm more nuanced and I looked at it from the perspective of "emergency". I.e. fixes are always permitted, that does not mean they should be rushed in.
Note also that I didn't restrict it to "buildbots are not broken", I wrote exactly: "when there is no bot broken and no reproducer".

I will let Ben elaborate, but I am under a strong impression that if the first comment had been phrased more along the lines of "Please include a test that exposes the buggy behavior and consider providing more detailed justification for such commits in the future", we wouldn't be having this discussion and would have given the author more concrete guidance for improvement.

As I mentioned above, with a test I would have seen the original patch reverted instead. My main concern here is the added tech debt: I see this not as a bug fix but as a new feature, and endorsing a "promise" of something that this API isn't intended to support.

This switch to top down traversal also impacts onnx-mlir which has passes that do not currently work correctly with top down traversal (but do work with bottom up). I realize it's hard to keep track of all possible users, but I think it makes a lot of sense to make such changes optional for a time until the downstream components can adjust without breaking.

This switch to top down traversal also impacts onnx-mlir which has passes that do not currently work correctly with top down traversal (but do work with bottom up). I realize it's hard to keep track of all possible users, but I think it makes a lot of sense to make such changes optional for a time until the downstream components can adjust without breaking.

Thank you Stella for the onnx-mlir perspective. If this is heavily impacting several downstream users, an incremental transition approach is definitely the right one. It isn't always easy to gauge how much a change will impact downstream, so thank you for chiming in.

With a test, I would have preferred to revert Chris' patch in this case.
Exposing the buggy behavior in linalg is cause to fix linalg, would we revert this patch immediately when fixing Linalg's bug?

I would expect so, yes. The patch description says this: "To give some time for fixing, add an option to allow to switch back to bottom-up traversal.". Given that Linalg pass seems to be the only one broken upstream, it being fixed would remove the need for this patch.

Note that it would have been still possible to just revert this patch without breaking buildbots simply because it would also remove the test as part of the revert commit.

This is a bit pedantic: this would apply to any bug fix commit :)
I meant reverting the code itself (not the test...) wouldn't break make check. In general if I can add assert(false) (or assert(flag==false)) in a function implementation and the test suite runs, this is a problem.

As a corollary, a "bug fix" commit that comes with a test that is passing without the code change is also problematic.
As an anecdote, a few years ago I setup an LLVM pre-submit CI system that would patch-in only the test-part of the patch and make sure ninja check fails before applying the code change and checking that it passes.

I agree that I am being pedantic here. Purposefully. It is possible to misread or overinterpret your review and comments as implying something different than what you intended. For example, by taking this pedantic approach. Everybody would win from being more precise and clarifying intent when possible.

And I agree that not having to change tests indicates missing coverage.

I am generally not convinced that "buildbots are not broken" is a sufficient condition to say upstream fixes are not permitted. The inverse is true though: broken buildbots are a sufficient condition to mandate upstream fixes.

You write "upstream fixes are not permitted", I'm more nuanced and I looked at it from the perspective of "emergency". I.e. fixes are always permitted, that does not mean they should be rushed in.
Note also that I didn't restrict it to "buildbots are not broken", I wrote exactly: "when there is no bot broken and no reproducer".

More nuanced is great, I am wholeheartedly for this! More nuanced also means somebody needs to consider the nuances and make the decision. I considered the situation of having buggy behavior in a relatively commonly used path, the scope of the patch to revert and related changes (I reverted that patch before), the complexity of fixing forward linalg (took Nicolas almost a day, and he knows that code better than anyone), the proposed patch and its potential utility for other users. With all this nuances included, it appeared reasonable to accept it, assuming that post-commit concerns will be addressed as usual.

I will let Ben elaborate, but I am under a strong impression that if the first comment had been phrased more along the lines of "Please include a test that exposes the buggy behavior and consider providing more detailed justification for such commits in the future", we wouldn't be having this discussion and would have given the author more concrete guidance for improvement.

As I mentioned above, with a test I would have seen the original patch reverted instead. My main concern here is the added tech debt: I see this not as a bug fix but as a new feature, and endorsing a "promise" of something that this API isn't intended to support.

I may not be explaining well enough, so let me try differently. I think it matters for people _how_ the things are said, not only what those things are. So I attempted to rephrase your comment as more unambiguously expressing an opinion and a giving a concrete indication of how to improve this and further patches. I feel like it would come across as helpful mentoring rather than as negative critique disregarding the concerns of the author and reviewer. Something that I personally would find helpful and productive if I were the patch author was the exact phrase from the last comment: "My main concern here is the added tech debt: I see this not as a bug fix but as a new feature", complemented with how you would like this concern to be addressed.

With a test, I would have preferred to revert Chris' patch in this case.
Exposing the buggy behavior in linalg is cause to fix linalg, would we revert this patch immediately when fixing Linalg's bug?

I would expect so, yes. The patch description says this: "To give some time for fixing, add an option to allow to switch back to bottom-up traversal.". Given that Linalg pass seems to be the only one broken upstream, it being fixed would remove the need for this patch.

OK, isn't linalg fixed now (you're saying something along these long below)? Are you gonna revert this patch then or followup with a long-term design? I suspect reverting this will cause some other issues for downstream users, but they at least have a patch they can keep in their own fork!

You write "upstream fixes are not permitted", I'm more nuanced and I looked at it from the perspective of "emergency". I.e. fixes are always permitted, that does not mean they should be rushed in.
Note also that I didn't restrict it to "buildbots are not broken", I wrote exactly: "when there is no bot broken and no reproducer".

More nuanced is great, I am wholeheartedly for this! More nuanced also means somebody needs to consider the nuances and make the decision. I considered the situation of having buggy behavior in a relatively commonly used path
the scope of the patch to revert and related changes (I reverted that patch before), the complexity of fixing forward linalg (took Nicolas almost a day, and he knows that code better than anyone),

You're not really addressing the "urgency" aspect, from my point of view upstream isn't obviously broken if you can't provide a test. This really could have waited one more day (or two...) to land.

assuming that post-commit concerns will be addressed as usual.

Sure, my post-commit concern is: we need a plan for this, because endorsing a promise on the visitation order of the greedy pattern rewriter does not seem maintainable to me. And we also need testing for whatever solution we come up with.
In general my request is also: please don't approve changes that don't have tests, and be cautious about the promise we make on core APIs which are maintainability hazard.

With a test, I would have preferred to revert Chris' patch in this case.
Exposing the buggy behavior in linalg is cause to fix linalg, would we revert this patch immediately when fixing Linalg's bug?

I would expect so, yes. The patch description says this: "To give some time for fixing, add an option to allow to switch back to bottom-up traversal.". Given that Linalg pass seems to be the only one broken upstream, it being fixed would remove the need for this patch.

OK, isn't linalg fixed now (you're saying something along these long below)? Are you gonna revert this patch then or followup with a long-term design? I suspect reverting this will cause some other issues for downstream users, but they at least have a patch they can keep in their own fork!

You write "upstream fixes are not permitted", I'm more nuanced and I looked at it from the perspective of "emergency". I.e. fixes are always permitted, that does not mean they should be rushed in.
Note also that I didn't restrict it to "buildbots are not broken", I wrote exactly: "when there is no bot broken and no reproducer".

More nuanced is great, I am wholeheartedly for this! More nuanced also means somebody needs to consider the nuances and make the decision. I considered the situation of having buggy behavior in a relatively commonly used path
the scope of the patch to revert and related changes (I reverted that patch before), the complexity of fixing forward linalg (took Nicolas almost a day, and he knows that code better than anyone),

You're not really addressing the "urgency" aspect, from my point of view upstream isn't obviously broken if you can't provide a test. This really could have waited one more day (or two...) to land.

assuming that post-commit concerns will be addressed as usual.

Sure, my post-commit concern is: we need a plan for this, because endorsing a promise on the visitation order of the greedy pattern rewriter does not seem maintainable to me. And we also need testing for whatever solution we come up with.
In general my request is also: please don't approve changes that don't have tests, and be cautious about the promise we make on core APIs which are maintainability hazard.

I would rather see us err on the side of pragmatism/trust/credit, as was done here, when extraordinary events transpire. No one, including the original author expected this to be such a disruptive change, and by the time some of us figured that out, we were deep in follow the sun triage mode. If anyone had possessed a crystal ball at the right time in this process, we would have reverted the original and taken an incremental path. But that wasn't obvious except after some hindsight. I think folks did the best they could here and I choose to extend some trust/credit on that.

As a follow-up, maybe move to discourse and figure out what to do with this flag now that we have it? I personally think this has descended a bit into pedantry and would rather focus on the problems at hand.