- Fold tensor_load(get_global_memref(constant global_memref)) -> its initial value
- Also dropped 'NoSideEffects' from global_memref to make sure unused public global_memref's do not get deleted.
Details
- Reviewers
mehdi_amini rriddle silvas herhut
Diff Detail
- Repository
- rG LLVM Github Monorepo
Unit Tests
Time | Test | |
---|---|---|
530 ms | windows > LLVM.CodeGen/AMDGPU::ds_read2.ll |
Event Timeline
I don't think this is a good idea, because tensor_load(std.get_global_memref) is precisely the pattern we expect to create from bufferizing a std.constant op on tensors. As such, this canonicalization would undo that lowering, making canonicalization not play well with bufferization.
Do you have a use case where you have std.global_memrefs but basic tensor-level constant folding has not already happened? Typically, I expect that tensor_load(std.get_global_memref) will mainly exist as an intermediate state of bufferization -- presumably, if the user is bufferizing their program, they are already happy with the canonicalizations that have happened at the tensor level.
Thoughts?
How this interacts with bufferization is a good point. I am a bit apprehensive however on what the story is w.r.t optimizing the intermixed tensor/memref world. Can you elaborate your thoughts on that? If we are to make a very concerted decision not to perform these optimizations, it seems only right that we document somewhere our opinions on the topic such that in the future we have a reference for what our policy is. This type of optimization seems "obvious" to many newcomers(including myself to some extent as I haven't followed the bufferization work that closely).
Yes, to me this is one of those obvious optimizations that can/should happen. But I see Sean's point about it undoing bufferization. May be this can be a separate optimization pass rather than a canonicalization? THB, this just seemed something natural to implement and I don't have an immediate use case in mind.
mlir/lib/Dialect/StandardOps/IR/Ops.cpp | ||
---|---|---|
4031 | I thought folders are now allowed to create new IR, and canonicalizers are. This one is creating new IR (ConstantOp). |
mlir/lib/Dialect/StandardOps/IR/Ops.cpp | ||
---|---|---|
4031 | A fold method may return either: This would use the second, i.e. fold also supports constant folding. You wouldn't create the new IR yourself, you would rely on the folder to do that for you. |
I don't quite get the point about the canonicalization hindering bufferization, can you craft a small example Sean?
Type conversion in general involves rewrites like
source_op(x) -> source_materialization(target_op(target_materialization(x))
(where the infra usually inserts the materializations). If we have patterns like
source_materialization(target_op(target_materialization(x)) -> source_op(x)
then that pattern is liable to undo a type conversion. In the case of this patch, source_op/target_op have zero operands, so we don't see target_materialization i.e. tensor_to_memref in the pattern. But it's clearer if we consider other ops, like canonicalizing tensor_load(memref_cast(tensor_to_memref(x)) -> tensor_cast(x). It's just too open-ended and seems unproductive.
I would say that the general rule is "avoid write patterns involving materialization ops". One exception that can be useful is something like source_materalization(target_materialization(x)) -> x which can help reduce intermediate IR volume.
Also, if looked at from the perspective of a type conversion that can plausibly go both directions, such as !perl.string -> !python.string and !python.string -> !perl.string, then it's clear that neither direction is "canonical". Arguably that's the case for one-way conversions too: neither direction is "canonical", one is just a higher level of abstraction that is lowered to a different level of abstraction.
We would expand constant() -> tensor_load(get_global_memref()) as part of bufferization, but if we happen to canonicalize before finalizing the bufferization, then we would fold it back to constant() and have to re-run bufferization for constant.
Generally, I haven't yet found a reason to need to run canonicalizations during the partially-bufferized state, but this just seems like a gotcha. Also, if you look at my previous post in reply to River, there's a slippery slope here. Do we convert memref_cast back to tensor_cast? Do we convert std.load back to extract_element? ... it seems like very bufferization pattern admits an inverse de-bufferization pattern that, when looked at in isolation, seems like an "optimization", but is just undoing a deliberate lowering.
mlir/lib/Dialect/StandardOps/IR/Ops.cpp | ||
---|---|---|
4031 | Thanks. Will update (once the discussion settles). |
I have started https://reviews.llvm.org/D90768 for the NoSideEffects and one more change. For this one, if there are concerns and the fact that we don't have an immediate use case, we can leave this out for now and revisit when we actually need this. Does that sound ok?
Abandoning for now. This should be simple to implement when the need arises (as a folder or a separate pass).
If this wasn't a symbol lookup, I'd say that this could just be a fold.