This adds a method to directly invoke mlirOperationDestroy on the
MlirOperation wrapped by a PyOperation.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Frankly I'm not sure if this is the best implementation. I do have a use-case where I think an explicit destroy call is needed, and to support that I think I had to make the destructor change. At that point, it seemed like I might as well just call the destructor in the new destroy method, but I'm not sure if that is some sort of anti-pattern.
It isn't clear to me where/how was this assertions hit before?
Frankly, I'm not sure. In circt, I wrote some code similar to test test case that triggered it. In that case, liveOperations.count(operation.ptr) == 0.
It'd be nice to have a test case that could show it, otherwise we may be hiding another bug somewhere...
I need to study this a bit: there is some subtlety to the ownership model. In the mean time, I'm trying to imagine the use case for needing this and failing. Can you share more of the motivation?
The backstory is we often need to create cycles in graph regions, and I have been working on something to aid this in Python. Circuit IR often looks like:
%a = op1(%b) %b = op2(%a)
And we want to be able to say something like this in Python:
a = op1.create(b=None) b = op2.create(a=a) a.set_operand(b=b)
We've kicked around some different ideas, but the one that stuck is to create temporary operations to supply values until they can be filled in. I have a very WIP branch to do this in Python.
The creation of temporary operations happens just like in the test case added to this revision: https://github.com/llvm/circt/pull/940/files#diff-118e1917f4ef350934484c7f6899e55475acec2a62d58461a839b97b3872475eR15.
Later on, when that "set_operand" call happens (that isn't the exact API), I call the new destroy method: https://github.com/llvm/circt/pull/940/files#diff-118e1917f4ef350934484c7f6899e55475acec2a62d58461a839b97b3872475eR23.
One thing I just noticed, and I'm not sure if it matters, is the temporary operation is not registered to any dialect.
Hopefully this helps. I'm trying to learn about the liveOperations myself, and understand why we'd need to assert the operation to-be-destructed must be in that list.
mlir/lib/Bindings/Python/IRCore.cpp | ||
---|---|---|
2096 | Can you call this erase to keep alignment with the C++ API? (we should rename the C API, I don't know why it is called "destroy"...) |
I did some more digging, and I think my confusion came from when operations are removed from the liveOperations map. This happens in the PyOperation destructor, and only there. When I was testing this before, I found that I was hitting the assertion in the destructor, because I had already called it once and removed the operation! Part of my confusion was how I previously wrote the test. I was checking the live operations map while I still had a Python reference, so the destructor hadn't been called.
Now, the test is just a standard FileCheck assertion on the IR. This revision just calls mlirOperationDestroy directly. This does what I wanted, but I'm still not sure if this is valid. The MlirOperation's pointer will be in the live operations map until the PyOperation destructor is called. I'm not sure if there will be any major issue with that...
This simple case works, but the previous concern about having the erased operation's pointer in the live operations map is a real issue. In fact, if I try to create any operations after erasing one, I hit this assertion: https://github.com/llvm/llvm-project/blob/86f291ebb2dfc5a93e1ee83b2d5965dcfa5d8bb1/mlir/lib/Bindings/Python/IRCore.cpp#L805. I haven't dug this far, but I think maybe an implementation detail of MLIR means the same pointer will be re-used on the next operation created? Anyway, it seems we must remove the pointer when erase is called... I am adding a test for this and trying to address it.
I've updated the test case to show what I mean. The bottom line is I need to make sure the Python reference is dropped and liveOperations is cleaned out before I can create any more operations. In the test, the addition of Operation.create("custom.op2") would hit the previously linked assertion, unless I added the del op.
This is somewhat non-intuitive to me, and when I first hit this, I was hoping to just have erase take care of removing from from liveOperations. But I wonder if that's not the right approach. The assertion says "cannot create detached operation that already exists", but I am quite sure this is a new operation. I did some debugging, and indeed, when Operation.create("custom.op2") is called, the operation.ptr is the same one that was used when Operation.create("custom.op1") was created. I think this hasn't really come up because we have only been able to create new operations thus far. Is there something fundamental to change here, or is this just how creating operations in MLIR works?
If this is just how creating operations works, I'm back to my original idea of updating the assertion in the destructor. What if it asserted that there was 0 or 1 operation in the liveOperations map? If something, like erase, already did the work of removing it, shouldn't that be ok? This is an operation that is about to no longer exist, and the map will reflect that.
Ah, there are sharp edges there indeed. The ownership model for operations was set up so that we could have "invalid" PyOperation instances (i.e. their backing operation was either destroyed or otherwise of an unknown state) -- however, as with all things that are being exercised for the first time, I wouldn't be surprised at all to find that the aspiration was not completely implemented/bug free. I believe the right thing on erase is to:
- Remove the PyOperation from the live set.
- Set the PyOperation::valid bit to false.
- Have a quick look over PyOperation methods and make sure things are guarded correctly with checkValid(). I wouldn't be surprised if some are missing.
- Condition everything in PyOperation::~PyOperation() on if (valid) { ... }. This should have been done from the get-go and is a bug.
Once a backing MLIR Operation is deleted, by whatever means, it is not legal to keep a dangling pointer to it: just like anything in C++, this pointer can/will likely be handed out again by malloc and friends.
I can look more deeply at this if the above doesn't work/you need help (it will just be a context switch as there is subtlety here that I need to remember).
Thanks @stellaraccident, I had missed that we can use the valid bit for this. I will give it a shot.
Also, when considering trees of operations, it is rather hard to make these kind of mutations 100% memory safe. We had discussed some approaches, but as it stands, if you erase a tree of ops and have a deep reference into the tree you just deleted, you're going to have a bad time. The expensive solution is, prior to erasing, iterate over children recursively, see if they have a corresponding PyOperation and invalidate it. I had toyed with other approaches to optimize that but never implemented them (i.e. if every operation had a monotonically increasing number associated with it, you could do some cheaper things and not run afoul of pointer re-use issues).
mlir/lib/Bindings/Python/IRCore.cpp | ||
---|---|---|
1045 | Can you add a TODO here: // TODO: Fix memory hazards when erasing a tree of operations for which a deep python reference to a child is live. For efficiency, we probably want to implement a C-API for walk that takes a callback which performs the invalidation. That will likely be quite a bit more efficient than doing all of the switchy recursion directly against the C-API. |
Also, when considering trees of operations, it is rather hard to make these kind of mutations 100% memory safe. We had discussed some approaches, but as it stands, if you erase a tree of ops and have a deep reference into the tree you just deleted, you're going to have a bad time. The expensive solution is, prior to erasing, iterate over children recursively, see if they have a corresponding PyOperation and invalidate it. I had toyed with other approaches to optimize that but never implemented them (i.e. if every operation had a monotonically increasing number associated with it, you could do some cheaper things and not run afoul of pointer re-use issues).
I missed this comment and didn't include anything about it in the current revision. For now, should I implement the "expensive solution" you mentioned? And leave optimizations to future work?
Can you add a TODO here:
For efficiency, we probably want to implement a C-API for walk that takes a callback which performs the invalidation. That will likely be quite a bit more efficient than doing all of the switchy recursion directly against the C-API.